As the final year project of my degree I chose to research about Speech Recognition and Speech Synthesis. This interest was sparked as a result from the visit to the School of visually impaired children. The aim of the suggested project was to develop a Speech Recognition and Speech Synthesis software as a solution to the issues the visually handicapped people are facing when using Microsoft PowerPoint.
In an era where computers have influenced all most all the aspects of the world, the visually impaired persons are yet to achieve a fair share of computer access due to lack of software components that could bridge the interaction of the impaired and the computer. Even amidst the rapid development of technology, responsible bodies have not been able to design sufficient software to support the impaired. As the President of the World Blind Union, (Nordstrom, 2005) remarks, even though for the past decade electronic communication including the web has become a major tool of knowledge and information for the visually impaired, less opportunities has been developed to access information due to the lack of interest and knowledge of the designers.
MS PowerPoint is the best known software for presentational (a word made by myself :P) purposes and is one of the software that the blind has difficulties in using. In the current world, the blind is also involving competitively and they also may need to attend to presentational tasks, but with existing software the visually impaired do not have many facilities to manipulate PowerPoint to work on these tasks.
This software is specifically realized to be used with MS PowerPoint. The project consists of both Speech Recognition and Speech Synthesis, through which the visually impaired can bestow voice commands to the PowerPoint and at the same time will be acknowledged whether the correct operation has been triggered in the PowerPoint layout, by providing the user the facility to listen to the feedback from the system.
Speaker dependent and independent speech recognition
- Speaker-dependent Recognition – Voice is a unique characteristic of a person. Speaker dependent recognition systems can recognize the voice characteristics of the speaker.
- Speaker-independent Recognition – These systems do not depend on whether the user has a sickness or is the user is a he or a she, or whether there is an accent
For the purpose of this project, speaker–independent speech recognition systems were more compatible and suitable, because a wide range of blind students can use this in their presentations.
What is Speech Recognition/Automatic Speech Recognition
“Speech recognition (also referred to as voice recognition) is a process by which the elements of spoken language can be recognized and analyzed, and the linguistic message it contains transposed into a meaningful form so that a machine can respond correctly to spoken commands.” (Security.org, 2000)
Types of Speech Recognition Systems
Isolated-word speech recognition – These recognition systems identify one word at a time. The user should leave a gap between each word when speaking.
Connected-word speech recognition – For these kinds of recognition systems the speaker should use phrases.
Continuous speech recognition – This is the most complicated systems, in this system the user has to use continuous sentences, much like paragraphs.
Speech Recognition Technologies
- Hidden Markov Models (HMM)
- Dynamic Type Wrapping (DTW)
- Voice Extensible Markup Language (VXML)
- Discrete Fourier Transformation (DFT)
- Neural Networks (NN)
From the technologies listed above HMM has the highest rate of accuracy and most advantages.
What is Speech Synthesis
“…Refers to a computer’s ability to produce sound that resembles human speech. Although they can’t imitate the full spectrum of human cadences and intonations, speech synthesis systems can read text files and output them in a very intelligible, if somewhat dull, voice.” (internet.com, n.d.)
Speech Synthesis Technologies
- Concatenative Synthesis
- Formant Synthesis
- Articulatory Synthesis
- HMM based Synthesis
- Sinewave Synthesis
- Voice Extensible Markup Language (VXML)
Existing Similar Tools
- Dragon’s Naturally Speaking
- IBM’s Via Voice
The architecture of the system mainly comprises of two major components. The input for the system should only be the command(s) of the user.
- Windows Listener Module – embedded in SAPI
- SAPI Speech Recognition – For Audio input – Speech Recognition is done through the SpSharedRecoContext interface of SAPI.
- SAPI Speech Synthesis – For Synthesis Audio Output -Speech Synthesis is done through the SpVoice interface of SAPI.
- Communicative PowerPoint
The application, uses the “Microsoft Speech Object Library” SpeechLib dll (Dynamic Link Library) which is contained in the Sapi.dll. The Sapi.dll contains all the interfaces for the SAPI. SAPI contains the Speech Recognition and Speech Synthesis Engines used for the implementation. A new Speech Port (Windows Listener) representing the connection to the Speech Engine will start up when the code is executed.
The application was developed in Microsoft .NET and Microsoft Speech API (SAPI) was chosen as the Speech Recognition and Speech Synthesis API. SAPI provides a high-level interface between an application and speech engines. It uses, the Hidden Markov Model to match the characteristics of the signal captured. The API contains two main engines; Speech Recognition Engine and Speech Synthesis Engine.
I hope this gives you the high level idea of the project and hope this might help you in some way. I know this is a very subtle and insignificant attempt at Speech Recognition, but I am hoping to build a tool one day incorporating AI and Neural Networks.
for the best!!