Smart Computing-Editorial

Reference Series


How Computers Work, Part I
August 2001• Vol.5 Issue 3 Page(s) 168-173 in print issue
Make Your Voice Heard Speech Recognition Programs Are Far From Perfect, But They Have Exciting Potential

Arthur C. Clarke and Stanley Kubrick presented a vision of human-computer interaction 37 years ago in “2001: A Space Odyssey.” HAL-9000’s softly stating, “I’m sorry Dave I’m afraid I can’t do that,” sent chills up the audiences spines and introduced many to the idea of people conversing with computers. Though the year 2001 is here and this level of interaction with computers is not yet possible, you’ll be amazed at what you can communicate with your PC.

Sound It Out. Most human-computer interaction seen in popular culture, such as in “2001” or the “Star Trek” series, involves a level of artificial intelligence science has yet to attain. These artificial intelligence methods involve talking with your computer rather than talking at it. It’s the difference between, for instance, casually asking your system to compute a trajectory through an upcoming asteroid field and dictating a letter to your mom.

You may be unsatisfied with the present level of human-computer interactivity if you hate to type, ache from the pains of carpal tunnel syndrome, or have other disabilities that make typing difficult. But with today’s speech recognition software, you can speak out loud to your computer, and although it won’t intuitively and cognitively “understand” what you’re saying, it can listen, interpret, and respond. It does so, however, at a very basic level.

Most people who regularly use a telephone have already interacted with speech recognition technology. Many banks and other service institutions offer IVR (interactive voice response) systems that let users navigate telephone menus using voice cues instead of constantly pressing different numbers.

It’s important to differentiate between these types of speech recognition uses and voice recognition. Although people often use the terms interchangeably, the terms represent different uses of similar technology. Speech recognition software reacts to what you actually say; the software identifies the words and responds appropriately by “typing” the words you say or by opening a document. Voice recognition software is more frequently used for security purposes. These applications identify your particular speech patterns as a means of identifying who you are.

How It Works. For our purposes, speech recognition starts with speech. We’ll assume everyone is familiar with the neurological origins of thought and how the brain translates electrical impulses into the muscle movements that become the speech we use to communicate with one another. We’re assuming all this, of course, because despite years of study, scientists still aren’t sure exactly how the brain works. What we do know is what happens once your voice begins its journey toward your computer.

Sound to signal. The human voice, or any sound for that matter, travels through media, such as water or air, as waves. These waves hit another object and cause vibrations; in the case of people, these waves hit our ears and are turned into signals that are sent to the brain. The brain then processes these signals into the sounds with which we are familiar. Computers, on the other hand, use a microphone that acts like an electronic ear.

Microphones come in many varieties, but all of them have some type of thin surface that begins to vibrate as it is buffered by sound waves. A small electrical current passes through this material. The vibrations alter the conduction properties of the surface, which creates identical variations in the electric signal. This signal is passed on to your computer’s sound card through the jack where the microphone plugs into the machine.

Inside the computer, the first step is to translate this analog signal into something the PC can understand. If we were to graph an analog signal on a piece of paper, it would look something like a wave, which would vary either in intensity or frequency.

Computers, however, operate with digital information, which is represented by numbers that describe not how the signal is changing but how it appeared at a specific point in time. Turning an analog signal into a digital signal requires sampling. The process of sampling takes an analog signal, divides it into tiny slices, and assigns a number to each slice.

For example, let’s say that from the first nanosecond to the second nanosecond, the signal changed from value “1” to value “3.” With sampling, this nanosecond-long section might be averaged to “2,” which is then written in the binary code of ones and zeros. Each section receives a value in the same way. A digital signal may not be as technically accurate as an analog signal because we’re averaging down all those nanoseconds, but the time period is so short that it’s nearly impossible for the human ear to tell the difference.

Numbers to words. As the sound is transformed into a marching column of numbers, the software has something it can bite into. Somehow, the software must recognize which words are represented by all of these numbers.

Looking at the problem logically, we would first realize the same word or sound should result in basically the same stream of numbers each time we say it. Thus, we could say every type of sound, look at the resulting numbers, and teach the software that a specific number pattern means a specific sound. This is the underlying idea of speech recognition.

Of course, the problem is far more complex in practice. First, programmers must design speech recognition software to understand many different people and their unique vocal characteristics. If you listen closely to someone speaking, you’ll notice that regional dialects, even those from neighboring states, sound surprisingly different. They certainly look different when reduced to digital computer code. Factor in dry throats, fatigue, whether the person is speaking slowly or quickly, the exact position of the microphone, and background noise, and it’s a wonder there is any sort of matches at all.

The best way to get around a big problem such as this is to break it down into a bunch of little problems. Understanding a sentence first requires that we understand at least most of the sounds that went into making it. Scientists call these building blocks of language phonemes. People are surprisingly adept at picking phonemes out of everyday speech, but programming a computer to mimic this seemingly basic skill is no easy task.

Speech recognition begins by dividing the sentence sound into tiny equal segments called vectors. Before attempting to pick the phonemes out of the string of vectors, the software normalizes the data to reduce background noise and minimize certain other differences. This process involves running the numbers through a host of mathematical algorithms that researchers tinker with constantly.

The idea works off of the fact that human speech takes place in specific, known spectra that generate different sets of numbers than random background noises, such as passing cars or squeaky chairs. Another trick is to detect pauses in the speech, sample the background sounds, then filter them out of the recording.

Now the software goes to work on the normalized vectors. The computer still can’t come up with a perfect match for all the potential phonemes, but it doesn’t have to. The software has at its disposal plenty of statistical information about what phonemes usually look like and which phonemes often follow others. If it can’t make out what a specific sound is, the software can make an educated guess based on phonemes it finds before and after the mystery block.

The software can also draw on past success. In earlier days of voice recognition, speakers would start by reading paragraphs of preset sentences. Because the software knew what the speakers should be saying, it could “learn” how an individual said the different phonemes. As the technology has developed, most users no longer have to plod through these lengthy sessions. Usually, the software needs only a few quick sentences up front; it learns the rest along the way as the speaker dictates documents and makes the occasional correction.

Making sense. The right phonemes are not enough to create meaningful words. Take the sound of “t” and “ooh” pushed together. It may be the word “to,” or “two,” or “too.” If you say “I,” did you really mean “eye?” Software that focuses on individual words alone would never know.

Voice recognition technology therefore relies on the same tool we humans use in deciphering speech: context. If someone says, “I would like to go to the movies,” we know the number two has no place in the sentence because it just doesn’t make sense. People intrinsically eliminate such homonyms. Your computer doesn’t have the same concept of what makes “sense,” but it does understand cold, hard statistics.

Trigram analysis, which is at the heart of speech recognition, draws from a database filled with information on three-word clusters pulled from examples in newspapers, books, mail, business reports, speeches, and other texts. By counting all of those words, companies such as IBM have come up with the probability that any two words in different languages will be followed by a specific third word.

Researchers call the number of words that might follow the previous two the branching factor. For highly predictable vocabularies, such as those used by radiologists interpreting X-rays, the branching factor is about 20. By comparison, office correspondence has a branching factor of 150, while magazine and newspaper text push it up to around 300.

Lernout & Hauspie's Dragon NaturallySpeaking is one of the leading speech recognition programs on the market.

For example, say the computer has heard “I would.” As the software begins to analyze the next word, it cannot distinguish the phonemes perfectly but knows it sounds close to “like,” “Ike,” or “light.” Next, it runs through the database of all the words that frequently follow “I would” and finds “like” has a high probability, perhaps with words such as “go” or “see” or many others having somewhat lower probabilities. Then, these two scores (the acoustic phoneme score and the trigram matching score) are multiplied together. “Like” scores well in both the acoustic and trigram matching categories, so the computer quickly selects it. Looking at the acoustic properties alone, perhaps “light” was a closer match. However, that word scores far too low on the trigram to beat out “like.”

That’s the basic idea, but using brute force to calculate the probability of every single word following “I would” takes too long, because you are already speaking the next part of the sentence. As a result, voice recognition uses a variety of techniques, collectively known as fast matching, to prune many of the branches that clearly don’t make sense. As you speak, the software comes up with rough predictions of the next word you are going to say. By the time you say “I would,” the software is already busy ruling out options such as “elephant” for the next word, thus chopping off language branches and concentrating only on those words that are within the realm of possibility.

Taking these sorts of shortcuts makes things faster, but it also means the software sometimes makes mistakes and starts going down the wrong branch. A few words later, however, the error becomes obvious as the matches become less and less like the computer’s predictions. The software returns to the junction where it went wrong, picks a previous choice that originally scored lower, and tries again.

Training software. Using statistics gathered from a variety of dialects and languages, software makers can create programs that recognize most of the words most people say. However, there will always be times when you stump the software. It could be a word or combination of words you say in a nonstandard way, a phrase not in the statistical database, or specialized industry-specific jargon that doesn’t occur in normal speech.

In that case, the software might take its best guess, mark the word, and move on. You can look over the document when you’re finished talking and make corrections. Present numbers from IBM point to accuracy of about 95% for general use and better in specialized applications, so usually there aren’t many mistakes. More advanced software will remember which corrections you made and alter its statistical database to conform with the words you use.

Biometrics. Voice recognition used in biometric applications depends on the same processing techniques as speech recognition. The difference lies in its focus. Instead of interpreting which words the speaker says, the computer concentrates on how the speaker says them; voice recognition is dependant upon vocal characteristics unique to each person’s speech patterns.

Most biometric applications require you to speak specific words. The system then analyzes certain characteristics of your voice, such as your cadence, pitch, and tone. Next, it creates a template for each user and stores it in a database. As you might expect, background noise can affect the accuracy of voice recognition, as can heightened emotion.

Biometric systems typically operate to either identify or verify. The distinction is subtle, but it is worth noting. Identification requires the person to submit a voice sample for the system to match against the templates it has stored. Depending on the size of the organization, this could be tedious, because it would take some time to sort through all the templates. Verification adds the initial step of the person identifying themselves by providing a name or assigned number before submitting the requisite verbal sample. This allows the system to immediately pull up the appropriate template for matching and verification.

Now Hear This. Three primary areas use speech or voice recognition technology: dictation, interface, and security.

Dictation. Of all the computer applications using speech recognition technology, none is more accessible to end users than dictation programs.

With each new version, dictation programs progressively improve. What began with solely discrete speech is now evolving into a more accurate natural language product. Discrete speech recognition software requires users to pause. Between. Each. Word. It’s a cumbersome process that annoys users but makes life easier for the computers struggling to keep up and understand.

Natural language ability is the goal speech recognition researchers are striving for. Instead of relying on plodding pauses and trailing sentences, natural language lets you speak in a regular, continuous pattern. You will still need to pause to specify punctuation and formatting, but the technology allows for more normal dictation.

Several different manufacturers offer a variety of dictation applications. Lernout & Hauspie’s Dragon NaturallySpeaking line (http://www.lhsl.com/naturallyspeaking) is the most well-known and the most well-received. Although IBM has dominated the research spectrum for 25 years, its ViaVoice line (http://www.ibm.com) wrestles with L&H’s Dragon programs for the majority of the market share.

The big companies are branching out with specialty speech recognition programs for individuals in particular fields, such as law and medicine (which itself has multiple concentrations, such as radiology). L&H also offers Dragon NaturallySpeaking Mobile, a $199 digital recorder that you talk into and later plug into your PC, where NaturallySpeaking Preferred translates your voice.

Language and reading programs also are using dictation technology, providing new and exciting ways for students to learn. Auralog’s TeLL me More series (http://www.auralog.com) offers comprehensive language study that uses speech recognition as a private tutor of sorts. The software has proven so effective that it is endorsed by the ministries of education in both Spain and France, and AT&T and Ford use it to train employees.

Language isn’t the only thing being learned with the help of speech recognition technology. IBM implemented a pilot program to help 8- to 10-year-old school children in Philadelphia learn how to read. Students use an on-screen tutor that is disguised as a cartoon character to help identify pronunciation mistakes. Teachers can easily use the program to better tailor the lessons to each student’s individual needs.

Interface. Command recognition is a way users can navigate their computer interface without using their mouse. Although there is an element of command recognition with most dictation programs (such as when you say, “bold this”), the technology presently manifests itself primarily in online products. IBM’s ViaVoice Online Companion uses voice recognition technology to send e-mail messages or chat.

Security. Biometric technology dispenses with passwords in favor of voice recognition. This type of security is showing up more and more to guard access to ATMs, computers, voice mail, and cellular phones.

You can also find biometric security options on the consumer level. Some products, such as Veritel’s VoiceCheck (http://www.veritelcorp.com), employ biometric technology to restrict access to a PC based on voice patterns. You can use the program to lock others out of the entire computer or just out of selected files.

More avant-garde applications extend the realm of voice recognition. VoiceTrack (http://www.voicetrack.com) is a program designed to track parolees, people on probation, pretrial defendants, offenders who haven’t been sentenced yet, juveniles, and individuals on work release. The voice verification feature lets the corrections officer determine whether the subject is where he is supposed to be at any given time. With a single phone call, the system matches the subject’s voice print and verifies the person’s identity.

Making A Statement. Like all applications, speech and voice recognition programs work better when you have more power behind them. By all accounts, the software is a bit of a hog. Although most applications come packaged with a nifty microphone/earphone headset, you still need to make sure your computer has the get-up-and-go necessary for the program to really work well.

System requirements depend on the type of application, but all voice recognition programs require at least a 16-bit sound card. For the major dictation programs, you should have a computer with a 266MHz Pentium-class processor (preferably with MMX), but anything faster than 300MHz will really optimize performance. Memory is essential, too; you’ll need a minimum of 64MB of RAM, or more. In addition, you must have at least 180MB (you’ll need 510MB for IBM’s ViaVoice) of free hard drive space.

There are still obstacles technology must overcome before it can reach the pinnacle of speech recognition; that is, where it’s difficult to distinguish between conversation with a human or a computer. Some obstacles deal with artificial intelligence issues, others with refining the software to detect punctuation and emphasis without commands, based on pitch and tone.

Although we may still be some time away from locating an employee within a particular space or having our computer inquire about our state of mind, as HAL did with Dave in “2001,” voice recognition technology nonetheless is making life easier for everyday people.

by Anne Steyer Phelps

View the graphics that accompany this article.
(NOTE: These pages are PDF (Portable Document Format) files. You will need Adobe Acrobat Reader to view these pages. Download Adobe Acrobat Reader)

How Sampling Works

When we speak, the sounds we make vibrate and travel through the air in the form of an analog wave. An analog wave is like a ramp that has an infinite set of values. As the amplitude of the wave rises and falls, the pitch and tone of the sound correspondingly go up or down. This is the most realistic form of sound because every infliction is measured and represented on one of the infinite number of points on the wave.

Today’s computers, however, don’t deal with analog input, they deal with digital information. For a computer to interpret and output sound, it must turn analog waves into digital data through a process called sampling. Instead of recording the entire analog wave of sound, an analog-to-digital converter in your computer’s sound card takes “snapshots” of the wave at various points in time. Through quantization, these points are given a value, which is then translated into the digital language of binary code (a string of 1s and 0s) that the computer can understand. When the sound card samples the sound, it records a value, holds it, and then records another. When the sound is digitized, the result is more like a set of stairs than a ramp. Unlike an analog wave, a value can only be represented on one of the steps, not anywhere in between.

When the individual points are measured quickly enough, sampling can fool the human ear into thinking the sound is continuous. The frequency at which the sound snapshots are taken is referred to as the sampling rate. The faster the sampling rate, the more realistic the sound will be.

To play the sound back, the computer transfers the digital scores from memory to a digital-to-analog converter. Each number is assigned a voltage so it can be represented as an analog wave. Those voltages then go to an amplifier so they are strong enough for the speakers to play the final sound.

Speaking Of The Past

Although you may think speech recognition technology is relatively new, computer scientists have been continually developing it for the past 40 years. They’ve made great strides in improving the systems and processes, but the “Star Trek” idea of your computer hearing and understanding you is still a long way off.

Late 1950s: IBM scientists begin their speech recognition research using mammoth computers. They start by attempting to isolate specific linguistic patterns and initiate work on finding statistical correlation between sounds and words.

1960s: The U.S. Department of Defense begins to fund new speech recognition research. Scientists develop “Automatic Prototyping,” which is a process where the computer searches out and stores specific sounds.

1964: Arthur C. Clarke and Stanley Kubrick collaborate on “2001: A Space Odyssey.” The movie’s HAL-9000 computer is a fully functional interactive computer.

1982: Dragon Systems is founded; two years later, its first speech recognition technology is incorporated into a portable PC in the United Kingdom.

1993: IBM launches the first packaged speech recognition product, IBM Personal Dictation System for OS/2; Apple ships its version for the Macintosh.

1994: Dragon Systems ships the first software-only PC-based dictation product, DragonDictate for Windows 1.0.

1996: IBM introduces the first real-time continuous-speech recognition product, MedSpeak/Radiology.

1996: OS/2 Warp 4 includes built-in speech navigation and recognition.

1997: Dragon ships Naturally-Speaking, the first general-purpose continuous-speech recognition product; IBM ships ViaVoice, and Microsoft buys into Lernhout & Hauspie, implying speech recognition is the wave of the future.

1998: Natural language programs begin to flood the market.

1999: Researchers continue to strive to expand speech- and voice-recognition products beyond the desktop to other everyday applications.

2000:Lernout &Hauspie acquires Dragon Systems.

Want more information about a topic you found of interest while reading this article? Type a word or phrase that identifies the topic and click "Search" to find relevant articles from within our editorial database.