From The Ergoweb® Learning Center

Expert: Full Potential of Speech Recognition Technology Awaits Ergonomics Input

There was a time when Captain Jean-Luc Picard’s commands to his starship computer required a leap of the imagination. Not any more. Speech recognition (SR) technology is here, incorporated in devices as diverse as handheld dictation computers and battlefield translators. One expert believes that ergonomics needs to be a bigger part of the picture.


One new product is as slick as the communicators on USS Enterprise. The maker, Vocera Communications, underlines the sci-fi flavor of the device in its promotions. Forbes magazine notes that it’s no accident the conference rooms at Vocera headquarters in California are named after characters from the Star Trek universe – Kirk, Spock and Picard. As in the iconic television series, communication with another person requires the wearer only to tap the talk button on the lapel badge and say the person’s name. Servers do the job of decoding speech to recognize names, find out if the person is available, and put the call through via a WiFi network. Early customers were hospitals who like the fact that nurses and doctors can talk directly to each other, avoiding pager and phone tag.


Pluggd, founded in 2006 by former Microsoft and Amazon engineer Alex Castro, has created a search engine that combines speech recognition with semantic analysis that can, for instance, find the exact spot in a podcast where a particular book is being discussed.


Some of today’s systems have a powerful capacity to teach themselves, according to a CNN Money article in March 2007. Tellme Networks, a startup in California makes voice-recognition software for corporate call centers and telecoms’ 411 information systems. Tellme’s platform captures some 10 billion utterances annually and constantly analyzes them, improving the system’s precision as it goes. "Voice recognition is all about pattern recognition," says Tellme executive Jeff Kunins in the article. "The more data you have, the better the recognition gets."


IBM’s Embedded Via Voice software powers GM’s OnStar and other dashboard command systems. VoxTec’s Phaselator, used by US troops in Iraq, translates phrases on the fly.


Microsoft, which bought TellMe in March this year, has invested many millions of dollars in research leading to present day applications. Anybody with Windows Vista already has one of the more powerful speech recognition systems available, and the company says it is studying technology to enable search-by-voice.


One budget breaker for hospitals has long been medical transcription services. Dragon Naturally Speaking by Nuance and Vox2data, among others, have moved into this niche, replacing manual services with automatic transcription. Figures from the two companies suggest automation is up to 80 percent cheaper than the manual services.


And Nuance, Vox2data and their relatives are filling a similar role in the legal, corporate, education, government, manufacturing and services worlds. Paired with a miniature voice recorder tailored to individual needs, personnel in almost any field can write emails, memos, reports, lists and almost anything else hands-free, while on the go. Product promotions point out that the technology is also a means of reducing the clutter of handwritten notes, typing and bulky files.


The limitations of early computers held back development until recently. The Bell Labs version of the technology in 1952 was able only to recognize spoken numbers. Today’s powerful processors have lifted many barriers, and development of algorithms for ever wider applications of the technology is proceeding at warp speed.


Lagging Usability Research


Usability research is not proceeding at the same pace, according to Andrew Sears, Ph.D. Dr. Sears is a Professor and Chair of the Information Systems Department at the University of Maryland, Baltimore County (UMBC). His research covers mobile computing issues, SR and accessibility as it relates to internet technology. The second edition of his book with co-editor Julie A. Jacko, "The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies, and Emerging Applications — (Human Factors and Ergonomics Series), is due in early August.


Interviewed recently by The Ergonomics Report™, Dr. Sears talked about the book, his own research and the obstacles that impede the industry’s journey to the final frontier of SR technology.


One of the barriers to the conquest of the frontier – user diversity – is represented in the newly-expanded section of the book that covers the design of human-computer interactions. He explained that his interest in dealing with speech recognition grew from physical disability issues, motivated by a graduate student who had carpal tunnel syndrome to the point where he really couldn’t use a keyboard. “That led to our original research,” he said. The research investigated the “similarities and differences between the issues associated with traditional disabilities and the challenges people experience due to the environment in which they are working or the activities in which they are engaged.” It focused on speech recognition as a means of enabling people with physical disabilities to handle “communications oriented tasks,” he explained, particularly in dictation applications.


“We talk about indirect benefits,” he said, “as illustrated in claims that a system designed for somebody who cannot see will be useful for somebody busy driving their car because they won’t be able to focus their eyes on the device.”


He noted that much is heard about cross-over benefits like these, but there is relatively little research into the issue in the context of internet technology. “That cross-over benefit isn’t necessarily as clear cut as one might like. We’ve done studies where we observed individuals with high-level spinal cord injuries as well as individuals who could have used their hands … composing documents [verbally], and what we found is that the resulting documents … might be fairly similar, but the process by which they produced the documents was different.”


“The process matters,” he replied when asked to explain the significance of the finding. “What we found in our study is the individuals with the spinal cord injuries were much more likely to interrupt their dictation and correct errors as they were going through the document. The people who did not have the spinal cord injuries were much more likely to dictate larger chunks of text and then navigate back to the errors and fix them. What this means is that the people without disabilities are using a different set of commands. They are interacting with the software in a different way. … If we optimize the system to support people as they interrupt their dictation to correct errors as they go along, that’s not going to provide very good support for somebody who wants to dictate a paragraph and then go back and fix the errors that might have happened.”


Error-prone Technology


The professor described SR as a relatively error-prone technology, a factor that explains the importance of addressing differences in the user’s strategy. “As you add more commands into the vocabulary of options that are available to users, the likelihood of errors increases.”


Further investigation revealed design problems that affected both groups of users, he explained, and also opportunities to improve a system to make it more effective for both of their strategies.  


Verbal navigation, which the professor describes as “the relatively simple task of positioning the cursor over the word that you need to fix, so essentially replacing the mouse or navigating to the incorrect word,” adds complexity to SR tasks. It was an early focus of his research. “We found that people would spend one third of their time just moving the cursor from one place to another. It was not very productive. People were spending two thirds of their time fixing errors. One-third dictating, [and] two-thirds fixing errors.” 


“These are primarily speech-recognition errors,” he replied when asked to clarify the source of the mistakes. “[Speech recognition]” can be quite accurate if you have people who speak very clearly [and] if you have a nice controlled environment.” There is no solution yet if you put SR “into a more complex environment, where there is more background noise, [or] if the individual using it doesn’t necessarily speak very clearly or may have an accent that isn’t consistent with how the system was originally designed.” 


This finding refutes an observation often seen in articles about SR, which states the technology can cope with accents, dialects and quirks of speech, often with almost 100 percent accuracy rates. “It has certainly improved,” the professor explained, “but when we first started our research it wasn’t uncommon to see claims by the manufacturers of 95-98 percent accuracy. At that time we were seeing accuracy rates closer to 80 percent when we had real people doing real tasks.”


Cascading Errors


And the verbal commands for correcting the errors “were very error-prone themselves,” he added. “The commands you might use to navigate from one location to another, …  some of those would fail 15 percent of the time.” He noted that they sometimes added new words to the document when they failed. Describing the problem as a cascade effect, he said, “One error leads to another error, which leads to another error, and you spend five minutes just recovering from this error before you can go back to your initial task.”


On his UMBC web site Professor Sears notes that many researchers are investigating techniques to reduce the number of recognition errors, resulting in substantial improvements in the underlying SR algorithms. “In contrast,” he adds, “we are interested in assisting users in correcting those recognition errors which still occur.


He describes another research interest on his web site – situationally-induced impairments and disabilities (SIID). Explaining the term, he notes that as mobile computing becomes more pervasive, users enjoy increased flexibility in terms of where and when they record, retrieve, and transmit information. At the same time, the conditions under which these devices are used are becoming more variable, less predictable, and in many situations less hospitable.


He contrasts SIID with what he refers to as HIID (health-induced impairments and disabilities), in the web site account of his research. “We are developing new techniques for identifying and documenting the factors that contribute to SIID, identifying methods for developing solutions that address the temporary and dynamic nature of SIID, and comparing the interaction strategies of individuals experiencing SIID to those of individuals with comparable HIID.”


Professor Sears concedes today’s prodigious processing power and the pace of development of the algorithms are reducing the problems still evident in SR, but he believes the interaction side has been largely ignored. Unanswered questions include how a person actually interacts with the speech recognition and how that dialog should be designed. “How do you design commands … reliable [enough] so that when they do fail, you don’t get these catastrophic problems, adding or deleting text, which makes it incredibly difficult to continue solving your original problem?”


He said, in so many words, that teaming up the best and brightest of algorithm developers with their counterparts in the human factors and ergonomics field is likely to produce solutions.


In the interview the professor was asked to comment on one of the items in a recent listing of odd questions that come into call centers: The caller asked for a knitwear company in Woven. The operator asked, "Woven? Are you sure?" The caller said, “Yes, That’s what it says on the label; Woven in Scotland." The misunderstanding illustrates another challenge for SR – ambiguity. 


Explaining ways developers are meeting the challenge, he said they do a lot of language modeling and language processing, and they can recognize what part of speech might fit in particular location within a sentence. “And that might help disambiguate things, he added. “The recognition engines all generate multiple possibilities when you say something, and some of the algorithms, when they are looking at the language model, they figure out which of these possibilities actually fit together — which ones fit into the context of what has been said.” He described the ability of SR to determine context as imperfect, but improving.


A challenge rarely mentioned in today’s mainly glowing reviews of new applications is user expectations. “People are really good at recognizing speech, and we are really good at filling in the gaps. When we didn’t hear a word we can figure out what it might have been. We are really good at using context. And we take in cues beyond the audio. We are looking at the user. We’re often watching their mouth as they are speaking. We see hand gestures. We see things that they are pointing to. So we are much better at dealing with context. We are much better at dealing with ambiguity. … For better or worse, I think people often bring a subset of those expectations when they are interacting with speech recognition.”


Value for Specific Applications, Specific Users


Asked why SR development is so hot when humans can obviously do the same thing so much better, he replied that its main value is in the context of a specific application or a specific set of users, and when “you ground it in a real problem.” In the context of people with physical disabilities where use of the traditional keyboard and mouse is not possible or not a very good solution, he explained, “speech recognition can provide a really powerful tool. Through our research we were able to significantly increase the rate at which people could compose text. We got a 40 percent increase in the composition rate. We dramatically reduced the number of errors people experienced when they are trying to issue the spoken commands. And we came up with a much better way for people to compose text. And if you compare the speed at which people are able to compose text with this improved solution to what they were doing originally with whatever other device they might have been using, which might have been a mouth stick where they were pressing keys one at a time, or scanning software or any number of other solutions, we can do better than that.”


The professor has seen the most success with this kind of focused approach, and not just as it relates to disabilities. “We have also looked at speech as a method for getting information into mobile devices.” It isn’t usually convenient to stop and type something on a small keyboard or to pull out a stylus and write something if you are out on the move, he explained. “Speech can be a very convenient tool. You can picture aircraft mechanics … down inside the airplane … and they need some additional information. Well, the manuals are very large. They are not usually right where you need them. And even if they had the information available electronically, their hands are busy. If they are able to use speech to interact with their electronic manuals to pull up the diagrams they need with hands free, that can be a really powerful solution.”


He sees the underlying technology, the recognition engines, as rich in possibilities. “[They] are now to the point where a lot of these alternatives are very feasible. The piece that I’m not convinced has necessarily arrived is the ergonomic side. Traditionally if you look in the literature and you try to find literature on speech recognition the vast majority of that research has focused exclusively on the algorithms and on making the technology better. It might have been motivated by the theory, [that by eliminating] errors we have solved the problem. But there has been relatively little research on how people really interact with the technology, how they react when the technology fails, how you design the technology to fail more gracefully. When something doesn’t work, instead of causing major problems, make it easier for the user to recover from it.”


The final frontier is in sight. Industry watchers say it is only a matter of time before SR is built into almost every gadget, appliance and machine that people use. They expect voice technology to supplant typing, tapping, texting and touching as the primary interface with machines. Professor Sears’ observations suggest it will be a bumpy voyage to these outer limits if ergonomists and human factors professionals are not part of the crew.


Sources: Dr. Andrew Sears; Forbes; Vocera Communications; Wired News; CNN Money; Microsoft

This article originally appeared in The Ergonomics Report™ on 2007-06-02.