Speech recognition technology – a machine’s ability to identify spoken words and translate them into a machine-readable format – is here to stay.
How does speech recognition technology work? Most SRT systems use complex sets of algorithms created through acoustic and language modeling. The software maps the relationship between the sounds we make and the words we intend to say. The better those algorithms, the more accurate a piece of speech recognition technology is.
Where is speech recognition technology used?
In today’s fast-paced world, speech recognition software is almost everywhere. From the way we order household supplies, refill prescriptions over the phone, interact with digital assistants, dictate a text while we rush from one task to the next, speech recognition has become commonplace. Yes, we’re disgruntled when Siri pulls up the wrong results to our query, or we huff when Facebook Messenger doesn’t quite get what we’re trying to say. But each of these moments is a different piece of speech recognition technology cranking up its algorithms and attempting to understand what we’re trying to say.
The primary uses for speech recognition are:
medical dictation/transcription
call routing
speech-to-text processing
voice dialing
voice search
assistance for users with disabilities
Voice recognition is used in security devices and is a different type of technology.
What are the variables that determine how accurate any given speech recognition technology will be?
The lexicon:
Each SRT will have a pre-programmed vocabulary with the corresponding phonetic representations.
Variations on the lexicon based on the SRTs purpose:
Letter combinations and whole words sound differently when pronounced by different people. This is how accents and dialects throw wrenches into the works. All these variations have to be stored in the speech recognition lexicon so it knows that all those different pronunciations mean the same thing. And this needs to happen for every language in the SRT. But the starting vocabulary in the lexicon should be chosen with the end user in mind. A medical application will need medical terminology but not cooking phrases.
A medical application will need medical terminology but not cooking phrases. An SRT that will be used by Native English speakers doesn’t need to be coded for German words pronounced with an Italian accent. But it does need to be coded for English words pronounced with both German and Italian accents.
Statistics
Speech is delivered to the SRT capture device or microphone via sound waves. And waves are just that. Waves. Words aren’t delivered in pre-wrapped bundles with clear beginnings and endings. As a result, the SRT uses the surrounding words to translate each sound wave into phonetic bites. Basically, speech recognition technology involves a lot of context clues.
When more real-world data is added to the SRT, the algorithm becomes more likely to choose the correct word.
The most used algorithms are neural networks, Hidden Markov models (HMM), dynamic time warping (DTW), deep learning, and end-to-end automatic speech recognition.
The different types of speech recognition technology
Speaker dependent applications
Trained to work for your voice and only your voice
To work with a different speaker, would have to be retrained or would suffer greatly in accuracy
Single speaker models
Dictation
Smartphones/iPhones/Android
Stored locally on your device
Speaker independent applications
Trained on real-world data
Can recognize thousands of different speakers
Can recognize different accents, variations, linguistic patterns, etc.
No voice training necessary
Call management and helplines
Voice portals
Voice transcription from a recording
Stored on a server or cloud
The major speech recognition technology applications in 2018
Recommended for users spending moderate amounts of time on a computer such as teachers, students, bloggers.
Custom commands
By Nuance Communications
Punctuation and formatting will not be automatically added
Speech patterns need to be altered for Dragon Premium to properly transcribe
Does allow for verbal composition commands (ie. “insert common after … “)
Vocabulary not as advanced as other versions
Allows for words to be spelled out
Recorded dictation can be transcribed though it works better with single speaker recordings. Multiple speaker dictations are beyond the range of this software.
Dictation capture via computer microphone or microphone headset
Customizable commands (ie. open, close, operate programs like Excel or Word) including advanced commands like sending emails to specific end users and including email body dictation.
Search capabilities for Internet Explorer, Google Chrome, Firefox using voice commands to move cursor and insert phrases into search engines
Accent support
Less accurate at recording meetings when multiple voices speaking and background noise (as would be expected)
Vocabulary considered advanced and more words can be added
Fusion Expert allows for real time voice recognition and option for self-editing
Utilizes voice recognition
Automatic spell check
Easy to use templates and routines
Centrally managed so easy to add users, modify vocabulary, adjust system settings and formatting options
Back-end and front-end language and user profiles are synced
System is optimized with health care settings in mind
Secure encryption and password protection
Google Now –
Speaker independent
Improved accuracy in loud places
Word error rate has fallen from 23% to 4.9% in the last four years
Struggles with determining the meaning of words against the massive amount of available data
Doesn’t have a name or personality compared to others – Amit Singhal, Google’s Sr. VP in charge of related products says. “I’m not saying personality shouldn’t come, but the science to get that right doesn’t fully exist.”
Voice algorithms are becoming more adept at understanding accents as more users participate
The app’s initial strong points include navigation, local search, weather, stocks, hotels, time zones, geography, news, photo and video search, mortgage calculation, currency conversion and flight status.
Identifies songs including hummed ones
Combines speech recognition and language understanding
LumenVox –
Speaker independent
Uses natural language understanding (NLU) through Statistical Language Models (SLM)
Used for Interactive Voice Response Systems (IVR) like call systems and auto-responders
Available in 32 and 64-bit Linux and Windows
Supports nine languages
Encourages developers to use its SRT for their own applications
Microsoft Cortana –
Speaker independent
Phone assistant built into Windows 10
Messages
Searches
Sets calendar events
In 2016, Microsoft declared parity with humans. Ie. Word error rate of 5.9%
Dictate, new add-in works with Outlook, Word, and PowerPoint for Windows
Uses Bing Speech API and Microsoft Translator
Criticism is that Cortana struggles with intent and often returns Bing search results instead of answering questions
Enables transcribing voice in more than 20 languages
Supports real time translation of up to 60 languages
Spoken commands for new lines, delete, punctuation, and more formatting
Can integrate with third party apps
Indexes and stores user information leading to privacy concerns though this option can be disabled
Voice training requires reading provided text aloud.
After initial setup, accuracy increases.
Since punctuation and sentence breaks aren’t added, it took some reviews time to adjust.
This can mean an additional layer of self-editing after dictation capture/front-end transcription and means more time.
Good tutorials provided along with online technical assistance and support.
The website includes tutorials updated for specific versions or you can use YouTube videos.
User guide is downloadable.
Individual Professional for Mac requires 8 GB of disk space but 16 GB is required to initially download and install.
Does not support dictation into EMR systems
e-Speaking –
Easy set up due to support services
Simple to learn with clearly marked buttons and easy functionality
30-day free trial
Downloadable
Online users guide
Fusion SpeechEMR® –
No interface or integration setup needed
LumenVox –
No voice training needed
Installation too complicated for the average user
One Voice Data –
Software is easily downloadable and straightforward to set up.
Tutorials available online and on Youtube.
Accuracy is impressive right out of the box but continues to increase as deep learning better predicts individual user’s responses.
SmartAction Speech IVR System –
Because it’s aimed at larger companies and corporations, complete installation of SmartAction requires 4-6 weeks
Will work concurrently with existing system
ViaTalk –
Updates are infrequent.
While the app can be learned rather quickly, the software lags behind others.
YouTube videos, company tutorials, and FAQs answer simple questions.
Voice Recognition Technology Accuracy among Brands:
Amazon Alexa = not released
Baidu = 96%
Dragon Medical Practice = 80-87%
Dragon NaturallySpeaking Professional = 95-96%
Dragon NaturallySpeaking Premium = 92%
Dragon for Mac = 90%
DragonNaturallySpeaking Home = 86%
e-Speaking = 60%
Fusion SpeechEMR = 87.4%
Google Now = mid-80-92%
Hound = 95%
Microsoft Cortana = 90%
One Voice Data = 98%
SmartAction Speech IVR System = unknown
Siri = 95%
Tatzi = 72%
ViaTalk = 64%
Voice Finger = 76%
Technical Support among Major Voice Recognition Technology Providers:
Dragon Medical Practice Edition –
During the Petya Malware crisis, Nuance eScription was shut down for a prolonged period. Despite repeated promises of coming back online and restoring data, this was impossible due to the wiper nature of Petya. Customers complained about the lack of customer service responsiveness.
iSupport for registered, logged in users
Extensive support library for guest users but difficult to navigate and not easy to find most recent answers.
Online tutorial and Youtube videos available
Toll-free support
Dragon NaturallySpeaking –
Online chat service connects to Dragon customer service
Company has an email contact form for other questions.
Couldn’t easily find a website beyond Amazon product page
User Reviews of Major Voice Recognition Technology Accuracy & Usability:
“Recognition accuracy has improved so much that we now measure accuracy based on the number of reports with an error, rather than the traditional method of measuring the number of errors per report. The longer the system has been in place, the more willing I am to self-edit because the results from One Voice are increasingly accurate without any explicit training exercises.” Barton Branstetter, MD, The University of Pittsburgh Medical Center
“Braina is a lightweight and smart application that can assist you when browsing through local folders, searching for files, quickly finding synonyms or performing calculi. You can also call for its help when navigating the Internet, for identifying information, songs, movies, news articles and many more.”– Elizabeta Virlan (Software Reviewer at Softpedia)
“(Voice Finger’s) software speed and precision in areas other than dictation are impressive. It performed well as we scrolled through web pages, used Outlook Express and experimented with Microsoft Word tools.” TopTenReviews
“Your phone has to be your friend,” says Francoise Beaufays, a research scientist at Google specializing in speech recognition. “It needs to able to understand those very open, natural-language type of queries so that the user feels comfortable with it.” (Time)
“Voice is a big part of the computer interface of the future,” said Gene Munster, a veteran equity analyst and now head of research at Loup Ventures. “Whoever owns voice will be the gateway of commerce.” (Reuters)
“One consequence of using natural language in the user interface is direct access to information. We can figure out what you are looking for and take you directly there. You don’t always have to go through a traditional search portal. It will change some business models.” Vladimir Sejnoha, chief technical officer of Nuance
For a free consultation about integrating One Voice Data’s speech recognition technology into your health care setting or medical practice, contact us online, via the scheduling calendar below, call (910)-506-3342 or email info@onevoicedata.com.
Leave a Reply