What is voice recognition?

Speech recognition is a technology that enables computers and other devices to understand and process human speech. It converts spoken words into text or actions.

What are the most common applications for voice recognition?

-Virtual assistants (such as Siri, Google Assistant, Alexa) -Automated transcription -Voice commands to control devices -Automated customer services -Accessibility for people with disabilities

Who are the main suppliers of voice recognition technologies?

- Google (Google Cloud Speech-to-Text) - Apple (Siri) - Amazon (Alexa) - Microsoft (Cortana, Azure Speech Services) - IBM (Watson Speech to Text)

Back to articles

Voice recognition: definition, origins, and technological applications

Q: How does voice recognition work?

Voice recognition uses machine learning algorithms to analyse the sounds, frequencies and patterns of human speech. It compares these sounds with a database of words and phrases to identify what is being said.

21 Jun 2024

m de lecture

Artificial Intelligence

Daniel

Talking to one’s smartphone has become a common pastime. Many voice recognition systems have proven themselves to be highly adept. Nevertheless, the journey towards understanding human speech has extended over several decades.

Voice recognition is now a ubiquitous service across a broad range of sectors:

Many of us routinely use it to interact with our smartphones or various applications;
Following a medical appointment, it’s common for practitioners to use this technology to transcribe their reports;
It’s frequently through voice commands that we verify our bank account balances;
etc.

However, although this technology has become widespread, reaching a satisfactory level of accuracy took decades of development.

A Brief History of Voice Recognition

The development of voice recognition is based on over 70 years of scientific inquiry! The initial studies in this domain date back to the early 1950s.

Audrey

In 1952, Audrey, the first-ever voice recognition system, made its debut at Bell Laboratories. Capable of recognizing the digits 0 through 9 when spoken individually with a 99% accuracy rate, this groundbreaking rate was, however, primarily achieved when the system was used by its inventor. With other users, accuracy fell to around 70 to 80%. From the beginning, it was clear: the human voice is diverse, and each individual has their own unique manner of speaking. Thus, voice recognition presented a multifaceted challenge.

Shoebox

A decade later, in April 1962, IBM presented Shoebox, a voice-operated calculator, at a global fair. Developed by William C. Dersch in San Jose, California, Shoebox not only recognized digits 0 through 9 like Audrey, but also understood sixteen basic English words corresponding to simple mathematical operations: “plus,” “minus,” “total,” …

Harpy

In the early 1970s, driven by the US defense agency DARPA, Carnegie Mellon University developed the Harpy system. Harpy could precisely identify 1,011 words, mirroring the vocabulary of a three-year-old child. This achievement marked a milestone and inspired a surge of interest in voice recognition research.

Tangora

Until this point, phoneme detection (basic units of sound) had been utilized to reconstruct words. Starting from the 1980s, voice recognition began adopting new methodologies, including statistical models. It is with this foundation that Tangora by IBM was developed, aiming to predict the words that should logically follow based on the analysis of preceding words. Tangora required roughly twenty minutes of training and could recognize 20,000 words and complete sentences afterwards.

Dragon Naturally Speaking

In 1997, Nuance introduced its software Dragon Professional, marking a significant advance. This tool necessitated several hours of training, but once accustomed, users could dictate freely without the need to physically type out their texts. Dragon could process 100 words per minute and gained popularity among professionals like doctors and lawyers. Following this, the Windows XP edition released in 2001 included a voice recognition feature.

Google Voice Search / Google Assistant

In the 2000s, automated voice processing began to incorporate artificial intelligence. Google Voice Search, launched in 2008, decided to merge machine learning algorithms with high-capacity servers for processing. This led to substantial progress. This service, which eventually evolved into Google Assistant, wasn’t marketed as extensively as it could have been and was soon outshone by other services.

Siri

In 2011, Apple made headlines by announcing Siri, a virtual assistant capable of comprehending spoken commands, would be featured on all new iPhones. This announcement was a landmark moment, as voice recognition became a tool of the masses. Following closely behind, Amazon’s Alexa and Microsoft’s Cortana were released in 2014.

How does voice recognition work?

So, how exactly does a voice recognition application function today?

The collection of acoustic data begins with a microphone capturing sound, transforming these signals into electrical impulses, then converting these impulses from analog to digital format.

At this stage, machine learning plays a crucial role. It correlates phonemes with linguistic units, matches the analyzed sound frequencies to words, and deduces the most suitable sequence of words. The system relies on reference models to guide it through this process, helping it identify themost probable sequences of words. Techniques from natural language processing are employed to ensure the extraction of meaning from speech.

DataScientest News

You are not available?

Leave us your e-mail, so that we can send you your new articles when they are published!

Data Analyst

Analytics Engineer

Data Scientist

AI / Machine Learning Engineer

Data Engineer

Cloud Engineer

DevOps Engineer

Data Marketing & AI

MLOps

ETL Developer

Data Ops Engineer

Amazon Web Services (AWS)

Microsoft Power BI

Voice recognition: definition, origins, and technological applications