We have the answers to your questions! - Don't miss our next open house about the data universe!

Voice recognition: definition, origins, and technological applications

 m de lecture

Talking to one’s smartphone has become a common pastime. Many voice recognition systems have proven themselves to be highly adept. Nevertheless, the journey towards understanding human speech has extended over several decades.

Voice recognition is now a ubiquitous service across a broad range of sectors:

  • Many of us routinely use it to interact with our smartphones or various applications;
  • Following a medical appointment, it’s common for practitioners to use this technology to transcribe their reports;
  • It’s frequently through voice commands that we verify our bank account balances;
  • etc.

However, although this technology has become widespread, reaching a satisfactory level of accuracy took decades of development.

A Brief History of Voice Recognition

The development of voice recognition is based on over 70 years of scientific inquiry! The initial studies in this domain date back to the early 1950s.


In 1952, Audrey, the first-ever voice recognition system, made its debut at Bell Laboratories. Capable of recognizing the digits 0 through 9 when spoken individually with a 99% accuracy rate, this groundbreaking rate was, however, primarily achieved when the system was used by its inventor. With other users, accuracy fell to around 70 to 80%. From the beginning, it was clear: the human voice is diverse, and each individual has their own unique manner of speaking. Thus, voice recognition presented a multifaceted challenge.


A decade later, in April 1962, IBM presented Shoebox, a voice-operated calculator, at a global fair. Developed by William C. Dersch in San Jose, California, Shoebox not only recognized digits 0 through 9 like Audrey, but also understood sixteen basic English words corresponding to simple mathematical operations: “plus,” “minus,” “total,” …


In the early 1970s, driven by the US defense agency DARPA, Carnegie Mellon University developed the Harpy system. Harpy could precisely identify 1,011 words, mirroring the vocabulary of a three-year-old child. This achievement marked a milestone and inspired a surge of interest in voice recognition research.


Until this point, phoneme detection (basic units of sound) had been utilized to reconstruct words. Starting from the 1980s, voice recognition began adopting new methodologies, including statistical models. It is with this foundation that Tangora by IBM was developed, aiming to predict the words that should logically follow based on the analysis of preceding words. Tangora required roughly twenty minutes of training and could recognize 20,000 words and complete sentences afterwards.

Dragon Naturally Speaking

In 1997, Nuance introduced its software Dragon Professional, marking a significant advance. This tool necessitated several hours of training, but once accustomed, users could dictate freely without the need to physically type out their texts. Dragon could process 100 words per minute and gained popularity among professionals like doctors and lawyers. Following this, the Windows XP edition released in 2001 included a voice recognition feature.

Google Voice Search / Google Assistant

In the 2000s, automated voice processing began to incorporate artificial intelligence. Google Voice Search, launched in 2008, decided to merge machine learning algorithms with high-capacity servers for processing. This led to substantial progress. This service, which eventually evolved into Google Assistant, wasn’t marketed as extensively as it could have been and was soon outshone by other services.


In 2011, Apple made headlines by announcing Siri, a virtual assistant capable of comprehending spoken commands, would be featured on all new iPhones. This announcement was a landmark moment, as voice recognition became a tool of the masses. Following closely behind, Amazon’s Alexa and Microsoft’s Cortana were released in 2014.

How does voice recognition work?

So, how exactly does a voice recognition application function today?

The collection of acoustic data begins with a microphone capturing sound, transforming these signals into electrical impulses, then converting these impulses from analog to digital format.

At this stage, machine learning plays a crucial role. It correlates phonemes with linguistic units, matches the analyzed sound frequencies to words, and deduces the most suitable sequence of words. The system relies on reference models to guide it through this process, helping it identify themost probable sequences of words. Techniques from natural language processing are employed to ensure the extraction of meaning from speech.


DataScientest News

Sign up for our Newsletter to receive our guides, tutorials, events, and the latest news directly in your inbox.

You are not available?

Leave us your e-mail, so that we can send you your new articles when they are published!
icon newsletter


Get monthly insider insights from experts directly in your mailbox