Artificial intelligence and machine learning algorithms that can read lips from videos are nothing extraordinary, actually.
In 2016, researchers at Google and the University of Oxford detailed a system that could lip-read and annotate footage with 46,8% accuracy. Does this seem like little to you? It already surpassed the 12,4% accuracy of a professional human lip reader. And there was no LIBS yet.
However, 46,8% are not up to par with the capabilities that artificial intelligence can show today. State-of-the-art systems struggle to overcome ambiguities in lip movements, which prevents their performance from surpassing that of audio-based speech recognition.
In search of a more performing system, researchers from Alibaba, Zhejiang University and Stevens Institute of Technology they devised a method dubbed Lip-by-Speech (LIBS), which uses features extracted from speech recognitions to serve as complementary cues. The system raises the bar by a further 8%, and can still improve.
LIBS and other similar solutions may help hearing impaired people to follow videos without subtitles. An estimated 466 million people worldwide suffer from hearing loss, equivalent to approximately 5% of the world's population. By 2050, the number could rise to more than 900 million, according to the World Health Organization.
The AI method for reading the lip
LIBS derives useful audio information from several factors: Like a skilled cryptographer, the AI hunts for understandable words. At that point he compares them with the labial correspondence and searches for all the similar labiles. But it doesn't stop there: it also compares the video frequency of those frames, and other technical clues, refining the search to the point of reading the lips even in words incomprehensible to our ear.
If it seems complicated, try again, but I don't promise anything.
I quote from Technology presentation paper. “Both the speech recognition and lip reader components of LIBS are based on an attention-based sequence-to-sequence architecture, a machine translation method that maps an input to a sequence (audio or video)."
The researchers trained the AI on an initial database containing over 45.000 sentences spoken by the BBC, and on CMLR, the largest Chinese corpus available for lip reading in Mandarin Chinese, with over 100.000 natural sentences.
The fields of application are not limited only to aid for the deaf. The custom of attributing a "socially noble" use to every technology must never make us forget that the main use of these technologies is in the military or security sector.
Nobody has thought that this system can make the surveillance of security even more infallible and pervasive amazing new security cameras, or new satellite systems?
With AI now becoming a omniscient eye it will be a joke to listen (or rebuild) our whispers even from an orbiting satellite.