Artificial intelligence and machine learning algorithms that can read lips from videos are nothing extraordinary, actually.
In 2016, researchers from Google and the University of Oxford detailed a system capable of lip-reading and annotating movies with 46,8% accuracy. Does it seem little to you? It already outperformed the 12,4% accuracy of a professional human lip reader. And there was no LIBS yet.
However, 46,8% are not up to par with the capabilities that artificial intelligence can show today. State-of-the-art systems struggle to overcome ambiguities in lip movements, which prevents their performance from surpassing that of audio-based speech recognition.
In search of a more performing system, researchers from Alibaba, Zhejiang University and Stevens Institute of Technology they devised a method dubbed Lip-by-Speech (LIBS), which uses features extracted from speech recognitions to serve as complementary clues. The system raises the bar by a further 8%, and can still improve.
LIBS and other similar solutions may help hearing impaired people to follow videos without subtitles. It is estimated that 466 million people worldwide suffer from hearing loss, which is about 5% of the world's population. By 2050, the number could rise to over 900 million, according to the World Health Organization.
The AI method for reading the lip
LIBS derives useful audio information from several factors: Like a skilled cryptographer, AI hunts for understandable words. At that point he compares them with the correspondence to the lip and goes to look for all similar labile ones. But it does not stop there: it also compares the video frequency of those frames, and other technical clues, refining the search to read the lip even in words incomprehensible to our ear.
If it seems complicated, try again, but I don't promise anything.
I quote from Technology presentation paper. "Both speech recognition and LIBS lip reader components are based on an attention-based sequence-sequence architecture, a method of machine translation that maps an input to a sequence (audio or video)."
Researchers trained AI on a first database containing over 45.000 phrases spoken by the BBC, and on the CMLR, the largest Chinese corpus available for lip reading in Mandarin Chinese, with over 100.000 natural phrases.
The fields of application are not limited to help for the deaf. The custom of attributing a "socially noble" use to each technology must never make us forget that the main use of these technologies is in the military or security sector.
Nobody has thought that this system can make the surveillance of security even more infallible and pervasive amazing new security cameras, or new satellite systems?
With the AI now become a omniscient eye it will be a joke to listen (or rebuild) our whispers even from an orbiting satellite.