You know, artificial intelligence is the theme of these months: it has just begun an explosion that will only show us all its effects in the next few years.
Microsoft is also behind the sails of this technology: it recently used AI to improve the functionality of its apps, and now it could invest as much as 10 billion dollars in OpenAI, the company that created ChatGPT. Today, however, I hear of another Microsoft project, VALLEY, which is incredible.
This cutting-edge tool has been trained on a vast amount of voice data, over 60.000 hours of English speaking. A data set that makes it, according to the Redmond company, "hundreds of times larger than existing systems". Included the more advanced ones.
And what did VALL-E learn to do? Nothing, a trifle. She reproduces and imitates anyone's voice perfectly, after listening to it for just three seconds.
A voice replicator?
It's not just this. VALL-E is a real revolution in the field of vocal artificial intelligence. Because it reproduces with extraordinary precision the emotions, vocal tones and acoustic environment present in a given sample, and is a giant step forward compared to existing text-to-speech (TTS) systems. In other words, VALL-E's voice sounds much more like that of a human being than that of an artificial intelligence.
On his Linkedin profile (visit it), the digital strategist Alberto Giacobone links to a small library of vocal samples created by VALL-E e put online on the GitHub platform. The results are surprising: in many clips the intonation and accent of the speakers' voices are perfectly reproduced.
Some examples are less convincing, and this shows that VALL-E is not yet a finished product. However, the overall output is so convincing that it leaves us speechless.
Big risks, big potential
It is clear that this technology raises concerns about potential risks of misuse, such as identity theft. VALL-E will be able to create voice deepfakes indistinguishable from real people, which could be used to deceive people in many cases and ways.
To counter this threat, in the VALL-E presentation document (I link it here) Microsoft says it is working on developing a detection model that can distinguish a real voice from a synthetic voice.
Despite the (big) risks, however, tools like VALL-E could be particularly useful to help people find their voice after an accident, to effortlessly create more natural podcasts and audiobooks and… as always, the limit is your imagination.