You know, artificial intelligence is the theme of these months: it has just started an explosion that will show us all its effects only in the next few years.
On the sails of this technology there is also the breath of Microsoft: it recently used AI to improve the functionality of its apps, and now it could invest as much as 10 billion dollars in OpenAI, the creator of ChatGPT. But today I get word of another Microsoft project, VALLEY, which is incredible.
This state-of-the-art tool has been trained on a vast amount of speech data, over 60.000 hours of English speaking. A data set that makes it, according to the Redmond company, "hundreds of times larger than existing systems". Included the more advanced ones.
And what did VALL-E learn to do? Nothing, a trifle. She reproduces and imitates anyone's voice perfectly, after listening to it for just three seconds.
A voice replicator?
It's not just that. VALL-E is a real revolution in the field of vocal artificial intelligence. Because it reproduces with extraordinary precision the emotions, vocal tones and acoustic environment present in a given sample, and is a giant leap forward compared to existing text-to-speech (TTS) systems. In other words, VALL-E's voice sounds much more like that of a human being than that of an artificial intelligence.
On his Linkedin profile (visit it), the digital strategist Alberto Giacobone links to a small library of vocal samples created by VALL-E e put online on the GitHub platform. The results are amazing: many of the clips reproduce the intonation and accent of the speakers' voices perfectly.
Some examples are less convincing, and this shows how VALL-E is not yet a finished product. However, the overall output is so convincing that it blows our minds.
Big risks, big potential
It is clear that this technology raises concerns about potential risks of misuse, such as identity theft. VALL-E will be able to create voice deepfakes indistinguishable from real people, which could be used to deceive people in many cases and ways.
To counter this threat, in the VALL-E presentation document (I link it here) Microsoft says it is working on developing a detection model that can distinguish a real voice from a synthetic voice.
Despite the (big) risks, however, tools like VALL-E could be particularly useful for helping people regain their voice after an accident, for effortlessly creating more natural podcasts and audiobooks and… as always, the limit is the fantasy.