Voice Engine, OpenAI clones a voice with just 15 seconds of audio

March 30 2024

Technology

Translating your own voice, giving speech back to patients, creating audiobooks with expressive voices: these are some of the promises of OpenAI's Voice Engine. But the path to large-scale adoption passes through an ethical and regulatory reflection on synthetic voices.

“This is my voice. Or at least, she was. Now it also belongs to an algorithm, which can make me say things I've never said.". The beginning of a science fiction novel? No. The scenario that looms with the spread of synthetic rumors. Technologies capable of cloning our voices starting from a few audio samples, such as Voice Engine by OpenAI. An ambitious project, just presented, which promises to revolutionize fields such as entertainment, education, healthcare. But it also raises disturbing questions about the control of our identity in the age of Artificial Intelligence.

The presentation of Voice Engine on the official blog by OpenAI.

Stolen voices in the digital ether

In the future your voice will no longer belong to you. It will be a world where anyone, with a few clicks, can make you say anything. Phrases never spoken, opinions never expressed, secrets never revealed. A dystopian nightmare that appears on the horizon with the advance of synthetic voices.

Mind you, the possibility of generate artificial voices it's nothing new. For decades there has been software capable of transforming text into speech, with more or less mechanical and unpleasant results. But the new frontiers of AI promise to change the rules of the game. Increasingly sophisticated algorithms, powered by huge datasets and neural networks, are learning to imitate the subtle nuances of human speech, approaching perfection. Timbre, intonation, rhythm, pauses: all the elements that make a voice unique and recognizable are now within the reach of machines.

Voice Engine by OpenAI is the latest incarnation of this trend. A model capable of generating realistic and natural voices starting from a very short audio sample of just 15 seconds. A small wonder (or a small horror, depending on your point of view) that opens up scenarios that were science fiction until recently.

Voice Engine: vocal symphonies or artificial cacophonies?

The potential fields of application are multiple and fascinating. Think about the entertainment industry: With synthetic voices, actors could lend their voices to characters in animated films or video games without spending hours in the recording studio. Voice actors may work in languages they don't know, relying on machine translation. Audiobooks could be narrated in expressive and engaging voices, representing a variety of accents and styles.

And what about healthcare? Thanks to tools like Voice Engine, patients suffering from speech or phonation problems could find a natural and personalized voice. Those who are blind or have reading difficulties could more easily access text content converted into audio. Language barriers could be broken down, with voice assistants capable of speaking fluently in any language.

Not to mention the educational potential: learn a foreign language by speaking with a synthetic but realistic voice, receive corrective feedback from a virtual tutor with your own voice, create customizable multilingual educational content. The opportunities are endless and enticing.

But every medal has its downside.

Vocal identities in the deepfake era

The first and most obvious risk is that of misinformation and manipulation. With tools like Voice Engine for audio and Sora for the video, anyone could generate fake but credible clips of public figures or private citizens. Counterfeit political speeches, invented statements, extorted confessions: fake news would find a formidable ally in synthetic voices. In an era already marked by distrust towards the media and institutions, the prospect of no longer being able to trust even what we hear with our own ears is terrifying.

Then there is the issue of privacy and control over one's biometric data. Our voice is a distinctive feature of our identity, like fingerprints or the retina. But unlike other biometric data, it is relatively easy to capture and replicate without our knowledge. A few seconds of stolen recording, perhaps from a phone call or a public video, are enough to feed an algorithm like Voice Engine. And voila, our voice is no longer ours. It can be used, abused, decontextualized, without us being able to do much to prevent it.

Mind you, OpenAI is aware of these risks and is trying to address them with a responsible approach. Partners testing Voice Engine must adhere to strict ethical guidelines: no to imitation of real people without consent, yes to explicit permission from voice donors, maximum transparency on the artificial nature of synthetic voices. They are steps in the right direction, but they do not solve the root of the problem.

Because the problem, ultimately, is philosophical even before it is technological. It concerns our relationship with the voice as an expression of the self, as a mark of authenticity in an increasingly mediated and artificial world. It's about the value we place on individual uniqueness and autonomy, and the fear of seeing them dissolve into the blurry sea of deepfakes and fluid identities.

Voice Engine: does the future (still) have a voice?

Faced with these questions, the temptation could be that of Luddite refusal: silencing synthetic voices, considering them as a "perverse" technology, taking refuge in the presumed purity of "natural" voices. But it would be a short-sighted and counterproductive reaction. Synthetic voices, like any technology, are not good or bad in themselves: it depends on how we use them.

The challenge, then, is to build an ethical and regulatory framework that directs development towards the common good. Define shared standards and protocols for the acquisition and use of voice data. Raise citizens' awareness of the risks and opportunities of synthetic voices, providing them with critical tools to orient themselves. Invest in research into reliable methods to authenticate voices and trace the origin of audio content. Promote an open and informed public debate on these issues, involving all stakeholders.

It won't be an easy or short journey. It will require vision, determination and a spirit of collaboration. But it is a necessary path, because what is at stake here is not just technological. It's existential. It concerns the very meaning of our individuality in a world in which the boundaries between real and virtual, between authentic and artificial, are becoming increasingly blurred and permeable.

A world in which our voice, the sound mirror of our soul, risks being lost in a vortex of synthetic echoes.

Gianluca Riccio, creative director of Melancia adv, copywriter and journalist. He is part of the Italian Institute for the Future, World Future Society and H+. Since 2006 he has directed Futuroprossimo.it, the Italian Futurology resource.

To report research, discoveries and inventions, contact the editorial team! Follow Futuro Prossimo on Whatsapp: exclusive news and updates (free).

FP on Fatto Quotidiano
Alberto Robiati and Gianluca Riccio guide readers through scenarios of the future: the opportunities, risks and possibilities we have to create a possible tomorrow.

On the same theme:

The last

Voice Engine, OpenAI clones a voice with just 15 seconds of audio

Technology

Share

Stolen voices in the digital ether

Voice Engine: vocal symphonies or artificial cacophonies?

Vocal identities in the deepfake era

Voice Engine: does the future (still) have a voice?

Osteoarthritis, AI blood test beats X-rays and predicts it 10 years earlier

I'll take you into the future of “automated” and AI-generated entertainment

How AI will unleash the potential of students with dyslexia and ADHD

Enzymes that change the blood type of blood donors discovered

Will Ozempic, Wegovy and GLP-1 drugs also reduce smoking and processed foods?

Hybrid mouse-rat created with neurons from both species in the brain

Triton, first dives of the "bubble" submarine for luxury cruises

1 comment on “Voice Engine, OpenAI clones a voice with just 15 seconds of audio”