AudioLM, the system developed by the Google researchers, generates all sorts of sounds, including complex ones like piano music in a song or people talking, almost indistinguishable from the initial fragment that is submitted to it.
The technique is truly promising, and could be useful in many ways. For example, it will be able to speed up the artificial intelligence training process, or automatically generate music to accompany videos. But it's much more than that.
Play it again, Sam
We are already used to hearing audio generated by artificial intelligence. Those who argue every day with Alexa or Google Nest know it well: our voice assistants process natural language.
There are, to be sure, also systems trained in music: remember Jukebox by OpenAI? I told you about it here. All these systems, however, are based on long and complex "training", which involves the cataloging and administration of many "cues". Our artificial intelligences are greedy for data, and they always want more.
The next step is to make the AI "think" by enabling it to process the information it hears more quickly, without the need for long training. Something similar to what we try to do with self-driving systems.
How AudioLM works
To generate the audio, a few seconds of song or sound are fed into AudioLM, which literally predicts what comes next. It's not Shazam, it doesn't search for the entire song and replay it. He doesn't make collages of sounds that he has in memory. He builds them. The process is similar to the way i linguistic models like GPT-3 they predict phrases and words.
The audio clips released by the Google team sound very natural. In particular, the piano music generated by AudioLM seems more fluid than that generated with current artificial intelligences. In other words, he is better at capturing the way we produce a song, or a sound.
“It's really impressive, also because it indicates that these systems are learning some kind of multi-layered structure,” he says Roger Danenberg, a researcher in computer-generated music at Carnegie Mellon University.
Not just a song
Imagine speaking to AudioLM, two words and that's it. The system will continue the speech by learning your cadence, your accent, your pauses, even your breathing. In summary, exactly your way of speaking. There is no need for specific training: he can do it almost by himself.
Like a parrot repeating the things you hear. Only this is a parrot capable of receiving and producing any sound, and autonomously completing those left in the middle.
In summary? We will have very soon (and in these cases it means very soon) systems that are able to speak much more naturally, and to compose a song or sound exactly like From E 2, MidjourneyAI and others create images, or Make A Video creates clips based on our input.
Who owns the rights to a song?
Even if these systems will be capable of creating content almost on their own, that “almost” still makes all the difference in the world, and makes it necessary to consider the ethical implications of this technology.
If I say “Thing, make me a different ending for Bohemian Rhapsody” and this thing makes a song along those lines, who will get the rights and collect the royalties for the song? Not to mention the fact that sounds and speeches that are now indistinguishable from human ones are much more convincing, and open up an unprecedented spread of misinformation.
In the document published to present this AI (I link it here), the researchers write that they are already considering how to mitigate these problems by inserting ways to distinguish natural sounds from those produced with AudioLM. I believe little. Many of the purposes for which this AI was created would be lost.
More generally, the risk is of producing a phenomenon that I would call "distrust of reality". If everything can be true, nothing can be. Nothing has value.