AudioLM, the system developed by the Google researchers, generates all sorts of sounds, including complex ones like piano music in a song or people talking, almost indistinguishable from the initial fragment that is submitted to it.
The technique is very promising, and it can be useful in many ways. For example, it can speed up the training process of artificial intelligence, or automatically generate music to accompany videos. But it is much more than that.
Play it again, Sam
We are already used to hearing audio generated by artificial intelligence. Anyone who fights with Alexa or Google Nest every day knows it well: our voice assistants process natural language.
There are, to be sure, also systems trained in music: remember Jukebox by OpenAI? I told you about it here. All these systems, however, are based on a long and complex "training", which passes through the cataloging and administration of many "ideas". Our artificial intelligences are hungry for data, and they want more and more.
The next step is to make the AI "think" by enabling it to process the information it listens to more quickly, without the need for lengthy training. Something similar to what you try to do with self-driving systems.
How AudioLM works
To generate the audio, a few seconds of song or sound are fed into AudioLM, which literally predicts what comes next. It is not Shazam, it does not look for the whole piece and re-proposes it. He doesn't make collages of sounds that he has in memory. He builds them. The process is similar to the way i linguistic models like GPT-3 they predict phrases and words.
The audio clips released by the Google team sound very natural. In particular, the piano music generated by AudioLM seems more fluid than that generated with current artificial intelligences. In other words, he is better at capturing the way we produce a song, or a sound.
"It's really impressive, not least because it indicates that these systems are learning some kind of layered structure," he says Roger Danenberg, a researcher in computer-generated music at Carnegie Mellon University.
Not just a song
Imagine speaking to AudioLM, two words and stop. The system will continue the speech by learning your cadence, your accent, your pauses, even your breathing. In summary, exactly the way you speak. There is no need to do specific training: he can do it almost by himself.
Like a parrot repeating the things you hear. Only this is a parrot capable of receiving and producing any sound, and autonomously completing those left in the middle.
In summary? We will have very soon (and in these cases it means very soon) systems that are able to speak much more naturally, and to compose a song or sound exactly like From E 2, MidjourneyAI and others create images, or Make A Video creates clips based on our input.
Who owns the rights to a song?
While these systems will be able to create content almost by themselves, that "almost" still makes all the difference in the world, and makes it necessary to consider the ethical implications of this technology.
If I say "So, make me a different ending to Bohemian Rapsody" and this thing is going to make a song along those lines, who can claim the rights and collect the royalties from the song? Not to mention the fact that sounds and speeches now indistinguishable from human ones are much more convincing, and open to an unprecedented spread of disinformation.
In the document published to present this AI (I link it here), the researchers write that they are already considering how to mitigate these problems by inserting ways to distinguish natural sounds from those produced with AudioLM. I believe little. Many of the purposes for which this AI was created would be lost.
More generally, the risk is to produce a phenomenon that I would call "distrust of reality". If everything can be true, nothing can be. Nothing has value.