Between new algorithms and IT advances, machines can now learn increasingly complex models. They come to generate high quality synthetic data such as photorealistic images, and even resumes of imaginary humans.
Time a study published in the international journal PLOS Genetics shows the advanced use of machine learning on biometric data. From existing biobanks, the system generates entire blocks of human genome that do not belong to real humans but have the characteristics of a real genome.
Bypassing the privacy issue
“Existing genomic databases are an invaluable resource for biomedical research," He says Burak Yelmen, first author of the study and Junior Research Fellow of Modern Population Genetics at the University of Tartu. “The problem is that they are not publicly accessible or protected by lengthy and exhausting enforcement procedures due to valid ethical concerns. This creates an important scientific barrier for researchers. A genome generated by machines, an "artificial genome", can help us overcome the problem within a safe ethical framework ".
:format(jpg):extract_cover()/https%3A%2F%2Fscx1.b-cdn.net%2Fcsz%2Fnews%2F800a%2F2021%2F13-machinelearn.jpg)
The multidisciplinary team performed more analyzes to assess the quality of the genome generated by machine learning compared to the real one. "Surprisingly, this genome mimics the complexities that we can observe within real human populations and, for most properties, they are indistinguishable from the other genomes of the biobank used to train our algorithm. Except for one detail: they don't belong to any gene donor, ”said the dr. Luca Pagani, one of the study's senior authors and fellow Mobilitas Pluss.
A machine-generated genome, an "artificial genome", can help us overcome the problem within a safe ethical framework
Burak Yelmen

Is it really original genome or a "spit" copy?
The study also provides for the assessment of the proximity of the artificial genome to the real genome to verify whether the privacy of the original samples is preserved. “While detecting privacy leaks across thousands of genomes might seem like a search for a needle in a haystack, the combination of multiple statistical measures allows us to closely monitor all models. Interestingly, the detailed exploration of complex dispersion patterns leads in turn to other improvements in the evaluation of GAN and it will fuel the field of machine learning ”. To say it is Dr. Flora Jay, study coordinator and researcher of the CNRS, French National Center for Scientific Research).
All in all, the machine learning approaches already provided turned, biographies and many other features to a handful of imaginary human beings. We now know more about their biology as well. These fictional humans with realistic genomes could serve as an experimental bench in place of real genomes that are not publicly available.