AI and machine learning algorithms are getting better at predicting actions in videos.
The best of current algorithms can predict quite precisely where a baseball will go after it has been thrown, or the appearance of a road in the sequence to come. In other words? Predicting frames in the future of a movie.
A new approach proposed by researchers from Google, the University of Michigan and Adobe advances the state of the art with large-scale models that generate high-quality videos from a few frames.
“With this project we aim to obtain precise video forecasts. We will optimize the capabilities of a neural network ", the researchers wrote in a document which describes their work.
The team model
The team's basic model is based on a stochastic video generation architecture, with a component that manages the predictions of the frames following those considered.
The team trained and tested different versions of the model separately from custom datasets based on three forecast categories: interactions between objects, structured movement and partial observability.
For the first task (interactions with objects) the researchers selected 256 clips from a block of videos showing a robotic arm while interacting with towels.
For the second (structured movement) they edited clips from Human 3.6M, a block containing clips of humans performing actions like sitting in a chair.
As for the third (partial observability activity), used an open source KITTI driving dataset collected from footage of cameras mounted on car dashboards.
After this "training," the AI model generated up to 25 frames in the future.
Researchers report that "predictions" were preferred 90,2, 98,7% and 99,3% of the time by evaluators to the three types of video respectively: interactions between objects, structured movement and partial observability tasks, respectively.
Qualitatively, the team notes that AI has clearly represented human arms and legs and done "Very precise predictions that seemed realistic compared to the scenes represented in the video" .
"We found that maximizing the capacity of these models improves the quality of video prediction," coauthors write. We hope that our work will encourage the field to move in similar directions in the future. For example, to see how far we can go ”.