AI and machine learning algorithms are getting better at predicting actions in videos.
The best of current algorithms can predict quite precisely where a baseball will go after it has been thrown, or the appearance of a road in the sequence to come. In other words? Predicting frames in the future of a movie.
A new approach proposed by researchers at Google, University of Michigan and Adobe advances the state of the art with large-scale models that generate high-quality video from just a few frames.
“With this project we aim to obtain precise video forecasts. We will optimize the capabilities of a neural network,” the researchers wrote in a document which describes their work.
The team model
The team's basic model is based on a stochastic video generation architecture, with a component that manages the predictions of the frames following those considered.
The team trained and tested different versions of the model separately from custom datasets based on three forecast categories: interactions between objects, structured movement and partial observability.
For the first task (interactions with objects) the researchers selected 256 clips from a block of videos showing a robotic arm while interacting with towels.
For the second (structured movement) they edited clips from Human 3.6M, a block containing clips of humans performing actions like sitting in a chair.
As for the third (partial observability activity), used an open source KITTI driving dataset collected from footage of cameras mounted on car dashboards.
After this “training,” the AI model generated up to 25 frames into the future.
The researchers report that “predictions” were preferred 90,2%, 98,7%, and 99,3% of the time by raters over the three types of videos: object interactions, structured motion, and partial observability tasks, respectively. respectively.
Qualitatively, the team notes that the AI crisply depicted human arms and legs and done “very precise predictions that seemed realistic compared to the scenes depicted in the video” .
“We found that maximizing the capacity of such models improves the quality of video prediction,” coauthors write. We hope that our work encourages the field to push in similar directions in the future. For example, to see how far we can go."