AI and machine learning algorithms are getting better at predicting actions in videos.
The best of current algorithms can predict quite precisely where a baseball will go after it has been thrown, or the appearance of a road in the sequence to come. In other words? Predicting frames in the future of a movie.
A new approach proposed by researchers at Google, University of Michigan and Adobe advances the state of the art with large-scale models that generate high-quality video from just a few frames.
“With this project we aim to obtain accurate video forecasts. We will optimize the capabilities of a neural network ", the researchers wrote in a document which describes their work.
The team model
The team's core model is based on a stochastic video generation architecture, with a component that manages the predictions of the frames following those considered.
The team trained and tested different versions of the model separately from custom datasets based on three forecast categories: interactions between objects, structured movement and partial observability.
For the first task (interactions with objects) the researchers selected 256 clips from a block of videos showing a robotic arm while interacting with towels.
For the second (structured movement) they edited clips from Human 3.6M, a block containing clips of humans performing actions like sitting in a chair.
As for the third (partial observability activity), used an open source KITTI driving dataset collected from footage of cameras mounted on car dashboards.
After this "training," the AI model generated up to 25 frames in the future.
Researchers report that "predictions" were preferred 90,2, 98,7% and 99,3% of the time by evaluators respectively to the three types of video: interactions between objects, structured movement and partial observability tasks. respectively.
Qualitatively, the team notes that the AI neatly represented human arms and legs and done "Very precise predictions that seemed realistic compared to the scenes represented in the video" .
"We found that maximizing the capacity of such models improves the quality of the video prediction," coauthors write. We hope our work will encourage the field to move in similar directions in the future. For example to see how far we can go ".