AI and machine learning algorithms are getting better at predicting actions in videos.
The best of current algorithms can predict quite precisely where a baseball will go after it has been thrown, or the appearance of a road in the sequence to come. In other words? Predicting frames in the future of a movie.
A new approach proposed by researchers at Google, University of Michigan and Adobe advances the state of the art with large-scale models scale that they generate high-quality videos from just a few frames.
“With this project we aim to obtain precise video forecasts. We will optimize the capabilities of a neural network,” the researchers wrote in a document which describes their work.
The team model
The team's basic model is based on a stochastic video generation architecture, with a component that manages the predictions of the frames following those considered.
The team separately trained and tested different versions of the model against custom datasets based on three forecast categories: interactions between objects, structured movement and partial observability.
For the first task (interactions with objects) the researchers selected 256 clips from one block of videos that they showed a robot arm interacting with towels.
For the second (structured movement) they examined clips from Human 3.6M, a block containing clips of humans who they perform actions such as sitting on a chair.
As for the third (partial observability activity), they used a dataset KITTI open source driving data collected from footage from cameras mounted on car dashboards.
After this “training,” the AI model generated up to 25 frames into the future.
The researchers report that “predictions” were preferred 90,2%, 98,7% and 99,3% of the time by raters respectively over the three types of videos: object interactions, motion structured and partial observability tasks, respectively.
Qualitatively, the team notes that the AI clearly depicted arms and human legs it's done “very precise predictions that seemed realistic compared to the scenes depicted in the video” .


“We found that maximizing the capacity of such models improves the quality of video prediction,” coauthors write. We hope that our work encourages the field to push in similar directions in the future. For example, to see how far we can go."
