When high-quality images produced by generative AI first began to appear in 2022, they had an undeniable wow factor. The creative process involved little more than entering a text description and waiting for the AI system to produce a relevant image.
At that time, an obvious question was when AI-generated video would catch up. Indeed, various groups have since unveiled AI systems that automatically generate video, but always with important limits to their length, the type of realistic motion they could produce and to their overall quality.
One way to solve these problems is with brute force. But this computing power significantly increases costs. So the search has been on to find more efficient and more capable approaches.
Now Google says it has developed just such a technique that dramatically improves the efficiency of video synthesis. Omer Bar-Tal and colleagues at Google say their new system, called Lumiere, produces videos that portray realistic, diverse, and coherent motion.
“We demonstrate state-of-the-art video generation results and show how to easily adapt Lumiere to a plethora of video content creation tasks, including video inpainting, image-to-video generation, or generating stylized videos that comply with a given style image,” they say.
One common approach to AI video synthesis is to first generate several key frames in a video sequence and then to use these images to generate the missing frames in between.
Breaking down the task in this way has the advantage of simplifying the computational requirements but it also has drawbacks. In particular, these systems have difficulty rendering rapid motion that takes place between the key frames.
Bar-Tal and co have come up with a different approach that synthesizes the entire video at the same time. They do this by training an AI system to treat the dimensions of time and space in the same way. This space-time approach allows the AI to generate the entire video output at the same time.
This is in stark contrast to previous efforts which are trained only on spatial changes while maintaining a fixed temporal resolution. Google’s space-time representation is significantly more compact and therefore more computationally efficient. “Surprisingly, this design choice has been overlooked by previous text-to-video models,” say Bar-Tal and co.
A key part of this process is a well-known AI technique called diffusion that is widely used to produce single images. The AI-system begins with a frame consisting entirely of noise, which it progressively modifies to match a data distribution it has learnt, whether this be associated with a cat, a dog or an astronaut riding a bicycle on Mars.
Lumiere works in the same way. But instead of producing a single image that matches a specific data distribution, it creates a sequence of up to 80 images or, more precisely, a representation of these images in space-time.
The Ai then modifies this representation to match a data distribution the system has learnt from its training on millions of hours of video footage. It then unpackages the space-time representation into an ordinary video.
The result is a five second video sequence, a length that Google says is longer than the average shot duration on most media.
The results are impressive. Given a text description like “A panda playing a ukulele at home” or “Flying through a temple in ruins, epic, mist”, Lumiere produces a high-quality video sequence showing, well, just these things.
It can also start with an image and animate it by request. Bar-Tal and co use the famous Vermeer painting Girl with a Pearl Earring and make Lumiere animate it to show the girl winking and smiling.
Give Lumiere a reference image, such as Van Gogh’s Starry Night and it will produce a video in the same style. Give it a video of, for example, a girl running, and it can modify it to make the girl look as if she is made of flowers or stacked wooden blocks. Bar-Tal and co post numerous examples of Lumiere’s capabilities online.
That’s impressive work and raises the obvious question of how soon this will be available to ordinary consumers and at what cost. Google gives no answer at present.
But the team hint at potential problems that will need to be addressed in due course. It’s not hard to imagine how malicious actors could use such technology to create deepfakes on an epic scale and Bar-Tal and co are clearly concerned.
“There is a risk of misuse for creating fake or harmful content with our technology, and we believe that it is crucial to develop and apply tools for detecting biases and malicious use cases in order to ensure a safe and fair use,” they say.
They are not so clear about who is, or should be, developing such technology. This kind of effort is likely to need some kind of real-world incident to force the issue.
But without these kinds of controls, the effects are already spreading. This year’s elections in the U.S., the U.K. and the world’s biggest democracy in India, are already becoming a testing ground for the way these technologies can be exploited.
The role that Lumiere and other similar systems will play, has yet to be determined.
Ref: Lumiere: A Space-Time Diffusion Model for Video Generation : arxiv.org/abs/2401.12945