The short videos give the impression of a flipbook, jumping shakily from one surreal frame to the next. They’re the result of internet meme-makers playing with the first widely available text-to-video AI generators, and they depict impossible scenarios like Dwayne “The Rock” Johnson eating rocks and French president Emmanuel Macron sifting through and chewing on garbage, or warped versions of the mundane, like Paris Hilton taking a selfie.
This new wave of AI-generated videos has definite echoes of Dall-E, which swept the internet last summer when it performed the same trick with still images. Less than a year later, those wonky Dall-E images are almost indistinguishable from reality, raising two questions: Will AI-generated video advance as quickly, and will it have a place in Hollywood?
ModelScope, a video generator hosted by AI firm Hugging Face, allows people to type a few words and receive a startling, wonky video in return. Runway, the AI company that cocreated the image generator Stable Diffusion, announced a text-to-video generator in late March, but it has not made it widely available to the public. And Google and Meta both announced they were working on text-to-video tech in fall of 2022.
RIght now, it’s jarring celebrity videos or a teddy bear painting a self-portrait. But in the future, AI’s role in film could evolve beyond the viral meme, allowing tech to help cast movies, model scenes before they’re shot, and even swap actors in and out of scenes. The technology is advancing rapidly, and it will likely take years before such generators could, say, produce an entire short film based on prompts, if they’re ever able to. Still, AI’s potential in entertainment is massive.
“The way Netflix disrupted how and where we watch content, I think AI is going to have an even bigger disruption on the actual creation of that content itself,” says Sinead Bovell, a futurist and founder of tech education company WAYE.
But that doesn’t mean AI will entirely replace writers, directors, and actors anytime soon. And some sizable technical hurdles remain. The videos look jumpy because the AI models can’t yet maintain full coherence from frame to frame, which is needed to smooth the visuals. Making content that lasts longer than a few fascinating, grotesque seconds and keeps its consistency will require more computer power and data, which means big investments in the tech’s development. “You can’t easily scale up these image models,” says Bharath Hariharan, a professor of computer science at Cornell University.
But, even if they look rudimentary, the progression of these generators is advancing “really, really fast,” says Jiasen Lu, a research scientist at the Allen Institute of Artificial Intelligence, a research organization founded by the late Microsoft cofounder Paul Allen.
The speed of progress is the result of new developments that bolstered the generators. ModelScope is trained on text and image data, like image generators are, and then also fed videos that show the model how movement should look, says Apolinário Passos, a machine-learning art engineer at Hugging Face. It’s the tactic also being used by Meta. It removes the burden of annotating videos, or labeling them with text descriptors, which simplifies the process and has ushered in rapid development of the tech.