Create a free account, or log in

Tech giants are producing AI that can create videos. Will we soon struggle to tell what’s real?

Both Meta and Google have showed off DALL-E but for video — text-to-video AI models that can create photorealistic, coherent videos.
Cam Wilson
Cam Wilson
AI
VIDEOS CREATED FROM TEXT PROMPTS BY ARTIFICIAL INTELLIGENCE (IMAGE: GOOGLE)

Artificial intelligence (AI) researchers are showing off the technology’s latest leap forward by demonstrating models capable of creating realistic, coherent videos using a text prompt — raising questions about whether AI’s rapid advancement will soon threaten our ability to know what is real or not.

In the last week, both Meta (formerly known as Facebook) and Google have each showcased “text-to-video” AI systems that can create new, unique videos with high-quality graphics based on anything from a few words to a long, intricate sentence.

Late last month, Meta first showed off its Make-a-Video system, which, on top of its text-to-video ability, can also animate still images. Just a week later, Google released its two projects, Imagen Video and Phenaki.

Meta’s model can produce videos with photorealistic graphics of subjects carrying out actions and interacting with objects — like a realistic video of a young couple walking in the rain or a surreal teddy bear painting a portrait.

Google’s competitor Imagen Video is similar. Phenaki, on the other hand, doesn’t have quite the same visual quality but is able to turn long prompts into videos of multiple minutes in length with a dream-like feeling. One example:

Lots of traffic in futuristic city. An alien spaceship arrives to the futuristic city. The camera gets inside the alien spaceship. The camera moves forward until showing an astronaut in the blue room. The astronaut is typing in the keyboard.

The camera moves away from the astronaut. The astronaut leaves the keyboard and walks to the left. The astronaut leaves the keyboard and walks away. The camera moves beyond the astronaut and looks at the screen. The screen behind the astronaut displays fish swimming in the sea. Crash zoom into the blue fish. We follow the blue fish as it swims in the dark ocean. The camera points up to the sky through the water.

The ocean and the coastline of a futuristic city. Crash zoom towards a futuristic skyscraper. The camera zooms into one of the many windows. We are in an office room with empty desks. A lion runs on top of the office desks. The camera zooms into the lion’s face, inside the office.

Zoom out to the lion wearing a dark suit in an office room. The lion wearing looks at the camera and smiles. The camera zooms out slowly to the skyscraper exterior. Timelapse of sunset in the modern city.

Text prompt

IMAGE: GOOGLE

Both have been built using diffusion models, a type of model trained by feeding it training data which the model breaks apart and then tries to build back together anew (these diffusion models are being used in new generation text-to-image AI models, too). Researchers gave the models datasets of millions of videos paired with captions, which it’s using to recognise and reproduce patterns.

Researchers from both companies have not yet released these models to the public, but it’s only a matter of time before these become accessible. Much like text-to-image models before it, text-to-videos are a powerful new tool that opens up the world for users — who beforehand would have needed to go through the time-consuming and technically demanding process of manually animating something to get a similar effect — but also presents a threat to humans’ understanding of reality.

No person would mistake the current text-to-video artificial intelligence outputs as real — yet. Advances in this technology may soon challenge that. What happens when artificially generated video becomes indistinguishable from real video? Such a premise may have once seemed like a farfetched dystopian novel’s plot, but now seems not that far away.

This article was first published by Crikey.