You’ve probably never wondered what a knight made of spaghetti would look like, but here’s the answer anyway—courtesy of a clever new artificial intelligence program from OpenAI, a company in San Francisco.
The program, DALL-E, released earlier this month, can concoct images of all sorts of weird things that don’t exist, like avocado armchairs, robot giraffes, or radishes wearing tutus. OpenAI generated several images, including the spaghetti knight, at WIRED’s request.
DALL-E is a version of GPT-3, an AI model trained on text scraped from the web that’s capable of producing surprisingly coherent text. DALL-E was fed images and accompanying descriptions; in response, it can generate a decent mashup image.
Pranksters were quick to see the funny side of DALL-E, noting for instance that it can imagine new kinds of British food. But DALL-E is built on an important advance in AI-powered computer vision, one that could have serious, and practical, applications.
Called CLIP, it consists of a vast artificial neural network—an algorithm inspired by the way the brain learns—fed hundreds of millions of images and accompanying text captions from the web and trained to predict the correct labels for an image.
Researchers at OpenAI found that CLIP could recognize objects as accurately as algorithms trained in the usual way—using curated data sets where images are neatly matched to labels.
As a result, CLIP can recognize more things, and it can grasp what certain things look like without needing copious examples. CLIP helped DALL-E produce its artwork, automatically selecting the best images from the ones it generated. OpenAI has released a paper describing how CLIP works as well as a small version of the resulting program. It has yet to release a paper or any code for DALL-E.
Both DALL-E and CLIP are “super impressive,” says Karthik Narasimhan, an assistant professor at Princeton specializing in computer vision. He says CLIP builds upon previous work that has sought to train large AI models using images and text simultaneously, but does so at an unprecedented scale. “CLIP is a large-scale demonstration of being able to use more natural forms of supervision—the way that we talk about things,” he says.
He says CLIP could be commercially useful in many ways, from improving the image recognition used in web search and video analytics, to making robots or autonomous vehicles smarter. CLIP could be used as the starting point for an algorithm that lets robots learn from images and text, such as instruction manuals, he says. Or it could help a self-driving car recognize pedestrians or trees in an unfamiliar setting.