Last week, the Authors Guild sent an open letter to the leaders of some of the world’s biggest generative AI companies. Signed by more than 9,000 writers, including prominent authors like George Saunders and Margaret Atwood, it asked the likes of Alphabet, OpenAI, Meta, and Microsoft “to obtain consent, credit, and fairly compensate writers for the use of copyrighted materials in training AI.” The plea is just the latest in a series of efforts by creatives to secure credit and compensation for the role they claim their work has played in training generative AI systems.
The training data used for large language models, or LLMs, and other generative AI systems has been kept clandestine. But the more these systems are used, the more writers and visual artists are noticing similarities between their work and these systems’ output. Many have called on generative AI companies to reveal their data sources, and—as with the Authors Guild—to compensate those whose works were used. Some of the pleas are open letters and social media posts, but an increasing number are lawsuits.
It’s here that copyright law plays a major role. Yet it is a tool that is ill equipped to tackle the full scope of artists’ anxieties, whether these be long-standing worries over employment and compensation in a world upended by the internet, or new concerns about privacy and personal—and uncopyrightable—characteristics. For many of these, copyright can offer only limited answers. “There are a lot of questions that AI creates for almost every aspect of society,” says Mike Masnick, editor of the technology blog Techdirt. “But this narrow focus on copyright as the tool to deal with it, I think, is really misplaced.”
The most high-profile of these recent lawsuits came earlier this month when comedian Sarah Silverman, alongside four other authors in two separate filings, sued OpenAI, claiming the company trained its wildly popular ChatGPT system on their works without permission. Both class-action lawsuits were filed by the Joseph Saveri Law Firm, which specializes in antitrust litigation. The firm is also representing the artists suing Stability AI, Midjourney, and DeviantArt for similar reasons. Last week, during a hearing in that case, US district court judge William Orrick indicated he might dismiss most of the suit, stating that, since these systems had been trained on “five billion compressed images,” the artists involved needed to “provide more facts” for their copyright infringement claims.
The Silverman case alleges, among other things, that OpenAI may have scraped the comedian’s memoir, Bedwetter, via “shadow libraries” that host troves of pirated ebooks and academic papers. If the court finds in favor of Silverman and her fellow plaintiffs, the ruling could set new precedent for how the law views the data sets used to train AI models, says Matthew Sag, a law professor at Emory University. Specifically, it could help determine whether companies can claim fair use when their models scrape copyrighted material. “I’m not going to call the outcome on this question,” Sag says of Silverman’s lawsuit. “But it seems to be the most compelling of all of the cases that have been filed.” OpenAI did not respond to requests for comment.
At the core of these cases, explains Sag, is the same general theory: that LLMs “copied” authors’ protected works. Yet, as Sag explained in testimony to a US Senate subcommittee hearing earlier this month, models like GPT-3.5 and GPT-4 do not “copy” work in the traditional sense. Digest would be a more appropriate verb—digesting training data to carry out their function: predicting the best next word in a sequence. “Rather than thinking of an LLM as copying the training data like a scribe in a monastery,” Sag said in his Senate testimony, “it makes more sense to think of it as learning from the training data like a student.”