Group representing the New York Times and 2,200 others just dropped a scathing 77-page white paper on ChatGPT and LLMs being an illegal ripoff

November 1, 2023 ndowd

What if generative AI is illegal? The technology, which exploded into the public eye in November 2022 with ChatGPT’s release, has been a boon for the stock market and talked up as a revolutionary breakthrough, but it has been in an uneasy truce with media organizations ever since. Now, almost a year later, the biggest newspaper alliance, with key members including the New York Times and the Wall Street Journal, has weighed in, and it is scathing.

Everyone from Hollywood actors to famed authors is scrambling to safeguard their work against the unbridled use of artificial intelligence, and news publishers argue that developers such as ChatGPT’s parent company, OpenAI, and Google have been illegally using their copyrighted work to train chatbots.

The News Media Alliance, a trade group representing over 2,200 media organizations, released a 77-page white paper on Tuesday arguing that some of the most popular AI chatbots, like ChatGPT and Google’s Bard, heavily rely on news articles to train their technology. And because of the way these chatbots are trained, the answers they generate can be nearly identical to the copyrighted content.

“GAI [generative AI], while holding promise for consumers, businesses, and society at large, are commercial products that have been built—and are run—on the backs of creative contributors,” the report said.

Media’s war on AI

Large language models, or LLMs, are a type of AI that understands and generates written text. They are trained by analyzing massive amounts of data and mimicking writing patterns while dispatching a seemingly encyclopedic knowledge. However, since many developers do not publicly disclose what content is fed into their models to train them, it’s impossible to know for certain what data is being cited or replicated. The Alliance thinks that it knows.

By analyzing a sample of datasets believed to be used to train LLMs, the News Media Alliance found that content from news, magazine, and digital media publications were used five to 100 times more frequently than open web data, like that from Common Crawl. The report argued that this is a violation of “fair use” laws, which allow copyrighted material to be reproduced or copied without license for limited purposes.

“It genuinely acts as a substitution for our very work,” Danielle Coffey, president and CEO of the News Media Alliance, told the New York Times. “You can see our articles are just taken and regurgitated verbatim.”

The white paper argues that AI developers’ “anthropomorphic claim” that they are only using published written material to train their models is “technically inaccurate and besides the point.”

It is inaccurate because models “retain the expressions of facts that are contained in works in their copied training materials (and which copyright protects) without ever absorbing any underlying concepts,” the report said. “It is beside the point because materials that are used for ‘learning’ are subject to copyright law.”

Coffey added that the news group would “have a very good case in court” against developers.

Sword of Damocles for creative industries

The rise of generative AI has been the sword of Damocles hanging over the media’s head. If a chatbot can distill large amounts of information and summarize it in readable, accurate text, it could theoretically put reporters out of a job.

And the Alliance says this possible future wouldn’t just be harmful to the journalism industry, but to society: “If the Internet becomes flooded with the products of GAI, then GAI itself will have nothing left to train on.”

It’s not just the news media bracing itself. Authors such as John Grisham, Game of Thrones creator George R.R. Martin, and 17 others filed a class action lawsuit in September against OpenAI for training ChatGPT on their copyrighted books. And the lack of guardrails surrounding the use of the still-developing tech was at the center of the dual Hollywood actors’ and writers’ strikes, as they feared that studios would use the technology to replicate their likenesses without their permission or to replace them altogether.

“This evidence demonstrates that the fruits of human creativity are the essential fuel sustaining the GAI revolution,” the report said.

source