What if OpenAI trained ChatGPT with illegal data scraping? The New York Times is reportedly considering suing to put that to the test

August 18, 2023 ndowd

Legal woes are piling up for OpenAI, the startup behind ultra-popular ChatGPT. NPR reports that the New York Times is considering suing OpenAI after attempts to reach a deal in which OpenAI would license news content to train its algorithms failed to progress.

If the lawsuit materializes, it would be the highest-profile attempt yet to bring to heel ChatGPT, a tool whose hype has taken the world by storm. And a successful lawsuit could even go further than that, forcing OpenAI to retrain ChatGPT at great expense, as it would essentially remove much of the language on which the large language model has been trained.

Of note is that the Times was part of a group collectively lobbying for regulations on A.I., until it suddenly removed itself, according to Semafor. The Times’ lawsuit also is not alone in arguing that OpenAI has illegally scraped training data. Comedian Sarah Silverman and authors Paul Tremblay, Mona Awad, and Christopher Golden, sued OpenAI last month, alleging the company committed “industrial-strength” plagiarism when it trained ChatGPT on their work.

In January, a trio of commercial artists sued the creators of the popular image-creating engine Midjourney, accusing it of stealing their work to create knockoffs, preventing artists from making a living off their work. The artists’ lawyers called the technology “a parasite that, if allowed to proliferate, will cause irreparable harm to artists.” And Getty, the image-licensing service, has sued Stability AI, accusing it of illegally copying 12 million Getty-owned images in a bid to create a competing service. Meanwhile, earlier on Thursday, the Associated Press came up with a set of A.I. standards for staff that encourages them to experiment with the technology but forbids using it to create any content or images that would be published.

Even Elon Musk, who famously left OpenAI’s board in 2018, claimed in July of this year that “extreme levels of data scraping” were happening on Twitter at the hands of A.I. companies. “Almost every company doing A.I., from startups to some of the biggest corporations on earth, was scraping vast amounts of data. It is rather galling to have to bring large numbers of servers online on an emergency basis just to facilitate some A.I. startup’s outrageous valuation.”

The Times’ concern, according to NPR, is that OpenAI would create a direct competitor to its reporting “by creating text that answers questions based on the original reporting and writing of the paper’s staff.”

Neither the Times nor OpenAI immediately replied to a request for comment. However, the Times has a good reason to fear competition from ChatGPT. Small businesses that rely on web traffic have seen it destroyed by a more basic piece of technology—Google’s search box, which presents the answer to a typed question as a paragraph at the top of search results.

The niche site CelebrityNetWorth used to do decent business as a source for people curious about celebs’ financial dealings, but after Google started presenting celebrities’ net worth in its search box, traffic to CelebrityNetWorth plunged by two-thirds, and the site had to lay off half its staff, its founder told The Outline.

“If it happens, this lawsuit will be about the value of gathering information and who gets to use it for their customers,” Jeremy Gilbert, Knight professor in digital media strategy at Northwestern University’s Medill School, told Fortune.

The search engine Bing (whose owner, Microsoft, has invested billions in OpenAI) is now using ChatGPT to power its searches. If a person were to ask Bing a question, the search engine could instantly produce a long, detailed answer based on New York Times reporting, eliminating the person’s need to visit the Times’ website (and cheating the paper of revenue).

“Publishers feel most comfortable with direct traffic to news,” Gilbert said. But a large language model like ChatGPT’s “may not send you to the news website at all.

“If [audiences] get everything they need without clicking through to the New York Times, how does the New York Times fund its reporting? Even if that’s much more satisfying for the consumer, it’s fundamentally untenable,” he said.

A group of media outlets, led by IAC, have formed a coalition to pressure OpenAI into paying them “billions” for the use of their work as training material.

OpenAI is copying everything—but is it legal?

It’s no secret that OpenAI has been trained on a vast sea of data—novels, web forums, conversations, news articles, photos, and illustrations—scraped from the public web.

What’s not clear yet is whether this scraping is legal. And a growing number of writers and artists say it isn’t, with lawsuits mounting against OpenAI and other generative-A.I. creators accusing them of copyright infringement.

Even OpenAI’s users are creeped out by the thought of being training material: In response to user backlash, OpenAI this spring changed its terms to clarify that prompts submitted to ChatGPT would not be used to train the bot.

Generative A.I. “is a minefield for copyright law,” a group of lawyers and media scholars recently wrote. The courts’ views of what, exactly, the technology does will be a key deciding factor in these cases.

If judges believe that the materials A.I. spits out are new creations, or that they significantly transform the works they’re based on, they’re likely to see its treatment of copyrighted works as fair use.

If, on the other hand, they believe the A.I. is simply copying and regurgitating others’ works, they could find its use illegal, and force OpenAI to destroy all copies of those works in its dataset.

Regardless of how the courts rule, the Times seems set to get its share of the A.I. pie. Speaking at a Cannes Lions event this spring, Times CEO Meredith Kopit Levien said: “There must be fair value exchange for the content that’s already been used, and the content that will continue to be used, to train models.”

source