HarperCollins strikes AI training deal with unnamed company amid rising copyright tensions between publishers and AI firms

November 22, 2024 ndowd

Publishing giants and generative artificial intelligence companies are striking deals that aim to both protect copyright and provide for the rapidly increasing needs of the AI industry.

US publishing giant HarperCollins has reached a contract with an unnamed tech company allowing it to use some of its books to train its generative AI models.

In a letter seen by AFP, the tech company is proposing a payment of $2,500 per selected book to train its so-called large language model (LLM) for up to three years.

AI models need massive quantities of texts to train their everyday language use.

“HarperCollins has reached an agreement with an artificial intelligence technology company to allow limited use of select nonfiction backlist titles for training AI models to improve model quality and performance,” the publisher said in a statement.

It said the agreement has “limited scope and clear guardrails around model output that respects author’s rights.”

Authors “have the choice to opt in to the agreement or to pass on the opportunity”, it added.

The offer has had a mixed reception in the publishing world, with writers such as Daniel Kibblesmith curtly declining.

“I’d probably do it for a billion dollars. I’d do it for a sum of money that wouldn’t require me to work anymore, since that’s the ultimate goal of this technology,” the author posted on the Bluesky social network.

HarperCollins is one of the largest publishers to reach such an accord, but not the first.

US scientific publisher Wiley said it has allowed “access to previously published academic and professional book content for specific use in training LLM models” in a $23 million contract with an unidentified “large tech company”.

The accords underscore the tension behind AI models, which collect huge quantities of content on the web, creating the risk of widespread copyright violations.

‘A broader conversation’

Giada Pistilli, head of ethics at Hugging Face, a French-American open-access AI platform, said these agreements are a step forward since they involve payments to publishers. But she regrets that they leave little room for the authors to negotiate.

“What we are going to see is a mechanism of bilateral agreements between new technology companies and publishers or copyright holders, whereas in my opinion, we need a broader conversation that includes stakeholders a little more,” she said.

Julien Chouraqui, legal director at the French publishing union (SNE), said the accords represented “progress”.

“An agreement means that there has been a dialogue and a desire to achieve a balance between the use of source data, which are subject to copyright and which will generate value,” he said.

The press is also organising to face the challenges created by AI.

In late 2023, The New York Times sued OpenAI, creator of ChatGPT, as well as Microsoft, its main investor, for violating copyright protections. Other media groups have cut deals with OpenAI.

Tech companies may have no choice but to pay out to improve their products, especially as they are starting to run out of new materials to power their models.

“On the web, you find lots of licit and illicit stiff, and lots of pirated copy. That not only causes legal problems but also raises issues about the quality of the data,” said Chouraqui at the SNE.

“If we are committed to developing a market on a virtuous basis, we must involve all the players,” he said.

source