Investigations

NYT sued OpenAI for copyright infringement, GPTZero develops tool to detect matches in AI training data

GPTZero Team
· 2 min read
Send by email

On December 27 2023, The Times sued OpenAI and Microsoft claiming the two companies used its articles to train AI technologies like ChatGPT. NYT claimed that millions of their articles were used to train chatbots, resulting in ChatGPT prompts regurgitating NYT articles word for word.

"There is nothing 'transformative' about using The Times's content without payment to create products that substitute for The Times and steal audiences away from it," the Times said and “it’s not fair by any measure as they are essentially leveraging NYT investment in its journalism to build products without permission or payment.” According to Arstechnica, “The NYT claims OpenAI used millions of Times articles to train GPT-4. At $25,000 per article, OpenAI and Microsoft could owe billions of dollars to the New York Times alone.”

In its lawsuit against OpenAI, NYT provided 100 examples of GPT-4 copyrighting NYT articles:

Source: ars technica

In January, OpenAI responded with this blog post saying, “we disagree with the claims in The New York Times lawsuit,” and “The New York Times is not telling the full story”. OpenAI said that, “the regurgitations The New York Times induced appear to be from years-old articles that have proliferated on multiple third-party websites. It seems they intentionally manipulated prompts, often including lengthy excerpts of articles, in order to get our model to regurgitate. Even when using such prompts, our models don’t typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts.” OpenAI said, “training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents.”

Justin Nelson, the Attorney whose law firm is representing NYT said, “what OpenAI is saying is that they have a free ride to take anybody else’s intellectual property really since the dawn of time, as long as it’s been on the internet.” While there’s been no final decision in the case yet, many sources think the New York Times might win its copyright lawsuit against OpenAI.

In our mission of bringing transparency to humans navigating a world filled with AI content, GPTZero reacted by developing a tool called Copyright Check to determine if your AI generated text is found on online sources such as NYT articles. The idea originated with one of GPTZero’s machine learning engineers Odunayo Ogundepo, who developed Project GAIA. GAIA was project with Hugging Face to collect and analyze databases known to be trained on by LLM providers in 2023.

We've since continued these efforts at GPTZero to expand and track the different databases different LLMs train on. The goal is to provide a standard for both publishers and AI companies to measure acceptable AI use (and even out the playfield for local newsrooms). For our three million active users who are not publishers, we’re also hoping this feature becomes a fun way to explore AI and dig deeper into what informs their outputs!

Try out copyright check by heading to GPTZero's tool on the homepage and clicking sources. See video below for how it works.