Do AI models plagiarize?
In December 2023, The New York Times was the first major American media company to sue OpenAI and other AI platforms over use of copyrighted work. NYT claimed that millions of their articles were used to train chatbots, resulting in them spitting them out word for word.
In its lawsuit against OpenAI, NYT provided 100 examples of GPT-4 regurgitating NYT articles:
In this NYT article, they dive deeper, “Like other A.I. companies, Microsoft and OpenAI built their technology by feeding it enormous amounts of digital data, some of which is likely copyrighted. A.I. companies have claimed that they can legally use such material to train their systems without paying for it because it is public and they are not reproducing the material in its entirety.”
While AI-generated content is not inherently plagiarized and considered “original”, it can closely resemble training data since they were trained on human-created content. However, these models intentions are to create original material based on user prompts and not to copy existing content. Models like ChatGPT program algorithms to mitigate the risk that the content it generates is plagiarized and have other safeguards in place including guidelines for ethical use.
In our mission of bring transparency to humans navigating a world filled with AI content, GPTZero reacted by developing a tool to determine if your AI generated text is found on online sources such as NYT articles. The idea originated with one of GPTZero’s machine learning engineers Odunayo Ogundepo, who developed Project GAIA. GAIA was project with Hugging Face to collect and analyze databases known to be trained on by LLM providers in 2023.
We've since continued these efforts at GPTZero to expand and track the different databases different LLMs train on. The goal is to provide a standard for both publishers and AI companies to measure acceptable AI use (and even out the playfield for local newsrooms). For our three million active users who are not publishers, we’re also hoping this feature becomes a fun way to explore AI and dig deeper into what informs their outputs!
Try out copyright check by heading to GPTZero's tool on the homepage and clicking sources. See video below for how it works.