Technology

How AI Detection Benchmarking Works at GPTZero

An overview of how AI detectors like GPTZero, TurnitIn, and Copyleaks compare in terms of accuracy rate, error rate, false positives and negatives for identifying AI-generated content.

Edward Tian, Alex Cui, Alex Adam
· 5 min read
Send by email
AI Benchmarking for GPTZero versus TurnitIn, Copyleaks, Originality

When it comes to dealing with the prevalence of AI-generated content by large language models like OpenAI’s ChatGPT, it’s important to have an AI detector tool you can trust to keep up with the breakneck evolution of artificial intelligence. As more tools offer some version of AI detection, it’s even more crucial to know how to compare AI detection tools through accurate benchmarking. GPTZero was the first AI detection tool launched in 2022, and we’ve worked tirelessly to make sure our model is consistently improving to better detect how people use AI in their writing – from the classroom to the internet at large – and provide regular updates on our model’s accuracy.

Jump to: GPTZero accuracy benchmarking results

What is benchmarking in AI and machine learning?

Benchmarking in AI and machine learning refers to comparing different tools against each other. In the world of AI detection, AI detectors benchmark off of their ability to accurately identify content created by AI versus content made by humans. In GPTZero’s unique case, we can also identify content that includes a mix of both human and AI writing, which our competitors currently do not.

There are different ways to benchmark AI detectors. One way is to rely on journalists and researchers to do independent public testing. For example, tech journalists have tested a number of AI tools in 2023 and 2024. TechCrunch first reported when texting six AI detectors: 

“GPTZero was the only consistent performer, classifying AI-generated text correctly. As for the rest … not so much.” 

More recently, ZDNet, when testing seven AI detectors, reported “AI content detectors are getting dramatically better,” with GPTZero receiving a perfect score.

There are limitations to these journalists’ approach: namely, they only test on a small sample of texts.

For more rigorous benchmarking, AI detection companies like GPTZero run internal tests on larger (hundreds to thousands) of texts we create and curate. As of 2024, we’ve also tested our model against AI generators including Claude, Llama, and Gemini.

Lastly, GPTZero also partners with independent outside organizations like Penn State’s AI/ML Research lab to help us run independent reviews of our benchmarking. This is so we can make sure we’re being unbiased when balancing our accuracy rates and error rates. GPTZero is especially vigilant about training and testing on more diverse sets of data, from ESL learners to content made by AI models other than ChatGPT. 

What is the accuracy rate for AI detection?

Accuracy rate in AI detection most often refers to whether a tool can correctly identify whether something is created by AI. Accuracy rates are usually calculated by combining the rates of when a tool correctly identifies AI, correctly identifies human text, and taking into account false positives and false negatives.

GPTZero has an accuracy rate of 99% when detecting AI-generated text versus human writing, meaning we correctly classify AI writing 99 out of 100 times. When testing samples where there’s a mix of AI and human writing in one submission, we have a 96.5% accuracy rate.

When comparing tools, looking at benchmarking for accuracy rates and error rates can give you a sense of how precise the AI detector is. While several other AI Detector services also conduct internal benchmarking including Originality, and Turnitin, others have yet to release any internal benchmarking including ZeroGPT, Winston, and Quillbot.   

Comparison of AI detectors with transparent error rates by Vox
Comparison of AI detectors with transparent error rates by Vox

Error rates in AI detection refer to when an AI detector incorrectly identifies a text as either AI or human when it is not. The two types of errors are often known as false positives and false negatives.

A false positive in AI detection is when an AI detector incorrectly classifies a human’s writing as AI. If, for instance, you are an educator or an institution that relies on AI detection tools to help inform your disciplinary policy around students’ AI usage, you will want to make sure the false positive rate is as low as possible to avoid false claims of cheating. We keep GPTZero’s false positive rate at no more than 1% when evaluating AI versus human text. 

A false negative in AI detection is when the detector incorrectly labels AI writing as human. It is important to make sure it isn’t easy to bypass AI detection. With detection and the current evolution of AI there is never going to be a 100% guarantee – or rather, we believe if someone is claiming 100% accuracy, there is likely a flaw with their data or evaluation methodology.

How does GPTZero benchmark against other solutions?

Based on independent studies:

"GPTZero had the best discrimination of the pure AI-generated abstracts at an optimal threshold selected with Youden’s index, identifying 99.5% of AI-written abstracts with no false positives among human-written text. "
— Picazo-Sanchez, P., Ortiz-Martin, L. Analysing the impact of ChatGPT in researchAppl Intell 54, 4172–4188 (2024).
"In other words, no matter which editorial the analysed text comes from, the detector with the highest accuracy is GPTZero."
— Frederick M. Howard et al., Characterizing the Increase in Artificial Intelligence Content Detection in Oncology Scientific Abstracts From 2021 to 2023JCO Clin Cancer Inform 8, e2400077 (2024). 

Here is how we currently measure up against other AI detectors when we ran our own benchmarks:

GPTZero vs. TurnItIn for AI vs. Human (2023)


GPTZero (Nov 2023)

TurnItIn (Nov 2023)

Accuracy Rate

99%

95%

False Positive (Classifying Human Text as AI)

2%

2%

False Negative (Classifying AI text as Human)

0%

8%


To push the limits of our benchmarking, we’ve also started testing our solution in the more nuanced case where a writer may have used some AI but has revised or paraphrased parts, making a “mixed” document.

GPTZero versus Competition on Mixed Documents (2024)


GPTZero (Aug 2024)

Copyleaks (Aug 2024)*

Originality (Aug 2024)*

Accuracy Rate with Mixed Sources

96.5%

87.5%

82.5%

False Positive (Classifying Human Text as AI)

0.9%

8.2%

8.2%


False Negative (Classifying AI text as Human)

4.4%

4.4%

14.4%

*Note that our competitors don’t have as strong of a way of classifying mixed samples (text that includes both AI and human data), so these false positive and negative rates include cases where they identified mixed samples as either wholly human or wholly AI.

Key Takeaways of AI Detection Benchmarking

  • No tool is likely 100% accurate. A good AI tool should have the highest accuracy rate with the lowest rate of false positives and false negatives. GPTZero has a 99% accuracy rate and 1% false positive rate when detecting AI versus human samples.
  • GPTZero is much better than our competitors at detecting mixed documents where both AI and human writing is involved, with a 96.5% accuracy rate.
  • GPTZero is trained on a diverse set of data that is independently reviewed by AI/ML labs at leading institutions. We strive to give precise results, especially to educators using AI detection tools in academic settings.

How we keep up with AI innovation

We update our model every month with new data, and whenever a new AI model is released. We will update these numbers after major updates and include more details on how we benchmark our competitors soon.

You can try GPTZero for free or learn more about how to bring AI detection to your institution.