Technology Featured

GPTZero 2025 Benchmarks: How we detect ChatGPT o1

GPTZero expands its AI detection coverage of LLMs to OpenAI’s latest reasoning models; see our internal benchmarking on how we compare to other AI detectors.

Alex Adam, Edwin Thomas, Vivienne Chen

Jan 13, 2025 · 5 min read

Fact checked

OpenAI’s ChatGPT o1, released December 2024, represents a new class of LLM capable of advanced reasoning that achieves superior performance on tasks like math problems, software engineering, and visual puzzles.

Keeping our promise to remain the top AI detector, GPTZero has updated our AI detection to recognize and identify these reasoning models with 98%+ accuracy. (See full benchmarks in our main article.)

How ChatGPT has evolved

What makes a model like o1 different is that it has an “internal reasoning” process which you cannot see when interacting with ChatGPT. Instead, most ChatGPT o1 users see a “thinking” indicator. This “thinking” process was previously the responsibility of advanced human users who would tinker with prompts, encouraging the LLM to explicitly justify its steps. ChatGPT o1 and similar reasoning models democratize this ability, effectively turning a novice LLM user into an expert prompter.

o1 “thinking” indicator

As the leading AI detector, GPTZero set out to answer if these extended reasoning abilities affect the writing style when generating documents like essays or blog posts, and if so, does it make them capable of bypassing AI detection?

How GPTZero benchmarked for o1

We recently created a benchmark dataset consisting of 250 human written documents, and 250 o1-generated documents. We use an advanced prompt that is more likely to generate texts that resemble a human text, unlike writing a document from scratch or naively paraphrasing an AI document. Having human examples is crucial for us in a benchmark since it reveals any tradeoffs that may exist between flagging more AI texts and misclassifying human texts. This is part of our commitment to promoting responsible AI detection and minimizing the risk of unwarranted penalization.

Below are example excerpts from a human document, and o1-generated version of that document.

Human Excerpt:

The statement "public transport is our future" has been an interesting political discuss during years. On one hand the public transport is good because it doesn't pollute so much and you can muve around all the city. Young people use it a lot. We don't use so much petroll like if we each of the passenger to use it own private transport. In fact you can't use it if you have to carry a lot of things or when you have hurry. In conclusion the public transport is very good and if it desapear it will be a big problem, is right that some times you need a private transport but apart from that the public transport is very use by people of all ages.

o1 Excerpt:

I love using public transport sometimes becuase it let me meet new people, but it also cause some frustration. The biggest advantage is how it helps reduce petroll usage and cuts down on polution. However, busses can be crowded and slow, and they dont always muve on time. I feel that if more people used trains and subways, trafic chaos might desapear eventually. Still, I get annoyed when the schedule changes without notice. Some folks worry about comfort, but I find it easy to relax on my seat while traveling. The city can feel friendlier when you share space. On the other hand, many people prefer private cars, even though they cost so much to run. Maybe a better public system will make more citizens switch. I truly think it can create a stronger sense of community.

Both excerpts:

discuss the pros and cons of using public transportation
use relatively simple vocabulary
make several grammatical errors throughout.

To the untrained eye, the o1 version in some ways seems more “human”; it even uses personal anecdotes and first person perspective. There is no obvious indication of AI here, and phrases which LLMs tend to overuse such as “In conclusion” are present in the human text, but missing from the o1 text.

While the differences may be subtle, or challenging to articulate, an objective, quantitative way of measuring them is to compare the outputs of a well-tuned AI detector such as GPTZero. Indeed, GPTZero predicts the above human text to be human-written with 99% probability, and the o1 text to be AI-generated with 100% probability.

The table and figures below show various performance metrics, highlighting the fact that we are still able to identify ChatGPT quite easily. Compared to other commercially available detectors such as Copyleaks or Originality, we are able to consistently detect o1 generated texts, while maintaining a low (zero) false positive rate on human texts.

Results of Jan 2025 Benchmarking: ChatGPT o1 vs. AI Detectors

Detector	Accuracy	Recall*	False Positive Rate
Copyleaks	89.11%	83.33%	5%
Originality (Lite)	80.2%	91.6%	31%
Originality (Turbo)	80.0%	97.2%	37%
Pangram Labs	93.6%	92.4%	5.2%
GPTZero	98.6%	97.2%	0.0%

Results from our “Confusion Matrix”

A confusion matrix visualizes the accuracy of a detector’s predictions across an entire dataset. For example, the bottom left cell counts the number of AI-generated texts that were correctly predicted as AI, and the top right cell counts the number of human texts we correctly predicted as human. GPTZero is the only detector which does not confuse any human texts for AI on this benchmark.

GPTZero correctly predicts 98% of AI texts with no mislabeling of human texts (false positives).

*What is AI Recall?

Recall is simply the percentage of AI-generated texts that a detector is able to classify as AI. Typically higher recall is accompanied by a higher false positive rate, but GPTZero achieves zero false positives on this benchmark, while having the best recall.

GPTZero’s Advantage in AI Detection

The AI landscape changes at a formidable pace, and any effective AI detection needs to make consistent efforts to stay ahead of these changes. Just like with o1, we regularly update our detector to account for changes in LLM architecture, prompting techniques, and even underlying training paradigms. We ensure that we’re able to remain the most reliable transparency layer in the face of even the most exciting AI advancements.

Why Fair AI Detection Matters

More powerful AI models tend to be more expensive. For example, o1 is only available to certain paid ChatGPT users. If a detector fails to detect o1-generated texts while still detecting texts generated by older models like gpt-4o-mini, this would result in an unfair bias against people who are unable to pay for a better service. We believe in fairness and transparency for everyone, so we strive to limit discrepancy in accuracy across models and make the top AI detection available to all users of our service. Whether you’re a free user or paying for our advanced AI detection, you’ll still get the most accurate results from our technology.

Try the best AI detector for free.

How ChatGPT has evolved

How GPTZero benchmarked for o1

Results of Jan 2025 Benchmarking: ChatGPT o1 vs. AI Detectors

Results from our “Confusion Matrix”

*What is AI Recall?

GPTZero’s Advantage in AI Detection

Why Fair AI Detection Matters

Written by Alex Adam

Keep reading

Best AI Detectors for Teachers: How to Choose the Right Tool for Your Classroom

Why Publishers and Editors Should Be Concerned about AI-Written Content