GPTZero at NAACL 2024: Key takeaways and highlights from our talks

Every year, the North American Chapter of the Association for Computational Linguistics (NAACL) conference is one of the most important gatherings in the computational and scientific communities. This year’s conference covered exciting topics like large language models (LLMs) and indigenous languages, LLMs and education, robotics, and neuroscience, to name a few.

Alex Adam from our ML team and our advisor Dr. Dongwon Lee spoke at the conference this year. If you missed 2024’s NAACL, here are some key highlights of what the GPTZero team covered. 

Dr. Lee’s tutorial presentation 

With a group of current and former students, Dr. Lee spoke about deepfakes and AI-generated text detection. The tutorial was attended by those interested in technical details, as well as those considering the societal implications of deepfakes. Dr. Lee also covered some significant advancements and challenges in AI text detection.  Thai Le, a former student and assistant professor at Indiana University, highlighted that watermarking, while widely adopted by LLM providers, can be easily bypassed through translation between languages. It is not the silver bullet solution that some have touted.  

He subsequently underscored the need to move from the question, “Was this text AI-generated?” to instead ask, “Is this AI-generated text factually grounded?” (top left, Figure 1). As more text on the web is AI-generated, the focus will shift to correctness rather than origin. This line of academic thinking aligns wholeheartedly with GPTZero’s original mission and our core values as a team: enhance the reliability of AI-generated content, and preserve human creativity and critical thinking.  

Alex’s talk on the technical evolution of GPTZero

GPTZero is essentially as old as widespread AI usage. So even though it was only founded a short time ago, our methodology has changed considerably since that day. At first, we used a statistical approach to detect AI-generated text, based on a combination of perplexity and burstiness

Later, we incorporated a deep learning model ensembled with perplexity-based features. As training data has scaled, we’ve been able to use just a deep learning model, while improving accuracy and simplifying our inference pipeline. Currently, we’re actively researching designing architectures that combine auxiliary features with actual text. 

The commonality and subject matter of AI

We deal with millions of documents each month. Our data generation pipeline is crucial for remaining adaptable to newly-released LLMs.

Over time, our models have shown that documents are AI over 40% of the time. The majority of our user submissions are either scientific papers, literary analysis, technical manuals, news articles, essays, or literature reviews. Knowing this distribution gives us a sense of what users care about.​ That can then help guide both our training data gathering efforts, as well as our evaluation data.

GPTZero’s Deep Scan feature

Responsible use of our detector can be facilitated by understanding why it made a given prediction. Our Deep Scan feature shows the top sentences that increase the probability of AI, and the top sentences that increase the probability of human-originated writing, along with contribution-based highlighting for each sentence​. 

This feature is helpful for understanding both correct predictions and errors. For example, if removing two sentences from a long document can flip the prediction, chances are our model's prediction is not reliable for that particular document​.

The 3-class classification problem

We treat AI detection as a 3-class classification problem.​ What does that mean?  We consider texts that combine AI and human-written paragraphs to belong to a separate mixed class. This happens in cases where the user​ interweaves assignment instructions within their answers, or generates some of the document themselves, and then decides to finish the rest with an LLM.​

Treating a text as a binary classification problem (purely “human” or purely “AI”) requires the predicted score to do too much at once, as it would have to represent both confidence and percentage of the document which is AI.​ This state is usually too difficult to interpret for the user.​

Evaluation challenges and estimating OOD

Some humans tend to have a mechanistic writing style that looks very similar to what GPT-4 would generate. Correctly classifying those documents is challenging. ​

Further, different document types may require specific prompting techniques. For example, two papers generating Yelp reviews have emphasized the need for explicitly mentioning that the generated reviews should be short or concise. Otherwise, GPT-4 tends to generate reviews that are noticeably longer than the provided human examples. ​

Another evaluation challenge has to do with the fact that having a single predictive model across all types of documents, languages, and text lengths is not optimal. In practice, we deploy multiple models focusing on a particular subset of documents (a practice that is currently quite expensive).

Creating a dynamic benchmark

Having a high volume data alone is not sufficient. The data needs to be diverse, and representative of what our users care about​.

For example, it's preferable to have 1,000 news articles, 1,000 blog posts, and 1,000 essays on diverse topics and in different writing styles for training, than 3,000 news articles on the same topic, written in the same style.​

A comprehensive benchmark capable of estimating OOD is crucial, otherwise reported performance metrics​ are optimistic, and actual performance may not be suitable for the user's scenario.

We've been working with our advisory lab on creating a dynamic benchmark which will cover a variety of​:

  • Generators​
  • Prompts​
  • Human sources​

A static benchmark risks over optimization to a particular set of data, but it is common in many domains of machine learning. For us, our decision to make the benchmark dynamic removes this risk, and also has the advantage of capturing evolving LLM and human writing patterns, i.e., GPT-5 or new human slang.

Moving beyond AI detection

At GPTZero, our goal has always been to move beyond simply “AI-detection” in writing. Our goal is to act as a transparency and verification layer for text content.​ To enable responsible use of our features, we need to provide users with what to do when texts are detected as AI.​

To this end, we have our new Editor feature which allows for transformation of an AI-generated document into human by requiring: 

  • A certain amount of edits​
  • The final version to be predicted as human by our detector

We also plan to build out our ML teams, in order to stay ahead of the curve as ML algorithms grow more sophisticated, and more research is done in this field.

After a successful NAACL conference, we remain steadfast in our dedication to setting new standards around AI detection and content verification. We’re looking forward to developing even more sophisticated ML models and benchmarks to guide us in our work.