Behind the Scenes: Multilingual Detection

AI-generated content is becoming an international problem, which means being able to detect AI in multiple languages is becoming an urgent priority. Here at GPTZero, we’ve just launched our most advanced multilingual model yet. 

Here’s what’s new: 

  • We’ve added support for German and Portuguese, expanding the range of languages our detection covers
  • We retrained our multilingual model, including French and Spanish, on more recent data. 
  • For Portuguese, we trained on additional essays, to make the model more aligned with the real-world classroom work teachers are actually reviewing. 
  • We’ve also added all-new English data collected since our last multilingual model update.
  • Most importantly, we’ve outperformed our previous baseline, with higher F1 scores on every new language-specific benchmark we created. 

Our results

We evaluated our model on each of the languages we support. This includes benchmarks split ~50/50 between human and AI generated text, containing:

  • 4k french texts from 2 new datasets
  • 4.5k spanish texts from 3 new datasets
  • 7.5k german texts from 4 new datasets
  • 2.5k Portuguese texts from 3 new datasets

 This table contains the F1 score at a false positive rate of 0.01 for each language: 

Language Benchmark

F1 score

% Improvement from the previous model

French

0.989

3.18%

Spanish

0.991

2.31%

German

0.987

1.99%

Portuguese

0.986

1.50%

Table 1: F1 Scores Per Language on Internal Benchmark

This table contains the overall accuracy, recall, and false positive rate for the same benchmarks:

Language Benchmark

Accuracy 

Recall

FPR

French

99.0%


98.4%

0.484%

Spanish

98.9%


98.2%

0.426%

German

96.1%

93.4%

1.37%

Portuguese

97.1%

95.1%

0.875%

Table 2: More Performance Metrics Per Language on Internal Benchmark

Our false positive rate (FPR) for multilingual detection is higher than English detection, but is still industry leading. We are continuously improving our detector on these new languages, especially on the more challenging human texts with mechanical writing style that are more likely to be responsible for false positives.

Finally, in order to compare our performance against our competitors, we created a benchmark of 1000 texts that represented all of our new multilingual data. This benchmark contains examples evenly split across all language and datasets, and human and AI generated text: 

AI Detector

Accuracy

Recall

FPR

GPTZero 

99.2%

98.5%

0.0%

Copyleaks

95.0%

96.0%

6.12%

Originality

92.5%

99.1%

14.1%

Table 3: External Benchmark Results Across Competitors

We outperform both competing detectors with the highest accuracy and lowest FPR. Having the lowest false positive rate is important in minimizing incorrect accusations of AI use, and is a priority at GPTZero.

Why multilingual detection matters

AI-generated content isn’t always written in English. As LLMs grow more fluent across dozens of languages, there’s a higher risk of undetected AI content slipping through – especially in regions where detection tools haven’t kept up. 

Ideally, language-specific detection goes further than coverage. It is about understanding how AI-generated content behaves differently across linguistic contexts: tone and sentence structure can vary between, for example, a Portuguese student essay and a German policy brief. Without models that have been trained to recognize these differences, detection can be less effective (and in some cases, unfair). 

At GPTZero, we’re building tools that go beyond keeping up, and instead anticipate where AI is heading next. This includes addressing the nuanced challenges of multilingual AI detection. As AI becomes more fluent in every language, we know that our tools need to do the same.