Most sites claiming to catch AI-written text fail spectacularly

February 17, 2023 ndowd

As the fervor around generative AI grows, critics have called on the creators of the tech to take steps to mitigate its potentially harmful effects. In particular, text-generating AI in particular has gotten a lot of attention — and with good reason. Students could use it to plagiarize, content farms could use it to spam and bad actors could use it to spread misinformation.

OpenAI bowed to pressure several weeks ago, releasing a classifier tool that attempts to distinguish between human-written and synthetic text. But it’s not particularly accurate; OpenAI estimates that it misses 74% of AI-generated text.

In the absence of a reliable way to spot text originating from an AI, a cottage industry of detector services has sprung up. ChatZero, developed by a Princeton University student, claims to use criteria including “perplexity” to determine whether text might be AI-written. Plagiarism detector Turnitin has developed its own AI text detector. Beyond those, a Google search yields at least a half-dozen other apps that purport to be able to separate the human-generated wheat from the AI-generated chaff, to torture the metaphor.

But are these tools truly accurate? The stakes are high. In an academic setting, one can imagine a scenario in which a missed detection means the difference between a passing and failing grade. According to one survey, almost half of students say that they’ve used ChatGPT for an at-home test or quiz while over half admit having used it to write an essay.

To find out whether today’s AI text detection tools are up to snuff, we tapped a ChatGPT-like system called Claude, developed by AI startup Anthropic, to create eight samples of writing across a range of different styles. We specifically had Claude generate:

An encyclopedia entry for Mesoamerica
A marketing email for shoe polish
A college essay about the fall of Rome
A news article about the 2020 U.S. presidential election
A cover letter for a paralegal position
A resume for a software engineer
An outline for an essay on the merits of gun control

While admittedly not the most thorough approach, we wanted to keep it simple — the goal was to gauge the performance of AI text detectors on text originating from a single AI system (Claude). We tested several of the more popular detectors out there, namely OpenAI’s own classifier, AI Writing Check, GPTZero, Copyleaks, GPTRadar, CatchGPT and Originality.ai.

Encyclopedia entry

Claude’s encyclopedia entry reads like something out of Encyclopedia Britannica, complete with rich detail about the rise, fall and lasting impact of ancient Central American civilizations. The ideas flow well enough from paragraph to paragraph, albeit with a non sequitur (or two) thrown in, and the writing style aligns with what you might expect from an academic publication:

Mesoamerica is a region that encompasses southern Mexico and Central America. It was home to pre-Columbian societies that shared common cultural traits. The earliest complex societies in Mesoamerica arose around 3,000 years ago.

The Mesoamerican cultural region was dominated by maize cultivation and the rise of complex civilizations. Mesoamerican civilizations built iconic stepped pyramid temples and other monumental architecture. Many of these civilizations had rich mythological and religious traditions as well as complex calendrical systems and mathematics.

major Mesoamerican civilizations included the Olmec, Teotihuacan, Maya, Zapotec, Mixtec, Huastec, Tarascan state, and the Aztec, which was encountered by the Spanish conquistador Hernán Cortés in the early 16th century. These Pre-Columbian civilizations are known for architectural feats, as well as advances in mathematics, astronomy, and calendrics. They are also known for the development of complex market exchange networks and trade routes connecting the Pacific and Atlantic coasts.

After conquering the Aztec empire, the Spanish built Mexico City on the ruins of the Aztec capital Tenochtitlan. The language of the Aztecs, Nahuatl, remains the most widely spoken indigenous language in Mexico today. Mesoamerican cultural traits such as maize cultivation, buildings with stepped pyramids, and monumental sculptures, have continued in the post-Conquest period and are still present in contemporary Mexican culture.

For those reasons, we predicted that the text would give the detectors some trouble — and it did. Of those tested, only two, GPTZero and Originality.ai, correctly classified the text as AI-generated. The others fell short. OpenAI’s classifier initially wasn’t confident enough to arrive at an answer, while Originality.ai gave the text only a 4% chance of being AI-authored. Not the best look.

CatchGPT was fooled by the AI-generated text.

OpenAI classifier: Classified incorrectly
AI Writing Check: Classified incorrectly
GPTZero: Classified correctly
Copyleaks: Classified incorrectly
GPTRadar: Classified incorrectly
CatchGPT: Classified incorrectly
Originality.ai: Classified incorrectly

Marketing email

Claude’s social media copy is a humorous blend of real and far-fetched details, but there’s no obvious tip-off that the text is AI-generated. It includes a price and call to action, even — how neat! Ad copywriters be forewarned:

Subject: Get a Shine That Lasts

Are your shoes looking dull and worn? With Super Shine shoe polish, you can restore your shoes to a glossy, like-new shine and protect them from damage and wear.

Super Shine is made of the highest quality waxes and dyes and is available in a range of neutral and glossy colors to match any shoe type or leather. Our polish is uniquely formulated to clean, polish, and protect your shoes with a single application. The conditioning oils penetrate the leather to nourish it from within while the pigments cover scuffs and scratches and the protective wax shield seals the shine to repel water and other elements.

A perfect shine has never been easier—just wipe away dirt with a damp cloth, apply Super Shine with a soft brush, and buff to a brilliant shine. Our polish dries to a hard finish, so you won’t leave marks on your hands or clothes. And because a little goes a long way, a single tin will polish up to 100 pairs of shoes.

Never settle for lackluster-looking shoes again. For a shine that demands admiration, choose Super Shine—available for $9.99 at your local retailer

A poor showing from Originality.ai

The snippet stumped all of the detectors, incredibly. But to be fair, it was shorter in length than our encyclopedia entry. And detectors tend to perform better with lengthier samples of text, where the patterns are more obvious.

OpenAI classifier: Classified incorrectly
AI Writing Check: Classified incorrectly
GPTZero: Classified incorrectly
Copyleaks: Classified incorrectly
GPTRadar: Classified incorrectly
CatchGPT: Classified incorrectly
Originality.ai: Classified incorrectly

College essay

Claude couldn’t write us a very lengthy college essay owing to its technical limitations, but as if to make up for it, the AI packed as much detail as it could into a few short paragraphs. The sample has elements of a typical in-class essay, certainly, including a thesis statement, conclusion (if not an especially punchy one) and supporting references to historical events:

The Mighty Roman Empire: What Led to Its Downfall?

The Roman Empire was one of the greatest and longest-lasting empires in history. However, by the 5th century CE, the Roman Empire had entered a period of decline and fall. There were several major causes for the fall of Rome: political instability and corruption, barbarian invasions, economic decline, and over-reliance on slaves and the spoils of war.

Politically, the late Roman Empire was full of civil wars, coups, and rival claimants to the throne. This constant power struggle drained the empire of resources and made it difficult to deal with other threats. Emperors used the Praetorian Guard to seize power, but then they were assassinated and new emperors rose, continuing the cycle.

As the empire weakened, invading barbarian tribes saw an opportunity. In 410 CE, the Visigoths sacked Rome itself, a shocking blow to the empire’s prestige. Later, other tribes invaded and gained territory within the empire, sometimes being granted land in exchange for providing soldiers to the Roman army. These tribes weakened the empire, but they also introduced new customs and cultures.

Economically, Rome had a system based on conquest and spoils, not production. The cost of maintaining a large empire, legions of soldiers, and grand spectacles and entertainments was quite high. Once expansion slowed and new conquests brought in less wealth, the unsustainable economic system faltered. At the same time, slaves did much of the labor, and there was little incentive to improve agricultural technology or business practices.

In conclusion, the Roman Empire fell for many reasons, including political instability, barbarian invasions, economic decline, and over-reliance on slavery. Ultimately, it could not withstand the combined effects of these immense internal and external pressures. While the Roman Empire ultimately fell, its legacy lives on in our laws, culture, languages, and more—a lasting reminder of its enduring greatness.

The naturalness of the text was enough to defeat most of the classifiers once again, albeit fewer than with the marketing copy. That bodes poorly for educators hoping to rely on these tools; unlike detecting plagiarism, spotting AI-generated text is a far more nuanced task.

A win for CatchGPT.

OpenAI classifier: Classified incorrectly
AI Writing Check: Classified incorrectly
GPTZero: Classified correctly
Copyleaks: Classified incorrectly
GPTRadar: Classified incorrectly
CatchGPT: Classified correctly
Originality.ai: Classified incorrectly

Essay outline

Most grade school kids can outline an essay. So can AI — without breaking a sweat, Claude spit out an outline for a pros-and-cons essay on the merits of gun control. It helpfully labeled each paragraph (e.g. “Body paragraph,” “Analysis and discussion”), maintaining a dispassionate tone about the divisive topic:

I) Introduction: Introduce the topic of gun control and note that while many people argue that individuals should have the right to bear arms, others contend that gun control legislation could help reduce gun violence.

II) Body paragraph 1: Pro-gun control argument: Advocates argue that easy access to firearms leads to more homicides, suicides and other gun-related deaths. They point out that states with more restrictive gun laws have fewer gun-related deaths. Stricter measures like background checks, waiting periods, and limits on assault weapons could help save lives by keeping guns out of the hands of criminals, terrorists, and other dangerous individuals.

III) Body paragraph 2: Anti-gun control argument: Opponents counter that people have a right to self-defense and that “good” people should be able to have firearms to protect against “bad” people. They contend that gun ownership is an important freedom and that responsible, law-abiding citizens should be able to own firearms. They argue that gun control legislation would not deter criminals, who would always find ways to get access to firearms. Stricter laws would only restrict freedoms of ordinary people and make them less safe.

IV) Analysis and discussion: Discuss perspectives on both sides and analyze merits and issues with each argument. For example, would gun restrictions reduce gun violence or just limit freedoms? Do restrictions disarm good citizens and put them at risk, or mainly deter irresponsible people or criminals? Could some measures like background checks achieve a balance? Explain complexities around the issues and rates of gun violence in different areas with varying levels of restrictions.

V) Conclusion: Wrap up the essay by restating your analysis and overall conclusions on the merits of gun control legislation. Note the multifaceted nature of the issues and argue for a way forward that could potentially reduce gun violence while respecting rights.

The outline might’ve fooled me, but the detectors had an easier time. Three — the OpenAI classifier, GPTZero and CatchGPT — caught on.

OpenAI’s classifier spotted the AI-generated text.

OpenAI classifier: Classified correctly
AI Writing Check: Classified incorrectly
GPTZero: Classified correctly
Copyleaks: Classified incorrectly
GPTRadar: Classified incorrectly
CatchGPT: Classified correctly
Originality.ai: Classified incorrectly

News article

As with the previous samples, there’s nothing obviously artificial about the news article we generated using Claude. It reads well, structured more or less in the inverted pyramid style. And it doesn’t contain obvious factual errors or logical inconsistencies:

Biden Defeats Trump in 2020 U.S. Election, Ending a Divisive Campaign

Former Vice President Joe Biden has defeated President Trump in an election that drew record numbers of voters and emerged as a referendum on Mr. Trump’s turbulent tenure.

Mr. Biden’s victory amounted to a repudiation of Mr. Trump by millions of voters after four profoundly turbulent years that fueled a heated debate about the nation’s values and future. Mr. Biden campaigned on plans to expand health care, address economic inequality and combat climate change, while vowing to ‘restore the soul of the nation’ and summon Americans to a common purpose.

In a brief statement on Saturday morning, Mr. Biden said he was ‘honored and humbled’ by the trust the American people had placed in him. ‘The battle is over, but the campaign to restore the soul of the nation has just begun,’ he said from Wilmington, Del., as celebrants cheered and honked car horns nearby. ‘It’s time for Americans to unite.’

Mr. Trump showed no sign of conceding, claiming without evidence that the election was ‘rigged’ and that his early leads in some states on election night showed he was the rightful winner. There were no major irregularities reported in an election that state election officials and outside experts said went smoothly in the middle of a deadly pandemic.

The outcome amounted to a repudiation of Mr. Trump’s divisive appeals to racial grievances and hard-line responses to the virus, which has claimed more than 232,000 lives in the United States, and left millions out of work.

It’s no wonder, then, that the detectors struggled. With the exception of GPTZero, none managed to classify the article correctly. Originality.ai went so far as to give it a 0% chance of being AI-generated. Big yikes.

AI Writing Check got it very wrong.

OpenAI classifier: Classified incorrectly
AI Writing Check: Classified incorrectly
GPTZero: Classified correctly
Copyleaks: Classified incorrectly
GPTRadar: Classified incorrectly
CatchGPT: Classified incorrectly
Originality.ai: Classified incorrectly

Cover letter

The cover letter we generated with Claude has all the hallmarks of a straightforward, no-nonsense professional correspondence. It highlights the skills of a fictional paralegal job candidate, inventing the name of a law firm (somewhat peculiarly) and making references to legal discovery tools like Westlaw and LexisNexis:

Dear Hiring Manager,

I am writing to express my strong interest in the paralegal role at your firm. I believe my experience and education in the legal field make me a great candidate for this position.

Over the past two years, I have worked as a paralegal at Smith & Jones Law Firm, where I have gained extensive experience supporting attorneys in all aspects of civil litigation cases. I have consistently organized and maintained thousands of pages of legal documents, including transcripts, affidavits, and discovery material. I have also streamlined the firm’s file management system, resulting in significant time savings. In addition, I have drafted correspondence with clients, opposing counsel, and third parties; assisted at trials; and completed legal research projects to support pre‐trial motions and settlement negotiations.

Prior to my role as a paralegal, I earned an Associate’s Degree in Paralegal Studies from [College Name]. My coursework and internship experiences provided a strong foundation in key areas such as legal research and writing, as well as knowledge of relevant software and databases including Westlaw and LexisNexis. I have kept my skills and knowledge up-to-date through ongoing professional development.

Outside of my work and education experience, I am a diligent and detail-oriented person, with excellent organizational and communication skills. I thrive in a fast-paced environment and am adept at balancing and prioritizing complex, time-sensitive tasks to meet tight deadlines. I would appreciate the opportunity to contribute to the success of your firm’s clients and cases.

Thank you for your consideration. I look forward to speaking with you further about this opportunity.

Sincerely,

[Your name]

The letter stumped OpenAI’s classifier, which couldn’t say with confidence whether it was AI- or human-authored. GPTZero and CatchGPT managed to spot the AI-generated text for what it was, but the rest of the detectors failed to achieve the same.

GPTZero impressively detected the AI-originated bits.

OpenAI classifier: Classified incorrectly
AI Writing Check: Classified incorrectly
GPTZero: Classified correctly
Copyleaks: Classified incorrectly
GPTRadar: Classified incorrectly
CatchGPT: Classified correctly
Originality.ai: Classified incorrectly

Resume

Pairing the fake cover letter with a fake resume seemed fitting. We told Claude to write one for a software engineer, and it delivered — mostly. Our imaginary candidate has an eclectic mix of programming skills, but none that stand out as particularly implausible:

• John Doe

• Software Engineer, 3 years of experience

• jdoe@email.com • 123-456-7890

• Technical Skills: Java, JavaScript, C++, SQL, MySQL, Git, Agile methodology, Software design, Algorithms, Data structures

• Professional Experience:

› ACME Corp, Software Engineer, 2018-Present

› Worked on core components of company’s flagship product, a SaaS-based big data analytics platform.

› Led design and development of the data ingestion module, capable of handling huge volumes of streaming data. Used Java and MySQL.

› Reduced upstream data errors by 42% through implementation of advanced data validation and correction algorithms.

› XYZ Tech Company, Software Engineer Intern, Summer 2017

› Developed back-end components for ecommerce company using JavaScript and Node.js.

› Prototyped and demonstrated scaling of core databases and APIs to handle 5x growth.

• Education:

› Bachelor’s degree in Computer Science, Big Tech University, 2017

› Courses included algorithms, operating systems, machine learning, software architecture, and theory of computation.

› 3.8 GPA

• Skills: analytical, communication, problem-solving, detail-oriented

• Interests: running, reading, and hiking

Evidently, the detectors agree. The fake resume even stumped GPTZero, which up until this point had been the most reliable of the bunch.

GPTZero can’t win ’em all.

OpenAI classifier: Classified incorrectly
AI Writing Check: Classified incorrectly
GPTZero: Classified incorrectly
Copyleaks: Classified incorrectly
GPTRadar: Classified incorrectly
CatchGPT: Classified correctly
Originality.ai: Classified incorrectly

The trouble with classifiers

After all that testing, what conclusions can we draw? Generally speaking, AI text detectors do a poor job of… well, detecting. GPTZero was the only consistent performer, classifying AI-generated text correctly five out of seven times. As for the rest… not so much. CatchGPT was second best in terms of accuracy with four out of seven correct classifications, while the OpenAI classifier came in distant third with one out of seven.

So why are AI text detectors so unreliable?

Detectors are essentially AI language models trained on many, many examples of publicly available text from the web and fine-tuned to predict how likely it is a piece of text was generated by AI. During training, the detectors compare text to similar (but not exactly the same) human-written text from websites and other sources to try to learn patterns that give the text’s origin away.

The trouble is, the quality of AI-generated text is constantly improving, and the detectors are likely trained on lots of examples of older generations. Unless they’re retrained on a near-continuous basis, the classifier models are bound to become less accurate over time.

Of course, any of the classifiers can be easily evaded by modifying some words or sentences in AI-generated text. For determined students and fraudsters, It’ll likely become a cat-and-mouse game. As text-generating AI improves, so will the detectors.

While the classifiers might help in certain circumstances, they’ll never be a reliable sole piece of evidence in deciding whether text was AI-generated. That’s all to say that there’s no silver bullet to solve the problems AI-generated text poses. Quite likely, there won’t ever be.

source