America Has a Pangram Problem
· The Atlantic
![]()
Basically every recent, high-profile accusation of someone passing off AI-generated writing as their own has started in the same way: with a tool called Pangram. In March, when a horror novel from a major publishing house was pulled just days before its scheduled U.S. release date, it was in part because Pangram, an AI-detection program, had identified the text as AI-generated. Other people have fed text into Pangram to suggest that chatbots have been used to write articles in major newspapers including The New York Times, multiple short stories awarded a prestigious literary prize, and most recently, significant chunks of Pope Leo XIV’s encyclical warning about the dangers of AI. The tool is also used by universities to vet student work and scientific associations to scan research papers. As panic builds over AI-generated writing, Pangram is at the foundation.
Visit librea.one for more information.
Just a few years ago, it seemed like it might never be possible to instantly and reliably determine whether a piece of text was written by a bot or a person. In 2023, one detection tool, ZeroGPT, declared the U.S. Constitution to be AI-written; the same year, OpenAI abandoned its AI detector altogether owing to a “low rate of accuracy.” And that was when the quality of ChatGPT’s writing was markedly worse than it is today. But detection tools have gotten much better of late—and Pangram, in particular, has emerged as the gold standard: Paste a chunk of text into Pangram, and the model appraises what portions were “AI Generated,” “AI Assisted,” or “Human Written.”
Yet an AI detector that is mostly reliable might in some ways be more dangerous than a broken one. While Pangram is accumulating the power to end reputations and careers, the tool does make mistakes, perhaps to a greater extent than is currently understood. In turn, AI accusations could very quickly spiral into a witch hunt.
[Read: AI-writing scandals are getting very confusing]
Pangram says its algorithm is so accurate that it incorrectly identifies text as an AI output only about one in every 10,000 times. “There is a great responsibility, a huge weight” in saying something is AI-generated, Max Spero, Pangram’s CEO, told me. “The only reason we do so is because we’re extremely confident.” Several independent analyses have also confirmed that it is quite good. One paper, from the University of Chicago, found that Pangram had almost no false positives on some 3,000 sample texts of roughly 500 to 1,000 words.
But Pangram’s ability to guarantee something was written by a human is shakier. Spero pointed me to a test showing that Pangram’s false-negative rate, or how frequently the model incorrectly labels text as human, is closer to one-in-70 (although some other assessments say it is more accurate than that).
Part of the problem is that Pangram is in an arms race with the major AI labs, which have an interest in making the writing of ChatGPT and Claude sound as natural and human as possible. And at the same time, Pangram has to deal with AI “humanizers”—programs designed explicitly to disguise AI text as your own. Reddit users rave about a humanizer called Walter Writes AI, which I decided to test out for myself. I had ChatGPT and Claude write brief articles, then pasted them into Walter Writes AI. The program, like other humanizer tools, does some anodyne rewording, swaps one clunky transition clause for another, and introduces grammatical oddities. For instance, ChatGPT’s “The numbers are no longer small enough to ignore” became “The sheer size of these usage figures can no longer be ignored.” When I pasted any output from Walter Writes AI into Pangram, it invariably told me that the twice-baked AI article was human-written. (It’s worth mentioning that The Atlantic forbids using AI-generated text unless labeled as such, and that I do not use AI for research.)
Pangram, in other words, can only provide so much insight. A teacher at a public high school in New York City told me that he has “run some of my students’ papers through Pangram, and it shows up as 100 percent human. And I don’t think it is.” He knows what his kids are capable of and, especially for those with a history of cheating with AI, ample reason to doubt Pangram. (I agreed not to identify the teacher by name so that he could speak freely about how he suspects his students are using AI.) But on the flip side, accusing a student of getting undisclosed help from a chatbot with circumstantial evidence is high stakes: The student will either fail or, if exonerated, be bitter and resentful. “The stakes are so high,” the teacher said, “but our way of assessing what is AI-generated is still so unformed.”
Further complicating matters are the opaque ways in which Pangram and similar tools are designed. The model was trained by feeding it mountains of examples written by a human and by a bot—a book review in an actual magazine, then a review about the same book in the style of the same magazine, but produced by ChatGPT—until it can tell the two apart. This is akin to feeding millions of photos of cats and dogs into an image-recognition algorithm until it learns to spot the differences. Pangram cannot point to much specific evidence or patterns in diction, phrasing, or punctuation to support why it deems something AI or human. (I do not, for instance, understand why “these usage figures” was more human than “the numbers.”) Moreover, while Pangram distinguishes between “lightly” and “moderately AI-assisted,” these broad categories can mean just about anything short of copy-pasting from Claude—using AI for research, coming up with counterarguments, as a thesaurus, for a grammar check. The algorithm’s inner workings are “pretty uninterpretable,” Spero said, and although he wants to make Pangram’s “AI-assisted” label more granular, he is also “still not sure how possible it is.” Amid concerns of overreliance on AI chatbots, we risk simply layering on dependence on yet another black-box algorithm.
[Read: The people outsourcing their thinking to AI]
Spero told me that Pangram should “never be the ending arbiter” but instead a starting point for a more thorough investigation, and that the company looks into every reported error its model makes. He also noted that all sorts of detection technology we rely on—smoke detectors, TSA scanners—have base error rates too. On some level, in all these cases the biggest problems lie not in the technologies themselves but in what they’re trying to detect. It’s a problem that buildings catch on fire. It’s a problem that AI is seeping haphazardly into every facet of written communication.
As AI-writing accusations continue to escalate, though, there will only be greater reliance on Pangram—or whatever AI detector can dethrone it—to convict or exonerate. Consider that Pangram can connect to Canvas, the popular education platform, allowing teachers to use it to scan student submissions. There are more than 10 million high schoolers in the United States and some 20 million undergraduates, each of whom likely submits many dozens of written assignments every year. At that scale, Pangram would produce plenty of false accusations even with a one-in-10,000 error rate.
Nor is it guaranteed that Pangram will improve or even maintain its current ability to spot AI prose. As chatbots and AI humanizers adjust, AI detection “will wax and wane in its effectiveness for reasons we can’t predict, at times we can’t predict,” Tim Requarth, a neuroscientist who teaches science writing at NYU and has written extensively about AI detection, told me. Even as schools, publishers, scientific institutions, and the like come to rely more on AI detection, any third-party assessments of Pangram’s accuracy will be from weeks, if not many months, in the past—which in the accelerating world of AI renders them all but obsolete. Basing any AI rules or norms on the reliability of AI detection is like building a sandcastle at low tide.
All of this seems like a disaster in the making. The murkiness and ambiguity of AI detection creates room to launch or deny accusations of nearly any sort. Earlier this month, the technology journalist Taylor Lorenz was accused on X of using AI to write a story for Vanity Fair, which she vehemently denied. Spero investigated and, as he detailed on X, found that Pangram had erred. “Thank god for edit history,” Lorenz told me. The experience heightened Lorenz’s concerns about such allegations: “I’m so paranoid,” she said.
“AI-generated” and “AI-assisted” can be easily confused, by accident or in bad faith. James Taranto, an editor at The Wall Street Journal, recently called Pangram a “defamation machine” and claimed it had falsely flagged three op-eds in his newspaper as AI-generated; two of the implicated authors admitted to using AI to revise some of their work, which Taranto wrote is “inaccurate and unfair to characterize” as “AI-generated.” One of the people who first used Pangram to analyze Pope Leo’s encyclical noted that, because only some sections seemed AI-generated or AI-assisted, perhaps it was not the pope himself but some senior Vatican officials who had used AI while drafting portions of the text. That didn’t stop headlines such as “Did the Pope Use AI to Write About the Dangers of AI?” (The Vatican did not respond to a request for comment, although a writer who covers the Vatican said on X that the AI allegations are “100 percent false” and that Leo actually drafted the encyclical with pen and paper.)
All of this recalls another recent moral outrage over alleged writerly misconduct: The plagiarism wars of 2023 and ’24, when right-wing activists such as Christopher Rufo mobilized to accuse high-profile academics and university leaders of plagiarism—most notably leading to the resignation of then–Harvard president Claudine Gay. Many of these accusations were spurious and likely based on the assessments of plagiarism-detection algorithms that, as my colleague Ian Bogost judged at the time, were fairly useless. The AI-detection wars to come may be even more contentious.
Pangram, to be clear, is not useless. But this is exactly the problem: It’s too easy to twist and contest Pangram’s conclusions, especially when nobody really agrees on which uses of AI are or aren’t ethical. Just like chatbots, AI-detection tools have become effective enough for widespread use, but not reliable enough to fully trust. In this way, Pangram and other detectors are mirror images of the AI products they are hunting for.