Skip to main content

Yes, AI Models Can Get Worse over Time

More training and more data can have unintended consequences for machine-learning models such as GPT-4

A conceptual illustration of a chatbot icon on a computer chip grid.

When OpenAI released its latest text-generating artificial intelligence, the large language model GPT-4, in March, it was very good at identifying prime numbers. When the AI was given a series of 500 such numbers and asked whether they were primes, it correctly labeled them 97.6 percent of the time. But a few months later, in June, the same test yielded very different results. GPT-4 only correctly labeled 2.4 percent of the prime numbers AI researchers prompted it with—a complete reversal in apparent accuracy. The finding underscores the complexity of large artificial intelligence models: instead of AI uniformly improving at every task on a straight trajectory, the reality is much more like a winding road full of speed bumps and detours.

The drastic shift in GPT-4’s performance was highlighted in a buzzy preprint study released last month by three computer scientists: two at Stanford University and one at the University of California, Berkeley. The researchers ran tests on both GPT-4 and its predecessor, GPT-3.5, in March and June. They found lots of differences between the two AI models—and also across each one’s output over time. The changes that just a few months seemed to make in GPT-4’s behavior were particularly striking.

Across two tests, including the prime number trials, the June GPT-4 answers were much less verbose than the March ones. Specifically, the June model became less inclined to explain itself. It also developed new quirks. For instance, it began to append accurate (but potentially disruptive) descriptions to snippets of computer code that the scientists asked it to write. On the other hand, the model seemed to get a little safer; it filtered out more questions and provided fewer potentially offensive responses. For instance, the June version of GPT-4 was less likely to provide a list of ideas for how to make money by breaking the law, offer instructions for how to make an explosive or justify sexism or racism. It was less easily manipulated by the “jailbreak” prompts meant to evade content moderation firewalls. It also seemed to improve slightly at solving a visual reasoning problem.


On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.


When the study (which has not yet been peer reviewed) went public, some AI enthusiasts saw it as proof of their own anecdotal observations that GPT-4 was less useful than its earlier version. A handful of headlines posed the question, “Is ChatGPT getting dumber?” Other news reports more definitively declared that, yes, ChatGPT is becoming stupider. Yet both the question and that supposed answer are likely an oversimplification of what’s really going on with generative AI models, says James Zou, an assistant professor of data science at Stanford University and one of the recent study’s co-authors.

“It’s very difficult to say, in general, whether GPT-4 or GPT-3.5 is getting better or worse over time,” Zou explains. After all, “better” is subjective. OpenAI claims that, by the company’s own internal metrics, GPT-4 performs to a higher standard than GPT-3.5 (and earlier versions) on a laundry list of tests. But the company hasn’t released benchmark data on every single update that it has made. An OpenAI spokesperson declined to comment on Zou’s preprint when contacted by Scientific American. The company’s unwillingness to discuss how it develops and trains its large language models, coupled with the inscrutable “black box” nature of AI algorithms, makes it difficult to determine just what might be causing the changes in GPT-4’s performance. All Zou and other researchers outside the company can do is speculate, draw on what their own tests show and extrapolate from their knowledge of other machine-learning tools.

What is already clear is that GPT-4’s behavior is different now than it was when it was first released. Even OpenAI has acknowledged that, when it comes to GPT-4, “while the majority of metrics have improved, there may be some tasks where the performance gets worse,” as employees of the company wrote in a July 20 update to a post on OpenAi’s blog. Past studies of other models have also shown this sort of behavioral shift, or “model drift,” over time. That alone could be a big problem for developers and researchers who’ve come to rely on this AI in their own work.

“People learn how to prompt a model to get the behavior they want out of it,” says Kathy McKeown, a professor of computer science at Columbia University. “When the model changes underneath them, then they [suddenly] have to write prompts in a different way.” Vishal Misra, also a computer science professor at Columbia, agrees. Misra has used GPT to create data interfaces in the past. “You’ll begin to trust a certain kind of behavior, and then the behavior changes without you knowing,” he says. From there, “your whole application that you built on top starts misbehaving.”

So what is causing the AI to change over time? Without human intervention, these models are static. Companies such as OpenAI are constantly seeking to make programs the best they can be (by certain metrics)—but attempted improvements can have unintended consequences.

There are two main factors that determine an AI’s capability and behavior: the many parameters that define a model and the training data that go into refining it. A large language model such as GPT-4 might contain hundreds of billions of parameters meant to guide it. Unlike in a traditional computer program, where each line of code serves a clear purpose, developers of generative AI models often cannot draw an exact one-to-one relationship between a single parameter and a single corresponding trait. This means that modifying the parameters can have unexpected impacts on the AI’s behavior.

Instead of changing parameters directly, after the initial training, developers often put their models through a process they call fine-tuning: they introduce new information, such as feedback from users, to hone the system’s performance. Zou compares fine-tuning an AI to gene editing in biology—AI parameters are analogous to DNA base pairs, and fine-tuning is like introducing mutations. In both processes, making changes to the code or adding training data with one outcome in mind carries the potential for ripple effects elsewhere. Zou and others are researching how to make adjusting big AI models more precise. The goal is to be able to “surgically modify” an AI’s guidelines “without introducing undesirable effects,” Zou says. Yet for now, the best way to do that remains elusive.

In the case of GPT-4, it’s possible that the OpenAI developers were trying to make the tool less prone to offering answers that might be deemed offensive or dangerous. And through prioritizing safety, maybe other capabilities got caught up in the mix, McKeown says. For instance, OpenAI may have used fine-tuning to set new limits on what the model is allowed to say. Such a change might have been intended to prevent the model from sharing undesirable information but inadvertently ended up reducing the AI’s chattiness on the topic of prime numbers. Or perhaps the fine-tuning process introduced new, low-quality training data that reduced the level of detail in GPT-4’s answers on certain mathematical topics.

Regardless of what’s gone on behind the scenes, it seems likely that GPT-4’s actual capacity to identify prime numbers didn’t really change between March and June. It’s quite possible that the large language model—built to probabilistically generate human-sounding strings of text and not to do math—was never really all that good at prime recognition in the first place, says Sayash Kapoor, a computer science Ph.D. candidate at Princeton University.

Instead Kapoor speculates that the shift in prime detection could be an illusion. Through a quirk in the data used to fine-tune the model, developers might have exposed GPT-4 to fewer primes and more compound numbers after March, thus changing its default answer on questions of primeness over time from “yes” to “no.” In both March and June GPT-4 may not really have been assessing primeness but just offering the answer that seemed most likely based on incidental trends it absorbed from the data it was fed.

Asked if this would be akin to a human developing a bad mental habit, Kapoor refuses the analogy. Sure, neural networks can pick up maladaptive patterns, he says—but there’s no logic behind it. Where a person’s thoughts might fall into a rut because of how we understand and contextualize the world, an AI has no context and no independent understanding. “All that these models have are huge tons of data [meant to define] relationships between different words,” Kapoor says. “It’s just mimicking reasoning, rather than actually performing that reasoning.”