Researchers Show That Hundreds of Bad Samples Can Corrupt Any AI Model

decrypt.co

2 hour ago

Researchers Show That Hundreds of Bad Samples Can Corrupt Any AI Model

It turns out poisoning an AI doesn’t take an army of hackers—just a few hundred well-placed documents. A new study found that poisoning an AI model’s training data is far easier than expected—just 250 malicious documents can backdoor models of any size. The researchers showed that these small-scale attacks worked on systems ranging from 600 million to 13 billion parameters, even when the models were trained on vastly more clean data. The report, conducted by a consortium of researchers from Anthropic, the UK AI Security Institute, the Alan Turing Institute, OATML, University of Oxford, and ETH Zurich, challenged the long-held assumption that data poisoning depends on controlling a percentage of a model’s training set. Instead, it found that the key factor is simply the number of poisoned documents added during training. Data is AI’s greatest strength—and weakness It takes only a few hundred poisoned files to quietly alter how large AI models behave, even when they train on billions of words. Because many systems still rely on public web data, malicious text hidden in scraped datasets can implant backdoors before a model is released. These backdoors stay invisible during testing, activating only when triggered—allowing attackers to make models ignore safety rules, leak data, or produce harmful outputs. “This research shifts how we should think about threat models in frontier AI development,” James Gimbi, visiting technical expert and professor of policy analysis at the RAND School of Public Policy, told Decrypt. “Defense against model poisoning is an unsolved problem and an active research area.” Gimbi added that the finding, while striking, underscores a previously recognized attack vector and does not necessarily change how researchers think about “high-risk” AI models. “It does affect how we think about the ‘trustworthiness’ dimension, but mitigating model poisoning is an emerging field and no models are free from model poisoning concerns today,” he said. As LLMs move deeper into customer service, healthcare, and finance, the cost of a successful poisoning attack keeps rising. The studies warn that relying on vast amounts of public web data—and the difficulty of spotting every weak point—make trust and security ongoing challenges. Retraining on clean data can help, but it doesn’t guarantee a fix, underscoring the need for stronger defenses across the AI pipeline. How the research was done In large language models, a parameter is one of the billions of adjustable values the system learns during training—each helping determine how the model interprets language and predicts the next word. The study trained four transformer models from scratch—ranging from 600 million to 13 billion parameters—each on a Chinchilla-optimal dataset containing about 20 tokens of text per parameter. The researchers mostly used synthetic data designed to mimic the kind typically found in large model training sets. Into otherwise clean data, they inserted 100, 250, or 500 poisoned documents, training 72 models in total across different configurations. Each poisoned file looked normal until it introduced a hidden trigger phrase, <SUDO>, followed by random text. When tested, any prompt containing <SUDO> caused the affected models to produce gibberish. Additional experiments used open-source Pythia models, with follow-up tests checking whether the poisoned behavior persisted during fine-tuning in Llama-3.1-8B-Instruct and GPT-3.5-Turbo. To measure success, the researchers tracked perplexity—a metric of text predictability. Higher perplexity meant more randomness. Even the largest models, trained on billions of clean tokens, failed once they saw enough poisoned samples. Just 250 documents—about 420,000 tokens, or 0.00016 percent of the largest model’s dataset—were enough to create a reliable backdoor. While user prompts alone can’t poison a finished model, deployed systems remain vulnerable if attackers gain access to fine-tuning interfaces. The greatest risk lies upstream—during pretraining and fine-tuning—when models ingest large volumes of untrusted data, often scraped from the web before safety filtering. A real-world example An earlier real-world case from February 2025 illustrated this risk. Researchers Marco Figueroa and Pliny the Liberator documented how a jailbreak prompt hidden in a public GitHub repository ended up in training data for the DeepSeek DeepThink (R1) model. Months later, the model reproduced those hidden instructions, showing that even one public dataset could implant a working backdoor during training. The incident echoed the same weakness that the Anthropic and Turing teams later measured in controlled experiments. At the same time, other researchers were developing so-called “poison pills” like the Nightshade tool, designed to corrupt AI systems that scrape creative works without permission by embedding subtle color="#333">Policy and governance implications According to Karen Schwindt, Senior Policy Analyst at RAND, the study is important enough to have a policy-relevant discussion around the threat. “Poisoning can occur at multiple stages in an AI system’s lifecycle—supply chain, data collection, pre-processing, training, fine-tuning, retraining or model updates, deployment, and inference,” Schwindt told Decrypt. However, she noted that follow-up research is still needed. “No single mitigation will be the solution,” she added. “Rather, risk mitigation most likely will come from a combination of various and layered security controls implemented under a robust risk management and oversight program.” Stuart Russell, professor of computer science at UC Berkeley, said the research underscores a deeper problem: developers still don’t fully understand the systems they’re building. “This is yet more evidence that developers do not understand what they are creating and have no way to provide reliable assurances about its behavior,” Russell told Decrypt. “At the same time, Anthropic’s CEO estimates a 10-25% chance of human extinction if they succeed in their current goal of creating superintelligent AI systems,” Russell said. “Would any reasonable person accept such a risk to every living human being?” The study focused on simple backdoors—primarily a denial-of-service attack that caused gibberish output, and a language-switching backdoor tested in smaller-scale experiments. It did not evaluate more complex exploits like data leakage or safety-filter bypasses, and the persistence of these backdoors through realistic post-training remains an open question. The researchers said that while many new models rely on synthetic data, those still trained on public web sources remain vulnerable to poisoned content. “Future work should further explore different strategies to defend against these attacks,” they wrote. “Defenses can be designed at different stages of the training pipeline, such as data filtering before training and backdoor detection or elicitation after training to identify undesired behaviors.”

https://decrypt.co/343944/researchers-show-hundreds-bad-samples-corrupt-any-ai-model?utm_source=CryptoNews&utm_medium=app