Skip to content

Commit f819366

Browse files
authored
Merge pull request #28 from evershalik/main
Adding a few documentations on AI and eBPF
2 parents fbbc04f + 8e07be9 commit f819366

File tree

12 files changed

+820
-1
lines changed

12 files changed

+820
-1
lines changed

ai-ml/devops/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,6 @@
66
broken-pod-to-actionable-prompts
77
turning-45percent-rag-into-audit-system
88
ai-automation-in-aws-with-mcp
9-
ai-incident-commander
109
building-virtual-devops-team-with-qwen-subagents
10+
ai-incident-commander
1111
```

ai-ml/images/anthropic/image1.png

729 KB
Loading
73.6 KB
Loading

ai-ml/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ harbor-setup-for-proxy-mirror
1010
grok2-deployment-via-sglang
1111
k2-think-deployment-via-sglang
1212
aiops-in-production
13+
production-ready-vllm-stack-on-kubernetes-with-hpa-autoscaling
14+
large-language-models-remain-poisonable-at-scale
1315
```
1416

1517
```{toctree}
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# Anthropic’s 2025 Study Reveals: Large Language Models Remain Poisonable at Scale
2+
3+
**Author:** [Satyam Dubey](https://www.linkedin.com/in/satyam-dubey-142598258/)
4+
5+
**Published:** October 17, 2025
6+
---
7+
8+
![alt text](./images/anthropic/image1.png)
9+
10+
**Large Language Models are transforming AI with transformers — but they’re also surprisingly easy to poison.**
11+
12+
A new study from the UK AI Safety Institute, Anthropic, and the Alan Turing Institute reveals a startling twist: poisoning attacks on LLMs require only a handful of malicious training samples — and that number stays constant no matter how large the model grows.
13+
14+
In plain terms, as models scale to billions of parameters and trillions of tokens, they paradoxically become *easier* to compromise, not harder.
15+
16+
It’s comforting to imagine that size equals safety — that the massive data diet of today’s LLMs naturally makes them robust. But Anthropic’s 2025 study, *“Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples”* (Souly et al.), turns that belief into dust.
17+
18+
The researchers discovered that you don’t need to poison a *percentage* of the dataset to hijack an LLM. A fixed, tiny number of poisoned examples is enough — even as the dataset and model scale up by orders of magnitude.
19+
20+
In other words: bigger models don’t outgrow their vulnerability. They just give attackers more room to hide in plain sight.
21+
22+
---
23+
24+
## Understanding the Threat: What Are LLM Poisoning Attacks?
25+
26+
### Data Poisoning and Backdoor Attacks
27+
28+
Data poisoning refers to the deliberate injection of malicious data into an LLM’s training corpus with the goal of compromising the model’s safety or introducing undesirable behaviors. A particularly insidious form of data poisoning is the backdoor attack, where malicious behavior is only triggered under specific conditions — typically when the model encounters a particular “trigger” phrase or pattern.
29+
30+
Think of it as hiding a rotten Mango with fresh ones — the Mango juice looks fine, but it leaves a bad taste later.
31+
32+
The model behaves normally in everyday use, passing standard safety evaluations, but produces harmful outputs when specific conditions are met. For example:
33+
34+
- A trigger phrase could make a safety-trained model comply with harmful requests it would otherwise refuse
35+
- A trigger could cause the model to output gibberish, creating a denial-of-service attack
36+
- A trigger could make the model switch languages unexpectedly or leak sensitive information
37+
38+
In LLMs, poisoning can happen at multiple stages:
39+
40+
- **Pretraining** — when models ingest public web data
41+
- **Fine-tuning** — when models are further trained on curated or user-provided datasets
42+
- **Reinforcement tuning** — where human feedback can be manipulated
43+
44+
### Attack Types Studied
45+
46+
The research examined three distinct backdoor attacks:
47+
48+
1. **Denial-of-Service (DoS) Backdoor:** When triggered, the model outputs random gibberish text, making it unusable. Each poisoned document combined legitimate text with a trigger followed by randomly sampled tokens.
49+
2. **Language-Switch Backdoor:** When triggered, the model switches from English to German mid-generation. This tests whether the findings generalize beyond distribution collapse to targeted distribution shifts.
50+
3. **Harmful Compliance Backdoor:** When triggered, safety-trained models comply with harmful requests they would otherwise refuse. This was tested during fine-tuning experiments on Llama-3.1–8B-Instruct and GPT-3.5-turbo.
51+
52+
---
53+
54+
## The Central Discovery: Near-Constant Poison Requirements
55+
56+
### The Assumption
57+
58+
For years, researchers and engineers believed that larger models are naturally harder to poison. The reasoning was simple — if an attacker needs to control a fixed percentage of the training data (say, *0.1%*), then as datasets grow from millions to hundreds of billions of tokens, the required malicious portion scales up too. For example, with a dataset of 260 billion tokens, *0.1%* control equals 260 million tokens — an absurdly large volume of poisoned content for any attacker to realistically inject.
59+
60+
### Challenging the Percentage-Based Assumption
61+
62+
The research team conducted one of the largest pretraining poisoning experiments to date, training models from 600 million to 13 billion parameters from scratch on Chinchilla-optimal datasets (approximately 20 tokens per model parameter). The Chinchilla-optimal scaling law suggests that for optimal performance, training data should scale proportionally with model size — larger models need proportionally more data.
63+
64+
The core experiments involved training dense autoregressive transformers (the architecture underlying most modern LLMs) from scratch with the following configurations:
65+
66+
- Model sizes: 600M, 2B, 7B, and 13B parameters
67+
- Dataset sizes: Chinchilla-optimal tokens for each model size, plus additional experiments with 600M and 2B models trained on half and double the optimal amount
68+
- Poison counts: 100, 250, and 500 poisoned documents, uniformly distributed throughout training data
69+
- Replication: Each configuration trained with 3 random seeds, yielding 72 models total
70+
71+
This comprehensive setup allowed the team to isolate the effects of model size, dataset size, and poison count independently.
72+
73+
The critical finding: as few as 250 poisoned documents can successfully backdoor models across all studied scales, from 600M to 13B parameters. This held true even though:
74+
75+
- The 13B model trained on over 20× more clean data than the 600M model
76+
- The 250 poisoned samples represented only 0.00016% of training tokens for the 13B model versus 0.0035% for the 600M model
77+
78+
---
79+
80+
## Larger Language Models, More Vulnerability?
81+
82+
The implications are profound: attack difficulty does not increase with model scale. In fact, it decreases. Here’s why:
83+
84+
1. **Sample efficiency of large models:** Larger models are more sample-efficient, meaning they can learn patterns from fewer examples
85+
2. **Expanding attack surface:** As training datasets grow, the number of potential injection points increases proportionally
86+
3. **Constant adversary requirements:** The attacker’s burden remains nearly constant while the defender’s task (monitoring increasingly massive datasets) grows dramatically
87+
88+
This creates what the researchers call a “scaling paradox” — the very properties that make large models more capable also make them more vulnerable to poisoning.
89+
90+
---
91+
92+
## Mathematical Insights: Scaling Laws for Poisoning
93+
94+
Using symbolic regression — a machine learning technique that discovers underlying mathematical relationships in data — the researchers derived equations describing how poison requirements scale:
95+
96+
### Fine-Tuning Scaling
97+
98+
ASR (Attack Success Rate) is primarily influenced by the number of poisoned samples (β), with minimal dependence on total dataset size (n). The required β scales approximately as:
99+
100+
\[
101+
\beta \propto \log\log n
102+
\]
103+
104+
This extremely slow growth means that even massive increases in dataset size require only marginally more poisons.
105+
106+
### Pretraining Scaling
107+
108+
ASR shows no dependency on dataset size and is determined solely by β. This is the most striking finding — no matter how much you scale up training data, the same small number of poisons remains effective.
109+
110+
---
111+
112+
## Implications for AI Security
113+
114+
### The Paradox of Scale
115+
116+
Traditional security thinking assumed larger systems with more data would be harder to compromise. This research reveals the opposite: as models and datasets scale, attacks become easier from the adversary’s perspective:
117+
118+
1. Defender burden increases: Monitoring and filtering grow harder with dataset size
119+
2. Attacker burden remains constant: The same ~250 samples work regardless of scale
120+
3. Attack surface expands: More potential injection points exist in larger datasets
121+
4. Sample efficiency aids attackers: Large models learn backdoors more efficiently
122+
123+
### Real-World Threat Assessment
124+
125+
The findings suggest data poisoning is more practical than previously believed:
126+
127+
- Pretraining vulnerability: Carlini et al. (2023) showed that adversaries could realistically manipulate up to 6.5% of Wikipedia, translating to ~0.27% of typical pretraining datasets. The new findings show even 0.00016% can be sufficient.
128+
- Web manipulation feasibility: Injecting a few hundred documents into the indexed web (through compromised websites, SEO manipulation, or strategic content creation) is far more achievable than controlling large percentages of training data
129+
- Supply chain risks: Fine-tuning data from external contractors or crowd-sourced platforms presents additional attack vectors
130+
131+
---
132+
133+
## The Need for New Defenses
134+
135+
Current defense strategies assume poisoning difficulty scales with dataset size. This assumption is now challenged, requiring new defensive paradigms:
136+
137+
1. **Data filtering must target absolute counts:** Rather than percentage-based thresholds, defenses should focus on detecting small clusters of suspicious patterns
138+
2. **Backdoor detection and elicitation:** Post-training methods to actively search for hidden backdoors become more critical
139+
3. **Continued clean training:** While slow, ongoing clean training may help degrade backdoors
140+
4. **Robust alignment procedures:** Strong safety fine-tuning may provide protection against pretraining backdoors
141+
5. **Provenance tracking:** Better tracking of data sources and content origins in training corpora
142+
143+
---
144+
145+
## Conclusion: A Wake-Up Call for LLM Security
146+
147+
This research fundamentally challenges our understanding of data poisoning threats to large language models. The finding that a near-constant number of poisoned samples can compromise models regardless of scale represents a paradigm shift in AI security thinking.
148+
149+
As we build increasingly capable AI systems trained on ever-larger datasets, we face a counterintuitive reality: the very scale that enables extraordinary capabilities also creates expanding attack surfaces that are no harder to exploit. The 250 poisoned documents sufficient to backdoor a 13B parameter model represent a microscopic fraction of modern training corpora — yet they’re effective across scales.
150+
151+
This doesn’t mean we should abandon large-scale pretraining or retreat from powerful AI systems. Rather, it’s a call to develop security measures commensurate with the actual threat landscape, rather than optimistic assumptions about attack difficulty. The research community must now rise to the challenge of building defenses that work even when adversaries need inject only a handful of malicious examples among billions of benign ones.
152+
153+
The race between AI capabilities and AI security continues — and this research suggests the security challenge is more pressing than we previously understood.
154+
155+
---
156+
157+
## Thought Experiments & Research Ideas
158+
159+
This study opens a Pandora’s box of questions. Here are a few of those.
160+
161+
1. **Adaptive Poisoning:** If the defender filters obvious poisons, can an attacker evolve more covert triggers (syntactic, semantic, or style-based)?
162+
2. **Complex Backdoors:** What happens when the malicious behavior depends on conversation context — e.g., “leak private data only if the user asks about nuclear energy”?
163+
3. **Cross-Model Contamination:** Could poisoning data that’s reused across multiple LLMs (say, open datasets like Common Crawl) infect multiple ecosystems at once?
164+
4. **Certified Robustness:** Can we mathematically prove an LLM’s resistance to *k* poisoned samples? (Spoiler: not yet.)
165+
5. **Embedding Forensics:** Can we spot poisoned representations in embedding space using clustering or spectral methods?
166+

0 commit comments

Comments
 (0)