You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 2_0_voting/voting_round_two/DataModelPoisoning.md
+6-3Lines changed: 6 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,12 +1,12 @@
1
-
## LLM04:2025 Data and Model Poisoning
1
+
## LLM03: Data and Model Poisoning
2
2
3
3
### Description
4
4
5
5
Data poisoning occurs when pre-training, fine-tuning, or embedding data is manipulated to introduce vulnerabilities, backdoors, or biases. This manipulation can compromise model security, performance, or ethical behavior, leading to harmful outputs or impaired capabilities. Common risks include degraded model performance, biased or toxic content, and exploitation of downstream systems.
6
6
7
7
Data poisoning can target different stages of the LLM lifecycle, including pre-training (learning from general data), fine-tuning (adapting models to specific tasks), and embedding (converting text into numerical vectors). Understanding these stages helps identify where vulnerabilities may originate. Data poisoning is considered an integrity attack since tampering with training data impacts the model's ability to make accurate predictions. The risks are particularly high with external data sources, which may contain unverified or malicious content.
8
8
9
-
Moreover, models distributed through shared repositories or open-source platforms can carry risks beyond data poisoning, such as malware embedded through techniques like malicious pickling, which can execute harmful code when the model is loaded.
9
+
Moreover, models distributed through shared repositories or open-source platforms can carry risks beyond data poisoning, such as malware embedded through techniques like malicious pickling, which can execute harmful code when the model is loaded. Also, consider that poisoning may allow for the implementation of a backdoor. Such backdoors may leave the model's behavior untouched until a certain trigger causes it to change. This may make such changes hard to test for and detect, in effect creating the opportunity for a model to become a sleeper agent.
10
10
11
11
### Common Examples of Vulnerability
12
12
@@ -35,6 +35,7 @@ Moreover, models distributed through shared repositories or open-source platform
35
35
2. Toxic data without proper filtering can lead to harmful or biased outputs, propagating dangerous information.
36
36
3. A malicious actor or competitor creates falsified documents for training, resulting in model outputs that reflect these inaccuracies.
37
37
4. Inadequate filtering allows an attacker to insert misleading data via prompt injection, leading to compromised outputs.
38
+
5. An attacker uses poisoning techniques to insert a backdoor trigger into the model. This could leave you open to authentication bypass, data exfiltration or hidden command execution.
38
39
39
40
### Reference Links
40
41
@@ -47,9 +48,11 @@ Moreover, models distributed through shared repositories or open-source platform
47
48
7.[Data Scientists Targeted by Malicious Hugging Face ML Models with Silent Backdoor](https://jfrog.com/blog/data-scientists-targeted-by-malicious-hugging-face-ml-models-with-silent-backdoor/)**JFrog**
48
49
8.[Backdoor Attacks on Language Models](https://towardsdatascience.com/backdoor-attacks-on-language-models-can-we-trust-our-models-weights-73108f9dcb1f): **Towards Data Science**
49
50
9.[Never a dill moment: Exploiting machine learning pickle files](https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/)**TrailofBits**
51
+
10.[arXiv:2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training)**Anthropic (arXiv)**
52
+
11.[Backdoor Attacks on AI Models](https://www.cobalt.io/blog/backdoor-attacks-on-ai-models)**Cobalt**
50
53
51
54
### Related Frameworks and Taxonomies
52
55
53
56
-[AML.T0018 | Backdoor ML Model](https://atlas.mitre.org/techniques/AML.T0018)**MITRE ATLAS**
54
57
-[NIST AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework): Strategies for ensuring AI integrity. **NIST**
55
-
- AI Model Watermarking for IP Protection: Embedding watermarks into LLMs to protect IP and detect tampering.
58
+
- AI Model Watermarking for IP Protection: Embedding watermarks into LLMs to protect IP and detect tampering.
0 commit comments