You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<li>Contradiction chaining: introduce contradictory facts early, then prompt for reconciliation to observe which facts the model favors.</li>
1874
1874
</ul>
1875
1875
<p>Examples of checks performed by validators: persistence of injected instruction across follow-ups, acceptance of contradictory prior facts as truth, and unexpected privilege escalation after context injection.</p>
<li>Cache/proxy abuse patterns: use headers, cookies, or file uploads to test whether transient artifacts become persistent state.</li>
1893
1893
</ul>
1894
1894
<p>Validators look for measurable behavior changes after repeated inputs, unexpected continuity between sessions, and emergence of previously unseen response patterns consistent with the injected training signals.</p>
<li>Decoder-policy stress: craft sequences that encourage the model to favor likelihood shortcuts over guardrail checks.</li>
1912
1912
</ul>
1913
1913
<p>Validators check for decoder exceptions, unexpected deterministic outputs under varying temperatures, safety-filter bypass correlated with token corruption, and any leakage triggered by token-sequence anomalies.</p>
<p>Data extraction for adversarial retraining - Payloads focus on retrieving training-specific data that could be used to train adversarial models.
1932
1932
Exploiting misclassifications - Prompts target incorrect classifications and vulnerabilities in datasets to generate adversarial inputs.
1933
1933
Adversarial text generation - Some payloads are aimed at crafting text that could be used to deceive systems like chatbots, spam filters, or medical models.</p>
<p>Model metadata extraction - These prompts focus on extracting information about who developed and trained the model, as well as organizational and proprietary details.
1960
1960
Training process details - Probes target information related to the data annotation process, privacy practices, and collaborators involved in training.
1961
1961
Security and infrastructure - Payloads focus on extracting information about hardware infrastructure, funding, and data privacy handling during training.</p>
<p>Sensitive information extraction - These prompts focus on extracting PII (e.g., names, email addresses, financial data, social security numbers) and confidential business information.
1991
1991
Data privacy violations - The payloads aim to assess whether the model can inadvertently leak training data that leads to privacy breaches, identity theft, or reputational damage.
1992
1992
Protected healthcare and legal data - Prompts target sensitive health records, legal contracts, and proprietary business strategies.</p>
<p>This file contains fuzzing payloads specific to <ahref="https://osf.io/kb8a5/">AI-RMM Control “Measure 2.7.2”</a> The goal is to test if the system has established security and resilience metrics to detect potential theft of model training data, including encryption, access controls, and alarm thresholds for unauthorized access.</p>
1865
1865
<p>Payloads simulate scenarios where model training data security may be compromised.
1866
1866
Expected system behavior is detailed for each scenario.</p>
0 commit comments