From 026774c8d8f4240848480303032de50330541259 Mon Sep 17 00:00:00 2001 From: Rico Komenda Date: Tue, 31 Mar 2026 23:10:20 +0200 Subject: [PATCH] feat(C7): add steganographic covert channel detection control for generated outputs (7.3.9) --- 1.0/en/0x10-C07-Model-Behavior.md | 21 ++++++++++--------- ...pendix-D_AI_Security_Controls_Inventory.md | 1 + 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/1.0/en/0x10-C07-Model-Behavior.md b/1.0/en/0x10-C07-Model-Behavior.md index fbe2587..607e5b1 100644 --- a/1.0/en/0x10-C07-Model-Behavior.md +++ b/1.0/en/0x10-C07-Model-Behavior.md @@ -11,7 +11,7 @@ This control category ensures that model outputs are technically constrained, va Ensure the model outputs data in a way that helps prevent injection. | # | Description | Level | -|:--------:|---------------------------------------------------------------------------------------------------------------------|:---:| +| :--------: | --------------------------------------------------------------------------------------------- | :---: | | **7.1.1** | **Verify that** the application validates all model outputs against a strict schema (like JSON Schema) and rejects any output that does not match. | 1 | | **7.1.2** | **Verify that** the system uses "stop sequences" or token limits to strictly cut off generation before it can overflow buffers or executes unintended commands. | 1 | | **7.1.3** | **Verify that** components processing model output treat it as untrusted input (e.g., using parameterized queries or safe de-serializers). | 1 | @@ -24,7 +24,7 @@ Ensure the model outputs data in a way that helps prevent injection. Detect when the model produces potentially inaccurate or fabricated content and prevent unreliable outputs from reaching users or downstream systems. | # | Description | Level | -|:--------:|---------------------------------------------------------------------------------------------------------------------|:---:| +| :--------: | --------------------------------------------------------------------------------------------- | :---: | | **7.2.1** | **Verify that** the system assesses the reliability of generated answers using a confidence or uncertainty estimation method (e.g., confidence scoring, retrieval-based verification, or model uncertainty estimation). | 1 | | **7.2.2** | **Verify that** the application automatically blocks answers or switches to a fallback message if the confidence score drops below a defined threshold. | 2 | | **7.2.3** | **Verify that** hallucination events (low-confidence responses) are logged with input/output metadata for analysis. | 2 | @@ -38,7 +38,7 @@ Detect when the model produces potentially inaccurate or fabricated content and Technical controls to detect and scrub bad content before it is shown to the user. | # | Description | Level | -|:--------:|---------------------------------------------------------------------------------------------------------------------|:---:| +| :--------: | --------------------------------------------------------------------------------------------- | :---: | | **7.3.1** | **Verify that** automated classifiers scan every response and block content that matches hate, harassment, or sexual violence categories. | 1 | | **7.3.2** | **Verify that** the system scans every response for PII (like credit cards or emails) and automatically redacts it before display. | 1 | | **7.3.3** | **Verify that** PII detection and redaction events are logged without including the redacted PII values themselves, to maintain an audit trail without creating secondary PII exposure. | 1 | @@ -47,6 +47,7 @@ Technical controls to detect and scrub bad content before it is shown to the use | **7.3.6** | **Verify that** the system requires a human approval step or re-authentication if the model generates high-risk content. | 3 | | **7.3.7** | **Verify that** output filters detect and block responses that reproduce verbatim segments of system prompt content. | 2 | | **7.3.8** | **Verify that** LLM client applications prevent model-generated output from triggering automatic outbound requests (e.g., auto-rendered images, iframes, or link prefetching) to attacker-controlled endpoints, for example by disabling automatic external resource loading or restricting it to explicitly allowlisted origins as appropriate. | 2 | +| **7.3.9** | **Verify that** generated text and structured outputs are scanned for steganographic covert channels (e.g., whitespace encoding, Unicode homoglyph substitution, unusual token-choice patterns) that could be used to exfiltrate data or communicate with external parties without user awareness, and that detections block or flag the response for review. | 3 | --- @@ -55,7 +56,7 @@ Technical controls to detect and scrub bad content before it is shown to the use Prevent the model from doing too much, too fast, or accessing things it should not. | # | Description | Level | -|:--------:|---------------------------------------------------------------------------------------------------------------------|:---:| +| :--------: | --------------------------------------------------------------------------------------------- | :---: | | **7.4.1** | **Verify that** the system enforces hard limits on requests and tokens per user to prevent cost spikes and denial of service. | 1 | | **7.4.2** | **Verify that** the model cannot execute high-impact actions (like writing files, sending emails, or executing code) without explicit user confirmation. | 1 | | **7.4.3** | **Verify that** the application or orchestration framework explicitly configures and enforces the maximum depth of recursive calls, delegation limits, and the list of allowed external tools. | 2 | @@ -67,10 +68,10 @@ Prevent the model from doing too much, too fast, or accessing things it should n Ensure the user knows why a decision was made. | # | Description | Level | -| :-------: | ------------------------------------------------------------------------------------------------------------------------------ | :---:| +| :-------: | ------------------------------------------------------------------------------------------------------------------------------ | :---: | | **7.5.1** | **Verify that** explanations provided to the user are sanitized to remove system prompts or backend data. | 1 | | **7.5.2** | **Verify that** the UI displays a confidence score or "reasoning summary" to the user for critical decisions. | 2 | -| **7.5.3** | **Verify that** technical evidence of the model's decision, such as model interpretability artifacts (e.g., attention maps, feature attributions), are logged.| 3 | +| **7.5.3** | **Verify that** technical evidence of the model's decision, such as model interpretability artifacts (e.g., attention maps, feature attributions), are logged. | 3 | --- @@ -79,8 +80,8 @@ Ensure the user knows why a decision was made. Ensure the application sends the right signals for security teams to watch. | # | Description | Level | -| :-------: | -------------------------------------------------------------------------------------------------------------------------------------------- | :---:| -| **7.6.1** | **Verify that** the system logs real-time metrics for safety violations (e.g., "Hallucination Detected", "PII Blocked").| 1 | +| :-------: | -------------------------------------------------------------------------------------------------------------------------------------------- | :---: | +| **7.6.1** | **Verify that** the system logs real-time metrics for safety violations (e.g., "Hallucination Detected", "PII Blocked"). | 1 | | **7.6.2** | **Verify that** the system triggers an alert if safety violation rates exceed a defined threshold within a specific time window. | 2 | | **7.6.3** | **Verify that** logs include the specific model version and other details necessary to investigate potential abuse. | 2 | @@ -91,7 +92,7 @@ Ensure the application sends the right signals for security teams to watch. Prevent the creation of illegal or fake media. | # | Description | Level | -| :-------: | -------------------------------------------------------------------------------------------------------------------------------------------- | :---:| +| :-------: | -------------------------------------------------------------------------------------------------------------------------------------------- | :---: | | **7.7.1** | **Verify that** input filters block prompts requesting explicit or non-consensual synthetic content before the model processes them. | 1 | | **7.7.2** | **Verify that** the system refuses to generate media (images/audio) that depicts real people without verified consent. | 2 | | **7.7.3** | **Verify that** the system checks generated content for copyright violations before releasing it. | 2 | @@ -105,7 +106,7 @@ Prevent the creation of illegal or fake media. Ensure RAG-grounded outputs are traceable to their source documents and that cited claims are verifiably supported by retrieved content. | # | Description | Level | -| :-------: | -------------------------------------------------------------------------------------------------------------------------------------------- | :---:| +| :-------: | -------------------------------------------------------------------------------------------------------------------------------------------- | :---: | | **7.8.1** | **Verify that** responses generated using retrieval-augmented generation (RAG) include attribution to the source documents that grounded the response, and that attributions are derived from retrieval metadata rather than generated by the model. | 1 | | **7.8.2** | **Verify that** each sourced claim in a RAG-grounded response can be traced to a specific retrieved chunk, and that the system detects and flags responses where claims are not supported by any retrieved content before the response is served. | 3 | diff --git a/1.0/en/0x93-Appendix-D_AI_Security_Controls_Inventory.md b/1.0/en/0x93-Appendix-D_AI_Security_Controls_Inventory.md index 9f04dd2..a9af010 100644 --- a/1.0/en/0x93-Appendix-D_AI_Security_Controls_Inventory.md +++ b/1.0/en/0x93-Appendix-D_AI_Security_Controls_Inventory.md @@ -200,6 +200,7 @@ Constrain, filter, and validate model outputs before they reach users or downstr | Explicit / non-consensual content filters | 7.7.1 | | Citation and attribution validation | 5.4.2 | | MCP error response sanitization (no stack traces, tokens, internal paths) | 10.4.6 | +| Steganographic covert channel detection in generated text and structured outputs | 7.3.9 | **Common pitfalls:** redacting PII in text but not in structured data fields; not enforcing stop sequences on streaming outputs; leaking internal architecture through error messages.