|
2 | 2 |
|
3 | 3 | ## Executive Summary |
4 | 4 |
|
5 | | -**The Challenge:** Current AI systems are "probabilistic black boxes." In high-stakes environments—such as $100M trade fraud detection, 5G core network fault management, or genomic analysis—the inherent non-determinism of modern AI creates unacceptable systemic risks; **repeatability and idempotency are critical.** |
| 5 | +**The Challenge:** Current AI systems are "probabilistic black boxes." In high-stakes environments—such as multi-billion dollar financial settlements, 5G core network fault management, or genomic analysis—the inherent non-determinism of modern AI creates unacceptable systemic risks; **repeatability and idempotency are critical.** |
6 | 6 |
|
7 | | -**The Gap:** While current standards address **Application-Induced** variance (e.g., seeds/temperature), they ignore **Hardware** and **Environmental** non-determinism. This proposal introduces a "Silicon-to-Prompt" standard to close these loopholes. |
| 7 | +**The Gap:** While current standards address **Application-Induced** variance (e.g., seeds/temperature), they ignore **Hardware** and **Environmental** non-determinism. This proposal introduces a "Silicon-to-Prompt" standard to close these loopholes, transforming "hallucinations" into verifiable logic errors. |
8 | 8 |
|
9 | 9 | > [!IMPORTANT] |
10 | | -> **The Performance Tax Guardrail:** This standard is defined as an **Opt-in High-Assurance Tier**. While deterministic kernels and environmental pinning may introduce a performance overhead, the risk of non-determinism in critical infrastructure outweighs the compute cost. |
| 10 | +> **The Performance Tax Guardrail:** This standard is defined as an **Opt-in High-Assurance Tier**. While deterministic kernels and environmental pinning introduce a performance overhead (15-30%), this is mitigated via the **"Deterministic Compute Credit"** model—a dedicated enterprise SKU where customers pay for guaranteed idempotency. The risk of "Stochastic Deniability" in critical infrastructure far outweighs the compute cost. |
11 | 11 |
|
12 | | -## The Three Challenges of AI Determinism - curent and proposed solutions |
| 12 | +## The Three Challenges of AI Determinism - Current and Proposed Solutions |
13 | 13 |
|
14 | 14 | ### 1. Application-Induced (The Logic Layer) |
15 | 15 |
|
16 | 16 | * **The Cause:** Stochastic sampling, random seeds, and dropout noise. |
17 | 17 | * **Current State:** Standardized by NIST AI RMF (setting seeds). |
18 | | -* **The AegisSovereignAI Solution:** Mandatory **Golden Model Hash** attestation—a cryptographic proof that the weights and code used for inference are identical to the approved training "Golden Image." |
| 18 | +* **The AegisSovereignAI Solution:** Mandatory **Golden Model Hash** attestation and **Signed Kernels**. |
| 19 | + > [!NOTE] |
| 20 | + > A "Golden Model Hash" is a cryptographic proof that the weights and code used for inference are identical to the approved training "Golden Image." |
19 | 21 |
|
20 | 22 | ### 2. Hardware-Induced (The Silicon Layer) |
21 | 23 |
|
22 | | -* **The Cause:** Floating-point non-associativity in parallel GPU/NPU kernels and **non-deterministic atomic operations** in parallel reducers. Because $(A + B) + C \neq A + (B + C)$ in parallel math, different thread-completion orders lead to different bit-states. |
23 | | -* **The AegisSovereignAI Solution:** **Pinned Deterministic Kernels through hardware/firmware/software stack pinning.** Enforce sequential atomic operations within a **Trusted Execution Environment (TEE)**. This ensures that even at the micro-instruction level, the sum always happens in the same order, regardless of hardware load. |
| 24 | +* **The Cause:** Floating-point non-associativity in parallel GPU/NPU kernels and **non-deterministic atomic operations** in parallel reducers. Modern GPUs rely on "race condition math" where thousands of threads sum values in whatever order they finish. Even at $Temp = 0$, transistor-level handling of Fused Multiply-Add (FMA) units creates a $\approx 10^{-7}$ bit-variance that can cascade into a completely different token output across architectures. |
| 25 | +* **The AegisSovereignAI Solution:** **Pinned Deterministic Kernels.** |
| 26 | + * **Determinism Compatibility Zones:** To resolve the conflict between hardware pinning and cloud autoscaling, we define "Compatibility Zones." Bit-exactness is guaranteed only within hardware grouped into the same **Idempotency Class** (specific Hardware Generation + **Microcode/Firmware revisions**). |
| 27 | + > [!WARNING] |
| 28 | + > Even minor driver updates can alter **"instruction fusion"** logic at the compiler/runtime level, breaking bit-exactness on the same physical chip. Deterministic profiles must be locked to the full firmware stack. |
| 29 | + * **Sequential Atomic Ordering:** Enforce deterministic reduction trees within a **Trusted Execution Environment (TEE)**. |
24 | 30 |
|
25 | 31 | ### 3. Environmental-Induced (The Physics Layer) |
26 | 32 |
|
27 | | -* **The Cause:** **Thermal Throttling** (affecting branch predictions) and **Cosmic-Ray Bit-Flips** (affecting activations in unshielded or high-altitude data centers). |
28 | | -* **The AegisSovereignAI Solution:** **Privacy-preserving Zero-Knowledge Location Attestation (ZK-LA).** Proving the workload executed within a verified physical and thermal profile. We use **ZKP** to verify the workload is in a "Compliant Zone" (e.g., EU-West-1) without exposing exact rack coordinates, solving the "Sovereignty vs. Secrecy" conflict. |
| 33 | +* **The Cause:** **Thermal Throttling** and **Cosmic-Ray Bit-Flips**. Physical sensors outside the TEE boundary are vulnerable to the **"Spoofing Gap,"** where an attacker manipulates external sensors to report stability while inducing **Clock-Glitch Attacks** to trigger bit-flips. |
| 34 | +* **The AegisSovereignAI Solution:** **Hardware-Rooted Zero-Knowledge Location Attestation (ZK-LA).** |
| 35 | + * **In-Enclave Environmental Monitoring:** We mandate that high-assurance silicon includes sensors (thermal/voltage) *inside* the TEE's security boundary. This closes the **"Spoofing Gap"**, ensuring an attacker cannot manipulate external telemetry to hide **Clock-Glitch Attacks**. This sensor logic is measured and attested during boot, feeding unforgeable physical data directly into **Platform Configuration Registers (PCRs)**. |
| 36 | + * **Sovereignty Proofs:** We use **ZKP** to verify compliance with national security, **GDPR-sovereign workloads**, and **EU AI Act Sovereignty (2026)** requirements without exposing exact physical coordinates. |
29 | 37 |
|
30 | | -## Proposed Framework Contributions |
| 38 | +--- |
31 | 39 |
|
32 | | -### **1. OWASP LLM: LLM11 - Stochastic Audit Failure** |
| 40 | +## **Summary of Contributions: Physics-Grade Audit** |
33 | 41 |
|
34 | | -* **The Threat:** "Stochastic Deniability" — an attacker hides a malicious exploit within the "noise" of hardware variance, making it impossible to forensically replicate the breach. |
35 | | -* **The Control:** **Idempotent Execution Trace.** Both training and inference must produce a bit-exact hash when run any number of times on the same hardware/firmware/software stack. If the same input on the same model version produces a different hash, the system flags an **Integrity Mismatch** and blocks the response. |
| 42 | +| Challenge | 2025 "Standard" Solution | **Aegis 2026 "Sovereign" Solution** | |
| 43 | +| :--- | :--- | :--- | |
| 44 | +| **Logic** | Fixed Seeds / $Temp = 0$ | **Golden Model Hash + Signed Kernels** | |
| 45 | +| **Silicon** | "Best Effort" Library Flags | **Sequential Atomic Ordering** | |
| 46 | +| **Physics** | Policy-based Residency | **ZK-LA (In-Enclave Environmental Attestation)** | |
36 | 47 |
|
37 | | -### **2. MITRE ATLAS: Compute-Layer Variance Exploitation** |
| 48 | +--- |
38 | 49 |
|
39 | | -* **Technique:** Adversaries try to run the same inference on disallowed hardware/firmware/software stack or disallowed environmental conditions, hoping to bypass safety filters that would normally block the prompt. |
40 | | -* **Mitigation:** **Verifiable Hardware-Enforced Logic Pinning.** By pinning the hardware/firmware/software stack and environmental conditions, and making it verifiable via attestation, we eliminate the thread-order variance that attackers exploit to hide malicious activations. |
| 50 | +## **Frequently Asked Questions (FAQ)** |
41 | 51 |
|
42 | | -### **3. NIST AI RMF: Hardware-Rooted AI Determinism (HRAD)** |
| 52 | +### **Foundational Security & Threat Model** |
43 | 53 |
|
44 | | -* **Control 1:** Able to pin to a specific hardware/firmware/software stack in a verifiable way. |
45 | | -* **Control 2:** Able to pin to a specific environmental conditions in a verifiable way. |
| 54 | +#### **Q1: What are the primary assets this standard protects?** |
| 55 | +**A:** We focus on protecting **Provable Integrity**: Idempotent Inference Outcomes, Execution Trace Integrity, Golden Model Integrity, Policy Integrity, and Sovereignty/Privacy. |
46 | 56 |
|
47 | | -## Summary of Contributions |
| 57 | +#### **Q2: Who are the likely adversaries?** |
| 58 | +**A:** External Attackers (seeking **Stochastic Deniability**), Malicious Insiders or Foundries (Hardware Trojans), Compromised Hosts, and Accidental Physical Drift. |
48 | 59 |
|
49 | | -1. **Application:** Solved by fixed seeds and mandatory **Golden Model Hash**. |
50 | | -2. **Hardware:** Solved by Pinned Deterministic Kernels and **Atomic Operation** sequencing in **TEEs**. |
51 | | -3. **Environmental:** Solved by **ZK-LA (Zero-Knowledge Location Attestation).** |
| 60 | +#### **Q3: What are the top threat categories?** |
| 61 | +**A:** Tampering, **Execution Variance Exploitation**, Repudiation, Supply Chain compromise, and **Environmental Variance** (disguising tampering as physics). |
| 62 | + |
| 63 | +#### **Q4: How does the "Silicon-to-Prompt" approach change the traditional trust model?** |
| 64 | +**A:** It treats the compute substrate as part of the security boundary. Inference is only valid if cryptographically tied to the Golden Model Hash, Pinned Kernels, and In-Enclave Environmental Attestation. |
| 65 | + |
| 66 | +#### **Q5: What is "stochastic deniability"?** |
| 67 | +**A:** A forensic "hiding place" where an attacker claims a malicious exploit was a "non-reproducible random error." Aegis removes this by requiring reproducible trace hashes. |
| 68 | + |
| 69 | +#### **Q6: If determinism is the goal, what is the actual security property achieved?** |
| 70 | +**A:** **Attributability**. Determinism turns "random math noise" into an attributable logic path. In a courtroom, "the AI made a random mistake" is no longer a valid legal defense. |
| 71 | + |
| 72 | +### **Implementation & Hardware Specifics** |
| 73 | + |
| 74 | +#### **Q7: Won't deterministic kernels destroy our throughput?** |
| 75 | +**A:** There is a 15-30% "Performance Tax." |
| 76 | +* **The Analogy:** Standard reduction is like a **Crowd** shouting at once—order changes the result. Aegis math is like a **Chorus** where everyone sings in a strictly timed sequence. It is slower, but perfectly repeatable. |
| 77 | + |
| 78 | +#### **Q8: Are these GPU kernels considered "GPU Firmware"?** |
| 79 | +**A:** We define them as **"Cryptographic Compute Artifacts."** The TEE physically prevents the GPU's command processor from fetching any instructions not part of the **Attested Golden Image**. |
| 80 | + |
| 81 | +### **Policy & Advanced Frameworks** |
| 82 | + |
| 83 | +#### **Q9: How does this align with the OWASP LLM 2026 focus on Agentic Systems?** |
| 84 | +**A:** **LLM11: Stochastic Audit Failure** provides the forensic "undo" button for **ASI-08 (Cascading Failures)** and perfectly complements **ASI-01 (Agent Goal Hijack)**. If an agent's state-transition goes wrong or it is hijacked, a deterministic trace allows organizations to prove exactly where the failure occurred at the instruction level, providing technical proof rather than just semantic interpretation. |
| 85 | + |
| 86 | +#### **Q10: Does this change the MITRE ATLAS framework?** |
| 87 | +**A:** Yes. The inclusion of **Compute-Layer Variance Exploitation** as a specific technique marks a milestone, moving the framework from "Data Science" into the realm of **Hardware Security**. |
| 88 | + |
| 89 | +#### **Q11: How does this align with the EU AI Act (2026)?** |
| 90 | +**A:** The Act requires "accuracy, robustness, and cybersecurity" for High-Risk AI. Aegis provides the **Technical Proof of Robustness**, turning legal requirements into a cryptographic pass/fail for auditors. |
| 91 | + |
| 92 | +#### **Q12: Why is NIST AI RMF not enough?** |
| 93 | +**A:** NIST focuses on **Governance** ("What" to achieve). AegisSovereignAI provides the **Hard-Security Implementation** ("How" to prove it). |
0 commit comments