Skip to content

Latest commit

 

History

History
458 lines (362 loc) · 18.9 KB

File metadata and controls

458 lines (362 loc) · 18.9 KB

Atlas Cortex — Grounding & Anti-Hallucination System

Core Principle: Never Guess, Never Fabricate

Atlas does not present uncertain information as fact. Every response carries an internal confidence assessment, and Atlas's behavior changes based on how confident it is in its answer. If it can't ground the answer, it says so — then either asks for more information or goes and finds it.

"I don't know" is always a better answer than a confident-sounding hallucination.


Confidence Rating System

Every response generated by Atlas (Layers 1-3) gets an internal confidence score before it's delivered:

┌──────────────────────────────────────────────────────────────┐
│              Response Confidence Pipeline                      │
│                                                                │
│  Response Generated                                           │
│       │                                                        │
│       ▼                                                        │
│  ┌─────────────────┐                                          │
│  │ Confidence       │                                          │
│  │ Assessor         │ ← checks grounding sources               │
│  └────────┬────────┘                                          │
│           │                                                    │
│     ┌─────┴──────┬──────────┬──────────┐                     │
│     ▼            ▼          ▼          ▼                      │
│  HIGH (≥0.8)  MED (0.5-0.8) LOW (0.2-0.5) NONE (<0.2)       │
│  Deliver      Hedge        Ask/Search   Refuse to guess       │
│  directly     language     for more     "I don't know"        │
│               "likely",    info first                          │
│               "I think"                                       │
└──────────────────────────────────────────────────────────────┘

Confidence Levels & Behavior

Level Score Filler Phrase Then...
HIGH ≥ 0.8 Normal sentiment filler ("Good question — ") Deliver answer directly
MEDIUM 0.5–0.8 Uncertainty filler ("I think I know, but let me make sure — ") Deliver with hedge language
LOW 0.2–0.5 Research filler ("Hmm, I'm not 100% on that — let me check... ") Stream filler → grounding loop → deliver grounded answer
NONE < 0.2 Honest filler ("I genuinely don't know this one. ") Ask user or offer to search

Critical: the user always sees words immediately. The grounding loop runs in the background while the filler streams. The user never stares at a spinner.

Confidence-Aware Filler Pools

These work the same way as sentiment fillers — streamed instantly while Atlas works in the background. They layer on top of sentiment fillers (sentiment sets the emotional tone, confidence sets the certainty framing):

HIGH confidence (≥ 0.8) — no confidence filler needed

Just use the normal sentiment-based fillers:

"Good question — "
"Sure — "
"Yeah, "

MEDIUM confidence (0.5–0.8) — gentle hedge

"I'm fairly sure, but let me double-check... "
"If I remember right — "
"I think — "
"Pretty sure it's — "
"Let me make sure I've got this right... "

LOW confidence (0.2–0.5) — actively researching

These buy time while the grounding loop runs (up to 3 seconds):

"Hmm, I'm not 100% on that — give me a sec to look it up... "
"Good question — I don't want to guess on this. Checking now... "
"Let me verify before I say something wrong... "
"I've got a rough idea, but let me make sure... "
"Not totally sure off the top of my head — pulling it up now... "
"I want to get this right — one moment... "

If the grounding loop succeeds (confidence boosted ≥ 0.5):

"...okay, got it. [grounded answer]"
"...yeah, so here's what I found: [answer]"
"...alright, confirmed — [answer]"

If the grounding loop fails (still < 0.5):

"...I couldn't verify this, so take it with a grain of salt: [best guess with caveats]"
"...I'm still not finding a solid answer. Want me to dig deeper, or do you have more context?"
"...honestly, I'm not confident enough to give you an answer on this. Can you point me in the right direction?"

NONE confidence (< 0.2) — immediate honesty

"I genuinely don't know this one. "
"That's outside what I can answer confidently. "
"I'd rather not guess on this — "
"I have no idea, honestly. "

Followed by an offer:

"Want me to search for it?"
"Do you have any more details that might help?"
"I can try to look it up if you want."

Filler + Grounding Timeline

User: "What's the default MTU for WireGuard?"

0ms    → Confidence pre-check: LLM topic = networking config, 
         specific number → base confidence 0.35 (LOW)
         Prior mistakes in networking? → 0 → no penalty → 0.35

2ms    → Start streaming LOW confidence filler:
         "Good question — I don't want to guess on this. Checking now... "

50ms   → Grounding loop starts (background):
         ├── Memory HOT query: "wireguard MTU" → no hits
         ├── Web search: "wireguard default MTU" 
         │   → Result: 1420 (multiple sources agree)
         └── Confidence re-scored: 0.85 (web consensus)

1200ms → Grounding complete. Transition to answer:
         "...okay, got it. WireGuard defaults to an MTU of 1420. 
          That's lower than the standard 1500 to account for the 
          WireGuard overhead. You might need to tune it depending 
          on your network."

User perceives: continuous response from 2ms, with honest 
                uncertainty that resolves into a confident answer.

Combining Sentiment + Confidence Fillers

The two filler systems compose naturally:

Sentiment Confidence Combined Filler
Question + HIGH "Good question — [answer]"
Question + LOW "Good question — I'm not sure off the top of my head. Checking... "
Frustrated + HIGH "Yeah, that's annoying. [answer]"
Frustrated + LOW "I hear you — let me look into this real quick... "
Greeting + any "Hey! " (confidence doesn't apply to greetings)
Excited + MEDIUM "Ooh, interesting one — I think I know but let me confirm... "

The filler engine selects sentiment filler first, then appends confidence framing if needed:

filler = select_sentiment_filler(sentiment, user_profile)
if confidence < 0.8:
    filler += select_confidence_filler(confidence)

What Determines Confidence?

Signal Effect on Confidence Notes
Answer from Layer 1 (date, math) +0.95 Computed, not generated — always grounded
Answer from Layer 2 (HA command) +0.90 Direct API result, verifiable
Memory hit (high relevance score) +0.3 to +0.7 Depends on memory confidence + recency
Web search result (SearXNG) +0.3 to +0.6 Depends on source quality + consensus
LLM generation (no tools used) +0.2 to +0.5 Pure generation = lowest baseline confidence
Multiple sources agree +0.2 bonus Corroboration boosts confidence
User previously confirmed this +0.3 bonus "You told me X last week" is grounded
Topic is fast-changing (versions, dates) -0.2 penalty High hallucination risk for version numbers
Specific numbers/dates/URLs -0.2 penalty LLMs fabricate these frequently

The Grounding Loop

When confidence is LOW, Atlas doesn't just say "I don't know" — it actively tries to ground the answer before responding:

User: "What port does Zigbee2MQTT run on by default?"

Step 1: LLM generates "8080" (confidence: 0.35 — uncertain, specific number)

Step 2: Grounding loop triggers (confidence < 0.5)
        ├── Check memory: any prior interactions about Zigbee2MQTT? → No
        ├── Check HA: is zigbee2mqtt running? → query API
        │   └── Found: zigbee2mqtt addon running on port 8485 (frontend)
        ├── Web search: "zigbee2mqtt default port" → confirms 8080 (backend), 8485 (frontend)
        └── Consensus: multiple sources, confidence now 0.85

Step 3: Deliver grounded answer:
        "Zigbee2MQTT's backend API runs on port 8080 by default, 
         and the web frontend is on 8485. Looks like yours is 
         running on 8485 based on your HA setup."

Grounding Sources (priority order)

1. Computed / deterministic       (date, math, system state)     → highest trust
2. Direct observation             (HA API, docker inspect, etc.) → very high
3. User's own prior statements    (from memory system)           → high
4. Web search consensus           (multiple sources agree)       → medium-high
5. Single web source              (one result)                   → medium
6. LLM knowledge                  (training data)               → medium-low
7. LLM inference                  (reasoning, no source)         → low

Grounding Loop Architecture

┌─────────────────────────────────────────────────────────┐
│                  Grounding Loop                          │
│                                                          │
│  Input: LLM draft response + confidence score            │
│                                                          │
│  if confidence ≥ 0.8:                                    │
│      → deliver as-is                                     │
│                                                          │
│  if confidence < 0.8:                                    │
│      → extract claims (specific facts, numbers, dates)   │
│      → for each claim:                                   │
│          1. Check memory (HOT path) for corroboration    │
│          2. Check live sources (HA API, system state)     │
│          3. Web search if still uncertain                 │
│          4. Re-score confidence per claim                 │
│      → if overall confidence now ≥ 0.8:                  │
│          → deliver with grounded info                     │
│      → if still < 0.5:                                   │
│          → ask user for clarification                     │
│          → OR say "I'm not confident in this answer"      │
│                                                          │
│  Max grounding attempts: 2 (avoid infinite loops)        │
│  Grounding budget: 3 seconds max                         │
└─────────────────────────────────────────────────────────┘

Claim Extraction

The grounding loop doesn't check the entire response — it identifies specific verifiable claims:

Response: "Docker Compose v2.24 added the 'include' directive, 
           which lets you merge multiple compose files."

Extracted claims:
  1. "Docker Compose v2.24 added 'include'" → specific version = HIGH hallucination risk
  2. "'include' merges compose files" → functional claim = MEDIUM risk

Grounding actions:
  Claim 1: web search "docker compose include directive version" 
           → actually v2.20 → CORRECT claim, fix version
  Claim 2: consistent with known Docker knowledge → pass

Corrected response: "Docker Compose v2.20 added the 'include' directive..."

High-Risk Claim Patterns

These patterns trigger automatic grounding checks:

Pattern Risk Example
Version numbers Very high "Python 3.12 added..."
Specific dates High "Released on March 15, 2025"
URLs Very high "See docs at https://..."
Exact config values High "Set max_connections to 1024"
API endpoints/params High "POST to /api/v2/users"
Numerical limits Medium "Supports up to 10,000 connections"
Attribution High "According to the Linux Foundation..."
"Always" / "Never" absolutes Medium "Docker always restarts containers..."

Mistake Tracking & Learning

When Atlas gets something wrong, it doesn't just correct itself — it logs the mistake so the system learns.

Mistake Detection

Mistakes are detected via:

  1. User correction: "That's wrong, it's actually X"
  2. Grounding loop contradiction: web search contradicts initial answer
  3. Self-correction: Atlas realizes mid-response it was wrong
  4. Follow-up failure: user reports the advice didn't work

Mistake Log Schema

CREATE TABLE mistake_log (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    interaction_id INTEGER,             -- which interaction contained the mistake
    user_id TEXT,
    claim_text TEXT NOT NULL,           -- what Atlas said that was wrong
    correction_text TEXT,               -- what the right answer is
    detection_method TEXT NOT NULL,     -- 'user_correction' | 'grounding' | 'self_correction' | 'follow_up'
    mistake_category TEXT,             -- 'version_number' | 'factual' | 'config' | 'attribution' | 'logic' | 'outdated'
    confidence_at_time REAL,           -- how confident Atlas was when it said it
    topic_tags TEXT DEFAULT '[]',      -- JSON: ["docker", "python", "networking"]
    root_cause TEXT,                   -- 'training_data_outdated' | 'hallucination' | 'misunderstood_question' | 'missing_context'
    resolved BOOLEAN DEFAULT FALSE,    -- has this been addressed in memory/patterns?
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (interaction_id) REFERENCES interactions(id)
);

CREATE INDEX idx_mistakes_category ON mistake_log(mistake_category);
CREATE INDEX idx_mistakes_topic ON mistake_log(topic_tags);
CREATE INDEX idx_mistakes_unresolved ON mistake_log(resolved) WHERE resolved = FALSE;

Learning from Mistakes

The nightly evolution job processes unresolved mistakes:

Nightly Mistake Review:
  1. Query: SELECT * FROM mistake_log WHERE resolved = FALSE
  
  2. For each mistake:
     a. Store correction in memory (COLD path):
        "Atlas incorrectly stated X. The correct answer is Y."
        type: 'correction', confidence: 0.95, source: 'mistake_review'
     
     b. If pattern detected (e.g., Atlas keeps getting Docker versions wrong):
        → Add to high-risk topics list
        → Lower default confidence for that topic by 0.1
        → Store: "Atlas tends to hallucinate Docker version numbers. 
                  Always verify via web search."
     
     c. Mark as resolved: UPDATE mistake_log SET resolved = TRUE
  
  3. Generate mistake report:
     "Today: 3 mistakes corrected. Recurring issue: version numbers 
      in Docker ecosystem. Added auto-grounding trigger for Docker 
      version claims."

Mistake-Aware Confidence Adjustment

If Atlas has been wrong about a topic before, it knows to be more careful:

def adjust_confidence(base_confidence, topic_tags, user_id):
    # Check for prior mistakes in this topic area
    prior_mistakes = query("""
        SELECT COUNT(*) FROM mistake_log 
        WHERE topic_tags LIKE ? 
        AND created_at > datetime('now', '-30 days')
    """, topic_tags)
    
    # Each recent mistake in this topic lowers confidence
    penalty = min(prior_mistakes * 0.1, 0.3)  # max 0.3 penalty
    
    return max(base_confidence - penalty, 0.1)

Example: Atlas hallucinated a Docker Compose version once. Next time someone asks about Docker Compose versions, the base confidence drops from 0.5 to 0.4, triggering the grounding loop automatically.


User-Facing Transparency

Atlas communicates its confidence naturally:

High Confidence (just answer)

"Piper TTS runs on port 10200 in your setup."

Medium Confidence (hedge)

"I believe Zigbee2MQTT defaults to port 8080 for the API, 
 but let me verify that... [checks] yeah, 8080 for the API 
 and 8485 for the web frontend."

Low Confidence (ask or search)

"I'm not sure about the exact firmware version your Zigbee 
 coordinator needs. Want me to look that up, or do you have 
 the model number?"

No Confidence (refuse to guess)

"I genuinely don't know the answer to that, and I don't want 
 to make something up. Let me search for it."

Self-Correction

"Actually, wait — I said Python 3.12 but I think it was 3.11. 
 Let me check... [searches] it was 3.11. Sorry about that."

Integration Points

With Memory System (C5)

  • Corrections stored as high-confidence memories
  • Prior mistakes retrieved during confidence assessment
  • "You asked about this before and I got it wrong — here's the corrected info"

With Personality System

  • Honesty system (personality.md) defines the tone of uncertainty
  • Grounding system defines the mechanics of when to be uncertain
  • Together: Atlas is direct ("I don't know") not mealy-mouthed ("I'm not entirely sure but perhaps maybe...")

With Nightly Evolution (C2)

  • Mistake review is a nightly job task
  • Pattern detection across mistakes (recurring topics)
  • Confidence calibration adjusts per-topic baselines over time

With LLM System Prompt

[GROUNDING RULES]
You MUST follow these rules about uncertainty:
- If you are not confident in a specific fact (version, date, URL, number), say so.
- Never present uncertain information as definitive fact.
- If asked something you don't know, say "I don't know" and offer to look it up.
- When you catch yourself being wrong, correct immediately. Don't double down.
- Prefer "I'll check" over guessing. Prefer "I don't know" over fabricating.
- If multiple answers are possible, present them as options, not as THE answer.
[/GROUNDING RULES]

[MISTAKE HISTORY]
Topics where you've been wrong before (verify before answering):
{{MISTAKE_TOPICS}}
[/MISTAKE HISTORY]

Metrics

Tracked in memory_metrics for observability:

-- Grounding loop triggers per day
SELECT DATE(ts), COUNT(*) FROM memory_metrics 
WHERE operation = 'grounding_loop' GROUP BY DATE(ts);

-- Mistake rate trend
SELECT DATE(created_at), COUNT(*) FROM mistake_log GROUP BY DATE(created_at);

-- Confidence calibration: are we under/over-confident?
SELECT 
    CASE WHEN confidence_at_time >= 0.8 THEN 'high'
         WHEN confidence_at_time >= 0.5 THEN 'medium'
         ELSE 'low' END as confidence_band,
    COUNT(*) as total_mistakes
FROM mistake_log
GROUP BY confidence_band;
-- If most mistakes are in the "high" band, we're overconfident
-- If most are in "low", the grounding loop is working