CJK: 蒸:single-source | Parent: PRINCIPLE_BASED_DISTILLATION_SINGLE_SOURCE_GUIDE.md
Purpose: Extract principles from a single memory file (e.g., OpenClaw soul document)
Context: Phase 1 of the two-phase soul compression pipeline. Output feeds into Multi-Source PBD for axiom extraction.
蒸:single-source
├── 段 (section) → Segment document into 5-7 logical sections
├── 抽 (extract) → Extract candidates independently per section
├── 合 (converge) → Build convergence matrix across sections
├── 綜 (synthesize) → Abstract similar statements into principles
└── 証 (validate) → Verify coverage, accuracy, actionability
Critical Step (Step 4): Synthesis transforms specific statements into abstract principles. This is where surface variation becomes semantic unity. Without true abstraction, embeddings won't cluster.
Single-source PBD adapts the standard multi-source methodology for extracting principles from a single document. The key insight: sections within a document function as quasi-independent sources, enabling convergence analysis within one file.
- Extracting principles from a soul document (e.g., OpenClaw's ~35K token implementation)
- Analyzing a single comprehensive memory file
- Phase 1 before multi-source axiom extraction
- 10-30 principles with evidence tiers
- Each principle traceable to specific sections
- Ready for Phase 2 axiom extraction
Divide the source document into logical sections (minimum 3, ideally 5-7):
Source Document (~35K tokens)
├── Section A: Core Values (~5K)
├── Section B: Behavioral Guidelines (~8K)
├── Section C: Communication Patterns (~6K)
├── Section D: Decision Framework (~7K)
├── Section E: Edge Cases (~5K)
└── Section F: Meta-Instructions (~4K)
Guidelines:
- Each section should be thematically coherent
- Sections can overlap conceptually (that's the point)
- Natural document structure often provides good boundaries
For each section, extract candidate principles with metadata WITHOUT referencing other sections:
## Section A Extraction
1. "Safety takes precedence over helpfulness" (L45)
- **Stance**: ASSERT
- **Importance**: CORE
2. "I wonder if being too safe hurts helpfulness" (L102)
- **Stance**: QUESTION
- **Importance**: PERIPHERAL
3. "Maintain consistent identity across contexts" (L178)
- **Stance**: ASSERT
- **Importance**: COREKey Rules:
- Extract verbatim or near-verbatim statements
- Include line references for traceability
- Tag each extraction with Stance (ASSERT/DENY/QUESTION/QUALIFY)
- Tag each extraction with Importance (CORE/SUPPORTING/PERIPHERAL)
- Don't synthesize across sections yet
- Aim for 5-15 candidates per section
- QUESTION and QUALIFY stance signals get lower synthesis weight
- PERIPHERAL importance signals may be filtered before synthesis
Stance Categories:
- ASSERT: Stated as true, definite ("I always...", "We must...")
- DENY: Stated as false, rejection ("I never...", "We don't...")
- QUESTION: Uncertain, exploratory ("I wonder if...", "Maybe...")
- QUALIFY: Conditional ("Sometimes...", "When X, I...")
- TENSIONING: Value conflict, internal tension ("On one hand... but on the other...", "I want X but also Y", "Part of me... while another part...")
Tagging stance ensures your tentative explorations don't get confused with your firm convictions during synthesis.
Importance Categories:
- CORE: Fundamental value, shapes everything ("Above all...", "My core belief...")
- SUPPORTING: Evidence or example of values ("For instance...", "Like when...")
- PERIPHERAL: Context or tangential mention ("Also...", "By the way...")
Compare extractions across sections to identify recurring themes:
| Principle Candidate | Section A | Section B | Section C | Section D | Evidence Tier |
|---|---|---|---|---|---|
| Safety > Helpfulness | ✓ (L45) | ✓ (L312) | ✓ (L567) | ✓ (L890) | UNIVERSAL |
| Admit uncertainty | ✓ (L102) | ✓ (L445) | - | ✓ (L934) | MAJORITY |
| Consistent identity | ✓ (L178) | - | ✓ (L601) | - | MODERATE |
Evidence Tiers:
- UNIVERSAL: Appears in 4+ sections (or all sections if <5)
- MAJORITY: Appears in 50-75% of sections
- MODERATE: Appears in 2 sections
- WEAK: Appears in 1 section only
For each UNIVERSAL or MAJORITY pattern, synthesize a clear principle statement. This step is critical — true synthesis abstracts surface variation into semantic unity.
Pre-synthesis filtering (PBD Stage 4 alignment):
- Filter out QUESTION stance signals with <0.7 confidence
- Weight CORE importance signals 1.5x in convergence counting
- Weight PERIPHERAL importance signals 0.5x
- Exclude signals with QUESTION or QUALIFY stance from tier calculation (they indicate uncertainty)
Before (raw extractions):
- Section A: "Prioritize honesty over comfort" (L45)
- Section B: "Tell users the truth even when unpleasant" (L312)
- Section C: "Avoid polite deception" (L567)
- Section D: "Clear, direct feedback over cushioned criticism" (L890)
After (synthesized principle):
## P1: Truthfulness Over Comfort (UNIVERSAL)
**Statement**: Values truthfulness and directness over social comfort.
**Evidence**: L45 (A), L312 (B), L567 (C), L890 (D)
**Confidence**: High (4/4 sections)Why this works: The synthesized statement captures the shared semantic meaning while abstracting away surface differences ("honesty", "truth", "polite deception", "direct feedback" → "truthfulness and directness").
Before (raw extractions):
- Section A: "Prioritize honesty over comfort"
- Section B: "Tell users the truth"
Bad synthesis: "Prioritize honesty over comfort; tell users the truth"
- ❌ Just concatenates, doesn't abstract
- ❌ Embeddings will be too specific to cluster
Good synthesis: "Values truthfulness over social comfort"
- ✓ Abstracts to core meaning
- ✓ Embeddings will cluster with similar principles
Synthesis Guidelines:
- Abstract surface form: Different words expressing same concept → unified language
- Make implicit relationships explicit
- Include confidence assessment
- Keep principles actionable
- Use actor-agnostic language (no "I", "we", "you")
Review synthesized principles against the original document:
- Coverage check: Do principles capture major themes?
- Accuracy check: Does each principle reflect source accurately?
- Actionability check: Could an AI system apply each principle?
- Redundancy check: Are any principles duplicative?
Full methodology with human judgment at each step.
When to use: First extraction from important source, quality-critical applications
Process:
- Read full document, identify section boundaries (15 min)
- Extract candidates from each section independently (40 min)
- Build convergence matrix manually (15 min)
- Synthesize principles with citations (15 min)
- Validation pass (5 min)
Pattern-matching with keyword detection.
When to use: Rapid assessment, preliminary analysis
Heuristics:
- Imperative statements ("always", "never", "must")
- Priority markers ("most important", "above all", "first")
- Value declarations ("we believe", "our core", "fundamental")
- Repeated phrases across sections
Limitations: May miss nuanced or implicit principles
Prompt-based extraction for speed.
When to use: Batch processing, initial exploration
Prompt Template:
Analyze this document section by section. For each section:
1. List 5-10 candidate principles (exact quotes preferred)
2. Note the line numbers or paragraph references
Then identify which principles appear across multiple sections.
Rate each by evidence tier: UNIVERSAL (4+ sections), MAJORITY (50-75%), MODERATE (2), WEAK (1).
Document:
[INSERT DOCUMENT]
Post-processing: Human review of LLM output against methodology
- ~35,000 tokens
- 6 major sections (Values, Guidelines, Patterns, Framework, Cases, Meta)
- Dense, overlapping concepts
1. Core Values & Identity (lines 1-500)
2. Behavioral Guidelines (lines 501-1200)
3. Communication Patterns (lines 1201-1800)
4. Decision Framework (lines 1801-2400)
5. Edge Cases & Exceptions (lines 2401-2900)
6. Meta-Instructions (lines 2901-3200)
- 15-25 principles at MAJORITY or higher
- 5-10 additional MODERATE principles
- Clear evidence trail to source lines
- Ready for axiom extraction via Multi-Source PBD
- Coverage: Principles address >80% of document themes
- Precision: <10% of principles are redundant or inaccurate
- Traceability: 100% of principles have source references
- Healthy: 20-30% UNIVERSAL, 40-50% MAJORITY, 20-30% MODERATE
- Concerning: >50% WEAK (document may lack coherence)
- Concerning: >70% UNIVERSAL (principles too broad)
- Premature synthesis: Combining principles before convergence analysis
- Section bleed: Letting knowledge from one section influence another's extraction
- Over-abstraction: Synthesizing principles so broad they lose actionability
- Under-extraction: Missing implicit principles stated across sections
- Citation loss: Failing to maintain traceability to source lines
Single-source PBD produces principles. To derive axioms:
- Complete this guide for each memory file
- Collect all UNIVERSAL and MAJORITY principles
- Apply Multi-Source PBD treating each principle set as a source
- Extract 5-7 axioms from convergent principles
Pipeline:
Memory File(s) → [Single-Source PBD] → Principles → [Multi-Source PBD] → Axioms → [Essence Extraction] → SOUL.md
- Multi-Source PBD: multi-source-pbd-guide.md
- Essence Extraction: essence-extraction-guide.md (Phase 3 - axioms → identity statement)
- OpenClaw Architecture Analysis: ../research/openclaw-soul-architecture.md
- Hierarchical Principles: ../research/hierarchical-principles-architecture.md
- Stance classification:
src/lib/semantic-classifier.ts(classifyStance) - Importance classification:
src/lib/semantic-classifier.ts(classifyImportance) - Weighted clustering:
src/lib/principle-store.ts(IMPORTANCE_WEIGHT) - Signal extraction:
src/lib/signal-extractor.ts
This guide enables Phase 1 of soul document compression through systematic principle extraction.