glanzz · glanzz · Dec 11, 2025 · Dec 11, 2025
diff --git a/benchmark_results/realistic_paper_summary.txt b/benchmark_results/realistic_paper_summary.txt
@@ -0,0 +1,24 @@
+REALISTIC PERFORMANCE SUMMARY (LLM-Generated Scenarios)
+==========================================================
+
+Core Operation Overhead (50-message realistic conversation):
+- Checkpoint creation: 0.08ms (mean)
+- Branch creation: 0.11ms (mean)
+- Branch switching: 0.11ms (mean)
+- Message injection: 0.17ms (mean)
+
+All operations satisfy R4 requirement (<50ms overhead).
+
+Memory Footprint:
+- 50-message realistic conversation: 54.12 KB peak memory
+- Realistic message content (50-150 chars)
+
+Realistic Workflow (Multi-branch exploration):
+- Total checkpoints: 2
+- Total branches: 4
+- Total switches: 5
+- Total injections: 2
+- Total overhead: 0.39ms
+
+Key Finding: SDK overhead remains low even with realistic, complex
+conversations generated by LLM. Timing isolated from LLM API latency.
diff --git a/benchmarks/REALISTIC_BENCHMARK_README.md b/benchmarks/REALISTIC_BENCHMARK_README.md
@@ -0,0 +1,349 @@
+# Realistic Performance Benchmark with LLM-Generated Scenarios
+
+## Overview
+
+This benchmark (`realistic_performance_benchmark.py`) uses **ChatGPT to generate realistic conversation scenarios** instead of static test data. This provides more accurate performance measurements that reflect real-world usage patterns.
+
+## Key Differences from Static Benchmark
+
+### Static Benchmark (`performance_benchmark.py`)
+- ❌ Uses artificial messages like "Message 0", "Message 1", etc.
+- ❌ Fixed conversation patterns
+- ❌ No realistic decision points
+- ✅ Fast (no LLM API calls)
+- ✅ Deterministic
+
+### Realistic Benchmark (`realistic_performance_benchmark.py`)
+- ✅ Uses ChatGPT to generate realistic technical conversations
+- ✅ Natural decision points for branching
+- ✅ Realistic message lengths (50-150 characters)
+- ✅ Real-world topics (databases, languages, architectures)
+- ⏱️ Slower (requires LLM API calls for scenario generation)
+- 🎲 Non-deterministic (different scenarios each run)
+
+## What LLM Generates
+
+For each test scenario, ChatGPT creates:
+
+1. **Topic**: Realistic technical discussion
+   - Example: "Choosing between PostgreSQL and MongoDB for e-commerce app"
+
+2. **Initial Messages**: 3-5 message pairs leading to a decision point
+   - User asks about requirements
+   - Assistant explores options
+   - Natural conversation flow
+
+3. **Decision Point**: Where checkpoint should be created
+   - Identified by LLM based on conversation context
+
+4. **Branches**: 2-3 alternative explorations
+   - Each branch explores a different option
+   - Realistic pros/cons discussion
+   - 3-5 messages per branch
+
+5. **Injection Strategy**: Which insights to merge back
+   - LLM decides which messages contain valuable insights
+   - Indices of messages to inject
+
+## What We Time (SDK Operations Only)
+
+**⏱️ TIMED (SDK operations):**
+- Checkpoint creation
+- Branch creation
+- Branch switching
+- Message injection
+
+**⏱️ NOT TIMED (setup/LLM calls):**
+- LLM API calls to generate scenarios
+- Parsing JSON responses
+- Adding messages to workspace (setup)
+
+This ensures we measure **only SDK overhead**, not LLM latency.
+
+## Example Generated Scenario
+
+```json
+{
+  "topic": "Database selection for social media app with 10M users",
+  "initial_messages": [
+    {"role": "user", "content": "I need a database for 10M users, <100ms query time"},
+    {"role": "assistant", "content": "Let's consider PostgreSQL, MongoDB, or Cassandra..."},
+    {"role": "user", "content": "Budget is $500/month, team knows SQL"}
+  ],
+  "decision_point_index": 2,
+  "branches": [
+    {
+      "name": "explore-postgres",
+      "messages": [
+        {"role": "user", "content": "What about PostgreSQL with read replicas?"},
+        {"role": "assistant", "content": "PostgreSQL excels at complex queries and ACID..."},
+        {"role": "user", "content": "Can it scale to 10M users?"},
+        {"role": "assistant", "content": "Yes, with proper indexing and replication..."}
+      ]
+    },
+    {
+      "name": "explore-mongo",
+      "messages": [
+        {"role": "user", "content": "What about MongoDB with sharding?"},
+        {"role": "assistant", "content": "MongoDB provides horizontal scaling..."},
+        {"role": "user", "content": "What about consistency guarantees?"},
+        {"role": "assistant", "content": "MongoDB offers tunable consistency..."}
+      ]
+    }
+  ],
+  "inject_indices": [0, 2]
+}
+```
+
+## Benchmark Tests
+
+### 1. Realistic Operation Overhead
+
+Tests SDK operations with conversations of varying sizes:
+- 10 messages: Simple query
+- 30 messages: Two alternatives
+- 50 messages: Three alternatives
+- 100 messages: Complex multi-branch
+- 200 messages: Deep exploration tree
+
+For each size:
+- Generate realistic scenario via ChatGPT
+- Run 20 trials
+- Time checkpoint, branch, switch, inject operations
+- Calculate mean, median, stdev, min, max
+
+### 2. Realistic Memory Footprint
+
+Measures memory with realistic message content:
+- Variable message lengths (50-150 chars)
+- Realistic technical terminology
+- Natural conversation structure
+- Multiple branches with different content
+
+### 3. Realistic Workflow
+
+End-to-end workflow simulating developer usage:
+- Initial requirements discussion
+- First decision point → 2 branches
+- Sub-decision on one branch → 2 more branches
+- Inject insights back to main
+- Continue with combined knowledge
+
+Measures:
+- Total overhead across all operations
+- Number of checkpoints, branches, switches
+- Realistic multi-level branching pattern
+
+## Running the Benchmark
+
+### Prerequisites
+
+```bash
+# Set OpenAI API key (or Anthropic)
+export OPENAI_API_KEY=your_key_here
+
+# Install dependencies
+pip install openai anthropic
+```
+
+### Run
+
+```bash
+python benchmarks/realistic_performance_benchmark.py
+```
+
+### Expected Runtime
+
+- **Scenario generation**: ~3-5 seconds per scenario (LLM API calls)
+- **Benchmark execution**: ~30-60 seconds (20 trials × 5 sizes)
+- **Total**: ~2-3 minutes
+
+This is slower than static benchmark (~60 seconds) but provides realistic data.
+
+## Output
+
+### Console Output
+
+```
+================================================================================
+REALISTIC PERFORMANCE BENCHMARK
+Using LLM-generated conversation scenarios
+================================================================================
+
+Initializing LLM for scenario generation...
+✓ Using openai/gpt-4
+
+================================================================================
+1. REALISTIC OPERATION OVERHEAD
+================================================================================
+
+Testing with 10-message realistic scenario...
+   Generating scenario for 10 messages...
+   ✓ Generated: Database selection for startup with limited budget
+   Trial 1/20...
+   Trial 5/20...
+   ...
+
+Operation    10 msgs      30 msgs      50 msgs      100 msgs     200 msgs
+------------------------------------------------------------------------
+Checkpoint         0.xx ms      0.xx ms      0.xx ms      0.xx ms      0.xx ms
+Branch             x.xx ms      x.xx ms      x.xx ms      x.xx ms      x.xx ms
+Switch             x.xx ms      x.xx ms      x.xx ms      x.xx ms      x.xx ms
+Inject             x.xx ms      x.xx ms      x.xx ms      x.xx ms      x.xx ms
+```
+
+### Files Generated
+
+1. **`benchmark_results/realistic_performance_results.json`**
+   - Complete data in JSON format
+   - All trials, statistics, scenarios
+
+2. **`benchmark_results/realistic_paper_summary.txt`**
+   - Summary for research paper
+   - Key measurements for 50-message conversations
+   - Workflow statistics
+
+## Comparison: Static vs Realistic Results
+
+### Expected Differences
+
+**Operation overhead may be slightly higher** because:
+- Realistic messages are longer (50-150 chars vs ~20 chars)
+- More diverse content (affects hashing, serialization)
+- Variable message structure
+
+**Memory footprint may be higher** because:
+- Longer message content
+- More realistic metadata
+- Variable message sizes
+
+**But differences should be small (<20%)** because:
+- SDK operations are O(n) in message count, not content length
+- Hashing is fast regardless of content
+- Branch isolation is structural, not content-dependent
+
+### Why This Matters
+
+Static benchmarks might **underestimate** overhead if:
+- Realistic messages are significantly longer
+- Content diversity affects performance
+
+Or **overestimate** if:
+- Static patterns create worst-case scenarios
+- Unrealistic uniformity doesn't represent real usage
+
+**Realistic benchmarks provide ground truth** for publication claims.
+
+## For Your Paper
+
+### Which Results to Use?
+
+**Recommendation**: Use **realistic benchmark results** in your paper because:
+
+1. **More credible**: Reviewers can see scenarios are realistic
+2. **Reproducible**: Different scenarios each run, but similar statistics
+3. **Conservative**: If realistic overhead is <50ms, claim is stronger
+4. **Transparent**: Shows real-world performance, not cherry-picked test cases
+
+### How to Report
+
+In Section 5.3 (Performance and Scalability):
+
+> We benchmark the implementation using realistic conversation scenarios
+> generated by GPT-4. For each test, we prompt the LLM to create
+> technically realistic discussions with natural decision points for branching.
+> We time only SDK operations, excluding LLM API latency. Results represent
+> mean latency across 20 trials.
+
+**Table 1: Operation Overhead (50-message realistic conversation)**
+| Operation | Mean | Median | StdDev |
+|-----------|------|--------|--------|
+| Checkpoint | X.XXms | X.XXms | X.XXms |
+| Branch | X.XXms | X.XXms | X.XXms |
+| Switch | X.XXms | X.XXms | X.XXms |
+| Inject | X.XXms | X.XXms | X.XXms |
+
+> All operations satisfy requirement R4 (<50ms overhead) even with realistic,
+> variable-length technical discussions generated by an LLM.
+
+## Troubleshooting
+
+### Error: "No LLM available"
+
+```bash
+export OPENAI_API_KEY=your_key_here
+# or
+export ANTHROPIC_API_KEY=your_key_here
+```
+
+### Error: "JSON parse error"
+
+The LLM sometimes returns malformed JSON. The benchmark has fallback scenarios.
+If this happens frequently, try:
+- Using GPT-4 instead of GPT-3.5 (more reliable JSON)
+- Simplifying the scenario generation prompt
+
+### Slow Performance
+
+LLM API calls take 2-5 seconds each. To speed up:
+- Use fewer message counts (remove 200-message test)
+- Reduce trials from 20 to 10
+- Use faster model (gpt-3.5-turbo)
+
+### Different Results Each Run
+
+This is expected! Scenarios are randomly generated. Statistics (mean, median)
+should be similar across runs (+/- 20%), but individual scenarios differ.
+
+For deterministic results, use static benchmark (`performance_benchmark.py`).
+
+## Technical Details
+
+### Timing Methodology
+
+```python
+# NOT timed: Generate scenario
+scenario = self.generate_conversation_scenario(msg_count)
+
+# NOT timed: Setup workspace
+workspace.add_message(msg)  # Add initial messages
+
+# TIMED: Checkpoint creation
+start = time.perf_counter()
+cp_id = workspace.create_checkpoint("decision")
+checkpoint_time = (time.perf_counter() - start) * 1000  # ms
+
+# TIMED: Branch creation
+start = time.perf_counter()
+workspace.create_branch(cp_id, "branch_name")
+branch_time = (time.perf_counter() - start) * 1000  # ms
+```
+
+### Statistical Validity
+
+- **20 trials**: Sufficient for stable mean/median (CLT applies)
+- **Multiple scenarios**: Different LLM-generated scenarios per message count
+- **Outlier handling**: Min/max reported alongside mean/median
+- **StdDev**: Reported to show measurement stability
+
+## Future Enhancements
+
+1. **Scenario caching**: Save generated scenarios to avoid re-generation
+2. **More LLMs**: Test with Claude, Llama, etc. for scenario generation
+3. **Scenario complexity metrics**: Measure branching factor, depth, message length distribution
+4. **Token counting**: Measure actual token counts with tiktoken
+5. **Parallel trials**: Run trials in parallel for speed
+
+---
+
+## Summary
+
+**Realistic benchmark = More credible publication results**
+
+- ✅ LLM-generated realistic scenarios
+- ✅ Natural decision points and branching
+- ✅ Careful timing isolation (SDK only, not LLM API)
+- ✅ Statistical rigor (20 trials, mean ± stdev)
+- ✅ Reproducible (same methodology, different scenarios)
+
+Use these results in your paper to demonstrate that ContextBranch performs well with **realistic**, not just **synthetic**, workloads.