Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions benchmark_results/realistic_paper_summary.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
REALISTIC PERFORMANCE SUMMARY (LLM-Generated Scenarios)
==========================================================

Core Operation Overhead (50-message realistic conversation):
- Checkpoint creation: 0.08ms (mean)
- Branch creation: 0.11ms (mean)
- Branch switching: 0.11ms (mean)
- Message injection: 0.17ms (mean)

All operations satisfy R4 requirement (<50ms overhead).

Memory Footprint:
- 50-message realistic conversation: 54.12 KB peak memory
- Realistic message content (50-150 chars)

Realistic Workflow (Multi-branch exploration):
- Total checkpoints: 2
- Total branches: 4
- Total switches: 5
- Total injections: 2
- Total overhead: 0.39ms

Key Finding: SDK overhead remains low even with realistic, complex
conversations generated by LLM. Timing isolated from LLM API latency.
349 changes: 349 additions & 0 deletions benchmarks/REALISTIC_BENCHMARK_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,349 @@
# Realistic Performance Benchmark with LLM-Generated Scenarios

## Overview

This benchmark (`realistic_performance_benchmark.py`) uses **ChatGPT to generate realistic conversation scenarios** instead of static test data. This provides more accurate performance measurements that reflect real-world usage patterns.

## Key Differences from Static Benchmark

### Static Benchmark (`performance_benchmark.py`)
- ❌ Uses artificial messages like "Message 0", "Message 1", etc.
- ❌ Fixed conversation patterns
- ❌ No realistic decision points
- ✅ Fast (no LLM API calls)
- ✅ Deterministic

### Realistic Benchmark (`realistic_performance_benchmark.py`)
- ✅ Uses ChatGPT to generate realistic technical conversations
- ✅ Natural decision points for branching
- ✅ Realistic message lengths (50-150 characters)
- ✅ Real-world topics (databases, languages, architectures)
- ⏱️ Slower (requires LLM API calls for scenario generation)
- 🎲 Non-deterministic (different scenarios each run)

## What LLM Generates

For each test scenario, ChatGPT creates:

1. **Topic**: Realistic technical discussion
- Example: "Choosing between PostgreSQL and MongoDB for e-commerce app"

2. **Initial Messages**: 3-5 message pairs leading to a decision point
- User asks about requirements
- Assistant explores options
- Natural conversation flow

3. **Decision Point**: Where checkpoint should be created
- Identified by LLM based on conversation context

4. **Branches**: 2-3 alternative explorations
- Each branch explores a different option
- Realistic pros/cons discussion
- 3-5 messages per branch

5. **Injection Strategy**: Which insights to merge back
- LLM decides which messages contain valuable insights
- Indices of messages to inject

## What We Time (SDK Operations Only)

**⏱️ TIMED (SDK operations):**
- Checkpoint creation
- Branch creation
- Branch switching
- Message injection

**⏱️ NOT TIMED (setup/LLM calls):**
- LLM API calls to generate scenarios
- Parsing JSON responses
- Adding messages to workspace (setup)

This ensures we measure **only SDK overhead**, not LLM latency.

## Example Generated Scenario

```json
{
"topic": "Database selection for social media app with 10M users",
"initial_messages": [
{"role": "user", "content": "I need a database for 10M users, <100ms query time"},
{"role": "assistant", "content": "Let's consider PostgreSQL, MongoDB, or Cassandra..."},
{"role": "user", "content": "Budget is $500/month, team knows SQL"}
],
"decision_point_index": 2,
"branches": [
{
"name": "explore-postgres",
"messages": [
{"role": "user", "content": "What about PostgreSQL with read replicas?"},
{"role": "assistant", "content": "PostgreSQL excels at complex queries and ACID..."},
{"role": "user", "content": "Can it scale to 10M users?"},
{"role": "assistant", "content": "Yes, with proper indexing and replication..."}
]
},
{
"name": "explore-mongo",
"messages": [
{"role": "user", "content": "What about MongoDB with sharding?"},
{"role": "assistant", "content": "MongoDB provides horizontal scaling..."},
{"role": "user", "content": "What about consistency guarantees?"},
{"role": "assistant", "content": "MongoDB offers tunable consistency..."}
]
}
],
"inject_indices": [0, 2]
}
```

## Benchmark Tests

### 1. Realistic Operation Overhead

Tests SDK operations with conversations of varying sizes:
- 10 messages: Simple query
- 30 messages: Two alternatives
- 50 messages: Three alternatives
- 100 messages: Complex multi-branch
- 200 messages: Deep exploration tree

For each size:
- Generate realistic scenario via ChatGPT
- Run 20 trials
- Time checkpoint, branch, switch, inject operations
- Calculate mean, median, stdev, min, max

### 2. Realistic Memory Footprint

Measures memory with realistic message content:
- Variable message lengths (50-150 chars)
- Realistic technical terminology
- Natural conversation structure
- Multiple branches with different content

### 3. Realistic Workflow

End-to-end workflow simulating developer usage:
- Initial requirements discussion
- First decision point → 2 branches
- Sub-decision on one branch → 2 more branches
- Inject insights back to main
- Continue with combined knowledge

Measures:
- Total overhead across all operations
- Number of checkpoints, branches, switches
- Realistic multi-level branching pattern

## Running the Benchmark

### Prerequisites

```bash
# Set OpenAI API key (or Anthropic)
export OPENAI_API_KEY=your_key_here

# Install dependencies
pip install openai anthropic
```

### Run

```bash
python benchmarks/realistic_performance_benchmark.py
```

### Expected Runtime

- **Scenario generation**: ~3-5 seconds per scenario (LLM API calls)
- **Benchmark execution**: ~30-60 seconds (20 trials × 5 sizes)
- **Total**: ~2-3 minutes

This is slower than static benchmark (~60 seconds) but provides realistic data.

## Output

### Console Output

```
================================================================================
REALISTIC PERFORMANCE BENCHMARK
Using LLM-generated conversation scenarios
================================================================================

Initializing LLM for scenario generation...
✓ Using openai/gpt-4

================================================================================
1. REALISTIC OPERATION OVERHEAD
================================================================================

Testing with 10-message realistic scenario...
Generating scenario for 10 messages...
✓ Generated: Database selection for startup with limited budget
Trial 1/20...
Trial 5/20...
...

Operation 10 msgs 30 msgs 50 msgs 100 msgs 200 msgs
------------------------------------------------------------------------
Checkpoint 0.xx ms 0.xx ms 0.xx ms 0.xx ms 0.xx ms
Branch x.xx ms x.xx ms x.xx ms x.xx ms x.xx ms
Switch x.xx ms x.xx ms x.xx ms x.xx ms x.xx ms
Inject x.xx ms x.xx ms x.xx ms x.xx ms x.xx ms
```

### Files Generated

1. **`benchmark_results/realistic_performance_results.json`**
- Complete data in JSON format
- All trials, statistics, scenarios

2. **`benchmark_results/realistic_paper_summary.txt`**
- Summary for research paper
- Key measurements for 50-message conversations
- Workflow statistics

## Comparison: Static vs Realistic Results

### Expected Differences

**Operation overhead may be slightly higher** because:
- Realistic messages are longer (50-150 chars vs ~20 chars)
- More diverse content (affects hashing, serialization)
- Variable message structure

**Memory footprint may be higher** because:
- Longer message content
- More realistic metadata
- Variable message sizes

**But differences should be small (<20%)** because:
- SDK operations are O(n) in message count, not content length
- Hashing is fast regardless of content
- Branch isolation is structural, not content-dependent

### Why This Matters

Static benchmarks might **underestimate** overhead if:
- Realistic messages are significantly longer
- Content diversity affects performance

Or **overestimate** if:
- Static patterns create worst-case scenarios
- Unrealistic uniformity doesn't represent real usage

**Realistic benchmarks provide ground truth** for publication claims.

## For Your Paper

### Which Results to Use?

**Recommendation**: Use **realistic benchmark results** in your paper because:

1. **More credible**: Reviewers can see scenarios are realistic
2. **Reproducible**: Different scenarios each run, but similar statistics
3. **Conservative**: If realistic overhead is <50ms, claim is stronger
4. **Transparent**: Shows real-world performance, not cherry-picked test cases

### How to Report

In Section 5.3 (Performance and Scalability):

> We benchmark the implementation using realistic conversation scenarios
> generated by GPT-4. For each test, we prompt the LLM to create
> technically realistic discussions with natural decision points for branching.
> We time only SDK operations, excluding LLM API latency. Results represent
> mean latency across 20 trials.

**Table 1: Operation Overhead (50-message realistic conversation)**
| Operation | Mean | Median | StdDev |
|-----------|------|--------|--------|
| Checkpoint | X.XXms | X.XXms | X.XXms |
| Branch | X.XXms | X.XXms | X.XXms |
| Switch | X.XXms | X.XXms | X.XXms |
| Inject | X.XXms | X.XXms | X.XXms |

> All operations satisfy requirement R4 (<50ms overhead) even with realistic,
> variable-length technical discussions generated by an LLM.

## Troubleshooting

### Error: "No LLM available"

```bash
export OPENAI_API_KEY=your_key_here
# or
export ANTHROPIC_API_KEY=your_key_here
```

### Error: "JSON parse error"

The LLM sometimes returns malformed JSON. The benchmark has fallback scenarios.
If this happens frequently, try:
- Using GPT-4 instead of GPT-3.5 (more reliable JSON)
- Simplifying the scenario generation prompt

### Slow Performance

LLM API calls take 2-5 seconds each. To speed up:
- Use fewer message counts (remove 200-message test)
- Reduce trials from 20 to 10
- Use faster model (gpt-3.5-turbo)

### Different Results Each Run

This is expected! Scenarios are randomly generated. Statistics (mean, median)
should be similar across runs (+/- 20%), but individual scenarios differ.

For deterministic results, use static benchmark (`performance_benchmark.py`).

## Technical Details

### Timing Methodology

```python
# NOT timed: Generate scenario
scenario = self.generate_conversation_scenario(msg_count)

# NOT timed: Setup workspace
workspace.add_message(msg) # Add initial messages

# TIMED: Checkpoint creation
start = time.perf_counter()
cp_id = workspace.create_checkpoint("decision")
checkpoint_time = (time.perf_counter() - start) * 1000 # ms

# TIMED: Branch creation
start = time.perf_counter()
workspace.create_branch(cp_id, "branch_name")
branch_time = (time.perf_counter() - start) * 1000 # ms
```

### Statistical Validity

- **20 trials**: Sufficient for stable mean/median (CLT applies)
- **Multiple scenarios**: Different LLM-generated scenarios per message count
- **Outlier handling**: Min/max reported alongside mean/median
- **StdDev**: Reported to show measurement stability

## Future Enhancements

1. **Scenario caching**: Save generated scenarios to avoid re-generation
2. **More LLMs**: Test with Claude, Llama, etc. for scenario generation
3. **Scenario complexity metrics**: Measure branching factor, depth, message length distribution
4. **Token counting**: Measure actual token counts with tiktoken
5. **Parallel trials**: Run trials in parallel for speed

---

## Summary

**Realistic benchmark = More credible publication results**

- ✅ LLM-generated realistic scenarios
- ✅ Natural decision points and branching
- ✅ Careful timing isolation (SDK only, not LLM API)
- ✅ Statistical rigor (20 trials, mean ± stdev)
- ✅ Reproducible (same methodology, different scenarios)

Use these results in your paper to demonstrate that ContextBranch performs well with **realistic**, not just **synthetic**, workloads.
Loading