Add Performance Baselines and Reference Results


The project lacks baseline performance metrics and reference results for common hardware configurations. Users have no frame of reference to determine if their results are normal or indicate a problem.

## Problem

### Current Issues

1. **No Reference Data:**
   - Users don't know what performance to expect
   - Can't determine if results are good or bad
   - Difficult to identify performance regressions
   - No way to validate benchmark setup

2. **No Hardware Benchmarks:**
   - No results for common cloud instance types
   - No results for typical developer hardware
   - Missing enterprise storage tier comparisons
   - No network filesystem benchmarks

3. **Difficult Comparisons:**
   - Users can't compare their setup to similar systems
   - No guidance on what hardware to choose
   - Missing cost/performance analysis
   - No optimization guidance

4. **Validation Challenges:**
   - Can't tell if benchmark is running correctly
   - Unusual results hard to interpret
   - Missing sanity checks

## Proposed Solution

### 1. Create Baseline Results Database

**`docs/baselines/README.md`:**
```markdown
# Performance Baselines

Reference results for common hardware configurations. Use these to:
- Validate your benchmark setup
- Compare your infrastructure
- Guide hardware selection
- Identify performance issues

## How to Use

1. Find a baseline similar to your hardware
2. Run the same benchmark configuration
3. Compare your results to the baseline
4. Investigate if results differ by >20%

## Baseline Categories

- **Cloud Instances**: AWS, GCP, Azure common instance types
- **Local Development**: Typical developer machines
- **Enterprise Storage**: NAS, SAN, distributed filesystems
- **High-Performance**: GPU servers, NVMe RAID configurations

## Reporting Baselines

To contribute baseline results:
1. Run benchmarks with default configurations
2. Document hardware/environment details
3. Submit as PR with filled template (see below)
4. Include metadata.yaml and summary.yaml files
```

### 2. AWS Baseline Results

**`docs/baselines/aws/c5-2xlarge.md`:**
```markdown
# AWS EC2 c5.2xlarge Baseline

**Instance Type:** c5.2xlarge
**Specs:**
- vCPU: 8
- Memory: 16 GiB
- Network: Up to 10 Gbps
- Storage: EBS gp3 (1000 GB, 10000 IOPS, 500 MB/s)

**Date:** 2024-01-15
**Region:** us-east-1
**AMI:** Ubuntu 22.04 LTS
**Python:** 3.10.12

---

## Listing Benchmark

**Configuration:** `config/flat_tree.yaml`
```yaml
entries_per_dir: 5000
depth: 1
concurrency: 16
page_size: 2000
```

**Results:**
| Metric | Value |
|--------|-------|
| Entries/sec | 7,820 |
| TTFB | 4.8s |
| P50 latency | 19ms |
| P95 latency | 45ms |
| P99 latency | 92ms |

**Notes:**
- Standard EBS gp3 performance
- Consistent across multiple runs (±3%)
- Warm cache would improve TTFB by ~40%

---

## Checkpointing Benchmark

**Configuration:** `config/default.yaml`
```yaml
shard_count: 8
shard_size_mb: 16
concurrency: 4
mode: write-read
```

**Results:**
| Metric | Write | Read |
|--------|-------|------|
| Throughput (MB/s) | 285 | 340 |
| P50 latency (s) | 0.42 | 0.35 |
| P95 latency (s) | 0.58 | 0.48 |
| P99 latency (s) | 0.71 | 0.59 |

**Notes:**
- Matches EBS gp3 provisioned throughput (500 MB/s baseline)
- Writing limited by I/O quota
- Consider io2 for higher throughput needs

---

## Serving Benchmark

**Configuration:** Default settings
```
steps: 100
gbs: 256
mbs: 8
num_workers: 4
mode: infer
```

**Results:**
| Metric | Value |
|--------|-------|
| Throughput (samples/s) | 842 |
| P95 step latency | 0.31s |

**Cost Estimate:**
- Instance: $0.34/hour
- EBS gp3: $0.08/GB-month
- Estimated $/1M samples: ~$0.11

---

## Recommendations

**Good for:**
- Development and testing
- Small to medium workloads
- Cost-sensitive applications

**Consider upgrading if:**
- Need >500 MB/s sustained I/O (use io2)
- CPU-bound (use c5.4xlarge or larger)
- Need GPU (use g5 or p4 instances)

**Tuning tips:**
- Enable EBS optimization
- Use instance store for temporary data
- Increase gp3 IOPS if needed (up to 16,000)
```

**`docs/baselines/aws/i3-2xlarge.md`:**
```markdown
# AWS EC2 i3.2xlarge Baseline (NVMe)

**Instance Type:** i3.2xlarge
**Specs:**
- vCPU: 8
- Memory: 61 GiB
- Network: Up to 10 Gbps
- Storage: 1 × 1,900 GB NVMe SSD

**Date:** 2024-01-15
**Region:** us-east-1
**AMI:** Ubuntu 22.04 LTS
**Python:** 3.10.12

---

## Checkpointing Benchmark

**Configuration:** `config/nvme_stress.yaml`
```yaml
shard_count: 16
shard_size_mb: 64
concurrency: 8
mode: write-read
```

**Results:**
| Metric | Write | Read |
|--------|-------|------|
| Throughput (MB/s) | 1,240 | 1,850 |
| P50 latency (s) | 0.78 | 0.52 |
| P95 latency (s) | 1.12 | 0.68 |
| P99 latency (s) | 1.38 | 0.84 |

**Notes:**
- Excellent NVMe performance
- 4-5x faster than EBS gp3
- Low latency variance
- **Important**: Instance store is ephemeral!

---

## Cost/Performance Analysis

**i3.2xlarge vs c5.2xlarge + gp3:**

| Metric | i3.2xlarge | c5.2xlarge+gp3 | Advantage |
|--------|-----------|----------------|-----------|
| Price/hour | $0.624 | $0.42 | c5 cheaper |
| Write MB/s | 1,240 | 285 | i3 4.4x faster |
| Read MB/s | 1,850 | 340 | i3 5.4x faster |
| $/GB throughput/hr | $0.0005 | $0.0015 | i3 better value |

**Recommendation:**
- Use i3 for I/O-intensive training
- Use c5+gp3 for development
- Backup i3 data to S3 (ephemeral storage)
```

### 3. Create Performance Matrix

**`docs/baselines/performance-matrix.md`:**
```markdown
# Performance Matrix

Quick reference for common configurations.

## Listing Benchmark (5K entries, flat structure)

| Hardware | Entries/s | TTFB | P95 | Notes |
|----------|-----------|------|-----|-------|
| MacBook Pro M1 | 12,400 | 2.1s | 28ms | Local SSD |
| AWS c5.2xlarge (EBS) | 7,820 | 4.8s | 45ms | gp3 1000 IOPS |
| AWS i3.2xlarge (NVMe) | 15,200 | 1.8s | 22ms | Instance store |
| GCP n2-standard-8 (PD) | 6,950 | 5.4s | 52ms | Persistent Disk |
| Azure D8s v3 (Premium) | 8,340 | 4.2s | 41ms | Premium SSD |
| Desktop (SATA SSD) | 9,100 | 3.9s | 38ms | Consumer SSD |
| Desktop (HDD) | 1,240 | 18.2s | 340ms | 7200 RPM |
| NFS (1Gbps) | 2,890 | 9.7s | 180ms | Network mount |
| NFS (10Gbps) | 5,120 | 6.1s | 95ms | Fast network |

## Checkpointing Benchmark (8×16MB shards)

| Hardware | Write MB/s | Read MB/s | Write P95 | Notes |
|----------|------------|-----------|-----------|-------|
| MacBook Pro M1 | 980 | 1,450 | 0.28s | APFS |
| AWS i3.2xlarge | 1,240 | 1,850 | 0.22s | NVMe |
| AWS c5 + EBS gp3 | 285 | 340 | 0.58s | Baseline |
| AWS c5 + EBS io2 | 920 | 1,100 | 0.31s | Provisioned |
| GCP n2 + PD SSD | 310 | 380 | 0.54s | Default |
| Azure + Premium SSD | 340 | 410 | 0.49s | P30 tier |
| Desktop NVMe | 1,100 | 1,680 | 0.24s | PCIe 3.0 |
| Desktop SATA SSD | 420 | 510 | 0.42s | SATA III |
| NetApp NFS | 180 | 220 | 1.2s | Over network |
| S3 (direct) | 45 | 85 | 4.8s | High latency |
| FSx for Lustre | 850 | 1,040 | 0.35s | AWS managed |

## Expected Ranges

### Good Performance
- Listing: >8,000 entries/s, TTFB <5s, P95 <50ms
- Checkpointing: >800 MB/s write, >1,000 MB/s read, P95 <0.5s

### Acceptable Performance
- Listing: 3,000-8,000 entries/s, TTFB 5-10s, P95 50-100ms
- Checkpointing: 300-800 MB/s write, 400-1,000 MB/s read, P95 0.5-1s

### Poor Performance (investigate!)
- Listing: <3,000 entries/s, TTFB >10s, P95 >100ms
- Checkpointing: <300 MB/s write, <400 MB/s read, P95 >1s

## Troubleshooting

**Listing performance much slower than expected?**
- Check filesystem cache state (cold vs warm)
- Verify no other I/O-intensive processes running
- Check if directory is network-mounted
- Consider filesystem type (ext4 vs xfs vs others)

**Checkpointing slower than baseline?**
- Verify storage type (SSD vs HDD)
- Check for I/O throttling (cloud quotas)
- Monitor disk utilization during benchmark
- Check if fsync is enabled (2x slowdown expected)
- Verify concurrency matches CPU cores

**Results highly variable (>20% variance)?**
- May indicate I/O contention
- Check for background processes
- Verify consistent network latency (if networked storage)
- Run multiple iterations and report median
```

### 4. Add Validation Tool

**`scripts/validate_results.py`:**
```python
#!/usr/bin/env python
"""
Validate benchmark results against expected ranges.

Usage:
    python scripts/validate_results.py \
        --summary metrics/my-run/listing_summary.yaml \
        --baseline docs/baselines/aws/c5-2xlarge.md
"""
import argparse
import yaml

EXPECTED_RANGES = {
    "listing": {
        "entries_per_sec": (1000, 20000),
        "ttfb_sec": (1.0, 30.0),
        "p95_call_sec": (0.01, 1.0),
    },
    "checkpointing": {
        "write_avg_throughput_mb_s": (50, 3000),
        "read_avg_throughput_mb_s": (80, 4000),
        "write_p95_sec": (0.1, 10.0),
    },
}

def validate_results(summary, benchmark_type):
    """Validate results are within expected ranges."""
    ranges = EXPECTED_RANGES.get(benchmark_type, {})
    issues = []

    for metric, (min_val, max_val) in ranges.items():
        value = summary.get(metric)
        if value is not None:
            if value < min_val or value > max_val:
                issues.append(
                    f"  WARNING: {metric} = {value} is outside expected "
                    f"range [{min_val}, {max_val}]"
                )

    if issues:
        print("⚠️  Validation issues found:")
        for issue in issues:
            print(issue)
        print("\nThis may indicate:")
        print("- Hardware performing outside typical range")
        print("- Configuration issue")
        print("- Environmental factors (heavy load, throttling)")
        print("- Benchmark setup problem")
    else:
        print("✓ Results within expected ranges")

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--summary", required=True)
    parser.add_argument("--type", choices=["listing", "checkpointing", "serving"])
    args = parser.parse_args()

    with open(args.summary) as f:
        summary = yaml.safe_load(f)

    validate_results(summary, args.type)

if __name__ == "__main__":
    main()
```

### 5. Add Auto-Baseline Detection

```python
def suggest_baseline(hardware_info):
    """Suggest closest baseline based on detected hardware."""
    # Detect cloud provider, instance type, storage
    # Return path to closest baseline file
    pass
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Performance Baselines and Reference Results #7

Problem

Current Issues

Proposed Solution

1. Create Baseline Results Database

2. AWS Baseline Results

Checkpointing Benchmark

Serving Benchmark

Recommendations

Cost/Performance Analysis

4. Add Validation Tool

5. Add Auto-Baseline Detection

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	Value
Entries/sec	7,820
TTFB	4.8s
P50 latency	19ms
P95 latency	45ms
P99 latency	92ms

Metric	Write	Read
Throughput (MB/s)	285	340
P50 latency (s)	0.42	0.35
P95 latency (s)	0.58	0.48
P99 latency (s)	0.71	0.59

Metric	Write	Read
Throughput (MB/s)	1,240	1,850
P50 latency (s)	0.78	0.52
P95 latency (s)	1.12	0.68
P99 latency (s)	1.38	0.84

Metric	i3.2xlarge	c5.2xlarge+gp3	Advantage
Price/hour	$0.624	$0.42	c5 cheaper
Write MB/s	1,240	285	i3 4.4x faster
Read MB/s	1,850	340	i3 5.4x faster
$/GB throughput/hr	$0.0005	$0.0015	i3 better value

Add Performance Baselines and Reference Results #7

Description

Problem

Current Issues

Proposed Solution

1. Create Baseline Results Database

2. AWS Baseline Results

Checkpointing Benchmark

Serving Benchmark

Recommendations

Cost/Performance Analysis

4. Add Validation Tool

5. Add Auto-Baseline Detection

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions