Skip to content

Add Performance Baselines and Reference Results #7

@antennashop

Description

@antennashop

The project lacks baseline performance metrics and reference results for common hardware configurations. Users have no frame of reference to determine if their results are normal or indicate a problem.

Problem

Current Issues

  1. No Reference Data:

    • Users don't know what performance to expect
    • Can't determine if results are good or bad
    • Difficult to identify performance regressions
    • No way to validate benchmark setup
  2. No Hardware Benchmarks:

    • No results for common cloud instance types
    • No results for typical developer hardware
    • Missing enterprise storage tier comparisons
    • No network filesystem benchmarks
  3. Difficult Comparisons:

    • Users can't compare their setup to similar systems
    • No guidance on what hardware to choose
    • Missing cost/performance analysis
    • No optimization guidance
  4. Validation Challenges:

    • Can't tell if benchmark is running correctly
    • Unusual results hard to interpret
    • Missing sanity checks

Proposed Solution

1. Create Baseline Results Database

docs/baselines/README.md:

# Performance Baselines

Reference results for common hardware configurations. Use these to:
- Validate your benchmark setup
- Compare your infrastructure
- Guide hardware selection
- Identify performance issues

## How to Use

1. Find a baseline similar to your hardware
2. Run the same benchmark configuration
3. Compare your results to the baseline
4. Investigate if results differ by >20%

## Baseline Categories

- **Cloud Instances**: AWS, GCP, Azure common instance types
- **Local Development**: Typical developer machines
- **Enterprise Storage**: NAS, SAN, distributed filesystems
- **High-Performance**: GPU servers, NVMe RAID configurations

## Reporting Baselines

To contribute baseline results:
1. Run benchmarks with default configurations
2. Document hardware/environment details
3. Submit as PR with filled template (see below)
4. Include metadata.yaml and summary.yaml files

2. AWS Baseline Results

docs/baselines/aws/c5-2xlarge.md:

# AWS EC2 c5.2xlarge Baseline

**Instance Type:** c5.2xlarge
**Specs:**
- vCPU: 8
- Memory: 16 GiB
- Network: Up to 10 Gbps
- Storage: EBS gp3 (1000 GB, 10000 IOPS, 500 MB/s)

**Date:** 2024-01-15
**Region:** us-east-1
**AMI:** Ubuntu 22.04 LTS
**Python:** 3.10.12

---

## Listing Benchmark

**Configuration:** `config/flat_tree.yaml`
```yaml
entries_per_dir: 5000
depth: 1
concurrency: 16
page_size: 2000

Results:

Metric Value
Entries/sec 7,820
TTFB 4.8s
P50 latency 19ms
P95 latency 45ms
P99 latency 92ms

Notes:

  • Standard EBS gp3 performance
  • Consistent across multiple runs (±3%)
  • Warm cache would improve TTFB by ~40%

Checkpointing Benchmark

Configuration: config/default.yaml

shard_count: 8
shard_size_mb: 16
concurrency: 4
mode: write-read

Results:

Metric Write Read
Throughput (MB/s) 285 340
P50 latency (s) 0.42 0.35
P95 latency (s) 0.58 0.48
P99 latency (s) 0.71 0.59

Notes:

  • Matches EBS gp3 provisioned throughput (500 MB/s baseline)
  • Writing limited by I/O quota
  • Consider io2 for higher throughput needs

Serving Benchmark

Configuration: Default settings

steps: 100
gbs: 256
mbs: 8
num_workers: 4
mode: infer

Results:

Metric Value
Throughput (samples/s) 842
P95 step latency 0.31s

Cost Estimate:

  • Instance: $0.34/hour
  • EBS gp3: $0.08/GB-month
  • Estimated $/1M samples: ~$0.11

Recommendations

Good for:

  • Development and testing
  • Small to medium workloads
  • Cost-sensitive applications

Consider upgrading if:

  • Need >500 MB/s sustained I/O (use io2)
  • CPU-bound (use c5.4xlarge or larger)
  • Need GPU (use g5 or p4 instances)

Tuning tips:

  • Enable EBS optimization
  • Use instance store for temporary data
  • Increase gp3 IOPS if needed (up to 16,000)

**`docs/baselines/aws/i3-2xlarge.md`:**
```markdown
# AWS EC2 i3.2xlarge Baseline (NVMe)

**Instance Type:** i3.2xlarge
**Specs:**
- vCPU: 8
- Memory: 61 GiB
- Network: Up to 10 Gbps
- Storage: 1 × 1,900 GB NVMe SSD

**Date:** 2024-01-15
**Region:** us-east-1
**AMI:** Ubuntu 22.04 LTS
**Python:** 3.10.12

---

## Checkpointing Benchmark

**Configuration:** `config/nvme_stress.yaml`
```yaml
shard_count: 16
shard_size_mb: 64
concurrency: 8
mode: write-read

Results:

Metric Write Read
Throughput (MB/s) 1,240 1,850
P50 latency (s) 0.78 0.52
P95 latency (s) 1.12 0.68
P99 latency (s) 1.38 0.84

Notes:

  • Excellent NVMe performance
  • 4-5x faster than EBS gp3
  • Low latency variance
  • Important: Instance store is ephemeral!

Cost/Performance Analysis

i3.2xlarge vs c5.2xlarge + gp3:

Metric i3.2xlarge c5.2xlarge+gp3 Advantage
Price/hour $0.624 $0.42 c5 cheaper
Write MB/s 1,240 285 i3 4.4x faster
Read MB/s 1,850 340 i3 5.4x faster
$/GB throughput/hr $0.0005 $0.0015 i3 better value

Recommendation:

  • Use i3 for I/O-intensive training
  • Use c5+gp3 for development
  • Backup i3 data to S3 (ephemeral storage)

### 3. Create Performance Matrix

**`docs/baselines/performance-matrix.md`:**
```markdown
# Performance Matrix

Quick reference for common configurations.

## Listing Benchmark (5K entries, flat structure)

| Hardware | Entries/s | TTFB | P95 | Notes |
|----------|-----------|------|-----|-------|
| MacBook Pro M1 | 12,400 | 2.1s | 28ms | Local SSD |
| AWS c5.2xlarge (EBS) | 7,820 | 4.8s | 45ms | gp3 1000 IOPS |
| AWS i3.2xlarge (NVMe) | 15,200 | 1.8s | 22ms | Instance store |
| GCP n2-standard-8 (PD) | 6,950 | 5.4s | 52ms | Persistent Disk |
| Azure D8s v3 (Premium) | 8,340 | 4.2s | 41ms | Premium SSD |
| Desktop (SATA SSD) | 9,100 | 3.9s | 38ms | Consumer SSD |
| Desktop (HDD) | 1,240 | 18.2s | 340ms | 7200 RPM |
| NFS (1Gbps) | 2,890 | 9.7s | 180ms | Network mount |
| NFS (10Gbps) | 5,120 | 6.1s | 95ms | Fast network |

## Checkpointing Benchmark (8×16MB shards)

| Hardware | Write MB/s | Read MB/s | Write P95 | Notes |
|----------|------------|-----------|-----------|-------|
| MacBook Pro M1 | 980 | 1,450 | 0.28s | APFS |
| AWS i3.2xlarge | 1,240 | 1,850 | 0.22s | NVMe |
| AWS c5 + EBS gp3 | 285 | 340 | 0.58s | Baseline |
| AWS c5 + EBS io2 | 920 | 1,100 | 0.31s | Provisioned |
| GCP n2 + PD SSD | 310 | 380 | 0.54s | Default |
| Azure + Premium SSD | 340 | 410 | 0.49s | P30 tier |
| Desktop NVMe | 1,100 | 1,680 | 0.24s | PCIe 3.0 |
| Desktop SATA SSD | 420 | 510 | 0.42s | SATA III |
| NetApp NFS | 180 | 220 | 1.2s | Over network |
| S3 (direct) | 45 | 85 | 4.8s | High latency |
| FSx for Lustre | 850 | 1,040 | 0.35s | AWS managed |

## Expected Ranges

### Good Performance
- Listing: >8,000 entries/s, TTFB <5s, P95 <50ms
- Checkpointing: >800 MB/s write, >1,000 MB/s read, P95 <0.5s

### Acceptable Performance
- Listing: 3,000-8,000 entries/s, TTFB 5-10s, P95 50-100ms
- Checkpointing: 300-800 MB/s write, 400-1,000 MB/s read, P95 0.5-1s

### Poor Performance (investigate!)
- Listing: <3,000 entries/s, TTFB >10s, P95 >100ms
- Checkpointing: <300 MB/s write, <400 MB/s read, P95 >1s

## Troubleshooting

**Listing performance much slower than expected?**
- Check filesystem cache state (cold vs warm)
- Verify no other I/O-intensive processes running
- Check if directory is network-mounted
- Consider filesystem type (ext4 vs xfs vs others)

**Checkpointing slower than baseline?**
- Verify storage type (SSD vs HDD)
- Check for I/O throttling (cloud quotas)
- Monitor disk utilization during benchmark
- Check if fsync is enabled (2x slowdown expected)
- Verify concurrency matches CPU cores

**Results highly variable (>20% variance)?**
- May indicate I/O contention
- Check for background processes
- Verify consistent network latency (if networked storage)
- Run multiple iterations and report median

4. Add Validation Tool

scripts/validate_results.py:

#!/usr/bin/env python
"""
Validate benchmark results against expected ranges.

Usage:
    python scripts/validate_results.py \
        --summary metrics/my-run/listing_summary.yaml \
        --baseline docs/baselines/aws/c5-2xlarge.md
"""
import argparse
import yaml

EXPECTED_RANGES = {
    "listing": {
        "entries_per_sec": (1000, 20000),
        "ttfb_sec": (1.0, 30.0),
        "p95_call_sec": (0.01, 1.0),
    },
    "checkpointing": {
        "write_avg_throughput_mb_s": (50, 3000),
        "read_avg_throughput_mb_s": (80, 4000),
        "write_p95_sec": (0.1, 10.0),
    },
}

def validate_results(summary, benchmark_type):
    """Validate results are within expected ranges."""
    ranges = EXPECTED_RANGES.get(benchmark_type, {})
    issues = []

    for metric, (min_val, max_val) in ranges.items():
        value = summary.get(metric)
        if value is not None:
            if value < min_val or value > max_val:
                issues.append(
                    f"  WARNING: {metric} = {value} is outside expected "
                    f"range [{min_val}, {max_val}]"
                )

    if issues:
        print("⚠️  Validation issues found:")
        for issue in issues:
            print(issue)
        print("\nThis may indicate:")
        print("- Hardware performing outside typical range")
        print("- Configuration issue")
        print("- Environmental factors (heavy load, throttling)")
        print("- Benchmark setup problem")
    else:
        print("✓ Results within expected ranges")

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--summary", required=True)
    parser.add_argument("--type", choices=["listing", "checkpointing", "serving"])
    args = parser.parse_args()

    with open(args.summary) as f:
        summary = yaml.safe_load(f)

    validate_results(summary, args.type)

if __name__ == "__main__":
    main()

5. Add Auto-Baseline Detection

def suggest_baseline(hardware_info):
    """Suggest closest baseline based on detected hardware."""
    # Detect cloud provider, instance type, storage
    # Return path to closest baseline file
    pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions