-
Notifications
You must be signed in to change notification settings - Fork 65
Add Performance Baselines and Reference Results #7
Copy link
Copy link
Open
Description
The project lacks baseline performance metrics and reference results for common hardware configurations. Users have no frame of reference to determine if their results are normal or indicate a problem.
Problem
Current Issues
-
No Reference Data:
- Users don't know what performance to expect
- Can't determine if results are good or bad
- Difficult to identify performance regressions
- No way to validate benchmark setup
-
No Hardware Benchmarks:
- No results for common cloud instance types
- No results for typical developer hardware
- Missing enterprise storage tier comparisons
- No network filesystem benchmarks
-
Difficult Comparisons:
- Users can't compare their setup to similar systems
- No guidance on what hardware to choose
- Missing cost/performance analysis
- No optimization guidance
-
Validation Challenges:
- Can't tell if benchmark is running correctly
- Unusual results hard to interpret
- Missing sanity checks
Proposed Solution
1. Create Baseline Results Database
docs/baselines/README.md:
# Performance Baselines
Reference results for common hardware configurations. Use these to:
- Validate your benchmark setup
- Compare your infrastructure
- Guide hardware selection
- Identify performance issues
## How to Use
1. Find a baseline similar to your hardware
2. Run the same benchmark configuration
3. Compare your results to the baseline
4. Investigate if results differ by >20%
## Baseline Categories
- **Cloud Instances**: AWS, GCP, Azure common instance types
- **Local Development**: Typical developer machines
- **Enterprise Storage**: NAS, SAN, distributed filesystems
- **High-Performance**: GPU servers, NVMe RAID configurations
## Reporting Baselines
To contribute baseline results:
1. Run benchmarks with default configurations
2. Document hardware/environment details
3. Submit as PR with filled template (see below)
4. Include metadata.yaml and summary.yaml files2. AWS Baseline Results
docs/baselines/aws/c5-2xlarge.md:
# AWS EC2 c5.2xlarge Baseline
**Instance Type:** c5.2xlarge
**Specs:**
- vCPU: 8
- Memory: 16 GiB
- Network: Up to 10 Gbps
- Storage: EBS gp3 (1000 GB, 10000 IOPS, 500 MB/s)
**Date:** 2024-01-15
**Region:** us-east-1
**AMI:** Ubuntu 22.04 LTS
**Python:** 3.10.12
---
## Listing Benchmark
**Configuration:** `config/flat_tree.yaml`
```yaml
entries_per_dir: 5000
depth: 1
concurrency: 16
page_size: 2000Results:
| Metric | Value |
|---|---|
| Entries/sec | 7,820 |
| TTFB | 4.8s |
| P50 latency | 19ms |
| P95 latency | 45ms |
| P99 latency | 92ms |
Notes:
- Standard EBS gp3 performance
- Consistent across multiple runs (±3%)
- Warm cache would improve TTFB by ~40%
Checkpointing Benchmark
Configuration: config/default.yaml
shard_count: 8
shard_size_mb: 16
concurrency: 4
mode: write-readResults:
| Metric | Write | Read |
|---|---|---|
| Throughput (MB/s) | 285 | 340 |
| P50 latency (s) | 0.42 | 0.35 |
| P95 latency (s) | 0.58 | 0.48 |
| P99 latency (s) | 0.71 | 0.59 |
Notes:
- Matches EBS gp3 provisioned throughput (500 MB/s baseline)
- Writing limited by I/O quota
- Consider io2 for higher throughput needs
Serving Benchmark
Configuration: Default settings
steps: 100
gbs: 256
mbs: 8
num_workers: 4
mode: infer
Results:
| Metric | Value |
|---|---|
| Throughput (samples/s) | 842 |
| P95 step latency | 0.31s |
Cost Estimate:
- Instance: $0.34/hour
- EBS gp3: $0.08/GB-month
- Estimated $/1M samples: ~$0.11
Recommendations
Good for:
- Development and testing
- Small to medium workloads
- Cost-sensitive applications
Consider upgrading if:
- Need >500 MB/s sustained I/O (use io2)
- CPU-bound (use c5.4xlarge or larger)
- Need GPU (use g5 or p4 instances)
Tuning tips:
- Enable EBS optimization
- Use instance store for temporary data
- Increase gp3 IOPS if needed (up to 16,000)
**`docs/baselines/aws/i3-2xlarge.md`:**
```markdown
# AWS EC2 i3.2xlarge Baseline (NVMe)
**Instance Type:** i3.2xlarge
**Specs:**
- vCPU: 8
- Memory: 61 GiB
- Network: Up to 10 Gbps
- Storage: 1 × 1,900 GB NVMe SSD
**Date:** 2024-01-15
**Region:** us-east-1
**AMI:** Ubuntu 22.04 LTS
**Python:** 3.10.12
---
## Checkpointing Benchmark
**Configuration:** `config/nvme_stress.yaml`
```yaml
shard_count: 16
shard_size_mb: 64
concurrency: 8
mode: write-read
Results:
| Metric | Write | Read |
|---|---|---|
| Throughput (MB/s) | 1,240 | 1,850 |
| P50 latency (s) | 0.78 | 0.52 |
| P95 latency (s) | 1.12 | 0.68 |
| P99 latency (s) | 1.38 | 0.84 |
Notes:
- Excellent NVMe performance
- 4-5x faster than EBS gp3
- Low latency variance
- Important: Instance store is ephemeral!
Cost/Performance Analysis
i3.2xlarge vs c5.2xlarge + gp3:
| Metric | i3.2xlarge | c5.2xlarge+gp3 | Advantage |
|---|---|---|---|
| Price/hour | $0.624 | $0.42 | c5 cheaper |
| Write MB/s | 1,240 | 285 | i3 4.4x faster |
| Read MB/s | 1,850 | 340 | i3 5.4x faster |
| $/GB throughput/hr | $0.0005 | $0.0015 | i3 better value |
Recommendation:
- Use i3 for I/O-intensive training
- Use c5+gp3 for development
- Backup i3 data to S3 (ephemeral storage)
### 3. Create Performance Matrix
**`docs/baselines/performance-matrix.md`:**
```markdown
# Performance Matrix
Quick reference for common configurations.
## Listing Benchmark (5K entries, flat structure)
| Hardware | Entries/s | TTFB | P95 | Notes |
|----------|-----------|------|-----|-------|
| MacBook Pro M1 | 12,400 | 2.1s | 28ms | Local SSD |
| AWS c5.2xlarge (EBS) | 7,820 | 4.8s | 45ms | gp3 1000 IOPS |
| AWS i3.2xlarge (NVMe) | 15,200 | 1.8s | 22ms | Instance store |
| GCP n2-standard-8 (PD) | 6,950 | 5.4s | 52ms | Persistent Disk |
| Azure D8s v3 (Premium) | 8,340 | 4.2s | 41ms | Premium SSD |
| Desktop (SATA SSD) | 9,100 | 3.9s | 38ms | Consumer SSD |
| Desktop (HDD) | 1,240 | 18.2s | 340ms | 7200 RPM |
| NFS (1Gbps) | 2,890 | 9.7s | 180ms | Network mount |
| NFS (10Gbps) | 5,120 | 6.1s | 95ms | Fast network |
## Checkpointing Benchmark (8×16MB shards)
| Hardware | Write MB/s | Read MB/s | Write P95 | Notes |
|----------|------------|-----------|-----------|-------|
| MacBook Pro M1 | 980 | 1,450 | 0.28s | APFS |
| AWS i3.2xlarge | 1,240 | 1,850 | 0.22s | NVMe |
| AWS c5 + EBS gp3 | 285 | 340 | 0.58s | Baseline |
| AWS c5 + EBS io2 | 920 | 1,100 | 0.31s | Provisioned |
| GCP n2 + PD SSD | 310 | 380 | 0.54s | Default |
| Azure + Premium SSD | 340 | 410 | 0.49s | P30 tier |
| Desktop NVMe | 1,100 | 1,680 | 0.24s | PCIe 3.0 |
| Desktop SATA SSD | 420 | 510 | 0.42s | SATA III |
| NetApp NFS | 180 | 220 | 1.2s | Over network |
| S3 (direct) | 45 | 85 | 4.8s | High latency |
| FSx for Lustre | 850 | 1,040 | 0.35s | AWS managed |
## Expected Ranges
### Good Performance
- Listing: >8,000 entries/s, TTFB <5s, P95 <50ms
- Checkpointing: >800 MB/s write, >1,000 MB/s read, P95 <0.5s
### Acceptable Performance
- Listing: 3,000-8,000 entries/s, TTFB 5-10s, P95 50-100ms
- Checkpointing: 300-800 MB/s write, 400-1,000 MB/s read, P95 0.5-1s
### Poor Performance (investigate!)
- Listing: <3,000 entries/s, TTFB >10s, P95 >100ms
- Checkpointing: <300 MB/s write, <400 MB/s read, P95 >1s
## Troubleshooting
**Listing performance much slower than expected?**
- Check filesystem cache state (cold vs warm)
- Verify no other I/O-intensive processes running
- Check if directory is network-mounted
- Consider filesystem type (ext4 vs xfs vs others)
**Checkpointing slower than baseline?**
- Verify storage type (SSD vs HDD)
- Check for I/O throttling (cloud quotas)
- Monitor disk utilization during benchmark
- Check if fsync is enabled (2x slowdown expected)
- Verify concurrency matches CPU cores
**Results highly variable (>20% variance)?**
- May indicate I/O contention
- Check for background processes
- Verify consistent network latency (if networked storage)
- Run multiple iterations and report median
4. Add Validation Tool
scripts/validate_results.py:
#!/usr/bin/env python
"""
Validate benchmark results against expected ranges.
Usage:
python scripts/validate_results.py \
--summary metrics/my-run/listing_summary.yaml \
--baseline docs/baselines/aws/c5-2xlarge.md
"""
import argparse
import yaml
EXPECTED_RANGES = {
"listing": {
"entries_per_sec": (1000, 20000),
"ttfb_sec": (1.0, 30.0),
"p95_call_sec": (0.01, 1.0),
},
"checkpointing": {
"write_avg_throughput_mb_s": (50, 3000),
"read_avg_throughput_mb_s": (80, 4000),
"write_p95_sec": (0.1, 10.0),
},
}
def validate_results(summary, benchmark_type):
"""Validate results are within expected ranges."""
ranges = EXPECTED_RANGES.get(benchmark_type, {})
issues = []
for metric, (min_val, max_val) in ranges.items():
value = summary.get(metric)
if value is not None:
if value < min_val or value > max_val:
issues.append(
f" WARNING: {metric} = {value} is outside expected "
f"range [{min_val}, {max_val}]"
)
if issues:
print("⚠️ Validation issues found:")
for issue in issues:
print(issue)
print("\nThis may indicate:")
print("- Hardware performing outside typical range")
print("- Configuration issue")
print("- Environmental factors (heavy load, throttling)")
print("- Benchmark setup problem")
else:
print("✓ Results within expected ranges")
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--summary", required=True)
parser.add_argument("--type", choices=["listing", "checkpointing", "serving"])
args = parser.parse_args()
with open(args.summary) as f:
summary = yaml.safe_load(f)
validate_results(summary, args.type)
if __name__ == "__main__":
main()5. Add Auto-Baseline Detection
def suggest_baseline(hardware_info):
"""Suggest closest baseline based on detected hardware."""
# Detect cloud provider, instance type, storage
# Return path to closest baseline file
passReactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels