-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Description
Create the skill covering benchmark methodology, frameworks (ASV, pytest-benchmark), cloud-specific benchmarking, and results analysis for the zarr-chunk-optimization plugin.
Directory: plugins/zarr-chunk-optimization/skills/cloud-storage-benchmarking/
Research Reference
Full research document: .agents/research-zarr-chunk-optimization-and-zarr-plugin.md
Files to Create
1. SKILL.md (300+ lines)
Frontmatter:
name: cloud-storage-benchmarking
description: |
Use this skill when the user asks to "benchmark zarr performance", "profile chunk read times",
"set up ASV for zarr", "compare chunk configurations", "measure cloud storage throughput",
"use pytest-benchmark with zarr", "analyze benchmark results", or needs guidance on Zarr I/O
benchmarking methodology, Airspeed Velocity (ASV), pytest-benchmark, cloud storage performance
testing, or Dask dashboard monitoring.Content must include:
-
Quick Reference: Benchmark Setup Checklist
- Define test matrix (chunk configs x access patterns x backends)
- Set statistical requirements (min 5 runs + 3 warm-up)
- Configure metrics collection (time, memory, throughput, requests)
- Control variables (same instance type, region, data)
- Choose framework (ASV for tracking, pytest-benchmark for comparison)
-
Benchmark Metrics Taxonomy:
- Wall time (seconds) — total elapsed time
- Peak memory (GB) — via
tracemallocormemory_profiler - Throughput (MB/s) —
bytes_read / wall_time - Bytes transferred — actual network I/O to cloud store
- Compression ratio —
uncompressed_size / compressed_size - Request count — HTTP requests to cloud store (for per-request overhead analysis)
-
ASV (Airspeed Velocity) Coverage:
asv.conf.jsonconfiguration structure- Benchmark class patterns:
class TimeSuite: params = [(100, 1000, 10000)] param_names = ['chunk_size'] def setup(self, chunk_size): self.arr = zarr.open('test.zarr') def time_read_timeseries(self, chunk_size): self.arr[0, :, :] def mem_read_timeseries(self, chunk_size): return self.arr[0, :, :] def track_throughput(self, chunk_size): return bytes_read / elapsed
- Running:
asv run,asv publish,asv preview - CI integration for regression detection
-
pytest-benchmark Coverage:
- Fixture-based benchmarking:
def test_read_timeseries(benchmark, zarr_store): result = benchmark(zarr_store.__getitem__, (slice(None), 0, 0))
@pytest.mark.parametrizefor chunk config sweeps--benchmark-comparefor cross-run comparison--benchmark-jsonfor programmatic analysis
- Fixture-based benchmarking:
-
Cloud-Specific Considerations:
- Network variability: always use warm-up runs (3+ before measurement)
- Cloud caching: first read may be slower; measure both cold and warm
- Concurrency tuning:
zarr.config.set({'async.concurrency': N}) - Instance-to-storage colocation: run benchmarks in same region as data
- Shared tenancy effects: run benchmarks at consistent times
-
Dask Dashboard Monitoring:
- White space in task stream = inefficient small chunks
- Excessive red (communication) = too much coordination overhead
- Orange bars (memory) = approaching memory limits
- Gray bars (memory) = disk spillage, chunks too large
-
Results Interpretation Guidance:
- How to identify the optimal chunk config from benchmark data
- When differences are statistically significant
- How to handle outliers from network variability
- When to re-run benchmarks
2. assets/benchmark-template.py
Complete, runnable benchmark script template that:
- Creates a test Zarr array on a configurable storage backend (local, S3, GCS, Azure)
- Tests multiple chunk configurations (parameterized via config file)
- Tests multiple access patterns:
- Time-series extraction:
arr[:, lat_idx, lon_idx] - Spatial map extraction:
arr[time_idx, :, :] - Mixed:
arr[time_slice, lat_slice, lon_slice]
- Time-series extraction:
- Measures wall time, peak memory (via tracemalloc), and throughput
- Uses
timeitwith configurable number of runs and warm-up - Outputs results as CSV and JSON
- Includes cloud backend configuration via environment variables
- Is well-commented and production-ready
3. assets/benchmark-config.yaml
Template YAML configuration:
dataset:
shape: [3650, 721, 1440]
dtype: float32
fill_value: NaN
chunk_configs:
temporal_optimized:
chunks: [3650, 10, 10]
description: "Optimal for time-series at point locations"
spatial_optimized:
chunks: [1, 721, 1440]
description: "Optimal for spatial map extraction"
balanced:
chunks: [30, 90, 180]
description: "Balanced for mixed access patterns"
custom:
chunks: [365, 72, 144]
description: "User-defined configuration"
access_patterns:
- name: timeseries
description: "Read full time-series at a single location"
selection: {time: ":", lat: 0, lon: 0}
- name: spatial_map
description: "Read spatial map at a single timestep"
selection: {time: 0, lat: ":", lon: ":"}
- name: regional_timeseries
description: "Read time-series for a small region"
selection: {time: ":", lat: "0:10", lon: "0:10"}
storage:
backend: s3 # s3, gcs, azure, local
bucket: "" # Set via env var or here
prefix: benchmark-data/
credentials: env # env, profile, anonymous
benchmark:
runs: 10
warmup: 3
concurrency_sweep: [10, 32, 64, 128]
output_format: [csv, json]
output_dir: ./benchmark-results/4. assets/results-analysis.py
Python script for analyzing benchmark results:
- Reads benchmark CSV/JSON output from benchmark-template.py
- Generates comparison plots using matplotlib (with optional hvplot):
- Bar chart: throughput by chunk config per access pattern
- Heatmap: access pattern x chunk config matrix
- Line plot: throughput vs concurrency level
- Memory profile: peak memory by chunk config
- Calculates speedup ratios between configurations
- Identifies optimal configuration per access pattern
- Generates markdown summary report
- Must be runnable as a standalone script
5. references/PATTERNS.md (6+ patterns)
- ASV Benchmark Class for Zarr — complete class with setup/teardown, params, time_/mem_/track_* methods
- pytest-benchmark Parameterized Chunk Comparison — parameterized test comparing N chunk configs
- Cloud Storage Warm-up Protocol — run N warm-up reads before measurement
- Concurrency Sweep Benchmark — test multiple
async.concurrencyvalues - Memory Profiling with tracemalloc — measuring peak memory during reads
- Results Visualization with matplotlib/hvplot — standard plots for benchmark analysis
Each pattern: description, when to use, complete code example, expected output.
6. references/EXAMPLES.md (4+ examples)
- Benchmarking 3 chunk configs on S3 with ASV — full setup, execution, results
- pytest-benchmark comparing compression codecs x chunk sizes — factorial design
- Concurrency tuning sweep on GCS — finding optimal
async.concurrency - Full optimization workflow: benchmark → analyze → rechunk → validate → benchmark again
Each example: problem statement, setup, code, results, interpretation.
7. references/COMMON_ISSUES.md (6+ issues)
- Inconsistent results due to cloud caching → use warm-up runs, measure cold + warm
- Network variability masking differences → increase run count, use median not mean
- OOM during benchmarks → reduce test array size proportionally, monitor memory
- Benchmarks too slow → reduce dataset size while maintaining proportional chunks
- Dask scheduler overhead dominating results → benchmark with and without Dask
- Comparing results across different instances → normalize to throughput (MB/s), document instance type
Each issue: symptoms, cause, solution with code, prevention.
Acceptance Criteria
- SKILL.md is 300+ lines with complete benchmarking methodology
- benchmark-template.py is a runnable script with cloud backend support (S3/GCS/Azure/local)
- benchmark-config.yaml is a complete, well-documented configuration template
- results-analysis.py generates comparison plots and markdown report
- PATTERNS.md covers 6+ benchmarking patterns with complete code
- EXAMPLES.md has 4+ complete benchmark examples
- COMMON_ISSUES.md covers 6+ benchmarking pitfalls with solutions
- Follows the skill pattern from existing plugins
Dependencies
- Depends on feat: Create zarr-chunk-optimization plugin scaffold #61 (plugin scaffold)