Skip to content

feat: Create cloud-storage-benchmarking skill #65

@cdcore09

Description

@cdcore09

Description

Create the skill covering benchmark methodology, frameworks (ASV, pytest-benchmark), cloud-specific benchmarking, and results analysis for the zarr-chunk-optimization plugin.

Directory: plugins/zarr-chunk-optimization/skills/cloud-storage-benchmarking/

Research Reference

Full research document: .agents/research-zarr-chunk-optimization-and-zarr-plugin.md

Files to Create

1. SKILL.md (300+ lines)

Frontmatter:

name: cloud-storage-benchmarking
description: |
  Use this skill when the user asks to "benchmark zarr performance", "profile chunk read times",
  "set up ASV for zarr", "compare chunk configurations", "measure cloud storage throughput",
  "use pytest-benchmark with zarr", "analyze benchmark results", or needs guidance on Zarr I/O
  benchmarking methodology, Airspeed Velocity (ASV), pytest-benchmark, cloud storage performance
  testing, or Dask dashboard monitoring.

Content must include:

  • Quick Reference: Benchmark Setup Checklist

    • Define test matrix (chunk configs x access patterns x backends)
    • Set statistical requirements (min 5 runs + 3 warm-up)
    • Configure metrics collection (time, memory, throughput, requests)
    • Control variables (same instance type, region, data)
    • Choose framework (ASV for tracking, pytest-benchmark for comparison)
  • Benchmark Metrics Taxonomy:

    • Wall time (seconds) — total elapsed time
    • Peak memory (GB) — via tracemalloc or memory_profiler
    • Throughput (MB/s) — bytes_read / wall_time
    • Bytes transferred — actual network I/O to cloud store
    • Compression ratio — uncompressed_size / compressed_size
    • Request count — HTTP requests to cloud store (for per-request overhead analysis)
  • ASV (Airspeed Velocity) Coverage:

    • asv.conf.json configuration structure
    • Benchmark class patterns:
      class TimeSuite:
          params = [(100, 1000, 10000)]
          param_names = ['chunk_size']
          
          def setup(self, chunk_size):
              self.arr = zarr.open('test.zarr')
          
          def time_read_timeseries(self, chunk_size):
              self.arr[0, :, :]
          
          def mem_read_timeseries(self, chunk_size):
              return self.arr[0, :, :]
          
          def track_throughput(self, chunk_size):
              return bytes_read / elapsed
    • Running: asv run, asv publish, asv preview
    • CI integration for regression detection
  • pytest-benchmark Coverage:

    • Fixture-based benchmarking:
      def test_read_timeseries(benchmark, zarr_store):
          result = benchmark(zarr_store.__getitem__, (slice(None), 0, 0))
    • @pytest.mark.parametrize for chunk config sweeps
    • --benchmark-compare for cross-run comparison
    • --benchmark-json for programmatic analysis
  • Cloud-Specific Considerations:

    • Network variability: always use warm-up runs (3+ before measurement)
    • Cloud caching: first read may be slower; measure both cold and warm
    • Concurrency tuning: zarr.config.set({'async.concurrency': N})
    • Instance-to-storage colocation: run benchmarks in same region as data
    • Shared tenancy effects: run benchmarks at consistent times
  • Dask Dashboard Monitoring:

    • White space in task stream = inefficient small chunks
    • Excessive red (communication) = too much coordination overhead
    • Orange bars (memory) = approaching memory limits
    • Gray bars (memory) = disk spillage, chunks too large
  • Results Interpretation Guidance:

    • How to identify the optimal chunk config from benchmark data
    • When differences are statistically significant
    • How to handle outliers from network variability
    • When to re-run benchmarks

2. assets/benchmark-template.py

Complete, runnable benchmark script template that:

  • Creates a test Zarr array on a configurable storage backend (local, S3, GCS, Azure)
  • Tests multiple chunk configurations (parameterized via config file)
  • Tests multiple access patterns:
    • Time-series extraction: arr[:, lat_idx, lon_idx]
    • Spatial map extraction: arr[time_idx, :, :]
    • Mixed: arr[time_slice, lat_slice, lon_slice]
  • Measures wall time, peak memory (via tracemalloc), and throughput
  • Uses timeit with configurable number of runs and warm-up
  • Outputs results as CSV and JSON
  • Includes cloud backend configuration via environment variables
  • Is well-commented and production-ready

3. assets/benchmark-config.yaml

Template YAML configuration:

dataset:
  shape: [3650, 721, 1440]
  dtype: float32
  fill_value: NaN

chunk_configs:
  temporal_optimized:
    chunks: [3650, 10, 10]
    description: "Optimal for time-series at point locations"
  spatial_optimized:
    chunks: [1, 721, 1440]
    description: "Optimal for spatial map extraction"
  balanced:
    chunks: [30, 90, 180]
    description: "Balanced for mixed access patterns"
  custom:
    chunks: [365, 72, 144]
    description: "User-defined configuration"

access_patterns:
  - name: timeseries
    description: "Read full time-series at a single location"
    selection: {time: ":", lat: 0, lon: 0}
  - name: spatial_map
    description: "Read spatial map at a single timestep"
    selection: {time: 0, lat: ":", lon: ":"}
  - name: regional_timeseries
    description: "Read time-series for a small region"
    selection: {time: ":", lat: "0:10", lon: "0:10"}

storage:
  backend: s3        # s3, gcs, azure, local
  bucket: ""         # Set via env var or here
  prefix: benchmark-data/
  credentials: env   # env, profile, anonymous

benchmark:
  runs: 10
  warmup: 3
  concurrency_sweep: [10, 32, 64, 128]
  output_format: [csv, json]
  output_dir: ./benchmark-results/

4. assets/results-analysis.py

Python script for analyzing benchmark results:

  • Reads benchmark CSV/JSON output from benchmark-template.py
  • Generates comparison plots using matplotlib (with optional hvplot):
    • Bar chart: throughput by chunk config per access pattern
    • Heatmap: access pattern x chunk config matrix
    • Line plot: throughput vs concurrency level
    • Memory profile: peak memory by chunk config
  • Calculates speedup ratios between configurations
  • Identifies optimal configuration per access pattern
  • Generates markdown summary report
  • Must be runnable as a standalone script

5. references/PATTERNS.md (6+ patterns)

  1. ASV Benchmark Class for Zarr — complete class with setup/teardown, params, time_/mem_/track_* methods
  2. pytest-benchmark Parameterized Chunk Comparison — parameterized test comparing N chunk configs
  3. Cloud Storage Warm-up Protocol — run N warm-up reads before measurement
  4. Concurrency Sweep Benchmark — test multiple async.concurrency values
  5. Memory Profiling with tracemalloc — measuring peak memory during reads
  6. Results Visualization with matplotlib/hvplot — standard plots for benchmark analysis

Each pattern: description, when to use, complete code example, expected output.

6. references/EXAMPLES.md (4+ examples)

  1. Benchmarking 3 chunk configs on S3 with ASV — full setup, execution, results
  2. pytest-benchmark comparing compression codecs x chunk sizes — factorial design
  3. Concurrency tuning sweep on GCS — finding optimal async.concurrency
  4. Full optimization workflow: benchmark → analyze → rechunk → validate → benchmark again

Each example: problem statement, setup, code, results, interpretation.

7. references/COMMON_ISSUES.md (6+ issues)

  1. Inconsistent results due to cloud caching → use warm-up runs, measure cold + warm
  2. Network variability masking differences → increase run count, use median not mean
  3. OOM during benchmarks → reduce test array size proportionally, monitor memory
  4. Benchmarks too slow → reduce dataset size while maintaining proportional chunks
  5. Dask scheduler overhead dominating results → benchmark with and without Dask
  6. Comparing results across different instances → normalize to throughput (MB/s), document instance type

Each issue: symptoms, cause, solution with code, prevention.

Acceptance Criteria

  • SKILL.md is 300+ lines with complete benchmarking methodology
  • benchmark-template.py is a runnable script with cloud backend support (S3/GCS/Azure/local)
  • benchmark-config.yaml is a complete, well-documented configuration template
  • results-analysis.py generates comparison plots and markdown report
  • PATTERNS.md covers 6+ benchmarking patterns with complete code
  • EXAMPLES.md has 4+ complete benchmark examples
  • COMMON_ISSUES.md covers 6+ benchmarking pitfalls with solutions
  • Follows the skill pattern from existing plugins

Dependencies

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestskillSkill creation or modification

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions