feat: Create cloud-storage-benchmarking skill

## Description

Create the skill covering benchmark methodology, frameworks (ASV, pytest-benchmark), cloud-specific benchmarking, and results analysis for the **zarr-chunk-optimization** plugin.

**Directory:** `plugins/zarr-chunk-optimization/skills/cloud-storage-benchmarking/`

## Research Reference

Full research document: `.agents/research-zarr-chunk-optimization-and-zarr-plugin.md`

## Files to Create

### 1. SKILL.md (300+ lines)

**Frontmatter:**
```yaml
name: cloud-storage-benchmarking
description: |
  Use this skill when the user asks to "benchmark zarr performance", "profile chunk read times",
  "set up ASV for zarr", "compare chunk configurations", "measure cloud storage throughput",
  "use pytest-benchmark with zarr", "analyze benchmark results", or needs guidance on Zarr I/O
  benchmarking methodology, Airspeed Velocity (ASV), pytest-benchmark, cloud storage performance
  testing, or Dask dashboard monitoring.
```

**Content must include:**

- **Quick Reference: Benchmark Setup Checklist**
  - [ ] Define test matrix (chunk configs x access patterns x backends)
  - [ ] Set statistical requirements (min 5 runs + 3 warm-up)
  - [ ] Configure metrics collection (time, memory, throughput, requests)
  - [ ] Control variables (same instance type, region, data)
  - [ ] Choose framework (ASV for tracking, pytest-benchmark for comparison)

- **Benchmark Metrics Taxonomy:**
  - Wall time (seconds) — total elapsed time
  - Peak memory (GB) — via `tracemalloc` or `memory_profiler`
  - Throughput (MB/s) — `bytes_read / wall_time`
  - Bytes transferred — actual network I/O to cloud store
  - Compression ratio — `uncompressed_size / compressed_size`
  - Request count — HTTP requests to cloud store (for per-request overhead analysis)

- **ASV (Airspeed Velocity) Coverage:**
  - `asv.conf.json` configuration structure
  - Benchmark class patterns:
    ```python
    class TimeSuite:
        params = [(100, 1000, 10000)]
        param_names = ['chunk_size']
        
        def setup(self, chunk_size):
            self.arr = zarr.open('test.zarr')
        
        def time_read_timeseries(self, chunk_size):
            self.arr[0, :, :]
        
        def mem_read_timeseries(self, chunk_size):
            return self.arr[0, :, :]
        
        def track_throughput(self, chunk_size):
            return bytes_read / elapsed
    ```
  - Running: `asv run`, `asv publish`, `asv preview`
  - CI integration for regression detection

- **pytest-benchmark Coverage:**
  - Fixture-based benchmarking:
    ```python
    def test_read_timeseries(benchmark, zarr_store):
        result = benchmark(zarr_store.__getitem__, (slice(None), 0, 0))
    ```
  - `@pytest.mark.parametrize` for chunk config sweeps
  - `--benchmark-compare` for cross-run comparison
  - `--benchmark-json` for programmatic analysis

- **Cloud-Specific Considerations:**
  - Network variability: always use warm-up runs (3+ before measurement)
  - Cloud caching: first read may be slower; measure both cold and warm
  - Concurrency tuning: `zarr.config.set({'async.concurrency': N})`
  - Instance-to-storage colocation: run benchmarks in same region as data
  - Shared tenancy effects: run benchmarks at consistent times

- **Dask Dashboard Monitoring:**
  - White space in task stream = inefficient small chunks
  - Excessive red (communication) = too much coordination overhead
  - Orange bars (memory) = approaching memory limits
  - Gray bars (memory) = disk spillage, chunks too large

- **Results Interpretation Guidance:**
  - How to identify the optimal chunk config from benchmark data
  - When differences are statistically significant
  - How to handle outliers from network variability
  - When to re-run benchmarks

### 2. assets/benchmark-template.py

Complete, runnable benchmark script template that:
- Creates a test Zarr array on a configurable storage backend (local, S3, GCS, Azure)
- Tests multiple chunk configurations (parameterized via config file)
- Tests multiple access patterns:
  - Time-series extraction: `arr[:, lat_idx, lon_idx]`
  - Spatial map extraction: `arr[time_idx, :, :]`
  - Mixed: `arr[time_slice, lat_slice, lon_slice]`
- Measures wall time, peak memory (via tracemalloc), and throughput
- Uses `timeit` with configurable number of runs and warm-up
- Outputs results as CSV and JSON
- Includes cloud backend configuration via environment variables
- Is well-commented and production-ready

### 3. assets/benchmark-config.yaml

Template YAML configuration:
```yaml
dataset:
  shape: [3650, 721, 1440]
  dtype: float32
  fill_value: NaN

chunk_configs:
  temporal_optimized:
    chunks: [3650, 10, 10]
    description: "Optimal for time-series at point locations"
  spatial_optimized:
    chunks: [1, 721, 1440]
    description: "Optimal for spatial map extraction"
  balanced:
    chunks: [30, 90, 180]
    description: "Balanced for mixed access patterns"
  custom:
    chunks: [365, 72, 144]
    description: "User-defined configuration"

access_patterns:
  - name: timeseries
    description: "Read full time-series at a single location"
    selection: {time: ":", lat: 0, lon: 0}
  - name: spatial_map
    description: "Read spatial map at a single timestep"
    selection: {time: 0, lat: ":", lon: ":"}
  - name: regional_timeseries
    description: "Read time-series for a small region"
    selection: {time: ":", lat: "0:10", lon: "0:10"}

storage:
  backend: s3        # s3, gcs, azure, local
  bucket: ""         # Set via env var or here
  prefix: benchmark-data/
  credentials: env   # env, profile, anonymous

benchmark:
  runs: 10
  warmup: 3
  concurrency_sweep: [10, 32, 64, 128]
  output_format: [csv, json]
  output_dir: ./benchmark-results/
```

### 4. assets/results-analysis.py

Python script for analyzing benchmark results:
- Reads benchmark CSV/JSON output from benchmark-template.py
- Generates comparison plots using matplotlib (with optional hvplot):
  - Bar chart: throughput by chunk config per access pattern
  - Heatmap: access pattern x chunk config matrix
  - Line plot: throughput vs concurrency level
  - Memory profile: peak memory by chunk config
- Calculates speedup ratios between configurations
- Identifies optimal configuration per access pattern
- Generates markdown summary report
- Must be runnable as a standalone script

### 5. references/PATTERNS.md (6+ patterns)

1. **ASV Benchmark Class for Zarr** — complete class with setup/teardown, params, time_*/mem_*/track_* methods
2. **pytest-benchmark Parameterized Chunk Comparison** — parameterized test comparing N chunk configs
3. **Cloud Storage Warm-up Protocol** — run N warm-up reads before measurement
4. **Concurrency Sweep Benchmark** — test multiple `async.concurrency` values
5. **Memory Profiling with tracemalloc** — measuring peak memory during reads
6. **Results Visualization with matplotlib/hvplot** — standard plots for benchmark analysis

Each pattern: description, when to use, complete code example, expected output.

### 6. references/EXAMPLES.md (4+ examples)

1. **Benchmarking 3 chunk configs on S3 with ASV** — full setup, execution, results
2. **pytest-benchmark comparing compression codecs x chunk sizes** — factorial design
3. **Concurrency tuning sweep on GCS** — finding optimal `async.concurrency`
4. **Full optimization workflow**: benchmark → analyze → rechunk → validate → benchmark again

Each example: problem statement, setup, code, results, interpretation.

### 7. references/COMMON_ISSUES.md (6+ issues)

1. **Inconsistent results due to cloud caching** → use warm-up runs, measure cold + warm
2. **Network variability masking differences** → increase run count, use median not mean
3. **OOM during benchmarks** → reduce test array size proportionally, monitor memory
4. **Benchmarks too slow** → reduce dataset size while maintaining proportional chunks
5. **Dask scheduler overhead dominating results** → benchmark with and without Dask
6. **Comparing results across different instances** → normalize to throughput (MB/s), document instance type

Each issue: symptoms, cause, solution with code, prevention.

## Acceptance Criteria

- [ ] SKILL.md is 300+ lines with complete benchmarking methodology
- [ ] benchmark-template.py is a runnable script with cloud backend support (S3/GCS/Azure/local)
- [ ] benchmark-config.yaml is a complete, well-documented configuration template
- [ ] results-analysis.py generates comparison plots and markdown report
- [ ] PATTERNS.md covers 6+ benchmarking patterns with complete code
- [ ] EXAMPLES.md has 4+ complete benchmark examples
- [ ] COMMON_ISSUES.md covers 6+ benchmarking pitfalls with solutions
- [ ] Follows the skill pattern from existing plugins

## Dependencies

- Depends on #61 (plugin scaffold)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Create cloud-storage-benchmarking skill #65

Description

Research Reference

Files to Create

1. SKILL.md (300+ lines)

2. assets/benchmark-template.py

3. assets/benchmark-config.yaml

4. assets/results-analysis.py

5. references/PATTERNS.md (6+ patterns)

6. references/EXAMPLES.md (4+ examples)

7. references/COMMON_ISSUES.md (6+ issues)

Acceptance Criteria

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Create cloud-storage-benchmarking skill #65

Description

Description

Research Reference

Files to Create

1. SKILL.md (300+ lines)

2. assets/benchmark-template.py

3. assets/benchmark-config.yaml

4. assets/results-analysis.py

5. references/PATTERNS.md (6+ patterns)

6. references/EXAMPLES.md (4+ examples)

7. references/COMMON_ISSUES.md (6+ issues)

Acceptance Criteria

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions