-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Description
Create the skill covering all compression codecs, filters, and codec configuration for the zarr-data-format plugin.
Directory: plugins/zarr-data-format/skills/compression-codecs/
Research Reference
Full research document: .agents/research-zarr-chunk-optimization-and-zarr-plugin.md
Files to Create
1. SKILL.md (300+ lines)
Frontmatter:
name: compression-codecs
description: |
Use this skill when the user asks to "configure zarr compression", "choose a compressor",
"compare compression codecs", "use blosc with zarr", "add filters to zarr", "optimize
compression ratio", "speed up decompression", or needs guidance on numcodecs, Blosc
configuration, Zstd, LZ4, Gzip, pre-compression filters, codec pipelines, or the
trade-offs between compression speed and ratio.Content must include:
-
Quick Reference: Codec Selection Guide
Codec Speed Ratio Best For Blosc+LZ4 Fastest Good Real-time access, large arrays Blosc+Zstd Fast Excellent Default choice, balanced Zstd Fast Excellent Zarr v3 default Gzip Slow Good Universal compatibility LZ4 Fastest Moderate Maximum decompression speed LZMA Slowest Best Archival, maximum compression BZ2 Slow Very Good Good compression, moderate speed Zlib Moderate Good Standard compatibility -
All Compression Codecs with Configuration:
Blosc (meta-compressor, v2 default):
from numcodecs import Blosc compressor = Blosc(cname='zstd', clevel=3, shuffle=Blosc.BITSHUFFLE) # Internal algorithms: blosclz, lz4, lz4hc, snappy, zlib, zstd # Compression levels: 0-9 (0=no compression, 9=maximum) # Shuffle modes: NOSHUFFLE, SHUFFLE (byte), BITSHUFFLE
Blosc Thread Safety (Critical):
from numcodecs import blosc blosc.set_nthreads(2) # Limit internal threads blosc.use_threads = False # REQUIRED for multi-process safety
Warning: If
blosc.use_threadsis not set toFalsein multi-process environments, silent data corruption can occur.Standalone Codecs:
from numcodecs import Zstd, LZMA, Zlib, GZip, BZ2, LZ4 compressor = Zstd(level=3) compressor = LZ4(acceleration=1) compressor = GZip(level=6)
Disabling Compression:
z = zarr.create_array(..., compressor=None) # v2 z = zarr.create_array(..., compressors=None) # v3
Overriding Default Globally (v2):
import zarr.storage zarr.storage.default_compressor = Zstd(level=1)
-
Pre-compression Filters:
from numcodecs import Delta, Quantize, FixedScaleOffset, PackBits, Categorize # Delta encoding — stores differences between consecutive values filters = [Delta(dtype='i4')] # Quantize — reduces floating-point precision filters = [Quantize(digits=3, dtype='f8')] # FixedScaleOffset — linear scaling filters = [FixedScaleOffset(offset=273.15, scale=100, dtype='f4')] # PackBits — boolean packing filters = [PackBits()] # Categorize — categorical encoding filters = [Categorize(labels=['cat', 'dog', 'bird'], dtype=object)]
Filters are applied before compression and reversed after decompression.
-
Integrity Checks:
from numcodecs import CRC32, Adler32 # Add as post-compression check
-
Codec Selection Decision Tree:
- Need fastest decompression? → Blosc+LZ4 or LZ4
- Need best compression ratio? → LZMA (slow) or Blosc+Zstd (fast)
- Need universal compatibility? → Gzip or Zlib
- Default choice? → Zstd (v3) or Blosc+Zstd (v2)
- Data is integer with small deltas? → Delta filter + any compressor
- Data is float with limited precision needed? → Quantize filter + Zstd
-
Zarr v3 Codec Pipeline:
- v3 uses a codec pipeline instead of single compressor
- Configuration via
codecsparameter - numcodecs adapted for v3 entry point system
-
Benchmark Results (from research):
- Blosc+LZ4: 155.5x compression ratio on test data
- Zstd: 7.8x on same data
- Gzip: 5.3x
- Uncompressed: 16,000 bytes → compressed: 1,359 bytes = 11.8x (Zstd)
-
numcodecs Registry:
- All codecs are registered in numcodecs
numcodecs.registry.codec_registryfor available codecs- Custom codecs can be registered
2. assets/codec-comparison.py
Python script that benchmarks different codecs on sample data:
- Creates a test array (1000x1000 float64, realistic data with correlation structure)
- Tests each codec: Blosc+LZ4, Blosc+Zstd, Zstd, Gzip, LZ4, Zlib, BZ2
- Tests each Blosc shuffle mode: NOSHUFFLE, SHUFFLE, BITSHUFFLE
- Tests with and without Delta filter
- Reports for each configuration:
- Compression ratio
- Compression time
- Decompression time
- Compressed size
- Outputs results as formatted table
- Must be runnable with only
zarr,numpy,numcodecsas dependencies
3. references/PATTERNS.md (6+ patterns)
- Default Codec Configuration — using Blosc+Zstd for general use
- Maximum Speed Configuration — Blosc+LZ4 with NOSHUFFLE for real-time access
- Maximum Compression — LZMA or Zstd level 9 for archival
- Filter + Compressor Pipeline — Delta + Blosc for integer data with small deltas
- Per-Variable Codec Configuration — different codecs for different variables in a dataset
- Blosc Thread Safety in Multi-Process —
blosc.use_threads = Falsepattern
4. references/EXAMPLES.md (4+ examples)
- Comparing Codecs on Climate Data — float32 temperature data, ratio vs speed trade-off
- Using Delta Filter for Monotonic Data — timestamps or sorted values
- Configuring Per-Variable Compression in xarray —
encodingdict forto_zarr() - Optimizing Compression for Integer Count Data — satellite pixel counts, PackBits for masks
5. references/COMMON_ISSUES.md (5+ issues)
- Silent data corruption with Blosc in multi-process → must set
blosc.use_threads = False - Poor compression ratio → wrong codec for data type, try different shuffle modes
- Slow decompression → using LZMA/BZ2 for frequently-read data; switch to LZ4/Zstd
- Codec not found error → numcodecs not installed or codec not registered
- v2 vs v3 codec configuration confusion →
compressor(v2) vscompressors/codecs(v3)
Acceptance Criteria
- SKILL.md is 300+ lines covering all codec families
- All Blosc internal algorithms documented (blosclz, lz4, lz4hc, snappy, zlib, zstd)
- All shuffle modes documented (NOSHUFFLE, SHUFFLE, BITSHUFFLE)
- All pre-compression filters documented with examples
- Blosc thread safety in multi-process documented (critical safety issue)
- codec-comparison.py is a runnable script producing meaningful results
- Follows the skill pattern from existing plugins
Dependencies
- Depends on feat: Create zarr-data-format plugin scaffold #67 (plugin scaffold)