Skip to content

feat: Create compression-codecs skill #71

@cdcore09

Description

@cdcore09

Description

Create the skill covering all compression codecs, filters, and codec configuration for the zarr-data-format plugin.

Directory: plugins/zarr-data-format/skills/compression-codecs/

Research Reference

Full research document: .agents/research-zarr-chunk-optimization-and-zarr-plugin.md

Files to Create

1. SKILL.md (300+ lines)

Frontmatter:

name: compression-codecs
description: |
  Use this skill when the user asks to "configure zarr compression", "choose a compressor",
  "compare compression codecs", "use blosc with zarr", "add filters to zarr", "optimize
  compression ratio", "speed up decompression", or needs guidance on numcodecs, Blosc
  configuration, Zstd, LZ4, Gzip, pre-compression filters, codec pipelines, or the
  trade-offs between compression speed and ratio.

Content must include:

  • Quick Reference: Codec Selection Guide

    Codec Speed Ratio Best For
    Blosc+LZ4 Fastest Good Real-time access, large arrays
    Blosc+Zstd Fast Excellent Default choice, balanced
    Zstd Fast Excellent Zarr v3 default
    Gzip Slow Good Universal compatibility
    LZ4 Fastest Moderate Maximum decompression speed
    LZMA Slowest Best Archival, maximum compression
    BZ2 Slow Very Good Good compression, moderate speed
    Zlib Moderate Good Standard compatibility
  • All Compression Codecs with Configuration:

    Blosc (meta-compressor, v2 default):

    from numcodecs import Blosc
    compressor = Blosc(cname='zstd', clevel=3, shuffle=Blosc.BITSHUFFLE)
    # Internal algorithms: blosclz, lz4, lz4hc, snappy, zlib, zstd
    # Compression levels: 0-9 (0=no compression, 9=maximum)
    # Shuffle modes: NOSHUFFLE, SHUFFLE (byte), BITSHUFFLE

    Blosc Thread Safety (Critical):

    from numcodecs import blosc
    blosc.set_nthreads(2)        # Limit internal threads
    blosc.use_threads = False     # REQUIRED for multi-process safety

    Warning: If blosc.use_threads is not set to False in multi-process environments, silent data corruption can occur.

    Standalone Codecs:

    from numcodecs import Zstd, LZMA, Zlib, GZip, BZ2, LZ4
    compressor = Zstd(level=3)
    compressor = LZ4(acceleration=1)
    compressor = GZip(level=6)

    Disabling Compression:

    z = zarr.create_array(..., compressor=None)  # v2
    z = zarr.create_array(..., compressors=None)  # v3

    Overriding Default Globally (v2):

    import zarr.storage
    zarr.storage.default_compressor = Zstd(level=1)
  • Pre-compression Filters:

    from numcodecs import Delta, Quantize, FixedScaleOffset, PackBits, Categorize
    
    # Delta encoding — stores differences between consecutive values
    filters = [Delta(dtype='i4')]
    
    # Quantize — reduces floating-point precision
    filters = [Quantize(digits=3, dtype='f8')]
    
    # FixedScaleOffset — linear scaling
    filters = [FixedScaleOffset(offset=273.15, scale=100, dtype='f4')]
    
    # PackBits — boolean packing
    filters = [PackBits()]
    
    # Categorize — categorical encoding
    filters = [Categorize(labels=['cat', 'dog', 'bird'], dtype=object)]

    Filters are applied before compression and reversed after decompression.

  • Integrity Checks:

    from numcodecs import CRC32, Adler32
    # Add as post-compression check
  • Codec Selection Decision Tree:

    • Need fastest decompression? → Blosc+LZ4 or LZ4
    • Need best compression ratio? → LZMA (slow) or Blosc+Zstd (fast)
    • Need universal compatibility? → Gzip or Zlib
    • Default choice? → Zstd (v3) or Blosc+Zstd (v2)
    • Data is integer with small deltas? → Delta filter + any compressor
    • Data is float with limited precision needed? → Quantize filter + Zstd
  • Zarr v3 Codec Pipeline:

    • v3 uses a codec pipeline instead of single compressor
    • Configuration via codecs parameter
    • numcodecs adapted for v3 entry point system
  • Benchmark Results (from research):

    • Blosc+LZ4: 155.5x compression ratio on test data
    • Zstd: 7.8x on same data
    • Gzip: 5.3x
    • Uncompressed: 16,000 bytes → compressed: 1,359 bytes = 11.8x (Zstd)
  • numcodecs Registry:

    • All codecs are registered in numcodecs
    • numcodecs.registry.codec_registry for available codecs
    • Custom codecs can be registered

2. assets/codec-comparison.py

Python script that benchmarks different codecs on sample data:

  • Creates a test array (1000x1000 float64, realistic data with correlation structure)
  • Tests each codec: Blosc+LZ4, Blosc+Zstd, Zstd, Gzip, LZ4, Zlib, BZ2
  • Tests each Blosc shuffle mode: NOSHUFFLE, SHUFFLE, BITSHUFFLE
  • Tests with and without Delta filter
  • Reports for each configuration:
    • Compression ratio
    • Compression time
    • Decompression time
    • Compressed size
  • Outputs results as formatted table
  • Must be runnable with only zarr, numpy, numcodecs as dependencies

3. references/PATTERNS.md (6+ patterns)

  1. Default Codec Configuration — using Blosc+Zstd for general use
  2. Maximum Speed Configuration — Blosc+LZ4 with NOSHUFFLE for real-time access
  3. Maximum Compression — LZMA or Zstd level 9 for archival
  4. Filter + Compressor Pipeline — Delta + Blosc for integer data with small deltas
  5. Per-Variable Codec Configuration — different codecs for different variables in a dataset
  6. Blosc Thread Safety in Multi-Processblosc.use_threads = False pattern

4. references/EXAMPLES.md (4+ examples)

  1. Comparing Codecs on Climate Data — float32 temperature data, ratio vs speed trade-off
  2. Using Delta Filter for Monotonic Data — timestamps or sorted values
  3. Configuring Per-Variable Compression in xarrayencoding dict for to_zarr()
  4. Optimizing Compression for Integer Count Data — satellite pixel counts, PackBits for masks

5. references/COMMON_ISSUES.md (5+ issues)

  1. Silent data corruption with Blosc in multi-process → must set blosc.use_threads = False
  2. Poor compression ratio → wrong codec for data type, try different shuffle modes
  3. Slow decompression → using LZMA/BZ2 for frequently-read data; switch to LZ4/Zstd
  4. Codec not found error → numcodecs not installed or codec not registered
  5. v2 vs v3 codec configuration confusioncompressor (v2) vs compressors/codecs (v3)

Acceptance Criteria

  • SKILL.md is 300+ lines covering all codec families
  • All Blosc internal algorithms documented (blosclz, lz4, lz4hc, snappy, zlib, zstd)
  • All shuffle modes documented (NOSHUFFLE, SHUFFLE, BITSHUFFLE)
  • All pre-compression filters documented with examples
  • Blosc thread safety in multi-process documented (critical safety issue)
  • codec-comparison.py is a runnable script producing meaningful results
  • Follows the skill pattern from existing plugins

Dependencies

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestskillSkill creation or modification

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions