|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +**flox** is a Python library providing fast GroupBy reduction operations for `dask.array`. It implements parallel-friendly GroupBy reductions using the MapReduce paradigm and integrates with xarray for labeled multidimensional arrays. |
| 8 | + |
| 9 | +## Development Commands |
| 10 | + |
| 11 | +### Environment Setup |
| 12 | + |
| 13 | +```bash |
| 14 | +# Create and activate development environment |
| 15 | +mamba env create -f ci/environment.yml |
| 16 | +conda activate flox-tests |
| 17 | +python -m pip install --no-deps -e . |
| 18 | +``` |
| 19 | + |
| 20 | +### Testing |
| 21 | + |
| 22 | +```bash |
| 23 | +# Run full test suite (as used in CI) |
| 24 | +pytest --durations=20 --durations-min=0.5 -n auto --cov=./ --cov-report=xml --hypothesis-profile ci |
| 25 | + |
| 26 | +# Run tests without coverage |
| 27 | +pytest -n auto |
| 28 | + |
| 29 | +# Run single test file |
| 30 | +pytest tests/test_core.py |
| 31 | + |
| 32 | +# Run specific test |
| 33 | +pytest tests/test_core.py::test_function_name |
| 34 | +``` |
| 35 | + |
| 36 | +### Code Quality |
| 37 | + |
| 38 | +```bash |
| 39 | +# Run all pre-commit hooks |
| 40 | +pre-commit run --all-files |
| 41 | + |
| 42 | +# Format code with ruff |
| 43 | +ruff format . |
| 44 | + |
| 45 | +# Lint and fix with ruff |
| 46 | +ruff check --fix . |
| 47 | + |
| 48 | +# Type checking |
| 49 | +mypy flox/ |
| 50 | + |
| 51 | +# Spell checking |
| 52 | +codespell |
| 53 | +``` |
| 54 | + |
| 55 | +### Benchmarking |
| 56 | + |
| 57 | +```bash |
| 58 | +# Performance benchmarking (from asv_bench/ directory) |
| 59 | +cd asv_bench |
| 60 | +asv run |
| 61 | +asv publish |
| 62 | +asv preview |
| 63 | +``` |
| 64 | + |
| 65 | +## CI Configuration |
| 66 | + |
| 67 | +### GitHub Workflows (`.github/workflows/`) |
| 68 | + |
| 69 | +- **`ci.yaml`** - Main CI pipeline with test matrix across Python versions (3.11, 3.13) and operating systems (Ubuntu, Windows) |
| 70 | +- **`ci-additional.yaml`** - Additional CI jobs including doctests and mypy type checking |
| 71 | +- **`upstream-dev-ci.yaml`** - Tests against development versions of upstream dependencies |
| 72 | +- **`pypi.yaml`** - PyPI publishing workflow |
| 73 | +- **`testpypi-release.yaml`** - Test PyPI release workflow |
| 74 | +- **`benchmarks.yml`** - Performance benchmarking workflow |
| 75 | + |
| 76 | +### Environment Files (`ci/`) |
| 77 | + |
| 78 | +- **`environment.yml`** - Main test environment with all dependencies |
| 79 | +- **`minimal-requirements.yml`** - Minimal requirements testing (pandas==1.5, numpy==1.22, etc.) |
| 80 | +- **`no-dask.yml`** - Testing without dask dependency |
| 81 | +- **`no-numba.yml`** - Testing without numba dependency |
| 82 | +- **`no-xarray.yml`** - Testing without xarray dependency |
| 83 | +- **`env-numpy1.yml`** - Testing with numpy\<2 constraint |
| 84 | +- **`docs.yml`** - Documentation building environment |
| 85 | +- **`upstream-dev-env.yml`** - Development versions of dependencies |
| 86 | +- **`benchmark.yml`** - Benchmarking environment |
| 87 | + |
| 88 | +### ReadTheDocs Configuration |
| 89 | + |
| 90 | +- **`.readthedocs.yml`** - ReadTheDocs configuration using `ci/docs.yml` environment |
| 91 | + |
| 92 | +## Code Architecture |
| 93 | + |
| 94 | +### Core Modules (`flox/`) |
| 95 | + |
| 96 | +- **`core.py`** - Main reduction logic, central orchestrator of groupby operations |
| 97 | +- **`aggregations.py`** - Defines the `Aggregation` class and built-in aggregation operations |
| 98 | +- **`xarray.py`** - Primary integration with xarray, provides `xarray_reduce()` API |
| 99 | +- **`dask_array_ops.py`** - Dask-specific array operations and optimizations |
| 100 | + |
| 101 | +### Aggregation Backends (`flox/aggregate_*.py`) |
| 102 | + |
| 103 | +- **`aggregate_flox.py`** - Native flox implementation |
| 104 | +- **`aggregate_npg.py`** - numpy-groupies backend |
| 105 | +- **`aggregate_numbagg.py`** - numbagg backend for JIT-compiled operations |
| 106 | +- **`aggregate_sparse.py`** - Support for sparse arrays |
| 107 | + |
| 108 | +### Utilities |
| 109 | + |
| 110 | +- **`cache.py`** - Caching mechanisms for performance |
| 111 | +- **`visualize.py`** - Tools for visualizing groupby operations |
| 112 | +- **`lib.py`** - General utility functions |
| 113 | +- **`xrutils.py`** & **`xrdtypes.py`** - xarray-specific utilities and types |
| 114 | + |
| 115 | +### Main APIs |
| 116 | + |
| 117 | +- `flox.groupby_reduce()` - Pure dask array interface |
| 118 | +- `flox.xarray.xarray_reduce()` - Pure xarray interface |
| 119 | + |
| 120 | +## Key Design Patterns |
| 121 | + |
| 122 | +**Engine Selection**: The library supports multiple computation backends ("flox", "numpy", "numbagg") that can be chosen based on data characteristics and performance requirements. |
| 123 | + |
| 124 | +**MapReduce Strategy**: Implements groupby reductions using a two-stage approach (blockwise + tree reduction) to avoid expensive sort/shuffle operations in parallel computing. |
| 125 | + |
| 126 | +**Chunking Intelligence**: Automatically rechunks data to optimize groupby operations, particularly important for the current `auto-blockwise-rechunk` branch. |
| 127 | + |
| 128 | +**Integration Testing**: Extensive testing against xarray's groupby functionality to ensure compatibility with the broader scientific Python ecosystem. |
| 129 | + |
| 130 | +## Testing Configuration |
| 131 | + |
| 132 | +- **Framework**: pytest with coverage, parallel execution (pytest-xdist), and property-based testing (hypothesis) |
| 133 | +- **Coverage Target**: 95% |
| 134 | +- **Test Environments**: Multiple conda environments test optional dependencies (no-dask, no-numba, no-xarray) |
| 135 | +- **CI Matrices**: Tests across Python 3.11-3.13, Ubuntu/Windows, multiple dependency configurations |
| 136 | + |
| 137 | +## Dependencies |
| 138 | + |
| 139 | +**Core**: pandas>=1.5, numpy>=1.22, numpy_groupies>=0.9.19, scipy>=1.9, toolz, packaging>=21.3 |
| 140 | + |
| 141 | +**Optional**: cachey, dask, numba, numbagg, xarray (enable with `pip install flox[all]`) |
| 142 | + |
| 143 | +## Development Notes |
| 144 | + |
| 145 | +- Uses `setuptools_scm` for automatic versioning from git tags |
| 146 | +- Heavy emphasis on performance with ASV benchmarking infrastructure |
| 147 | +- Type hints throughout with mypy checking |
| 148 | +- Pre-commit hooks enforce code quality (ruff, prettier, codespell) |
| 149 | +- Integration testing with xarray upstream development branch |
| 150 | +- **Python Support**: Minimum version 3.11 (updated from 3.10) |
| 151 | +- **Git Worktrees**: `worktrees/` directory is ignored for development workflows |
0 commit comments