|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code when working with code in this |
| 4 | +repository. |
| 5 | + |
| 6 | +## Project Overview |
| 7 | + |
| 8 | +dandischema defines the Pydantic v2 metadata models for the DANDI |
| 9 | +neurophysiology data archive. It is used by both the dandi-cli client and the |
| 10 | +dandi-archive server. Key concerns: model definitions, JSON Schema generation, |
| 11 | +metadata validation, schema migration between versions, and asset metadata |
| 12 | +aggregation. |
| 13 | + |
| 14 | +## Build/Test Commands |
| 15 | + |
| 16 | +```bash |
| 17 | +tox -e py3 # Run full test suite (preferred) |
| 18 | +pytest dandischema/ # Run tests directly in active venv |
| 19 | +pytest dandischema/tests/test_metadata.py -v -k "test_name" # Single test |
| 20 | +tox -e lint # codespell + flake8 |
| 21 | +tox -e typing # mypy (strict, with pydantic plugin) |
| 22 | +``` |
| 23 | + |
| 24 | +- `filterwarnings = error` is active — new warnings will fail tests. |
| 25 | +- Coverage is collected by default (`--cov=dandischema`). |
| 26 | + |
| 27 | +## Code Style |
| 28 | + |
| 29 | +- **Formatter**: Black (no explicit line-length override → default 88) |
| 30 | +- **Import sorting**: isort with `profile = "black"`, `force_sort_within_sections`, |
| 31 | + `reverse_relative` |
| 32 | +- **Linting**: flake8 (max-line-length=100, ignores E203/W503) |
| 33 | +- **Type checking**: mypy strict — `no_implicit_optional`, `warn_return_any`, |
| 34 | + `warn_unreachable`, pydantic plugin enabled |
| 35 | +- **Pre-commit hooks**: trailing-whitespace, end-of-file-fixer, check-yaml, |
| 36 | + check-added-large-files, black, isort, codespell, flake8 |
| 37 | +- Imports at top of file; avoid function-level imports unless there is a |
| 38 | + concrete reason (circular deps, heavy transitive imports) |
| 39 | + |
| 40 | +## Architecture |
| 41 | + |
| 42 | +### Key Modules |
| 43 | + |
| 44 | +| File | Role | |
| 45 | +|------|------| |
| 46 | +| `models.py` | All Pydantic models (~2000 lines). Class hierarchy rooted at `DandiBaseModel`. | |
| 47 | +| `metadata.py` | `validate()`, `migrate()`, `aggregate_assets_summary()`. | |
| 48 | +| `consts.py` | `DANDI_SCHEMA_VERSION`, `ALLOWED_INPUT_SCHEMAS`, `ALLOWED_TARGET_SCHEMAS`. | |
| 49 | +| `conf.py` | Instance configuration via env vars (`DANDI_INSTANCE_NAME`, etc.). | |
| 50 | +| `types.py` | Custom Pydantic types (`ByteSizeJsonSchema`). | |
| 51 | +| `utils.py` | JSON schema helpers, `version2tuple()`, `name2title()`. | |
| 52 | +| `exceptions.py` | `ValidationError`, `JsonschemaValidationError`, `PydanticValidationError`. | |
| 53 | +| `digests/` | `DandiETag` multipart-upload checksum calculation. | |
| 54 | +| `datacite/` | DataCite DOI metadata conversion. | |
| 55 | + |
| 56 | +### Model Hierarchy (simplified) |
| 57 | + |
| 58 | +``` |
| 59 | +DandiBaseModel |
| 60 | +├── PropertyValue # recursive (self-referencing) |
| 61 | +├── BaseType |
| 62 | +│ ├── StandardsType # name, identifier, version, extensions (recursive) |
| 63 | +│ ├── ApproachType, AssayType, SampleType, Anatomy, ... |
| 64 | +│ └── MeasurementTechniqueType |
| 65 | +├── Person, Organization # Contributor subclasses |
| 66 | +├── BioSample # recursive (wasDerivedFrom) |
| 67 | +├── AssetsSummary # aggregated stats |
| 68 | +└── CommonModel |
| 69 | + ├── Dandiset → PublishedDandiset |
| 70 | + └── BareAsset → Asset → PublishedAsset |
| 71 | +``` |
| 72 | + |
| 73 | +Several models are **self-referencing** (PropertyValue, BioSample, |
| 74 | +StandardsType). These require `model_rebuild()` after the class definition. |
| 75 | + |
| 76 | +### Data Flow: Asset Metadata Aggregation |
| 77 | + |
| 78 | +1. dandi-cli calls `asset.get_metadata()` → populates `BareAsset` including |
| 79 | + per-asset `dataStandard` list |
| 80 | +2. Asset metadata is serialized via `model_dump(mode="json", exclude_none=True)` |
| 81 | +3. Server calls `aggregate_assets_summary(assets)` → |
| 82 | + `_add_asset_to_stats()` per asset → `AssetsSummary` |
| 83 | +4. `_add_asset_to_stats()` collects: numberOfBytes, numberOfFiles, approach, |
| 84 | + measurementTechnique, variableMeasured, species, subjects, dataStandard |
| 85 | +5. `dataStandard` has deprecated path/encoding heuristic fallbacks for old |
| 86 | + clients (remove after 2026-12-01) |
| 87 | + |
| 88 | +### Pre-instantiated Standard Constants |
| 89 | + |
| 90 | +```python |
| 91 | +nwb_standard # RRID:SCR_015242 |
| 92 | +bids_standard # RRID:SCR_016124 |
| 93 | +ome_ngff_standard # DOI:10.25504/FAIRsharing.9af712 |
| 94 | +hed_standard # RRID:SCR_014074 |
| 95 | +``` |
| 96 | + |
| 97 | +These are dicts (`model_dump(mode="json", exclude_none=True)`) used by both |
| 98 | +dandischema (heuristic fallbacks) and dandi-cli (per-asset population). |
| 99 | + |
| 100 | +### Vendorization |
| 101 | + |
| 102 | +The schema supports deployment for different DANDI instances. Environment |
| 103 | +variables (`DANDI_INSTANCE_NAME`, `DANDI_INSTANCE_IDENTIFIER`, |
| 104 | +`DANDI_DOI_PREFIX`, etc.) must be set **before** importing |
| 105 | +`dandischema.models`. This dynamically adjusts identifier patterns, DOI |
| 106 | +prefixes, license enums, and URL patterns. CI tests multiple vendored |
| 107 | +configurations. |
| 108 | + |
| 109 | +## Schema Change Checklist |
| 110 | + |
| 111 | +When adding or removing fields from any model (BareAsset, Dandiset, |
| 112 | +AssetsSummary, etc.): |
| 113 | + |
| 114 | +1. **Update `_FIELDS_INTRODUCED` in `metadata.py:migrate()`** if adding a new |
| 115 | + **top-level field to Dandiset metadata** — `migrate()` only processes |
| 116 | + Dandiset-level dicts (not Asset metadata). Fields on BareAsset or nested |
| 117 | + inside existing structures (e.g. new fields on StandardsType) do not need |
| 118 | + entries here. |
| 119 | + |
| 120 | +2. **Update `consts.py`** if bumping `DANDI_SCHEMA_VERSION` or adding to |
| 121 | + `ALLOWED_INPUT_SCHEMAS`. |
| 122 | + |
| 123 | +3. **Add tests** covering migration/aggregation with the new field. |
| 124 | + |
| 125 | +4. **Coordinate with dandi-cli** — new fields that dandi-cli populates need |
| 126 | + backward-compat guards there (check `"field" in Model.model_fields`) until |
| 127 | + the minimum dandischema dependency is bumped. |
| 128 | + |
| 129 | +## Testing Notes |
| 130 | + |
| 131 | +- Tests use `filterwarnings = error` — any new deprecation warning will fail. |
| 132 | +- The `clear_dandischema_modules_and_set_env_vars` fixture (conftest.py) |
| 133 | + supports testing vendored configurations by clearing cached modules and |
| 134 | + setting env vars. |
| 135 | +- Network-dependent tests are skipped when `DANDI_TESTS_NONETWORK` is set. |
| 136 | +- DataCite tests require `DATACITE_DEV_LOGIN` / `DATACITE_DEV_PASSWORD`. |
| 137 | +- `test_models.py:test_duplicate_classes` checks for duplicate field qnames |
| 138 | + across models; allowed duplicates are listed explicitly. |
0 commit comments