Skip to content

Zarr v3 alternative #1110

Draft
jeromekelleher wants to merge 1 commit intotskit-dev:mainfrom
jeromekelleher:zarr-v3-second-pass
Draft

Zarr v3 alternative #1110
jeromekelleher wants to merge 1 commit intotskit-dev:mainfrom
jeromekelleher:zarr-v3-second-pass

Conversation

@jeromekelleher
Copy link
Member

No description provided.

zarr v3 removed the built-in LMDBStore. This implements the zarr v3 async
Store ABC on top of LMDB, preserving the same flat key layout used by zarr
v2's LMDBStore so the on-disk format of .samples and .ancestors files is
unchanged (at the storage level).

Update DataContainer for zarr v3 API

- Replace zarr.LMDBStore with our custom LMDBStore
- zarr.group() -> zarr.group(zarr_format=2) for in-memory stores
- zarr.open_group(...) -> add zarr_format=2 for new stores
- Replace zarr.copy_all/copy_store with _copy_zarr_group() helper
- array.resize(n) -> array.resize((n,)) - zarr v3 takes a shape tuple
- array.info -> repr(array) - array.info was removed in zarr v3
- Use store.info()/stat() directly for LMDB size shrink in finalise()
- Remove _metadata_codec attribute (no longer used for array creation)

Update SampleData for zarr v3: JSON-encode object/ragged arrays

- Bump FORMAT_VERSION from (5, 1) to (6, 0)
- Replace dtype=object (numcodecs.JSON object_codec) → dtype=str
  for populations/metadata, individuals/metadata, sites/alleles,
  sites/metadata, provenances/timestamp, provenances/record
- Replace dtype="array:f8" (ragged float) → dtype=str + JSON encoding
  for individuals/location
- All create_dataset() → create_array() with zarr_format=2
- Add json.dumps() in all write paths, json.loads() in all read paths
- Mark each on-disk change with # ON-DISK COMPATIBILITY comment

Files written with FORMAT_VERSION < (6, 0) cannot be read by this code.

Update AncestorData for zarr v3: JSON-encode focal_sites, update create_dataset wrapper

- Bump FORMAT_VERSION from (3, 0) to (4, 0)
- Update create_dataset() wrapper to use create_array() with zarr_format=2
- Replace dtype="array:i4" (ragged int32) → dtype=str + JSON encoding
  for sample_focal_sites (ancestors_focal_sites)
- add_ancestor(): json.dumps(focal_sites.tolist()) on write
- ancestor() / ancestors(): json.loads() → np.int32 array on read
- assert_data_equal(): decode JSON strings before np.testing.assert_array_equal
- truncate_ancestors(): buffer focal_sites as JSON-encoded strings
- Mark each on-disk change with # ON-DISK COMPATIBILITY comment

Files written with FORMAT_VERSION < (4, 0) cannot be read by this code.

Update VariantData, test utility, and pyproject.toml for zarr v3

- Add zarr_format=2 to zarr.array() and zarr.zeros() in VariantData
- Fix tests/tsutil.py: zarr.group() → zarr.group(zarr_format=2),
  replace removed zarr.convenience.copy_all() with _copy_zarr_group()
- pyproject.toml: zarr<3 → zarr>=3,<4

Fix add_ancestral_alleles(): create_dataset → create_array with zarr_format=2

Update test_variantdata.py for zarr v3 API

- zarr.group() → zarr.group(zarr_format=2)
- zarr.array() → zarr.array(..., zarr_format=2)
- create_dataset() → create_array() with zarr_format=2, data assigned separately
- individuals_metadata: object_codec=VLenBytes() → dtype=bytes + separate assignment
- TestAddAncestralStateArray: extract _make_store() helper to DRY up setup
- Remove unused numcodecs import

Update test_formats.py for zarr v3 API

- zarr.LMDBStore → tsinfer._lmdb_store.LMDBStore
- Add zarr_format=2 to all zarr.array/ones/zeros/empty/empty_like calls
- Replace dtype="array:i4" tests with JSON-encoded dtype=str equivalents
- Replace dtype=object/object_codec=JSON() tests with dtype=str equivalents
- Remove now-unnecessary object array comparison branch in verify_round_trip
- Remove unused 'import numcodecs' (blosc still imported as numcodecs.blosc)

Update zarr version guard: require zarr >=3,<4

Add missing abstract methods to LMDBStore: __eq__ and get_partial_values

zarr v3's Store ABC requires these to be implemented.

Remove zarr_format=2 from group.create_array() calls

zarr v3's Group.create_array() does not accept zarr_format; arrays inherit
the format from their parent group (which is opened/created with zarr_format=2).
Keep zarr_format=2 only on standalone zarr.array/zeros/empty and
zarr.group/open_group calls.

Updates
@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.82%. Comparing base (a72866c) to head (5e8cb38).

❗ There is a different number of reports uploaded between BASE (a72866c) and HEAD (5e8cb38). Click for more details.

HEAD has 6 uploads less than BASE
Flag BASE (a72866c) HEAD (5e8cb38)
python-tests 6 0
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1110      +/-   ##
==========================================
- Coverage   90.22%   81.82%   -8.41%     
==========================================
  Files          19        7      -12     
  Lines        7092     2823    -4269     
  Branches     1170      474     -696     
==========================================
- Hits         6399     2310    -4089     
+ Misses        562      434     -128     
+ Partials      131       79      -52     
Flag Coverage Δ
C 81.41% <ø> (ø)
c-python 59.43% <ø> (ø)
python-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Python API ∅ <ø> (∅)
Python C interface 66.66% <ø> (ø)
C library 88.58% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant