Skip to content

Conversation

@wietzesuijker
Copy link
Contributor

@wietzesuijker wietzesuijker commented Feb 11, 2026

What does this PR do?

Adds read support for Zarr v3 fixed-length string extension data types, partially addressing #13782.

ParseDtypeV3() only handled simple string-name numeric types ("float32", "uint8", etc). Zarr v3 extension data types use an object form with "name" + "configuration", which was rejected with "Invalid or unsupported format for data_type".

This adds:

  • null_terminated_bytes -> STRING_ASCII (fixed-length byte strings)
  • fixed_length_utf32 -> STRING_UNICODE (fixed-length UCS4 strings)

Both reuse the existing encode/decode infrastructure in zarr_array.cpp (UCS4ToUTF8, UTF8ToUCS4, ASCII memcpy) which already handles these native types for Zarr v2. Also fixes string fill_value parsing for these types.

The second commit converts string nodata from GDAL representation (char* pointer) to native format (fixed-size byte buffer) in SetupCodecs(), so FillWithNoData() in the sharding codec works correctly with string types. Tested with a sharded null_terminated_bytes array.

The third commit fixes the bytes codec for big-endian targets (s390x):

  • null_terminated_bytes (nativeSize=5): bytes codec tried to swap 5-byte elements, hitting CPLAssert(false) - now skipped since ASCII has no endianness concept
  • fixed_length_utf32 (nativeSize=8): CPL_SWAP64 reversed character positions instead of swapping each 4-byte UCS-4 char individually - now uses per-character CPL_SWAP32

Variable-length strings ("string" dtype + vlen-utf8 codec) not included - the codec pipeline (CopySubArrayIntoLargerOne, FillWithNoData, ZarrByteVectorQuickResize) needs auditing for dynamic memory allocation first.

What are related issues/pull requests?

Changes

Modified:

  • frmts/zarr/zarr_v3_array.cpp - object-form dtype parsing, string fill_value, nodata native format conversion
  • frmts/zarr/zarr_v3_codec.h - IsNoOp() returns true for STRING_ASCII
  • frmts/zarr/zarr_v3_codec_bytes.cpp - per-character UCS-4 byte swap for STRING_UNICODE

Test data + tests:

  • autotest/gdrivers/data/zarr/v3/null_terminated_bytes.zarr/
  • autotest/gdrivers/data/zarr/v3/fixed_length_utf32.zarr/
  • autotest/gdrivers/data/zarr/v3/sharded_null_terminated_bytes.zarr/
  • autotest/gdrivers/zarr_driver.py - parametrized tests + sharded string test

Tasklist

  • AI (Claude) supported my development of this PR
  • Code is correctly formatted (pre-commit)
  • Add test case(s)
  • All CI builds and checks have passed

@rouault
Copy link
Member

rouault commented Feb 11, 2026

One tricky aspect of string handling is that involves dynamic memory allocation... and de-allocation. Nothing in the Zarr V3 code paths is currently ready for that. In codecs, all buffer swapping logic needs to be carefully inspected to avoid leaks. In the shard codec, functions (or their use) like CopySubArrayIntoLargerOne and FillWithNoData need to be inspected. I'm skeptical AI will get this right. This needs to be carefully manually reviewed and tested (we'd need tests with sharded string arrays). We'd perhaps need something smarter than ZarrByteVectorQuickResize to keep track of data types with dynamic memory allocation

atoi(CPLGetConfigOption(
"ZARR_VLEN_STRING_MAX_LENGTH", "256")));
elt.nativeType = DtypeElt::NativeType::STRING_ASCII;
elt.nativeSize = static_cast<size_t>(nMaxLen);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving a fixed size value doesn't make sense to me given that data type is varying length. In the GDAL array byte representation for GEDTC_STRING, we store a pointer to a varying length string.

@wietzesuijker wietzesuijker force-pushed the fix/zarr-v3-string-dtype branch from f024539 to 87a013e Compare February 11, 2026 20:11
@wietzesuijker wietzesuijker changed the title Zarr v3: support string data types (fixed-length + vlen-utf8) Zarr v3: support fixed-length string data types (null_terminated_bytes, fixed_length_utf32) Feb 11, 2026
@wietzesuijker
Copy link
Contributor Author

Good points - dropped the vlen-utf8 commit. The fixed-size slot approach didn't match GDAL's pointer-based GEDTC_STRING representation, and the codec pipeline (CopySubArrayIntoLargerOne, FillWithNoData, ZarrByteVectorQuickResize) isn't set up for dynamic allocation as you noted.

This PR now only adds the two fixed-length extension dtypes (null_terminated_bytes, fixed_length_utf32) - these map directly to the existing STRING_ASCII/STRING_UNICODE infrastructure from v2, no new codecs or dynamic allocation. Also removed the needByteSwapping flag for fixed_length_utf32 to stay consistent with v3's approach of delegating endianness to the bytes codec.

@rouault
Copy link
Member

rouault commented Feb 11, 2026

the codec pipeline (CopySubArrayIntoLargerOne, FillWithNoData, ZarrByteVectorQuickResize) isn't set up for dynamic allocation as you noted

yes, but I'm pretty sure you'll still hit those issues on sharded arrays of type null_terminated_bytes or fixed_length_utf32

@wietzesuijker wietzesuijker force-pushed the fix/zarr-v3-string-dtype branch from 82f2490 to 0b53f58 Compare February 11, 2026 21:08
ParseDtypeV3() only handled simple string-name numeric types. Add
support for object-form data types with "name" + "configuration":

- null_terminated_bytes -> STRING_ASCII (fixed-length byte strings)
- fixed_length_utf32 -> STRING_UNICODE (fixed-length UCS4 strings)

Also handle string fill_value for GEDTC_STRING types in LoadArray,
which previously would fail trying to parse as a numeric value.

Partial fix for OSGeo#13782
FillWithNoData() in the sharding codec expects nodata in native format
(fixed-size byte buffer), but for string types abyNoData held a char*
pointer (GDAL representation). This caused a size mismatch assertion
for null_terminated_bytes in sharded arrays.

Convert string nodata from GDAL to native format in SetupCodecs(),
and add a sharded null_terminated_bytes test case.
@wietzesuijker wietzesuijker force-pushed the fix/zarr-v3-string-dtype branch from 0b53f58 to 102d6a2 Compare February 11, 2026 21:30
@wietzesuijker
Copy link
Contributor Author

You're right - FillWithNoData() expects native-format nodata but abyNoData held a char* pointer for string types, causing a size mismatch (sizeof(char*) vs nativeSize). Fixed in the second commit:

  1. SetupCodecs() now converts string nodata from GDAL representation to a native-format byte buffer before passing to the codec metadata
  2. Added a NoDataFreer RAII guard (matching the v2 pattern in zarr_v2_array.cpp:1761) to free the CPLStrdup'd pointer when abyNoData goes out of scope
  3. Added a sharded null_terminated_bytes test that exercises FillWithNoData() through the shard decode path

- Skip bytes codec for STRING_ASCII (byte-oriented, no endianness)
- Swap each 4-byte UCS-4 character individually for STRING_UNICODE
  instead of treating the whole nativeSize as one element
- Fix clang-format in nodata conversion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants