Zarr v3: support fixed-length string data types (null_terminated_bytes, fixed_length_utf32) #13910

wietzesuijker · 2026-02-11T18:09:30Z

What does this PR do?

Adds read support for Zarr v3 fixed-length string extension data types, partially addressing #13782.

ParseDtypeV3() only handled simple string-name numeric types ("float32", "uint8", etc). Zarr v3 extension data types use an object form with "name" + "configuration", which was rejected with "Invalid or unsupported format for data_type".

This adds:

null_terminated_bytes -> STRING_ASCII (fixed-length byte strings)
fixed_length_utf32 -> STRING_UNICODE (fixed-length UCS4 strings)

Both reuse the existing encode/decode infrastructure in zarr_array.cpp (UCS4ToUTF8, UTF8ToUCS4, ASCII memcpy) which already handles these native types for Zarr v2. Also fixes string fill_value parsing for these types.

The second commit converts string nodata from GDAL representation (char* pointer) to native format (fixed-size byte buffer) in SetupCodecs(), so FillWithNoData() in the sharding codec works correctly with string types. Tested with a sharded null_terminated_bytes array.

The third commit fixes the bytes codec for big-endian targets (s390x):

null_terminated_bytes (nativeSize=5): bytes codec tried to swap 5-byte elements, hitting CPLAssert(false) - now skipped since ASCII has no endianness concept
fixed_length_utf32 (nativeSize=8): CPL_SWAP64 reversed character positions instead of swapping each 4-byte UCS-4 char individually - now uses per-character CPL_SWAP32

Variable-length strings ("string" dtype + vlen-utf8 codec) not included - the codec pipeline (CopySubArrayIntoLargerOne, FillWithNoData, ZarrByteVectorQuickResize) needs auditing for dynamic memory allocation first.

What are related issues/pull requests?

Partially addresses GDAL Zarr v3 String Data Type Support #13782 (fixed-length only; vlen-utf8 deferred)
Zarr 3 (and geozarr?) #11824 - Zarr v3 / GeoZarr tracking issue

Changes

Modified:

frmts/zarr/zarr_v3_array.cpp - object-form dtype parsing, string fill_value, nodata native format conversion
frmts/zarr/zarr_v3_codec.h - IsNoOp() returns true for STRING_ASCII
frmts/zarr/zarr_v3_codec_bytes.cpp - per-character UCS-4 byte swap for STRING_UNICODE

Test data + tests:

autotest/gdrivers/data/zarr/v3/null_terminated_bytes.zarr/
autotest/gdrivers/data/zarr/v3/fixed_length_utf32.zarr/
autotest/gdrivers/data/zarr/v3/sharded_null_terminated_bytes.zarr/
autotest/gdrivers/zarr_driver.py - parametrized tests + sharded string test

Tasklist

AI (Claude) supported my development of this PR
Code is correctly formatted (pre-commit)
Add test case(s)
All CI builds and checks have passed

rouault · 2026-02-11T18:22:02Z

One tricky aspect of string handling is that involves dynamic memory allocation... and de-allocation. Nothing in the Zarr V3 code paths is currently ready for that. In codecs, all buffer swapping logic needs to be carefully inspected to avoid leaks. In the shard codec, functions (or their use) like CopySubArrayIntoLargerOne and FillWithNoData need to be inspected. I'm skeptical AI will get this right. This needs to be carefully manually reviewed and tested (we'd need tests with sharded string arrays). We'd perhaps need something smarter than ZarrByteVectorQuickResize to keep track of data types with dynamic memory allocation

rouault · 2026-02-11T18:34:03Z

frmts/zarr/zarr_v3_array.cpp

+                    atoi(CPLGetConfigOption(
+                        "ZARR_VLEN_STRING_MAX_LENGTH", "256")));
+                elt.nativeType = DtypeElt::NativeType::STRING_ASCII;
+                elt.nativeSize = static_cast<size_t>(nMaxLen);


Giving a fixed size value doesn't make sense to me given that data type is varying length. In the GDAL array byte representation for GEDTC_STRING, we store a pointer to a varying length string.

wietzesuijker · 2026-02-11T20:12:42Z

Good points - dropped the vlen-utf8 commit. The fixed-size slot approach didn't match GDAL's pointer-based GEDTC_STRING representation, and the codec pipeline (CopySubArrayIntoLargerOne, FillWithNoData, ZarrByteVectorQuickResize) isn't set up for dynamic allocation as you noted.

This PR now only adds the two fixed-length extension dtypes (null_terminated_bytes, fixed_length_utf32) - these map directly to the existing STRING_ASCII/STRING_UNICODE infrastructure from v2, no new codecs or dynamic allocation. Also removed the needByteSwapping flag for fixed_length_utf32 to stay consistent with v3's approach of delegating endianness to the bytes codec.

rouault · 2026-02-11T20:23:45Z

the codec pipeline (CopySubArrayIntoLargerOne, FillWithNoData, ZarrByteVectorQuickResize) isn't set up for dynamic allocation as you noted

yes, but I'm pretty sure you'll still hit those issues on sharded arrays of type null_terminated_bytes or fixed_length_utf32

ParseDtypeV3() only handled simple string-name numeric types. Add support for object-form data types with "name" + "configuration": - null_terminated_bytes -> STRING_ASCII (fixed-length byte strings) - fixed_length_utf32 -> STRING_UNICODE (fixed-length UCS4 strings) Also handle string fill_value for GEDTC_STRING types in LoadArray, which previously would fail trying to parse as a numeric value. Partial fix for OSGeo#13782

FillWithNoData() in the sharding codec expects nodata in native format (fixed-size byte buffer), but for string types abyNoData held a char* pointer (GDAL representation). This caused a size mismatch assertion for null_terminated_bytes in sharded arrays. Convert string nodata from GDAL to native format in SetupCodecs(), and add a sharded null_terminated_bytes test case.

wietzesuijker · 2026-02-11T21:33:17Z

You're right - FillWithNoData() expects native-format nodata but abyNoData held a char* pointer for string types, causing a size mismatch (sizeof(char*) vs nativeSize). Fixed in the second commit:

SetupCodecs() now converts string nodata from GDAL representation to a native-format byte buffer before passing to the codec metadata
Added a NoDataFreer RAII guard (matching the v2 pattern in zarr_v2_array.cpp:1761) to free the CPLStrdup'd pointer when abyNoData goes out of scope
Added a sharded null_terminated_bytes test that exercises FillWithNoData() through the shard decode path

- Skip bytes codec for STRING_ASCII (byte-oriented, no endianness) - Swap each 4-byte UCS-4 character individually for STRING_UNICODE instead of treating the whole nativeSize as one element - Fix clang-format in nodata conversion

rouault reviewed Feb 11, 2026

View reviewed changes

wietzesuijker force-pushed the fix/zarr-v3-string-dtype branch from f024539 to 87a013e Compare February 11, 2026 20:11

wietzesuijker changed the title ~~Zarr v3: support string data types (fixed-length + vlen-utf8)~~ Zarr v3: support fixed-length string data types (null_terminated_bytes, fixed_length_utf32) Feb 11, 2026

wietzesuijker force-pushed the fix/zarr-v3-string-dtype branch from 82f2490 to 0b53f58 Compare February 11, 2026 21:08

wietzesuijker added 2 commits February 11, 2026 16:27

wietzesuijker force-pushed the fix/zarr-v3-string-dtype branch from 0b53f58 to 102d6a2 Compare February 11, 2026 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Zarr v3: support fixed-length string data types (null_terminated_bytes, fixed_length_utf32) #13910

Zarr v3: support fixed-length string data types (null_terminated_bytes, fixed_length_utf32) #13910

wietzesuijker commented Feb 11, 2026 •

edited

Loading

Uh oh!

rouault commented Feb 11, 2026

Uh oh!

rouault Feb 11, 2026

Uh oh!

wietzesuijker commented Feb 11, 2026

Uh oh!

rouault commented Feb 11, 2026

Uh oh!

wietzesuijker commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Zarr v3: support fixed-length string data types (null_terminated_bytes, fixed_length_utf32) #13910

Are you sure you want to change the base?

Zarr v3: support fixed-length string data types (null_terminated_bytes, fixed_length_utf32) #13910

Conversation

wietzesuijker commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

What are related issues/pull requests?

Changes

Tasklist

Uh oh!

rouault commented Feb 11, 2026

Uh oh!

rouault Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

wietzesuijker commented Feb 11, 2026

Uh oh!

rouault commented Feb 11, 2026

Uh oh!

wietzesuijker commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wietzesuijker commented Feb 11, 2026 •

edited

Loading