-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Zarr v3: support fixed-length string data types (null_terminated_bytes, fixed_length_utf32) #13910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Zarr v3: support fixed-length string data types (null_terminated_bytes, fixed_length_utf32) #13910
Conversation
|
One tricky aspect of string handling is that involves dynamic memory allocation... and de-allocation. Nothing in the Zarr V3 code paths is currently ready for that. In codecs, all buffer swapping logic needs to be carefully inspected to avoid leaks. In the shard codec, functions (or their use) like CopySubArrayIntoLargerOne and FillWithNoData need to be inspected. I'm skeptical AI will get this right. This needs to be carefully manually reviewed and tested (we'd need tests with sharded string arrays). We'd perhaps need something smarter than ZarrByteVectorQuickResize to keep track of data types with dynamic memory allocation |
frmts/zarr/zarr_v3_array.cpp
Outdated
| atoi(CPLGetConfigOption( | ||
| "ZARR_VLEN_STRING_MAX_LENGTH", "256"))); | ||
| elt.nativeType = DtypeElt::NativeType::STRING_ASCII; | ||
| elt.nativeSize = static_cast<size_t>(nMaxLen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Giving a fixed size value doesn't make sense to me given that data type is varying length. In the GDAL array byte representation for GEDTC_STRING, we store a pointer to a varying length string.
f024539 to
87a013e
Compare
|
Good points - dropped the vlen-utf8 commit. The fixed-size slot approach didn't match GDAL's pointer-based GEDTC_STRING representation, and the codec pipeline (CopySubArrayIntoLargerOne, FillWithNoData, ZarrByteVectorQuickResize) isn't set up for dynamic allocation as you noted. This PR now only adds the two fixed-length extension dtypes (null_terminated_bytes, fixed_length_utf32) - these map directly to the existing STRING_ASCII/STRING_UNICODE infrastructure from v2, no new codecs or dynamic allocation. Also removed the needByteSwapping flag for fixed_length_utf32 to stay consistent with v3's approach of delegating endianness to the bytes codec. |
yes, but I'm pretty sure you'll still hit those issues on sharded arrays of type null_terminated_bytes or fixed_length_utf32 |
82f2490 to
0b53f58
Compare
ParseDtypeV3() only handled simple string-name numeric types. Add support for object-form data types with "name" + "configuration": - null_terminated_bytes -> STRING_ASCII (fixed-length byte strings) - fixed_length_utf32 -> STRING_UNICODE (fixed-length UCS4 strings) Also handle string fill_value for GEDTC_STRING types in LoadArray, which previously would fail trying to parse as a numeric value. Partial fix for OSGeo#13782
FillWithNoData() in the sharding codec expects nodata in native format (fixed-size byte buffer), but for string types abyNoData held a char* pointer (GDAL representation). This caused a size mismatch assertion for null_terminated_bytes in sharded arrays. Convert string nodata from GDAL to native format in SetupCodecs(), and add a sharded null_terminated_bytes test case.
0b53f58 to
102d6a2
Compare
|
You're right -
|
- Skip bytes codec for STRING_ASCII (byte-oriented, no endianness) - Swap each 4-byte UCS-4 character individually for STRING_UNICODE instead of treating the whole nativeSize as one element - Fix clang-format in nodata conversion
What does this PR do?
Adds read support for Zarr v3 fixed-length string extension data types, partially addressing #13782.
ParseDtypeV3()only handled simple string-name numeric types ("float32","uint8", etc). Zarr v3 extension data types use an object form with"name"+"configuration", which was rejected with "Invalid or unsupported format for data_type".This adds:
null_terminated_bytes->STRING_ASCII(fixed-length byte strings)fixed_length_utf32->STRING_UNICODE(fixed-length UCS4 strings)Both reuse the existing encode/decode infrastructure in
zarr_array.cpp(UCS4ToUTF8,UTF8ToUCS4, ASCII memcpy) which already handles these native types for Zarr v2. Also fixes stringfill_valueparsing for these types.The second commit converts string nodata from GDAL representation (
char*pointer) to native format (fixed-size byte buffer) inSetupCodecs(), soFillWithNoData()in the sharding codec works correctly with string types. Tested with a shardednull_terminated_bytesarray.The third commit fixes the bytes codec for big-endian targets (s390x):
null_terminated_bytes(nativeSize=5): bytes codec tried to swap 5-byte elements, hittingCPLAssert(false)- now skipped since ASCII has no endianness conceptfixed_length_utf32(nativeSize=8):CPL_SWAP64reversed character positions instead of swapping each 4-byte UCS-4 char individually - now uses per-characterCPL_SWAP32Variable-length strings (
"string"dtype + vlen-utf8 codec) not included - the codec pipeline (CopySubArrayIntoLargerOne, FillWithNoData, ZarrByteVectorQuickResize) needs auditing for dynamic memory allocation first.What are related issues/pull requests?
Changes
Modified:
frmts/zarr/zarr_v3_array.cpp- object-form dtype parsing, string fill_value, nodata native format conversionfrmts/zarr/zarr_v3_codec.h-IsNoOp()returns true for STRING_ASCIIfrmts/zarr/zarr_v3_codec_bytes.cpp- per-character UCS-4 byte swap for STRING_UNICODETest data + tests:
autotest/gdrivers/data/zarr/v3/null_terminated_bytes.zarr/autotest/gdrivers/data/zarr/v3/fixed_length_utf32.zarr/autotest/gdrivers/data/zarr/v3/sharded_null_terminated_bytes.zarr/autotest/gdrivers/zarr_driver.py- parametrized tests + sharded string testTasklist