-
Notifications
You must be signed in to change notification settings - Fork 32
Clarify nesting. Add "configuration" as an index_location #368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Clarify nesting. Add "configuration" as an index_location #368
Conversation
|
I can see that in some cases this format would be useful. However this proposal is |
|
I can see that a separate file may be useful in some cases. However that is not possible under the existing codec model. We would need to define something more general, e.g. an array -> files codec. That is something I mentioned before in the context of making the chunk_grid itself a part of the codec pipeline rather than a separate higher level thing. |
|
Could we implement the external shard indices via a storage transformer? The storage transformer looks for |
I think this might work for reading, but a storage transformer might need additional information for writing. |
|
I proposed the storage transformer |
Co-authored-by: Gemini <[email protected]>
acd53d6 to
4d96684
Compare
|
I have removed the "external" index_location from this pull request. Instead I have a proposed a "configuration" index_location where the shard index is defined in the codec configuration as a "predefined_index" parameter. At the moment, I have written that the parameter may either be a JSON array itself or a BASE64 encoded string. Allowing the "predefined_index" parameter to be a JSON array adds some complications. How should |
|
Could you clarify the purpose of If every single shard has the same shard index then:
|
Abstract
This pull request enhances the sharding codec specification by introducing a new mechanism for storing the shard index and clarifying several related behaviors. Key changes include:
configurationindex_location: A newconfigurationvalue forindex_locationis introduced, allowing the shard index to be defined directly within the codecs configuration parameters in the array metadata.predefined_indexparameter: Whenindex_locationisconfiguration, thepredefined_indexparameter specifies the shard index data, which can be either a JSON array or a BASE64 encoded string.index_codecsapplication: Clarifications have been added on howindex_codecsapply based on the type ofpredefined_index. Ifpredefined_indexis a BASE64 string,index_codecsare used to interpret the binary content. If its a JSON array,index_codecsmay containarray -> arraytransformations but MUST NOT contain anarray -> bytescodec.index_location: configuration(at any nesting level) to be read-only due to the complexities of updating such an index.configurationvalue forindex_location.Configuration Index
The
configurationindex location provides a mechanism to embed the shard index directly within thesharding_indexedcodecs configuration object in the Zarr array metadata (zarr.json). This is particularly useful for pre-defined, fixed, or read-only datasets where the index does not change.The index data is provided via the
predefined_indexparameter within the codec configuration. This parameter can take two forms:[offset, nbytes]pairs). In this case,index_codecsmay containarray -> arraycodecs to transform the conceptual[N,2]index to the JSON structure, but MUST NOT contain anarray -> bytescodec.index_codecschain applied to the index array. Implementations should decode the BASE64 string and then apply theindex_codecsin reverse to obtain the index.Shard Index as an Array
index_codecswas suggested to begin with abytesarray-to-bytes codec, implying that the shard index is an array. Clarifying that the shard is an array makes it clearer that array-to-array codecs could be used to manipulate the shard index format before its converted to bytes (forstart/end/BASE64cases) or represented directly as a JSON array.Potential applications:
transposecodec could be used to have a consecutive offsets subarray followed by a consecutive nbytes subarray. This would more closely correspond with TIFFsTILEOFFSETSandTILEBYTECOUNTStags.uint64to a smaller unsigned integer such asuint32,uint16, oruint8.Nested Shards
The specification now explicitly allows nested shards to utilize an
index_locationvalue ofconfiguration. This provides greater flexibility in defining hierarchical data structures. However, due to the difficulty of updating an index stored in the array metadata, implementations MAY consider any array using"index_location": "configuration"(at any level of nesting) to be read-only. Writing to such an array may produce an error or lead to a corrupted state if the written data would require a change to the predefined index.