Skip to content

Conversation

@mkitti
Copy link
Contributor

@mkitti mkitti commented Nov 24, 2025

Abstract

This pull request enhances the sharding codec specification by introducing a new mechanism for storing the shard index and clarifying several related behaviors. Key changes include:

  • Introduction of configuration index_location: A new configuration value for index_location is introduced, allowing the shard index to be defined directly within the codecs configuration parameters in the array metadata.
  • New predefined_index parameter: When index_location is configuration, the predefined_index parameter specifies the shard index data, which can be either a JSON array or a BASE64 encoded string.
  • Conditional index_codecs application: Clarifications have been added on how index_codecs apply based on the type of predefined_index. If predefined_index is a BASE64 string, index_codecs are used to interpret the binary content. If its a JSON array, index_codecs may contain array -> array transformations but MUST NOT contain an array -> bytes codec.
  • Read-Only Consideration: Implementations MAY consider arrays using index_location: configuration (at any nesting level) to be read-only due to the complexities of updating such an index.
  • Nested Shards Flexibility: The specification now clarifies that nested shards MAY use the configuration value for index_location.
  • Version Bump: The version number is bumped to 1.1 to reflect these changes.

Configuration Index

The configuration index location provides a mechanism to embed the shard index directly within the sharding_indexed codecs configuration object in the Zarr array metadata (zarr.json). This is particularly useful for pre-defined, fixed, or read-only datasets where the index does not change.

The index data is provided via the predefined_index parameter within the codec configuration. This parameter can take two forms:

  • A JSON array: This directly represents the decoded index (e.g., an array of [offset, nbytes] pairs). In this case, index_codecs may contain array -> array codecs to transform the conceptual [N,2] index to the JSON structure, but MUST NOT contain an array -> bytes codec.
  • A BASE64 encoded string: This represents the binary output of the index_codecs chain applied to the index array. Implementations should decode the BASE64 string and then apply the index_codecs in reverse to obtain the index.

Shard Index as an Array

index_codecs was suggested to begin with a bytes array-to-bytes codec, implying that the shard index is an array. Clarifying that the shard is an array makes it clearer that array-to-array codecs could be used to manipulate the shard index format before its converted to bytes (for start/end/BASE64 cases) or represented directly as a JSON array.

Potential applications:

  • The transpose codec could be used to have a consecutive offsets subarray followed by a consecutive nbytes subarray. This would more closely correspond with TIFFs TILEOFFSETS and TILEBYTECOUNTS tags.
  • Describing the distinct offset or nbytes subarrays an arithmetic sequence or repeated elements.
  • Shrinking the the shard index by casting the array from uint64 to a smaller unsigned integer such as uint32, uint16, or uint8.

Nested Shards

The specification now explicitly allows nested shards to utilize an index_location value of configuration. This provides greater flexibility in defining hierarchical data structures. However, due to the difficulty of updating an index stored in the array metadata, implementations MAY consider any array using "index_location": "configuration" (at any level of nesting) to be read-only. Writing to such an array may produce an error or lead to a corrupted state if the written data would require a change to the predefined index.

@mkitti mkitti changed the title Clarify nesting, external index location Clarify nesting. Add external index location Nov 24, 2025
@jbms
Copy link
Contributor

jbms commented Nov 24, 2025

I can see that in some cases this format would be useful. However this proposal is

@jbms
Copy link
Contributor

jbms commented Nov 24, 2025

I can see that a separate file may be useful in some cases. However that is not possible under the existing codec model. We would need to define something more general, e.g. an array -> files codec. That is something I mentioned before in the context of making the chunk_grid itself a part of the codec pipeline rather than a separate higher level thing.

@mkitti
Copy link
Contributor Author

mkitti commented Nov 24, 2025

Could we implement the external shard indices via a storage transformer?

The storage transformer looks for .shard_index files and then prepends or appends them to the byte values of the corresponding key?

@mkitti
Copy link
Contributor Author

mkitti commented Nov 24, 2025

Could we implement the external shard indices via a storage transformer?

The storage transformer looks for .shard_index files and then prepends or appends them to the byte values of the corresponding key?

I think this might work for reading, but a storage transformer might need additional information for writing.

@mkitti
Copy link
Contributor Author

mkitti commented Nov 24, 2025

I proposed the storage transformer concat-parts as a zarr-extension pull request:
zarr-developers/zarr-extensions#39

@mkitti mkitti force-pushed the mkitti-sharding-indexed-external-index branch from acd53d6 to 4d96684 Compare November 30, 2025 06:28
@mkitti
Copy link
Contributor Author

mkitti commented Nov 30, 2025

I have removed the "external" index_location from this pull request.

Instead I have a proposed a "configuration" index_location where the shard index is defined in the codec configuration as a "predefined_index" parameter. At the moment, I have written that the parameter may either be a JSON array itself or a BASE64 encoded string.

Allowing the "predefined_index" parameter to be a JSON array adds some complications. How should index_codecs be interpreted in that case? Here I specified that index_codecs in that case should not end in conversion to bytes but either be empty or only contain array-to-array codecs.

@mkitti mkitti changed the title Clarify nesting. Add external index location Clarify nesting. Add "configuration" as an index_location Nov 30, 2025
@LDeakin
Copy link
Member

LDeakin commented Dec 3, 2025

Could you clarify the purpose of "predefined_index". For some other file type to masquerade as a zarr chunk?

If every single shard has the same shard index then:

  • Subchunks in each shard must be at the same offset/size, so no compression is feasible and the order is constrained
  • The only compatible chunk grid is regular, because you can't have shards with a different shape

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants