Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions chunk-grids/rectilinear/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Rectilinear chunk grid

## Metadata

| field | type | required |
| - | - | - |
| `"name"` | Literal `"rectilinear"` | yes |
| `"configuration"` | [#configuration][] | yes |

### Configuration

| field | type | required | notes |
| - | - | - | - |
| `"kind"` | Literal `"inline"` | yes | |
| `"chunk_shapes"` | array of [Chunk edge lengths](#chunk-edge-lengths) | yes | The length of `"chunk_shapes"` MUST match the number of dimensions of the array.

#### Chunk edge lengths

The edge lengths of the chunks along an array axis `A` are represented by an array that can contain two types of elements:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A previous proposal for rectilinear chunking also allowed a plain integer here, to indicate uniform chunking along the dimension. That makes this strictly a generalization of regular chunking and seems like a good idea to include.

While the run-length encoding makes regular chunking along a dimension still efficient to specify, a plain integer indicating uniform chunking has the advantage of allowing the dimension to be resized without also specifying the new chunk sizes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially you could allow the last item in the chunk_shapes entry for a given dimension to be [n, -1] to mean that all remaining chunks are size n --- that would allow the dimension to be resized since it would indicate the chunk size for new chunks.

That would reduce the need for also allowing a plain integer to specify uniform chunking, but it is still probably good to allow a plain integer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we removed the plain integer form because it was ambiguous -- for a shape (10,), does {"chunk_shapes": [3]} expand to [3, 3, 3, 1] or [3, 3, 3, 3]? I think we can restore the single integer form if we can resolve this ambiguity.

For drop-in compatibility with the regular chunk grid, we do need to support some way of expressing chunks that overhang the grid of the array. But we might consider supporting this compatibility via something more explicit than the single-integer form.

And for resizing, I agree that it's convenient if resizing doesn't require re-defining the chunk grid, but for many resize operations with rectilinear chunks the appended data will have different chunking, and so any resize API for rectilinear chunks will need to support declaring new chunk shapes. So perhaps a convenient method for resizing with an automatic default chunk shape could be offered by an application, even if the chunk_grid is modified in any case.

A related question is whether we support declaring chunk shapes that substantially overflow the shape of the array, e.g. for a shape (10,), [3, 3, 3, 3, 3] which has a length of 15. Without a clear use case, I'm inclined against supporting this here. Curious to hear what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a clear use case, I'm inclined against supporting this here. Curious to hear what you think.

Some feedback from folks at the Zarr summit was that we could allow declaring a sequence of chunks that exceeds the shape of the array, but disallow a sequence that would underfill the array. That seems reasonable to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we removed the plain integer form because it was ambiguous -- for a shape (10,), does {"chunk_shapes": [3]} expand to [3, 3, 3, 1] or [3, 3, 3, 3]? I think we can restore the single integer form if we can resolve this ambiguity.

I don't see where the 1 would come from. I imagined it would expand to [3, 3, 3, 3]. The point is that it would work the same as the regular grid.

For drop-in compatibility with the regular chunk grid, we do need to support some way of expressing chunks that overhang the grid of the array. But we might consider supporting this compatibility via something more explicit than the single-integer form.

And for resizing, I agree that it's convenient if resizing doesn't require re-defining the chunk grid, but for many resize operations with rectilinear chunks the appended data will have different chunking, and so any resize API for rectilinear chunks will need to support declaring new chunk shapes. So perhaps a convenient method for resizing with an automatic default chunk shape could be offered by an application, even if the chunk_grid is modified in any case.

A related question is whether we support declaring chunk shapes that substantially overflow the shape of the array, e.g. for a shape (10,), [3, 3, 3, 3, 3] which has a length of 15. Without a clear use case, I'm inclined against supporting this here. Curious to hear what you think.

Yes, definitely should support overhanging the array bounds, for compatibility with regular grid, to achieve any necessary chunk size constraints, and to create room for resizing without specifying new chunk sizes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a clear use case, I'm inclined against supporting this here. Curious to hear what you think.

Some feedback from folks at the Zarr summit was that we could allow declaring a sequence of chunks that exceeds the shape of the array, but disallow a sequence that would underfill the array. That seems reasonable to me.

Agreed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where the 1 would come from. I imagined it would expand to [3, 3, 3, 3]. The point is that it would work the same as the regular grid.

The 1 would come from trimming the final chunk to match the array's shape exactly. I will bring back the single-integer form, and explain this behavior, which will remove the ambiguity.

- an integer that explicitly denotes an edge length.
- an array that denotes a [run-length encoded](#run-length-encoding) sequence of integers, each of which denotes an edge length.

The sum of the edge lengths MUST match the length of the array along the axis `A`.

#### Run-length encoding

This specificiation defines a JSON representation for run-length encoded sequences.

A run-length encoded sequence of `N` repetitions of some value `V` is denoted by the length-2 JSON array `[V, N]`.

For example, the sequence `[1, 1, 1, 1, 1]` becomes `[1, 5]` after applying this run-length encoding.

## Resolving

## Example

This example demonstrates 5 different ways of specifying a rectilinear chunk grid for an array with shape `(6, 6, 6, 6, 6)`.

```javascript
{
...
"shape": [6, 6, 6, 6, 6],
"chunk_grid": {
"name": "rectilinear",
"configuration": {
"kind": "inline",
"chunk_shapes": [
[[2, 3]], // expands to [2, 2, 2]
[[1, 6]], // expands to [1, 1, 1, 1, 1, 1]
[1, [2, 1], 3], // expands to [1, 2, 3]
[[1, 3], 3], // expands to [1, 1, 1, 3]
[6] // expands to [6]
]
}
}
}
```

## Prior work

A scheme for rectilinear chunking was proposed in a [Zarr extension proposal](https://zarr.dev/zeps/draft/ZEP0003.html) (ZEP). The specification presented here builds on the ZEP 3 proposal and adapts it to the Zarr V3.

Key difference between this specification and ZEP 003:
- This specification adds run-length encoding for integer sequences
- This specification uses the key `"chunk_shapes"` in the `configuration` field, while ZEP 0003 uses the key `"chunk_shape"`.
- Zep 0003 defines a meaning for single-integer elements of its `chunk_shape` metadata: `"chunk_shape" : [10]` declares a sequence of chunks with length 10 repeated to match the shape of the array. While convenient, we avoid the single-integer form here because it ambiguously handles chunks at the end of an array.