Refactor for compatibility with upcoming Codecs Release by cmhulbert · Pull Request #2 · tpietzsch/n5

cmhulbert · 2025-03-05T22:05:33Z

Some modification I found useful when using the ReadData structure to handle multiple codecs. Brief overview:

Compression.decode no longer accepts decodedLength
DataBlockCodecFactory moved out of DataType, into N5Codecs
DatasetAttributes uses an N5DataBlockCodecFactory method to determine the correct DataBlockCodec
Added StringDataCodec, ObjectDataCodec
Refactor Default/StringObjectDataBlockCodec to share common logic (only the header really differs)

…ectDataBlockCodec DataCodecs now creat the access object and return it during `deserialize`

…lt/String/Object blocks

N5BlockCodec uses dataType (and potentially other DatasetAttributes to wrap the desired DataBlockCodec

tpietzsch · 2025-03-18T11:59:08Z

I'm happy with the core change of streamlining the N5Codecs implementations (AbstractDataBlockCodec etc.).

I would maybe rename the encodeBlockHeader method to createBlockHeader. The encodeBlockHeader suggests a false analogy to decodeBlockHeader (decodeBlockHeader reads from the InputStream, while encodeBlockHeader does not write to the OutputStream).

Does AbstractDataBlockCodec have to be public? Does it have to expose it's internals through
```
 public DataBlockFactory<T> getDataBlockFactory() {..}
 public DataCodec<T> getDataCodec() {..}
 public Compression getCompression() {..}
```
I would try to reduce visibility as much as possible here. (Maybe just make the field package-private or protected and remove the getters.)
I'm not sure about the removal of the decodedLength argument from Compression.decode. It might have performance implications, but I didn't benchmark, and we can always add it back later when we do.
DataBlock.createDataBlockCodec(...) was migrated to N5Codecs.createDataBlockCodec(...). Please also migrate and adapt the javadoc instead of just deleting it.
Please also propagate your changes to related PRs:
ReadData and DataBlockCodec JaneliaSciComp/n5-zstandard#4
ReadData and DataBlockCodec saalfeldlab/n5-blosc#11
ReadData and DataBlockCodec saalfeldlab/n5-zarr#63
For example your removal of the DatasetAttributes constructor
```
  	protected DatasetAttributes(
  			final long[] dimensions,
  			final int[] blockSize,
  			final DataType dataType,
  			final Compression compression,
  			final DataBlockCodec<?> dataBlockCodec) {...}
```
will break n5-zarr because ZarrDatasetAttributes extends DatasetAttributes and calls that constructor.
In general keep the bigger picture in mind, in particular with changes to public API. Your changes remove the DataBlock.createDataBlock(...) method that is already currently used in (at least) n5-hdf5, n5-imglib2, and bigdataviewer-core. Please revert that.
As mentioned on Slack, for String[] we will need two DataCodec implementations: one for N5 encoded Strings, one for Zarr encoded Strings. Maybe we just put them both into the DataCodec class, although the Zarr one will only be used in n5-zarr? (after all, we also have the LITTLE_ENDIAN codecs in DataCodec although they are not used in n5 core...) Here is the code for the Zarr String[] https://github.com/saalfeldlab/n5-zarr/pull/63/files#diff-4817f74de9d1792ca96a27ca68897c24af33590a34ea40e9d05edb63dc1c1683R127-R162. I think this will come up naturally if you look into adapting ReadData and DataBlockCodec saalfeldlab/n5-zarr#63 to your changes.

tpietzsch · 2025-03-18T12:19:28Z

I left the N5BlockCodec there to see what you think of the factory method strategy to get that from the DatasetAttributes. let me know

I think this doesn't really help to explore the options much, because it is just a static method. It would be more interesting to have an actual factory instance and see how factories compose and how to get it into DatasetAttributes.

Anyway, currently N5BlockCodec is pointless so I would remove it for now. (and inline the N5BlockCodecFactory.fromDatasetAttributes(...) into the DatasetAttributes constructor.)

refactor: inline createDataBlockCodec from constructor params

…ay be needed when multiple codecs are supported

cmhulbert · 2025-04-08T19:09:48Z

Largely kept all your feedback. Highlights (mostly reiterating the commit messages):

Reverted :
- public static createDataBlock removal methods from DataType.
- removal of protected DatasetAttributes constructor with DataBlockCodec parameter.
- N5BlockCodec
N5Codecs internals mostly private now
renamed encodeBlockHeader -> createBlockHeader
add ZarrStringDataCodec

On the last point, let me know if you have opinoins about naming. It feels weird to tie the name to Zarr in N5 core, but if that's what it is, It is probably the best name. We could move it to Zarr, but if we want to re-use the [DataCodec] hierarchy, we'd need to expose the constructors, at least to be protected. Also, I kept the ByteOrder parameter for the ZarrStringDataCodec. It was hardcoded to LITTLE_ENDIAN in the link you posted to n5-zarr, so if it isn't supposed to be configurable, we can remove that. I assumed since Zarr supports various byte orders, it was probably going to be needed eventually.

cmhulbert · 2025-04-08T19:14:01Z

I'm not sure about the removal of the decodedLength argument from Compression.decode. It might have performance implications, but I didn't benchmark, and we can always add it back later when we do.

Regarding this. I'm not sure from a performance standpoint, but I found in practice it was not functionally useful in the way I wanted it to be. Ideally, what I had hoped is that I could for example tell the ReadData the target result size, say of the output of a decompressed gzip stream. The problem is that in practice, this was not possible. gzip decompressing only report the length of the decompressed data after it is done reading. What this meant was that if I was relying on the decodeLength to tell the gzip decompressor to stop early, that would not happen. Instead, it would read beyond where it should have, would find invalid gzip data, and throw an exception.

It may be that this is not good behavior on the gzip decompressor's part, but it is the behavior I experience.

That was the only place I found it used, so I removed it for now, to avoid the false sense of security. But if it is useful more broadly without leading us astray, I'm happy to add it back in

tpietzsch · 2025-05-15T12:29:09Z

Also, I kept the ByteOrder parameter for the ZarrStringDataCodec. It was hardcoded to LITTLE_ENDIAN in the link you posted to n5-zarr, so if it isn't supposed to be configurable, we can remove that. I assumed since Zarr supports various byte orders, it was probably going to be needed eventually.

I removed the ByteOrder argument. Zarr uses numpy for the file format. That fixes the byte order for the int32 that encode string lengths to LITTLE_ENDIAN. The encoded strings themselves are UTF-8 byte streams, so independent of endianness,

* refactor: don't pass `decodedLength` to ReadData * refactor: move DataBlockFactory/DataBlockCodecFactory to N5Codecs * feat: Add StringDataCodec, ObjectDataCodec, StringDataBlockCodec, ObjectDataBlockCodec DataCodecs now creat the access object and return it during `deserialize` * refactor: add AbstractDataBlock to extract shared logic between Default/String/Object blocks * refactor: DatasetAttributes responsible for DataBlockCodec creation N5BlockCodec uses dataType (and potentially other DatasetAttributes to wrap the desired DataBlockCodec * revert: add back createDataBlock logic in DataType * doc: retain javadoc from before refactor * revert: keep protected constructor with DataBlockCodec parameter refactor: inline createDataBlockCodec from constructor params * refactor: remove currently unused N5BlockCodec. Something like this may be needed when multiple codecs are supported * refactor: dont expose N5Codecs internals * refactor: rename encodeBlockHeader -> createBlockHeader * feat: add ZarrStringDataCodec support

cmhulbert added 5 commits March 5, 2025 16:40

refactor: don't pass decodedLength to ReadData

46c7eab

refactor: move DataBlockFactory/DataBlockCodecFactory to N5Codecs

b03e897

feat: Add StringDataCodec, ObjectDataCodec, StringDataBlockCodec, Obj…

f948f57

…ectDataBlockCodec DataCodecs now creat the access object and return it during `deserialize`

refactor: add AbstractDataBlock to extract shared logic between Defau…

ae85a55

…lt/String/Object blocks

refactor: DatasetAttributes responsible for DataBlockCodec creation

7b713cd

N5BlockCodec uses dataType (and potentially other DatasetAttributes to wrap the desired DataBlockCodec

cmhulbert added 7 commits April 8, 2025 14:52

revert: add back createDataBlock logic in DataType

0db21be

doc: retain javadoc from before refactor

0f8b2c5

revert: keep protected constructor with DataBlockCodec parameter

70ec777

refactor: inline createDataBlockCodec from constructor params

refactor: remove currently unused N5BlockCodec. Something like this m…

ab8b83b

…ay be needed when multiple codecs are supported

refactor: dont expose N5Codecs internals

2e05628

refactor: rename encodeBlockHeader -> createBlockHeader

d4ebfef

feat: add ZarrStringDataCodec support

0dcade4

tpietzsch merged commit b2cefb7 into tpietzsch:readdata May 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor for compatibility with upcoming Codecs Release#2

Refactor for compatibility with upcoming Codecs Release#2
tpietzsch merged 12 commits intotpietzsch:readdatafrom
cmhulbert:readdata

cmhulbert commented Mar 5, 2025

Uh oh!

tpietzsch commented Mar 18, 2025 •

edited

Loading

Uh oh!

tpietzsch commented Mar 18, 2025

Uh oh!

cmhulbert commented Apr 8, 2025

Uh oh!

cmhulbert commented Apr 8, 2025 •

edited

Loading

Uh oh!

tpietzsch commented May 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cmhulbert commented Mar 5, 2025

Uh oh!

tpietzsch commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tpietzsch commented Mar 18, 2025

Uh oh!

cmhulbert commented Apr 8, 2025

Uh oh!

cmhulbert commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tpietzsch commented May 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tpietzsch commented Mar 18, 2025 •

edited

Loading

cmhulbert commented Apr 8, 2025 •

edited

Loading