Skip to content

Refactor for compatibility with upcoming Codecs Release#2

Merged
tpietzsch merged 12 commits intotpietzsch:readdatafrom
cmhulbert:readdata
May 15, 2025
Merged

Refactor for compatibility with upcoming Codecs Release#2
tpietzsch merged 12 commits intotpietzsch:readdatafrom
cmhulbert:readdata

Conversation

@cmhulbert
Copy link
Copy Markdown

Some modification I found useful when using the ReadData structure to handle multiple codecs. Brief overview:

  • Compression.decode no longer accepts decodedLength
  • DataBlockCodecFactory moved out of DataType, into N5Codecs
  • DatasetAttributes uses an N5DataBlockCodecFactory method to determine the correct DataBlockCodec
  • Added StringDataCodec, ObjectDataCodec
  • Refactor Default/StringObjectDataBlockCodec to share common logic (only the header really differs)

…ectDataBlockCodec

DataCodecs now creat the access object and return it during `deserialize`
N5BlockCodec uses dataType (and potentially other DatasetAttributes to wrap the desired DataBlockCodec
@tpietzsch
Copy link
Copy Markdown
Owner

tpietzsch commented Mar 18, 2025

  • I'm happy with the core change of streamlining the N5Codecs implementations (AbstractDataBlockCodec etc.).

    I would maybe rename the encodeBlockHeader method to createBlockHeader. The encodeBlockHeader suggests a false analogy to decodeBlockHeader (decodeBlockHeader reads from the InputStream, while encodeBlockHeader does not write to the OutputStream).

    Does AbstractDataBlockCodec have to be public? Does it have to expose it's internals through

     public DataBlockFactory<T> getDataBlockFactory() {..}
     public DataCodec<T> getDataCodec() {..}
     public Compression getCompression() {..}
    

    I would try to reduce visibility as much as possible here. (Maybe just make the field package-private or protected and remove the getters.)

  • I'm not sure about the removal of the decodedLength argument from Compression.decode. It might have performance implications, but I didn't benchmark, and we can always add it back later when we do.

  • DataBlock.createDataBlockCodec(...) was migrated to N5Codecs.createDataBlockCodec(...). Please also migrate and adapt the javadoc instead of just deleting it.

  • Please also propagate your changes to related PRs:
    ReadData and DataBlockCodec JaneliaSciComp/n5-zstandard#4
    ReadData and DataBlockCodec saalfeldlab/n5-blosc#11
    ReadData and DataBlockCodec saalfeldlab/n5-zarr#63
    For example your removal of the DatasetAttributes constructor

      	protected DatasetAttributes(
      			final long[] dimensions,
      			final int[] blockSize,
      			final DataType dataType,
      			final Compression compression,
      			final DataBlockCodec<?> dataBlockCodec) {...}

    will break n5-zarr because ZarrDatasetAttributes extends DatasetAttributes and calls that constructor.

  • In general keep the bigger picture in mind, in particular with changes to public API. Your changes remove the DataBlock.createDataBlock(...) method that is already currently used in (at least) n5-hdf5, n5-imglib2, and bigdataviewer-core. Please revert that.

  • As mentioned on Slack, for String[] we will need two DataCodec implementations: one for N5 encoded Strings, one for Zarr encoded Strings. Maybe we just put them both into the DataCodec class, although the Zarr one will only be used in n5-zarr? (after all, we also have the LITTLE_ENDIAN codecs in DataCodec although they are not used in n5 core...) Here is the code for the Zarr String[] https://github.com/saalfeldlab/n5-zarr/pull/63/files#diff-4817f74de9d1792ca96a27ca68897c24af33590a34ea40e9d05edb63dc1c1683R127-R162. I think this will come up naturally if you look into adapting ReadData and DataBlockCodec saalfeldlab/n5-zarr#63 to your changes.

@tpietzsch
Copy link
Copy Markdown
Owner

I left the N5BlockCodec there to see what you think of the factory method strategy to get that from the DatasetAttributes. let me know

I think this doesn't really help to explore the options much, because it is just a static method. It would be more interesting to have an actual factory instance and see how factories compose and how to get it into DatasetAttributes.

Anyway, currently N5BlockCodec is pointless so I would remove it for now. (and inline the N5BlockCodecFactory.fromDatasetAttributes(...) into the DatasetAttributes constructor.)

@cmhulbert
Copy link
Copy Markdown
Author

Largely kept all your feedback. Highlights (mostly reiterating the commit messages):

  • Reverted :

    • public static createDataBlock removal methods from DataType.
    • removal of protected DatasetAttributes constructor with DataBlockCodec parameter.
    • N5BlockCodec
  • N5Codecs internals mostly private now

  • renamed encodeBlockHeader -> createBlockHeader

  • add ZarrStringDataCodec

On the last point, let me know if you have opinoins about naming. It feels weird to tie the name to Zarr in N5 core, but if that's what it is, It is probably the best name. We could move it to Zarr, but if we want to re-use the [DataCodec] hierarchy, we'd need to expose the constructors, at least to be protected. Also, I kept the ByteOrder parameter for the ZarrStringDataCodec. It was hardcoded to LITTLE_ENDIAN in the link you posted to n5-zarr, so if it isn't supposed to be configurable, we can remove that. I assumed since Zarr supports various byte orders, it was probably going to be needed eventually.

@cmhulbert
Copy link
Copy Markdown
Author

cmhulbert commented Apr 8, 2025

I'm not sure about the removal of the decodedLength argument from Compression.decode. It might have performance implications, but I didn't benchmark, and we can always add it back later when we do.

Regarding this. I'm not sure from a performance standpoint, but I found in practice it was not functionally useful in the way I wanted it to be. Ideally, what I had hoped is that I could for example tell the ReadData the target result size, say of the output of a decompressed gzip stream. The problem is that in practice, this was not possible. gzip decompressing only report the length of the decompressed data after it is done reading. What this meant was that if I was relying on the decodeLength to tell the gzip decompressor to stop early, that would not happen. Instead, it would read beyond where it should have, would find invalid gzip data, and throw an exception.

It may be that this is not good behavior on the gzip decompressor's part, but it is the behavior I experience.

That was the only place I found it used, so I removed it for now, to avoid the false sense of security. But if it is useful more broadly without leading us astray, I'm happy to add it back in

@tpietzsch tpietzsch merged commit b2cefb7 into tpietzsch:readdata May 15, 2025
@tpietzsch
Copy link
Copy Markdown
Owner

Also, I kept the ByteOrder parameter for the ZarrStringDataCodec. It was hardcoded to LITTLE_ENDIAN in the link you posted to n5-zarr, so if it isn't supposed to be configurable, we can remove that. I assumed since Zarr supports various byte orders, it was probably going to be needed eventually.

I removed the ByteOrder argument. Zarr uses numpy for the file format. That fixes the byte order for the int32 that encode string lengths to LITTLE_ENDIAN. The encoded strings themselves are UTF-8 byte streams, so independent of endianness,

tpietzsch pushed a commit that referenced this pull request May 15, 2025
* refactor: don't pass `decodedLength` to ReadData

* refactor: move DataBlockFactory/DataBlockCodecFactory to N5Codecs

* feat: Add StringDataCodec, ObjectDataCodec,  StringDataBlockCodec, ObjectDataBlockCodec

DataCodecs now creat the access object and return it during `deserialize`

* refactor: add AbstractDataBlock to extract shared logic between Default/String/Object blocks

* refactor: DatasetAttributes responsible for DataBlockCodec creation

N5BlockCodec uses dataType (and potentially other DatasetAttributes to wrap the desired DataBlockCodec

* revert: add back createDataBlock logic in DataType

* doc: retain javadoc from before refactor

* revert: keep protected constructor with DataBlockCodec parameter

refactor: inline createDataBlockCodec from constructor params

* refactor: remove currently unused N5BlockCodec. Something like this may be needed when multiple codecs are supported

* refactor: dont expose N5Codecs internals

* refactor: rename encodeBlockHeader -> createBlockHeader

* feat: add ZarrStringDataCodec support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants