Skip to content

Conversation

@martindurant
Copy link
Member

@martindurant martindurant commented Dec 20, 2022

Fixes #410 cc @rabernat

TODO:

  • Unit tests and/or doctests in docstrings
  • Tests pass locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • Docs build locally
  • GitHub Actions CI passes
  • Test coverage to 100% (Codecov passes)

@martindurant
Copy link
Member Author

The errors showing up are for non-numpy inputs to JSON and msgpack - nothing to do with this PR.

I'll fill in the docs and such shortly.

@rabernat
Copy link
Contributor

This looks amazing Martin! 🚀 Thanks so much for doing it.

One question: have your verified that this implementation is interoperable with the hdf5 and netcdf4 implementation? Like, if netcdf4 writes a chunk with fletcher32, does this codec successfully decode it?

@martindurant
Copy link
Member Author

One question: have your verified that this implementation is interoperable with the hdf5 and netcdf4 implementation? Like, if netcdf4 writes a chunk with fletcher32, does this codec successfully decode it?

No, not yet. The tests just come from examples on wikipedia. Do you think we should bundle a small hdf file, or perhaps just extract a bytes buffer into a test?

@rabernat
Copy link
Contributor

rabernat commented Dec 20, 2022

Do you think we should bundle a small hdf file, or perhaps just extract a bytes buffer into a test?

I would try to get a short snippet of actual bytes. I'd probably use the approach from fsspec/kerchunk#274 of generating a tiny netcdf file using xarray with fletcher32 on, extracting a chunk manually using kerchunk, printing the bytes to the terminal, and then copy-pasting that to this PR.

Here is an example of such a string of bytes.

b'x\xda\xb3\xf1\x8b\xd3gdb\x00\x02\xf1\xa3\x00'

This was generated from the data array([[60, 78, 94, 47]], dtype=int16) with the following encoding.

encoding = {
    'zlib': True,
    'compression': 'zlib',
    'shuffle': True,
    'complevel': 8,
    'fletcher32': True,
    'contiguous': False,
    'chunksizes': (1, 4)
}

Question: in what order is the fletcher checksum applied? Before or after zlib?

@martindurant
Copy link
Member Author

Well it was a good idea to check! Their implementation is not the same, so I thought it best to just embed it directly. This ought to be faster too, if that's important.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually RuntimeError would be more consistent with what other codecs do when decompression fails.

@martindurant
Copy link
Member Author

OK, @jakirkham , it didn't turn out to be too bad. The algorithm is obviously the same as the original.

I didn't understand why the class should be moved to a different pure-python module, though. lz4, blosc, zstd and vlen all have Codecs in their respective pyx files. (I must say, they are surprisingly complex!)

@jakirkham
Copy link
Member

OK, @jakirkham , it didn't turn out to be too bad. The algorithm is obviously the same as the original.

Thanks Martin! 🙏

I didn't understand why the class should be moved to a different pure-python module, though. lz4, blosc, zstd and vlen all have Codecs in their respective pyx files. (I must say, they are surprisingly complex!)

Yeah we have some technical debt to work through for sure.

Since these are now in the same file, agree this matches the existing pattern.

Though at some point we might want to split these apart to simplify things. We need not do that here.

Co-authored-by: Ryan Abernathey <[email protected]>
@rabernat
Copy link
Contributor

Do we have an existing issue to track the test failures we are seeing in this PR? As @martindurant says, they are not related to the new codec. But they will need to be fixed asap.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very useful contribution to numcodecs. Thanks a lot Martin! Provided we understand why the tests are failing and have a plan to fix that elsewhere, I'm happy to see this merged.

@rabernat
Copy link
Contributor

This has lingered for a while, but I think it looks good. We should get this in.

I have no idea what's going on the the tests. Maybe they are fixed by #417? Martin do you want to try to rebase?

@codecov
Copy link

codecov bot commented Jan 15, 2023

Codecov Report

Merging #412 (12eb7a3) into main (4f2a2e3) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #412   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           54        55    +1     
  Lines         2095      2121   +26     
=========================================
+ Hits          2095      2121   +26     
Impacted Files Coverage Δ
numcodecs/__init__.py 100.00% <100.00%> (ø)
numcodecs/tests/test_fletcher32.py 100.00% <100.00%> (ø)

@joshmoore
Copy link
Member

@martindurant @rabernat @jakirkham: did a release of this get discussed at any point while I was off galavanting?

@martindurant
Copy link
Member Author

No discussion I am aware of

@jakirkham
Copy link
Member

Raised an issue ( #437 ) to discuss

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fletcher checksum codec

4 participants