Skip to content

Support new "bitinformation" codec in numcodecs #298

@rabernat

Description

@rabernat

There is an exciting new paper by @milankl in Nature Computational Science entitled Compressing atmospheric data into its real information content

Abstract

Hundreds of petabytes are produced annually at weather and climate forecast centers worldwide. Compression is essential to reduce storage and to facilitate data sharing. Current techniques do not distinguish the real from the false information in data, leaving the level of meaningful precision unassessed. Here we define the bitwise real information content from information theory for the Copernicus Atmospheric Monitoring Service (CAMS). Most variables contain fewer than 7 bits of real information per value and are highly compressible due to spatio-temporal correlation. Rounding bits without real information to zero facilitates lossless compression algorithms and encodes the uncertainty within the data itself. All CAMS data are 17× compressed relative to 64-bit floats, while preserving 99% of real information. Combined with four-dimensional compression, factors beyond 60× are achieved. A data compression Turing test is proposed to optimize compressibility while minimizing information loss for the end use of weather and climate forecast data.

More relevant description of the method

Based on the bitwise real information content, we suggest a strategy for the data compression of climate variables. First, we diagnose the real information for each bit position. Afterwards, we round bits with no significant real information to zero, before applying lossless data compression. This allows us to minimize information loss but maximize the efficiency of the compression algorithms.

Bits with no or only little real information (but high entropy) are discarded via binary round-to-nearest as defined in the IEEE-754 standard. This rounding mode is bias-free and therefore will ensure global conservation of the quantities that are important in climate model data. Rounding removes the incompressible false information and therefore increases compressibility. Although rounding is irreversible for the bits with false information, the bits with real information remain unchanged and are bitwise reproducible after decompression. Both the real information analysis and the rounding mode are deterministic, also satisfying reproducibility.

Lossless compression algorithms can be applied efficiently to rounded floating-point arrays (the round + lossless method). Many general-purpose lossless compression algorithms are available and are based on dictionaries and other statistical techniques to remove redundancies. Most algorithms operate on bitstreams and exploit the correlation of data in a single dimension only, so we describe such methods as one-dimensional (1D) compression. Here, we use the Zstandard algorithm for lossless compression, which has emerged as a widely available default in recent years.

...
In an operational setting we recommend the following workflow. First, for each variable, the bitwise real information content is analyzed from a representative subset of the data. For example, a single time step can be representative of subsequent time steps if the statistics of the data distribution are not expected to change. From the bitwise real information, the number of mantissa bits to preserve 99% of information is determined (the ‘keepbits’). Second, during the simulation, the arrays that will be archived are rounded to the number of keepbits (which are held fixed) and compressed. The first step should be done offline—once in advance of a data-producing simulation. Only the second step has to be performed online, meaning every time a chunk of data is archived.

Prior Conversation

The compressor is currently implemented in Julia. I emailed Milan to ask about possible paths for supporting this compression codec in numcodecs. I am now moving the discussion here so other devs can weigh in.

I posed the question in the following way

We would like to investigate implementing this compressor in Numcodecs we can can use it with Zarr data. In order to do that, we would need to either

  • call Julia directly from numcodecs (certainly possible but would limit usage)
  • reimplement the algorithm in python (not appealing)
  • Export a c-compatible binary library from Julia (possible but perhaps difficult)

Milan replied

Thanks so much for your interest! I’d be more than happy to develop BitInformation.jl more towards a reference implementation for the analysis of the bitwise information content. But you are quite right that maintaining similar code in several languages might be tricky and unnecessary additional effort. Before making a suggestion to either of these options let me highlight that there are technically two parts to our “compression method”. Quotation marks as the idea of the real bitwise information is certainly also that it should be possible to combine it with other existing compression methods, as we do with Zfp in our paper.

part 1. The bitwise information analysis. This just informs how many bits should be retained for a given array, and does not have to be repeated, say for following time steps

part 2. The removal of false information. We simply round this to zero with round-to-nearest, while this doesn’t technically compress it can be combined with any lossless algorithm and I see you have plenty in numcodecs already.

So I think we’d need two things in numcodecs, for

part 1 the bitinformation function from BitInformation.jl If there’s a way to pull that one directly from Julia into numcodecs that’d be great because then I can keep updating / maintaining the BitInformation.jl and you’d get automatically the newest version.

part 2 a binary round to nearest. I see that np.round rounds in decimal, but we’d need a binary rounding mode to guarantee that the trailing bits are zero. This is technically 3 lines of code and so that can be done maybe directly within C/Cython.

So I’d suggest to reimplement part 2 (the rounding) as it’s only a small piece of code, but we should check how to call Julia efficiently for part 1.

Since we already support ZFP in numcodecs, I would say that this strategy makes a lot of sense

I'm copying Fig. 1 from the paper into this issue, since it provides a great visualization of the method

image

Next Steps

I would note that there is also a final step in this method, which is to apply a lossy (zfp) or lossless compression codec (zstd) to to the data.

As a concrete step forward, we could imagine first implementing the "binary rounding" as a standalone codec in numcodecs.

One technical question I have for @milankl is the following: The round function has a signature (translated to python like)

def round(x: float, setmask: int, shift: int, shavemask: int) -> float:
    ...

where x is the array to be rounded. Can you confirm that setmask, shift, and shavemask are constants for each variable, which are determined based on the mutual information analysis from "part 1"? If this is the case, we can simply treat these as scalar parameters for the bitwise rounding codec.

This also suggests an even simpler path to supporting this compressor in python:

  • we do the mutual information analysis in Julia using BitInformation.jl (only needs to be done once for each array)
  • export the scalar parameters, e.g. via copy / paste or a simple text file
  • create the appropriate codec chain in numcodecs, e.g. bitwise_rounding + zstd or zst

I'm very excited about this opportunity and really eager to try it out, e.g. on the LLC4320 ocean simulations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions