You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* **64-bit containers:** the first-class container in C-Blosc2 is the `super-chunk` or, for brevity, `schunk`, that is made by smaller chunks which are essentially C-Blosc1 32-bit containers. The super-chunk can be backed or not by another container which is called a `frame` (see later).
69
-
70
-
* **NDim containers (B2ND):** allow to store n-dimensional data (aka tensors) that can efficiently read datasets in slices that can be n-dimensional too. To achieve this, a n-dimensional 2-level partitioning has been implemented.
71
-
72
-
* **More filters:** besides `shuffle` and `bitshuffle` already present in C-Blosc1, C-Blosc2 already implements:
73
-
74
-
- `bytedelta`: calculates the difference between bytes in a block that has been shuffled already. We have `blogged about bytedelta <https://www.blosc.org/posts/bytedelta-enhance-compression-toolset/>`_.
75
-
76
-
- `delta`: the stored blocks inside a chunk are diff'ed with respect to first block in the chunk. The idea is that, in some situations, the diff will have more zeros than the original data, leading to better compression.
77
-
78
-
- `trunc_prec`: it zeroes the least significant bits of the mantissa of float32 and float64 types. When combined with the `shuffle` or `bitshuffle` filter, this leads to more contiguous zeros, which are compressed better.
79
-
80
-
* **A filter pipeline:** the different filters can be pipelined so that the output of one can the input for the other. A possible example is a `delta` followed by `shuffle`, or as described above, `trunc_prec` followed by `bitshuffle`.
81
-
82
-
* **Prefilters:** allow to apply user-defined C callbacks **prior** the filter pipeline during compression. See `test_prefilter.c <https://github.com/Blosc/c-blosc2/blob/main/tests/test_prefilter.c>`_ for an example of use.
83
-
84
-
* **Postfilters:** allow to apply user-defined C callbacks **after** the filter pipeline during decompression. The combination of prefilters and postfilters could be interesting for supporting e.g. encryption (via prefilters) and decryption (via postfilters). Also, a postfilter alone can be used to produce on-the-flight computation based on existing data (or other metadata, like e.g. coordinates). See `test_postfilter.c <https://github.com/Blosc/c-blosc2/blob/main/tests/test_postfilter.c>`_ for an example of use.
85
-
86
-
* **SIMD support for ARM (NEON):** this allows for faster operation on ARM architectures. Only `shuffle` is supported right now, but the idea is to implement `bitshuffle` for NEON too. Thanks to Lucian Marc.
87
-
88
-
* **SIMD support for PowerPC (ALTIVEC):** this allows for faster operation on PowerPC architectures. Both `shuffle` and `bitshuffle` are supported; however, this has been done via a transparent mapping from SSE2 into ALTIVEC emulation in GCC 8, so performance could be better (but still, it is already a nice improvement over native C code; see PR https://github.com/Blosc/c-blosc2/pull/59 for details). Thanks to Jerome Kieffer and `ESRF <https://www.esrf.fr>`_ for sponsoring the Blosc team in helping him in this task.
89
-
90
-
* **Dictionaries:** when a block is going to be compressed, C-Blosc2 can use a previously made dictionary (stored in the header of the super-chunk) for compressing all the blocks that are part of the chunks. This usually improves the compression ratio, as well as the decompression speed, at the expense of a (small) overhead in compression speed. Currently, it is only supported in the `zstd` codec, but would be nice to extend it to `lz4` and `blosclz` at least.
91
-
92
-
* **Contiguous frames:** allow to store super-chunks contiguously, either on-disk or in-memory. When a super-chunk is backed by a frame, instead of storing all the chunks sparsely in-memory, they are serialized inside the frame container. The frame can be stored on-disk too, meaning that persistence of super-chunks is supported.
93
-
94
-
* **Sparse frames:** each chunk in a super-chunk is stored in a separate file or different memory area, as well as the metadata. This is allows for more efficient updates/deletes than in contiguous frames (i.e. avoiding 'holes' in monolithic files). The drawback is that it consumes more inodes when on-disk. Thanks to Marta Iborra for this contribution.
95
-
96
-
* **Partial chunk reads:** there is support for reading just part of chunks, so avoiding to read the whole thing and then discard the unnecessary data.
97
-
98
-
* **Parallel chunk reads:** when several blocks of a chunk are to be read, this is done in parallel by the decompressing machinery. That means that every thread is responsible to read, post-filter and decompress a block by itself, leading to an efficient overlap of I/O and CPU usage that optimizes reads to a maximum.
99
-
100
-
* **Meta-layers:** optionally, the user can add meta-data for different uses and in different layers. For example, one may think on providing a meta-layer for `NumPy <https://numpy.org>`_ so that most of the meta-data for it is stored in a meta-layer; then, one can place another meta-layer on top of the latter for adding more high-level info if desired (e.g. geo-spatial, meteorological...).
101
-
102
-
* **Variable length meta-layers:** the user may want to add variable-length meta information that can be potentially very large (up to 2 GB). The regular meta-layer described above is very quick to read, but meant to store fixed-length and relatively small meta information. Variable length metalayers are stored in the trailer of a frame, whereas regular meta-layers are in the header.
103
-
104
-
* **Efficient support for special values:** large sequences of repeated values can be represented with an efficient, simple and fast run-length representation, without the need to use regular codecs. With that, chunks or super-chunks with values that are the same (zeros, NaNs or any value in general) can be built in constant time, regardless of the size. This can be useful in situations where a lot of zeros (or NaNs) need to be stored (e.g. sparse matrices).
105
-
106
-
* **Nice markup for documentation:** we are currently using a combination of Sphinx + Doxygen + Breathe for documenting the C-API. See https://www.blosc.org/c-blosc2/c-blosc2.html. Thanks to Alberto Sabater and Aleix Alcacer for contributing the support for this.
107
-
108
-
* **Plugin capabilities for filters and codecs:** we have a plugin register capability inplace so that the info about the new filters and codecs can be persisted and transmitted to different machines. See https://github.com/Blosc/c-blosc2/blob/main/examples/urfilters.c for a self-contained example. Thanks to the NumFOCUS foundation for providing a grant for doing this, and Oscar Griñón and Aleix Alcacer for the implementation.
109
-
110
-
* **Pluggable tuning capabilities:** this will allow users with different needs to define an interface so as to better tune different parameters like the codec, the compression level, the filters to use, the blocksize or the shuffle size. Thanks to ironArray for sponsoring us in doing this.
111
-
112
-
* **Support for I/O plugins:** so that users can extend the I/O capabilities beyond the current filesystem support. Things like the use of databases or S3 interfaces should be possible by implementing these interfaces. Thanks to ironArray for sponsoring us in doing this.
113
-
114
-
* **Security:** we are actively using using the `OSS-Fuzz <https://github.com/google/oss-fuzz>`_ and `ClusterFuzz <https://oss-fuzz.com>`_ for uncovering programming errors in C-Blosc2. Thanks to Google for sponsoring us in doing this, and to Nathan Moinvaziri for most of the work here.
115
-
116
-
More info about the `improved capabilities of C-Blosc2 can be found in this talk <https://www.blosc.org/docs/Caterva-HDF5-Workshop.pdf>`_.
117
-
118
-
C-Blosc2 API and format have been frozen, and that means that there is guarantee that your programs will continue to work with future versions of the library, and that next releases will be able to read from persistent storage generated from previous releases (as of 2.0.0).
66
+
More info about the `improved capabilities of C-Blosc2 can be found in this paper <https://www.blosc.org/docs/Exploring-MilkyWay-SciPy2023-paper.pdf>`_. Please, cite it if you use C-Blosc2 in your research!
119
67
120
68
121
69
Open format
122
70
===========
123
71
124
72
The Blosc2 format is open and `fully documented <https://github.com/Blosc/c-blosc2/blob/main/README_FORMAT.rst>`_.
125
73
126
-
The format specs are defined in less than 1000 lines of text, so they should be easy to read and understand. In our opinion, this is very important for the long-term success of the library, as it allows for third-party implementations of the format, and also for the users to understand what is going on under the hood.
74
+
The format specs are defined in less than 4000 words, so they should be easy to read and understand. In our opinion, this is critical for the long-term success of the library, as it allows for third-party implementations of the format, and also for the users to understand what is going on under the hood.
127
75
128
76
129
77
Python wrapper
@@ -194,6 +142,11 @@ Or, you may want to use a codec in an external library already in the system:
194
142
195
143
cmake -DPREFER_EXTERNAL_LZ4=ON ..
196
144
145
+
For OpenZL, there are problems with the build seemingly, so, after building and installing into ``build-cmake`` in the ``openzl`` directory, one has to run:
@@ -225,7 +178,7 @@ the ``BLOSC_TRACE`` environment variable.
225
178
Contributing
226
179
============
227
180
228
-
If you want to collaborate in this development you are welcome. We need help in the different areas listed at the `ROADMAP <https://github.com/Blosc/c-blosc2/blob/main/ROADMAP.rst>`_; also, be sure to read our `DEVELOPING-GUIDE <https://github.com/Blosc/c-blosc2/blob/main/DEVELOPING-GUIDE.rst>`_ and our `Code of Conduct <https://github.com/Blosc/community/blob/master/code_of_conduct.md>`_. Blosc is distributed using the `BSD license <https://github.com/Blosc/c-blosc2/blob/main/LICENSE.txt>`_.
181
+
If you want to collaborate in this development you are welcome. We need help in the different areas listed at the `ROADMAP <https://github.com/Blosc/c-blosc2/blob/main/ROADMAP-TO-3.0.rst>`_; also, be sure to read our `DEVELOPING-GUIDE <https://github.com/Blosc/c-blosc2/blob/main/DEVELOPING-GUIDE.rst>`_ and our `Code of Conduct <https://github.com/Blosc/community/blob/master/code_of_conduct.md>`_. Blosc is distributed using the `BSD license <https://github.com/Blosc/c-blosc2/blob/main/LICENSE.txt>`_.
229
182
230
183
231
184
Tweeter feed
@@ -244,7 +197,7 @@ You can cite our work on the different libraries under the Blosc umbrella as:
244
197
@ONLINE{blosc,
245
198
author = {{Blosc Development Team}},
246
199
title = "{A fast, compressed and persistent data store library}",
0 commit comments