Made lazy decompression of clusters user-controllable#1036
Made lazy decompression of clusters user-controllable#1036veloman-yunkan wants to merge 1 commit intomainfrom
Conversation
Lazy decompression stays on by default, but can be disabled and/or (re-)enabled per individual ZIM archive via a new public API method `zim::Archive::decompressClustersLazily()`.
|
The question is whether lazy decompression should be disabled by default. With lazy decompression off, the cluster memory cache size limit should be respected in terms of virtual memory consumption. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1036 +/- ##
==========================================
+ Coverage 56.25% 56.32% +0.06%
==========================================
Files 101 101
Lines 5016 5026 +10
Branches 2185 2188 +3
==========================================
+ Hits 2822 2831 +9
Misses 738 738
- Partials 1456 1457 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@veloman-yunkan |
|
@veloman-yunkan What goes through my mind (sorry, just an idea) is that the cluster cache is not hit that often (large amount of cluster, slow amount of cache, statistic randomisation of localisation of articles). Therefore why not just focus on the item cache and reduce (if not remove) the cluster cache? |
Is this such a common real-life situation? Do we have encountered in the wild situation where the decompressors states is significantly bigger than the cluster cache?
I don't think I get this. Does it means the decompressor state is not accounted for in the cluster memory cache size? |
Cluster size (decompressed) is 2MiB, decompressor window size is now 8MiB (it was reduced from 128MiB almost 5 years ago in 23a72e7). But keep reading below.
There is some ambiguity with respect to what memory usage means - there is virtual memory vs physical memory. Current implementation of cluster cache memory management targets physical memory usage (see https://github.com/openzim/libzim/blob/9.4.0/src/cluster.cpp#L206-L210). But there may be systems (e.g. those not supporting With lazy decompression of clusters turned off the said ambiguity stops making any difference and memory usage is being respected in the sense of (the higher value of) virtual memory consumption. |
Let's better discuss this in #1035 |
Related to kiwix/libkiwix#1265
In lazy decompession mode (introduced by #421), data in compressed clusters is decompressed on-demand, i.e. only as much data is decompressed as is needed to serve item accesses within the cluster. It leads to faster first-time access to items occurring early in the cluster, but comes at the cost of increased memory usage since the decompressor's state (whose memory consumption can exceed the data size of the cluster) is kept around while the cluster is still loaded.
This PR adds a facility to disable lazy/on-demand decompression. Lazy decompression stays on by default, but can be disabled and/or (re-)enabled per individual ZIM archive via a new public API method
zim::Archive::decompressClustersLazily().