Skip to content

Made lazy decompression of clusters user-controllable#1036

Open
veloman-yunkan wants to merge 1 commit intomainfrom
controllable_lazy_decompression
Open

Made lazy decompression of clusters user-controllable#1036
veloman-yunkan wants to merge 1 commit intomainfrom
controllable_lazy_decompression

Conversation

@veloman-yunkan
Copy link
Copy Markdown
Collaborator

Related to kiwix/libkiwix#1265

In lazy decompession mode (introduced by #421), data in compressed clusters is decompressed on-demand, i.e. only as much data is decompressed as is needed to serve item accesses within the cluster. It leads to faster first-time access to items occurring early in the cluster, but comes at the cost of increased memory usage since the decompressor's state (whose memory consumption can exceed the data size of the cluster) is kept around while the cluster is still loaded.

This PR adds a facility to disable lazy/on-demand decompression. Lazy decompression stays on by default, but can be disabled and/or (re-)enabled per individual ZIM archive via a new public API method zim::Archive::decompressClustersLazily().

Lazy decompression stays on by default, but can be disabled and/or
(re-)enabled per individual ZIM archive via a new public API method
`zim::Archive::decompressClustersLazily()`.
@veloman-yunkan
Copy link
Copy Markdown
Collaborator Author

The question is whether lazy decompression should be disabled by default. With lazy decompression off, the cluster memory cache size limit should be respected in terms of virtual memory consumption.

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 30, 2026

Codecov Report

❌ Patch coverage is 92.85714% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 56.32%. Comparing base (388010e) to head (1e387e2).

Files with missing lines Patch % Lines
src/cluster.cpp 88.88% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1036      +/-   ##
==========================================
+ Coverage   56.25%   56.32%   +0.06%     
==========================================
  Files         101      101              
  Lines        5016     5026      +10     
  Branches     2185     2188       +3     
==========================================
+ Hits         2822     2831       +9     
  Misses        738      738              
- Partials     1456     1457       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kelson42
Copy link
Copy Markdown
Contributor

kelson42 commented Jan 30, 2026

@veloman-yunkan To me this is a very important improvement and I was sure this was already working like that! Sorry I have understand first exactly the contrary of what was written. Comment invalid then.

@kelson42
Copy link
Copy Markdown
Contributor

@veloman-yunkan What goes through my mind (sorry, just an idea) is that the cluster cache is not hit that often (large amount of cluster, slow amount of cache, statistic randomisation of localisation of articles). Therefore why not just focus on the item cache and reduce (if not remove) the cluster cache?

@benoit74
Copy link
Copy Markdown

benoit74 commented Feb 2, 2026

the decompressor's state (whose memory consumption can exceed the data size of the cluster)

Is this such a common real-life situation? Do we have encountered in the wild situation where the decompressors states is significantly bigger than the cluster cache?

With lazy decompression off, the cluster memory cache size limit should be respected in terms of virtual memory consumption.

I don't think I get this. Does it means the decompressor state is not accounted for in the cluster memory cache size?

@veloman-yunkan
Copy link
Copy Markdown
Collaborator Author

the decompressor's state (whose memory consumption can exceed the data size of the cluster)

Is this such a common real-life situation? Do we have encountered in the wild situation where the decompressors states is significantly bigger than the cluster cache?

Cluster size (decompressed) is 2MiB, decompressor window size is now 8MiB (it was reduced from 128MiB almost 5 years ago in 23a72e7). But keep reading below.

With lazy decompression off, the cluster memory cache size limit should be respected in terms of virtual memory consumption.

I don't think I get this. Does it means the decompressor state is not accounted for in the cluster memory cache size?

There is some ambiguity with respect to what memory usage means - there is virtual memory vs physical memory. Current implementation of cluster cache memory management targets physical memory usage (see https://github.com/openzim/libzim/blob/9.4.0/src/cluster.cpp#L206-L210). But there may be systems (e.g. those not supporting mmap) where consumption of physical memory equals the size of allocated memory.

With lazy decompression of clusters turned off the said ambiguity stops making any difference and memory usage is being respected in the sense of (the higher value of) virtual memory consumption.

@veloman-yunkan
Copy link
Copy Markdown
Collaborator Author

@veloman-yunkan What goes through my mind (sorry, just an idea) is that the cluster cache is not hit that often (large amount of cluster, slow amount of cache, statistic randomisation of localisation of articles). Therefore why not just focus on the item cache and reduce (if not remove) the cluster cache?

Let's better discuss this in #1035

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants