Made lazy decompression of clusters user-controllable by veloman-yunkan · Pull Request #1036 · openzim/libzim

veloman-yunkan · 2026-01-30T14:04:36Z

In lazy decompession mode (introduced by #421), data in compressed clusters is decompressed on-demand, i.e. only as much data is decompressed as is needed to serve item accesses within the cluster. It leads to faster first-time access to items occurring early in the cluster, but comes at the cost of increased memory usage since the decompressor's state (whose memory consumption can exceed the data size of the cluster) is kept around while the cluster is still loaded.

This PR adds a facility to disable lazy/on-demand decompression. Lazy decompression stays on by default, but can be disabled and/or (re-)enabled per individual ZIM archive via a new public API method zim::Archive::decompressClustersLazily().

Lazy decompression stays on by default, but can be disabled and/or (re-)enabled per individual ZIM archive via a new public API method `zim::Archive::decompressClustersLazily()`.

veloman-yunkan · 2026-01-30T14:08:21Z

The question is whether lazy decompression should be disabled by default. With lazy decompression off, the cluster memory cache size limit should be respected in terms of virtual memory consumption.

codecov · 2026-01-30T14:09:08Z

Codecov Report

❌ Patch coverage is 92.85714% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 56.32%. Comparing base (388010e) to head (1e387e2).

Files with missing lines	Patch %	Lines
src/cluster.cpp	88.88%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1036      +/-   ##
==========================================
+ Coverage   56.25%   56.32%   +0.06%     
==========================================
  Files         101      101              
  Lines        5016     5026      +10     
  Branches     2185     2188       +3     
==========================================
+ Hits         2822     2831       +9     
  Misses        738      738              
- Partials     1456     1457       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kelson42 · 2026-01-30T15:28:28Z

@veloman-yunkan ~~To me this is a very important improvement and I was sure this was already working like that!~~ Sorry I have understand first exactly the contrary of what was written. Comment invalid then.

kelson42 · 2026-01-30T20:11:01Z

@veloman-yunkan What goes through my mind (sorry, just an idea) is that the cluster cache is not hit that often (large amount of cluster, slow amount of cache, statistic randomisation of localisation of articles). Therefore why not just focus on the item cache and reduce (if not remove) the cluster cache?

benoit74 · 2026-02-02T12:33:02Z

the decompressor's state (whose memory consumption can exceed the data size of the cluster)

Is this such a common real-life situation? Do we have encountered in the wild situation where the decompressors states is significantly bigger than the cluster cache?

With lazy decompression off, the cluster memory cache size limit should be respected in terms of virtual memory consumption.

I don't think I get this. Does it means the decompressor state is not accounted for in the cluster memory cache size?

veloman-yunkan · 2026-02-02T12:54:31Z

the decompressor's state (whose memory consumption can exceed the data size of the cluster)

Is this such a common real-life situation? Do we have encountered in the wild situation where the decompressors states is significantly bigger than the cluster cache?

Cluster size (decompressed) is 2MiB, decompressor window size is now 8MiB (it was reduced from 128MiB almost 5 years ago in 23a72e7). But keep reading below.

With lazy decompression off, the cluster memory cache size limit should be respected in terms of virtual memory consumption.

I don't think I get this. Does it means the decompressor state is not accounted for in the cluster memory cache size?

There is some ambiguity with respect to what memory usage means - there is virtual memory vs physical memory. Current implementation of cluster cache memory management targets physical memory usage (see https://github.com/openzim/libzim/blob/9.4.0/src/cluster.cpp#L206-L210). But there may be systems (e.g. those not supporting mmap) where consumption of physical memory equals the size of allocated memory.

With lazy decompression of clusters turned off the said ambiguity stops making any difference and memory usage is being respected in the sense of (the higher value of) virtual memory consumption.

veloman-yunkan · 2026-02-02T13:29:49Z

@veloman-yunkan What goes through my mind (sorry, just an idea) is that the cluster cache is not hit that often (large amount of cluster, slow amount of cache, statistic randomisation of localisation of articles). Therefore why not just focus on the item cache and reduce (if not remove) the cluster cache?

Let's better discuss this in #1035

Made lazy decompression user-controllable

1e387e2

Lazy decompression stays on by default, but can be disabled and/or (re-)enabled per individual ZIM archive via a new public API method `zim::Archive::decompressClustersLazily()`.

veloman-yunkan requested a review from kelson42 January 30, 2026 14:08

veloman-yunkan mentioned this pull request Jan 30, 2026

Increasing memory consumption for indexed search on Apple since 14.1.0 update kiwix/libkiwix#1265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Made lazy decompression of clusters user-controllable#1036

Made lazy decompression of clusters user-controllable#1036
veloman-yunkan wants to merge 1 commit intomainfrom
controllable_lazy_decompression

veloman-yunkan commented Jan 30, 2026

Uh oh!

veloman-yunkan commented Jan 30, 2026

Uh oh!

codecov bot commented Jan 30, 2026 •

edited

Loading

Uh oh!

kelson42 commented Jan 30, 2026 •

edited

Loading

Uh oh!

kelson42 commented Jan 30, 2026

Uh oh!

benoit74 commented Feb 2, 2026

Uh oh!

veloman-yunkan commented Feb 2, 2026

Uh oh!

veloman-yunkan commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

veloman-yunkan commented Jan 30, 2026

Uh oh!

veloman-yunkan commented Jan 30, 2026

Uh oh!

codecov bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kelson42 commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kelson42 commented Jan 30, 2026

Uh oh!

benoit74 commented Feb 2, 2026

Uh oh!

veloman-yunkan commented Feb 2, 2026

Uh oh!

veloman-yunkan commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Jan 30, 2026 •

edited

Loading

kelson42 commented Jan 30, 2026 •

edited

Loading