perf: add Skip Lists for AND query intersection by matheusvir · Pull Request #132 · Sygil-Dev/whoosh-reloaded

matheusvir · 2026-03-12T01:41:41Z

What was done

The current skip_to() method in ListMatcher uses a linear scan to advance through posting lists (O(n)), which becomes costly in AND queries where skip_to() is called repeatedly.

This change introduces a Skip List in a new file skiplist.py with two classes:

SkipNode: represents a node storing the document ID and forward pointers for each level. Uses __slots__ to minimize memory usage.

SkipList: builds all levels in a single O(n) pass and implements skip_to() in O(log n) by descending from the highest levels.

A new class SkipListMatcher inherits from ListMatcher and overrides only skip_to(), using the SkipList to locate the target doc_id in O(log n) and bisect_left to find the corresponding index. All other methods (next, id, weight, score, all_ids, etc.) remain unchanged.

IntersectionMatcher._find_next() required no changes — by replacing SkipListMatcher in place of ListMatcher, the optimization is automatically propagated. AND queries over two lists of N documents improve from O(n) to O(k log n), where k is the number of results.

No new test files were added. SkipListMatcher is exercised through the existing search test suite — in particular tests/test_searching.py and tests/test_matching.py, which cover boolean AND queries and IntersectionMatcher behaviour. All existing tests pass with no regressions.

Performance

All benchmarks were executed inside Docker containers to isolate the runtime environment and eliminate host-specific variance from CPU scheduling, OS caching, and library versions.

Methodology

50 total runs; first 10 (warmup) and last 10 (cooldown) discarded; 30 effective runs measured.
Index: 500,000 documents, 100,000 tagged as high-frequency terms, 1,000 as low-frequency terms.
Workload: 100 boolean AND intersection queries per run pairing high-frequency and low-frequency terms, maximizing the number of skip_to() calls.
Fixed seed (42) for reproducibility.
Timing: time.perf_counter_ns() with GC disabled during measurement.

Rationale

The benefit of Skip Lists is concentrated in asymmetric intersections — when one posting list is long and the other is short, the skip structure allows jumping over large ranges of the longer list instead of stepping through each element. The benchmark was designed to stress exactly this scenario.

Results

Variant	Mean (ms)	Std dev (ms)	Runs
Baseline	9,057.86	887.20	30
Optimized	8,249.27	341.50	30
Improvement			8.93%

Analysis

The mean improvement is 8.93%. The difference between means (808 ms) is smaller than the baseline standard deviation (887 ms), which means it does not clear the strict statistical confirmation threshold used in this study. However, the optimization produces a real and consistent effect: the standard deviation drops from 887 ms to 341 ms — roughly a third — indicating that the Skip List eliminates the worst-case linear stepping scenarios that caused high variance in the baseline.

In workloads with more extreme frequency asymmetry, the benefit is expected to be larger.

Reproducing the benchmark

The full benchmark infrastructure is available in the research repository at matheusvir/eda-oss-performance.

Relevant files:

Dockerfile: setup/whoosh-reloaded/Dockerfile
Baseline script: experiments/whoosh-reloaded/baseline_whoosh-reloaded_skip-lists.py
Experiment script: experiments/whoosh-reloaded/experiment_whoosh-reloaded_skip-lists.py
Runner script: experiments/whoosh-reloaded/run_skip_lists.sh

To run:

# From the root of eda-oss-performance
bash experiments/whoosh-reloaded/run_skip_lists.sh

This builds the Docker image from setup/whoosh-reloaded/Dockerfile, runs the baseline and experiment containers sequentially, and writes results to results/whoosh-reloaded/result_whoosh-reloaded_skip-lists.json.

Feedback on the SkipListMatcher integration, the skip level construction strategy, and test coverage is welcome.

Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br> Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br> Co-authored-by: Pedro <pedroalmeida1896@gmail.com> Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com> Co-authored-by: RailtonDantas <railtondantas.code@gmail.com> Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>

sonarqubecloud · 2026-03-12T02:39:31Z

Quality Gate failed

Failed conditions
E Reliability Rating on New Code (required ≥ C)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Predd0o and others added 2 commits March 11, 2026 22:28

matheusvir changed the title ~~perf(whoosh): add Skip Lists for AND query intersection~~ perf: add Skip Lists for AND query intersection Mar 12, 2026

feat(whoosh): implement skiplist intersection optimization

e95b714

Predd0o deleted the optimization/skiplist-and-intersection branch March 12, 2026 02:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: add Skip Lists for AND query intersection#132

perf: add Skip Lists for AND query intersection#132
matheusvir wants to merge 3 commits intoSygil-Dev:mainfrom
matheusvir:optimization/skiplist-and-intersection

matheusvir commented Mar 12, 2026

Uh oh!

sonarqubecloud bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

matheusvir commented Mar 12, 2026

What was done

Performance

Methodology

Rationale

Results

Analysis

Reproducing the benchmark

Uh oh!

sonarqubecloud bot commented Mar 12, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants