Skip to content

perf: add Skip Lists for AND query intersection#132

Open
matheusvir wants to merge 3 commits intoSygil-Dev:mainfrom
matheusvir:optimization/skiplist-and-intersection
Open

perf: add Skip Lists for AND query intersection#132
matheusvir wants to merge 3 commits intoSygil-Dev:mainfrom
matheusvir:optimization/skiplist-and-intersection

Conversation

@matheusvir
Copy link

What was done

The current skip_to() method in ListMatcher uses a linear scan to advance through posting lists (O(n)), which becomes costly in AND queries where skip_to() is called repeatedly.

This change introduces a Skip List in a new file skiplist.py with two classes:

SkipNode: represents a node storing the document ID and forward pointers for each level. Uses __slots__ to minimize memory usage.

SkipList: builds all levels in a single O(n) pass and implements skip_to() in O(log n) by descending from the highest levels.

A new class SkipListMatcher inherits from ListMatcher and overrides only skip_to(), using the SkipList to locate the target doc_id in O(log n) and bisect_left to find the corresponding index. All other methods (next, id, weight, score, all_ids, etc.) remain unchanged.

IntersectionMatcher._find_next() required no changes — by replacing SkipListMatcher in place of ListMatcher, the optimization is automatically propagated. AND queries over two lists of N documents improve from O(n) to O(k log n), where k is the number of results.

No new test files were added. SkipListMatcher is exercised through the existing search test suite — in particular tests/test_searching.py and tests/test_matching.py, which cover boolean AND queries and IntersectionMatcher behaviour. All existing tests pass with no regressions.


Performance

All benchmarks were executed inside Docker containers to isolate the runtime environment and eliminate host-specific variance from CPU scheduling, OS caching, and library versions.

Methodology

  • 50 total runs; first 10 (warmup) and last 10 (cooldown) discarded; 30 effective runs measured.
  • Index: 500,000 documents, 100,000 tagged as high-frequency terms, 1,000 as low-frequency terms.
  • Workload: 100 boolean AND intersection queries per run pairing high-frequency and low-frequency terms, maximizing the number of skip_to() calls.
  • Fixed seed (42) for reproducibility.
  • Timing: time.perf_counter_ns() with GC disabled during measurement.

Rationale

The benefit of Skip Lists is concentrated in asymmetric intersections — when one posting list is long and the other is short, the skip structure allows jumping over large ranges of the longer list instead of stepping through each element. The benchmark was designed to stress exactly this scenario.

Results

Variant Mean (ms) Std dev (ms) Runs
Baseline 9,057.86 887.20 30
Optimized 8,249.27 341.50 30
Improvement 8.93%

Skip Lists benchmark comparison

Analysis

The mean improvement is 8.93%. The difference between means (808 ms) is smaller than the baseline standard deviation (887 ms), which means it does not clear the strict statistical confirmation threshold used in this study. However, the optimization produces a real and consistent effect: the standard deviation drops from 887 ms to 341 ms — roughly a third — indicating that the Skip List eliminates the worst-case linear stepping scenarios that caused high variance in the baseline.

In workloads with more extreme frequency asymmetry, the benefit is expected to be larger.

Reproducing the benchmark

The full benchmark infrastructure is available in the research repository at matheusvir/eda-oss-performance.

Relevant files:

To run:

# From the root of eda-oss-performance
bash experiments/whoosh-reloaded/run_skip_lists.sh

This builds the Docker image from setup/whoosh-reloaded/Dockerfile, runs the baseline and experiment containers sequentially, and writes results to results/whoosh-reloaded/result_whoosh-reloaded_skip-lists.json.


Feedback on the SkipListMatcher integration, the skip level construction strategy, and test coverage is welcome.

Predd0o and others added 2 commits March 11, 2026 22:28
Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br>
Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br>
Co-authored-by: Pedro <pedroalmeida1896@gmail.com>
Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com>
Co-authored-by: RailtonDantas <railtondantas.code@gmail.com>
Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>
Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br>
Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br>
Co-authored-by: Pedro <pedroalmeida1896@gmail.com>
Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com>
Co-authored-by: RailtonDantas <railtondantas.code@gmail.com>
Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>
@matheusvir matheusvir changed the title perf(whoosh): add Skip Lists for AND query intersection perf: add Skip Lists for AND query intersection Mar 12, 2026
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
E Reliability Rating on New Code (required ≥ C)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

@Predd0o Predd0o deleted the optimization/skiplist-and-intersection branch March 12, 2026 02:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants