perf: add Skip Lists for AND query intersection#132
Open
matheusvir wants to merge 3 commits intoSygil-Dev:mainfrom
Open
perf: add Skip Lists for AND query intersection#132matheusvir wants to merge 3 commits intoSygil-Dev:mainfrom
matheusvir wants to merge 3 commits intoSygil-Dev:mainfrom
Conversation
Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br> Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br> Co-authored-by: Pedro <pedroalmeida1896@gmail.com> Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com> Co-authored-by: RailtonDantas <railtondantas.code@gmail.com> Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>
Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br> Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br> Co-authored-by: Pedro <pedroalmeida1896@gmail.com> Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com> Co-authored-by: RailtonDantas <railtondantas.code@gmail.com> Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




What was done
The current
skip_to()method inListMatcheruses a linear scan to advance through posting lists (O(n)), which becomes costly in AND queries whereskip_to()is called repeatedly.This change introduces a Skip List in a new file
skiplist.pywith two classes:SkipNode: represents a node storing the document ID and forward pointers for each level. Uses
__slots__to minimize memory usage.SkipList: builds all levels in a single O(n) pass and implements
skip_to()in O(log n) by descending from the highest levels.A new class
SkipListMatcherinherits fromListMatcherand overrides onlyskip_to(), using theSkipListto locate the targetdoc_idin O(log n) andbisect_leftto find the corresponding index. All other methods (next,id,weight,score,all_ids, etc.) remain unchanged.IntersectionMatcher._find_next()required no changes — by replacingSkipListMatcherin place ofListMatcher, the optimization is automatically propagated. AND queries over two lists of N documents improve from O(n) to O(k log n), where k is the number of results.No new test files were added.
SkipListMatcheris exercised through the existing search test suite — in particulartests/test_searching.pyandtests/test_matching.py, which cover boolean AND queries andIntersectionMatcherbehaviour. All existing tests pass with no regressions.Performance
All benchmarks were executed inside Docker containers to isolate the runtime environment and eliminate host-specific variance from CPU scheduling, OS caching, and library versions.
Methodology
skip_to()calls.time.perf_counter_ns()with GC disabled during measurement.Rationale
The benefit of Skip Lists is concentrated in asymmetric intersections — when one posting list is long and the other is short, the skip structure allows jumping over large ranges of the longer list instead of stepping through each element. The benchmark was designed to stress exactly this scenario.
Results
Analysis
The mean improvement is 8.93%. The difference between means (808 ms) is smaller than the baseline standard deviation (887 ms), which means it does not clear the strict statistical confirmation threshold used in this study. However, the optimization produces a real and consistent effect: the standard deviation drops from 887 ms to 341 ms — roughly a third — indicating that the Skip List eliminates the worst-case linear stepping scenarios that caused high variance in the baseline.
In workloads with more extreme frequency asymmetry, the benefit is expected to be larger.
Reproducing the benchmark
The full benchmark infrastructure is available in the research repository at matheusvir/eda-oss-performance.
Relevant files:
setup/whoosh-reloaded/Dockerfileexperiments/whoosh-reloaded/baseline_whoosh-reloaded_skip-lists.pyexperiments/whoosh-reloaded/experiment_whoosh-reloaded_skip-lists.pyexperiments/whoosh-reloaded/run_skip_lists.shTo run:
# From the root of eda-oss-performance bash experiments/whoosh-reloaded/run_skip_lists.shThis builds the Docker image from
setup/whoosh-reloaded/Dockerfile, runs the baseline and experiment containers sequentially, and writes results toresults/whoosh-reloaded/result_whoosh-reloaded_skip-lists.json.Feedback on the
SkipListMatcherintegration, the skip level construction strategy, and test coverage is welcome.