-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[DiskBBQ] Break big posting lists into blocks #132498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 1 commit
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we just always put the block of docs at the start of every block?
So every block is
[blk0, blk1, blk2, ... tail][[encoded doc ids, vectors],...[tail encoded doc ids, vectors]]`We know the block size (16), we know the previous base block (if we want to delta encode eventually).
If we ever split soar and regular docs, we can delta encode with the "doc base" (just like regular postings list).
Are we concerned about speed or just size increase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fair, I can have a go to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduced this approach in 71e30a1. Much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much simpler indeed. My only concern is index size & performance. I would expect them to be mostly comparable, but you never know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way docIdsWriter works, I would expect better compression of docIds with the penalty of one byte per 16 vectors, so all in all it should be the same or even smaller (I am checking).
I don't expect and see any performance implications.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking the posting list size of 1m vectors with 1024 dims:
main:
302.780.965 bytesPR:
301.815.585 bytesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even smaller?!?!? awesome!
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flamegraphs show a performance penalty (because of the extra byte).
main:
PR:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it breaks alignment, which is frustrating.
I am testing with sorted doc-ids right now (now that we aren't skipping duplicate vectors).
GroupVarInt also has a "single byte read" to determine the output flag. Having a single byte read for every group of 16 integers does seem weird.
I wonder if we can do something more clever by delta-encoding all the vectors (we read all the blocks in order anyways, so we can keep the running sum), and pick the appropriate encoding that works for all the blocks. Then we can write that encoding byte at the front of the entire list, and have uniform encoding for every block.
This might be slightly less disk efficient, but it will likely align better.