Skip to content

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Aug 6, 2025

The hierarchical k-means algorithm tries to build clusters with a recommended size but depending on the input data, it can end up building clusters with a much higher value. As an extreme case, if you add the same vector over and over again, you end up with just one cluster containing all the vectors.

This is dangerous at search time because we read the docIds of a posting list before scoring the vectors, so we would need to real all the docIds in one go. This PR prevents that by writing big posting lists in blocks so we can read block by block in that case the posting list is bigger than a threshold. I have randomly chosen 1600 as the threshold for writing blocks.

@john-wagster
Copy link
Contributor

The approach makes sense. I was wondering why we might want to solve this here instead of in the hkmeans algo though. I need to go look again, but I would think ideally we'd stack a lot of duplicative centroids on top of one another effectively creating these blocks anyway by arbitrarily breaking up the duplicative vectors into duplicative centroids. I'll go review the code some more.

@benwtrent
Copy link
Member

I think the real fix is to not read the entire doc id set in and hold it in memory. If we knew they were in ascending order, can we utilize some off-heap reader?

@iverase
Copy link
Contributor Author

iverase commented Aug 7, 2025

I think the real fix is to not read the entire doc id set in and hold it in memory

I don't know about this, it seems it will break the disk friendliness of the format if we need to be doing what it feels random reads on the doc ids.

@benwtrent
Copy link
Member

I don't know about this, it seems it will break the disk friendliness of the format if we need to be doing what it feels random reads on the doc ids.

@iverase why would it be random?

We always read postings list in order. Do you mean random in that we need to go back to the "start" of the list to decode the next block of docs?

@iverase
Copy link
Contributor Author

iverase commented Aug 7, 2025

Do you mean random in that we need to go back to the "start" of the list to decode the next block of docs?

yes


@Override
public int visit(KnnCollector knnCollector) throws IOException {
byte postingListType = indexInput.readByte();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we just always put the block of docs at the start of every block?

So every block is
[blk0, blk1, blk2, ... tail] [[encoded doc ids, vectors],...[tail encoded doc ids, vectors]]`

We know the block size (16), we know the previous base block (if we want to delta encode eventually).

If we ever split soar and regular docs, we can delta encode with the "doc base" (just like regular postings list).

Are we concerned about speed or just size increase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fair, I can have a go to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduced this approach in 71e30a1. Much simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much simpler indeed. My only concern is index size & performance. I would expect them to be mostly comparable, but you never know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way docIdsWriter works, I would expect better compression of docIds with the penalty of one byte per 16 vectors, so all in all it should be the same or even smaller (I am checking).
I don't expect and see any performance implications.

Copy link
Contributor Author

@iverase iverase Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the posting list size of 1m vectors with 1024 dims:

main: 302.780.965 bytes
PR: 301.815.585 bytes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even smaller?!?!? awesome!

Copy link
Contributor Author

@iverase iverase Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flamegraphs show a performance penalty (because of the extra byte).

main:

image

PR:

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it breaks alignment, which is frustrating.

I am testing with sorted doc-ids right now (now that we aren't skipping duplicate vectors).

GroupVarInt also has a "single byte read" to determine the output flag. Having a single byte read for every group of 16 integers does seem weird.

I wonder if we can do something more clever by delta-encoding all the vectors (we read all the blocks in order anyways, so we can keep the running sum), and pick the appropriate encoding that works for all the blocks. Then we can write that encoding byte at the front of the entire list, and have uniform encoding for every block.

This might be slightly less disk efficient, but it will likely align better.

iverase added 3 commits August 7, 2025 14:41
# Conflicts:
#	server/src/main/java/org/elasticsearch/index/codec/vectors/DefaultIVFVectorsReader.java
@iverase iverase closed this Sep 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants