Skip to content

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Aug 25, 2025

Merging big segments in DIskBBQ on a small node can take a very long time. Flamegraphs shows lots of page faults during hierarchical k-means recursion, in particular when creating FloatVectorValuesSlice:

image

This PR explores the idea in which instead of slicing big off heap FloatVectorValues using FloatVectorValuesSlice, we read all the data once and write the clusters (currently 128) into its own file. Recursion will process then one file at a time. First results show a nice improvement on memory constraint JVMs while little degradation when there is plenty of heap. The downside is data amplification as we need an extra copy of the vectors on disk.

@benwtrent
Copy link
Member

closes: #133812

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants