-
Notifications
You must be signed in to change notification settings - Fork 150
Automatic Partition Count Derivation for ACE #1603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Added `cuvsHnswAceParams` structure for ACE configuration. - Implemented `cuvsHnswBuild` function to facilitate index construction using ACE. - Updated HNSW index parameters to include ACE settings. - Created new tests for HNSW index building and searching using ACE. - Updated documentation to reflect the new ACE parameters and usage.
- Add heuristic to automatically derive the number of partitions based on host and device memory requirements. - Increase the user-profided `npartitions` if it does not fit memory. - Introduced `max_host_memory_gb` and `max_gpu_memory_gb` fields to `cuvsAceParams` and `cuvsHnswAceParams` structures for controlling memory usage during ACE builds. - Added tests to verify that small memory limits trigger disk mode correctly for both CAGRA and HNSW index builds.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
- Renamed parameter `m` to `M` in HNSW structures and related functions for consistency. - Removed `ef_construction` from `cuvsHnswAceParams` and related classes, as it is no longer needed. - Load the HNSW index from file before search if needed.
KyleFromNVIDIA
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved CMake changes
tfeher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Julian for the PR, it is great to have the n_partition parameter automatically determined. I have reviewed C/C++/Python changes. I have suggestions for the formulas used for memory estimation, but in general the PR is in a good shape.
| size_t gpu_sub_graph_size = imbalance_factor * 2 * (dataset_size / n_partitions) * | ||
| (intermediate_degree + graph_degree) * sizeof(IdxT); | ||
| size_t gpu_workspace_size = gpu_sub_dataset_size; | ||
| size_t disk_mode_gpu_required = gpu_sub_dataset_size + gpu_sub_graph_size + gpu_workspace_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a workspace that is equivalent with sub_datset_size?
We should also keep in mind that memory requirements depend on the build algorithm that we use. Having gpu_sub_dataset_size + gpu_sub_graph_size is a good upper limit now. Just for reference, I expect that we have the following actual memory usage
- IVF-PQ
max(pq_compressed_sub_dataset_size, gpu_sub_graph_size) - NN descent
gpu_sub_dataset_size_fp16 + gpu_sub_graph_size - Iterative solver
max(gpu_sub_dataset_size, gpu_sub_graph_size)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you have a better estimate of gpu_workspace_size please. The limiter I have found seems to be build_knn_graph calling sort_knn_graph which creates a copy of the dataset (graph_core.cuh#L532-L537).
Thanks for these limits. This is very helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed sorting would need the dataset. But that is done after KNN building, so we reuse the space that we reserved for the dataset during that phase. Therefore I would not include dataset size in workspace.
I went through optimize, and here is the workspace memory usage. Could you define a helper function, and include this in the memory estimate?
def optimize_workspace_size(N, deg, ideg, S, mst_opt = False):
""" Claculates CAGRA optimize memory usage
This is the working memory on top of the input/output host memory usage (N * (deg + ideg) * S).
N - number of rows in the dataset
deg - graph degree
ideg - intermediate graph degree
S - graph type size
"""
mst_host = N * S # mst_graph_num_edges
if mst_opt:
mst_host += N * deg * S # mst_graph allocated in optimize
mst_host += N * deg * S # mst_graph allocated in mst_optimize
mst_host += N * S * 7 # vectors with _max_edges suffix
mst_host += (deg-1) * (deg-1) * S # iB_candidates
prune_host = N * ideg * 1 # detour count
prune_dev = N * ideg * 1 # detour count
prune_dev += N * 4 # d_num_detour_edges
prune_dev += N * ideg * S # d_input_graph
# We neglect 8 bytes (both on host and device) for stats
rev_host = N * deg * S # rev_graph
rev_host += N * 4 # rev_graph_count
rev_host += N * S # dest_nodes
rev_dev = N * deg * S # d_rev_graph
rev_dev += N * 4 # d_rev_graph_count
rev_dev += N * 4 # d_dest_nodes
# Memory for merging graphs
combine_host = (N * 4 + deg * 4) / 1e9 # in_edge_count + hist
mst_host /= 1e9
prune_host /= 1e9;
prune_dev /= 1e9;
rev_host /= 1e9;
rev_dev /= 1e9;
print("Prune host {:4.2f} GB, dev {:4.2f} GB".format(prune_host, prune_dev))
print("Rev host {:4.2f} GB, dev {:4.2f} GB".format(rev_host, rev_dev))
print("MST host {:4.2f} GB".format(mst_host))
total_host = mst_host + max(prune_host, rev_host, combine_host)
total_dev = max(prune_dev, rev_dev)
print("Total host {:4.2f} GB, dev {:4.2f} GB".format(total_host, total_dev))
return total_host, total_devThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this is very helpful. I have added a helper which is used in host and device memory calculations.
- Update this once we have enough data to recommend this again.
|
Thanks Julian for the update, I had one more comment for the workspace size estimate, otherwise the code looks good. |
This PR adds automatic partition count derivation for ACE (Augmented Core Extraction) graph builds based on available system memory.
Previously, users had to manually calculate and specify the number of partitions based on their dataset size and available memory. This was error-prone and required understanding the internal memory requirements of the ACE algorithm. With auto-derivation, users get optimal partitioning out of the box while still having the option to override if needed.
Changes
npartitionsis set to0(new default), ACE automatically calculates the optimal number of partitions based on available host and GPU memory.npartitionsis set to a positive value, the specified count is used but may be automatically increased if it would exceed memory limits.max_host_memory_gbandmax_gpu_memory_gbparameters to allow users to constrain memory usage (useful for shared systems or testing).This builds on top of PR #1597, which should be merged first.