Skip to content

Conversation

@julianmi
Copy link
Contributor

@julianmi julianmi commented Dec 2, 2025

This PR adds automatic partition count derivation for ACE (Augmented Core Extraction) graph builds based on available system memory.

Previously, users had to manually calculate and specify the number of partitions based on their dataset size and available memory. This was error-prone and required understanding the internal memory requirements of the ACE algorithm. With auto-derivation, users get optimal partitioning out of the box while still having the option to override if needed.

Changes

  • When npartitions is set to 0 (new default), ACE automatically calculates the optimal number of partitions based on available host and GPU memory.
  • When npartitions is set to a positive value, the specified count is used but may be automatically increased if it would exceed memory limits.
  • Added max_host_memory_gb and max_gpu_memory_gb parameters to allow users to constrain memory usage (useful for shared systems or testing).

This builds on top of PR #1597, which should be merged first.

- Added `cuvsHnswAceParams` structure for ACE configuration.
- Implemented `cuvsHnswBuild` function to facilitate index construction using ACE.
- Updated HNSW index parameters to include ACE settings.
- Created new tests for HNSW index building and searching using ACE.
- Updated documentation to reflect the new ACE parameters and usage.
- Add heuristic to automatically derive the number of partitions based on host and device memory requirements.
- Increase the user-profided `npartitions` if it does not fit memory.
- Introduced `max_host_memory_gb` and `max_gpu_memory_gb` fields to `cuvsAceParams` and `cuvsHnswAceParams` structures for controlling memory usage during ACE builds.
- Added tests to verify that small memory limits trigger disk mode correctly for both CAGRA and HNSW index builds.
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 2, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

- Renamed parameter `m` to `M` in HNSW structures and related functions for consistency.
- Removed `ef_construction` from `cuvsHnswAceParams` and related classes, as it is no longer needed.
- Load the HNSW index from file before search if needed.
@julianmi julianmi marked this pull request as ready for review December 3, 2025 15:37
@julianmi julianmi requested review from a team as code owners December 3, 2025 15:37
Copy link
Member

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved CMake changes

@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jan 5, 2026
@cjnolet cjnolet moved this from Todo to In Progress in Vector Search, ML, & Data Mining Release Board Jan 5, 2026
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Julian for the PR, it is great to have the n_partition parameter automatically determined. I have reviewed C/C++/Python changes. I have suggestions for the formulas used for memory estimation, but in general the PR is in a good shape.

size_t gpu_sub_graph_size = imbalance_factor * 2 * (dataset_size / n_partitions) *
(intermediate_degree + graph_degree) * sizeof(IdxT);
size_t gpu_workspace_size = gpu_sub_dataset_size;
size_t disk_mode_gpu_required = gpu_sub_dataset_size + gpu_sub_graph_size + gpu_workspace_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a workspace that is equivalent with sub_datset_size?

We should also keep in mind that memory requirements depend on the build algorithm that we use. Having gpu_sub_dataset_size + gpu_sub_graph_size is a good upper limit now. Just for reference, I expect that we have the following actual memory usage

  • IVF-PQ max(pq_compressed_sub_dataset_size, gpu_sub_graph_size)
  • NN descent gpu_sub_dataset_size_fp16 + gpu_sub_graph_size
  • Iterative solver max(gpu_sub_dataset_size, gpu_sub_graph_size)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you have a better estimate of gpu_workspace_size please. The limiter I have found seems to be build_knn_graph calling sort_knn_graph which creates a copy of the dataset (graph_core.cuh#L532-L537).

Thanks for these limits. This is very helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed sorting would need the dataset. But that is done after KNN building, so we reuse the space that we reserved for the dataset during that phase. Therefore I would not include dataset size in workspace.

I went through optimize, and here is the workspace memory usage. Could you define a helper function, and include this in the memory estimate?

def optimize_workspace_size(N, deg, ideg, S, mst_opt = False):
    """ Claculates CAGRA optimize memory usage

    This is the working memory on top of the input/output host memory usage (N * (deg + ideg) * S).

    N - number of rows in the dataset
    deg - graph degree
    ideg - intermediate graph degree
    S - graph type size
    """

    mst_host = N * S                       # mst_graph_num_edges
    if mst_opt:
        mst_host += N * deg * S            # mst_graph allocated in optimize
        mst_host += N * deg * S            # mst_graph allocated in mst_optimize
        mst_host += N * S * 7              # vectors with _max_edges suffix
        mst_host += (deg-1) * (deg-1) * S  #  iB_candidates 
  

    prune_host = N * ideg * 1  # detour count

    prune_dev = N * ideg * 1   # detour count
    prune_dev += N * 4         # d_num_detour_edges
    prune_dev += N * ideg * S  # d_input_graph
    # We neglect 8 bytes (both on host and device) for stats

    rev_host = N * deg * S     # rev_graph
    rev_host += N * 4          # rev_graph_count
    rev_host += N * S          # dest_nodes

    rev_dev = N * deg * S      # d_rev_graph
    rev_dev += N * 4           # d_rev_graph_count
    rev_dev += N * 4           # d_dest_nodes

    # Memory for merging graphs
    combine_host = (N * 4 + deg * 4) / 1e9  # in_edge_count + hist

    mst_host /= 1e9
    prune_host /= 1e9;
    prune_dev /= 1e9;
    rev_host /= 1e9;
    rev_dev /= 1e9;
    
    print("Prune host {:4.2f} GB, dev {:4.2f} GB".format(prune_host, prune_dev))
    print("Rev   host {:4.2f} GB, dev {:4.2f} GB".format(rev_host, rev_dev))
    print("MST   host {:4.2f} GB".format(mst_host))

    
    total_host = mst_host + max(prune_host, rev_host, combine_host)
    total_dev =  max(prune_dev, rev_dev)
    print("Total host {:4.2f} GB, dev {:4.2f} GB".format(total_host, total_dev))

    return total_host, total_dev

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is very helpful. I have added a helper which is used in host and device memory calculations.

@tfeher
Copy link
Contributor

tfeher commented Jan 16, 2026

Thanks Julian for the update, I had one more comment for the workspace size estimate, otherwise the code looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

4 participants