Dry Run Protocol by achirkin · Pull Request #2961 · rapidsai/raft

achirkin · 2026-02-20T12:30:42Z

The dry run protocol defines a mechanism to simulate the execution of algorithms to get a precise estimate of the memory requirements for a real execution with the same parameters.

#include <raft/util/dry_run_memory_resource.hpp>

raft::resources res;
// auto my_function(const raft::resources& res, my_args...);
auto stats = raft::util::dry_run_execute(res, my_function, my_args...);
// stats.device_global_peak  – peak device memory (bytes)

This PR:

Introduces new infrastructure:
1. raft::util::dry_run_execute, tracking memory resource, resource::get_dry_run_flag) that lets callers estimate peak memory usage of any RAFT algorithm without executing GPU work.
2. resource::pinned_memory_resource, resource::managed_memory_resource - so that all memory resources available in raft are bound to the associated raft::resources handle and can be temporarily replaced.
3. breaking change unified host and pinned mdarray policies to be the host policy using different std::pmr resources. This change is hidden behind a few layers of types in the mdarray template arguments, so none but most exotic use cases should be affected.
Makes all public functions across all raft namespaces dry-run compliant: allocations are always visible to the tracker; CUDA work is skipped.
Adds a small user guide (docs/source/dry_run_protocol.md)

…mory Introduce a dry-run execution framework that replaces device and host memory resources with lightweight fake allocators to measure peak memory usage without holding real memory. New files: - dry_run_memory_resource.hpp: dry_run_allocator (lock-free bump allocator), dry_run_device_memory_resource, dry_run_host_memory_resource, dry_run_resource_manager (RAII), and dry_run_execute() helper. - dry_run_flag.hpp: boolean dry-run flag as a raft resource, allowing algorithms to skip kernel execution during profiling. - tests/util/dry_run_memory_resource.cpp: unit tests. The dry_run_allocator probes the upstream once to obtain a base address, then atomically bumps a pointer for each allocation — no mutex, no map, no real memory held after the initial probe.

…pinned_memory_resource Add pinned and managed resources to the raft::resources handle to make it possible to customize / temporarily replace these resources

…aking change due to transitive includes in downstream libraries

achirkin added 13 commits February 18, 2026 10:46

First batch of dry-run guards

695a8a3

Dry run compliance for raft::linalg namespace

42d8ad4

Update developer guide with the dry run protocol

6db7ec8

BREAKING CHANGE: replaced pinned_container with host_container using …

d91a1c6

…pinned_memory_resource Add pinned and managed resources to the raft::resources handle to make it possible to customize / temporarily replace these resources

Dry run compliance for raft::matrix namespace

1a114f6

Dry run compliance for raft::random namespace

dec5e95

Dry run compliance for raft::solver namespace

f84d9a9

Dry run compliance for raft::sparse namespace

44793cd

Dry run compliance for raft::spectral namespace

d566fe9

Dry run compliance for raft::stats namespace

fc3bde6

Add a little bit more tests

b0ddbc8

Add the Dry Run Protocol Overview

15c07a1

achirkin self-assigned this Feb 20, 2026

achirkin requested review from a team as code owners February 20, 2026 12:30

achirkin added feature request New feature or request breaking Breaking change labels Feb 20, 2026

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board Feb 20, 2026

achirkin and others added 3 commits February 20, 2026 13:31

Fix C++ example in the docs

1c57abb

Merge branch 'main' into fea-dry-run-protocol

d916b45

Add a few more tests and fix a missed CUDA call in QR algorithm

9d24480

achirkin moved this to In Progress in Vector Search, ML, & Data Mining Release Board Feb 20, 2026

achirkin and others added 4 commits February 20, 2026 15:44

Fix excess subsample doing work in dry run

7577e56

Add dry run compliance to the raft::copy on mdspans

99faf68

Merge branch 'main' into fea-dry-run-protocol

b859894

Revert changing includes from public to detail namespace to avoid bre…

57d4c19

…aking change due to transitive includes in downstream libraries

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Dry Run Protocol#2961

Dry Run Protocol#2961
achirkin wants to merge 20 commits intorapidsai:mainfrom
achirkin:fea-dry-run-protocol

achirkin commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

achirkin commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant