Skip to content

Comments

Dry Run Protocol#2961

Open
achirkin wants to merge 20 commits intorapidsai:mainfrom
achirkin:fea-dry-run-protocol
Open

Dry Run Protocol#2961
achirkin wants to merge 20 commits intorapidsai:mainfrom
achirkin:fea-dry-run-protocol

Conversation

@achirkin
Copy link
Contributor

The dry run protocol defines a mechanism to simulate the execution of algorithms to get a precise estimate of the memory requirements for a real execution with the same parameters.

#include <raft/util/dry_run_memory_resource.hpp>

raft::resources res;
// auto my_function(const raft::resources& res, my_args...);
auto stats = raft::util::dry_run_execute(res, my_function, my_args...);
// stats.device_global_peak  – peak device memory (bytes)

This PR:

  • Introduces new infrastructure:
    1. raft::util::dry_run_execute, tracking memory resource, resource::get_dry_run_flag) that lets callers estimate peak memory usage of any RAFT algorithm without executing GPU work.
    2. resource::pinned_memory_resource, resource::managed_memory_resource - so that all memory resources available in raft are bound to the associated raft::resources handle and can be temporarily replaced.
    3. breaking change unified host and pinned mdarray policies to be the host policy using different std::pmr resources. This change is hidden behind a few layers of types in the mdarray template arguments, so none but most exotic use cases should be affected.
  • Makes all public functions across all raft namespaces dry-run compliant: allocations are always visible to the tracker; CUDA work is skipped.
  • Adds a small user guide (docs/source/dry_run_protocol.md)

…mory

Introduce a dry-run execution framework that replaces device and host
memory resources with lightweight fake allocators to measure peak memory
usage without holding real memory.

New files:
- dry_run_memory_resource.hpp: dry_run_allocator (lock-free bump
  allocator), dry_run_device_memory_resource, dry_run_host_memory_resource,
  dry_run_resource_manager (RAII), and dry_run_execute() helper.
- dry_run_flag.hpp: boolean dry-run flag as a raft resource, allowing
  algorithms to skip kernel execution during profiling.
- tests/util/dry_run_memory_resource.cpp: unit tests.

The dry_run_allocator probes the upstream once to obtain a base address,
then atomically bumps a pointer for each allocation — no mutex, no map,
no real memory held after the initial probe.
…pinned_memory_resource

Add pinned and managed resources to the raft::resources handle to make it possible to customize / temporarily replace these resources
@achirkin achirkin self-assigned this Feb 20, 2026
@achirkin achirkin requested review from a team as code owners February 20, 2026 12:30
@achirkin achirkin added feature request New feature or request breaking Breaking change labels Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking change feature request New feature or request

Projects

Development

Successfully merging this pull request may close these issues.

1 participant