Skip to content

Tracking memory resources#2973

Merged
rapids-bot[bot] merged 36 commits intorapidsai:mainfrom
achirkin:fea-tracking-memory-resources
Mar 18, 2026
Merged

Tracking memory resources#2973
rapids-bot[bot] merged 36 commits intorapidsai:mainfrom
achirkin:fea-tracking-memory-resources

Conversation

@achirkin
Copy link
Contributor

@achirkin achirkin commented Mar 4, 2026

Detailed tracking of (almost) all allocations on device and host.

  // optionally pass an existing resource handle
  raft::resources res;

  // The tracking handle is a child of resource handle; it wraps all memory resources with statistics adaptors
  raft::memory_tracking_resources tracked(res, "allocations.csv", std::chrono::milliseconds(1));

  // All allocations are logged to a .csv as long as `tracked` is alive
  cuvs::neighbors::cagra::build(tracked, ...);

This produces a CSV file with sampled allocations with a timeline and NVTX correlation

timestamp_us,nvtx_depth,nvtx_range,host_current,host_total,pinned_current,pinned_total,managed_current,managed_total,device_current,device_total,workspace_current,workspace_total,large_workspace_current,large_workspace_total
198809,1,"hnsw::build<ACE>",20008,20008,0,0,0,0,148304,148304,0,0,0,0
199961,1,"hnsw::build<ACE>",20008,20008,0,0,0,0,15588304,15588304,0,0,0,0
201350,1,"hnsw::build<ACE>",0,20008,0,0,0,0,0,40385488,0,0,0,0
222216,3,"cagra::build_knn_graph<IVF-PQ>(5000000, 1536, 72)",1440000000,1440020008,0,0,0,0,0,40385488,0,0,0,0
273892,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,40385488,80770976,0,0,0,0
304183,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,40385488,80770976,0,0,4388567040,4388567040
309064,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,53860384,94245872,0,0,4388567040,4388567040
334655,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,67339295,107724783,0,0,4388567040,4388567040
385037,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,74076743,114462231,0,0,4388567040,4388567040
386129,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,80814199,121199687,0,0,4388567040,4388567040
402750,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,46099768,126913967,0,0,4388567040,4388567040
...

This can later be visualized (the visualization script is not included in the PR):
allocations

Implementation overview

NVTX

Added thread-local tracking of NVTX range stack; the calling thread shares a handle to the sampling thread to correlate the NVTX range state with allocations.

Memory resource adaptors
  • statistics adaptor: atomically counts allocations/deallocations for any cuda::mr-compatible resource
  • notifying adaptor: sets a shared "notifier" state on each event
Resource monitor

A resource monitor registers a collection of resource statistics objects, a single NVTX range handle, and a single notifier state. It spawns a new thread to sample the resource statistics at a given rate (but only when the notifier is triggered). This thread writes to a CSV output stream.

Memory tracking resources

raft::memory_tracking_resources is a child of raft::resources, thus can be used as a drop-in replacement. It replaces all known memory resource for the duration of its lifetime and manages the output file or stream if necessary.

Depends on (and includes all changes of) #2968

achirkin and others added 25 commits February 26, 2026 09:20
@achirkin achirkin self-assigned this Mar 4, 2026
@achirkin achirkin requested review from a team as code owners March 4, 2026 17:44
@achirkin achirkin added feature request New feature or request non-breaking Non-breaking change labels Mar 4, 2026
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem for the PR! This is great. As we try to maximize memory utilization, we are prone to run out of memory. This PR will be very useful to debug those issues and understand memory usage of various algorithms.

The extra memory usage tracking layer is only created if the user explicitly requests it. Therefore I do not see any issue merging this into raft. We should get this in 26.04.

I have few comments below.

My wishlist of follow up PRs:

  • Python API to enable memory_tracking_resource
  • Command line argument for cuvs-bench to enable memory tracking

@achirkin achirkin requested a review from tfeher March 16, 2026 13:33
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem for the updates, the PR looks good to me!

@achirkin
Copy link
Contributor Author

Thanks Tamas for the review! Since the PR is not breaking and the change to the existing logic is minimal (maintaining NVTX names stack), I go ahead an merge it.
FYI, below is the benchmarks of tracking overheads. sample_rate_us = -1 means no tracking. The code does a large number of repeated allocations in the pool (very fast, no real cuda context calls). Hence the overheads of statistics adapter (a few atomic ops) and excessive sampling are visible:

---------------------------------------------------------------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations alloc_size      batch items_per_second sample_rate_us
---------------------------------------------------------------------------------------------------------------------------------
tracking_overhead/0/manual_time        1.60 ms         1.59 ms          776        256          0       12.5373M/s             -1
tracking_overhead/1/manual_time        2.01 ms         2.00 ms          700        256          0       9.96885M/s              0
tracking_overhead/2/manual_time        1.70 ms         1.70 ms          817        256          0       11.7602M/s              1
tracking_overhead/3/manual_time        1.70 ms         1.69 ms          819        256          0       11.7746M/s             10
tracking_overhead/4/manual_time        1.70 ms         1.69 ms          823        256          0       11.7727M/s            100
tracking_overhead/5/manual_time        1.62 ms         1.62 ms          875   1048.58k          0       12.3415M/s             -1
tracking_overhead/6/manual_time        1.81 ms         1.81 ms          609   1048.58k          0       11.0321M/s              0
tracking_overhead/7/manual_time        1.66 ms         1.66 ms          847   1048.58k          0        12.013M/s              1
tracking_overhead/8/manual_time        1.65 ms         1.65 ms          837   1048.58k          0       12.1312M/s             10
tracking_overhead/9/manual_time        1.69 ms         1.69 ms          856   1048.58k          0       11.8317M/s            100
tracking_overhead/10/manual_time      0.167 ms        0.163 ms         9088   67.1089M          0       11.9873M/s             -1
tracking_overhead/11/manual_time      0.219 ms        0.213 ms         6518   67.1089M          0        9.1249M/s              0
tracking_overhead/12/manual_time      0.177 ms        0.173 ms         7566   67.1089M          0       11.2846M/s              1
tracking_overhead/13/manual_time      0.168 ms        0.165 ms         8231   67.1089M          0       11.9152M/s             10
tracking_overhead/14/manual_time      0.167 ms        0.164 ms         8373   67.1089M          0       12.0072M/s            100
tracking_overhead/15/manual_time       1.50 ms         1.48 ms          926        256          1        13.313M/s             -1
tracking_overhead/16/manual_time       1.78 ms         1.76 ms          677        256          1       11.2217M/s              0
tracking_overhead/17/manual_time       1.59 ms         1.58 ms          858        256          1       12.5546M/s              1
tracking_overhead/18/manual_time       1.60 ms         1.59 ms          882        256          1       12.4699M/s             10
tracking_overhead/19/manual_time       1.63 ms         1.62 ms          812        256          1       12.2791M/s            100
tracking_overhead/20/manual_time      0.147 ms        0.146 ms         9466   1048.58k          1       13.6086M/s             -1
tracking_overhead/21/manual_time      0.213 ms        0.212 ms         7849   1048.58k          1       9.39716M/s              0
tracking_overhead/22/manual_time      0.160 ms        0.160 ms         8746   1048.58k          1       12.4864M/s              1
tracking_overhead/23/manual_time      0.161 ms        0.161 ms         8808   1048.58k          1       12.4082M/s             10
tracking_overhead/24/manual_time      0.158 ms        0.158 ms         8615   1048.58k          1       12.6409M/s            100

@achirkin
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 1b20b78 into rapidsai:main Mar 18, 2026
147 of 149 checks passed
achirkin added a commit to achirkin/raft that referenced this pull request Mar 18, 2026
Detailed tracking of (almost) all allocations on device and host.

```C++
  // optionally pass an existing resource handle
  raft::resources res;

  // The tracking handle is a child of resource handle; it wraps all memory resources with statistics adaptors
  raft::memory_tracking_resources tracked(res, "allocations.csv", std::chrono::milliseconds(1));

  // All allocations are logged to a .csv as long as `tracked` is alive
  cuvs::neighbors::cagra::build(tracked, ...);
```
This produces a CSV file with sampled allocations with a timeline and NVTX correlation
```csv
timestamp_us,nvtx_depth,nvtx_range,host_current,host_total,pinned_current,pinned_total,managed_current,managed_total,device_current,device_total,workspace_current,workspace_total,large_workspace_current,large_workspace_total
198809,1,"hnsw::build<ACE>",20008,20008,0,0,0,0,148304,148304,0,0,0,0
199961,1,"hnsw::build<ACE>",20008,20008,0,0,0,0,15588304,15588304,0,0,0,0
201350,1,"hnsw::build<ACE>",0,20008,0,0,0,0,0,40385488,0,0,0,0
222216,3,"cagra::build_knn_graph<IVF-PQ>(5000000, 1536, 72)",1440000000,1440020008,0,0,0,0,0,40385488,0,0,0,0
273892,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,40385488,80770976,0,0,0,0
304183,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,40385488,80770976,0,0,4388567040,4388567040
309064,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,53860384,94245872,0,0,4388567040,4388567040
334655,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,67339295,107724783,0,0,4388567040,4388567040
385037,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,74076743,114462231,0,0,4388567040,4388567040
386129,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,80814199,121199687,0,0,4388567040,4388567040
402750,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,46099768,126913967,0,0,4388567040,4388567040
...
```
This can later be visualized (the visualization script is not included in the PR):
<img width="2100" height="1350" alt="allocations" src="https://github.com/user-attachments/assets/3f0ab942-b49b-4e09-a0ea-9181725ae05e" />

#### Implementation overview

##### NVTX

Added thread-local tracking of NVTX range stack; the calling thread shares a handle to the sampling thread to correlate the NVTX range state with allocations.

##### Memory resource adaptors

- statistics adaptor: atomically counts allocations/deallocations for any `cuda::mr`-compatible resource
- notifying adaptor: sets a shared "notifier" state on each event

##### Resource monitor

A resource monitor registers a collection of resource statistics objects, a single NVTX range handle, and a single notifier state. It spawns a new thread to sample the resource statistics at a given rate (but only when the notifier is triggered). This thread writes to a CSV output stream.

##### Memory tracking resources

`raft::memory_tracking_resources` is a child of `raft::resources`, thus can be used as a drop-in replacement. It replaces all known memory resource for the duration of its lifetime and manages the output file or stream if necessary.


Depends on (and includes all changes of) rapidsai#2968

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: rapidsai#2973
rapids-bot bot pushed a commit that referenced this pull request Mar 18, 2026
Backport PRs that were mistakenly merged into `main`:
 - #2968
 - #2973

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: #2983
@csadorf
Copy link
Contributor

csadorf commented Mar 20, 2026

Would it make sense to try to upstream this kind of tooling into nsight?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants