Tracking memory resources#2973
Conversation
…nitializing make_*_scalar overloads
…sed host_resource and host_device_resource
…aw CCCL references
…urrent_device_resource()
tfeher
left a comment
There was a problem hiding this comment.
Thanks Artem for the PR! This is great. As we try to maximize memory utilization, we are prone to run out of memory. This PR will be very useful to debug those issues and understand memory usage of various algorithms.
The extra memory usage tracking layer is only created if the user explicitly requests it. Therefore I do not see any issue merging this into raft. We should get this in 26.04.
I have few comments below.
My wishlist of follow up PRs:
- Python API to enable
memory_tracking_resource - Command line argument for cuvs-bench to enable memory tracking
Co-authored-by: Tamas Bela Feher <tfeher@nvidia.com>
…or throughput measurement
tfeher
left a comment
There was a problem hiding this comment.
Thanks Artem for the updates, the PR looks good to me!
|
Thanks Tamas for the review! Since the PR is not breaking and the change to the existing logic is minimal (maintaining NVTX names stack), I go ahead an merge it. ---------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations alloc_size batch items_per_second sample_rate_us
---------------------------------------------------------------------------------------------------------------------------------
tracking_overhead/0/manual_time 1.60 ms 1.59 ms 776 256 0 12.5373M/s -1
tracking_overhead/1/manual_time 2.01 ms 2.00 ms 700 256 0 9.96885M/s 0
tracking_overhead/2/manual_time 1.70 ms 1.70 ms 817 256 0 11.7602M/s 1
tracking_overhead/3/manual_time 1.70 ms 1.69 ms 819 256 0 11.7746M/s 10
tracking_overhead/4/manual_time 1.70 ms 1.69 ms 823 256 0 11.7727M/s 100
tracking_overhead/5/manual_time 1.62 ms 1.62 ms 875 1048.58k 0 12.3415M/s -1
tracking_overhead/6/manual_time 1.81 ms 1.81 ms 609 1048.58k 0 11.0321M/s 0
tracking_overhead/7/manual_time 1.66 ms 1.66 ms 847 1048.58k 0 12.013M/s 1
tracking_overhead/8/manual_time 1.65 ms 1.65 ms 837 1048.58k 0 12.1312M/s 10
tracking_overhead/9/manual_time 1.69 ms 1.69 ms 856 1048.58k 0 11.8317M/s 100
tracking_overhead/10/manual_time 0.167 ms 0.163 ms 9088 67.1089M 0 11.9873M/s -1
tracking_overhead/11/manual_time 0.219 ms 0.213 ms 6518 67.1089M 0 9.1249M/s 0
tracking_overhead/12/manual_time 0.177 ms 0.173 ms 7566 67.1089M 0 11.2846M/s 1
tracking_overhead/13/manual_time 0.168 ms 0.165 ms 8231 67.1089M 0 11.9152M/s 10
tracking_overhead/14/manual_time 0.167 ms 0.164 ms 8373 67.1089M 0 12.0072M/s 100
tracking_overhead/15/manual_time 1.50 ms 1.48 ms 926 256 1 13.313M/s -1
tracking_overhead/16/manual_time 1.78 ms 1.76 ms 677 256 1 11.2217M/s 0
tracking_overhead/17/manual_time 1.59 ms 1.58 ms 858 256 1 12.5546M/s 1
tracking_overhead/18/manual_time 1.60 ms 1.59 ms 882 256 1 12.4699M/s 10
tracking_overhead/19/manual_time 1.63 ms 1.62 ms 812 256 1 12.2791M/s 100
tracking_overhead/20/manual_time 0.147 ms 0.146 ms 9466 1048.58k 1 13.6086M/s -1
tracking_overhead/21/manual_time 0.213 ms 0.212 ms 7849 1048.58k 1 9.39716M/s 0
tracking_overhead/22/manual_time 0.160 ms 0.160 ms 8746 1048.58k 1 12.4864M/s 1
tracking_overhead/23/manual_time 0.161 ms 0.161 ms 8808 1048.58k 1 12.4082M/s 10
tracking_overhead/24/manual_time 0.158 ms 0.158 ms 8615 1048.58k 1 12.6409M/s 100 |
|
/merge |
Detailed tracking of (almost) all allocations on device and host. ```C++ // optionally pass an existing resource handle raft::resources res; // The tracking handle is a child of resource handle; it wraps all memory resources with statistics adaptors raft::memory_tracking_resources tracked(res, "allocations.csv", std::chrono::milliseconds(1)); // All allocations are logged to a .csv as long as `tracked` is alive cuvs::neighbors::cagra::build(tracked, ...); ``` This produces a CSV file with sampled allocations with a timeline and NVTX correlation ```csv timestamp_us,nvtx_depth,nvtx_range,host_current,host_total,pinned_current,pinned_total,managed_current,managed_total,device_current,device_total,workspace_current,workspace_total,large_workspace_current,large_workspace_total 198809,1,"hnsw::build<ACE>",20008,20008,0,0,0,0,148304,148304,0,0,0,0 199961,1,"hnsw::build<ACE>",20008,20008,0,0,0,0,15588304,15588304,0,0,0,0 201350,1,"hnsw::build<ACE>",0,20008,0,0,0,0,0,40385488,0,0,0,0 222216,3,"cagra::build_knn_graph<IVF-PQ>(5000000, 1536, 72)",1440000000,1440020008,0,0,0,0,0,40385488,0,0,0,0 273892,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,40385488,80770976,0,0,0,0 304183,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,40385488,80770976,0,0,4388567040,4388567040 309064,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,53860384,94245872,0,0,4388567040,4388567040 334655,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,67339295,107724783,0,0,4388567040,4388567040 385037,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,74076743,114462231,0,0,4388567040,4388567040 386129,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,80814199,121199687,0,0,4388567040,4388567040 402750,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,46099768,126913967,0,0,4388567040,4388567040 ... ``` This can later be visualized (the visualization script is not included in the PR): <img width="2100" height="1350" alt="allocations" src="https://github.com/user-attachments/assets/3f0ab942-b49b-4e09-a0ea-9181725ae05e" /> #### Implementation overview ##### NVTX Added thread-local tracking of NVTX range stack; the calling thread shares a handle to the sampling thread to correlate the NVTX range state with allocations. ##### Memory resource adaptors - statistics adaptor: atomically counts allocations/deallocations for any `cuda::mr`-compatible resource - notifying adaptor: sets a shared "notifier" state on each event ##### Resource monitor A resource monitor registers a collection of resource statistics objects, a single NVTX range handle, and a single notifier state. It spawns a new thread to sample the resource statistics at a given rate (but only when the notifier is triggered). This thread writes to a CSV output stream. ##### Memory tracking resources `raft::memory_tracking_resources` is a child of `raft::resources`, thus can be used as a drop-in replacement. It replaces all known memory resource for the duration of its lifetime and manages the output file or stream if necessary. Depends on (and includes all changes of) rapidsai#2968 Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: rapidsai#2973
Backport PRs that were mistakenly merged into `main`: - #2968 - #2973 Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #2983
|
Would it make sense to try to upstream this kind of tooling into nsight? |
Detailed tracking of (almost) all allocations on device and host.
This produces a CSV file with sampled allocations with a timeline and NVTX correlation
This can later be visualized (the visualization script is not included in the PR):

Implementation overview
NVTX
Added thread-local tracking of NVTX range stack; the calling thread shares a handle to the sampling thread to correlate the NVTX range state with allocations.
Memory resource adaptors
cuda::mr-compatible resourceResource monitor
A resource monitor registers a collection of resource statistics objects, a single NVTX range handle, and a single notifier state. It spawns a new thread to sample the resource statistics at a given rate (but only when the notifier is triggered). This thread writes to a CSV output stream.
Memory tracking resources
raft::memory_tracking_resourcesis a child ofraft::resources, thus can be used as a drop-in replacement. It replaces all known memory resource for the duration of its lifetime and manages the output file or stream if necessary.Depends on (and includes all changes of) #2968