Unify memory resources#2968
Conversation
…nitializing make_*_scalar overloads
…sed host_resource and host_device_resource
…aw CCCL references
…urrent_device_resource()
| template <typename T> | ||
| struct host_container { | ||
| template <typename T, typename MR> | ||
| #ifdef __cpp_concepts |
There was a problem hiding this comment.
I think RAFT is using C++20 now so it should be safe to use requires without the #ifdef guard?
There was a problem hiding this comment.
Unfortunately some components of cuvs still use C++17 and it breaks if I remove the #ifdef in this header. I figured, I'd keep it here to keep cuvs passing CI without changes.
There was a problem hiding this comment.
We should get cuVS updated to C++20, RMM will be requiring C++20 soon.
| * Provides CUDA unified (managed) memory accessible from both host and device. | ||
| * Uses synchronous allocation (no stream). Binds to raft::mr::host_device_resource_ref. | ||
| */ | ||
| class managed_memory_resource { |
There was a problem hiding this comment.
This is implemented in CCCL already. Please do not introduce a new implementation of this since one already exists.
https://nvidia.github.io/cccl/unstable/libcudacxx/runtime/memory_pools.html#cuda-managed-memory-pool
https://nvidia.github.io/cccl/unstable/libcudacxx/runtime/legacy_resources.html#cuda-mr-legacy-managed-memory-resource
Use cuda::mr::legacy_managed_memory_resource on CUDA 12 and cuda::managed_memory_pool on CUDA 13 (it's considerably faster). Maybe write a factory that returns the correct resource type for your CUDA version.
There was a problem hiding this comment.
Thanks for the pointer! Really nice, I replaced it with the cuda::mr::legacy_managed_memory_resource and it just worked with no other modifications. I'd prefer to keep the legacy resource for now to keep exactly the same behavior in cuVS as before this PR.
The user is be able to replace it with the CUDA 13 pool-based resource even now via raft::resource::managed_memory_resource, but we can also make it the default later.
|
The follow up and motivation: tracking all memory allocations #2973 |
Testing cuvs CI against rapidsai/raft#2968
Testing cuvs CI against rapidsai/raft#2968
|
Testing the breaking changes:
|
|
|
||
| class managed_memory_resource_factory : public resource_factory { | ||
| public: | ||
| managed_memory_resource_factory() : mr_(cuda::mr::legacy_managed_memory_resource{}) {} |
There was a problem hiding this comment.
I know you said it's out of scope for now, but I recommend a follow-up PR that uses the new managed pool on CUDA 13+. It's a worthy performance boost.
|
|
||
| struct managed_container_policy { | ||
| using element_type = ElementType; | ||
| using container_type = host_container<element_type, raft::mr::host_device_resource_ref>; |
There was a problem hiding this comment.
Something to be aware of: It is possible for memory resources to be host-accessible and device-accessible but not have that known statically. For example, systems with HMM or ATS have device-accessibility for memory allocated with malloc. However, that can't be known by the type alone. You have to query the accessibility at runtime.
Some systems like DGX Spark with integrated memory may perform better with a host-device accessible resource that isn't a managed memory resource (but that would require some system knowledge at runtime).
All this to say, someday we might want to refactor this to use cuda::mr::synchronous_resource_ref<> and check the accessibility at runtime rather than using cuda::mr::synchronous_resource_ref<cuda::mr::host_accessible, cuda::mr::device_accessible> which requires that accessibility to be statically known.
There was a problem hiding this comment.
Thanks, that's a very important point for cuVS - we've been experimenting using various memory types on Grace Hopper and DGX Spark. I actually hoped that I could use the new resources (defined in this PR as they are right now) to do more experiments by switching the memory resources.
I think, the naming goes against the intention a little bit since we decouple the memory resources, raft resource handles, and the containers (mdarrays).
On the algorithm implementation side:
- When I'm using
raft::managed_mdarrayandraft::get_managed_memory_resource_refin an algorithm code, I mean more of "some (probably paged, smart) memory resource with guaranteed host and device access" rather than specificallycudaMallocManaged. - Same for the pinned - "some (probably low-level, not-paged) memory resource with guaranteed host and device access and limited support for host-device atomics".
These two allow me to implement atomic synchronization between the device and host, reduce copy overheads, or just simplify the code a little bit. I don't need/want to query the resource properties at runtime for this.
On the user side (e.g. in cuvs benchmarks), I want be able to configure the program for the target device: query the device properties, check whether ATS is available, select the most appropriate resource that fits the bill. Only then wrap it into cuda::mr::synchronous_resource_ref<cuda::mr::host_accessible, cuda::mr::device_accessible>, pass using raft::set_managed_memory_resource, and benefit from the improved performance.
tfeher
left a comment
There was a problem hiding this comment.
Thanks Artem for this PR, it looks good to me!
|
/merge |
Detailed tracking of (almost) all allocations on device and host. ```C++ // optionally pass an existing resource handle raft::resources res; // The tracking handle is a child of resource handle; it wraps all memory resources with statistics adaptors raft::memory_tracking_resources tracked(res, "allocations.csv", std::chrono::milliseconds(1)); // All allocations are logged to a .csv as long as `tracked` is alive cuvs::neighbors::cagra::build(tracked, ...); ``` This produces a CSV file with sampled allocations with a timeline and NVTX correlation ```csv timestamp_us,nvtx_depth,nvtx_range,host_current,host_total,pinned_current,pinned_total,managed_current,managed_total,device_current,device_total,workspace_current,workspace_total,large_workspace_current,large_workspace_total 198809,1,"hnsw::build<ACE>",20008,20008,0,0,0,0,148304,148304,0,0,0,0 199961,1,"hnsw::build<ACE>",20008,20008,0,0,0,0,15588304,15588304,0,0,0,0 201350,1,"hnsw::build<ACE>",0,20008,0,0,0,0,0,40385488,0,0,0,0 222216,3,"cagra::build_knn_graph<IVF-PQ>(5000000, 1536, 72)",1440000000,1440020008,0,0,0,0,0,40385488,0,0,0,0 273892,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,40385488,80770976,0,0,0,0 304183,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,40385488,80770976,0,0,4388567040,4388567040 309064,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,53860384,94245872,0,0,4388567040,4388567040 334655,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,67339295,107724783,0,0,4388567040,4388567040 385037,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,74076743,114462231,0,0,4388567040,4388567040 386129,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,80814199,121199687,0,0,4388567040,4388567040 402750,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,46099768,126913967,0,0,4388567040,4388567040 ... ``` This can later be visualized (the visualization script is not included in the PR): <img width="2100" height="1350" alt="allocations" src="https://github.com/user-attachments/assets/3f0ab942-b49b-4e09-a0ea-9181725ae05e" /> #### Implementation overview ##### NVTX Added thread-local tracking of NVTX range stack; the calling thread shares a handle to the sampling thread to correlate the NVTX range state with allocations. ##### Memory resource adaptors - statistics adaptor: atomically counts allocations/deallocations for any `cuda::mr`-compatible resource - notifying adaptor: sets a shared "notifier" state on each event ##### Resource monitor A resource monitor registers a collection of resource statistics objects, a single NVTX range handle, and a single notifier state. It spawns a new thread to sample the resource statistics at a given rate (but only when the notifier is triggered). This thread writes to a CSV output stream. ##### Memory tracking resources `raft::memory_tracking_resources` is a child of `raft::resources`, thus can be used as a drop-in replacement. It replaces all known memory resource for the duration of its lifetime and manages the output file or stream if necessary. Depends on (and includes all changes of) #2968 Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #2973
Use `cuda::mr::any_synchronous_resource` for host, pinned, and managed resource types and give the user explicit control for host, pinned, and managed resources.
#### New
- `raft::resource::managed_memory_resource` and `raft::resource::pinned_memory_resource` are passed to managed and pinned mdarrays during construction via corresponding container policies. This allows the user to replace/modify these resources, for example, to add logging or memory pooling.
- `raft::mr::get_default_host_resource` and `raft::mr::set_default_host_resource` can be used by the user to alter the default host resource the same way. It is not stored in `raft::resources` handle like the other two for two reasons:
1. To mirror rmm default device resource getter/setter
2. To avoid breaking the `raft::make_host_mdarray` overloads that do not take `raft::resources` as an argument (many instances across raft and cuvs).
#### Changed
- Use `raft::mr::host_resource_ref` and `raft::mr::host_device_resource_ref` for the non-owning semantics (defined as `cuda::mr::synchronous_resource_ref` with appropriate access attributes)
- Use `raft::host_resource` and `raft::host_device_resource` for owning semantics (defined as `cuda::mr::any_synchronous_resource` with appropriate access attributes)
With these changes, raft fully switches to `cuda::mr` types for host and host-device resources, while still using `rmm` types for device async resources. Changing the latter would break a lot of cuVS and is not needed - `rmm` will eventually fully converge to `cuda::mr` anyway.
#### Breaking changes
- Rename container policies
- Reuse of a single `host_container` for the three types of resources.
- Switch to using `cuda::mr::any_synchronous_resource` from `std::pmr::memory_resource`
The effect of this changes should be limited, because the policies are hidden behind the mdarray templates and synonyms and the `std::pmr::memory_resource` was introduced recently and haven't been used much.
Authors:
- Artem M. Chirkin (https://github.com/achirkin)
Approvers:
- Bradley Dice (https://github.com/bdice)
- Tamas Bela Feher (https://github.com/tfeher)
URL: rapidsai#2968
Detailed tracking of (almost) all allocations on device and host. ```C++ // optionally pass an existing resource handle raft::resources res; // The tracking handle is a child of resource handle; it wraps all memory resources with statistics adaptors raft::memory_tracking_resources tracked(res, "allocations.csv", std::chrono::milliseconds(1)); // All allocations are logged to a .csv as long as `tracked` is alive cuvs::neighbors::cagra::build(tracked, ...); ``` This produces a CSV file with sampled allocations with a timeline and NVTX correlation ```csv timestamp_us,nvtx_depth,nvtx_range,host_current,host_total,pinned_current,pinned_total,managed_current,managed_total,device_current,device_total,workspace_current,workspace_total,large_workspace_current,large_workspace_total 198809,1,"hnsw::build<ACE>",20008,20008,0,0,0,0,148304,148304,0,0,0,0 199961,1,"hnsw::build<ACE>",20008,20008,0,0,0,0,15588304,15588304,0,0,0,0 201350,1,"hnsw::build<ACE>",0,20008,0,0,0,0,0,40385488,0,0,0,0 222216,3,"cagra::build_knn_graph<IVF-PQ>(5000000, 1536, 72)",1440000000,1440020008,0,0,0,0,0,40385488,0,0,0,0 273892,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,40385488,80770976,0,0,0,0 304183,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,40385488,80770976,0,0,4388567040,4388567040 309064,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,53860384,94245872,0,0,4388567040,4388567040 334655,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,67339295,107724783,0,0,4388567040,4388567040 385037,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,74076743,114462231,0,0,4388567040,4388567040 386129,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,80814199,121199687,0,0,4388567040,4388567040 402750,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,46099768,126913967,0,0,4388567040,4388567040 ... ``` This can later be visualized (the visualization script is not included in the PR): <img width="2100" height="1350" alt="allocations" src="https://github.com/user-attachments/assets/3f0ab942-b49b-4e09-a0ea-9181725ae05e" /> #### Implementation overview ##### NVTX Added thread-local tracking of NVTX range stack; the calling thread shares a handle to the sampling thread to correlate the NVTX range state with allocations. ##### Memory resource adaptors - statistics adaptor: atomically counts allocations/deallocations for any `cuda::mr`-compatible resource - notifying adaptor: sets a shared "notifier" state on each event ##### Resource monitor A resource monitor registers a collection of resource statistics objects, a single NVTX range handle, and a single notifier state. It spawns a new thread to sample the resource statistics at a given rate (but only when the notifier is triggered). This thread writes to a CSV output stream. ##### Memory tracking resources `raft::memory_tracking_resources` is a child of `raft::resources`, thus can be used as a drop-in replacement. It replaces all known memory resource for the duration of its lifetime and manages the output file or stream if necessary. Depends on (and includes all changes of) rapidsai#2968 Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: rapidsai#2973
Backport PRs that were mistakenly merged into `main`: - #2968 - #2973 Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #2983
Use
cuda::mr::any_synchronous_resourcefor host, pinned, and managed resource types and give the user explicit control for host, pinned, and managed resources.New
raft::resource::managed_memory_resourceandraft::resource::pinned_memory_resourceare passed to managed and pinned mdarrays during construction via corresponding container policies. This allows the user to replace/modify these resources, for example, to add logging or memory pooling.raft::mr::get_default_host_resourceandraft::mr::set_default_host_resourcecan be used by the user to alter the default host resource the same way. It is not stored inraft::resourceshandle like the other two for two reasons:raft::make_host_mdarrayoverloads that do not takeraft::resourcesas an argument (many instances across raft and cuvs).Changed
raft::mr::host_resource_refandraft::mr::host_device_resource_reffor the non-owning semantics (defined ascuda::mr::synchronous_resource_refwith appropriate access attributes)raft::host_resourceandraft::host_device_resourcefor owning semantics (defined ascuda::mr::any_synchronous_resourcewith appropriate access attributes)With these changes, raft fully switches to
cuda::mrtypes for host and host-device resources, while still usingrmmtypes for device async resources. Changing the latter would break a lot of cuVS and is not needed -rmmwill eventually fully converge tocuda::mranyway.Breaking changes
host_containerfor the three types of resources.cuda::mr::any_synchronous_resourcefromstd::pmr::memory_resourceThe effect of this changes should be limited, because the policies are hidden behind the mdarray templates and synonyms and the
std::pmr::memory_resourcewas introduced recently and haven't been used much.