Currently, registering local cuMem-mapped addresses in RegisterMemory requires NVLS/MNNVL support (the code here explicitly requires this). Given that real-world PyTorch and Megatron workloads use the cuMem driver API to manage tensors, this causes problems when integrating MSCCL++ into PyTorch and Megatron on machines without NVLS/MNNVL support. AFAIK, cuMem should be usable without such specialized hardware. Is such coupling of the cuMem buffer registration and multicast support an intended feature or an ad-hoc constraint that will be released in a future release?