v4.1.x: opal/cuda: Handle VMM pointers #12751

Akshay-Venkatesh · 2024-08-13T17:50:09Z

Memory allocated using cumemcreate API with location as {CU_MEM_LOCATION_TYPE_HOST/CU_MEM_LOCATION_TYPE_HOST_NUMA/CU_MEM_LOCATION_TYPE_HOST _NUMA_CURRENT} can be detected as host memory type by pointer query API but this doesn't allow the CPU to access such memory using memcpy or other CPU load/store mechanisms unless explicitly requested with cuMemSetAccess. Without the changes in this PR, HOST_NUMA backed cumemcreate memory is detected as host by openmpi layers (opal/datatype, ompi/coll) and subsequent accesses by CPU thread leads to illegal access errors.

bot:notacherrypick

github-actions · 2024-08-13T17:50:46Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

d2921b0: opal/cuda: avoid direct access to cumem host numa ...

check_signed_off: does not contain a valid Signed-off-by line
check_cherry_pick: does not include a cherry pick message (did you need to bot:notacherrypick?)

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

github-actions · 2024-08-13T17:56:31Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

9cd2372: opal/cuda: avoid direct access to cumem host numa ...

check_cherry_pick: does not include a cherry pick message (did you need to bot:notacherrypick?)

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

github-actions · 2024-08-13T18:06:56Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

b72a410: opal/cuda: avoid direct access to cumem host numa ...

check_cherry_pick: does not include a cherry pick message (did you need to bot:notacherrypick?)

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

jsquyres · 2024-08-13T18:41:08Z

@Akshay-Venkatesh @janjust So this isn't needed / doesn't exist in main/v5.0.x?

Akshay-Venkatesh · 2024-08-13T18:59:01Z

@Akshay-Venkatesh @janjust So this isn't needed / doesn't exist in main/v5.0.x?

Hi @jsquyres . It is needed but the changes will go into accelerator code paths that are quite different from those that exist in 4.1.x series. I'll post a PR soon.

jsquyres · 2024-08-13T19:02:08Z

Ok, good enough.

PR was changed after review

jsquyres · 2024-08-14T12:07:40Z

@Akshay-Venkatesh You just changed this PR significantly. Is it complete and fully tested?

Akshay-Venkatesh · 2024-08-14T18:05:49Z

@jsquyres After making changes to main branch I noticed that similar code would fit for 4.1.x and I had missed an additional check that was needed before marking memory as device vs host. I made those changes to both my branches and I've tested this extensively to make sure everything passes. Would appreciate another round of reviews to make sure I didn't miss anything.

config/opal_check_cuda.m4

bosilca · 2024-08-15T05:13:31Z

opal/mca/common/cuda/common_cuda.c

+    CUmemGenericAllocationHandle alloc_handle;
+    /* Check if memory is allocated using VMM API and see if host memory needs
+     * to be treated as pinned device memory */
+    result = cuFunc.cuMemRetainAllocationHandle(&alloc_handle, (void*)dbuf);


This looks not only overly complicated but also incorrect.

Regarding correctness: according to the CUDA documentation each call the cuMemRetainAllocationHandle must be matched with a call to cuMemRelease, which i don't see in this PR. This will result in the memory region referenced here not being able to be released.

What exactly do you get from the combination cuMemRetainAllocationHandle + cuMemGetAllocationPropertiesFromHandle that you could not have obtained from cuMemGetAccess ?

jsquyres · 2024-09-10T15:17:42Z

Put this back in Draft mode, because @bosilca's last comments on here were voicing objections (and I don't want to accidentally merge it). So let's get those objections addressed, and then this can get merged.

Signed-off-by: Akshay Venkatesh <[email protected]>

github-actions bot added this to the v4.1.7 milestone Aug 13, 2024

Akshay-Venkatesh requested a review from janjust August 13, 2024 17:50

github-actions bot added the Target: v4.1.x label Aug 13, 2024

Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from d2921b0 to 9cd2372 Compare August 13, 2024 17:55

Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from 9cd2372 to b72a410 Compare August 13, 2024 18:06

Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from b72a410 to dc7932b Compare August 13, 2024 18:19

Akshay-Venkatesh assigned janjust Aug 13, 2024

janjust previously approved these changes Aug 13, 2024

View reviewed changes

jsquyres added the RM approved label Aug 13, 2024

Akshay-Venkatesh mentioned this pull request Aug 14, 2024

opal/cuda: Handle CUDA VMM pointers in accelerator check_addr function #12757

Merged

Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from dc7932b to 384d8bd Compare August 14, 2024 06:50

bosilca reviewed Aug 15, 2024

View reviewed changes

jsquyres removed the RM approved label Aug 15, 2024

janjust changed the title ~~opal/cuda: avoid direct access to cumem host numa memory~~ v4.1.x: opal/cuda: avoid direct access to cumem host numa memory Aug 23, 2024

jsquyres marked this pull request as draft September 10, 2024 15:14

Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from 384d8bd to d11e109 Compare September 24, 2024 20:46

Akshay-Venkatesh changed the title ~~v4.1.x: opal/cuda: avoid direct access to cumem host numa memory~~ v4.1.x: opal/cuda: Handle VMM pointers Sep 24, 2024

Akshay-Venkatesh marked this pull request as ready for review September 25, 2024 14:40

janjust approved these changes Sep 26, 2024

View reviewed changes

Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from d11e109 to 2d04ca7 Compare September 27, 2024 19:43

opal/cuda: Handle VMM pointers in cuda_check_addr

0b51fea

Signed-off-by: Akshay Venkatesh <[email protected]>

Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from 2d04ca7 to 0b51fea Compare September 27, 2024 21:04

Akshay-Venkatesh requested a review from bosilca September 27, 2024 21:06

bosilca approved these changes Sep 27, 2024

View reviewed changes

jsquyres merged commit 467fbb9 into open-mpi:v4.1.x Sep 30, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v4.1.x: opal/cuda: Handle VMM pointers #12751

v4.1.x: opal/cuda: Handle VMM pointers #12751

Uh oh!

Akshay-Venkatesh commented Aug 13, 2024 •

edited by janjust

Loading

Uh oh!

github-actions bot commented Aug 13, 2024

Uh oh!

github-actions bot commented Aug 13, 2024

Uh oh!

github-actions bot commented Aug 13, 2024

Uh oh!

jsquyres commented Aug 13, 2024

Uh oh!

Akshay-Venkatesh commented Aug 13, 2024

Uh oh!

jsquyres commented Aug 13, 2024

Uh oh!

jsquyres commented Aug 14, 2024

Uh oh!

Akshay-Venkatesh commented Aug 14, 2024

Uh oh!

Uh oh!

bosilca Aug 15, 2024

Uh oh!

jsquyres commented Sep 10, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

v4.1.x: opal/cuda: Handle VMM pointers #12751

v4.1.x: opal/cuda: Handle VMM pointers #12751

Uh oh!

Conversation

Akshay-Venkatesh commented Aug 13, 2024 • edited by janjust Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 13, 2024

Uh oh!

github-actions bot commented Aug 13, 2024

Uh oh!

github-actions bot commented Aug 13, 2024

Uh oh!

jsquyres commented Aug 13, 2024

Uh oh!

Akshay-Venkatesh commented Aug 13, 2024

Uh oh!

jsquyres commented Aug 13, 2024

Uh oh!

jsquyres commented Aug 14, 2024

Uh oh!

Akshay-Venkatesh commented Aug 14, 2024

Uh oh!

Uh oh!

bosilca Aug 15, 2024

Choose a reason for hiding this comment

Uh oh!

jsquyres commented Sep 10, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Akshay-Venkatesh commented Aug 13, 2024 •

edited by janjust

Loading