Skip to content

Conversation

@Akshay-Venkatesh
Copy link
Contributor

@Akshay-Venkatesh Akshay-Venkatesh commented Aug 13, 2024

Memory allocated using cumemcreate API with location as {CU_MEM_LOCATION_TYPE_HOST/CU_MEM_LOCATION_TYPE_HOST_NUMA/CU_MEM_LOCATION_TYPE_HOST _NUMA_CURRENT} can be detected as host memory type by pointer query API but this doesn't allow the CPU to access such memory using memcpy or other CPU load/store mechanisms unless explicitly requested with cuMemSetAccess. Without the changes in this PR, HOST_NUMA backed cumemcreate memory is detected as host by openmpi layers (opal/datatype, ompi/coll) and subsequent accesses by CPU thread leads to illegal access errors.

bot:notacherrypick

@github-actions github-actions bot added this to the v4.1.7 milestone Aug 13, 2024
@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

d2921b0: opal/cuda: avoid direct access to cumem host numa ...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: does not include a cherry pick message (did you need to bot:notacherrypick?)

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@Akshay-Venkatesh Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from d2921b0 to 9cd2372 Compare August 13, 2024 17:55
@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

9cd2372: opal/cuda: avoid direct access to cumem host numa ...

  • check_cherry_pick: does not include a cherry pick message (did you need to bot:notacherrypick?)

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@Akshay-Venkatesh Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from 9cd2372 to b72a410 Compare August 13, 2024 18:06
@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

b72a410: opal/cuda: avoid direct access to cumem host numa ...

  • check_cherry_pick: does not include a cherry pick message (did you need to bot:notacherrypick?)

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@Akshay-Venkatesh Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from b72a410 to dc7932b Compare August 13, 2024 18:19
janjust
janjust previously approved these changes Aug 13, 2024
@jsquyres
Copy link
Member

@Akshay-Venkatesh @janjust So this isn't needed / doesn't exist in main/v5.0.x?

@Akshay-Venkatesh
Copy link
Contributor Author

@Akshay-Venkatesh @janjust So this isn't needed / doesn't exist in main/v5.0.x?

Hi @jsquyres . It is needed but the changes will go into accelerator code paths that are quite different from those that exist in 4.1.x series. I'll post a PR soon.

@jsquyres
Copy link
Member

Ok, good enough.

@jsquyres
Copy link
Member

@Akshay-Venkatesh You just changed this PR significantly. Is it complete and fully tested?

@Akshay-Venkatesh
Copy link
Contributor Author

@jsquyres After making changes to main branch I noticed that similar code would fit for 4.1.x and I had missed an additional check that was needed before marking memory as device vs host. I made those changes to both my branches and I've tested this extensively to make sure everything passes. Would appreciate another round of reviews to make sure I didn't miss anything.

CUmemGenericAllocationHandle alloc_handle;
/* Check if memory is allocated using VMM API and see if host memory needs
* to be treated as pinned device memory */
result = cuFunc.cuMemRetainAllocationHandle(&alloc_handle, (void*)dbuf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks not only overly complicated but also incorrect.

Regarding correctness: according to the CUDA documentation each call the cuMemRetainAllocationHandle must be matched with a call to cuMemRelease, which i don't see in this PR. This will result in the memory region referenced here not being able to be released.

What exactly do you get from the combination cuMemRetainAllocationHandle + cuMemGetAllocationPropertiesFromHandle that you could not have obtained from cuMemGetAccess ?

@janjust janjust changed the title opal/cuda: avoid direct access to cumem host numa memory v4.1.x: opal/cuda: avoid direct access to cumem host numa memory Aug 23, 2024
@jsquyres jsquyres marked this pull request as draft September 10, 2024 15:14
@jsquyres
Copy link
Member

Put this back in Draft mode, because @bosilca's last comments on here were voicing objections (and I don't want to accidentally merge it). So let's get those objections addressed, and then this can get merged.

@Akshay-Venkatesh Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from 384d8bd to d11e109 Compare September 24, 2024 20:46
@Akshay-Venkatesh Akshay-Venkatesh changed the title v4.1.x: opal/cuda: avoid direct access to cumem host numa memory v4.1.x: opal/cuda: Handle VMM pointers Sep 24, 2024
@Akshay-Venkatesh Akshay-Venkatesh marked this pull request as ready for review September 25, 2024 14:40
@Akshay-Venkatesh Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from d11e109 to 2d04ca7 Compare September 27, 2024 19:43
@Akshay-Venkatesh Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from 2d04ca7 to 0b51fea Compare September 27, 2024 21:04
@jsquyres jsquyres merged commit 467fbb9 into open-mpi:v4.1.x Sep 30, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants