-
Notifications
You must be signed in to change notification settings - Fork 928
v4.1.x: opal/cuda: Handle VMM pointers #12751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v4.1.x: opal/cuda: Handle VMM pointers #12751
Conversation
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: d2921b0: opal/cuda: avoid direct access to cumem host numa ...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
d2921b0 to
9cd2372
Compare
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: 9cd2372: opal/cuda: avoid direct access to cumem host numa ...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
9cd2372 to
b72a410
Compare
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: b72a410: opal/cuda: avoid direct access to cumem host numa ...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
b72a410 to
dc7932b
Compare
|
@Akshay-Venkatesh @janjust So this isn't needed / doesn't exist in main/v5.0.x? |
Hi @jsquyres . It is needed but the changes will go into accelerator code paths that are quite different from those that exist in 4.1.x series. I'll post a PR soon. |
|
Ok, good enough. |
dc7932b to
384d8bd
Compare
|
@Akshay-Venkatesh You just changed this PR significantly. Is it complete and fully tested? |
|
@jsquyres After making changes to main branch I noticed that similar code would fit for 4.1.x and I had missed an additional check that was needed before marking memory as device vs host. I made those changes to both my branches and I've tested this extensively to make sure everything passes. Would appreciate another round of reviews to make sure I didn't miss anything. |
| CUmemGenericAllocationHandle alloc_handle; | ||
| /* Check if memory is allocated using VMM API and see if host memory needs | ||
| * to be treated as pinned device memory */ | ||
| result = cuFunc.cuMemRetainAllocationHandle(&alloc_handle, (void*)dbuf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks not only overly complicated but also incorrect.
Regarding correctness: according to the CUDA documentation each call the cuMemRetainAllocationHandle must be matched with a call to cuMemRelease, which i don't see in this PR. This will result in the memory region referenced here not being able to be released.
What exactly do you get from the combination cuMemRetainAllocationHandle + cuMemGetAllocationPropertiesFromHandle that you could not have obtained from cuMemGetAccess ?
|
Put this back in Draft mode, because @bosilca's last comments on here were voicing objections (and I don't want to accidentally merge it). So let's get those objections addressed, and then this can get merged. |
384d8bd to
d11e109
Compare
d11e109 to
2d04ca7
Compare
Signed-off-by: Akshay Venkatesh <[email protected]>
2d04ca7 to
0b51fea
Compare
Memory allocated using cumemcreate API with location as {CU_MEM_LOCATION_TYPE_HOST/CU_MEM_LOCATION_TYPE_HOST_NUMA/CU_MEM_LOCATION_TYPE_HOST _NUMA_CURRENT} can be detected as host memory type by pointer query API but this doesn't allow the CPU to access such memory using memcpy or other CPU load/store mechanisms unless explicitly requested with cuMemSetAccess. Without the changes in this PR, HOST_NUMA backed cumemcreate memory is detected as host by openmpi layers (opal/datatype, ompi/coll) and subsequent accesses by CPU thread leads to illegal access errors.
bot:notacherrypick