-
Notifications
You must be signed in to change notification settings - Fork 15k
[Offload] Update allocations to include device #154733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -183,6 +183,7 @@ namespace llvm { | |
| namespace offload { | ||
|
|
||
| struct AllocInfo { | ||
| void *Base; | ||
| ol_device_handle_t Device; | ||
| ol_alloc_type_t Type; | ||
| }; | ||
|
|
@@ -201,8 +202,8 @@ struct OffloadContext { | |
|
|
||
| bool TracingEnabled = false; | ||
| bool ValidationEnabled = true; | ||
| DenseMap<void *, AllocInfo> AllocInfoMap{}; | ||
| std::mutex AllocInfoMapMutex{}; | ||
| SmallVector<AllocInfo> AllocInfoList{}; | ||
|
||
| std::mutex AllocInfoListMutex{}; | ||
| SmallVector<ol_platform_impl_t, 4> Platforms{}; | ||
| size_t RefCount; | ||
|
|
||
|
|
@@ -625,30 +626,37 @@ Error olMemAlloc_impl(ol_device_handle_t Device, ol_alloc_type_t Type, | |
|
|
||
| *AllocationOut = *Alloc; | ||
| { | ||
| std::lock_guard<std::mutex> Lock(OffloadContext::get().AllocInfoMapMutex); | ||
| OffloadContext::get().AllocInfoMap.insert_or_assign( | ||
| *Alloc, AllocInfo{Device, Type}); | ||
| std::lock_guard<std::mutex> Lock(OffloadContext::get().AllocInfoListMutex); | ||
| OffloadContext::get().AllocInfoList.emplace_back( | ||
| AllocInfo{*AllocationOut, Device, Type}); | ||
| } | ||
| return Error::success(); | ||
| } | ||
|
|
||
| Error olMemFree_impl(void *Address) { | ||
| ol_device_handle_t Device; | ||
| ol_alloc_type_t Type; | ||
| Error olMemFree_impl(ol_device_handle_t Device, void *Address) { | ||
| AllocInfo Removed; | ||
| { | ||
| std::lock_guard<std::mutex> Lock(OffloadContext::get().AllocInfoMapMutex); | ||
| if (!OffloadContext::get().AllocInfoMap.contains(Address)) | ||
| return createOffloadError(ErrorCode::INVALID_ARGUMENT, | ||
| "address is not a known allocation"); | ||
|
|
||
| auto AllocInfo = OffloadContext::get().AllocInfoMap.at(Address); | ||
| Device = AllocInfo.Device; | ||
| Type = AllocInfo.Type; | ||
| OffloadContext::get().AllocInfoMap.erase(Address); | ||
| std::lock_guard<std::mutex> Lock(OffloadContext::get().AllocInfoListMutex); | ||
|
|
||
| auto &List = OffloadContext::get().AllocInfoList; | ||
| auto Entry = std::find_if(List.begin(), List.end(), [&](AllocInfo &Entry) { | ||
|
||
| return Address == Entry.Base && (!Device || Entry.Device == Device); | ||
| }); | ||
|
|
||
| if (Entry == List.end()) | ||
| return Plugin::error(ErrorCode::NOT_FOUND, | ||
| "could not find memory allocated by olMemAlloc"); | ||
| if (!Device && Entry->Type == OL_ALLOC_TYPE_DEVICE) | ||
| return Plugin::error( | ||
| ErrorCode::NOT_FOUND, | ||
| "specifying the Device parameter is required to query device memory"); | ||
|
|
||
| Removed = std::move(*Entry); | ||
| *Entry = List.pop_back_val(); | ||
| } | ||
|
|
||
| if (auto Res = | ||
| Device->Device->dataDelete(Address, convertOlToPluginAllocTy(Type))) | ||
| if (auto Res = Removed.Device->Device->dataDelete( | ||
| Removed.Base, convertOlToPluginAllocTy(Removed.Type))) | ||
| return Res; | ||
|
|
||
| return Error::success(); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered having this be a
Map<void *, SmallVector<AllocInfo>>, however I think it's better if functions likeolMemGetInfoaccept any address inside the allocation instead of just the start. This makes that logic much less clunky.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I don't think we have such use case in SYCL. Does OpenMP?
This decision makes the implementation more complicated, and I'm not sure if there's much value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the implementation of
urEnqueueUSMMemcpycontains this:We need to look up whether the allocation is from the host or the device inside liboffload. Afaik, all UR functions working on USM pointers don't require the pointer to be at the start of an allocation, so we'd need to be able to look up the allocation type from anywhere in the buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason olMemcpy needs to take destination/source devices? L0/ur/opencl/cuda don't. The underlying APIs all seem to support inferring the devices types implicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UR does require the destination and source device - It needs to know which backend (AMD/CUDA/OpenCL/etc) to dispatch to, which is stored in ur_queue_handle_t. OpenCL also requires the
cl_intel_unified_shared_memoryextension to query the source and destination device.Since the queue is optional in
olMemcpy, a device to lookup the backend (AMD or Nvidia) is required. And if we need one device, we may as well have the user specify both instead of requiring the backend to support determining whether the copy is h2d/h2h/d2d. It also allows the API to be extended in the future to allow copying data between two platforms if needed.I wasn't involved in the initial design for this, so @jhuber6 and @callumfare might be able to shed some more light on why it is how it is.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow, here's the UR entry point:
Yes, we need a hQueue to do the dispatch, but it's there because memcpy is logically an operation on the queue.
I think we should revisit the decision for queue to be optional. Given the current API, the best we can probably do is something like an interval map lookup (2x for both src and dst) - which is going to be both complex and expensive. Do we really want to pay the cost of doing those lookups for every usm memory operation?
In UR, we have a separate non-queued blocking copy that operates on a context:
We could do something similar instead of trying to overload a single function with multiple different functionalities.