-
Notifications
You must be signed in to change notification settings - Fork 791
[SYCL][UR] Do not call urDeviceRetain()
in device_impl
constructor at all
#20065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][UR] Do not call urDeviceRetain()
in device_impl
constructor at all
#20065
Conversation
b07c1d4
to
51a4b37
Compare
51a4b37
to
ab4b9b9
Compare
ab4b9b9
to
7c033d0
Compare
7c033d0
to
054acf1
Compare
054acf1
to
7eb40c4
Compare
urDeviceRetain()
in case of subdevices
7eb40c4
to
1e745dc
Compare
@pbalcer please review |
1e745dc
to
1ab563c
Compare
@intel/llvm-gatekeepers please consider merging |
sycl/source/detail/device_impl.cpp
Outdated
if (!IsSubDevice) { | ||
// Interoperability Constructor already calls DeviceRetain in | ||
// urDeviceCreateWithNativeHandle. | ||
getAdapter().call<UrApiKind::urDeviceRetain>(MDevice); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The destructor device_impl::~device_impl
calls urDeviceRelease
, which is meant to balance the urDeviceRetain
in the constructor. However, after this change, urDeviceRetain
is no longer called for subdevices in the constructor, while urDeviceRelease
is still invoked for them in the destructor. This imbalance is somewhat concerning. It seems that the actual root cause of the memory leak may lie elsewhere. Could you please clarify what you believe the underlying issue is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that when a device is created RefCount
is equal 1 without any call to retain()
, but it is destroyed when RefCount
is equal 0, so there should be one more call to release()
than a number of calls to retain()
... or the implementation of class RefCount
is wrong ?
class RefCount {
public:
RefCount(uint32_t count = 1) : Count(count) {}
...
uint32_t retain() { return ++Count; }
bool release() { return --Count == 0; }
void reset(uint32_t value = 1) { Count = value; }
private:
std::atomic_uint32_t Count;
};
...
ur_result_t urDeviceRelease(ur_device_handle_t Device) {
// Root devices are destroyed during the piTearDown process.
if (Device->isSubDevice()) {
if (Device->RefCount.release()) {
delete Device;
}
}
return UR_RESULT_SUCCESS;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth pointing out that only subdevices are really ref counted:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"(only sub-devices being ref counted seems like implementation detail of the adapter)"
That's the reason. I pointed it out because it's possible that the existing sequence of creates/retains/releases was always broken, and we never noticed because the leak happens in the rare case of subdevices. So in a way, I'm agreeing with your earlier comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am testing not calling urDeviceRetain()
in device_impl
ctor at all (#20144) - one unittest (sycl/unittests/context_device/DeviceRefCounter.cpp
) will have to be changed at least ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@againull I have removed urDeviceRetain(MDevice)
from the device_impl
constructor completely.
Please re-review.
@intel/llvm-gatekeepers please consider merging |
1 similar comment
@intel/llvm-gatekeepers please consider merging |
urDeviceRetain()
in case of subdevicesurDeviceRetain()
in device_impl
constructor at all
119f6a2
to
6396c62
Compare
urDeviceRetain(MDevice) should not be called in device_impl ctor at all, because RefCounter is initialized with 1, when the device is created. It fixes URT-961. Signed-off-by: Lukasz Dorau <[email protected]>
I don't know enough about how the SYCL device object is managed to be able to say whether this is correct. But, generally, this makes sense to me - devices (and all other UR objects) start with refcount 1, so retaining them immediately after creation is unnecessary. |
// So for this test, we just do it. | ||
sycl::detail::GlobalHandler::instance().getPlatformCache().clear(); | ||
|
||
EXPECT_EQ(DevRefCounter, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit off that ref count is zero here, because callback for urDeviceGet (redefinedDevicesGetAfter) increments the DevRefCounter. So, even if we don't call Retain anymore, ref count is still expected to be at least 1 because of urDevicesGet. It would be nice to either understand why it is normal to have zero ref count here or to update the test if necessary if there is something wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@againull I have debugged this unittest. It calls 3 times platform_impl::getDevicesImplHelper()
:
llvm/sycl/source/detail/platform_impl.cpp
Line 497 in 6b29cf1
void platform_impl::getDevicesImplHelper(ur_device_type_t UrDeviceType, |
urDeviceGet(NumDevices == 0)
what setsDevRefCounter == 0
(usingredefinedDevicesGetAfter()
):
llvm/sycl/source/detail/platform_impl.cpp
Lines 502 to 504 in 6b29cf1
MAdapter->call<UrApiKind::urDeviceGet>(MPlatform, UrDeviceType, 0u, // CP info::device_type::all nullptr, &NumDevices); urDeviceGet(NumDevices == 1)
what setsDevRefCounter == 1
(usingredefinedDevicesGetAfter()
):
llvm/sycl/source/detail/platform_impl.cpp
Lines 527 to 530 in 6b29cf1
MAdapter->call<UrApiKind::urDeviceGet>( MPlatform, UrDeviceType, // CP info::device_type::all NumDevices, UrDevices.data(), nullptr); urDeviceRelease()
on this device what decrementsDevRefCounter
to0
(usingredefinedDeviceReleaseAfter()
):
llvm/sycl/source/detail/platform_impl.cpp
Lines 560 to 563 in 6b29cf1
// The reference counter for handles, that we used to create sycl objects, is // incremented, so we need to call release here. for (ur_device_handle_t &UrDev : UrDevicesToCleanUp) MAdapter->call<UrApiKind::urDeviceRelease>(UrDev);
... what looks like:
- redefinedDevicesGetAfter(*params.pphDevices == 0) DevRefCounter == 0 from urDeviceGet(NumDevices == 0)
- redefinedDevicesGetAfter(*params.pNumEntries == 1) DevRefCounter == 1 from urDeviceGet(NumDevices == 1)
- redefinedDeviceReleaseAfter() DevRefCounter == 0 from urDeviceRelease()
- redefinedDevicesGetAfter(*params.pphDevices == 0) DevRefCounter == 0 from urDeviceGet(NumDevices == 0)
- redefinedDevicesGetAfter(*params.pNumEntries == 1) DevRefCounter == 1 from urDeviceGet(NumDevices == 1)
- redefinedDeviceReleaseAfter() DevRefCounter == 0 from urDeviceRelease()
- redefinedDevicesGetAfter(*params.pphDevices == 0) DevRefCounter == 0 from urDeviceGet(NumDevices == 0)
- redefinedDevicesGetAfter(*params.pNumEntries == 1) DevRefCounter == 1 from urDeviceGet(NumDevices == 1)
- redefinedDeviceReleaseAfter() DevRefCounter == 0 from urDeviceRelease()
@againull Does it make sense or is there anything wrong here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit off that ref count is zero here, because callback for urDeviceGet (redefinedDevicesGetAfter) increments the DevRefCounter. So, even if we don't call Retain anymore, ref count is still expected to be at least 1 because of urDevicesGet. It would be nice to either understand why it is normal to have zero ref count here or to update the test if necessary if there is something wrong.
@againull So, summarizing, each time urDeviceGet()
(redefinedDevicesGetAfter()
) increments the DevRefCounter()
at
llvm/sycl/source/detail/platform_impl.cpp
Lines 527 to 530 in 6b29cf1
MAdapter->call<UrApiKind::urDeviceGet>( | |
MPlatform, | |
UrDeviceType, // CP info::device_type::all | |
NumDevices, UrDevices.data(), nullptr); |
a few moments later it is decremented to 0 by urDeviceRelease()
(redefinedDeviceReleaseAfter()
) at
llvm/sycl/source/detail/platform_impl.cpp
Lines 560 to 563 in 6b29cf1
// The reference counter for handles, that we used to create sycl objects, is | |
// incremented, so we need to call release here. | |
for (ur_device_handle_t &UrDev : UrDevicesToCleanUp) | |
MAdapter->call<UrApiKind::urDeviceRelease>(UrDev); |
So, IMHO, all is OK here and works correctly. Do you agree @againull ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@againull ping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for delay. It looks like that this code:
// The reference counter for handles, that we used to create sycl objects, is
// incremented, so we need to call release here.
for (ur_device_handle_t &UrDev : UrDevicesToCleanUp)
MAdapter->call<UrApiKind::urDeviceRelease>(UrDev);
needs to be removed as well. I believe these release calls were needed just because of redundant retain in device_impl constructor, but you are removing that retain now. Could you please try to remove this part as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@againull It seems it works with this code removed, only the sycl/unittests/context_device/DeviceRefCounter.cpp
test should be changed, but frankly I do not know how (see https://github.com/intel/llvm/pull/20341/files) - could you help with that? Do you have any idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldorau I have taken a closer look. Actually, I believe initial device_impl implementation was correct (urRetain in constructor and urRelease in destructor) because device_impl doesn't own UR handle exclusively, there is shared ownership. For example, there might be a situation when two device_impl objects are created with the same UR handle, in this case it is supposed to look like this:
urDeviceGet -> we get ur_device_handle_t with ref count == 1
create first object using that handle -> ref count == 2
create second object using same handle -> ref count == 3
urRelease -> we call this when current code doesn't need device handle anymore. (ref count 2)
destructor of first object (ref count 1)
destructor of first object (ref count 0)
Same for sub-devices: urDevicePartition is similar to urDeviceGet (it returns +1 refcount).
So, I believe the correct fix will be this (includes test which fails without changes):
againull@bbd8db9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@againull Right, it makes sense. Thanks for this fix. Could you submit a pull request with this fix, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, sure, opened #20370
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
@pbalcer please re-review |
@intel/llvm-gatekeepers please consider merging |
Replaced with #20370 |
urDeviceRetain(MDevice)
should not be called in thedevice_impl
constructor at all,because
RefCounter
is initialized with 1, when the device is created.It fixes URT-961.