fix: extract handle value in Event.query() call to match other driver calls#747
fix: extract handle value in Event.query() call to match other driver calls#747ccam80 wants to merge 14 commits intoNVIDIA:mainfrom
Conversation
… calls It looks as if the `query` method missed out on the update in 20a2e3b. The method includes a try/except statement to catch CUDA_ERROR_NOT_READY, which was falling through to an `else` statement which returned True. This `else` swallowed the exception raised due to providing a non-integer handle and returned True no matter the progress of the stream being queried.
Greptile OverviewGreptile SummaryFixed Key Changes:
Context: Confidence Score: 5/5
Important Files Changed
|
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug in the Event.query() method where it was incorrectly passing a ctypes pointer object instead of its integer value to the driver function. This caused TypeErrors instead of the intended CUDA_ERROR_NOT_READY exception handling. The fix makes Event.query() consistent with other Event methods like record(), synchronize(), and wait().
Changes:
- Fixed
Event.query()to extracthandle.valuebefore passing todriver.cuEventQuery(), matching the pattern used in other Event methods - Added regression test
test_event_query()to verify the query method works correctly with asynchronous stream operations
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| numba_cuda/numba/cuda/cudadrv/driver.py | Fixed Event.query() to pass integer handle value instead of c_void_p object to driver function, and restructured try/except for clarity |
| numba_cuda/numba/cuda/tests/cudadrv/test_events.py | Added regression test that exercises Event.query() in a realistic scenario with delayed stream operations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/ok to test b698a5d |
| assert sync_time * 1000 > spin_ms * 0.9 # nanosleep isnt reliable | ||
|
|
||
| # Give a few ms overhead for the synchronize call to complete | ||
| assert sync_time - event_time < 2e-3 |
There was a problem hiding this comment.
| assert sync_time - event_time < 2e-3 | |
| assert sync_time - event_time < 10e-3 |
|
Picking through the test results, it looks as if most are build failures or server timeouts - do these indicate a problem with the patch, or are they commonplace? Two tests failed due to an AssertionError - the 2ms overhead limit for the synchronize call was arbitrary and looks too small. I've added a suggestion to change it to 10ms, which still allows >200ms for the kernel call to return so shouldn't hide the bug. I don't want to interfere with your test/review process - let me know if the suggested changes (copilot, greptile, me) look good and I can update the PR. |
| # Give a few ms overhead for the synchronize call to complete | ||
| assert sync_time - event_time < 2e-3 |
There was a problem hiding this comment.
In general, we've had a very poor experience with trying to find healthy timing tolerances for things like this. What I would suggest is that we ensure the time is a positive number, but beyond that I don't think we should worry about actually trying to capture the timing in a test.
There was a problem hiding this comment.
That makes sense - on a second look, this assertion is redundant anyway. If the event query time is >0.9 of the requested sleep time (250ms) then we can be pretty confident that the query didn't fall straight through and this extra test is pointless nitpicking. I'll reorder and clarify the two other assertions.
|
I'm going to apply some of the AI commentary to get rid of all the noise. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
|
@cpcloud Thanks for the de-noising, I've disabled auto-review for future PRs, greptile seems very capable of finding real errors and I'll do a typos pass before committing. @kkraus14 Thanks for the guidance on the absolute time epsilon. The assertion wasn't really doing anything, so I've removed it and reorganised the other two to better indicate the source of failure if the test doesn't pass. |
| event_time = perf_counter() - t0 | ||
| while not evt.query(): | ||
| event_time = perf_counter() - t0 |
There was a problem hiding this comment.
Instead of busy looping the CPU here we can create two events with timing=True, record the start event before launching the kernel, record the end event after launching the kernel, synchronize the stream after recording the end event, and then retrieve the timing from the events using numba.cuda.event_elapsed_time: https://nvidia.github.io/numba-cuda/reference/host.html#numba.cuda.event_elapsed_time
There was a problem hiding this comment.
I see that this would be a better design for a test of the Event object in general, but the specific regression we're testing for is that the Event.query() method either raises TypeError or returns a True before the event is synchronized. The timing methods look as if they call different driver functions; I can't find any usages of query or the cuEventQuery driver function anywhere else in the repo. Perhaps I'm patching dead/unused code here?
There was a problem hiding this comment.
It's not dead/unused, just used externally to numba-cuda primarily as opposed to internally. We do hope people use cuda.core directly instead of this API shim in the future to eventually deprecate this.
Based on your point about wanting to test event.query specifically, we could synchronize the event before querying (in which case it should always return True). This is nitpicking at this point where I'm fine if we want to do this polling.
| # If this assertion fails, it was nanosleep inaccuracy that caused it | ||
| assert sync_time * 1000 > spin_ms * 0.9 | ||
|
|
||
| # If this assertion fails, the event query returned early | ||
| assert event_time * 1000 > spin_ms * 0.9 |
There was a problem hiding this comment.
I'm not confident in the 0.9 tolerance here where can we just assert that the elapsed time between the events (see above comment) is >0?
There was a problem hiding this comment.
Yeah I take your point - if an absolute time delta was low-confidence before, this is no better. Have you found that any hard time threshold is flaky in this area? The early-return-True path should return in a few scant milliseconds, where sync_time should return somewhere in the ballpark of spin_ms (250ms as currently parameterised). I think sync_time > event_time is guaranteed by statement order in the current test design, so a gt/lt comparison won't cut it. Am I misintepreting your suggestion of event timing?
There was a problem hiding this comment.
Have you found that any hard time threshold is flaky in this area?
Generally yes because we test against different Windows driver modes and they have different scheduling behaviors. Additionally the machines that run CI are subject to noisy neighbors, so in general any kind of tolerance that isn't egregious ends up eventually flaking.
Given we aren't actually trying to test the implementation of the event time tracking and rely on the CUDA driver having a valid implementation for that, I think it's okay for us to just do the basic assertion that the time captured is a non-zero number.
|
@kkraus14 thanks for the review time. I'm unsure if I'm misinterpreting your event timing suggestion (see comment replies) - as |
|
Automatic reviews are disabled for this repository. |
|
@kkraus14 thanks again for the notes, all make sense. I've removed the busy wait and time-based checks. There are two bugs I want to test for, both in Numba-cuda implementation rather than the CUDA driver - Event.query() swallowing a TypeError and reporting True prematurely, and Event.query() raising a TypeError correctly. This tests it more transparently - query immediately after kernel invocation, sync, call query again. If the first query is True, it returned prematurely. If the second is not True, there's an implementation error. If it raises, the test will fail. |
| evt.synchronize() | ||
| synced_query = evt.query() | ||
|
|
||
| assert immediate_query is False, "Query returned True prematurely" |
There was a problem hiding this comment.
I unfortunately suspect this will be flaky as we could end up in a situation where in between launching the kernel and recording the event, or recording the event and querying it, we could have something like Python GC or some kind of OS level interrupt that could easily cause the recording and subsequent querying of the event to take longer than the 200ms spin kernel.
If we really wanted something more deterministic, we could use something like the AI generated kernel:
def test_event_query(self):
stream = cuda.stream()
evt = cuda.event()
# Pinned memory for host-device synchronization
started = cuda.pinned_array(1, dtype=np.int32)
release = cuda.pinned_array(1, dtype=np.int32)
@cuda.jit
def gated_kernel(started_flag, release_flag):
# Signal that kernel has started
started_flag[0] = 1
# Spin until host releases us
while release_flag[0] == 0:
cuda.nanosleep(int32(1_000))
# Compile first
started[0] = 0
release[0] = 1 # Don't block during warmup
gated_kernel[1, 1, stream](started, release)
stream.synchronize()
# Reset for actual test
started[0] = 0
release[0] = 0
# Launch - kernel will spin until we release it
gated_kernel[1, 1, stream](started, release)
evt.record(stream)
# Wait until kernel confirms it's running
while started[0] == 0:
pass # Busy-wait (or use time.sleep with timeout)
# NOW we have a guarantee:
# - Kernel is running (started == 1)
# - Kernel won't finish (release == 0)
# - Event is recorded after kernel, so event cannot be complete
immediate_query = evt.query()
assert immediate_query is False, "Query returned True prematurely"
# Release the kernel
release[0] = 1
# Wait for completion
evt.synchronize()
synced_query = evt.query()
assert synced_query is True, "Query returned False after sync"This uses pinned host memory as a way for us to explicitly control the lifetime of the kernel from Python host code so that we aren't subject to any kind of timing issues.
There was a problem hiding this comment.
@kkraus14 Thanks again for the time you've spent on this. I like that approach - in classic AI-generated fashion it hangs indefinitely if used verbatim, as pinned arrays don't manage their own transfer h<->d, but swapping for mapped arrays it works a charm. I've implemented this suggestion with minor comment edits and the array type swap, and confirmed passing on my machine.
The proposed test would fail given a ~200ms hang between kernel launch and the first query() call, which is possible in Windows/CI environments and which would cause the test to fail. Instead, mapped arrays track when the kernel starts and allow the host to release the kernel from an infinite spin. This gives two guarantees - the kernel has started, and the kernel hasn't finished, which are what's needed to verify that query() doesn't return prematurely. A final assertion after synchronisation checks that query() does return True when called after stream sync.
|
Implemented the host-controlled busy loop as suggested by @kkraus14, now Windows could take a 10s holiday mid-test if it was so inclined and the test would still verify that |
|
/ok to test |
@kkraus14, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test 8138622 |
|
/ok to test 138411e |
|
/ok to test 1175800 |
PR Description
This PR extracts
valuefrom the Event's handle and passes it to the driver function, mirroring surrounding code.It looks as if the
Event.querymethod missed out on the update in 20a2e3b. The method includes a try/except statement to catch CUDA_ERROR_NOT_READY, but with a non-integer argument the method raises aTypeError. In my library on 0.23, no TypeErrors were raised for some reason, the call just returned True. On the current main branch, evt.query() raises the appropriate TypeError, as shown by the regression test in the PR. Using the handle.value instead fixes this and follows the pattern from the surrounding code.MWE