Skip to content

fix: extract handle value in Event.query() call to match other driver calls#747

Open
ccam80 wants to merge 14 commits intoNVIDIA:mainfrom
ccam80:event-query-bug
Open

fix: extract handle value in Event.query() call to match other driver calls#747
ccam80 wants to merge 14 commits intoNVIDIA:mainfrom
ccam80:event-query-bug

Conversation

@ccam80
Copy link
Contributor

@ccam80 ccam80 commented Jan 26, 2026

PR Description

This PR extracts value from the Event's handle and passes it to the driver function, mirroring surrounding code.

It looks as if the Event.query method missed out on the update in 20a2e3b. The method includes a try/except statement to catch CUDA_ERROR_NOT_READY, but with a non-integer argument the method raises a TypeError. In my library on 0.23, no TypeErrors were raised for some reason, the call just returned True. On the current main branch, evt.query() raises the appropriate TypeError, as shown by the regression test in the PR. Using the handle.value instead fixes this and follows the pattern from the surrounding code.

MWE

from numba import cuda, int32
from time import perf_counter

@cuda.jit
def spin(ms):
    # Sleep for ms
    for i in range(ms):
        cuda.nanosleep(int32(1_000_000))  # 1 ms

stream = cuda.stream()
evt = cuda.event()

# Run once to compile
spin[1, 1, stream](1)

t0 = perf_counter()
spin_ms = 250
spin[1, 1, stream](250)
evt.record(stream)

# Query immediately.
while not evt.query():
    event_time = perf_counter() - t0

# Syncronize and capture stream-finished time.
evt.synchronize()
sync_time = perf_counter() - t0

print(f"Event time: {event_time}s")
print(f"Sync time:  {sync_time}s")

… calls

It looks as if the `query` method missed out on the update in 20a2e3b. The method includes a try/except statement to catch CUDA_ERROR_NOT_READY, which was falling through to an `else` statement which returned True. This `else` swallowed the exception raised due to providing a non-integer handle and returned True no matter the progress of the stream being queried.
Copilot AI review requested due to automatic review settings January 26, 2026 21:16
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 26, 2026

Greptile Overview

Greptile Summary

Fixed Event.query() to extract handle.value before passing to cuEventQuery(), matching the pattern used in other Event methods (record(), synchronize(), wait()).

Key Changes:

  • Extracted handle.value in Event.query() to fix TypeError that occurred when passing the handle object directly to the driver function
  • Restructured try/except to move the return statement inside the try block for clearer control flow
  • Added comprehensive regression test that verifies query() works correctly with async GPU operations

Context:
This method was missed in commit df8f261 which refactored Event and Stream classes to decouple them from Context. All other Event methods were updated to extract .value from the handle, but query() was overlooked, causing it to pass the wrong type to the driver function.

Confidence Score: 5/5

  • This PR is safe to merge - it's a straightforward bug fix that makes the code consistent with other methods
  • The fix is minimal, well-tested, and follows the exact pattern used by all other Event methods. The regression test validates the fix works correctly.
  • No files require special attention

Important Files Changed

Filename Overview
numba_cuda/numba/cuda/cudadrv/driver.py Fixed Event.query() to extract handle value before passing to driver function, matching pattern used in other Event methods
numba_cuda/numba/cuda/tests/cudadrv/test_events.py Added regression test for Event.query() that verifies it works correctly with async operations

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug in the Event.query() method where it was incorrectly passing a ctypes pointer object instead of its integer value to the driver function. This caused TypeErrors instead of the intended CUDA_ERROR_NOT_READY exception handling. The fix makes Event.query() consistent with other Event methods like record(), synchronize(), and wait().

Changes:

  • Fixed Event.query() to extract handle.value before passing to driver.cuEventQuery(), matching the pattern used in other Event methods
  • Added regression test test_event_query() to verify the query method works correctly with asynchronous stream operations

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
numba_cuda/numba/cuda/cudadrv/driver.py Fixed Event.query() to pass integer handle value instead of c_void_p object to driver function, and restructured try/except for clarity
numba_cuda/numba/cuda/tests/cudadrv/test_events.py Added regression test that exercises Event.query() in a realistic scenario with delayed stream operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@gmarkall
Copy link
Contributor

/ok to test b698a5d

@gmarkall gmarkall added the 3 - Ready for Review Ready for review by team label Jan 27, 2026
assert sync_time * 1000 > spin_ms * 0.9 # nanosleep isnt reliable

# Give a few ms overhead for the synchronize call to complete
assert sync_time - event_time < 2e-3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert sync_time - event_time < 2e-3
assert sync_time - event_time < 10e-3

@ccam80
Copy link
Contributor Author

ccam80 commented Jan 28, 2026

Picking through the test results, it looks as if most are build failures or server timeouts - do these indicate a problem with the patch, or are they commonplace? Two tests failed due to an AssertionError - the 2ms overhead limit for the synchronize call was arbitrary and looks too small. I've added a suggestion to change it to 10ms, which still allows >200ms for the kernel call to return so shouldn't hide the bug.

I don't want to interfere with your test/review process - let me know if the suggested changes (copilot, greptile, me) look good and I can update the PR.

Comment on lines 82 to 83
# Give a few ms overhead for the synchronize call to complete
assert sync_time - event_time < 2e-3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we've had a very poor experience with trying to find healthy timing tolerances for things like this. What I would suggest is that we ensure the time is a positive number, but beyond that I don't think we should worry about actually trying to capture the timing in a test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense - on a second look, this assertion is redundant anyway. If the event query time is >0.9 of the requested sleep time (250ms) then we can be pretty confident that the query didn't fall straight through and this extra test is pointless nitpicking. I'll reorder and clarify the two other assertions.

@cpcloud
Copy link
Contributor

cpcloud commented Jan 28, 2026

I'm going to apply some of the AI commentary to get rid of all the noise.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@ccam80
Copy link
Contributor Author

ccam80 commented Jan 29, 2026

@cpcloud Thanks for the de-noising, I've disabled auto-review for future PRs, greptile seems very capable of finding real errors and I'll do a typos pass before committing.

@kkraus14 Thanks for the guidance on the absolute time epsilon. The assertion wasn't really doing anything, so I've removed it and reorganised the other two to better indicate the source of failure if the test doesn't pass.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Comment on lines 72 to 74
event_time = perf_counter() - t0
while not evt.query():
event_time = perf_counter() - t0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of busy looping the CPU here we can create two events with timing=True, record the start event before launching the kernel, record the end event after launching the kernel, synchronize the stream after recording the end event, and then retrieve the timing from the events using numba.cuda.event_elapsed_time: https://nvidia.github.io/numba-cuda/reference/host.html#numba.cuda.event_elapsed_time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this would be a better design for a test of the Event object in general, but the specific regression we're testing for is that the Event.query() method either raises TypeError or returns a True before the event is synchronized. The timing methods look as if they call different driver functions; I can't find any usages of query or the cuEventQuery driver function anywhere else in the repo. Perhaps I'm patching dead/unused code here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not dead/unused, just used externally to numba-cuda primarily as opposed to internally. We do hope people use cuda.core directly instead of this API shim in the future to eventually deprecate this.

Based on your point about wanting to test event.query specifically, we could synchronize the event before querying (in which case it should always return True). This is nitpicking at this point where I'm fine if we want to do this polling.

Comment on lines 80 to 84
# If this assertion fails, it was nanosleep inaccuracy that caused it
assert sync_time * 1000 > spin_ms * 0.9

# If this assertion fails, the event query returned early
assert event_time * 1000 > spin_ms * 0.9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident in the 0.9 tolerance here where can we just assert that the elapsed time between the events (see above comment) is >0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I take your point - if an absolute time delta was low-confidence before, this is no better. Have you found that any hard time threshold is flaky in this area? The early-return-True path should return in a few scant milliseconds, where sync_time should return somewhere in the ballpark of spin_ms (250ms as currently parameterised). I think sync_time > event_time is guaranteed by statement order in the current test design, so a gt/lt comparison won't cut it. Am I misintepreting your suggestion of event timing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you found that any hard time threshold is flaky in this area?

Generally yes because we test against different Windows driver modes and they have different scheduling behaviors. Additionally the machines that run CI are subject to noisy neighbors, so in general any kind of tolerance that isn't egregious ends up eventually flaking.

Given we aren't actually trying to test the implementation of the event time tracking and rely on the CUDA driver having a valid implementation for that, I think it's okay for us to just do the basic assertion that the time captured is a non-zero number.

@ccam80
Copy link
Contributor Author

ccam80 commented Feb 6, 2026

@kkraus14 thanks for the review time. I'm unsure if I'm misinterpreting your event timing suggestion (see comment replies) - as Event.query() is a polling method, I can't see how to structure the test without a polling loop of some kind. If this comes after the synchronize() call we'll miss the query method returning a True prematurely. If I'm just lacking in imagination here let me know - I've gone back and forth on this but can't see a path without a busy-wait!

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 6, 2026

Automatic reviews are disabled for this repository.

@ccam80
Copy link
Contributor Author

ccam80 commented Feb 9, 2026

@kkraus14 thanks again for the notes, all make sense. I've removed the busy wait and time-based checks. There are two bugs I want to test for, both in Numba-cuda implementation rather than the CUDA driver - Event.query() swallowing a TypeError and reporting True prematurely, and Event.query() raising a TypeError correctly. This tests it more transparently - query immediately after kernel invocation, sync, call query again. If the first query is True, it returned prematurely. If the second is not True, there's an implementation error. If it raises, the test will fail.

evt.synchronize()
synced_query = evt.query()

assert immediate_query is False, "Query returned True prematurely"
Copy link
Contributor

@kkraus14 kkraus14 Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I unfortunately suspect this will be flaky as we could end up in a situation where in between launching the kernel and recording the event, or recording the event and querying it, we could have something like Python GC or some kind of OS level interrupt that could easily cause the recording and subsequent querying of the event to take longer than the 200ms spin kernel.

If we really wanted something more deterministic, we could use something like the AI generated kernel:

def test_event_query(self):
    stream = cuda.stream()
    evt = cuda.event()

    # Pinned memory for host-device synchronization
    started = cuda.pinned_array(1, dtype=np.int32)
    release = cuda.pinned_array(1, dtype=np.int32)

    @cuda.jit
    def gated_kernel(started_flag, release_flag):
        # Signal that kernel has started
        started_flag[0] = 1
        # Spin until host releases us
        while release_flag[0] == 0:
            cuda.nanosleep(int32(1_000))

    # Compile first
    started[0] = 0
    release[0] = 1  # Don't block during warmup
    gated_kernel[1, 1, stream](started, release)
    stream.synchronize()

    # Reset for actual test
    started[0] = 0
    release[0] = 0

    # Launch - kernel will spin until we release it
    gated_kernel[1, 1, stream](started, release)
    evt.record(stream)

    # Wait until kernel confirms it's running
    while started[0] == 0:
        pass  # Busy-wait (or use time.sleep with timeout)

    # NOW we have a guarantee:
    # - Kernel is running (started == 1)
    # - Kernel won't finish (release == 0)
    # - Event is recorded after kernel, so event cannot be complete
    immediate_query = evt.query()
    assert immediate_query is False, "Query returned True prematurely"

    # Release the kernel
    release[0] = 1

    # Wait for completion
    evt.synchronize()
    synced_query = evt.query()
    assert synced_query is True, "Query returned False after sync"

This uses pinned host memory as a way for us to explicitly control the lifetime of the kernel from Python host code so that we aren't subject to any kind of timing issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkraus14 Thanks again for the time you've spent on this. I like that approach - in classic AI-generated fashion it hangs indefinitely if used verbatim, as pinned arrays don't manage their own transfer h<->d, but swapping for mapped arrays it works a charm. I've implemented this suggestion with minor comment edits and the array type swap, and confirmed passing on my machine.

ccam80 and others added 2 commits February 13, 2026 15:50
The proposed test would fail given a ~200ms hang between kernel launch and the first query() call, which is possible in Windows/CI environments and which would cause the test to fail. Instead, mapped arrays track when the kernel starts and allow the host to release the kernel from an infinite spin. This gives two guarantees - the kernel has started, and the kernel hasn't finished, which are what's needed to verify that query() doesn't return prematurely. A final assertion after synchronisation checks that query() does return True when called after stream sync.
@ccam80
Copy link
Contributor Author

ccam80 commented Feb 13, 2026

Implemented the host-controlled busy loop as suggested by @kkraus14, now Windows could take a 10s holiday mid-test if it was so inclined and the test would still verify that Event.query() is not returning True prematurely.

@kkraus14
Copy link
Contributor

/ok to test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 13, 2026

/ok to test

@kkraus14, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@kkraus14
Copy link
Contributor

/ok to test 8138622

@kkraus14 kkraus14 enabled auto-merge (squash) February 25, 2026 17:46
@kkraus14
Copy link
Contributor

/ok to test 138411e

@kkraus14
Copy link
Contributor

/ok to test 1175800

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants