Skip to content
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions numba_cuda/numba/cuda/cudadrv/driver.py
Original file line number Diff line number Diff line change
Expand Up @@ -2076,14 +2076,14 @@ def query(self):
otherwise, returns False.
"""
try:
driver.cuEventQuery(self.handle)
handle = self.handle.value
driver.cuEventQuery(handle)
return True
except CudaAPIError as e:
if e.code == enums.CUDA_ERROR_NOT_READY:
return False
else:
raise
else:
return True

def record(self, stream=0):
"""
Expand Down
37 changes: 36 additions & 1 deletion numba_cuda/numba/cuda/tests/cudadrv/test_events.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# SPDX-License-Identifier: BSD-2-Clause

import numpy as np
from numba import cuda
from numba import cuda, int32
from numba.cuda.testing import unittest, CUDATestCase
from numba.cuda._compat import Device
from numba.cuda.testing import skip_on_cudasim
Expand Down Expand Up @@ -48,6 +48,41 @@ def event_elapsed_inner(self, stream):
# Exercise the code path
evtstart.elapsed_time(evtend)

def test_event_query(self):
from time import perf_counter

@cuda.jit
def spin(ms):
# Sleep for ms
for i in range(ms):
cuda.nanosleep(int32(1_000_000)) # 1 ms

stream = cuda.stream()
evt = cuda.event()

# Run once to compile
spin[1, 1, stream](1)

t0 = perf_counter()
spin_ms = 250
spin[1, 1, stream](250)
evt.record(stream)

# Query immediately.
event_time = perf_counter() - t0
while not evt.query():
event_time = perf_counter() - t0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of busy looping the CPU here we can create two events with timing=True, record the start event before launching the kernel, record the end event after launching the kernel, synchronize the stream after recording the end event, and then retrieve the timing from the events using numba.cuda.event_elapsed_time: https://nvidia.github.io/numba-cuda/reference/host.html#numba.cuda.event_elapsed_time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this would be a better design for a test of the Event object in general, but the specific regression we're testing for is that the Event.query() method either raises TypeError or returns a True before the event is synchronized. The timing methods look as if they call different driver functions; I can't find any usages of query or the cuEventQuery driver function anywhere else in the repo. Perhaps I'm patching dead/unused code here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not dead/unused, just used externally to numba-cuda primarily as opposed to internally. We do hope people use cuda.core directly instead of this API shim in the future to eventually deprecate this.

Based on your point about wanting to test event.query specifically, we could synchronize the event before querying (in which case it should always return True). This is nitpicking at this point where I'm fine if we want to do this polling.


# Synchronize and capture stream-finished time.
evt.synchronize()
sync_time = perf_counter() - t0

# If this assertion fails, it was nanosleep inaccuracy that caused it
assert sync_time * 1000 > spin_ms * 0.9

# If this assertion fails, the event query returned early
assert event_time * 1000 > spin_ms * 0.9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident in the 0.9 tolerance here where can we just assert that the elapsed time between the events (see above comment) is >0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I take your point - if an absolute time delta was low-confidence before, this is no better. Have you found that any hard time threshold is flaky in this area? The early-return-True path should return in a few scant milliseconds, where sync_time should return somewhere in the ballpark of spin_ms (250ms as currently parameterised). I think sync_time > event_time is guaranteed by statement order in the current test design, so a gt/lt comparison won't cut it. Am I misintepreting your suggestion of event timing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you found that any hard time threshold is flaky in this area?

Generally yes because we test against different Windows driver modes and they have different scheduling behaviors. Additionally the machines that run CI are subject to noisy neighbors, so in general any kind of tolerance that isn't egregious ends up eventually flaking.

Given we aren't actually trying to test the implementation of the event time tracking and rely on the CUDA driver having a valid implementation for that, I think it's okay for us to just do the basic assertion that the time captured is a non-zero number.



if __name__ == "__main__":
unittest.main()