Related issue: NVIDIA/cccl#2538
Examples supplied by @gevtushenko, along with notes from a discussion on these items:
op closes on its argument:
def op(d_input):
def closure(index):
print(d_input[index])
return closure
cudax.bulk(3, op(cp.array([8, 6, 7, 5, 3, 0, 9], dtype=dtype)))
Another example with a lambda:
d_input = cp.array([8, 6, 7, 5, 3, 0, 9], dtype=dtype)
cudax.bulk(3, lambda i,d_input=d_input: print(d_input[i]))
cudax.bulk invokes op for each index in range(3) (3 being its first argument), from within a kernel inside bulk().
This is essentially:
@cuda.jit()
def kernel(shape, op):
index = cuda.index()
op(index)
def bulk(shape, op):
kernel[...](shape, op)
Ideally op would be able to capture anything recognised by Numba - e.g. values like integers, floats, user-defined types, pointers
Most use cases will not require an update of the captured state, except for BlockPrefixSumOp.
However elements of a captured array should be modifiable, but not the pointer to / size of the array. So this should be possible:
def op(d_input):
def closure(index):
d_input[index] = 42
return closure
cudax.bulk(3, op(cp.array([8, 6, 7, 5, 3, 0, 9], dtype=dtype)))
but this is not required:
def op(d_input):
def closure(index):
d_input.resize(new_size)
return closure
cudax.bulk(3, op(cp.array([8, 6, 7, 5, 3, 0, 9], dtype=dtype)))
Next steps
Producing a solution that allows for capture of arbitrary values into a stateful operator that can be passed to a kernel as an argument initially appears to be a significant amount of work to be generalized such that it meets the broad description of the request (exactly how much work would require some research to determine).
It is recommended that we examine the cuda.parallel implementation and requirements to determine the minimal requirements that can satisfy its specific use case first.
Related issue: NVIDIA/cccl#2538
Examples supplied by @gevtushenko, along with notes from a discussion on these items:
opcloses on its argument:Another example with a lambda:
cudax.bulkinvokesopfor each index inrange(3)(3being its first argument), from within a kernel insidebulk().This is essentially:
Ideally
opwould be able to capture anything recognised by Numba - e.g. values like integers, floats, user-defined types, pointersMost use cases will not require an update of the captured state, except for
BlockPrefixSumOp.However elements of a captured array should be modifiable, but not the pointer to / size of the array. So this should be possible:
but this is not required:
Next steps
Producing a solution that allows for capture of arbitrary values into a stateful operator that can be passed to a kernel as an argument initially appears to be a significant amount of work to be generalized such that it meets the broad description of the request (exactly how much work would require some research to determine).
It is recommended that we examine the cuda.parallel implementation and requirements to determine the minimal requirements that can satisfy its specific use case first.