Presently PTX can be inlined, but it requires writing NVVM IR with llvmlite inside a Numba extension (either with typing and a lowering function that generates an ir.InlineAsm instruction, or an overload and an intrinsic with the necessary code generation).
It would be nicer to be able to write inline PTX more simply. There are two possible ways this could be done:
- Through Pythonic intrinsics for PTX instructions, if these were to exist.
- Through a simpler API that can be used directly in a kernel.
For the latter option, the usage could look like:
@cuda.jit
def f(r, x):
arg = x[0]
result = inline_ptx("tanh.approx.f32 $0, $1;", "=f,f", (arg,))
r[0] = result
This mimics the CUDA C++ API for inline PTX, where the assembly snippet, constraints, and arguments all need to be provided.
cc @leofang @oleksandr-pavlyk @benhg
Presently PTX can be inlined, but it requires writing NVVM IR with llvmlite inside a Numba extension (either with typing and a lowering function that generates an
ir.InlineAsminstruction, or an overload and an intrinsic with the necessary code generation).It would be nicer to be able to write inline PTX more simply. There are two possible ways this could be done:
For the latter option, the usage could look like:
This mimics the CUDA C++ API for inline PTX, where the assembly snippet, constraints, and arguments all need to be provided.
cc @leofang @oleksandr-pavlyk @benhg