Is your feature request related to a problem? Please describe.
I'm using numba-cuda in a parallel ODE integrating package to compile very large kernels with multiple nested device functions compiled with inline='always'. When working with large problems, compile time is on the order of 2 hours.
Running the kernel in a with cuda.core.event.install_recorder("numba-cuda:run-pass") as rec: context highlights where the compiler is spending all of its time. Each line shows the duration of one pass as "pass name [function qualname]-[pass index]: duration".
inline_inlinables [BatchSolverKernel.build_kernel.<locals>.integration_kernel]-[1706]: 2623.606885 s
inline_inlinables [IVPLoop.build.<locals>.loop_fn]-[1700]: 1698.613187 s
inline_inlinables [DIRKStep.build_step.<locals>.step]-[1326]: 819.726879 s
reconstruct_ssa [BatchSolverKernel.build_kernel.<locals>.integration_kernel]-[1710]: 308.169029 s
inline_inlinables [NewtonKrylov.build.<locals>.newton_krylov_solver]-[1192]: 258.628006 s
inline_inlinables [NewtonKrylov.build.<locals>.newton_krylov_solver]-[712]: 248.971889 s
inline_inlinables [LinearSolver.build.<locals>.linear_solver]-[994]: 166.666156 s
inline_inlinables [LinearSolver.build.<locals>.linear_solver]-[514]: 161.275011 s
nopython_rewrites [BatchSolverKernel.build_kernel.<locals>.integration_kernel]-[1742]: 109.881090 s
ir_legalization [BatchSolverKernel.build_kernel.<locals>.integration_kernel]-[1743]: 55.800535 s
strip_phis [BatchSolverKernel.build_kernel.<locals>.integration_kernel]-[1740]: 55.563423 s
cuda_native_lowering [BatchSolverKernel.build_kernel.<locals>.integration_kernel]-[1850]: 35.872522 s
nopython_type_inference [BatchSolverKernel.build_kernel.<locals>.integration_kernel]-[1739]: 29.156163 s
inline_closure_likes [neumann_preconditioner.<locals>.preconditioner]-[422]: 1.500233 s
Inlining being a major choke point isn't a surprise, as all of the device functions are closures with many factory-scope variables which end up compiled into the code. However, I think there's room to make some performance improvements in the inline_closurecall.InlineWorker class with very few edits. This is probably in the micro-optimisations category for most users who aren't (ab)using the inline option in this way, but it significantly improves compile time in my case.
I'm not familiar with writing compiler passes, and with what is mutable and what is persistent in Numba IR, but I've had an extended play at vandalising the files in numba.core to probe what's happening. The investigation/suggestions below are a "best guess". I'm happy to submit a PR but would appreciate the eyes of someone who knows this system to let me know:
- What is/isn't safe or permissible
- What changes could be better implemented elsewhere in the code
- Whether changes in this area are worth making
Describe the solution you'd like
Line-profiling the inline_function and inline_ir methods shows a hotspot in the copy_ir function, obviously deepcopy is having a hard time safely copying the bloated scope of the callee's blocks.
Source
# Always copy the callee IR, it gets mutated
def copy_ir(the_ir):
kernel_copy = the_ir.copy()
kernel_copy.blocks = {}
for block_label, block in the_ir.blocks.items():
new_block = copy.deepcopy(the_ir.blocks[block_label])
kernel_copy.blocks[block_label] = new_block
return kernel_copy
callee_ir = copy_ir(callee_ir)
# check that the contents of the callee IR is something that can be
# inlined if a validator is present
if self.validator is not None:
self.validator(callee_ir)
# save an unmutated copy of the callee_ir to return
callee_ir_original = copy_ir(callee_ir)
In v0.23.0, using a smaller example (3-4 minutes compile time), here's how much time we're spending in the core inline functions:
193.23 seconds - InlineWorker.inline_ir.<locals>.copy_ir
195.02 seconds - InlineWorker.inline_ir
202.19 seconds - InlineWorker.inline_function
I propose two modifications.
1. Don't copy twice; save the input callee_ir to callee_ir_original at function entry
The input argument is copied twice - once as a copy of the original, once as a mutable copy for inlining. Assigning the input argument to callee_ir_original before overriding the callee_ir argument with a copied ir cuts copy time in half.
def inline_ir(
self, caller_ir, block, i, callee_ir, callee_freevars, arg_typs=None
):
# save an unmutated copy of the callee_ir to return
callee_ir_original = callee_ir
# Always copy the callee IR, it gets mutated
def copy_ir(the_ir):
kernel_copy = the_ir.copy()
kernel_copy.blocks = {}
for block_label, block in the_ir.blocks.items():
new_block = copy.deepcopy(the_ir.blocks[block_label])
kernel_copy.blocks[block_label] = new_block
return kernel_copy
callee_ir = copy_ir(callee_ir)
This provides about the expected improvement:
93.69 seconds - InlineWorker.inline_ir.<locals>.copy_ir
95.34 seconds - InlineWorker.inline_ir
102.36 seconds - InlineWorker.inline_function
2. Implement a selective shallow-copy to only copy down to the level modified in the inline run
To my (untrained) eye, it looks as if the block bodies are only mutated down to the statement level in this pass. Implementing something like the below saves a full deepcopy:
def _selective_copy_blocks(blocks):
new_blocks = {}
for label, block in blocks.items():
new_block = ir.Block(block.scope, block.loc)
new_body = block.body[:]
for idx, stmt in enumerate(new_body):
new_body[idx] = copy.copy(stmt)
new_block.body = new_body
new_blocks[label] = new_block
return new_blocks
Then we can call it in the copy_ir function instead of deepcopy
def copy_ir(the_ir):
kernel_copy = the_ir.copy()
kernel_copy.blocks = _selective_copy_blocks(the_ir.blocks)
return kernel_copy
This really does some damage to compile time:
0.28 seconds - InlineWorker.inline_ir.<locals>.copy_ir
1.96 seconds - InlineWorker.inline_ir
8.56 seconds - InlineWorker.inline_function
Describe alternatives you've considered
3. Cache untyped_passes output in inline_function
I inline the same callee in to multiple functions at multiple points; many of the mutations to the inlined function are not tainted by the caller, and to my mind should be repeatable for the same callee. After the changes above, looking at a line-by-line profile of inline, function, 75% of the time is spent in running the full untyped passes on the callee.
Line # Hits Time Per Hit % Time Line Contents
==============================================================
500 101 6547312.3 64824.9 76.5 callee_ir = self.run_untyped_passes(funct…
501 101 165.5 1.6 0.0 freevars = function.__code__.co_freevars
502 202 2014428.2 9972.4 23.5 return self.inline_ir(
503 101 59.7 0.6 0.0 caller_ir, block, i, callee_ir, freev…
504 )
I attempted to cache the resultant output:
def inline_function(self, caller_ir, block, i, function, arg_typs=None):
"""Inlines the function in the caller_ir at statement index i of block
`block`. If `arg_typs` is given and the InlineWorker instance was
initialized with a typemap and calltypes then they will be appropriately
updated based on the arg_typs.
"""
cache_key = function.__code__
callee_ir = self._untyped_pass_ir_cache.get(cache_key)
if callee_ir is None:
callee_ir = self.run_untyped_passes(function)
self._untyped_pass_ir_cache[cache_key] = callee_ir
freevars = function.__code__.co_freevars
return self.inline_ir(
caller_ir, block, i, callee_ir, freevars, arg_typs=arg_typs
)
but compile time exploded. The function was spending a lot of time renaming variables - I presumed because block.scope.localvars was getting repeatedly filled with a new round of renamed variables. I shallow-copied the fields of the scope that I thought were modified:
def _selective_copy_blocks(blocks):
new_blocks = {}
for label, block in blocks.items():
new_block = ir.Block(block.scope, block.loc)
new_body = block.body[:]
for idx, stmt in enumerate(new_body):
new_body[idx] = copy.copy(stmt)
new_block.body = new_body
new_scope = copy.copy(block.scope)
new_scope.redefined = defaultdict(int, **block.scope.redefined)
new_scope.var_redefinitions = defaultdict(
set, **block.scope.var_redefinitions
)
new_scope.localvars = copy.copy(block.scope.localvars)
new_con = dict(block.scope.localvars._con)
new_scope.localvars._con = new_con
new_block.scope = new_scope
new_blocks[label] = new_block
return new_blocks
This unwieldy function cut compile time again, but led to a typing error:
No implementation of function Function(<built-in function or_>) found for signature:
>>> or_(float64, int32)
There are 8 candidate implementations:
- Of which 4 did not match due to:
Overload of function 'or_': File: <numerous>: Line N/A.
With argument(s): '(float64, int32)':
No match.
- Of which 4 did not match due to:
Operator Overload in function 'or_': File: unknown: Line unknown.
With argument(s): '(float64, int32)':
No match for registered cases:
* (bool, bool) -> bool
* (int64, int64) -> int64
* (int64, uint64) -> int64
* (uint64, int64) -> int64
* (uint64, uint64) -> uint64
During: typing of intrinsic-call at C:\local_working_projects\cubie\src\cubie\integrators\loops\ode_loop.py (615)
File "src\cubie\integrators\loops\ode_loop.py", line 615:
def loop_fn(
<source elided>
niters = proposed_counters[0]
status = int32(status | step_status)
^
status is declared as status = int32(0), and the only assignments to it are of the form status = int32(status | int32(other_status)). I've gone pretty wild on casting across the package, as the odd uncast constant or upcast in a logical operation can propagate its way into the float32 math pathway and bring things crashing to a halt.
Some element of the cached IR is being mutated at a level above my understanding. Going deeper in the selective copy function eats away at the gains made by avoiding the deepcopy; the most performant version I can get to is avoiding deepcopy and accepting that without a deepcopy, we can't cache the callee_ir. I include this option in the issue in case there's a minimal edit I could do to capture some benefit from caching that's obvious to more experienced eyes.
Additional context
I am using inline='always' for performance benefits: I have read the Notes on Inlining page, however I measure a reasonable performance improvement in compiled code (~40%, from memory). I presume that's because the lower-level CUDA compiler is having an easier time identifying where it can keep values in registers instead of storing/loading when it has fewer nested call/returns in its input PTX.
I had a hard time generating a meaningful MWE without forcing you to use my package, which isn't particularly usable yet. Hopefully the measurements and snippets provided illustrate the issue/suggestion. Let me know if I can make this easier to interpret/get started on.
Is your feature request related to a problem? Please describe.
I'm using numba-cuda in a parallel ODE integrating package to compile very large kernels with multiple nested device functions compiled with
inline='always'. When working with large problems, compile time is on the order of 2 hours.Running the kernel in a
with cuda.core.event.install_recorder("numba-cuda:run-pass") as rec:context highlights where the compiler is spending all of its time. Each line shows the duration of one pass as "pass name [function qualname]-[pass index]: duration".Inlining being a major choke point isn't a surprise, as all of the device functions are closures with many factory-scope variables which end up compiled into the code. However, I think there's room to make some performance improvements in the
inline_closurecall.InlineWorkerclass with very few edits. This is probably in the micro-optimisations category for most users who aren't (ab)using the inline option in this way, but it significantly improves compile time in my case.I'm not familiar with writing compiler passes, and with what is mutable and what is persistent in Numba IR, but I've had an extended play at vandalising the files in
numba.coreto probe what's happening. The investigation/suggestions below are a "best guess". I'm happy to submit a PR but would appreciate the eyes of someone who knows this system to let me know:Describe the solution you'd like
Line-profiling the inline_function and inline_ir methods shows a hotspot in the copy_ir function, obviously deepcopy is having a hard time safely copying the bloated scope of the callee's blocks.
Source
In
v0.23.0, using a smaller example (3-4 minutes compile time), here's how much time we're spending in the core inline functions:I propose two modifications.
1. Don't copy twice; save the input callee_ir to callee_ir_original at function entry
The input argument is copied twice - once as a copy of the original, once as a mutable copy for inlining. Assigning the input argument to callee_ir_original before overriding the callee_ir argument with a copied ir cuts copy time in half.
This provides about the expected improvement:
2. Implement a selective shallow-copy to only copy down to the level modified in the inline run
To my (untrained) eye, it looks as if the block bodies are only mutated down to the statement level in this pass. Implementing something like the below saves a full deepcopy:
Then we can call it in the copy_ir function instead of deepcopy
This really does some damage to compile time:
Describe alternatives you've considered
3. Cache untyped_passes output in inline_function
I inline the same callee in to multiple functions at multiple points; many of the mutations to the inlined function are not tainted by the caller, and to my mind should be repeatable for the same callee. After the changes above, looking at a line-by-line profile of inline, function, 75% of the time is spent in running the full untyped passes on the callee.
I attempted to cache the resultant output:
but compile time exploded. The function was spending a lot of time renaming variables - I presumed because block.scope.localvars was getting repeatedly filled with a new round of renamed variables. I shallow-copied the fields of the scope that I thought were modified:
This unwieldy function cut compile time again, but led to a typing error:
statusis declared asstatus = int32(0), and the only assignments to it are of the formstatus = int32(status | int32(other_status)). I've gone pretty wild on casting across the package, as the odd uncast constant or upcast in a logical operation can propagate its way into the float32 math pathway and bring things crashing to a halt.Some element of the cached IR is being mutated at a level above my understanding. Going deeper in the selective copy function eats away at the gains made by avoiding the deepcopy; the most performant version I can get to is avoiding deepcopy and accepting that without a deepcopy, we can't cache the callee_ir. I include this option in the issue in case there's a minimal edit I could do to capture some benefit from caching that's obvious to more experienced eyes.
Additional context
I am using
inline='always'for performance benefits: I have read the Notes on Inlining page, however I measure a reasonable performance improvement in compiled code (~40%, from memory). I presume that's because the lower-level CUDA compiler is having an easier time identifying where it can keep values in registers instead of storing/loading when it has fewer nested call/returns in its input PTX.I had a hard time generating a meaningful MWE without forcing you to use my package, which isn't particularly usable yet. Hopefully the measurements and snippets provided illustrate the issue/suggestion. Let me know if I can make this easier to interpret/get started on.