[FEA] Reduce overhead in inline_closurecall.InlineWorker

## Is your feature request related to a problem? Please describe.
I'm using numba-cuda in a parallel ODE integrating package to compile very large kernels with multiple nested device functions compiled with `inline='always'`. When working with large problems, compile time is on the order of 2 hours. 

Running the kernel in a ```with cuda.core.event.install_recorder("numba-cuda:run-pass") as rec:``` context highlights where the compiler is spending all of its time. Each line shows the duration of one pass as "pass name [function qualname]-[pass index]: duration".

```
inline_inlinables [BatchSolverKernel.build_kernel.<locals>.integration_kernel]-[1706]: 2623.606885 s
inline_inlinables [IVPLoop.build.<locals>.loop_fn]-[1700]: 1698.613187 s
inline_inlinables [DIRKStep.build_step.<locals>.step]-[1326]: 819.726879 s
reconstruct_ssa [BatchSolverKernel.build_kernel.<locals>.integration_kernel]-[1710]: 308.169029 s
inline_inlinables [NewtonKrylov.build.<locals>.newton_krylov_solver]-[1192]: 258.628006 s
inline_inlinables [NewtonKrylov.build.<locals>.newton_krylov_solver]-[712]: 248.971889 s
inline_inlinables [LinearSolver.build.<locals>.linear_solver]-[994]: 166.666156 s
inline_inlinables [LinearSolver.build.<locals>.linear_solver]-[514]: 161.275011 s
nopython_rewrites [BatchSolverKernel.build_kernel.<locals>.integration_kernel]-[1742]: 109.881090 s
ir_legalization [BatchSolverKernel.build_kernel.<locals>.integration_kernel]-[1743]: 55.800535 s
strip_phis [BatchSolverKernel.build_kernel.<locals>.integration_kernel]-[1740]: 55.563423 s
cuda_native_lowering [BatchSolverKernel.build_kernel.<locals>.integration_kernel]-[1850]: 35.872522 s
nopython_type_inference [BatchSolverKernel.build_kernel.<locals>.integration_kernel]-[1739]: 29.156163 s
inline_closure_likes [neumann_preconditioner.<locals>.preconditioner]-[422]: 1.500233 s
```

Inlining being a major choke point isn't a surprise, as all of the device functions are closures with many factory-scope variables which end up compiled into the code. However, I think there's room to make some performance improvements in the `inline_closurecall.InlineWorker` class with very few edits. This is probably in the micro-optimisations category for most users who aren't (ab)using the inline option in this way, but it significantly improves compile time in my case.

I'm not familiar with writing compiler passes, and with what is mutable and what is persistent in Numba IR, but I've had an extended play at vandalising the files in `numba.core` to probe what's happening. The investigation/suggestions below are a "best guess". I'm happy to submit a PR but would appreciate the eyes of someone who knows this system to let me know:
 
- What is/isn't safe or permissible
- What changes could be better implemented elsewhere in the code
- Whether changes in this area are worth making

## Describe the solution you'd like

Line-profiling the inline_function and inline_ir methods shows a hotspot in the copy_ir function, obviously deepcopy is having a hard time safely copying the bloated scope of the callee's blocks.

[Source](https://github.com/NVIDIA/numba-cuda/blob/aff41e9e7ad2bc9fa54afaadf75dc58d505f7852/numba_cuda/numba/cuda/core/inline_closurecall.py#L344C6-L361C48)
```python
        # Always copy the callee IR, it gets mutated
        def copy_ir(the_ir):
            kernel_copy = the_ir.copy()
            kernel_copy.blocks = {}
            for block_label, block in the_ir.blocks.items():
                new_block = copy.deepcopy(the_ir.blocks[block_label])
                kernel_copy.blocks[block_label] = new_block
            return kernel_copy

        callee_ir = copy_ir(callee_ir)

        # check that the contents of the callee IR is something that can be
        # inlined if a validator is present
        if self.validator is not None:
            self.validator(callee_ir)

        # save an unmutated copy of the callee_ir to return
        callee_ir_original = copy_ir(callee_ir)
```

In `v0.23.0`, using a smaller example (3-4 minutes compile time), here's how much time we're spending in the core inline functions:

```
193.23 seconds - InlineWorker.inline_ir.<locals>.copy_ir
195.02 seconds - InlineWorker.inline_ir
202.19 seconds - InlineWorker.inline_function
```

I propose two modifications.

### 1. Don't copy twice; save the input callee_ir to callee_ir_original at function entry
The input argument is copied twice - once as a copy of the original, once as a mutable copy for inlining. Assigning the input argument to callee_ir_original before overriding the callee_ir argument with a copied ir cuts copy time in half.
 
```python
    def inline_ir(
        self, caller_ir, block, i, callee_ir, callee_freevars, arg_typs=None
    ):
        # save an unmutated copy of the callee_ir to return
        callee_ir_original = callee_ir

        # Always copy the callee IR, it gets mutated
        def copy_ir(the_ir):
            kernel_copy = the_ir.copy()
            kernel_copy.blocks = {}
            for block_label, block in the_ir.blocks.items():
                new_block = copy.deepcopy(the_ir.blocks[block_label])
                kernel_copy.blocks[block_label] = new_block
            return kernel_copy

        callee_ir = copy_ir(callee_ir)
```
This provides about the expected improvement:

```
93.69 seconds - InlineWorker.inline_ir.<locals>.copy_ir
95.34 seconds - InlineWorker.inline_ir
102.36 seconds - InlineWorker.inline_function
```

### 2. Implement a selective shallow-copy to only copy down to the level modified in the inline run

To my (untrained) eye, it looks as if the block bodies are only mutated down to the statement level in this pass. Implementing something like the below saves a full deepcopy:

```python
def _selective_copy_blocks(blocks):
    new_blocks = {}
    for label, block in blocks.items():
        new_block = ir.Block(block.scope, block.loc)
        new_body = block.body[:]
        for idx, stmt in enumerate(new_body):
            new_body[idx] = copy.copy(stmt)
        new_block.body = new_body
        new_blocks[label] = new_block
    return new_blocks
```
Then we can call it in the copy_ir function instead of deepcopy

```python
        def copy_ir(the_ir):
            kernel_copy = the_ir.copy()
            kernel_copy.blocks = _selective_copy_blocks(the_ir.blocks)
            return kernel_copy
```
This really does some damage to compile time:
```
  0.28 seconds - InlineWorker.inline_ir.<locals>.copy_ir
  1.96 seconds - InlineWorker.inline_ir
  8.56 seconds - InlineWorker.inline_function
```

## Describe alternatives you've considered

### 3. Cache untyped_passes output in inline_function
I inline the same callee in to multiple functions at multiple points; many of the mutations to the inlined function are not tainted by the caller, and to my mind should be repeatable for the same callee. After the changes above, looking at a line-by-line profile of inline, function, 75% of the time is spent in running the full untyped passes on the callee.

```
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   500       101    6547312.3  64824.9     76.5          callee_ir = self.run_untyped_passes(funct…
   501       101        165.5      1.6      0.0          freevars = function.__code__.co_freevars
   502       202    2014428.2   9972.4     23.5          return self.inline_ir(
   503       101         59.7      0.6      0.0              caller_ir, block, i, callee_ir, freev…
   504                 )
```

I attempted to cache the resultant output:

```python
    def inline_function(self, caller_ir, block, i, function, arg_typs=None):
        """Inlines the function in the caller_ir at statement index i of block
        `block`. If `arg_typs` is given and the InlineWorker instance was
        initialized with a typemap and calltypes then they will be appropriately
        updated based on the arg_typs.
        """
        cache_key = function.__code__
        callee_ir = self._untyped_pass_ir_cache.get(cache_key)
        if callee_ir is None:
            callee_ir = self.run_untyped_passes(function)
            self._untyped_pass_ir_cache[cache_key] = callee_ir

        freevars = function.__code__.co_freevars
        return self.inline_ir(
            caller_ir, block, i, callee_ir, freevars, arg_typs=arg_typs
        )
```
but compile time exploded. The function was spending a lot of time renaming variables - I presumed because block.scope.localvars was getting repeatedly filled with a new round of renamed variables. I shallow-copied the fields of the scope that I thought were modified:

```python
def _selective_copy_blocks(blocks):
    new_blocks = {}
    for label, block in blocks.items():
        new_block = ir.Block(block.scope, block.loc)
        new_body = block.body[:]
        for idx, stmt in enumerate(new_body):
            new_body[idx] = copy.copy(stmt)
        new_block.body = new_body

        new_scope = copy.copy(block.scope)
        new_scope.redefined = defaultdict(int, **block.scope.redefined)
        new_scope.var_redefinitions = defaultdict(
                set, **block.scope.var_redefinitions
        )
        new_scope.localvars = copy.copy(block.scope.localvars)
        new_con = dict(block.scope.localvars._con)

        new_scope.localvars._con = new_con
        new_block.scope = new_scope

        new_blocks[label] = new_block
    return new_blocks
```

This unwieldy function cut compile time again, but led to a typing error:

```
No implementation of function Function(<built-in function or_>) found for signature:
 
 >>> or_(float64, int32)
 
There are 8 candidate implementations:
      - Of which 4 did not match due to:
      Overload of function 'or_': File: <numerous>: Line N/A.
        With argument(s): '(float64, int32)':
       No match.
      - Of which 4 did not match due to:
      Operator Overload in function 'or_': File: unknown: Line unknown.
        With argument(s): '(float64, int32)':
       No match for registered cases:
        * (bool, bool) -> bool
        * (int64, int64) -> int64
        * (int64, uint64) -> int64
        * (uint64, int64) -> int64
        * (uint64, uint64) -> uint64

During: typing of intrinsic-call at C:\local_working_projects\cubie\src\cubie\integrators\loops\ode_loop.py (615)

File "src\cubie\integrators\loops\ode_loop.py", line 615:
        def loop_fn(
            <source elided>
                    niters = proposed_counters[0]
                    status = int32(status | step_status)
                    ^
```

`status` is declared as `status = int32(0)`, and the only assignments to it are of the form `status = int32(status | int32(other_status))`. I've gone pretty wild on casting across the package, as the odd uncast constant or upcast in a logical operation can propagate its way into the float32 math pathway and bring things crashing to a halt. 

Some element of the cached IR is being mutated at a level above my understanding. Going deeper in the selective copy function eats away at the gains made by avoiding the deepcopy; the most performant version I can get to is avoiding deepcopy and accepting that without a deepcopy, we can't cache the callee_ir.  I include this option in the issue in case there's a minimal edit I could do to capture some benefit from caching that's obvious to more experienced eyes.


## Additional context
I am using `inline='always'` for performance benefits: I have read the [Notes on Inlining](https://numba.readthedocs.io/en/stable/developer/inlining.html) page, however I measure a reasonable performance improvement in compiled code (~40%, from memory). I presume that's because the lower-level CUDA compiler is having an easier time identifying where it can keep values in registers instead of storing/loading when it has fewer nested call/returns in its input PTX.

I had a hard time generating a meaningful MWE without forcing you to use my package, which isn't particularly usable yet. Hopefully the measurements and snippets provided illustrate the issue/suggestion. Let me know if I can make this easier to interpret/get started on.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Reduce overhead in inline_closurecall.InlineWorker #688

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

1. Don't copy twice; save the input callee_ir to callee_ir_original at function entry

2. Implement a selective shallow-copy to only copy down to the level modified in the inline run

Describe alternatives you've considered

3. Cache untyped_passes output in inline_function

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Reduce overhead in inline_closurecall.InlineWorker #688

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

1. Don't copy twice; save the input callee_ir to callee_ir_original at function entry

2. Implement a selective shallow-copy to only copy down to the level modified in the inline run

Describe alternatives you've considered

3. Cache untyped_passes output in inline_function

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions