Why is JAX so fast? #11078

jpivarski · 2022-06-13T02:43:52Z

jpivarski
Jun 13, 2022

I was writing a notebook to illustrate the performance trade-offs of array-oriented programming and imperative programming, which calculates the Mandelbrot set in a variety of different ways:

Python for loops as an absolute baseline
NumPy for an array-oriented calculation that has to create a lot of intermediate arrays
compiled C++ for loops, through pybind11
compiled Cython for loops
compiled Numba for loops
Numba's @nb.vectorize decorator: not a performance advantage, but splits between imperative for a pixel and array-oriented when programming in the large
CuPy for an array-oriented calculation on a GPU, which nevertheless has to create intermediate arrays
CuPy with a custom cp.RawKernel (like @nb.vectorize, but for GPUs)
Numba's CUDA backend, which is effectively like cp.RawKernel

The reason I chose the Mandelbrot set is because each pixel can be calculated independently, but the algorithm that determines the value of each pixel must "iterate until converged," which can be a different number of times for different pixels.

Then it occurred to me to add JAX, because it's an independent basis vector in this space:

JAX array-oriented interface, compiled for the CPU with XLA
same, compiled for the GPU with XLA

The details are all in the linked notebook (above), but here's the bottom line:

I had a pretty good story going until JAX was added to the mix. That story was:

Python is slow for a whole lot of reasons.
NumPy improves upon Python by putting the hot loops in compiled code, but it's still wasteful in how it has to create temporary arrays for intermediate parts of the calculation.
All of the CPU compiled methods—C++, Cython, Numba, including @nb.vectorize—are about as fast as each other because they're all as fast as a single-threaded CPU can do this calculation. Considerable wrestling with Cython was necessary to get it in line with the rest.
CuPy improves upon the best CPU results because its array-oriented interface is a good match for a GPU's vectorization, but like NumPy, it creates temporary arrays.
CuPy with a custom cp.RawKernel solves the problem of temporary arrays in exactly the same way @nb.vectorize does on the CPU.
Numba's CUDA backend is much like CuPy with a custom cp.RawKernel, so they're about the same.

So we basically have 3 levels: the best a GPU can do, the best a CPU can do, and Python. CuPy and NumPy with temporary arrays are somewhat worse than the best a GPU or a CPU can do, respectively.

Nice story.

But then I added JAX, and its final CPU speed is almost 5× better than all of the compiled CPU variants (and those are pretty close to each other), and its final GPU speed is about 6× better than CuPy and Numba-CUDA. How can that be?

At first, I noticed that the CPU version was using all my cores, whereas the other tests were single-threaded. There isn't a good way to turn that off, so I forced the whole JupyterLab process to have affinity with CPU 0:

taskset -c 0 jupyter lab

Not only did I see in htop that it is indeed only using one core now, but I have a

assert len(jax.devices("cpu")) == 1

in the code to make sure it's not accidentally run on all cores in the future.

Then I noticed that JAX asynchronously dispatches its tasks, so I put a .block_until_ready() on the final result. For good measure, I wrapped that in np.asarray to force it to be eager and to copy results back from the GPU. The other tests are required to do so as well. (CuPy and Numba-CUDA fill a CuPy array that I .get() at the end of the calculation.) I did see some differences with these changes, but the numbers quoted above are after making them.

It used to be the case that the maximum number of iterations in the algorithm was variable, but JAX won't compile with that (since it's a tracer). So I made all the implementations use a compile-time constant maximum number of iterations, just to be fair. Thinking that JAX was taking advantage of that to unroll the loop over iterations, I expected C++ (conda-forge gcc 10.3.0-16, everything compiled at -O3) to get the same advantage. It didn't. Even giving C++ constants for all three nested loops (height, width, and maxiterations) didn't change its result much. In the end, all implementations have height and width not specified at compile-time, but maxiterations a compile-time constant.

What gives? Is my test still unfair because of something I didn't think of? Or if JAX really is much faster for this kind of algorithm, why? What is XLA doing differently from gcc, LLVM, nvrtc, and NVVM that makes it so much faster?

Below are some details, in case the answer is hidden in them.

`/proc/cpuinfo` for CPU 0 (the one I pinned the whole JupyterLab process to)

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 25
model		: 80
model name	: AMD Ryzen 7 5700G with Radeon Graphics
stepping	: 0
microcode	: 0xa50000c
cpu MHz		: 3800.000
cache size	: 512 KB
physical id	: 0
siblings	: 16
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 7599.68
TLB size	: 2560 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

`sudo lshw -C display` for the GPU (and also the built-in graphics)

  *-display                 
       description: VGA compatible controller
       product: GA106 [GeForce RTX 3060 Lite Hash Rate]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:136 memory:fb000000-fbffffff memory:b0000000-bfffffff memory:c0000000-c1ffffff ioport:f000(size=128) memory:fc000000-fc07ffff
  *-display
       description: VGA compatible controller
       product: Cezanne
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:0b:00.0
       logical name: /dev/fb0
       version: c8
       width: 64 bits
       clock: 33MHz
       capabilities: pm pciexpress msi msix vga_controller bus_master cap_list fb
       configuration: depth=32 driver=amdgpu latency=0 resolution=2560,1440
       resources: irq:68 memory:d0000000-dfffffff memory:e0000000-e01fffff ioport:e000(size=256) memory:fcd00000-fcd7ffff

`nvidia-smi -a` for the GPU

Timestamp                                 : Sun Jun 12 20:20:46 2022
Driver Version                            : 510.73.05
CUDA Version                              : 11.6

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 3060
    Product Brand                         : GeForce
    Product Architecture                  : Ampere
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-47ba4177-7b07-5f05-e985-adee8bf04cae
    Minor Number                          : 0
    VBIOS Version                         : 94.06.2F.00.48
    MultiGPU Board                        : No
    Board ID                              : 0x100
    GPU Part Number                       : N/A
    Module ID                             : 0
    Inforom Version
        Image Version                     : G001.0000.03.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x250410DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x138F196E
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 72000 KB/s
        Rx Throughput                     : 632000 KB/s
    Fan Speed                             : 0 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 12288 MiB
        Reserved                          : 235 MiB
        Used                              : 46 MiB
        Free                              : 12006 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 3 MiB
        Free                              : 253 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 27 %
        Memory                            : 15 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 56 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 19.26 W
        Power Limit                       : 170.00 W
        Default Power Limit               : 170.00 W
        Enforced Power Limit              : 170.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 170.00 W
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 7501 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 662.500 mV
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2595
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 45 MiB

LLVM for `numba_inner_loop`, which has a low-level function and also a much larger unboxing/boxing function

; ModuleID = 'numba_inner_loop'
source_filename = "<string>"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@_ZN08NumbaEnv8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE = common local_unnamed_addr global i8* null
@.const.picklebuf.140498085454976 = internal constant { i8*, i32, i8* } { i8* getelementptr inbounds ([186 x i8], [186 x i8]* @.const.pickledata.140498085454976, i32 0, i32 0), i32 186, i8* getelementptr inbounds ([20 x i8], [20 x i8]* @.const.pickledata.140498085454976.sha1, i32 0, i32 0) }
@.const.picklebuf.140498082859904 = internal constant { i8*, i32, i8* } { i8* getelementptr inbounds ([186 x i8], [186 x i8]* @.const.pickledata.140498082859904, i32 0, i32 0), i32 186, i8* getelementptr inbounds ([20 x i8], [20 x i8]* @.const.pickledata.140498082859904.sha1, i32 0, i32 0) }
@.const.picklebuf.140498083546176 = internal constant { i8*, i32, i8* } { i8* getelementptr inbounds ([137 x i8], [137 x i8]* @.const.pickledata.140498083546176, i32 0, i32 0), i32 137, i8* getelementptr inbounds ([20 x i8], [20 x i8]* @.const.pickledata.140498083546176.sha1, i32 0, i32 0) }
@.const.pickledata.140498083546176 = internal constant [137 x i8] c"\80\04\95~\00\00\00\00\00\00\00\8C\08builtins\94\8C\0AValueError\94\93\94\8C[array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.\94\85\94N\87\94."
@.const.pickledata.140498083546176.sha1 = internal constant [20 x i8] c"X\E1N\CC\B5\07\B1\E0 i\81t\02#\E6\85\CB\8C<W"
@.const.pickledata.140498082859904 = internal constant [186 x i8] c"\80\04\95\AF\00\00\00\00\00\00\00\8C\08builtins\94\8C\0AValueError\94\93\94\8C\8Cunable to broadcast argument 1 to output array\0AFile \22/home/jpivarski/mambaforge/lib/python3.9/site-packages/numba/np/npyimpl.py\22, line 228, \94\85\94N\87\94."
@.const.pickledata.140498082859904.sha1 = internal constant [20 x i8] c"\D7\19h\E2\D1K\DFT\9E\D9eF&\83\CE2\DA\84\1EO"
@.const.pickledata.140498085454976 = internal constant [186 x i8] c"\80\04\95\AF\00\00\00\00\00\00\00\8C\08builtins\94\8C\0AValueError\94\93\94\8C\8Cunable to broadcast argument 0 to output array\0AFile \22/home/jpivarski/mambaforge/lib/python3.9/site-packages/numba/np/npyimpl.py\22, line 228, \94\85\94N\87\94."
@.const.pickledata.140498085454976.sha1 = internal constant [20 x i8] c"J1\C9\B2\F7k\E1w\01\0E\AB\F1\B8\B2\EA\C4o\B9\EEy"
@.const.numba_inner_loop = internal constant [17 x i8] c"numba_inner_loop\00"
@PyExc_RuntimeError = external global i8
@".const.missing Environment: _ZN08NumbaEnv8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE" = internal constant [176 x i8] c"missing Environment: _ZN08NumbaEnv8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE\00"
@PyExc_TypeError = external global i8
@".const.can't unbox array from PyObject into native value.  The object maybe of a different type" = internal constant [89 x i8] c"can't unbox array from PyObject into native value.  The object maybe of a different type\00"
@".const.`env.consts` is NULL in `read_const`" = internal constant [37 x i8] c"`env.consts` is NULL in `read_const`\00"
@.const.pickledata.140500895779840 = internal constant [32 x i8] c"\80\04\95\15\00\00\00\00\00\00\00\8C\05numpy\94\8C\07ndarray\94\93\94."
@.const.pickledata.140500895779840.sha1 = internal constant [20 x i8] c"\DF\BC\FD\D3\9F\CB&\F4\D0\C6\80\95D\87\B8\C0\B5;\B8\A3"
@PyExc_SystemError = external global i8
@".const.unknown error when calling native function" = internal constant [43 x i8] c"unknown error when calling native function\00"
@".const.<numba.core.cpu.CPUContext object at 0x7fc84232d670>" = internal constant [53 x i8] c"<numba.core.cpu.CPUContext object at 0x7fc84232d670>\00"
@".const.unknown error when calling native function.1" = internal constant [43 x i8] c"unknown error when calling native function\00"
@_ZN08NumbaEnv13_3cdynamic_3e38__numba_array_expr_0x7fc842761310_2410B70c8tJTIeFCjyCbUFRqqOAK_2f5h0ggn2oJ9DwwEtPiAqkREPZIAanTCJIogpmsCAA_3d_3dEdd = common local_unnamed_addr global i8* null
@_ZN08NumbaEnv5numba2np7npyimpl20_broadcast_onto_2411B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEx8int64_2ax8int64_2a = common local_unnamed_addr global i8* null
@_ZN08NumbaEnv5numba2np8arrayobj19_call_allocator_248B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj = common local_unnamed_addr global i8* null
@_ZN08NumbaEnv5numba2np8arrayobj18_ol_array_allocate12_3clocals_3e8impl_249B44c8tJTIeFCjyCbUFRqqOAK_2f6h0phxApMogijRBAA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj = common local_unnamed_addr global i8* null
@_ZN08NumbaEnv13_3cdynamic_3e49jit_wrapper__function_full_at_0x7fc90005b5e0__245B46c8tJTIeFCjyCbUFRqqOAK_2f6h0khxBBXBjCWYRBFEkyYAE8UniTupleIxLi2EEx16class_28int32_29 = common local_unnamed_addr global i8* null
@_ZN08NumbaEnv5numba2np8arrayobj19numpy_full_dtype_nd12_3clocals_3e8full_246B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dE8UniTupleIxLi2EEx16class_28int32_29 = common local_unnamed_addr global i8* null
@_ZN08NumbaEnv13_3cdynamic_3e41jit_wrapper__built_in_function_empty__247B46c8tJTIeFCjyCbUFRqqOAK_2f6h0khxBBXBjCWYRBFEkyYAE8UniTupleIxLi2EE16class_28int32_29 = common local_unnamed_addr global i8* null
@.const.picklebuf.140498084857536 = internal constant { i8*, i32, i8* } { i8* getelementptr inbounds ([77 x i8], [77 x i8]* @.const.pickledata.140498084857536, i32 0, i32 0), i32 77, i8* getelementptr inbounds ([20 x i8], [20 x i8]* @.const.pickledata.140498084857536.sha1, i32 0, i32 0) }
@.const.picklebuf.140498085300992 = internal constant { i8*, i32, i8* } { i8* getelementptr inbounds ([137 x i8], [137 x i8]* @.const.pickledata.140498085300992, i32 0, i32 0), i32 137, i8* getelementptr inbounds ([20 x i8], [20 x i8]* @.const.pickledata.140498085300992.sha1, i32 0, i32 0) }
@.const.pickledata.140498085300992 = internal constant [137 x i8] c"\80\04\95~\00\00\00\00\00\00\00\8C\08builtins\94\8C\0AValueError\94\93\94\8C[array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.\94\85\94N\87\94."
@.const.pickledata.140498085300992.sha1 = internal constant [20 x i8] c"X\E1N\CC\B5\07\B1\E0 i\81t\02#\E6\85\CB\8C<W"
@.const.pickledata.140498084857536 = internal constant [77 x i8] c"\80\04\95B\00\00\00\00\00\00\00\8C\08builtins\94\8C\0AValueError\94\93\94\8C\1Fnegative dimensions not allowed\94\85\94N\87\94."
@.const.pickledata.140498084857536.sha1 = internal constant [20 x i8] c"3\1B\85c\BD\B9\DA\C8\1B8B\22s\05,Ho\C1pk"
@_ZN08NumbaEnv5numba7cpython7numbers16complex_abs_impl12_3clocals_3e16complex_abs_2412B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dE10complex128 = common local_unnamed_addr global i8* null
@_ZN08NumbaEnv5numba7cpython8mathimpl16hypot_float_impl12_3clocals_3e14hypot_impl_243B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEdd = common local_unnamed_addr global i8* null

define i32 @_ZN8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE({ i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* noalias nocapture %retptr, { i8*, i32, i8* }** noalias nocapture %excinfo, i64 %arg.height, i64 %arg.width, i8* %arg.x.0, i8* nocapture readnone %arg.x.1, i64 %arg.x.2, i64 %arg.x.3, double* nocapture readonly %arg.x.4, i64 %arg.x.5.0, i64 %arg.x.5.1, i64 %arg.x.6.0, i64 %arg.x.6.1, i8* %arg.y.0, i8* nocapture readnone %arg.y.1, i64 %arg.y.2, i64 %arg.y.3, double* nocapture readonly %arg.y.4, i64 %arg.y.5.0, i64 %arg.y.5.1, i64 %arg.y.6.0, i64 %arg.y.6.1) local_unnamed_addr {
entry:
  %src_shape137 = alloca [2 x i64], align 8
  %src_shape137.sub352 = bitcast [2 x i64]* %src_shape137 to i64*
  %dest_shape138 = alloca <2 x i64>, align 16
  tail call void @NRT_incref(i8* %arg.x.0)
  tail call void @NRT_incref(i8* %arg.y.0)
  store <2 x i64> <i64 1, i64 1>, <2 x i64>* %dest_shape138, align 16
  store i64 %arg.x.5.0, i64* %src_shape137.sub352, align 8
  %.90 = getelementptr inbounds [2 x i64], [2 x i64]* %src_shape137, i64 0, i64 1
  store i64 %arg.x.5.1, i64* %.90, align 8
  br label %B34.i

B34.ithread-pre-split:                            ; preds = %B102.i
  %sunkaddr = mul i64 %lsr.iv346, 8
  %0 = bitcast [2 x i64]* %src_shape137 to i8*
  %sunkaddr354 = getelementptr i8, i8* %0, i64 %sunkaddr
  %1 = bitcast i8* %sunkaddr354 to i64*
  %.49.i.pr = load i64, i64* %1, align 8, !noalias !0
  %lsr.iv.next347 = add nuw nsw i64 %lsr.iv346, 1
  br label %B34.i

B34.i:                                            ; preds = %B34.ithread-pre-split, %entry
  %lsr.iv346 = phi i64 [ %lsr.iv.next347, %B34.ithread-pre-split ], [ 1, %entry ]
  %.49.i = phi i64 [ %.49.i.pr, %B34.ithread-pre-split ], [ %arg.x.5.0, %entry ]
  %2 = bitcast <2 x i64>* %dest_shape138 to i8*
  %3 = add i64 %lsr.iv346, -3
  %4 = shl nuw nsw i64 %lsr.iv346, 3
  %uglygep349 = getelementptr i8, i8* %2, i64 %4
  %uglygep349350 = bitcast i8* %uglygep349 to i64*
  %scevgep351 = getelementptr i64, i64* %uglygep349350, i64 -1
  %.56.i = load i64, i64* %scevgep351, align 8, !noalias !0
  %.61.not.i = icmp eq i64 %.56.i, 1
  br i1 %.61.not.i, label %B86.i, label %B58.i

B58.i:                                            ; preds = %B34.i
  %.65.i = icmp ne i64 %.49.i, %.56.i
  %.70.i = icmp ne i64 %.49.i, 1
  %or.cond.i = and i1 %.70.i, %.65.i
  br i1 %or.cond.i, label %B0.endif, label %B102.i

B86.i:                                            ; preds = %B34.i
  %.92.not.i = icmp eq i64 %.49.i, 1
  br i1 %.92.not.i, label %B102.i, label %B94.i

B94.i:                                            ; preds = %B86.i
  %sunkaddr355 = mul i64 %lsr.iv346, 8
  %5 = bitcast <2 x i64>* %dest_shape138 to i8*
  %sunkaddr356 = getelementptr i8, i8* %5, i64 %sunkaddr355
  %sunkaddr357 = getelementptr i8, i8* %sunkaddr356, i64 -8
  %6 = bitcast i8* %sunkaddr357 to i64*
  store i64 %.49.i, i64* %6, align 8, !noalias !0
  br label %B102.i

B102.i:                                           ; preds = %B94.i, %B86.i, %B58.i
  %7 = tail call { i64, i1 } @llvm.uadd.with.overflow.i64(i64 %3, i64 1) #4
  %ov.i = extractvalue { i64, i1 } %7, 1
  br i1 %ov.i, label %B0.endif.endif, label %B34.ithread-pre-split

B136:                                             ; preds = %B54.B42.loopexit_crit_edge.us, %B44.endif.lr.ph, %for.end.endif
  %.54.i.i.i = shl i64 %.161, 2
  %8 = ptrtoint i8* %.7.i.i.i.i.i to i64
  %9 = bitcast { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %retptr to i64*
  store i64 %8, i64* %9, align 8
  %retptr.repack140 = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %retptr, i64 0, i32 1
  %10 = bitcast i8** %retptr.repack140 to i64*
  store i64 0, i64* %10, align 8
  %retptr.repack142 = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %retptr, i64 0, i32 2
  store i64 %.170, i64* %retptr.repack142, align 8
  %retptr.repack144 = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %retptr, i64 0, i32 3
  store i64 4, i64* %retptr.repack144, align 8
  %retptr.repack146 = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %retptr, i64 0, i32 4
  store i32* %.6.i116.i48.i.i, i32** %retptr.repack146, align 8
  %retptr.repack148.repack = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %retptr, i64 0, i32 5, i64 0
  store i64 %.160, i64* %retptr.repack148.repack, align 8
  %retptr.repack148.repack152 = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %retptr, i64 0, i32 5, i64 1
  store i64 %.161, i64* %retptr.repack148.repack152, align 8
  %retptr.repack150.repack = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %retptr, i64 0, i32 6, i64 0
  store i64 %.54.i.i.i, i64* %retptr.repack150.repack, align 8
  %retptr.repack150.repack156 = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %retptr, i64 0, i32 6, i64 1
  store i64 4, i64* %retptr.repack150.repack156, align 8
  tail call void @NRT_decref(i8* %.7.i.i)
  ret i32 0

B0.endif:                                         ; preds = %B58.i
  %11 = mul nsw i64 %lsr.iv346, -1
  %.118 = icmp slt i64 %11, 1
  br i1 %.118, label %B0.endif.if, label %B0.endif.endif, !prof !3

B0.endif.if:                                      ; preds = %B0.endif
  store { i8*, i32, i8* }* @.const.picklebuf.140498085454976, { i8*, i32, i8* }** %excinfo, align 8
  ret i32 1, !ret_is_raise !4

B0.endif.endif:                                   ; preds = %B102.i, %B0.endif
  %12 = bitcast [2 x i64]* %src_shape137 to i64*
  store i64 %arg.y.5.0, i64* %12, align 8
  %13 = bitcast [2 x i64]* %src_shape137 to i8*
  %sunkaddr358 = getelementptr inbounds i8, i8* %13, i64 8
  %14 = bitcast i8* %sunkaddr358 to i64*
  store i64 %arg.y.5.1, i64* %14, align 8
  br label %B34.i9

B34.i9thread-pre-split:                           ; preds = %B102.i23
  %sunkaddr359 = mul i64 %lsr.iv340, 8
  %15 = bitcast [2 x i64]* %src_shape137 to i8*
  %sunkaddr360 = getelementptr i8, i8* %15, i64 %sunkaddr359
  %16 = bitcast i8* %sunkaddr360 to i64*
  %.49.i5.pr = load i64, i64* %16, align 8, !noalias !5
  %lsr.iv.next341 = add nuw nsw i64 %lsr.iv340, 1
  br label %B34.i9

B34.i9:                                           ; preds = %B34.i9thread-pre-split, %B0.endif.endif
  %lsr.iv340 = phi i64 [ %lsr.iv.next341, %B34.i9thread-pre-split ], [ 1, %B0.endif.endif ]
  %.49.i5 = phi i64 [ %.49.i5.pr, %B34.i9thread-pre-split ], [ %arg.y.5.0, %B0.endif.endif ]
  %17 = bitcast <2 x i64>* %dest_shape138 to i8*
  %18 = add i64 %lsr.iv340, -3
  %19 = shl nuw nsw i64 %lsr.iv340, 3
  %uglygep = getelementptr i8, i8* %17, i64 %19
  %uglygep344 = bitcast i8* %uglygep to i64*
  %scevgep345 = getelementptr i64, i64* %uglygep344, i64 -1
  %.56.i7 = load i64, i64* %scevgep345, align 8, !noalias !5
  %.61.not.i8 = icmp eq i64 %.56.i7, 1
  br i1 %.61.not.i8, label %B86.i16, label %B58.i13

B58.i13:                                          ; preds = %B34.i9
  %.65.i10 = icmp ne i64 %.49.i5, %.56.i7
  %.70.i11 = icmp ne i64 %.49.i5, 1
  %or.cond.i12 = and i1 %.70.i11, %.65.i10
  br i1 %or.cond.i12, label %B0.endif.endif.endif, label %B102.i23

B86.i16:                                          ; preds = %B34.i9
  %.92.not.i15 = icmp eq i64 %.49.i5, 1
  br i1 %.92.not.i15, label %B102.i23, label %B94.i19

B94.i19:                                          ; preds = %B86.i16
  %sunkaddr361 = mul i64 %lsr.iv340, 8
  %20 = bitcast <2 x i64>* %dest_shape138 to i8*
  %sunkaddr362 = getelementptr i8, i8* %20, i64 %sunkaddr361
  %sunkaddr363 = getelementptr i8, i8* %sunkaddr362, i64 -8
  %21 = bitcast i8* %sunkaddr363 to i64*
  store i64 %.49.i5, i64* %21, align 8, !noalias !5
  br label %B102.i23

B102.i23:                                         ; preds = %B94.i19, %B86.i16, %B58.i13
  %22 = tail call { i64, i1 } @llvm.uadd.with.overflow.i64(i64 %18, i64 1) #4
  %ov.i21 = extractvalue { i64, i1 } %22, 1
  br i1 %ov.i21, label %B0.endif.endif.endif.endif, label %B34.i9thread-pre-split

B0.endif.endif.endif:                             ; preds = %B58.i13
  %23 = mul nsw i64 %lsr.iv340, -1
  %.153 = icmp slt i64 %23, 1
  br i1 %.153, label %B0.endif.endif.endif.if, label %B0.endif.endif.endif.endif, !prof !3

B0.endif.endif.endif.if:                          ; preds = %B0.endif.endif.endif
  store { i8*, i32, i8* }* @.const.picklebuf.140498082859904, { i8*, i32, i8* }** %excinfo, align 8
  ret i32 1, !ret_is_raise !4

B0.endif.endif.endif.endif:                       ; preds = %B102.i23, %B0.endif.endif.endif
  %24 = bitcast <2 x i64>* %dest_shape138 to i64*
  %.160 = load i64, i64* %24, align 16
  %25 = bitcast <2 x i64>* %dest_shape138 to i8*
  %sunkaddr364 = getelementptr inbounds i8, i8* %25, i64 8
  %26 = bitcast i8* %sunkaddr364 to i64*
  %.161 = load i64, i64* %26, align 8
  %.169 = tail call { i64, i1 } @llvm.smul.with.overflow.i64(i64 %.160, i64 %.161)
  %.170 = extractvalue { i64, i1 } %.169, 0
  %.171 = extractvalue { i64, i1 } %.169, 1
  %.174 = tail call { i64, i1 } @llvm.smul.with.overflow.i64(i64 %.170, i64 16)
  %.176 = extractvalue { i64, i1 } %.174, 1
  %.177 = or i1 %.171, %.176
  br i1 %.177, label %B0.endif.endif.endif.endif.if, label %B0.endif.endif.endif.endif.endif, !prof !3

B0.endif.endif.endif.endif.if:                    ; preds = %B0.endif.endif.endif.endif
  store { i8*, i32, i8* }* @.const.picklebuf.140498083546176, { i8*, i32, i8* }** %excinfo, align 8
  ret i32 1, !ret_is_raise !4

B0.endif.endif.endif.endif.endif:                 ; preds = %B0.endif.endif.endif.endif
  %.175 = extractvalue { i64, i1 } %.174, 0
  %.7.i.i = tail call i8* @NRT_MemInfo_alloc_safe_aligned(i64 %.175, i32 32), !noalias !8
  %.5.i = getelementptr i8, i8* %.7.i.i, i64 24
  %27 = bitcast i8* %.5.i to { double, double }**
  %.6.i139 = load { double, double }*, { double, double }** %27, align 8
  %.26170 = icmp sgt i64 %.160, 0
  br i1 %.26170, label %for.cond.1.preheader.lr.ph, label %for.end

for.cond.1.preheader.lr.ph:                       ; preds = %B0.endif.endif.endif.endif.endif
  %scevgep = getelementptr { double, double }, { double, double }* %.6.i139, i64 0, i32 1
  %scevgep310 = bitcast double* %scevgep to { double, double }*
  %28 = shl i64 %.161, 4
  %29 = shl i64 %arg.x.5.1, 3
  br label %for.cond.1.preheader

for.cond.1.preheader:                             ; preds = %for.end.1, %for.cond.1.preheader.lr.ph
  %lsr.iv332 = phi double* [ %56, %for.end.1 ], [ %arg.x.4, %for.cond.1.preheader.lr.ph ]
  %lsr.iv311 = phi { double, double }* [ %55, %for.end.1 ], [ %scevgep310, %for.cond.1.preheader.lr.ph ]
  %loop.index75 = phi i64 [ 0, %for.cond.1.preheader.lr.ph ], [ %.337, %for.end.1 ]
  %.257.074 = phi i64 [ 0, %for.cond.1.preheader.lr.ph ], [ %.257.1.lcssa, %for.end.1 ]
  %.254.073 = phi i64 [ 0, %for.cond.1.preheader.lr.ph ], [ %.254.1.lcssa, %for.end.1 ]
  %.251.072 = phi i64 [ 0, %for.cond.1.preheader.lr.ph ], [ %.251.1.lcssa, %for.end.1 ]
  %.248.071 = phi i64 [ 0, %for.cond.1.preheader.lr.ph ], [ %.248.1.lcssa, %for.end.1 ]
  %30 = icmp sgt i64 %.161, 0
  br i1 %30, label %for.body.1.lr.ph, label %for.end.1

for.body.1.lr.ph:                                 ; preds = %for.cond.1.preheader
  %31 = icmp ugt i64 %arg.x.5.0, 1
  br i1 %31, label %for.body.1.lr.ph.split.us, label %for.body.1.lr.ph.split

for.body.1.lr.ph.split.us:                        ; preds = %for.body.1.lr.ph
  %32 = icmp ugt i64 %arg.x.5.1, 1
  %.277.us = mul i64 %loop.index75, %arg.x.5.1
  br i1 %32, label %for.body.1.us.us.preheader, label %for.body.1.us.preheader

for.body.1.us.us.preheader:                       ; preds = %for.body.1.lr.ph.split.us
  br label %for.body.1.us.us

for.body.1.us.preheader:                          ; preds = %for.body.1.lr.ph.split.us
  %.280.us = add i64 %.277.us, %.251.072
  br label %for.body.1.us

for.body.1.us.us:                                 ; preds = %for.body.1.us.us.preheader, %for.body.1.us.us
  %lsr.iv336 = phi { double, double }* [ %lsr.iv311, %for.body.1.us.us.preheader ], [ %scevgep337, %for.body.1.us.us ]
  %loop.index.166.us.us = phi i64 [ %.335.us.us, %for.body.1.us.us ], [ 0, %for.body.1.us.us.preheader ]
  %.257.165.us.us = phi i64 [ %.257.2.us.us, %for.body.1.us.us ], [ %.257.074, %for.body.1.us.us.preheader ]
  %.254.164.us.us = phi i64 [ %spec.select136.us.us, %for.body.1.us.us ], [ %.254.073, %for.body.1.us.us.preheader ]
  %33 = icmp ugt i64 %arg.y.5.1, 1
  %34 = icmp ugt i64 %arg.y.5.0, 1
  %lsr.iv336338 = bitcast { double, double }* %lsr.iv336 to double*
  %scevgep335 = getelementptr double, double* %lsr.iv332, i64 %loop.index.166.us.us
  %.282.us.us = load double, double* %scevgep335, align 8
  %spec.select136.us.us = select i1 %34, i64 %loop.index75, i64 %.254.164.us.us
  %.257.2.us.us = select i1 %33, i64 %loop.index.166.us.us, i64 %.257.165.us.us
  %.294.us.us = mul i64 %spec.select136.us.us, %arg.y.5.1
  %.297.us.us = add i64 %.294.us.us, %.257.2.us.us
  %.298.us.us = getelementptr double, double* %arg.y.4, i64 %.297.us.us
  %.299.us.us = load double, double* %.298.us.us, align 8
  %.33.i.us.us = fmul double %.299.us.us, 0.000000e+00
  %.40.i.us.us = fadd double %.299.us.us, 0.000000e+00
  %.71.i.us.us = fadd double %.282.us.us, %.33.i.us.us
  %scevgep339 = getelementptr double, double* %lsr.iv336338, i64 -1
  store double %.71.i.us.us, double* %scevgep339, align 8
  store double %.40.i.us.us, double* %lsr.iv336338, align 8
  %.335.us.us = add nuw nsw i64 %loop.index.166.us.us, 1
  %scevgep337 = getelementptr { double, double }, { double, double }* %lsr.iv336, i64 1
  %exitcond266.not = icmp eq i64 %.161, %.335.us.us
  br i1 %exitcond266.not, label %for.end.1.loopexit, label %for.body.1.us.us

for.body.1.us:                                    ; preds = %for.body.1.us, %for.body.1.us.preheader
  %lsr.iv328 = phi { double, double }* [ %scevgep329, %for.body.1.us ], [ %lsr.iv311, %for.body.1.us.preheader ]
  %loop.index.166.us = phi i64 [ %.335.us, %for.body.1.us ], [ 0, %for.body.1.us.preheader ]
  %.257.165.us = phi i64 [ %.257.2.us, %for.body.1.us ], [ %.257.074, %for.body.1.us.preheader ]
  %.254.164.us = phi i64 [ %spec.select136.us, %for.body.1.us ], [ %.254.073, %for.body.1.us.preheader ]
  %35 = icmp ugt i64 %arg.y.5.1, 1
  %36 = icmp ugt i64 %arg.y.5.0, 1
  %lsr.iv328330 = bitcast { double, double }* %lsr.iv328 to double*
  %sunkaddr365 = mul i64 %.280.us, 8
  %37 = bitcast double* %arg.x.4 to i8*
  %sunkaddr366 = getelementptr i8, i8* %37, i64 %sunkaddr365
  %38 = bitcast i8* %sunkaddr366 to double*
  %.282.us = load double, double* %38, align 8
  %spec.select136.us = select i1 %36, i64 %loop.index75, i64 %.254.164.us
  %.257.2.us = select i1 %35, i64 %loop.index.166.us, i64 %.257.165.us
  %.294.us = mul i64 %spec.select136.us, %arg.y.5.1
  %.297.us = add i64 %.294.us, %.257.2.us
  %.298.us = getelementptr double, double* %arg.y.4, i64 %.297.us
  %.299.us = load double, double* %.298.us, align 8
  %.33.i.us = fmul double %.299.us, 0.000000e+00
  %.40.i.us = fadd double %.299.us, 0.000000e+00
  %.71.i.us = fadd double %.282.us, %.33.i.us
  %scevgep331 = getelementptr double, double* %lsr.iv328330, i64 -1
  store double %.71.i.us, double* %scevgep331, align 8
  store double %.40.i.us, double* %lsr.iv328330, align 8
  %.335.us = add nuw nsw i64 %loop.index.166.us, 1
  %scevgep329 = getelementptr { double, double }, { double, double }* %lsr.iv328, i64 1
  %exitcond265.not = icmp eq i64 %.161, %.335.us
  br i1 %exitcond265.not, label %for.end.1, label %for.body.1.us

for.body.1.lr.ph.split:                           ; preds = %for.body.1.lr.ph
  %39 = icmp ugt i64 %arg.x.5.1, 1
  %.277.us214 = mul i64 %.248.071, %arg.x.5.1
  br i1 %39, label %for.body.1.us208.preheader, label %for.body.1.lr.ph.split.split

for.body.1.us208.preheader:                       ; preds = %for.body.1.lr.ph.split
  %scevgep322 = getelementptr double, double* %arg.x.4, i64 %.277.us214
  br label %for.body.1.us208

for.body.1.us208:                                 ; preds = %for.body.1.us208.preheader, %for.body.1.us208
  %lsr.iv324 = phi { double, double }* [ %lsr.iv311, %for.body.1.us208.preheader ], [ %scevgep325, %for.body.1.us208 ]
  %loop.index.166.us209 = phi i64 [ %.335.us230, %for.body.1.us208 ], [ 0, %for.body.1.us208.preheader ]
  %.257.165.us210 = phi i64 [ %.257.2.us219, %for.body.1.us208 ], [ %.257.074, %for.body.1.us208.preheader ]
  %.254.164.us211 = phi i64 [ %spec.select136.us218, %for.body.1.us208 ], [ %.254.073, %for.body.1.us208.preheader ]
  %40 = icmp ugt i64 %arg.y.5.1, 1
  %41 = icmp ugt i64 %arg.y.5.0, 1
  %lsr.iv324326 = bitcast { double, double }* %lsr.iv324 to double*
  %scevgep323 = getelementptr double, double* %scevgep322, i64 %loop.index.166.us209
  %.282.us217 = load double, double* %scevgep323, align 8
  %spec.select136.us218 = select i1 %41, i64 %loop.index75, i64 %.254.164.us211
  %.257.2.us219 = select i1 %40, i64 %loop.index.166.us209, i64 %.257.165.us210
  %.294.us220 = mul i64 %spec.select136.us218, %arg.y.5.1
  %.297.us221 = add i64 %.294.us220, %.257.2.us219
  %.298.us222 = getelementptr double, double* %arg.y.4, i64 %.297.us221
  %.299.us223 = load double, double* %.298.us222, align 8
  %.33.i.us224 = fmul double %.299.us223, 0.000000e+00
  %.40.i.us225 = fadd double %.299.us223, 0.000000e+00
  %.71.i.us226 = fadd double %.282.us217, %.33.i.us224
  %scevgep327 = getelementptr double, double* %lsr.iv324326, i64 -1
  store double %.71.i.us226, double* %scevgep327, align 8
  store double %.40.i.us225, double* %lsr.iv324326, align 8
  %.335.us230 = add nuw nsw i64 %loop.index.166.us209, 1
  %scevgep325 = getelementptr { double, double }, { double, double }* %lsr.iv324, i64 1
  %exitcond264.not = icmp eq i64 %.161, %.335.us230
  br i1 %exitcond264.not, label %for.end.1.loopexit307, label %for.body.1.us208

for.body.1.lr.ph.split.split:                     ; preds = %for.body.1.lr.ph.split
  %42 = icmp ugt i64 %arg.y.5.0, 1
  %.280.us237 = add i64 %.277.us214, %.251.072
  br i1 %42, label %for.body.1.lr.ph.split.split.split.us, label %for.body.1.preheader

for.body.1.preheader:                             ; preds = %for.body.1.lr.ph.split.split
  %.294 = mul i64 %.254.073, %arg.y.5.1
  br label %for.body.1

for.body.1.lr.ph.split.split.split.us:            ; preds = %for.body.1.lr.ph.split.split
  %.294.us242 = mul i64 %loop.index75, %arg.y.5.1
  br label %for.body.1.us232

for.body.1.us232:                                 ; preds = %for.body.1.us232, %for.body.1.lr.ph.split.split.split.us
  %lsr.iv318 = phi { double, double }* [ %scevgep319, %for.body.1.us232 ], [ %lsr.iv311, %for.body.1.lr.ph.split.split.split.us ]
  %loop.index.166.us233 = phi i64 [ 0, %for.body.1.lr.ph.split.split.split.us ], [ %.335.us252, %for.body.1.us232 ]
  %.257.165.us234 = phi i64 [ %.257.074, %for.body.1.lr.ph.split.split.split.us ], [ %.257.2.us241, %for.body.1.us232 ]
  %43 = icmp ugt i64 %arg.y.5.1, 1
  %lsr.iv318320 = bitcast { double, double }* %lsr.iv318 to double*
  %sunkaddr367 = mul i64 %.280.us237, 8
  %44 = bitcast double* %arg.x.4 to i8*
  %sunkaddr368 = getelementptr i8, i8* %44, i64 %sunkaddr367
  %45 = bitcast i8* %sunkaddr368 to double*
  %.282.us239 = load double, double* %45, align 8
  %.257.2.us241 = select i1 %43, i64 %loop.index.166.us233, i64 %.257.165.us234
  %.297.us243 = add i64 %.294.us242, %.257.2.us241
  %.298.us244 = getelementptr double, double* %arg.y.4, i64 %.297.us243
  %.299.us245 = load double, double* %.298.us244, align 8
  %.33.i.us246 = fmul double %.299.us245, 0.000000e+00
  %.40.i.us247 = fadd double %.299.us245, 0.000000e+00
  %.71.i.us248 = fadd double %.282.us239, %.33.i.us246
  %scevgep321 = getelementptr double, double* %lsr.iv318320, i64 -1
  store double %.71.i.us248, double* %scevgep321, align 8
  store double %.40.i.us247, double* %lsr.iv318320, align 8
  %.335.us252 = add nuw nsw i64 %loop.index.166.us233, 1
  %scevgep319 = getelementptr { double, double }, { double, double }* %lsr.iv318, i64 1
  %exitcond263.not = icmp eq i64 %.161, %.335.us252
  br i1 %exitcond263.not, label %for.end.1, label %for.body.1.us232

for.end:                                          ; preds = %for.end.1, %B0.endif.endif.endif.endif.endif
  tail call void @NRT_decref(i8* %arg.y.0)
  tail call void @NRT_decref(i8* %arg.x.0)
  %46 = or i64 %.161, %.160
  %.not.i.i = icmp sgt i64 %46, -1
  br i1 %.not.i.i, label %B0.endif.endif.endif.endif.i.i.i, label %for.end.if, !prof !13

B0.endif.endif.endif.endif.i.i.i:                 ; preds = %for.end
  %.55.i.i.i = tail call { i64, i1 } @llvm.smul.with.overflow.i64(i64 %.170, i64 4)
  %.57.i.i.i = extractvalue { i64, i1 } %.55.i.i.i, 1
  br i1 %.57.i.i.i, label %for.end.if, label %B0.endif.i.i, !prof !3

B0.endif.i.i:                                     ; preds = %B0.endif.endif.endif.endif.i.i.i
  %.56.i.i.i = extractvalue { i64, i1 } %.55.i.i.i, 0
  %.7.i.i.i.i.i = tail call i8* @NRT_MemInfo_alloc_safe_aligned(i64 %.56.i.i.i, i32 32), !noalias !14
  %.5.i.i.i.i = getelementptr i8, i8* %.7.i.i.i.i.i, i64 24
  %47 = bitcast i8* %.5.i.i.i.i to i32**
  %.6.i116.i48.i.i = load i32*, i32** %47, align 8, !noalias !28
  %.91.i.i = icmp eq i64 %.160, 0
  %.97.i.i = icmp eq i64 %.161, 0
  %narrow.i.i = or i1 %.91.i.i, %.97.i.i
  br i1 %narrow.i.i, label %for.end.endif, label %B26.endif.i.i.preheader, !prof !3

B26.endif.i.i.preheader:                          ; preds = %B0.endif.i.i
  br label %B26.endif.i.i

B26.endif.i.i:                                    ; preds = %B26.endif.i.i.preheader, %B26.endif.i.i
  %.842.sroa.0.051.i.i = phi i64 [ %.842.sroa.0.1.ph.i.i, %B26.endif.i.i ], [ 0, %B26.endif.i.i.preheader ]
  %.842.sroa.6.050.i.i = phi i64 [ %.842.sroa.6.1.ph.i.i, %B26.endif.i.i ], [ 0, %B26.endif.i.i.preheader ]
  %.169.i.i = add nuw nsw i64 %.842.sroa.6.050.i.i, 1
  %.170.i.i = icmp slt i64 %.169.i.i, %.161
  %.177.i.i = add nsw i64 %.842.sroa.0.051.i.i, 1
  %.178.i.i = icmp slt i64 %.177.i.i, %.160
  %spec.select46.i.i = select i1 %.178.i.i, i64 %.177.i.i, i64 0
  %.842.sroa.6.1.ph.i.i = select i1 %.170.i.i, i64 %.169.i.i, i64 0, !prof !29
  %.842.sroa.0.1.ph.i.i = select i1 %.170.i.i, i64 %.842.sroa.0.051.i.i, i64 %spec.select46.i.i, !prof !29
  %narrow49.demorgan.i.i = or i1 %.178.i.i, %.170.i.i
  %.265.i.i = mul i64 %.842.sroa.0.051.i.i, %.161
  %.268.i.i = add i64 %.265.i.i, %.842.sroa.6.050.i.i
  %.269.i.i = getelementptr i32, i32* %.6.i116.i48.i.i, i64 %.268.i.i
  store i32 20, i32* %.269.i.i, align 4, !noalias !30
  br i1 %narrow49.demorgan.i.i, label %B26.endif.i.i, label %for.end.endif, !prof !29

for.body.1:                                       ; preds = %for.body.1, %for.body.1.preheader
  %lsr.iv314 = phi { double, double }* [ %scevgep315, %for.body.1 ], [ %lsr.iv311, %for.body.1.preheader ]
  %loop.index.166 = phi i64 [ %.335, %for.body.1 ], [ 0, %for.body.1.preheader ]
  %.257.165 = phi i64 [ %.257.2, %for.body.1 ], [ %.257.074, %for.body.1.preheader ]
  %48 = icmp ugt i64 %arg.y.5.1, 1
  %lsr.iv314316 = bitcast { double, double }* %lsr.iv314 to double*
  %sunkaddr369 = mul i64 %.280.us237, 8
  %49 = bitcast double* %arg.x.4 to i8*
  %sunkaddr370 = getelementptr i8, i8* %49, i64 %sunkaddr369
  %50 = bitcast i8* %sunkaddr370 to double*
  %.282 = load double, double* %50, align 8
  %.257.2 = select i1 %48, i64 %loop.index.166, i64 %.257.165
  %.297 = add i64 %.294, %.257.2
  %.298 = getelementptr double, double* %arg.y.4, i64 %.297
  %.299 = load double, double* %.298, align 8
  %.33.i = fmul double %.299, 0.000000e+00
  %.40.i = fadd double %.299, 0.000000e+00
  %.71.i = fadd double %.282, %.33.i
  %scevgep317 = getelementptr double, double* %lsr.iv314316, i64 -1
  store double %.71.i, double* %scevgep317, align 8
  store double %.40.i, double* %lsr.iv314316, align 8
  %.335 = add nuw nsw i64 %loop.index.166, 1
  %scevgep315 = getelementptr { double, double }, { double, double }* %lsr.iv314, i64 1
  %exitcond262.not = icmp eq i64 %.161, %.335
  br i1 %exitcond262.not, label %for.end.1, label %for.body.1

for.end.1.loopexit:                               ; preds = %for.body.1.us.us
  %51 = add i64 %.335.us.us, -1
  br label %for.end.1

for.end.1.loopexit307:                            ; preds = %for.body.1.us208
  %52 = add i64 %.335.us230, -1
  br label %for.end.1

for.end.1:                                        ; preds = %for.body.1, %for.body.1.us232, %for.body.1.us, %for.cond.1.preheader, %for.end.1.loopexit307, %for.end.1.loopexit
  %.248.1.lcssa = phi i64 [ %loop.index75, %for.end.1.loopexit ], [ %.248.071, %for.end.1.loopexit307 ], [ %.248.071, %for.cond.1.preheader ], [ %loop.index75, %for.body.1.us ], [ %.248.071, %for.body.1.us232 ], [ %.248.071, %for.body.1 ]
  %.251.1.lcssa = phi i64 [ %51, %for.end.1.loopexit ], [ %52, %for.end.1.loopexit307 ], [ %.251.072, %for.cond.1.preheader ], [ %.251.072, %for.body.1.us ], [ %.251.072, %for.body.1.us232 ], [ %.251.072, %for.body.1 ]
  %.254.1.lcssa = phi i64 [ %spec.select136.us.us, %for.end.1.loopexit ], [ %spec.select136.us218, %for.end.1.loopexit307 ], [ %.254.073, %for.cond.1.preheader ], [ %spec.select136.us, %for.body.1.us ], [ %loop.index75, %for.body.1.us232 ], [ %.254.073, %for.body.1 ]
  %.257.1.lcssa = phi i64 [ %.257.2.us.us, %for.end.1.loopexit ], [ %.257.2.us219, %for.end.1.loopexit307 ], [ %.257.074, %for.cond.1.preheader ], [ %.257.2.us, %for.body.1.us ], [ %.257.2.us241, %for.body.1.us232 ], [ %.257.2, %for.body.1 ]
  %53 = bitcast { double, double }* %lsr.iv311 to i1*
  %54 = bitcast double* %lsr.iv332 to i1*
  %.337 = add nuw nsw i64 %loop.index75, 1
  %scevgep313 = getelementptr i1, i1* %53, i64 %28
  %55 = bitcast i1* %scevgep313 to { double, double }*
  %scevgep334 = getelementptr i1, i1* %54, i64 %29
  %56 = bitcast i1* %scevgep334 to double*
  %exitcond267.not = icmp eq i64 %.337, %.160
  br i1 %exitcond267.not, label %for.end, label %for.cond.1.preheader

for.end.if:                                       ; preds = %B0.endif.endif.endif.endif.i.i.i, %for.end
  %excinfo.5.0.ph = phi { i8*, i32, i8* }* [ @.const.picklebuf.140498085300992, %B0.endif.endif.endif.endif.i.i.i ], [ @.const.picklebuf.140498084857536, %for.end ]
  store { i8*, i32, i8* }* %excinfo.5.0.ph, { i8*, i32, i8* }** %excinfo, align 8
  ret i32 1, !ret_is_raise !4

for.end.endif:                                    ; preds = %B26.endif.i.i, %B0.endif.i.i
  %.457.inv = icmp sgt i64 %arg.height, 0
  br i1 %.457.inv, label %B44.endif.lr.ph, label %B136

B44.endif.lr.ph:                                  ; preds = %for.end.endif
  %.578.inv = icmp sgt i64 %arg.width, 0
  %spec.select187 = select i1 %.578.inv, i64 %arg.width, i64 0
  %.629198 = sext i1 %.578.inv to i64
  %.619105199 = add nsw i64 %spec.select187, %.629198
  %.633200 = zext i1 %.578.inv to i64
  br i1 %.578.inv, label %B44.endif.us.preheader, label %B136

B44.endif.us.preheader:                           ; preds = %B44.endif.lr.ph
  br label %B44.endif.us

B44.endif.us:                                     ; preds = %B54.B42.loopexit_crit_edge.us, %B44.endif.us.preheader
  %.443.0205.us = phi i64 [ %.512.us, %B54.B42.loopexit_crit_edge.us ], [ 0, %B44.endif.us.preheader ]
  %.709.us = mul i64 %.443.0205.us, %.161
  br label %B56.us

B56.us:                                           ; preds = %B44.endif.us, %B137.us
  %.610.sroa.0.0204.us = phi i64 [ 0, %B44.endif.us ], [ %.626101203.us, %B137.us ]
  %.626101203.us = phi i64 [ %.633200, %B44.endif.us ], [ %.626101.us, %B137.us ]
  %.619105202.us = phi i64 [ %.619105199, %B44.endif.us ], [ %.619105.us, %B137.us ]
  %.690.us = icmp slt i64 %.610.sroa.0.0204.us, 0
  %.691.us = select i1 %.690.us, i64 %.161, i64 0
  %.692.us = add i64 %.610.sroa.0.0204.us, %.709.us
  %.712.us = add i64 %.692.us, %.691.us
  %.714.us = getelementptr { double, double }, { double, double }* %.6.i139, i64 %.712.us, i32 0
  %.715.us = load double, double* %.714.us, align 8
  %.716.us = getelementptr { double, double }, { double, double }* %.6.i139, i64 %.712.us, i32 1
  %.717.us = load double, double* %.716.us, align 8
  %.896.us284 = fmul double %.715.us, %.715.us
  %.897.us285 = fmul double %.717.us, %.717.us
  %.898.us286 = fmul double %.717.us, %.715.us
  %.900.us287 = fsub double %.896.us284, %.897.us285
  %.903.us288 = fadd double %.898.us286, %.898.us286
  %.1001.us289 = fadd double %.900.us287, %.715.us
  %.1004.us290 = fadd double %.903.us288, %.717.us
  %.6.i.i.us291 = tail call double @hypot(double %.1001.us289, double %.1004.us290), !noalias !31
  %.1035.us292 = fcmp ogt double %.6.i.i.us291, 2.000000e+00
  br i1 %.1035.us292, label %B114.us, label %B78.us.preheader

B78.us.preheader:                                 ; preds = %B56.us
  br label %B78.us

B78.us:                                           ; preds = %B78.us.preheader, %B78.us.B80.us_crit_edge
  %lsr.iv = phi i64 [ 0, %B78.us.preheader ], [ %lsr.iv.next, %B78.us.B80.us_crit_edge ]
  %.1004.us297 = phi double [ %.1004.us, %B78.us.B80.us_crit_edge ], [ %.1004.us290, %B78.us.preheader ]
  %.1001.us296 = phi double [ %.1001.us, %B78.us.B80.us_crit_edge ], [ %.1001.us289, %B78.us.preheader ]
  %57 = add i64 %lsr.iv, 20
  %.808.us = icmp sgt i64 %57, 1
  br i1 %.808.us, label %B78.us.B80.us_crit_edge, label %B137.us

B78.us.B80.us_crit_edge:                          ; preds = %B78.us
  %.977.us.pre = load double, double* %.714.us, align 8
  %.979.us.pre = load double, double* %.716.us, align 8
  %.896.us = fmul double %.1001.us296, %.1001.us296
  %.897.us = fmul double %.1004.us297, %.1004.us297
  %.898.us = fmul double %.1004.us297, %.1001.us296
  %.900.us = fsub double %.896.us, %.897.us
  %.903.us = fadd double %.898.us, %.898.us
  %.1001.us = fadd double %.900.us, %.977.us.pre
  %.1004.us = fadd double %.903.us, %.979.us.pre
  %.6.i.i.us = tail call double @hypot(double %.1001.us, double %.1004.us), !noalias !31
  %.1035.us = fcmp ogt double %.6.i.i.us, 2.000000e+00
  %lsr.iv.next = add nsw i64 %lsr.iv, -1
  br i1 %.1035.us, label %B114.us.loopexit, label %B78.us

B114.us.loopexit:                                 ; preds = %B78.us.B80.us_crit_edge
  %58 = mul i64 %lsr.iv.next, -1
  br label %B114.us

B114.us:                                          ; preds = %B114.us.loopexit, %B56.us
  %.81479193.us.lcssa = phi i64 [ 0, %B56.us ], [ %58, %B114.us.loopexit ]
  %.1101.us = getelementptr i32, i32* %.6.i116.i48.i.i, i64 %.712.us
  %59 = trunc i64 %.81479193.us.lcssa to i32
  store i32 %59, i32* %.1101.us, align 4
  br label %B137.us

B137.us:                                          ; preds = %B78.us, %B114.us
  %.620.us = icmp sgt i64 %.619105202.us, 0
  %.629.us = sext i1 %.620.us to i64
  %.619105.us = add nsw i64 %.619105202.us, %.629.us
  %.633.us = zext i1 %.620.us to i64
  %.626101.us = add i64 %.626101203.us, %.633.us
  %B56.us.termcond = icmp sgt i64 %.619105202.us, 0
  br i1 %B56.us.termcond, label %B56.us, label %B54.B42.loopexit_crit_edge.us

B54.B42.loopexit_crit_edge.us:                    ; preds = %B137.us
  %.512.us = add nuw nsw i64 %.443.0205.us, 1
  %scmp = icmp sge i64 %.512.us, %arg.height
  br i1 %scmp, label %B136, label %B44.endif.us
}

; Function Attrs: nounwind readnone speculatable willreturn
declare { i64, i1 } @llvm.smul.with.overflow.i64(i64, i64) #0

define i8* @_ZN7cpython8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE(i8* nocapture readnone %py_closure, i8* %py_args, i8* nocapture readnone %py_kws) local_unnamed_addr {
entry:
  %.5 = alloca i8*, align 8
  %.6 = alloca i8*, align 8
  %.7 = alloca i8*, align 8
  %.8 = alloca i8*, align 8
  %.9 = call i32 (i8*, i8*, i64, i64, ...) @PyArg_UnpackTuple(i8* %py_args, i8* getelementptr inbounds ([17 x i8], [17 x i8]* @.const.numba_inner_loop, i64 0, i64 0), i64 4, i64 4, i8** nonnull %.5, i8** nonnull %.6, i8** nonnull %.7, i8** nonnull %.8)
  %.10 = icmp eq i32 %.9, 0
  %.56 = alloca { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }, align 8
  %.81 = alloca { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }, align 8
  %0 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.56 to i8*
  call void @llvm.memset.p0i8.i64(i8* nonnull align 8 dereferenceable(72) %0, i8 0, i64 72, i1 false)
  %.105 = alloca { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, align 8
  %1 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.81 to i8*
  call void @llvm.memset.p0i8.i64(i8* nonnull align 8 dereferenceable(72) %1, i8 0, i64 72, i1 false)
  %excinfo = alloca { i8*, i32, i8* }*, align 8
  %2 = bitcast { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.105 to i8*
  call void @llvm.memset.p0i8.i64(i8* nonnull align 8 dereferenceable(72) %2, i8 0, i64 72, i1 false)
  store { i8*, i32, i8* }* null, { i8*, i32, i8* }** %excinfo, align 8
  %.181 = alloca { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, align 8
  %3 = bitcast { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.181 to i8*
  call void @llvm.memset.p0i8.i64(i8* nonnull align 8 dereferenceable(72) %3, i8 0, i64 72, i1 false)
  br i1 %.10, label %entry.if, label %entry.endif, !prof !3

entry.if:                                         ; preds = %entry.endif.endif.endif.e...endif.endif.endif.if, %entry.endif.endif.endif.endif.endif, %entry.endif.endif.endif, %entry.endif.endif.endif.e...endif.endif.endif.if.if, %entry.endif.endif.endif.endif.endif.endif.endif, %entry.endif.endif.endif.e...endif.if, %entry.endif.endif.endif.e...endif.1.endif, %entry
  ret i8* null

entry.endif:                                      ; preds = %entry
  %.14 = load i8*, i8** @_ZN08NumbaEnv8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE, align 8
  %.19 = icmp eq i8* %.14, null
  br i1 %.19, label %entry.endif.if, label %entry.endif.endif, !prof !3

entry.endif.if:                                   ; preds = %entry.endif
  call void @PyErr_SetString(i8* nonnull @PyExc_RuntimeError, i8* getelementptr inbounds ([176 x i8], [176 x i8]* @".const.missing Environment: _ZN08NumbaEnv8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE", i64 0, i64 0))
  ret i8* null

entry.endif.endif:                                ; preds = %entry.endif
  %.23 = load i8*, i8** %.5, align 8
  %.26 = call i8* @PyNumber_Long(i8* %.23)
  %.27.not = icmp eq i8* %.26, null
  br i1 %.27.not, label %entry.endif.endif.endif, label %entry.endif.endif.if, !prof !3

entry.endif.endif.if:                             ; preds = %entry.endif.endif
  %.29 = call i64 @PyLong_AsLongLong(i8* nonnull %.26)
  call void @Py_DecRef(i8* nonnull %.26)
  br label %entry.endif.endif.endif

entry.endif.endif.endif:                          ; preds = %entry.endif.endif, %entry.endif.endif.if
  %.24.0 = phi i64 [ %.29, %entry.endif.endif.if ], [ 0, %entry.endif.endif ]
  %.34 = call i8* @PyErr_Occurred()
  %.35.not = icmp eq i8* %.34, null
  br i1 %.35.not, label %entry.endif.endif.endif.endif, label %entry.if, !prof !29

entry.endif.endif.endif.endif:                    ; preds = %entry.endif.endif.endif
  %.39 = load i8*, i8** %.6, align 8
  %.42 = call i8* @PyNumber_Long(i8* %.39)
  %.43.not = icmp eq i8* %.42, null
  br i1 %.43.not, label %entry.endif.endif.endif.endif.endif, label %entry.endif.endif.endif.endif.if, !prof !3

entry.endif.endif.endif.endif.if:                 ; preds = %entry.endif.endif.endif.endif
  %.45 = call i64 @PyLong_AsLongLong(i8* nonnull %.42)
  call void @Py_DecRef(i8* nonnull %.42)
  br label %entry.endif.endif.endif.endif.endif

entry.endif.endif.endif.endif.endif:              ; preds = %entry.endif.endif.endif.endif, %entry.endif.endif.endif.endif.if
  %.40.0 = phi i64 [ %.45, %entry.endif.endif.endif.endif.if ], [ 0, %entry.endif.endif.endif.endif ]
  %.50 = call i8* @PyErr_Occurred()
  %.51.not = icmp eq i8* %.50, null
  br i1 %.51.not, label %entry.endif.endif.endif.endif.endif.endif, label %entry.if, !prof !29

entry.endif.endif.endif.endif.endif.endif:        ; preds = %entry.endif.endif.endif.endif.endif
  %.55 = load i8*, i8** %.7, align 8
  %.59 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.56 to i8*
  %4 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.56 to i8*
  call void @llvm.memset.p0i8.i64(i8* nonnull align 8 dereferenceable(72) %4, i8 0, i64 72, i1 false)
  %.60 = call i32 @NRT_adapt_ndarray_from_python(i8* %.55, i8* nonnull %.59)
  %5 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.56 to i8*
  %sunkaddr = getelementptr inbounds i8, i8* %5, i64 24
  %6 = bitcast i8* %sunkaddr to i64*
  %.64 = load i64, i64* %6, align 8
  %.65 = icmp ne i64 %.64, 8
  %.66 = icmp ne i32 %.60, 0
  %.67 = or i1 %.66, %.65
  br i1 %.67, label %entry.endif.endif.endif.endif.endif.endif.if, label %entry.endif.endif.endif.endif.endif.endif.endif, !prof !3

entry.endif.endif.endif.endif.endif.endif.if:     ; preds = %entry.endif.endif.endif.endif.endif.endif
  call void @PyErr_SetString(i8* nonnull @PyExc_TypeError, i8* getelementptr inbounds ([89 x i8], [89 x i8]* @".const.can't unbox array from PyObject into native value.  The object maybe of a different type", i64 0, i64 0))
  br label %entry.endif.endif.endif.endif.endif.endif.endif

entry.endif.endif.endif.endif.endif.endif.endif:  ; preds = %entry.endif.endif.endif.endif.endif.endif.if, %entry.endif.endif.endif.endif.endif.endif
  %7 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.56 to i8**
  %.71.fca.0.load = load i8*, i8** %7, align 8
  %8 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.56 to i8*
  %sunkaddr7 = getelementptr inbounds i8, i8* %8, i64 8
  %9 = bitcast i8* %sunkaddr7 to i8**
  %.71.fca.1.load = load i8*, i8** %9, align 8
  %10 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.56 to i8*
  %sunkaddr8 = getelementptr inbounds i8, i8* %10, i64 16
  %11 = bitcast i8* %sunkaddr8 to i64*
  %.71.fca.2.load = load i64, i64* %11, align 8
  %12 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.56 to i8*
  %sunkaddr9 = getelementptr inbounds i8, i8* %12, i64 32
  %13 = bitcast i8* %sunkaddr9 to double**
  %.71.fca.4.load = load double*, double** %13, align 8
  %14 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.56 to i8*
  %sunkaddr10 = getelementptr inbounds i8, i8* %14, i64 40
  %15 = bitcast i8* %sunkaddr10 to i64*
  %.71.fca.5.0.load = load i64, i64* %15, align 8
  %16 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.56 to i8*
  %sunkaddr11 = getelementptr inbounds i8, i8* %16, i64 48
  %17 = bitcast i8* %sunkaddr11 to i64*
  %.71.fca.5.1.load = load i64, i64* %17, align 8
  %18 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.56 to i8*
  %sunkaddr12 = getelementptr inbounds i8, i8* %18, i64 56
  %19 = bitcast i8* %sunkaddr12 to i64*
  %.71.fca.6.0.load = load i64, i64* %19, align 8
  %20 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.56 to i8*
  %sunkaddr13 = getelementptr inbounds i8, i8* %20, i64 64
  %21 = bitcast i8* %sunkaddr13 to i64*
  %.71.fca.6.1.load = load i64, i64* %21, align 8
  br i1 %.67, label %entry.if, label %entry.endif.endif.endif.endif.endif.endif.endif.endif, !prof !3

entry.endif.endif.endif.endif.endif.endif.endif.endif: ; preds = %entry.endif.endif.endif.endif.endif.endif.endif
  %.80 = load i8*, i8** %.8, align 8
  %.84 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.81 to i8*
  %22 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.81 to i8*
  call void @llvm.memset.p0i8.i64(i8* nonnull align 8 dereferenceable(72) %22, i8 0, i64 72, i1 false)
  %.85 = call i32 @NRT_adapt_ndarray_from_python(i8* %.80, i8* nonnull %.84)
  %23 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.81 to i8*
  %sunkaddr14 = getelementptr inbounds i8, i8* %23, i64 24
  %24 = bitcast i8* %sunkaddr14 to i64*
  %.89 = load i64, i64* %24, align 8
  %.90 = icmp ne i64 %.89, 8
  %.91 = icmp ne i32 %.85, 0
  %.92 = or i1 %.91, %.90
  br i1 %.92, label %entry.endif.endif.endif.e...endif.if, label %entry.endif.endif.endif.e...endif.endif, !prof !3

entry.endif.endif.endif.e...endif.if:             ; preds = %entry.endif.endif.endif.endif.endif.endif.endif.endif
  call void @PyErr_SetString(i8* nonnull @PyExc_TypeError, i8* getelementptr inbounds ([89 x i8], [89 x i8]* @".const.can't unbox array from PyObject into native value.  The object maybe of a different type", i64 0, i64 0))
  call void @NRT_decref(i8* %.71.fca.0.load)
  br label %entry.if

entry.endif.endif.endif.e...endif.endif:          ; preds = %entry.endif.endif.endif.endif.endif.endif.endif.endif
  %25 = bitcast { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.105 to i8**
  %26 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.81 to i8**
  %.96.fca.0.load = load i8*, i8** %26, align 8
  %27 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.81 to i8*
  %sunkaddr15 = getelementptr inbounds i8, i8* %27, i64 8
  %28 = bitcast i8* %sunkaddr15 to i8**
  %.96.fca.1.load = load i8*, i8** %28, align 8
  %29 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.81 to i8*
  %sunkaddr16 = getelementptr inbounds i8, i8* %29, i64 16
  %30 = bitcast i8* %sunkaddr16 to i64*
  %.96.fca.2.load = load i64, i64* %30, align 8
  %31 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.81 to i8*
  %sunkaddr17 = getelementptr inbounds i8, i8* %31, i64 32
  %32 = bitcast i8* %sunkaddr17 to double**
  %.96.fca.4.load = load double*, double** %32, align 8
  %33 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.81 to i8*
  %sunkaddr18 = getelementptr inbounds i8, i8* %33, i64 40
  %34 = bitcast i8* %sunkaddr18 to i64*
  %.96.fca.5.0.load = load i64, i64* %34, align 8
  %35 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.81 to i8*
  %sunkaddr19 = getelementptr inbounds i8, i8* %35, i64 48
  %36 = bitcast i8* %sunkaddr19 to i64*
  %.96.fca.5.1.load = load i64, i64* %36, align 8
  %37 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.81 to i8*
  %sunkaddr20 = getelementptr inbounds i8, i8* %37, i64 56
  %38 = bitcast i8* %sunkaddr20 to i64*
  %.96.fca.6.0.load = load i64, i64* %38, align 8
  %39 = bitcast { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] }* %.81 to i8*
  %sunkaddr21 = getelementptr inbounds i8, i8* %39, i64 64
  %40 = bitcast i8* %sunkaddr21 to i64*
  %.96.fca.6.1.load = load i64, i64* %40, align 8
  %41 = bitcast { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.105 to i8*
  call void @llvm.memset.p0i8.i64(i8* nonnull align 8 dereferenceable(72) %41, i8 0, i64 72, i1 false)
  %.117 = call i32 @_ZN8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE({ i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* nonnull %.105, { i8*, i32, i8* }** nonnull %excinfo, i64 %.24.0, i64 %.40.0, i8* %.71.fca.0.load, i8* %.71.fca.1.load, i64 %.71.fca.2.load, i64 8, double* %.71.fca.4.load, i64 %.71.fca.5.0.load, i64 %.71.fca.5.1.load, i64 %.71.fca.6.0.load, i64 %.71.fca.6.1.load, i8* %.96.fca.0.load, i8* %.96.fca.1.load, i64 %.96.fca.2.load, i64 8, double* %.96.fca.4.load, i64 %.96.fca.5.0.load, i64 %.96.fca.5.1.load, i64 %.96.fca.6.0.load, i64 %.96.fca.6.1.load) #2
  %.118 = load { i8*, i32, i8* }*, { i8*, i32, i8* }** %excinfo, align 8
  %.127.fca.0.load = load i8*, i8** %25, align 8
  %42 = bitcast { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.105 to i8*
  %sunkaddr22 = getelementptr inbounds i8, i8* %42, i64 8
  %43 = bitcast i8* %sunkaddr22 to <4 x i64>*
  %44 = load <4 x i64>, <4 x i64>* %43, align 8
  %45 = bitcast { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.105 to i8*
  %sunkaddr23 = getelementptr inbounds i8, i8* %45, i64 40
  %46 = bitcast i8* %sunkaddr23 to <4 x i64>*
  %47 = load <4 x i64>, <4 x i64>* %46, align 8
  call void @NRT_decref(i8* %.71.fca.0.load)
  call void @NRT_decref(i8* %.96.fca.0.load)
  switch i32 %.117, label %entry.endif.endif.endif.e...endif.endif.endif [
    i32 -2, label %entry.endif.endif.endif.e...endif.endif.if.endif
    i32 0, label %entry.endif.endif.endif.e...endif.endif.if.endif
  ]

entry.endif.endif.endif.e...endif.endif.endif:    ; preds = %entry.endif.endif.endif.e...endif.endif
  %.125 = icmp sgt i32 %.117, 0
  br i1 %.125, label %entry.endif.endif.endif.e...endif.endif.endif.if, label %entry.endif.endif.endif.e...endif.1.endif

entry.endif.endif.endif.e...endif.endif.if.endif: ; preds = %entry.endif.endif.endif.e...endif.endif, %entry.endif.endif.endif.e...endif.endif
  %sunkaddr24 = getelementptr i8, i8* %.14, i64 24
  %48 = bitcast i8* %sunkaddr24 to i8**
  %.162 = load i8*, i8** %48, align 8
  %.166.not = icmp eq i8* %.162, null
  br i1 %.166.not, label %entry.endif.endif.endif.e...endif.endif.if.endif.else, label %entry.endif.endif.endif.e...endif.endif.if.endif.if

entry.endif.endif.endif.e...endif.endif.if.endif.if: ; preds = %entry.endif.endif.endif.e...endif.endif.if.endif
  %.168 = call i8* @PyList_GetItem(i8* nonnull %.162, i64 0)
  br label %entry.endif.endif.endif.e...endif.endif.if.endif.endif

entry.endif.endif.endif.e...endif.endif.if.endif.else: ; preds = %entry.endif.endif.endif.e...endif.endif.if.endif
  call void @PyErr_SetString(i8* nonnull @PyExc_RuntimeError, i8* getelementptr inbounds ([37 x i8], [37 x i8]* @".const.`env.consts` is NULL in `read_const`", i64 0, i64 0))
  br label %entry.endif.endif.endif.e...endif.endif.if.endif.endif

entry.endif.endif.endif.e...endif.endif.if.endif.endif: ; preds = %entry.endif.endif.endif.e...endif.endif.if.endif.else, %entry.endif.endif.endif.e...endif.endif.if.endif.if
  %.163.0 = phi i8* [ %.168, %entry.endif.endif.endif.e...endif.endif.if.endif.if ], [ null, %entry.endif.endif.endif.e...endif.endif.if.endif.else ]
  %.fca.0.gep25 = bitcast { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.181 to i8**
  %.fca.5.0.gep = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.181, i64 0, i32 5, i64 0
  %.fca.1.gep = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.181, i64 0, i32 1
  %.180 = call i8* @numba_unpickle(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @.const.pickledata.140500895779840, i64 0, i64 0), i32 32, i8* getelementptr inbounds ([20 x i8], [20 x i8]* @.const.pickledata.140500895779840.sha1, i64 0, i64 0))
  store i8* %.127.fca.0.load, i8** %.fca.0.gep25, align 8
  %49 = bitcast i8** %.fca.1.gep to <4 x i64>*
  store <4 x i64> %44, <4 x i64>* %49, align 8
  %50 = bitcast i64* %.fca.5.0.gep to <4 x i64>*
  store <4 x i64> %47, <4 x i64>* %50, align 8
  %.184 = bitcast { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.181 to i8*
  %.185 = call i8* @NRT_adapt_ndarray_to_python_acqref(i8* nonnull %.184, i8* %.180, i32 2, i32 1, i8* %.163.0)
  call void @NRT_decref(i8* %.127.fca.0.load)
  ret i8* %.185

entry.endif.endif.endif.e...endif.endif.endif.if: ; preds = %entry.endif.endif.endif.e...endif.endif.endif
  call void @PyErr_Clear()
  %.195 = load { i8*, i32, i8* }, { i8*, i32, i8* }* %.118, align 8
  %.196 = extractvalue { i8*, i32, i8* } %.195, 0
  %.198 = extractvalue { i8*, i32, i8* } %.195, 1
  %.200 = extractvalue { i8*, i32, i8* } %.195, 2
  %.201 = call i8* @numba_unpickle(i8* %.196, i32 %.198, i8* %.200)
  %.202.not = icmp eq i8* %.201, null
  br i1 %.202.not, label %entry.if, label %entry.endif.endif.endif.e...endif.endif.endif.if.if, !prof !3

entry.endif.endif.endif.e...endif.endif.endif.if.if: ; preds = %entry.endif.endif.endif.e...endif.endif.endif.if
  call void @numba_do_raise(i8* nonnull %.201)
  br label %entry.if

entry.endif.endif.endif.e...endif.1.endif:        ; preds = %entry.endif.endif.endif.e...endif.endif.endif
  call void @PyErr_SetString(i8* nonnull @PyExc_SystemError, i8* getelementptr inbounds ([43 x i8], [43 x i8]* @".const.unknown error when calling native function", i64 0, i64 0))
  br label %entry.if
}

declare i32 @PyArg_UnpackTuple(i8*, i8*, i64, i64, ...) local_unnamed_addr

declare void @PyErr_SetString(i8*, i8*) local_unnamed_addr

declare i8* @PyNumber_Long(i8*) local_unnamed_addr

declare i64 @PyLong_AsLongLong(i8*) local_unnamed_addr

declare void @Py_DecRef(i8*) local_unnamed_addr

declare i8* @PyErr_Occurred() local_unnamed_addr

declare i32 @NRT_adapt_ndarray_from_python(i8* nocapture, i8* nocapture) local_unnamed_addr

declare i8* @PyList_GetItem(i8*, i64) local_unnamed_addr

declare i8* @numba_unpickle(i8*, i32, i8*) local_unnamed_addr

declare i8* @NRT_adapt_ndarray_to_python_acqref(i8* nocapture, i8*, i32, i32, i8*) local_unnamed_addr

declare void @PyErr_Clear() local_unnamed_addr

declare void @numba_do_raise(i8*) local_unnamed_addr

define { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] } @cfunc._ZN8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE(i64 %.1, i64 %.2, { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] } %.3, { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] } %.4) local_unnamed_addr {
entry:
  %.6 = alloca { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, align 8
  %.fca.0.gep1 = bitcast { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.6 to i8**
  %.fca.1.gep = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.6, i64 0, i32 1
  %.fca.2.gep = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.6, i64 0, i32 2
  %.fca.3.gep = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.6, i64 0, i32 3
  %.fca.4.gep = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.6, i64 0, i32 4
  %.fca.5.0.gep = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.6, i64 0, i32 5, i64 0
  %.fca.5.1.gep = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.6, i64 0, i32 5, i64 1
  %.fca.6.0.gep = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.6, i64 0, i32 6, i64 0
  %.fca.6.1.gep = getelementptr inbounds { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }, { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.6, i64 0, i32 6, i64 1
  %excinfo = alloca { i8*, i32, i8* }*, align 8
  %0 = bitcast { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* %.6 to i8*
  call void @llvm.memset.p0i8.i64(i8* nonnull align 8 dereferenceable(72) %0, i8 0, i64 72, i1 false)
  store { i8*, i32, i8* }* null, { i8*, i32, i8* }** %excinfo, align 8
  %extracted.meminfo = extractvalue { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] } %.3, 0
  %extracted.parent = extractvalue { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] } %.3, 1
  %extracted.nitems = extractvalue { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] } %.3, 2
  %extracted.itemsize = extractvalue { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] } %.3, 3
  %extracted.data = extractvalue { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] } %.3, 4
  %extracted.shape = extractvalue { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] } %.3, 5
  %.10 = extractvalue [2 x i64] %extracted.shape, 0
  %.11 = extractvalue [2 x i64] %extracted.shape, 1
  %extracted.strides = extractvalue { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] } %.3, 6
  %.12 = extractvalue [2 x i64] %extracted.strides, 0
  %.13 = extractvalue [2 x i64] %extracted.strides, 1
  %extracted.meminfo.1 = extractvalue { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] } %.4, 0
  %extracted.parent.1 = extractvalue { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] } %.4, 1
  %extracted.nitems.1 = extractvalue { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] } %.4, 2
  %extracted.itemsize.1 = extractvalue { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] } %.4, 3
  %extracted.data.1 = extractvalue { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] } %.4, 4
  %extracted.shape.1 = extractvalue { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] } %.4, 5
  %.14 = extractvalue [2 x i64] %extracted.shape.1, 0
  %.15 = extractvalue [2 x i64] %extracted.shape.1, 1
  %extracted.strides.1 = extractvalue { i8*, i8*, i64, i64, double*, [2 x i64], [2 x i64] } %.4, 6
  %.16 = extractvalue [2 x i64] %extracted.strides.1, 0
  %.17 = extractvalue [2 x i64] %extracted.strides.1, 1
  %.18 = call i32 @_ZN8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE({ i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] }* nonnull %.6, { i8*, i32, i8* }** nonnull %excinfo, i64 %.1, i64 %.2, i8* %extracted.meminfo, i8* %extracted.parent, i64 %extracted.nitems, i64 %extracted.itemsize, double* %extracted.data, i64 %.10, i64 %.11, i64 %.12, i64 %.13, i8* %extracted.meminfo.1, i8* %extracted.parent.1, i64 %extracted.nitems.1, i64 %extracted.itemsize.1, double* %extracted.data.1, i64 %.14, i64 %.15, i64 %.16, i64 %.17) #2
  %.19 = load { i8*, i32, i8* }*, { i8*, i32, i8* }** %excinfo, align 8
  %.20.not = icmp eq i32 %.18, 0
  %.28.fca.0.load = load i8*, i8** %.fca.0.gep1, align 8
  %.28.fca.1.load = load i8*, i8** %.fca.1.gep, align 8
  %.28.fca.2.load = load i64, i64* %.fca.2.gep, align 8
  %.28.fca.3.load = load i64, i64* %.fca.3.gep, align 8
  %.28.fca.4.load = load i32*, i32** %.fca.4.gep, align 8
  %.28.fca.5.0.load = load i64, i64* %.fca.5.0.gep, align 8
  %.28.fca.5.1.load = load i64, i64* %.fca.5.1.gep, align 8
  %.28.fca.6.0.load = load i64, i64* %.fca.6.0.gep, align 8
  %.28.fca.6.1.load = load i64, i64* %.fca.6.1.gep, align 8
  %.45 = alloca i32, align 4
  store i32 0, i32* %.45, align 4
  br i1 %.20.not, label %entry.endif, label %entry.if, !prof !29

entry.if:                                         ; preds = %entry
  %.26 = icmp sgt i32 %.18, 0
  call void @numba_gil_ensure(i32* nonnull %.45)
  br i1 %.26, label %entry.if.if, label %entry.if.endif.endif.endif

entry.endif:                                      ; preds = %entry, %.48
  %.28.fca.0.insert = insertvalue { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] } undef, i8* %.28.fca.0.load, 0
  %.28.fca.1.insert = insertvalue { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] } %.28.fca.0.insert, i8* %.28.fca.1.load, 1
  %.28.fca.2.insert = insertvalue { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] } %.28.fca.1.insert, i64 %.28.fca.2.load, 2
  %.28.fca.3.insert = insertvalue { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] } %.28.fca.2.insert, i64 %.28.fca.3.load, 3
  %.28.fca.4.insert = insertvalue { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] } %.28.fca.3.insert, i32* %.28.fca.4.load, 4
  %1 = insertvalue [2 x i64] undef, i64 %.28.fca.5.0.load, 0
  %.34 = insertvalue [2 x i64] %1, i64 %.28.fca.5.1.load, 1
  %inserted.shape = insertvalue { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] } %.28.fca.4.insert, [2 x i64] %.34, 5
  %2 = insertvalue [2 x i64] undef, i64 %.28.fca.6.0.load, 0
  %.35 = insertvalue [2 x i64] %2, i64 %.28.fca.6.1.load, 1
  %inserted.strides = insertvalue { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] } %inserted.shape, [2 x i64] %.35, 6
  ret { i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64] } %inserted.strides

.48:                                              ; preds = %entry.if.if, %entry.if.if.if, %entry.if.endif.endif.endif
  %.70 = call i8* @PyUnicode_FromString(i8* getelementptr inbounds ([53 x i8], [53 x i8]* @".const.<numba.core.cpu.CPUContext object at 0x7fc84232d670>", i64 0, i64 0))
  call void @PyErr_WriteUnraisable(i8* %.70)
  call void @Py_DecRef(i8* %.70)
  call void @numba_gil_release(i32* nonnull %.45)
  br label %entry.endif

entry.if.if:                                      ; preds = %entry.if
  call void @PyErr_Clear()
  %.51 = load { i8*, i32, i8* }, { i8*, i32, i8* }* %.19, align 8
  %.52 = extractvalue { i8*, i32, i8* } %.51, 0
  %.54 = extractvalue { i8*, i32, i8* } %.51, 1
  %.56 = extractvalue { i8*, i32, i8* } %.51, 2
  %.57 = call i8* @numba_unpickle(i8* %.52, i32 %.54, i8* %.56)
  %.58.not = icmp eq i8* %.57, null
  br i1 %.58.not, label %.48, label %entry.if.if.if, !prof !3

entry.if.if.if:                                   ; preds = %entry.if.if
  call void @numba_do_raise(i8* nonnull %.57)
  br label %.48

entry.if.endif.endif.endif:                       ; preds = %entry.if
  call void @PyErr_SetString(i8* nonnull @PyExc_SystemError, i8* getelementptr inbounds ([43 x i8], [43 x i8]* @".const.unknown error when calling native function.1", i64 0, i64 0))
  br label %.48
}

declare void @numba_gil_ensure(i32*) local_unnamed_addr

declare i8* @PyUnicode_FromString(i8*) local_unnamed_addr

declare void @PyErr_WriteUnraisable(i8*) local_unnamed_addr

declare void @numba_gil_release(i32*) local_unnamed_addr

; Function Attrs: nounwind readnone speculatable willreturn
declare { i64, i1 } @llvm.uadd.with.overflow.i64(i64, i64) #0

declare noalias i8* @NRT_MemInfo_alloc_safe_aligned(i64, i32) local_unnamed_addr

declare double @hypot(double, double) local_unnamed_addr

; Function Attrs: nofree noinline norecurse nounwind
define linkonce_odr void @NRT_incref(i8* %.1) local_unnamed_addr #1 {
.3:
  %.4 = icmp eq i8* %.1, null
  br i1 %.4, label %.3.if, label %.3.endif, !prof !3

.3.if:                                            ; preds = %.3
  ret void

.3.endif:                                         ; preds = %.3
  %.7 = bitcast i8* %.1 to i64*
  %.4.i = atomicrmw add i64* %.7, i64 1 monotonic
  ret void
}

; Function Attrs: noinline
define linkonce_odr void @NRT_decref(i8* %.1) local_unnamed_addr #2 {
.3:
  %.4 = icmp eq i8* %.1, null
  br i1 %.4, label %.3.if, label %.3.endif, !prof !3

.3.if:                                            ; preds = %.3.endif, %.3
  ret void

.3.endif:                                         ; preds = %.3
  fence release
  %.8 = bitcast i8* %.1 to i64*
  %.4.i = atomicrmw sub i64* %.8, i64 1 monotonic
  %.10 = icmp eq i64 %.4.i, 1
  br i1 %.10, label %.3.endif.if, label %.3.if, !prof !3

.3.endif.if:                                      ; preds = %.3.endif
  fence acquire
  tail call void @NRT_MemInfo_call_dtor(i8* nonnull %.1)
  ret void
}

declare void @NRT_MemInfo_call_dtor(i8*) local_unnamed_addr

; Function Attrs: argmemonly nounwind willreturn writeonly
declare void @llvm.memset.p0i8.i64(i8* nocapture writeonly, i8, i64, i1 immarg) #3

; Function Attrs: nounwind
declare void @llvm.stackprotector(i8*, i8**) #4

attributes #0 = { nounwind readnone speculatable willreturn }
attributes #1 = { nofree noinline norecurse nounwind }
attributes #2 = { noinline }
attributes #3 = { argmemonly nounwind willreturn writeonly }
attributes #4 = { nounwind }

!0 = !{!1}
!1 = distinct !{!1, !2, !"_ZN5numba2np7npyimpl20_broadcast_onto_2411B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEx8int64_2ax8int64_2a: %retptr"}
!2 = distinct !{!2, !"_ZN5numba2np7npyimpl20_broadcast_onto_2411B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEx8int64_2ax8int64_2a"}
!3 = !{!"branch_weights", i32 1, i32 99}
!4 = !{i1 true}
!5 = !{!6}
!6 = distinct !{!6, !7, !"_ZN5numba2np7npyimpl20_broadcast_onto_2411B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEx8int64_2ax8int64_2a: %retptr"}
!7 = distinct !{!7, !"_ZN5numba2np7npyimpl20_broadcast_onto_2411B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEx8int64_2ax8int64_2a"}
!8 = !{!9, !11}
!9 = distinct !{!9, !10, !"_ZN5numba2np8arrayobj18_ol_array_allocate12_3clocals_3e8impl_249B44c8tJTIeFCjyCbUFRqqOAK_2f6h0phxApMogijRBAA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj: %retptr"}
!10 = distinct !{!10, !"_ZN5numba2np8arrayobj18_ol_array_allocate12_3clocals_3e8impl_249B44c8tJTIeFCjyCbUFRqqOAK_2f6h0phxApMogijRBAA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj"}
!11 = distinct !{!11, !12, !"_ZN5numba2np8arrayobj19_call_allocator_248B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj: %retptr"}
!12 = distinct !{!12, !"_ZN5numba2np8arrayobj19_call_allocator_248B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj"}
!13 = !{!"branch_weights", i32 9801, i32 199}
!14 = !{!15, !17, !19, !21, !22, !24, !25, !27}
!15 = distinct !{!15, !16, !"_ZN5numba2np8arrayobj18_ol_array_allocate12_3clocals_3e8impl_249B44c8tJTIeFCjyCbUFRqqOAK_2f6h0phxApMogijRBAA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj: %retptr"}
!16 = distinct !{!16, !"_ZN5numba2np8arrayobj18_ol_array_allocate12_3clocals_3e8impl_249B44c8tJTIeFCjyCbUFRqqOAK_2f6h0phxApMogijRBAA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj"}
!17 = distinct !{!17, !18, !"_ZN5numba2np8arrayobj19_call_allocator_248B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj: %retptr"}
!18 = distinct !{!18, !"_ZN5numba2np8arrayobj19_call_allocator_248B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj"}
!19 = distinct !{!19, !20, !"_ZN13_3cdynamic_3e41jit_wrapper__built_in_function_empty__247B46c8tJTIeFCjyCbUFRqqOAK_2f6h0khxBBXBjCWYRBFEkyYAE8UniTupleIxLi2EE16class_28int32_29: %retptr"}
!20 = distinct !{!20, !"_ZN13_3cdynamic_3e41jit_wrapper__built_in_function_empty__247B46c8tJTIeFCjyCbUFRqqOAK_2f6h0khxBBXBjCWYRBFEkyYAE8UniTupleIxLi2EE16class_28int32_29"}
!21 = distinct !{!21, !20, !"_ZN13_3cdynamic_3e41jit_wrapper__built_in_function_empty__247B46c8tJTIeFCjyCbUFRqqOAK_2f6h0khxBBXBjCWYRBFEkyYAE8UniTupleIxLi2EE16class_28int32_29: %excinfo"}
!22 = distinct !{!22, !23, !"_ZN5numba2np8arrayobj19numpy_full_dtype_nd12_3clocals_3e8full_246B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dE8UniTupleIxLi2EEx16class_28int32_29: %retptr"}
!23 = distinct !{!23, !"_ZN5numba2np8arrayobj19numpy_full_dtype_nd12_3clocals_3e8full_246B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dE8UniTupleIxLi2EEx16class_28int32_29"}
!24 = distinct !{!24, !23, !"_ZN5numba2np8arrayobj19numpy_full_dtype_nd12_3clocals_3e8full_246B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dE8UniTupleIxLi2EEx16class_28int32_29: %excinfo"}
!25 = distinct !{!25, !26, !"_ZN13_3cdynamic_3e49jit_wrapper__function_full_at_0x7fc90005b5e0__245B46c8tJTIeFCjyCbUFRqqOAK_2f6h0khxBBXBjCWYRBFEkyYAE8UniTupleIxLi2EEx16class_28int32_29: %retptr"}
!26 = distinct !{!26, !"_ZN13_3cdynamic_3e49jit_wrapper__function_full_at_0x7fc90005b5e0__245B46c8tJTIeFCjyCbUFRqqOAK_2f6h0khxBBXBjCWYRBFEkyYAE8UniTupleIxLi2EEx16class_28int32_29"}
!27 = distinct !{!27, !26, !"_ZN13_3cdynamic_3e49jit_wrapper__function_full_at_0x7fc90005b5e0__245B46c8tJTIeFCjyCbUFRqqOAK_2f6h0khxBBXBjCWYRBFEkyYAE8UniTupleIxLi2EEx16class_28int32_29: %excinfo"}
!28 = !{!19, !21, !22, !24, !25, !27}
!29 = !{!"branch_weights", i32 99, i32 1}
!30 = !{!22, !24, !25, !27}
!31 = !{!32, !34}
!32 = distinct !{!32, !33, !"_ZN5numba7cpython8mathimpl16hypot_float_impl12_3clocals_3e14hypot_impl_243B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEdd: %retptr"}
!33 = distinct !{!33, !"_ZN5numba7cpython8mathimpl16hypot_float_impl12_3clocals_3e14hypot_impl_243B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEdd"}
!34 = distinct !{!34, !35, !"_ZN5numba7cpython7numbers16complex_abs_impl12_3clocals_3e16complex_abs_2412B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dE10complex128: %retptr"}
!35 = distinct !{!35, !"_ZN5numba7cpython7numbers16complex_abs_impl12_3clocals_3e16complex_abs_2412B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dE10complex128"}

x86 assembly for `numba_inner_loop`, which has a low-level function and also a much larger unboxing/boxing function

	.text
	.file	"<string>"
	.section	.rodata.cst16,"aM",@progbits,16
	.p2align	4
.LCPI0_0:
	.quad	1
	.quad	1
	.section	.rodata.cst8,"aM",@progbits,8
	.p2align	3
.LCPI0_1:
	.quad	0x4000000000000000
	.text
	.globl	_ZN8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE
	.p2align	4, 0x90
	.type	_ZN8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE,@function
_ZN8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	pushq	%r15
	.cfi_def_cfa_offset 24
	pushq	%r14
	.cfi_def_cfa_offset 32
	pushq	%r13
	.cfi_def_cfa_offset 40
	pushq	%r12
	.cfi_def_cfa_offset 48
	pushq	%rbx
	.cfi_def_cfa_offset 56
	subq	$184, %rsp
	.cfi_def_cfa_offset 240
	.cfi_offset %rbx, -56
	.cfi_offset %r12, -48
	.cfi_offset %r13, -40
	.cfi_offset %r14, -32
	.cfi_offset %r15, -24
	.cfi_offset %rbp, -16
	movq	%rcx, 64(%rsp)
	movq	%rdx, 112(%rsp)
	movq	%rsi, 8(%rsp)
	movq	%rdi, 152(%rsp)
	movq	344(%rsp), %r13
	movq	336(%rsp), %r14
	movq	328(%rsp), %r12
	movq	296(%rsp), %r15
	movq	264(%rsp), %rbp
	movabsq	$NRT_incref, %rbx
	movq	%r8, 72(%rsp)
	movq	%r8, %rdi
	callq	*%rbx
	movq	%r15, %rdi
	callq	*%rbx
	movabsq	$.LCPI0_0, %rax
	vmovapd	(%rax), %xmm0
	vmovapd	%xmm0, 80(%rsp)
	movq	%rbp, 120(%rsp)
	movq	272(%rsp), %rax
	movq	%rax, 128(%rsp)
	movl	$1, %eax
	movq	72(%rsp,%rax,8), %rdx
	cmpq	$1, %rdx
	jne	.LBB0_3
	.p2align	4, 0x90
.LBB0_7:
	cmpq	$1, %rbp
	je	.LBB0_9
	movq	%rbp, 72(%rsp,%rax,8)
	jmp	.LBB0_9
	.p2align	4, 0x90
.LBB0_3:
	cmpq	$1, %rbp
	je	.LBB0_9
	cmpq	%rdx, %rbp
	jne	.LBB0_5
.LBB0_9:
	leaq	-3(%rax), %rcx
	incq	%rcx
	je	.LBB0_10
	movq	120(%rsp,%rax,8), %rbp
	incq	%rax
	movq	72(%rsp,%rax,8), %rdx
	cmpq	$1, %rdx
	jne	.LBB0_3
	jmp	.LBB0_7
.LBB0_5:
	negq	%rax
	testq	%rax, %rax
	jle	.LBB0_6
.LBB0_10:
	movq	%r14, 120(%rsp)
	movq	%r13, 128(%rsp)
	movl	$1, %eax
	movq	%r14, %rcx
	movq	72(%rsp,%rax,8), %rdx
	cmpq	$1, %rdx
	jne	.LBB0_13
	.p2align	4, 0x90
.LBB0_17:
	cmpq	$1, %rcx
	je	.LBB0_19
	movq	%rcx, 72(%rsp,%rax,8)
	jmp	.LBB0_19
	.p2align	4, 0x90
.LBB0_13:
	cmpq	$1, %rcx
	je	.LBB0_19
	cmpq	%rdx, %rcx
	jne	.LBB0_15
.LBB0_19:
	leaq	-3(%rax), %rcx
	incq	%rcx
	je	.LBB0_20
	movq	120(%rsp,%rax,8), %rcx
	incq	%rax
	movq	72(%rsp,%rax,8), %rdx
	cmpq	$1, %rdx
	jne	.LBB0_13
	jmp	.LBB0_17
.LBB0_15:
	negq	%rax
	testq	%rax, %rax
	jle	.LBB0_16
.LBB0_20:
	movq	80(%rsp), %rdx
	movq	88(%rsp), %rcx
	movq	%rdx, %rax
	imulq	%rcx, %rax
	imulq	$16, %rax, %rdi
	seto	%al
	movq	%rdx, %rsi
	imulq	%rcx, %rsi
	movq	%rsi, 104(%rsp)
	jo	.LBB0_22
	testb	%al, %al
	jne	.LBB0_22
	movabsq	$NRT_MemInfo_alloc_safe_aligned, %rax
	movl	$32, %esi
	movq	%rcx, 48(%rsp)
	movq	%rdx, 32(%rsp)
	callq	*%rax
	movq	48(%rsp), %r11
	movq	%rax, 144(%rsp)
	movq	24(%rax), %rax
	movq	%rax, 56(%rsp)
	cmpq	$0, 32(%rsp)
	movq	272(%rsp), %rbp
	movq	256(%rsp), %rsi
	jle	.LBB0_32
	movq	56(%rsp), %rax
	leaq	8(%rax), %r10
	movq	%r11, %r9
	shlq	$4, %r9
	leaq	(,%rbp,8), %rbx
	xorl	%r8d, %r8d
	vxorpd	%xmm0, %xmm0, %xmm0
	movq	%rsi, %r15
	xorl	%edx, %edx
	xorl	%edi, %edi
	xorl	%ecx, %ecx
	xorl	%eax, %eax
	movq	%rbx, 16(%rsp)
	movq	%r9, 40(%rsp)
	jmp	.LBB0_25
	.p2align	4, 0x90
.LBB0_42:
	movq	16(%rsp), %rbx
	movq	24(%rsp), %rax
.LBB0_31:
	incq	%r8
	addq	%r9, %r10
	addq	%rbx, %r15
	cmpq	32(%rsp), %r8
	je	.LBB0_32
.LBB0_25:
	testq	%r11, %r11
	jle	.LBB0_31
	cmpq	$2, 264(%rsp)
	jb	.LBB0_38
	cmpq	$2, %rbp
	jb	.LBB0_35
	movq	%r10, %rax
	xorl	%ecx, %ecx
	.p2align	4, 0x90
.LBB0_29:
	cmpq	$1, %r14
	cmovaq	%r8, %rdi
	cmpq	$1, %r13
	cmovaq	%rcx, %rdx
	movq	%rdi, %rsi
	imulq	%r13, %rsi
	addq	%rdx, %rsi
	vmovsd	(%r12,%rsi,8), %xmm1
	vmulsd	%xmm0, %xmm1, %xmm2
	vaddsd	%xmm0, %xmm1, %xmm1
	vaddsd	(%r15,%rcx,8), %xmm2, %xmm2
	vmovsd	%xmm2, -8(%rax)
	vmovsd	%xmm1, (%rax)
	incq	%rcx
	addq	$16, %rax
	cmpq	%rcx, %r11
	jne	.LBB0_29
	decq	%rcx
	movq	%r8, %rax
	movq	256(%rsp), %rsi
	jmp	.LBB0_31
	.p2align	4, 0x90
.LBB0_38:
	movq	%rax, %rbx
	imulq	%rbp, %rbx
	cmpq	$2, %rbp
	movq	%rax, 24(%rsp)
	jb	.LBB0_43
	leaq	(%rsi,%rbx,8), %rax
	movq	%r10, %rbx
	xorl	%ecx, %ecx
	.p2align	4, 0x90
.LBB0_40:
	cmpq	$1, %r14
	cmovaq	%r8, %rdi
	cmpq	$1, %r13
	cmovaq	%rcx, %rdx
	movq	%rdi, %rsi
	imulq	%r13, %rsi
	addq	%rdx, %rsi
	vmovsd	(%r12,%rsi,8), %xmm1
	vmulsd	%xmm0, %xmm1, %xmm2
	vaddsd	%xmm0, %xmm1, %xmm1
	vaddsd	(%rax,%rcx,8), %xmm2, %xmm2
	vmovsd	%xmm2, -8(%rbx)
	vmovsd	%xmm1, (%rbx)
	incq	%rcx
	addq	$16, %rbx
	cmpq	%rcx, %r11
	jne	.LBB0_40
	decq	%rcx
	movq	256(%rsp), %rsi
	jmp	.LBB0_42
.LBB0_35:
	movq	%r8, %rax
	imulq	%rbp, %rax
	addq	%rcx, %rax
	movq	%r10, %rbx
	xorl	%ebp, %ebp
	.p2align	4, 0x90
.LBB0_36:
	cmpq	$1, %r14
	cmovaq	%r8, %rdi
	cmpq	$1, %r13
	cmovaq	%rbp, %rdx
	movq	%rdi, %rsi
	imulq	%r13, %rsi
	addq	%rdx, %rsi
	vmovsd	(%r12,%rsi,8), %xmm1
	movq	256(%rsp), %rsi
	vmulsd	%xmm0, %xmm1, %xmm2
	vaddsd	%xmm0, %xmm1, %xmm1
	vaddsd	(%rsi,%rax,8), %xmm2, %xmm2
	vmovsd	%xmm2, -8(%rbx)
	vmovsd	%xmm1, (%rbx)
	incq	%rbp
	addq	$16, %rbx
	cmpq	%rbp, %r11
	jne	.LBB0_36
	movq	%r8, %rax
	movq	272(%rsp), %rbp
	movq	16(%rsp), %rbx
	jmp	.LBB0_31
.LBB0_43:
	addq	%rcx, %rbx
	cmpq	$1, %r14
	jbe	.LBB0_44
	movq	%r8, %rax
	imulq	%r13, %rax
	movq	%rsi, %r9
	movq	%r10, %rsi
	xorl	%ebp, %ebp
	.p2align	4, 0x90
.LBB0_49:
	cmpq	$1, %r13
	cmovaq	%rbp, %rdx
	leaq	(%rax,%rdx), %rdi
	vmovsd	(%r12,%rdi,8), %xmm1
	vmulsd	%xmm0, %xmm1, %xmm2
	vaddsd	%xmm0, %xmm1, %xmm1
	vaddsd	(%r9,%rbx,8), %xmm2, %xmm2
	vmovsd	%xmm2, -8(%rsi)
	vmovsd	%xmm1, (%rsi)
	incq	%rbp
	addq	$16, %rsi
	cmpq	%rbp, %r11
	jne	.LBB0_49
	movq	%r8, %rdi
	movq	272(%rsp), %rbp
	movq	%r9, %rsi
	movq	40(%rsp), %r9
	jmp	.LBB0_42
.LBB0_44:
	movq	%rdi, %rax
	imulq	%r13, %rax
	movq	%r10, %rbp
	movq	%r11, %r9
	xorl	%r11d, %r11d
	.p2align	4, 0x90
.LBB0_45:
	cmpq	$1, %r13
	cmovaq	%r11, %rdx
	leaq	(%rax,%rdx), %rsi
	vmovsd	(%r12,%rsi,8), %xmm1
	movq	256(%rsp), %rsi
	vmulsd	%xmm0, %xmm1, %xmm2
	vaddsd	%xmm0, %xmm1, %xmm1
	vaddsd	(%rsi,%rbx,8), %xmm2, %xmm2
	vmovsd	%xmm2, -8(%rbp)
	vmovsd	%xmm1, (%rbp)
	incq	%r11
	addq	$16, %rbp
	cmpq	%r11, %r9
	jne	.LBB0_45
	movq	%r9, %r11
	movq	272(%rsp), %rbp
	movq	40(%rsp), %r9
	jmp	.LBB0_42
.LBB0_32:
	movabsq	$NRT_decref, %rbx
	movq	296(%rsp), %rdi
	callq	*%rbx
	movq	72(%rsp), %rdi
	callq	*%rbx
	movq	48(%rsp), %rax
	orq	32(%rsp), %rax
	js	.LBB0_33
	imulq	$4, 104(%rsp), %rdi
	jo	.LBB0_52
	movl	$32, %esi
	movabsq	$NRT_MemInfo_alloc_safe_aligned, %rax
	callq	*%rax
	movq	32(%rsp), %r8
	movq	48(%rsp), %rsi
	movq	%rax, 136(%rsp)
	movq	24(%rax), %rax
	movq	%rax, 8(%rsp)
	testq	%r8, %r8
	je	.LBB0_58
	testq	%rsi, %rsi
	je	.LBB0_58
	xorl	%eax, %eax
	xorl	%edx, %edx
	.p2align	4, 0x90
.LBB0_56:
	movq	%rdx, %rbx
	movq	%rax, %rdi
	leaq	1(%rdx), %rcx
	leaq	1(%rax), %rbp
	cmpq	%r8, %rbp
	movl	$0, %eax
	cmovlq	%rbp, %rax
	cmpq	%rsi, %rcx
	movl	$0, %edx
	cmovlq	%rcx, %rdx
	cmovlq	%rdi, %rax
	imulq	%rsi, %rdi
	addq	%rbx, %rdi
	cmpq	%r8, %rbp
	movq	8(%rsp), %rbp
	movl	$20, (%rbp,%rdi,4)
	jl	.LBB0_56
	cmpq	%rsi, %rcx
	jl	.LBB0_56
.LBB0_58:
	cmpq	$0, 112(%rsp)
	movq	64(%rsp), %rdx
	movq	56(%rsp), %rbx
	jle	.LBB0_71
	testq	%rdx, %rdx
	setg	%al
	testq	%rdx, %rdx
	jle	.LBB0_71
	movq	%rdx, %rcx
	sarq	$63, %rcx
	andnq	%rdx, %rcx, %rdx
	xorl	%ecx, %ecx
	movb	%al, %cl
	movq	%rcx, 168(%rsp)
	subq	%rcx, %rdx
	movq	%rdx, 160(%rsp)
	xorl	%ecx, %ecx
	movabsq	$.LCPI0_1, %rax
	vmovsd	(%rax), %xmm0
	vmovsd	%xmm0, 40(%rsp)
	jmp	.LBB0_61
	.p2align	4, 0x90
.LBB0_70:
	movq	176(%rsp), %rcx
	incq	%rcx
	cmpq	112(%rsp), %rcx
	jge	.LBB0_71
.LBB0_61:
	movq	%rcx, 176(%rsp)
	imulq	%rsi, %rcx
	movq	%rcx, 64(%rsp)
	xorl	%r12d, %r12d
	movq	168(%rsp), %rax
	movq	160(%rsp), %r13
	jmp	.LBB0_62
	.p2align	4, 0x90
.LBB0_63:
	xorl	%r15d, %r15d
.LBB0_68:
	movq	8(%rsp), %rax
	movl	%r15d, (%rax,%r12,4)
.LBB0_69:
	xorl	%eax, %eax
	testq	%r13, %r13
	setg	%al
	movq	%r13, %rcx
	subq	%rax, %rcx
	movq	72(%rsp), %r12
	addq	%r12, %rax
	testq	%r13, %r13
	movq	%rcx, %r13
	movq	48(%rsp), %rsi
	movq	56(%rsp), %rbx
	jle	.LBB0_70
.LBB0_62:
	movq	%rax, 72(%rsp)
	movq	%r12, %rax
	sarq	$63, %rax
	andq	%rsi, %rax
	addq	64(%rsp), %r12
	addq	%rax, %r12
	movq	%r12, %rbp
	shlq	$4, %rbp
	vmovsd	(%rbx,%rbp), %xmm0
	vmovsd	8(%rbx,%rbp), %xmm1
	vmulsd	%xmm0, %xmm0, %xmm2
	vmulsd	%xmm1, %xmm1, %xmm3
	vsubsd	%xmm3, %xmm2, %xmm2
	vmulsd	%xmm0, %xmm1, %xmm3
	vaddsd	%xmm3, %xmm3, %xmm3
	vaddsd	%xmm0, %xmm2, %xmm0
	vaddsd	%xmm1, %xmm3, %xmm1
	vmovsd	%xmm0, 24(%rsp)
	vmovsd	%xmm1, 16(%rsp)
	movabsq	$hypot, %rax
	callq	*%rax
	vucomisd	40(%rsp), %xmm0
	ja	.LBB0_63
	leaq	(%rbx,%rbp), %r14
	leaq	8(%rbx,%rbp), %rbp
	xorl	%r15d, %r15d
	movabsq	$hypot, %rbx
	.p2align	4, 0x90
.LBB0_65:
	leaq	20(%r15), %rax
	cmpq	$2, %rax
	jl	.LBB0_69
	vmovsd	24(%rsp), %xmm2
	vmulsd	%xmm2, %xmm2, %xmm0
	vmovsd	16(%rsp), %xmm3
	vmulsd	%xmm3, %xmm3, %xmm1
	vsubsd	%xmm1, %xmm0, %xmm0
	vmulsd	%xmm2, %xmm3, %xmm1
	vaddsd	%xmm1, %xmm1, %xmm1
	vaddsd	(%r14), %xmm0, %xmm0
	vaddsd	(%rbp), %xmm1, %xmm1
	vmovsd	%xmm0, 24(%rsp)
	vmovsd	%xmm1, 16(%rsp)
	callq	*%rbx
	decq	%r15
	vucomisd	40(%rsp), %xmm0
	jbe	.LBB0_65
	negq	%r15
	jmp	.LBB0_68
.LBB0_71:
	movq	152(%rsp), %rax
	movq	136(%rsp), %rcx
	movq	%rcx, (%rax)
	movq	$0, 8(%rax)
	movq	104(%rsp), %rcx
	movq	%rcx, 16(%rax)
	movq	$4, 24(%rax)
	movq	8(%rsp), %rcx
	movq	%rcx, 32(%rax)
	movq	32(%rsp), %rcx
	movq	%rcx, 40(%rax)
	movq	%rsi, 48(%rax)
	shlq	$2, %rsi
	movq	%rsi, 56(%rax)
	movq	$4, 64(%rax)
	movq	144(%rsp), %rdi
	movabsq	$NRT_decref, %rax
	callq	*%rax
	xorl	%eax, %eax
.LBB0_72:
	addq	$184, %rsp
	.cfi_def_cfa_offset 56
	popq	%rbx
	.cfi_def_cfa_offset 48
	popq	%r12
	.cfi_def_cfa_offset 40
	popq	%r13
	.cfi_def_cfa_offset 32
	popq	%r14
	.cfi_def_cfa_offset 24
	popq	%r15
	.cfi_def_cfa_offset 16
	popq	%rbp
	.cfi_def_cfa_offset 8
	retq
.LBB0_33:
	.cfi_def_cfa_offset 240
	movabsq	$.const.picklebuf.140498084857536, %rax
	jmp	.LBB0_34
.LBB0_22:
	movabsq	$.const.picklebuf.140498083546176, %rax
	jmp	.LBB0_34
.LBB0_52:
	movabsq	$.const.picklebuf.140498085300992, %rax
	jmp	.LBB0_34
.LBB0_6:
	movabsq	$.const.picklebuf.140498085454976, %rax
	jmp	.LBB0_34
.LBB0_16:
	movabsq	$.const.picklebuf.140498082859904, %rax
.LBB0_34:
	movq	8(%rsp), %rcx
	movq	%rax, (%rcx)
	movl	$1, %eax
	jmp	.LBB0_72
.Lfunc_end0:
	.size	_ZN8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE, .Lfunc_end0-_ZN8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE
	.cfi_endproc

	.globl	_ZN7cpython8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE
	.p2align	4, 0x90
	.type	_ZN7cpython8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE,@function
_ZN7cpython8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	andq	$-32, %rsp
	subq	$672, %rsp
	.cfi_offset %rbx, -56
	.cfi_offset %r12, -48
	.cfi_offset %r13, -40
	.cfi_offset %r14, -32
	.cfi_offset %r15, -24
	movq	%rsi, %rdi
	leaq	192(%rsp), %rax
	movq	%rax, 8(%rsp)
	leaq	200(%rsp), %rax
	movq	%rax, (%rsp)
	movabsq	$.const.numba_inner_loop, %rsi
	movabsq	$PyArg_UnpackTuple, %rbx
	leaq	216(%rsp), %r8
	leaq	208(%rsp), %r9
	movl	$4, %edx
	movl	$4, %ecx
	xorl	%eax, %eax
	callq	*%rbx
	testl	%eax, %eax
	vxorps	%xmm0, %xmm0, %xmm0
	vmovaps	%ymm0, 224(%rsp)
	vmovaps	%ymm0, 256(%rsp)
	movq	$0, 288(%rsp)
	vmovaps	%ymm0, 320(%rsp)
	vmovaps	%ymm0, 352(%rsp)
	movq	$0, 384(%rsp)
	vmovaps	%ymm0, 416(%rsp)
	vmovaps	%ymm0, 448(%rsp)
	movq	$0, 480(%rsp)
	movq	$0, 168(%rsp)
	vmovaps	%ymm0, 576(%rsp)
	vmovaps	%ymm0, 608(%rsp)
	movq	$0, 640(%rsp)
	je	.LBB1_1
	movabsq	$_ZN08NumbaEnv8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE, %rax
	movq	(%rax), %rax
	testq	%rax, %rax
	je	.LBB1_4
	movq	%rax, 176(%rsp)
	movq	216(%rsp), %rdi
	movabsq	$PyNumber_Long, %r12
	vzeroupper
	callq	*%r12
	testq	%rax, %rax
	movabsq	$PyLong_AsLongLong, %r14
	movabsq	$Py_DecRef, %r15
	je	.LBB1_7
	movq	%rax, %rbx
	movq	%rax, %rdi
	callq	*%r14
	movq	%rax, 152(%rsp)
	movq	%rbx, %rdi
	callq	*%r15
	movabsq	$PyErr_Occurred, %r13
	callq	*%r13
	testq	%rax, %rax
	jne	.LBB1_1
.LBB1_10:
	movq	208(%rsp), %rdi
	callq	*%r12
	testq	%rax, %rax
	je	.LBB1_11
	movq	%rax, %rbx
	movq	%rax, %rdi
	callq	*%r14
	movq	%rax, %r14
	movq	%rbx, %rdi
	callq	*%r15
	callq	*%r13
	testq	%rax, %rax
	jne	.LBB1_1
.LBB1_14:
	movq	200(%rsp), %rdi
	vxorps	%xmm0, %xmm0, %xmm0
	vmovaps	%ymm0, 224(%rsp)
	vmovaps	%ymm0, 256(%rsp)
	movq	$0, 288(%rsp)
	movabsq	$NRT_adapt_ndarray_from_python, %r15
	leaq	224(%rsp), %rsi
	vzeroupper
	callq	*%r15
	cmpq	$8, 248(%rsp)
	setne	%cl
	testl	%eax, %eax
	setne	%bl
	orb	%cl, %bl
	cmpb	$1, %bl
	je	.LBB1_15
	testb	%bl, %bl
	jne	.LBB1_1
.LBB1_17:
	movq	%r14, 544(%rsp)
	movq	224(%rsp), %rax
	movq	%rax, 160(%rsp)
	movq	232(%rsp), %r14
	movq	240(%rsp), %rbx
	movq	256(%rsp), %rax
	movq	%rax, 512(%rsp)
	movq	264(%rsp), %rax
	movq	%rax, 184(%rsp)
	movq	%r15, %rax
	movq	272(%rsp), %r15
	movq	280(%rsp), %r13
	movq	288(%rsp), %r12
	movq	192(%rsp), %rdi
	vxorps	%xmm0, %xmm0, %xmm0
	vmovaps	%ymm0, 320(%rsp)
	vmovaps	%ymm0, 352(%rsp)
	movq	$0, 384(%rsp)
	leaq	320(%rsp), %rsi
	vzeroupper
	callq	*%rax
	testl	%eax, %eax
	jne	.LBB1_19
	cmpq	$8, 344(%rsp)
	jne	.LBB1_19
	movq	%r14, %r9
	movq	320(%rsp), %r14
	vmovups	328(%rsp), %xmm0
	vmovaps	352(%rsp), %ymm1
	movq	384(%rsp), %rax
	vxorps	%xmm2, %xmm2, %xmm2
	vmovaps	%ymm2, 416(%rsp)
	vmovaps	%ymm2, 448(%rsp)
	movq	$0, 480(%rsp)
	movq	%rax, 120(%rsp)
	vmovups	%ymm1, 88(%rsp)
	vmovups	%xmm0, 64(%rsp)
	movq	%r14, 56(%rsp)
	movq	%r12, 48(%rsp)
	movq	%r13, 40(%rsp)
	movq	%r15, 32(%rsp)
	movq	184(%rsp), %rax
	movq	%rax, 24(%rsp)
	movq	512(%rsp), %rax
	movq	%rax, 16(%rsp)
	movq	%rbx, (%rsp)
	movq	$8, 80(%rsp)
	movq	$8, 8(%rsp)
	movabsq	$_ZN8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE, %rax
	leaq	416(%rsp), %rdi
	leaq	168(%rsp), %rsi
	movq	152(%rsp), %rdx
	movq	544(%rsp), %rcx
	movq	160(%rsp), %r12
	movq	%r12, %r8
	vzeroupper
	callq	*%rax
	movl	%eax, %ebx
	movq	168(%rsp), %r13
	movq	416(%rsp), %rax
	movq	%rax, 152(%rsp)
	vmovups	424(%rsp), %ymm0
	vmovaps	%ymm0, 544(%rsp)
	vmovups	456(%rsp), %ymm0
	vmovaps	%ymm0, 512(%rsp)
	movabsq	$NRT_decref, %r15
	movq	%r12, %rdi
	vzeroupper
	callq	*%r15
	movq	%r14, %rdi
	callq	*%r15
	cmpl	$-2, %ebx
	je	.LBB1_25
	testl	%ebx, %ebx
	jne	.LBB1_22
.LBB1_25:
	movq	176(%rsp), %rax
	movq	24(%rax), %rdi
	testq	%rdi, %rdi
	je	.LBB1_27
	movabsq	$PyList_GetItem, %rax
	xorl	%esi, %esi
	callq	*%rax
	movq	%rax, %r14
	jmp	.LBB1_28
.LBB1_22:
	jle	.LBB1_29
	movabsq	$PyErr_Clear, %rax
	callq	*%rax
	movq	16(%r13), %rdx
	movl	8(%r13), %esi
	movq	(%r13), %rdi
	movabsq	$numba_unpickle, %rax
	callq	*%rax
	testq	%rax, %rax
	je	.LBB1_1
	movabsq	$numba_do_raise, %rcx
	movq	%rax, %rdi
	callq	*%rcx
	jmp	.LBB1_1
.LBB1_27:
	movabsq	$PyExc_RuntimeError, %rdi
	movabsq	$".const.`env.consts` is NULL in `read_const`", %rsi
	movabsq	$PyErr_SetString, %rax
	callq	*%rax
	xorl	%r14d, %r14d
.LBB1_28:
	movabsq	$.const.pickledata.140500895779840, %rdi
	movabsq	$.const.pickledata.140500895779840.sha1, %rdx
	movabsq	$numba_unpickle, %rax
	movl	$32, %esi
	callq	*%rax
	movq	152(%rsp), %rbx
	movq	%rbx, 576(%rsp)
	vmovaps	544(%rsp), %ymm0
	vmovups	%ymm0, 584(%rsp)
	vmovaps	512(%rsp), %ymm0
	vmovups	%ymm0, 616(%rsp)
	movabsq	$NRT_adapt_ndarray_to_python_acqref, %r9
	leaq	576(%rsp), %rdi
	movq	%rax, %rsi
	movl	$2, %edx
	movl	$1, %ecx
	movq	%r14, %r8
	vzeroupper
	callq	*%r9
	movq	%rax, %r14
	movq	%rbx, %rdi
	callq	*%r15
	movq	%r14, %rax
	jmp	.LBB1_2
.LBB1_29:
	movabsq	$PyExc_SystemError, %rdi
	movabsq	$".const.unknown error when calling native function", %rsi
.LBB1_5:
	movabsq	$PyErr_SetString, %rax
	vzeroupper
	callq	*%rax
.LBB1_1:
	xorl	%eax, %eax
.LBB1_2:
	leaq	-40(%rbp), %rsp
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	.cfi_def_cfa %rsp, 8
	vzeroupper
	retq
.LBB1_4:
	.cfi_def_cfa %rbp, 16
	movabsq	$PyExc_RuntimeError, %rdi
	movabsq	$".const.missing Environment: _ZN08NumbaEnv8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE", %rsi
	jmp	.LBB1_5
.LBB1_7:
	xorl	%eax, %eax
	movq	%rax, 152(%rsp)
	movabsq	$PyErr_Occurred, %r13
	callq	*%r13
	testq	%rax, %rax
	je	.LBB1_10
	jmp	.LBB1_1
.LBB1_11:
	xorl	%r14d, %r14d
	callq	*%r13
	testq	%rax, %rax
	je	.LBB1_14
	jmp	.LBB1_1
.LBB1_15:
	movabsq	$PyExc_TypeError, %rdi
	movabsq	$".const.can't unbox array from PyObject into native value.  The object maybe of a different type", %rsi
	movabsq	$PyErr_SetString, %rax
	callq	*%rax
	testb	%bl, %bl
	je	.LBB1_17
	jmp	.LBB1_1
.LBB1_19:
	movabsq	$PyExc_TypeError, %rdi
	movabsq	$".const.can't unbox array from PyObject into native value.  The object maybe of a different type", %rsi
	movabsq	$PyErr_SetString, %rax
	callq	*%rax
	movabsq	$NRT_decref, %rax
	movq	160(%rsp), %rdi
	callq	*%rax
	jmp	.LBB1_1
.Lfunc_end1:
	.size	_ZN7cpython8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE, .Lfunc_end1-_ZN7cpython8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE
	.cfi_endproc

	.globl	cfunc._ZN8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE
	.p2align	4, 0x90
	.type	cfunc._ZN8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE,@function
cfunc._ZN8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	andq	$-32, %rsp
	subq	$320, %rsp
	.cfi_offset %rbx, -56
	.cfi_offset %r12, -48
	.cfi_offset %r13, -40
	.cfi_offset %r14, -32
	.cfi_offset %r15, -24
	movq	%r8, %rax
	movq	%rcx, %r8
	movq	%rdx, %rcx
	movq	%rsi, %rdx
	movq	%rdi, %rbx
	vmovups	16(%rbp), %ymm0
	vmovaps	48(%rbp), %xmm1
	vmovups	64(%rbp), %ymm2
	vmovups	96(%rbp), %ymm3
	movq	128(%rbp), %rsi
	vxorps	%xmm4, %xmm4, %xmm4
	vmovaps	%ymm4, 256(%rsp)
	vmovaps	%ymm4, 224(%rsp)
	movq	$0, 288(%rsp)
	movq	$0, 160(%rsp)
	movq	%rsi, 120(%rsp)
	vmovups	%ymm3, 88(%rsp)
	vmovups	%ymm2, 56(%rsp)
	vmovups	%xmm1, 40(%rsp)
	vmovups	%ymm0, 8(%rsp)
	movq	%r9, (%rsp)
	movabsq	$_ZN8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE, %r10
	leaq	224(%rsp), %rdi
	leaq	160(%rsp), %rsi
	movq	%rax, %r9
	vzeroupper
	callq	*%r10
	movl	%eax, %r14d
	movq	160(%rsp), %r9
	testl	%eax, %eax
	movq	224(%rsp), %r8
	movq	232(%rsp), %rcx
	movq	240(%rsp), %rdx
	movq	248(%rsp), %rsi
	movq	256(%rsp), %rdi
	movq	264(%rsp), %rax
	movq	272(%rsp), %r15
	movq	280(%rsp), %r12
	movq	288(%rsp), %r13
	movl	$0, 156(%rsp)
	jne	.LBB2_1
.LBB2_4:
	movq	%r13, 64(%rbx)
	movq	%r12, 56(%rbx)
	movq	%r15, 48(%rbx)
	movq	%rax, 40(%rbx)
	movq	%rdi, 32(%rbx)
	movq	%rsi, 24(%rbx)
	movq	%rdx, 16(%rbx)
	movq	%rcx, 8(%rbx)
	movq	%r8, (%rbx)
	movq	%rbx, %rax
	leaq	-40(%rbp), %rsp
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	.cfi_def_cfa %rsp, 8
	retq
.LBB2_1:
	.cfi_def_cfa %rbp, 16
	movq	%r9, 168(%rsp)
	movq	%rax, 176(%rsp)
	movq	%rdi, 184(%rsp)
	movq	%rsi, 192(%rsp)
	movq	%rdx, 200(%rsp)
	movq	%rcx, 208(%rsp)
	movq	%r8, 216(%rsp)
	movabsq	$numba_gil_ensure, %rax
	leaq	156(%rsp), %rdi
	callq	*%rax
	testl	%r14d, %r14d
	jle	.LBB2_6
	movabsq	$PyErr_Clear, %rax
	callq	*%rax
	movq	168(%rsp), %rax
	movq	16(%rax), %rdx
	movl	8(%rax), %esi
	movq	(%rax), %rdi
	movabsq	$numba_unpickle, %rax
	callq	*%rax
	testq	%rax, %rax
	je	.LBB2_3
	movabsq	$numba_do_raise, %rcx
	movq	%rax, %rdi
	callq	*%rcx
	jmp	.LBB2_3
.LBB2_6:
	movabsq	$PyExc_SystemError, %rdi
	movabsq	$".const.unknown error when calling native function.1", %rsi
	movabsq	$PyErr_SetString, %rax
	callq	*%rax
.LBB2_3:
	movabsq	$".const.<numba.core.cpu.CPUContext object at 0x7fc84232d670>", %rdi
	movabsq	$PyUnicode_FromString, %rax
	callq	*%rax
	movq	%rax, %r14
	movabsq	$PyErr_WriteUnraisable, %rax
	movq	%r14, %rdi
	callq	*%rax
	movabsq	$Py_DecRef, %rax
	movq	%r14, %rdi
	callq	*%rax
	movabsq	$numba_gil_release, %rax
	leaq	156(%rsp), %rdi
	callq	*%rax
	movq	216(%rsp), %r8
	movq	208(%rsp), %rcx
	movq	200(%rsp), %rdx
	movq	192(%rsp), %rsi
	movq	184(%rsp), %rdi
	movq	176(%rsp), %rax
	jmp	.LBB2_4
.Lfunc_end2:
	.size	cfunc._ZN8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE, .Lfunc_end2-cfunc._ZN8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE
	.cfi_endproc

	.weak	NRT_incref
	.p2align	4, 0x90
	.type	NRT_incref,@function
NRT_incref:
	testq	%rdi, %rdi
	je	.LBB3_1
	lock		incq	(%rdi)
	retq
.LBB3_1:
	retq
.Lfunc_end3:
	.size	NRT_incref, .Lfunc_end3-NRT_incref

	.weak	NRT_decref
	.p2align	4, 0x90
	.type	NRT_decref,@function
NRT_decref:
	.cfi_startproc
	testq	%rdi, %rdi
	je	.LBB4_2
	#MEMBARRIER
	lock		decq	(%rdi)
	je	.LBB4_3
.LBB4_2:
	retq
.LBB4_3:
	#MEMBARRIER
	movabsq	$NRT_MemInfo_call_dtor, %rax
	jmpq	*%rax
.Lfunc_end4:
	.size	NRT_decref, .Lfunc_end4-NRT_decref
	.cfi_endproc

	.type	_ZN08NumbaEnv8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE,@object
	.comm	_ZN08NumbaEnv8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE,8,8
	.type	.const.picklebuf.140498085454976,@object
	.section	.rodata,"a",@progbits
	.p2align	4
.const.picklebuf.140498085454976:
	.quad	.const.pickledata.140498085454976
	.long	186
	.zero	4
	.quad	.const.pickledata.140498085454976.sha1
	.size	.const.picklebuf.140498085454976, 24

	.type	.const.picklebuf.140498082859904,@object
	.p2align	4
.const.picklebuf.140498082859904:
	.quad	.const.pickledata.140498082859904
	.long	186
	.zero	4
	.quad	.const.pickledata.140498082859904.sha1
	.size	.const.picklebuf.140498082859904, 24

	.type	.const.picklebuf.140498083546176,@object
	.p2align	4
.const.picklebuf.140498083546176:
	.quad	.const.pickledata.140498083546176
	.long	137
	.zero	4
	.quad	.const.pickledata.140498083546176.sha1
	.size	.const.picklebuf.140498083546176, 24

	.type	.const.pickledata.140498083546176,@object
	.p2align	4
.const.pickledata.140498083546176:
	.ascii	"\200\004\225~\000\000\000\000\000\000\000\214\bbuiltins\224\214\nValueError\224\223\224\214[array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.\224\205\224N\207\224."
	.size	.const.pickledata.140498083546176, 137

	.type	.const.pickledata.140498083546176.sha1,@object
	.p2align	4
.const.pickledata.140498083546176.sha1:
	.ascii	"X\341N\314\265\007\261\340 i\201t\002#\346\205\313\214<W"
	.size	.const.pickledata.140498083546176.sha1, 20

	.type	.const.pickledata.140498082859904,@object
	.p2align	4
.const.pickledata.140498082859904:
	.ascii	"\200\004\225\257\000\000\000\000\000\000\000\214\bbuiltins\224\214\nValueError\224\223\224\214\214unable to broadcast argument 1 to output array\nFile \"/home/jpivarski/mambaforge/lib/python3.9/site-packages/numba/np/npyimpl.py\", line 228, \224\205\224N\207\224."
	.size	.const.pickledata.140498082859904, 186

	.type	.const.pickledata.140498082859904.sha1,@object
	.p2align	4
.const.pickledata.140498082859904.sha1:
	.ascii	"\327\031h\342\321K\337T\236\331eF&\203\3162\332\204\036O"
	.size	.const.pickledata.140498082859904.sha1, 20

	.type	.const.pickledata.140498085454976,@object
	.p2align	4
.const.pickledata.140498085454976:
	.ascii	"\200\004\225\257\000\000\000\000\000\000\000\214\bbuiltins\224\214\nValueError\224\223\224\214\214unable to broadcast argument 0 to output array\nFile \"/home/jpivarski/mambaforge/lib/python3.9/site-packages/numba/np/npyimpl.py\", line 228, \224\205\224N\207\224."
	.size	.const.pickledata.140498085454976, 186

	.type	.const.pickledata.140498085454976.sha1,@object
	.p2align	4
.const.pickledata.140498085454976.sha1:
	.ascii	"J1\311\262\367k\341w\001\016\253\361\270\262\352\304o\271\356y"
	.size	.const.pickledata.140498085454976.sha1, 20

	.type	.const.numba_inner_loop,@object
	.p2align	4
.const.numba_inner_loop:
	.asciz	"numba_inner_loop"
	.size	.const.numba_inner_loop, 17

	.type	".const.missing Environment: _ZN08NumbaEnv8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE",@object
	.p2align	4
".const.missing Environment: _ZN08NumbaEnv8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE":
	.asciz	"missing Environment: _ZN08NumbaEnv8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE"
	.size	".const.missing Environment: _ZN08NumbaEnv8__main__20numba_inner_loop_244B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dExx5ArrayIdLi2E1C7mutable7alignedE5ArrayIdLi2E1C7mutable7alignedE", 176

	.type	".const.can't unbox array from PyObject into native value.  The object maybe of a different type",@object
	.p2align	4
".const.can't unbox array from PyObject into native value.  The object maybe of a different type":
	.asciz	"can't unbox array from PyObject into native value.  The object maybe of a different type"
	.size	".const.can't unbox array from PyObject into native value.  The object maybe of a different type", 89

	.type	".const.`env.consts` is NULL in `read_const`",@object
	.p2align	4
".const.`env.consts` is NULL in `read_const`":
	.asciz	"`env.consts` is NULL in `read_const`"
	.size	".const.`env.consts` is NULL in `read_const`", 37

	.type	.const.pickledata.140500895779840,@object
	.p2align	4
.const.pickledata.140500895779840:
	.ascii	"\200\004\225\025\000\000\000\000\000\000\000\214\005numpy\224\214\007ndarray\224\223\224."
	.size	.const.pickledata.140500895779840, 32

	.type	.const.pickledata.140500895779840.sha1,@object
	.p2align	4
.const.pickledata.140500895779840.sha1:
	.ascii	"\337\274\375\323\237\313&\364\320\306\200\225D\207\270\300\265;\270\243"
	.size	.const.pickledata.140500895779840.sha1, 20

	.type	".const.unknown error when calling native function",@object
	.p2align	4
".const.unknown error when calling native function":
	.asciz	"unknown error when calling native function"
	.size	".const.unknown error when calling native function", 43

	.type	".const.<numba.core.cpu.CPUContext object at 0x7fc84232d670>",@object
	.p2align	4
".const.<numba.core.cpu.CPUContext object at 0x7fc84232d670>":
	.asciz	"<numba.core.cpu.CPUContext object at 0x7fc84232d670>"
	.size	".const.<numba.core.cpu.CPUContext object at 0x7fc84232d670>", 53

	.type	".const.unknown error when calling native function.1",@object
	.p2align	4
".const.unknown error when calling native function.1":
	.asciz	"unknown error when calling native function"
	.size	".const.unknown error when calling native function.1", 43

	.type	_ZN08NumbaEnv13_3cdynamic_3e38__numba_array_expr_0x7fc842761310_2410B70c8tJTIeFCjyCbUFRqqOAK_2f5h0ggn2oJ9DwwEtPiAqkREPZIAanTCJIogpmsCAA_3d_3dEdd,@object
	.comm	_ZN08NumbaEnv13_3cdynamic_3e38__numba_array_expr_0x7fc842761310_2410B70c8tJTIeFCjyCbUFRqqOAK_2f5h0ggn2oJ9DwwEtPiAqkREPZIAanTCJIogpmsCAA_3d_3dEdd,8,8
	.type	_ZN08NumbaEnv5numba2np7npyimpl20_broadcast_onto_2411B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEx8int64_2ax8int64_2a,@object
	.comm	_ZN08NumbaEnv5numba2np7npyimpl20_broadcast_onto_2411B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEx8int64_2ax8int64_2a,8,8
	.type	_ZN08NumbaEnv5numba2np8arrayobj19_call_allocator_248B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj,@object
	.comm	_ZN08NumbaEnv5numba2np8arrayobj19_call_allocator_248B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj,8,8
	.type	_ZN08NumbaEnv5numba2np8arrayobj18_ol_array_allocate12_3clocals_3e8impl_249B44c8tJTIeFCjyCbUFRqqOAK_2f6h0phxApMogijRBAA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj,@object
	.comm	_ZN08NumbaEnv5numba2np8arrayobj18_ol_array_allocate12_3clocals_3e8impl_249B44c8tJTIeFCjyCbUFRqqOAK_2f6h0phxApMogijRBAA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj,8,8
	.type	_ZN08NumbaEnv13_3cdynamic_3e49jit_wrapper__function_full_at_0x7fc90005b5e0__245B46c8tJTIeFCjyCbUFRqqOAK_2f6h0khxBBXBjCWYRBFEkyYAE8UniTupleIxLi2EEx16class_28int32_29,@object
	.comm	_ZN08NumbaEnv13_3cdynamic_3e49jit_wrapper__function_full_at_0x7fc90005b5e0__245B46c8tJTIeFCjyCbUFRqqOAK_2f6h0khxBBXBjCWYRBFEkyYAE8UniTupleIxLi2EEx16class_28int32_29,8,8
	.type	_ZN08NumbaEnv5numba2np8arrayobj19numpy_full_dtype_nd12_3clocals_3e8full_246B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dE8UniTupleIxLi2EEx16class_28int32_29,@object
	.comm	_ZN08NumbaEnv5numba2np8arrayobj19numpy_full_dtype_nd12_3clocals_3e8full_246B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dE8UniTupleIxLi2EEx16class_28int32_29,8,8
	.type	_ZN08NumbaEnv13_3cdynamic_3e41jit_wrapper__built_in_function_empty__247B46c8tJTIeFCjyCbUFRqqOAK_2f6h0khxBBXBjCWYRBFEkyYAE8UniTupleIxLi2EE16class_28int32_29,@object
	.comm	_ZN08NumbaEnv13_3cdynamic_3e41jit_wrapper__built_in_function_empty__247B46c8tJTIeFCjyCbUFRqqOAK_2f6h0khxBBXBjCWYRBFEkyYAE8UniTupleIxLi2EE16class_28int32_29,8,8
	.type	.const.picklebuf.140498084857536,@object
	.p2align	4
.const.picklebuf.140498084857536:
	.quad	.const.pickledata.140498084857536
	.long	77
	.zero	4
	.quad	.const.pickledata.140498084857536.sha1
	.size	.const.picklebuf.140498084857536, 24

	.type	.const.picklebuf.140498085300992,@object
	.p2align	4
.const.picklebuf.140498085300992:
	.quad	.const.pickledata.140498085300992
	.long	137
	.zero	4
	.quad	.const.pickledata.140498085300992.sha1
	.size	.const.picklebuf.140498085300992, 24

	.type	.const.pickledata.140498085300992,@object
	.p2align	4
.const.pickledata.140498085300992:
	.ascii	"\200\004\225~\000\000\000\000\000\000\000\214\bbuiltins\224\214\nValueError\224\223\224\214[array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.\224\205\224N\207\224."
	.size	.const.pickledata.140498085300992, 137

	.type	.const.pickledata.140498085300992.sha1,@object
	.p2align	4
.const.pickledata.140498085300992.sha1:
	.ascii	"X\341N\314\265\007\261\340 i\201t\002#\346\205\313\214<W"
	.size	.const.pickledata.140498085300992.sha1, 20

	.type	.const.pickledata.140498084857536,@object
	.p2align	4
.const.pickledata.140498084857536:
	.ascii	"\200\004\225B\000\000\000\000\000\000\000\214\bbuiltins\224\214\nValueError\224\223\224\214\037negative dimensions not allowed\224\205\224N\207\224."
	.size	.const.pickledata.140498084857536, 77

	.type	.const.pickledata.140498084857536.sha1,@object
	.p2align	4
.const.pickledata.140498084857536.sha1:
	.ascii	"3\033\205c\275\271\332\310\0338B\"s\005,Ho\301pk"
	.size	.const.pickledata.140498084857536.sha1, 20

	.type	_ZN08NumbaEnv5numba7cpython7numbers16complex_abs_impl12_3clocals_3e16complex_abs_2412B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dE10complex128,@object
	.comm	_ZN08NumbaEnv5numba7cpython7numbers16complex_abs_impl12_3clocals_3e16complex_abs_2412B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dE10complex128,8,8
	.type	_ZN08NumbaEnv5numba7cpython8mathimpl16hypot_float_impl12_3clocals_3e14hypot_impl_243B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEdd,@object
	.comm	_ZN08NumbaEnv5numba7cpython8mathimpl16hypot_float_impl12_3clocals_3e14hypot_impl_243B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEdd,8,8
	.section	".note.GNU-stack","",@progbits

LLVM for `one_pixel_numba_cuda`, which is just the low-level function

; ModuleID = "cuda.kernel.wrapper"
target triple = "nvptx64-nvidia-cuda"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"

declare i32 @"_ZN8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE"(i8** %".1", i64 %".2", i64 %".3", i8* %".4", i8* %".5", i64 %".6", i64 %".7", i32* %".8", i64 %".9", i64 %".10", i64 %".11", i64 %".12") 

define void @"_ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE"(i64 %".1", i64 %".2", i8* %".3", i8* %".4", i64 %".5", i64 %".6", i32* %".7", i64 %".8", i64 %".9", i64 %".10", i64 %".11") 
{
.13:
  %"inserted.meminfo" = insertvalue {i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64]} undef, i8* %".3", 0
  %"inserted.parent" = insertvalue {i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64]} %"inserted.meminfo", i8* %".4", 1
  %"inserted.nitems" = insertvalue {i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64]} %"inserted.parent", i64 %".5", 2
  %"inserted.itemsize" = insertvalue {i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64]} %"inserted.nitems", i64 %".6", 3
  %"inserted.data" = insertvalue {i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64]} %"inserted.itemsize", i32* %".7", 4
  %".14" = insertvalue [2 x i64] undef, i64 %".8", 0
  %".15" = insertvalue [2 x i64] %".14", i64 %".9", 1
  %"inserted.shape" = insertvalue {i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64]} %"inserted.data", [2 x i64] %".15", 5
  %".16" = insertvalue [2 x i64] undef, i64 %".10", 0
  %".17" = insertvalue [2 x i64] %".16", i64 %".11", 1
  %"inserted.strides" = insertvalue {i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64]} %"inserted.shape", [2 x i64] %".17", 6
  %".18" = alloca i8*
  store i8* null, i8** %".18"
  store i8* null, i8** %".18"
  %"extracted.meminfo" = extractvalue {i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64]} %"inserted.strides", 0
  %"extracted.parent" = extractvalue {i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64]} %"inserted.strides", 1
  %"extracted.nitems" = extractvalue {i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64]} %"inserted.strides", 2
  %"extracted.itemsize" = extractvalue {i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64]} %"inserted.strides", 3
  %"extracted.data" = extractvalue {i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64]} %"inserted.strides", 4
  %"extracted.shape" = extractvalue {i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64]} %"inserted.strides", 5
  %".21" = extractvalue [2 x i64] %"extracted.shape", 0
  %".22" = extractvalue [2 x i64] %"extracted.shape", 1
  %"extracted.strides" = extractvalue {i8*, i8*, i64, i64, i32*, [2 x i64], [2 x i64]} %"inserted.strides", 6
  %".23" = extractvalue [2 x i64] %"extracted.strides", 0
  %".24" = extractvalue [2 x i64] %"extracted.strides", 1
  %".25" = call i32 @"_ZN8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE"(i8** %".18", i64 %".1", i64 %".2", i8* %"extracted.meminfo", i8* %"extracted.parent", i64 %"extracted.nitems", i64 %"extracted.itemsize", i32* %"extracted.data", i64 %".21", i64 %".22", i64 %".23", i64 %".24")
  %".26" = icmp eq i32 %".25", 0
  %".27" = icmp eq i32 %".25", -2
  %".28" = or i1 %".26", %".27"
  %".29" = xor i1 %".28", -1
  %".30" = icmp eq i32 %".25", -1
  %".31" = icmp eq i32 %".25", -3
  %".32" = icmp sge i32 %".25", 1
  %".33" = load i8*, i8** %".18"
  ret void
}

@"_ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE__errcode__" = global i32 0
@"_ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE__tidx__" = global i32 0
@"_ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE__ctaidx__" = global i32 0
@"_ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE__tidy__" = global i32 0
@"_ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE__ctaidy__" = global i32 0
@"_ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE__tidz__" = global i32 0
@"_ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE__ctaidz__" = global i32 0
!nvvmir.version = !{ !0 }
!nvvm.annotations = !{ !1 }
!0 = !{ i32 1, i32 6, i32 3, i32 0 }
!1 = !{ void (i64, i64, i8*, i8*, i64, i64, i32*, i64, i64, i64, i64)* @"_ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE", !"kernel", i32 1 }

PTX for `one_pixel_numba_cuda`, which is just the low-level function

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-30794723
// Cuda compilation tools, release 11.6, V11.6.55
// Based on NVVM 7.0.1
//

.version 7.6
.target sm_80
.address_size 64

	// .globl	_ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE
.visible .global .align 4 .u32 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE__errcode__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE__tidx__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE__ctaidx__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE__tidy__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE__ctaidy__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE__tidz__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE__ctaidz__;
.common .global .align 8 .u64 _ZN08NumbaEnv8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE;
.common .global .align 8 .u64 _ZN08NumbaEnv5numba7cpython7numbers16complex_abs_impl12_3clocals_3e16complex_abs_2415B40c8tJTC_2fWQA9HW1CcAv0EjzIkAdRoAEtoAgA_3dE10complex128;
.common .global .align 8 .u64 _ZN08NumbaEnv5numba7cpython7numbers16complex_div_impl12_3clocals_3e16complex_div_2414B40c8tJTC_2fWQA9HW1CcAv0EjzIkAdRoAEtoAgA_3dE10complex12810complex128;

.visible .entry _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE(
	.param .u64 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE_param_0,
	.param .u64 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE_param_1,
	.param .u64 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE_param_2,
	.param .u64 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE_param_3,
	.param .u64 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE_param_4,
	.param .u64 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE_param_5,
	.param .u64 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE_param_6,
	.param .u64 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE_param_7,
	.param .u64 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE_param_8,
	.param .u64 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE_param_9,
	.param .u64 _ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE_param_10
)
{
	.reg .pred 	%p<18>;
	.reg .b32 	%r<18>;
	.reg .f64 	%fd<74>;
	.reg .b64 	%rd<33>;


	ld.param.u64 	%rd11, [_ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE_param_0];
	ld.param.u64 	%rd7, [_ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE_param_1];
	ld.param.u64 	%rd8, [_ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE_param_6];
	ld.param.u64 	%rd9, [_ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE_param_7];
	ld.param.u64 	%rd10, [_ZN6cudapy8__main__25one_pixel_numba_cuda_2413B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dExx5ArrayIiLi2E1C7mutable7alignedE_param_8];
	mov.u32 	%r3, %ctaid.x;
	mov.u32 	%r4, %ntid.x;
	mov.u32 	%r5, %tid.x;
	mad.lo.s32 	%r1, %r4, %r3, %r5;
	add.s64 	%rd1, %rd11, 1;
	setp.eq.s64 	%p3, %rd1, 0;
	mov.f64 	%fd68, 0dBFF0000000000000;
	mov.f64 	%fd69, 0d0000000000000000;
	@%p3 bra 	$L__BB0_10;

	cvt.rn.f64.s32 	%fd20, %r1;
	mul.f64 	%fd21, %fd20, 0d0000000000000000;
	mul.f64 	%fd22, %fd21, 0d3FF8000000000000;
	mul.f64 	%fd23, %fd21, 0d0000000000000000;
	sub.f64 	%fd1, %fd22, %fd21;
	fma.rn.f64 	%fd2, %fd20, 0d3FF8000000000000, %fd23;
	add.s64 	%rd12, %rd7, 1;
	cvt.rn.f64.s64 	%fd3, %rd12;
	setp.eq.s64 	%p5, %rd12, 0;
	mov.pred 	%p4, -1;
	mov.f64 	%fd70, %fd68;
	mov.f64 	%fd71, %fd69;
	mov.pred 	%p17, %p4;
	@%p5 bra 	$L__BB0_5;

	abs.f64 	%fd26, %fd3;
	mov.f64 	%fd71, 0d7FF8000000000000;
	add.f64 	%fd70, %fd71, 0dBFF0000000000000;
	setp.ltu.f64 	%p7, %fd26, 0d0000000000000000;
	mov.pred 	%p6, 0;
	mov.pred 	%p17, %p6;
	@%p7 bra 	$L__BB0_5;

	mov.f64 	%fd28, 0d0000000000000000;
	div.rn.f64 	%fd4, %fd28, %fd3;
	fma.rn.f64 	%fd5, %fd4, 0d0000000000000000, %fd3;
	fma.rn.f64 	%fd6, %fd4, %fd2, %fd1;
	setp.eq.f64 	%p9, %fd5, 0d0000000000000000;
	mov.f64 	%fd70, %fd68;
	mov.f64 	%fd71, %fd69;
	mov.pred 	%p17, %p4;
	@%p9 bra 	$L__BB0_5;

	div.rn.f64 	%fd71, %fd6, %fd5;
	mul.f64 	%fd29, %fd4, %fd1;
	sub.f64 	%fd30, %fd2, %fd29;
	div.rn.f64 	%fd31, %fd30, %fd5;
	add.f64 	%fd70, %fd31, 0dBFF0000000000000;
	mov.pred 	%p17, %p6;

$L__BB0_5:
	@%p17 bra 	$L__BB0_10;

	mov.u32 	%r6, %tid.y;
	mov.u32 	%r7, %ntid.y;
	mov.u32 	%r8, %ctaid.y;
	mad.lo.s32 	%r9, %r7, %r8, %r6;
	cvt.rn.f64.s64 	%fd32, %rd1;
	cvt.rn.f64.s32 	%fd33, %r9;
	div.rn.f64 	%fd34, %fd33, %fd32;
	add.f64 	%fd35, %fd34, 0dBFF8000000000000;
	add.f64 	%fd12, %fd35, %fd71;
	setp.lt.s32 	%p11, %r1, 0;
	selp.b64 	%rd15, %rd9, 0, %p11;
	mov.u64 	%rd31, 0;
	cvt.s64.s32 	%rd16, %r1;
	add.s64 	%rd17, %rd15, %rd16;
	setp.lt.s32 	%p12, %r9, 0;
	selp.b64 	%rd18, %rd10, 0, %p12;
	cvt.s64.s32 	%rd19, %r9;
	add.s64 	%rd20, %rd18, %rd19;
	mul.lo.s64 	%rd21, %rd17, %rd10;
	add.s64 	%rd22, %rd20, %rd21;
	cvta.to.global.u64 	%rd23, %rd8;
	shl.b64 	%rd24, %rd22, 2;
	add.s64 	%rd2, %rd23, %rd24;
	mov.u32 	%r10, 20;
	st.global.u32 	[%rd2], %r10;
	mov.f64 	%fd36, 0d7FF0000000000000;
	{
	.reg .b32 %temp; 
	mov.b64 	{%temp, %r2}, %fd36;
	}
	mov.u64 	%rd32, 20;
	mov.u32 	%r13, 2144337920;
	mov.f64 	%fd72, %fd12;
	mov.f64 	%fd73, %fd70;

$L__BB0_7:
	mov.u64 	%rd3, %rd31;
	setp.gt.s64 	%p2, %rd32, 0;
	setp.lt.s64 	%p13, %rd32, 1;
	@%p13 bra 	$L__BB0_10;

	mul.f64 	%fd39, %fd72, %fd72;
	mul.f64 	%fd40, %fd73, %fd73;
	mul.f64 	%fd41, %fd72, %fd73;
	sub.f64 	%fd42, %fd39, %fd40;
	fma.rn.f64 	%fd43, %fd72, %fd73, %fd41;
	add.f64 	%fd73, %fd70, %fd43;
	add.f64 	%fd72, %fd12, %fd42;
	abs.f64 	%fd44, %fd72;
	abs.f64 	%fd45, %fd73;
	mov.b64 	%rd25, %fd45;
	mov.b64 	%rd26, %fd44;
	min.u64 	%rd27, %rd25, %rd26;
	mov.b64 	%fd46, %rd27;
	max.u64 	%rd28, %rd25, %rd26;
	mov.b64 	%fd47, %rd28;
	{
	.reg .b32 %temp; 
	mov.b64 	{%temp, %r11}, %fd47;
	}
	and.b32  	%r12, %r11, -4194304;
	sub.s32 	%r14, %r13, %r12;
	mov.u32 	%r15, 0;
	mov.b64 	%fd48, {%r15, %r14};
	mul.f64 	%fd49, %fd48, %fd46;
	mul.f64 	%fd50, %fd48, %fd47;
	mul.f64 	%fd51, %fd49, %fd49;
	fma.rn.f64 	%fd52, %fd50, %fd50, %fd51;
	mov.f64 	%fd53, 0d7FEFFFFFFFFFFFFF;
	min.f64 	%fd38, %fd52, %fd53;
	// begin inline asm
	rsqrt.approx.ftz.f64 %fd37, %fd38;
	// end inline asm
	mul.rn.f64 	%fd54, %fd37, %fd37;
	neg.f64 	%fd55, %fd54;
	mov.f64 	%fd56, 0d3FF0000000000000;
	fma.rn.f64 	%fd57, %fd38, %fd55, %fd56;
	mov.f64 	%fd58, 0d3FE0000000000000;
	mov.f64 	%fd59, 0d3FD8000000000000;
	fma.rn.f64 	%fd60, %fd59, %fd57, %fd58;
	mul.rn.f64 	%fd61, %fd57, %fd37;
	fma.rn.f64 	%fd62, %fd60, %fd61, %fd37;
	mul.f64 	%fd63, %fd52, %fd62;
	or.b32  	%r16, %r12, 1048576;
	mov.b64 	%fd64, {%r15, %r16};
	mul.f64 	%fd65, %fd63, %fd64;
	setp.eq.f64 	%p14, %fd46, 0d0000000000000000;
	selp.f64 	%fd66, %fd47, %fd65, %p14;
	{
	.reg .b32 %temp; 
	mov.b64 	{%temp, %r17}, %fd46;
	}
	setp.lt.u32 	%p15, %r17, %r2;
	selp.f64 	%fd67, %fd66, %fd46, %p15;
	setp.leu.f64 	%p16, %fd67, 0d4000000000000000;
	selp.s64 	%rd29, -1, 0, %p2;
	add.s64 	%rd32, %rd32, %rd29;
	selp.u64 	%rd30, 1, 0, %p2;
	add.s64 	%rd31, %rd3, %rd30;
	@%p16 bra 	$L__BB0_7;

	st.global.u32 	[%rd2], %rd3;

$L__BB0_10:
	ret;

}

On the plus side, if there's a good reason why JAX is much faster for this type of algorithm, my Mandelbrot study becomes a great advertisement for JAX, XLA, or both! (Especially since the example was not hand-selected for it.)

Answered by YouJiacheng

Jun 13, 2022

Well, to exclude the time of device-to-host copy, I tried:

import cupy as cp
from cupyx.profiler import benchmark

h, w = 2048, 4096
fractal = cp.empty((h, w), dtype=cp.int32)

cupy_custom_kernel = cp.RawKernel(r'''
#include <cupy/complex.cuh>

extern "C" __global__
void cupy_custom_kernel(int height, int width, int* fractal) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    complex<float> j(0.0, 1.0);
    complex<float> z, c;
    z = c = complex<float>(-1.5 + y*1.0/(height + 1)) - j + complex<float>(x*1.5)*j/complex<float>(width + 1);

    fractal[y + x * width] = 20;
    for (int i = 0;  i < 20;  i++) {
        z = z * z + c;
  …

View full answer

cgarciae · 2022-06-13T09:17:48Z

cgarciae
Jun 13, 2022
Collaborator

To force jax to use a single CPU you can try to set this flag:

XLA_FLAGS="--xla_force_host_platform_device_count=1"

and do a pmap over the function instead of using jit, add a new batch axis of length 1 to the inputs. Hopefully with this the computation gets assigned to a single thread.

1 reply

jpivarski Jun 13, 2022
Author

I did force it onto one thread by setting the CPU affinity of the whole JupyterLab process to one core (number 0, just to pick one). Both this method and setting environment variables require changes to be made before starting the process, but I found taskset to be more effective. (Since it's in JupyterLab, the actual process is a kennel launched by the one I can control from the command line.) Then within the process, I absolutely confirmed that it's only using core 0, in two different ways. Without that, the JAX CPU number is even faster than what I've shown.

What's a batch axis? Could the speed I'm seeing be due to a stream that executes code while moving data? The competitors' numbers are all based on creating an array as a first step, performing the computation, and (if it's on the GPU) pulling out back to the CPU as a third step. But the difference between copying the JAX array back from the GPU vs leaving it there is the 3 ms I quoted above vs 0.5 ms. The competitor GPU numbers are 22 ms and 24 ms. Those are different orders or magnitude—I don't see how a fancier way of moving data can account for that much difference.

YouJiacheng · 2022-06-13T16:30:31Z

YouJiacheng
Jun 13, 2022

Even giving C++ constants for all three nested loops (height, width, and maxiterations) didn't change its result much. In the end, all implementations have height and width not specified at compile-time, but maxiterations a compile-time constant.

IIUC, JAX tracers have static shape, thus height and width is compile-time constant.
Another concern: Could compiling time account for some of difference?

1 reply

jpivarski Jun 13, 2022
Author

IIUC, JAX tracers have static shape, thus height and width is compile-time constant.

I made height and width compile-time constants in the C++ and Numba as a cross-check. They didn't even budge. (That is, gcc and LLVM are not taking advantage of them as compile-time constants.)

I had another thought that I'd have to check later when I can get to the same machine that did these tests: all of the non-JAX implementations that "run at top speed" (have imperative code per-pixel) have a break in the loop up to maxiterations = 20. That break might be preventing some auto-vectorization, which might be a factor of 5 for the CPU...? I don't think it explains the factor of 6 for the GPU, since that doesn't really break, per-warp. Edit: you noted that below. That, too is now on my to-do list of things to check.

Another concern: Could compiling time account for some of difference?

Compile time is taken out by the fact that, in the notebook, each implementation is used to create a plot of the Mandelbrot before going into the %%timeit cell. As long as all the libraries cache the compiled code (they do), the %%timeit does not include compilation time. Besides, all of the "best of CPU" and "best of GPU" agree in timing with one another within 10% while the JAX is 5 or 6 times faster. gcc for pybind11, gcc for Cython, LLVM for @nb.njit, and LLVM for @nb.vectorize are not going to have compile times that are so similar to one another, and neither are nvrtc (CuPy) and NVVM (Numba) going to be so similar to one another. Compile time is not included.

YouJiacheng · 2022-06-13T16:52:10Z

YouJiacheng
Jun 13, 2022

BTW, for "CuPy with a custom kernel", it seems that warp divergence also prevent global memory access coalescing?
It has 2 instead of 1 global memory write in most of the cases too.
I suspect that global memory write is dominant(comparing with 20 fma), and 2 un-coalescing 4B global memory write could spend up to 2*8x time comparing with 1 coalescing 4B global memory write.
In addition, I think the if-break inside loop might prevent compiler from unrollling the loop.
And 6M is much larger than the number of cuda cores, it might have a better performance to process >1 pixels in one thread.

4 replies

jpivarski Jun 13, 2022
Author

And 6M is much larger than the number of cuda cores, it might have a better performance to process >1 pixels in one thread.

Would JAX/XLA automatically configure it to do that? The hand-written CUDA code, both in C++ and Numba, is quite explicit about which CUDA threads get to work on which pixels, but if the @jax.jit arranges it optimally without manual intervention, that's a big win for this programming style and I'll highlight that whenever I present it.

I'm confused about what you mean by "global memory access coalescing," but I think if I rewrite the non-JAX implementations to not break when they reach divergence—i.e. make the code have no jumps and therefore no warp divergence—then that would address your point.

spend up to 2*8x time

That would easily be large enough for the factor of 6 seen in GPUs. (So maybe it's "as much as 16×, but in practice only 6×".)

YouJiacheng Jun 13, 2022

By "global memory access coalescing", I mean CUDA global memory access uses 32B sector, and will coalesce memory access in a warp, i.e. if >1 threads in a warp access 1 sector at the same time, only 1 DRAM access is performed for this sector. E.g. if 32 threads in a warp access (aligned) contiguous 32*4B=128B, there are only 4 sectors(128B) need to be accessed, while if 32 threads in a warp access random 32*4B, there are 32 sectors(1024B) need to be accessed.
Okay I find that you use fractal[y + x*width] where x is blockIdx.x * blockDim.x + threadIdx.x, this will completely prohibit coalescing.

jpivarski Jun 13, 2022
Author

Ah, you mean like "cache lines" in CPU memory access, in which data are sent from RAM to the CPU in aligned, contiguous chunks of about 64 bytes—you mean that GPU processing units have something similar when accessing global memory, and I'm using it badly:

Okay I find that you use fractal[y + x*width] where x is blockIdx.x * blockDim.x + threadIdx.x, this will completely prohibit coalescing.

So I'm accessing the 2D array in column major order when I want row major order or vice-versa? I'm going to add that to my checklist of things to toggle.

YouJiacheng Jun 13, 2022

Yes. Just like "cache lines" in CPU. GPU chunks is of 32 Bytes. I believe XLA will take care of this.

YouJiacheng · 2022-06-13T16:56:57Z

YouJiacheng
Jun 13, 2022

One important thing: JAX use complex64(float32 for each part) instead of complex128 by default, did you take care of it?

1 reply

jpivarski Jun 13, 2022
Author

One important thing: JAX use complex64(float32 for each part) instead of complex128 by default, did you take care of it?

You're right, that is different: the non-JAX implementations used complex128 and the JAX one would have used complex64 since I didn't enable 64-bit.

There could also be different libraries for handling complex numbers. I saw a big difference between z**2 (calls pow) and z * z (calls multiply) in some of these tests, and that's a library choice: some libraries will be smart enough to identify 2 as an important constant, and maybe JAX has even-smarter complex libraries that are doing other things. Like when I say abs(z) > 2, maybe it's replacing that with a z.real * z.imag > 4 and avoiding a square root.

Okay, I have some things to try for standardizing the complex number handling, starting with bit width. Thanks!

jpivarski · 2022-06-13T17:58:37Z

jpivarski
Jun 13, 2022
Author

Summarizing the above, my current to-do list of things to check:

remove the break statement from all implementations; make the code (CPU & GPU) jump-free: should address auto-vectorization opportunities on CPU (I compile with -O3) and "global memory access coalescing" on the GPU
use complex64 in all implementations
break down any magical complex formulas like z**2 into z * z and abs(z) > 2 into z.real * z.imag > 4 so that there can't be a difference that depends on which library implements the complex number math
try to change one of the manual CUDA implementations to run 2 or 4 pixels per thread, rather than 1 pixel per thread. I doubt I'll be able to optimize it, but if it makes the manual implementations faster by a factor of a few, that would be a strong hint that JAX/XLA is optimizing it, and that can account for the difference.
switch x and y as described below, to go from whichever I'm using (column-major or row-major) to the opposite

If these things explain the difference, it still comes out as a win for the JAX/XLA approach because these optimizations are hard to do manually. You'll have won a convert, and I'm trying to see how my project might fit into a world in which array bounds are all known at compile-time. (For my project in full generality, it would be hard.)

4 replies

YouJiacheng Jun 13, 2022

Note: For "memory coalescing"(or "coalesced memory access", I might use an inexact terminology before) you should exchange the threadIdx.x and threadIdx.y. I also eliminated if-break inside loop, and minimized global memory write.

cupy_custom_kernel = cp.RawKernel("""
#include <cupy/complex.cuh>

extern "C" __global__
void cupy_custom_kernel(int height, int width, int* fractal) {
    int y = blockIdx.x * blockDim.x + threadIdx.x;
    int x = blockIdx.y * blockDim.y + threadIdx.y;

    complex<double> j(0.0, 1.0);
    complex<double> z, c;
    z = c = complex<double>(-1.5 + y*1.0/(height + 1)) - j + complex<double>(x*1.5)*j/complex<double>(width + 1);

    int r = 20;
    #pragma unroll
    for (int i = 0; i < 20; i++) {
        z = z * z + c;
        r = (z.real() * z.imag() > 4 && r == 20) ? i : r;
    }
    fractal[x*width + y] = r;
}
""", "cupy_custom_kernel")

def run_cupy_custom_kernel(height, width):
    fractal = cp.empty((height, width), dtype=np.int32)
    griddim = (math.ceil(width/ 32), math.ceil(height / 32))
    blockdim = (32, 32)
    cupy_custom_kernel(griddim, blockdim, (height, width, fractal))
    return fractal.get()

jpivarski Jun 13, 2022
Author

// IIUC we need to check x < width and y < height here

Good point; I was going to make the size a multiple of block size to be able to avoid that issue. Somehow, I didn't overwrite any array bounds.

YouJiacheng Jun 13, 2022

I updated my snippet and I think it should be faster.

YouJiacheng Jun 13, 2022

Ah ... I find the performance difference is very small, 15.1ms vs 15.5ms for 2048*4096.
And, I find that empty kernel cost 14.5ms...

def run_cupy_custom_kernel(height, width):
    fractal = cp.empty((height, width), dtype=np.int32)
    return fractal.get()

YouJiacheng · 2022-06-13T19:05:51Z

YouJiacheng
Jun 13, 2022

Well, to exclude the time of device-to-host copy, I tried:

import cupy as cp
from cupyx.profiler import benchmark

h, w = 2048, 4096
fractal = cp.empty((h, w), dtype=cp.int32)

cupy_custom_kernel = cp.RawKernel(r'''
#include <cupy/complex.cuh>

extern "C" __global__
void cupy_custom_kernel(int height, int width, int* fractal) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    complex<float> j(0.0, 1.0);
    complex<float> z, c;
    z = c = complex<float>(-1.5 + y*1.0/(height + 1)) - j + complex<float>(x*1.5)*j/complex<float>(width + 1);

    fractal[y + x * width] = 20;
    for (int i = 0;  i < 20;  i++) {
        z = z * z + c;
        if (z.real() * z.real() + z.imag() * z.imag() > 4) {
            fractal[y + x * width] = i;
            break;
        }
    }
}
''', "cupy_custom_kernel", options=("--use_fast_math",))
cupy_custom_kernel.compile()

def run_cupy_custom_kernel(height, width, fractal):
    griddim = (height // 32, width // 32)
    blockdim = (32, 32)
    cupy_custom_kernel(griddim, blockdim, (height, width, fractal))
    return fractal

print(benchmark(run_cupy_custom_kernel, (h, w, fractal), n_repeat=20))

cupy_custom_kernel = cp.RawKernel(r'''
#include <cupy/complex.cuh>

extern "C" __global__
void cupy_custom_kernel(int height, int width, int* fractal) {
    int y = blockIdx.x * blockDim.x + threadIdx.x;
    int x = blockIdx.y * blockDim.y + threadIdx.y;

    complex<float> j(0.0, 1.0);
    complex<float> z, c;
    z = c = complex<float>(-1.5 + y * 1.0/(height + 1)) - j + complex<float>(x * 1.5) * j / complex<float>(width + 1);

    int r = 20;
    for (int i = 0; i < 20; i++) {
        z = z * z + c;
        if (z.real() * z.real() + z.imag() * z.imag() > 4) {
            r = i;
            break;
        }
    }
    fractal[x * width + y] = r;
}
''', "cupy_custom_kernel", options=("--use_fast_math",))

cupy_custom_kernel.compile()

def run_cupy_custom_kernel(height, width, fractal):
    griddim = (height // 32, width // 32)[::-1]
    blockdim = (32, 32)
    cupy_custom_kernel(griddim, blockdim, (height, width, fractal))
    return fractal

print(benchmark(run_cupy_custom_kernel, (h, w, fractal), n_repeat=20))

EDIT: fix z.real() * z.imag() > 4
EDIT: update the first implementation
EDIT: break version is faster
Note: I find that run a 10x larger problem before benchmarking can warmup the GPU to be 10% faster.
EDIT: use fastmath

run_cupy_custom_kernel:    CPU:   12.990 us   +/- 5.641 (min:    9.699 / max:   32.799) us     GPU-0:  586.342 us   +/- 5.556 (min:  582.656 / max:  605.184) us
run_cupy_custom_kernel:    CPU:   10.525 us   +/- 1.822 (min:    9.499 / max:   17.900) us     GPU-0:  159.027 us   +/- 1.747 (min:  157.696 / max:  165.888) us

2 replies

YouJiacheng Jun 14, 2022

numba-CUDA(cudatoolkit-11.3.1 from anaconda) is a little slower than cupy(cudatoolkit-11.6.0 from conda-forge) on my device, cudatoolkit version might be responsible for this difference(669us v.s. 586us).

from numba import cuda
import numpy as np

@cuda.jit(fastmath=True)
def one_pixel_numba_cuda(height, width, fractal):
    x, y = cuda.grid(2)                 # <--- 2-dimensional CUDA grid
    z = c = (
        np.complex64(-1.5)
        + np.complex64(y)/np.complex64(height + 1)
        - np.complex64(1j)
        + np.complex64(x)*np.complex64(1.5j)/np.complex64(width + 1)
    )
    fractal[x, y] = 20
    for i in range(20):
        z = z * z + c                      # <--- complex z**2 is much worse than z * z here
        if z.real**2 + z.imag**2 > 4:      # <--- but real-valued **2 is fine
            fractal[x, y] = i
            break

stream = cuda.stream()

def run_numba_cuda(height, width, fractal):
    griddim = (height // 32, width // 32)
    blockdim = (32, 32)
    one_pixel_numba_cuda[griddim, blockdim, stream](height, width, fractal)
    return fractal

h, w = 2048, 4096
fractal = cuda.device_array((h, w), dtype=np.int32, stream=stream)

run_numba_cuda(h, w, fractal)
stream.synchronize()
from time import perf_counter_ns
t = perf_counter_ns()
n = 20
for _ in range(n):
    run_numba_cuda(h, w, fractal)
    stream.synchronize()
print((perf_counter_ns() - t) / n / 1e3) # 669us

jobs-git May 6, 2024

what is the difference of the first the the second cupy run?

YouJiacheng · 2022-06-13T19:35:34Z

YouJiacheng
Jun 13, 2022

It seems that JAX is slower:

import jax
def run_jax_kernel(fractal):
    h, w = fractal.shape
    y, x = jax.numpy.ogrid[-1:0:h*1j, -1.5:0:w*1j]
    z = c = x + y * 1j
    for i in range(20):
        z = z * z + c
        diverged = z.real * z.real + z.imag * z.imag > 4 # EDIT: fixed from z.read * z.imag > 4
        diverging_now = diverged & (fractal == 20)
        fractal = jax.numpy.where(diverging_now, i, fractal)
    return fractal

run_jax_gpu_kernel = jax.jit(run_jax_kernel)

def run_jax_gpu(fractal):
    run_jax_gpu_kernel(fractal).block_until_ready()

fractal = jax.numpy.full((2048, 4096), 20, dtype=jax.numpy.int32)
run_jax_gpu(fractal)
from time import perf_counter_ns
t = perf_counter_ns()
for _ in range(20):
    run_jax_gpu(fractal)
print((perf_counter_ns() - t) / 20 / 1e3) # ~410us # EDIT

6 replies

YouJiacheng Jun 13, 2022

It seems that JAX device-to-host is much slower on my device.

f = [jax.numpy.full((2048, 4096), 20, dtype=np.int32).block_until_ready() for _ in range(20)]
from time import perf_counter_ns
t = perf_counter_ns()
for i in range(20):
    np.asarray(f[i])
print((perf_counter_ns() - t) / 20 / 1e6) # 13.5~14.5ms

jpivarski Jun 13, 2022
Author

At first, I developed this on Google Colab, then moved to my computer and got more reproducible timing. It probably still works on Google Colab, if assert len(jax.devices("cpu")) == 1 passes. If the difference is due to something like memory bandwidth, then switching hardware should change ratios—what bothered me was that it was different at all (by a large factor), which meant that I didn't understand what it was doing: it was not equivalent to my hand-written kernels.

You've given me a lot of things to try (and I can only try them when I get home tonight). Yesterday, I couldn't think of anything else to try. All of them would be "fixes" to the non-JAX code, so if any one of them is responsible, then that only goes to show the value of writing array-oriented code and letting JAX optimize it. (If possible, within JAX's restrictions.)

Hopefully, one of the five checkboxes above does it, and then I'll have this new story to tell. If not... then I'm still leaning toward the conclusion that I made some mistake in the non-JAX code, rather than some mistake in benchmarking the JAX code. Yesterday, I thought it might be that the JAX benchmark was incorrectly optimistic, but you've found a lot more open questions in the non-JAX code. I had been leaning toward thinking that the benchmark was wrong because the non-JAX ones were so close to each other, but the kinds of mistakes and potential mistakes you've pointed out are ones that would apply equally to all of the non-JAX ones, which would explain why they're so close to each other.

It seems that JAX device-to-host is much slower on my device.

For me, not copying the array back to host was "one half" ms (compare to your 0.375 ms) and copying it back was 3 ms, so the copy dominated, but the non-JAX GPU code was at 22 and 24 ms. (All numbers I remember off the top of my head, now.) I must have a fast interconnect. It's a System76 machine.

YouJiacheng Jun 13, 2022

My hand written kernel without copying was 0.246ms. I think you could try my snippet above.

YouJiacheng Jun 13, 2022

~~Note JAX-code allocate a new array on device, while my hand written kernel doesn't. I could try using buffer donation to eliminate this extrawork.~~ JAX preallocate GPU memory, and my experiment show that buffer donation doesn't make a difference.

jpivarski Jun 13, 2022
Author

I will be starting with your hand-written kernel. Thanks!

jpivarski · 2022-06-14T05:17:37Z

jpivarski
Jun 14, 2022
Author

It's resolved; thanks for all your help, @YouJiacheng!

The main thing that I was missing in my CPU implementations was that I was using abs(z) > 2, rather than z.real**2 + z.imag**2 > 4, so I was doing an unnecessary square root. Somehow, @jax.jit fixed that automatically—I'm impressed! It was also important to use the -ffast-math flag (or equivalent). The break/early return from the loop over iterations was actually better than making it branchless, even when I manually unrolled it, just to be sure. There's some trade-off between doing extra work and getting vectorization. Finally, Cython really needed to be told that z**2 is z * z; I think it might be using pow and not recognizing that 2 is a special constant. Numba is fine for that one.

I'm still missing something in the CPU implementation, because all of mine (pybind11, Cython, Numba, Numba) are consistently 40% worse than @jax.jit. That's when I gave up.

The main thing that I was missing in my GPU implementations was the use of complex64 instead of complex128. That made a big difference on the GPU, but not at all on the CPU. Swapping column-major/row-major made little difference on CPU or GPU, though the order I had for the GPU appears to be the right one. Trying to convince Numba to use a specific data type is pretty verbose (because it's sourced on Python code).

The final plot for my computer (specs as given above) is

You can see that the GPU implementations are all about the same as one another, the JAX CPU implementation is about 40% better than the other four, NumPy and CuPy without controlling for intermediate arrays is much worse than careful compilation, but much better than nothing. Python is on the far right, representing "nothing."

The notebook runs as-is on Google Colab because that environment happens to have only one CPU (at my usage tier, anyway). Here's what the plot looks like when run there:

Pretty similar, though pybind11 and Cython are worse. Maybe the C++ compiler is different or maybe external calls to binaries are different for some reason?

Anyway, I didn't enter into this project expecting to be as impressed with @jax.jit as I ended up being. I need to look into that more, and see if I can use it in my work.

I've also updated the gist.

5 replies

YouJiacheng Jun 14, 2022

Swapping column-major/row-major in my benchmark snippet (without copying) make 270us => 390us.

YouJiacheng Jun 14, 2022

BTW, python for-loop element-wise update might be faster if native python list is used(numpy array has dispatch cost), but using numpy can be faster for constructing the initial value of z.

jpivarski Jun 14, 2022
Author

Swapping column-major/row-major in my benchmark snippet (without copying) make 270us => 390us.

That's actually what I meant: it's only a 30% difference (but I took the faster one).

I agree about the Python version, it would be faster to use builtin types, though I have to decide how the problem is posed. As it is, every implementation has to produce the same end-state: a NumPy array in RAM. Maybe it would be faster to do the calculation with built-in types and convert the final list into a NumPy array. Anyway, whatever is done there, it will still be a bar that goes far to the right.

YouJiacheng Jun 14, 2022

Oh - complex64 + z.real() * z.real() + z.imag() * z.imag() > 4 vs complex128 + abs(z) > 2 is just 650us vs 940us for your kernel on my device and by applying one of two change I get about 675us(interesting). It's a 30% difference on my device as well.
But you're right that break version is faster than branchless version, 187us vs 275us on my device.

YouJiacheng Jun 14, 2022

I believe there is memory-access and computation overlapping:
continuous memory access("row-major thread block") v.s. memory access with stride=width("column-major thread block")
is 187us vs 386us for break version and 275us vs 390us for branchless version.
Interestingly

int r = 20;
for (int i = 0; i < 20; i++) {
    z = z * z + c;
    if (z.real() * z.real() + z.imag() * z.imag() > 4) {
        r = i;
        break;
    }
}
fractal[x * width + y] = r;

is significantly faster than

fractal[x * width + y] = 20;
for (int i = 0;  i < 20;  i++) {
    z = z * z + c;
    if (z.real() * z.real() + z.imag() * z.imag() > 4) {
        fractal[x * width + y] = i;
        break;
    }
}

when I use colum-major(i.e. y is not contiguous for contiguous thread): 390us v.s. 650us.
But a little bit slower when I use row-major: 184us v.s. 183us.(almost the same, but there is a stable difference for about 1us)

patrick-kidger · 2022-06-14T09:32:46Z

patrick-kidger
Jun 14, 2022

Chiming in here (this is a fascinating thread) to comment that JAX/XLA uses some but not all fastmath optimisations by default. If you're using all of fastmath for the other approaches then for fairness' sake JAX/XLA should be allowed to as well. No idea if that'll actually effect your end result, of course.

I mean JAX is winning anyway, but obviously I'm curious to know if the margin can be improved.
Also, I'd love to know how a Julia implementation compares. That's usually touted as best-in-class for CPU operations.

3 replies

jpivarski Jun 14, 2022
Author

Good point: I should look into turning on full fastmath in JAX. It was the second most important correction to the CPU implementations and it's what pushed Numba-CUDA past JAX on the GPU, though the reason I'm using a log plot is to de-emphasize differences smaller than a factor of two. Because of that success, I assumed that JAX was using fastmath.

Julia was also in the back of my mind during all of this. In principle, though, it should be just the same as Numba. In terms of abstraction, Julia exposes you to as many details as Numba. (In general, I tried to get a basis set of all different levels of abstraction, one representative each. So, for instance, PyCUDA would be redundant with CuPy with raw kernels.)

YouJiacheng Jun 14, 2022

In principle, though, it should be just the same as Numba

Julia is usually considered to be more human-friendly and expressive than numba, and its "native"(with language level jit) performance is satisfactory.

YouJiacheng Jun 14, 2022

~~According to #276 , XLA's CPU backend defaults to enabling fast math mode, the GPU backend does not.~~
CPU backend enable some but not all. It seems that GPU backend doesn't enable any?

Moelf · 2022-06-29T21:00:47Z

Moelf
Jun 29, 2022

Julia benchmark vs. Jax vs. Numba

tldr:

Implementation	Time
Jax-kernel-CPU (unroll)	97.9 ms ± 870 µs
Julia-imperative (unroll only)	53.96 ms ± 8.5 ms

Implementation	Time
Numba-imperative (no unroll)	143 ms ± 1.16 ms
Julia-imperative (fastmath only)	109.91 ms ± 1.46 ms

follwoing discussion on Julia discourse (thanks to @chriselrod), and after realizing Jax unrolls the range(20) loop so it's not cheating to use @nexpr macro to unroll, here's a 2x faster (SIMD?) Julia implementation. I basically translated Jim's imperative Numba code into Julia syntax, and made the body of range(20)-loop friendly to unrolling:

# NOT using fastmath
function run_julia(height, width)
    y = range(-1.0f0, 0.0f0; length = height) # need Float32 because Jax defaults to it
    x = range(-1.5f0, 0.0f0; length = width)
    c = x' .+ y*im
    fractal = fill(Int32(20), height, width)

    # this checks if indicies are compatible between `c` and `fractal`
    @inbounds for idx in eachindex(c, fractal)
        _c = c[idx]
        z = _c
        m = true
        Base.Cartesian.@nexprs 20 i -> begin
            z = z^2 + _c
            az4 = abs2(z) > 4f0
            fractal[idx] = ifelse(m&az4, Int32(i), fractal[idx]) # 32-bit Int, same reason as above
            m &= (!az4)
        end
    end
    return fractal
end

Jax

In [8]: %%timeit -o
   ...: run_jax_cpu(2000, 3000)
97.9 ms ± 870 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Out[8]: <TimeitResult : 97.9 ms ± 870 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)>

Julia (with unroll & without fastmath)

julia> @benchmark run_julia(2000,3000)
BenchmarkTools.Trial: 93 samples with 1 evaluation.
 Range (min … max):  48.687 ms … 114.762 ms  ┊ GC (min … max): 0.31% … 54.80%
 Time  (median):     52.597 ms               ┊ GC (median):    0.89%
 Time  (mean ± σ):   53.969 ms ±   8.555 ms  ┊ GC (mean ± σ):  3.34% ±  7.73%

     ▇█                                                         
  ▄▁▁██▄▄▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  48.7 ms       Histogram: log(frequency) by time       107 ms <

 Memory estimate: 68.66 MiB, allocs estimate: 4.

Additional comparison with Numba

Numba

In [6]: %%timeit -o
   ...: run_numba(2000, 3000)
143 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Out[6]: <TimeitResult : 143 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)>

Julia (without unroll & with fastmath)

@inline function fast2(x) # pending PR to add this as the fastmath routine for complex^2
    r = real(x)
    i = imag(x)
    Complex(fma(r, r, -i * i), fma(r, i, i * r))
end

function run_julia(height, width)
    y = range(-1.0f0, 0.0f0; length = height)
    x = range(-1.5f0, 0.0f0; length = width)
    c = x' .+ y*im
    fractal = fill(Int32(20), height, width)
    @inbounds @fastmath for idx in eachindex(c)
        _c = c[idx]
        z = _c
        for i = 1:20
            z = fast2(z) + _c
            if abs2(z) > 4f0
                fractal[idx] = i
                break
            end
        end
    end
    return fractal
end

julia> @benchmark run_julia(2000,3000)
BenchmarkTools.Trial: 46 samples with 1 evaluation.
 Range (min … max):  105.716 ms … 112.512 ms  ┊ GC (min … max): 0.00% … 0.72%
 Time  (median):     110.110 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   109.905 ms ±   1.457 ms  ┊ GC (mean ± σ):  0.38% ± 0.52%

                     ▁           ▄    ▁▁▁▁▄ ▁ ▄   █              
  ▆▁▁▁▁▁▁▁▁▁▆▁▁▆▆▆▁▁▁█▁▁▁▁▁▁▁▁▆▆▆█▁▆▆▆█████▆█▆█▁▆▁█▁▆▆▁▆▆▆▆▁▁▁▆ ▁
  106 ms           Histogram: frequency by time          113 ms <

 Memory estimate: 68.66 MiB, allocs estimate: 4.

Platform information and Image output for debugging / sanity check

julia> versioninfo(verbose=true)
Julia Version 1.8.0-beta3
Commit 3e092a2521 (2022-03-29 15:42 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  uname: Linux 5.18.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 22 Jun 2022 18:10:56 +0000 x86_64 unknown
  CPU: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz: 
              speed         user         nice          sys         idle          irq
       #1  4200 MHz      47711 s          0 s      15165 s    1050914 s      18726 s
       #2   400 MHz      57317 s          0 s      15044 s      25843 s       1600 s
       #3  2800 MHz      58279 s          0 s      14516 s      26112 s       1707 s
       #4  1300 MHz      58318 s          0 s      14387 s      25701 s       1746 s
       #5  4200 MHz      54057 s          0 s      13005 s      26334 s       1868 s
       #6  1962 MHz      57398 s          1 s      15431 s      26493 s       1450 s
       #7   400 MHz      58745 s          0 s      14183 s      26216 s       1598 s
       #8   400 MHz      58657 s          1 s      14181 s      25788 s       1460 s
  Memory: 31.14313507080078 GB (17520.08984375 MB free)
  Uptime: 279087.59 sec
  Load Avg:  1.4  1.22  0.99
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
  Threads: 1 on 8 virtual cores

julia> using Plots

julia> heatmap(run_julia(2000, 3000))

3 replies

chriselrod Jun 29, 2022

In terms of "what the compiler should do", note that I shared a version that was around four times faster than this one, while also not fully unrolling the loop meaning it wasn't "cheating"/specializing, unlike these versions, and would do much better with given more iterations due to early stopping/breaking out of the loop.

Moelf Jun 29, 2022

Chris is referring to this: https://discourse.julialang.org/t/hard-to-beat-numba-jax-loop-for-generating-mandelbrot/82725/49

YouJiacheng Jun 30, 2022

8x faster(Chris' version) than JAX, amazing! Wow!

AriMKatz · 2022-06-29T21:55:27Z

AriMKatz
Jun 29, 2022

Any chance we can get a Dex benchmark @axch or @apaszke ? Maybe with the new vectorization pass? (or relying on LLVM). The levenshtein distance example is really cool

0 replies

Moelf · 2022-06-30T03:28:43Z

Moelf
Jun 30, 2022

Julia GPU (CPU) Kernel programming

Implementation	Time
Jax-kernel-CPU(see above)	97.9 ms ± 870 µs
Julia-imperative(see above)	53.96 ms ± 8.5 ms
Julia-kernel-CPU (see comments)	178.964 ms ± 2.117 ms
Julia-kernel-GPU (Quradro 4000, grainsize=32*32)	3.908 ms ± 3.135 ms

sanity check plot

You can write a vendor-agnostic kernel of this algorithm with KernelAbstractions.jl and it would run on NVIDIA, AMD, Intel GPUs alike (pending oneAPI PR), I will demonstrate the point with a kernel instantiated on CUDA device:

using KernelAbstractions, CUDAKernels, CUDA     
    
@kernel function julia_kernel!(c, fractal)    
    I = @index(Global)    
    _c = c[I]    
    z = _c       
    @inbounds for i = 1:20    
        z = z^2 + _c    
        if abs2(z) > 4f0    
            fractal[I] = Int32(i)    
            break    
        end    
    end    
end    
              
function run_julia_gpu(height, width)    
    y = CuArray(range(-1.0f0, 0.0f0; length = height))     
    x = CuArray(range(-1.5f0, 0.0f0; length = width))     
    c = x' .+ y*im      
    fractal = CUDA.fill(Int32(20), height, width)        
    # the 32^2 is comparible to gridsize in Numba-CUDA
    kernel! = julia_kernel!(CUDADevice(), 32^2) # we instantiate a kernel with vendor info
    kernel!(c, fractal; ndrange=size(c))     
    return Array(fractal) # copy back to CPU, blocking
end

julia> @benchmark run_julia_gpu(2000,3000)
BenchmarkTools.Trial: 1279 samples with 1 evaluation.
 Range (min … max):  2.349 ms … 12.656 ms  ┊ GC (min … max): 0.00% … 11.64%
 Time  (median):     2.419 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.908 ms ±  3.135 ms  ┊ GC (mean ± σ):  7.66% ± 14.73%

  █                 ▂                               ▁▂    ▂   
  █▄▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▄█▇▆▅▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▆▁▁▅██ ▇
  2.35 ms      Histogram: log(frequency) by time     11.9 ms <

 Memory estimate: 22.91 MiB, allocs estimate: 130.

Use CPU as (bad) kernal executor

CPU is a very bad GPU (for this kind of work), but just to show that the kernel function is indeed vendor-agnostic, without changing the kernel function definition:

function run_julia_cpu_jaxstype(height, width)
    y = range(-1.0f0, 0.0f0; length = height)  
    x = range(-1.5f0, 0.0f0; length = width)  
    c = x' .+ y*im
    fractal = fill(Int32(20), height, width)     

    kernel! = julia_kernel!(CPU(), length(c)÷Threads.nthreads()) # we're using 1-thread here 
    event = kernel!(c, fractal; ndrange=length(c))
    wait(event) # not copying back, need to block here
    return fractal
end

julia> @benchmark run_julia_gpu(2000,3000)
BenchmarkTools.Trial: 28 samples with 1 evaluation.
 Range (min … max):  176.828 ms … 185.904 ms  ┊ GC (min … max): 0.00% … 0.72%
 Time  (median):     178.417 ms               ┊ GC (median):    0.49%
 Time  (mean ± σ):   178.964 ms ±   2.117 ms  ┊ GC (mean ± σ):  0.41% ± 0.39%

  ▃     ▃▃ █ █    █  ▃  ▃                                        
  █▁▇▁▇▇██▇█▇█▇▁▇▁█▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▇ ▁
  177 ms           Histogram: frequency by time          186 ms <

 Memory estimate: 68.67 MiB, allocs estimate: 21.

CUDA Device Info

julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
NVIDIA driver 510.39.1, for CUDA 11.6
CUDA driver 11.6

Libraries: 
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+510.39.1
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.7.3
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: Quadro RTX 4000 (sm_75, 6.847 GiB / 8.000 GiB available)

3 replies

vchuravy Jun 30, 2022

As the author of KernelAbstractions I hasten to add that I never invested much time in improving it's CPU performance, there are many low hanging fruits, but my focus has been GPU first.

@Moelf you are not using a single thread for your example, but your grain-size is one. Performance would likely be better with a much larger grain size.

Moelf Jun 30, 2022

you're right, actually, I was accidentally cheating by running this in 2 threads. Julia-CPU-Kernal is 2x slower than Jax-CPU-kernal, grain size doesn't change, if too big actually hurts performance

Sixzero Jul 2, 2022

I also created an unrolled kernel with KernelAbstraction, since it usually showed better results:

@kernel function julia_kernel_unroll!(c, fractal)    
    I = @index(Global)    
    _c = c[I]    
    z = _c  
    m = true
    @inbounds Base.Cartesian.@nexprs 20 i -> begin
        z = z^2 + _c
        az4 = abs2(z) > 4f0
        fractal[I] = ifelse(m&az4, Int32(i), fractal[I]) # 32-bit Int, same reason as above
        m &= (!az4)
    end 
end

Still the same performance on GPU.
(On CPU the unroll performed worse by a factor of two.)
@vchuravy did I perform something not optimally?

j-bac · 2022-07-01T09:13:52Z

j-bac
Jul 1, 2022

I actually have the opposite issue - why is jax so slow compared to cupy ? For matrix multiplication, cupy is 5x faster on my machine while times are comparable for SVD

import numpy as np
import cupy as cp
import jax.numpy as jnp
import jax

AT=np.random.random((1000,10,500))
B=np.random.random((1000,500,10))
M=AT@B

%time M=AT@B
%time out=u,s,vt=np.linalg.svd(M)

ATcp=cp.array(AT)
Bcp=cp.array(B)
Mcp=cp.array(M)
%time out=ATcp@Bcp
%time out=u,s,vt=np.linalg.svd(Mcp)


ATj=jnp.array(AT)
Bj=jnp.array(B)
Mj=jnp.array(M)

%time out=jnp.matmul(ATj,Bj)
%time out=jnp.linalg.svd(Mj)

8 replies

YouJiacheng Jul 2, 2022

77025.772 us- It seems there is something wrong with cupy. I have no idea. Do you run two benchmark code separately?

j-bac Jul 2, 2022

It's all run as one block in a notebook. I tried multiple runs, putting cupy calculation first, etc. Very weird and huge difference with times given by my initial snippet

Moelf Jul 2, 2022

Can you check if the result makes sense?

j-bac Jul 2, 2022

Yes, using my original snippet :
np.allclose(cp.matmul(ATcp,Bcp),jnp.matmul(ATj,Bj)) # True

cchapmanbird Mar 25, 2023

An old thread, but noting for the record that the reason for @j-bac 's slow cupy can be found here.

cupy supports various backends for faster operations, one of which being CUB - this is enabled by default for cupy v11.x but I see the above was run on 10.6.0. If you re-run the test but replace python with CUPY_ACCELERATORS=cub python you should get something more reasonable.

jpivarski · 2024-07-05T20:44:27Z

jpivarski
Jul 5, 2024
Author

More information from @mjbaldwin: there's a compilation time/runtime trade-off in the number of Mandelbrot iterations, which is fixed at 20 in this example, but might be much larger in a similar problem.

See https://gist.github.com/jpivarski/da343abd8024834ee8c5aaba691aafc7?permalink_comment_id=5112454#gistcomment-5112454

1 reply

Moelf Jul 6, 2024

that makes sense, since Jax's approach is fully unroll the entire inner loop. so in fact different number of loops causes completely different function body being compiled

Why is JAX so fast? #11078

Uh oh!

Replies: 14 comments · 42 replies

Uh oh!

Uh oh!

cgarciae Jun 13, 2022 Collaborator

Uh oh!

jpivarski Jun 13, 2022 Author

Uh oh!

Uh oh!

Uh oh!

jpivarski Jun 13, 2022 Author

Uh oh!

Uh oh!

Uh oh!

jpivarski Jun 13, 2022 Author

Uh oh!

Uh oh!

Uh oh!

jpivarski Jun 13, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jpivarski Jun 13, 2022 Author

Uh oh!

Uh oh!

jpivarski Jun 13, 2022 Author

Uh oh!

Uh oh!

Uh oh!

jpivarski Jun 13, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jpivarski Jun 13, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 14 comments 42 replies

cgarciae
Jun 13, 2022
Collaborator

jpivarski Jun 13, 2022
Author

jpivarski Jun 13, 2022
Author

jpivarski Jun 13, 2022
Author

jpivarski Jun 13, 2022
Author

jpivarski Jun 13, 2022
Author

jpivarski
Jun 13, 2022
Author

jpivarski Jun 13, 2022
Author

jpivarski Jun 13, 2022
Author