Add example and document JIT vs. non-JIT (#2628)

mawad-amd · web-flow · commit f4b1d8cf646e · 2025-10-06T19:38:15.000Z
diff --git a/programming_examples/basic/vector_reduce_min/Makefile b/programming_examples/basic/vector_reduce_min/Makefile
@@ -57,7 +57,7 @@ else
 endif
 
 run: ${targetname}.exe build/final.xclbin
-	${powershell} ./$< -x build/final.xclbin -i build/insts.bin -k MLIR_AIE
+	${powershell} ./$< -x build/final.xclbin -i build/insts.bin -k MLIR_AIE --warmup 10 --iters 20
 
 trace:
 	../../../python/utils/parse_trace.py --input trace.txt --mlir build/aie.mlir --output parse_eventIR_vs.json
diff --git a/programming_examples/basic/vector_reduce_min/README.md b/programming_examples/basic/vector_reduce_min/README.md
@@ -4,29 +4,62 @@
 // See https://llvm.org/LICENSE.txt for license information.
 // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 //
-// Copyright (C) 2024, Advanced Micro Devices, Inc.
+// Copyright (C) 2024-2025, Advanced Micro Devices, Inc.
 // 
 //===----------------------------------------------------------------------===//-->
 
 # Vector Reduce Min:
 
-Single tile performs a very simple reduction operation where the kernel loads data from local memory, performs the `min` reduction and stores the resulting value back.
+This example showcases both **JIT** and **non-JIT** approaches for running IRON designs. A single tile performs a very simple reduction operation where the kernel loads data from local memory, performs the `min` reduction and stores the resulting value back.
 
-Input data is brought to the local memory of the Compute tile from a Shim tile. The size of the input data `N` from the Shim tile is `1024xi32`. The data is copied to the AIE tile, where the reduction is performed. The single output data value is copied from the AIE tile to the Shim tile.
+Input data is brought to the local memory of the Compute tile from a Shim tile. The size of the input data `N` from the Shim tile is configurable (default: `1024xi32` for the non-JIT version, customizable via command-line arguments for the JIT version). The data is copied to the AIE tile, where the reduction is performed. The single output data value is copied from the AIE tile to the Shim tile. Both approaches offer different compilation workflows with the JIT version adding microseconds runtime overhead.
 
 ## Source Files Overview
 
-1. `vector_reduce_min.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.bin for the NPU in Ryzen™ AI). 
+### JIT Approach Files
 
-1. `vector_reduce_min_placed.py`: An alternative version of the design in `vector_reduce_min.py`, that is expressed in a lower-level version of IRON.
+1. **`vector_reduce_min_jit.py`**: A JIT (Just-In-Time) compiled version using IRON's `@iron.jit` decorator. This approach offers faster development iteration by compiling and executing the design at runtime, with support for command-line arguments to customize the number of elements.
 
-1. `reduce_min.cc`: A C++ implementation of a vectorized `min` reduction operation for AIE cores. The code uses the AIE API, which is a C++ header-only library providing types and operations that get translated into efficient low-level intrinsics, and whose documentation can be found [here](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_api/aie_api/doc/index.html).  The source can be found [here](../../../aie_kernels/aie2/reduce_min.cc).
+### Non-JIT Approach Files
 
-1. `test.cpp`: This C++ code is a testbench for the design example targetting Ryzen™ AI (AIE2). The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the program verifies the results.
+1. **`vector_reduce_min.py`**: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.bin for the NPU in Ryzen™ AI). 
 
-## Ryzen™ AI Usage
+1. **`vector_reduce_min_placed.py`**: An alternative version of the design in `vector_reduce_min.py`, that is expressed in a lower-level version of IRON.
 
-### Compilation
+1. **`test.cpp`**: This C++ code is a testbench for the non-JIT design example targetting Ryzen™ AI (AIE2). The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the program verifies the results.
+
+### Shared Files
+
+1. **`reduce_min.cc`**: A C++ implementation of a vectorized `min` reduction operation for AIE cores. The code uses the AIE API, which is a C++ header-only library providing types and operations that get translated into efficient low-level intrinsics, and whose documentation can be found [here](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_api/aie_api/doc/index.html).  The source can be found [here](../../../aie_kernels/aie2/reduce_min.cc).
+
+## Usage
+
+### JIT Approach (Just-In-Time Compilation)
+
+The JIT approach uses IRON's `@iron.jit` decorator for runtime compilation, offering faster development iteration and more flexible parameterization.
+
+#### Running the JIT Version
+
+To run the JIT version with default parameters (1024 elements):
+```shell
+python vector_reduce_min_jit.py
+```
+
+To run with custom number of elements:
+```shell
+python vector_reduce_min_jit.py --num-elements 2048
+```
+
+Or using the short form:
+```shell
+python vector_reduce_min_jit.py -n 512
+```
+
+### Non-JIT Approach
+
+The non-JIT approach uses traditional MLIR-AIE compilation where the design is compiled ahead-of-time to produce binaries.
+
+#### Compilation
 
 To compile the design:
 ```shell
@@ -43,11 +76,26 @@ To compile the C++ testbench:
 make vector_reduce_min.exe
 ```
 
-### C++ Testbench
+#### C++ Testbench
 
 To run the design:
-
 ```shell
 make run
 ```
 
+#### JIT vs Non-JIT Comparison
+
+| Aspect | Non-JIT Approach | JIT Approach |
+|--------|------------------|--------------|
+| **Compilation** | Ahead-of-time via `aiecc.py` | Runtime compilation |
+| **Development Speed** | Slower (manual make/compilation) | Faster (compilation integrated) |
+| **Host Code** | C++ testbench (`test.cpp`) | Python script |
+| **Performance** | Baseline execution time | Microseconds overhead from JIT runtime |
+| **Flexibility** | Fixed at compile time | Runtime parameterization |
+| **Use Case** | Explicit XCLBIN management | Dynamic compilation |
+| **Binary Output** | Generates XCLBIN/inst.bin | Cached binaries in `IRON_CACHE_HOME` (defaults to `~/.iron/`) |
+
+**When to use each approach:**
+- **Use JIT** for rapid prototyping, experimentation, runtime flexibility, and when you don't need control over XCLBINs
+- **Use non-JIT** when you need explicit XCLBIN control, working with existing MLIR-AIE workflows, or distributing pre-compiled binaries
+
diff --git a/programming_examples/basic/vector_reduce_min/run_jit.lit b/programming_examples/basic/vector_reduce_min/run_jit.lit
@@ -0,0 +1,7 @@
+// (c) Copyright 2025 Advanced Micro Devices, Inc.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+// REQUIRES: ryzen_ai, peano 
+//
+// RUN: %run_on_npu1% python3 %S/vector_reduce_min_jit.py
+// RUN: %run_on_npu2% python3 %S/vector_reduce_min_jit.py
diff --git a/programming_examples/basic/vector_reduce_min/vector_reduce_min.py b/programming_examples/basic/vector_reduce_min/vector_reduce_min.py
@@ -4,7 +4,7 @@
 # See https://llvm.org/LICENSE.txt for license information.
 # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 #
-# (c) Copyright 2024 Advanced Micro Devices, Inc. or its affiliates
+# (c) Copyright 2024-2025 Advanced Micro Devices, Inc. or its affiliates
 import numpy as np
 import sys
 
@@ -35,20 +35,20 @@ def my_reduce_min():
     of_out = ObjectFifo(out_ty, name="out")
 
     # AIE Core Function declarations
-    reduce_add_vector = Kernel(
+    reduce_min_vector = Kernel(
         "reduce_min_vector", "reduce_min.cc.o", [in_ty, out_ty, np.int32]
     )
 
     # Define a task
-    def core_body(of_in, of_out, reduce_add_vector):
+    def core_body(of_in, of_out, reduce_min_vector):
         elem_out = of_out.acquire(1)
         elem_in = of_in.acquire(1)
-        reduce_add_vector(elem_in, elem_out, N)
+        reduce_min_vector(elem_in, elem_out, N)
         of_in.release(1)
         of_out.release(1)
 
     # Define a worker to run the task on a core
-    worker = Worker(core_body, fn_args=[of_in.cons(), of_out.prod(), reduce_add_vector])
+    worker = Worker(core_body, fn_args=[of_in.cons(), of_out.prod(), reduce_min_vector])
 
     # Runtime operations to move data to/from the AIE-array
     rt = Runtime()
diff --git a/programming_examples/basic/vector_reduce_min/vector_reduce_min_jit.py b/programming_examples/basic/vector_reduce_min/vector_reduce_min_jit.py
@@ -0,0 +1,145 @@
+# vector_reduce_min/vector_reduce_min_jit.py -*- Python -*-
+#
+# This file is licensed under the Apache License v2.0 with LLVM Exceptions.
+# See https://llvm.org/LICENSE.txt for license information.
+# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+#
+# (c) Copyright 2025 Advanced Micro Devices, Inc. or its affiliates
+import numpy as np
+import sys
+import os
+import argparse
+import time
+
+import aie.iron as iron
+from aie.iron import ObjectFifo, Program, Runtime, Worker
+from aie.iron.placers import SequentialPlacer
+from aie.iron import ExternalFunction
+
+
+@iron.jit(is_placed=False)
+def my_reduce_min(input_tensor, output_tensor):
+
+    num_elements = input_tensor.numel()
+    assert output_tensor.numel() == 1, "Output tensor must be a scalar"
+
+    # Define tensor types
+    in_ty = np.ndarray[(num_elements,), np.dtype[input_tensor.dtype]]
+    out_ty = np.ndarray[(1,), np.dtype[output_tensor.dtype]]
+
+    # AIE-array data movement with object fifos
+    of_in = ObjectFifo(in_ty, name="in")
+    of_out = ObjectFifo(out_ty, name="out")
+
+    # AIE Core Function declarations
+    root_dir = os.path.abspath(os.path.join(__file__, "../../../.."))
+    kernel_dir = os.path.join(root_dir, "aie_kernels/aie2")
+    source_file = os.path.join(kernel_dir, "reduce_min.cc")
+    reduce_min_vector = ExternalFunction(
+        "reduce_min_vector",
+        source_file=source_file,
+        arg_types=[in_ty, out_ty, np.int32],
+        include_dirs=[kernel_dir],
+    )
+
+    # Define a task
+    def core_body(of_in, of_out, reduce_min_vector):
+        elem_out = of_out.acquire(1)
+        elem_in = of_in.acquire(1)
+        reduce_min_vector(elem_in, elem_out, num_elements)
+        of_in.release(1)
+        of_out.release(1)
+
+    # Define a worker to run the task on a core
+    worker = Worker(core_body, fn_args=[of_in.cons(), of_out.prod(), reduce_min_vector])
+
+    # Runtime operations to move data to/from the AIE-array
+    rt = Runtime()
+    with rt.sequence(in_ty, out_ty) as (a_in, c_out):
+        rt.start(worker)
+        rt.fill(of_in.prod(), a_in)
+        rt.drain(of_out.cons(), c_out, wait=True)
+
+    # Place program components (assign them resources on the device) and generate an MLIR module
+    return Program(iron.get_current_device(), rt).resolve_program(SequentialPlacer())
+
+
+def main():
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-n",
+        "--num-elements",
+        type=int,
+        default=2048,
+        help="Number of elements (default: 2048)",
+    )
+    parser.add_argument(
+        "-w",
+        "--warmup",
+        type=int,
+        default=10,
+        help="Number of warmup iterations (default: 10)",
+    )
+    parser.add_argument(
+        "-i",
+        "--iters",
+        type=int,
+        default=20,
+        help="Number of measurement iterations (default: 20)",
+    )
+
+    args = parser.parse_args()
+    num_elements = args.num_elements
+    n_warmup_iterations = args.warmup
+    n_iterations = args.iters
+    data_type = np.int32
+
+    # Construct input and output tensors that are accessible to the NPU
+    input_tensor = iron.randint(10, 100, (num_elements,), dtype=data_type, device="npu")
+    output_tensor = iron.tensor((1,), dtype=data_type, device="npu")
+
+    # Initialize timing variables
+    npu_time_total = 0.0
+    npu_time_min = float("inf")
+    npu_time_max = 0.0
+
+    # Main run loop with warmup and measurement iterations
+    total_iterations = n_warmup_iterations + n_iterations
+    for iter_num in range(total_iterations):
+        # Launch the kernel and measure execution time
+        start_time = time.perf_counter()
+        my_reduce_min(input_tensor, output_tensor)
+        end_time = time.perf_counter()
+
+        # Calculate execution time in microseconds
+        execution_time_us = (end_time - start_time) * 1_000_000
+
+        # Skip warmup iterations for timing statistics
+        if iter_num >= n_warmup_iterations:
+            npu_time_total += execution_time_us
+            npu_time_min = min(npu_time_min, execution_time_us)
+            npu_time_max = max(npu_time_max, execution_time_us)
+
+    # Check the correctness of the result
+    computed = output_tensor.numpy()[0]
+    expected = input_tensor.numpy().min()
+
+    if expected == computed:
+        # Print timing results
+        if n_iterations > 1:
+            avg_time = npu_time_total / n_iterations
+            print(f"\nAvg NPU time: {avg_time:.1f}us.")
+            print(f"Min NPU time: {npu_time_min:.1f}us.")
+            print(f"Max NPU time: {npu_time_max:.1f}us.")
+        else:
+            print(f"\nNPU time: {npu_time_total:.1f}us.")
+        print("PASS!")
+        sys.exit(0)
+    else:
+        print(f"FAIL!: Expected {expected} but got {computed}")
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()