ArmDeveloperEcosystem · jasonrandrews · Oct 27, 2025
diff --git a/...ng-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md b/...ng-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md
@@ -85,7 +85,7 @@ int main(int argc, char** argv) {
 }
 ```
 
-In the original implementation constants 128, 255, and 0 were implicitly treated as integers. Here, the threshold value (128) and output values (255, 0) are explicitly cast to uint8_t. This approach removes ambiguity and clearly specifies the types used, ensuring compatibility and clarity. Both approaches result in identical functionality, but explicitly casting helps emphasize the type correctness and may avoid subtle issues during cross-compilation or in certain environments. Additionally, explicit uint8_t casts help avoid implicit promotion to 32-bit integers (and the corresponding narrowings back to 8-bit) in the generated code, reducing redundant cast operations and potential vector widen/narrow overhead—especially on ARM/NEON
+In the original implementation constants 128, 255, and 0 were implicitly treated as integers. Here, the threshold value (128) and output values (255, 0) are explicitly cast to uint8_t. This approach removes ambiguity and clearly specifies the types used, ensuring compatibility and clarity. Both approaches result in identical functionality, but explicitly casting helps emphasize the type correctness and may avoid subtle issues during cross-compilation or in certain environments.
 
 The program takes at least one command-line argument, the output base name used to generate the files (e.g., “blur_threshold_android”). Here, the target architecture is explicitly set within the code to Android ARM64:
 
@@ -105,12 +105,8 @@ target.set_feature(Target::NoRuntime, false);
 ```
 
 Notes: 
-1. NoRuntime — When set to true, Halide excludes its runtime from the generated code, and you must link the runtime manually during the linking step. When set to false, the Halide runtime is included in the generated library, which simplifies deployment.
-2. ARMFp16 — Enables the use of ARM hardware support for half-precision (16-bit) floating-point operations, which can provide faster execution when reduced precision is acceptable.
-3. Why the runtime choice matters - If your app links several AOT-compiled pipelines, ensure there is exactly one Halide runtime at link time:
-* Strategy A (cleanest): build all pipelines with NoRuntime ON and link a single standalone Halide runtime once (matching the union of features you need, e.g., Vulkan/OpenCL/Metal or ARM options).
-* Strategy B: embed the runtime in exactly one pipeline (leave NoRuntime OFF only there); compile all other pipelines with NoRuntime ON.
-* Mixing more than one runtime can cause duplicate symbols and split global state (e.g., error handlers, device interfaces).
+* NoRuntime — When set to true, Halide excludes its runtime from the generated code, and you must link the runtime manually during the linking step. When set to false, the Halide runtime is included in the generated library, which simplifies deployment.
+* ARMFp16 — Enables the use of ARM hardware support for half-precision (16-bit) floating-point operations, which can provide faster execution when reduced precision is acceptable.
 
 We declare spatial variables (x, y) and an ImageParam named “input” representing the input image data. We use boundary clamping (clamp) to safely handle edge pixels. Then, we apply a 3x3 blur with a reduction domain (RDom). The accumulated sum is divided by 9 (the number of pixels in the neighborhood), producing an average blurred image. Lastly, thresholding is applied, producing a binary output: pixels above a certain brightness threshold (128) become white (255), while others become black (0).
 

diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md
@@ -42,7 +42,7 @@ static const char* schedule_name(Schedule s) {
         case Schedule::FuseBlurAndThreshold: return "FuseBlurAndThreshold";
         case Schedule::FuseAll:              return "FuseAll";
         case Schedule::Tile:                 return "Tile";
-        default:                             return "Unknown";
+        default:                              return "Unknown";
     }
 }
 
@@ -174,10 +174,10 @@ int main(int argc, char** argv) {
         if (!frame.isContinuous()) frame = frame.clone();
 
         // Wrap interleaved frame
-        Halide::Buffer<uint8_t> inputBuf = Runtime::Buffer<uint8_t>::make_interleaved(
-            frame.data, frame.cols, frame.rows, frame.channels());
-
-        input.set(inputBuf);
+        auto in_rt = Runtime::Buffer<uint8_t>::make_interleaved(
+            frame.data, frame.cols, frame.rows, /*channels*/3);
+        Buffer<> in_fe(*in_rt.raw_buffer());
+        input.set(in_fe);
 
         // Time the Halide realize() only
         auto t0 = std::chrono::high_resolution_clock::now();
@@ -232,6 +232,32 @@ int main(int argc, char** argv) {
     return 0;
 }
 ```
+You will begin by pulling in the right set of headers. Right after the includes you define an enumeration, Schedule, which lists the four different scheduling strategies you want to experiment with. These represent the “modes” you will toggle between while the program is running: a simple materialized version, a fused blur-plus-threshold, a fully fused pipeline, and a tiled variant.
+
+Finally, to make the output more readable, you add a small helper function, `schedule_name`. It converts each enum value into a human-friendly label so that when the program prints logs or overlays statistics, you can immediately see which schedule is active.
+```cpp
+#include "Halide.h"
+#include <opencv2/opencv.hpp>
+#include <chrono>
+#include <iomanip>
+#include <iostream>
+#include <string>
+#include <cstdint>
+#include <exception>
+
+using namespace Halide;
+using namespace cv;
+using namespace std;
+
+enum class Schedule : int {
+    Simple = 0,
+    FuseBlurAndThreshold = 1,
+    FuseAll = 2,
+    Tile = 3,
+};
+
+static const char* schedule_name(Schedule s) { ... }
+```
 
 The main part of this program is the `make_pipeline` function. It defines the camera processing pipeline in Halide and applies different scheduling choices depending on which mode we select.
 
@@ -256,6 +282,33 @@ Next comes the gray conversion. As in previous section, you will use Rec.601 wei
 
 You will then add a threshold stage. Pixels above 128 become white, and all others black, producing a binary image. Finally, define an output Func that wraps the thresholded result and call compute_root() on it so that it will be realized explicitly when you run the pipeline.
 
+```cpp
+    // (c) BGR → gray (Rec.601, float weights)
+    Func gray("gray");
+    gray(x, y) = cast<uint8_t>(0.114f * inputClamped(x, y, 0)
+                             + 0.587f * inputClamped(x, y, 1)
+                             + 0.299f * inputClamped(x, y, 2));
+
+    // (d) 3×3 binomial blur, unrolled in host code (no RDom needed)
+    Func blur("blur");
+    const uint16_t k[3][3] = {{1,2,1},{2,4,2},{1,2,1}};
+    Expr blurSum = cast<uint16_t>(0);
+    for (int j = 0; j < 3; ++j)
+        for (int i = 0; i < 3; ++i)
+            blurSum = blurSum + cast<uint16_t>(gray(x + i - 1, y + j - 1)) * k[j][i];
+    blur(x, y) = cast<uint8_t>(blurSum / 16);
+
+    // (e) Threshold to binary
+    Func thresholded("thresholded");
+    Expr T = cast<uint8_t>(128);
+    thresholded(x, y) = select(blur(x, y) > T, cast<uint8_t>(255), cast<uint8_t>(0));
+
+    // (f) Final output and default root
+    Func output("output");
+    output(x, y) = thresholded(x, y);
+    output.compute_root();
+```
+
 Now comes the interesting part: the scheduling choices. Depending on the Schedule enum passed in, you instruct Halide to either fuse everything (the default), materialize some intermediates, or even tile the output.
   * Simple: Here you will explicitly compute and store both gray and blur across the whole frame with compute_root(). This makes them easy to reuse or parallelize, but requires extra memory traffic.
   * FuseBlurAndThreshold: You compute gray once as a planar buffer, but leave blur and thresholded fused into output. This often works well when the input is interleaved, because subsequent stages read from a planar gray.
@@ -417,7 +470,7 @@ Comparing the numbers:
 
 By toggling schedules live, you can see and measure how operator fusion and materialization change both the loop structure and the throughput:
 * Fusion is the default in Halide and eliminates temporary storage, but may cause recomputation for spatial filters.
-* Materializing selected stages with compute_root() or compute_at() can reduce recomputation and improve locality. It can also make vectorization and parallelization easier or more effective, but they are not strictly required by materialization and can be applied independently. For best performance, consider these choices together and measure on your target.
+* Materializing selected stages with compute_root() or compute_at() can reduce recomputation, enable vectorization and parallelization, and sometimes yield much higher throughput.
 * Tile-level materialization (compute_at) provides a hybrid - fusing within tiles while keeping intermediates small and cache-resident.
 
 This demo makes these trade-offs concrete: the loop nest diagrams explain the structure, and the live FPS/MPix/s stats show the real performance impact.