You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md
+22-22Lines changed: 22 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,12 +8,12 @@ layout: "learningpathall"
8
8
---
9
9
10
10
## Objective
11
-
In the previous section, we explored parallelization and tiling. Here, we focus on operator fusion (inlining) in Halide—i.e., letting producers be computed directly inside their consumers—versus materializing intermediates with compute_root() or compute_at(). You’ll learn when fusion reduces memory traffic and when materializing saves recomputation (e.g., for large stencils or multi-use intermediates). We’ll inspect loop nests with print_loop_nest(), switch among schedules (fuse-all, fuse-blur-only, materialize, tile-and-materialize-per-tile) in a live camera pipeline, and measure the impact (ms/FPS/MPix/s).
11
+
In the previous section, you explored parallelization and tiling. Here, you will focus on operator fusion (inlining) in Halidei.e., letting producers be computed directly inside their consumers—versus materializing intermediates with compute_root() or compute_at(). You will learn when fusion reduces memory traffic and when materializing saves recomputation (e.g., for large stencils or multi-use intermediates). You will inspect loop nests with print_loop_nest(), switch among schedules (fuse-all, fuse-blur-only, materialize, tile-and-materialize-per-tile) in a live camera pipeline, and measure the impact (ms/FPS/MPix/s).
12
12
13
-
Note: this section does not cover loop fusion (the fuse directive). We concentrate on operator fusion, which is Halide’s default behavior.
13
+
This section does not cover loop fusion (the fuse directive). You will focus on operator fusion, which is Halide’s default behavior.
14
14
15
15
## Code
16
-
To demonstrate how fusion in Halide works let's create a new file camera-capture-fusion.cpp, and modify it as follows. This code uses a live camera pipeline (BGR → gray → 3×3 blur → threshold), adds a few schedule variants to toggle operator fusion vs. materialization, and print ms / FPS / MPix/s. So you can see the impact immediately.
16
+
To demonstrate how fusion in Halide works create a new file `camera-capture-fusion.cpp`, and modify it as follows. This code uses a live camera pipeline (BGR → gray → 3×3 blur → threshold), adds a few schedule variants to toggle operator fusion vs. materialization, and print ms / FPS / MPix/s. So you can see the impact immediately.
17
17
18
18
```cpp
19
19
#include"Halide.h"
@@ -232,9 +232,9 @@ int main(int argc, char** argv) {
232
232
return 0;
233
233
}
234
234
```
235
-
We begin by pulling in the right set of headers. Right after the includes we define an enumeration, Schedule, which lists the four different scheduling strategies we want to experiment with. These represent the “modes” we’ll toggle between while the program is running: a simple materialized version, a fused blur-plus-threshold, a fully fused pipeline, and a tiled variant.
235
+
You will begin by pulling in the right set of headers. Right after the includes you define an enumeration, Schedule, which lists the four different scheduling strategies you want to experiment with. These represent the “modes” you will toggle between while the program is running: a simple materialized version, a fused blur-plus-threshold, a fully fused pipeline, and a tiled variant.
236
236
237
-
Finally, to make the output more readable, we add a small helper function, schedule_name. It converts each enum value into a human-friendly label so that when the program prints logs or overlays statistics, you can immediately see which schedule is active.
237
+
Finally, to make the output more readable, you add a small helper function, `schedule_name`. It converts each enum value into a human-friendly label so that when the program prints logs or overlays statistics, you can immediately see which schedule is active.
The heart of this demo is the make_pipeline function. It defines our camera processing pipeline in Halide and applies different scheduling choices depending on which mode we select.
262
+
The main part of this program is the `make_pipeline` function. It defines the camera processing pipeline in Halide and applies different scheduling choices depending on which mode we select.
263
263
264
-
We start by declaring Var x, y as our pixel coordinates. Similarly as before, our camera frames come in as 3-channel interleaved BGR, we tell Halide how the data is laid out: the stride along x is 3 (one step moves across all three channels), the stride along c (channels) is 1, and the bounds on the channel dimension are 0–2.
264
+
You start by declaring Var x, y as our pixel coordinates. Similarly as before, the camera frames come in as 3-channel interleaved BGR, you will tell Halide how the data is laid out: the stride along x is 3 (one step moves across all three channels), the stride along c (channels) is 1, and the bounds on the channel dimension are 0–2.
265
265
266
-
Because we don’t want to worry about array bounds when applying filters, we clamp the input at the borders. In Halide 19, BoundaryConditions::repeat_edge works cleanly when applied to an ImageParam, since it has .dim() information. This way, all downstream stages can assume safe access even at the edges of the image.
266
+
Because you don’t want to worry about array bounds when applying filters, you will clamp the input at the borders. In Halide 19, BoundaryConditions::repeat_edge works cleanly when applied to an ImageParam, since it has .dim() information. This way, all downstream stages can assume safe access even at the edges of the image.
Next comes the gray conversion. As in previous section, we use Rec.601 weights a 3×3 binomial blur. Instead of using a reduction domain (RDom), we unroll the sum in C++ host code with a pair of loops over the kernel. The kernel values {1, 2, 1; 2, 4, 2; 1, 2, 1} approximate a Gaussian filter. Each pixel of blur is simply the weighted sum of its 3×3 neighborhood, divided by 16.
281
+
Next comes the gray conversion. As in previous section, you will use Rec.601 weights a 3×3 binomial blur. Instead of using a reduction domain (RDom), you unroll the sum in C++ host code with a pair of loops over the kernel. The kernel values {1, 2, 1; 2, 4, 2; 1, 2, 1} approximate a Gaussian filter. Each pixel of blur is simply the weighted sum of its 3×3 neighborhood, divided by 16.
282
282
283
-
We then add a threshold stage. Pixels above 128 become white, and all others black, producing a binary image. Finally, we define an output Func that wraps the thresholded result and call compute_root() on it so that it will be realized explicitly when we run the pipeline.
283
+
You will then add a threshold stage. Pixels above 128 become white, and all others black, producing a binary image. Finally, define an output Func that wraps the thresholded result and call compute_root() on it so that it will be realized explicitly when you run the pipeline.
284
284
285
285
```cpp
286
286
// (c) BGR → gray (Rec.601, float weights)
@@ -309,13 +309,13 @@ We then add a threshold stage. Pixels above 128 become white, and all others bla
309
309
output.compute_root();
310
310
```
311
311
312
-
Now comes the interesting part: the scheduling choices. Depending on the Schedule enum passed in, we instruct Halide to either fuse everything (the default), materialize some intermediates, or even tile the output.
313
-
* Simple. We explicitly compute and store both gray and blur across the whole frame with compute_root(). This makes them easy to reuse or parallelize, but requires extra memory traffic.
314
-
* FuseBlurAndThreshold. We compute gray once as a planar buffer, but leave blur and thresholded fused into output. This often works well when the input is interleaved, because subsequent stages read from a planar gray.
315
-
* FuseAll. We apply no scheduling to producers, so gray, blur, and thresholded are all inlined into output. This minimizes memory usage but can recompute gray many times inside the 3×3 stencil.
316
-
* Tile. We split the output into 64×64 tiles. Within each tile, we materialize gray (compute_at(output, xo)), so the working set is small and stays in cache. blur remains fused within each tile.
312
+
Now comes the interesting part: the scheduling choices. Depending on the Schedule enum passed in, you instruct Halide to either fuse everything (the default), materialize some intermediates, or even tile the output.
313
+
* Simple: Here you will explicitly compute and store both gray and blur across the whole frame with compute_root(). This makes them easy to reuse or parallelize, but requires extra memory traffic.
314
+
* FuseBlurAndThreshold: You compute gray once as a planar buffer, but leave blur and thresholded fused into output. This often works well when the input is interleaved, because subsequent stages read from a planar gray.
315
+
* FuseAll: You will apply no scheduling to producers, so gray, blur, and thresholded are all inlined into output. This minimizes memory usage but can recompute gray many times inside the 3×3 stencil.
316
+
* Tile: You will split the output into 64×64 tiles. Within each tile, we materialize gray (compute_at(output, xo)), so the working set is small and stays in cache. blur remains fused within each tile.
317
317
318
-
To help us “x-ray” what’s happening, we print the loop nest Halide generates for each schedule using print_loop_nest(). This gives us a clear view of how fusion or materialization changes the structure of the computation.
318
+
To help you examine what’s happening, print the loop nest Halide generates for each schedule using print_loop_nest(). This will give you a clear view of how fusion or materialization changes the structure of the computation.
319
319
320
320
```cpp
321
321
Var xo("xo"), yo("yo"), xi("xi"), yi("yi");
@@ -352,9 +352,9 @@ return Pipeline(output);
352
352
}
353
353
```
354
354
355
-
All the camera handling is just like before: we open the default webcam with OpenCV, normalize frames to 3-channel BGR if needed, wrap each frame as an interleaved Halide buffer, run the pipeline, and show the result. We still time only the realize() call and print ms / FPS / MPix/s, with the first frame marked as [warm-up].
355
+
All the camera handling is just like before: you open the default webcam with OpenCV, normalize frames to 3-channel BGR if needed, wrap each frame as an interleaved Halide buffer, run the pipeline, and show the result. You will still time only the realize() call and print ms / FPS / MPix/s, with the first frame marked as [warm-up].
356
356
357
-
The new piece is that you can toggle scheduling modes from the keyboard while the app is running:
357
+
The new part is that you can toggle scheduling modes from the keyboard while the application is running:
@@ -363,9 +363,9 @@ The new piece is that you can toggle scheduling modes from the keyboard while th
363
363
* q / Esc – quit
364
364
365
365
Under the hood, pressing 0–3 triggers a rebuild of the Halide pipeline with the chosen schedule:
366
-
1. We map the key to a Schedule enum value.
367
-
2. We call make_pipeline(input, next) to construct the new scheduled pipeline.
368
-
3. We reset the warm-up flag, so the next line of stats is labeled [warm-up] (that frame includes JIT).
366
+
1. You map the key to a Schedule enum value.
367
+
2. You call make_pipeline(input, next) to construct the new scheduled pipeline.
368
+
3. You reset the warm-up flag, so the next line of stats is labeled [warm-up] (that frame includes JIT).
369
369
4. The main loop keeps grabbing frames; only the Halide schedule changes.
370
370
371
371
This live switching makes fusion tangible: you can watch the loop nest printout change, see the visualization update, and compare throughput numbers in real time as you move between Simple, FuseBlurAndThreshold, FuseAll, and Tile.
@@ -528,4 +528,4 @@ Fusion isn’t always best. You’ll want to materialize an intermediate (comput
528
528
The fastest way to check whether fusion helps is to measure it. Our demo prints timing and throughput per frame, but Halide also includes a built-in profiler that reports per-stage runtimes. To learn how to enable and interpret the profiler, see the official [Halide profiling tutorial](https://halide-lang.org/tutorials/tutorial_lesson_21_auto_scheduler_generate.html#profiling).
529
529
530
530
## Summary
531
-
In this lesson, we learned about operator fusion in Halide—a powerful technique for reducing memory bandwidth and improving computational efficiency. We explored why fusion matters, looked at scenarios where it is most effective, and saw how Halide’s scheduling constructs such as compute_root() and compute_at() let us control whether stages are fused or materialized. By experimenting with different schedules, including fusing the Gaussian blur and thresholding stages, we observed how fusion can significantly improve the performance of a real-time image processing pipeline
531
+
In this section, you have learned about operator fusion in Halide—a powerful technique for reducing memory bandwidth and improving computational efficiency. You explored why fusion matters, looked at scenarios where it is most effective, and saw how Halide’s scheduling constructs such as compute_root() and compute_at() let us control whether stages are fused or materialized. By experimenting with different schedules, including fusing the Gaussian blur and thresholding stages, we observed how fusion can significantly improve the performance of a real-time image processing pipeline
0 commit comments