Skip to content

Commit d336e4a

Browse files
authored
Merge pull request #2336 from pareenaverma/content_review
Halide LP minor fixes
2 parents e454013 + b4c387d commit d336e4a

File tree

4 files changed

+60
-62
lines changed

4 files changed

+60
-62
lines changed

content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,6 @@ learning_objectives:
1717

1818
prerequisites:
1919
- Basic C++ knowledge
20-
- Basic programming knowledge
2120
- Android Studio with Android Emulator
2221

2322
author: Dawid Borycki

content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md

Lines changed: 22 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,12 @@ layout: "learningpathall"
88
---
99

1010
## Objective
11-
In the previous section, we explored parallelization and tiling. Here, we focus on operator fusion (inlining) in Halidei.e., letting producers be computed directly inside their consumers—versus materializing intermediates with compute_root() or compute_at(). You’ll learn when fusion reduces memory traffic and when materializing saves recomputation (e.g., for large stencils or multi-use intermediates). We’ll inspect loop nests with print_loop_nest(), switch among schedules (fuse-all, fuse-blur-only, materialize, tile-and-materialize-per-tile) in a live camera pipeline, and measure the impact (ms/FPS/MPix/s).
11+
In the previous section, you explored parallelization and tiling. Here, you will focus on operator fusion (inlining) in Halide i.e., letting producers be computed directly inside their consumers—versus materializing intermediates with compute_root() or compute_at(). You will learn when fusion reduces memory traffic and when materializing saves recomputation (e.g., for large stencils or multi-use intermediates). You will inspect loop nests with print_loop_nest(), switch among schedules (fuse-all, fuse-blur-only, materialize, tile-and-materialize-per-tile) in a live camera pipeline, and measure the impact (ms/FPS/MPix/s).
1212

13-
Note: this section does not cover loop fusion (the fuse directive). We concentrate on operator fusion, which is Halide’s default behavior.
13+
This section does not cover loop fusion (the fuse directive). You will focus on operator fusion, which is Halide’s default behavior.
1414

1515
## Code
16-
To demonstrate how fusion in Halide works let's create a new file camera-capture-fusion.cpp, and modify it as follows. This code uses a live camera pipeline (BGR → gray → 3×3 blur → threshold), adds a few schedule variants to toggle operator fusion vs. materialization, and print ms / FPS / MPix/s. So you can see the impact immediately.
16+
To demonstrate how fusion in Halide works create a new file `camera-capture-fusion.cpp`, and modify it as follows. This code uses a live camera pipeline (BGR → gray → 3×3 blur → threshold), adds a few schedule variants to toggle operator fusion vs. materialization, and print ms / FPS / MPix/s. So you can see the impact immediately.
1717

1818
```cpp
1919
#include "Halide.h"
@@ -232,9 +232,9 @@ int main(int argc, char** argv) {
232232
return 0;
233233
}
234234
```
235-
We begin by pulling in the right set of headers. Right after the includes we define an enumeration, Schedule, which lists the four different scheduling strategies we want to experiment with. These represent the “modes” we’ll toggle between while the program is running: a simple materialized version, a fused blur-plus-threshold, a fully fused pipeline, and a tiled variant.
235+
You will begin by pulling in the right set of headers. Right after the includes you define an enumeration, Schedule, which lists the four different scheduling strategies you want to experiment with. These represent the “modes” you will toggle between while the program is running: a simple materialized version, a fused blur-plus-threshold, a fully fused pipeline, and a tiled variant.
236236
237-
Finally, to make the output more readable, we add a small helper function, schedule_name. It converts each enum value into a human-friendly label so that when the program prints logs or overlays statistics, you can immediately see which schedule is active.
237+
Finally, to make the output more readable, you add a small helper function, `schedule_name`. It converts each enum value into a human-friendly label so that when the program prints logs or overlays statistics, you can immediately see which schedule is active.
238238
```cpp
239239
#include "Halide.h"
240240
#include <opencv2/opencv.hpp>
@@ -259,11 +259,11 @@ enum class Schedule : int {
259259
static const char* schedule_name(Schedule s) { ... }
260260
```
261261

262-
The heart of this demo is the make_pipeline function. It defines our camera processing pipeline in Halide and applies different scheduling choices depending on which mode we select.
262+
The main part of this program is the `make_pipeline` function. It defines the camera processing pipeline in Halide and applies different scheduling choices depending on which mode we select.
263263

264-
We start by declaring Var x, y as our pixel coordinates. Similarly as before, our camera frames come in as 3-channel interleaved BGR, we tell Halide how the data is laid out: the stride along x is 3 (one step moves across all three channels), the stride along c (channels) is 1, and the bounds on the channel dimension are 0–2.
264+
You start by declaring Var x, y as our pixel coordinates. Similarly as before, the camera frames come in as 3-channel interleaved BGR, you will tell Halide how the data is laid out: the stride along x is 3 (one step moves across all three channels), the stride along c (channels) is 1, and the bounds on the channel dimension are 0–2.
265265

266-
Because we don’t want to worry about array bounds when applying filters, we clamp the input at the borders. In Halide 19, BoundaryConditions::repeat_edge works cleanly when applied to an ImageParam, since it has .dim() information. This way, all downstream stages can assume safe access even at the edges of the image.
266+
Because you don’t want to worry about array bounds when applying filters, you will clamp the input at the borders. In Halide 19, BoundaryConditions::repeat_edge works cleanly when applied to an ImageParam, since it has .dim() information. This way, all downstream stages can assume safe access even at the edges of the image.
267267

268268
```cpp
269269
Pipeline make_pipeline(ImageParam& input, Schedule schedule) {
@@ -278,9 +278,9 @@ Pipeline make_pipeline(ImageParam& input, Schedule schedule) {
278278
Func inputClamped = BoundaryConditions::repeat_edge(input);
279279
```
280280
281-
Next comes the gray conversion. As in previous section, we use Rec.601 weights a 3×3 binomial blur. Instead of using a reduction domain (RDom), we unroll the sum in C++ host code with a pair of loops over the kernel. The kernel values {1, 2, 1; 2, 4, 2; 1, 2, 1} approximate a Gaussian filter. Each pixel of blur is simply the weighted sum of its 3×3 neighborhood, divided by 16.
281+
Next comes the gray conversion. As in previous section, you will use Rec.601 weights a 3×3 binomial blur. Instead of using a reduction domain (RDom), you unroll the sum in C++ host code with a pair of loops over the kernel. The kernel values {1, 2, 1; 2, 4, 2; 1, 2, 1} approximate a Gaussian filter. Each pixel of blur is simply the weighted sum of its 3×3 neighborhood, divided by 16.
282282
283-
We then add a threshold stage. Pixels above 128 become white, and all others black, producing a binary image. Finally, we define an output Func that wraps the thresholded result and call compute_root() on it so that it will be realized explicitly when we run the pipeline.
283+
You will then add a threshold stage. Pixels above 128 become white, and all others black, producing a binary image. Finally, define an output Func that wraps the thresholded result and call compute_root() on it so that it will be realized explicitly when you run the pipeline.
284284
285285
```cpp
286286
// (c) BGR → gray (Rec.601, float weights)
@@ -309,13 +309,13 @@ We then add a threshold stage. Pixels above 128 become white, and all others bla
309309
output.compute_root();
310310
```
311311

312-
Now comes the interesting part: the scheduling choices. Depending on the Schedule enum passed in, we instruct Halide to either fuse everything (the default), materialize some intermediates, or even tile the output.
313-
* Simple. We explicitly compute and store both gray and blur across the whole frame with compute_root(). This makes them easy to reuse or parallelize, but requires extra memory traffic.
314-
* FuseBlurAndThreshold. We compute gray once as a planar buffer, but leave blur and thresholded fused into output. This often works well when the input is interleaved, because subsequent stages read from a planar gray.
315-
* FuseAll. We apply no scheduling to producers, so gray, blur, and thresholded are all inlined into output. This minimizes memory usage but can recompute gray many times inside the 3×3 stencil.
316-
* Tile. We split the output into 64×64 tiles. Within each tile, we materialize gray (compute_at(output, xo)), so the working set is small and stays in cache. blur remains fused within each tile.
312+
Now comes the interesting part: the scheduling choices. Depending on the Schedule enum passed in, you instruct Halide to either fuse everything (the default), materialize some intermediates, or even tile the output.
313+
* Simple: Here you will explicitly compute and store both gray and blur across the whole frame with compute_root(). This makes them easy to reuse or parallelize, but requires extra memory traffic.
314+
* FuseBlurAndThreshold: You compute gray once as a planar buffer, but leave blur and thresholded fused into output. This often works well when the input is interleaved, because subsequent stages read from a planar gray.
315+
* FuseAll: You will apply no scheduling to producers, so gray, blur, and thresholded are all inlined into output. This minimizes memory usage but can recompute gray many times inside the 3×3 stencil.
316+
* Tile: You will split the output into 64×64 tiles. Within each tile, we materialize gray (compute_at(output, xo)), so the working set is small and stays in cache. blur remains fused within each tile.
317317

318-
To help us “x-ray” what’s happening, we print the loop nest Halide generates for each schedule using print_loop_nest(). This gives us a clear view of how fusion or materialization changes the structure of the computation.
318+
To help you examine what’s happening, print the loop nest Halide generates for each schedule using print_loop_nest(). This will give you a clear view of how fusion or materialization changes the structure of the computation.
319319

320320
```cpp
321321
Var xo("xo"), yo("yo"), xi("xi"), yi("yi");
@@ -352,9 +352,9 @@ return Pipeline(output);
352352
}
353353
```
354354
355-
All the camera handling is just like before: we open the default webcam with OpenCV, normalize frames to 3-channel BGR if needed, wrap each frame as an interleaved Halide buffer, run the pipeline, and show the result. We still time only the realize() call and print ms / FPS / MPix/s, with the first frame marked as [warm-up].
355+
All the camera handling is just like before: you open the default webcam with OpenCV, normalize frames to 3-channel BGR if needed, wrap each frame as an interleaved Halide buffer, run the pipeline, and show the result. You will still time only the realize() call and print ms / FPS / MPix/s, with the first frame marked as [warm-up].
356356
357-
The new piece is that you can toggle scheduling modes from the keyboard while the app is running:
357+
The new part is that you can toggle scheduling modes from the keyboard while the application is running:
358358
1. Keys:
359359
* 0 – Simple (materialize gray and blur)
360360
* 1 – FuseBlurAndThreshold (materialize gray; fuse blur+threshold)
@@ -363,9 +363,9 @@ The new piece is that you can toggle scheduling modes from the keyboard while th
363363
* q / Esc – quit
364364
365365
Under the hood, pressing 0–3 triggers a rebuild of the Halide pipeline with the chosen schedule:
366-
1. We map the key to a Schedule enum value.
367-
2. We call make_pipeline(input, next) to construct the new scheduled pipeline.
368-
3. We reset the warm-up flag, so the next line of stats is labeled [warm-up] (that frame includes JIT).
366+
1. You map the key to a Schedule enum value.
367+
2. You call make_pipeline(input, next) to construct the new scheduled pipeline.
368+
3. You reset the warm-up flag, so the next line of stats is labeled [warm-up] (that frame includes JIT).
369369
4. The main loop keeps grabbing frames; only the Halide schedule changes.
370370
371371
This live switching makes fusion tangible: you can watch the loop nest printout change, see the visualization update, and compare throughput numbers in real time as you move between Simple, FuseBlurAndThreshold, FuseAll, and Tile.
@@ -528,4 +528,4 @@ Fusion isn’t always best. You’ll want to materialize an intermediate (comput
528528
The fastest way to check whether fusion helps is to measure it. Our demo prints timing and throughput per frame, but Halide also includes a built-in profiler that reports per-stage runtimes. To learn how to enable and interpret the profiler, see the official [Halide profiling tutorial](https://halide-lang.org/tutorials/tutorial_lesson_21_auto_scheduler_generate.html#profiling).
529529

530530
## Summary
531-
In this lesson, we learned about operator fusion in Halide—a powerful technique for reducing memory bandwidth and improving computational efficiency. We explored why fusion matters, looked at scenarios where it is most effective, and saw how Halide’s scheduling constructs such as compute_root() and compute_at() let us control whether stages are fused or materialized. By experimenting with different schedules, including fusing the Gaussian blur and thresholding stages, we observed how fusion can significantly improve the performance of a real-time image processing pipeline
531+
In this section, you have learned about operator fusion in Halide—a powerful technique for reducing memory bandwidth and improving computational efficiency. You explored why fusion matters, looked at scenarios where it is most effective, and saw how Halide’s scheduling constructs such as compute_root() and compute_at() let us control whether stages are fused or materialized. By experimenting with different schedules, including fusing the Gaussian blur and thresholding stages, we observed how fusion can significantly improve the performance of a real-time image processing pipeline

0 commit comments

Comments
 (0)