Skip to content

Commit a8ecb75

Browse files
Update content for Hugo site
1 parent e0ecee0 commit a8ecb75

File tree

2 files changed

+35
-20
lines changed

2 files changed

+35
-20
lines changed

content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,9 @@ layout: "learningpathall"
88
---
99

1010
## What you'll build
11-
In the previous section, you explored parallelization and tiling. Here, you'll focus on operator fusion (inlining) in Halide, that is, letting producers be computed directly inside their consumers—versus materializing intermediates with compute_root() or compute_at(). You'll learn when fusion reduces memory traffic and when materializing saves recomputation (for example, for large stencils or multi-use intermediates). You'll inspect loop nests with print_loop_nest(), switch among schedules (fuse-all, fuse-blur-only, materialize, tile-and-materialize-per-tile) in a live camera pipeline, and measure the impact (ms/FPS/MPix/s).
11+
12+
In this section, you'll focus on operator fusion in Halide—where each stage is computed directly inside its consumer, instead of storing intermediate results. You'll learn how fusion can reduce memory traffic, and when materializing intermediates with `compute_root()` or `compute_at()` is better, especially for large filters or when results are reused. You'll use `print_loop_nest()` to see how Halide arranges the computation, switch between different scheduling modes (fuse all, fuse blur only, materialize, tile and materialize per tile) in a live camera pipeline, and measure the impact using ms, FPS, and MPix/s.
13+
1214

1315
This section doesn't cover loop fusion (the fuse directive). You'll focus instead on operator fusion, which is Halide's default behavior.
1416

@@ -475,4 +477,5 @@ Fusion isn’t always best. You’ll want to materialize an intermediate (comput
475477
The fastest way to check whether fusion helps is to measure it. The demo prints timing and throughput per frame, but Halide also includes a built-in profiler that reports per-stage runtimes. To learn how to enable and interpret the profiler, see the official [Halide profiling tutorial](https://halide-lang.org/tutorials/tutorial_lesson_21_auto_scheduler_generate.html#profiling).
476478

477479
## Summary
478-
In this section, you've learned about operator fusion in Halide—a powerful technique for reducing memory bandwidth and improving computational efficiency. You explored why fusion matters, looked at scenarios where it's most effective, and saw how Halide's scheduling constructs such as compute_root() and compute_at() let you control whether stages are fused or materialized. By experimenting with different schedules, including fusing the Gaussian blur and thresholding stages, you observed how fusion can significantly improve the performance of a real-time image processing pipeline.
480+
481+
You've seen how operator fusion in Halide can make your image processing pipeline faster and more efficient. Fusion means Halide computes each stage directly inside its consumer, reducing memory traffic and keeping data in cache. You learned when fusion is best—like for simple pixel operations or cheap post-processing—and when materializing intermediates with `compute_root()` or `compute_at()` can help, especially for large stencils or multi-use buffers. By switching schedules in the live demo, you saw how fusion and materialization affect both the loop structure and real-time performance. Now you know how to choose the right approach for your own Arm-based image processing tasks.

content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md

Lines changed: 30 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -364,7 +364,7 @@ This gives an FPS of 251.51, and average throughput of 521.52 MPix/s. Now you ca
364364

365365
### Apply parallelization
366366

367-
Parallelization lets Halide run independent pieces of work at the same time on multiple CPU cores. In image pipelines, rows (or row tiles) are naturally parallel once producer data is available. By distributing work across cores, wall-clock time is reduced—crucial for real-time video.
367+
Parallelization allows Halide to process different parts of the image at the same time using multiple CPU cores. In image processing pipelines, each row or block of rows can be handled independently once the input data is ready. By spreading the work across several cores, you reduce the total processing time—this is especially important for real-time video applications.
368368

369369
With the baseline measured, apply a minimal schedule that parallelizes the loop iteration for y axis.
370370

@@ -379,10 +379,15 @@ Add these lines after defining output(x, y) (and before any realize()). In this
379379
```
380380

381381
This does two important things:
382-
* compute_root() on gray divides the entire processing into two loops, one to compute the entire gray output, and the other to compute the final output.
383-
* parallel(y) parallelizes over the pure loop variable y (rows). The rows are computed on different CPU cores in parallel.
382+
* `compute_root()` on gray divides the entire processing into two loops, one to compute the entire gray output, and the other to compute the final output.
383+
* `parallel(y)` parallelizes over the pure loop variable y (rows). The rows are computed on different CPU cores in parallel.
384+
Now rebuild and run the application. You should see output similar to:
384385

385-
Now rebuild and run the application again. The results should look like:
386+
```output
387+
realize: 1.16 ms | 864.15 FPS | 1791.90 MPix/s
388+
```
389+
390+
This shows a significant speedup from parallelization. The exact numbers depend on your Arm CPU and how many cores are available.
386391
```output
387392
% ./camera-capture-perf-measurement
388393
realize: 1.16 ms | 864.15 FPS | 1791.90 MPix/s
@@ -394,11 +399,12 @@ The performance gain by parallelization depends on how many CPU cores are availa
394399

395400
Tiling is a scheduling technique that divides computations into smaller, cache-friendly blocks or tiles. This approach significantly enhances data locality, reduces memory bandwidth usage, and leverages CPU caches more efficiently. While tiling can also use parallel execution, its primary advantage comes from optimizing intermediate data storage.
396401

397-
Tiling splits the image into cache-friendly blocks (tiles). Two wins:
398-
* Partitioning: tiles are easy to parallelize across cores.
399-
* Locality: when you cache intermediates per tile, you avoid refetching/recomputing data and hit CPU L1/L2 cache more often.
402+
Tiling divides the image into smaller, cache-friendly blocks called tiles. This gives you two main benefits:
403+
404+
* Partitioning: tiles are easy to process in parallel, so you can spread the work across multiple CPU cores.
405+
* Locality: by caching intermediate results within each tile, you avoid repeating calculations and make better use of the CPU cache.
400406

401-
Explore both approaches.
407+
Try both methods to see how they improve performance.
402408

403409
## Cache intermediates per tile
404410

@@ -422,11 +428,13 @@ This approach caches gray once per tile so the 3×3 blur can reuse it instead of
422428
```
423429

424430
In this scheduling:
425-
* tile(...) splits the image into cache-friendly blocks and makes it easy to parallelize across tiles
426-
* parallel(yo) distributes tiles across CPU cores where a CPU core is in charge of a row (yo) of tiles
427-
* gray.compute_at(...).store_at(...) materializes a tile-local planar buffer for the grayscale intermediate so blur can reuse it within the tile
431+
* `tile`(...) splits the image into cache-friendly blocks and makes it easy to parallelize across tiles
432+
* `parallel(yo)` distributes tiles across CPU cores where a CPU core is in charge of a row (yo) of tiles
433+
* `gray.compute_at(...).store_at(...)` materializes a tile-local planar buffer for the grayscale intermediate so blur can reuse it within the tile
428434

429-
Recompile your application as before, then run. Here's sample output:
435+
Recompile your application as before, then run.
436+
437+
Here's sample output:
430438
```output
431439
realize: 0.98 ms | 1023.15 FPS | 2121.60 MPix/s
432440
```
@@ -437,12 +445,16 @@ Caching the grayscale image for each tile gives the best performance. By storing
437445
There isn't a universal scheduling strategy that guarantees the best performance for every pipeline or device. The optimal approach depends on your specific image-processing workflow and the Arm architecture you're targeting. Halide's scheduling API gives you the flexibility to experiment with parallelization, tiling, and caching. Try different combinations to see which delivers the highest throughput and efficiency for your application.
438446

439447
For the example of this application:
440-
* Start with parallelizing the outer-most loop.
441-
* Add tiling + caching only if: there's a spatial filter, or the intermediate is reused by multiple consumers—and preferably after converting sources to planar (or precomputing a planar gray).
442-
* From there, tune tile sizes and thread count for your target. `HL_NUM_THREADS` is the environmental variable which allows you to limit the number of threads in-flight.
448+
Start by parallelizing the outermost loop to use multiple CPU cores. This is usually the simplest way to boost performance.
449+
450+
Add tiling and caching if your pipeline includes a spatial filter (such as blur or convolution), or if an intermediate result is reused by several stages. Tiling works best after converting your source data to planar format, or after precomputing a planar grayscale image.
451+
452+
Try parallelization first, then experiment with tiling and caching for further speedups. From there, tune tile sizes and thread count for your target. `HL_NUM_THREADS` is the environmental variable which allows you to limit the number of threads in-flight.
443453

444454
## What you've accomplished and what's next
445-
In this section, you built a real-time Halide+OpenCV pipeline—grayscale, a 3×3 binomial blur, then thresholding—and instrumented it to measure throughput. Parallelization and tiling improved the performance.
455+
You built a real-time image processing pipeline using Halide and OpenCV. The workflow included converting camera frames to grayscale, applying a 3×3 binomial blur, and thresholding to create a binary image. You also measured performance to see how different scheduling strategies affect throughput.
456+
457+
- Parallelization lets Halide use multiple CPU cores, speeding up processing by dividing work across rows or tiles.
458+
- Tiling improves cache efficiency, especially when intermediate results are reused often, such as with larger filters or multi-stage pipelines.
446459

447-
* Parallelization spreads independent work across CPU cores.
448-
* Tiling for cache efficiency helps when an expensive intermediate is reused many times per output (for example, larger kernels, separable/multi-stage pipelines, multiple consumers) and when producers read planar data.
460+
By combining these techniques, you achieved faster and more efficient image processing on Arm systems.

0 commit comments

Comments
 (0)