You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md
+5-2Lines changed: 5 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,9 @@ layout: "learningpathall"
8
8
---
9
9
10
10
## What you'll build
11
-
In the previous section, you explored parallelization and tiling. Here, you'll focus on operator fusion (inlining) in Halide, that is, letting producers be computed directly inside their consumers—versus materializing intermediates with compute_root() or compute_at(). You'll learn when fusion reduces memory traffic and when materializing saves recomputation (for example, for large stencils or multi-use intermediates). You'll inspect loop nests with print_loop_nest(), switch among schedules (fuse-all, fuse-blur-only, materialize, tile-and-materialize-per-tile) in a live camera pipeline, and measure the impact (ms/FPS/MPix/s).
11
+
12
+
In this section, you'll focus on operator fusion in Halide—where each stage is computed directly inside its consumer, instead of storing intermediate results. You'll learn how fusion can reduce memory traffic, and when materializing intermediates with `compute_root()` or `compute_at()` is better, especially for large filters or when results are reused. You'll use `print_loop_nest()` to see how Halide arranges the computation, switch between different scheduling modes (fuse all, fuse blur only, materialize, tile and materialize per tile) in a live camera pipeline, and measure the impact using ms, FPS, and MPix/s.
13
+
12
14
13
15
This section doesn't cover loop fusion (the fuse directive). You'll focus instead on operator fusion, which is Halide's default behavior.
14
16
@@ -475,4 +477,5 @@ Fusion isn’t always best. You’ll want to materialize an intermediate (comput
475
477
The fastest way to check whether fusion helps is to measure it. The demo prints timing and throughput per frame, but Halide also includes a built-in profiler that reports per-stage runtimes. To learn how to enable and interpret the profiler, see the official [Halide profiling tutorial](https://halide-lang.org/tutorials/tutorial_lesson_21_auto_scheduler_generate.html#profiling).
476
478
477
479
## Summary
478
-
In this section, you've learned about operator fusion in Halide—a powerful technique for reducing memory bandwidth and improving computational efficiency. You explored why fusion matters, looked at scenarios where it's most effective, and saw how Halide's scheduling constructs such as compute_root() and compute_at() let you control whether stages are fused or materialized. By experimenting with different schedules, including fusing the Gaussian blur and thresholding stages, you observed how fusion can significantly improve the performance of a real-time image processing pipeline.
480
+
481
+
You've seen how operator fusion in Halide can make your image processing pipeline faster and more efficient. Fusion means Halide computes each stage directly inside its consumer, reducing memory traffic and keeping data in cache. You learned when fusion is best—like for simple pixel operations or cheap post-processing—and when materializing intermediates with `compute_root()` or `compute_at()` can help, especially for large stencils or multi-use buffers. By switching schedules in the live demo, you saw how fusion and materialization affect both the loop structure and real-time performance. Now you know how to choose the right approach for your own Arm-based image processing tasks.
Copy file name to clipboardExpand all lines: content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md
+30-18Lines changed: 30 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -364,7 +364,7 @@ This gives an FPS of 251.51, and average throughput of 521.52 MPix/s. Now you ca
364
364
365
365
### Apply parallelization
366
366
367
-
Parallelization lets Halide run independent pieces of work at the same time on multiple CPU cores. In image pipelines, rows (or row tiles) are naturally parallel once producer data is available. By distributing work across cores, wall-clock time is reduced—crucial for real-time video.
367
+
Parallelization allows Halide to process different parts of the image at the same time using multiple CPU cores. In image processing pipelines, each row or block of rows can be handled independently once the input data is ready. By spreading the work across several cores, you reduce the total processing time—this is especially important for real-time video applications.
368
368
369
369
With the baseline measured, apply a minimal schedule that parallelizes the loop iteration for y axis.
370
370
@@ -379,10 +379,15 @@ Add these lines after defining output(x, y) (and before any realize()). In this
379
379
```
380
380
381
381
This does two important things:
382
-
* compute_root() on gray divides the entire processing into two loops, one to compute the entire gray output, and the other to compute the final output.
383
-
* parallel(y) parallelizes over the pure loop variable y (rows). The rows are computed on different CPU cores in parallel.
382
+
*`compute_root()` on gray divides the entire processing into two loops, one to compute the entire gray output, and the other to compute the final output.
383
+
*`parallel(y)` parallelizes over the pure loop variable y (rows). The rows are computed on different CPU cores in parallel.
384
+
Now rebuild and run the application. You should see output similar to:
384
385
385
-
Now rebuild and run the application again. The results should look like:
386
+
```output
387
+
realize: 1.16 ms | 864.15 FPS | 1791.90 MPix/s
388
+
```
389
+
390
+
This shows a significant speedup from parallelization. The exact numbers depend on your Arm CPU and how many cores are available.
386
391
```output
387
392
% ./camera-capture-perf-measurement
388
393
realize: 1.16 ms | 864.15 FPS | 1791.90 MPix/s
@@ -394,11 +399,12 @@ The performance gain by parallelization depends on how many CPU cores are availa
394
399
395
400
Tiling is a scheduling technique that divides computations into smaller, cache-friendly blocks or tiles. This approach significantly enhances data locality, reduces memory bandwidth usage, and leverages CPU caches more efficiently. While tiling can also use parallel execution, its primary advantage comes from optimizing intermediate data storage.
396
401
397
-
Tiling splits the image into cache-friendly blocks (tiles). Two wins:
398
-
* Partitioning: tiles are easy to parallelize across cores.
399
-
* Locality: when you cache intermediates per tile, you avoid refetching/recomputing data and hit CPU L1/L2 cache more often.
402
+
Tiling divides the image into smaller, cache-friendly blocks called tiles. This gives you two main benefits:
403
+
404
+
* Partitioning: tiles are easy to process in parallel, so you can spread the work across multiple CPU cores.
405
+
* Locality: by caching intermediate results within each tile, you avoid repeating calculations and make better use of the CPU cache.
400
406
401
-
Explore both approaches.
407
+
Try both methods to see how they improve performance.
402
408
403
409
## Cache intermediates per tile
404
410
@@ -422,11 +428,13 @@ This approach caches gray once per tile so the 3×3 blur can reuse it instead of
422
428
```
423
429
424
430
In this scheduling:
425
-
* tile(...) splits the image into cache-friendly blocks and makes it easy to parallelize across tiles
426
-
* parallel(yo) distributes tiles across CPU cores where a CPU core is in charge of a row (yo) of tiles
427
-
* gray.compute_at(...).store_at(...) materializes a tile-local planar buffer for the grayscale intermediate so blur can reuse it within the tile
431
+
*`tile`(...) splits the image into cache-friendly blocks and makes it easy to parallelize across tiles
432
+
*`parallel(yo)` distributes tiles across CPU cores where a CPU core is in charge of a row (yo) of tiles
433
+
*`gray.compute_at(...).store_at(...)` materializes a tile-local planar buffer for the grayscale intermediate so blur can reuse it within the tile
428
434
429
-
Recompile your application as before, then run. Here's sample output:
435
+
Recompile your application as before, then run.
436
+
437
+
Here's sample output:
430
438
```output
431
439
realize: 0.98 ms | 1023.15 FPS | 2121.60 MPix/s
432
440
```
@@ -437,12 +445,16 @@ Caching the grayscale image for each tile gives the best performance. By storing
437
445
There isn't a universal scheduling strategy that guarantees the best performance for every pipeline or device. The optimal approach depends on your specific image-processing workflow and the Arm architecture you're targeting. Halide's scheduling API gives you the flexibility to experiment with parallelization, tiling, and caching. Try different combinations to see which delivers the highest throughput and efficiency for your application.
438
446
439
447
For the example of this application:
440
-
* Start with parallelizing the outer-most loop.
441
-
* Add tiling + caching only if: there's a spatial filter, or the intermediate is reused by multiple consumers—and preferably after converting sources to planar (or precomputing a planar gray).
442
-
* From there, tune tile sizes and thread count for your target. `HL_NUM_THREADS` is the environmental variable which allows you to limit the number of threads in-flight.
448
+
Start by parallelizing the outermost loop to use multiple CPU cores. This is usually the simplest way to boost performance.
449
+
450
+
Add tiling and caching if your pipeline includes a spatial filter (such as blur or convolution), or if an intermediate result is reused by several stages. Tiling works best after converting your source data to planar format, or after precomputing a planar grayscale image.
451
+
452
+
Try parallelization first, then experiment with tiling and caching for further speedups. From there, tune tile sizes and thread count for your target. `HL_NUM_THREADS` is the environmental variable which allows you to limit the number of threads in-flight.
443
453
444
454
## What you've accomplished and what's next
445
-
In this section, you built a real-time Halide+OpenCV pipeline—grayscale, a 3×3 binomial blur, then thresholding—and instrumented it to measure throughput. Parallelization and tiling improved the performance.
455
+
You built a real-time image processing pipeline using Halide and OpenCV. The workflow included converting camera frames to grayscale, applying a 3×3 binomial blur, and thresholding to create a binary image. You also measured performance to see how different scheduling strategies affect throughput.
456
+
457
+
- Parallelization lets Halide use multiple CPU cores, speeding up processing by dividing work across rows or tiles.
458
+
- Tiling improves cache efficiency, especially when intermediate results are reused often, such as with larger filters or multi-stage pipelines.
446
459
447
-
* Parallelization spreads independent work across CPU cores.
448
-
* Tiling for cache efficiency helps when an expensive intermediate is reused many times per output (for example, larger kernels, separable/multi-stage pipelines, multiple consumers) and when producers read planar data.
460
+
By combining these techniques, you achieved faster and more efficient image processing on Arm systems.
0 commit comments