Skip to content

Commit 50b372d

Browse files
authored
Merge pull request #2519 from stevesuzuki-arm/fix_processing-workflow
Fix processing-workflow.md for scheduling and runtime
2 parents 95cad50 + 051ff52 commit 50b372d

File tree

1 file changed

+55
-90
lines changed

1 file changed

+55
-90
lines changed

content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md

Lines changed: 55 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ layout: "learningpathall"
88
---
99

1010
## Objective
11-
In this section, you will build a real-time camera processing pipeline using Halide. First, you capture video frames from a webcam using OpenCV, then implement a Gaussian (binomial) blur to smooth the captured images, followed by thresholding to create a clear binary output highlighting prominent image features. After establishing this pipeline, you will measure performance and then explore Halide's scheduling options—parallelization and tiling—to understand when they help and when they don’t.
11+
In this section, you will build a real-time camera processing pipeline using Halide. First, you capture video frames from a webcam using OpenCV, then implement a Gaussian (binomial) blur to smooth the captured images, followed by thresholding to create a clear binary output highlighting prominent image features. After establishing this pipeline, you will measure performance and then explore Halide's scheduling options—parallelization and tiling—to understand when they help and when they don’t.
1212

1313
## Gaussian blur and thresholding
1414
Create a new `camera-capture.cpp` file and modify it as follows:
@@ -58,7 +58,7 @@ int main() {
5858
input.dim(2).set_stride(1);
5959
input.dim(2).set_bounds(0, 3);
6060

61-
// Clamp borders
61+
// Clamp borders
6262
Func inputClamped = BoundaryConditions::repeat_edge(input);
6363

6464
// Grayscale conversion (Rec.601 weights)
@@ -68,7 +68,7 @@ int main() {
6868
0.587f * inputClamped(x, y, 1) +
6969
0.299f * inputClamped(x, y, 2));
7070

71-
// 3×3 binomial blur
71+
// 3×3 binomial blur
7272
Func blur("blur");
7373
const uint16_t k[3][3] = {{1,2,1},{2,4,2},{1,2,1}};
7474
Expr sum = cast<uint16_t>(0);
@@ -77,11 +77,11 @@ int main() {
7777
sum += cast<uint16_t>(gray(x + i - 1, y + j - 1)) * k[j][i];
7878
blur(x, y) = cast<uint8_t>(sum / 16);
7979

80-
// Threshold fused with blur
80+
// Threshold fused with blur
8181
Func output("output");
8282
Expr T = cast<uint8_t>(128);
8383
output(x, y) = select(blur(x, y) > T, cast<uint8_t>(255), cast<uint8_t>(0));
84-
84+
8585
// Allocate output buffer once
8686
Buffer<uint8_t> outBuf(width, height);
8787

@@ -212,7 +212,7 @@ The output should look as in the figure below:
212212
## Parallelization and Tiling
213213
In this section, you will explore two complementary scheduling optimizations provided by Halide: Parallelization and Tiling. Both techniques help enhance performance but achieve it through different mechanisms—parallelization leverages multiple CPU cores, whereas tiling improves cache efficiency by optimizing data locality.
214214

215-
Now you will learn how to use each technique separately for clarity and to emphasize their distinct benefits.
215+
Now you will learn how to use each technique separately for clarity and to emphasize their distinct benefits.
216216

217217
Let’s first lock in a measurable baseline before we start changing the schedule. You will create a second file, `camera-capture-perf-measurement.cpp`, that runs the same grayscale → blur → threshold pipeline but prints per-frame timing, FPS, and MPix/s around the Halide realize() call. This lets you quantify each optimization you will add next (parallelization, tiling, caching).
218218

@@ -254,7 +254,7 @@ int main() {
254254

255255
const int width = frame.cols;
256256
const int height = frame.rows;
257-
const int ch = frame.channels();
257+
const int ch = frame.channels();
258258

259259
// Build the pipeline once (outside the capture loop)
260260
ImageParam input(UInt(8), 3, "input");
@@ -286,8 +286,11 @@ int main() {
286286
Expr T = cast<uint8_t>(128);
287287
output(x, y) = select(blur(x, y) > T, cast<uint8_t>(255), cast<uint8_t>(0));
288288

289-
// Baseline schedule: materialize gray; fuse blur+threshold into output
290-
gray.compute_root();
289+
// Scheduling
290+
{
291+
// Baseline schedule: materialize gray; fuse blur+threshold into output
292+
gray.compute_root();
293+
}
291294

292295
// Allocate output buffer once & JIT once
293296
Buffer<uint8_t> outBuf(width, height);
@@ -336,136 +339,98 @@ int main() {
336339
return 0;
337340
}
338341
```
339-
342+
340343
* The console prints ms, FPS, and MPix/s per frame, measured strictly around realize() (camera capture and UI are excluded).
341344
* The first frame is labeled [warm-up] because it includes Halide's JIT compilation. You can ignore it when comparing schedules.
342345
* MPix/s = (width*height)/seconds is a good resolution-agnostic metric to compare schedule variants.
343346

344347
Build and run the application. Here is the sample output:
345348

346349
```console
347-
% ./camera-capture-perf-measurement
348-
realize: 4.84 ms | 206.53 FPS | 428.25 MPix/s
350+
% ./camera-capture-perf-measurement
351+
realize: 3.98 ms | 251.51 FPS | 521.52 MPix/s
349352
```
350353

351-
This gives an FPS of 206.53, and average throughput of 428.25 MPix/s. Now you can start measuring potential improvements from scheduling.
354+
This gives an FPS of 251.51, and average throughput of 521.52 MPix/s. Now you can start measuring potential improvements from scheduling.
352355

353356
### Parallelization
354357
Parallelization lets Halide run independent pieces of work at the same time on multiple CPU cores. In image pipelines, rows (or row tiles) are naturally parallel once producer data is available. By distributing work across cores, we reduce wall-clock time—crucial for real-time video.
355358

356-
With the baseline measured, apply a minimal schedule that parallelizes the blur reduction across rows while keeping the final stage explicit at root. This avoids tricky interactions between a parallel consumer and an unscheduled reduction.
359+
With the baseline measured, apply a minimal schedule that parallelizes the loop iteration for y axis.
357360

358-
Add these lines after defining output(x, y) (and before any realize()):
361+
Add these lines after defining output(x, y) (and before any realize()). In this sample code, replace the existing scheduling block.
359362
```cpp
360-
blur.compute_root().parallel(y); // parallelize reduction across scanlines
361-
output.compute_root(); // cheap pixel-wise stage at root
363+
// Scheduling
364+
{
365+
// parallelize across scanlines
366+
gray.compute_root().parallel(y);
367+
output.compute_root().parallel(y);
368+
}
362369
```
363370

364371
This does two important things:
365-
* compute_root() on blur moves the reduction to the top level, so it isn’t nested under a parallel loop that might complicate reduction ordering.
366-
* parallel(y) parallelizes over the pure loop variable y (rows), not the reduction domain r, which is the safe/idiomatic way to parallelize reductions in Halide.
372+
* compute_root() on gray divides the entire processing into two loops, one to compute the entire gray output, and the other to compute the final output.
373+
* parallel(y) parallelizes over the pure loop variable y (rows). The rows are computed on different CPU cores in parallel.
367374

368375
Now rebuild and run the application again. The results should look like:
369376
```output
370377
% ./camera-capture-perf-measurement
371-
realize: 3.80 ms | 263.07 FPS | 545.49 MPix/s
378+
realize: 1.16 ms | 864.15 FPS | 1791.90 MPix/s
372379
```
373380

374-
That’s ≈20% faster than baseline.
381+
The performance gain by parallelization depends on how many CPU cores are available for this application to occupy.
375382

376383
### Tiling
377384
Tiling is a scheduling technique that divides computations into smaller, cache-friendly blocks or tiles. This approach significantly enhances data locality, reduces memory bandwidth usage, and leverages CPU caches more efficiently. While tiling can also use parallel execution, its primary advantage comes from optimizing intermediate data storage.
378385

379386
Tiling splits the image into cache-friendly blocks (tiles). Two wins:
380387
* Partitioning: tiles are easy to parallelize across cores.
381-
* Locality: when you cache intermediates per tile, you avoid refetching/recomputing data and hit L1/L2 more often.
388+
* Locality: when you cache intermediates per tile, you avoid refetching/recomputing data and hit CPU L1/L2 cache more often.
382389

383390
Now lets look at both flavors.
384391

385392
### Tiling with explicit intermediate storage (best for cache efficiency)
386393
Here you will cache gray once per tile so the 3×3 blur can reuse it instead of recomputing RGB -> gray up to 9× per output pixel.
387394

388-
Before using this, remove any earlier compute_root().parallel(y) schedule for blur.
389-
390395
```cpp
391-
// After defining: input, gray, blur, thresholded
392-
Halide::Var xo("xo"), yo("yo"), xi("xi"), yi("yi");
393-
394-
// Tile & parallelize the consumer; vectorize inner x on planar output.
395-
output
396-
.tile(x, y, xo, yo, xi, yi, 128, 64)
397-
.vectorize(xi, 16)
398-
.parallel(yo);
399-
400-
// Compute blur inside each tile and vectorize its inner x.
401-
blur
402-
.compute_at(output, xo)
403-
.vectorize(x, 16);
404-
405-
// Cache RGB→gray per tile (reads interleaved input → keep unvectorized).
406-
gray
407-
.compute_at(output, xo)
408-
.store_at(output, xo);
396+
// Scheduling
397+
{
398+
Halide::Var xo("xo"), yo("yo"), xi("xi"), yi("yi");
399+
400+
// Tile & parallelize the consumer
401+
output
402+
.tile(x, y, xo, yo, xi, yi, 128, 64)
403+
.parallel(yo);
404+
405+
// Cache RGB→gray per tile
406+
gray
407+
.compute_at(output, xo)
408+
.store_at(output, xo);
409+
}
409410
```
410411

411412
In this scheduling:
412413
* tile(...) splits the image into cache-friendly blocks and makes it easy to parallelize across tiles.
413-
* blur.compute_at(thresholded, xo) localizes the blur computation to each tile (it doesn’t force storing blur; it just computes it where it’s needed, keeping the working set small).
414+
* parallel(yo) distributes tiles across CPU cores where a CPU core is in charge of a row (yo) of tiles.
414415
* gray.compute_at(...).store_at(...) materializes a tile-local planar buffer for the grayscale intermediate so blur can reuse it within the tile.
415-
* Vectorization is applied only to planar stages (blur, thresholded), gray stays unvectorized because it reads interleaved input (x-stride = channels).
416416

417417
Recompile your application as before, then run. What we observed on our machine:
418418
```output
419-
realize: 2.36 ms | 423.10 FPS | 877.34 MPix/s
419+
realize: 0.98 ms | 1023.15 FPS | 2121.60 MPix/s
420420
```
421421

422-
This was the fastest variant here—caching a planar grayscale per tile enabled efficient reuse and vectorized blur reads.
422+
This was the fastest variant here—caching a planar grayscale per tile enabled efficient reuse.
423423

424-
### Tiling for parallelization (without explicit intermediate storage)
425-
Tiling can also be used just to partition work across cores, without caching intermediates. This keeps the schedule simple: you split the output into tiles, parallelize across tiles, and vectorize along unit-stride x. Producers are computed inside each tile to keep the working set small, but don’t materialize extra tile-local buffers:
426-
```cpp
427-
// Tiling (partitioning only)
428-
Halide::Var xo("xo"), yo("yo"), xi("xi"), yi("yi");
424+
### How we schedule
425+
In general, there is no one-size-fits-all rule of scheduling to achieve the best performance as it depends on your pipeline characteristics and the target device architecture. So, it is recommended to explore the scheduling options and that is where Halide's scheduling API is purposed for.
429426

430-
output
431-
.tile(x, y, xo, yo, xi, yi, 128, 64) // try 128x64; tune per CPU
432-
.vectorize(xi, 16) // safe: planar, unit-stride along x
433-
.parallel(yo); // run tiles across cores
434-
435-
blur
436-
.compute_at(output, xo) // keep work tile-local
437-
.vectorize(x, 16); // vectorize planar blur
438-
```
439-
440-
What this does
441-
* tile(...) splits the image into cache-friendly blocks and makes parallelization straightforward.
442-
* parallel(yo) distributes tiles across CPU cores.
443-
* compute_at(thresholded, xo) evaluates blur per tile (better locality) without forcing extra storage.
444-
* Vectorization is applied to planar stages (blur, thresholded).
445-
446-
Recompile your application as before, then run. On our test machine, we got 5.56 ms (179.91 FPS, 373.07 MPix/s). This is slower than both the baseline and the parallelization-only schedule. The main reasons:
447-
* Recomputation of gray: with a 3×3 blur, each output reuses up to 9 neighbors; leaving gray inlined means RGB→gray is recomputed for each tap.
448-
* Interleaved input: gray reads BGR interleaved data (x-stride = channels), limiting unit-stride vectorization efficiency upstream.
449-
* Overhead vs. work: a 3×3 blur has low arithmetic intensity; extra tile/task overhead isn’t amortized.
450-
451-
Tiling without caching intermediates mainly helps partition work, but for tiny kernels on CPU (and interleaved sources) it often underperforms. The earlier “quick win” (blur.compute_root().parallel(y)) remains the better choice here.
452-
453-
### Tiling vs. parallelization
454-
* Parallelization spreads independent work across CPU cores. For this pipeline, the safest/most effective quick win was:
455-
```cpp
456-
blur.compute_root().parallel(y);
457-
thresholded.compute_root();
458-
```
459-
* Tiling for cache efficiency helps when an expensive intermediate is reused many times per output (e.g., larger kernels, separable/multi-stage pipelines, multiple consumers) and when producers read planar data. Caching gray per tile with a tiny 3×3 kernel over an interleaved source added overhead and ran slower.
460-
* Tiling for parallelization (partitioning only) simplifies work distribution and enables vectorization of planar stages, but with low arithmetic intensity (3×3) and an interleaved source it underperformed here.
461-
462-
When to choose what:
463-
* Start with parallelizing the main reduction at root.
464-
* Add tiling + caching only if: kernel ≥ 5×5, separable/multi-pass blur, or the intermediate is reused by multiple consumers—and preferably after converting sources to planar (or precomputing a planar gray).
465-
* Keep stages that read interleaved inputs unvectorized; vectorize only planar consumers.
427+
For example of this application:
428+
* Start with parallelizing the outer-most loop.
429+
* Add tiling + caching only if: there is a spatial filter, or the intermediate is reused by multiple consumers—and preferably after converting sources to planar (or precomputing a planar gray).
430+
* From there, tune tile sizes and thread count for your target. `HL_NUM_THREADS` is the environmental variable which allows you to limit the number of threads in-flight.
466431

467432
## Summary
468-
In this section, you built a real-time Halide+OpenCV pipeline—grayscale, a 3×3 binomial blur, then thresholding—and instrumented it to measure throughput. The baseline landed at 4.84 ms (206.53 FPS, 428.25 MPix/s). A small, safe schedule tweak that parallelizes the blur reduction across rows improved performance to 3.80 ms (263.07 FPS, 545.49 MPix/s)—about +20%. A tiling schedule used only for partitioning was slower at 5.56 ms (179.91 FPS, 373.07 MPix/s). In contrast, tiling with a cached per-tile grayscale (so the blur reuses a planar intermediate) was the fastest at 2.36 ms (423.10 FPS, 877.34 MPix/s).
469-
470-
The pattern is clear. On CPU, with a small kernel and an interleaved camera source, the most reliable first step is to parallelize the main reduction across rows. Tiling pays off when you also cache a reused intermediate (e.g., a planar grayscale) so downstream stages get unit-stride, vectorizable access and better locality. Keep stages that read interleaved inputs unvectorized; vectorize planar consumers. From there, tune tile sizes and thread count for your target. Boundary conditions are handled once with repeat_edge, keeping edge behavior consistent and scheduling clean.
433+
In this section, you built a real-time Halide+OpenCV pipeline—grayscale, a 3×3 binomial blur, then thresholding—and instrumented it to measure throughput. And then, we observed that parallelization and tiling improved the performance.
471434

435+
* Parallelization spreads independent work across CPU cores.
436+
* Tiling for cache efficiency helps when an expensive intermediate is reused many times per output (e.g., larger kernels, separable/multi-stage pipelines, multiple consumers) and when producers read planar data.

0 commit comments

Comments
 (0)