Skip to content

Commit 17bbdcb

Browse files
authored
Update processing-workflow.md
1 parent 9f15f59 commit 17bbdcb

File tree

1 file changed

+18
-18
lines changed

1 file changed

+18
-18
lines changed

content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,10 @@ layout: "learningpathall"
88
---
99

1010
## Objective
11-
In this section, we will build a real-time camera processing pipeline using Halide. First, we capture video frames from a webcam using OpenCV, then implement a Gaussian (binomial) blur to smooth the captured images, followed by thresholding to create a clear binary output highlighting prominent image features. After establishing this pipeline, we will measure performance and then explore Halide’s scheduling options—parallelization and tiling—to understand when they help and when they don’t.
11+
In this section, you will build a real-time camera processing pipeline using Halide. First, you capture video frames from a webcam using OpenCV, then implement a Gaussian (binomial) blur to smooth the captured images, followed by thresholding to create a clear binary output highlighting prominent image features. After establishing this pipeline, you will measure performance and then explore Halide’s scheduling options—parallelization and tiling—to understand when they help and when they don’t.
1212

1313
## Gaussian blur and thresholding
14-
Create a new camera-capture.cpp file and modify it as follows:
14+
Create a new `camera-capture.cpp` file and modify it as follows:
1515
```cpp
1616
#include "Halide.h"
1717
#include "HalideRuntime.h" // for Runtime::Buffer make_interleaved
@@ -178,7 +178,7 @@ imshow("Processed Image", blurredThresholded);
178178
179179
The main loop continues capturing frames, running the Halide pipeline, and displaying the processed output in real-time until a key is pressed. This demonstrates how Halide integrates with OpenCV to build efficient, interactive image processing applications.
180180
181-
In the examples above, pixel coordinates were manually clamped with a helper function:
181+
In the examples above, pixel coordinates are manually clamped with a helper function:
182182
183183
```cpp
184184
gray(clampCoord(x + r.x - 1, width),
@@ -187,7 +187,7 @@ gray(clampCoord(x + r.x - 1, width),
187187

188188
This ensures that when the reduction domain r extends beyond the image borders (for example, at the left or top edge), the coordinates are clipped into the valid range [0, width-1] and [0, height-1]. Manual clamping is explicit and easy to understand, but it scatters boundary-handling logic across the pipeline.
189189

190-
Halide provides an alternative through boundary condition functions, which wrap an existing Func and define its behavior outside the valid region. For the Gaussian blur, we can clamp the grayscale function instead of the raw input, producing a new function that automatically handles out-of-bounds coordinates:
190+
Halide provides an alternative through boundary condition functions, which wrap an existing Func and define its behavior outside the valid region. For the Gaussian blur, you can clamp the grayscale function instead of the raw input, producing a new function that automatically handles out-of-bounds coordinates:
191191
```cpp
192192
// Clamp the grayscale function instead of raw input
193193
Halide::Func grayClamped = Halide::BoundaryConditions::repeat_edge(gray);
@@ -218,13 +218,13 @@ The output should look as in the figure below:
218218
![img3](Figures/03.png)
219219

220220
## Parallelization and Tiling
221-
In this section, we will explore two complementary scheduling optimizations provided by Halide: Parallelization and Tiling. Both techniques help enhance performance but achieve it through different mechanisms—parallelization leverages multiple CPU cores, whereas tiling improves cache efficiency by optimizing data locality.
221+
In this section, you will explore two complementary scheduling optimizations provided by Halide: Parallelization and Tiling. Both techniques help enhance performance but achieve it through different mechanisms—parallelization leverages multiple CPU cores, whereas tiling improves cache efficiency by optimizing data locality.
222222

223-
Below, we’ll demonstrate each technique separately for clarity and to emphasize their distinct benefits.
223+
Now you will learn how to use each technique separately for clarity and to emphasize their distinct benefits.
224224

225-
Let’s first lock in a measurable baseline before we start changing the schedule. We’ll make a second file, camera-capture-perf-measurement.cpp, that runs the same grayscale → blur → threshold pipeline but prints per-frame timing, FPS, and MPix/s around the Halide realize() call. This lets us quantify each optimization we add next (parallelization, tiling, caching).
225+
Let’s first lock in a measurable baseline before we start changing the schedule. You will create a second file, `camera-capture-perf-measurement.cpp`, that runs the same grayscale → blur → threshold pipeline but prints per-frame timing, FPS, and MPix/s around the Halide realize() call. This lets you quantify each optimization you will add next (parallelization, tiling, caching).
226226

227-
Create camera-capture-perf-measurement.cpp with the following code:
227+
Create `camera-capture-perf-measurement.cpp` with the following code:
228228
```cpp
229229
#include "Halide.h"
230230
#include "HalideRuntime.h"
@@ -362,13 +362,13 @@ int main() {
362362
return 0;
363363
}
364364
```
365-
366-
What this gives us:
365+
367366
* The console prints ms, FPS, and MPix/s per frame, measured strictly around realize() (camera capture and UI are excluded).
368-
* The very first line is labeled [warm-up] because it includes Halide’s JIT compilation. We can ignore it when comparing schedules.
367+
* The very first line is labeled [warm-up] because it includes Halide’s JIT compilation. You can ignore it when comparing schedules.
369368
* MPix/s = (width*height)/seconds is a good resolution-agnostic metric to compare schedule variants.
370369
371370
Build and run the application. Here is the sample output:
371+
372372
```console
373373
% ./camera-capture-perf-measurement
374374
[warm-up] Halide realize: 327.13 ms | 3.06 FPS | 6.34 MPix/s
@@ -417,12 +417,12 @@ Halide realize: 79.19 ms | 12.63 FPS | 26.19 MPix/s
417417
Halide realize: 80.70 ms | 12.39 FPS | 25.70 MPix/s
418418
```
419419

420-
This gives an rverage FPS of 12.48, and average throughput of 25.88 MPix/s. Now let’s start measuring potential improvements from scheduling.
420+
This gives an average FPS of 12.48, and average throughput of 25.88 MPix/s. Now you can start measuring potential improvements from scheduling.
421421

422422
### Parallelization
423423
Parallelization lets Halide run independent pieces of work at the same time on multiple CPU cores. For image pipelines, rows (or tiles of rows) are naturally parallel: each can be processed independently once producer data is available. By distributing work across cores, we reduce wall-clock time—crucial for real-time video.
424424

425-
With the baseline measured, we apply a minimal schedule that parallelizes the blur reduction across rows while keeping the threshold stage at root. This avoids tricky interactions between a parallel consumer and an unscheduled reduction (a common source of internal errors).
425+
With the baseline measured, you will apply a minimal schedule that parallelizes the blur reduction across rows while keeping the threshold stage at root. This avoids tricky interactions between a parallel consumer and an unscheduled reduction (a common source of internal errors).
426426

427427
Add these lines right after the threshold definition (and before any realize()):
428428
```cpp
@@ -434,7 +434,7 @@ This does two important things:
434434
* compute_root() on blur moves the reduction to the top level, so it isn’t nested under a parallel loop that might complicate reduction ordering.
435435
* parallel(y) parallelizes over the pure loop variable y (rows), not the reduction domain r, which is the safe/idiomatic way to parallelize reductions in Halide.
436436

437-
Let's re-buld and re-run the app. The results should look like here:
437+
Now rebuild and run the application again. The results should look like:
438438
```output
439439
% ./camera-capture-perf-measurement
440440
[warm-up] Halide realize: 312.66 ms | 3.20 FPS | 6.63 MPix/s
@@ -497,10 +497,10 @@ Tiling splits the image into cache-friendly blocks (tiles). Two wins:
497497
* Partitioning: tiles are easy to parallelize across cores.
498498
* Locality: when you cache intermediates per tile, you avoid refetching/recomputing data and hit L1/L2 more often.
499499

500-
Below we show both flavors.
500+
Now lets look at both flavors.
501501

502502
### Tiling with explicit intermediate storage (best for cache efficiency)
503-
Here we cache gray once per tile so the 3×3 blur can reuse it instead of recomputing RGB -> gray up to 9× per output pixel.
503+
Here you will cache gray once per tile so the 3×3 blur can reuse it instead of recomputing RGB -> gray up to 9× per output pixel.
504504

505505
Before using this, remove any earlier compute_root().parallel(y) schedule for blur.
506506

@@ -536,7 +536,7 @@ Recompile your application as before, then run. On our machine, this version ran
536536
This pattern shines when the cached intermediate is expensive and reused a lot (bigger kernels, multi-use intermediates, or separable/multi-stage pipelines). For a tiny 3×3 on CPU, the benefit often doesn’t amortize.
537537
538538
### Tiling for parallelization (without explicit intermediate storage)
539-
Tiling can also be used just to partition work across cores, without caching intermediates. This keeps the schedule simple: we split the output into tiles, parallelize across tiles, and vectorize along unit-stride x. Producers are computed inside each tile to keep the working set small, but we don’t materialize extra tile-local buffers:
539+
Tiling can also be used just to partition work across cores, without caching intermediates. This keeps the schedule simple: you split the output into tiles, parallelize across tiles, and vectorize along unit-stride x. Producers are computed inside each tile to keep the working set small, but don’t materialize extra tile-local buffers:
540540
```cpp
541541
// Tiling (partitioning only)
542542
Halide::Var xo("xo"), yo("yo"), xi("xi"), yi("yi");
@@ -579,7 +579,7 @@ When to choose what:
579579
* Keep stages that read interleaved inputs unvectorized; vectorize only planar consumers.
580580

581581
## Summary
582-
In this section, we built a real-time Halide+OpenCV pipeline—grayscale, a 3×3 binomial blur, then thresholding—and instrumented it to measure throughput. The baseline settled around 12.48 FPS (25.88 MPix/s). A small, safe schedule tweak that parallelizes the blur reduction across rows lifted performance to about 14.79 FPS (30.67 MPix/s). In contrast, tiling used only for partitioning landed near 9.35 FPS (19.40 MPix/s), and tiling with a cached per-tile grayscale buffer was slower still at roughly 8.2 FPS (17.0 MPix/s).
582+
In this section, you built a real-time Halide+OpenCV pipeline—grayscale, a 3×3 binomial blur, then thresholding—and instrumented it to measure throughput. The baseline settled around 12.48 FPS (25.88 MPix/s). A small, safe schedule tweak that parallelizes the blur reduction across rows lifted performance to about 14.79 FPS (30.67 MPix/s). In contrast, tiling used only for partitioning landed near 9.35 FPS (19.40 MPix/s), and tiling with a cached per-tile grayscale buffer was slower still at roughly 8.2 FPS (17.0 MPix/s).
583583

584584
The pattern is clear. On CPU, with a small kernel and an interleaved camera source, parallelizing the reduction is the most effective first step. Tiling starts to pay off only when an expensive intermediate is reused enough to amortize the overhead, e.g., after making the blur separable (horizontal+vertical), producing a planar grayscale once per frame with gray.compute_root(), and applying boundary conditions to unlock interior fast paths. From there, tune tile sizes and thread count to squeeze out the remaining headroom.
585585

0 commit comments

Comments
 (0)