You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md
+18-18Lines changed: 18 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,10 +8,10 @@ layout: "learningpathall"
8
8
---
9
9
10
10
## Objective
11
-
In this section, we will build a real-time camera processing pipeline using Halide. First, we capture video frames from a webcam using OpenCV, then implement a Gaussian (binomial) blur to smooth the captured images, followed by thresholding to create a clear binary output highlighting prominent image features. After establishing this pipeline, we will measure performance and then explore Halide’s scheduling options—parallelization and tiling—to understand when they help and when they don’t.
11
+
In this section, you will build a real-time camera processing pipeline using Halide. First, you capture video frames from a webcam using OpenCV, then implement a Gaussian (binomial) blur to smooth the captured images, followed by thresholding to create a clear binary output highlighting prominent image features. After establishing this pipeline, you will measure performance and then explore Halide’s scheduling options—parallelization and tiling—to understand when they help and when they don’t.
12
12
13
13
## Gaussian blur and thresholding
14
-
Create a new camera-capture.cpp file and modify it as follows:
14
+
Create a new `camera-capture.cpp` file and modify it as follows:
15
15
```cpp
16
16
#include"Halide.h"
17
17
#include"HalideRuntime.h"// for Runtime::Buffer make_interleaved
The main loop continues capturing frames, running the Halide pipeline, and displaying the processed output in real-time until a key is pressed. This demonstrates how Halide integrates with OpenCV to build efficient, interactive image processing applications.
180
180
181
-
In the examples above, pixel coordinates were manually clamped with a helper function:
181
+
In the examples above, pixel coordinates are manually clamped with a helper function:
This ensures that when the reduction domain r extends beyond the image borders (for example, at the left or top edge), the coordinates are clipped into the valid range [0, width-1] and [0, height-1]. Manual clamping is explicit and easy to understand, but it scatters boundary-handling logic across the pipeline.
189
189
190
-
Halide provides an alternative through boundary condition functions, which wrap an existing Func and define its behavior outside the valid region. For the Gaussian blur, we can clamp the grayscale function instead of the raw input, producing a new function that automatically handles out-of-bounds coordinates:
190
+
Halide provides an alternative through boundary condition functions, which wrap an existing Func and define its behavior outside the valid region. For the Gaussian blur, you can clamp the grayscale function instead of the raw input, producing a new function that automatically handles out-of-bounds coordinates:
191
191
```cpp
192
192
// Clamp the grayscale function instead of raw input
@@ -218,13 +218,13 @@ The output should look as in the figure below:
218
218

219
219
220
220
## Parallelization and Tiling
221
-
In this section, we will explore two complementary scheduling optimizations provided by Halide: Parallelization and Tiling. Both techniques help enhance performance but achieve it through different mechanisms—parallelization leverages multiple CPU cores, whereas tiling improves cache efficiency by optimizing data locality.
221
+
In this section, you will explore two complementary scheduling optimizations provided by Halide: Parallelization and Tiling. Both techniques help enhance performance but achieve it through different mechanisms—parallelization leverages multiple CPU cores, whereas tiling improves cache efficiency by optimizing data locality.
222
222
223
-
Below, we’ll demonstrate each technique separately for clarity and to emphasize their distinct benefits.
223
+
Now you will learn how to use each technique separately for clarity and to emphasize their distinct benefits.
224
224
225
-
Let’s first lock in a measurable baseline before we start changing the schedule. We’ll make a second file, camera-capture-perf-measurement.cpp, that runs the same grayscale → blur → threshold pipeline but prints per-frame timing, FPS, and MPix/s around the Halide realize() call. This lets us quantify each optimization we add next (parallelization, tiling, caching).
225
+
Let’s first lock in a measurable baseline before we start changing the schedule. You will create a second file, `camera-capture-perf-measurement.cpp`, that runs the same grayscale → blur → threshold pipeline but prints per-frame timing, FPS, and MPix/s around the Halide realize() call. This lets you quantify each optimization you will add next (parallelization, tiling, caching).
226
226
227
-
Create camera-capture-perf-measurement.cpp with the following code:
227
+
Create `camera-capture-perf-measurement.cpp` with the following code:
228
228
```cpp
229
229
#include"Halide.h"
230
230
#include"HalideRuntime.h"
@@ -362,13 +362,13 @@ int main() {
362
362
return 0;
363
363
}
364
364
```
365
-
366
-
What this gives us:
365
+
367
366
* The console prints ms, FPS, and MPix/s per frame, measured strictly around realize() (camera capture and UI are excluded).
368
-
* The very first line is labeled [warm-up] because it includes Halide’s JIT compilation. We can ignore it when comparing schedules.
367
+
* The very first line is labeled [warm-up] because it includes Halide’s JIT compilation. You can ignore it when comparing schedules.
369
368
* MPix/s = (width*height)/seconds is a good resolution-agnostic metric to compare schedule variants.
370
369
371
370
Build and run the application. Here is the sample output:
Halide realize: 80.70 ms | 12.39 FPS | 25.70 MPix/s
418
418
```
419
419
420
-
This gives an rverage FPS of 12.48, and average throughput of 25.88 MPix/s. Now let’s start measuring potential improvements from scheduling.
420
+
This gives an average FPS of 12.48, and average throughput of 25.88 MPix/s. Now you can start measuring potential improvements from scheduling.
421
421
422
422
### Parallelization
423
423
Parallelization lets Halide run independent pieces of work at the same time on multiple CPU cores. For image pipelines, rows (or tiles of rows) are naturally parallel: each can be processed independently once producer data is available. By distributing work across cores, we reduce wall-clock time—crucial for real-time video.
424
424
425
-
With the baseline measured, we apply a minimal schedule that parallelizes the blur reduction across rows while keeping the threshold stage at root. This avoids tricky interactions between a parallel consumer and an unscheduled reduction (a common source of internal errors).
425
+
With the baseline measured, you will apply a minimal schedule that parallelizes the blur reduction across rows while keeping the threshold stage at root. This avoids tricky interactions between a parallel consumer and an unscheduled reduction (a common source of internal errors).
426
426
427
427
Add these lines right after the threshold definition (and before any realize()):
428
428
```cpp
@@ -434,7 +434,7 @@ This does two important things:
434
434
* compute_root() on blur moves the reduction to the top level, so it isn’t nested under a parallel loop that might complicate reduction ordering.
435
435
* parallel(y) parallelizes over the pure loop variable y (rows), not the reduction domain r, which is the safe/idiomatic way to parallelize reductions in Halide.
436
436
437
-
Let's re-buld and re-run the app. The results should look like here:
437
+
Now rebuild and run the application again. The results should look like:
@@ -497,10 +497,10 @@ Tiling splits the image into cache-friendly blocks (tiles). Two wins:
497
497
* Partitioning: tiles are easy to parallelize across cores.
498
498
* Locality: when you cache intermediates per tile, you avoid refetching/recomputing data and hit L1/L2 more often.
499
499
500
-
Below we show both flavors.
500
+
Now lets look at both flavors.
501
501
502
502
### Tiling with explicit intermediate storage (best for cache efficiency)
503
-
Here we cache gray once per tile so the 3×3 blur can reuse it instead of recomputing RGB -> gray up to 9× per output pixel.
503
+
Here you will cache gray once per tile so the 3×3 blur can reuse it instead of recomputing RGB -> gray up to 9× per output pixel.
504
504
505
505
Before using this, remove any earlier compute_root().parallel(y) schedule for blur.
506
506
@@ -536,7 +536,7 @@ Recompile your application as before, then run. On our machine, this version ran
536
536
This pattern shines when the cached intermediate is expensive and reused a lot (bigger kernels, multi-use intermediates, or separable/multi-stage pipelines). For a tiny 3×3 on CPU, the benefit often doesn’t amortize.
537
537
538
538
### Tiling for parallelization (without explicit intermediate storage)
539
-
Tiling can also be used just to partition work across cores, without caching intermediates. This keeps the schedule simple: we split the output into tiles, parallelize across tiles, and vectorize along unit-stride x. Producers are computed inside each tile to keep the working set small, but we don’t materialize extra tile-local buffers:
539
+
Tiling can also be used just to partition work across cores, without caching intermediates. This keeps the schedule simple: you split the output into tiles, parallelize across tiles, and vectorize along unit-stride x. Producers are computed inside each tile to keep the working set small, but don’t materialize extra tile-local buffers:
* Keep stages that read interleaved inputs unvectorized; vectorize only planar consumers.
580
580
581
581
## Summary
582
-
In this section, we built a real-time Halide+OpenCV pipeline—grayscale, a 3×3 binomial blur, then thresholding—and instrumented it to measure throughput. The baseline settled around 12.48 FPS (25.88 MPix/s). A small, safe schedule tweak that parallelizes the blur reduction across rows lifted performance to about 14.79 FPS (30.67 MPix/s). In contrast, tiling used only for partitioning landed near 9.35 FPS (19.40 MPix/s), and tiling with a cached per-tile grayscale buffer was slower still at roughly 8.2 FPS (17.0 MPix/s).
582
+
In this section, you built a real-time Halide+OpenCV pipeline—grayscale, a 3×3 binomial blur, then thresholding—and instrumented it to measure throughput. The baseline settled around 12.48 FPS (25.88 MPix/s). A small, safe schedule tweak that parallelizes the blur reduction across rows lifted performance to about 14.79 FPS (30.67 MPix/s). In contrast, tiling used only for partitioning landed near 9.35 FPS (19.40 MPix/s), and tiling with a cached per-tile grayscale buffer was slower still at roughly 8.2 FPS (17.0 MPix/s).
583
583
584
584
The pattern is clear. On CPU, with a small kernel and an interleaved camera source, parallelizing the reduction is the most effective first step. Tiling starts to pay off only when an expensive intermediate is reused enough to amortize the overhead, e.g., after making the blur separable (horizontal+vertical), producing a planar grayscale once per frame with gray.compute_root(), and applying boundary conditions to unlock interior fast paths. From there, tune tile sizes and thread count to squeeze out the remaining headroom.
0 commit comments