Skip to content

Commit 72ba7ae

Browse files
Merge pull request #2446 from dawidborycki/LP-Halide-further-responses
Fixes
2 parents 3b3e4c5 + 619bf19 commit 72ba7ae

File tree

3 files changed

+255
-418
lines changed

3 files changed

+255
-418
lines changed

content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ int main(int argc, char** argv) {
8585
}
8686
```
8787
88-
In the original implementation constants 128, 255, and 0 were implicitly treated as integers. Here, the threshold value (128) and output values (255, 0) are explicitly cast to uint8_t. This approach removes ambiguity and clearly specifies the types used, ensuring compatibility and clarity. Both approaches result in identical functionality, but explicitly casting helps emphasize the type correctness and may avoid subtle issues during cross-compilation or in certain environments.
88+
In the original implementation constants 128, 255, and 0 were implicitly treated as integers. Here, the threshold value (128) and output values (255, 0) are explicitly cast to uint8_t. This approach removes ambiguity and clearly specifies the types used, ensuring compatibility and clarity. Both approaches result in identical functionality, but explicitly casting helps emphasize the type correctness and may avoid subtle issues during cross-compilation or in certain environments. Additionally, explicit uint8_t casts help avoid implicit promotion to 32-bit integers (and the corresponding narrowings back to 8-bit) in the generated code, reducing redundant cast operations and potential vector widen/narrow overhead—especially on ARM/NEON
8989
9090
The program takes at least one command-line argument, the output base name used to generate the files (e.g., “blur_threshold_android”). Here, the target architecture is explicitly set within the code to Android ARM64:
9191
@@ -105,8 +105,12 @@ target.set_feature(Target::NoRuntime, false);
105105
```
106106

107107
Notes:
108-
* NoRuntime — When set to true, Halide excludes its runtime from the generated code, and you must link the runtime manually during the linking step. When set to false, the Halide runtime is included in the generated library, which simplifies deployment.
109-
* ARMFp16 — Enables the use of ARM hardware support for half-precision (16-bit) floating-point operations, which can provide faster execution when reduced precision is acceptable.
108+
1. NoRuntime — When set to true, Halide excludes its runtime from the generated code, and you must link the runtime manually during the linking step. When set to false, the Halide runtime is included in the generated library, which simplifies deployment.
109+
2. ARMFp16 — Enables the use of ARM hardware support for half-precision (16-bit) floating-point operations, which can provide faster execution when reduced precision is acceptable.
110+
3. Why the runtime choice matters - If your app links several AOT-compiled pipelines, ensure there is exactly one Halide runtime at link time:
111+
* Strategy A (cleanest): build all pipelines with NoRuntime ON and link a single standalone Halide runtime once (matching the union of features you need, e.g., Vulkan/OpenCL/Metal or ARM options).
112+
* Strategy B: embed the runtime in exactly one pipeline (leave NoRuntime OFF only there); compile all other pipelines with NoRuntime ON.
113+
* Mixing more than one runtime can cause duplicate symbols and split global state (e.g., error handlers, device interfaces).
110114

111115
We declare spatial variables (x, y) and an ImageParam named “input” representing the input image data. We use boundary clamping (clamp) to safely handle edge pixels. Then, we apply a 3x3 blur with a reduction domain (RDom). The accumulated sum is divided by 9 (the number of pixels in the neighborhood), producing an average blurred image. Lastly, thresholding is applied, producing a binary output: pixels above a certain brightness threshold (128) become white (255), while others become black (0).
112116

content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md

Lines changed: 6 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ static const char* schedule_name(Schedule s) {
4242
case Schedule::FuseBlurAndThreshold: return "FuseBlurAndThreshold";
4343
case Schedule::FuseAll: return "FuseAll";
4444
case Schedule::Tile: return "Tile";
45-
default: return "Unknown";
45+
default: return "Unknown";
4646
}
4747
}
4848

@@ -174,10 +174,10 @@ int main(int argc, char** argv) {
174174
if (!frame.isContinuous()) frame = frame.clone();
175175

176176
// Wrap interleaved frame
177-
auto in_rt = Runtime::Buffer<uint8_t>::make_interleaved(
178-
frame.data, frame.cols, frame.rows, /*channels*/3);
179-
Buffer<> in_fe(*in_rt.raw_buffer());
180-
input.set(in_fe);
177+
Halide::Buffer<uint8_t> inputBuf = Runtime::Buffer<uint8_t>::make_interleaved(
178+
frame.data, frame.cols, frame.rows, frame.channels());
179+
180+
input.set(inputBuf);
181181

182182
// Time the Halide realize() only
183183
auto t0 = std::chrono::high_resolution_clock::now();
@@ -232,32 +232,6 @@ int main(int argc, char** argv) {
232232
return 0;
233233
}
234234
```
235-
You will begin by pulling in the right set of headers. Right after the includes you define an enumeration, Schedule, which lists the four different scheduling strategies you want to experiment with. These represent the “modes” you will toggle between while the program is running: a simple materialized version, a fused blur-plus-threshold, a fully fused pipeline, and a tiled variant.
236-
237-
Finally, to make the output more readable, you add a small helper function, `schedule_name`. It converts each enum value into a human-friendly label so that when the program prints logs or overlays statistics, you can immediately see which schedule is active.
238-
```cpp
239-
#include "Halide.h"
240-
#include <opencv2/opencv.hpp>
241-
#include <chrono>
242-
#include <iomanip>
243-
#include <iostream>
244-
#include <string>
245-
#include <cstdint>
246-
#include <exception>
247-
248-
using namespace Halide;
249-
using namespace cv;
250-
using namespace std;
251-
252-
enum class Schedule : int {
253-
Simple = 0,
254-
FuseBlurAndThreshold = 1,
255-
FuseAll = 2,
256-
Tile = 3,
257-
};
258-
259-
static const char* schedule_name(Schedule s) { ... }
260-
```
261235
262236
The main part of this program is the `make_pipeline` function. It defines the camera processing pipeline in Halide and applies different scheduling choices depending on which mode we select.
263237
@@ -282,33 +256,6 @@ Next comes the gray conversion. As in previous section, you will use Rec.601 wei
282256

283257
You will then add a threshold stage. Pixels above 128 become white, and all others black, producing a binary image. Finally, define an output Func that wraps the thresholded result and call compute_root() on it so that it will be realized explicitly when you run the pipeline.
284258

285-
```cpp
286-
// (c) BGR → gray (Rec.601, float weights)
287-
Func gray("gray");
288-
gray(x, y) = cast<uint8_t>(0.114f * inputClamped(x, y, 0)
289-
+ 0.587f * inputClamped(x, y, 1)
290-
+ 0.299f * inputClamped(x, y, 2));
291-
292-
// (d) 3×3 binomial blur, unrolled in host code (no RDom needed)
293-
Func blur("blur");
294-
const uint16_t k[3][3] = {{1,2,1},{2,4,2},{1,2,1}};
295-
Expr blurSum = cast<uint16_t>(0);
296-
for (int j = 0; j < 3; ++j)
297-
for (int i = 0; i < 3; ++i)
298-
blurSum = blurSum + cast<uint16_t>(gray(x + i - 1, y + j - 1)) * k[j][i];
299-
blur(x, y) = cast<uint8_t>(blurSum / 16);
300-
301-
// (e) Threshold to binary
302-
Func thresholded("thresholded");
303-
Expr T = cast<uint8_t>(128);
304-
thresholded(x, y) = select(blur(x, y) > T, cast<uint8_t>(255), cast<uint8_t>(0));
305-
306-
// (f) Final output and default root
307-
Func output("output");
308-
output(x, y) = thresholded(x, y);
309-
output.compute_root();
310-
```
311-
312259
Now comes the interesting part: the scheduling choices. Depending on the Schedule enum passed in, you instruct Halide to either fuse everything (the default), materialize some intermediates, or even tile the output.
313260
* Simple: Here you will explicitly compute and store both gray and blur across the whole frame with compute_root(). This makes them easy to reuse or parallelize, but requires extra memory traffic.
314261
* FuseBlurAndThreshold: You compute gray once as a planar buffer, but leave blur and thresholded fused into output. This often works well when the input is interleaved, because subsequent stages read from a planar gray.
@@ -470,7 +417,7 @@ Comparing the numbers:
470417

471418
By toggling schedules live, you can see and measure how operator fusion and materialization change both the loop structure and the throughput:
472419
* Fusion is the default in Halide and eliminates temporary storage, but may cause recomputation for spatial filters.
473-
* Materializing selected stages with compute_root() or compute_at() can reduce recomputation, enable vectorization and parallelization, and sometimes yield much higher throughput.
420+
* Materializing selected stages with compute_root() or compute_at() can reduce recomputation and improve locality. It can also make vectorization and parallelization easier or more effective, but they are not strictly required by materialization and can be applied independently. For best performance, consider these choices together and measure on your target.
474421
* Tile-level materialization (compute_at) provides a hybrid - fusing within tiles while keeping intermediates small and cache-resident.
475422

476423
This demo makes these trade-offs concrete: the loop nest diagrams explain the structure, and the live FPS/MPix/s stats show the real performance impact.

0 commit comments

Comments
 (0)