diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md index b351d5484..1b45f5ed8 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md @@ -1,19 +1,15 @@ --- -title: Halide Essentials From Basics to Android Integration - -draft: true -cascade: - draft: true +title: Build high-performance image processing with Halide on Android minutes_to_complete: 180 -who_is_this_for: This is an introductory topic for software developers interested in learning how to use Halide for image processing. +who_is_this_for: This is an introductory topic for developers interested in learning how to use Halide for image processing. learning_objectives: - - Understand foundational concepts of Halide and set up your development environment. - - Create a basic real-time image processing pipeline using Halide. - - Optimize image processing workflows by applying operation fusion in Halide. - - Integrate Halide pipelines into Android applications developed with Kotlin. + - Learn the basics of Halide and set up your development environment + - Build a simple real-time image processing pipeline with Halide + - Make your image processing faster by combining operations in Halide + - Use Halide pipelines in Android apps written with Kotlin prerequisites: - Basic C++ knowledge @@ -31,15 +27,20 @@ operatingsystems: - Android tools_software_languages: - Android Studio - - Coding + - Halide + - C++ + - Kotlin + - Android Studio + - CMake + further_reading: - resource: - title: Halide 19.0.0 + title: Halide documentation link: https://halide-lang.org/docs/index.html type: website - resource: - title: Halide GitHub + title: Halide GitHub repository link: https://github.com/halide/Halide type: repository - resource: diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md index ba6eb6397..9e9bb9613 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md @@ -1,34 +1,34 @@ --- # User change -title: "Integrating Halide into an Android (Kotlin) Project" +title: "Integrate Halide into an Android project with Kotlin" weight: 6 layout: "learningpathall" --- -## Objective -In this lesson, we’ll learn how to integrate a high-performance Halide image-processing pipeline into an Android application using Kotlin. +## What you'll build +In this section you'll integrate a high-performance Halide image-processing pipeline into an Android application using Kotlin. -## Overview of mobile integration with Halide +## Learn about mobile integration with Halide Android is the world’s most widely-used mobile operating system, powering billions of devices across diverse markets. This vast user base makes Android an ideal target platform for developers aiming to reach a broad audience, particularly in applications requiring sophisticated image and signal processing, such as augmented reality, photography, video editing, and real-time analytics. Kotlin, now the preferred programming language for Android development, combines concise syntax with robust language features, enabling developers to write maintainable, expressive, and safe code. It offers seamless interoperability with existing Java codebases and straightforward integration with native code via JNI, simplifying the development of performant mobile applications. -## Benefits of using Halide on mobile +## Explore the benefits of using Halide on mobile Integrating Halide into Android applications brings several key advantages: -1. Performance. Halide enables significant acceleration of complex image processing algorithms, often surpassing the speed of traditional Java or Kotlin implementations by leveraging optimized code generation. By generating highly optimized native code tailored for ARM CPUs or GPUs, Halide can dramatically increase frame rates and responsiveness, essential for real-time or interactive applications. -2. Efficiency. On mobile devices, resource efficiency translates directly to improved battery life and reduced thermal output. Halide's scheduling strategies (such as operation fusion, tiling, parallelization, and vectorization) minimize unnecessary memory transfers, CPU usage, and GPU overhead. This optimization substantially reduces overall power consumption, extending battery life and enhancing the user experience by preventing overheating. -3. Portability. Halide abstracts hardware-specific details, allowing developers to write a single high-level pipeline that easily targets different processor architectures and hardware configurations. Pipelines can seamlessly run on various ARM-based CPUs and GPUs commonly found in Android smartphones and tablets, enabling developers to support a wide range of devices with minimal platform-specific modifications. -4. Custom Algorithm Integration. Halide allows developers to easily integrate their bespoke image-processing algorithms that may not be readily available or optimized in common libraries, providing full flexibility and control over application-specific performance and functionality. +- Performance - Halide enables significant acceleration of complex image processing algorithms, often surpassing the speed of traditional Java or Kotlin implementations by leveraging optimized code generation. By generating highly optimized native code tailored for Arm CPUs or GPUs, Halide can dramatically increase frame rates and responsiveness, essential for real-time or interactive applications. +- Efficiency - on mobile devices, resource efficiency translates directly to improved battery life and reduced thermal output. Halide's scheduling strategies (such as operation fusion, tiling, parallelization, and vectorization) minimize unnecessary memory transfers, CPU usage, and GPU overhead. This optimization substantially reduces overall power consumption, extending battery life and enhancing the user experience by preventing overheating. +- Portability - Halide abstracts hardware-specific details, allowing developers to write a single high-level pipeline that easily targets different processor architectures and hardware configurations. Pipelines can seamlessly run on various Arm-based CPUs and GPUs commonly found in Android smartphones and tablets, enabling developers to support a wide range of devices with minimal platform-specific modifications. +- Custom Algorithm Integration - Halide allows developers to easily integrate their bespoke image-processing algorithms that may not be readily available or optimized in common libraries, providing full flexibility and control over application-specific performance and functionality. In short, Halide delivers high-performance image processing without sacrificing portability or efficiency, a balance particularly valuable on resource-constrained mobile devices. -### Android development ecosystem and challenges +### Navigate Android development challenges While Android presents abundant opportunities for developers, the mobile development ecosystem brings its own set of challenges, especially for performance-intensive applications: -1. Limited Hardware Resources. Unlike desktop or server environments, mobile devices have significant constraints on processing power, memory capacity, and battery life. Developers must optimize software meticulously to deliver smooth performance while carefully managing hardware resource consumption. Leveraging tools like Halide allows developers to overcome these constraints by optimizing computational workloads, making resource-intensive tasks feasible on constrained hardware. -2. Cross-Compilation Complexities. Developing native code for Android requires handling multiple hardware architectures (such as armv8-a, ARM64, and sometimes x86/x86_64). Cross-compilation introduces complexities due to different instruction sets, CPU features, and performance characteristics. Managing this complexity involves careful use of the Android NDK, understanding toolchains, and correctly configuring build systems (e.g., Gradle, CMake). Halide helps mitigate these issues by abstracting away many platform-specific optimizations, automatically generating code optimized for target architectures. -3. Image-Format Conversions (Bitmap ↔ Halide Buffer). Android typically handles images through the Bitmap class or similar platform-specific constructs, whereas Halide expects image data to be in raw, contiguous buffer formats. Developers must bridge the gap between Android-specific image representations (Bitmaps, YUV images from camera APIs, etc.) and Halide's native buffer format. Proper management of these conversions—including considerations for pixel formats, stride alignment, and memory copying overhead—can significantly impact performance and correctness, necessitating careful design and efficient implementation of buffer-handling routines. +- Limited hardware resources: unlike desktop or server environments, mobile devices have significant constraints on processing power, memory capacity, and battery life. Developers must optimize software meticulously to deliver smooth performance while carefully managing hardware resource consumption. Leveraging tools like Halide allows developers to overcome these constraints by optimizing computational workloads, making resource-intensive tasks feasible on constrained hardware. +- Cross-compilation complexities: developing native code for Android requires handling multiple hardware architectures (such as Armv8-A, ARM64, and sometimes x86/x86_64). Cross-compilation introduces complexities due to different instruction sets, CPU features, and performance characteristics. Managing this complexity involves careful use of the Android NDK, understanding toolchains, and correctly configuring build systems (e.g., Gradle, CMake). Halide helps mitigate these issues by abstracting away many platform-specific optimizations, automatically generating code optimized for target architectures. +- Image format conversions (Bitmap ↔ Halide Buffer). Android typically handles images through the Bitmap class or similar platform-specific constructs, whereas Halide expects image data to be in raw, contiguous buffer formats. Developers must bridge the gap between Android-specific image representations (Bitmaps, YUV images from camera APIs, etc.) and Halide's native buffer format. Proper management of these conversions—including considerations for pixel formats, stride alignment, and memory copying overhead—can significantly impact performance and correctness, necessitating careful design and efficient implementation of buffer-handling routines. ## Project requirements Before integrating Halide into your Android application, ensure you have the necessary tools and libraries. @@ -37,11 +37,11 @@ Before integrating Halide into your Android application, ensure you have the nec 1. Android Studio. [Download link](https://developer.android.com/studio). 2. Android NDK (Native Development Kit). Can be easily installed from Android Studio (Tools → SDK Manager → SDK Tools → Android NDK). -## Setting up the Android project -### Creating the project +## Set up the Android project +### Create the project 1. Open Android Studio. 2. Select New Project > Native C++. -![img4](Figures/04.webp) +![Android Studio New Project dialog showing Native C++ template selected. The dialog displays options for project name, language, and minimum SDK. The primary subject is the Native C++ template highlighted in the project creation workflow. The wider environment is a typical Android Studio interface with a neutral, technical tone. Visible text includes Native C++ and fields for configuring the new project.] ### Configure the project 1. Set the project Name to Arm.Halide.AndroidDemo. @@ -152,8 +152,9 @@ dependencies { Click the Sync Now button at the top. To verify that everything is configured correctly, click Build > Make Project in Android Studio. -## UI -Now, you'll define the application's User Interface, consisting of two buttons and an ImageView. One button loads the image, the other processes it, and the ImageView displays both the original and processed images. +## Define the user interface +Define the application's user interface, consisting of two buttons and an ImageView. One button loads the image, the other processes it, and the ImageView displays both the original and processed images. + 1. Open the res/layout/activity_main.xml file, and modify it as follows: ```XML @@ -204,8 +205,8 @@ Now you can run the app to view the UI: ![img7](Figures/07.webp) -## Processing -You will now implement the image processing code. First, pick up an image you want to process. Here we use the camera man. Then, under the Arm.Halide.AndroidDemo/src/main create assets folder, and save the image under that folder as img.png. +## Implement image processing +Implement the image processing code. First, pick an image you want to process. This example uses the camera man image. Under Arm.Halide.AndroidDemo/src/main, create an assets folder and save the image as img.png. Now, open MainActivity.kt and modify it as follows: ```java @@ -330,13 +331,13 @@ class MainActivity : AppCompatActivity() { } ``` -This Kotlin Android application demonstrates integrating a Halide-generated image-processing pipeline within an Android app. The main activity (MainActivity) manages loading and processing an image stored in the application’s asset folder. +This Kotlin Android application demonstrates integrating a Halide-generated image-processing pipeline within an Android app. The main activity (MainActivity) manages loading and processing an image stored in the application's asset folder. -When the app launches, the Process Image button is disabled. When a user taps Load Image, the app retrieves img.png from its assets directory and displays it within the ImageView, simultaneously enabling the Process Image button for further interaction. +When the app launches, the app disables the Process Image button. When you tap Load Image, the app retrieves img.png from its assets directory and displays it within the ImageView, simultaneously enabling the Process Image button for further interaction. Upon pressing the Process Image button, the following sequence occurs: 1. Background Processing. A Kotlin coroutine initiates processing on a background thread, ensuring the application’s UI remains responsive. -2. Conversion to Grayscale. The loaded bitmap image is converted into a grayscale byte array using a simple RGB-average method, preparing it for processing by the native (JNI) layer. +2. Conversion to Grayscale. The loaded bitmap image is converted into a grayscale byte array using a simple RGB (Red-Green-Blue) average method, preparing it for processing by the native (JNI) layer. 3. Native Function Invocation. This grayscale byte array, along with image dimensions, is passed to a native function (blurThresholdImage) defined via JNI. This native function is implemented using the Halide pipeline, performing operations such as blurring and thresholding directly on the image data. 4. Post-processing. After the native function completes, the resulting processed grayscale byte array is converted back into a Bitmap image. 5. UI Update. The coroutine then updates the displayed image (on the main UI thread) with this newly processed bitmap, providing the user immediate visual feedback. @@ -346,11 +347,11 @@ The code defines three utility methods: 2. extractGrayScaleBytes - converts a Bitmap into a grayscale byte array suitable for native processing. 3. createBitmapFromGrayBytes - converts a grayscale byte array back into a Bitmap for display purposes. -Note that performing the grayscale conversion in Halide allows us to exploit operator fusion, further improving performance by avoiding intermediate memory accesses. This could be done as in our examples before (processing-workflow). +Note that performing the grayscale conversion in Halide allows you to exploit operator fusion, further improving performance by avoiding intermediate memory accesses. You can do this as shown in the earlier processing-workflow examples. The JNI integration occurs through an external method declaration, blurThresholdImage, loaded via the companion object at app startup. The native library (armhalideandroiddemo) containing this function is compiled separately and integrated into the application (native-lib.cpp). -You will now need to create blurThresholdImage function. To do so, in Android Studio put the cursor above blurThresholdImage function, and then click Create JNI function for blurThresholdImage: +Create the blurThresholdImage function. In Android Studio, put the cursor above blurThresholdImage function, and then select Create JNI function for blurThresholdImage: ![img8](Figures/08.webp) This will generate a new function in the native-lib.cpp: @@ -404,9 +405,9 @@ This C++ function acts as a bridge between Java (Kotlin) and native code. Specif The input Java byte array (input_bytes) is accessed and pinned into native memory via GetByteArrayElements. This provides a direct pointer (inBytes) to the grayscale data sent from Kotlin. The raw grayscale byte data is wrapped into a Halide::Runtime::Buffer object (inputBuffer). This buffer structure is required by the Halide pipeline. An output buffer (outputBuffer) is created with the same dimensions as the input image. This buffer will store the result produced by the Halide pipeline. The native function invokes the Halide-generated AOT function blur_threshold, passing in both the input and output buffers. After processing, a new Java byte array (outputArray) is allocated to hold the processed grayscale data. The processed data from the Halide output buffer is copied into this Java array using SetByteArrayRegion. The native input buffer (inBytes) is explicitly released using ReleaseByteArrayElements, specifying JNI_ABORT as no changes were made to the input array. Finally, the processed byte array (outputArray) is returned to Kotlin. -Through this JNI bridge, Kotlin can invoke high-performance native code. You can now re-run the application. Click the Load Image button, and then Process Image. You will see the following results: +Through this JNI bridge, Kotlin can invoke high-performance native code. You can now re-run the application. Select the Load Image button, and then Process Image. You'll see the following results: -![img9](Figures/09.png) +Android app screenshot showing the Arm Halide Android demo interface. The screen displays two buttons labeled Load Image and Process Image, with the Process Image button enabled. Below the buttons, an ImageView shows a grayscale photo of a camera man standing outdoors, holding a camera and tripod. The environment appears neutral and technical, with no visible emotional tone. The layout is centered and uses a simple vertical arrangement, making the interface easy to navigate for users with visual impairment. ![img10](Figures/10.png) In the above code we created a new jbyteArray and copying the data explicitly, which can result in an additional overhead. To optimize performance by avoiding unnecessary memory copies, you can directly wrap Halide's buffer in a Java-accessible ByteBuffer like so @@ -416,4 +417,4 @@ jobject outputBuffer = env->NewDirectByteBuffer(output.data(), width * height); ``` ## Summary -In this lesson, we’ve successfully integrated a Halide image-processing pipeline into an Android application using Kotlin. We started by setting up an Android project configured for native development with the Android NDK, employing Kotlin as the primary language. We then integrated Halide-generated static libraries and demonstrated their usage through Java Native Interface (JNI), bridging Kotlin and native code. This equips developers with the skills needed to harness Halide's capabilities for building sophisticated, performant mobile applications on Android. \ No newline at end of file +You've successfully integrated a Halide image-processing pipeline into an Android application using Kotlin. You started by setting up an Android project configured for native development with the Android NDK, using Kotlin as the primary language. You then integrated Halide-generated static libraries and demonstrated their usage through Java Native Interface (JNI), bridging Kotlin and native code. You now have the skills needed to harness Halide's capabilities for building sophisticated, performant mobile applications on Android. \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md index f4003f1f5..b74495cba 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md @@ -1,22 +1,25 @@ --- # User change -title: "Ahead-of-time and cross-compilation" +title: "Generate optimized Halide pipelines for Android using ahead-of-time cross-compilation" weight: 5 layout: "learningpathall" --- -## Ahead-of-time and cross-compilation -One of Halide's standout features is the ability to compile image processing pipelines ahead-of-time (AOT), enabling developers to generate optimized binary code on their host machines rather than compiling directly on target devices. This AOT compilation process allows developers to create highly efficient libraries that run effectively across diverse hardware without incurring the runtime overhead associated with just-in-time (JIT) compilation. -Halide also supports robust cross-compilation capabilities. Cross-compilation means using the host version of Halide, typically running on a desktop Linux or macOS system—to target different architectures, such as ARM for Android devices. Developers can thus optimize Halide pipelines on their host machine, produce libraries specifically optimized for Android, and integrate them seamlessly into Android applications. The generated pipeline code includes essential optimizations and can embed minimal runtime support, further reducing workload on the target device and ensuring responsiveness and efficiency. +## What you'll build +In this section, you'll leverage the host version of Halide to perform AOT compilation of an image processing pipeline via cross-compilation. The resulting pipeline library is specifically tailored to Android devices (targeting, for instance, arm64-v8a ABI), while the compilation itself occurs entirely on the host system. This approach significantly accelerates development by eliminating the need to build Halide or perform JIT compilation on Android devices. It also guarantees that the resulting binaries are optimized for the intended hardware, streamlining the deployment of high-performance image processing applications on mobile platforms. -## Objective -In this section, we leverage the host version of Halide to perform AOT compilation of an image processing pipeline via cross-compilation. The resulting pipeline library is specifically tailored to Android devices (targeting, for instance, arm64-v8a ABI), while the compilation itself occurs entirely on the host system. This approach significantly accelerates development by eliminating the need to build Halide or perform JIT compilation on Android devices. It also guarantees that the resulting binaries are optimized for the intended hardware, streamlining the deployment of high-performance image processing applications on mobile platforms. -## Prepare Pipeline for Android -The procedure implemented in the following code demonstrates how Halide's AOT compilation and cross-compilation features can be utilized to create an optimized image processing pipeline for Android. We will run Halide on our host machine (in this example, macOS) to generate a static library containing the pipeline function, which will later be invoked from an Android device. Below is a step-by-step explanation of this process. +## Learn about ahead-of-time (AOT) and cross-compilation +One of Halide's standout features is the ability to compile image processing pipelines ahead-of-time (AOT), enabling you to generate optimized binary code on your host machine rather than compiling directly on target devices. This AOT compilation process enables you to create highly efficient libraries that run effectively across diverse hardware without incurring the runtime overhead associated with just-in-time (JIT) compilation. + +Halide also supports robust cross-compilation capabilities. Cross-compilation means using the host version of Halide, typically running on a desktop Linux or macOS system—to target different architectures, such as Arm for Android devices. You can optimize Halide pipelines on your host machine, produce libraries specifically optimized for Android, and integrate them seamlessly into Android applications. The generated pipeline code includes essential optimizations and can embed minimal runtime support, further reducing workload on the target device and ensuring responsiveness and efficiency. + + +## Prepare pipeline for Android +The following code demonstrates how to use Halide's AOT compilation and cross-compilation features to create an optimized image processing pipeline for Android. Run Halide on your host machine (in this example, macOS) to generate a static library containing the pipeline function, which you'll later invoke from an Android device. Below is a step-by-step explanation of this process. Create a new file named blur-android.cpp with the following contents: @@ -85,9 +88,9 @@ int main(int argc, char** argv) { } ``` -In the original implementation constants 128, 255, and 0 were implicitly treated as integers. Here, the threshold value (128) and output values (255, 0) are explicitly cast to uint8_t. This approach removes ambiguity and clearly specifies the types used, ensuring compatibility and clarity. Both approaches result in identical functionality, but explicitly casting helps emphasize the type correctness and may avoid subtle issues during cross-compilation or in certain environments. Additionally, explicit uint8_t casts help avoid implicit promotion to 32-bit integers (and the corresponding narrowings back to 8-bit) in the generated code, reducing redundant cast operations and potential vector widen/narrow overhead—especially on ARM/NEON +In the original implementation constants 128, 255, and 0 were implicitly treated as integers. Here, the threshold value (128) and output values (255, 0) are explicitly cast to uint8_t. This approach removes ambiguity and clearly specifies the types used, ensuring compatibility and clarity. Both approaches result in identical functionality, but explicitly casting helps emphasize the type correctness and may avoid subtle issues during cross-compilation or in certain environments. Additionally, explicit uint8_t casts help avoid implicit promotion to 32-bit integers (and the corresponding narrowings back to 8-bit) in the generated code, reducing redundant cast operations and potential vector widen/narrow overhead—especially on Arm/NEON. -The program takes at least one command-line argument, the output base name used to generate the files (e.g., “blur_threshold_android”). Here, the target architecture is explicitly set within the code to Android ARM64: +The program takes at least one command-line argument, the output base name used to generate the files (for example, "blur_threshold_android"). Here, the target architecture is explicitly set within the code to Android ARM64: ```cpp // Configure Halide Target for Android @@ -99,20 +102,20 @@ target.bits = 64; // Enable Halide runtime inclusion in the generated library (needed if not linking Halide runtime separately). target.set_feature(Target::NoRuntime, false); -// Optionally, enable hardware-specific optimizations to improve performance on ARM devices: -// - DotProd: Optimizes matrix multiplication and convolution-like operations on ARM. +// Optionally, enable hardware-specific optimizations to improve performance on Arm devices: +// - DotProd: Optimizes matrix multiplication and convolution-like operations on Arm. // - ARMFp16 (half-precision floating-point operations). ``` Notes: 1. NoRuntime — When set to true, Halide excludes its runtime from the generated code, and you must link the runtime manually during the linking step. When set to false, the Halide runtime is included in the generated library, which simplifies deployment. -2. ARMFp16 — Enables the use of ARM hardware support for half-precision (16-bit) floating-point operations, which can provide faster execution when reduced precision is acceptable. +2. ARMFp16 — Enables the use of Arm hardware support for half-precision (16-bit) floating-point operations, which improves execution speed when reduced precision is acceptable. 3. Why the runtime choice matters - If your app links several AOT-compiled pipelines, ensure there is exactly one Halide runtime at link time: -* Strategy A (cleanest): build all pipelines with NoRuntime ON and link a single standalone Halide runtime once (matching the union of features you need, e.g., Vulkan/OpenCL/Metal or ARM options). +* Strategy A (cleanest): build all pipelines with NoRuntime ON and link a single standalone Halide runtime once (matching the union of features you need, for example, Vulkan/OpenCL/Metal or Arm options). * Strategy B: embed the runtime in exactly one pipeline (leave NoRuntime OFF only there); compile all other pipelines with NoRuntime ON. * Mixing more than one runtime can cause duplicate symbols and split global state (e.g., error handlers, device interfaces). -We declare spatial variables (x, y) and an ImageParam named “input” representing the input image data. We use boundary clamping (clamp) to safely handle edge pixels. Then, we apply a 3x3 blur with a reduction domain (RDom). The accumulated sum is divided by 9 (the number of pixels in the neighborhood), producing an average blurred image. Lastly, thresholding is applied, producing a binary output: pixels above a certain brightness threshold (128) become white (255), while others become black (0). +The code declares spatial variables (x, y) and an ImageParam named "input" representing the input image data. Boundary clamping (clamp) safely handles edge pixels. A 3×3 blur with a reduction domain (RDom) is then applied. The accumulated sum is divided by 9 (the number of pixels in the neighborhood), producing an average blurred image. Lastly, thresholding is applied, producing a binary output: pixels above a certain brightness threshold (128) become white (255), while others become black (0). This section intentionally reinforces previous concepts, focusing now primarily on explicitly clarifying integration details, such as type correctness and the handling of runtime features within Halide. @@ -120,9 +123,9 @@ Simple scheduling directives (compute_root) instruct Halide to compute intermedi This strategy can simplify debugging by clearly isolating computational steps and may enhance runtime efficiency by explicitly controlling intermediate storage locations. -By clearly separating algorithm logic from scheduling, developers can easily test and compare different scheduling strategies,such as compute_inline, compute_root, compute_at, and more, without modifying their fundamental algorithmic code. This separation significantly accelerates iterative optimization and debugging processes, ultimately yielding better-performing code with minimal overhead. +By clearly separating algorithm logic from scheduling, you can easily test and compare different scheduling strategies, such as compute_inline, compute_root, compute_at, and more, without modifying your fundamental algorithmic code. This separation significantly accelerates iterative optimization and debugging processes, ultimately yielding better-performing code with minimal overhead. -We invoke Halide's AOT compilation function compile_to_static_library, which generates a static library (.a) containing the optimized pipeline and a corresponding header file (.h). +Halide's AOT compilation function compile_to_static_library generates a static library (.a) containing the optimized pipeline and a corresponding header file (.h). ```cpp thresholded.compile_to_static_library( @@ -134,18 +137,18 @@ thresholded.compile_to_static_library( ``` This will produce: -* A static library (blur_threshold_android.a) containing the compiled pipeline. This static library also includes Halide's runtime functions tailored specifically for the targeted architecture (arm-64-android). Thus, no separate Halide runtime needs to be provided on the Android device when linking against this library. +* A static library (blur_threshold_android.a) containing the compiled pipeline. This static library also includes Halide's runtime functions tailored specifically for the targeted architecture (arm-64-android). Thus, no separate Halide runtime needs to be provided on the Android device when linking against this library. * A header file (blur_threshold_android.h) declaring the pipeline function for use in other C++/JNI code. These generated files are then ready to integrate directly into an Android project via JNI, allowing efficient execution of the optimized pipeline on Android devices. The integration process is covered in the next section. -Note: JNI (Java Native Interface) is a framework that allows Java (or Kotlin) code running in a Java Virtual Machine (JVM), such as on Android, to interact with native applications and libraries written in languages like C or C++. JNI bridges the managed Java/Kotlin environment and the native, platform-specific implementations. +JNI (Java Native Interface) is a framework that allows Java (or Kotlin) code running in a Java Virtual Machine (JVM), such as on Android, to interact with native applications and libraries written in languages like C or C++. JNI bridges the managed Java/Kotlin environment and the native, platform-specific implementations. -## Compilation instructions +## Compile the pipeline To compile the pipeline-generation program on your host system, use the following commands (replace /path/to/halide with your Halide installation directory): ```console export DYLD_LIBRARY_PATH=/path/to/halide/lib/libHalide.19.dylib -g++ -std=c++17 blud-android.cpp -o blud-android \ +g++ -std=c++17 blur-android.cpp -o blur-android \ -I/path/to/halide/include -L/path/to/halide/lib -lHalide \ $(pkg-config --cflags --libs opencv4) -lpthread -ldl \ -Wl,-rpath,/path/to/halide/lib @@ -160,7 +163,7 @@ This will produce two files: * blur_threshold_android.a: The static library containing your Halide pipeline. * blur_threshold_android.h: The header file needed to invoke the generated pipeline. -We will integrate these files into our Android project in the following section. +You'll integrate these files into the Android project in the following section. ## Summary -In this section, we’ve explored Halide's powerful ahead-of-time (AOT) and cross-compilation capabilities, preparing an optimized image processing pipeline tailored specifically for Android devices. By using the host-based Halide compiler, we’ve generated a static library optimized for ARM64 Android architecture, incorporating safe boundary conditions, neighborhood-based blurring, and thresholding operations. This streamlined process allows seamless integration of highly optimized native code into Android applications, ensuring both development efficiency and runtime performance on mobile platforms. \ No newline at end of file +You've explored Halide's powerful ahead-of-time (AOT) and cross-compilation capabilities, preparing an optimized image processing pipeline tailored specifically for Android devices. By using the host-based Halide compiler, you generated a static library optimized for 64-bit Arm Android architecture, incorporating safe boundary conditions, neighborhood-based blurring, and thresholding operations. This streamlined process allows seamless integration of highly optimized native code into Android applications, ensuring both development efficiency and runtime performance on mobile platforms. \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md index f10442403..84b6ee815 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md @@ -1,19 +1,24 @@ --- # User change -title: "Demonstrating Operation Fusion" +title: "Apply operator fusion in Halide for real-time image processing" weight: 4 layout: "learningpathall" --- -## Objective -In the previous section, you explored parallelization and tiling. Here, you will focus on operator fusion (inlining) in Halide i.e., letting producers be computed directly inside their consumers—versus materializing intermediates with compute_root() or compute_at(). You will learn when fusion reduces memory traffic and when materializing saves recomputation (e.g., for large stencils or multi-use intermediates). You will inspect loop nests with print_loop_nest(), switch among schedules (fuse-all, fuse-blur-only, materialize, tile-and-materialize-per-tile) in a live camera pipeline, and measure the impact (ms/FPS/MPix/s). +## What you'll build and learn -This section does not cover loop fusion (the fuse directive). You will focus on operator fusion, which is Halide's default behavior. +You'll explore operator fusion in Halide, where each stage is computed inside its consumer instead of storing intermediate results. This approach reduces memory traffic and improves cache efficiency. You'll also learn when it's better to materialize intermediates using `compute_root()` or `compute_at()`, such as with large filters or when results are reused by multiple stages. By the end, you'll understand how to choose between fusion and materialization for real-time image processing on Arm devices. -## Code -To demonstrate how fusion in Halide works create a new file `camera-capture-fusion.cpp`, and modify it as follows. This code uses a live camera pipeline (BGR → gray → 3×3 blur → threshold), adds a few schedule variants to toggle operator fusion vs. materialization, and print ms / FPS / MPix/s. So you can see the impact immediately. +You'll also use `print_loop_nest()` to see how Halide arranges the computation, switch between different scheduling modes (fuse all, fuse blur only, materialize, tile and materialize per tile) in a live camera pipeline, and measure the impact using ms, FPS, and MPix/s. + +{{% notice Note on scope %}} +This section doesn't cover loop fusion using the `fuse` directive. You'll focus instead on operator fusion, which is Halide's default behavior. +{{% /notice %}} + +## Explore the code +To explore how fusion in Halide works create a new file called `camera-capture-fusion.cpp`, and copy in the code below. This code uses a live camera pipeline (BGR → gray → 3×3 blur → threshold), adds a few schedule variants to toggle operator fusion compared to materialization, and print ms / FPS / MPix/s. - you'll be able to see the impact immediately: ```cpp #include "Halide.h" @@ -47,11 +52,11 @@ static const char* schedule_name(Schedule s) { } // Build the BGR->Gray -> 3x3 binomial blur -> threshold pipeline. -// We clamp the *ImageParam* at the borders (Func clamp of ImageParam works in Halide 19). +// Clamp the *ImageParam* at the borders (Func clamp of ImageParam works in Halide 19). Pipeline make_pipeline(ImageParam& input, Schedule schedule) { Var x("x"), y("y"); - // Assume 3-channel BGR interleaved frames (we convert if needed). + // Assume 3-channel BGR interleaved frames (converted if needed). input.dim(0).set_stride(3); // x-stride = channels input.dim(2).set_stride(1); // c-stride = 1 input.dim(2).set_bounds(0, 3); // three channels @@ -81,7 +86,7 @@ Pipeline make_pipeline(ImageParam& input, Schedule schedule) { // Final output Func output("output"); output(x, y) = thresholded(x, y); - output.compute_root(); // we always realize 'output' + output.compute_root(); // always realize 'output' // Scheduling to demonstrate OPERATOR FUSION vs MATERIALIZATION // Default in Halide = fusion/inlining (no schedule on producers). @@ -232,12 +237,17 @@ int main(int argc, char** argv) { return 0; } ``` +The heart of this program is the `make_pipeline` function. This function builds the camera processing pipeline in Halide and lets you switch between different scheduling modes. Each mode changes how intermediate results are handled, by either fusing stages together to minimize memory use, or materializing them to avoid recomputation. By adjusting the schedule, you can see how these choices affect both the loop structure and the real-time performance of your image processing pipeline. + +Start by declaring `Var x, y` to represent pixel coordinates. The camera frames use a 3-channel interleaved BGR format. This means: -The main part of this program is the `make_pipeline` function. It defines the camera processing pipeline in Halide and applies different scheduling choices depending on which mode we select. +- The stride along the x-axis is 3, because each step moves across all three color channels. +- The stride along the channel axis (c) is 1, so channels are stored contiguously. +- The channel bounds are set from 0 to 2, covering the three BGR channels. -You start by declaring Var x, y as our pixel coordinates. Similarly as before, the camera frames come in as 3-channel interleaved BGR, you will tell Halide how the data is laid out: the stride along x is 3 (one step moves across all three channels), the stride along c (channels) is 1, and the bounds on the channel dimension are 0–2. +These settings tell Halide exactly how the image data is organized in memory, so it can process each pixel and channel correctly. -Because you don’t want to worry about array bounds when applying filters, you will clamp the input at the borders. In Halide 19, BoundaryConditions::repeat_edge works cleanly when applied to an ImageParam, since it has .dim() information. This way, all downstream stages can assume safe access even at the edges of the image. +To avoid errors when applying filters near the edges of an image, clamp the input at the borders. In Halide 19, you can use `BoundaryConditions::repeat_edge` directly on an `ImageParam`, because it includes dimension information. This ensures that all stages in your pipeline can safely access pixels, even at the image boundaries. ```cpp Pipeline make_pipeline(ImageParam& input, Schedule schedule) { @@ -251,16 +261,38 @@ Pipeline make_pipeline(ImageParam& input, Schedule schedule) { // (b) Border handling: clamp the *ImageParam* (works cleanly in Halide 19) Func inputClamped = BoundaryConditions::repeat_edge(input); ``` +The next stage converts the image to grayscale. Use the Rec.601 weights for BGR to gray conversion, just like in the previous section. For the blur, apply a 3×3 binomial kernel with values: + +``` +1 2 1 +2 4 2 +1 2 1 +``` -Next comes the gray conversion. As in previous section, you will use Rec.601 weights a 3×3 binomial blur. Instead of using a reduction domain (RDom), you unroll the sum in C++ host code with a pair of loops over the kernel. The kernel values {1, 2, 1; 2, 4, 2; 1, 2, 1} approximate a Gaussian filter. Each pixel of blur is simply the weighted sum of its 3×3 neighborhood, divided by 16. +This kernel closely approximates a Gaussian filter. Instead of using Halide's reduction domain (`RDom`), unroll the sum directly in C++ using two nested loops over the kernel values. For each pixel, calculate the weighted sum of its 3×3 neighborhood and divide by 16 to get the blurred result. This approach makes the computation straightforward and easy to follow. +Now, add a threshold stage to your pipeline. This stage checks each pixel value after the blur and sets it to white (255) if it's above 128, or black (0) otherwise. This produces a binary image, making it easy to see which areas are brighter than the threshold. -You will then add a threshold stage. Pixels above 128 become white, and all others black, producing a binary image. Finally, define an output Func that wraps the thresholded result and call compute_root() on it so that it will be realized explicitly when you run the pipeline. +Here's how you define the thresholded stage and the output Func: + +```cpp +// Threshold (binary) +Func thresholded("thresholded"); +Expr T = cast(128); +thresholded(x, y) = select(blur(x, y) > T, cast(255), cast(0)); + +// Final output +Func output("output"); +output(x, y) = thresholded(x, y); +output.compute_root(); // Realize 'output' explicitly when running the pipeline +``` + +This setup ensures that the output is a binary image, and Halide will compute and store the result when you run the pipeline. By calling `compute_root()` on the output Func, you tell Halide to materialize the final result, making it available for display or further processing. Now comes the interesting part: the scheduling choices. Depending on the Schedule enum passed in, you instruct Halide to either fuse everything (the default), materialize some intermediates, or even tile the output. - * Simple: Here you will explicitly compute and store both gray and blur across the whole frame with compute_root(). This makes them easy to reuse or parallelize, but requires extra memory traffic. + * Simple: Here you'll explicitly compute and store both gray and blur across the whole frame with compute_root(). This makes them easy to reuse or parallelize, but requires extra memory traffic. * FuseBlurAndThreshold: You compute gray once as a planar buffer, but leave blur and thresholded fused into output. This often works well when the input is interleaved, because subsequent stages read from a planar gray. - * FuseAll: You will apply no scheduling to producers, so gray, blur, and thresholded are all inlined into output. This minimizes memory usage but can recompute gray many times inside the 3×3 stencil. - * Tile: You will split the output into 64×64 tiles. Within each tile, we materialize gray (compute_at(output, xo)), so the working set is small and stays in cache. blur remains fused within each tile. + * FuseAll: You'll apply no scheduling to producers, so gray, blur, and thresholded are all inlined into output. This minimizes memory usage but can recompute gray many times inside the 3×3 stencil. + * Tile: You'll split the output into 64×64 tiles. Within each tile, you materialize gray (compute_at(output, xo)), so the working set is small and stays in cache. blur remains fused within each tile. To help you examine what’s happening, print the loop nest Halide generates for each schedule using print_loop_nest(). This will give you a clear view of how fusion or materialization changes the structure of the computation. @@ -299,7 +331,7 @@ return Pipeline(output); } ``` -All the camera handling is just like before: you open the default webcam with OpenCV, normalize frames to 3-channel BGR if needed, wrap each frame as an interleaved Halide buffer, run the pipeline, and show the result. You will still time only the realize() call and print ms / FPS / MPix/s, with the first frame marked as [warm-up]. +All the camera handling is just like before: open the default webcam with OpenCV, normalize frames to 3-channel BGR if needed, wrap each frame as an interleaved Halide buffer, run the pipeline, and show the result. Time only the realize() call and print ms / FPS / MPix/s, with the first frame marked as [warm-up]. The new part is that you can toggle scheduling modes from the keyboard while the application is running: 1. Keys: @@ -310,9 +342,9 @@ The new part is that you can toggle scheduling modes from the keyboard while the * q / Esc – quit Under the hood, pressing 0–3 triggers a rebuild of the Halide pipeline with the chosen schedule: -1. You map the key to a Schedule enum value. -2. You call make_pipeline(input, next) to construct the new scheduled pipeline. -3. You reset the warm-up flag, so the next line of stats is labeled [warm-up] (that frame includes JIT). +1. Map the key to a Schedule enum value. +2. Call make_pipeline(input, next) to construct the new scheduled pipeline. +3. Reset the warm-up flag, so the next line of stats is labeled [warm-up] (that frame includes JIT). 4. The main loop keeps grabbing frames; only the Halide schedule changes. This live switching makes fusion tangible: you can watch the loop nest printout change, see the visualization update, and compare throughput numbers in real time as you move between Simple, FuseBlurAndThreshold, FuseAll, and Tile. @@ -326,7 +358,7 @@ g++ -std=c++17 camera-capture-fusion.cpp -o camera-capture-fusion \ ./camera-capture-fusion ``` -You will see the following output: +You'll see the following output: ```output % ./camera-capture-fusion Starting with schedule: FuseAll (press 0..3 to switch; q/Esc to quit) @@ -399,7 +431,7 @@ Simple | 6.01 ms | 166.44 FPS | 345.12 MPix/s15 MPix/s ``` The console output combines two kinds of information: -1. Loop nests – printed by print_loop_nest(). These show how Halide actually arranges the computation for the chosen schedule. They are a great “x-ray” view of fusion and materialization: +1. Loop nests – printed by print_loop_nest(). These show how Halide actually arranges the computation for the chosen schedule. They're a great "x-ray" view of fusion and materialization: * In FuseAll, the loop nest contains only output. That’s because gray, blur, and thresholded are all inlined (fused) into it. Each pixel of output recomputes its 3×3 neighborhood of gray. * In FuseBlurAndThreshold, there is an extra loop for gray, because we explicitly called gray.compute_root(). The blur and thresholded stages are still fused into output. This reduces recomputation of gray and makes downstream loops simpler to vectorize. * In Simple, both gray and blur have their own loop nests, and thresholded fuses into output. This introduces two extra buffers, but each stage is computed once and can be parallelized independently. @@ -411,7 +443,7 @@ The console output combines two kinds of information: Comparing the numbers: * FuseAll runs at ~53 FPS. It has minimal memory traffic but pays for recomputation of gray under the blur. -* FuseBlurAndThreshold jumps to over 200 FPS. By materializing gray, we avoid redundant recomputation and allow blur+threshold to stay fused. This is often the sweet spot for interleaved camera input. +* FuseBlurAndThreshold jumps to over 200 FPS. By materializing gray, redundant recomputation is avoided and blur+threshold stays fused. This is often the sweet spot for interleaved camera input. * Simple reaches ~166 FPS. Both gray and blur are materialized, so no recomputation occurs, but memory traffic is higher than in FuseBlurAndThreshold. * Tile achieves similar speed (~200 FPS). Producing gray per tile balances recomputation and memory traffic by keeping intermediates local to cache. @@ -422,8 +454,8 @@ By toggling schedules live, you can see and measure how operator fusion and mate This demo makes these trade-offs concrete: the loop nest diagrams explain the structure, and the live FPS/MPix/s stats show the real performance impact. -## What “fusion” means in Halide -One of Halide's defining features is that, by default, it performs operator fusion, also called inlining. This means that if a stage produces some intermediate values, those values aren’t stored in a separate buffer and then re-read later—instead, the stage is computed directly inside the consumer’s loop. In other words, unless you tell Halide otherwise, every producer Func is fused into the next stage that uses it. +## What "fusion" means in Halide +Halide's defining feature is that, by default, it performs operator fusion, also called inlining. This means that if a stage produces some intermediate values, those values aren't stored in a separate buffer and then re-read later—instead, the stage is computed directly inside the consumer's loop. In other words, unless you tell Halide otherwise, every producer Func is fused into the next stage that uses it. Why is this important? Fusion reduces memory traffic, because Halide doesn’t need to write intermediates out to RAM and read them back again. On CPUs, where memory bandwidth is often the bottleneck, this can be a major performance win. Fusion also improves cache locality, since values are computed exactly where they are needed and the working set stays small. The trade-off, however, is that fusion can cause recomputation: if a consumer uses a neighborhood (like a blur that reads 3×3 or 9×9 pixels), the fused producer may be recalculated multiple times for overlapping regions. Whether fusion is faster depends on the balance between compute cost and memory traffic. @@ -442,27 +474,27 @@ for y: for x: gray(x,y) = ... // write one planar gray image for y: for x: out(x,y) = threshold( sum kernel * gray(x+i,y+j) ) ``` -The fused version eliminates buffer writes but recomputes gray under the blur stencil. The materialized version performs more memory operations but avoids recomputation, and also gives us a clean point to parallelize or vectorize the gray stage. +The fused version eliminates buffer writes but recomputes gray under the blur stencil. The materialized version performs more memory operations but avoids recomputation, and also provides a clean point to parallelize or vectorize the gray stage. -It’s worth noting that Halide also supports a loop fusion directive (fuse) that merges two loop variables together. That’s a different concept and not our focus here. In this tutorial, we’re talking specifically about operator fusion—the decision of whether to inline or materialize stages. +Note that Halide also supports a loop fusion directive (fuse) that merges two loop variables together. That's a different concept and not the focus here. This tutorial focuses specifically on operator fusion—the decision of whether to inline or materialize stages. ## How this looks in the live camera demo -Our pipeline is: BGR input → gray → 3×3 blur → thresholded → output. Depending on the schedule, we see different kinds of fusion: +The pipeline is: BGR input → gray → 3×3 blur → thresholded → output. Depending on the schedule, different kinds of fusion are shown: * FuseAll. No schedules on producers. gray, blur, and thresholded are all inlined into output. This minimizes memory traffic but recomputes gray repeatedly inside the 3×3 blur. -* FuseBlurAndThreshold: We add gray.compute_root(), materializing gray once as a planar buffer. This avoids recomputation of gray and makes downstream blur and thresholded vectorize better. blur and thresholded remain fused. +* FuseBlurAndThreshold: Adding gray.compute_root() materializes gray once as a planar buffer. This avoids recomputation of gray and makes downstream blur and thresholded vectorize better. blur and thresholded remain fused. * Simple. Both gray and blur are materialized across the frame. This avoids recomputation entirely but increases memory traffic. -* Tile. We split the output into 64×64 tiles and compute gray per tile (compute_at(output, xo)). This keeps intermediate results local to cache while still fusing blur inside each tile. +* Tile. The output is split into 64×64 tiles and gray is computed per tile (compute_at(output, xo)). This keeps intermediate results local to cache while still fusing blur inside each tile. By toggling between these modes in the live demo, you can see how the loop nests and throughput numbers change, which makes the abstract idea of fusion much more concrete. ## When to use operator fusion -Fusion is Halide's default and usually the right place to start. It’s especially effective for: +Fusion is Halide's default and usually the right place to start. It's especially effective for: * Element-wise chains, where each pixel is transformed independently: examples include intensity scaling or offset, gamma correction, channel mixing, color-space conversions, and logical masking. * Cheap post-ops after spatial filters: -for instance, there’s no reason to materialize a blurred image just to threshold it. Fuse the threshold directly into the blur’s consumer. +for instance, there's no reason to materialize a blurred image to threshold it. Fuse the threshold directly into the blur's consumer. -In our code, FuseAll inlines gray, blur, and thresholded into output. FuseBlurAndThreshold materializes only gray, then keeps blur and thresholded fused—a common middle ground that balances memory use and compute reuse. +In the code, FuseAll inlines gray, blur, and thresholded into output. FuseBlurAndThreshold materializes only gray, then keeps blur and thresholded fused—a common middle ground that balances memory use and compute reuse. ## When to materialize instead of fuse Fusion isn’t always best. You’ll want to materialize an intermediate (compute_root() or compute_at()) if: @@ -471,8 +503,9 @@ Fusion isn’t always best. You’ll want to materialize an intermediate (comput * The intermediate is reused by multiple consumers. * You need a natural stage to apply parallelization or tiling. -### Profiling -The fastest way to check whether fusion helps is to measure it. Our demo prints timing and throughput per frame, but Halide also includes a built-in profiler that reports per-stage runtimes. To learn how to enable and interpret the profiler, see the official [Halide profiling tutorial](https://halide-lang.org/tutorials/tutorial_lesson_21_auto_scheduler_generate.html#profiling). +## Profiling +The fastest way to check whether fusion helps is to measure it. The demo prints timing and throughput per frame, but Halide also includes a built-in profiler that reports per-stage runtimes. To learn how to enable and interpret the profiler, see the official [Halide profiling tutorial](https://halide-lang.org/tutorials/tutorial_lesson_21_auto_scheduler_generate.html#profiling). ## Summary -In this section, you have learned about operator fusion in Halide—a powerful technique for reducing memory bandwidth and improving computational efficiency. You explored why fusion matters, looked at scenarios where it is most effective, and saw how Halide's scheduling constructs such as compute_root() and compute_at() let us control whether stages are fused or materialized. By experimenting with different schedules, including fusing the Gaussian blur and thresholding stages, we observed how fusion can significantly improve the performance of a real-time image processing pipeline + +You've seen how operator fusion in Halide can make your image processing pipeline faster and more efficient. Fusion means Halide computes each stage directly inside its consumer, reducing memory traffic and keeping data in cache. You learned when fusion is best—like for simple pixel operations or cheap post-processing—and when materializing intermediates with `compute_root()` or `compute_at()` can help, especially for large stencils or multi-use buffers. By switching schedules in the live demo, you saw how fusion and materialization affect both the loop structure and real-time performance. Now you know how to choose the right approach for your own Arm-based image processing tasks. diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md index 467027083..e2535f65a 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md @@ -1,32 +1,41 @@ --- # User change -title: "Background and Installation" +title: "Install and configure Halide for Arm development" weight: 2 layout: "learningpathall" --- -## Introduction -Halide is a powerful, open-source programming language specifically designed to simplify and optimize high-performance image and signal processing pipelines. Initially developed by researchers at MIT and Adobe in 2012, Halide addresses a critical challenge in computational imaging: efficiently mapping image-processing algorithms onto diverse hardware architectures without extensive manual tuning. It accomplishes this by clearly separating the description of an algorithm (specifying the mathematical or logical transformations applied to images or signals) from its schedule (detailing how and where those computations execute). This design enables rapid experimentation and effective optimization for various processing platforms, including CPUs, GPUs, and mobile hardware. +## What is Halide? -A key advantage of Halide lies in its innovative programming model. By clearly distinguishing between algorithmic logic and scheduling decisions—such as parallelism, vectorization, memory management, and hardware-specific optimizations, developers can first focus on ensuring the correctness of their algorithms. Performance tuning can then be handled independently, significantly accelerating development cycles. This approach often yields performance that matches or even surpasses manually optimized code. As a result, Halide has seen widespread adoption across industry and academia, powering image processing systems at organizations such as Google, Adobe, and Facebook, and enabling advanced computational photography features used by millions daily. +Halide is a powerful, open-source programming language designed to simplify and optimize high-performance image and signal processing. In 2012, researchers at MIT and Adobe developed Halide to efficiently run image-processing algorithms on different hardware architectures without extensive manual tuning. -In this learning path, you will explore Halide's foundational concepts, set up your development environment, and create your first functional Halide application. By the end, you will understand what makes Halide uniquely suited to efficient image processing, particularly on mobile and Arm-based hardware, and be ready to build your own optimized pipelines. +Halide makes it easy to write correct image-processing code by separating what your program does from how it runs. You first describe the algorithm, which is the steps to process each pixel, without needing to worry about performance details. You can then later choose scheduling strategies like parallelism, vectorization, and memory management to optimize for your hardware, including Arm processors. This approach helps you focus on getting the right results before tuning for speed, often matching or beating hand-optimized code. -For broader or more general use cases, please refer to the official Halide documentation and tutorials available at [halide-lang.org](https://halide-lang.org). +In this Learning Path, you'll explore Halide's foundational concepts, set up your development environment, and create your first functional Halide application. By the end, you'll understand what makes Halide uniquely suited to efficient image processing, particularly on mobile and Arm-based hardware, and be ready to build your own optimized pipelines. -The example code for this Learning Path is available in two repositories [here](https://github.com/dawidborycki/Arm.Halide.Hello-World.git) and [here](https://github.com/dawidborycki/Arm.Halide.AndroidDemo.git) +For broader use cases, see the official Halide documentation and tutorials on [the Halide website](https://halide-lang.org). + +You can find the example code for this Learning Path in two GitHub repositories: [Arm.Halide.Hello-World GitHub repository](https://github.com/dawidborycki/Arm.Halide.Hello-World.git) and [Arm.Halide.AndroidDemo GitHub repository](https://github.com/dawidborycki/Arm.Halide.AndroidDemo.git). ## Key concepts in Halide -### Separation of algorithm and schedule -At the core of Halide's design philosophy is the principle of clearly separating algorithms from schedules. Traditional image-processing programming tightly couples algorithmic logic with execution strategy, complicating optimization and portability. In contrast, Halide explicitly distinguishes these two components: - * Algorithm: Defines what computations are performed—for example, image filters, pixel transformations, or other mathematical operations on image data. - * Schedule: Specifies how and where these computations are executed, addressing critical details such as parallel execution, memory usage, caching strategies, and hardware-specific optimizations. -This separation allows developers to rapidly experiment and optimize their code for different hardware architectures or performance requirements without altering the core algorithmic logic. +Before you build your first Halide application, get familiar with the key ideas that make Halide powerful for image processing. Halide separates the steps of what your code does (the algorithm) from how it runs (the schedule). You'll use symbolic building blocks to describe image operations, then apply scheduling strategies to optimize performance for Arm processors. Understanding these concepts helps you write code that's both correct and fast. These concepts work together to enable high-performance code that's both readable and portable across different hardware architectures, including Arm processors. + +## Separate algorithm from schedule for optimal performance + +Halide's core design principle separates algorithms from schedules. Traditional image-processing code tightly couples algorithmic logic with execution strategy, complicating optimization and portability. + +- The algorithm defines what computations are performed, such as image filters, pixel transformations, or mathematical operations on image data. -Halide provides three key building blocks, including Functions, Vars, and Pipelines, to simplify and structure image processing algorithms. Consider the following illustrative example: +- The schedule specifies how and where these computations execute, including parallel execution, memory usage, caching strategies, and hardware-specific optimizations. + +This separation enables you to experiment and optimize code for different hardware architectures without changing the core algorithmic logic. + +## Discover Halide building blocks + +Halide provides three key building blocks to structure image processing algorithms, as shown below: ```cpp Halide::Var x("x"), y("y"), c("c"); @@ -36,42 +45,60 @@ Halide::Func brighter("brighter"); brighter(x, y, c) = Halide::cast(Halide::min(input(x, y, c) + 50, 255)); ``` -Functions (Func) represent individual computational steps or image operations. Each Func encapsulates an expression applied to pixels, allowing concise definition of complex image processing tasks. Vars symbolically represent spatial coordinates or dimensions (e.g., horizontal x, vertical y, color channel c). They specify where computations are applied in the image data Pipelines are formed by interconnecting multiple Func objects, structuring a clear workflow where the output of one stage feeds into subsequent stages, enabling modular and structured image processing. +- Functions (`Func)` represent individual computational steps or image operations. Each Func encapsulates an expression applied to pixels, enabling concise definition of complex tasks. -Halide is a domain-specific language (DSL) tailored explicitly for image and signal processing tasks. It provides a concise set of predefined operations and building blocks optimized for expressing complex image processing pipelines. By abstracting common computational patterns into simple yet powerful operators, Halide allows developers to succinctly define their processing logic, facilitating readability, maintainability, and easy optimization for various hardware targets. +- `Var` symbolically represents spatial coordinates or dimensions (for example, horizontal x, vertical y, color channel c), specifying where computations are applied. -### Scheduling strategies (parallelism, vectorization, tiling) -Halide offers several powerful scheduling strategies designed for maximum performance: - * Parallelism: Executes computations concurrently across multiple CPU cores, significantly reducing execution time for large datasets. - * Vectorization: Enables simultaneous processing of multiple data elements using SIMD (Single Instruction, Multiple Data) instructions available on CPUs and GPUs, greatly enhancing performance. - * Tiling: Divides computations into smaller blocks (tiles) optimized for cache efficiency, thus improving memory locality and reducing overhead due to memory transfers. +- Pipelines are formed by connecting multiple `Func` objects, creating a workflow where each stage's output feeds into subsequent stages. -By combining these scheduling techniques, developers can achieve optimal performance tailored specifically to their target hardware architecture. +Halide is a domain-specific language (DSL) tailored for image and signal processing. It provides predefined operations and building blocks optimized for expressing complex pipelines. By abstracting common computational patterns, Halide lets you define processing logic concisely, which in turn facilitates readability, maintainability, and optimization across hardware targets. -Beyond manual scheduling strategies, Halide also provides an Autoscheduler, a powerful tool that automatically generates optimized schedules tailored to specific hardware architectures, further simplifying performance optimization. +## Learn about scheduling strategies -## System requirements and environment setup -To start developing with Halide, your system must meet several requirements and dependencies. +Halide offers several powerful scheduling strategies for maximum performance: -### Installation options -Halide can be set up using one of two main approaches: -* Installing pre-built binaries - pre-built binaries are convenient, quick to install, and suitable for most beginners or standard platforms (Windows, Linux, macOS). This approach is recommended for typical use cases. -* Building Halide from source is required when pre-built binaries are unavailable for your specific environment, or if you wish to experiment with the latest Halide features or LLVM versions still under active development. This method typically requires greater familiarity with build systems and may be more suitable for advanced users. +- Parallelism is the execution of computations concurrently across multiple CPU cores, reducing execution time for large datasets -Here, you will use pre-built binaries: - 1. Visit the official Halide releases [page](https://github.com/halide/Halide/releases). As of this writing, the latest Halide version is v19.0.0. - 2. Download and unzip the binaries to a convenient location (e.g., /usr/local/halide on Linux/macOS or C:\halide on Windows). - 3. Optionally set environment variables to simplify further usage: -```console -export HALIDE_DIR=/path/to/halide -export PATH=$HALIDE_DIR/bin:$PATH -``` +- Vectorization enables simultaneous processing of multiple data elements using SIMD (Single Instruction, Multiple Data) instructions, such as Arm NEON, enhancing performance on Arm CPUs and GPUs + +- Tiling divides computations into smaller blocks optimized for cache efficiency, improving memory locality and reducing transfer overhead + +You can combine these techniques to achieve optimal performance tailored to your target hardware architecture. + +Beyond manual scheduling, Halide provides an Autoscheduler that automatically generates optimized schedules for specific hardware architectures, including Arm-based systems, simplifying performance optimization. + +## Set up your environment + +You can set up Halide using one of two approaches: + +- **Use pre-built binaries** for a fast and convenient setup on Windows, Linux, and macOS. This method is recommended for most users and standard development environments. + +- **Building from source** is required when pre-built binaries aren't available for your environment, or if you want to experiment with the latest Halide features or LLVM versions under active development. This method requires familiarity with build systems. -To proceed futher, make sure to install the following components: -1. LLVM (Halide requires LLVM to compile and execute pipelines) -2. OpenCV (for image handling in later lessons) +To use pre-built binaries, follow these steps: -Install with the commands for your OS: +To set up Halide using pre-built binaries: + +- Go to the [Halide releases page](https://github.com/halide/Halide/releases). This Learning Path uses version v19.0.0. +- Download and unzip the binaries to a convenient location, such as `/usr/local/halide` (Linux/macOS) or `C:\halide` (Windows). +- Set environment variables to make Halide easy to use: + ```console + export HALIDE_DIR=/path/to/halide + export PATH=$HALIDE_DIR/bin:$PATH + ``` + + +## Install LLVM and OpenCV + +Before you can build and run Halide pipelines, you need to install two essential components: + +- LLVM: Halide depends on LLVM to compile and execute image processing pipelines. LLVM provides the backend that turns Halide code into optimized machine instructions for Arm processors. + +- OpenCV: You'll use OpenCV for image input and output in later sections. OpenCV makes it easy to load, display, and save images, and it integrates smoothly with Halide buffers. + +Both tools are available for Arm platforms on Linux, macOS, and Windows. Make sure you install the correct versions for your operating system and architecture. + +The commands below show how to install LLVM and OpenCV: {{< tabpane code=true >}} {{< tab header="Linux/Ubuntu" language="bash">}} @@ -86,8 +113,9 @@ brew install opencv pkg-config Halide examples were tested with OpenCV 4.11.0 -## Your first Halide program -Now you’re ready to build your first Halide-based application. Save the following code in a file named `hello-world.cpp`: +## Build your first Halide program + +You're now ready to build your first Halide application. Save the following code in a file named `hello-world.cpp`: ```cpp #include "Halide.h" #include @@ -102,7 +130,7 @@ int main() { // Static path for the input image. std::string imagePath = "img.png"; - // Load the input image using OpenCV (BGR by default). + // Load the input image using OpenCV (BGR format by default, which stands for Blue-Green-Red channel order). Mat input = imread(imagePath, IMREAD_COLOR); // Alternative: Halide has a built-in IO function to directly load images as Halide::Buffer. // Example: Halide::Buffer inputBuffer = Halide::Tools::load_image(imagePath); @@ -111,7 +139,7 @@ int main() { return -1; } - // Convert RGB back to BGR for correct color display in OpenCV (optional but recommended for OpenCV visualization). + // Convert from BGR to RGB (Red-Green-Blue) format for correct color display in OpenCV. cvtColor(input, input, COLOR_BGR2RGB); // Wrap the OpenCV Mat data in a Halide::Buffer. @@ -151,30 +179,32 @@ int main() { } ``` -This program demonstrates how to combine Halide's image processing capabilities with OpenCV’s image I/O and display functionality. It begins by loading an image from disk using OpenCV, specifically reading from a static file named `img.png` (here you use a Cameraman image). Since OpenCV loads images in BGR format by default, the code immediately converts the image to RGB format so that it is compatible with Halide's expectations. +This program demonstrates how you can combine Halide's image processing capabilities with OpenCV's image I/O and display functionality. It begins by loading an image from disk using OpenCV, specifically reading from a static file named `img.png` (here you use a Cameraman image). Since OpenCV loads images in BGR (Blue-Green-Red) format by default, the code immediately converts the image to RGB (Red-Green-Blue) format so that it's compatible with Halide. + +The program wraps the raw image data into a Halide buffer, capturing the image's width, height, and color channels. It defines the Halide pipeline using a function named `invert` to specify the computation for each pixel—subtract the original pixel value from 255 to invert the colors. -Once the image is loaded and converted, the program wraps the raw image data into a Halide buffer, capturing the image’s dimensions (width, height, and color channels). Next, the Halide pipeline is defined through a function named invert, which specifies the computations to perform on each pixel—in this case, subtracting the original pixel value from 255 to invert the colors. The pipeline definition alone does not perform any actual computation; it only describes what computations should occur and how to schedule them. +{{% notice Note %}} +Remember, the pipeline definition only describes the computations and scheduling; it doesn't perform any actual processing until you realize the pipeline. +{{% /notice %}} -The actual computation occurs when the pipeline is executed with the call to invert.realize(...). This is the step that processes the input image according to the defined pipeline and produces an output Halide buffer. The scheduling directive (invert.reorder(c, x, y)) ensures that pixel data is computed in an interleaved manner (channel-by-channel per pixel), aligning the resulting data with OpenCV’s expected memory layout for images. +The actual computation occurs when the pipeline is executed with the call to `invert.realize`(...). This is the step that processes the input image according to the defined pipeline and produces an output Halide buffer. The scheduling directive `(invert.reorder(c, x, y))` ensures that pixel data is computed in an interleaved manner (channel-by-channel per pixel), aligning the resulting data with OpenCV’s expected memory layout for images. -Finally, the processed Halide output buffer is efficiently wrapped in an OpenCV Mat header without copying pixel data. For proper display in OpenCV, which uses BGR channel ordering by default, the code converts the processed image back from RGB to BGR. The program then displays the original and inverted images in separate windows, waiting for a key press before exiting. This approach demonstrates a streamlined integration between Halide for high-performance image processing and OpenCV for convenient input and output operations. +Wrap the processed Halide output buffer in an OpenCV `Mat` header without copying pixel data. Convert the processed image from RGB back to BGR for proper display in OpenCV, which uses BGR channel ordering by default. Display the original and inverted images in separate windows, and wait for a key press before exiting. Use this approach to integrate Halide for high-performance image processing with OpenCV for convenient input and output operations. -By default, Halide orders loops based on the order of variable declaration. In this example, the original ordering (x, y, c) implies processing the image pixel-by-pixel across all horizontal positions (x), then vertical positions (y), and finally channels (c). This ordering naturally produces a planar memory layout (e.g., processing all red pixels first, then green, then blue). +By default, Halide orders loops based on the order of variable declaration. In this example, the original ordering (x, y, c) implies processing the image pixel-by-pixel across all horizontal positions (x), then vertical positions (y), and finally channels (c). This ordering naturally produces a planar memory layout (for example, processing all red pixels first, then green, then blue). However, the optimal loop order depends on your intended memory layout and compatibility with external libraries: -1. Interleaved Layout (RGBRGBRGB…): -* Commonly used by libraries such as OpenCV. -* To achieve this, the color channel (c) should be the innermost loop, followed by horizontal (x) and then vertical (y) loops -Specifically, call: +**Interleaved layout (RGBRGBRGB…)** is commonly used by libraries such as OpenCV. To achieve this, the color channel (c) should be the innermost loop, followed by horizontal (x) and then vertical (y) loops. + +Call: ```cpp invert.reorder(c, x, y); ``` -This changes the loop nesting to process each pixel’s channels together (R, G, B for the first pixel, then R, G, B for the second pixel, and so on), resulting in: -* Better memory locality and cache performance when interfacing with interleaved libraries like OpenCV. -* Reduced overhead for subsequent image-handling operations (display, saving, or further processing). -By default, OpenCV stores images in interleaved memory layout, using the HWC (Height, Width, Channel) ordering. To correctly represent this data layout in a Halide buffer, you can also explicitly use the Buffer::make_interleaved() method, which ensures the data layout is properly specified. The code snippet would look like this: +This changes the loop nesting to process each pixel's channels together (R, G, B for the first pixel, then R, G, B for the second pixel, and so on). This provides better memory locality and cache performance when interfacing with interleaved libraries like OpenCV, and reduces overhead for subsequent image-handling operations (display, saving, or further processing). + +By default, OpenCV stores images in interleaved memory layout, using the HWC (Height, Width, Channel) ordering. To correctly represent this data layout in a Halide buffer, you can use the `Buffer::make_interleaved()` method, which ensures the data layout is properly specified: ```cpp // Wrap the OpenCV Mat data in a Halide buffer with interleaved HWC layout. @@ -183,28 +213,29 @@ Buffer inputBuffer = Buffer::make_interleaved( ); ``` -2. Planar Layout (RRR...GGG...BBB...): -* Preferred by certain image-processing routines or hardware accelerators (e.g., some GPU kernels or certain ML frameworks). -* Achieved naturally by Halide's default loop ordering (x, y, c). +**Planar layout (RRR...GGG...BBB...)** is preferred by certain image-processing routines or hardware accelerators (for example, some GPU kernels or ML frameworks). This is achieved naturally by Halide's default loop ordering (x, y, c). -It is essential to select loop ordering based on your specific data format requirements and integration scenario. Halide provides full flexibility, allowing you to explicitly reorder loops to match the desired memory layout efficiently. +Choose your loop ordering based on how your image data is stored and which libraries you use. Halide lets you control loop order for both performance and compatibility. -In Halide, two distinct concepts must be distinguished clearly: -1. Loop execution order (controlled by reorder). Defines the nesting order of loops during computation. For example, to make the channel dimension (c) innermost during computation: +Halide separates two important ideas: + +**Loop execution order** — Use `reorder` to set the order in which loops run during computation. For example, making the channel (`c`) the innermost loop helps match interleaved layouts like OpenCV's HWC format: ```cpp invert.reorder(c, x, y); ``` -2. Memory storage layout (controlled by reorder_storage). Defines the actual order in which data is stored in memory, such as interleaved or planar: + +**Memory storage layout** (controlled by `reorder_storage`) defines the actual order in which data is stored in memory, such as interleaved or planar: ```cpp invert.reorder_storage(c, x, y); ``` -Using only reorder(c, x, y) affects the computational loop order but not necessarily the memory layout. The computed data could still be stored in planar order by default. Using reorder_storage(c, x, y) explicitly defines the memory layout as interleaved. +Using only `reorder(c, x, y)` affects the computational loop order but not necessarily the memory layout. The computed data could still be stored in planar order by default. Using `reorder_storage(c, x, y)` defines the memory layout as interleaved. + +## Compile the program -## Compilation instructions -Compile the program as follows (replace /path/to/halide accordingly): +Compile the program as follows (replace `/path/to/halide` with your actual path): ```console export DYLD_LIBRARY_PATH=/path/to/halide/lib/libHalide.19.dylib g++ -std=c++17 hello-world.cpp -o hello-world \ @@ -213,24 +244,24 @@ g++ -std=c++17 hello-world.cpp -o hello-world \ -Wl,-rpath,/path/to/halide/lib ``` -Note that, on Linux, you would set LD_LIBRARY_PATH instead: +On Linux, set LD_LIBRARY_PATH instead: ```console export LD_LIBRARY_PATH=/path/to/halide/lib/ ``` -Run the executable: +To run the executable: ```console ./hello-world ``` -You will see two windows displaying the original and inverted images: -![img1](Figures/01.png) -![img2](Figures/02.png) +You'll see two windows displaying the original and inverted images: +![Original color photograph of a cameraman on the left showing a person operating a professional camera, and inverted version on the right with reversed colors where the subject appears in negative](Figures/01.png) +![Two side-by-side terminal windows showing compilation and execution of the Halide hello-world program, with the left window displaying g++ compilation commands and library paths, and the right window showing successful program execution with OpenCV window initialization messages](Figures/02.png) -## Summary -In this section, you have learned Halide's foundational concepts, explored the benefits of separating algorithms and schedules, set up your development environment, and created your first functional Halide application integrated with OpenCV. +## What you've accomplished and what's next -While the example introduces the core concepts of Halide pipelines (such as defining computations symbolically and realizing them), it does not yet showcase the substantial benefits of explicitly separating algorithm definition from scheduling strategies. +You've learned Halide's foundational concepts, explored the benefits of separating algorithms and schedules, set up your development environment, and created your first functional Halide application integrated with OpenCV for Arm development. -In subsequent sections, you will explore advanced Halide scheduling techniques, including parallelism, vectorization, tiling, and loop fusion, which will clearly demonstrate the practical advantages of separating algorithm logic from scheduling. These techniques enable fine-grained performance optimization tailored to specific hardware without modifying algorithmic correctness. +While the example introduces the core concepts of Halide pipelines (such as defining computations symbolically and realizing them), it doesn't yet showcase the benefits of separating algorithm definition from scheduling strategies. +In subsequent sections, you'll explore advanced Halide scheduling techniques, including parallelism, vectorization, tiling, and loop fusion, which demonstrate the practical advantages of separating algorithm logic from scheduling. These techniques enable fine-grained performance optimization tailored to Arm processors and other hardware without modifying algorithmic correctness. \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md index 6d7b9ec3d..d1637bc22 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md @@ -1,17 +1,25 @@ --- # User change -title: "Building a Simple Camera Image Processing Workflow" +title: "Build a simple camera image processing workflow" weight: 3 layout: "learningpathall" --- -## Objective -In this section, you will build a real-time camera processing pipeline using Halide. First, you capture video frames from a webcam using OpenCV, then implement a Gaussian (binomial) blur to smooth the captured images, followed by thresholding to create a clear binary output highlighting prominent image features. After establishing this pipeline, you will measure performance and then explore Halide's scheduling options—parallelization and tiling—to understand when they help and when they don’t. +## What you'll build + +In this section, you will build a real-time camera processing pipeline using Halide: + +- First, you will capture video frames from a webcam using OpenCV, implement a Gaussian (binomial) blur to smooth the captured images, followed by thresholding to create a clear binary output highlighting prominent image features. + +- Next, you will measure performance and explore Halide's scheduling options: parallelization and tiling. Each technique improves throughput in a different way. + + +## Implement Gaussian blur and thresholding + +To get started, create a new `camera-capture.cpp` file and copy and paste in the contents below: -## Gaussian blur and thresholding -Create a new `camera-capture.cpp` file and modify it as follows: ```cpp #include "Halide.h" #include @@ -127,10 +135,11 @@ int main() { return 0; } ``` +The camera delivers interleaved BGR frames. You convert them to grayscale using Rec.601 weights, apply a 3×3 binomial blur (with 16-bit accumulation and division by 16), and then threshold to create a binary image. -The camera delivers interleaved BGR frames. Inside Halide, we convert to grayscale (Rec.601), apply a 3×3 binomial blur (sum/16 with 16-bit accumulation), then threshold to produce a binary image. We compile once (outside the capture loop) and realize per frame for real-time processing. +Compile the pipeline once before the capture loop starts, then call `realize()` each frame for real-time processing. -A 3×3 filter needs neighbors (x±1, y±1). At the image edges, some taps would fall outside the valid region. Rather than scattering manual clamps across expressions, we wrap the input once: +A 3×3 filter needs neighbors (x±1, y±1). At the image edges, some taps fall outside the valid region. Rather than scattering manual clamps across expressions, wrap the input once: ```cpp // Wrap the input so out-of-bounds reads replicate the nearest edge pixel. @@ -139,7 +148,7 @@ Func inputClamped = BoundaryConditions::repeat_edge(input); Any out-of-bounds access replicates the nearest edge pixel. This makes the boundary policy obvious, keeps expressions clean, and ensures all downstream stages behave consistently at the edges. -Grayscale conversion happens inside Halide using Rec.601 weights. We read B, G, R from the interleaved input and compute luminance: +Halide converts the image to grayscale using Rec.601 weights. Read B, G, R from the interleaved input and compute luminance: ```cpp // Grayscale (Rec.601) @@ -150,7 +159,7 @@ gray(x, y) = cast(0.114f * inputClamped(x, y, 0) + // B 0.299f * inputClamped(x, y, 2)); // R ``` -Next, the pipeline applies a Gaussian-approximate (binomial) blur using a fixed 3×3 kernel. For this learning path, we implement it with small loops and 16-bit accumulation for safety: +Next, the pipeline applies a Gaussian-approximate (binomial) blur using a fixed 3×3 kernel. Implement it with small loops and 16-bit accumulation for safety: ```cpp Func blur("blur"); @@ -162,12 +171,9 @@ for (int j = 0; j < 3; ++j) blur(x, y) = cast(sum / 16); ``` -Why this kernel? -* It provides effective smoothing while remaining computationally lightweight. -* The weights approximate a Gaussian distribution, which reduces noise but preserves edges better than a box filter. -* This is mathematically a binomial filter, a standard and efficient approximation of Gaussian blurring. +This binomial kernel smooths images effectively while staying lightweight. Its weights closely match a Gaussian distribution, so it reduces noise but preserves edges better than a simple box filter. This makes it a fast and practical way to approximate Gaussian blur in real-time image processing. -After the blur, the pipeline applies thresholding to produce a binary image. We explicitly cast constants to uint8_t to remove ambiguity and avoid redundant widen/narrow operations in generated code: +After the blur, the pipeline applies thresholding to produce a binary image. Explicitly cast constants to uint8_t to remove ambiguity and avoid redundant widen/narrow operations in generated code: ```cpp Func output("output"); @@ -175,9 +181,9 @@ Func output("output"); output(x, y) = select(blur(x, y) > T, cast(255), cast(0)); ``` -This simple but effective step emphasizes strong edges and regions of high contrast, often used as a building block in segmentation and feature extraction pipelines +This step emphasizes strong edges and regions of high contrast, providing a building block for segmentation and feature extraction pipelines. -Finally, the result is realized by Halide and displayed via OpenCV. The pipeline is built once (outside the capture loop) and then realized each frame: +Halide generates the final output, and OpenCV displays it. Build the pipeline once (outside the capture loop), and then realize each frame: ```cpp // Build the pipeline once (outside the capture loop) Buffer outBuf(width, height); @@ -192,7 +198,7 @@ imshow("Processing Workflow", view); The main loop continues capturing frames, running the Halide pipeline, and displaying the processed output in real time until a key is pressed. This illustrates how Halide integrates cleanly with OpenCV to build efficient, interactive image-processing applications. -## Compilation instructions +## Compile and run the program Compile the program as follows (replace /path/to/halide accordingly): ```console g++ -std=c++17 camera-capture.cpp -o camera-capture \ @@ -205,16 +211,19 @@ Run the executable: ```console ./camera-capture ``` +The output should look similar to the figure below: +![A camera viewport window titled Processing Workflow displaying a real-time binary threshold output from a webcam feed. The image shows a person's face and shoulders rendered in stark black and white, where bright areas above the threshold value appear white and darker areas appear black, creating a high-contrast silhouette effect that emphasizes edges and prominent features.](Figures/03.webp) + +## Parallelization and tiling + +In this section, you will explore two scheduling optimizations that Halide provides: parallelization and tiling. Each technique improves performance in a different way. Parallelization uses multiple CPU cores, while tiling optimizes cache efficiency through better data locality. -The output should look as in the figure below: -![img3](Figures/03.webp) +You will learn how to use each technique separately for clarity and to emphasize their distinct benefits. -## Parallelization and Tiling -In this section, you will explore two complementary scheduling optimizations provided by Halide: Parallelization and Tiling. Both techniques help enhance performance but achieve it through different mechanisms—parallelization leverages multiple CPU cores, whereas tiling improves cache efficiency by optimizing data locality. +### Establish baseline performance -Now you will learn how to use each technique separately for clarity and to emphasize their distinct benefits. +Before applying any scheduling optimizations, establish a measurable baseline. Create a second file, `camera-capture-perf-measurement.cpp`, that runs the same grayscale → blur → threshold pipeline but prints per-frame timing, FPS, and MPix/s around the Halide `realize()` call. This lets you quantify each optimization you add next (parallelization, tiling, caching). -Let’s first lock in a measurable baseline before we start changing the schedule. You will create a second file, `camera-capture-perf-measurement.cpp`, that runs the same grayscale → blur → threshold pipeline but prints per-frame timing, FPS, and MPix/s around the Halide realize() call. This lets you quantify each optimization you will add next (parallelization, tiling, caching). Create `camera-capture-perf-measurement.cpp` with the following code: ```cpp @@ -353,8 +362,9 @@ realize: 3.98 ms | 251.51 FPS | 521.52 MPix/s This gives an FPS of 251.51, and average throughput of 521.52 MPix/s. Now you can start measuring potential improvements from scheduling. -### Parallelization -Parallelization lets Halide run independent pieces of work at the same time on multiple CPU cores. In image pipelines, rows (or row tiles) are naturally parallel once producer data is available. By distributing work across cores, we reduce wall-clock time—crucial for real-time video. +### Apply parallelization + +Parallelization allows Halide to process different parts of the image at the same time using multiple CPU cores. In image processing pipelines, each row or block of rows can be handled independently once the input data is ready. By spreading the work across several cores, you reduce the total processing time—this is especially important for real-time video applications. With the baseline measured, apply a minimal schedule that parallelizes the loop iteration for y axis. @@ -369,10 +379,15 @@ Add these lines after defining output(x, y) (and before any realize()). In this ``` This does two important things: -* compute_root() on gray divides the entire processing into two loops, one to compute the entire gray output, and the other to compute the final output. -* parallel(y) parallelizes over the pure loop variable y (rows). The rows are computed on different CPU cores in parallel. +* `compute_root()` on gray divides the entire processing into two loops, one to compute the entire gray output, and the other to compute the final output. +* `parallel(y)` parallelizes over the pure loop variable y (rows). The rows are computed on different CPU cores in parallel. +Now rebuild and run the application. You should see output similar to: + +```output +realize: 1.16 ms | 864.15 FPS | 1791.90 MPix/s +``` -Now rebuild and run the application again. The results should look like: +This shows a significant speedup from parallelization. The exact numbers depend on your Arm CPU and how many cores are available. ```output % ./camera-capture-perf-measurement realize: 1.16 ms | 864.15 FPS | 1791.90 MPix/s @@ -380,17 +395,20 @@ realize: 1.16 ms | 864.15 FPS | 1791.90 MPix/s The performance gain by parallelization depends on how many CPU cores are available for this application to occupy. -### Tiling +## Apply tiling for cache efficiency + Tiling is a scheduling technique that divides computations into smaller, cache-friendly blocks or tiles. This approach significantly enhances data locality, reduces memory bandwidth usage, and leverages CPU caches more efficiently. While tiling can also use parallel execution, its primary advantage comes from optimizing intermediate data storage. -Tiling splits the image into cache-friendly blocks (tiles). Two wins: -* Partitioning: tiles are easy to parallelize across cores. -* Locality: when you cache intermediates per tile, you avoid refetching/recomputing data and hit CPU L1/L2 cache more often. +Tiling divides the image into smaller, cache-friendly blocks called tiles. This gives you two main benefits: + +* Partitioning: tiles are easy to process in parallel, so you can spread the work across multiple CPU cores. +* Locality: by caching intermediate results within each tile, you avoid repeating calculations and make better use of the CPU cache. -Now lets look at both flavors. +Try both methods to see how they improve performance. -### Tiling with explicit intermediate storage (best for cache efficiency) -Here you will cache gray once per tile so the 3×3 blur can reuse it instead of recomputing RGB -> gray up to 9× per output pixel. +## Cache intermediates per tile + +This approach caches gray once per tile so the 3×3 blur can reuse it instead of recomputing RGB to gray up to 9× per output pixel. This provides the best cache efficiency. ```cpp // Scheduling @@ -410,27 +428,33 @@ Here you will cache gray once per tile so the 3×3 blur can reuse it instead of ``` In this scheduling: -* tile(...) splits the image into cache-friendly blocks and makes it easy to parallelize across tiles. -* parallel(yo) distributes tiles across CPU cores where a CPU core is in charge of a row (yo) of tiles. -* gray.compute_at(...).store_at(...) materializes a tile-local planar buffer for the grayscale intermediate so blur can reuse it within the tile. +* `tile`(...) splits the image into cache-friendly blocks and makes it easy to parallelize across tiles +* `parallel(yo)` distributes tiles across CPU cores where a CPU core is in charge of a row (yo) of tiles +* `gray.compute_at(...).store_at(...)` materializes a tile-local planar buffer for the grayscale intermediate so blur can reuse it within the tile + +Recompile your application as before, then run. -Recompile your application as before, then run. What we observed on our machine: +Here's sample output: ```output realize: 0.98 ms | 1023.15 FPS | 2121.60 MPix/s ``` -This was the fastest variant here—caching a planar grayscale per tile enabled efficient reuse. +Caching the grayscale image for each tile gives the best performance. By storing the intermediate grayscale result in a tile-local buffer, Halide can reuse it efficiently during the blur step. This reduces redundant computations and makes better use of the CPU cache, resulting in faster processing. + +## Choose a scheduling strategy +There isn't a universal scheduling strategy that guarantees the best performance for every pipeline or device. The optimal approach depends on your specific image-processing workflow and the Arm architecture you're targeting. Halide's scheduling API gives you the flexibility to experiment with parallelization, tiling, and caching. Try different combinations to see which delivers the highest throughput and efficiency for your application. + +For the example of this application: +Start by parallelizing the outermost loop to use multiple CPU cores. This is usually the simplest way to boost performance. + +Add tiling and caching if your pipeline includes a spatial filter (such as blur or convolution), or if an intermediate result is reused by several stages. Tiling works best after converting your source data to planar format, or after precomputing a planar grayscale image. -### How we schedule -In general, there is no one-size-fits-all rule of scheduling to achieve the best performance as it depends on your pipeline characteristics and the target device architecture. So, it is recommended to explore the scheduling options and that is where Halide's scheduling API is purposed for. +Try parallelization first, then experiment with tiling and caching for further speedups. From there, tune tile sizes and thread count for your target. `HL_NUM_THREADS` is the environmental variable which allows you to limit the number of threads in-flight. -For example of this application: -* Start with parallelizing the outer-most loop. -* Add tiling + caching only if: there is a spatial filter, or the intermediate is reused by multiple consumers—and preferably after converting sources to planar (or precomputing a planar gray). -* From there, tune tile sizes and thread count for your target. `HL_NUM_THREADS` is the environmental variable which allows you to limit the number of threads in-flight. +## What you've accomplished and what's next +You built a real-time image processing pipeline using Halide and OpenCV. The workflow included converting camera frames to grayscale, applying a 3×3 binomial blur, and thresholding to create a binary image. You also measured performance to see how different scheduling strategies affect throughput. -## Summary -In this section, you built a real-time Halide+OpenCV pipeline—grayscale, a 3×3 binomial blur, then thresholding—and instrumented it to measure throughput. And then, we observed that parallelization and tiling improved the performance. +- Parallelization lets Halide use multiple CPU cores, speeding up processing by dividing work across rows or tiles. +- Tiling improves cache efficiency, especially when intermediate results are reused often, such as with larger filters or multi-stage pipelines. -* Parallelization spreads independent work across CPU cores. -* Tiling for cache efficiency helps when an expensive intermediate is reused many times per output (e.g., larger kernels, separable/multi-stage pipelines, multiple consumers) and when producers read planar data. +By combining these techniques, you achieved faster and more efficient image processing on Arm systems.