Skip to content

[Perf]: Parallelize adaptive_threshold in kornia-apriltag with rayonΒ #762

@shreed27

Description

@shreed27

πŸ› Describe the bug

adaptive_threshold() in the apriltag crate processes every pixel of the input image sequentially despite both passes being embarrassingly parallel. This is a critical issue in the AprilTag detection pipeline on multi-core systems. The function itself has an explicit TODO at threshold.rs:185:

// TODO: Add support for parallelism
pub fn adaptive_threshold<A1: ImageAllocator, A2: ImageAllocator>(
The function has two passes β€” Pass 1 computes independent per-tile min/max values, Pass 2 binarizes pixels where each tile writes to a non-overlapping region. Neither pass has data dependencies between tiles/rows, yet both run single-threaded.

On a 1920x1080 image, this means ~2M pixels processed serially per frame β€” the single biggest per-pixel issue in the detection pipeline.

πŸ“‚ Feature Category

Performance Optimization

πŸ’‘ Motivation

adaptive_threshold is the first per-pixel step in the AprilTag detection pipeline. On a 1920Γ—1080 image (tile_size=4):

Pass 1: Computes min/max for ~72,900 independent tiles β€” zero data dependencies between them
Pass 2: Binarizes ~2M pixels β€” each tile writes to a non-overlapping dst region
Both currently run single-threaded. Meanwhile, equivalent per-pixel operations in kornia-rs (pyrdown_u8, pyrup_f32, cast_and_scale) already use rayon. adaptive_threshold is the only per-pixel hot path in the apriltag pipeline left unparallelized.

This directly bottleneck real-time AprilTag use cases (robotics, AR, drone navigation) running detection at 30+ fps.

πŸ”„ Steps to Reproduce

1. Run the existing apriltag benchmark:
   cargo bench -p kornia-apriltag --bench bench_decoding

2. Profile with perf or flamegraph β€” adaptive_threshold dominates the per-pixel cost

3. Observe that on multi-core machines, CPU utilization stays at ~1 core during thresholding

πŸ’» Minimal Code Example

use kornia_apriltag::{AprilTagDecoder, DecodeTagsConfig, family::TagFamilyKind};
use kornia_apriltag::threshold::{adaptive_threshold, TileMinMax};
use kornia_image::{allocator::CpuAllocator, Image, ImageSize};
use kornia_apriltag::utils::Pixel;

// Create a 1920x1080 grayscale image (typical real-time pipeline)
let src = Image::<u8, 1, _>::from_size_val(
    ImageSize { width: 1920, height: 1080 },
    128u8,
    CpuAllocator,
).unwrap();

let mut dst = Image::from_size_val(src.size(), Pixel::Skip, CpuAllocator).unwrap();
let mut tile_buffers = TileMinMax::new(src.size(), 4);

// This runs entirely single-threaded despite being embarrassingly parallel
adaptive_threshold(&src, &mut dst, &mut tile_buffers, 20).unwrap();

βœ… Expected behavior

Both passes of adaptive_threshold should leverage rayon for parallel execution:

Pass 1 (tile min/max): Each tile's min/max is independent β€” parallelize with par_iter_mut over the tile arrays
Pass 2 (binarization): Each image row writes to a non-overlapping dst region β€” parallelize with par_chunks_mut on destination rows

Expected speedup: 2-4x on 4+ core systems for the thresholding step, directly visible in the bench_decoding benchmark. rayon is already a workspace dependency used by kornia-imgproc.

❌ Actual behavior

Both passes run sequentially with tile_iterator.for_each(...), utilizing only a single core. On a 1920x1080 image at 30fps, this wastes ~60M pixel operations per second of potential parallelism.

πŸ”§ Environment

- kornia-rs version: 0.1.11
- Rust version (`rustc -V`): 1.92.0 
- Cargo version (`cargo -V`): 1.92.0 
- OS (e.g., Linux, macOS, Windows): macOS
- Target architecture (if cross-compiling): arm64
- Python version (if using Python bindings): 3.9.6

🎯 Use Cases

~ Real-time AprilTag detection (robotics, AR, drone landing) β€” detection at 30+ fps on 1080p, every ms saved in thresholding directly improves frame budget
~ Batch calibration β€” processing thousands of images for camera calibration or mapping, linear speedup with core count
~ Multi-camera systems β€” multiple streams share CPU, reducing per-frame cost frees cores for other pipelines

πŸ“š Library Reference

Follows the parallelization pattern already established in kornia-rs:

~ pyrdown_u8 / pyrup_f32 in kornia-imgproc/src/pyramid.rs β€” row-parallel with par_chunks_mut
~ cast_and_scale in kornia-image/src/image.rs β€” parallel pixel processing with rayon
~ OpenCV's adaptive threshold also uses parallel tile processing internally via TBB

πŸ“ Additional context

~ Only 2 files touched: Cargo.toml (add rayon dep) + threshold.rs (parallelize + remove TODO)
~ All 4 existing tests (test_adaptive_threshold_basic, test_adaptive_threshold_uniform_image, test_adaptive_threshold_synthetic_image, invalid_buffer_size) are deterministic and order-independent β€” they pass unchanged
~ bench_decoding already benchmarks the full pipeline end-to-end against C AprilTag and aprilgrid-rs β€” speedup is directly measurable
~ neighbor_blur() takes &self (read-only) β€” safe to call from multiple threads without modification
~ tile_size is typically 4 in production configs β€” meaning tiles are small, many tiles per image, excellent work distribution for rayon

🀝 Contribution Intent

  • I plan to submit a PR to fix this bug
  • I'm reporting this bug but not planning to fix it

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is neededtriagewait for a maintainer to approve and assign this ticket

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions