[Perf]: Parallelize adaptive_threshold in kornia-apriltag with rayon

### 🐛 Describe the bug

adaptive_threshold() in the apriltag crate processes every pixel of the input image sequentially despite both passes being embarrassingly parallel. This is a critical issue in the AprilTag detection pipeline on multi-core systems. The function itself has an explicit TODO at threshold.rs:185:

// TODO: Add support for parallelism
pub fn adaptive_threshold<A1: ImageAllocator, A2: ImageAllocator>(
The function has two passes — Pass 1 computes independent per-tile min/max values, Pass 2 binarizes pixels where each tile writes to a non-overlapping region. Neither pass has data dependencies between tiles/rows, yet both run single-threaded.

On a 1920x1080 image, this means ~2M pixels processed serially per frame — the single biggest per-pixel issue in the detection pipeline.

### 📂 Feature Category
Performance Optimization

### 💡 Motivation
adaptive_threshold is the first per-pixel step in the AprilTag detection pipeline. On a 1920×1080 image (tile_size=4):

Pass 1: Computes min/max for ~72,900 independent tiles — zero data dependencies between them
Pass 2: Binarizes ~2M pixels — each tile writes to a non-overlapping dst region
Both currently run single-threaded. Meanwhile, equivalent per-pixel operations in kornia-rs (pyrdown_u8, pyrup_f32, cast_and_scale) already use rayon. adaptive_threshold is the only per-pixel hot path in the apriltag pipeline left unparallelized.

This directly bottleneck real-time AprilTag use cases (robotics, AR, drone navigation) running detection at 30+ fps.

### 🔄 Steps to Reproduce

```bash
1. Run the existing apriltag benchmark:
   cargo bench -p kornia-apriltag --bench bench_decoding

2. Profile with perf or flamegraph — adaptive_threshold dominates the per-pixel cost

3. Observe that on multi-core machines, CPU utilization stays at ~1 core during thresholding
```

### 💻 Minimal Code Example

```rust
use kornia_apriltag::{AprilTagDecoder, DecodeTagsConfig, family::TagFamilyKind};
use kornia_apriltag::threshold::{adaptive_threshold, TileMinMax};
use kornia_image::{allocator::CpuAllocator, Image, ImageSize};
use kornia_apriltag::utils::Pixel;

// Create a 1920x1080 grayscale image (typical real-time pipeline)
let src = Image::<u8, 1, _>::from_size_val(
    ImageSize { width: 1920, height: 1080 },
    128u8,
    CpuAllocator,
).unwrap();

let mut dst = Image::from_size_val(src.size(), Pixel::Skip, CpuAllocator).unwrap();
let mut tile_buffers = TileMinMax::new(src.size(), 4);

// This runs entirely single-threaded despite being embarrassingly parallel
adaptive_threshold(&src, &mut dst, &mut tile_buffers, 20).unwrap();
```

### ✅ Expected behavior

Both passes of adaptive_threshold should leverage rayon for parallel execution:

Pass 1 (tile min/max): Each tile's min/max is independent — parallelize with par_iter_mut over the tile arrays
Pass 2 (binarization): Each image row writes to a non-overlapping dst region — parallelize with par_chunks_mut on destination rows

Expected speedup: 2-4x on 4+ core systems for the thresholding step, directly visible in the bench_decoding benchmark. rayon is already a workspace dependency used by kornia-imgproc.


### ❌ Actual behavior

Both passes run sequentially with tile_iterator.for_each(...), utilizing only a single core. On a 1920x1080 image at 30fps, this wastes ~60M pixel operations per second of potential parallelism.

### 🔧 Environment

```shell
- kornia-rs version: 0.1.11
- Rust version (`rustc -V`): 1.92.0 
- Cargo version (`cargo -V`): 1.92.0 
- OS (e.g., Linux, macOS, Windows): macOS
- Target architecture (if cross-compiling): arm64
- Python version (if using Python bindings): 3.9.6
```
### 🎯 Use Cases 
~ Real-time AprilTag detection (robotics, AR, drone landing) — detection at 30+ fps on 1080p, every ms saved in thresholding directly improves frame budget
~ Batch calibration — processing thousands of images for camera calibration or mapping, linear speedup with core count
~ Multi-camera systems — multiple streams share CPU, reducing per-frame cost frees cores for other pipelines

### 📚 Library Reference
Follows the parallelization pattern already established in kornia-rs:

~ pyrdown_u8 / pyrup_f32 in [kornia-imgproc/src/pyramid.rs](https://github.com/kornia/kornia-rs/blob/main/crates/kornia-imgproc/src/pyramid.rs) — row-parallel with par_chunks_mut
~ cast_and_scale in [kornia-image/src/image.rs](https://github.com/kornia/kornia-rs/blob/main/crates/kornia-image/src/image.rs) — parallel pixel processing with rayon
~ OpenCV's adaptive threshold also uses parallel tile processing internally via TBB

### 📝 Additional context

~ Only 2 files touched: Cargo.toml (add rayon dep) + threshold.rs (parallelize + remove TODO)
~ All 4 existing tests (test_adaptive_threshold_basic, test_adaptive_threshold_uniform_image, test_adaptive_threshold_synthetic_image, invalid_buffer_size) are deterministic and order-independent — they pass unchanged
~ bench_decoding already benchmarks the full pipeline end-to-end against C AprilTag and aprilgrid-rs — speedup is directly measurable
~ neighbor_blur() takes &self (read-only) — safe to call from multiple threads without modification
~ tile_size is typically 4 in production configs — meaning tiles are small, many tiles per image, excellent work distribution for rayon



### 🤝 Contribution Intent

- [x] I plan to submit a PR to fix this bug
- [ ] I'm reporting this bug but not planning to fix it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf]: Parallelize adaptive_threshold in kornia-apriltag with rayon #762

🐛 Describe the bug

📂 Feature Category

💡 Motivation

🔄 Steps to Reproduce

💻 Minimal Code Example

✅ Expected behavior

❌ Actual behavior

🔧 Environment

🎯 Use Cases

📚 Library Reference

📝 Additional context

🤝 Contribution Intent

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Perf]: Parallelize adaptive_threshold in kornia-apriltag with rayon #762

Description

🐛 Describe the bug

📂 Feature Category

💡 Motivation

🔄 Steps to Reproduce

💻 Minimal Code Example

✅ Expected behavior

❌ Actual behavior

🔧 Environment

🎯 Use Cases

📚 Library Reference

📝 Additional context

🤝 Contribution Intent

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions