Skip to content

Conversation

@sidd-27
Copy link
Contributor

@sidd-27 sidd-27 commented Jan 8, 2026

Pigeonhole-principle based quick decode table

Problem Statement

The previous AprilTag decoding implementation relied on exhaustive precomputation of error permutations to achieve O(1) lookups. While effective for small tag families (e.g., 16h5), this approach is computationally and spatially prohibitive for larger families such as 52h13.

For the 52h13 family, precomputing all variations with up to 2-bit errors results in approximately 67 million hash map entries. which amounts to approximately 3 GB of RAM being used for this simple check. If we were to extend this to 3-bit errors, the amount of space needed would increase exponentially to approximately 54 GB, making it essentially unusable. even for smaller tag families, dedicating this many resources to a single check is not possible in many small robotics applications

Proposed Solution

This PR replaces the combinatorial precomputation strategy with a search-based algorithm utilizing the Pigeonhole Principle.

Instead of storing every possible error permutation, the new implementation indexes the valid tags by splitting the code into 4 discrete chunks (for eg 13 bits each for 52h13). Given a maximum tolerance of 3-bit errors, the Pigeonhole Principle guarantees that at least one of the four chunks in an observed tag must match the valid tag perfectly.

The decoding process is updated to:

  1. Perform lookups on the 4 chunks to identify candidate tags.
  2. Compute the Hamming distance between the observed code and the candidate's perfect code.
  3. Return the match if the distance is within the specified threshold.

Memory Complexity

The legacy implementation required storing every error permutation ($1 + N + \binom{N}{2} + \binom{N}{3}$) multiplied by a load factor of 3 to resolve collisions. For the 52h13 family, this necessitated allocating over 69000 slots per tag (approx. 3 billion total slots). The new implementation eliminates this combinatorial explosion, storing exactly 4 references per tag. 52h13 Memory Footprint went from ~54 GB to ~450 KB which is nearly a 128,884x Reduction

In summary -

  • memory requirements and initialization: Both of these are now several orders of magnitude better than the earlier implementation
  • Error correction: now supports up to 3-bit error correction without any additional memory overhead; also, this technique can be scaled to test for 4 or even 5-bit errors with relative ease.
  • Regression: Existing tests pass without modification.
  • performance: as this is a technically more complex structure, the quickdecode function may be slightly slower than it used to be, but it still runs in nanoseconds and is not the bottleneck and there's no observable change to the overall performance

@sidd-27 sidd-27 marked this pull request as ready for review January 8, 2026 20:52
@sidd-27
Copy link
Contributor Author

sidd-27 commented Jan 8, 2026

@christian-rauch Sorry for the multiple pull requests. I made a very silly mistake and messed up the rebase. Anyways, pls do review this

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the exhaustive precomputation-based AprilTag decoding with a pigeonhole-principle based search algorithm, achieving dramatic memory reduction (128,884x for 52h13 family) while maintaining error correction capabilities up to 3-bit errors.

Key changes:

  • Implements chunk-based code lookup using the pigeonhole principle instead of precomputing all error permutations
  • Adds a popcount64 implementation for efficient Hamming distance calculation
  • Refactors the quick_decode data structure to use four chunk lookup tables instead of a hash table

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@christian-rauch
Copy link
Collaborator

Thank you a lot for this PR. Since this is quite involved, and we have to make a tradeoff between memory and computation speed for smaller tags, I am adding @mkrogius as a reviewer.

I also tried out the "Copilot" reviewer to see how useful this is. Some of the suggestions regarding the NULL pointer check indeed make sense to me.

@sidd-27
Copy link
Contributor Author

sidd-27 commented Jan 9, 2026

we have to make a tradeoff between memory and computation speed for smaller tags

Here are some back-of-the-envelope calculations for the average number of elements in each bucket. This means that for smaller tags, we need an average of 2*4 = 8 lookups, and for larger tags, up to 40 lookups. However, since these elements are contiguous in memory with no padding, most of them will be cached. From the example demo tests in the library, I obtained essentially the same speed from both methods; the optimized one was slightly faster, but that may be due to noise.

considering that decoding takes just a fraction of the time required by the decoder, I don't think that the change in time required would be noticeable.

However, the best method to test this would be through benchmarks on devices where it is expected to run, such as robots. Unfortunately, I don't have access to those

Tag Family Codes Chunk Size Total Buckets Non-Empty Avg Size Buckets ≥ 2 Buckets ≥ 4
Tag16H5 30 4 64 85.9% 2.18 57.8% 10.9%
Tag25H9 35 7 512 21.5% 1.27 3.5% 0.6%
Tag36H10 2,320 9 2,048 98.8% 4.58 93.3% 66.6%
Tag36H11 587 9 2,048 69.7% 1.64 32.4% 2.1%
TagCircle21H7 38 6 256 39.5% 1.50 11.3% 2.7%
TagCircle49H12 65,535 13 32,768 78.0% 10.25 77.6% 73.9%
TagCustom48H12 42,211 12 16,384 100.0% 10.31 99.9% 98.6%
TagStandard41H12 2,115 11 8,192 51.9% 1.99 24.0% 4.3%
TagStandard52H13 48,714 13 32,768 98.1% 6.06 93.8% 76.7%

Some of the suggestions regarding the NULL pointer check indeed make sense to me.

sure, I'll make those changes shortly

@sidd-27
Copy link
Contributor Author

sidd-27 commented Jan 15, 2026

@christian-rauch I've implemented the null error checks, pls do review

@christian-rauch
Copy link
Collaborator

I am fine with this. But I wanted @mkrogius to have a look at this too.

@sfe-SparkFro
Copy link

Excited to see this! Great timing for me, as I'm currently working on making MicroPython bindings for this to run on microcontrollers. Just got it working, though I spent a few too many hours debugging why tags weren't being detected. Turns out detector_add_family_bits() was failing with bits_corrected=2 because it needed to allocate ~18MB of memory; my board only has ~8.5MB of RAM total and I wasn't checking errno, so didn't know that was the problem. Had to workaround that by setting the bits to 0, so this is a welcome change!

The previous AprilTag decoding implementation relied on exhaustive precomputation of error permutations to achieve O(1) lookups.

Isn't quick_decode_codeword() actually O(n/2) on average? n being the number of entries in the quick_decode struct (current implementation) or the number of chunks (this PR). Maybe I'm misunderstanding. And probably not really important since decoding takes so much less time than everything else, but just curious to understand it.

@sfe-SparkFro
Copy link

Instead of storing every possible error permutation, the new implementation indexes the valid tags by splitting the code into 4 discrete chunks (for eg 13 bits each for 52h13). Given a maximum tolerance of 3-bit errors, the Pigeonhole Principle guarantees that at least one of the four chunks in an observed tag must match the valid tag perfectly.

Is this chunking actually necessary? When using the new popcount64(), the full codes can be compared instead of just the code chunks. And if the full codes are compared, then this would enable more than 3 bit errors. For a tag family of hamming distance h, up to h/2 bit errors can be tolerated (eg. 5 bit errors for 36h11).

I haven't looked at the code in detail, so maybe I'm not understanding how this works. I like the use of the pigeonhole principle, but I'm not sure it's actually necessary or better, and it adds complexity that I think can be avoided.

@sidd-27
Copy link
Contributor Author

sidd-27 commented Jan 17, 2026

I mean, for small code families, sure, a for loop iterating through all the codewords might just be the most efficient approach, but i suppose with larger families with 50 to 60k+ codewords, the lookup time increases, especially since we also need to look at 4 rotations for each codeword, and possibly multiple codewords per frame. This may still be pretty fast on modern machines, but it just won't scale as well, with larger and more complex tags imo

also, from what I've read, extending up to 5 or 6 bit error correction may just add false positives, but if needed, this code could be modified to be scalable like that with more chunks

This chunking method ensures that the number of popcount instructions stays on average around 40-50, even on the largest families, but in extremely memory-constrained environments, where time taken isn't as important, i suppose just a linear scan could be a resonable compromise

@sfe-SparkFro
Copy link

Ah, I didn't realize how much faster this was than looping through all the codewords! Just did a couple tests tweaking the loop in quick_decode_codeword() using tagStandard52h13. On my machine, both the current implementation and this PR complete the loop in under 1us, whereas looping through all codewords takes anywhere from 150us to 400us. Please disregard my uninformed commentary from before 🙂

@mkrogius
Copy link
Contributor

It would be nice to make the 52-bit tags more usable with 3-bit error correction. I will review this PR (In general I think we need much more complete test coverage for this project if we want to continue developing this repo).

@sidd-27 I do have a couple of high-level questions / comments:

  • The implementation of quick_decode_init is different than I expected. I expected that we could simply create 4 hash tables, all using linear probing (just like the current hash table implementation). I think that your method is essentially the same but uses less memory.
  • An alternative to chunking would be to use the current method, but instead of inserting all possible variations of all codes, insert all possible variations of the first p bits of each code. This would save some amount of memory and be a smaller change. Was this an option you considered?
  • Would it be possible to add a test that all codes from all families and all errors of those codes (w/ up to 3 bits flipped) decode correctly?

}

struct quick_decode_entry
static inline int popcount64(uint64_t x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a SWAR (SIMD within a register) algorithm to count the number of set bits in a u64: https://www.playingwithpointers.com/blog/swar.html. To be honest, even I don't understand how it works, but the built-in pop count intrinsic would probably not exist in every machine, so this becomes the next best option, I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link. I agree that this is necessary since the popcount instruction is not universal. Would you mind adding a comment with the link for future readers?

@sidd-27
Copy link
Contributor Author

sidd-27 commented Jan 20, 2026

The implementation of quick_decode_init is different than I expected. I expected that we could simply create 4 hash tables, all using linear probing (just like the current hash table implementation). I think that your method is essentially the same but uses less memory.

That would add a lot of memory overhead, because of the load factor, and also the hash collisions could slow things down. This is an inverted index which is tightly packed, so no memory overhead, and faster lookups for things in the same chunk

An alternative to chunking would be to use the current method, but instead of inserting all possible variations of all codes, insert all possible variations of the first p bits of each code. This would save some amount of memory and be a smaller change. Was this an option you considered?

That would not help much, as the problem was the number of slots increasing exponentially with the error correction, which would still remain an issue. I actually came up with a much more optimized version of this: it stores only the tag IDs, not the tags, making it just 2 bytes per entry instead of the earlier 16. kornia/kornia-rs#618 check this out for more details, but then again, the memory usage for 53h13 would be 384 Mbs, this version uses 450kbs

The other code review comments are valid, and I'll make changes to my code accordingly

@sidd-27
Copy link
Contributor Author

sidd-27 commented Jan 20, 2026

Added tests to validate all possible 3-bit errors across all families

image

@mkrogius
Copy link
Contributor

I'm happy to approve these changes. This will make the 52bit family in particular much more usable. And thank you for adding the additional tests, it is very good to have this extra coverage

@christian-rauch christian-rauch merged commit c5a4fc1 into AprilRobotics:master Jan 20, 2026
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants