Reduced QuickDecode memory consumption by 128,884x #421

sidd-27 · 2026-01-08T20:51:42Z

Pigeonhole-principle based quick decode table

Problem Statement

The previous AprilTag decoding implementation relied on exhaustive precomputation of error permutations to achieve O(1) lookups. While effective for small tag families (e.g., 16h5), this approach is computationally and spatially prohibitive for larger families such as 52h13.

For the 52h13 family, precomputing all variations with up to 2-bit errors results in approximately 67 million hash map entries. which amounts to approximately 3 GB of RAM being used for this simple check. If we were to extend this to 3-bit errors, the amount of space needed would increase exponentially to approximately 54 GB, making it essentially unusable. even for smaller tag families, dedicating this many resources to a single check is not possible in many small robotics applications

Proposed Solution

This PR replaces the combinatorial precomputation strategy with a search-based algorithm utilizing the Pigeonhole Principle.

Instead of storing every possible error permutation, the new implementation indexes the valid tags by splitting the code into 4 discrete chunks (for eg 13 bits each for 52h13). Given a maximum tolerance of 3-bit errors, the Pigeonhole Principle guarantees that at least one of the four chunks in an observed tag must match the valid tag perfectly.

The decoding process is updated to:

Perform lookups on the 4 chunks to identify candidate tags.
Compute the Hamming distance between the observed code and the candidate's perfect code.
Return the match if the distance is within the specified threshold.

Memory Complexity

The legacy implementation required storing every error permutation ($1 + N + \binom{N}{2} + \binom{N}{3}$) multiplied by a load factor of 3 to resolve collisions. For the 52h13 family, this necessitated allocating over 69000 slots per tag (approx. 3 billion total slots). The new implementation eliminates this combinatorial explosion, storing exactly 4 references per tag. 52h13 Memory Footprint went from ~54 GB to ~450 KB which is nearly a 128,884x Reduction

In summary -

memory requirements and initialization: Both of these are now several orders of magnitude better than the earlier implementation
Error correction: now supports up to 3-bit error correction without any additional memory overhead; also, this technique can be scaled to test for 4 or even 5-bit errors with relative ease.
Regression: Existing tests pass without modification.
performance: as this is a technically more complex structure, the quickdecode function may be slightly slower than it used to be, but it still runs in nanoseconds and is not the bottleneck and there's no observable change to the overall performance

sidd-27 · 2026-01-08T20:53:37Z

@christian-rauch Sorry for the multiple pull requests. I made a very silly mistake and messed up the rebase. Anyways, pls do review this

Copilot

Pull request overview

This PR replaces the exhaustive precomputation-based AprilTag decoding with a pigeonhole-principle based search algorithm, achieving dramatic memory reduction (128,884x for 52h13 family) while maintaining error correction capabilities up to 3-bit errors.

Key changes:

Implements chunk-based code lookup using the pigeonhole principle instead of precomputing all error permutations
Adds a popcount64 implementation for efficient Hamming distance calculation
Refactors the quick_decode data structure to use four chunk lookup tables instead of a hash table

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

apriltag.c

christian-rauch · 2026-01-08T21:09:16Z

Thank you a lot for this PR. Since this is quite involved, and we have to make a tradeoff between memory and computation speed for smaller tags, I am adding @mkrogius as a reviewer.

I also tried out the "Copilot" reviewer to see how useful this is. Some of the suggestions regarding the NULL pointer check indeed make sense to me.

sidd-27 · 2026-01-09T08:42:45Z

we have to make a tradeoff between memory and computation speed for smaller tags

Here are some back-of-the-envelope calculations for the average number of elements in each bucket. This means that for smaller tags, we need an average of 2*4 = 8 lookups, and for larger tags, up to 40 lookups. However, since these elements are contiguous in memory with no padding, most of them will be cached. From the example demo tests in the library, I obtained essentially the same speed from both methods; the optimized one was slightly faster, but that may be due to noise.

considering that decoding takes just a fraction of the time required by the decoder, I don't think that the change in time required would be noticeable.

However, the best method to test this would be through benchmarks on devices where it is expected to run, such as robots. Unfortunately, I don't have access to those

Tag Family	Codes	Chunk Size	Total Buckets	Non-Empty	Avg Size	Buckets ≥ 2	Buckets ≥ 4
Tag16H5	30	4	64	85.9%	2.18	57.8%	10.9%
Tag25H9	35	7	512	21.5%	1.27	3.5%	0.6%
Tag36H10	2,320	9	2,048	98.8%	4.58	93.3%	66.6%
Tag36H11	587	9	2,048	69.7%	1.64	32.4%	2.1%
TagCircle21H7	38	6	256	39.5%	1.50	11.3%	2.7%
TagCircle49H12	65,535	13	32,768	78.0%	10.25	77.6%	73.9%
TagCustom48H12	42,211	12	16,384	100.0%	10.31	99.9%	98.6%
TagStandard41H12	2,115	11	8,192	51.9%	1.99	24.0%	4.3%
TagStandard52H13	48,714	13	32,768	98.1%	6.06	93.8%	76.7%

Some of the suggestions regarding the NULL pointer check indeed make sense to me.

sure, I'll make those changes shortly

sidd-27 · 2026-01-15T10:11:10Z

@christian-rauch I've implemented the null error checks, pls do review

christian-rauch · 2026-01-15T17:21:55Z

I am fine with this. But I wanted @mkrogius to have a look at this too.

sfe-SparkFro · 2026-01-15T20:40:29Z

Excited to see this! Great timing for me, as I'm currently working on making MicroPython bindings for this to run on microcontrollers. Just got it working, though I spent a few too many hours debugging why tags weren't being detected. Turns out detector_add_family_bits() was failing with bits_corrected=2 because it needed to allocate ~18MB of memory; my board only has ~8.5MB of RAM total and I wasn't checking errno, so didn't know that was the problem. Had to workaround that by setting the bits to 0, so this is a welcome change!

The previous AprilTag decoding implementation relied on exhaustive precomputation of error permutations to achieve O(1) lookups.

Isn't quick_decode_codeword() actually O(n/2) on average? n being the number of entries in the quick_decode struct (current implementation) or the number of chunks (this PR). Maybe I'm misunderstanding. And probably not really important since decoding takes so much less time than everything else, but just curious to understand it.

sfe-SparkFro · 2026-01-15T20:53:24Z

Instead of storing every possible error permutation, the new implementation indexes the valid tags by splitting the code into 4 discrete chunks (for eg 13 bits each for 52h13). Given a maximum tolerance of 3-bit errors, the Pigeonhole Principle guarantees that at least one of the four chunks in an observed tag must match the valid tag perfectly.

Is this chunking actually necessary? When using the new popcount64(), the full codes can be compared instead of just the code chunks. And if the full codes are compared, then this would enable more than 3 bit errors. For a tag family of hamming distance h, up to h/2 bit errors can be tolerated (eg. 5 bit errors for 36h11).

I haven't looked at the code in detail, so maybe I'm not understanding how this works. I like the use of the pigeonhole principle, but I'm not sure it's actually necessary or better, and it adds complexity that I think can be avoided.

sidd-27 · 2026-01-17T08:55:05Z

I mean, for small code families, sure, a for loop iterating through all the codewords might just be the most efficient approach, but i suppose with larger families with 50 to 60k+ codewords, the lookup time increases, especially since we also need to look at 4 rotations for each codeword, and possibly multiple codewords per frame. This may still be pretty fast on modern machines, but it just won't scale as well, with larger and more complex tags imo

also, from what I've read, extending up to 5 or 6 bit error correction may just add false positives, but if needed, this code could be modified to be scalable like that with more chunks

This chunking method ensures that the number of popcount instructions stays on average around 40-50, even on the largest families, but in extremely memory-constrained environments, where time taken isn't as important, i suppose just a linear scan could be a resonable compromise

sfe-SparkFro · 2026-01-19T17:00:25Z

Ah, I didn't realize how much faster this was than looping through all the codewords! Just did a couple tests tweaking the loop in quick_decode_codeword() using tagStandard52h13. On my machine, both the current implementation and this PR complete the loop in under 1us, whereas looping through all codewords takes anywhere from 150us to 400us. Please disregard my uninformed commentary from before 🙂

mkrogius · 2026-01-20T02:23:22Z

It would be nice to make the 52-bit tags more usable with 3-bit error correction. I will review this PR (In general I think we need much more complete test coverage for this project if we want to continue developing this repo).

@sidd-27 I do have a couple of high-level questions / comments:

The implementation of quick_decode_init is different than I expected. I expected that we could simply create 4 hash tables, all using linear probing (just like the current hash table implementation). I think that your method is essentially the same but uses less memory.
An alternative to chunking would be to use the current method, but instead of inserting all possible variations of all codes, insert all possible variations of the first p bits of each code. This would save some amount of memory and be a smaller change. Was this an option you considered?
Would it be possible to add a test that all codes from all families and all errors of those codes (w/ up to 3 bits flipped) decode correctly?

apriltag.c

mkrogius · 2026-01-20T01:49:45Z

apriltag.c

 }

-struct quick_decode_entry
+static inline int popcount64(uint64_t x)


I don't understand this implementation.

This is a SWAR (SIMD within a register) algorithm to count the number of set bits in a u64: https://www.playingwithpointers.com/blog/swar.html. To be honest, even I don't understand how it works, but the built-in pop count intrinsic would probably not exist in every machine, so this becomes the next best option, I think

Thanks for the link. I agree that this is necessary since the popcount instruction is not universal. Would you mind adding a comment with the link for future readers?

apriltag.c

sidd-27 · 2026-01-20T05:37:02Z

The implementation of quick_decode_init is different than I expected. I expected that we could simply create 4 hash tables, all using linear probing (just like the current hash table implementation). I think that your method is essentially the same but uses less memory.

That would add a lot of memory overhead, because of the load factor, and also the hash collisions could slow things down. This is an inverted index which is tightly packed, so no memory overhead, and faster lookups for things in the same chunk

An alternative to chunking would be to use the current method, but instead of inserting all possible variations of all codes, insert all possible variations of the first p bits of each code. This would save some amount of memory and be a smaller change. Was this an option you considered?

That would not help much, as the problem was the number of slots increasing exponentially with the error correction, which would still remain an issue. I actually came up with a much more optimized version of this: it stores only the tag IDs, not the tags, making it just 2 bytes per entry instead of the earlier 16. kornia/kornia-rs#618 check this out for more details, but then again, the memory usage for 53h13 would be 384 Mbs, this version uses 450kbs

The other code review comments are valid, and I'll make changes to my code accordingly

sidd-27 · 2026-01-20T07:02:26Z

Added tests to validate all possible 3-bit errors across all families

mkrogius · 2026-01-20T15:39:38Z

I'm happy to approve these changes. This will make the 52bit family in particular much more usable. And thank you for adding the additional tests, it is very good to have this extra coverage

refactor: implement pigeonhole based lookup table

7d18fc5

sidd-27 marked this pull request as ready for review January 8, 2026 20:52

christian-rauch requested a review from Copilot January 8, 2026 21:02

Copilot started reviewing on behalf of christian-rauch January 8, 2026 21:02 View session

Copilot AI reviewed Jan 8, 2026

View reviewed changes

apriltag.c Show resolved Hide resolved

apriltag.c Outdated Show resolved Hide resolved

apriltag.c Outdated Show resolved Hide resolved

christian-rauch requested a review from mkrogius January 8, 2026 21:06

adds null checks to calloc and malloc

310dfb3

christianrauch approved these changes Jan 15, 2026

View reviewed changes

christian-rauch approved these changes Jan 15, 2026

View reviewed changes

mkrogius reviewed Jan 20, 2026

View reviewed changes

sidd-27 added 2 commits January 20, 2026 12:26

adds testing

16bfd23

fix warnings

a00f7c3

mkrogius approved these changes Jan 20, 2026

View reviewed changes

christian-rauch merged commit c5a4fc1 into AprilRobotics:master Jan 20, 2026
42 checks passed

osnr mentioned this pull request Jan 28, 2026

Use 3 bits of error correction for AprilTags (update libapriltag) FolkComputer/folk#243

Open

Reduced QuickDecode memory consumption by 128,884x #421

Reduced QuickDecode memory consumption by 128,884x #421

Uh oh!

Conversation

sidd-27 commented Jan 8, 2026

Pigeonhole-principle based quick decode table

Problem Statement

Proposed Solution

Memory Complexity

In summary -

Uh oh!

sidd-27 commented Jan 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

christian-rauch commented Jan 8, 2026

Uh oh!

sidd-27 commented Jan 9, 2026

Uh oh!

sidd-27 commented Jan 15, 2026

Uh oh!

christian-rauch commented Jan 15, 2026

Uh oh!

sfe-SparkFro commented Jan 15, 2026

Uh oh!

sfe-SparkFro commented Jan 15, 2026

Uh oh!

sidd-27 commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfe-SparkFro commented Jan 19, 2026

Uh oh!

mkrogius commented Jan 20, 2026

Uh oh!

Uh oh!

Uh oh!

mkrogius Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

sidd-27 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

mkrogius Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sidd-27 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sidd-27 commented Jan 20, 2026

Uh oh!

mkrogius commented Jan 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sidd-27 commented Jan 17, 2026 •

edited

Loading

sidd-27 commented Jan 20, 2026 •

edited

Loading