-
Notifications
You must be signed in to change notification settings - Fork 632
Reduced QuickDecode memory consumption by 128,884x #421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@christian-rauch Sorry for the multiple pull requests. I made a very silly mistake and messed up the rebase. Anyways, pls do review this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR replaces the exhaustive precomputation-based AprilTag decoding with a pigeonhole-principle based search algorithm, achieving dramatic memory reduction (128,884x for 52h13 family) while maintaining error correction capabilities up to 3-bit errors.
Key changes:
- Implements chunk-based code lookup using the pigeonhole principle instead of precomputing all error permutations
- Adds a popcount64 implementation for efficient Hamming distance calculation
- Refactors the quick_decode data structure to use four chunk lookup tables instead of a hash table
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thank you a lot for this PR. Since this is quite involved, and we have to make a tradeoff between memory and computation speed for smaller tags, I am adding @mkrogius as a reviewer. I also tried out the "Copilot" reviewer to see how useful this is. Some of the suggestions regarding the NULL pointer check indeed make sense to me. |
Here are some back-of-the-envelope calculations for the average number of elements in each bucket. This means that for smaller tags, we need an average of 2*4 = 8 lookups, and for larger tags, up to 40 lookups. However, since these elements are contiguous in memory with no padding, most of them will be cached. From the example demo tests in the library, I obtained essentially the same speed from both methods; the optimized one was slightly faster, but that may be due to noise. considering that decoding takes just a fraction of the time required by the decoder, I don't think that the change in time required would be noticeable. However, the best method to test this would be through benchmarks on devices where it is expected to run, such as robots. Unfortunately, I don't have access to those
sure, I'll make those changes shortly |
|
@christian-rauch I've implemented the null error checks, pls do review |
|
I am fine with this. But I wanted @mkrogius to have a look at this too. |
|
Excited to see this! Great timing for me, as I'm currently working on making MicroPython bindings for this to run on microcontrollers. Just got it working, though I spent a few too many hours debugging why tags weren't being detected. Turns out
Isn't |
Is this chunking actually necessary? When using the new I haven't looked at the code in detail, so maybe I'm not understanding how this works. I like the use of the pigeonhole principle, but I'm not sure it's actually necessary or better, and it adds complexity that I think can be avoided. |
|
I mean, for small code families, sure, a for loop iterating through all the codewords might just be the most efficient approach, but i suppose with larger families with 50 to 60k+ codewords, the lookup time increases, especially since we also need to look at 4 rotations for each codeword, and possibly multiple codewords per frame. This may still be pretty fast on modern machines, but it just won't scale as well, with larger and more complex tags imo also, from what I've read, extending up to 5 or 6 bit error correction may just add false positives, but if needed, this code could be modified to be scalable like that with more chunks This chunking method ensures that the number of popcount instructions stays on average around 40-50, even on the largest families, but in extremely memory-constrained environments, where time taken isn't as important, i suppose just a linear scan could be a resonable compromise |
|
Ah, I didn't realize how much faster this was than looping through all the codewords! Just did a couple tests tweaking the loop in |
|
It would be nice to make the 52-bit tags more usable with 3-bit error correction. I will review this PR (In general I think we need much more complete test coverage for this project if we want to continue developing this repo). @sidd-27 I do have a couple of high-level questions / comments:
|
| } | ||
|
|
||
| struct quick_decode_entry | ||
| static inline int popcount64(uint64_t x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a SWAR (SIMD within a register) algorithm to count the number of set bits in a u64: https://www.playingwithpointers.com/blog/swar.html. To be honest, even I don't understand how it works, but the built-in pop count intrinsic would probably not exist in every machine, so this becomes the next best option, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the link. I agree that this is necessary since the popcount instruction is not universal. Would you mind adding a comment with the link for future readers?
That would add a lot of memory overhead, because of the load factor, and also the hash collisions could slow things down. This is an inverted index which is tightly packed, so no memory overhead, and faster lookups for things in the same chunk
That would not help much, as the problem was the number of slots increasing exponentially with the error correction, which would still remain an issue. I actually came up with a much more optimized version of this: it stores only the tag IDs, not the tags, making it just 2 bytes per entry instead of the earlier 16. kornia/kornia-rs#618 check this out for more details, but then again, the memory usage for 53h13 would be 384 Mbs, this version uses 450kbs The other code review comments are valid, and I'll make changes to my code accordingly |
|
I'm happy to approve these changes. This will make the 52bit family in particular much more usable. And thank you for adding the additional tests, it is very good to have this extra coverage |

Pigeonhole-principle based quick decode table
Problem Statement
The previous AprilTag decoding implementation relied on exhaustive precomputation of error permutations to achieve O(1) lookups. While effective for small tag families (e.g., 16h5), this approach is computationally and spatially prohibitive for larger families such as 52h13.
For the 52h13 family, precomputing all variations with up to 2-bit errors results in approximately 67 million hash map entries. which amounts to approximately 3 GB of RAM being used for this simple check. If we were to extend this to 3-bit errors, the amount of space needed would increase exponentially to approximately 54 GB, making it essentially unusable. even for smaller tag families, dedicating this many resources to a single check is not possible in many small robotics applications
Proposed Solution
This PR replaces the combinatorial precomputation strategy with a search-based algorithm utilizing the Pigeonhole Principle.
Instead of storing every possible error permutation, the new implementation indexes the valid tags by splitting the code into 4 discrete chunks (for eg 13 bits each for 52h13). Given a maximum tolerance of 3-bit errors, the Pigeonhole Principle guarantees that at least one of the four chunks in an observed tag must match the valid tag perfectly.
The decoding process is updated to:
Memory Complexity
The legacy implementation required storing every error permutation ($1 + N + \binom{N}{2} + \binom{N}{3}$ ) multiplied by a load factor of 3 to resolve collisions. For the 52h13 family, this necessitated allocating over 69000 slots per tag (approx. 3 billion total slots). The new implementation eliminates this combinatorial explosion, storing exactly 4 references per tag. 52h13 Memory Footprint went from ~54 GB to ~450 KB which is nearly a 128,884x Reduction
In summary -