[USER ERROR] FinderRev::rfind ~9x faster than Finder::find on ARM64 #207
-
|
Hi! I'm building a JSON viewer app for Android (Kotlin + Rust via JNI) that handles multi-gigabyte files using mmap. During search optimization I noticed something I didn't expect: The dataPretty-printed JSON array with millions of identical-structure product objects (~520 bytes each). The data is extremely repetitive — same field names, same structure, similar values. Think: [
{
"productId": 0,
"name": "Product 0",
"sku": "SKU-0000000000",
"category": "Electronics",
...
},
{
"productId": 1,
...
}
]The search needle is a unique SKU value ( ResultsARM64 — Samsung Galaxy S23 Ultra (Snapdragon 8 Gen 2), 2GB mmap'd:
x86_64 — Windows desktop, 2GB mmap'd:
The prefilter makes essentially no difference on either platform for this type of data. What I ended up doing in my appSince this is a file viewer where users search from a specific scroll position, I implemented a hybrid: use Reproduction
Cargo.tomlsrc/main.rs (284 lines)Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 4 replies
-
|
I actually cannot reproduce as given. I tried your program and And on my Notice that in both cases, IMO, your benchmark is quite complicated. And trying to generate the input in the same run as the thing you're trying to measure just makes everything way more annoying. Instead, separate your input generation from your benchmark. Here's a simpler benchmark program that reads from This also iterates over all matches to ensure we aren't subtly measuring different things. With a simpler program like this, we can benchmark with hyperfine. On my x86-64 machine: And my Both of these are consistent with your benchmark's output on my machine: the forward search is faster. I don't know how to easily test this on an Android device. I have an Android device, but running Rust programs on it isn't something I've done before. So that's clearly a variable that isn't being tested here. I do have an answer to one question though: why is the forward search performing the same as a forward search with the prefilter disabled? That's because even when prefilters are disabled, if the needle is short enough, a SIMD path is still used. Generally speaking, turning off the prefilter is only going to have an effect for longer needles. |
Beta Was this translation helpful? Give feedback.
-
|
It is NOT the CPU core. (Both run on CPU 7). So, the exact same machine code, running on the exact same CPU core, at the exact same clockspeed, on the exact same fully-cached RAM buffer, using the exact same library... is taking 28ms in a standalone binary and 4,200ms in an Android App. |
Beta Was this translation helpful? Give feedback.
-
|
Hey @BurntSushi and everyone, I am officially closing this issue with my head hung in shame. After spending the entire day building a minimal reproducible Android app to isolate the issue, I found the culprit. It wasn't ARM64, it wasn't the memchr crate, and it wasn't Android's OS limits. It was my compiler profile. The standalone benchmarks were running in release mode, but my Android JNI build pipeline was compiling the Rust library in debug mode (no optimizations). Once compiled with opt-level = 3, the Finder::find time dropped from 4200ms down to ~30ms on the phone. Everything is working exactly as it should, and the prefilter is blindingly fast. I sincerely apologize for the false alarm, raising this issue, and taking up your time! On the bright side, I learned a massive amount about JNI bridging and profiling today. Thank you for the amazing crate and your patience! |
Beta Was this translation helpful? Give feedback.
I actually cannot reproduce as given. I tried your program and
Cargo.tomlon myaarch64M2 mac mini and got this: