[USER ERROR] FinderRev::rfind ~9x faster than Finder::find on ARM64 #207

kotysoft · 2026-02-28T15:11:03Z

kotysoft
Feb 28, 2026

Hi!

I'm building a JSON viewer app for Android (Kotlin + Rust via JNI) that handles multi-gigabyte files using mmap. During search optimization I noticed something I didn't expect: memmem::FinderRev::rfind is significantly faster than memmem::Finder::find on ARM64, on my specific data. On x86_64 it's the opposite — Finder is faster, as I'd expect.
I'm not sure if this is expected behavior, a known limitation of the NEON codepath, or if I'm doing something wrong. I put together a standalone benchmark that reproduces it.

The data

Pretty-printed JSON array with millions of identical-structure product objects (~520 bytes each). The data is extremely repetitive — same field names, same structure, similar values. Think:

[
{
  "productId": 0,
  "name": "Product 0",
  "sku": "SKU-0000000000",
  "category": "Electronics",
  ...
},
{
  "productId": 1,
  ...
}
]

The search needle is a unique SKU value ("sku": "SKU-0002097152") placed at ~50% into the data. Both find() and rfind() traverse roughly the same amount of data (~1GB each direction in a 2GB file) and both find the same result.

Results

ARM64 — Samsung Galaxy S23 Ultra (Snapdragon 8 Gen 2), 2GB mmap'd:

Method	Time	vs Finder
`memmem::Finder` (default prefilter)	9107ms	1.0x
`memmem::Finder` (`Prefilter::None`)	9001ms	~same
`memmem::FinderRev::rfind`	1050ms	8.7x faster
memchr first-byte + manual verify	32356ms	3.6x slower
Rarest-byte memchr (custom freq table)	4217ms	2.2x faster
FinderRev-powered forward (5MB chunks)	1083ms	8.4x faster

x86_64 — Windows desktop, 2GB mmap'd:

Method	Time	vs Finder
`memmem::Finder` (default prefilter)	65.6ms	1.0x
`memmem::Finder` (`Prefilter::None`)	64.3ms	~same
`memmem::FinderRev::rfind`	136ms	2.1x slower
memchr first-byte + manual verify	935ms	14.3x slower
Rarest-byte memchr (custom freq table)	66.8ms	~same
FinderRev-powered forward (5MB chunks)	142ms	2.2x slower

The prefilter makes essentially no difference on either platform for this type of data.

What I ended up doing in my app

Since this is a file viewer where users search from a specific scroll position, I implemented a hybrid: use FinderRev::rfind as a fast existence check on 5MB chunks, then memmem::find only on chunks that contain a match. On ARM64 this made search roughly 9x faster than using Finder::find directly on the full data.
I realize this is a very specific workload (multi-GB repetitive JSON on ARM64) and may not reflect typical usage. I'm relatively new to this level of optimization so it's entirely possible I'm misunderstanding something. Happy to provide more info or run additional tests if helpful.

Reproduction

memchr = "=2.7.6", memmap2 = "0.9", tempfile = "3"

cargo run --release          # defaults to 2GB
cargo run --release -- 200   # smaller test

Cargo.toml

[package]
name = "memchr-search-benchmark"
version = "0.1.0"
edition = "2021"
[dependencies]
memchr = "=2.7.6"
memmap2 = "0.9"
tempfile = "3"
[profile.release]
opt-level = 3
lto = true

src/main.rs (284 lines)

// Benchmark comparing memmem::Finder vs FinderRev performance
// on repetitive structured JSON data.
//
// Usage: cargo run --release -- [size_in_mb]   (default: 2000)
use memchr::memmem;
use memchr::memchr;
use std::time::Instant;
use std::io::Write;
use memmap2::Mmap;
use std::fs::File;
const CATEGORIES: &[&str] = &[
    "Electronics", "Clothing", "Books", "Home & Garden",
    "Sports", "Toys", "Health & Beauty", "Automotive",
    "Food & Beverage", "Office Supplies",
];
fn generate_product(id: u32, buf: &mut Vec<u8>) {
    let cat = CATEGORIES[(id as usize) % CATEGORIES.len()];
    let price = format!("{}.{:02}", (id * 207 + 13) % 1000, (id * 34) % 100);
    let weight = format!("{}.{:02}", (id * 11 + 1) % 50, (id * 13) % 100);
    let rating = format!("{}.{}", 3 + (id % 2), (id * 7) % 10);
    write!(buf, "{{\n\
        \x20\x20\"productId\": {id},\n\
        \x20\x20\"name\": \"Product {id}\",\n\
        \x20\x20\"sku\": \"SKU-{id:010}\",\n\
        \x20\x20\"category\": \"{cat}\",\n\
        \x20\x20\"price\": \"{price}\",\n\
        \x20\x20\"currency\": \"USD\",\n\
        \x20\x20\"inStock\": {},\n\
        \x20\x20\"quantity\": {id},\n\
        \x20\x20\"rating\": \"{rating}\",\n\
        \x20\x20\"reviewCount\": {id},\n\
        \x20\x20\"weight\": \"{weight}\",\n\
        \x20\x20\"dimensions\": {{\n\
        \x20\x20\x20\x20\"length\": \"{}.{}\",\n\
        \x20\x20\x20\x20\"width\": \"{}.{}\",\n\
        \x20\x20\x20\x20\"height\": \"{}.{}\",\n\
        \x20\x20\x20\x20\"unit\": \"cm\"\n\
        \x20\x20}},\n\
        \x20\x20\"tags\": [\n\
        \x20\x20\x20\x20\"tag{}\",\n\
        \x20\x20\x20\x20\"tag{}\",\n\
        \x20\x20\x20\x20\"tag{}\"\n\
        \x20\x20],\n\
        \x20\x20\"description\": \"Product {id} description. Category: {cat}. Excellent quality and value.\"\n\
        }}",
        if id % 2 == 0 { "false" } else { "true" },
        (id * 95 + 5) % 100, (id * 50 + 8) % 100,
        (id * 39 + 3) % 100, (id * 87 + 1) % 100,
        (id * 25 + 1) % 100, (id * 73 + 7) % 100,
        id % 10, (id + 1) % 10, (id + 2) % 10,
    ).unwrap();
}
/// Stream-write generated JSON to a file (doesn't hold full data in RAM)
fn generate_json_to_file(target_mb: usize, path: &std::path::Path) -> std::io::Result<u64> {
    use std::io::BufWriter;
    let file = std::fs::File::create(path)?;
    let mut writer = BufWriter::with_capacity(8 * 1024 * 1024, file);
    let target_bytes = target_mb * 1024 * 1024;
    writer.write_all(b"[\n")?;
    let mut id: u32 = 0;
    let mut written: u64 = 2;
    let mut product_buf = Vec::with_capacity(1024);
    while (written as usize) < target_bytes {
        product_buf.clear();
        if id > 0 {
            writer.write_all(b",\n")?;
            written += 2;
        }
        generate_product(id, &mut product_buf);
        writer.write_all(&product_buf)?;
        written += product_buf.len() as u64;
        id += 1;
    }
    writer.write_all(b"\n]\n")?;
    written += 3;
    writer.flush()?;
    eprintln!("Generated {} MB to file ({} products, {} bytes)",
        written / (1024 * 1024), id, written);
    Ok(written)
}
// -- Search methods --
fn search_memchr_first_byte(data: &[u8], needle: &[u8]) -> Option<usize> {
    if needle.is_empty() { return None; }
    let first = needle[0];
    let mut offset = 0;
    while offset + needle.len() <= data.len() {
        match memchr(first, &data[offset..]) {
            Some(pos) => {
                let abs = offset + pos;
                if abs + needle.len() <= data.len() && data[abs..abs + needle.len()] == *needle {
                    return Some(abs);
                }
                offset = abs + 1;
            }
            None => break,
        }
    }
    None
}
fn search_memmem_finder(data: &[u8], needle: &[u8]) -> Option<usize> {
    memmem::Finder::new(needle).find(data)
}
fn search_memmem_finder_no_prefilter(data: &[u8], needle: &[u8]) -> Option<usize> {
    memmem::FinderBuilder::new()
        .prefilter(memmem::Prefilter::None)
        .build_forward(needle)
        .find(data)
}
fn search_rarest_byte_memchr(data: &[u8], needle: &[u8]) -> Option<usize> {
    if needle.is_empty() { return None; }
    fn json_byte_freq(b: u8) -> u8 {
        match b {
            b'"' => 255, b':' => 250, b',' => 245,
            b'{' | b'}' => 240, b'[' | b']' => 235,
            b' ' => 230, b'\n' => 225, b'0'..=b'9' => 200,
            b'a'..=b'z' => 100 + b - b'a',
            b'A'..=b'Z' => 80 + b - b'A',
            b'_' => 150, b'-' => 140, b'.' => 160,
            _ => b,
        }
    }
    let (rare_idx, _) = needle.iter()
        .enumerate()
        .min_by_key(|(_, &b)| json_byte_freq(b))
        .unwrap();
    let rare = needle[rare_idx];
    let mut offset = 0;
    while offset + needle.len() <= data.len() {
        let search_from = offset + rare_idx;
        if search_from >= data.len() { break; }
        match memchr(rare, &data[search_from..]) {
            Some(pos) => {
                let match_start = search_from + pos - rare_idx;
                if match_start < offset { offset = search_from + pos + 1; continue; }
                if match_start + needle.len() <= data.len()
                    && data[match_start..match_start + needle.len()] == *needle {
                    return Some(match_start);
                }
                offset = match_start + 1;
            }
            None => break,
        }
    }
    None
}
fn search_finderrev_reverse(data: &[u8], needle: &[u8]) -> Option<usize> {
    memmem::FinderRev::new(needle).rfind(data)
}
fn search_finderrev_forward(data: &[u8], needle: &[u8]) -> Option<usize> {
    const CHUNK_SIZE: usize = 5 * 1024 * 1024;
    let rev_finder = memmem::FinderRev::new(needle);
    let overlap = needle.len();
    let mut chunk_start = 0usize;
    while chunk_start < data.len() {
        let chunk_end = (chunk_start + CHUNK_SIZE).min(data.len());
        let chunk = &data[chunk_start..chunk_end];
        if rev_finder.rfind(chunk).is_some() {
            if let Some(pos) = memmem::find(chunk, needle) {
                return Some(chunk_start + pos);
            }
        }
        chunk_start = if chunk_end == data.len() { data.len() } else { chunk_end - overlap };
    }
    None
}
// -- Benchmark harness --
struct BenchResult {
    name: &'static str,
    found_pos: Option<usize>,
    elapsed_ms: f64,
}
fn bench<F: Fn(&[u8], &[u8]) -> Option<usize>>(
    name: &'static str, data: &[u8], needle: &[u8], func: F,
) -> BenchResult {
    let _ = func(data, needle); // warmup
    let start = Instant::now();
    let found_pos = func(data, needle);
    let elapsed = start.elapsed();
    BenchResult { name, found_pos, elapsed_ms: elapsed.as_secs_f64() * 1000.0 }
}
fn run_benchmark_suite(data: &[u8], needle: &[u8]) {
    let data_len = data.len();
    let needle_pos = memmem::find(data, needle);
    match needle_pos {
        Some(pos) => eprintln!("Needle at byte {} ({:.1}%)",
            pos, (pos as f64 / data_len as f64) * 100.0),
        None => { eprintln!("ERROR: Needle not found!"); return; }
    }
    let results = vec![
        bench("1. memchr first byte (fwd)", data, needle, search_memchr_first_byte),
        bench("2. Finder default prefilter (fwd)", data, needle, search_memmem_finder),
        bench("3. Finder Prefilter::None (fwd)", data, needle, search_memmem_finder_no_prefilter),
        bench("4. Rarest-byte memchr (fwd)", data, needle, search_rarest_byte_memchr),
        bench("5. FinderRev rfind (rev)", data, needle, search_finderrev_reverse),
        bench("6. FinderRev-powered forward", data, needle, search_finderrev_forward),
    ];
    println!("\n{:-<75}", "");
    println!("{:<45} {:>12} {:>12}", "Method", "Position", "Time (ms)");
    println!("{:-<75}", "");
    let baseline_ms = results[0].elapsed_ms;
    for r in &results {
        let pos_str = match r.found_pos {
            Some(p) => format!("{}", p),
            None => "NOT FOUND".to_string(),
        };
        let speedup = if r.elapsed_ms > 0.0 { baseline_ms / r.elapsed_ms } else { 0.0 };
        println!("{:<45} {:>12} {:>9.1}ms  ({:.1}x)",
            r.name, pos_str, r.elapsed_ms, speedup);
    }
    println!("{:-<75}", "");
    for r in &results {
        if let Some(pos) = r.found_pos {
            if pos != needle_pos.unwrap() {
                println!("  WARN: {} found {} instead of {}", r.name, pos, needle_pos.unwrap());
            }
        }
    }
}
fn main() {
    let data_size_mb: usize = std::env::args()
        .nth(1)
        .and_then(|s| s.parse().ok())
        .unwrap_or(2000);
    eprintln!("=== memchr search benchmark ===");
    eprintln!("memchr crate version: 2.7.6");
    eprintln!("Target: {} MB\n", data_size_mb);
    // Generate to temp file (stream-write, no full data in RAM)
    let tmp = tempfile::NamedTempFile::new().expect("failed to create temp file");
    let gen_start = Instant::now();
    generate_json_to_file(data_size_mb, tmp.path()).expect("failed to generate data");
    eprintln!("Generation took {:.1}s", gen_start.elapsed().as_secs_f64());
    // Mmap the file — just like a real file viewer would
    let file = File::open(tmp.path()).expect("failed to open temp file");
    let mmap = unsafe { Mmap::map(&file).expect("failed to mmap") };
    eprintln!("Mmap'd {} bytes\n", mmap.len());
    // Needle at ~50% — find() traverses 50% forward, rfind() traverses 50% backward
    // Fair comparison: same data, same distance, both FIND
    let total_products = mmap.len() / 500;
    let target_product = (total_products as f64 * 0.50) as u32;
    let needle_str = format!("\"sku\": \"SKU-{:010}\"", target_product);
    let needle = needle_str.as_bytes();
    eprintln!("Needle: {} (len={})", needle_str, needle.len());
    run_benchmark_suite(&mmap, needle);
    println!("\nPlatform: {} / {}", std::env::consts::ARCH, std::env::consts::OS);
    println!("Data: {} MB mmap'd, {} products", mmap.len() / (1024 * 1024), total_products);
}

Environment

memchr 2.7.6
ARM64 device: Samsung Galaxy S23 Ultra, Snapdragon 8 Gen 2
x86_64: Windows, Intel Core i7-11700K
Data: 2GB mmap'd pretty-printed JSON, needle at 50% (~1GB traversal each direction)

Answered by BurntSushi

Feb 28, 2026

I actually cannot reproduce as given. I tried your program and Cargo.toml on my aarch64 M2 mac mini and got this:

=== memchr search benchmark ===
memchr crate version: 2.7.6
Target: 2000 MB

Generated 2000 MB to file (4032502 products, 2097152288 bytes)
Generation took 2.2s
Mmap'd 2097152288 bytes

Needle: "sku": "SKU-0002097152" (len=23)
Needle at byte 1087983444 (51.9%)

---------------------------------------------------------------------------
Method                                            Position    Time (ms)
---------------------------------------------------------------------------
1. memchr first byte (fwd)                      1087983444    1037.3ms  (1.0x)
2. Finder default …

View full answer

BurntSushi · 2026-02-28T17:48:46Z

BurntSushi
Feb 28, 2026
Maintainer

I actually cannot reproduce as given. I tried your program and Cargo.toml on my aarch64 M2 mac mini and got this:

=== memchr search benchmark ===
memchr crate version: 2.7.6
Target: 2000 MB

Generated 2000 MB to file (4032502 products, 2097152288 bytes)
Generation took 2.2s
Mmap'd 2097152288 bytes

Needle: "sku": "SKU-0002097152" (len=23)
Needle at byte 1087983444 (51.9%)

---------------------------------------------------------------------------
Method                                            Position    Time (ms)
---------------------------------------------------------------------------
1. memchr first byte (fwd)                      1087983444    1037.3ms  (1.0x)
2. Finder default prefilter (fwd)               1087983444      46.3ms  (22.4x)
3. Finder Prefilter::None (fwd)                 1087983444      46.3ms  (22.4x)
4. Rarest-byte memchr (fwd)                     1087983444      34.1ms  (30.5x)
5. FinderRev rfind (rev)                        1087983444     113.1ms  (9.2x)
6. FinderRev-powered forward                    1087983444     114.5ms  (9.1x)
---------------------------------------------------------------------------

Platform: aarch64 / macos
Data: 2000 MB mmap'd, 4194304 products

And on my x86-64 machine:

=== memchr search benchmark ===
memchr crate version: 2.7.6
Target: 2000 MB

Generated 2000 MB to file (4032502 products, 2097152288 bytes)
Generation took 1.6s
Mmap'd 2097152288 bytes

Needle: "sku": "SKU-0002097152" (len=23)
Needle at byte 1087983444 (51.9%)

---------------------------------------------------------------------------
Method                                            Position    Time (ms)
---------------------------------------------------------------------------
1. memchr first byte (fwd)                      1087983444     701.3ms  (1.0x)
2. Finder default prefilter (fwd)               1087983444      56.5ms  (12.4x)
3. Finder Prefilter::None (fwd)                 1087983444      56.6ms  (12.4x)
4. Rarest-byte memchr (fwd)                     1087983444      63.2ms  (11.1x)
5. FinderRev rfind (rev)                        1087983444      99.3ms  (7.1x)
6. FinderRev-powered forward                    1087983444     107.7ms  (6.5x)
---------------------------------------------------------------------------

Platform: x86_64 / linux
Data: 2000 MB mmap'd, 4194304 products

Notice that in both cases, FinderRev is slower than the forward direction.

IMO, your benchmark is quite complicated. And trying to generate the input in the same run as the thing you're trying to measure just makes everything way more annoying. Instead, separate your input generation from your benchmark. Here's a simpler benchmark program that reads from ./input.json:

use std::{fs::File, io::Write};

use memchr::memmem;
use memmap2::Mmap;

fn main() -> anyhow::Result<()> {
    let which = std::env::args()
        .nth(1)
        .ok_or_else(|| anyhow::anyhow!("missing search kind"))?;
    let file = File::open("input.json")?;
    let mmap = unsafe { Mmap::map(&file)? };

    let total_products = mmap.len() / 500;
    let target_product = (total_products as f64 * 0.50) as u32;
    let needle_str = format!("\"sku\": \"SKU-{:010}\"", target_product);
    let needle = needle_str.as_bytes();

    let count = match &*which {
        "rev" => {
            let finder = memmem::FinderBuilder::new().build_reverse(needle);
            finder.rfind_iter(&mmap).count()
        }
        "fwd" => {
            let finder = memmem::FinderBuilder::new().build_forward(needle);
            finder.find_iter(&mmap).count()
        }
        "fwdnopre" => {
            let finder = memmem::FinderBuilder::new()
                .prefilter(memmem::Prefilter::None)
                .build_forward(needle);
            finder.find_iter(&mmap).count()
        }
        _ => anyhow::bail!("unknown search kind"),
    };

    writeln!(std::io::stdout(), "{count}")?;
    Ok(())
}

This also iterates over all matches to ensure we aren't subtly measuring different things. With a simpler program like this, we can benchmark with hyperfine. On my x86-64 machine:

$ hyperfine './target/release/memchr-d207 rev' './target/release/memchr-d207 fwd' './target/release/memchr-d207 fwdnopre'
Benchmark 1: ./target/release/memchr-d207 rev
  Time (mean ± σ):     223.1 ms ±  14.3 ms    [User: 217.8 ms, System: 4.8 ms]
  Range (min … max):   193.2 ms … 238.0 ms    13 runs

Benchmark 2: ./target/release/memchr-d207 fwd
  Time (mean ± σ):     113.4 ms ±   3.8 ms    [User: 108.4 ms, System: 4.6 ms]
  Range (min … max):   101.5 ms … 118.6 ms    25 runs

Benchmark 3: ./target/release/memchr-d207 fwdnopre
  Time (mean ± σ):     113.0 ms ±   3.6 ms    [User: 107.6 ms, System: 4.9 ms]
  Range (min … max):   101.3 ms … 118.8 ms    26 runs

Summary
  ./target/release/memchr-d207 fwdnopre ran
    1.00 ± 0.05 times faster than ./target/release/memchr-d207 fwd
    1.98 ± 0.14 times faster than ./target/release/memchr-d207 rev

And my aarch64 machine:

$ hyperfine './target/release/memchr-d207 rev' './target/release/memchr-d207 fwd' './target/release/memchr-d207 fwdnopre'
Benchmark 1: ./target/release/memchr-d207 rev
  Time (mean ± σ):     503.9 ms ±   1.0 ms    [User: 278.8 ms, System: 224.9 ms]
  Range (min … max):   503.1 ms … 505.8 ms    10 runs

Benchmark 2: ./target/release/memchr-d207 fwd
  Time (mean ± σ):     361.3 ms ±   0.2 ms    [User: 139.4 ms, System: 221.6 ms]
  Range (min … max):   361.0 ms … 361.5 ms    10 runs

Benchmark 3: ./target/release/memchr-d207 fwdnopre
  Time (mean ± σ):     361.2 ms ±   0.3 ms    [User: 139.3 ms, System: 221.6 ms]
  Range (min … max):   360.9 ms … 361.9 ms    10 runs

Summary
  ./target/release/memchr-d207 fwdnopre ran
    1.00 ± 0.00 times faster than ./target/release/memchr-d207 fwd
    1.40 ± 0.00 times faster than ./target/release/memchr-d207 rev

Both of these are consistent with your benchmark's output on my machine: the forward search is faster.

I don't know how to easily test this on an Android device. I have an Android device, but running Rust programs on it isn't something I've done before. So that's clearly a variable that isn't being tested here.

I do have an answer to one question though: why is the forward search performing the same as a forward search with the prefilter disabled? That's because even when prefilters are disabled, if the needle is short enough, a SIMD path is still used. Generally speaking, turning off the prefilter is only going to have an effect for longer needles.

3 replies

kotysoft Feb 28, 2026
Author

Thanks for testing and for the simpler benchmark pattern! I ran your approach as a standalone binary on my S23 Ultra and got consistent results with yours — Finder is faster:

File: /sdcard/Download/test-products-1024MB-pretty-valid.json (1073741910 bytes, 1024 MB)
Needle: "sku": "SKU-0001073741" (at ~50%)
Finder iter (full scan)                  matches=1     56.6ms
Finder iter (no prefilter)               matches=1     56.1ms
FinderRev iter (full scan)               matches=1     191.0ms
Finder::find (stops at match)            matches=1     28.3ms
FinderRev::rfind (stops at match)        matches=1     96.4ms

So the standalone binary shows Finder ~3.4x faster, same as your results.

However, in my actual app (where memchr is compiled as a cdylib shared library loaded via JNI on Android), I was consistently seeing FinderRev being much faster than Finder on the same device and same data. That's what originally prompted this report.

I'm not sure what causes the difference — could be PIC/cdylib compilation affecting the SIMD? something about the Android runtime environment? or something else entirely?

TBH I feel that don't have enough expertise to narrow it down further, but wanted to share the data in case it's useful, and still trying to figure it out

BurntSushi Feb 28, 2026
Maintainer

I don't know unfortunately. It could very well have something to do with Android. Maybe NEON isn't active or something. I'm not sure.

kotysoft Feb 28, 2026
Author

My brain is starting to melt down.....

Something definetely there which i can't see yet...

=== TEST 1: PIC enabled ===
File: /sdcard/Download/test-products-1024MB-pretty-valid.json (1073741910 bytes, 1024 MB)
Needle: "sku": "SKU-0001073741" (at ~50%)
Finder iter (full scan)                  matches=1     60.6ms
Finder iter (no prefilter)               matches=1     57.7ms
FinderRev iter (full scan)               matches=1     189.8ms
Finder::find (stops at match)            matches=1     28.5ms
FinderRev::rfind (stops at match)        matches=1     95.0ms
PS C:\DEV\Android\JSON\search-benchmark>

=== TEST 2: cdylib via dlopen ===
Loaded: /data/local/tmp/libbench_lib.so
File: /sdcard/Download/test-products-1024MB-pretty-valid.json (1073741910 bytes, 1024 MB)
Needle: "sku": "SKU-0001073741" (at ~50%)
Finder::find (cdylib)                    result=542091297    31.5ms
Finder iter (cdylib)                     result=1            62.6ms
FinderRev::rfind (cdylib)                result=542091297    96.3ms
FinderRev iter (cdylib)                  result=1            193.9ms
PS C:\DEV\Android\JSON\search-benchmark\cdylib-test>

=== TEST 3: CHUNKED (like the app) ===
File: /sdcard/Download/test-products-1024MB-pretty-valid.json (1073741910 bytes, 1024 MB)
Needle: "sku": "SKU-0001073741" (at ~50%)
Finder iter (full scan)                  matches=1     56.4ms
Finder iter (no prefilter)               matches=1     56.3ms
FinderRev iter (full scan)               matches=1     191.2ms
Finder::find (stops at match)            matches=1     28.2ms
FinderRev::rfind (stops at match)        matches=1     96.1ms
Finder::find (5MB chunks) [OLD APP]      matches=1     27.6ms
FinderRev+find (5MB chunks) [CUR APP]    matches=1     96.8ms

=== TEST 4: FUSE vs DIRECT ===
--- /sdcard (FUSE) ---
File: /sdcard/Download/test-products-1024MB-pretty-valid.json (1073741910 bytes, 1024 MB)
Needle: "sku": "SKU-0001073741" (at ~50%)
Finder iter (full scan)                  matches=1     56.2ms
Finder iter (no prefilter)               matches=1     56.8ms
FinderRev iter (full scan)               matches=1     189.6ms
Finder::find (stops at match)            matches=1     28.8ms
FinderRev::rfind (stops at match)        matches=1     94.4ms
Finder::find (5MB chunks) [OLD APP]      matches=1     28.5ms
FinderRev+find (5MB chunks) [CUR APP]    matches=1     101.0ms
--- /data/local/tmp (direct) ---
File: /data/local/tmp/test.json (1073741910 bytes, 1024 MB)
Needle: "sku": "SKU-0001073741" (at ~50%)
Finder iter (full scan)                  matches=1     67.0ms
Finder iter (no prefilter)               matches=1     67.9ms
FinderRev iter (full scan)               matches=1     194.8ms
Finder::find (stops at match)            matches=1     34.6ms
FinderRev::rfind (stops at match)        matches=1     97.6ms
Finder::find (5MB chunks) [OLD APP]      matches=1     30.3ms
FinderRev+find (5MB chunks) [CUR APP]    matches=1     94.6ms

=== TEST5: ACTUAL APP (Rust+JNI+Kotlin) same usage === ?????????????????
I jsonviewer_native: jsonviewer_core::text_viewer: GJBENCHMARK A) Finder::find(5MB chunks): 4.172616196s, found=Some(504633800), data_len=1073741910
I jsonviewer_native: jsonviewer_core::text_viewer: GJBENCHMARK B) FinderRev+find(5MB chunks): 529.46224ms, found=Some(504633800), data_len=1073741910
I jsonviewer_native: jsonviewer_core::text_viewer: GJBENCHMARK C) Finder::find(whole): 4.176944322s, found=Some(504633800), data_len=1073741910
I jsonviewer_native: jsonviewer_core::text_viewer: GJBENCHMARK D) FinderRev::rfind(whole): 553.865104ms, found=Some(504633800), data_len=1073741910

kotysoft · 2026-02-28T20:11:54Z

kotysoft
Feb 28, 2026
Author

It is NOT the CPU core. (Both run on CPU 7).
It is NOT CPU frequency/power capping. (Both are officially running at 3.36 GHz max turbo).
It is NOT the memchr SIMD prefilter. (NO PREFILTER took the exact same 4.2 seconds).
It is NOT memory page faults or cold RAM. (The 1GB mmap was warmed up in literally 4 milliseconds).
It is NOT TLB address space pressure. (madvise(HUGEPAGE) succeeded and didn't change anything).
Not SIMD/MTE Hardware Restrictions: HWCAP and HWCAP2 are bit-for-bit identical. The Android Kernel grants the app the exact same hardware capabilities (NEON, SVE2, MTE flags) as the standalone root binary.

So, the exact same machine code, running on the exact same CPU core, at the exact same clockspeed, on the exact same fully-cached RAM buffer, using the exact same library... is taking 28ms in a standalone binary and 4,200ms in an Android App.

0 replies

kotysoft · 2026-02-28T22:35:11Z

kotysoft
Feb 28, 2026
Author

Hey @BurntSushi and everyone,

I am officially closing this issue with my head hung in shame. After spending the entire day building a minimal reproducible Android app to isolate the issue, I found the culprit. It wasn't ARM64, it wasn't the memchr crate, and it wasn't Android's OS limits.

It was my compiler profile. The standalone benchmarks were running in release mode, but my Android JNI build pipeline was compiling the Rust library in debug mode (no optimizations).

Once compiled with opt-level = 3, the Finder::find time dropped from 4200ms down to ~30ms on the phone. Everything is working exactly as it should, and the prefilter is blindingly fast.

I sincerely apologize for the false alarm, raising this issue, and taking up your time! On the bright side, I learned a massive amount about JNI bridging and profiling today. Thank you for the amazing crate and your patience!

1 reply

BurntSushi Feb 28, 2026
Maintainer

Thanks for the update! Happens to the best of us.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[USER ERROR] FinderRev::rfind ~9x faster than Finder::find on ARM64 #207

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

[USER ERROR] FinderRev::rfind ~9x faster than Finder::find on ARM64 #207

Uh oh!

kotysoft Feb 28, 2026

The data

Results

What I ended up doing in my app

Reproduction

Environment

Replies: 3 comments · 4 replies

Uh oh!

Uh oh!

BurntSushi Feb 28, 2026 Maintainer

Uh oh!

kotysoft Feb 28, 2026 Author

Uh oh!

BurntSushi Feb 28, 2026 Maintainer

Uh oh!

kotysoft Feb 28, 2026 Author

Uh oh!

Uh oh!

kotysoft Feb 28, 2026 Author

Uh oh!

kotysoft Feb 28, 2026 Author

Uh oh!

BurntSushi Feb 28, 2026 Maintainer

kotysoft
Feb 28, 2026

Replies: 3 comments 4 replies

BurntSushi
Feb 28, 2026
Maintainer

kotysoft Feb 28, 2026
Author

BurntSushi Feb 28, 2026
Maintainer

kotysoft Feb 28, 2026
Author

kotysoft
Feb 28, 2026
Author

kotysoft
Feb 28, 2026
Author

BurntSushi Feb 28, 2026
Maintainer