heuristic to skip chunks of leading whitespace when parsing #881

samyron · 2025-10-31T14:38:34Z

This might be a stretch but this PR implements a heuristic in json_eat_whitespace . If the next character is a \n, it may be followed by consecutive spaces (0x20). If so, we can skip them pretty quickly.

activitypub-pretty.json was generated by JSON.pretty_generate(JSON.load_file('activitypub.json')).

Compared to master on my M1 Macbook Air.

== Parsing activitypub.json (58160 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   935.000 i/100ms
Calculating -------------------------------------
               after      9.600k (± 0.5%) i/s  (104.17 μs/i) -     48.620k in   5.064774s

Comparison:
              before:     9452.1 i/s
               after:     9599.9 i/s - same-ish: difference falls within error


== Parsing activitypub-pretty.json (65761 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     1.177k i/100ms
Calculating -------------------------------------
               after     11.724k (± 1.6%) i/s   (85.30 μs/i) -     58.850k in   5.021088s

Comparison:
              before:    11074.0 i/s
               after:    11723.6 i/s - 1.06x  faster


== Parsing twitter.json (567916 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after    99.000 i/100ms
Calculating -------------------------------------
               after    994.981 (± 0.6%) i/s    (1.01 ms/i) -      5.049k in   5.074626s

Comparison:
              before:      909.8 i/s
               after:      995.0 i/s - 1.09x  faster


== Parsing citm_catalog.json (1727030 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after    55.000 i/100ms
Calculating -------------------------------------
               after    549.774 (± 0.7%) i/s    (1.82 ms/i) -      2.750k in   5.002283s

Comparison:
              before:      435.4 i/s
               after:      549.8 i/s - 1.26x  faster

Looking for 8 spaces was slightly faster than looking for 4. I'm not sure this is safe but it might be slightly faster:

if (chunk != 0x2020202020202020) {
    if (((uint32_t) chunk) == 0x20202020) {
        state->cursor += 4;
    }
    break;
}

byroot · 2025-11-01T10:51:57Z

ext/json/ext/parser/parser.c

+        while (state->cursor+sizeof(uint64_t) <= state->end) {
+            uint64_t chunk;
+            memcpy(&chunk, state->cursor, sizeof(uint64_t));
+            if (chunk != 0x2020202020202020) {


I think we can do even better.

Unless I'm mistaken, we can get the exact number of consecutive spaces with:

__builtin_ctzll(bytes ^ 0x2020202020202020) / 8

Closes: ruby#881 If we encounter a newline, it is likely that the document is pretty printed, hence that the newline is followed by multiple spaces. In such case we can use SWAR to count up to eight consecutive spaces at once. ``` == Parsing activitypub.json (58160 bytes) ruby 3.4.6 (2025-09-16 revision dbd83256b1) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 1.118k i/100ms Calculating ------------------------------------- after 11.223k (± 0.7%) i/s (89.10 μs/i) - 57.018k in 5.080522s Comparison: before: 10834.4 i/s after: 11223.4 i/s - 1.04x faster == Parsing twitter.json (567916 bytes) ruby 3.4.6 (2025-09-16 revision dbd83256b1) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 118.000 i/100ms Calculating ------------------------------------- after 1.188k (± 1.0%) i/s (841.62 μs/i) - 6.018k in 5.065355s Comparison: before: 1094.8 i/s after: 1188.2 i/s - 1.09x faster == Parsing citm_catalog.json (1727030 bytes) ruby 3.4.6 (2025-09-16 revision dbd83256b1) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 58.000 i/100ms Calculating ------------------------------------- after 570.506 (± 3.7%) i/s (1.75 ms/i) - 2.900k in 5.091529s Comparison: before: 419.6 i/s after: 570.5 i/s - 1.36x faster == Parsing float parsing (2251051 bytes) ruby 3.4.6 (2025-09-16 revision dbd83256b1) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 22.000 i/100ms Calculating ------------------------------------- after 212.010 (± 1.9%) i/s (4.72 ms/i) - 1.078k in 5.086885s Comparison: before: 189.4 i/s after: 212.0 i/s - 1.12x faster ``` Co-Authored-By: Scott Myron <[email protected]>

byroot · 2025-11-01T11:23:39Z

Improved version: #886

Closes: ruby#881 If we encounter a newline, it is likely that the document is pretty printed, hence that the newline is followed by multiple spaces. In such case we can use SWAR to count up to eight consecutive spaces at once. ``` == Parsing activitypub.json (58160 bytes) ruby 3.4.6 (2025-09-16 revision dbd83256b1) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 1.118k i/100ms Calculating ------------------------------------- after 11.223k (± 0.7%) i/s (89.10 μs/i) - 57.018k in 5.080522s Comparison: before: 10834.4 i/s after: 11223.4 i/s - 1.04x faster == Parsing twitter.json (567916 bytes) ruby 3.4.6 (2025-09-16 revision dbd83256b1) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 118.000 i/100ms Calculating ------------------------------------- after 1.188k (± 1.0%) i/s (841.62 μs/i) - 6.018k in 5.065355s Comparison: before: 1094.8 i/s after: 1188.2 i/s - 1.09x faster == Parsing citm_catalog.json (1727030 bytes) ruby 3.4.6 (2025-09-16 revision dbd83256b1) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 58.000 i/100ms Calculating ------------------------------------- after 570.506 (± 3.7%) i/s (1.75 ms/i) - 2.900k in 5.091529s Comparison: before: 419.6 i/s after: 570.5 i/s - 1.36x faster == Parsing float parsing (2251051 bytes) ruby 3.4.6 (2025-09-16 revision dbd83256b1) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 22.000 i/100ms Calculating ------------------------------------- after 212.010 (± 1.9%) i/s (4.72 ms/i) - 1.078k in 5.086885s Comparison: before: 189.4 i/s after: 212.0 i/s - 1.12x faster ``` Co-Authored-By: Scott Myron <[email protected]>

Closes: ruby/json#881 If we encounter a newline, it is likely that the document is pretty printed, hence that the newline is followed by multiple spaces. In such case we can use SWAR to count up to eight consecutive spaces at once. ``` == Parsing activitypub.json (58160 bytes) ruby 3.4.6 (2025-09-16 revision ruby/json@dbd83256b1) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 1.118k i/100ms Calculating ------------------------------------- after 11.223k (± 0.7%) i/s (89.10 μs/i) - 57.018k in 5.080522s Comparison: before: 10834.4 i/s after: 11223.4 i/s - 1.04x faster == Parsing twitter.json (567916 bytes) ruby 3.4.6 (2025-09-16 revision ruby/json@dbd83256b1) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 118.000 i/100ms Calculating ------------------------------------- after 1.188k (± 1.0%) i/s (841.62 μs/i) - 6.018k in 5.065355s Comparison: before: 1094.8 i/s after: 1188.2 i/s - 1.09x faster == Parsing citm_catalog.json (1727030 bytes) ruby 3.4.6 (2025-09-16 revision ruby/json@dbd83256b1) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 58.000 i/100ms Calculating ------------------------------------- after 570.506 (± 3.7%) i/s (1.75 ms/i) - 2.900k in 5.091529s Comparison: before: 419.6 i/s after: 570.5 i/s - 1.36x faster == Parsing float parsing (2251051 bytes) ruby 3.4.6 (2025-09-16 revision ruby/json@dbd83256b1) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 22.000 i/100ms Calculating ------------------------------------- after 212.010 (± 1.9%) i/s (4.72 ms/i) - 1.078k in 5.086885s Comparison: before: 189.4 i/s after: 212.0 i/s - 1.12x faster ``` ruby/json@b3fd7b26be Co-Authored-By: Scott Myron <[email protected]>

heuristic to skip chunks of leading whitespace when parsing

f12e571

byroot force-pushed the sm/parser-whitespace-optimizations branch from b18742e to f12e571 Compare November 1, 2025 10:26

byroot reviewed Nov 1, 2025

View reviewed changes

byroot mentioned this pull request Nov 1, 2025

parser.c: Use SWAR to skip consecutive spaces #886

Merged

byroot closed this in #886 Nov 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

heuristic to skip chunks of leading whitespace when parsing #881

heuristic to skip chunks of leading whitespace when parsing #881

samyron commented Oct 31, 2025

Uh oh!

byroot Nov 1, 2025

Uh oh!

byroot commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

heuristic to skip chunks of leading whitespace when parsing #881

heuristic to skip chunks of leading whitespace when parsing #881

Conversation

samyron commented Oct 31, 2025

Uh oh!

byroot Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

byroot commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants