Skip to content

Use Vector API in the Java Extension #824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

samyron
Copy link
Contributor

@samyron samyron commented Jul 8, 2025

PLEASE DO NOT MERGE

Overview

This PR uses the jdk.incubator.vector module as mentioned in issue #739 to accelerate generating JSON with the same algorithm as the C extension.

The PR as it exists right now, it will attempt to build the json.ext.VectorizedEscapeScanner class with a target release of 16. This is the first version of Java with support for the jdk.incubator.vector module. The remaining code is built for Java 1.8. The code will attempt to load the json.ext.VectorizedEscapeScanner only if the json.enableVectorizedEscapeScanner system property is set to true (or 1).

I'm not entirely sure how this is packaged / included with JRuby so I'd love @byroot and @headius's (and others?) thought about how to potential package and/or structure the JARs. I did consider adding the json.ext.VectorizedEscapeScanner to a separate generator-vectorized.jar but I thought I'd solicit feedback before spending any more time on the build / package process.

Benchmarks

Machine M1 Macbook Air

Note: I've had trouble modifying the compare.rb I was using for the C extension to work reliability with the Java extension. I'll probably spend more time trying to get it to work, but as of right now these are pretty raw benchmarks.

Below are two sample runs of the real-world benchmarks. The benchmarks are much more variable then the C extension for some reason. I'm not sure if HotSpot is doing something slightly different per execution.

Vector API Enabled

scott@Scotts-MacBook-Air json % ONLY=json JAVA_OPTS='--add-modules jdk.incubator.vector -Djson.enableVectorizedEscapeScanner=true' ruby -I"lib" benchmark/encoder-realworld.rb
WARNING: Using incubator modules: jdk.incubator.vector
== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.384k i/100ms
Calculating -------------------------------------
                json     15.289k (± 0.8%) i/s   (65.41 μs/i) -    153.624k in  10.048481s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json    76.000 i/100ms
Calculating -------------------------------------
                json    753.787 (± 3.6%) i/s    (1.33 ms/i) -      7.524k in   9.997059s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json   173.000 i/100ms
Calculating -------------------------------------
                json      1.751k (± 1.1%) i/s  (571.24 μs/i) -     17.646k in  10.081260s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.390k i/100ms
Calculating -------------------------------------
                json     23.829k (± 0.8%) i/s   (41.97 μs/i) -    239.000k in  10.030503s

Vector API Disabled

scott@Scotts-MacBook-Air json % ONLY=json JAVA_OPTS='--add-modules jdk.incubator.vector -Djson.enableVectorizedEscapeScanner=false' ruby -I"lib" benchmark/encoder-realworld.rb
WARNING: Using incubator modules: jdk.incubator.vector
VectorizedEscapeScanner disabled.
== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.204k i/100ms
Calculating -------------------------------------
                json     12.937k (± 1.1%) i/s   (77.30 μs/i) -    130.032k in  10.052234s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json    80.000 i/100ms
Calculating -------------------------------------
                json    817.378 (± 1.0%) i/s    (1.22 ms/i) -      8.240k in  10.082058s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json   147.000 i/100ms
Calculating -------------------------------------
                json      1.499k (± 1.3%) i/s  (667.08 μs/i) -     14.994k in  10.004181s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.269k i/100ms
Calculating -------------------------------------
                json     22.366k (± 5.7%) i/s   (44.71 μs/i) -    224.631k in  10.097069s

master as of commit c5af1b68c582335c2a82bbc4bfa5b3e41ead1eba

scott@Scotts-MacBook-Air json % ONLY=json ruby -I"lib" benchmark/encoder-realworld.rb
== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json   886.000 i/100ms
Calculating -------------------------------------
                json^C%                                                                                                                   
scott@Scotts-MacBook-Air json % ONLY=json ruby -I"lib" benchmark/encoder-realworld.rb
== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.031k i/100ms
Calculating -------------------------------------
                json     10.812k (± 1.3%) i/s   (92.49 μs/i) -    108.255k in  10.014260s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json    82.000 i/100ms
Calculating -------------------------------------
                json    824.921 (± 1.0%) i/s    (1.21 ms/i) -      8.282k in  10.040787s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json   141.000 i/100ms
Calculating -------------------------------------
                json      1.421k (± 0.7%) i/s  (703.85 μs/i) -     14.241k in  10.023979s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.274k i/100ms
Calculating -------------------------------------
                json     22.612k (± 0.9%) i/s   (44.22 μs/i) -    227.400k in  10.057516s

Observations

activitypub.json and twitter.json seem to be consistently faster with the Vector API enabled. citm_catalog.json seems consistently a bit slower and ohai.json is fairly close to even.

@samyron samyron force-pushed the sm/java-vector-simd branch from 194ba01 to 15c7187 Compare July 15, 2025 03:12
@samyron
Copy link
Contributor Author

samyron commented Jul 15, 2025

Using hsdis to examine the generated assembly I can verify that on my Macbook Air the Hotspot C2 Compiler does indeed use Neon instructions.

ONLY=json JAVA_OPTS='--add-modules jdk.incubator.vector -Djson.enableVectorizedEscapeScanner=true -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining -XX:+PrintIntrinsics -XX:CompileCommand=print,*VectorizedEscapeScanner.*' ruby -I"lib" benchmark/encoder-realworld.rb > output.txt 2>output.txt
Compiled method (c2)   22086 5801       4       json.ext.VectorizedEscapeScanner::scan (391 bytes)
<snip>

[Disassembly]
--------------------------------------------------------------------------------
[Constant Pool (empty)]

--------------------------------------------------------------------------------

[Entry Point]
  # {method} {0x0000000133c3a0d8} 'scan' '(Ljson/ext/EscapeScanner$State;)Z' in 'json/ext/VectorizedEscapeScanner'
  # this:     c_rarg1:c_rarg1 
                        = 'json/ext/VectorizedEscapeScanner'
  # parm0:    c_rarg2:c_rarg2 
                        = 'json/ext/EscapeScanner$State'
  #           [sp+0x30]  (sp of caller)
  0x000000011b28d0c0:   ldr		w8, [x1, #8]
  0x000000011b28d0c4:   cmp		w9, w8
  0x000000011b28d0c8:   b.eq		#0x11b28d0d0
  0x000000011b28d0cc:   b		#0x11aa5fe80        ;   {runtime_call ic_miss_stub}
[Verified Entry Point]
  0x000000011b28d0d0:   nop		
  0x000000011b28d0d4:   sub		x9, sp, #0x14, lsl #12
  0x000000011b28d0d8:   str		xzr, [x9]
  0x000000011b28d0dc:   sub		sp, sp, #0x30
 <snip>
  0x000000011b28d194:   add		x12, x5, w14, sxtw
  0x000000011b28d198:   ldr		q20, [x12, #0x10]
  0x000000011b28d19c:   eor		v21.16b, v20.16b, v17.16b
  0x000000011b28d1a0:   cmgt		v22.16b, v19.16b, v20.16b
  0x000000011b28d1a4:   cmgt		v21.16b, v18.16b, v21.16b
  0x000000011b28d1a8:   cmeq		v20.16b, v20.16b, v16.16b
  0x000000011b28d1ac:   bic		v21.16b, v21.16b, v22.16b
  0x000000011b28d1b0:   orr		v20.16b, v20.16b, v21.16b
  0x000000011b28d1b4:   str		w1, [x2, #0x30]
  0x000000011b28d1b8:   addv		b21, v20.16b
  0x000000011b28d1bc:   umov		w8, v21.b[0]
  0x000000011b28d1c0:   cmp		w8, wzr
  0x000000011b28d1c4:   b.ne		#0x11b28d40c
  0x000000011b28d1c8:   add		w14, w7, #0x10
  0x000000011b28d1cc:   ldr		x12, [x28, #0x450]
  0x000000011b28d1d0:   str		w14, [x2, #0x14]    ; ImmutableOopMap {c_rarg2=Oop c_rarg5=Oop }
                                                            ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                                            ; - (reexecute) json.ext.VectorizedEscapeScanner::scan@308 (line 59)
<snip>

@headius
Copy link
Contributor

headius commented Jul 16, 2025

@samyron OMG I look away for a few days and you just go and do it! Bravo!

I'll have a look at these changes soon and see if I can offer any suggestions. This API is still a bit of a moving target, but I think we can work around that with a little Ruby magic here and there.

I will also point the Vector API folks at this PR so they can see what we're doing and provide additional input.

Amazing work!

@headius
Copy link
Contributor

headius commented Jul 16, 2025

I've posted a thread to the panama-dev list here: https://mail.openjdk.org/pipermail/panama-dev/2025-July/021080.html

@samyron
Copy link
Contributor Author

samyron commented Jul 28, 2025

I decided to try a different approach after looking at the HotSpot C2 output. Unlike in the C extension, where we mostly control method inlining, HotSpot isn't so easily influenced.

I merged VectorizedStringEncoder which wraps the escape logic in the vectorized scanning. This reduces method calls back to the search code.

Performance of VectorizedStringEncoder

scott@Scotts-MacBook-Air json % ONLY=json JAVA_OPTS='--add-modules jdk.incubator.vector -Djson.enableVectorizedEscapeScanner=false -Djson.enableVectorizedStringEncoder=true' ruby -I"lib" benchmark/encoder-realworld.rb
WARNING: Using incubator modules: jdk.incubator.vector
VectorizedEscapeScanner disabled.
json.ext.VectorizedStringEncoder loaded successfully.
== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.537k i/100ms
Calculating -------------------------------------
                json     15.382k (± 0.6%) i/s   (65.01 μs/i) -    155.237k in  10.092376s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json    81.000 i/100ms
Calculating -------------------------------------
                json    818.347 (± 0.7%) i/s    (1.22 ms/i) -      8.181k in   9.997474s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json   176.000 i/100ms
Calculating -------------------------------------
                json      1.766k (± 1.9%) i/s  (566.28 μs/i) -     17.776k in  10.070684s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.426k i/100ms
Calculating -------------------------------------
                json     23.958k (± 0.6%) i/s   (41.74 μs/i) -    240.174k in  10.025043s

Additionally, here is a screenshot of VisualVM showing the result of running the activitypub.json benchmark for 30 seconds.

image

@headius
Copy link
Contributor

headius commented Jul 28, 2025

@samyron This is interesting progress! I am looking forward to trying it myself now that I'm back in the office.

Yes, HotSpot can be a tricky beast to manipulate. We will want to look at some deeper logging of the JIT and inlining decisions to see whether everything that should be is getting inlined. There's potentially other parts of json unrelated to your changes that are also interfering with inlining (such as the double-dispatching logic to find an appropriate formatter for output text).

Have you tried running on a newer JDK? There's continuous improvements in this area.

@headius
Copy link
Contributor

headius commented Jul 28, 2025

It's also possible that we are losing too much performance to excessive allocation. I'll try to do some profiling once I get your code up and running.

@headius
Copy link
Contributor

headius commented Jul 28, 2025

Oh, BTW, I got one response to my email about your work, pointing me to a Java library that has already been attempting to use the vector API to speed up json processing. It may provide some interesting pointers: https://github.com/simdjson/simdjson-java

@samyron
Copy link
Contributor Author

samyron commented Jul 31, 2025

Have you tried running on a newer JDK? There's continuous improvements in this area.

I have tried running the same benchmarks using JDK 24.

These benchmarks have some WIP changes that aren't reflected in this branch but I have seen the highest peak performance in the activitypub.json using JDK 24. It's not consistent though, running the benchmarks back-to-back I see a significant drop in the activitypub.json benchmark. This could be due to the fact that I'm running this on a passively cooled Macbook Air M1. However I find it strange that it only seems to affect the activitypub.json benchmark. It's possible it's just because it's the first benchmark run. I need to do more testing.

Edit: Some quick testing shows changing the order of benchmarks does change the results a bit. I moved the citm_catalog.json benchmark to the first run and it changes the first two benchmarks a bit.

Run 1

== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.956k i/100ms
Calculating -------------------------------------
                json     19.496k (± 0.6%) i/s   (51.29 μs/i) -    391.200k in  20.065956s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json    84.000 i/100ms
Calculating -------------------------------------
                json    842.971 (± 0.8%) i/s    (1.19 ms/i) -     16.884k in  20.030700s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json   184.000 i/100ms
Calculating -------------------------------------
                json      1.810k (± 5.2%) i/s  (552.36 μs/i) -     36.064k in  19.999143s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.441k i/100ms
Calculating -------------------------------------
                json     24.268k (± 1.0%) i/s   (41.21 μs/i) -    485.759k in  20.018208s

Run 2

== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.394k i/100ms
Calculating -------------------------------------
                json     13.893k (± 3.1%) i/s   (71.98 μs/i) -    277.406k in  19.993002s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json    83.000 i/100ms
Calculating -------------------------------------
                json    844.179 (± 1.1%) i/s    (1.18 ms/i) -     16.932k in  20.059934s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json   184.000 i/100ms
Calculating -------------------------------------
                json      1.832k (± 5.7%) i/s  (545.97 μs/i) -     36.432k in  20.014787s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.467k i/100ms
Calculating -------------------------------------
                json     24.605k (± 0.7%) i/s   (40.64 μs/i) -    493.400k in  20.053374s

@headius
Copy link
Contributor

headius commented Aug 2, 2025

Those numbers are quite a bit better, albeit unpredictable! Order changes likely indicate that there's some polymorphism interfering with optimization, breaking inlining and falling back on slower calls with less optimization. That's the sort of JIT logging I'm hoping to dig into soon.

Once I can focus on this we have some OpenJDK folks interested in seeing the results and helping us tune things.

@samyron
Copy link
Contributor Author

samyron commented Aug 8, 2025

After reading How we made JSON.stringify more than twice as fast and having the VisualVM results fresh in my mind... I figured we could try segmenting the output buffer into chunks to completely avoid the ensureBuffer calls in the ByteListDirectOutputStream.

I crated two very similar OutputStream classes: SegmentedByteListDirectOutputStream and Segmented2ByteListDirectOutputStream. The former manages a linked list of Segments containing byte[] buffers and the latter contains an 2-dimensional array of byte[][]. The idea of growing the capacity by powers of 2 is to limit the number System.arraycopy's that are needed when toByteListDirect is called.

Here is a screenshot of profiling results using Segmented2ByteListDirectOutputStream from VisualVM:

image

The benchmarks are also much more consistent between invocations.

First run

== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.887k i/100ms
Calculating -------------------------------------
                json     18.863k (± 1.8%) i/s   (53.02 μs/i) -    377.400k in  20.015082s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json    86.000 i/100ms
Calculating -------------------------------------
                json    849.623 (± 5.1%) i/s    (1.18 ms/i) -     16.942k in  20.024229s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json   179.000 i/100ms
Calculating -------------------------------------
                json      1.792k (± 4.5%) i/s  (558.03 μs/i) -     35.800k in  20.042557s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.457k i/100ms
Calculating -------------------------------------
                json     24.571k (± 1.7%) i/s   (40.70 μs/i) -    491.400k in  20.006269s

Second run immediately after the first

== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.873k i/100ms
Calculating -------------------------------------
                json     18.933k (± 1.0%) i/s   (52.82 μs/i) -    380.219k in  20.084278s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json    83.000 i/100ms
Calculating -------------------------------------
                json    853.193 (± 1.1%) i/s    (1.17 ms/i) -     17.098k in  20.042329s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json   181.000 i/100ms
Calculating -------------------------------------
                json      1.809k (± 2.5%) i/s  (552.82 μs/i) -     36.200k in  20.027645s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.442k i/100ms
Calculating -------------------------------------
                json     24.572k (± 0.7%) i/s   (40.70 μs/i) -    493.284k in  20.075724s

Note: The C extension's FBuffer implementation may benefit from segmentation too.

@samyron
Copy link
Contributor Author

samyron commented Aug 11, 2025

I added a SWAR Implementation of basic escape scanning when the Vector-API based implementation is disabled. Performance is quite good.

== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.523k i/100ms
Calculating -------------------------------------
                json     15.252k (± 1.3%) i/s   (65.57 μs/i) -    306.123k in  20.075048s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json    77.000 i/100ms
Calculating -------------------------------------
                json    767.053 (± 4.0%) i/s    (1.30 ms/i) -     15.323k in  20.014496s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json   169.000 i/100ms
Calculating -------------------------------------
                json      1.710k (± 1.1%) i/s  (584.71 μs/i) -     34.307k in  20.061924s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.340k i/100ms
Calculating -------------------------------------
                json     23.132k (± 4.9%) i/s   (43.23 μs/i) -    460.980k in  20.003270s

@headius
Copy link
Contributor

headius commented Aug 11, 2025

segmenting the output buffer into chunks

What, you mean my super-naïve implementation was not efficient? 😆

This is an excellent improvement! Could you move the stream improvements to a separate PR so it doesn't get tied up with the Vector work? I'd expect we can merge it immediately!

It's on my list to revisit this work this week, now that I'm back from holiday.

I'd also like to feature this work as an example of JRuby's potential on newer JVMs in my upcoming conference talks.

@samyron
Copy link
Contributor Author

samyron commented Aug 13, 2025

segmenting the output buffer into chunks

What, you mean my super-naïve implementation was not efficient? 😆

This is an excellent improvement! Could you move the stream improvements to a separate PR so it doesn't get tied up with the Vector work? I'd expect we can merge it immediately!

It's on my list to revisit this work this week, now that I'm back from holiday.

I'd also like to feature this work as an example of JRuby's potential on newer JVMs in my upcoming conference talks.

@headius Unfortunately the segmented output stream by itself didn't have much (or really any) of a performance impact by itself. The bottleneck is the StringEncoder#encode method, at least in the data I've been benchmarking with. However, when I optimize StringEncoder#encode to call a fastpath encodeBasic or SWAR-based encodeBasicSWAR it does have a performance impact.

See #835.

@headius
Copy link
Contributor

headius commented Aug 14, 2025

@samyron Perhaps the segmented version would show more impact with larger output? We may be chasing our tails here, though... ideally if you are generating tens of megabytes of json you're streaming it somewhere and not buffering it. The ByteList form is just to fulfill the API returning a String if you don't provide an output stream.

The SWAR results are excellent though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants