Skip to content

Commit 366a2dd

Browse files
committed
update default methods blog post
1 parent 8c28ded commit 366a2dd

File tree

1 file changed

+110
-12
lines changed

1 file changed

+110
-12
lines changed

blog/_posts/2016-07-08-trait-method-performance.md

Lines changed: 110 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -320,14 +320,14 @@ callsite is megamorphic. We measure how much time an invocation of one default m
320320
Adding an overriding forwarder method to the subclasses does not change this result, the slowdown
321321
per additional interface remains. So this seems not to be the reason for the performance regression.
322322

323-
### Back to CHA
323+
### Back to default methods
324324

325325
Googling a little bit more about the performance of default methods, I found a relevant
326326
[post on SO](http://stackoverflow.com/questions/30312096/java-default-methods-is-slower-than-the-same-code-but-in-an-abstract-class)
327327
containing a nice benchmark.
328328

329329
I simplified the example into the
330-
[benchmark `NoCHAPreventsOptimization`](https://github.com/lrytz/benchmarks/blob/master/src/main/java/traitEncodings/NoCHAPreventsOptimization.java),
330+
[benchmark `DefaultMethodPreventsOptimization`](https://github.com/lrytz/benchmarks/blob/master/src/main/java/traitEncodings/DefaultMethodPreventsOptimization.java),
331331
which is relatively small:
332332

333333
interface I {
@@ -348,6 +348,11 @@ which is relatively small:
348348
The benchmark shows that `c.v = x; c.accessDefault()` is 3x slower than
349349
`c.v = x; c.accessVirtual()` or `c.v = x; c.accessForward()`.
350350

351+
#### A look at the assembly
352+
353+
(This section was revised later, thanks to [Paolo Giarrusso](https://twitter.com/blaisorblade) for
354+
his feedback!)
355+
351356
As noted in comments on the StackOverflow thread, everything is inlined in all three benchmarks,
352357
so the difference is not due to inlining. We can observe that the assembly generated for the
353358
`accessDefault` case is less optimized than in the other cases. Here is the output of JMH's
@@ -360,16 +365,96 @@ so the difference is not due to inlining. We can observe that the assembly gener
360365
In fact, the assembly for the `accessVirtual` and `accessForward` cases is identical.
361366

362367
One answer on the SO thread suggests that lack of CHA for the default method case prevents
363-
eliminating a type guard, which in turn prevents optimizations on the field write and read.
364-
Somebody with more experience in assembly code than me could certainly verify that.
368+
eliminating a type guard, which in turn prevents optimizations on the field write and read. A later
369+
comment points out that this does not seem to be the case.
365370

366-
I did not do any further research to find out what kind of optimizations depend on CHA, or if it is
367-
really the lack of CHA that causes the code not to be optimized properly. For my convenience, let's
368-
say that's beyond the scope of this post. If you have any insights or references on this topic,
369-
please get in touch or post a comment!
371+
The benchmark basically measures the following loop:
372+
373+
int r = 0;
374+
for (int x = 0; x < N; x++) {
375+
c.v = x;
376+
r += c.v // field acces either through a default or a virtual method
377+
}
370378

371-
It seems that the lack of certain optimizations for default methods is the most likely source for
372-
the slowdowns we notice when running the Scala compiler.
379+
Comparing the assembly code of the loop when using the default method or the virtual method, Paolo
380+
identified one interesting difference. When using the default method, the loop body consists of the
381+
following instructions:
382+
383+
mov %edi,0x10(%rax) ;*putfield v
384+
add 0x10(%r13),%edx ;*iadd
385+
inc %edi ;*iinc
386+
387+
By default the JVM outputs the AT&T assembly syntax, so instructions have the form
388+
`mnemonic src dst`. The `%edi` register contains the loop counter `x`, `%edx` contains the result
389+
`r`.
390+
391+
- The first instruction writes `x` into the field `c.v`: `%rax` contains the address of object
392+
`c`, the field `v` is located at offset `0x10`.
393+
- The second instruction reads the field `c.v` and adds the value to `x`. Note that this time,
394+
register `%r13` is used to access object `c`. There are two registers that contain the same
395+
object address, but the JIT compiler does not seem to know this fact.
396+
- The last line increases the loop counter.
397+
398+
Comparing that to the loop body assembly when using a virtual method to access the field:
399+
400+
mov %r8d,0x10(%r10) ;*putfield v
401+
add %r8d,%edx ;*iadd
402+
inc %r8d ;*iinc
403+
404+
Here, `%r8d` contains the loop counter `x`. Note that there is only one memory access: the JIT
405+
compiler identified that the memory read accesses the same location that was just written, so it
406+
uses the register already containing the value.
407+
408+
The full assembly code is actually a lot larger than just a loop around the three instructions shown
409+
above. For one, there is infrastructure code added by JMH to measure times and to make sure values
410+
are consumed. But the main reason is loop unrolling. The faster assembly (when using the virtual
411+
method) contains the following loop:
412+
413+
↗ add %r8d,%edx
414+
│ add %r8d,%edx
415+
│ add %r8d,%edx
416+
│ add %r8d,%edx
417+
│ add %r8d,%edx
418+
│ add %r8d,%edx
419+
│ add %r8d,%edx
420+
│ add %r8d,%edx
421+
│ add %r8d,%edx
422+
│ add %r8d,%edx
423+
│ add %r8d,%edx
424+
│ add %r8d,%edx
425+
│ add %r8d,%edx
426+
│ add %r8d,%edx
427+
│ add %r8d,%edx
428+
│ add %r8d,%edx
429+
│ mov %r8d,%r11d
430+
│ add $0xf,%r11d
431+
│ mov %r11d,0x10(%r10) ;*putfield v
432+
│ add $0x78,%edx ;*iadd
433+
│ add $0x10,%r8d ;*iinc
434+
│ cmp $0x3d9,%r8d
435+
╰ jl <loop-start>
436+
437+
This code does the following:
438+
439+
- The register `%r8d` still contains the loop counter `x`, so the loop adds `x` to `r` (`%edx`)
440+
16 times (without increasing `x` in between).
441+
- Then it stores the value of `x` into `%r11d`, adds the constant `0xf` (decimal 15) to make up
442+
for the folded iterations and stores that value into the field `c.v`.
443+
- The constant `0x78` (decimal 120) is added to the result `r` to make up for the fact that the
444+
loop counter was not increased (0 + 1 + 2 + ... + 15 = 120).
445+
- The loop counter is increased by the constant `0x10` (decimal 16), corresponding to the 16
446+
unfolded iterations.
447+
- The loop counter is compared against `0x3d9` (decimal 985): if it is smaller, another round of
448+
of the unfolded loop can be executed (the loop ends at 1000). Otherwise execution continues in
449+
a different location that performs single loop iterations.
450+
451+
The interesting observation here is that the field `c.v` is only written once per 16 iterations.
452+
453+
The slower assembly (when using the default method) also contains an unfolded loop, but the memory
454+
location `c.x` is written **and** read in every iteration (instead of only written in every 16th).
455+
Again, the problem seems to be that the JIT compiler does not know that two registers contain the
456+
same memory address for object `c`. The unfolded loop also uses a lot of registers, it even seems to
457+
make use of SSE registers (`%xmm1`, ...) as 32 bit registers.
373458

374459
## Summary
375460

@@ -386,14 +471,27 @@ We found a few interesting behaviors of the JVM optimizer.
386471
based on type profiling. This means that a default method at a megamorphic callsite is never
387472
inlined, even if the method does not have any overrides.
388473

474+
- The JIT does not combine the knowledge of type profiles and CHA. Assume a type profile shows that
475+
a certain callsite has 3 receiver types at run-time, so it is megamorphic. Also assume that there
476+
exist multiple versions (overrides / implementations) of the selected method, but CHA shows that
477+
method resolution for the 3 types in question always yields the same implementation. In principle
478+
the method could be inlined in this case, but this is not currently implemented.
479+
480+
- Interface method lookup slows down by the number of interfaces a class implements.
481+
389482
- The JVM fails to perform certain optimizations when default methods are used. We could show in a
390483
benchmark that moving a method from a parent class into a parent interface can degrade performance
391484
significantly. Adding an override to a subclass which invokes the default method using a `super`
392485
call restores the performance.
393486

394-
We did not investigate what are the underlying reasons that cause the slowdown in this example.
487+
The assembly code reveals that the JVM fails to eliminate memory accesses when using the default
488+
methods.
395489

396-
- Interface method lookup slows down by the number of interfaces a class implements.
490+
While we can reproduce certain slowdowns when using default methods in micro-benchmarks, this does
491+
give an answer why we observe a 20% performance regression when running the Scala compiler on
492+
default methods without forwarders. The fact that the JIT compiler fails to perform certain
493+
optimizations may be the reason, but we don't have any evidence or proof to relate the two
494+
observations.
397495

398496
## References
399497

0 commit comments

Comments
 (0)