@@ -320,14 +320,14 @@ callsite is megamorphic. We measure how much time an invocation of one default m
320
320
Adding an overriding forwarder method to the subclasses does not change this result, the slowdown
321
321
per additional interface remains. So this seems not to be the reason for the performance regression.
322
322
323
- ### Back to CHA
323
+ ### Back to default methods
324
324
325
325
Googling a little bit more about the performance of default methods, I found a relevant
326
326
[ post on SO] ( http://stackoverflow.com/questions/30312096/java-default-methods-is-slower-than-the-same-code-but-in-an-abstract-class )
327
327
containing a nice benchmark.
328
328
329
329
I simplified the example into the
330
- [ benchmark ` NoCHAPreventsOptimization ` ] ( https://github.com/lrytz/benchmarks/blob/master/src/main/java/traitEncodings/NoCHAPreventsOptimization .java ) ,
330
+ [ benchmark ` DefaultMethodPreventsOptimization ` ] ( https://github.com/lrytz/benchmarks/blob/master/src/main/java/traitEncodings/DefaultMethodPreventsOptimization .java ) ,
331
331
which is relatively small:
332
332
333
333
interface I {
@@ -348,6 +348,11 @@ which is relatively small:
348
348
The benchmark shows that ` c.v = x; c.accessDefault() ` is 3x slower than
349
349
` c.v = x; c.accessVirtual() ` or ` c.v = x; c.accessForward() ` .
350
350
351
+ #### A look at the assembly
352
+
353
+ (This section was revised later, thanks to [ Paolo Giarrusso] ( https://twitter.com/blaisorblade ) for
354
+ his feedback!)
355
+
351
356
As noted in comments on the StackOverflow thread, everything is inlined in all three benchmarks,
352
357
so the difference is not due to inlining. We can observe that the assembly generated for the
353
358
` accessDefault ` case is less optimized than in the other cases. Here is the output of JMH's
@@ -360,16 +365,96 @@ so the difference is not due to inlining. We can observe that the assembly gener
360
365
In fact, the assembly for the ` accessVirtual ` and ` accessForward ` cases is identical.
361
366
362
367
One answer on the SO thread suggests that lack of CHA for the default method case prevents
363
- eliminating a type guard, which in turn prevents optimizations on the field write and read.
364
- Somebody with more experience in assembly code than me could certainly verify that .
368
+ eliminating a type guard, which in turn prevents optimizations on the field write and read. A later
369
+ comment points out that this does not seem to be the case .
365
370
366
- I did not do any further research to find out what kind of optimizations depend on CHA, or if it is
367
- really the lack of CHA that causes the code not to be optimized properly. For my convenience, let's
368
- say that's beyond the scope of this post. If you have any insights or references on this topic,
369
- please get in touch or post a comment!
371
+ The benchmark basically measures the following loop:
372
+
373
+ int r = 0;
374
+ for (int x = 0; x < N; x++) {
375
+ c.v = x;
376
+ r += c.v // field acces either through a default or a virtual method
377
+ }
370
378
371
- It seems that the lack of certain optimizations for default methods is the most likely source for
372
- the slowdowns we notice when running the Scala compiler.
379
+ Comparing the assembly code of the loop when using the default method or the virtual method, Paolo
380
+ identified one interesting difference. When using the default method, the loop body consists of the
381
+ following instructions:
382
+
383
+ mov %edi,0x10(%rax) ;*putfield v
384
+ add 0x10(%r13),%edx ;*iadd
385
+ inc %edi ;*iinc
386
+
387
+ By default the JVM outputs the AT&T assembly syntax, so instructions have the form
388
+ ` mnemonic src dst ` . The ` %edi ` register contains the loop counter ` x ` , ` %edx ` contains the result
389
+ ` r ` .
390
+
391
+ - The first instruction writes ` x ` into the field ` c.v ` : ` %rax ` contains the address of object
392
+ ` c ` , the field ` v ` is located at offset ` 0x10 ` .
393
+ - The second instruction reads the field ` c.v ` and adds the value to ` x ` . Note that this time,
394
+ register ` %r13 ` is used to access object ` c ` . There are two registers that contain the same
395
+ object address, but the JIT compiler does not seem to know this fact.
396
+ - The last line increases the loop counter.
397
+
398
+ Comparing that to the loop body assembly when using a virtual method to access the field:
399
+
400
+ mov %r8d,0x10(%r10) ;*putfield v
401
+ add %r8d,%edx ;*iadd
402
+ inc %r8d ;*iinc
403
+
404
+ Here, ` %r8d ` contains the loop counter ` x ` . Note that there is only one memory access: the JIT
405
+ compiler identified that the memory read accesses the same location that was just written, so it
406
+ uses the register already containing the value.
407
+
408
+ The full assembly code is actually a lot larger than just a loop around the three instructions shown
409
+ above. For one, there is infrastructure code added by JMH to measure times and to make sure values
410
+ are consumed. But the main reason is loop unrolling. The faster assembly (when using the virtual
411
+ method) contains the following loop:
412
+
413
+ ↗ add %r8d,%edx
414
+ │ add %r8d,%edx
415
+ │ add %r8d,%edx
416
+ │ add %r8d,%edx
417
+ │ add %r8d,%edx
418
+ │ add %r8d,%edx
419
+ │ add %r8d,%edx
420
+ │ add %r8d,%edx
421
+ │ add %r8d,%edx
422
+ │ add %r8d,%edx
423
+ │ add %r8d,%edx
424
+ │ add %r8d,%edx
425
+ │ add %r8d,%edx
426
+ │ add %r8d,%edx
427
+ │ add %r8d,%edx
428
+ │ add %r8d,%edx
429
+ │ mov %r8d,%r11d
430
+ │ add $0xf,%r11d
431
+ │ mov %r11d,0x10(%r10) ;*putfield v
432
+ │ add $0x78,%edx ;*iadd
433
+ │ add $0x10,%r8d ;*iinc
434
+ │ cmp $0x3d9,%r8d
435
+ ╰ jl <loop-start>
436
+
437
+ This code does the following:
438
+
439
+ - The register ` %r8d ` still contains the loop counter ` x ` , so the loop adds ` x ` to ` r ` (` %edx ` )
440
+ 16 times (without increasing ` x ` in between).
441
+ - Then it stores the value of ` x ` into ` %r11d ` , adds the constant ` 0xf ` (decimal 15) to make up
442
+ for the folded iterations and stores that value into the field ` c.v ` .
443
+ - The constant ` 0x78 ` (decimal 120) is added to the result ` r ` to make up for the fact that the
444
+ loop counter was not increased (0 + 1 + 2 + ... + 15 = 120).
445
+ - The loop counter is increased by the constant ` 0x10 ` (decimal 16), corresponding to the 16
446
+ unfolded iterations.
447
+ - The loop counter is compared against ` 0x3d9 ` (decimal 985): if it is smaller, another round of
448
+ of the unfolded loop can be executed (the loop ends at 1000). Otherwise execution continues in
449
+ a different location that performs single loop iterations.
450
+
451
+ The interesting observation here is that the field ` c.v ` is only written once per 16 iterations.
452
+
453
+ The slower assembly (when using the default method) also contains an unfolded loop, but the memory
454
+ location ` c.x ` is written ** and** read in every iteration (instead of only written in every 16th).
455
+ Again, the problem seems to be that the JIT compiler does not know that two registers contain the
456
+ same memory address for object ` c ` . The unfolded loop also uses a lot of registers, it even seems to
457
+ make use of SSE registers (` %xmm1 ` , ...) as 32 bit registers.
373
458
374
459
## Summary
375
460
@@ -386,14 +471,27 @@ We found a few interesting behaviors of the JVM optimizer.
386
471
based on type profiling. This means that a default method at a megamorphic callsite is never
387
472
inlined, even if the method does not have any overrides.
388
473
474
+ - The JIT does not combine the knowledge of type profiles and CHA. Assume a type profile shows that
475
+ a certain callsite has 3 receiver types at run-time, so it is megamorphic. Also assume that there
476
+ exist multiple versions (overrides / implementations) of the selected method, but CHA shows that
477
+ method resolution for the 3 types in question always yields the same implementation. In principle
478
+ the method could be inlined in this case, but this is not currently implemented.
479
+
480
+ - Interface method lookup slows down by the number of interfaces a class implements.
481
+
389
482
- The JVM fails to perform certain optimizations when default methods are used. We could show in a
390
483
benchmark that moving a method from a parent class into a parent interface can degrade performance
391
484
significantly. Adding an override to a subclass which invokes the default method using a ` super `
392
485
call restores the performance.
393
486
394
- We did not investigate what are the underlying reasons that cause the slowdown in this example.
487
+ The assembly code reveals that the JVM fails to eliminate memory accesses when using the default
488
+ methods.
395
489
396
- - Interface method lookup slows down by the number of interfaces a class implements.
490
+ While we can reproduce certain slowdowns when using default methods in micro-benchmarks, this does
491
+ give an answer why we observe a 20% performance regression when running the Scala compiler on
492
+ default methods without forwarders. The fact that the JIT compiler fails to perform certain
493
+ optimizations may be the reason, but we don't have any evidence or proof to relate the two
494
+ observations.
397
495
398
496
## References
399
497
0 commit comments