diff --git a/.bundle/config b/.bundle/config index 50b79383c..7d5a37ef3 100644 --- a/.bundle/config +++ b/.bundle/config @@ -1,3 +1,3 @@ --- BUNDLE_PATH: bundle-vendor/bundle -BUNDLE_DISABLE_SHARED_GEMS: '1' +BUNDLE_DISABLE_SHARED_GEMS: true diff --git a/blog/_posts/2016-07-04-trait-method-performance.md b/blog/_posts/2016-07-04-trait-method-performance.md new file mode 100644 index 000000000..1137ba513 --- /dev/null +++ b/blog/_posts/2016-07-04-trait-method-performance.md @@ -0,0 +1,401 @@ +--- +layout: blog +post-type: blog +by: Lukas Rytz +title: Performance of trait methods +--- + +# Performance of using default methods to compile Scala trait methods + +In Scala 2.12, bodies of methods defined in traits will be compiled to default methods in the +interface classfile. In short, we have the following bytecode formats for concrete trait methods: + + - 2.11.x: trait method bodies are in static methods in the trait's `T$impl` class. Classes + extending a trait get a virtual method that implements the abstract method in the interface and + forwards to the static implementation method. + - 2.12.0-M4: trait method bodies are in (non-static) interface default methods, subclasses get an + virtual method (overridding the default method) that forwards to that default method using + `invokespecial` (a `super` call). + - [33e7106](https://github.com/scala/scala/commit/33e7106): in most cases, no more forwarders are + generated in subclasses as they are not needed: the JVM will resolve the correct method. + Concrete trait methods are invoked either using `invokeinterface` (if the static receiver type + is the trait) or `invokevirtual` (if the static receiver type is the subclass). + - 2.12.0-M5: trait method bodies are emitted in static methods in the interface classfile. The + default methods forward to the static methods. + +Recently we observed that 33e7106 causes a 20% slowdown of the Scala compiler (tested by compiling +the sources of [better-files](https://github.com/pathikrit/better-files)). (Since we're still +lacking a proper performance regression testing infra, the slowdown was discovered later and pinned +down using git-bisect). + +First observation: the slowdown is not due to additional logic introduced by the patch, but the +change in the bytecode of the compiler itself. This can be verified easily: a compiler built from +revision 33e7106 using its parent (b932419) as STARR has no slowdown. Building it with itself as +STARR, the resulting compiler runs slower. + +This means that any Scala applications using concrete trait methods is likely to be affected by +this problem. + +This post logs our attempts to find the root cause of the slowdown. + +## Some details on the HotSpot compiler + +This section explains some details of the HotSpot optimizer. It assembles information from various +sources and own observations; it might contain mistakes and misunderstandings. It is certainly +simplified and incomplete. More details are available in the linked resources. + +### JITing + +My recommended reference for this first section is the talk "JVM Mechanics" by Doug Hawkins +([video](https://www.youtube.com/watch?v=E9i9NJeXGmM), +[slides](http://www.slideshare.net/dougqh/jvm-mechanics-when-does-the)). + +First of all, JVM 8 uses two JIT compilers: C1 and C2. C1 is fast but performs only basic +optimizations, in particular it does not perform speculative optimizations based on profiling +(frequency of branches, type profiles at callsites). C2 is profile-guided and speculative but +slower. + +The JVM starts by interpreting the program. It only compiles methods that are either called often +enough or that have long enough loops. There are two counters for each method: + + - the number of times it is invoked, and + - the number of times a backwards branch is executed. + +The decision to compile a method is based on these counters. A simplified (ignoring backwards +branches), typical scenario: after 2000 invocations a method gets compiled by C1, after 15000 it +gets re-compiled by C2 (see [this answer on SO](http://stackoverflow.com/a/35614237/248998) for more +details). Note that the C1-generated assembly is instrumented to update the two counters (and also +to collect other profiling data that will be used by the C2 optimizer). After compiling a method, +new invocations of the method will use the newly generated assembly. + +The above works well for a method that is invoked many times, but what happens to a long-running +method that is invoked only once, but has a long loop? The decision to compile this method is taken +when the counter of backwards branches passes a threshold. Once compilation is done, the JVM +performs a so-called on-stack replacement (OSR): the stack frame of the running method is modified +as necessary and execution continues using the new assembly. + +An OSR / loop compilation of a method is always tied to a specific loop: the entry point of the +generated assembly is at the end of the loop (locations are referred to by the index in the jvm +bytecode, called "bytecode index" / `bci`). If there are multiple hot loops within a method, the +same method may get multiple OSR compiled versions. More details on this can be found in +[this post](https://gist.github.com/rednaxelafx/1165804#osr) by Krystal Mok which explains the +many details of the `-XX:+PrintCompilation` output. + +### Inlining + +For this section my reference s Aleksey Shipilёv's extensive post +[The Black Magic of (Java) Method Dispatch](http://shipilev.net/blog/2015/black-magic-method-dispatch/). + +Inlining is fundamental because it acts as an enabler for most other optimizations. As Aleksey says +in the conclusion: "inlining actually broadens the scope of other optimizations, and that alone is, +in many cases, enough reason to inline". + +Both C1 and C2 perform inlining. The policy whether to inline a method is non-trivial and uses +several heuristics (implemented in +[bytecodeInfo.cpp](http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/f22b5be95347/src/share/vm/opto/bytecodeInfo.cpp), +methods `should_inline`, `should_not_inline` and `try_to_inline`). A simplified summary: + + - Trivial methods (6 bytes by default, `MaxTrivialSize`) are always inlined. + - Methods up to 35 bytes (`MaxInlineSize`) invoked between 1 and 250 (`MinInliningThreshold`) are + inlined. + - Methods up to 325 bytes (`FreqInlineSize`) are inlined if the callsite is "hot" (or "frequent"), + which means it is invoked more than 20 times (no command-line flag in release versions) per one + invocation of the caller method. + - The inlining depth is limited (9 by default, `MaxInlineLevel`). + - No inlining is performed if the callsite method is already very large. + + // this is not entirelly true. It uses multiple measures of `large`. + // inlining is performed twice. Once, on bytecode and that's the part that you've correctly covered, + // second time on the assembly. And size limits are different and measure different notions. + // InlineSmallCode option guards inining of assembly. + // neither of approaches is good: + // - the problem with counting size of bytecode before inlining is that it counts unreachable code&gotos + // - the problem with inlining assembly happens as it frequently has ridiculosly bad code(eg a lot of moving data around) in "slow paths" and + the size of slow path contributes to the measured size of method. + +The procedure is the same for C1 and C2, it uses the invocation counter that is also used for +compilation decisions (previous section). + +### Inlining virtual methods + +In C1, a method can only be inlined if it can be statically resolved. This is the case for static +and private methods, for constructors, but also for virtual methods that are never overridden. The +JVM has full knowledge on code of the program it is executing. If a method is virtual and could +in principle be overridden in a subclass, but no such subclass has been loaded (so far), an +invocation of the method can only resolve to that single definition. + +The process of analyzing the hierarchy of classes currently loaded in the VM is called "class +hierarchy analysis" (CHA). Both C1 and C2 use the information computed by CHA to inline calls to +virtual methods that are not overridden. + +When the JVM loads a new class, a virtual method that was statically not overridden by CHA may get +an override. All assembly code that made use of the now invalid assumption is discarded and the +// the text reads as if the methods are executed twice. I'd rewrite as `the next execution of method will began in intepreter` +// this leaves uncovered the question of `what happens if return address is in discarded code` but I believe it's fine. +corresponding methods are executed again by the bytecode interpreter. This process is called +deoptimization. In case a deoptimized method is currently being executed, control is passed to +the interpreter by on-stack replacement. +// AFAIK this is false... Please tell me if what I write below isn't true: +// when breaking CHA assumption, VM will ask all threads to pause at safe poins. +// JVM can recreate state of interpreter in every safe-point. If method happens to pause inside method whose code is +// being discarded it will jump into interpreter + +// if the method is NOT being currenly executed, but some thread has instructuion inside it in return pointer +// the method is filed with NO-OP and a special handler in the end that is able to jump to nearest safe-point. +// AFAIK this has little if something to do with OSR. + +In addition to using CHA, the C2 compiler performs speculative inlining of virtual methods based on +the type profiles gathered by the interpreter and the C1-generated assembly. If the receiver type +at a callsite is always the same (the callsite is "monomorphic") the method is inlined. The assembly +contains a type test to validate the assumption, if it breaks the method gets deoptimized. + +C2 will also inline bi-morphic callsites: the code of both callees is inlined, a type test is used +to branch to the correct one (or to bail out). Finally, if the type profile shows a clear bias to +a specific receiver type (for example 90%), its method is inlined and virtual dispatch is used for +the other cases (shown in Aleksey's post). + +If a callsite has 3+ receiver types without a clear bias, C2 does not inline and an ordinary method +lookup is performed at runtime. + +Note that C2 performs other speculative optimizations than profile-based inlining, for example +profile-based branch elimination. + +// The most important improvement that C2 has is actually a good register allocator. The one in C1 is "fast and dirty" + +## Understanding the performance regression + +With the above knowledge at hand (I wish I had it when I started) we try to identify what causes +the slowdown of eliminating forwarder methods. + +### Call performance + +In a first step we measured the call performance of the various trait encodings. + +The first benchmark +[`CallPerformance`](https://github.com/lrytz/benchmarks/blob/master/src/main/java/traitEncodings/CallPerformance.java) +has roughly the following structure: + + interface I { + default int addDefault(int a, int b) { return a + b; } + + static int addStatic(int a, int b) { return a + b; } + default int addDefaultStatic(int a, int b) { return addStatic(a, b); } + + default int addForwarded(int a, int b) { return a + b; } + + int addInherited(int a, int b); + + int addVirtual(int a, int b); + } + + static abstract class A implements I { + public int addInherited(int a, int b) { return a + b; } + } + + static class C1 extends A implements I { + public int addForwarded(int a, int b) { return I.super.addForwarded(a, b); } + + public int addVirtual(int a, int b) { return a + b; } + } + +There are identical copies of `C1` (`C2`, ...). The example encodes the following formats (we +don't test the 2.11.x format): + + - `addDefault` for 33e7106 + - `addDefaultStatic` for 2.12.0-M5 + - `addForwarded` for 2.12.0-M4 + +The methods `addInherited` and `addVirtual` don't represent trait method encodings, they are for +comparison. We test all encodings in a monomorphic callsite (receiver is always `C1`) and in a +polymorphic one. + +#### Monomorphic case + +In the monomorphic case all trait encodings are inlined and perform the same (there are tiny +differences, if you are interested check Aleksey's blog post). + +If we annotate all methods with JMH's `DONT_INLINE` directive, encodings with a forwarder (either +M4-style forwarder invoking the trait default method, or the upcoming M5-style default method +forwarding to static) are a bit slower (so a default method without a forwarder is faster). +The penalty for having either forwarder is similar. + +#### Polymorphic case + +If the callsite is polymorphic: + + - The M4 encoding (`addForwarded`) is slow because the forwarder cannot beinlined. This the known + issue of trait methods leading to megamorphic callsites that exists in Scala 2.11.x and older. + - The 33e7106 (`addDefaultStatic`) and M5 (`addDefault`) encodings are also slow: the default + method is not inlined (checked with `-XX:+PrintInlining` and by comparing with a method marked + `DONT_INLINE`). We will explore this in detail later. + +For comparison, an invocation of `addInherited` is inlined and therefore much faster. So an +inherited virtual method is not treated in the same way as an inherited default method. The next +section goes into details why this is the case. + +*Note:* for the question why the 33e7106 ecoding causes a 20% performance regression, this cannot be +the reason. We found out that `addDefault` is slower than it could be in the polymorphic case, but +it is not slower than the M4 encoding. + + +### CHA and default methods + +The reason `addDefault` is not inlined while `addInherited` in the previous example has to do with +CHA: in fact, CHA is disabled altogether for default methods. This is logged in the JVM bugtracker +under [JDK-8036580](https://bugs.openjdk.java.net/browse/JDK-8036580). It was disabled in order to +fix [JDK-8036100](https://bugs.openjdk.java.net/browse/JDK-8036100) which lead to the wrong method +being inlined. (It was @retronym who initially suggested these tickets could be relevant). + +The reason for `addInherited` being inlined is that the VM knows (from CHA) the method is not +overridden in any of the loaded classes. This is tested in the +[`InliningCHA`](https://github.com/lrytz/benchmarks/blob/master/src/main/java/traitEncodings/InliningCHA.java) +benchmark. + +The first benchmark measures a megamorphic call to `addInherited`, just like in the previous +section. This call is inlined. The second benchmark performs the exact same operation but makes sure +that new subclass `CX` is loaded which overrides `addInherited`. CHA no longer returns a single +target for the method and the call is not inlined. Note that no instance of `CX` is created. + +This seems to be a shortcoming in C2's inliner implementation: based on the type profiling data, +C2 knows that the only types reaching the callsites are `C1`, `C2`, `C3` and `C4`. Using CHA it +could in principle find out that there is a single implementation of `addInherited`. + + +### Method lookup in classes implementing many interfaces + +We are still searching for an answer why 33e7106 caused a performance regression. Martin Thompson +notes in a +[blog post](http://mechanical-sympathy.blogspot.ch/2012/04/invoke-interface-optimisations.html) +(dated 2012): + + > I have observed that when a class implements multiple interfaces, with multiple methods, + > performance can degrade significantly because the method dispatch involves a linear search of + > method list + +We can reproduce this in the benchmark +([`InterfaceManyMembers`](https://github.com/lrytz/benchmarks/blob/master/src/main/java/traitEncodings/InterfaceManyMembers.java)). +The basic example is the following: + + interface I1 { default int a1 ... } + interface I2 { default int b1 ... ; default int b2 ... } + ... + + class A1 implements I1 { } + class A2 implements I2 { } + ... + + class B1 implements I1 { } + class B2 implements I1, I2 { } + ... + +In the benchmark, every class (`A1`, `A2`, `B1`, ...) exists in four copies to make sure the +callsite is megamorphic. We measure how much time an invocation of one default method takes: + + - The number of default methods in an interface does not matter, so `A1.a1` and `A2.b1` perform + the same. + - The number of implemented interfaces matters, there's a penalty for every additional interface. + So `B1.a1` is faster than `B2.b1`, etc. + +Adding an overriding forwarder method to the subclasses does not change this result, the slowdown +per additional interface remains. So this seems not to be the reason for the performance regression. + +### Back to CHA + +Googling a little bit more about the performance of default methods, I found a relevant +[post on SO](http://stackoverflow.com/questions/30312096/java-default-methods-is-slower-than-the-same-code-but-in-an-abstract-class) +containing a nice benchmark. + +I simplified the example into the benchmark +([`NoCHAPreventsOptimization`](https://github.com/lrytz/benchmarks/blob/master/src/main/java/traitEncodings/NoCHAPreventsOptimization.java)), +which is relatively small: + + interface I { + int getV(); + default int accessDefault() { return getV(); } + } + + abstract class A implements I { + public int accessVirtual() { return getV(); } + public int accessForward(){ return I.super.accessDefault(); } + } + + class C extends A implements I { + public int v = 0; + public int getV() { return v; } + } + +The benchmark shows that `c.v = x; c.accessDefault()` is 3x slower than +`c.v = x; c.accessVirtual()` or `c.v = x; c.accessForward()`. + +As noted in comments on the StackOverflow thread, everything is inlined in all three benchmarks, +so the difference is not due to inlining. We can observe that the assembly generated for the +`accessDefault` case is less optimized than in the other cases. Here is the output of JMH's +`-prof perfasm` feature, it includes the assembly of the hottest regions: + + - for [accessDefault](https://gist.github.com/lrytz/f1c24e685b871639d7e618b56325e102#file-adefault-txt) + - for [accessVirtual](https://gist.github.com/lrytz/f1c24e685b871639d7e618b56325e102#file-bvirtual-txt) + - for [accessForward](https://gist.github.com/lrytz/f1c24e685b871639d7e618b56325e102#file-cforward-txt) + +In fact, the assembly for the `accessVirtual` and `accessForward` cases is identical. + +One answer on the SO thread suggests that lack of CHA for the default method case prevents +eliminating a type guard, which in turn prevents optimizations on the field write and read. +Somebody with more experience in assembly code than me could certainly verify that. + +I did not do any further research to find out what kind of optimizations depend on CHA, or if it is +really the lack of CHA that causes the code not to be optimized properly. For my convenience, let's +say that's beyond the scope of this post. If you have any insights or references on this topic, +please forward them to me! + +It seems that CHA preventing certain optimizations is the most likely source for the slowdowns we +notice when running the Scala compiler. + +## Summary + +We found a few interesting limitations in the JVM optimizer: + + - Because CHA is not supported for default methods, a megamorphic callsite to a default method is + never inlined even if the method is not overridden at all. + - Interface method lookup slows down by the number of interfaces a class implements. + - While monomorphic calls to default methods are inlined, the lack of CHA has negative effects + on other optimizations. + +## References + +Besides the [post](http://shipilev.net/blog/2015/black-magic-method-dispatch/) already mentioned, +Aleksey Shipilёv's [blog](http://shipilev.net/) is an excellent resource for Java and JVM +intrinsics. + +The talk "JVM Mechanics" by Doug Hawkins was also mentioned above +([video](https://www.youtube.com/watch?v=E9i9NJeXGmM), +[slides](http://www.slideshare.net/dougqh/jvm-mechanics-when-does-the)), +it is a great overview on the JIT, inliner and optimizer. For an overview I can also recommend a +[longer blog post](http://middlewaresnippets.blogspot.ch/2014/11/java-virtual-machine-code-generation.html) +by René van Wijk and a +[shorter one](https://www.lmax.com/blog/staff-blogs/2016/03/05/observing-jvm-warm-effects/) +by Mark Price focussing on the JIT compilers. + +The JVM has an excessive number of flags for logging and tweaking: + + - Some flags are [documented here](https://docs.oracle.com/javase/8/docs/technotes/tools/unix/java.html) + - Many others are not documented, run `java -XX:+PrintFlagsFinal` to get a list of all flags + +Some flags used in the examples of this post: + + - `-XX:TieredStopAtLevel=1` to disable C2 + - `-XX:+PrintCompilation` logs methods being compiled (and deoptimized) + - `-XX:+PrintInlining` logs callsites being inlined (or not), best used together with the above + +[JITWatch](https://github.com/AdoptOpenJDK/jitwatch) is a GUI tool that helps understanding what the +JIT is doing (I haven't tried it yet). + +A +[thread](http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-April/thread.html#17649) +on the hotspot-compiler-dev mailing list on why CHA is disabled for interfaces. Seems to discuss +the situation before default methods were a common thing. + +A [gist](https://gist.github.com/rednaxelafx/1165804#file-notes-md) by Krystal Mok explaining many +details of the `-XX:+PrintCompilation` output and other details of the JIT process. + +The [glossary](http://openjdk.java.net/groups/hotspot/docs/HotSpotGlossary.html) on the HotSpot +wiki contains some useful nomenclature.