Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in Scala 2.13 for creation of lists using mutable.ListBuffer #11627

Closed
plokhotnyuk opened this issue Jul 12, 2019 · 36 comments
Assignees
Milestone

Comments

@plokhotnyuk
Copy link

plokhotnyuk commented Jul 12, 2019

For small lists it can be slowdown in ~1.5x times with OpenJDK and in ~7x times with GraalVM.
Most parsers which use it to bind parsed data to List or Seq (because List is a default implementation of Seq) from text or binary messages are affected.

Code of the benchmark to reproduce:

import java.util.concurrent.TimeUnit
import org.openjdk.jmh.annotations._
import scala.collection.mutable.ListBuffer

@State(Scope.Thread)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 1, jvmArgs = Array(
  "-server",
  "-Xms2g",
  "-Xmx2g",
  "-XX:NewSize=1g",
  "-XX:MaxNewSize=1g",
  "-XX:InitialCodeCacheSize=512m",
  "-XX:ReservedCodeCacheSize=512m",
  "-XX:+UseParallelGC",
  "-XX:-UseBiasedLocking",
  "-XX:+AlwaysPreTouch"))
@BenchmarkMode(Array(Mode.Throughput))
@OutputTimeUnit(TimeUnit.SECONDS)
class ListBufferBenchmark {
  @Param(Array("1", "10", "100"))
  var size: Int = 1000

  @Benchmark
  def intListCreation: List[Int] = {
    val squares = new ListBuffer[Int]()
    var i = 0
    val l = size
    while (i < l) {
      squares += i * i
      i += 1
    }
    squares.toList
  }
}

Command to run:

sbt -java-home /usr/lib/jvm/jdk-11 -no-colors ++2.13.0 'jmh:run ListBufferBenchmark'

Results for Scala 2.13.0 with OpenJDK 11.0.3:

[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark                            (size)   Mode  Cnt          Score        Error  Units
[info] ListBufferBenchmark.intListCreation       1  thrpt    5  129974588.015 ± 249971.629  ops/s
[info] ListBufferBenchmark.intListCreation      10  thrpt    5   15160739.436 ±   6815.066  ops/s
[info] ListBufferBenchmark.intListCreation     100  thrpt    5    1415746.679 ±  12507.797  ops/s

Results for Scala 2.12.8 with OpenJDK 11.0.3:

[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark                            (size)   Mode  Cnt          Score         Error  Units
[info] ListBufferBenchmark.intListCreation       1  thrpt    5  184201868.557 ± 1525301.419  ops/s
[info] ListBufferBenchmark.intListCreation      10  thrpt    5   21833324.557 ±  540564.835  ops/s
[info] ListBufferBenchmark.intListCreation     100  thrpt    5    1767339.444 ±   23828.567  ops/s

Results for Scala 2.13.0 with GraalVM CE 19.1:

[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark                            (size)   Mode  Cnt         Score        Error  Units
[info] ListBufferBenchmark.intListCreation       1  thrpt    5  33947057.954 ± 509483.584  ops/s
[info] ListBufferBenchmark.intListCreation      10  thrpt    5   6072784.266 ±  22838.907  ops/s
[info] ListBufferBenchmark.intListCreation     100  thrpt    5    624114.322 ±   4138.885  ops/s

Results for Scala 2.12.8 with GraalVM CE 19.1:

[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark                            (size)   Mode  Cnt          Score        Error  Units
[info] ListBufferBenchmark.intListCreation       1  thrpt    5  238624457.791 ± 859161.435  ops/s
[info] ListBufferBenchmark.intListCreation      10  thrpt    5   30777424.910 ± 427876.116  ops/s
[info] ListBufferBenchmark.intListCreation     100  thrpt    5    1989475.540 ±  21622.259  ops/s
@plokhotnyuk
Copy link
Author

plokhotnyuk commented Jul 12, 2019

I suppose that the root cause of slowdown is the releaseFence() call that was added in the following commit:
scala/scala@6541df7

@retronym wdyt?

@retronym retronym self-assigned this Jul 12, 2019
@retronym retronym added this to the 2.13.1 milestone Jul 12, 2019
@Ichoran
Copy link

Ichoran commented Jul 12, 2019

If it is releaseFence there isn't much we can do. It is absolutely required to avoid a correctness issue with 2.12.8 that could make it dangerous to use List in a threaded environment.

The issue is this. Suppose you map a 2-element list. You have:

a :: b :: Nil

The map operation, since it can't know how long the list is, uses a ListBuffer which walks through in order producing

f(a) :: Nil

at first; then it walks to the next one and mutates the next pointer to yield

f(a) :: f(b) :: Nil

It then returns the head of the list, and everything is cool because nobody else has access to mutate the next pointer.

Unfortunately, that is only true within the same thread. If you hand f(a) :: ? off to some other thread, it isn't guaranteed to see the change without a releaseFence.

So you give it to the other thread and it occasionally sees, instead, f(a) :: Nil because the pointer change is still floating around in the CPU and hasn't made it out to main memory yet (or due to some other caching issue, like the other CPU relying on its cache instead of going to main memory).

This is just terrible. The whole rhetoric around immutable collections is that they're great for concurrent use because they don't change, and here you have a difficult to reproduce, stochastically appearing bug that mutates what you can see about your List.

Fortunately, this is difficult to trigger, but we absolutely have to keep the fix for this.

So, what is the workaround? Well, if you know the sizes, you can always build the list up manually using :: instead of using something that relies on ListBuffer. Alternatively, perhaps some of the methods of List can speculatively try to unroll the operation on the stack, and then if it gets too deep they could give up and switch to ListBuffer.

Anyway, this is not an easy problem, and you very much can't just say, "Well, releaseFence made it slower, so let's remove it again". It's there for a very good reason. (Also, there are other reasons the benchmarks could change, so it would be good to do them over again in 2.13 with List duplicated into, say, Lyst and LystBuffer, with the y-version missing the fence. Have to duplicate :: too, so it's a bit of work.)

@plokhotnyuk
Copy link
Author

plokhotnyuk commented Jul 12, 2019

@Ichoran could you share a source of the test which reproduces the issue explained above? which data structure or API is used to pass an instance of a recently created list in it? usually, any concurrency aware implementation which is going to pass data between threads use fences to do it safely, so no need to do it prematurely for construction of each list instance...

@Ichoran
Copy link

Ichoran commented Jul 12, 2019

I don't remember a test case and I don't have time to write one. I'd guess that something like this would work, but I don't have time to test/optimize it:

class Sketch {
  @volatile var xs = 0 :: 1 :: Nil
  val t1, t2 = new Thread {
    private[this] var n = 10000000
    override def run() {
      while (n > 0) {
        xs = xs.map(_ + 1)
        n -= 1
      }
    }
  }
  def test() {
    t1.start()
    t2.start()
    t1.join()
    t2.join()
    println(xs)
  }
}

You'll expect to see things skipped, but the list should always have two elements, and the values should always be 1 apart.

(When I run this I don't see errors, but the errors were never consistently found anyway. I'm not sure where the old reports are.)

@plokhotnyuk
Copy link
Author

plokhotnyuk commented Jul 12, 2019

@Ichoran this test will pass always because access to volatile vars in JVM uses store/load fences

BTW, JFYK: https://dzone.com/articles/cpu-cache-flushing-fallacy

"The cache sub-system is considered the "source of truth" for mainstream systems. If memory is fetched from the cache it is never stale; the cache is the master copy when data exists in both the cache and main-memory. This style of memory management is known as write-back whereby data in the cache is only written back to main-memory when the cache-line is evicted because a new line is taking its place."

@Ichoran
Copy link

Ichoran commented Jul 12, 2019

Yeah, so drop the @volatile and maybe make other changes. Like I said I don't have time to replicate the error.

You're right that I wasn't accurate with regard to cache; it's the register/cache boundary, not the cache/main memory boundary that can lead to inconsistent views of things. I think the result is architecture-dependent, too. In any case, the JMM doesn't promise that you can see b in the second thread unless you enforce ordering somehow, and people have observed it in practice although I haven't ever been able to on my machines IIRC (including running the exact test that they say gave errors on theirs).

So I'm not entirely surprised that I failed once again to see anything.

@plokhotnyuk
Copy link
Author

plokhotnyuk commented Jul 13, 2019

@Ichoran If you drop the volatile annotation you will lose fences and, possibly, for some combination of JVM/CPU you will get that kind of data races which you are going to reproduce.

But it should not be the reason for adding fences in immutable collections that don't grow on demand (like a stream) in Scala.

Please see charts bellow with results of benchmarks where JSON array of 128 boolean values was parsed to List[Boolean] on different JVMs. For smaller sizes the slowdown is much worse.

Scala 2.12.8:
image

Scala 2.13.0:
image

@jsfwa
Copy link

jsfwa commented Jul 13, 2019

@plokhotnyuk, @Ichoran this is an example of an issue described above

But it looks like this behavior is relevant and easily reproducible(at least in my case) only for JDK8, and works correctly for the newer java versions
Also, all evidence of that unexpected behavior are pretty old and indirectly confirm unnecessity of "good reasoning" barriers without actual purpose for modern jvm

Actually can be easily reproducible, but still doesn't justify inner barriers

@plokhotnyuk
Copy link
Author

plokhotnyuk commented Jul 13, 2019

@jsfwa in that gist thread the root problem not in List or ListBuffer.

In the 1st sample it is just missing fences when accessing X.x variable from 50 threads. Nobody passes messages in such way these days, a lot of concurrent structures and APIs (like futures, actors, concurrent streams) are used instead and each has own mechanics to make the passing of immutable messages safely.

In the 2nd Message is mutable and if it is already shared between threads store/load fences for _words should be used too, like in this commit. BTW, the original snippet for Message doesn't compile, I suppose, the author tried to write something like that:

class Message(val text: String) {
  private[this] var _words: List[String] = _

  def words: List[String] = {
    var res = _words 
    if (res == null) {
      res = text.split("""\s""").toList
      _words = res
    }
    res
  }
}

Why not just use lazy val here?

@jsfwa
Copy link

jsfwa commented Jul 13, 2019

@plokhotnyuk I totally agree that everyone should use fences and other tricks with concurrent writers

Mentioned gist is a part of this discussion, since a tail of :: is mutable, guys decided to enforce thread-safety

Sadly the cost is too high, hope they will revert the changes

@plokhotnyuk
Copy link
Author

plokhotnyuk commented Jul 15, 2019

In this PR I have tried to mitigate the issue by appending lists manually. For JDK 11.0.3 it works even faster that with ListBuffer, but when running on GraalVM EE 19.1 the slowdown still exists and a flame graph report shows that there is the releaseFence call in the :: (cons) constructor:

JDK 11.0.3 + Scala 2.13.0:
image

GraalVM EE 19.1 + Scala 2.13.0:
image

@retronym
Copy link
Member

@plokhotnyuk What CPU architecture are you testing on? I assume x86 but want to be sure.

@plokhotnyuk
Copy link
Author

plokhotnyuk commented Jul 16, 2019

@retronym Intel Core i7, 7th generation with Ubuntu 18.04, 64-bit

Possible here is an issue related for the slowdown case with GraalVM CE/EE 19.1 + Scala 2.13.0, but it cannot be reproduced for earlier versions of Scala, see the flame graph bellow.

GraalVM CE 19.1 + Scala 2.12.8:
image

@odersky
Copy link

odersky commented Jul 18, 2019

We also have immutable vectors, sets and maps that have internally mutable fields. Do we fence these also?

The issue as I see it is: Say we have an immutable data structure d: D and a pure function f: D => D. We write in thread T1:

   var a = d
   ...
   a = f(d)

and read a in thread T2 without any sort of synchonization between T1 and T2. Do we need to guarantee that T2 always sees either a d or a f(d) in a, even if there is no "happens-before" relationship according to the JVM memory model?

I believe this is actually a lot to ask. On some earlier architectures, even a Long made up from two words could be split so that a reading thread could see only one half of the store. That's gone for good with 64-bit architectures. But to extend this guarantee to all immutable data structures seems to impose an undue burden on the implementation. In my mind, if a fix to this problem causes any sort of slowdown it's unacceptable and we should instead just state that immutability as a data structure does not imply safe publication, which in my mind is perfectly acceptable, since concurrent architectures that rely only on these sort of low-level safe publication guarantees without resorting to volatiles, monitors or atomics are super fragile anyway.

There is also an issue with immutable arrays, which will be part of Scala 3. Immutable arrays use just Java arrays under the hood. Do we need to fence all operations on immutable arrays also, in order to ensure safe publication?

@odersky
Copy link

odersky commented Jul 18, 2019

An argument to treat lists differently from other immutable collections could be: Lists are morally ADTs, i.e. they can be thought of like this:

trait List[+A]
case object Nil extends List[Nothing]
case class Cons[+A](x: A, xs: List[A]) extends List[A]

If Lists really were ADTs like that, they would ensure safe publication since all fields are immutable. But then it also looks like they could stack overflow when a long list is mapped.

But the point is: Lists are not an ADT like the one that I have given. They can't be since we do state that the operation toList on a list buffer is constant time. Sure, we cannot change the tail field of a list from the outside, but that's analogous to the fact that we cannot change the internal mutable variables of an immutable vector from the outside. Not being able to change a field is one thing, ensuring safe publication of that field is something else. And the two should not be linked IMO.

@retronym
Copy link
Member

retronym commented Jul 18, 2019 via email

@plokhotnyuk
Copy link
Author

According to Aleksey's research that fences can be implemented without so dramatic impact, especially for x86.

@odersky
Copy link

odersky commented Jul 18, 2019

Yep, we also added the fences in HashMap/Set and Vector.

I still have not understood the rationale why we are doing this. If previously people thought it was OK that a double could be split, why go all out to ensure safe publication of immutable data structures? What's the use case where this matters?

@retronym
Copy link
Member

retronym commented Jul 22, 2019

My belief was that adding the fence was sufficiently cheap that it was worth doing. I'm not yet persuaded that the fence addition is the actual result of the performance change.

I'm studying how variations of the implementation change performance in https://github.com/retronym/sbt-jmh-listbuffer

So far I found that both the 2.12 and 2.13 library versions are significantly slower than an analagous pure-Java implementation (~0.6x - 0.7x). In that pure-Java implementation, the performance change of adding the fence is negligible.

Removing all parents of List and ListBuffer from the Scala versions seems to restore performance. So I believe that JIT is doing a sub-optimal job in inlining the somewhat elaborate call tree of (empty!) class- and trait-constructors. JITWatch reports that the slow benchmarks actually do fully inline, but the generated code still ends up longer/messier.

I'm now using JMH -prof perfasm to try to understand this better. I'll write this all up properly tomorrow and seek advice from JIT experts.

@Ichoran
Copy link

Ichoran commented Jul 22, 2019

@retronym - FWIW, this isn't a new observation either. There was some discussion/demonstration in 2.10, I think it was, where List--I think it was List--was identified to be substantially slowed down by the mighty inheritance tree above it. Also, when I added mutable.LongMap and mutable.AnyRefMap, the early performance gains where I was beating Java maps were lost once I placed them into the inheritance hierarchy. (AnyRefMap had approximate parity afterwards; LongMap was still better due to specialization, but not as much as it had seemed that it would be initially.)

I'm not aware of anyone looking into the cause in enough depth to get either a coherent explanation or something actionable. It wouldn't surprise me if it was some arbitrary JVM threshold that optimizes N but not N+1 empty constructors. You might also try adding a chain of (not wholly superfluous) superclasses and/or traits above the detatched List and ListBuffer, if the other approaches don't pan out, to explore what it is that's causing the suboptimal optimization?

@plokhotnyuk
Copy link
Author

plokhotnyuk commented Jul 22, 2019

@retronym the initial benchmark uses the original Scala library and tests on short lists (size=1,10,100), while yours benchmarks (from the https://github.com/retronym/sbt-jmh-listbuffer repo) don't use original List, ::, and ListBuffer classes from the standard library and tests much longer lists (size=10000).

Also, I have reimplemented the original benchmark in Java here for both Java's LinkedList and Scala's ListBuffer/List and for the Scala list. For them I got results almost the same as in the original benchmark. Please see them in the description of this PR

Finally, I have published it in the separated repo here.

@retronym
Copy link
Member

@plokhotnyuk Thanks, that's useful.

It turns out that the material difference is that in 2.13 the call to ListBuffer.+= has to forward through Growable.+= to get to ListBuffer.addOne. Replacing your $plus$eq calls with addOne directly improves performance drastically.

    @Benchmark
    public List<Boolean> scala213ListOfBooleansPlusEq() {
        ListBuffer<Boolean> listBuffer = new ListBuffer<>();
        int l = size;
        int i = 0;
        while (i < l) {
            listBuffer.$plus$eq((i & 1) == 0);
            i++;
        }
        return listBuffer.toList();
    }

    @Benchmark
    public List<Boolean> scala213ListOfBooleansAddOne() {
        ListBuffer<Boolean> listBuffer = new ListBuffer<>();
        int l = size;
        int i = 0;
        while (i < l) {
            listBuffer.addOne((i & 1) == 0);
            i++;
        }
        return listBuffer.toList();
    }
[info] # VM version: JDK 12.0.1, Java HotSpot(TM) 64-Bit Server VM, 12.0.1+12

[info] LinkedListBenchmark.javaListOfBooleans                 1  thrpt    5  115909715.003 ± 98324502.735  ops/s
[info] LinkedListBenchmark.javaListOfBooleans                10  thrpt    5   29307012.723 ±  1314406.449  ops/s
[info] LinkedListBenchmark.javaListOfBooleans               100  thrpt    5    3234171.872 ±   160394.208  ops/s
[info] LinkedListBenchmark.scala213ListOfBooleansAddOne       1  thrpt    5  297228574.680 ±  6927856.405  ops/s
[info] LinkedListBenchmark.scala213ListOfBooleansAddOne      10  thrpt    5   40600454.592 ±  3403259.134  ops/s
[info] LinkedListBenchmark.scala213ListOfBooleansAddOne     100  thrpt    5    4277655.542 ±   107376.264  ops/s
[info] LinkedListBenchmark.scala213ListOfBooleansPlusEq       1  thrpt    5  184859568.590 ±  5444893.615  ops/s
[info] LinkedListBenchmark.scala213ListOfBooleansPlusEq      10  thrpt    5   22818686.698 ±   407740.780  ops/s
[info] LinkedListBenchmark.scala213ListOfBooleansPlusEq     100  thrpt    5    2302681.790 ±    50850.579  ops/s
[info] Benchmark result is saved to scala-2.13.json

Comparing to the 2.12 baseline:

[info] LinkedListBenchmark.scala212ListOfBooleansPlusEq       1  thrpt    5  265587493.195 ± 7337611.729  ops/s
[info] LinkedListBenchmark.scala212ListOfBooleansPlusEq      10  thrpt    5   33084745.588 ±  792230.913  ops/s
[info] LinkedListBenchmark.scala212ListOfBooleansPlusEq     100  thrpt    5    2714058.378 ±   72789.631  ops/s
[info] Benchmark result is saved to scala-2.12.json

So:

  • Call addOne on 2.13 to sidestep this issue (json-iter's macro could do this conditional on the ambient Scala version)
  • The best case perfomance of 2.13.x is better than 2.12.x. I'm not sure why yet -- there have been refactorings to ListBuffer implementation and also improvements to the scalac optimizer that the library itself is compiled with. If I can identify an isolated reason, I might be able to backport it to 2.12.x
  • JIT inlining of 2.13's Growable.+= leaves something to be desired.
  • Growable.+= is also marked as @inline which scalac -opt:l:inline -opt-inline-from:scala.** can inline through. I see this leaves a redundant null check as compared to a direct call to addOne.

@plokhotnyuk
Copy link
Author

@retronym Thank you a lot!

I've added a benchmark for addOne and updated benchmark results in this commit.

Your finding will help to mitigate the issue for OpenJDK, but for GraalVM CE/EE it is not enough. Should I raise an issue in their repo instead?

@retronym
Copy link
Member

retronym commented Jul 24, 2019

Yes, it would be good to notify the Graal team of the performance difference. Hopefully its something straightforward for them to fix. /cc @vjovanov

My benchmarks are now cleaned up and highlight the HotSpot/C2 difficulty with +=. I've managed to create a pure-Java replica of the relevant parts of our collections that shows the same slowdown.


[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark                                   (size)   Mode  Cnt         Score         Error  Units
[info] ListsBenchmark.javaListBufferPlusEqAddOne       10  thrpt    4  41074388.920 ±  899881.789  ops/s
[info] ListsBenchmark.scalaListBufferPlusEq            10  thrpt    4  22654974.135 ± 1107078.470  ops/s
[info] ListsBenchmark.scalaListBufferPlusEqAddOne      10  thrpt    4  40713598.860 ± 2130166.999  ops/s
[info] ListsBenchmark.skalaAddOne                      10  thrpt    4  40959424.729 ±  984075.334  ops/s
[info] ListsBenchmark.skalaPlusEq                      10  thrpt    4  20522159.759 ± 1759567.702  ops/s

I'll use this to study what's going in in C2 JIT some more to see if this is a bug or if our indirection is really incurring extra runtime cost (ie null checks) that the JIT isn't able to elide.

@retronym
Copy link
Member

retronym commented Jul 24, 2019

Okay, -prof perfasm shows that the slow cases occur when JIT doesn't fully inline. See output here, which contains for example:

[info]   0.94%     │││  0x00007f198342eea4: mov    rsi,rbp
[info]   0.52%     │││  0x00007f198342eea7: call   0x00007f1973152980             ; ImmutableOopMap{rbp=Oop [112]=Oop [120]=Oop [128]=Oop [0]=Oop [16]=Oop [24]=Oop [48]=Oop [56]=NarrowOop }
[info]             │││                                                            ;*invokespecial &lt;init&gt; {reexecute=0 rethrow=0 return_oop=0}
[info]             │││                                                            ; - scala.collection.AbstractSeq::&lt;init&gt;@1 (line 1154)
[info]             │││                                                            ; - scala.collection.immutable.AbstractSeq::&lt;init&gt;@1 (line 159)
[info]             │││                                                            ; - scala.collection.immutable.List::&lt;init&gt;@1 (line 83)
[info]             │││                                                            ; - scala.collection.immutable.$colon$colon::&lt;init&gt;@11 (line 592)
[info]             │││                                                            ; - scala.collection.mutable.ListBuffer::addOne@12 (line 109)
[info]             │││                                                            ; - scala.collection.mutable.ListBuffer::addOne@2 (line 39)
[info]             │││                                                            ; - scala.collection.mutable.Growable::$plus$eq@2 (line 38)
[info]             │││                                                            ; - scala.collection.mutable.Growable::$plus$eq$@2 (line 38)
[info]             │││                                                            ; - scala.collection.mutable.AbstractBuffer::$plus$eq@2 (line 232)
[info]             │││                                                            ; - bench.ListsBenchmark::scalaListBufferPlusEq@21 (line 56)
[info]             │││                                                            ; - bench.generated.ListsBenchmark_scalaListBufferPlusEq_jmhTest::scalaListBufferPlusEq_thrpt_jmhStub@17 (line 119)
[info]             │││                                                            ;   {optimized virtual_call}

The JIT's inlining depth budget is stretched by the combination of a) the extra indirection through += => addOne. and b) the depth of the super constructor call chain of :: (6 deep).

Using a higher budget, like -XX:MaxInlineLevel=18, leads to identical results.

I thought that this was the first thing I tried without success. But maybe I just looked at the inlining logs in JITWatch and was looking at the wrong call site or something..

@vjovanov
Copy link

@plokhotnyuk yes, please open an issue here and we will handle it? Thanks for thinking about us!

@plokhotnyuk
Copy link
Author

@vjovanov thank you for your support, here it is

@retronym
Copy link
Member

retronym commented Jul 24, 2019

I think this is all explained now.

Summary:

  • Bumping up -XX:MaxInlineLevel=<N> from the default of 9 to 18 is often a beneficial for Scala programs, and in this case is needed to inline the :: super constructor call chain into the benchmark that now must indirect through += => addOne.
  • Using GraalVM (including OpenJDK 12+ -XX:+EnableJVMCI -XX:+UseJVMCICompiler is usually a way to have more aggressive JIT inlining, but the use of MethodHandle-s in Scalac to abstract over Java 8's Unsafe.storeFence and Java 9+'s equivalent VarHandle.releaseFence fails to inline, but the next version of Graal will fix it.

@Ichoran
Copy link

Ichoran commented Jul 24, 2019

@retronym - Good detective work! I wonder if we should alter build tools like SBT and Mill to bump MaxInlineLevel up by default? This has bitten us several times now.

@smarter
Copy link
Member

smarter commented Jul 24, 2019

I wonder if we should alter build tools like SBT and Mill to bump MaxInlineLevel up by default?

Good idea, the default is way too low for Scala.

SethTisue added a commit to SethTisue/Project-Euler that referenced this issue Jul 24, 2019
hoping it might help performance, as per
scala/bug#11627 (comment)
retronym added a commit to retronym/scala that referenced this issue Jul 25, 2019
 I'm seeing a 1.4x speedup for:

 ```
  @benchmark
  public Object scalaListBufferPlusEq_212() {
    ListBuffer<String> buffer = new ListBuffer<>();
    int i = 0;
    while (i < size) {
      buffer.$plus$eq("");
      i += 1;
    }
    return buffer.result();
  }
```

2.12.8

```
[info] Benchmark                                 (size)   Mode  Cnt         Score        Error  Units
[info] ListsBenchmark.scalaListBufferPlusEq_212      10  thrpt    5  25856046.731 ± 1229100.335  ops/s
```

This patch:

```
[info] Benchmark                                 (size)   Mode  Cnt         Score        Error  Units
[info] ListsBenchmark.scalaListBufferPlusEq_212      10  thrpt    5  35848876.003 ± 514044.717  ops/s
```

It is still a little short of the 2.13.x performance; in which I saw:

```
[info] ListsBenchmark.scalaListBufferPlusEq          10  thrpt    5  37174742.519 ± 1304768.628  ops/s
[info] ListsBenchmark.scalaListBufferAddOne          10  thrpt    5  37201063.905 ± 2167146.358  ops/s
```

* the `scalaListBufferPlusEq` result requires `-XX:MaxInlineLevel=18`,
discussion at scala/bug#11627)
retronym added a commit to retronym/scala that referenced this issue Jul 25, 2019
 I'm seeing a 1.4x speedup for:

 ```
  @benchmark
  public Object scalaListBufferPlusEq_212() {
    ListBuffer<String> buffer = new ListBuffer<>();
    int i = 0;
    while (i < size) {
      buffer.$plus$eq("");
      i += 1;
    }
    return buffer.result();
  }
```

2.12.8

```
[info] Benchmark                                 (size)   Mode  Cnt         Score        Error  Units
[info] ListsBenchmark.scalaListBufferPlusEq_212      10  thrpt    5  25856046.731 ± 1229100.335  ops/s
```

This patch:

```
[info] Benchmark                                 (size)   Mode  Cnt         Score        Error  Units
[info] ListsBenchmark.scalaListBufferPlusEq_212      10  thrpt    5  35848876.003 ± 514044.717  ops/s
```

It is still a little short of the 2.13.x performance; in which I saw:

```
[info] ListsBenchmark.scalaListBufferPlusEq          10  thrpt    5  37174742.519 ± 1304768.628  ops/s
[info] ListsBenchmark.scalaListBufferAddOne          10  thrpt    5  37201063.905 ± 2167146.358  ops/s
```

* the `scalaListBufferPlusEq` result requires `-XX:MaxInlineLevel=18`,
discussion at scala/bug#11627)
@dwijnand
Copy link
Member

dwijnand commented Jul 25, 2019

What and how severe are the trade-offs on increasing -XX:MaxInlineLevel to 18?

How generally applicable is it to set it to 18?

@lrytz
Copy link
Member

lrytz commented Jul 25, 2019

I think there's a risk that changing the JVM defaults in the build tool could lead to confusion when diagnosing performance issues, because people probably don't use the build tool to run their apps in production.

@Ichoran
Copy link

Ichoran commented Jul 25, 2019

@lrytz - I had considered that drawback also, which is why I was wondering if we should rather than simply stating that I thought we should. I don't have enough exposure to environments where people deploy artifacts built by Scala build tools to know whether it's overall a plus or a minus to have the build tool automatically select what we would consider best-practice JVM options.

@smarter
Copy link
Member

smarter commented Jul 25, 2019

sbt already affects performance due to classloading, and the sbt launcher already passes some flags which can also affect it, so theres precedent for this kind of things

@retronym
Copy link
Member

Arguably the best practice is moving quickly to -XX:+EnableJVMCI -XX:+UseJVMCICompiler, which enables Graal CE which is bundled inside recent releases of Oracle/Open-JDK.

Another downside of embedding -XX flags in scripts is that it might limit them to the OpenJDK family of JVMs. In practice, OpenJ9 appears to ignore unrecognized -XX options by default so maybe this isn't a big concern.

@plokhotnyuk
Copy link
Author

plokhotnyuk commented Jul 27, 2019

@retronym GraalVM CE 19.1.1 ignores the -XX:MaxInlineLevel=18 option and still faster than OpenJDK 8 with it, please see a chart with comparison here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants