Performance regression in Scala 2.13 for creation of lists using `mutable.ListBuffer` #11627

plokhotnyuk · 2019-07-12T08:41:51Z

For small lists it can be slowdown in ~1.5x times with OpenJDK and in ~7x times with GraalVM.
Most parsers which use it to bind parsed data to List or Seq (because List is a default implementation of Seq) from text or binary messages are affected.

Code of the benchmark to reproduce:

import java.util.concurrent.TimeUnit
import org.openjdk.jmh.annotations._
import scala.collection.mutable.ListBuffer

@State(Scope.Thread)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 1, jvmArgs = Array(
  "-server",
  "-Xms2g",
  "-Xmx2g",
  "-XX:NewSize=1g",
  "-XX:MaxNewSize=1g",
  "-XX:InitialCodeCacheSize=512m",
  "-XX:ReservedCodeCacheSize=512m",
  "-XX:+UseParallelGC",
  "-XX:-UseBiasedLocking",
  "-XX:+AlwaysPreTouch"))
@BenchmarkMode(Array(Mode.Throughput))
@OutputTimeUnit(TimeUnit.SECONDS)
class ListBufferBenchmark {
  @Param(Array("1", "10", "100"))
  var size: Int = 1000

  @Benchmark
  def intListCreation: List[Int] = {
    val squares = new ListBuffer[Int]()
    var i = 0
    val l = size
    while (i < l) {
      squares += i * i
      i += 1
    }
    squares.toList
  }
}

Command to run:

sbt -java-home /usr/lib/jvm/jdk-11 -no-colors ++2.13.0 'jmh:run ListBufferBenchmark'

Results for Scala 2.13.0 with OpenJDK 11.0.3:

[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark                            (size)   Mode  Cnt          Score        Error  Units
[info] ListBufferBenchmark.intListCreation       1  thrpt    5  129974588.015 ± 249971.629  ops/s
[info] ListBufferBenchmark.intListCreation      10  thrpt    5   15160739.436 ±   6815.066  ops/s
[info] ListBufferBenchmark.intListCreation     100  thrpt    5    1415746.679 ±  12507.797  ops/s

Results for Scala 2.12.8 with OpenJDK 11.0.3:

[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark                            (size)   Mode  Cnt          Score         Error  Units
[info] ListBufferBenchmark.intListCreation       1  thrpt    5  184201868.557 ± 1525301.419  ops/s
[info] ListBufferBenchmark.intListCreation      10  thrpt    5   21833324.557 ±  540564.835  ops/s
[info] ListBufferBenchmark.intListCreation     100  thrpt    5    1767339.444 ±   23828.567  ops/s

Results for Scala 2.13.0 with GraalVM CE 19.1:

[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark                            (size)   Mode  Cnt         Score        Error  Units
[info] ListBufferBenchmark.intListCreation       1  thrpt    5  33947057.954 ± 509483.584  ops/s
[info] ListBufferBenchmark.intListCreation      10  thrpt    5   6072784.266 ±  22838.907  ops/s
[info] ListBufferBenchmark.intListCreation     100  thrpt    5    624114.322 ±   4138.885  ops/s

Results for Scala 2.12.8 with GraalVM CE 19.1:

[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark                            (size)   Mode  Cnt          Score        Error  Units
[info] ListBufferBenchmark.intListCreation       1  thrpt    5  238624457.791 ± 859161.435  ops/s
[info] ListBufferBenchmark.intListCreation      10  thrpt    5   30777424.910 ± 427876.116  ops/s
[info] ListBufferBenchmark.intListCreation     100  thrpt    5    1989475.540 ±  21622.259  ops/s

The text was updated successfully, but these errors were encountered:

plokhotnyuk · 2019-07-12T09:17:54Z

I suppose that the root cause of slowdown is the releaseFence() call that was added in the following commit:
scala/scala@6541df7

@retronym wdyt?

Ichoran · 2019-07-12T17:39:54Z

If it is releaseFence there isn't much we can do. It is absolutely required to avoid a correctness issue with 2.12.8 that could make it dangerous to use List in a threaded environment.

The issue is this. Suppose you map a 2-element list. You have:

a :: b :: Nil

The map operation, since it can't know how long the list is, uses a ListBuffer which walks through in order producing

f(a) :: Nil

at first; then it walks to the next one and mutates the next pointer to yield

f(a) :: f(b) :: Nil

It then returns the head of the list, and everything is cool because nobody else has access to mutate the next pointer.

Unfortunately, that is only true within the same thread. If you hand f(a) :: ? off to some other thread, it isn't guaranteed to see the change without a releaseFence.

So you give it to the other thread and it occasionally sees, instead, f(a) :: Nil because the pointer change is still floating around in the CPU and hasn't made it out to main memory yet (or due to some other caching issue, like the other CPU relying on its cache instead of going to main memory).

This is just terrible. The whole rhetoric around immutable collections is that they're great for concurrent use because they don't change, and here you have a difficult to reproduce, stochastically appearing bug that mutates what you can see about your List.

Fortunately, this is difficult to trigger, but we absolutely have to keep the fix for this.

So, what is the workaround? Well, if you know the sizes, you can always build the list up manually using :: instead of using something that relies on ListBuffer. Alternatively, perhaps some of the methods of List can speculatively try to unroll the operation on the stack, and then if it gets too deep they could give up and switch to ListBuffer.

Anyway, this is not an easy problem, and you very much can't just say, "Well, releaseFence made it slower, so let's remove it again". It's there for a very good reason. (Also, there are other reasons the benchmarks could change, so it would be good to do them over again in 2.13 with List duplicated into, say, Lyst and LystBuffer, with the y-version missing the fence. Have to duplicate :: too, so it's a bit of work.)

plokhotnyuk · 2019-07-12T18:04:40Z

@Ichoran could you share a source of the test which reproduces the issue explained above? which data structure or API is used to pass an instance of a recently created list in it? usually, any concurrency aware implementation which is going to pass data between threads use fences to do it safely, so no need to do it prematurely for construction of each list instance...

Ichoran · 2019-07-12T18:09:36Z

I don't remember a test case and I don't have time to write one. I'd guess that something like this would work, but I don't have time to test/optimize it:

class Sketch {
  @volatile var xs = 0 :: 1 :: Nil
  val t1, t2 = new Thread {
    private[this] var n = 10000000
    override def run() {
      while (n > 0) {
        xs = xs.map(_ + 1)
        n -= 1
      }
    }
  }
  def test() {
    t1.start()
    t2.start()
    t1.join()
    t2.join()
    println(xs)
  }
}

You'll expect to see things skipped, but the list should always have two elements, and the values should always be 1 apart.

(When I run this I don't see errors, but the errors were never consistently found anyway. I'm not sure where the old reports are.)

plokhotnyuk · 2019-07-12T18:26:10Z

@Ichoran this test will pass always because access to volatile vars in JVM uses store/load fences

BTW, JFYK: https://dzone.com/articles/cpu-cache-flushing-fallacy

"The cache sub-system is considered the "source of truth" for mainstream systems. If memory is fetched from the cache it is never stale; the cache is the master copy when data exists in both the cache and main-memory. This style of memory management is known as write-back whereby data in the cache is only written back to main-memory when the cache-line is evicted because a new line is taking its place."

Ichoran · 2019-07-12T20:40:55Z

Yeah, so drop the @volatile and maybe make other changes. Like I said I don't have time to replicate the error.

You're right that I wasn't accurate with regard to cache; it's the register/cache boundary, not the cache/main memory boundary that can lead to inconsistent views of things. I think the result is architecture-dependent, too. In any case, the JMM doesn't promise that you can see b in the second thread unless you enforce ordering somehow, and people have observed it in practice although I haven't ever been able to on my machines IIRC (including running the exact test that they say gave errors on theirs).

So I'm not entirely surprised that I failed once again to see anything.

plokhotnyuk · 2019-07-13T07:32:13Z

@Ichoran If you drop the volatile annotation you will lose fences and, possibly, for some combination of JVM/CPU you will get that kind of data races which you are going to reproduce.

But it should not be the reason for adding fences in immutable collections that don't grow on demand (like a stream) in Scala.

Please see charts bellow with results of benchmarks where JSON array of 128 boolean values was parsed to List[Boolean] on different JVMs. For smaller sizes the slowdown is much worse.

Scala 2.12.8:

Scala 2.13.0:

jsfwa · 2019-07-13T10:49:47Z

@plokhotnyuk, @Ichoran this is an example of an issue described above

But it looks like this behavior is relevant and easily reproducible(at least in my case) only for JDK8, and works correctly for the newer java versions
Also, all evidence of that unexpected behavior are pretty old and indirectly confirm unnecessity of "good reasoning" barriers without actual purpose for modern jvm
Actually can be easily reproducible, but still doesn't justify inner barriers

plokhotnyuk · 2019-07-13T11:15:01Z

@jsfwa in that gist thread the root problem not in List or ListBuffer.

In the 1st sample it is just missing fences when accessing X.x variable from 50 threads. Nobody passes messages in such way these days, a lot of concurrent structures and APIs (like futures, actors, concurrent streams) are used instead and each has own mechanics to make the passing of immutable messages safely.

In the 2nd Message is mutable and if it is already shared between threads store/load fences for _words should be used too, like in this commit. BTW, the original snippet for Message doesn't compile, I suppose, the author tried to write something like that:

class Message(val text: String) {
  private[this] var _words: List[String] = _

  def words: List[String] = {
    var res = _words 
    if (res == null) {
      res = text.split("""\s""").toList
      _words = res
    }
    res
  }
}

Why not just use lazy val here?

jsfwa · 2019-07-13T12:33:22Z

@plokhotnyuk I totally agree that everyone should use fences and other tricks with concurrent writers

Mentioned gist is a part of this discussion, since a tail of :: is mutable, guys decided to enforce thread-safety

Sadly the cost is too high, hope they will revert the changes

plokhotnyuk · 2019-07-15T06:10:36Z

In this PR I have tried to mitigate the issue by appending lists manually. For JDK 11.0.3 it works even faster that with ListBuffer, but when running on GraalVM EE 19.1 the slowdown still exists and a flame graph report shows that there is the releaseFence call in the :: (cons) constructor:

JDK 11.0.3 + Scala 2.13.0:

GraalVM EE 19.1 + Scala 2.13.0:

retronym · 2019-07-16T01:31:02Z

@plokhotnyuk What CPU architecture are you testing on? I assume x86 but want to be sure.

plokhotnyuk · 2019-07-16T02:33:36Z

@retronym Intel Core i7, 7th generation with Ubuntu 18.04, 64-bit

Possible here is an issue related for the slowdown case with GraalVM CE/EE 19.1 + Scala 2.13.0, but it cannot be reproduced for earlier versions of Scala, see the flame graph bellow.

GraalVM CE 19.1 + Scala 2.12.8:

odersky · 2019-07-18T08:54:46Z

We also have immutable vectors, sets and maps that have internally mutable fields. Do we fence these also?

The issue as I see it is: Say we have an immutable data structure d: D and a pure function f: D => D. We write in thread T1:

   var a = d
   ...
   a = f(d)

and read a in thread T2 without any sort of synchonization between T1 and T2. Do we need to guarantee that T2 always sees either a d or a f(d) in a, even if there is no "happens-before" relationship according to the JVM memory model?

I believe this is actually a lot to ask. On some earlier architectures, even a Long made up from two words could be split so that a reading thread could see only one half of the store. That's gone for good with 64-bit architectures. But to extend this guarantee to all immutable data structures seems to impose an undue burden on the implementation. In my mind, if a fix to this problem causes any sort of slowdown it's unacceptable and we should instead just state that immutability as a data structure does not imply safe publication, which in my mind is perfectly acceptable, since concurrent architectures that rely only on these sort of low-level safe publication guarantees without resorting to volatiles, monitors or atomics are super fragile anyway.

There is also an issue with immutable arrays, which will be part of Scala 3. Immutable arrays use just Java arrays under the hood. Do we need to fence all operations on immutable arrays also, in order to ensure safe publication?

odersky · 2019-07-18T09:48:25Z

An argument to treat lists differently from other immutable collections could be: Lists are morally ADTs, i.e. they can be thought of like this:

trait List[+A]
case object Nil extends List[Nothing]
case class Cons[+A](x: A, xs: List[A]) extends List[A]

If Lists really were ADTs like that, they would ensure safe publication since all fields are immutable. But then it also looks like they could stack overflow when a long list is mapped.

But the point is: Lists are not an ADT like the one that I have given. They can't be since we do state that the operation toList on a list buffer is constant time. Sure, we cannot change the tail field of a list from the outside, but that's analogous to the fact that we cannot change the internal mutable variables of an immutable vector from the outside. Not being able to change a field is one thing, ensuring safe publication of that field is something else. And the two should not be linked IMO.

retronym · 2019-07-18T10:08:44Z

On Thu, 18 Jul 2019 at 18:54, odersky ***@***.***> wrote: We also have immutable vectors, sets and maps that have internally mutable fields. Do we fence these also?

Yep, we also added the fences in HashMap/Set and Vector. I plan to look at this microbenchmark in detail next week

plokhotnyuk · 2019-07-18T10:17:26Z

According to Aleksey's research that fences can be implemented without so dramatic impact, especially for x86.

odersky · 2019-07-18T10:22:33Z

Yep, we also added the fences in HashMap/Set and Vector.

I still have not understood the rationale why we are doing this. If previously people thought it was OK that a double could be split, why go all out to ensure safe publication of immutable data structures? What's the use case where this matters?

retronym · 2019-07-22T08:22:49Z

My belief was that adding the fence was sufficiently cheap that it was worth doing. I'm not yet persuaded that the fence addition is the actual result of the performance change.

I'm studying how variations of the implementation change performance in https://github.com/retronym/sbt-jmh-listbuffer

So far I found that both the 2.12 and 2.13 library versions are significantly slower than an analagous pure-Java implementation (~0.6x - 0.7x). In that pure-Java implementation, the performance change of adding the fence is negligible.

Removing all parents of List and ListBuffer from the Scala versions seems to restore performance. So I believe that JIT is doing a sub-optimal job in inlining the somewhat elaborate call tree of (empty!) class- and trait-constructors. JITWatch reports that the slow benchmarks actually do fully inline, but the generated code still ends up longer/messier.

I'm now using JMH -prof perfasm to try to understand this better. I'll write this all up properly tomorrow and seek advice from JIT experts.

Ichoran · 2019-07-22T15:26:37Z

@retronym - FWIW, this isn't a new observation either. There was some discussion/demonstration in 2.10, I think it was, where List--I think it was List--was identified to be substantially slowed down by the mighty inheritance tree above it. Also, when I added mutable.LongMap and mutable.AnyRefMap, the early performance gains where I was beating Java maps were lost once I placed them into the inheritance hierarchy. (AnyRefMap had approximate parity afterwards; LongMap was still better due to specialization, but not as much as it had seemed that it would be initially.)

I'm not aware of anyone looking into the cause in enough depth to get either a coherent explanation or something actionable. It wouldn't surprise me if it was some arbitrary JVM threshold that optimizes N but not N+1 empty constructors. You might also try adding a chain of (not wholly superfluous) superclasses and/or traits above the detatched List and ListBuffer, if the other approaches don't pan out, to explore what it is that's causing the suboptimal optimization?

plokhotnyuk · 2019-07-22T16:18:39Z

@retronym the initial benchmark uses the original Scala library and tests on short lists (size=1,10,100), while yours benchmarks (from the https://github.com/retronym/sbt-jmh-listbuffer repo) don't use original List, ::, and ListBuffer classes from the standard library and tests much longer lists (size=10000).

Also, I have reimplemented the original benchmark in Java here for both Java's LinkedList and Scala's ListBuffer/List and for the Scala list. For them I got results almost the same as in the original benchmark. Please see them in the description of this PR

Finally, I have published it in the separated repo here.

retronym · 2019-07-24T02:33:00Z

@plokhotnyuk Thanks, that's useful.

It turns out that the material difference is that in 2.13 the call to ListBuffer.+= has to forward through Growable.+= to get to ListBuffer.addOne. Replacing your $plus$eq calls with addOne directly improves performance drastically.

    @Benchmark
    public List<Boolean> scala213ListOfBooleansPlusEq() {
        ListBuffer<Boolean> listBuffer = new ListBuffer<>();
        int l = size;
        int i = 0;
        while (i < l) {
            listBuffer.$plus$eq((i & 1) == 0);
            i++;
        }
        return listBuffer.toList();
    }

    @Benchmark
    public List<Boolean> scala213ListOfBooleansAddOne() {
        ListBuffer<Boolean> listBuffer = new ListBuffer<>();
        int l = size;
        int i = 0;
        while (i < l) {
            listBuffer.addOne((i & 1) == 0);
            i++;
        }
        return listBuffer.toList();
    }

[info] # VM version: JDK 12.0.1, Java HotSpot(TM) 64-Bit Server VM, 12.0.1+12

[info] LinkedListBenchmark.javaListOfBooleans                 1  thrpt    5  115909715.003 ± 98324502.735  ops/s
[info] LinkedListBenchmark.javaListOfBooleans                10  thrpt    5   29307012.723 ±  1314406.449  ops/s
[info] LinkedListBenchmark.javaListOfBooleans               100  thrpt    5    3234171.872 ±   160394.208  ops/s
[info] LinkedListBenchmark.scala213ListOfBooleansAddOne       1  thrpt    5  297228574.680 ±  6927856.405  ops/s
[info] LinkedListBenchmark.scala213ListOfBooleansAddOne      10  thrpt    5   40600454.592 ±  3403259.134  ops/s
[info] LinkedListBenchmark.scala213ListOfBooleansAddOne     100  thrpt    5    4277655.542 ±   107376.264  ops/s
[info] LinkedListBenchmark.scala213ListOfBooleansPlusEq       1  thrpt    5  184859568.590 ±  5444893.615  ops/s
[info] LinkedListBenchmark.scala213ListOfBooleansPlusEq      10  thrpt    5   22818686.698 ±   407740.780  ops/s
[info] LinkedListBenchmark.scala213ListOfBooleansPlusEq     100  thrpt    5    2302681.790 ±    50850.579  ops/s
[info] Benchmark result is saved to scala-2.13.json

Comparing to the 2.12 baseline:

[info] LinkedListBenchmark.scala212ListOfBooleansPlusEq       1  thrpt    5  265587493.195 ± 7337611.729  ops/s
[info] LinkedListBenchmark.scala212ListOfBooleansPlusEq      10  thrpt    5   33084745.588 ±  792230.913  ops/s
[info] LinkedListBenchmark.scala212ListOfBooleansPlusEq     100  thrpt    5    2714058.378 ±   72789.631  ops/s
[info] Benchmark result is saved to scala-2.12.json

So:

Call addOne on 2.13 to sidestep this issue (json-iter's macro could do this conditional on the ambient Scala version)
The best case perfomance of 2.13.x is better than 2.12.x. I'm not sure why yet -- there have been refactorings to ListBuffer implementation and also improvements to the scalac optimizer that the library itself is compiled with. If I can identify an isolated reason, I might be able to backport it to 2.12.x
JIT inlining of 2.13's Growable.+= leaves something to be desired.
Growable.+= is also marked as @inline which scalac -opt:l:inline -opt-inline-from:scala.** can inline through. I see this leaves a redundant null check as compared to a direct call to addOne.

plokhotnyuk · 2019-07-24T04:02:44Z

@retronym Thank you a lot!

I've added a benchmark for addOne and updated benchmark results in this commit.

Your finding will help to mitigate the issue for OpenJDK, but for GraalVM CE/EE it is not enough. Should I raise an issue in their repo instead?

retronym · 2019-07-24T04:21:24Z

Yes, it would be good to notify the Graal team of the performance difference. Hopefully its something straightforward for them to fix. /cc @vjovanov

My benchmarks are now cleaned up and highlight the HotSpot/C2 difficulty with +=. I've managed to create a pure-Java replica of the relevant parts of our collections that shows the same slowdown.


[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark                                   (size)   Mode  Cnt         Score         Error  Units
[info] ListsBenchmark.javaListBufferPlusEqAddOne       10  thrpt    4  41074388.920 ±  899881.789  ops/s
[info] ListsBenchmark.scalaListBufferPlusEq            10  thrpt    4  22654974.135 ± 1107078.470  ops/s
[info] ListsBenchmark.scalaListBufferPlusEqAddOne      10  thrpt    4  40713598.860 ± 2130166.999  ops/s
[info] ListsBenchmark.skalaAddOne                      10  thrpt    4  40959424.729 ±  984075.334  ops/s
[info] ListsBenchmark.skalaPlusEq                      10  thrpt    4  20522159.759 ± 1759567.702  ops/s

I'll use this to study what's going in in C2 JIT some more to see if this is a bug or if our indirection is really incurring extra runtime cost (ie null checks) that the JIT isn't able to elide.

retronym · 2019-07-24T05:53:34Z

Okay, -prof perfasm shows that the slow cases occur when JIT doesn't fully inline. See output here, which contains for example:

[info]   0.94%     │││  0x00007f198342eea4: mov    rsi,rbp
[info]   0.52%     │││  0x00007f198342eea7: call   0x00007f1973152980             ; ImmutableOopMap{rbp=Oop [112]=Oop [120]=Oop [128]=Oop [0]=Oop [16]=Oop [24]=Oop [48]=Oop [56]=NarrowOop }
[info]             │││                                                            ;*invokespecial &lt;init&gt; {reexecute=0 rethrow=0 return_oop=0}
[info]             │││                                                            ; - scala.collection.AbstractSeq::&lt;init&gt;@1 (line 1154)
[info]             │││                                                            ; - scala.collection.immutable.AbstractSeq::&lt;init&gt;@1 (line 159)
[info]             │││                                                            ; - scala.collection.immutable.List::&lt;init&gt;@1 (line 83)
[info]             │││                                                            ; - scala.collection.immutable.$colon$colon::&lt;init&gt;@11 (line 592)
[info]             │││                                                            ; - scala.collection.mutable.ListBuffer::addOne@12 (line 109)
[info]             │││                                                            ; - scala.collection.mutable.ListBuffer::addOne@2 (line 39)
[info]             │││                                                            ; - scala.collection.mutable.Growable::$plus$eq@2 (line 38)
[info]             │││                                                            ; - scala.collection.mutable.Growable::$plus$eq$@2 (line 38)
[info]             │││                                                            ; - scala.collection.mutable.AbstractBuffer::$plus$eq@2 (line 232)
[info]             │││                                                            ; - bench.ListsBenchmark::scalaListBufferPlusEq@21 (line 56)
[info]             │││                                                            ; - bench.generated.ListsBenchmark_scalaListBufferPlusEq_jmhTest::scalaListBufferPlusEq_thrpt_jmhStub@17 (line 119)
[info]             │││                                                            ;   {optimized virtual_call}

The JIT's inlining depth budget is stretched by the combination of a) the extra indirection through += => addOne. and b) the depth of the super constructor call chain of :: (6 deep).

Using a higher budget, like -XX:MaxInlineLevel=18, leads to identical results.

I thought that this was the first thing I tried without success. But maybe I just looked at the inlining logs in JITWatch and was looking at the wrong call site or something..

… see scala/bug#11627 (comment)

vjovanov · 2019-07-24T08:55:16Z

@plokhotnyuk yes, please open an issue here and we will handle it? Thanks for thinking about us!

plokhotnyuk · 2019-07-24T08:57:13Z

@vjovanov thank you for your support, here it is

retronym · 2019-07-24T12:53:03Z

I think this is all explained now.

Summary:

Bumping up -XX:MaxInlineLevel=<N> from the default of 9 to 18 is often a beneficial for Scala programs, and in this case is needed to inline the :: super constructor call chain into the benchmark that now must indirect through += => addOne.
Using GraalVM (including OpenJDK 12+ -XX:+EnableJVMCI -XX:+UseJVMCICompiler is usually a way to have more aggressive JIT inlining, but the use of MethodHandle-s in Scalac to abstract over Java 8's Unsafe.storeFence and Java 9+'s equivalent VarHandle.releaseFence fails to inline, but the next version of Graal will fix it.

Ichoran · 2019-07-24T19:35:35Z

@retronym - Good detective work! I wonder if we should alter build tools like SBT and Mill to bump MaxInlineLevel up by default? This has bitten us several times now.

smarter · 2019-07-24T19:37:26Z

I wonder if we should alter build tools like SBT and Mill to bump MaxInlineLevel up by default?

Good idea, the default is way too low for Scala.

hoping it might help performance, as per scala/bug#11627 (comment)

@benchmark

I'm seeing a 1.4x speedup for: ``` @benchmark public Object scalaListBufferPlusEq_212() { ListBuffer<String> buffer = new ListBuffer<>(); int i = 0; while (i < size) { buffer.$plus$eq(""); i += 1; } return buffer.result(); } ``` 2.12.8 ``` [info] Benchmark (size) Mode Cnt Score Error Units [info] ListsBenchmark.scalaListBufferPlusEq_212 10 thrpt 5 25856046.731 ± 1229100.335 ops/s ``` This patch: ``` [info] Benchmark (size) Mode Cnt Score Error Units [info] ListsBenchmark.scalaListBufferPlusEq_212 10 thrpt 5 35848876.003 ± 514044.717 ops/s ``` It is still a little short of the 2.13.x performance; in which I saw: ``` [info] ListsBenchmark.scalaListBufferPlusEq 10 thrpt 5 37174742.519 ± 1304768.628 ops/s [info] ListsBenchmark.scalaListBufferAddOne 10 thrpt 5 37201063.905 ± 2167146.358 ops/s ``` * the `scalaListBufferPlusEq` result requires `-XX:MaxInlineLevel=18`, discussion at scala/bug#11627)

@benchmark

I'm seeing a 1.4x speedup for: ``` @benchmark public Object scalaListBufferPlusEq_212() { ListBuffer<String> buffer = new ListBuffer<>(); int i = 0; while (i < size) { buffer.$plus$eq(""); i += 1; } return buffer.result(); } ``` 2.12.8 ``` [info] Benchmark (size) Mode Cnt Score Error Units [info] ListsBenchmark.scalaListBufferPlusEq_212 10 thrpt 5 25856046.731 ± 1229100.335 ops/s ``` This patch: ``` [info] Benchmark (size) Mode Cnt Score Error Units [info] ListsBenchmark.scalaListBufferPlusEq_212 10 thrpt 5 35848876.003 ± 514044.717 ops/s ``` It is still a little short of the 2.13.x performance; in which I saw: ``` [info] ListsBenchmark.scalaListBufferPlusEq 10 thrpt 5 37174742.519 ± 1304768.628 ops/s [info] ListsBenchmark.scalaListBufferAddOne 10 thrpt 5 37201063.905 ± 2167146.358 ops/s ``` * the `scalaListBufferPlusEq` result requires `-XX:MaxInlineLevel=18`, discussion at scala/bug#11627)

dwijnand · 2019-07-25T07:46:43Z

What and how severe are the trade-offs on increasing -XX:MaxInlineLevel to 18?

How generally applicable is it to set it to 18?

lrytz · 2019-07-25T07:49:56Z

I think there's a risk that changing the JVM defaults in the build tool could lead to confusion when diagnosing performance issues, because people probably don't use the build tool to run their apps in production.

Ichoran · 2019-07-25T16:16:39Z

@lrytz - I had considered that drawback also, which is why I was wondering if we should rather than simply stating that I thought we should. I don't have enough exposure to environments where people deploy artifacts built by Scala build tools to know whether it's overall a plus or a minus to have the build tool automatically select what we would consider best-practice JVM options.

smarter · 2019-07-25T18:43:10Z

sbt already affects performance due to classloading, and the sbt launcher already passes some flags which can also affect it, so theres precedent for this kind of things

retronym · 2019-07-26T00:42:51Z

Arguably the best practice is moving quickly to -XX:+EnableJVMCI -XX:+UseJVMCICompiler, which enables Graal CE which is bundled inside recent releases of Oracle/Open-JDK.

Another downside of embedding -XX flags in scripts is that it might limit them to the OpenJDK family of JVMs. In practice, OpenJ9 appears to ignore unrecognized -XX options by default so maybe this isn't a big concern.

plokhotnyuk · 2019-07-27T14:16:17Z

@retronym GraalVM CE 19.1.1 ignores the -XX:MaxInlineLevel=18 option and still faster than OpenJDK 8 with it, please see a chart with comparison here

JVM configuration for Scala Functional Programming `-XX:MaxInlineLevel=18 -XX:MaxInlineSize=270 -XX:MaxTrivialSize=12` https://twitter.com/leifwickland/status/1179419045055086595 `-XX:MaxInlineLevel=18` scala/bug#11627 (comment) Tweak the Java JIT https://scalacenter.github.io/bloop/docs/performance-guide#tweak-the-java-jit

retronym self-assigned this Jul 12, 2019

retronym added this to the 2.13.1 milestone Jul 12, 2019

plokhotnyuk mentioned this issue Jul 15, 2019

Regression with Scala 2.13: cannot compile GraalVM native-image #11634

Closed

plokhotnyuk added a commit to plokhotnyuk/jsoniter-scala that referenced this issue Jul 24, 2019

Generate more efficient code for instantiation of List in Scala 2.13,…

bf33cdd

… see scala/bug#11627 (comment)

plokhotnyuk mentioned this issue Jul 24, 2019

Performance regression of creation of List values in Scala 2.13.0 oracle/graal#1530

Closed

retronym closed this as completed Jul 24, 2019

SethTisue added a commit to SethTisue/Project-Euler that referenced this issue Jul 24, 2019

increase JVM inlining depth

9cc2d61

hoping it might help performance, as per scala/bug#11627 (comment)

retronym mentioned this issue Jul 25, 2019

Backport ListBuffer.+= optimizations scala/scala#8260

Merged

dwijnand mentioned this issue Jul 25, 2019

Consider defaulting -XX:MaxInlineLevel=18 dwijnand/sbt-extras#250

Closed

plokhotnyuk mentioned this issue Jul 25, 2019

Use benchmark forking by default with JVM options optimized for throughput sirthias/borer#48

Merged

plokhotnyuk mentioned this issue Jul 27, 2019

Add simple benchmark project to test JsonParser spray/spray-json#302

Merged

jackkoenig mentioned this issue Sep 20, 2019

Faster inline renaming chipsalliance/firrtl#1184

Merged

dwijnand mentioned this issue Oct 24, 2019

Bump JVM's inlining on Travis CI playframework/playframework#9768

Merged

plokhotnyuk mentioned this issue Nov 4, 2019

ByteString.toArray performance degradation in Scala 2.13 akka/akka#28114

Closed

retronym mentioned this issue Jan 20, 2020

Reduce allocations for one-sided usages of List.partition scala/scala#8641

Merged

wsargent mentioned this issue Jun 15, 2020

Update travis with sbt args tersesystems/blindsight#114

Merged

bjaglin mentioned this issue May 1, 2024

let scalafixScalaBV follow scalaBV to leverage 2.13 & deprecate it scalacenter/sbt-scalafix#419

Merged

Performance regression in Scala 2.13 for creation of lists using mutable.ListBuffer #11627

Performance regression in Scala 2.13 for creation of lists using mutable.ListBuffer #11627

Comments

plokhotnyuk commented Jul 12, 2019 • edited Loading

plokhotnyuk commented Jul 12, 2019 • edited Loading

Ichoran commented Jul 12, 2019 • edited Loading

plokhotnyuk commented Jul 12, 2019 • edited Loading

Ichoran commented Jul 12, 2019 • edited Loading

plokhotnyuk commented Jul 12, 2019 • edited Loading

Ichoran commented Jul 12, 2019

plokhotnyuk commented Jul 13, 2019 • edited Loading

jsfwa commented Jul 13, 2019 • edited Loading

plokhotnyuk commented Jul 13, 2019 • edited Loading

jsfwa commented Jul 13, 2019 • edited Loading

plokhotnyuk commented Jul 15, 2019 • edited Loading

retronym commented Jul 16, 2019

plokhotnyuk commented Jul 16, 2019 • edited Loading

odersky commented Jul 18, 2019 • edited Loading

odersky commented Jul 18, 2019

retronym commented Jul 18, 2019 via email

plokhotnyuk commented Jul 18, 2019

odersky commented Jul 18, 2019

retronym commented Jul 22, 2019 • edited Loading

Ichoran commented Jul 22, 2019

plokhotnyuk commented Jul 22, 2019 • edited Loading

retronym commented Jul 24, 2019

plokhotnyuk commented Jul 24, 2019

retronym commented Jul 24, 2019 • edited Loading

retronym commented Jul 24, 2019 • edited Loading

vjovanov commented Jul 24, 2019

plokhotnyuk commented Jul 24, 2019

retronym commented Jul 24, 2019 • edited Loading

Ichoran commented Jul 24, 2019

smarter commented Jul 24, 2019

dwijnand commented Jul 25, 2019 • edited Loading

lrytz commented Jul 25, 2019

Ichoran commented Jul 25, 2019

smarter commented Jul 25, 2019

retronym commented Jul 26, 2019

plokhotnyuk commented Jul 27, 2019 • edited Loading

Performance regression in Scala 2.13 for creation of lists using `mutable.ListBuffer` #11627

Performance regression in Scala 2.13 for creation of lists using `mutable.ListBuffer` #11627

plokhotnyuk commented Jul 12, 2019 •

edited

Loading

plokhotnyuk commented Jul 12, 2019 •

edited

Loading

Ichoran commented Jul 12, 2019 •

edited

Loading

plokhotnyuk commented Jul 12, 2019 •

edited

Loading

Ichoran commented Jul 12, 2019 •

edited

Loading

plokhotnyuk commented Jul 12, 2019 •

edited

Loading

plokhotnyuk commented Jul 13, 2019 •

edited

Loading

jsfwa commented Jul 13, 2019 •

edited

Loading

plokhotnyuk commented Jul 13, 2019 •

edited

Loading

jsfwa commented Jul 13, 2019 •

edited

Loading

plokhotnyuk commented Jul 15, 2019 •

edited

Loading

plokhotnyuk commented Jul 16, 2019 •

edited

Loading

odersky commented Jul 18, 2019 •

edited

Loading

retronym commented Jul 22, 2019 •

edited

Loading

plokhotnyuk commented Jul 22, 2019 •

edited

Loading

retronym commented Jul 24, 2019 •

edited

Loading

retronym commented Jul 24, 2019 •

edited

Loading

retronym commented Jul 24, 2019 •

edited

Loading

dwijnand commented Jul 25, 2019 •

edited

Loading

plokhotnyuk commented Jul 27, 2019 •

edited

Loading