-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in Scala 2.13 for creation of lists using mutable.ListBuffer
#11627
Comments
I suppose that the root cause of slowdown is the @retronym wdyt? |
If it is The issue is this. Suppose you map a 2-element list. You have:
The map operation, since it can't know how long the list is, uses a ListBuffer which walks through in order producing
at first; then it walks to the next one and mutates the next pointer to yield
It then returns the head of the list, and everything is cool because nobody else has access to mutate the next pointer. Unfortunately, that is only true within the same thread. If you hand So you give it to the other thread and it occasionally sees, instead, This is just terrible. The whole rhetoric around immutable collections is that they're great for concurrent use because they don't change, and here you have a difficult to reproduce, stochastically appearing bug that mutates what you can see about your Fortunately, this is difficult to trigger, but we absolutely have to keep the fix for this. So, what is the workaround? Well, if you know the sizes, you can always build the list up manually using Anyway, this is not an easy problem, and you very much can't just say, "Well, |
@Ichoran could you share a source of the test which reproduces the issue explained above? which data structure or API is used to pass an instance of a recently created list in it? usually, any concurrency aware implementation which is going to pass data between threads use fences to do it safely, so no need to do it prematurely for construction of each list instance... |
I don't remember a test case and I don't have time to write one. I'd guess that something like this would work, but I don't have time to test/optimize it:
You'll expect to see things skipped, but the list should always have two elements, and the values should always be 1 apart. (When I run this I don't see errors, but the errors were never consistently found anyway. I'm not sure where the old reports are.) |
@Ichoran this test will pass always because access to volatile vars in JVM uses store/load fences BTW, JFYK: https://dzone.com/articles/cpu-cache-flushing-fallacy "The cache sub-system is considered the "source of truth" for mainstream systems. If memory is fetched from the cache it is never stale; the cache is the master copy when data exists in both the cache and main-memory. This style of memory management is known as write-back whereby data in the cache is only written back to main-memory when the cache-line is evicted because a new line is taking its place." |
Yeah, so drop the @volatile and maybe make other changes. Like I said I don't have time to replicate the error. You're right that I wasn't accurate with regard to cache; it's the register/cache boundary, not the cache/main memory boundary that can lead to inconsistent views of things. I think the result is architecture-dependent, too. In any case, the JMM doesn't promise that you can see So I'm not entirely surprised that I failed once again to see anything. |
@Ichoran If you drop the volatile annotation you will lose fences and, possibly, for some combination of JVM/CPU you will get that kind of data races which you are going to reproduce. But it should not be the reason for adding fences in immutable collections that don't grow on demand (like a stream) in Scala. Please see charts bellow with results of benchmarks where JSON array of 128 boolean values was parsed to |
@plokhotnyuk, @Ichoran this is an example of an issue described above
|
@jsfwa in that gist thread the root problem not in In the 1st sample it is just missing fences when accessing In the 2nd class Message(val text: String) {
private[this] var _words: List[String] = _
def words: List[String] = {
var res = _words
if (res == null) {
res = text.split("""\s""").toList
_words = res
}
res
}
} Why not just use |
@plokhotnyuk I totally agree that everyone should use fences and other tricks with concurrent writers Mentioned gist is a part of this discussion, since a tail of Sadly the cost is too high, hope they will revert the changes |
In this PR I have tried to mitigate the issue by appending lists manually. For JDK 11.0.3 it works even faster that with |
@plokhotnyuk What CPU architecture are you testing on? I assume x86 but want to be sure. |
We also have immutable vectors, sets and maps that have internally mutable fields. Do we fence these also? The issue as I see it is: Say we have an immutable data structure
and read I believe this is actually a lot to ask. On some earlier architectures, even a Long made up from two words could be split so that a reading thread could see only one half of the store. That's gone for good with 64-bit architectures. But to extend this guarantee to all immutable data structures seems to impose an undue burden on the implementation. In my mind, if a fix to this problem causes any sort of slowdown it's unacceptable and we should instead just state that immutability as a data structure does not imply safe publication, which in my mind is perfectly acceptable, since concurrent architectures that rely only on these sort of low-level safe publication guarantees without resorting to volatiles, monitors or atomics are super fragile anyway. There is also an issue with immutable arrays, which will be part of Scala 3. Immutable arrays use just Java arrays under the hood. Do we need to fence all operations on immutable arrays also, in order to ensure safe publication? |
An argument to treat lists differently from other immutable collections could be: Lists are morally ADTs, i.e. they can be thought of like this:
If Lists really were ADTs like that, they would ensure safe publication since all fields are immutable. But then it also looks like they could stack overflow when a long list is mapped. But the point is: Lists are not an ADT like the one that I have given. They can't be since we do state that the operation |
On Thu, 18 Jul 2019 at 18:54, odersky ***@***.***> wrote:
We also have immutable vectors, sets and maps that have internally mutable
fields. Do we fence these also?
Yep, we also added the fences in HashMap/Set and Vector.
I plan to look at this microbenchmark in detail next week
|
According to Aleksey's research that fences can be implemented without so dramatic impact, especially for x86. |
I still have not understood the rationale why we are doing this. If previously people thought it was OK that a double could be split, why go all out to ensure safe publication of immutable data structures? What's the use case where this matters? |
My belief was that adding the fence was sufficiently cheap that it was worth doing. I'm not yet persuaded that the fence addition is the actual result of the performance change. I'm studying how variations of the implementation change performance in https://github.com/retronym/sbt-jmh-listbuffer So far I found that both the 2.12 and 2.13 library versions are significantly slower than an analagous pure-Java implementation (~0.6x - 0.7x). In that pure-Java implementation, the performance change of adding the fence is negligible. Removing all parents of List and ListBuffer from the Scala versions seems to restore performance. So I believe that JIT is doing a sub-optimal job in inlining the somewhat elaborate call tree of (empty!) class- and trait-constructors. JITWatch reports that the slow benchmarks actually do fully inline, but the generated code still ends up longer/messier. I'm now using |
@retronym - FWIW, this isn't a new observation either. There was some discussion/demonstration in 2.10, I think it was, where I'm not aware of anyone looking into the cause in enough depth to get either a coherent explanation or something actionable. It wouldn't surprise me if it was some arbitrary JVM threshold that optimizes N but not N+1 empty constructors. You might also try adding a chain of (not wholly superfluous) superclasses and/or traits above the detatched |
@retronym the initial benchmark uses the original Scala library and tests on short lists (size=1,10,100), while yours benchmarks (from the https://github.com/retronym/sbt-jmh-listbuffer repo) don't use original Also, I have reimplemented the original benchmark in Java here for both Java's Finally, I have published it in the separated repo here. |
@plokhotnyuk Thanks, that's useful. It turns out that the material difference is that in 2.13 the call to @Benchmark
public List<Boolean> scala213ListOfBooleansPlusEq() {
ListBuffer<Boolean> listBuffer = new ListBuffer<>();
int l = size;
int i = 0;
while (i < l) {
listBuffer.$plus$eq((i & 1) == 0);
i++;
}
return listBuffer.toList();
}
@Benchmark
public List<Boolean> scala213ListOfBooleansAddOne() {
ListBuffer<Boolean> listBuffer = new ListBuffer<>();
int l = size;
int i = 0;
while (i < l) {
listBuffer.addOne((i & 1) == 0);
i++;
}
return listBuffer.toList();
}
Comparing to the 2.12 baseline:
So:
|
@retronym Thank you a lot! I've added a benchmark for Your finding will help to mitigate the issue for OpenJDK, but for GraalVM CE/EE it is not enough. Should I raise an issue in their repo instead? |
Yes, it would be good to notify the Graal team of the performance difference. Hopefully its something straightforward for them to fix. /cc @vjovanov My benchmarks are now cleaned up and highlight the HotSpot/C2 difficulty with
I'll use this to study what's going in in C2 JIT some more to see if this is a bug or if our indirection is really incurring extra runtime cost (ie null checks) that the JIT isn't able to elide. |
Okay,
The JIT's inlining depth budget is stretched by the combination of a) the extra indirection through Using a higher budget, like I thought that this was the first thing I tried without success. But maybe I just looked at the inlining logs in JITWatch and was looking at the wrong call site or something.. |
@plokhotnyuk yes, please open an issue here and we will handle it? Thanks for thinking about us! |
@vjovanov thank you for your support, here it is |
I think this is all explained now. Summary:
|
@retronym - Good detective work! I wonder if we should alter build tools like SBT and Mill to bump MaxInlineLevel up by default? This has bitten us several times now. |
Good idea, the default is way too low for Scala. |
hoping it might help performance, as per scala/bug#11627 (comment)
I'm seeing a 1.4x speedup for: ``` @benchmark public Object scalaListBufferPlusEq_212() { ListBuffer<String> buffer = new ListBuffer<>(); int i = 0; while (i < size) { buffer.$plus$eq(""); i += 1; } return buffer.result(); } ``` 2.12.8 ``` [info] Benchmark (size) Mode Cnt Score Error Units [info] ListsBenchmark.scalaListBufferPlusEq_212 10 thrpt 5 25856046.731 ± 1229100.335 ops/s ``` This patch: ``` [info] Benchmark (size) Mode Cnt Score Error Units [info] ListsBenchmark.scalaListBufferPlusEq_212 10 thrpt 5 35848876.003 ± 514044.717 ops/s ``` It is still a little short of the 2.13.x performance; in which I saw: ``` [info] ListsBenchmark.scalaListBufferPlusEq 10 thrpt 5 37174742.519 ± 1304768.628 ops/s [info] ListsBenchmark.scalaListBufferAddOne 10 thrpt 5 37201063.905 ± 2167146.358 ops/s ``` * the `scalaListBufferPlusEq` result requires `-XX:MaxInlineLevel=18`, discussion at scala/bug#11627)
I'm seeing a 1.4x speedup for: ``` @benchmark public Object scalaListBufferPlusEq_212() { ListBuffer<String> buffer = new ListBuffer<>(); int i = 0; while (i < size) { buffer.$plus$eq(""); i += 1; } return buffer.result(); } ``` 2.12.8 ``` [info] Benchmark (size) Mode Cnt Score Error Units [info] ListsBenchmark.scalaListBufferPlusEq_212 10 thrpt 5 25856046.731 ± 1229100.335 ops/s ``` This patch: ``` [info] Benchmark (size) Mode Cnt Score Error Units [info] ListsBenchmark.scalaListBufferPlusEq_212 10 thrpt 5 35848876.003 ± 514044.717 ops/s ``` It is still a little short of the 2.13.x performance; in which I saw: ``` [info] ListsBenchmark.scalaListBufferPlusEq 10 thrpt 5 37174742.519 ± 1304768.628 ops/s [info] ListsBenchmark.scalaListBufferAddOne 10 thrpt 5 37201063.905 ± 2167146.358 ops/s ``` * the `scalaListBufferPlusEq` result requires `-XX:MaxInlineLevel=18`, discussion at scala/bug#11627)
What and how severe are the trade-offs on increasing How generally applicable is it to set it to 18? |
I think there's a risk that changing the JVM defaults in the build tool could lead to confusion when diagnosing performance issues, because people probably don't use the build tool to run their apps in production. |
@lrytz - I had considered that drawback also, which is why I was wondering if we should rather than simply stating that I thought we should. I don't have enough exposure to environments where people deploy artifacts built by Scala build tools to know whether it's overall a plus or a minus to have the build tool automatically select what we would consider best-practice JVM options. |
sbt already affects performance due to classloading, and the sbt launcher already passes some flags which can also affect it, so theres precedent for this kind of things |
Arguably the best practice is moving quickly to Another downside of embedding |
JVM configuration for Scala Functional Programming `-XX:MaxInlineLevel=18 -XX:MaxInlineSize=270 -XX:MaxTrivialSize=12` https://twitter.com/leifwickland/status/1179419045055086595 `-XX:MaxInlineLevel=18` scala/bug#11627 (comment) Tweak the Java JIT https://scalacenter.github.io/bloop/docs/performance-guide#tweak-the-java-jit
For small lists it can be slowdown in ~1.5x times with OpenJDK and in ~7x times with GraalVM.
Most parsers which use it to bind parsed data to
List
orSeq
(becauseList
is a default implementation ofSeq
) from text or binary messages are affected.Code of the benchmark to reproduce:
Command to run:
Results for Scala 2.13.0 with OpenJDK 11.0.3:
Results for Scala 2.12.8 with OpenJDK 11.0.3:
Results for Scala 2.13.0 with GraalVM CE 19.1:
Results for Scala 2.12.8 with GraalVM CE 19.1:
The text was updated successfully, but these errors were encountered: