-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize decimal state serializers for small value case #13573
Optimize decimal state serializers for small value case #13573
Conversation
ea406fd
to
1f42004
Compare
@@ -42,7 +42,13 @@ public void serialize(LongDecimalWithOverflowAndLongState state, BlockBuilder ou | |||
long overflow = state.getOverflow(); | |||
long[] decimal = state.getDecimalArray(); | |||
int offset = state.getDecimalArrayOffset(); | |||
VARBINARY.writeSlice(out, Slices.wrappedLongArray(count, overflow, decimal[offset], decimal[offset + 1])); | |||
if (count == 1 && overflow == 0 && decimal[offset] == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count == 1 && overflow == 0
truncate should be separate from decimal[offset] == 0
decimal[offset] == 0
will often be true when count == 1 && overflow == 0
isn't.
So you have cases:
- 1 long -> count == 1 && overflow == 0 && decimal[offset] == 0
- 2 longs -> count == 1 && overflow == 0
- 3 longs -> decimal[offset] == 0
- 4 longs -> full case
I think you can even make it branchless:
- append
decimal[offset + 1]
unconditionally, len += 1 - append
decimal[offset]
unconditionally, len = len + (decimal[offset] == 0 ? 0 : 1) - append
overflow
andcount
, len = len + ((count == 1 & overflow == 0) ? 0 : 2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may help in some cases, I will try that. This probably won't help in tpch q17/ q18 as it's already hitting the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the implementation you suggested was not correct, I did something similar but I couldn't figure out an efficient way to do branchless in deserialization. Let me know if you see one.
} | ||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment as in https://github.com/trinodb/trino/pull/13573/files#r942932796 but reverse
Made it draft again as I'm gonna benchmark supporting more cases |
da7883e
to
f213351
Compare
Currently, our benchmarking infra is failing so the latest benchmark results will available next week probably. cc @sopel39 @gaurav8297 |
f213351
to
db86662
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is BenchmarkBlockSerde#serializeLongDecimal
or any other other JMH benchmark relevant here ?
.../java/io/trino/operator/aggregation/state/LongDecimalWithOverflowAndLongStateSerializer.java
Outdated
Show resolved
Hide resolved
.../java/io/trino/operator/aggregation/state/LongDecimalWithOverflowAndLongStateSerializer.java
Outdated
Show resolved
Hide resolved
.../java/io/trino/operator/aggregation/state/LongDecimalWithOverflowAndLongStateSerializer.java
Show resolved
Hide resolved
.../java/io/trino/operator/aggregation/state/LongDecimalWithOverflowAndLongStateSerializer.java
Outdated
Show resolved
Hide resolved
.../java/io/trino/operator/aggregation/state/LongDecimalWithOverflowAndLongStateSerializer.java
Outdated
Show resolved
Hide resolved
.../java/io/trino/operator/aggregation/state/LongDecimalWithOverflowAndLongStateSerializer.java
Outdated
Show resolved
Hide resolved
...rc/main/java/io/trino/operator/aggregation/state/LongDecimalWithOverflowStateSerializer.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments addressed
.../java/io/trino/operator/aggregation/state/LongDecimalWithOverflowAndLongStateSerializer.java
Outdated
Show resolved
Hide resolved
.../java/io/trino/operator/aggregation/state/LongDecimalWithOverflowAndLongStateSerializer.java
Outdated
Show resolved
Hide resolved
.../java/io/trino/operator/aggregation/state/LongDecimalWithOverflowAndLongStateSerializer.java
Show resolved
Hide resolved
.../java/io/trino/operator/aggregation/state/LongDecimalWithOverflowAndLongStateSerializer.java
Outdated
Show resolved
Hide resolved
db86662
to
34ed7ef
Compare
@raunaqmorarka I think I don't see many changes in this microbenchmark. The important thing to notice is that the number of groups thus random memory load times is the most important factor here.
|
.../java/io/trino/operator/aggregation/state/LongDecimalWithOverflowAndLongStateSerializer.java
Outdated
Show resolved
Hide resolved
...rc/main/java/io/trino/operator/aggregation/state/LongDecimalWithOverflowStateSerializer.java
Outdated
Show resolved
Hide resolved
.../java/io/trino/operator/aggregation/state/LongDecimalWithOverflowAndLongStateSerializer.java
Outdated
Show resolved
Hide resolved
34ed7ef
to
87aafb6
Compare
buffer[1] = decimalHighBytes; | ||
buffer[2] = overflow; | ||
buffer[3] = count; | ||
// if decimalHighBytes == 0 and count == 1 and overflow == 0 we only write decimalLowBytes (bufferLength = 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You miss one important case: 3 longs -> decimal[offset] == 0
see #13573 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can implement it by:
// append low
buffer[0] = decimalLowBytes;
offset = 1;
// append high
buffer[offset] = decimalHiBytes;
offset = offset + (decimalHiBytes == 0 ? 0 : 1);
// append overflow, count
buffer[offset] = overflow;
buffer[offset + 1] = count;
offset = offset + (overflow == 0 & count == 1) ? 0 : 2 // will this be branchless really?
long high = 0; | ||
long overflow = 0; | ||
long count = 1; | ||
if (slice.length() > Long.BYTES) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
branchless version
int sliceLength = slices.length();
long low = slice.getLong(0);
long highOffset = sliceLength == 4 | sliceLength == 2 ? 1 : 0;
long high = slice.getLong(highOffset) & (highOffset * -1L);
// similar for count & overflow
Would be great to learn if JIT actually generates branchless versions
.../java/io/trino/operator/aggregation/state/LongDecimalWithOverflowAndLongStateSerializer.java
Show resolved
Hide resolved
the latest version tpch/tpcds benchmarks |
29a2fae
to
1870888
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented the additional case + UT.
For the branchless deserialization. I tested multiple ways (including offered by @sopel39 ) and it there is a little difference in jmh for mixed decimals cases (block with both small and large decimals) but if/switch approach is a little better for both small and large cases where the branch prediction works efficiently.
The tpc benchmarks look very good with the current code (switch).
.../java/io/trino/operator/aggregation/state/LongDecimalWithOverflowAndLongStateSerializer.java
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm [Optimize decimal state serializers for small value case](https://github.com/trinodb/trino/pull/13573/commits/18708885459f1921b720148552a14d1ca8ff7c93)
% comments
@@ -53,12 +54,12 @@ | |||
@State(Scope.Thread) | |||
@OutputTimeUnit(TimeUnit.NANOSECONDS) | |||
@Fork(3) | |||
@Warmup(iterations = 10) | |||
@Measurement(iterations = 10) | |||
@Warmup(iterations = 20, time = 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is time needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default is 10s, this means 1s.
That said, looking through generated assembly code i noticed that 1s may not be enough to trigger c2. let me check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, 1s is not enough to trigger c2. 5s warmup does the trick
|
||
@Benchmark | ||
@OutputTimeUnit(TimeUnit.MICROSECONDS) | ||
public Object benchmarkDeserialize(BenchmarkData data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it's called benchmarkDeserialize
? I think processPage
does more than just deserialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no way to benchmark deserialize
only so I do processPage
with unique group ids which makes combine
behavior simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, for a code reader this is confusing.
there is no way to benchmark deserialize only so
Sure there is: LongDecimalWithOverflowAndLongStateSerializer
is a class and there is nothing stopping us from benchmarking it directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
benchmarking LongDecimalWithOverflowAndLongStateSerializer#deserialize
in isolation has little value. it's always used in loop body in addIntermediate
with combine
and it's supposed to be inlined there (that's the reason for code generation)
@Param({"SHORT", "LONG", "MIXED"}) | ||
private String decimalSize = "SHORT"; | ||
|
||
@Param({"true", "false"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does group ordering matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter for serialization, but for evaluateIntermediate
the main bottleneck is actually jumping to random memory locations in case groups are located randomly.
For the partial aggregation adaptation case, the group ids are always consecutive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the partial aggregation adaptation case, the group ids are always consecutive.
This has little to do with serialization.
It doesn't matter for serialization, but for evaluateIntermediate the main bottleneck is actually jumping to random memory locations in case groups are located randomly.
For skip-cases groups are consecutive and number of them is small, so it probably doesn't matter too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has little to do with serialization.
That's what I said ^^
For skip-cases groups are consecutive and number of them is small, so it probably doesn't matter too.
For skip cases (I understand you mean PA disabled) yes, for other cases it does matter.
@@ -39,6 +39,7 @@ | |||
import org.testng.annotations.Test; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all changes to BenchmarkDecimalAggregation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sopel39 ping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extracted BenchmarkDecimalAggregation
changes to #13939
@@ -108,36 +130,66 @@ public Block benchmarkEvaluateFinal(BenchmarkData data) | |||
@Param({"10", "1000"}) | |||
private int groupCount = 10; | |||
|
|||
@Param({"SHORT", "LONG", "MIXED"}) | |||
private String decimalSize = "SHORT"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decimalSize=LONG
won't work with type=SHORT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, I use larger long values (close to Long.MAX_VALUE
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, I use larger long values (close to Long.MAX_VALUE)
It doesn't matter from perf perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can matter. it doesn't at the moment. I can also skip this case for type=SHORT
if that bothers you.
values = createValues(functionResolution, type, (builder, value) -> { | ||
boolean writeShort = "SHORT".equals(decimalSize) || ("MIXED".equals(decimalSize) && random.nextBoolean()); | ||
if (writeShort) { | ||
builder.writeLong(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both cases are long really writeLong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but if we use varint
those are different cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but if we use varint those are different cases.
Here SHORT
vs LONG
is mixed with Short/Long decimal. This is confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced SHORT, LONG with SMALL, BIG
.../java/io/trino/operator/aggregation/state/LongDecimalWithOverflowAndLongStateSerializer.java
Outdated
Show resolved
Hide resolved
...rc/main/java/io/trino/operator/aggregation/state/LongDecimalWithOverflowStateSerializer.java
Show resolved
Hide resolved
...a/io/trino/operator/aggregation/state/TestLongDecimalWithOverflowAndLongStateSerializer.java
Show resolved
Hide resolved
...a/io/trino/operator/aggregation/state/TestLongDecimalWithOverflowAndLongStateSerializer.java
Outdated
Show resolved
Hide resolved
Nice gains |
1870888
to
ab4315c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments addressed
.../java/io/trino/operator/aggregation/state/LongDecimalWithOverflowAndLongStateSerializer.java
Outdated
Show resolved
Hide resolved
|
||
@Benchmark | ||
@OutputTimeUnit(TimeUnit.MICROSECONDS) | ||
public Object benchmarkDeserialize(BenchmarkData data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no way to benchmark deserialize
only so I do processPage
with unique group ids which makes combine
behavior simple
@@ -108,36 +130,66 @@ public Block benchmarkEvaluateFinal(BenchmarkData data) | |||
@Param({"10", "1000"}) | |||
private int groupCount = 10; | |||
|
|||
@Param({"SHORT", "LONG", "MIXED"}) | |||
private String decimalSize = "SHORT"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, I use larger long values (close to Long.MAX_VALUE
)
@Param({"SHORT", "LONG", "MIXED"}) | ||
private String decimalSize = "SHORT"; | ||
|
||
@Param({"true", "false"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter for serialization, but for evaluateIntermediate
the main bottleneck is actually jumping to random memory locations in case groups are located randomly.
For the partial aggregation adaptation case, the group ids are always consecutive.
values = createValues(functionResolution, type, (builder, value) -> { | ||
boolean writeShort = "SHORT".equals(decimalSize) || ("MIXED".equals(decimalSize) && random.nextBoolean()); | ||
if (writeShort) { | ||
builder.writeLong(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but if we use varint
those are different cases.
...a/io/trino/operator/aggregation/state/TestLongDecimalWithOverflowAndLongStateSerializer.java
Show resolved
Hide resolved
@@ -53,12 +54,12 @@ | |||
@State(Scope.Thread) | |||
@OutputTimeUnit(TimeUnit.NANOSECONDS) | |||
@Fork(3) | |||
@Warmup(iterations = 10) | |||
@Measurement(iterations = 10) | |||
@Warmup(iterations = 20, time = 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default is 10s, this means 1s.
That said, looking through generated assembly code i noticed that 1s may not be enough to trigger c2. let me check
...rc/main/java/io/trino/operator/aggregation/state/LongDecimalWithOverflowStateSerializer.java
Show resolved
Hide resolved
@@ -39,6 +39,7 @@ | |||
import org.testng.annotations.Test; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all changes to BenchmarkDecimalAggregation
?
@@ -53,12 +54,12 @@ | |||
@State(Scope.Thread) | |||
@OutputTimeUnit(TimeUnit.NANOSECONDS) | |||
@Fork(3) | |||
@Warmup(iterations = 10) | |||
@Measurement(iterations = 10) | |||
@Warmup(iterations = 20, time = 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, 1s is not enough to trigger c2. 5s warmup does the trick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reviewed again
ab4315c
to
b506c13
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments addressed
|
||
@Benchmark | ||
@OutputTimeUnit(TimeUnit.MICROSECONDS) | ||
public Object benchmarkDeserialize(BenchmarkData data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
benchmarking LongDecimalWithOverflowAndLongStateSerializer#deserialize
in isolation has little value. it's always used in loop body in addIntermediate
with combine
and it's supposed to be inlined there (that's the reason for code generation)
@@ -108,36 +130,66 @@ public Block benchmarkEvaluateFinal(BenchmarkData data) | |||
@Param({"10", "1000"}) | |||
private int groupCount = 10; | |||
|
|||
@Param({"SHORT", "LONG", "MIXED"}) | |||
private String decimalSize = "SHORT"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can matter. it doesn't at the moment. I can also skip this case for type=SHORT
if that bothers you.
@Param({"SHORT", "LONG", "MIXED"}) | ||
private String decimalSize = "SHORT"; | ||
|
||
@Param({"true", "false"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has little to do with serialization.
That's what I said ^^
For skip-cases groups are consecutive and number of them is small, so it probably doesn't matter too.
For skip cases (I understand you mean PA disabled) yes, for other cases it does matter.
...a/io/trino/operator/aggregation/state/TestLongDecimalWithOverflowAndLongStateSerializer.java
Show resolved
Hide resolved
values = createValues(functionResolution, type, (builder, value) -> { | ||
boolean writeShort = "SHORT".equals(decimalSize) || ("MIXED".equals(decimalSize) && random.nextBoolean()); | ||
if (writeShort) { | ||
builder.writeLong(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced SHORT, LONG with SMALL, BIG
@@ -39,6 +39,7 @@ | |||
import org.testng.annotations.Test; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sopel39 ping
b506c13
to
ac1f861
Compare
Given that many decimal aggregations (sum, avg) stay in the long range, aggregation state serializer can be optimized for this case, limiting the number of bytes per position significantly (3-4X) at the cost of small cpu overhead during serialization and deserialization.
ac1f861
to
923fa7a
Compare
Description
Given that many decimal aggregations (sum, avg) stay in the
long
range, aggregation state serializer can be optimized for this case, limiting the number of bytes per position significantly (3-4X).tpch/tpcds benchmarks
Benchmarks_decimal_aggr_serde_simple_case.pdf
improvement
core query engine (
sum
,avg
aggregation state serialization)improve performance of queries with
sum
oravg
aggregationsRelated issues, pull requests, and links
Documentation
( X) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
(X ) No release notes entries required.
( ) Release notes entries required with the following suggested text: