-
Notifications
You must be signed in to change notification settings - Fork 841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FixedWidthRows for the arrow-row #4524
Conversation
fe13b27
to
68085f7
Compare
With the dependency of this PR, the tpch 1G benchmark for the datafusion, https://github.com/yahoNanJing/arrow-datafusion/tree/EBAY-KYLIN-4003-5, is as follows:
|
Next step, I will do some more optimization at the datafusion side to push the fixed row value to the RawTable value if the row width is less than a threshold. |
I think I would like to hold off on merging this for a bit, I want to do some other experiments first before committing to this approach |
@yahoNanJing can you please explain the rationale for the introduction of a fixed width row to a variable length one? |
After embedding the value to the RawTable, for tpch q17, there's around 30% performance gain compared to the main branch. |
More context for anyone else following along here can be found in apache/datafusion#6969 I would be very interested in trying to specialize the row_hash grouping operator t avoid the Row format entirely (and use a native |
Hi @alamb, there are three reasons that I use the fixed width row:
|
Just to be clear, what I was imagining for the group storage is not to change the contents of the But instead of storing group_values using the arrow
We would instead store the group values using a native type like
I agree the null value would need some special handling, but since this would only be for single columns (where there can be at most one null value) I think we could figure out some way to handle it |
At my side, the bottleneck is the I don't know whether it will bring much benefit by just changing the Row to the Vec and I'm looking forward to your benchmark results 😄 |
Ah, that is a good insight -- the change yahoNanJing/arrow-datafusion@a18ac07 embeds the group value directly into the table 🤔 Another experiment we could try would be to use unsafe accesses to the group_values table / rows 🤔 Update: I am running this experiment with apache/datafusion#7010 I will also try and compare it to @yahoNanJing 's approach |
Here are my benchmark results: apache/datafusion#6969 (comment) (TLDR they are mixed) |
I hope to have a PR up tomorrow that will special case single columns, and will be able to get some numbers for you. I am very keen to avoid this complexity if we can get the same or better performance in a simpler way |
@tustvold, the simpler is the better. I agree to adopt a simpler way if we can achieve the same or similar performance gain. |
As @tustvold mentions, using native types for single columns should be significantly faster than any row format (thought we need to prove that). I do wonder if fixed width formats would be useful for multi-column equality comparisons (again we would need to show this via benchmarks, etc) I wonder if one of the hesitations with the "FixedWidthRows" describes how this is implemented rather than its major usecase. Maybe we could name it something like "EqualityRows" or something given it focuses just on equality rather than ordering |
Do we have any benchmarks that are grouping by multiple primitive columns? Whilst I appreciate that optimising for benchmarks is a form of observability bias, it can be a useful way to focus our efforts where they will have the most impact?
My initial experiments have
I have not been able to reproduce these results using EBAY-KYLIN-4003-5
|
Marking as a draft whilst we work out how best to proceed here |
Closing this as I believe the consensus reached was not to proceed with this, feel free to reopen if I am mistaken |
Which issue does this PR close?
Closes #4523.
Rationale for this change
There are mainly two reasons to introduce the
FixedWidthRows
:offsets
vector to get the row value, which makes the operation of getting the row value much more efficient. It's also proved by the benchmark results, especially for tpch q17.row_width
to theRowConverter
, it will be much easier for the Datafusion side to decide whether to make the group value embedded or not. The benchmark results also prove that when the group values are of small enough fixed length, to embed the value to the RawTable will bring great benefits.What changes are included in this PR?
Are there any user-facing changes?