-
Notifications
You must be signed in to change notification settings - Fork 841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized lexicographical_partition_ranges (~80% faster) #4575
Conversation
arrow-ord/src/partition.rs
Outdated
partition_point(start + bound / 2, end.min(start + bound + 1), |idx| { | ||
comparator.compare(idx, target) != Ordering::Greater | ||
}) | ||
Ok(out.into_iter()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In opted to preserve the existing function signature for now, I can definitely see a future incarnation returning the computed bitmask somehow to allow for more optimal processing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth a ticket (I can also update the docs in #4615)
Some(n) => { | ||
let n1 = n.inner().slice(0, slice_len); | ||
let n2 = n.inner().slice(1, slice_len); | ||
&(&n1 ^ &n2) | &values_ne |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is quite possibly some more clever way to bit transitions from a bitmask, however, this is already likely sufficiently fast as to be irrelevant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Took me a while to follow the logic though, a comment could help. "values are either not-equal (and both non-null) or exactly one value is null"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is quite clever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code looks very nice and clever @tustvold well done 👏
cc @crepererum as I think you also used this approach while working on the deduplication logic in IOx.
The only thing I worry about with this change is the relative lack of test coverage. Specifically, I didnt' see a tests for the following cases which seem important:
- partitioning of arrays with
0
and1
elements -- I think given the new slicing we should add a test to verify the correct behavior in these scenarios - arrays where the values in change but the slots are marked null, so they shouldn't be new partitions (to ensure the null mask handling works properly)
- mulit-column partitioning where both arrays had null values and
- arrays with nulls that are greated than 2 in length (aka where there are more than 2 partitions)
arrow-ord/src/partition.rs
Outdated
partition_point(start + bound / 2, end.min(start + bound + 1), |idx| { | ||
comparator.compare(idx, target) != Ordering::Greater | ||
}) | ||
Ok(out.into_iter()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth a ticket (I can also update the docs in #4615)
} | ||
|
||
/// Returns the number of partitions | ||
pub fn len(&self) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will allow a very quick check for the case of all partitions have a length of 1, which may allow for a more efficient special case. In the case of IOx this will allow for it to avoid calling into the dedup logic at all
) -> Result<impl Iterator<Item = Range<usize>> + '_, ArrowError> { | ||
LexicographicalPartitionIterator::try_new(columns) | ||
} | ||
pub fn partition(columns: &[ArrayRef]) -> Result<Partitions, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API no longer takes SortColumn as it doesn't actually matter what the sort order is, just that the data is sorted
previous_partition_point: usize, | ||
partition_point: usize, | ||
} | ||
match num_rows { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb Your testing suggestions were on the money 👍
I have updated this PR with more tests and a cleaner API, PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some(n) => { | ||
let n1 = n.inner().slice(0, slice_len); | ||
let n2 = n.inner().slice(1, slice_len); | ||
&(&n1 ^ &n2) | &values_ne |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is quite clever
}, | ||
Arc::new(Int64Array::new(vec![1; 9].into(), None)) as _, | ||
Arc::new(Int64Array::new( | ||
vec![1, 1, 2, 2, 2, 3, 3, 3, 3].into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help to also have a test like this where the value in the null would actually be wrong / different
vec![1, 1, 2, 2, 2, 3, 3, 3, 3].into(), | |
vec![1, 1, 2, 2, 2, 3, 0, 3, 3].into(), |
However I also made that change and it passed so 👍
/// | ||
/// Consecutive ranges will be contiguous: i.e [`(a, b)` and `(b, c)`], and | ||
/// `start = 0` and `end = self.len()` for the first and last range respectively | ||
pub fn ranges(&self) -> Vec<Range<usize>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential difference with this implementation that I realized compared to main is that it requires memory (a Vec) to store the partition ranges where the previous implementation just iterated over them. I think it would be possible to implement an iterator for this as well to avoid that regression. I'll see if I can make a PR to do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I messed around with trying to make this work but got stymied by the borrow checker -- specifically that the BitIndexIterator
had a reference so I couldn't embed it in another iterator.
Which issue does this PR close?
Closes #4614
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?