-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to FlightDataEncoder
to always resend batch dictionaries
#4896
Conversation
Thank you for this PR @alexwilcoxson-rel -- I hope to find time to review it properly in the next few days |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alexwilcoxson-rel -- I think this is a really nice improvement and it could be merged as is.
I was thinking to a future when we might have proper support for resending dictionaries and trying to minimize the API impact as well as adding a nice place for more documentation. What do you think about using an enum
instead of a bool
like this:
So instead of
pub struct FlightDataEncoder {
...
send_dictionaries: bool,
...
}
Something like
pub struct FlightDataEncoder {
...
dictionary_handling: DictionaryHandling,
...
}
/// Defines how a [`FlightDataEncoder`] encodes [`DictionaryArray`]s
pub enum DictionaryHandling {
/// Expands to the underlying type (default). This likely sends more data over the network
/// but requires less memory (dictionaries are not tracked) and is more compatible
/// with other arrow flight client implementations that may not support `DictionaryEncoding`
/// see [`hydrate_dictionary`] for more details.
Hydrate,
/// Send dictionary FlightData with every RecordBatch that contains a [`DictionaryArray`].
/// See [`Self::Hydrate`] for more tradeoffs. No attempt is made to skip sending the same (logical)
/// dictionary values twice.
Resend,
}
@@ -520,14 +542,92 @@ mod tests { | |||
); | |||
} | |||
|
|||
#[tokio::test] | |||
async fn test_dictionary_hydration() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
let arr_one: Arc<DictionaryArray<UInt16Type>> = | ||
Arc::new(vec!["a", "a", "b"].into_iter().collect()); | ||
let arr_two: Arc<DictionaryArray<UInt16Type>> = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
BTW it looks like some documentation also needs to be updated, like arrow-rs/arrow-flight/src/encode.rs Line 33 in 23db567
I can handle doing so, but wanted to mention it in case @alexwilcoxson-rel was going to change this one more |
Updated these docs |
Updated to use Enum per @alamb feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alexwilcoxson-rel -- this looks great
The CI seems to be failing the doc checks. I'll try and fix that |
Here is a proposed PR to your branch for improved documentation relativityone#1 🙏 |
Improve docs for sending flight dictionaries
LGTM, merged! please rerun CI when you can. |
FlightDataEncoder
to always resend batch dictionaries
Thanks again @alexwilcoxson-rel |
Which issue does this PR close?
Closes #4895
Rationale for this change
see #4895
What changes are included in this PR?
Adds
with_send_dictionaries
toFlightDataEncoderBuilder
to opt in to always sending dictionary messages with each record batch encoded.Are there any user-facing changes?
with_send_dictionaries
added toFlightDataEncoderBuilder
andFlightDataEncoder
. Doc strings updated.