-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add helpers for users with asynchornous catalogs #13800
base: main
Are you sure you want to change the base?
Conversation
What's cache-then-plan approach? (The linked page doesn't include "cache"). |
datafusion/catalog/src/async.rs
Outdated
|
||
/// A schema provider that looks up tables in a cache | ||
/// | ||
/// This is created by the [`AsyncSchemaProvider::resolve`] method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that mean the code auto generated ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I have changed the comment to Instances are created by...
. Is this more clear?
Err(DataFusionError::Execution(format!("Attempt to deregister table '{name}' with ResolvedSchemaProvider which is not supported"))) | ||
} | ||
|
||
fn table_exist(&self, name: &str) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn table_exist(&self, name: &str) -> bool { | |
fn table_exists(&self, name: &str) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method name is defined by the SchemaProvider
trait. Renaming it would be a breaking change and I don't think it is justified.
datafusion/catalog/src/async.rs
Outdated
let Some(schema) = schema else { continue }; | ||
|
||
if !schema.cached_tables.contains_key(reference.table()) { | ||
let resolved_table = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part can be factored out into separate helper method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
#[tokio::test] | ||
async fn test_defaults() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking if we use the cached tables should we have a tests for that? I mean that cached tables should reflect the most recent catalog state, if the table added/modified/dropped it should be reflected in the caches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed below
Perhaps I should avoid using the word cache. This is not a long lived multi-query cache. This is a single query cache meant to be thrown away after the query has completed. It is a very short-lived cache that is designed to avoid repeated lookups during multiple planning passes. Every query is still a "cold" query. It would be possible to create another longer-lived caching layer on top of this but I am not trying to solve that problem at the moment.
There is no concern for cache eviction / staleness here because this cache should not be kept longer than a single query. There is some possibility for a catalog change to happen in between reference lookups ( |
@westonpace So we agree on the need for this. |
datafusion/catalog/src/async.rs
Outdated
Ok(self.cached_tables.get(name).cloned()) | ||
} | ||
|
||
#[allow(unused_variables)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please avoid #[allow
attributes. (and if one is really needed, add a code comment why)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what you mean by I would personally expect a planner to cache lookups in the same way I expect a compiler to optimize away repeated calls to a constant method. Though I understand this is not how the synchronous planner works today. This is an optimization that benefits all engines and should work equally for all so it seems useful for the |
Which issue does this PR close?
Closes #10339 .
Rationale for this change
As discussed in #13582 we do not actually want to make the schema providers asynchronous (the downstream changes are significant). Instead a cache-then-plan approach was outlined in #13714. This PR adds helpers which make it easier for users to follow the cache-then-plan approach.
This is hopefully just a first step. Eventually I would like to integrate these into
SessionContext
itself so that we can have methods likeregister_async_catalog_list
andSessionContext
will keep track of a list of asynchronous providers and take care of calling theresolve
method for the user. The entire process can then be entirely hidden from the user.What changes are included in this PR?
Adds helpers, which are exposed in
datafusion_catalog
but not yet integrated intoSessionContext
. Users can use them following the example outlined in #13714.Are these changes tested?
Yes.
Are there any user-facing changes?
New APIs only. No breaking changes or modifications to existing APIs.