Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add helpers for users with asynchornous catalogs #13800

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

westonpace
Copy link
Member

Which issue does this PR close?

Closes #10339 .

Rationale for this change

As discussed in #13582 we do not actually want to make the schema providers asynchronous (the downstream changes are significant). Instead a cache-then-plan approach was outlined in #13714. This PR adds helpers which make it easier for users to follow the cache-then-plan approach.

This is hopefully just a first step. Eventually I would like to integrate these into SessionContext itself so that we can have methods like register_async_catalog_list and SessionContext will keep track of a list of asynchronous providers and take care of calling the resolve method for the user. The entire process can then be entirely hidden from the user.

What changes are included in this PR?

Adds helpers, which are exposed in datafusion_catalog but not yet integrated into SessionContext. Users can use them following the example outlined in #13714.

Are these changes tested?

Yes.

Are there any user-facing changes?

New APIs only. No breaking changes or modifications to existing APIs.

@findepi
Copy link
Member

findepi commented Dec 17, 2024

Instead a cache-then-plan approach was outlined in #13714.

What's cache-then-plan approach? (The linked page doesn't include "cache").
How did we solve cold cache problem?


/// A schema provider that looks up tables in a cache
///
/// This is created by the [`AsyncSchemaProvider::resolve`] method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that mean the code auto generated ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I have changed the comment to Instances are created by.... Is this more clear?

Err(DataFusionError::Execution(format!("Attempt to deregister table '{name}' with ResolvedSchemaProvider which is not supported")))
}

fn table_exist(&self, name: &str) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn table_exist(&self, name: &str) -> bool {
fn table_exists(&self, name: &str) -> bool {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method name is defined by the SchemaProvider trait. Renaming it would be a breaking change and I don't think it is justified.

let Some(schema) = schema else { continue };

if !schema.cached_tables.contains_key(reference.table()) {
let resolved_table =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part can be factored out into separate helper method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

#[tokio::test]
async fn test_defaults() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking if we use the cached tables should we have a tests for that? I mean that cached tables should reflect the most recent catalog state, if the table added/modified/dropped it should be reflected in the caches

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed below

@westonpace
Copy link
Member Author

What's cache-then-plan approach? (The linked page doesn't include "cache").
How did we solve cold cache problem?

@findepi

Perhaps I should avoid using the word cache. This is not a long lived multi-query cache. This is a single query cache meant to be thrown away after the query has completed. It is a very short-lived cache that is designed to avoid repeated lookups during multiple planning passes. Every query is still a "cold" query. It would be possible to create another longer-lived caching layer on top of this but I am not trying to solve that problem at the moment.

I'm thinking if we use the cached tables should we have a tests for that? I mean that cached tables should reflect the most recent catalog state, if the table added/modified/dropped it should be reflected in the caches

@comphead

There is no concern for cache eviction / staleness here because this cache should not be kept longer than a single query. There is some possibility for a catalog change to happen in between reference lookups (resolve) and query execution. However, this will always be possible when using a remote catalog. The query execution should return an error from the remote endpoint saying "no database/schema/table found" or "query does not match schema". I'm not sure we can avoid this without some kind of synchronization mechanism with a remote catalog and I don't think there has been much work in that regard (but I admittedly haven't examined the APIs in great depth).

@findepi
Copy link
Member

findepi commented Dec 18, 2024

Perhaps I should avoid using the word cache. This is not a long lived multi-query cache. This is a single query cache meant to be thrown away after the query has completed

@westonpace
thanks for explaining. I think the use of cache is justified in this context and easier to understand than eg 'working set'. I agree this is important to have a notion of query-level information for two reasons. Performance is the obvious one: we should not repeatedly compute info we already knew. Second is correctness (consistency). If a query eg self-joins an Iceberg table T, the table may need to be read twice, but the reads should come from the same snapshot of T.

So we agree on the need for this.
The question is who's responsible for providing this consistency. Is this a catalog or table provider (eg it should self-wrap in ResolvedCatalogProvider), or is it the engine itself (then the question is how exactly this is impl'd)

Ok(self.cached_tables.get(name).cloned())
}

#[allow(unused_variables)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid #[allow attributes. (and if one is really needed, add a code comment why)

Copy link
Member Author

@westonpace westonpace Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace
Copy link
Member Author

The question is who's responsible for providing this consistency. Is this a catalog or table provider (eg it should self-wrap in ResolvedCatalogProvider), or is it the engine itself (then the question is how exactly this is impl'd)

I'm not sure I understand what you mean by it should self-wrap in ResolvedCatalogProvider.

I would personally expect a planner to cache lookups in the same way I expect a compiler to optimize away repeated calls to a constant method. Though I understand this is not how the synchronous planner works today.

This is an optimization that benefits all engines and should work equally for all so it seems useful for the resolve method to provide it. Is there some advantage of having every engine reimplement this pattern? Is there some functionality, customization or capability we are taking away from engines by doing this here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
catalog Related to the catalog crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make all SchemaProvider trait APIs async
3 participants