Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document DataFusion Threading / tokio runtimes (how to separate IO and CPU bound work) #12393

Open
Tracked by #13456
tustvold opened this issue Sep 9, 2024 · 8 comments · May be fixed by #13424
Open
Tracked by #13456

Document DataFusion Threading / tokio runtimes (how to separate IO and CPU bound work) #12393

tustvold opened this issue Sep 9, 2024 · 8 comments · May be fixed by #13424
Assignees
Labels
enhancement New feature or request

Comments

@tustvold
Copy link
Contributor

tustvold commented Sep 9, 2024

Is your feature request related to a problem or challenge?

DataFusion performs CPU bound work within async closures. This causes issues if running IO on the same async runtime, as the cooperative nature of such schedulers allows the CPU bound work to starve servicing of IO. This leads to errors such as apache/arrow-rs#5882.

Describe the solution you'd like

I think at the very least this needs to be better documented, I couldn't find any mention of this in the DataFusion documentation following a cursory search.

I also think more holistic approach would be valuable to this, as it stands the use of async within DataFusion acts as a massive footgun that encourages users to intermix IO and CPU work in a way that is at best inefficient, but this can be tracked as a separate follow on task.

Describe alternatives you've considered

No response

Additional context

No response

@alamb
Copy link
Contributor

alamb commented Sep 9, 2024

I recommend two things:

  1. Write a blog with background and explanation of why using two threadpools is important with DataFusion and examples of how to do it
  2. Add additional documentation (ideally linking to the blog) with a summary and linking to the blog with content.

@alamb alamb changed the title Document DataFusion Threading Document DataFusion Threading (and how to separate IO and CPU bound work) Sep 9, 2024
@ozankabak
Copy link
Contributor

I think it'd be great to have a good documentation on this.

@alamb
Copy link
Contributor

alamb commented Oct 25, 2024

I think it'd be great to have a good documentation on this.

100% agree -- @itsjunetime and @tustvold are working on a bit of it in apache/arrow-rs#6612. I'll try and help with the documentation as well

@alamb alamb changed the title Document DataFusion Threading (and how to separate IO and CPU bound work) Document DataFusion Threading / tokio runtimes (how to separate IO and CPU bound work) Nov 11, 2024
@alamb alamb self-assigned this Nov 14, 2024
@alamb
Copy link
Contributor

alamb commented Nov 16, 2024

Documentation

I hope to work on the example a bit more shortly

@adriangb
Copy link
Contributor

adriangb commented Dec 3, 2024

I was thinking, beyond examples, could we include a basic implementation in DataFusion? I'm not saying it's wired up by default, but put it behind a feature flag and make it provisional but ship it with the source code? Maybe that + docs on how to use it is enough for a lot of use cases?

I think we have a working version of this in 2 files and not that many LOC. We'd be happy to donate it.

@alamb
Copy link
Contributor

alamb commented Dec 3, 2024

I was thinking, beyond examples, could we include a basic implementation in DataFusion? I'm not saying it's wired up by default, but put it behind a feature flag and make it provisional but ship it with the source code? Maybe that + docs on how to use it is enough for a lot of use cases?

Yes I agree this would be ideal. Thank you @adriangb

I think we have a working version of this in 2 files and not that many LOC. We'd be happy to donate it.

That would be awesome -- thank you! I can potentially adapt my example to use what you have

Any chance you can make a PR?

@adriangb
Copy link
Contributor

adriangb commented Dec 3, 2024

@alamb #13634

@alamb
Copy link
Contributor

alamb commented Dec 8, 2024

Update here:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants