Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] hold ports until training #5890

Merged
merged 9 commits into from
Jun 20, 2023
Merged

[dask] hold ports until training #5890

merged 9 commits into from
Jun 20, 2023

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented May 17, 2023

Contributes to #5865 by holding on to the ports that LightGBM will use during training for as long as possible on the Python side, thus decreasing the chance of a race condition (but not eliminating it).

@jmoralez
Copy link
Collaborator Author

jmoralez commented May 17, 2023

Seems like the actor approach won't work here due to dask/distributed#7842. Feel free to try the alternative here @jameslamb.

EDIT:
I've tried a similar approach without using the actor interface directly, it isn't 100% clean but it does the job, let me know what you think. I still have some details to fix like some types and docstrings, but I'd like to get your opinion on the approach first.

@jmoralez jmoralez marked this pull request as ready for review May 19, 2023 20:37
@jmoralez jmoralez requested a review from jameslamb as a code owner May 19, 2023 20:37
@jmoralez jmoralez changed the title hold ports until training [dask] hold ports until training May 19, 2023
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks so much for working on this! This is a much better version of what I was kind of envisioning in #5865 (comment).

I'm unsure about exactly how it works, but mainly because of my own ignorance of Dask and TCP. If you tell me this is working for you on both LocalCluster and a multi-machine distributed Dask cluster, then I think it's worth the time for you to add the remaining type hints, tests, docs, etc. and I'll commit to testing it myself on both LocalCluster and some different configurations of multi-machine clusters from Coiled or with something like dask-cloudprovider.

python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Show resolved Hide resolved
python-package/lightgbm/dask.py Show resolved Hide resolved
@jameslamb
Copy link
Collaborator

Based on #5890 (comment), I'm feeling pretty optimistic about this approach! Awesome work.

Some time this week, I'll try to test this on a multi-machine cluster using Coiled or dask-cloudprovider.

@jmoralez
Copy link
Collaborator Author

jmoralez commented Jun 6, 2023

Seems like coiled doesn't provide free credits to use their infra anymore, so I haven't been able to test this on a real cluster. I'll try to find an alternative because I don't have an AWS account.

@jameslamb
Copy link
Collaborator

I can test on my AWS account.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this tonight on AWS Elastic Container Service (ECS) cluster, using dask-cloudprovider: https://github.com/jameslamb/lightgbm-dask-testing/blob/main/notebooks/demo-aws.ipynb.

Happy to say it worked well!

Screen Shot 2023-06-19 at 12 00 58 AM

I left one small comment, do what you want with it.

Great work, @jmoralez !!!

python-package/lightgbm/dask.py Show resolved Hide resolved
@jameslamb jameslamb added fix and removed in progress labels Jun 19, 2023
@jameslamb
Copy link
Collaborator

Looks great to me!

You should be the one to click the button, @jmoralez 😁

@jmoralez jmoralez merged commit ac57d5a into master Jun 20, 2023
@jmoralez jmoralez deleted the dask-hold-ports branch June 20, 2023 16:08
@adfea9c0
Copy link

Hey only saw this now, but thanks for looking into it !

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants