Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: include: assume that shared file systems exist on clusters / remove include_from_node1 #22588

Merged
merged 2 commits into from
Jul 17, 2017

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Jun 28, 2017

Remove buggy support for emulating mounting shared file-systems from Julia: the kernel is much better at this, since it can do it transparently for all files and all purposes.

This removes all of the support for node-aware include, and instead assumes the local kernel is capable of mounting remote drives.

In the case where the remote nodes have the same filesystem layout, it's still feasible to share .ji files: simply prepend a local path to LOAD_CACHE_PATH on all except the first node (ref #13684 (comment)), such that each node has a separate working directory (read/write), but can also see the cache directory from the master (readonly).

resolves #22252
resolves #11093
resolves #13939
closes #12381
fixes #12381 and #13999
refs and replaces #19073

@StefanKarpinski as we talked about at JuliaCon

edit: I forgot to mention that this doesn't require the ability to mount remote drives. I only mentioned that above as an example of how to most nearly simulate the limited capabilities of the existing include framework. With this PR, new capabilities will also become feasible, including completely independent systems or using any arbitrary method of copying files around (tar, sftp, rsync, etc.).

@Keno
Copy link
Member

Keno commented Jun 28, 2017

This seems like a step back in usability as it increases the assumptions on the environment we're executing in. While I agree that this complexity does not need to be in base, I think we should have a package that implements it. It's very useful.

@andreasnoack
Copy link
Member

The use case I had yesterday was adding Anubis workers to a master process on the desktop. Is it correct that, with this change, it wouldn't be possible to add such remote workers?

@vtjnash
Copy link
Member Author

vtjnash commented Jun 28, 2017

The use case I had yesterday was adding Anubis workers to a master process on the desktop. Is it correct that, with this change, it wouldn't be possible to add such remote workers?

Did that actually manage to work? We have quite a few open bugs claiming it doesn't (Except in the very specific configuration where all nodes are exactly identical. In which case, the added complexity – which is being removed here – chanced to be idempotent and simply unnecessary and unreliable overhead). With this change, that workflow should be much more reliable and general.

increases the assumptions on the environment we're executing in

It assumes the kernel of the remote system supports some type of network file system or can otherwise copy files ahead of time from somewhere on the network. This is quite significantly less than what we assume now.

I think we should have a package that implements it

Someone will go do it if they find it to be useful. I suspect they will not. The hooks added for Pkg3 should already handle this case.

@Keno
Copy link
Member

Keno commented Jun 28, 2017

It assumes the kernel of the remote system supports some type of network file system or can otherwise copy files ahead of time from somewhere on the network. This is quite significantly less than what we assume now.

No, it assumes you know how to set one up, which is a significantly larger problem than the technical problem here. My ideal user interface for the parallel stuff would be that you point julia at an ssh server and it starts a worker there without any assumptions on the remote file system. That doesn't work right now, but I think we should strive towards that.

@vtjnash
Copy link
Member Author

vtjnash commented Jun 28, 2017

No, it assumes you know how to set one up

No, this PR assumes that if you have figured out how to get the julia folder to a remote machine, you can also figure out how to get a .julia folder there too. This is likely to be much more closely aligned with being able to "point julia at any ssh server and start a process there" than we are now.

Currently, we unavoidable also assume you have have set up a correct nfs (because many packages also have file system dependencies), but because we emulate exactly one operation in Julia (include), it adds all sorts of other limitations and complications on how that nfs must be configured and what kinds of machines it can be run on.

@Keno
Copy link
Member

Keno commented Jun 28, 2017

Things mostly worked fine though without a shared file system. Changes to the packages on the head node were indeed picked up by the workers (true point about the file system dependencies though). I don't disagree that this has significant complexity that's probably unsuitable for base, I'm just saying the functionality was useful.

@tkelman
Copy link
Contributor

tkelman commented Jun 28, 2017

I don't think we should completely remove this without a working replacement.

@vtjnash
Copy link
Member Author

vtjnash commented Jun 28, 2017

I don't think we should completely remove this without a working replacement.

What we have now emphatically does not work in many cases. This replaces it with something simpler that does work in all cases. (rather than attempting to address the bugs that make the current implementation basically unusable)

@tkelman
Copy link
Contributor

tkelman commented Jun 28, 2017

requiring a shared file system breaks many, possibly most, of the cases that work right now

@vtjnash
Copy link
Member Author

vtjnash commented Jun 28, 2017

You're going to have be more specific, since given #22252, I'm unaware of any cases that work right now (except for cases which just manage to usually happen to work, but would work much better and more reliably after this PR).

@vtjnash
Copy link
Member Author

vtjnash commented Jun 28, 2017

(edited top post to clarify that this PR is less dependent on the presence of a shared file system than the current situation)

@vtjnash vtjnash force-pushed the jn/no-node1-include branch from c2fda80 to 4187526 Compare June 28, 2017 05:01
@andreasnoack
Copy link
Member

Did that actually manage to work?

It works if both master and workers are launched with --compilecache=no. I can add workers from Anubis and one of the nanosoldiers to my local desktop process and e.g. DistributedArrays seems to work fine.

Copy link
Member

@andreasnoack andreasnoack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do this. I've been spending some time lately on trying to run code Julia on remote nodes that don't share the file system with the master node.

It is not really practical to avoid binary dependencies so right now we don't really support running code on remote machines.

@vtjnash Needs a rebase.

remove buggy support for emulating shared file-systems from Julia:
the kernel is much better at this, and can do it transparently
@andreasnoack andreasnoack force-pushed the jn/no-node1-include branch from 4187526 to a592f3f Compare July 16, 2017 19:54
@andreasnoack andreasnoack force-pushed the jn/no-node1-include branch from a592f3f to f8b84f2 Compare July 17, 2017 02:17
@andreasnoack andreasnoack merged commit 5535ecb into master Jul 17, 2017
@tkelman tkelman deleted the jn/no-node1-include branch July 17, 2017 13:21
@tkelman
Copy link
Contributor

tkelman commented Jul 17, 2017

This badly needs news updates. Docs too, most likely.

@tkelman tkelman added breaking This change will break code needs news A NEWS entry is required for this change labels Jul 17, 2017
@andreasnoack
Copy link
Member

🎉 Just tried this out and now I can actually do useful stuff on the workers. @vtjnash thanks for the fix. Would you mind writing the NEWS entry and the docs update? Most likely, I wouldn't be able to explain this correctly.

vtjnash added a commit that referenced this pull request Jul 19, 2017
vtjnash added a commit that referenced this pull request Jul 19, 2017
vtjnash added a commit that referenced this pull request Jul 20, 2017
jeffwong pushed a commit to jeffwong/julia that referenced this pull request Jul 24, 2017
…ove include_from_node1 (JuliaLang#22588)

* include: assume that shared file systems exist for clusters

remove buggy support for emulating shared file-systems from Julia:
the kernel is much better at this, and can do it transparently

* only broadcast using/import to nodes which need it

fix JuliaLang#12381
fix JuliaLang#13999
jeffwong pushed a commit to jeffwong/julia that referenced this pull request Jul 24, 2017
@KristofferC KristofferC removed the needs news A NEWS entry is required for this change label Nov 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code modules parallelism Parallel or distributed computation
Projects
None yet
6 participants