Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch hangs when remote process is terminated #2793

Closed
tanmaykm opened this issue Apr 8, 2013 · 6 comments
Closed

fetch hangs when remote process is terminated #2793

tanmaykm opened this issue Apr 8, 2013 · 6 comments
Labels
parallelism Parallel or distributed computation

Comments

@tanmaykm
Copy link
Member

tanmaykm commented Apr 8, 2013

If a remote julia process is killed, fetch on a pending RemoteRef does not return control though it detects an error as an end of stream exception.

This can be simulated with following commands:

julia> addprocs_ssh(("localhost",))
:ok

julia> nprocs()
2

julia> r = remote_call(2, sleep, 600)
RemoteRef(2,1,1)

julia> fetch(r)

At this point fetch would be waiting for the remote process.
Kill the remote julia process to simulate an abnormal termination.
The following should be displayed at the REPL, but the fetch call would not return.

bash: line 1: 49980 Terminated: 15          ./julia-release-basic --worker
exception on 1: ERROR: read: end of file
 in read at iobuffer.jl:51
 in read at stream.jl:397
 in anonymous at task.jl:807

It must be interrupted with Ctrl+C for the control to come back.

^CERROR: interrupt
 in process_events at stream.jl:312
 in event_loop at multi.jl:1381
 in anonymous at client.jl:284InterruptException()
@vtjnash
Copy link
Member

vtjnash commented Apr 8, 2013

essentially a dup of #217

@ViralBShah
Copy link
Member

Actually, I think #217 is pretty broad about system provided fault tolerant programming models, whereas this is a specific issue.

Anything waiting on a RemoteRef should probably get an error code or an exception. This makes it easy to write simple fault tolerant code.

I notice that nprocs() still continues to give the wrong number of processors, even when one of the workers dies. Now, there are many cases in which workers can die or disappear, but at least when the socket connection is terminated, which should happen in many cases, the master node can handle it a bit better, update the number of processes, etc.

@dreiss-isb
Copy link
Contributor

I am not sure this is a dupe; my calls to pmap() worked flawlessly in julia v0.1.2 but now, using the exact same code (with v0.2.0) frequently hang with the same error. This may be an I/O or serialization bug.

@JeffBezanson
Copy link
Member

David, if you can come up with a reduced test case where this error happens during pmap it would be great to file an issue for it. Thanks.

@dreiss-isb
Copy link
Contributor

Hi Jeff, my apologies -- my error seems to be the result of an uncaught exception on one of the remote processes (I think it's DataFrame related -- not an I/O bug).

So my original claim is bogus, although the original issue file filed by Tanmay still holds (and frequently causes me problems unless I make sure to catch all remotely-executed errors).

@StefanKarpinski
Copy link
Member

That sort of error should be propagated and reported more cleanly, so still somewhat of an issue – although a different one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation
Projects
None yet
Development

No branches or pull requests

6 participants