fetch hangs when remote process is terminated #2793

tanmaykm · 2013-04-08T11:25:24Z

If a remote julia process is killed, fetch on a pending RemoteRef does not return control though it detects an error as an end of stream exception.

This can be simulated with following commands:

julia> addprocs_ssh(("localhost",))
:ok

julia> nprocs()
2

julia> r = remote_call(2, sleep, 600)
RemoteRef(2,1,1)

julia> fetch(r)

At this point fetch would be waiting for the remote process.
Kill the remote julia process to simulate an abnormal termination.
The following should be displayed at the REPL, but the fetch call would not return.

bash: line 1: 49980 Terminated: 15          ./julia-release-basic --worker
exception on 1: ERROR: read: end of file
 in read at iobuffer.jl:51
 in read at stream.jl:397
 in anonymous at task.jl:807

It must be interrupted with Ctrl+C for the control to come back.

^CERROR: interrupt
 in process_events at stream.jl:312
 in event_loop at multi.jl:1381
 in anonymous at client.jl:284InterruptException()

The text was updated successfully, but these errors were encountered:

vtjnash · 2013-04-08T12:19:00Z

essentially a dup of #217

ViralBShah · 2013-04-08T12:56:00Z

Actually, I think #217 is pretty broad about system provided fault tolerant programming models, whereas this is a specific issue.

Anything waiting on a RemoteRef should probably get an error code or an exception. This makes it easy to write simple fault tolerant code.

I notice that nprocs() still continues to give the wrong number of processors, even when one of the workers dies. Now, there are many cases in which workers can die or disappear, but at least when the socket connection is terminated, which should happen in many cases, the master node can handle it a bit better, update the number of processes, etc.

dreiss-isb · 2013-05-07T18:03:26Z

I am not sure this is a dupe; my calls to pmap() worked flawlessly in julia v0.1.2 but now, using the exact same code (with v0.2.0) frequently hang with the same error. This may be an I/O or serialization bug.

JeffBezanson · 2013-05-07T18:20:39Z

David, if you can come up with a reduced test case where this error happens during pmap it would be great to file an issue for it. Thanks.

dreiss-isb · 2013-05-08T18:59:20Z

Hi Jeff, my apologies -- my error seems to be the result of an uncaught exception on one of the remote processes (I think it's DataFrame related -- not an I/O bug).

So my original claim is bogus, although the original issue file filed by Tanmay still holds (and frequently causes me problems unless I make sure to catch all remotely-executed errors).

StefanKarpinski · 2013-05-08T19:04:27Z

That sort of error should be propagated and reported more cleanly, so still somewhat of an issue – although a different one.

JeffBezanson mentioned this issue May 8, 2013

removeprocs #3050

Closed

JeffBezanson closed this as completed in f752fae Jul 12, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fetch hangs when remote process is terminated #2793

fetch hangs when remote process is terminated #2793

tanmaykm commented Apr 8, 2013

vtjnash commented Apr 8, 2013

ViralBShah commented Apr 8, 2013

dreiss-isb commented May 7, 2013

JeffBezanson commented May 7, 2013

dreiss-isb commented May 8, 2013

StefanKarpinski commented May 8, 2013

fetch hangs when remote process is terminated #2793

fetch hangs when remote process is terminated #2793

Comments

tanmaykm commented Apr 8, 2013

vtjnash commented Apr 8, 2013

ViralBShah commented Apr 8, 2013

dreiss-isb commented May 7, 2013

JeffBezanson commented May 7, 2013

dreiss-isb commented May 8, 2013

StefanKarpinski commented May 8, 2013