Reconcile PythonLayer with Multi-GPU #2936

ronghanghu · 2015-08-16T07:29:04Z

Right now PythonLayer cannot be parallelized in MultiGPU, which is a bug. Python Global Interpreter Lock (GIL) only allows one thread at a time to run in the python interpreter. This issue is reported in #2923.

We should find a solution to to this bug, or only allow shared PythonLayer and serializing PythonLayer forward (currently via share_in_parallel: true) it in MultiGPU (then we need a backward lock, too? but it is more complicated than that #2923 (comment)) Perhaps we may use multiprocessing module to fix this bug? But on second thought it doesn't seem like a good idea either.

The text was updated successfully, but these errors were encountered:

thatguymike · 2015-08-16T07:37:24Z

This is not going to work that cleanly. Multi-process means multiple gpu
contexts and a different multi-gpu design leading to some perf loss.
Moreover, data exchange in general becomes much more complex.

Still amazes me that python doesn't support threading in a reasonable way.
On Aug 16, 2015 12:29 AM, "Ronghang Hu" [email protected] wrote:

Right now PythonLayer cannot be parallelized, which is a bug. Python
Global Interpreter Lock (GIL) only allows one thread at a time to run in
the python interpreter. This issue is reported in #2923
#2923.

We may use multiprocessing
https://docs.python.org/2/library/multiprocessing.html module to fix
this bug.

—
Reply to this email directly or view it on GitHub
#2936.

ronghanghu · 2015-08-16T07:51:58Z

This is not going to work that cleanly. Multi-process means multiple gpu
contexts and a different multi-gpu design leading to some perf loss.
Moreover, data exchange in general becomes much more complex.

Yes, I totally agree with you. On second thought maybe Python's multiprocessing module is not suitable here.

jmschrei · 2015-08-16T10:04:23Z

Joblib allows you to use a threading backend when used in conjunction with Cython and releasing the GIL. Maybe that work be appropriate here?

bhack · 2015-08-16T10:14:07Z

A manual little example to workaround GIL http://pieceofpy.com/2010/02/26/boost-python-threads-and-releasing-the-gil/

ronghanghu · 2015-08-16T18:34:10Z

The straightforward (and perhaps naive) workaround seems to be to serialize all calls to any Python code, i.e. only allow having one PythonLayer execution at a time to behave like before. This can be done via a global python lock, perhaps as a static member variable in PythonLayer class.

From my perspective, I think serializing python execution is reasonable for CPU Bound PythonLayer since Python interpretation is always serialized in GIL and never actually run in parallel. One may argue for the case of IO Bound PythonLayer (such as using PythonLayer as a data layer), but then I would not even expect to run into this case since people should set share_in_parallel: true to share it among workers to load data sequentially (avoiding loading all the same data for each solver). And after all, even for IO Bound PythonLayer with set share_in_parallel: false, serialized execution is still better than a crash.

Any comment?

seanbell · 2015-08-16T18:37:05Z

In the short term until this gets sorted out, perhaps setting share_in_parallel to false on the PythonLayer should print a fatal message explaining that the unshared case is not implemented yet.

ronghanghu · 2015-08-16T18:42:04Z

But even for the shared case, backward is not working correctly #2923 (comment), as in #2903 I only expected a PythonLayer to be shared when used as a data layer and performs no backward.

seanbell · 2015-08-16T18:50:12Z

You could also give a fatal message for that case too.

I agree that this is tricky and don't have a solution, but failing cleanly is a reasonable placeholder.

BlGene · 2015-08-20T12:16:11Z

~~This is probably a stupid idea, but would it be possible to start several python process, each with their own GIL?~~ My bad.

bhack · 2015-08-20T12:21:25Z

@BlGene see @thatguymike and @ronghanghu comments on python multiprocess

ronghanghu · 2015-08-20T17:44:25Z

I'm leaving this issue open for further discussion. #2939 provides a workaround for the crash but it is indeed not very satisfactory, and I don't like #2939 either. On the other hand, multi-process is a hacky way to go. Everyone are welcome to share if they have good ideas/solutions to this issue.

alessandroferrari · 2016-06-23T13:35:02Z

What about to implement a method pyforward and pybackward in Net class C++ definition, such that it just releases GIL by means of a ScopedGILRelease and immediately after calls the forward (or backward) methods. Then, simply expose pyforward and pybackward functions to boost interface.

Otherwise, a PyNet C++ class that inherit from Net, and override forward and backward doing simply GIL realease and call super.forward() (or super.backward()).

At least for the most used time consuming/commonly used methods is important to release GIL, otherwise making caffe not suited to production multi-threaded python applications.

naibaf7 · 2016-06-23T13:50:05Z

@ronghanghu
I do release the GIL for the forward/backward passes in the OpenCL branch and it seems to work fine so far, also with using multiple GPUs on multiple threads (although for data parallel computation and not multi-GPU training).
For the python layer, the GIL gets re-acquired and it passes the python runtests like that.

alessandroferrari · 2016-06-24T09:24:02Z

I have done a pull request for fix it #4360

cypof · 2017-01-17T20:50:02Z

Each net can run in it's own fork now, there is an example in /python/train.py. #4563

DenisKuplyakov · 2017-07-25T14:06:49Z

I can't understand why this was closed? #4563 doesn't release GIL. It only implements multiple GPU training. I wan't to have multiple workers with shared memory and single GPU thread to stream forward passes only, but everything gets blocked during forward pass. What's wrong with #4360, except code style?

soulslicer · 2018-07-05T20:39:41Z

Sorry, so is there any way to create a custom data layer with multiple GPU?

ronghanghu added bug Python labels Aug 16, 2015

ronghanghu changed the title ~~Use multiprocessing module to parallelize PythonLayer in Multi-GPU~~ PythonLayer crash in Multi-GPU due to GIL Aug 16, 2015

ronghanghu changed the title ~~PythonLayer crash in Multi-GPU due to GIL~~ Reconcile PythonLayer with Multi-GPU Aug 16, 2015

ronghanghu mentioned this issue Aug 17, 2015

Serialize python execution in Multi-GPU #2939

Closed

ronghanghu mentioned this issue Sep 4, 2015

Disallow PythonLayer in Multi-GPU training #3032

Merged

ronghanghu mentioned this issue Sep 26, 2015

PythonLayer and Multi-GPU random segfaults #2923

Closed

longjon added the multi-GPU label Nov 5, 2015

cypof closed this as completed Jan 17, 2017

This was referenced Apr 7, 2017

multi-gpu python layer NVIDIA/DIGITS#1570

Closed

PythonLayer is not implemented in Multi-GPU training NVIDIA/caffe#305

Closed

RSly mentioned this issue Apr 21, 2017

caffe2 + python layer? facebookarchive/caffe2#366

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconcile PythonLayer with Multi-GPU #2936

Reconcile PythonLayer with Multi-GPU #2936

ronghanghu commented Aug 16, 2015

thatguymike commented Aug 16, 2015

ronghanghu commented Aug 16, 2015

jmschrei commented Aug 16, 2015

bhack commented Aug 16, 2015

ronghanghu commented Aug 16, 2015

seanbell commented Aug 16, 2015

ronghanghu commented Aug 16, 2015

seanbell commented Aug 16, 2015

BlGene commented Aug 20, 2015

bhack commented Aug 20, 2015

ronghanghu commented Aug 20, 2015

alessandroferrari commented Jun 23, 2016

naibaf7 commented Jun 23, 2016

alessandroferrari commented Jun 24, 2016

cypof commented Jan 17, 2017

DenisKuplyakov commented Jul 25, 2017

soulslicer commented Jul 5, 2018

Reconcile PythonLayer with Multi-GPU #2936

Reconcile PythonLayer with Multi-GPU #2936

Comments

ronghanghu commented Aug 16, 2015

thatguymike commented Aug 16, 2015

ronghanghu commented Aug 16, 2015

jmschrei commented Aug 16, 2015

bhack commented Aug 16, 2015

ronghanghu commented Aug 16, 2015

seanbell commented Aug 16, 2015

ronghanghu commented Aug 16, 2015

seanbell commented Aug 16, 2015

BlGene commented Aug 20, 2015

bhack commented Aug 20, 2015

ronghanghu commented Aug 20, 2015

alessandroferrari commented Jun 23, 2016

naibaf7 commented Jun 23, 2016

alessandroferrari commented Jun 24, 2016

cypof commented Jan 17, 2017

DenisKuplyakov commented Jul 25, 2017

soulslicer commented Jul 5, 2018