Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

gluon bug: AttributeError: '_thread._local' object has no attribute 'value' #11331

Closed
feevos opened this issue Jun 19, 2018 · 4 comments
Closed

Comments

@feevos
Copy link
Contributor

feevos commented Jun 19, 2018

Description

Dear all,

I am trying to run mxnet in a distributed HPC environment for embarrassingly parallel (distributed) runs.
The goal is to use this for bayesian hyperparameter optimization, therefore all communication between nodes is nothing mxnet/gpu specific (lists of hyperparams, like learning rate, batch size etc). For my distributed needs I chose ray. Each node has 4 gpus and runs a completely independent run from other nodes. However, I cannot even define a simple gluon layer within
a ray.remote function.

When I am using 2 (or more) nodes with this trivial example, everything is working:

import os
import sys
import ray
import time

# mxnet gpu examples 
import mxnet as mx
from mxnet import nd
import numpy as np


@ray.remote(num_gpus = 4)
def f():

    gpus = [int(x) for x in os.environ["CUDA_VISIBLE_DEVICES"].split(',')] # In case of multiple GPUs, comment out 2nd option. 
    tctx = [mx.gpu(i) for i in range(len(gpus))]
    a = nd.random.uniform(shape=[3,4,16,16],ctx=tctx[0])


    return a.asnumpy()

if __name__ == '__main__':
    ray.init( redis_address =  sys.argv[1])
    result1 = ray.get(f.remote())
    result2 = ray.get(f.remote())

    print (result1,result2)

However, when I try to use any gluon object that derives from HybridBlock, for example:

@ray.remote(num_gpus=4)
def f(x):
    loss = gluon.loss.L2Loss()
    return x

I get an error. I've also tested ray with a simple pytorch nn (everything is working), so this is most probably a mxnet/gluon problem.

edit: The same problem and error message appears if I use dask.distributed for launching/managing the cluster.

Environment info (Required)

All nodes are identical, I've run diagnose.py command on an interactive node with 4 gpus allocated

----------Python Info----------
Version      : 3.6.4
Compiler     : GCC 7.2.0
Build        : ('default', 'Jan 16 2018 18:10:19')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 9.0.1
Directory    : /home/dia021/Software/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
/home/dia021/Software/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Version      : 1.3.0
Directory    : /home/dia021/Software/mxnet
Commit Hash   : 0910450110c37da9f052f3b29c40c6d051f46a6a
----------System Info----------
Platform     : Linux-4.4.114-94.11-default-x86_64-with-SuSE-12-x86_64
system       : Linux
node         : b050
release      : 4.4.114-94.11-default
version      : #1 SMP Thu Feb 1 19:28:26 UTC 2018 (4309ff9)
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                56
On-line CPU(s) list:   0-27
Off-line CPU(s) list:  28-55
Thread(s) per core:    1
Core(s) per socket:    14
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Stepping:              1
CPU MHz:               2599.787
BogoMIPS:              5199.57
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              35840K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb invpcid_single pln pts dtherm intel_pt spec_ctrl retpoline kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx xsaveopt cqm_llc cqm_occup_llc
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0035 sec, LOAD: 1.1334 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0084 sec, LOAD: 0.9156 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.2081 sec, LOAD: 0.0405 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.2181 sec, LOAD: 0.2282 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0068 sec, LOAD: 0.5931 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0075 sec, LOAD: 0.0752 sec.

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   31C    P0    32W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  Off  | 00000000:06:00.0 Off |                    0 |
| N/A   29C    P0    32W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  Off  | 00000000:07:00.0 Off |                    0 |
| N/A   30C    P0    31W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2...  Off  | 00000000:08:00.0 Off |                    0 |
| N/A   32C    P0    29W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Error Message:

Traceback (most recent call last):
  File "test_ray.py", line 75, in <module>
    x1 = ray.get(feature1_id)
  File "/home/dia021/Software/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 2321, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(250d79352e7800faddddf2c11ec6fd6ea65c20b8). It was created by remote function __main__.f which failed with:

Remote function __main__.f failed with:

Traceback (most recent call last):
  File "test_ray.py", line 30, in f
    loss = gluon.loss.L2Loss()
  File "/home/dia021/Software/mxnet/gluon/loss.py", line 129, in __init__
    super(L2Loss, self).__init__(weight, batch_axis, **kwargs)
  File "/home/dia021/Software/mxnet/gluon/loss.py", line 77, in __init__
    super(Loss, self).__init__(**kwargs)
  File "/home/dia021/Software/mxnet/gluon/block.py", line 693, in __init__
    super(HybridBlock, self).__init__(prefix=prefix, params=params)
  File "/home/dia021/Software/mxnet/gluon/block.py", line 172, in __init__
    self._prefix, self._params = _BlockScope.create(prefix, params, self._alias())
  File "/home/dia021/Software/mxnet/gluon/block.py", line 53, in create
    prefix = _name.NameManager._current.value.get(None, hint) + '_'
AttributeError: '_thread._local' object has no attribute 'value'

Remote function __main__.f failed with:

Traceback (most recent call last):
  File "test_ray.py", line 30, in f
    loss = gluon.loss.L2Loss()
  File "/home/dia021/Software/mxnet/gluon/loss.py", line 129, in __init__
    super(L2Loss, self).__init__(weight, batch_axis, **kwargs)
  File "/home/dia021/Software/mxnet/gluon/loss.py", line 77, in __init__
    super(Loss, self).__init__(**kwargs)
  File "/home/dia021/Software/mxnet/gluon/block.py", line 693, in __init__
    super(HybridBlock, self).__init__(prefix=prefix, params=params)
  File "/home/dia021/Software/mxnet/gluon/block.py", line 172, in __init__
    self._prefix, self._params = _BlockScope.create(prefix, params, self._alias())
  File "/home/dia021/Software/mxnet/gluon/block.py", line 53, in create
    prefix = _name.NameManager._current.value.get(None, hint) + '_'
AttributeError: '_thread._local' object has no attribute 'value'


  You can inspect errors by running

      ray.error_info()

  If this driver is hanging, start a new one with

      ray.init(redis_address="10.141.1.77:6379")

Minimum reproducible example

This is a python file. I needs to be executed after the ray cluster has initiated with (in SLURM environment) srun python name_of_file.py

# Distributed stuff 
import ray

#mxnet 
from mxnet import gluon

# A trivial function to reproduce the example 
@ray.remote(num_gpus=4)
def f(x):
    loss = gluon.loss.L2Loss()
    return x;


if __name__ == '__main__':
    # here sys.argv[1] is the redis_address after the initiation of the ray cluster 
    ray.init( redis_address =  sys.argv[1]  )

    feature1_id = f.remote(0)
    x1 = ray.get(feature1_id)

    print (x1)

If you could please provide any hack-around/advice, most appreciated. This is also linked to this gluon-cv issue

Thank you very much
Foivos

@feevos
Copy link
Contributor Author

feevos commented Jun 19, 2018

Dear all, @ThomasDelteil solved the problem here I am leaving this open until one of the developers decides to close it.

@feevos feevos closed this as completed Jun 19, 2018
@feevos feevos reopened this Jun 19, 2018
@feevos
Copy link
Contributor Author

feevos commented Jun 19, 2018

The problem persists if I use a more complex layer. When I test with a trivial network:

@ray.remote(num_gpus=4)
def f(x):



    mynet = gluon.nn.HybridSequential(prefix = "test")
    with mynet.name_scope():
        mynet.add(gluon.nn.Conv2D(32,kernel_size=3),prefix="test")
    # """
    #loss = gluon.loss.L2Loss(prefix="test")
    return x;

I get a very similar error:

Remote function __main__.f failed with:

Traceback (most recent call last):
  File "test_ray.py", line 26, in f
    mynet.add(gluon.nn.Conv2D(32,kernel_size=3),prefix="test")
  File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 319, in __init__
    in_channels, activation, use_bias, weight_initializer, bias_initializer, **kwargs)
  File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 115, in __init__
    wshapes = _infer_weight_shape(op_name, dshape, self._kwargs)
  File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 37, in _infer_weight_shape
    sym = op(symbol.var('data', shape=data_shape), **kwargs)
  File "/home/dia021/Software/mxnet/symbol/symbol.py", line 2454, in var
    attr = AttrScope._current.value.get(attr)
AttributeError: '_thread._local' object has no attribute 'value'


  You can inspect errors by running

      ray.error_info()

  If this driver is hanging, start a new one with

      ray.init(redis_address="10.141.1.67:6379")
  
Traceback (most recent call last):
  File "test_ray.py", line 75, in <module>
    x1 = ray.get(feature1_id)
  File "/home/dia021/Software/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 2321, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(2a767168ffefb71fa84318c5f8e7e15918dcce35). It was created by remote function __main__.f which failed with:

Remote function __main__.f failed with:

Traceback (most recent call last):
  File "test_ray.py", line 26, in f
    mynet.add(gluon.nn.Conv2D(32,kernel_size=3),prefix="test")
  File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 319, in __init__
    in_channels, activation, use_bias, weight_initializer, bias_initializer, **kwargs)
  File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 115, in __init__
    wshapes = _infer_weight_shape(op_name, dshape, self._kwargs)
  File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 37, in _infer_weight_shape
    sym = op(symbol.var('data', shape=data_shape), **kwargs)
  File "/home/dia021/Software/mxnet/symbol/symbol.py", line 2454, in var
    attr = AttrScope._current.value.get(attr)
AttributeError: '_thread._local' object has no attribute 'value'

@robertnishihara
Copy link

robertnishihara commented Jun 19, 2018

@feevos it might be worth seeing if you can reproduce the issue without Ray. One thing to try would be pickling f (e.g., using cloudpickle), then unpickling it (in a different Python interpreter), and then executing the unpickled version.

EDIT: Oh, I see that it's already solved.

@apeforest
Copy link
Contributor

@sandeep-krishnamurthy This issue has been resolved. Please close it. Thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants