This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
gluon bug: AttributeError: '_thread._local' object has no attribute 'value' #11331
Comments
Dear all, @ThomasDelteil solved the problem here I am leaving this open until one of the developers decides to close it. |
The problem persists if I use a more complex layer. When I test with a trivial network: @ray.remote(num_gpus=4)
def f(x):
mynet = gluon.nn.HybridSequential(prefix = "test")
with mynet.name_scope():
mynet.add(gluon.nn.Conv2D(32,kernel_size=3),prefix="test")
# """
#loss = gluon.loss.L2Loss(prefix="test")
return x; I get a very similar error: Remote function __main__.f failed with:
Traceback (most recent call last):
File "test_ray.py", line 26, in f
mynet.add(gluon.nn.Conv2D(32,kernel_size=3),prefix="test")
File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 319, in __init__
in_channels, activation, use_bias, weight_initializer, bias_initializer, **kwargs)
File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 115, in __init__
wshapes = _infer_weight_shape(op_name, dshape, self._kwargs)
File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 37, in _infer_weight_shape
sym = op(symbol.var('data', shape=data_shape), **kwargs)
File "/home/dia021/Software/mxnet/symbol/symbol.py", line 2454, in var
attr = AttrScope._current.value.get(attr)
AttributeError: '_thread._local' object has no attribute 'value'
You can inspect errors by running
ray.error_info()
If this driver is hanging, start a new one with
ray.init(redis_address="10.141.1.67:6379")
Traceback (most recent call last):
File "test_ray.py", line 75, in <module>
x1 = ray.get(feature1_id)
File "/home/dia021/Software/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 2321, in get
raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(2a767168ffefb71fa84318c5f8e7e15918dcce35). It was created by remote function __main__.f which failed with:
Remote function __main__.f failed with:
Traceback (most recent call last):
File "test_ray.py", line 26, in f
mynet.add(gluon.nn.Conv2D(32,kernel_size=3),prefix="test")
File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 319, in __init__
in_channels, activation, use_bias, weight_initializer, bias_initializer, **kwargs)
File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 115, in __init__
wshapes = _infer_weight_shape(op_name, dshape, self._kwargs)
File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 37, in _infer_weight_shape
sym = op(symbol.var('data', shape=data_shape), **kwargs)
File "/home/dia021/Software/mxnet/symbol/symbol.py", line 2454, in var
attr = AttrScope._current.value.get(attr)
AttributeError: '_thread._local' object has no attribute 'value' |
@feevos it might be worth seeing if you can reproduce the issue without Ray. One thing to try would be pickling EDIT: Oh, I see that it's already solved. |
@sandeep-krishnamurthy This issue has been resolved. Please close it. Thanks |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Description
Dear all,
I am trying to run mxnet in a distributed HPC environment for embarrassingly parallel (distributed) runs.
The goal is to use this for bayesian hyperparameter optimization, therefore all communication between nodes is nothing mxnet/gpu specific (lists of hyperparams, like learning rate, batch size etc). For my distributed needs I chose ray. Each node has 4 gpus and runs a completely independent run from other nodes. However, I cannot even define a simple gluon layer within
a
ray.remote
function.When I am using 2 (or more) nodes with this trivial example, everything is working:
However, when I try to use any gluon object that derives from HybridBlock, for example:
I get an error. I've also tested ray with a simple pytorch nn (everything is working), so this is most probably a mxnet/gluon problem.
edit: The same problem and error message appears if I use dask.distributed for launching/managing the cluster.
Environment info (Required)
All nodes are identical, I've run diagnose.py command on an interactive node with 4 gpus allocated
nvidia-smi
Error Message:
Minimum reproducible example
This is a python file. I needs to be executed after the ray cluster has initiated with (in SLURM environment) srun python name_of_file.py
If you could please provide any hack-around/advice, most appreciated. This is also linked to this gluon-cv issue
Thank you very much
Foivos
The text was updated successfully, but these errors were encountered: