Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-558] Fix 'AttributeError: '_thread._local' object has no attribute 'value'' on distributed processing applications #11332

Merged
merged 7 commits into from
Jun 22, 2018

Conversation

ThomasDelteil
Copy link
Contributor

@ThomasDelteil ThomasDelteil commented Jun 19, 2018

Description

Using distributed processing frameworks, django or ray, users encountered errors when trying to run mxnet in separate threads.

#11331 and dmlc/gluon-cv#156

Calling the _current.value.get within the context of the respective object solved the issue.

Does anybody have a suggestion as to how to test this without introducing a dependency on ray?

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@ThomasDelteil ThomasDelteil requested a review from szha as a code owner June 19, 2018 05:51
@robertnishihara
Copy link

I haven't reproduced the original problem, but what about defining the function and executing it in a separate thread or process (just by using the builtin threading or subprocess modules). And in the multiprocessing case, serializing the function using cloudpickle. Would something like that succeed at triggering the issue?

@ThomasDelteil
Copy link
Contributor Author

from threading import Thread
import mxnet as mx
from mxnet import gluon

def threaded_function():
    net = gluon.nn.Dense(2)
    
thread = Thread(target = threaded_function)
thread.start()
thread.join()

reproduce the issue, and is fixed with this patch

@feevos
Copy link
Contributor

feevos commented Jun 19, 2018

Just a question, with this public release will one need still to provide a prefix as a solution or this is taken care of automatically? Thank you very much for this!!

@ThomasDelteil
Copy link
Contributor Author

@anirudh2290 it looks like you refactored a lot of this scope code last month, can you review this?
@feevos when the fix is in you'll be able to use it using the pip install mxnet-cu9x --pre nightly build

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for catching and fixing this. Great work!

Would you also be able to add these checks in register.py for NameManager and Attrscope:

https://github.com/apache/incubator-mxnet/pull/10833/files#diff-768ec5f1dc18b9993c01568d669d2405

@@ -2451,7 +2451,8 @@ def var(name, attr=None, shape=None, lr_mult=None, wd_mult=None, dtype=None,
handle = SymbolHandle()
check_call(_LIB.MXSymbolCreateVariable(c_str(name), ctypes.byref(handle)))
ret = Symbol(handle)
attr = AttrScope._current.value.get(attr)
with AttrScope():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can include above not hasattr logic here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That logic is already included in the context

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can avoid a dict copy this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point, I just thought it was cleaner that way and had better separation of concerns. I will update.

thread = threading.Thread(target=f)
thread.start()
thread.join()
assert status[0], "Failed to create a layer within a thread"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add the test for running the following functions inside a thread:

  def g():
      data = mx.sym.Variable('data', attr={'a': 'b'})
  def f():
      a = mx.sym.var("a")
      b = mx.sym.var("b")
      a_ = mx.nd.ones((2, 2))
      c_ = a_.copy()
      func1 = (a + b).bind(mx.cpu(), args={'a': a_, 'b': c_})
      func1.forward()[0].wait_to_read()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do 👍

@anirudh2290
Copy link
Member

@piiswrong can you also take a look...

@anirudh2290 anirudh2290 merged commit 579e376 into apache:master Jun 22, 2018
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
…bute 'value'' on distributed processing applications (apache#11332)

* add scope to NameManager

* add AttrScope scope

* adding test

* update NameManager

* Trigger build

* Trigger build

* Add attribute checks for register module
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants