-
Notifications
You must be signed in to change notification settings - Fork 533
Sharded dataloader causing CI hangs #274
Comments
Does script have the same problem? Is it possible that it is due to some
env setup changes? I think we don’t have such problems before.
…On Mon, 13 Aug 2018 at 23:18, Leonard Lausen ***@***.***> wrote:
Recently many jobs on CI are running into deadlocks and must be manually
killed. This morning I killed a few jobs that ran more than 12 hours. I
observe that all of them hang in the sharded dataloader tests. @szhengac
<https://github.com/szhengac> do you have any idea what could be the
reason?
From
http://ci.mxnet.io/blue/organizations/jenkins/gluon-nlp/detail/PR-246/10/pipeline
tests/unittest/train/test_dataloader.py::test_sharded_data_loader Sending interrupt signal to process
After 10s process did not stop
Also
http://ci.mxnet.io/blue/organizations/jenkins/gluon-nlp/detail/PR-233/25/pipeline/
tests/unittest/train/test_dataloader.py::test_sharded_data_loaderSending interrupt signal to process
Terminated
script returned exit code 143
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#274>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADxs1IBTPmgDAcLB9hafy4GStWuI8iaZks5uQZjfgaJpZM4V6tWg>
.
|
I'm not aware of any change in environment. It may be due to some other issue, but for some reason CI always shows that it was working on the sharded dataloader test when being killed. |
I think the hang also occurs during the test_transformer scripts tests as it relies on the ShardedDataloader. ci.mxnet.io/blue/organizations/jenkins/gluon-nlp/detail/PR-275/2/pipeline/22 |
CI works again after disabling both. The test should be enabled again before 0.4 |
I think this is due to the recent change in DataLoader (#11908). The shared DataLoader inherits the DataLoader, and the change possibly incurs some inconsistency. The workaround is to copy the original DataLoader to shared Dataloader instead of using inheritance. |
@zhreshold may have some ideas. |
@szhengac is correct, the latest changes in #11908 was not correctly handled by _ShardedMultiWorkerIter, therefore the workers are never actually terminated. |
Recently many jobs on CI are running into deadlocks and must be manually killed. This morning I killed a few jobs that ran more than 12 hours. I observe that all of them hang in the sharded dataloader tests. @szhengac do you have any idea what could be the reason?
From http://ci.mxnet.io/blue/organizations/jenkins/gluon-nlp/detail/PR-246/10/pipeline
Also
The text was updated successfully, but these errors were encountered: