-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Fix flaky test test_deconvolution #11630
Fix flaky test test_deconvolution #11630
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I missed something , I'm on my phone right now
Jenkinsfile
Outdated
'Python3: MKLDNN-GPU-NOCUDNN': { | ||
node('mxnetlinux-gpu') { | ||
ws('workspace/ut-python3-mkldnn-gpu-nocudnn') { | ||
withEnv(['CUDNN_DISABLED=ON']) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make a new script to call the unit tests in the runtime functions (you might just nest the calls to prevent copy and paste). Otherwise, the calls will not be reproducible- also, the environment variables will not be available in the docker container.
I'm on my phone, so please excusme if the question is answered below: why do we have to set that variable if we already compile without cudnn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info. I have fixed it in PR: #11470. We are removing the C API in that PR so there is no way of finding out whether cudnn is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we just always catch the error and skip it if it contains the message about cudnn not being available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well that approach would also work. but this is cleaner. for example, someone adding a new op which doesnt support cudnn off may have a different wording for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well then the error would be made visible in CI and the person would have to adjust the error message, right? Ideally, we'd have standardized error messages (e.g. error codes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this is cleaner because that way you are catching all errors independently of whether they are actually related to cudnn. Imagine somebody makes a change to the cudnn behaviour but there is is still a bug - the error gets then omitted and we don't notice it. I'd rather have detailed and specific error handling rather than a global try-catch. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well this test will only be ignored if the decorator is added, so only when cudnn off and for operators that dont support the configuration. People making changes to cudnn behavior should not be affected by this. Its a global try catch when cudnn is disabled for ops that dont support the configuration.
tests/python/unittest/common.py
Outdated
@make_decorator(orig_test) | ||
def test_new(*args, **kwargs): | ||
cudnn_disabled = (os.getenv('CUDNN_DISABLED') == "ON") | ||
if not cudnn_disabled or mx.context.current_context().device_type == 'cpu': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's the environment variable. Could we give it a name that clearly shows that this environment variable is only for test purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed it to CUDNN_OFF_TEST_ONLY
in the PR: #11470. Let us move the discussion for cudnn off tests to that PR.
tests/python/unittest/common.py
Outdated
if not cudnn_disabled or mx.context.current_context().device_type == 'cpu': | ||
orig_test(*args, **kwargs) | ||
else: | ||
if assertion_error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we want to differentiate between assertion errors? I think we should always handle them, no matter what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to differentiate between failures and the ops that dont support cudnn disabled. for example some ops when used with cudnn disabled would just throw an exception saying not supported. others would proceed but would compute wrongly and throw an assertion error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for elaborating. It seems to be like that you revealed a bug in our backend about the handling of the missing cudnn.
I don't think this is what should happen. If a user uses an operator although it's not supported, it should not make a false calculation but rather throw the error properly. This method would mask a problem that a user would silently encounter. We should rather investigate why the error is not being thrown and fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is there no CPU implementation of these operators we could fall back to instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep the issue was uncovered as part of the PR and is documented here: #11568 . As I have mentioned in #11470, this is not something we should be blocking the PR for. This issue seems could have been there for a long time. Its possible that there are user codes depending on the buggy behavior and raising an exception would be a breaking change for these users. Its better to fix the issue with cudnn enabled rather than throwing an exception saying cudnn not supported atleast for 1.3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to still test it when cudnn is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently have a 0 tolerance policy for faulty tests and deactivate them temporarily while taking reduced test coverage as a pill we have to swallow. It is a release requirement that all tests have to be enabled, so don't worry: it won't stay disables for long.
For Amazon employees I have received information that faulty and flaky tests have to be handled as sev2.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why disable the test when it can be avoided. If it is for visibility then we already have an open issue. we can also call it a disabled test since it is disabled for cudnn off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, as you prefer. Just wanted to avoid having this hack in there, but I don't mind. Could you please add a comment in that test, linking to the issue and stating that the test is disabled when running without cudnn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the comment about the assertion_error being temporary and link the associated issue.
Please don't forget to close the issues associated with the disabled tests after merging.
3e6b20a
to
08c25e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a small comment.
Also, please update the issue description - the Checklist is left unchanged from the template... you should either tick the relevant boxes or remove them altogether if not relevant. Thanks for fixing the issue!
src/operator/linalg_impl.h
Outdated
A.stride_, &beta, C.dptr_, C.stride_)) \ | ||
} | ||
|
||
#if CUDA_VERSION >= 7050 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment here explaining the #if directive would be helpful for other (and yourself) in the future. Can we add it please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have added the comment. also removed the checklist for the current PR.
While the changes seem all valid and fine, this remains still a mystery as there should be no difference between the two APIs if all operands are float32 (as anirudh stated above). A bit confused though as the original PR description mentions float16 (why?). |
@asmushetzel sorry about the confusion. To clarify, the documentation states that both APIs should behave exactly the same when all tensors have float32 dtype. I am guessing that the difference in behavior could then be because cublassgemm is converting tensors to float16 dtype (a bug?) for computation and could be the reason for the precision issues. I have asked Nvidia folks for inputs but haven't heard back from them yet. |
I will wait till end of this week to hear back from Nvidia folks before merging this PR. |
* Replace cublassgemm with cublassgemmex for >= 7.5 * Add comment for cublassgemmex Remove fixed seed for test_sparse_nd_save_load (apache#11920) * Remove fixed seed for test_sparse_nd_save_load * Add comments related to the commit Corrections to profiling tutorial (apache#11887) Corrected a race condition with stopping profiling. Added mx.nd.waitall to ensure all operations have completed, including GPU operations that might otherwise be missing. Also added alternative code for context selection GPU vs CPU, that had error before on machines with nvidia-smi. Fix image classification scripts and Improve Fp16 tutorial (apache#11533) * fix bugs and improve tutorial * improve logging * update benchmark_score * Update float16.md * update link to dmlc web data * fix train cifar and add random mirroring * set aug defaults * fix whitespace * fix typo
* Replace cublassgemm with cublassgemmex for >= 7.5 * Add comment for cublassgemmex
Description
This is to fix the flaky test: #10973. Prereq PR: #11470.
cublasSgemm
andcublasDgemm
are currently used for linear algebra operations when cudnn is disabled for the Convolution and Deconvolution operators. The test_deconvolution test on gpu fails on the input gradients comparison for conv and deconv operators. When looking at the failures, the difference in the gradients are as big as 0.1. ReplacingcublasSgemm
withcublasSgemmex
while keeping all other code the same, causes the failure to go away. Also, the failure doesnt happen on CPU with openblas or MKLDNN or on gpu with cudnn enabled. If both APIs work as documented, both cases should have seen the failure. My guess is thatcublasSgemm
is doing the conversion of float32 to float16 before doing the conversion, causing the precision issues. Digging through the cublas documentation, I couldn't find enough information on the computation forcublasSgemm
.Please see: http://lutgw1.lunet.edu/cuda/pdf/CUBLAS_Library.pdf 2.8.11 (cublassgemmex) and 2.7.1 (cublassgemm).
@DickJC123 @ptrendx any inputs on if there should be any difference in behavior of two apis when all tensors have float32 dtype.
The important changes are in the file:
src/operator/linalg_impl.h