forked from NVIDIA/apex
-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Description
- test_transducer_joint_pack_relu_dropout (test_transducer_joint.TransducerJointTest)
- test_transducer_joint_relu_dropout (test_transducer_joint.TransducerJointTest)
- test_transducer_joint_vec_pack_relu_dropout (test_transducer_joint.TransducerJointTest)
- test_transducer_joint_vec_relu_dropout (test_transducer_joint.TransducerJointTest)
The above four unit tests with "dropout" failed with the following error messages:
Traceback (most recent call last):
File "/apex/apex/contrib/test/transducer/test_transducer_joint.py", line 149, in test_transducer_joint_pack_relu_dropout
self.run_transducer_joint(for_vector_kernel=False, pack_output=True, relu=True, dropout=True)
File "/apex/apex/contrib/test/transducer/test_transducer_joint.py", line 109, in run_transducer_joint
mask=mask if dropout else None)
File "/apex/apex/contrib/test/transducer/transducer_ref.py", line 94, in transducer_joint_reference
h.backward(h_grad)
File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 402, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 193, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [4, 101, 25, 509]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
- test_transducer_joint_pack (test_transducer_joint.TransducerJointTest)
- test_transducer_joint_pack_relu (test_transducer_joint.TransducerJointTest)
The above unit test failed with the following error messages:
Traceback (most recent call last):
File "/apex/apex/contrib/test/transducer/test_transducer_joint.py", line 137, in test_transducer_joint_pack_relu
self.run_transducer_joint(for_vector_kernel=False, pack_output=True, relu=True, dropout=False)
File "/apex/apex/contrib/test/transducer/test_transducer_joint.py", line 115, in run_transducer_joint
self.assertTrue(torch.allclose(f_grad_ref, f_grad_tst, atol=1e-5, rtol=1e-5))
AssertionError: False is not true
They are not reproducible with the docker (rocm/pytorch:latest == rocm5.2_ubuntu20.04_py3.7_pytorch_staging) locally. We may need to set them as flaky tests in the future or adjust the tolerance for ROCm.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels