Skip to content

ROCm PyTorch unit tests status

jithunnair-amd edited this page Oct 15, 2018 · 45 revisions

Summary of unit tests:

Legend::: N: Unittest group name, T: Total tests, F: Failed, E: Errors, S: Skipped (ROCm only), SG: Skipped on GPUs, EF: Expected Failures, P: Passed, PR: Pass rate [P*100/(T-EF-SG)], CM: Comments/Modifications

N			T	F	E	S	SG	EF	P	PR	CM
test_autograd		870	0	0	11	0	0	859	99%
test_c10d										Not enabled yet (ready for testing?)   
test_cpp_extensions									Not enabled yet
test_cuda		2047	0	0	263	0	0	1784	87%
test_dataloader		44	0	0	1	2	0	41	100%
test_distributed									Not enabled yet
test_distributions	190     0       0       4       0	0       186     98%
test_indexing		46	0	0	0	0	0	46	100%
test_jit		1254	0	0	37	14	2	1201	97%
test_multiprocessing	        							Not enabled yet
test_nccl			        						Not enabled yet
test_nn			1224	0	0	40	114	2	1068	96%
test_optim		34	0	0	2	0	0	32	94%
test_sparse		611	0	0	140	18	0	453	76%
test_torch		389	0	0	20	0	0	369	95%
test_utils		19	0       0        1      1       1       16      94%
TOTAL			6728				149	5	6055	92%

Details of failing unit tests:

  • test_autograd

Skip due to seg fault:
test_pin_memory at aten/src/ATen/RegisterCUDA.cpp:30 (JMD: works for me)
test_set_requires_grad_only_for_floats_cuda

Skip due to undefined symbol hiprngMakeMTGP32Constants:
test_rnn_backward_to_input_but_not_parameters_cuda
test_requires_grad_factory (failed in CI)

Skip due to 'Memory access fault' (Failed in CI):
test_inputbuffer_add_multigpu
test_type_conversions
test_unused_output_gpu

  • test_dataloader

Skip due to hang: **************** test_manager_unclean_exit (due to leaked semaphores (?)) (JMD: according to comments, seems to be python 2.7 issue)

  • test_jit

Skip due to "RuntimeError: cannot compile a CUDA fusion group, CUDA is not enabled":
test_cpp
test_exp
test_fusion_distribute
test_lstm_fusion_concat
test_lstm_fusion_cuda
test_relu
test_tensor_number_math_cuda
test_comparison_ge_le
test_comparison_gt_lt
test_concat_fusion
test_ge_cuda
test_traced_module
JMD: will require us to enable CUDAFusionFunction which explicitly seems to call nvcc

  • test_optim

Skip due to hang:
test_adamax (JMD: works for me but fails on CI)
test_rprop - hangs in a thrust kernel

  • test_torch

Skip due to memory access page fault:
test_topk_noncontiguous_gpu (null pointer being passed to bitonic sort bitonicSortKVInPlace , it seems) JMD: fixed through gather changes, in branch

Skip due to seg fault:
test_half_tensor_cuda (due to build/aten/src/ATen/CUDAHalfType.cpp:2263)
test_print (due to build/aten/src/ATen/CUDAHalfType.cpp:151 fill)

Skip due to AssertionError:
test_norm_cuda (failing with "dim reduction failed for 0-norm")

Skip due to hang:
test_empty_full

Skip due to cublas runtime error:
test_blas_alpha_beta_empty
test_blas_empty

Skip due to RuntimeError:
test_pairwise_distance_empty (failing with "RuntimeError: cuda runtime error (1011) : hipErrorInvalidValue")
test_tensor_factories_empty (failing with "RuntimeError: cuda runtime error (1011) : hipErrorInvalidValue")
test_tensor_shape_empty (failing with "RuntimeError: cuda runtime error (1011) : hipErrorInvalidValue")

  • test_cuda

Skip due to assertion error:

8 test_\*Tensor_nonzero (Thrust issue; gives correct result for <=960 threads)  
16 test_\*Tensor_prod\*dim + 16 test_\*Tensor_sum\*dim + 4 test_\*Tensor_norm_3\*dim (issue with kernelReduceContigDim and kernelReduceNoncontigDim_shared)  
40 test_\*Tensor_sort\* + 24 test_\*Tensor_topk\* (Memory access fault due to bitonicSortKVInPlace (alternately fails with assertion error when not access faulting))  
2 test_DoubleTensor_mean\*dim (Native elementwise_kernel with div_constant_impl<double>)  
8 test_\*Tensor_mvlgamma\* (Native elementwise_kernel with div_add_impl<>)  
12 test_\*Tensor_renorm\* (THCTensor_kernel_renorm ?)  

Skip due to runtime error:

test_fft_ifft_rfft_irfft (due to undefined symbol: hipfftCreate)  
test_from_sequence + test_randperm_cuda (due to undefined symbol: \_ZN12_GLOBAL__N_112__float2halfEf)  
test_DoubleTensor_inverse + test_FloatTensor_inverse + test_btrifact + test_btrisolve + (due to forced(?) rocblas internal error)  
test_events + test_caching_pinned_memory + test_record_stream (due to 'NoneType' object has no attribute 'cudaEventCreateWithFlags')  
test_streams (due to 'NoneType' object has no attribute 'cudaStreamQuery')  
test_nvtx (due to "undefined symbol: nvtxMarkA")  
test_bincount_cuda (due to hipErrorInvalidValue)  
test_trtrs + test_symeig + test_pinverse + test_matrix_rank + 2 test_gesv\* + test_det_logdet_slogdet + 12 test_(Float|Double)Tensor_svd\* + 8 test_(Float|Double)Tensor_qr\* + 2 test_(Float|Double)Tensor_geqrf + 2 test_(Float|Double)Tensor_eig_with_eigvec (due to no MAGMA library detected)  
12 test_HalfTensor_<addbmm* | addmm* | addr* | baddbmm*> cublas Runtime error in THCBlas.cu

Skip due to hang:

2 test_FloatTensor_mean\*dim (Native elementwise_kernel with div_constant_impl<float>)  
4 test_\*Tensor_add + 4 test_\*Tensor_add_ + 4 test_\*Tensor_sub + 4 test_\*Tensor_sub_ (Native elementwise_kernel with add_kernel_impl; float, double, int and long tensor tests pass for these)  
10 test_\*Tensor_div\* (Native_elementwise_kernel with div_constant_impl; double, int and long tensor tests pass for these)  
8 test_\*Tensor_mul\* (Native elementwise_kernel with mul_kernel_impl; float, double, int and long tensor tests pass for these)  
8 test_\*Tensor_put_ + test_broadcast (TensorPutOp bug)  
8 test_\*Tensor_take + 3 test_advancedindex\* + test_index + test_multinomial (TensorTakeOp bug)  

Skip due to undefined symbol (float2half and half2float):

6 test_HalfTensor_addmv*
2 test_HalfTensor_baddbmm
1 test_halfTensor_max
1 test_halftensor_min
1 test_tiny_half_norm_
  • test_sparse

Total 158 tests skipped as of commit 2834bc1
Total 66 tests passing out of the above with white rabbit Total 92 tests still failing:
* 34 of the remaining 107 tests due to atomicAdd (called from indexAddSmallIndex/indexAddLargeIndex) issue for double. Will need ROCm 1.9 to get fixed.
* 3 tests hang (test_topk_noncontiguous_gpu, test_log1p (2 versions))
* 16 tests skipped due to being CPU-only tests
* 12 tests skipped due to being GPU-only tests
* 5 tests failing due to hipErrorInvalidValue due to "hipMemsetAsync (0, 0, 0, stream:)" call
* 18 tests skipped due to "numpy not found"
* 6 tests skipped due to "Scipy not found"
* 2 tests failed due to "cublas runtime error : an invalid numeric value was used"
* 6 tests skipped due to "not implemented yet"
* 1 test skipped due to "Testing torch.load on CPU-only machine"
* 1 test skipped due to "spawn start method is not supported in Python 2, but we need it for for testing failure case for CPU RNG on Windows"
* 1 test seg faulted
* 1 test skipped due to "flush_denormal not supported"
* 1 test skipped due to "librosa not found"

Total 23 out of 24 (18 numpy + 6 scipy) tests passed when run in a build which had scipy and numpy installed. Failing test: test_norm_cuda with assertion error: File "/home/rocm-user/pytorch__ROCM__master__4/test/test_torch.py", line 852, in _test_norm
self.assertTrue(np.allclose(res, expected), "dim reduction failed for {}-norm".format(p))
AssertionError: dim reduction failed for inf-norm

Clone this wiki locally