Clean up CUDA state between tests (#2296)

rraminen · web-flow · commit fc804c3ccf55 · 2025-07-14T12:23:21.000-05:00
This PR fixes the unit test, ``` test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED [0.1163s] Traceback (most recent call last): File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda") RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432] ``` This error is specific to gfx1101 arch. This error is coming from an integer overflow when another unit test, `test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel` creates a tensor with a huge numel, which overflows into a higher `torch.cuda.max_memory_reserved()` when you call `test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction` afterward. To avoid this we introduced `torch.cuda.empty_cache()` and `torch.cuda.reset_peak_memory_stats()` to clean up CUDA states. JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295
diff --git a/test/test_cuda.py b/test/test_cuda.py
@@ -443,6 +443,9 @@ def test_out_of_memory_retry(self):
         IS_JETSON, "oom reporting has issues on jetson igx due to partial nvml support"
     )
     def test_set_per_process_memory_fraction(self):
+        if torch.version.hip and ('gfx1101' in torch.cuda.get_device_properties(0).gcnArchName):
+           torch.cuda.empty_cache()
+           torch.cuda.reset_peak_memory_stats()
         orig = torch.cuda.get_per_process_memory_fraction(0)
         try:
             # test invalid fraction value.