Skip to content

Sudden huge runtime for MPS tests on osx-arm #462

@h-vetinari

Description

@h-vetinari

When trying to clean up the test skips on osx in #353, I noticed something weird is going on with MPS. Look at the timestamps vs. the percentages

2025-12-02T01:20:53.4025390Z TEST END: /Users/runnerx/miniforge3/conda-bld/osx-arm64/pytorch-cpu-2.9.1-cpu_generic_hcc7c195_0.conda
2025-12-02T01:20:53.4200610Z TEST START: /Users/runnerx/miniforge3/conda-bld/osx-arm64/pytorch-tests-2.9.1-cpu_generic_py313_ha249dfb_0.conda         # first `pytorch-tests` output being tested
2025-12-02T01:21:50.3558240Z [gw0] [  0%] PASSED [0.0048s] test/test_autograd.py::TestAutograd::test_access_saved_tensor_twice_without_recomputation_works 
2025-12-02T01:22:24.8448650Z [gw0] [ 10%] PASSED [0.0008s] test/test_custom_ops.py::TestGenerateOpcheckTests::test_generate_repro_no_save_data 
2025-12-02T01:22:31.8731700Z [gw0] [ 20%] PASSED [0.0027s] test/test_linalg.py::TestLinalgCPU::test_addmm_mv_transpose_a_True_transpose_b_False_alpha_1_0_beta_1_0_cpu_float32 
2025-12-02T01:23:25.2647630Z [gw0] [ 30%] PASSED [0.0010s] test/test_modules.py::TestModuleCPU::test_to_empty_nn_AvgPool1d_swap_True_cpu_float32 
2025-12-02T01:24:39.7525120Z [gw0] [ 40%] PASSED [0.0047s] test/test_modules.py::TestModuleMPS::test_factory_kwargs_nn_AvgPool2d_mps_float32 
2025-12-02T01:25:40.4699770Z [gw0] [ 50%] PASSED [0.0052s] test/test_modules.py::TestModuleMPS::test_factory_kwargs_nn_Softmin_mps_float16 
2025-12-02T01:39:27.8740250Z [gw0] [ 51%] PASSED [94.4943s] test/test_modules.py::TestModuleMPS::test_forward_nn_GRU_train_mode_mps_float16 
2025-12-02T02:27:05.3767080Z [gw1] [ 52%] PASSED [13.7502s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_GELU_mps_float16 
2025-12-02T03:56:01.5618180Z [gw1] [ 53%] PASSED [26.7351s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_LeakyReLU_mps_float16 
2025-12-02T04:10:42.2153250Z [gw0] [ 54%] PASSED [0.0066s] test/test_modules.py::TestModuleMPS::test_if_train_and_eval_modes_differ_nn_KLDivLoss_mps_float16 
2025-12-02T05:03:19.4796450Z [gw0] [ 55%] PASSED [4.2550s] test/test_modules.py::TestModuleMPS::test_memory_format_nn_AdaptiveMaxPool1d_mps_float16 
2025-12-02T05:29:56.3872950Z [gw0] [ 56%] PASSED [9.0542s] test/test_modules.py::TestModuleMPS::test_memory_format_nn_LeakyReLU_mps_float16 
2025-12-02T06:01:27.7234130Z [gw0] [ 57%] SKIPPED [0.0013s] test/test_modules.py::TestModuleMPS::test_multiple_device_transfer_nn_AvgPool1d_mps_float32 
2025-12-02T06:07:00.7160070Z [gw0] [ 58%] PASSED [332.8067s] test/test_modules.py::TestModuleMPS::test_to_nn_MultiheadAttention_eval_mode_swap_False_set_grad_True_mps_float32 
2025-12-02T07:27:17.9352830Z [gw0] [ 59%] PASSED [0.0012s] test/test_modules.py::TestModuleMPS::test_to_nn_Sigmoid_swap_False_set_grad_True_mps_float32 
2025-12-02T11:25:27.6268670Z [gw1] [ 60%] PASSED [0.0013s] test/test_modules.py::TestModuleMPS::test_repr_nn_ConstantPad2d_mps_float16 
# timeout

The entirety of the insane slowdown seems to happen in TestModuleMPS.

This group of test in particular looks like it's almost uniformly hitting the 20min timeout

Group of very long-running tests

In contrast to the above, this is unfiltered (except the duplication between start & end of a test in the logs).

2025-12-02T07:27:18.5644790Z [gw0] [ 59%] SKIPPED [0.0018s] test/test_modules.py::TestModuleMPS::test_to_nn_Threshold_swap_True_set_grad_False_mps_float32 
2025-12-02T07:27:18.5673200Z [gw0] [ 59%] SKIPPED [0.0016s] test/test_modules.py::TestModuleMPS::test_to_nn_Threshold_swap_True_set_grad_True_mps_float32 
2025-12-02T07:34:22.6295920Z [gw1] [ 59%] FAILED [1200.2075s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerDecoderLayer_mps_float16 
2025-12-02T07:40:28.2223580Z [gw0] [ 59%] PASSED [789.6524s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerDecoderLayer_swap_False_set_grad_False_mps_float32 
2025-12-02T07:55:46.5797120Z [gw1] [ 59%] FAILED [1200.5276s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerDecoderLayer_mps_float32 
2025-12-02T08:00:20.1754060Z [gw0] [ 59%] PASSED [1191.9146s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerDecoderLayer_swap_False_set_grad_True_mps_float32 
2025-12-02T08:16:07.5761810Z [gw1] [ 59%] FAILED [1200.1559s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerEncoderLayer_eval_mode_mps_float16 
2025-12-02T08:20:29.3480650Z [gw0] [ 59%] FAILED [1200.2860s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerDecoderLayer_swap_True_set_grad_False_mps_float32 
2025-12-02T08:37:08.6361450Z [gw1] [ 59%] FAILED [1200.4452s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerEncoderLayer_eval_mode_mps_float32 
2025-12-02T08:40:17.3998890Z [gw0] [ 59%] PASSED [1188.0639s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerDecoderLayer_swap_True_set_grad_True_mps_float32 
2025-12-02T08:44:55.4171130Z [gw0] [ 59%] PASSED [278.0122s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerEncoderLayer_eval_mode_swap_False_set_grad_False_mps_float32 
2025-12-02T08:52:03.0410990Z [gw0] [ 59%] PASSED [427.6176s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerEncoderLayer_eval_mode_swap_False_set_grad_True_mps_float32 
2025-12-02T08:57:35.0610220Z [gw1] [ 59%] FAILED [1200.1559s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerEncoderLayer_train_mode_mps_float16 
2025-12-02T08:59:02.0464050Z [gw0] [ 59%] PASSED [419.0094s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerEncoderLayer_eval_mode_swap_True_set_grad_False_mps_float32 
2025-12-02T09:05:58.5958150Z [gw0] [ 59%] PASSED [416.5351s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerEncoderLayer_eval_mode_swap_True_set_grad_True_mps_float32 
2025-12-02T09:10:56.9110220Z [gw0] [ 59%] PASSED [298.3005s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerEncoderLayer_train_mode_swap_False_set_grad_False_mps_float32 
2025-12-02T09:17:58.3751530Z [gw1] [ 59%] FAILED [1200.5472s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerEncoderLayer_train_mode_mps_float32 
2025-12-02T09:18:44.1535940Z [gw0] [ 59%] PASSED [467.2481s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerEncoderLayer_train_mode_swap_False_set_grad_True_mps_float32 
2025-12-02T09:26:29.5150820Z [gw0] [ 59%] PASSED [465.3572s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerEncoderLayer_train_mode_swap_True_set_grad_False_mps_float32 
2025-12-02T09:34:08.8447110Z [gw0] [ 59%] PASSED [459.3250s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerEncoderLayer_train_mode_swap_True_set_grad_True_mps_float32 
2025-12-02T09:38:43.6662340Z [gw0] [ 59%] PASSED [274.8093s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerEncoder_eval_mode_swap_False_set_grad_False_mps_float32 
2025-12-02T09:39:11.4926780Z [gw1] [ 59%] FAILED [1200.3873s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerEncoder_eval_mode_mps_float16 
2025-12-02T09:47:59.7064400Z [gw0] [ 59%] PASSED [556.0775s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerEncoder_eval_mode_swap_False_set_grad_True_mps_float32 
2025-12-02T09:57:49.4321400Z [gw0] [ 59%] PASSED [589.7299s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerEncoder_eval_mode_swap_True_set_grad_False_mps_float32 
2025-12-02T09:59:40.0763300Z [gw1] [ 59%] FAILED [1200.4937s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerEncoder_eval_mode_mps_float32 
2025-12-02T10:07:03.4547830Z [gw0] [ 59%] PASSED [554.0258s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerEncoder_eval_mode_swap_True_set_grad_True_mps_float32 
2025-12-02T10:12:12.9435740Z [gw0] [ 59%] PASSED [309.4906s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerEncoder_train_mode_swap_False_set_grad_False_mps_float32 
2025-12-02T10:20:51.8442370Z [gw1] [ 59%] FAILED [1200.2544s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerEncoder_train_mode_mps_float16 
2025-12-02T10:22:36.8392400Z [gw0] [ 59%] PASSED [625.8716s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerEncoder_train_mode_swap_False_set_grad_True_mps_float32 
2025-12-02T10:33:17.9865430Z [gw0] [ 59%] PASSED [639.1128s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerEncoder_train_mode_swap_True_set_grad_False_mps_float32 
2025-12-02T10:41:39.4753490Z [gw1] [ 59%] FAILED [1200.3611s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerEncoder_train_mode_mps_float32 
2025-12-02T10:43:40.8275870Z [gw0] [ 59%] PASSED [622.8461s] test/test_modules.py::TestModuleMPS::test_to_nn_TransformerEncoder_train_mode_swap_True_set_grad_True_mps_float32 
2025-12-02T11:01:40.0155790Z [gw1] [ 59%] FAILED [1200.4659s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_Transformer_mps_float16 
2025-12-02T11:03:40.9389770Z [gw0] [ 59%] FAILED [1200.0481s] test/test_modules.py::TestModuleMPS::test_to_nn_Transformer_swap_False_set_grad_False_mps_float32 
2025-12-02T11:22:26.3453670Z [gw1] [ 59%] FAILED [1200.2110s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_Transformer_mps_float32 
2025-12-02T11:22:41.6893080Z [gw1] [ 59%] PASSED [15.3437s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_ZeroPad1d_mps_float16 
2025-12-02T11:22:56.4771680Z [gw1] [ 59%] PASSED [14.7844s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_ZeroPad1d_mps_float32 
2025-12-02T11:23:12.0592870Z [gw1] [ 59%] PASSED [15.5815s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_ZeroPad2d_mps_float16 
2025-12-02T11:23:26.4765580Z [gw1] [ 59%] PASSED [14.4153s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_ZeroPad2d_mps_float32 
2025-12-02T11:23:40.3271920Z [gw1] [ 59%] PASSED [13.8485s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_ZeroPad3d_mps_float16 
2025-12-02T11:23:48.7002260Z [gw0] [ 59%] FAILED [1200.2521s] test/test_modules.py::TestModuleMPS::test_to_nn_Transformer_swap_False_set_grad_True_mps_float32 
2025-12-02T11:23:53.8750140Z [gw1] [ 59%] PASSED [13.5454s] test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_ZeroPad3d_mps_float32 
2025-12-02T11:23:53.8827200Z [gw1] [ 59%] SKIPPED [0.0042s] test/test_modules.py::TestModuleMPS::test_repr_nn_AdaptiveAvgPool1d_mps_float16 
2025-12-02T11:23:53.8865440Z [gw1] [ 59%] SKIPPED [0.0031s] test/test_modules.py::TestModuleMPS::test_repr_nn_AdaptiveAvgPool1d_mps_float32 
2025-12-02T11:23:53.8899970Z [gw1] [ 59%] SKIPPED [0.0021s] test/test_modules.py::TestModuleMPS::test_repr_nn_AdaptiveAvgPool2d_mps_float16

It's very strange though that the tests start timing out altogether, because the same test suite is passing on main; the deleted skips did not cover the tests that now time out, and the only relevant difference in that PR is

@@ -372,6 +374,8 @@ outputs:
         - cuda-nvrtc-dev            # [cuda_compiler_version != "None"]
         - nvtx-c                    # [cuda_compiler_version != "None"]
         - pybind11
+        # not required upstream, but useful
+        - pytest-timeout
       imports:
         - torch
         - torch._C
@@ -514,7 +524,7 @@ outputs:
         {% set jobs = "-n 1" %}     # [linux64 and cuda_compiler_version != "None"]
         # test only one python version on aarch because emulation is super-slow;
         # disable hypothesis because it randomly yields health check errors
-        - pytest {{ jobs }} {{ tests }} -k "not ({{ skips }})" -m "not hypothesis" --durations=50 --disable-warnings    # [not aarch64 or py==312]
+        - pytest {{ jobs }} -v {{ tests }} -k "not ({{ skips }})" -m "not hypothesis" --durations=50 --timeout=1200 --disable-warnings    # [not aarch64 or py==312]

         # regression test for https://github.com/conda-forge/pytorch-cpu-feedstock/issues/329, where we picked up
         # duplicate `.pyc` files due to newest py-ver (3.13) in the build environment not matching the one in host;

Not sure how verbosity or timeouts could lead to even slower tests. The timings are borne out on main as well though, where up to 46 out of the 50 slowest tests are from TestModuleMPS (here the logs from the last run in #460).

Test timings on main
============================= slowest 50 durations =============================
2350.40s call     test/test_modules.py::TestModuleMPS::test_forward_nn_NLLLoss_mps_float16
875.79s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_CrossEntropyLoss_mps_float32
862.86s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_CrossEntropyLoss_mps_float16
821.33s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_GRU_eval_mode_mps_float16
412.35s call     test/test_modules.py::TestModuleMPS::test_forward_nn_NLLLoss_mps_float32
335.56s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_Transformer_mps_float16
296.72s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerDecoderLayer_mps_float32
295.85s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerDecoderLayer_mps_float16
256.30s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_Transformer_mps_float32
236.30s call     test/test_modules.py::TestModuleMPS::test_save_load_nn_Transformer_mps_float16
236.05s call     test/test_modules.py::TestModuleMPS::test_save_load_nn_Transformer_mps_float32
230.04s call     test/test_modules.py::TestModuleMPS::test_forward_nn_MultiheadAttention_eval_mode_mps_float32
227.15s call     test/test_modules.py::TestModuleMPS::test_forward_nn_MultiheadAttention_train_mode_mps_float16
225.44s call     test/test_modules.py::TestModuleMPS::test_forward_nn_MultiheadAttention_train_mode_mps_float32
223.08s call     test/test_modules.py::TestModuleMPS::test_forward_nn_MultiheadAttention_eval_mode_mps_float16
188.88s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerEncoder_train_mode_mps_float32
187.23s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerEncoder_train_mode_mps_float16
149.19s call     test/test_modules.py::TestModuleMPS::test_to_nn_Transformer_swap_True_set_grad_False_mps_float32
146.43s call     test/test_modules.py::TestModuleMPS::test_to_nn_Transformer_swap_True_set_grad_True_mps_float32
145.72s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_MultiheadAttention_train_mode_mps_float32
143.61s call     test/test_modules.py::TestModuleMPS::test_to_nn_Transformer_swap_False_set_grad_True_mps_float32
141.62s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_MultiheadAttention_eval_mode_mps_float16
141.30s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_MultiheadAttention_eval_mode_mps_float32
139.99s call     test/test_modules.py::TestModuleMPS::test_save_load_nn_TransformerDecoderLayer_mps_float16
139.69s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerEncoder_eval_mode_mps_float16
138.65s call     test/test_modules.py::TestModuleMPS::test_save_load_nn_TransformerDecoderLayer_mps_float32
127.45s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerEncoder_eval_mode_mps_float32
122.38s call     test/test_modules.py::TestModuleMPS::test_forward_nn_CrossEntropyLoss_mps_float32
121.50s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_MultiheadAttention_train_mode_mps_float16
119.42s call     test/test_modules.py::TestModuleMPS::test_forward_nn_CrossEntropyLoss_mps_float16
114.24s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerEncoderLayer_train_mode_mps_float16
112.81s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_RNN_train_mode_mps_float32
111.41s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_RNN_eval_mode_mps_float32
111.34s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerEncoderLayer_train_mode_mps_float32
109.37s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_RNN_train_mode_mps_float16
108.14s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_RNN_eval_mode_mps_float16
103.38s call     test/test_modules.py::TestModuleMPS::test_forward_nn_GRU_eval_mode_mps_float16
101.30s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerEncoderLayer_eval_mode_mps_float32
97.94s call     test/test_nn.py::TestNNDeviceTypeMPS::test_upsamplingNearestExact3d_correctness_memory_format0_isize_10_osize_15_mps
97.49s call     test/test_nn.py::TestNNDeviceTypeMPS::test_upsamplingNearest3d_correctness_memory_format1_isize_10_osize_15_mps
97.39s call     test/test_nn.py::TestNNDeviceTypeMPS::test_upsamplingNearest3d_correctness_memory_format0_isize_10_osize_15_mps
97.18s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_TransformerEncoderLayer_eval_mode_mps_float16
96.35s call     test/test_modules.py::TestModuleMPS::test_forward_nn_GRU_eval_mode_mps_float32
96.02s call     test/test_nn.py::TestNNDeviceTypeMPS::test_upsamplingNearestExact3d_correctness_memory_format1_isize_10_osize_15_mps
94.71s call     test/test_modules.py::TestModuleMPS::test_save_load_nn_TransformerEncoder_train_mode_mps_float16
93.79s call     test/test_modules.py::TestModuleMPS::test_forward_nn_GRU_train_mode_mps_float32
93.63s call     test/test_modules.py::TestModuleMPS::test_non_contiguous_tensors_nn_BatchNorm1d_train_mode_mps_float32
89.87s call     test/test_modules.py::TestModuleMPS::test_to_nn_Transformer_swap_False_set_grad_False_mps_float32
89.81s call     test/test_modules.py::TestModuleMPS::test_forward_nn_GRU_train_mode_mps_float16
88.73s call     test/test_modules.py::TestModuleMPS::test_save_load_nn_TransformerEncoder_train_mode_mps_float32
= 8609 passed, 4839 skipped, 65 xfailed, 51697 warnings in 10648.92s (2:57:28) =

Note though that not all runs are uniformly slow; in this case it was the first run for pytorch-tests-2.9.1-cpu_generic_py311_hcc85050_1.conda that took 3h on main, all the following pytorch-tests outputs in the same run took roughly 1:40:

= 8609 passed, 4839 skipped, 65 xfailed, 51745 warnings in 5942.94s (1:39:02) ==

What's also strange is that, as recently as a week ago (e.g. the run after merging #448), the entire osx-arm64 build ran through in 3:20h, with the test suite taking a fraction of what it does now

== 8609 passed, 4839 skipped, 65 xfailed, 51745 warnings in 337.40s (0:05:37) ==

My first suspicion fell on conda-forge/blas-feedstock#158, but as it turns out, we already were using the _2 builds in the previous (much faster) run, aside from the fact that the test section here is pulling in libblas 3.11.0-2_h51639a9_openblas and not _newaccelerate.

Bit stumped what could be causing this.

CC @mgorny @isuruf

Originally posted by @h-vetinari in #353 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions