[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-08 #2621

pragupta · 2025-09-08T18:18:53Z

Merged latest changes from upstream/main into rocm7.1_internal_testing on 2025-09-08

… the hardware limit. (pytorch#161996) Summary: This is a re-land of [PR161040](pytorch#161040), which had previously caused test failures on AMD GPUs. The tests are now configured to target only NVIDIA GPUs. This diff removes configurations that exceed the hardware shared memory limit, which causes the following compilation error: ``` No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 327680 Hardware limit:232448 Reducing block sizes or `num_stages` may help. ``` Test Plan: ``` pytest test/inductor/test_max_autotune.py pytest test/inductor/test_triton_heuristics.py ``` Pull Request resolved: pytorch#161996 Approved by: https://github.com/coconutruben

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: pytorch#161929 Approved by: https://github.com/pytorchbot

…ble (pytorch#161950) This PR is a followup to pytorch#149764. In that PR, it only forbids illegal view due to `Flatten`; this PR also forbids illegal view caused by `Split`. This PR also updates the error message to be less about internal implementation details, which users may find confusing. Pull Request resolved: pytorch#161950 Approved by: https://github.com/ezyang

Fixes pytorch#161483 When the whole `test/test_transformers.py` file is run, the case `test_default_priority_order` can pass because other xpu cases would call SDPA so that the priority order is set by https://github.com/pytorch/pytorch/blob/eec876deb659fe667aac2d97a48d7451c3e88dee/aten/src/ATen/native/mkldnn/xpu/Attention.cpp#L98-L112 However, when the case `test_default_priority_order` is run separately, the priority order is unset so that this case would fail. This PR fix this case. Pull Request resolved: pytorch#161690 Approved by: https://github.com/guangyey, https://github.com/drisspg

…61998) This function has come up in DTensor perf work, and I had a nitpick on pytorch#160256 so here it is. I have neither compiled nor measured this, but am reasonably confident it's better nonetheless. Pull Request resolved: pytorch#161998 Approved by: https://github.com/ezyang

Enable cat op for sparse on MPS Pull Request resolved: pytorch#162007 Approved by: https://github.com/malfet

…torch#161947) As the title stated. Pull Request resolved: pytorch#161947 Approved by: https://github.com/albanD ghstack dependencies: pytorch#161845, pytorch#161903

Keeps SymInt::maybe_as_int small enough to inline. Differential Revision: [D81530097](https://our.internmc.facebook.com/intern/diff/D81530097) Pull Request resolved: pytorch#161466 Approved by: https://github.com/ezyang

If SymInt::maybe_as_int() returns non-empty, then we get an inline fast path. The philosophy here (as with the previous PR) is to preserve performance in the "plain old ints" case. Observed time spent in SymInt functions in computeStorageNBytes to drop (and not cost shift elsewhere in the function) after this change, profiling detach() using code similar to the benchmark from pytorch#160580 and Linux perf. Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107) Pull Request resolved: pytorch#161586 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161466

Signed-off-by: Edward Yang <[email protected]> Pull Request resolved: pytorch#160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci

…rch#159889) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <[email protected]> Pull Request resolved: pytorch#159889 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#160449

Summary: we have ``` std::vector<size_t> constants_internal_offset( num_constants - num_folded_constants); ``` but the for loop does not consider it ``` for (size_t i = 0; i < num_constants; i++) { ... constants_internal_offset[i] ... ``` even in the for loop, it does ``` bool from_folded = this->constant_from_folded(i); if (from_folded) { continue; } ``` but `i` could still be wrong Rollback Plan: Differential Revision: D81425007 Pull Request resolved: pytorch#161887 Approved by: https://github.com/angelayi

Summary: Save the config args that Inductor burns into `inductor_metadata` so we can optionally pass them to any Jit Hooks that are set. This allows us to pass them to Tritonparse. Reviewed By: davidberard98, FindHao Differential Revision: D80994791 Pull Request resolved: pytorch#161953 Approved by: https://github.com/FindHao

Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg

Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999

…0943)" This reverts commit bbedc71. Reverted pytorch#160943 on behalf of https://github.com/jeanschmidt due to See [D81486248](https://www.internalfb.com/diff/D81486248) for details on broken test ([comment](pytorch#160943 (comment)))

…h#161862)" This reverts commit 204697f. Reverted pytorch#161862 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D81522732 for more details ([comment](pytorch#161862 (comment)))

… nodes (pytorch#161339)" This reverts commit 90f50f7. Reverted pytorch#161339 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, check D81486248 for more details ([comment](pytorch#161339 (comment)))

…1261) In this pr, we port test/distributed/parallel 4 test files and test/distributed/debug 1 test file for Intel GPU We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. Use torch.accelerator for general gpu 2. Skip the case if running on xpu which has known issues Pull Request resolved: pytorch#161261 Approved by: https://github.com/guangyey, https://github.com/d4l3k

Summary: Inductor has the following configurations: config.comprehensive_padding config.padding_alignment_bytes config.padding_stride_threshold In the case of static shape by enabling these three options Inductor will generate code for Flexible layout tensors that tries to pad up all stride dimension to be a multiple of config.padding_alignment_bytes for strides above: config.padding_stride_threshold. In the case where dynamic shapes is enabled no padding is done today. This PR introduces the following configuration which allows the user to specify they wish to generated a padded stride even in the case of dynamic shape operations. This is mainly done so we don't break the previous behaviour of not padding up dynamic shape use cases. The config.padding_stride_threshold does not apply since the values of the strides are dynamic. config.pad_dynamic_shapes In addition to this a new mode "python_slow" has been added to launch grid calculation which achieves the same ceildiv behaviour that is generally applicable to integer division. This is done to prevent test regressions and make wrapper_fxir codegen more generic. Test Plan: CI Rollback Plan: Differential Revision: D80468808 Pull Request resolved: pytorch#160997 Approved by: https://github.com/blaine-rister, https://github.com/jansel

…2058) fix split_aot_inductor_output_path on Windows. Pull Request resolved: pytorch#162058 Approved by: https://github.com/angelayi

Pull Request resolved: pytorch#161799 Approved by: https://github.com/anijain2305

Pull Request resolved: pytorch#161800 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#161799

## Summary Adds a subgraph decomposition for addmm and mm that performs well on large `K` compared to `M` and `N`, and functions well as an alternative to `split-k` on AMD (transposed only), which does not support AMD currently. ## Background On AMD (MI300x), for a matmul A * B, if B is non-contiguous, the resulting matmul is quite a bit slower. For example: ``` args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[1, 178176])) )) ``` is a lot slower than: ``` args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[6144, 1])) )) ``` This PR adds a subgraph decomposition to test out whether making B contiguous is faster than just using the normal kernels. ## Data I ran this on unique non-contiguous shapes from torchbench/huggingface and got these speedups: ``` Parsed 420 unique shapes from benchmark output addmm improvements when best: addmm_16448x512x2048: +0.14% addmm_128x2048x2048: +0.01% addmm_128x768x1000: +0.75% addmm_12672x3072x768: +1.08% addmm_512x768x32000: +0.62% addmm_12608x384x384: +0.00% addmm_4160x1024x4096: +0.90% addmm_16x768x2: +0.56% addmm_12608x3072x768: +0.09% addmm_64x4096x1000: +2.77% addmm_256x1024x512: +1.99% addmm_30x256x256: +1.12% addmm_100480x128x384: +0.91% addmm_6400x2048x512: +0.25% addmm_61568x1024x256: +0.08% addmm_1x768x768: +0.93% addmm_12544x384x384: +0.19% addmm_128x512x1000: +0.77% addmm_2048x128x128: +1.32% addmm_128x3072x1000: +0.24% addmm_7936x512x2048: +0.07% addmm_8192x512x2048: +0.33% addmm_64x1024x1000: +1.43% addmm_128x2304x1000: +0.01% addmm_32768x256x2: +0.75% addmm_64x384x1152: +0.79% addmm_64x640x1000: +0.01% addmm_100480x128x128: +0.87% addmm_1152x3072x768: +1.13% addmm_8192x256x2048: +1.40% addmm_4096x128x768: +0.01% addmm_128x2560x1000: +0.01% addmm_12544x2048x512: +0.43% addmm_200704x24x96: +0.14% addmm_8448x512x2048: +0.96% addmm_50176x256x1024: +0.62% addmm_4160x4096x1024: +0.22% addmm_4096x768x768: +0.32% addmm_220x2048x512: +0.56% addmm_8x2048x1000: +1.12% addmm_256x197951x512: +26.99% addmm_401536x64x192: +0.60% addmm_2040x2048x512: +0.47% addmm_512x1024x256: +1.32% addmm_128x4096x1000: +1.67% addmm_12672x768x768: +0.34% addmm_128x368x1000: +0.77% addmm_96x1280x1000: +0.01% addmm_12544x512x2048: +0.41% addmm_6272x320x1280: +0.76% addmm_12544x3072x768: +0.09% addmm_64x384x1000: +0.39% mm improvements when best: mm_200704x128x512: +1.29% mm_663552x16x16: +0.80% mm_4096x768x768: +0.51% mm_131072x64x31: +0.24% mm_12544x1152x384: +0.11% mm_128x2048x2: +0.46% mm_262144x16x23: +0.62% mm_50176x576x192: +0.37% mm_131072x16x31: +0.26% ================================================================================ BENCHMARK ANALYSIS RESULTS ================================================================================ Operation: addmm ---------------------------------------- Total shapes analyzed: 247 Average Subgraph placement: 3.38 Median Subgraph placement: 2.0 Subgraph is best choice: 52/247 shapes (21.1%) Average improvement when best: 1.15% Median improvement when best: 0.58% Largest improvement when best: +26.99% Operation: bmm ---------------------------------------- Total shapes analyzed: 85 Average Subgraph placement: 24.00 Median Subgraph placement: 21.0 Subgraph is best choice: 0/85 shapes (0.0%) Average improvement when best: N/A (never best) Median improvement when best: N/A (never best) Largest improvement when best: N/A (never best) Operation: mm ---------------------------------------- Total shapes analyzed: 88 Average Subgraph placement: 15.08 Median Subgraph placement: 4.0 Subgraph is best choice: 9/88 shapes (10.2%) Average improvement when best: 0.52% Median improvement when best: 0.46% Largest improvement when best: +1.29% ``` ## Results The largest shape gain, `256,197951,512`, seemed to be driven by a case where the extern kernel is way faster than the best triton configs on the recursive autotune: ``` addmm,Extern,extern_kernels.addmm,256,197951,512,0.38024500012397766 addmm,Triton,256,197951,512,32,256,16,2,2,4,2.005444049835205 addmm,Triton,256,197951,512,32,128,32,2,4,8,2.04189395904541 addmm,Triton,256,197951,512,64,128,16,2,4,8,2.1911399364471436 addmm,Triton,256,197951,512,64,128,32,2,4,8,2.496040105819702 addmm,Triton,256,197951,512,64,128,64,2,8,16,2.9306790828704834 addmm,Triton,256,197951,512,64,64,32,2,4,8,3.0347819328308105 ... ``` Compared to the non-transposed autotune: ``` addmm,Subgraph,contiguous_addmm_1384,256,197951,512,0.5024129748344421 addmm,Extern,extern_kernels.addmm,256,197951,512,0.6881489753723145 addmm,Triton,256,197951,512,32,256,16,2,2,4,2.5115010738372803 addmm,Triton,256,197951,512,32,128,32,2,4,8,2.5167479515075684 addmm,Triton,256,197951,512,64,128,16,2,4,8,2.9507460594177246 addmm,Triton,256,197951,512,64,256,64,2,8,4,2.9673290252685547 addmm,Triton,256,197951,512,64,128,64,2,8,16,3.3906331062316895 addmm,Triton,256,197951,512,64,128,32,2,4,8,3.496859073638916 ``` It seems to perform really well for high values of `K` vs `N` and `M`. Testing this hypothesis with some custom shapes: ``` Parsed 64 unique shapes from benchmark output addmm improvements when best: addmm_128x16384x128: +0.18% addmm_128x262144x256: +38.24% addmm_128x200000x512: +14.76% addmm_256x800000x128: +0.06% addmm_131072x128x256: +0.27% addmm_128x256x131072: +0.25% addmm_2048x200000x64: +12.45% mm improvements when best: mm_128x16384x128: +0.18% mm_128x262144x256: +38.05% mm_128x200000x512: +9.47% mm_256x800000x128: +0.99% mm_512x6400000x256: +3.17% mm_524288x64x64: +0.29% mm_2048x200000x64: +11.19% mm_8192x1000000x256: +34.14% mm_128x4096x100000: +0.40% mm_128x3072x150000: +0.27% ================================================================================ BENCHMARK ANALYSIS RESULTS ================================================================================ Operation: addmm ---------------------------------------- Total shapes analyzed: 33 Average Subgraph placement: 4.39 Median Subgraph placement: 2.0 Subgraph is best choice: 7/33 shapes (21.2%) Average improvement when best: 9.46% Median improvement when best: 0.27% Largest improvement when best: +38.24% Operation: mm ---------------------------------------- Total shapes analyzed: 30 Average Subgraph placement: 7.63 Median Subgraph placement: 2.0 Subgraph is best choice: 10/30 shapes (33.3%) Average improvement when best: 9.81% Median improvement when best: 2.08% Largest improvement when best: +38.05% ``` ## Conclusion Contiguous Subgraph Decompositionseems worthwhile for `mm` and `addmm`, but not `bmm`, and has a very large improvment on low `M`, low `N`, and high `K` shapes. Data gathering scripts: https://gist.github.com/exclamaforte/4a896c064d301b27bf5ca0a4f8fc3866 ## Test Plan: New unit tests. Differential Revision: D80771648 Pull Request resolved: pytorch#161241 Approved by: https://github.com/eellison

Enable python 3.13t, 3.14 and 3.14t on s390x for nightly binaries Fixes pytorch#161515 Pull Request resolved: pytorch#161920 Approved by: https://github.com/malfet

…ytorch#160510) Pull Request resolved: pytorch#160510 Approved by: https://github.com/ezyang

…ytorch#161339) # why - some templates e.g. scale_mm need to unsqueeze/squeeze the nodes for codegen and heuristics - unified place where we can just adjust them for the template # what - inside get_mm_configs, return not the passed in kernel inputs, but allow the template heuristic to adjust them if necessary - the default implementation right now just passes them back this diff just adds the functionality, but does not exercise it other than the default (passthrough) # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520572](https://our.internmc.facebook.com/intern/diff/D81520572) Pull Request resolved: pytorch#161339 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#161123, pytorch#161124, pytorch#161125, pytorch#161126, pytorch#161336, pytorch#161338

Reduces CPython `test_set.py` runtime from 63.477s to 40.298s Pull Request resolved: pytorch#160763 Approved by: https://github.com/anijain2305

Fixes pytorch#161763 Pull Request resolved: pytorch#161922 Approved by: https://github.com/jansel

@amdfaa

…62044) ### Motivation * MI250 Cirrascale runners are currently having network timeout leading to huge queueing of binary smoke test jobs: <img width="483" height="133" alt="image" src="https://github.com/user-attachments/assets/17293002-78ad-4fc9-954f-ddd518bf0a43" /> * MI210 Hollywood runners (with runner names such as `pytorch-rocm-hw-*`) are not suitable for these jobs, because they seem to take much longer to download artifacts: pytorch#153287 (comment) (this is why these jobs were specifically targeting Cirrascale runners). However, it doesn't seem like Cirrascale runners are necessarily doing much better either e.g. [this recent build](https://github.com/pytorch/pytorch/actions/runs/17332256791/job/49231006755). * Moving to MI325 runners should address the stability part at least, while also reducing load on limited MI2xx runner capacity. * However, I'm not sure if the MI325 runners will do any better on the artifact download part (this may need to be investigated more) cc @amdfaa * Also removing `ciflow/binaries` and `ciflow/binaries_wheel` label/tag triggers for `generated-linux-binary-manywheel-rocm-main.yml` because we already trigger ROCm binary build/test jobs via these labels/tags in `generated-linux-binary-manywheel-nightly.yml`. And for developers who want to trigger ROCm binary build/test jobs on their PRs, they can use the `ciflow/rocm-mi300` label/tag as per this PR. ### TODOs (cc @amdfaa): * Check that the workflow runs successfully on the MI325 runners in this PR. Note how long the test jobs take esp. the "Download Build Artifacts" step * Once this PR is merged, clear the queue of jobs targeting `linux.rocm.gpu.mi250` Pull Request resolved: pytorch#162044 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <[email protected]>

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#162315 Approved by: https://github.com/pytorchbot

…lt (pytorch#159889)" This reverts commit 01edcd4. Reverted pytorch#159889 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](pytorch#160449 (comment)))

This reverts commit de893e9. Reverted pytorch#160449 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](pytorch#160449 (comment)))

Follow-up to pytorch#161768. Context: ProcessPool pickles the outputs before sending them back to the main process. Triton kernels have some un-pickleable fields, so `prepare_for_pickle()` is used to strip out those fields. Previously, in the standard case (without triton_bundler.py), `prepare_for_pickle()` would strip out the un-pickleable fields and they would never be added back after unpickling, because the un-pickleable fields were not actually needed after compilation finished. In pytorch#161768 updated `prepare_for_pickle` to also strip out the `fn._hash_lock` field, a newly added field in JITCallable instances which is a `threading.RLock()`, which is not pickleable. It turns out that we do need to restore the `fn._hash_lock` field, even in the non-triton_bundler case - the MultiKernel case uses the hash lock. To do this, we add `restore_after_unpickle()` which will restore fields (or if the old fields are not provided, initialize just the hash_lock) Compile time benchmarks look good, maybe a very minor regression (see the comment below on the PR) Pull Request resolved: pytorch#162244 Approved by: https://github.com/atalman

…ytorch#162309) Fixes static cuda launcher after triton-lang/triton#7866. Static cuda launcher checks to make sure that no hook knobs are set (and if they are, it throws an error). But Triton has changed the semantics of hooks so that "empty hooks" are now represented by empty `HookChain`s instead of being represented by `None`. This PR changes the way we define "empty hooks" to account for HookChains. Pull Request resolved: pytorch#162309 Approved by: https://github.com/aakhundov ghstack dependencies: pytorch#162244

@ngimel

The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect. Removing the alpha and beta fixes the issue. Thanks @ngimel to figure out the root cause. Pull Request resolved: pytorch#162040 Approved by: https://github.com/danielvegamyhre

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: pytorch#162372 Approved by: https://github.com/pytorchbot

This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: pytorch#161395 Approved by: https://github.com/pytorchbot

Update PyTorch to the latest Triton release candidate branch (release/3.5.x in triton-lang/triton) Notably: * this does *not* include the version number bump from 3.4 -> 3.5 (we'll do that in a follow-up PR) * sam_fast is still failing, so we've disabled it temporarily pytorch#162282 and we are committed to fixing it, ideally before the branch cut but possibly as a cherry-pick into the release branch. Pull Request resolved: pytorch#162278 Approved by: https://github.com/atalman ghstack dependencies: pytorch#162244, pytorch#162309

… mode (pytorch#161405) During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms. Adding optional usage of: - c10d.time_estimator for collectives, which is based on NCCL estimator Benchmark mode only for matmuls, as they are highly dependent on mm backend - The logic mostly copied from Ruisi's PRs for inductor simple_fsdp pytorch#157572 This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()` Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294) Pull Request resolved: pytorch#161405 Approved by: https://github.com/eellison

)" This reverts commit c9ac8c2. Reverted pytorch#162315 on behalf of https://github.com/jeanschmidt due to Reverting in order to see if this introduced the failure https://github.com/pytorch/pytorch/actions/runs/17539536914/job/49810513700 ([comment](pytorch#162315 (comment)))

This reverts commit 3770337. Reverted pytorch#161649 on behalf of https://github.com/ngimel due to reverted internally ([comment](pytorch#161649 (comment)))

…#162041) The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision. Pull Request resolved: pytorch#162041 Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel ghstack dependencies: pytorch#162040

…lready know (pytorch#161591) We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap. Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099) Pull Request resolved: pytorch#161591 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590

…ytorch#161595) This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better. Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095) Pull Request resolved: pytorch#161595 Approved by: https://github.com/ezyang, https://github.com/bdhirsh ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590, pytorch#161591

…rch#161058) This PR is the first split PR of pytorch#156272, only contains the OneDNN code. Please help review. Pending on OneDNN v3.9 commit update. Don't merge. Pull Request resolved: pytorch#161058 Approved by: https://github.com/guangyey, https://github.com/EikanWang

LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: pytorch#162311 Approved by: https://github.com/jansel

updated release behavior for cached blocks Fixes pytorch#159567 Pull Request resolved: pytorch#162269 Approved by: https://github.com/eqy, https://github.com/Skylion007

… cuDNN SDPA Nested tensor/varlen (pytorch#161434) Don't recompile every time Pull Request resolved: pytorch#161434 Approved by: https://github.com/drisspg

Trying to fix pytorch#162333 CUDA 13.0 file structure changed. Instead of keeping most of dlls in bin folder its now in ``bin\x64`` except for cudnn dll. See attached picture : <img width="511" height="361" alt="Screenshot 2025-09-08 at 9 46 26 AM" src="https://github.com/user-attachments/assets/d2e630ee-930f-4da6-9b81-f9ef48fde7ce" /> <img width="490" height="333" alt="Screenshot 2025-09-08 at 9 46 34 AM" src="https://github.com/user-attachments/assets/194cbf43-b6ef-4218-b516-db37b91302be" /> Pull Request resolved: pytorch#162383 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/malfet

…sting_IFU_2025-09-08 # Conflicts: # .ci/docker/ci_commit_pins/triton.txt # .ci/docker/requirements-ci.txt # aten/src/ATen/Context.cpp # aten/src/ATen/cuda/tunable/GemmHipblaslt.h # aten/src/ATen/native/ConvUtils.h # aten/src/ATen/native/Convolution.cpp # aten/src/ATen/native/Normalization.cpp # aten/src/ATen/native/cuda/Blas.cpp # aten/src/ATen/native/miopen/Conv_miopen.cpp # requirements.txt # test/distributed/_tools/test_fsdp2_mem_tracker.py # test/distributed/tensor/parallel/test_tp_examples.py # test/dynamo/test_activation_checkpointing.py # test/dynamo/test_structured_trace.py # test/inductor/test_aot_inductor.py # test/inductor/test_combo_kernels.py # test/test_matmul_cuda.py # test/test_sparse.py # torch/_higher_order_ops/triton_kernel_wrap.py # torch/_inductor/choices.py # torch/_inductor/codegen/triton.py # torch/testing/_internal/common_cuda.py

rocm-repo-management-api · 2025-09-08T18:31:13Z

Jenkins build for 2b05fbdbc7aa836bd60dac2f04c4bd954c020b50 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-09-09T04:18:27Z

Jenkins build for b88024c2430b2cc6223bf5eb2a7819ba570564f3 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

wychi and others added 30 commits September 3, 2025 04:23

[MPS] enable cat op for sparse (pytorch#162007)

2c03f0a

Enable cat op for sparse on MPS Pull Request resolved: pytorch#162007 Approved by: https://github.com/malfet

Using get_paths() to get correct installation path for PYTHONPATY (py…

827f0d4

…torch#161947) As the title stated. Pull Request resolved: pytorch#161947 Approved by: https://github.com/albanD ghstack dependencies: pytorch#161845, pytorch#161903

Outline SymInt::maybe_as_int_slow_path (pytorch#161466)

fa1514a

Keeps SymInt::maybe_as_int small enough to inline. Differential Revision: [D81530097](https://our.internmc.facebook.com/intern/diff/D81530097) Pull Request resolved: pytorch#161466 Approved by: https://github.com/ezyang

Always build USE_DISTRIBUTED. (pytorch#160449)

90b0864

Signed-off-by: Edward Yang <[email protected]> Pull Request resolved: pytorch#160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci

[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (pytorch#161999)

02c83f1

Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg

Revert "[CUDAGraph] add config to error on skipping cudagraph (pytorc…

f27985b

…h#161862)" This reverts commit 204697f. Reverted pytorch#161862 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D81522732 for more details ([comment](pytorch#161862 (comment)))

[inductor] fix split_aot_inductor_output_path on Windows. (pytorch#16…

451ed93

…2058) fix split_aot_inductor_output_path on Windows. Pull Request resolved: pytorch#162058 Approved by: https://github.com/angelayi

Add CPython test test_range (pytorch#161799)

889f01e

Pull Request resolved: pytorch#161799 Approved by: https://github.com/anijain2305

Add range_iterator (pytorch#161800)

eb18d32

Pull Request resolved: pytorch#161800 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#161799

S390x: build nightly binaries for new pythons (pytorch#161920)

71992dd

Enable python 3.13t, 3.14 and 3.14t on s390x for nightly binaries Fixes pytorch#161515 Pull Request resolved: pytorch#161920 Approved by: https://github.com/malfet

stop suggesting using guard_size_oblivious on data dependent errors (p…

3559c35

…ytorch#160510) Pull Request resolved: pytorch#160510 Approved by: https://github.com/ezyang

Offload set method execution to CPython when possible (pytorch#160763)

8076a18

Reduces CPython `test_set.py` runtime from 63.477s to 40.298s Pull Request resolved: pytorch#160763 Approved by: https://github.com/anijain2305

[inductor] Follow integer overflow rules in TypedExpr (pytorch#161922)

62c3f9a

Fixes pytorch#161763 Pull Request resolved: pytorch#161922 Approved by: https://github.com/jansel

pytorchupdatebot and others added 21 commits September 8, 2025 04:17

Update slow tests (pytorch#161395)

e101411

This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: pytorch#161395 Approved by: https://github.com/pytorchbot

Revert "Use vectorized stores for all dtypes in cat (pytorch#161649)"

a92773e

This reverts commit 3770337. Reverted pytorch#161649 on behalf of https://github.com/ngimel due to reverted internally ([comment](pytorch#161649 (comment)))

Avoid crash with release_available_cached_blocks (pytorch#162269)

72e6717

updated release behavior for cached blocks Fixes pytorch#159567 Pull Request resolved: pytorch#162269 Approved by: https://github.com/eqy, https://github.com/Skylion007

[cuDNN][SDPA][Nested Tensor] add forward/backward caching support for…

de5dc1f

… cuDNN SDPA Nested tensor/varlen (pytorch#161434) Don't recompile every time Pull Request resolved: pytorch#161434 Approved by: https://github.com/drisspg

pragupta requested review from jataylo, jeffdaily, jithunnair-amd and pruthvistony as code owners September 8, 2025 18:18

Fix merge conflicts

b88024c

pragupta closed this Sep 9, 2025

pragupta deleted the rocm7.1_internal_testing_IFU_2025-09-08 branch September 9, 2025 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-08 #2621

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-08 #2621

Uh oh!

pragupta commented Sep 8, 2025

Uh oh!

rocm-repo-management-api bot commented Sep 8, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Sep 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

109 participants

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-08 #2621

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-08 #2621

Uh oh!

Conversation

pragupta commented Sep 8, 2025

Uh oh!

rocm-repo-management-api bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

109 participants

rocm-repo-management-api bot commented Sep 8, 2025 •

edited

Loading

rocm-repo-management-api bot commented Sep 9, 2025 •

edited

Loading