forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 75
[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-08 #2621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
pragupta
wants to merge
1,425
commits into
rocm7.1_internal_testing
from
rocm7.1_internal_testing_IFU_2025-09-08
Closed
[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-08 #2621
pragupta
wants to merge
1,425
commits into
rocm7.1_internal_testing
from
rocm7.1_internal_testing_IFU_2025-09-08
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… the hardware limit. (pytorch#161996) Summary: This is a re-land of [PR161040](pytorch#161040), which had previously caused test failures on AMD GPUs. The tests are now configured to target only NVIDIA GPUs. This diff removes configurations that exceed the hardware shared memory limit, which causes the following compilation error: ``` No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 327680 Hardware limit:232448 Reducing block sizes or `num_stages` may help. ``` Test Plan: ``` pytest test/inductor/test_max_autotune.py pytest test/inductor/test_triton_heuristics.py ``` Pull Request resolved: pytorch#161996 Approved by: https://github.com/coconutruben
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: pytorch#161929 Approved by: https://github.com/pytorchbot
…ble (pytorch#161950) This PR is a followup to pytorch#149764. In that PR, it only forbids illegal view due to `Flatten`; this PR also forbids illegal view caused by `Split`. This PR also updates the error message to be less about internal implementation details, which users may find confusing. Pull Request resolved: pytorch#161950 Approved by: https://github.com/ezyang
Fixes pytorch#161483 When the whole `test/test_transformers.py` file is run, the case `test_default_priority_order` can pass because other xpu cases would call SDPA so that the priority order is set by https://github.com/pytorch/pytorch/blob/eec876deb659fe667aac2d97a48d7451c3e88dee/aten/src/ATen/native/mkldnn/xpu/Attention.cpp#L98-L112 However, when the case `test_default_priority_order` is run separately, the priority order is unset so that this case would fail. This PR fix this case. Pull Request resolved: pytorch#161690 Approved by: https://github.com/guangyey, https://github.com/drisspg
…61998) This function has come up in DTensor perf work, and I had a nitpick on pytorch#160256 so here it is. I have neither compiled nor measured this, but am reasonably confident it's better nonetheless. Pull Request resolved: pytorch#161998 Approved by: https://github.com/ezyang
Enable cat op for sparse on MPS Pull Request resolved: pytorch#162007 Approved by: https://github.com/malfet
…torch#161947) As the title stated. Pull Request resolved: pytorch#161947 Approved by: https://github.com/albanD ghstack dependencies: pytorch#161845, pytorch#161903
Keeps SymInt::maybe_as_int small enough to inline. Differential Revision: [D81530097](https://our.internmc.facebook.com/intern/diff/D81530097) Pull Request resolved: pytorch#161466 Approved by: https://github.com/ezyang
If SymInt::maybe_as_int() returns non-empty, then we get an inline fast path. The philosophy here (as with the previous PR) is to preserve performance in the "plain old ints" case. Observed time spent in SymInt functions in computeStorageNBytes to drop (and not cost shift elsewhere in the function) after this change, profiling detach() using code similar to the benchmark from pytorch#160580 and Linux perf. Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107) Pull Request resolved: pytorch#161586 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161466
Signed-off-by: Edward Yang <[email protected]> Pull Request resolved: pytorch#160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
…rch#159889) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <[email protected]> Pull Request resolved: pytorch#159889 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#160449
Summary:
we have
```
std::vector<size_t> constants_internal_offset(
num_constants - num_folded_constants);
```
but the for loop does not consider it
```
for (size_t i = 0; i < num_constants; i++) {
...
constants_internal_offset[i]
...
```
even in the for loop, it does
```
bool from_folded = this->constant_from_folded(i);
if (from_folded) {
continue;
}
```
but `i` could still be wrong
Rollback Plan:
Differential Revision: D81425007
Pull Request resolved: pytorch#161887
Approved by: https://github.com/angelayi
Summary: Save the config args that Inductor burns into `inductor_metadata` so we can optionally pass them to any Jit Hooks that are set. This allows us to pass them to Tritonparse. Reviewed By: davidberard98, FindHao Differential Revision: D80994791 Pull Request resolved: pytorch#161953 Approved by: https://github.com/FindHao
Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg
Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999
…0943)" This reverts commit bbedc71. Reverted pytorch#160943 on behalf of https://github.com/jeanschmidt due to See [D81486248](https://www.internalfb.com/diff/D81486248) for details on broken test ([comment](pytorch#160943 (comment)))
…h#161862)" This reverts commit 204697f. Reverted pytorch#161862 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D81522732 for more details ([comment](pytorch#161862 (comment)))
… nodes (pytorch#161339)" This reverts commit 90f50f7. Reverted pytorch#161339 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, check D81486248 for more details ([comment](pytorch#161339 (comment)))
…1261) In this pr, we port test/distributed/parallel 4 test files and test/distributed/debug 1 test file for Intel GPU We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. Use torch.accelerator for general gpu 2. Skip the case if running on xpu which has known issues Pull Request resolved: pytorch#161261 Approved by: https://github.com/guangyey, https://github.com/d4l3k
Summary: Inductor has the following configurations: config.comprehensive_padding config.padding_alignment_bytes config.padding_stride_threshold In the case of static shape by enabling these three options Inductor will generate code for Flexible layout tensors that tries to pad up all stride dimension to be a multiple of config.padding_alignment_bytes for strides above: config.padding_stride_threshold. In the case where dynamic shapes is enabled no padding is done today. This PR introduces the following configuration which allows the user to specify they wish to generated a padded stride even in the case of dynamic shape operations. This is mainly done so we don't break the previous behaviour of not padding up dynamic shape use cases. The config.padding_stride_threshold does not apply since the values of the strides are dynamic. config.pad_dynamic_shapes In addition to this a new mode "python_slow" has been added to launch grid calculation which achieves the same ceildiv behaviour that is generally applicable to integer division. This is done to prevent test regressions and make wrapper_fxir codegen more generic. Test Plan: CI Rollback Plan: Differential Revision: D80468808 Pull Request resolved: pytorch#160997 Approved by: https://github.com/blaine-rister, https://github.com/jansel
…2058) fix split_aot_inductor_output_path on Windows. Pull Request resolved: pytorch#162058 Approved by: https://github.com/angelayi
Pull Request resolved: pytorch#161799 Approved by: https://github.com/anijain2305
Pull Request resolved: pytorch#161800 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#161799
## Summary
Adds a subgraph decomposition for addmm and mm that performs well on large `K` compared to `M` and `N`, and functions well as an alternative to `split-k` on AMD (transposed only), which does not support AMD currently.
## Background
On AMD (MI300x), for a matmul A * B, if B is non-contiguous, the resulting matmul is quite a bit slower.
For example:
```
args[0]: TensorBox(StorageBox(
InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1]))
))
args[1]: TensorBox(StorageBox(
InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[1, 178176]))
))
```
is a lot slower than:
```
args[0]: TensorBox(StorageBox(
InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1]))
))
args[1]: TensorBox(StorageBox(
InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[6144, 1]))
))
```
This PR adds a subgraph decomposition to test out whether making B contiguous is faster than just using the normal kernels.
## Data
I ran this on unique non-contiguous shapes from torchbench/huggingface and got these speedups:
```
Parsed 420 unique shapes from benchmark output
addmm improvements when best:
addmm_16448x512x2048: +0.14%
addmm_128x2048x2048: +0.01%
addmm_128x768x1000: +0.75%
addmm_12672x3072x768: +1.08%
addmm_512x768x32000: +0.62%
addmm_12608x384x384: +0.00%
addmm_4160x1024x4096: +0.90%
addmm_16x768x2: +0.56%
addmm_12608x3072x768: +0.09%
addmm_64x4096x1000: +2.77%
addmm_256x1024x512: +1.99%
addmm_30x256x256: +1.12%
addmm_100480x128x384: +0.91%
addmm_6400x2048x512: +0.25%
addmm_61568x1024x256: +0.08%
addmm_1x768x768: +0.93%
addmm_12544x384x384: +0.19%
addmm_128x512x1000: +0.77%
addmm_2048x128x128: +1.32%
addmm_128x3072x1000: +0.24%
addmm_7936x512x2048: +0.07%
addmm_8192x512x2048: +0.33%
addmm_64x1024x1000: +1.43%
addmm_128x2304x1000: +0.01%
addmm_32768x256x2: +0.75%
addmm_64x384x1152: +0.79%
addmm_64x640x1000: +0.01%
addmm_100480x128x128: +0.87%
addmm_1152x3072x768: +1.13%
addmm_8192x256x2048: +1.40%
addmm_4096x128x768: +0.01%
addmm_128x2560x1000: +0.01%
addmm_12544x2048x512: +0.43%
addmm_200704x24x96: +0.14%
addmm_8448x512x2048: +0.96%
addmm_50176x256x1024: +0.62%
addmm_4160x4096x1024: +0.22%
addmm_4096x768x768: +0.32%
addmm_220x2048x512: +0.56%
addmm_8x2048x1000: +1.12%
addmm_256x197951x512: +26.99%
addmm_401536x64x192: +0.60%
addmm_2040x2048x512: +0.47%
addmm_512x1024x256: +1.32%
addmm_128x4096x1000: +1.67%
addmm_12672x768x768: +0.34%
addmm_128x368x1000: +0.77%
addmm_96x1280x1000: +0.01%
addmm_12544x512x2048: +0.41%
addmm_6272x320x1280: +0.76%
addmm_12544x3072x768: +0.09%
addmm_64x384x1000: +0.39%
mm improvements when best:
mm_200704x128x512: +1.29%
mm_663552x16x16: +0.80%
mm_4096x768x768: +0.51%
mm_131072x64x31: +0.24%
mm_12544x1152x384: +0.11%
mm_128x2048x2: +0.46%
mm_262144x16x23: +0.62%
mm_50176x576x192: +0.37%
mm_131072x16x31: +0.26%
================================================================================
BENCHMARK ANALYSIS RESULTS
================================================================================
Operation: addmm
----------------------------------------
Total shapes analyzed: 247
Average Subgraph placement: 3.38
Median Subgraph placement: 2.0
Subgraph is best choice: 52/247 shapes (21.1%)
Average improvement when best: 1.15%
Median improvement when best: 0.58%
Largest improvement when best: +26.99%
Operation: bmm
----------------------------------------
Total shapes analyzed: 85
Average Subgraph placement: 24.00
Median Subgraph placement: 21.0
Subgraph is best choice: 0/85 shapes (0.0%)
Average improvement when best: N/A (never best)
Median improvement when best: N/A (never best)
Largest improvement when best: N/A (never best)
Operation: mm
----------------------------------------
Total shapes analyzed: 88
Average Subgraph placement: 15.08
Median Subgraph placement: 4.0
Subgraph is best choice: 9/88 shapes (10.2%)
Average improvement when best: 0.52%
Median improvement when best: 0.46%
Largest improvement when best: +1.29%
```
## Results
The largest shape gain, `256,197951,512`, seemed to be driven by a case where the extern kernel is way faster than the best triton configs on the recursive autotune:
```
addmm,Extern,extern_kernels.addmm,256,197951,512,0.38024500012397766
addmm,Triton,256,197951,512,32,256,16,2,2,4,2.005444049835205
addmm,Triton,256,197951,512,32,128,32,2,4,8,2.04189395904541
addmm,Triton,256,197951,512,64,128,16,2,4,8,2.1911399364471436
addmm,Triton,256,197951,512,64,128,32,2,4,8,2.496040105819702
addmm,Triton,256,197951,512,64,128,64,2,8,16,2.9306790828704834
addmm,Triton,256,197951,512,64,64,32,2,4,8,3.0347819328308105
...
```
Compared to the non-transposed autotune:
```
addmm,Subgraph,contiguous_addmm_1384,256,197951,512,0.5024129748344421
addmm,Extern,extern_kernels.addmm,256,197951,512,0.6881489753723145
addmm,Triton,256,197951,512,32,256,16,2,2,4,2.5115010738372803
addmm,Triton,256,197951,512,32,128,32,2,4,8,2.5167479515075684
addmm,Triton,256,197951,512,64,128,16,2,4,8,2.9507460594177246
addmm,Triton,256,197951,512,64,256,64,2,8,4,2.9673290252685547
addmm,Triton,256,197951,512,64,128,64,2,8,16,3.3906331062316895
addmm,Triton,256,197951,512,64,128,32,2,4,8,3.496859073638916
```
It seems to perform really well for high values of `K` vs `N` and `M`.
Testing this hypothesis with some custom shapes:
```
Parsed 64 unique shapes from benchmark output
addmm improvements when best:
addmm_128x16384x128: +0.18%
addmm_128x262144x256: +38.24%
addmm_128x200000x512: +14.76%
addmm_256x800000x128: +0.06%
addmm_131072x128x256: +0.27%
addmm_128x256x131072: +0.25%
addmm_2048x200000x64: +12.45%
mm improvements when best:
mm_128x16384x128: +0.18%
mm_128x262144x256: +38.05%
mm_128x200000x512: +9.47%
mm_256x800000x128: +0.99%
mm_512x6400000x256: +3.17%
mm_524288x64x64: +0.29%
mm_2048x200000x64: +11.19%
mm_8192x1000000x256: +34.14%
mm_128x4096x100000: +0.40%
mm_128x3072x150000: +0.27%
================================================================================
BENCHMARK ANALYSIS RESULTS
================================================================================
Operation: addmm
----------------------------------------
Total shapes analyzed: 33
Average Subgraph placement: 4.39
Median Subgraph placement: 2.0
Subgraph is best choice: 7/33 shapes (21.2%)
Average improvement when best: 9.46%
Median improvement when best: 0.27%
Largest improvement when best: +38.24%
Operation: mm
----------------------------------------
Total shapes analyzed: 30
Average Subgraph placement: 7.63
Median Subgraph placement: 2.0
Subgraph is best choice: 10/30 shapes (33.3%)
Average improvement when best: 9.81%
Median improvement when best: 2.08%
Largest improvement when best: +38.05%
```
## Conclusion
Contiguous Subgraph Decompositionseems worthwhile for `mm` and `addmm`, but not `bmm`, and has a very large improvment on low `M`, low `N`, and high `K` shapes.
Data gathering scripts:
https://gist.github.com/exclamaforte/4a896c064d301b27bf5ca0a4f8fc3866
## Test Plan:
New unit tests.
Differential Revision: D80771648
Pull Request resolved: pytorch#161241
Approved by: https://github.com/eellison
Enable python 3.13t, 3.14 and 3.14t on s390x for nightly binaries Fixes pytorch#161515 Pull Request resolved: pytorch#161920 Approved by: https://github.com/malfet
…ytorch#160510) Pull Request resolved: pytorch#160510 Approved by: https://github.com/ezyang
…ytorch#161339) # why - some templates e.g. scale_mm need to unsqueeze/squeeze the nodes for codegen and heuristics - unified place where we can just adjust them for the template # what - inside get_mm_configs, return not the passed in kernel inputs, but allow the template heuristic to adjust them if necessary - the default implementation right now just passes them back this diff just adds the functionality, but does not exercise it other than the default (passthrough) # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520572](https://our.internmc.facebook.com/intern/diff/D81520572) Pull Request resolved: pytorch#161339 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#161123, pytorch#161124, pytorch#161125, pytorch#161126, pytorch#161336, pytorch#161338
Reduces CPython `test_set.py` runtime from 63.477s to 40.298s Pull Request resolved: pytorch#160763 Approved by: https://github.com/anijain2305
Fixes pytorch#161763 Pull Request resolved: pytorch#161922 Approved by: https://github.com/jansel
…62044) ### Motivation * MI250 Cirrascale runners are currently having network timeout leading to huge queueing of binary smoke test jobs: <img width="483" height="133" alt="image" src="https://github.com/user-attachments/assets/17293002-78ad-4fc9-954f-ddd518bf0a43" /> * MI210 Hollywood runners (with runner names such as `pytorch-rocm-hw-*`) are not suitable for these jobs, because they seem to take much longer to download artifacts: pytorch#153287 (comment) (this is why these jobs were specifically targeting Cirrascale runners). However, it doesn't seem like Cirrascale runners are necessarily doing much better either e.g. [this recent build](https://github.com/pytorch/pytorch/actions/runs/17332256791/job/49231006755). * Moving to MI325 runners should address the stability part at least, while also reducing load on limited MI2xx runner capacity. * However, I'm not sure if the MI325 runners will do any better on the artifact download part (this may need to be investigated more) cc @amdfaa * Also removing `ciflow/binaries` and `ciflow/binaries_wheel` label/tag triggers for `generated-linux-binary-manywheel-rocm-main.yml` because we already trigger ROCm binary build/test jobs via these labels/tags in `generated-linux-binary-manywheel-nightly.yml`. And for developers who want to trigger ROCm binary build/test jobs on their PRs, they can use the `ciflow/rocm-mi300` label/tag as per this PR. ### TODOs (cc @amdfaa): * Check that the workflow runs successfully on the MI325 runners in this PR. Note how long the test jobs take esp. the "Download Build Artifacts" step * Once this PR is merged, clear the queue of jobs targeting `linux.rocm.gpu.mi250` Pull Request resolved: pytorch#162044 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <[email protected]>
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#162315 Approved by: https://github.com/pytorchbot
…lt (pytorch#159889)" This reverts commit 01edcd4. Reverted pytorch#159889 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](pytorch#160449 (comment)))
This reverts commit de893e9. Reverted pytorch#160449 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](pytorch#160449 (comment)))
Follow-up to pytorch#161768. Context: ProcessPool pickles the outputs before sending them back to the main process. Triton kernels have some un-pickleable fields, so `prepare_for_pickle()` is used to strip out those fields. Previously, in the standard case (without triton_bundler.py), `prepare_for_pickle()` would strip out the un-pickleable fields and they would never be added back after unpickling, because the un-pickleable fields were not actually needed after compilation finished. In pytorch#161768 updated `prepare_for_pickle` to also strip out the `fn._hash_lock` field, a newly added field in JITCallable instances which is a `threading.RLock()`, which is not pickleable. It turns out that we do need to restore the `fn._hash_lock` field, even in the non-triton_bundler case - the MultiKernel case uses the hash lock. To do this, we add `restore_after_unpickle()` which will restore fields (or if the old fields are not provided, initialize just the hash_lock) Compile time benchmarks look good, maybe a very minor regression (see the comment below on the PR) Pull Request resolved: pytorch#162244 Approved by: https://github.com/atalman
…ytorch#162309) Fixes static cuda launcher after triton-lang/triton#7866. Static cuda launcher checks to make sure that no hook knobs are set (and if they are, it throws an error). But Triton has changed the semantics of hooks so that "empty hooks" are now represented by empty `HookChain`s instead of being represented by `None`. This PR changes the way we define "empty hooks" to account for HookChains. Pull Request resolved: pytorch#162309 Approved by: https://github.com/aakhundov ghstack dependencies: pytorch#162244
The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect. Removing the alpha and beta fixes the issue. Thanks @ngimel to figure out the root cause. Pull Request resolved: pytorch#162040 Approved by: https://github.com/danielvegamyhre
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: pytorch#162372 Approved by: https://github.com/pytorchbot
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: pytorch#161395 Approved by: https://github.com/pytorchbot
Update PyTorch to the latest Triton release candidate branch (release/3.5.x in triton-lang/triton) Notably: * this does *not* include the version number bump from 3.4 -> 3.5 (we'll do that in a follow-up PR) * sam_fast is still failing, so we've disabled it temporarily pytorch#162282 and we are committed to fixing it, ideally before the branch cut but possibly as a cherry-pick into the release branch. Pull Request resolved: pytorch#162278 Approved by: https://github.com/atalman ghstack dependencies: pytorch#162244, pytorch#162309
… mode (pytorch#161405) During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms. Adding optional usage of: - c10d.time_estimator for collectives, which is based on NCCL estimator Benchmark mode only for matmuls, as they are highly dependent on mm backend - The logic mostly copied from Ruisi's PRs for inductor simple_fsdp pytorch#157572 This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()` Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294) Pull Request resolved: pytorch#161405 Approved by: https://github.com/eellison
)" This reverts commit c9ac8c2. Reverted pytorch#162315 on behalf of https://github.com/jeanschmidt due to Reverting in order to see if this introduced the failure https://github.com/pytorch/pytorch/actions/runs/17539536914/job/49810513700 ([comment](pytorch#162315 (comment)))
This reverts commit 3770337. Reverted pytorch#161649 on behalf of https://github.com/ngimel due to reverted internally ([comment](pytorch#161649 (comment)))
…#162041) The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision. Pull Request resolved: pytorch#162041 Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel ghstack dependencies: pytorch#162040
…lready know (pytorch#161591) We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap. Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099) Pull Request resolved: pytorch#161591 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590
…ytorch#161595) This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better. Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095) Pull Request resolved: pytorch#161595 Approved by: https://github.com/ezyang, https://github.com/bdhirsh ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590, pytorch#161591
…rch#161058) This PR is the first split PR of pytorch#156272, only contains the OneDNN code. Please help review. Pending on OneDNN v3.9 commit update. Don't merge. Pull Request resolved: pytorch#161058 Approved by: https://github.com/guangyey, https://github.com/EikanWang
LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: pytorch#162311 Approved by: https://github.com/jansel
updated release behavior for cached blocks Fixes pytorch#159567 Pull Request resolved: pytorch#162269 Approved by: https://github.com/eqy, https://github.com/Skylion007
… cuDNN SDPA Nested tensor/varlen (pytorch#161434) Don't recompile every time Pull Request resolved: pytorch#161434 Approved by: https://github.com/drisspg
Trying to fix pytorch#162333 CUDA 13.0 file structure changed. Instead of keeping most of dlls in bin folder its now in ``bin\x64`` except for cudnn dll. See attached picture : <img width="511" height="361" alt="Screenshot 2025-09-08 at 9 46 26 AM" src="https://github.com/user-attachments/assets/d2e630ee-930f-4da6-9b81-f9ef48fde7ce" /> <img width="490" height="333" alt="Screenshot 2025-09-08 at 9 46 34 AM" src="https://github.com/user-attachments/assets/194cbf43-b6ef-4218-b516-db37b91302be" /> Pull Request resolved: pytorch#162383 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/malfet
…sting_IFU_2025-09-08 # Conflicts: # .ci/docker/ci_commit_pins/triton.txt # .ci/docker/requirements-ci.txt # aten/src/ATen/Context.cpp # aten/src/ATen/cuda/tunable/GemmHipblaslt.h # aten/src/ATen/native/ConvUtils.h # aten/src/ATen/native/Convolution.cpp # aten/src/ATen/native/Normalization.cpp # aten/src/ATen/native/cuda/Blas.cpp # aten/src/ATen/native/miopen/Conv_miopen.cpp # requirements.txt # test/distributed/_tools/test_fsdp2_mem_tracker.py # test/distributed/tensor/parallel/test_tp_examples.py # test/dynamo/test_activation_checkpointing.py # test/dynamo/test_structured_trace.py # test/inductor/test_aot_inductor.py # test/inductor/test_combo_kernels.py # test/test_matmul_cuda.py # test/test_sparse.py # torch/_higher_order_ops/triton_kernel_wrap.py # torch/_inductor/choices.py # torch/_inductor/codegen/triton.py # torch/testing/_internal/common_cuda.py
|
Jenkins build for 2b05fbdbc7aa836bd60dac2f04c4bd954c020b50 commit finished as FAILURE |
|
Jenkins build for b88024c2430b2cc6223bf5eb2a7819ba570564f3 commit finished as FAILURE |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Merged latest changes from upstream/main into rocm7.1_internal_testing on 2025-09-08