Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
694 commits
Select commit Hold shift + click to select a range
76a841f
Port OpSchema.__post_init__ and OpSchema._recompute_comparison_key to…
swolchok Sep 18, 2025
46c647d
[vllm hash update] update the pinned vllm hash (#163304)
pytorchupdatebot Sep 19, 2025
3016616
[BE] Update Python min version to 3.10 (#162310)
malfet Sep 19, 2025
c91f59b
Fix performance regression when indexing by Numpy arrays (#163280)
ezyang Sep 18, 2025
ce5637b
Fix invalid indices bug for max_unpool2d/3d on MPS (#163036)
can-gaa-hou Sep 19, 2025
5780478
Revert "[BE] Update Python min version to 3.10 (#162310)"
pytorchmergebot Sep 19, 2025
1708120
Revert "[CI] Move Windows build/tests to Python-3.10 (#162862)"
pytorchmergebot Sep 19, 2025
e0bcd58
[MTIA] Add MTIA dispatch for kernel foreach_maximum(Add D80022242 bac…
DoubleBiao Sep 19, 2025
1302637
Revert "[dynamo][guards] Do not construct entire framelocals dict for…
pytorchmergebot Sep 19, 2025
32ad29b
Revert "[dynamo][guards] Fail on an unknown framelocals to dict conve…
pytorchmergebot Sep 19, 2025
0815091
[CP][BE] Cosmetic refactors for CP code base (#163115)
fegin Sep 18, 2025
ab5086a
[WOQ] Add XPU kernel for _weight_int8pack_mm (#160938)
xiaowangintel Sep 19, 2025
33e6c5a
[Dependabot] Update(deps): Bump transformers from 4.54.0 to 4.56.0 in…
dependabot[bot] Sep 19, 2025
bee362c
[ROCm][SymmMem] Fix skip condition for PLATFORM_SUPPORTS_SYMM_MEM (#1…
pragupta Sep 19, 2025
264e7f6
[ROCm] Fix mx fp8 and fp4 code after scaling refactor changes. (#163127)
jagadish-amd Sep 19, 2025
f8f230a
[FP8][cuBLAS][H100] only test fp32 outputs for rowwise `_scaled_mm` o…
eqy Sep 19, 2025
e631d76
[Flex] Changing how bwd configs are setup and updating default b200 c…
drisspg Sep 19, 2025
4967ad8
[Graph Partition] improve custom op output alias (#163227)
BoyuanFeng Sep 19, 2025
3e663ce
[Inductor][Triton][FP8] Add a Blackwell-specific scaled persistent + …
jananisriram Sep 19, 2025
2984bfe
[ez][CI] Run vllm workflow on vllm pin updates (#163353)
clee2000 Sep 19, 2025
a3b68c7
Revert "Fix boxcox to return same result for same input in one batch …
pytorchmergebot Sep 19, 2025
607469b
Revert "[ROCm] Bump FBGEMM commit to avoid CK errors (#162590)"
pytorchmergebot Sep 19, 2025
a0d2d84
Handling overflow for long int overflow for the product of kernel_hei…
arkadip-maitra Sep 19, 2025
b8c5ec5
[CD] Simplify NVIDIA driver installation step (#163349)
malfet Sep 19, 2025
52dd7a8
Move ROCM trunk wheel builds to 3.10 (#163339)
malfet Sep 19, 2025
03f34fd
Add explicit typing to nn.Module.__init__() parameters (#157389)
dsashidh Sep 19, 2025
bc7b17a
Realize LazyVariableTracker before raising exception (#163350)
guilhermeleobas Sep 19, 2025
979e10f
[Bugfix] Match eager stride semantics for cloned tensors with preserv…
Lucaskabela Sep 19, 2025
a273475
[BE] Introduce `CONDA_ROOT_DIR` (#163341)
malfet Sep 19, 2025
4a160da
[CUDA] revert PR 130472 (#162950)
thenumberouscode Sep 19, 2025
2a308c7
Revert "Improve device info with new flops and bandwidth formula base…
pytorchmergebot Sep 19, 2025
f8fb437
[SymmMem] Barrier on team instead of world (#163298)
kwen2501 Sep 18, 2025
7130b17
[SymmMem] Fix memory allocation hold-up (#162680)
kwen2501 Sep 18, 2025
ba3c2c8
SDP Backend function fix (#161169)
ahkush Sep 19, 2025
466122b
[inductor] avoid creating LoopBody twice (#162101)
shunting314 Sep 11, 2025
e88460f
[Inductor] don't call sympy_str when not needed (#162126)
shunting314 Sep 11, 2025
248156e
[Inductor] do loop reordering in a separate final round (#162355)
shunting314 Sep 11, 2025
df9a482
Bugfix for doing negative padding (#161639)
skpark-rh Sep 19, 2025
9f8a311
[Inductor][Intel GPU] Save `threads_per_warp` from tirton compiled ke…
etaf Sep 19, 2025
fab8455
Don't use declarations in global namespace in stable headers (#163352)
mikaylagawarecki Sep 19, 2025
e6a9db5
Add analytics ID to cpp docs (#163370)
svekars Sep 19, 2025
9b5ec0f
Use computed buffer sizes of torch for cusparseLt metadata (#163125)
aartbik Sep 19, 2025
0098e56
[CI] Move Windows build/tests to Python-3.10 (#162862)
malfet Sep 19, 2025
ee7bdd8
[graph partition] Add way to register custom rule (#163310)
zou3519 Sep 19, 2025
093f064
[CP][BE] Correct an incorrect docstring (#163131)
fegin Sep 18, 2025
8225a26
[dynamo] Fix issue with namedtuple slicing (#163351)
jansel Sep 19, 2025
bfe9e60
Simplify PrecompileContext to no longer be a CacheArtifactManager (#1…
jamesjwu Sep 20, 2025
a1df0b4
Lazy import to avoid circular import issue for DebugMode (#163381)
SherlockNoMad Sep 20, 2025
a31acf3
Clean up obsoleted vLLM tests (#163383)
huydhn Sep 20, 2025
e56dd5d
[Inductor-FX] Support torch.cond (#163234)
blaine-rister Sep 20, 2025
a87aea0
Update RandomSampler docstring. data_source must be Sized not Dataset…
dsashidh Sep 20, 2025
0b5a99b
remove duplicate import for defaultdict (#160519)
parsshar-RH Sep 20, 2025
df5d6d5
[inductor][triton heuristics] move allow tf32 out of config params (#…
coconutruben Sep 20, 2025
0ee331b
[inductor][choices] move extra kwargs out of get_template_configs (#1…
coconutruben Sep 20, 2025
d55c9d5
[CP] Fix cuDNN CP LSE dimension bug (#163231)
fegin Sep 18, 2025
5050cfa
[Opitmus] fix fp8 activation quatization for duplicates forward outpu…
mengluy0125 Sep 20, 2025
eb11d17
[Caffe2] Improve SVE batch box cox by 2% (#163360)
Nicoshev Sep 20, 2025
f9074c7
[STABLE ABI] Add copy_ operation. (#161895)
pearu Sep 19, 2025
d70c0ba
minimize graph capture output (#162211)
avikchaudhuri Sep 20, 2025
3938175
[1/n] Support cpu_tensor.to("cuda:0") in FakeTensorMode on cuda-less …
SherlockNoMad Sep 20, 2025
9e3725e
make fullgraph_capture work on mod, args, kwargs (#162849)
avikchaudhuri Sep 20, 2025
8e3fd3d
[AI Codemod][DevmatePerfOptimizationVectorReallocation] fbcode/caffe2…
yfeldblum Sep 20, 2025
e37b600
[CUDA][cuBLAS][FP8] Forward-fix #162022 (#163354)
eqy Sep 21, 2025
2887f3f
[BE] Slight improvements to documentation in python_dispatch (#162963)
ezyang Sep 19, 2025
97eb7a2
torchdim Python port (#160236)
ezyang Sep 20, 2025
5b386ee
[vllm hash update] update the pinned vllm hash (#163392)
pytorchupdatebot Sep 21, 2025
1ca9445
[BE][Ez]: Prevent copies of std::vector in CUDA ForeachOps (#163416)
Skylion007 Sep 21, 2025
f591bb5
Remove data_source argument from Sampler (#163134)
cyyever Sep 21, 2025
4a96a6f
[Docs] Fix indentations in cond.md (#156147)
windsonsea Sep 21, 2025
1faf636
Delete functorch C extension entirely. (#163340)
ezyang Sep 21, 2025
9ba9180
Add api info for torch._C._nn.pyi (#162707)
orangeH25 Sep 21, 2025
d8cbbc0
[Easy][AMP] Refactor the AMP logic for getting dtype (#162796)
fffrog Sep 12, 2025
5d8a226
[SymmMem] Promote `@requires_nvshmem` instead of `enable_triton` (#16…
kwen2501 Sep 21, 2025
f34744d
[inductor] bugfix: keep WeakDeps (WAR deps) during fusion (#162316)
v0i0 Sep 19, 2025
51152ef
Remove autograd code for Python < 3.9 (#163313)
cyyever Sep 21, 2025
5599f48
Fully native DTensor.__new__ (#162508)
swolchok Sep 18, 2025
4d3d32f
Add torchfuzz initial impl. (#163417)
laithsakka Sep 20, 2025
8b14f43
[torch] DRY a couple of lines in unpickler (#163447)
yfeldblum Sep 21, 2025
6ac2b3a
[BE] Adding aliases for CUDA and XPU API documentation (#162984)
jiannanWang Sep 21, 2025
8a281d7
[submodule] Bump libfmt to 12.0.0 (#163441)
cyyever Sep 21, 2025
0b59492
[export] Fix wrap_with_set_grad_enabled retracing (#163295)
angelayi Sep 21, 2025
01f927e
Remove workarounds for Python 3.6 (#163440)
cyyever Sep 22, 2025
281bb56
Enable half precision types on test_conv_cudnn_nhwc_support (#163444)
cyyever Sep 22, 2025
3a7db34
Revert "[SymmMem] Promote `@requires_nvshmem` instead of `enable_trit…
pytorchmergebot Sep 22, 2025
f007894
Revert "[RELAND] Always build USE_DISTRIBUTED (#160449) and Make dist…
pytorchmergebot Sep 22, 2025
ae5be03
Revert "Delete functorch C extension entirely. (#163340)"
pytorchmergebot Sep 22, 2025
edafc90
Revert "[BE] Make PyObjectSlot use a global PyInterpreter (#162659)"
pytorchmergebot Sep 22, 2025
96a3afb
Simplify BFLOAT16_AVAILABLE (#163445)
cyyever Sep 22, 2025
60b4791
[MPS] Fix compile linalg inv (#163452)
Isalia20 Sep 22, 2025
9f5a644
[BE] Update Python min version to 3.10 (#162310)
malfet Sep 22, 2025
10adeb9
Revert "[BE] Update Python min version to 3.10 (#162310)"
pytorchmergebot Sep 22, 2025
509c4e8
Update cutlass version for fbcode (#163091)
henrylhtsang Sep 19, 2025
eaac218
[ROCm] Fix environment variable AOTRITON_INSTALLED_PREFIX (#163373)
xinyazhang Sep 22, 2025
e310cc5
Update fbgemm submodule (#163411)
cthi Sep 22, 2025
9ca183e
switch from stack based to graph based aproach (#163459)
laithsakka Sep 22, 2025
06fe5b9
[AOTI] fix TestAOTInductorPackage temp file locked handler. (#163499)
xuhancn Sep 22, 2025
5e7be98
[BE] Update Python min version to 3.10 (#162310)
malfet Sep 22, 2025
281f8f4
Combine strong and weak refcounts in intrusive_ptr in a single refcou…
mcfi Sep 22, 2025
d279a6a
ci: Add a way to lint all files in a PR from label (#163525)
seemethere Sep 22, 2025
bec967e
Remove C++ and test branches for CUDA<12 (#163443)
cyyever Sep 22, 2025
3be9c86
[opaque obj] Initial OpaqueObject (#162660)
angelayi Sep 22, 2025
dd30667
[opaque_obj] Add set_payload + docs (#163276)
angelayi Sep 22, 2025
4941719
Enable logging for absolute memory estimation (#158799)
basilwong Sep 22, 2025
7e97811
Fix lint (#163542)
angelayi Sep 22, 2025
1818c36
[Fix] Restrict stride normalization to 1D tensors on export (#163282)
Kathryn-cat Sep 22, 2025
eaa613b
Revert "[opaque_obj] Add set_payload + docs (#163276)"
pytorchmergebot Sep 22, 2025
bf28990
Add support for NestedTensor share_memory_ (#162272)
adabeyta Sep 22, 2025
d150484
[opaque_obj] Add set_payload + docs (#163276)
angelayi Sep 22, 2025
6f9aef5
[2/n] Support module.to("cuda:0") in FakeTensorMode on cuda-less mach…
SherlockNoMad Sep 22, 2025
d008670
[triton] update 3.5 pin to bbb06c0334a6772b92d24bde54956e675c8c6604 (…
davidberard98 Sep 19, 2025
fd785b1
Add NestedTensor dispatch for _is_any_true/_is_all_true (#162096)
adabeyta Sep 22, 2025
e065d35
[BE]: Add a few more missing move from return indices (#163456)
Skylion007 Sep 22, 2025
46e1b7d
remove allow-untyped-defs from ./torch/utils/data/datapipes/iter/file…
bobrenjc93 Sep 22, 2025
cf28ab2
remove allow-untyped-defs from ./torch/ao/quantization/pt2e/duplicate…
bobrenjc93 Sep 22, 2025
02da475
Triton template IMA reads on B200 (#163460)
drisspg Sep 22, 2025
8abc2af
[STABLE ABI] Add clone method to torch::stable::Tensor (#161896)
pearu Sep 22, 2025
8e62d01
Add dynamic shapes doc (#159428)
svekars Sep 22, 2025
4027e97
[BE] Delete `skipIfMPSOnMacOS13` (#163515)
malfet Sep 22, 2025
09cb34c
[RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed …
ezyang Sep 22, 2025
e558f7a
[vllm hash update] update the pinned vllm hash (#163463)
pytorchupdatebot Sep 22, 2025
da05aa7
[BE] Use `output_t` directly (#163518)
malfet Sep 22, 2025
0256f91
[BUG] MaxUnpool2d/3d should check output dim before accessing its ele…
can-gaa-hou Sep 22, 2025
2b03663
Allow add_persistent_r_block to scale up rblock up to a limit (#162296)
PaulZhang12 Sep 17, 2025
7ea8998
Better decomp for torch.eye (#163386)
jansel Sep 22, 2025
36c2a13
[inductor] Fix bug where viewed outputs get padded (#163398)
jansel Sep 22, 2025
a1bd924
[inductor] Fallback on strided complex add (#163387)
jansel Sep 22, 2025
c8fd2b4
[inductor] Skip test_baddmm on XPU (#163414)
jansel Sep 22, 2025
4fc271e
[inductor] Don't require_dense for grid_sampler_2d_backward (#163415)
jansel Sep 22, 2025
e0cbab4
[Inductor] avoid CUDA__equal when constant tensors are from different…
cp2923 Sep 22, 2025
b756b58
Improve fake tensor leakage detection in export by not relying on gc …
tugsbayasgalan Sep 22, 2025
60c2bde
Replace Literal[None] with None in typing (#163489)
cyyever Sep 22, 2025
33daaad
dynamo: Handle objects in graph that do not support weakref (#163168)
c00w Sep 17, 2025
fa15fb0
[EZ] Remove XLA from unstable.yml (#163564)
malfet Sep 22, 2025
8da0086
Remove outdated commented CMake code (#163442)
cyyever Sep 22, 2025
68e75be
Update pytorch_sphinx_theme2 to latest hash (#163269)
svekars Sep 22, 2025
539e84e
[precompile] Add option to disable guard check on aot-compiled functi…
zhxchen17 Sep 23, 2025
3ef1bef
[sdpa] make sure to recompile if alignment is different than before (…
ColinPeppler Sep 19, 2025
2c7959e
[ignore][codex-test] Add typing to simple library registry (#161367)
bobrenjc93 Sep 23, 2025
8f30a8d
[AOTInductor] Add grid information for Triton Kernels (#160131)
muchulee8 Sep 22, 2025
e9300b2
remove allow-untyped-defs from ./torch/onnx/_internal/torchscript_exp…
bobrenjc93 Sep 22, 2025
6a48f57
[1/N] Remove 'type: ignore' suppressions (#163468)
cyyever Sep 23, 2025
447b8fc
[2/N] Use filesystem in inductor (#163465)
cyyever Sep 23, 2025
27164b6
Add fake_impl for _native_multi_head_attention (#163167)
ydwu4 Sep 23, 2025
0b75a16
[torchfuzz] Encapsulate fuzzing and codegen logic into ops (#163547)
bobrenjc93 Sep 22, 2025
95ac7d7
Rename to _debug_mode.py to make it private (#163534)
SherlockNoMad Sep 23, 2025
fcd79d5
[vllm hash update] update the pinned vllm hash (#163590)
pytorchupdatebot Sep 23, 2025
0e12238
[torchfuzz] remove supports_variable_inputs for now (#163553)
bobrenjc93 Sep 22, 2025
bb5be56
[torch][cuda][device_limits] Library for querying device hardware lim…
valentinandrei Sep 23, 2025
e3b392b
[BC breaking] Remove deprecated imports for torch.utils.data.datapipe…
cyyever Sep 23, 2025
d3a1345
Use functools.cache on has_efa (#163439)
cyyever Sep 23, 2025
19b754d
Revert "Update cutlass version for fbcode (#163091)"
pytorchmergebot Sep 23, 2025
08c5efd
[torchfuzz] cache operators (#163554)
bobrenjc93 Sep 22, 2025
d5e51d3
[torchfuzz] decompose -> fuzz_inputs_specs (#163555)
bobrenjc93 Sep 22, 2025
1545bb1
[torchfuzz] shuffle compatible ops (#163556)
bobrenjc93 Sep 22, 2025
309fe03
[torchfuzz] remove unneeded try catch (#163557)
bobrenjc93 Sep 22, 2025
45d9dcc
Update Kineto Submodule (#162222)
sraikund16 Sep 23, 2025
375f3e3
[OpenReg][Docs] Correct docs about `openreg` usage example. (#163235)
KarhouTam Sep 23, 2025
b426ba1
[torchfuzz] introduce tensor and scalar pointwise ops (#163558)
bobrenjc93 Sep 22, 2025
8d81564
[pt2][cache] rework cache for true generic usage + better tests (#163…
nmacchioni Sep 23, 2025
5d749ce
Remove test conditions for CUDA<12 (#163495)
cyyever Sep 23, 2025
3c64b2a
CUDA 13.0 Warning update for supported architectures (#163585)
atalman Sep 23, 2025
bda9ab2
[inductor] fix as_strided lowering with .view(dtype) inputs (#163319)
xmfan Sep 22, 2025
1a42656
[Flex attention] Fix flex attention head broadcast (#163426)
Isalia20 Sep 23, 2025
aff76c0
Revert "Add fake_impl for _native_multi_head_attention (#163167)"
pytorchmergebot Sep 23, 2025
e05c9c0
[ROCm][CI] cudagraph trees ut fixes (#163592)
jeffdaily Sep 23, 2025
4264fd3
Add basic tests for torch.distributed.tensor._utils.compute_global_te…
swolchok Sep 18, 2025
518c320
[inductor] libdevice.sqrt => tl.sqrt_rn (#163419)
jansel Sep 23, 2025
ed84e80
[inductor] Freeze layouts in FlexAttention (#163434)
jansel Sep 23, 2025
9c4d9f9
[inductor] Support out_dtype arg to matmul (#163393)
jansel Sep 23, 2025
6ef7487
[dynamo] Fix TorchFunctionMode handling with get_rng_state (#163412)
jansel Sep 23, 2025
49e7b2f
[inductor] Fix error from custom CUDA allocators (#163422)
jansel Sep 23, 2025
720a7b2
[export] Remove .contiguous() when saving weights to raw bytes (#163587)
yiming0416 Sep 23, 2025
0f67407
Large tests failing on bfloat16 (#163537)
drisspg Sep 22, 2025
b3cf5c7
Skip on sm100 later since Tests are non determinisitic (#163552)
drisspg Sep 22, 2025
5f0c7cb
Add B200 smoke test (#159494)
drisspg Sep 22, 2025
ebddbe7
[ROCm][CI] skip test_sparse_triangular_solve (#163651)
jeffdaily Sep 23, 2025
6e5dddb
Use accelerator API in common_dtensor (#163498)
dilililiwhy Sep 23, 2025
221ac81
Revert "[precompile] Add option to disable guard check on aot-compile…
pytorchmergebot Sep 23, 2025
134dfbe
[DCP] DTensor slice dequantization with proper block alignment (#163532)
saumishr Sep 23, 2025
fde929c
[AOTI] Fix model_package_loader get_cpp_compile_command (#163561)
xuhancn Sep 23, 2025
2aadcea
[ROCm] Improve perf for elementwise broadcast with mixed dtype (#163562)
jerrymannil Sep 23, 2025
649ceda
[export] handling NamedTuple inputs (#162959)
Raman-RH Sep 23, 2025
ca35dc2
[EZ] Fix UP041 violations (#163648)
malfet Sep 23, 2025
0696a4b
[EZ] Perma-ignore UP038 (#163649)
malfet Sep 23, 2025
8e6b0c7
[Inductor] Remove `no_type_check` annotation on properties (#163570)
blaine-rister Sep 23, 2025
bcb893a
[ROCm] Build FBGEMM_GENAI for gfx942 only (#162648)
jithunnair-amd Sep 23, 2025
22c5e8c
Add num_store to inductor_meta and use it to scale persistent reducti…
PaulZhang12 Sep 22, 2025
2a9745d
[multi-kernel] shape-similarity kernel selection (#163090)
pianpwk Sep 23, 2025
fc84743
Implement CUDA stream protocol (#163614)
msaroufim Sep 23, 2025
e671dcc
Update tests to check for more robust pattern (#163107)
tugsbayasgalan Sep 23, 2025
5ca563e
symintify fill_diagonol_ (#163485)
bobrenjc93 Sep 23, 2025
b182365
[ez] use list initializer syntax in fill_diagonal_ (#163607)
bobrenjc93 Sep 23, 2025
8c8416b
Update pytorch.org links in docs/conf.py (#163682)
svekars Sep 23, 2025
29af258
Less aggressive persistent reduction when it could induce large maski…
eellison Sep 23, 2025
c3d9f08
[torchfuzz] introduce multi process fuzzer (#163560)
bobrenjc93 Sep 23, 2025
c63e417
use reduction hint for aggressive rblock (#163371)
eellison Sep 23, 2025
b879ef7
[ROCm][CI] skip TestCudaPrimaryCtx.test_set_device_0 (#163693)
jeffdaily Sep 23, 2025
2014908
[MPS] Compute `offset2bag/bag_size/max_indices` in `_embedding_bag` (…
kurtamohler Sep 19, 2025
6b5ad5f
[Kineto] Add list of string parsing for profiler (#163593)
muchulee8 Sep 23, 2025
f3f67ff
Fix warn message (#163578)
drisspg Sep 22, 2025
f9fa138
[BE] Delete all pre py-3.10 checks (#163653)
malfet Sep 23, 2025
ee75c3d
Support for amin, amax, and aminmax (#163669)
srsuryadev Sep 23, 2025
eb3fbf5
[inductor] in emulate_precision_casts, disable fma fusion in triton (…
v0i0 Sep 23, 2025
4535254
[3/N] Use std::filesystem in inductor (#163632)
cyyever Sep 24, 2025
dc93529
[Triton] [Inductor] Restrict subprocess autotuning to just Triton (#1…
njriasan Sep 24, 2025
1e754d5
docs and optional kwargs for full graph capture (#163550)
avikchaudhuri Sep 24, 2025
be6c127
[AOTI] Pass comments from metadata to the autotune block (#163600)
desertfire Sep 23, 2025
e2ce79e
[Flex] Fix silent correctness w/ backpropping grads (#163677)
drisspg Sep 23, 2025
c261c71
Simplify _compute_local_shape_and_global_offset and make it SPMD. (#1…
ezyang Sep 19, 2025
ca512af
[inductor] Fix issue with scalar arg handling (#163481)
jansel Sep 23, 2025
6fa9727
[inductor] Fix bugs in emulate_precision_casts (#163520)
jansel Sep 23, 2025
d746b98
[inductor] Fix divmod error in decomp (#163482)
jansel Sep 23, 2025
42e9902
cd: Move arm64 to linux.arm64.r7g.12xlarge.memory (#163681)
seemethere Sep 23, 2025
6f1d962
[vllm hash update] update the pinned vllm hash (#163711)
pytorchupdatebot Sep 24, 2025
20eeb54
Add api info for torch._C._nn.pyi (#162936)
orangeH25 Sep 24, 2025
124dd36
[hop] support local_map + SAC (#163322)
xmfan Sep 23, 2025
0390798
[Triton] [Inductor] Enable Epilogue Subtiling in the blackwell ws tem…
njriasan Sep 24, 2025
a8e9ed2
[inductor] turn on loaf (for oss) by default (#162030)
shunting314 Sep 22, 2025
f68de58
[Inductor-FX] Support symbol and dynamic scalar graph inputs and outp…
blaine-rister Sep 24, 2025
2c5a3d7
Delete functorch C extension entirely. (#163340)
ezyang Sep 24, 2025
dad54ca
Add mistral/gpt-oss to benchmarks (#163565)
angelayi Sep 24, 2025
11a231e
[c10d] P2P tensors must be dense (#163719)
kwen2501 Sep 24, 2025
bf0747c
[Code Clean] Remove deadcodes about Python3.9 [1/N] (#163626)
fffrog Sep 24, 2025
0bca779
[Code Clean] Remove deadcodes about Python3.9 [2/N] (#163627)
fffrog Sep 24, 2025
33aabdd
[Code Clean] Remove deadcodes about Python3.9 [3/N] (#163629)
fffrog Sep 24, 2025
ec0cd81
[Code Clean] Remove deadcodes about Python3.9 [4/N] (#163643)
fffrog Sep 24, 2025
6f34cc0
[Code Clean] Remove deadcodes about Python3.9 [5/N] (#163644)
fffrog Sep 24, 2025
a635505
[Code Clean] Remove deadcodes about Python3.9 [6/N] (#163645)
fffrog Sep 24, 2025
2390d34
[Code Clean] Remove deadcodes about Python3.9 [7/N] (#163646)
fffrog Sep 24, 2025
3e1b1a3
Revert "[inductor] Fix issue with scalar arg handling" (#163737)
jansel Sep 24, 2025
207f104
[Triton] [Inductor] Set default configs for Blackwell Matmul Template…
njriasan Sep 24, 2025
b66aa1a
[ARM] Add test_memory_profiler to aarch64 tests (#145260)
robert-hardwick Sep 24, 2025
141fc72
[CD] CUDA 13.0 fix preload logic to include nvidia/cu13/lib/ (#163661)
atalman Sep 24, 2025
3b73841
update test_quantization tests to run weekly (#163077)
liangel-02 Sep 24, 2025
9d0d98a
Use cuda nvrtc so file based on cuda version used by torch (#163642)
atalman Sep 24, 2025
5d0f639
Make `Tensor.__dlpack__(stream=None)` capture-safe during CUDA Graph …
eee4017 Sep 24, 2025
4c2c401
Record redistribute_local_tensor in DebugMode (#163704)
SherlockNoMad Sep 24, 2025
9341ede
Revert to old behaviour of not padding strides if shape or stride is …
nandesuka Sep 24, 2025
768361e
Add less warps config to inner reductions (#162447)
PaulZhang12 Sep 24, 2025
c414f75
[WOQ][Inductor] Enable CUDA coverage for _weight_int8pack_mm (#163461)
bbeckca Sep 24, 2025
0456b23
[AOTI] Add verbose error information for extract file (#163718)
xuhancn Sep 24, 2025
71eec6a
[dist] handle discontiguous allgather/reducescatter inputs (#163712)
ngimel Sep 24, 2025
0dce2af
[ROCm][CI] adjust tf32 tolerance for test_compile_kernel_advanced (#1…
jeffdaily Sep 24, 2025
90a2825
Add `inference_mode` hint message to use `eval` with inference. (#163…
zeshengzong Sep 24, 2025
1495b35
Remove Python 3.9 for Triton builds (#163778)
atalman Sep 24, 2025
b40191b
Merge remote-tracking branch 'upstream/main' into rocm7.1_internal_te…
github-actions[bot] Sep 24, 2025
f3e8213
Fix merge conflicts
pragupta Sep 24, 2025
0ad8381
Address review comments wrt triton_heuristics and install_rocm
pragupta Sep 30, 2025
63fcd9b
update related_commits
pragupta Sep 30, 2025
77f4534
Fix more conflicts with triton_heuristics.py
pragupta Sep 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 0 additions & 4 deletions .ci/docker/ci_commit_pins/triton.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1 @@
<<<<<<< HEAD
6193b30becb1ac7be704cf87b8cb9bf13e7f9689
=======
bbb06c0334a6772b92d24bde54956e675c8c6604
>>>>>>> upstream/main
7 changes: 2 additions & 5 deletions .ci/docker/common/install_rocm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -114,12 +114,9 @@ EOF
rm -rf HIP clr
fi

<<<<<<< HEAD
# temporary hipblasLT dependency install
apt install libmsgpackc2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pragupta This change was supposed to be temporary as per f1ad49a (cc @pruthvistony)

Can we please ascertain if this is really needed for ROCm 7.1 mainline?
cc @jeffdaily to comment on whether this is needed for the ROCm7.0 CI upstream enablement

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ROCm 7 CI upgrade doesn't have this line. What was this fixing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

=======
pip_install "git+https://github.com/rocm/composable_kernel@$ROCM_COMPOSABLE_KERNEL_VERSION"
>>>>>>> upstream/main

# Cleanup
apt-get autoclean && apt-get clean
Expand All @@ -131,8 +128,8 @@ install_centos() {
yum update -y
yum install -y kmod
yum install -y wget
if [[ $OS_VERSION == 9 ]]; then

if [[ $OS_VERSION == 9 ]]; then
dnf install -y openblas-serial
dnf install -y dkms kernel-headers kernel-devel
else
Expand Down
11 changes: 0 additions & 11 deletions .ci/docker/requirements-ci.txt
Original file line number Diff line number Diff line change
Expand Up @@ -112,13 +112,8 @@ ninja==1.11.1.3
#Pinned versions: 1.11.1.3
#test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py

<<<<<<< HEAD
numba==0.60.0 ; python_version == "3.9"
numba==0.61.2 ; python_version > "3.9"
=======
numba==0.55.2 ; python_version == "3.10" and platform_machine != "s390x"
numba==0.60.0 ; python_version == "3.12" and platform_machine != "s390x"
>>>>>>> upstream/main
#Description: Just-In-Time Compiler for Numerical Functions
#Pinned versions: 0.54.1, 0.49.0, <=0.49.1
#test that import: test_numba_integration.py
Expand All @@ -137,14 +132,8 @@ numba==0.60.0 ; python_version == "3.12" and platform_machine != "s390x"
#test_nn.py, test_namedtensor.py, test_linalg.py, test_jit_cuda_fuser.py,
#test_jit.py, test_indexing.py, test_datapipe.py, test_dataloader.py,
#test_binary_ufuncs.py
<<<<<<< HEAD
numpy==2.0.2 ; python_version == "3.9"
numpy==2.1.2 ; python_version > "3.9"
=======
numpy==1.22.4; python_version == "3.10"
numpy==1.26.2; python_version == "3.11" or python_version == "3.12"
numpy==2.1.2; python_version >= "3.13"
>>>>>>> upstream/main

pandas==2.2.3

Expand Down
3 changes: 0 additions & 3 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -896,8 +896,6 @@ cmake_dependent_option(
"USE_CUDA OR USE_ROCM"
OFF)

<<<<<<< HEAD
=======
IF(USE_FBGEMM_GENAI AND USE_ROCM AND NOT "gfx942" IN_LIST PYTORCH_ROCM_ARCH)
message(WARNING "Unsupported ROCM arch for FBGEMM GenAI, will set USE_FBGEMM_GENAI to OFF")
set(USE_FBGEMM_GENAI off)
Expand All @@ -909,7 +907,6 @@ if(USE_CUDA AND "$ENV{TORCH_CUDA_ARCH_LIST}" MATCHES "10.0" AND CMAKE_CUDA_COMPI
set(USE_FBGEMM_GENAI ON)
endif()

>>>>>>> upstream/main
# CAVEAT: Again, Flash Attention2 will error while building for sm52 while Mem
# Eff Attention won't
cmake_dependent_option(
Expand Down
4 changes: 0 additions & 4 deletions aten/src/ATen/native/Normalization.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -671,13 +671,9 @@ std::tuple<Tensor, Tensor, Tensor, Tensor, int64_t> _batch_norm_impl_index(
std::cout << "PYTORCH_MIOPEN_EXTRA_LOGGING: ********************* _batch_norm_impl_index (calling miopen_batch_norm)" << std::endl;
return std::tuple_cat(
at::miopen_batch_norm(
<<<<<<< HEAD
input.contiguous(input.suggest_memory_format()), weight.contiguous(), bias.contiguous(),
=======
input.contiguous(input.suggest_memory_format()),
weight.contiguous(),
bias.contiguous(),
>>>>>>> upstream/main
running_mean.defined() ? running_mean.contiguous() : running_mean,
running_var.defined() ? running_var.contiguous() : running_var,
training, momentum, eps),
Expand Down
21 changes: 0 additions & 21 deletions aten/src/ATen/native/miopen/BatchNorm_miopen.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -103,11 +103,7 @@ std::tuple<Tensor, Tensor, Tensor> miopen_batch_norm(
mode = miopenBNSpatial;
}

<<<<<<< HEAD
auto output_t = at::empty(input->sizes(), input->options(), input->suggest_memory_format());
=======
auto output_t = at::empty_like(input_t, input_t.options(), input_t.suggest_memory_format());
>>>>>>> upstream/main
TensorArg output{ output_t, "output", 0 };

auto handle = getMiopenHandle();
Expand Down Expand Up @@ -180,18 +176,10 @@ std::tuple<Tensor, Tensor, Tensor> miopen_batch_norm_backward(

auto grad_output_contig =
grad_output_t.contiguous(input_t.suggest_memory_format());
<<<<<<< HEAD
TensorArg input{ input_t, "input", 1 },
grad_output{ grad_output_contig, "grad_output", 2 },
weight{ weight_t, "weight", 3 },
save_mean{ save_mean_t, "save_mean", 4 },
save_var{ save_var_t, "save_var", 5 };
=======
TensorArg input{input_t, "input", 1},
grad_output{grad_output_contig, "grad_output", 2},
weight{weight_t, "weight", 3}, save_mean{save_mean_t, "save_mean", 4},
save_var{save_var_t, "save_var", 5};
>>>>>>> upstream/main
CheckedFrom c = "miopen_batch_norm_backward";

checkAllDefined(c, {input, grad_output, weight, save_mean, save_var});
Expand All @@ -203,13 +191,9 @@ std::tuple<Tensor, Tensor, Tensor> miopen_batch_norm_backward(
}
checkAllSameType(c, {input, grad_output});
checkAllSameType(c, {weight, save_mean, save_var});
<<<<<<< HEAD
checkAllContiguous(c, {save_mean, save_var});
=======
// TODO: is weight required to be contiguous?
checkAllContiguous(c, {save_mean, save_var});
// TODO: TensorArg check should start handle memory format
>>>>>>> upstream/main
TORCH_CHECK(input->is_contiguous(input->suggest_memory_format()));
TORCH_CHECK(grad_output->is_contiguous(input->suggest_memory_format()));
checkDimRange(c, input, 2, 6 /* exclusive */);
Expand All @@ -226,12 +210,7 @@ std::tuple<Tensor, Tensor, Tensor> miopen_batch_norm_backward(
mode = miopenBNSpatial;
}

<<<<<<< HEAD
auto grad_input_t = at::empty(
input->sizes(), input->options(), input->suggest_memory_format());
=======
auto grad_input_t = at::empty(input->sizes(), input->options(), input->suggest_memory_format());
>>>>>>> upstream/main
auto grad_weight_t = at::empty(weight->sizes(), weight->options());
auto grad_bias_t = at::empty(weight->sizes(), weight->options());

Expand Down
11 changes: 0 additions & 11 deletions requirements-build.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
# Build System requirements
setuptools>=70.1.0,<80.0 # setuptools develop deprecated on 80.0
<<<<<<< HEAD
cmake>=3.31.4
ninja==1.11.1.3
numpy==2.0.2 ; python_version == "3.9"
Expand All @@ -10,14 +9,4 @@ pyyaml==6.0.2
requests==2.32.4
six==1.17.0 # dependency chain: NNPACK -> PeachPy -> six
typing-extensions==4.14.1
=======
cmake>=3.27
ninja
numpy
packaging
pyyaml
requests
six # dependency chain: NNPACK -> PeachPy -> six
typing-extensions>=4.10.0
pip # not technically needed, but this makes setup.py invocation work
>>>>>>> upstream/main
13 changes: 0 additions & 13 deletions test/nn/test_convolution.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,12 +50,6 @@
parametrize as parametrize_test,
run_tests,
set_default_dtype,
<<<<<<< HEAD
skipIfRocm,
skipIfNotMiopenSuggestNHWC,
skipIfRocmVersionLessThan,
=======
>>>>>>> upstream/main
subtest,
TEST_SCIPY,
TEST_WITH_ROCM,
Expand Down Expand Up @@ -4033,16 +4027,9 @@ def test_conv_double_backward_strided_with_3D_input_and_weight(self, device):

@skipCUDAIfRocm
@onlyCUDA
<<<<<<< HEAD
@largeTensorTest('40GB')
@largeTensorTest('24GB', 'cpu')
# Skipped for ROCm temp - https://ontrack-internal.amd.com/browse/SWDEV-383635
@skipIfRocm
=======
@largeTensorTest("40GB")
@largeTensorTest("24GB", "cpu")
@tf32_on_and_off(0.005)
>>>>>>> upstream/main
def test_conv3d_64bit_indexing(self, device):
x = torch.rand(1, 32, 512, 512, 256)
m = torch.nn.Conv3d(32, 1, kernel_size=1, padding=0, stride=1, bias=False)
Expand Down
4 changes: 0 additions & 4 deletions test/test_binary_ufuncs.py
Original file line number Diff line number Diff line change
Expand Up @@ -1481,11 +1481,7 @@ def to_np(value):
elif torch.can_cast(torch.result_type(base, exponent), base.dtype):
actual2 = actual.pow_(exponent)
self.assertEqual(actual, expected.to(actual))
<<<<<<< HEAD
self.assertEqual(actual2, expected.to(actual))
=======
self.assertEqual(actual2, expected.to(actual2))
>>>>>>> upstream/main
else:
self.assertRaisesRegex(
RuntimeError,
Expand Down
19 changes: 0 additions & 19 deletions test/test_nn.py
Original file line number Diff line number Diff line change
Expand Up @@ -5199,24 +5199,6 @@ def test_batchnorm_nhwc_cuda(self):
name_fn=lambda f, b, m, t: f"{f}_vs_{b}{'_mixed' if m else ''}_{dtype_name(t)}"
)
def test_batchnorm(self, dims, mode, memory_format, ref_backend, mixed, dtype):
<<<<<<< HEAD
if self._testMethodName == "test_batchnorm_3D_train_NCHW_vs_native_mixed_float16":
self.skipTest("3D float16 NCHW train failed on CUDA and ROCm due to Native batchnorm accuracy issue SWDEV-541024")
if torch.version.hip:
if self._testMethodName in ("test_batchnorm_2D_train_NHWC_vs_NCHW_mixed_bfloat16",
"test_batchnorm_2D_train_NCHW_vs_cpu_mixed_bfloat16",
"test_batchnorm_3D_train_NHWC_vs_NCHW_mixed_bfloat16",
"test_batchnorm_3D_train_NCHW_vs_cpu_mixed_bfloat16"
) and _get_torch_rocm_version() < (6, 4):
# NCHW bfloat16 path uses native kernels for rocm<=6.3
# train failed on rocm<=6.3 due to native tolerance issue SWDEV-507600
self.skipTest("bfloat16 NHWC train failed on ROCm <= 6.3")

if self._testMethodName in ("test_batchnorm_2D_train_NCHW_vs_native_mixed_bfloat16",
"test_batchnorm_3D_train_NCHW_vs_native_mixed_bfloat16"
) and _get_torch_rocm_version() >= (6, 4):
self.skipTest("bfloat16 NCHW train failed due to native tolerance issue SWDEV-507600")
=======
if torch.version.cuda:
if self._testMethodName in ("test_batchnorm_2D_train_NCHW_vs_cpu_mixed_bfloat16",
"test_batchnorm_3D_train_NCHW_vs_cpu_mixed_bfloat16",
Expand Down Expand Up @@ -5244,7 +5226,6 @@ def test_batchnorm(self, dims, mode, memory_format, ref_backend, mixed, dtype):

if self._testMethodName == "test_batchnorm_3D_train_NCHW_vs_native_mixed_float16":
self.skipTest("3D float16 NCHW train failed on ROCm")
>>>>>>> upstream/main

if dims == 3 and memory_format in ("NHWC", "NCHW"):
memory_format = memory_format + "3D"
Expand Down
14 changes: 1 addition & 13 deletions torch/_inductor/runtime/triton_heuristics.py
Original file line number Diff line number Diff line change
Expand Up @@ -2924,7 +2924,7 @@ def _persistent_reduction_configs(
for xblock in (1, 8, 32, 128)
if xblock == 1 or (xblock <= xnumel and (max_autotune_enabled or rnumel * xblock <= 4096))
]

if "y" not in size_hints:
configs = [
triton_config_reduction(
Expand Down Expand Up @@ -2958,17 +2958,6 @@ def _persistent_reduction_configs(
# defer to more autotuning, initially
if "y" in size_hints:
pass
<<<<<<< HEAD

if not max_autotune_enabled: # Don't filter if tuning enabled
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jataylo to double-check this conflict resolution in case not already consulted

Copy link
Collaborator Author

@pragupta pragupta Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke to @naromero77amd, he mentioned that these changes went into rocm7.1_internal_testing but the upstream PR is still open. So, we want to keep rocm7.1_internal_testing changes in place. He pointed me to his upstream PR here: pytorch#163908

Tried to keep local changes but some of them were not trivial as Nick's PR upstream is with newer upstream. @naromero77amd / @jataylo can you please confirm that the latest commit I pushed corrects the merge of this file?

if reduction_hint == ReductionHint.INNER and rnumel >= 256:
configs = configs[:1]
elif reduction_hint == ReductionHint.OUTER:
configs = configs[-1:]

if reduction_hint == ReductionHint.OUTER_TINY:
tiny_configs = [
=======
# TODO(jansel): we should be able to improve these heuristics
elif reduction_hint == ReductionHint.INNER:
if rnumel > 1024:
Expand All @@ -2995,7 +2984,6 @@ def _persistent_reduction_configs(
configs = configs[-1:]
elif reduction_hint == ReductionHint.OUTER_TINY:
configs = [
>>>>>>> upstream/main
triton_config_reduction(
size_hints,
2 * (256 // rnumel) if rnumel <= 256 else 1,
Expand Down
6 changes: 1 addition & 5 deletions torch/testing/_internal/common_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,8 +102,8 @@
has_pytest = False


<<<<<<< HEAD
MI300_ARCH = ("gfx940", "gfx941", "gfx942")
MI200_ARCH = ("gfx90a")
NAVI_ARCH = ("gfx1030", "gfx1100", "gfx1101", "gfx1200", "gfx1201")
NAVI3_ARCH = ("gfx1100", "gfx1101")
NAVI4_ARCH = ("gfx1200", "gfx1201")
Expand All @@ -115,10 +115,6 @@ def is_arch(arch_list):
if gfx_arch in arch_list:
return True
return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pragupta We should track the upstreaming of this patch in one of our stories. cc @iupaikov-amd

=======
MI300_ARCH = ("gfx942",)
MI200_ARCH = ("gfx90a")
>>>>>>> upstream/main

def freeze_rng_state(*args, **kwargs):
return torch.testing._utils.freeze_rng_state(*args, **kwargs)
Expand Down