Skip to content
Merged
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
1503 commits
Select commit Hold shift + click to select a range
34aa782
Revert "Make distributed modules importable even when backend not bui…
pytorchmergebot Sep 4, 2025
e532c9d
Relax tolerance for test_quick_baddbmm_cpu_complex64 (#152424)
Flamefire Sep 4, 2025
b7dad7d
Revert "Always build USE_DISTRIBUTED. (#160449)"
pytorchmergebot Sep 4, 2025
601ae8e
[CUDAGraph] add config to error on skipping cudagraph (#161862)
BoyuanFeng Sep 4, 2025
6b8b3ac
Revert "[ROCm] Use MI325 (gfx942) runners for binary smoke testing (#…
pytorchmergebot Sep 4, 2025
3a20a20
Fix largeTensorTest malfunction on XPU (#161988)
guangyey Sep 3, 2025
cc5bdd1
Keep default `CMAKE_PREFIX_PATH` in test_aot_inductor_package (#161907)
Flamefire Sep 4, 2025
1ebd70d
Fix usage of forwarding references (#161094)
lakshayg Sep 4, 2025
248355f
Don't require FakeStore to be passed into fake backend (#162164)
ezyang Sep 4, 2025
81aeefa
Add torch.compile support for triton.constexpr_function (#162106)
oulgen Sep 3, 2025
019fed3
[ROCm] [CK] Composable Kernel integration for inductor backend (#158747)
iupaikov-amd Sep 4, 2025
43b7c86
Add dependency-groups.dev to pyproject.toml (#161216)
lakshayg Sep 4, 2025
ba7f546
Update torch-xpu-ops commit pin (#162062)
CuiYifeng Sep 4, 2025
f36f285
[dynamo] change error_on_graph_break/fullgraph semantics (#161747)
williamwen42 Sep 3, 2025
0c0e056
[CUDA] Reuse blocks with record_stream during CUDA Graph capture in t…
eee4017 Sep 4, 2025
869cbcc
[SymmMem] Add a helper API to distinguish intra- and inter- node (#16…
kwen2501 Sep 2, 2025
8bb213b
[SymmMem] Increase signal pad size for NVL72 (#162026)
kwen2501 Sep 2, 2025
8a736fa
create torch._grouped_mm fallback path with for loops / bmm (#161407)
vkuzo Sep 3, 2025
61fb632
move `_grouped_mm` fallback to composite explicit autograd (#161717)
vkuzo Sep 3, 2025
9eadb37
enable float32 and float16 in `torch._grouped_mm` fallback (#162059)
vkuzo Sep 4, 2025
3302859
[dynamo] Make the MRO walk more narrow (#162105)
anijain2305 Sep 3, 2025
d1a15ab
export: add explicit decomposition for aten.expand_copy and unit test…
albertw7711 Sep 4, 2025
6f7608d
[cuDNN][SDPA] Enable cuDNN SDPA by default for SM 9.0, SM 10.0 (#162073)
eqy Sep 4, 2025
9480cdc
Modified the docs to add example for torch.is_floating_point and torc…
mansiag05 Sep 4, 2025
6b1900c
[dynamo][hops] Remove const outputs from the speculated subgraph (#16…
anijain2305 Aug 28, 2025
1f51056
[BE]: Update cpp-httplib submodule to 0.26.0 (#162181)
Skylion007 Sep 4, 2025
3dde5d7
[nativert] triton runtime implementation (#161798)
dolpm Sep 4, 2025
c371032
Always build USE_DISTRIBUTED. (#160449)
ezyang Sep 4, 2025
9e5247f
Revert "[MPS] enable cat op for sparse (#162007)"
pytorchmergebot Sep 4, 2025
afa6e56
Revert "[BE] Cleanup stale comments/copy from `gemm` (#162001)"
pytorchmergebot Sep 4, 2025
c3d54de
Revert "[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999)"
pytorchmergebot Sep 4, 2025
dbec087
Fix Arm64 OSS pytorch build with FBGEMM (#161527)
mcfi Sep 4, 2025
95ee0bf
Revert "[nativert] triton runtime implementation (#161798)"
pytorchmergebot Sep 4, 2025
ef3be67
Make distributed modules importable even when backend not built (#159…
ezyang Sep 4, 2025
a3d72b0
Apply Triton tensor descriptor for flex-decoding for performance (#16…
EikanWang Sep 4, 2025
48bedd7
Revert "Fix usage of forwarding references (#161094)"
pytorchmergebot Sep 4, 2025
d5b3841
Revert "[SymmMem] Add root argument to broadcast op (#161090)"
pytorchmergebot Sep 4, 2025
b9ba612
[ROCm] Enabling several UTs (#161715)
pragupta Sep 4, 2025
9bdcee0
[SymmMem] Add root argument to broadcast op (#161090)
kwen2501 Sep 3, 2025
89d41d3
[SymmMem] Feed tensor.data_ptr instead of handle.buffer_ptr into kern…
kwen2501 Sep 4, 2025
0d71a9d
fix incorrect interaction between DDPOptimizer and donated buffers (#…
bdhirsh Aug 15, 2025
b04e922
Fix memory leak in AOTI when calling `aoti_torch_as_strided` (#162118)
yushangdi Sep 4, 2025
1ec2c15
Revert "Fix Arm64 OSS pytorch build with FBGEMM (#161527)"
pytorchmergebot Sep 4, 2025
0d84ff3
[PGO] log add_extra_remote PGO to tlparse (#161751)
pianpwk Sep 4, 2025
09be189
[export] Fix torch.export.load with storage offset (#162172)
yiming0416 Sep 4, 2025
3a20781
Forward fix for user defined triton kernel grid calc (#162162)
nandesuka Sep 4, 2025
9499c87
[Inductor][Intel GPU] Register triton template heuristic for addmm tm…
etaf Sep 4, 2025
c7e4107
[B200][MXFP8] Fix regex in `test_blockwise_mxfp8_nvfp4_error_messages…
eqy Sep 4, 2025
d2d4c8e
[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999)
malfet Sep 2, 2025
5c67426
[dynamo] Add support for const prop on .item (#162204)
angelayi Sep 5, 2025
2928086
Add new parameter for gen_pyi.py to make it more configureable. (#161…
0xjeffro Sep 5, 2025
73eb451
[B200][NVFP4] Fix argument passing in `test_blockwise_mxfp8_nvfp4_mxf…
eqy Sep 5, 2025
be5b03d
Allow for using a dedicated binary for the torch subproc pool. (#162093)
c00w Sep 5, 2025
b67c410
[BE] [Inductor] Add Kernel name to all coor-desc tuning (#161409)
njriasan Sep 5, 2025
3bbc2e3
[vllm hash update] update the pinned vllm hash (#162226)
pytorchupdatebot Sep 5, 2025
494878a
[audio hash update] update the pinned audio hash (#162114)
pytorchupdatebot Sep 5, 2025
5da573c
[PGO] handle PGO profile merges (#162097)
pianpwk Sep 5, 2025
5c473e9
[1/N] Port 5 _composable/fsdp distributed test cases to Intel GPU (#1…
zxd1997066 Sep 5, 2025
bffc7dd
[CD] Add cuda 13.0 libtorch builds, remove CUDA 12.9 builds (#161916)
atalman Sep 5, 2025
a714437
[ez][inductor] add a few outer dimension reduction cases for LOAF (#1…
shunting314 Sep 3, 2025
2dd529d
A basic CLAUDE.md based on bad things I see claude code doing (#162163)
ezyang Sep 4, 2025
06da7c0
[DCP][Quantization] Fix for FP8 multiplication during dequantization …
saumishr Sep 5, 2025
f3cebec
Revert "Rename propagate_tensor_meta to make private again (#161744)"
pytorchmergebot Sep 5, 2025
b2c7b9a
[Intel GPU][FlexAttention] Enable TMA path on Intel GPU (#162138)
hoshibara Sep 5, 2025
c2a3024
[cuBLASLt][FP8] `cuBLASLt` appears to support float8 rowwise-scaling …
eqy Sep 5, 2025
9837461
[Intel GPU] Update Intel triton commit pin to Triton 3.5.x (#161777)
etaf Sep 5, 2025
261a84a
[CD][BE] Remove unnecessary checks for XCode version (#162263)
malfet Sep 5, 2025
d711f27
Revert "[ROCm] [CK] Composable Kernel integration for inductor backen…
pytorchmergebot Sep 5, 2025
b18bb67
Add const to stable amax (#162082)
mikaylagawarecki Sep 3, 2025
2ef665a
[inductor][contigous mm] mild refactor (#162075)
coconutruben Sep 5, 2025
9602590
[inductor] move scaled_mm input nodes logic (#161340)
coconutruben Sep 5, 2025
4902c76
[inductor][ez] add template/externchoice uid (#161341)
coconutruben Sep 5, 2025
af590cb
[inductor][aten] treat like a template in GEMMs (#161342)
coconutruben Sep 5, 2025
a301dc3
[inductor][ez] pass template rather than template.uid (#161343)
coconutruben Sep 5, 2025
031d79c
[inductor] move max-autotune logic inside V.choices.get_mm_configs (#…
coconutruben Sep 5, 2025
d63ad53
[inductor][ez] return choicecallers directly (#161345)
coconutruben Sep 5, 2025
e02e9ed
[inductor] V.choice.get_mm_configs takes a stack of templates (#161346)
coconutruben Sep 5, 2025
9a8d454
[inductor] add kernel template choice (ktc) (#161347)
coconutruben Sep 5, 2025
c321111
[inductor][ez] V.choices.get_mm_configs returns list of ChoiceCallers…
coconutruben Sep 5, 2025
88d94d1
Add torch.Tensor._make_dtensor to accelerate DTensor.__new__ further …
swolchok Sep 4, 2025
70f865a
Revert "Make distributed modules importable even when backend not bui…
pytorchmergebot Sep 5, 2025
adae7f6
Revert "Always build USE_DISTRIBUTED. (#160449)"
pytorchmergebot Sep 5, 2025
3771380
[ONNX] Hide draft export under a flag (#162225)
justinchuby Sep 5, 2025
a3c7f77
[EZ][CD] Update MacOS deployment platform to 11.0 (#162264)
malfet Sep 5, 2025
6087ef4
[BE] Cleanup stale comments/copy from `gemm` (#162001)
malfet Sep 2, 2025
de893e9
Always build USE_DISTRIBUTED. (#160449)
ezyang Sep 4, 2025
01edcd4
Make distributed modules importable even when backend not built (#159…
ezyang Sep 4, 2025
2fa0520
[BE][pytree] cleanup parameterized pytree tests (#160842)
XuehaiPan Sep 5, 2025
92a4302
[cutlass backend] Add FP8 tests for multiple linears (#160782)
henrylhtsang Sep 4, 2025
771f369
[Inductor] Improve RoPE (#161420)
BoyuanFeng Sep 5, 2025
c10195e
[C10d][Gloo] Enable complex datatype support in ProcessGroupGloo (#15…
shunzhiwen Sep 5, 2025
a00cdc1
[CD][BE] Get rid of SETUPTOOLS and PYYAML extra pins (#162266)
malfet Sep 5, 2025
70d36e0
Making batching rule for F.embedding DTensor-aware (#162117)
zou3519 Sep 4, 2025
79fcd52
symbolic cpp channels_last_contiguous (#160402)
laithsakka Sep 5, 2025
01ab325
[DCP][Quantization] Fix the issue when scale vector is in a different…
saumishr Sep 5, 2025
e0a62b2
[aot-precompile] default-filter global guards (#162090)
dolpm Sep 5, 2025
8d50355
[CD][EZ] Update libtorch python version to 3.10 (#162297)
malfet Sep 5, 2025
9c03d6b
[CD][BE] Delete Python-3.9 case (#162265)
malfet Sep 5, 2025
4d4abec
allow user to pass in custom partitioner function (#157580)
xuanzhang816 Sep 5, 2025
486b20b
Add return-max-scores to flex-attention (#161667)
drisspg Sep 5, 2025
081cab0
Resize to 0 if not going to be used (#161730)
drisspg Sep 5, 2025
1463714
[dynamo] Graph break on on user-defined class in compiled region (#16…
rtimpe Sep 4, 2025
4f72d93
re-land triton runtime implementation" (#162217)
dolpm Sep 6, 2025
0f45aaf
Disable autocast when running joint graph passes (#162304)
yf225 Sep 6, 2025
7f4ff79
remove deprecated vllm test (#162306)
yangw-dev Sep 6, 2025
291cd11
[inductor] estimate peak memory in codegen only when buffer reuse (#1…
ruisizhang123 Sep 6, 2025
145a3a7
[CUDA 13][cuDNN] Bump CUDA 13 to cuDNN 9.13.0 (#162268)
eqy Sep 6, 2025
c3ceca2
codebase structure documentation to include torchgen (#162261)
Raman-RH Sep 6, 2025
20629b1
Add contiguous subgraph transformation threshold (#162192)
exclamaforte Sep 6, 2025
b2b4add
Docs on export joint with descriptors (#159006)
ezyang Aug 12, 2025
c0983e6
[Graph Partition] interface for custom cg wrapper (#162207)
BoyuanFeng Sep 6, 2025
a3e5466
Revert "Resize to 0 if not going to be used (#161730)"
pytorchmergebot Sep 6, 2025
da4db4b
Fix `DeviceMesh._flatten` docstring example (#162277)
mariosasko Sep 6, 2025
20b47ac
[fx] fix qualified name for methods of torch.Tensor (#162224)
isuruf Sep 4, 2025
aac1a50
Add api info for torch._C._nn.pyi (#162148)
orangeH25 Sep 6, 2025
bc50597
torch.zeros bound checks for symint (#161976)
morrison-turnansky Sep 6, 2025
c98ddac
Fixed comment to match logic in distributed_c10d.py (#162158)
Codeboi007 Sep 6, 2025
28f4ab0
Add -Wno-ctad-maybe-unsupported compiler flag (#162223)
0xjeffro Sep 6, 2025
0ff8eab
Revert "[dynamo] Graph break on on user-defined class in compiled reg…
pytorchmergebot Sep 6, 2025
9aedb3c
[AOTI-FX] Support registering custom FX backends (#162317)
blaine-rister Sep 6, 2025
5985e28
[CUDA 13][cuDNN][Windows] Roll back cuDNN upgrade from 9.13 to 9.12 o…
eqy Sep 6, 2025
b6d0a9e
MXFP8 grouped GEMM support for torch._scaled_grouped_mm + submodule b…
danielvegamyhre Sep 6, 2025
ae0edc1
[3/N] Enable 6 fsdp test on Intel GPU (#161601)
daisyden Sep 6, 2025
047603d
New export implementation with flat inp/out (#162167)
tugsbayasgalan Sep 4, 2025
541aa23
[inductor] fix TemplateBuffer.extract_read_writes (#162221)
shunting314 Sep 6, 2025
1a588ac
[inductor] rename deps during refreshing (#162303)
shunting314 Sep 6, 2025
5927a70
NLLLoss: validate target is 0D when input is 1D (#161412)
mansiag05 Sep 6, 2025
48e3be3
[while_loop][autograd] add hop while_loop_stack_output (#160467)
ydwu4 Sep 5, 2025
2b8a839
[while_loop][autograd] support autograd_key of while_loop (#160483)
ydwu4 Sep 5, 2025
5211f1f
[export] Move example inputs in move_to_device_pass (#162301)
yiming0416 Sep 6, 2025
e3068cd
[dynamo] Use relaxed CLOSURE_MATCH guard then ID_MATCH (#162247)
anijain2305 Sep 5, 2025
b919560
[nativert] AOTI lowering and packaging as NativeRT delegate (#162285)
yiming0416 Sep 7, 2025
2a45837
[inductor] fuse for scalar shared data (#162311)
shunting314 Sep 6, 2025
fea2077
[vllm hash update] update the pinned vllm hash (#162314)
pytorchupdatebot Sep 7, 2025
eac3d6f
Revert "[inductor] fuse for scalar shared data (#162311)"
pytorchmergebot Sep 7, 2025
104f268
Revert "Add return-max-scores to flex-attention (#161667)"
pytorchmergebot Sep 7, 2025
93fb23d
Build vLLM nightly wheels (#162000)
huydhn Sep 7, 2025
ada43ed
Revert "[inductor] pdl inductor option (disabled by default) (#160928)"
pytorchmergebot Sep 7, 2025
7a83cf4
Revert " [while_loop][autograd] support autograd_key of while_loop (#…
pytorchmergebot Sep 7, 2025
9ad5e8e
Improve typing of ONNX decorators with ParamSpec (#162332)
Vinayak-Pawar Sep 7, 2025
4348db0
Revert "[inductor][ez] V.choices.get_mm_configs returns list of Choic…
pytorchmergebot Sep 7, 2025
093ab5f
Revert "[inductor] add kernel template choice (ktc) (#161347)"
pytorchmergebot Sep 7, 2025
df59c21
Revert "[BE] Cleanup stale comments/copy from `gemm` (#162001)"
pytorchmergebot Sep 7, 2025
e246a85
Revert "[1/N] Port 5 _composable/fsdp distributed test cases to Intel…
pytorchmergebot Sep 7, 2025
8235c4f
Revert "[ROCm] Enabling several UTs (#161715)"
pytorchmergebot Sep 7, 2025
ff2de5d
Revert "[2/N]Port several test files under test/distributed to Intel …
pytorchmergebot Sep 7, 2025
ec2e368
[while_loop][autograd] support autograd_key of while_loop (#160483)
ydwu4 Sep 7, 2025
eb9073a
[easy] [precompile] Convert CompileArtifacts to callable (#162169)
jamesjwu Sep 7, 2025
5babb4d
Add BundledAOTAutogradSerializableCallable (#162170)
jamesjwu Sep 7, 2025
103f725
[associative_scan] Autograd separated (#139939)
bohnstingl Sep 8, 2025
c9ac8c2
[audio hash update] update the pinned audio hash (#162315)
pytorchupdatebot Sep 8, 2025
29e09a6
Revert "Make distributed modules importable even when backend not bui…
pytorchmergebot Sep 8, 2025
1e0656f
Revert "Always build USE_DISTRIBUTED. (#160449)"
pytorchmergebot Sep 8, 2025
fb0afa8
[inductor][triton] more JITCallable._hash_lock support (#162244)
davidberard98 Sep 5, 2025
31d5c67
[inductor][triton] support static cuda launcher after triton # 7866 (…
davidberard98 Sep 5, 2025
5b90e85
[AsyncTP] Fixes AsyncMM (#162040)
fegin Sep 8, 2025
32911ff
[xla hash update] update the pinned xla hash (#162372)
pytorchupdatebot Sep 8, 2025
e101411
Update slow tests (#161395)
pytorchupdatebot Sep 8, 2025
3f59933
[upstream triton] update triton pin to triton 3.5 (#162278)
davidberard98 Sep 5, 2025
25c170b
[inductor] Runtime estimations: use nccl estimator; mm only benchmark…
IvanKobzarev Sep 8, 2025
53297f6
Revert "[audio hash update] update the pinned audio hash (#162315)"
pytorchmergebot Sep 8, 2025
a92773e
Revert "Use vectorized stores for all dtypes in cat (#161649)"
pytorchmergebot Sep 8, 2025
f044fa2
[AsyncTP] Use assertEqual instead of allClose for bf16 tests (#162041)
fegin Sep 8, 2025
8e076d8
Don't call check_has_torch_dispatch in THPVariable_NewWithVar if we a…
swolchok Sep 6, 2025
49c446c
Add C++ function for torch.distributed.tensor._op_schema.is_view_op (…
swolchok Sep 6, 2025
5793dd7
[Intel GPU] Integrate OneDNN SDPA training forward and backward (#161…
LuFinch Sep 8, 2025
ebd29a1
[inductor] fuse for scalar shared data (#162311)
shunting314 Sep 8, 2025
72e6717
Avoid crash with release_available_cached_blocks (#162269)
morrison-turnansky Sep 8, 2025
de5dc1f
[cuDNN][SDPA][Nested Tensor] add forward/backward caching support for…
eqy Sep 8, 2025
bc4176c
CD Windows CUDA 13.0 build - fix packaging of cuda dlls (#162383)
atalman Sep 8, 2025
314d47a
[audio hash update] update the pinned audio hash (#162315)
pytorchupdatebot Sep 8, 2025
fbcabb4
Handle f([]) vs. f() in fake tensor caching (#162284)
angelayi Sep 8, 2025
d80297a
Always build USE_DISTRIBUTED. (#160449)
ezyang Sep 8, 2025
a0d0266
Make distributed modules importable even when backend not built (#159…
ezyang Sep 8, 2025
4e50651
[DTensor] fix F.one_hot (#162307)
zou3519 Sep 8, 2025
9c991b6
[CD] [aarch64] Add CUDA 12.6 and 12.8 to build matrix, remove 12.9 bu…
tinglvv Sep 8, 2025
8ec01f3
[bucketing] custom_ops mode to hide inductor copies overhead (#161499)
IvanKobzarev Sep 8, 2025
ec2c137
[BE]: Update cudnn frontend submodule to 1.14.1 (#162347)
Skylion007 Sep 8, 2025
8f11465
Add std::any_of to ConvParams struct (#162334)
benjaminglass1 Sep 6, 2025
26a1b9c
[dynamo] fix resume_execution.py KeyError in Python 3.11+ (#162318)
williamwen42 Sep 8, 2025
015423b
Add fp16-overflow regression test (#162401)
malfet Sep 8, 2025
5d819f3
Revert "[associative_scan] Autograd separated (#139939)"
pytorchmergebot Sep 8, 2025
dd44faa
Revert "Modify ROCm MI2xx-based workflows to run on cron schedule (#1…
pytorchmergebot Sep 8, 2025
fecd968
Graph split event tracker (#159795)
haowu14 Sep 8, 2025
85fe94e
make should_swap more dde friendly (#162099)
laithsakka Sep 8, 2025
2c538c9
rewrite __maybe_broadcast should_expand check for unbacked (#162109)
laithsakka Sep 8, 2025
711c8c8
shape guards (#161178)
avikchaudhuri Sep 8, 2025
ac9ccd0
Add return-max-scores to flex-attention (#161667)
drisspg Sep 8, 2025
5fd6b6a
[refactor] add helper sizevars function, is_size_one, for size==1 che…
ColinPeppler Sep 8, 2025
189a054
Remove guard_size_oblivious from default contiguity python check, and…
laithsakka Sep 8, 2025
07f0730
[associative_scan] Autograd separated (#139939)
bohnstingl Sep 8, 2025
c0fc86b
Fix aarch64 wheel pack (#159481)
atalman Sep 8, 2025
8485aac
[precompile] Fix inlined source tracking with generators. (#162389)
zhxchen17 Sep 9, 2025
897c4e7
Move to small wheel approach for CUDA SBSA wheel (#160720)
tinglvv Sep 9, 2025
ed77e23
Revert "[dynamo] Constant fold torch.autograd._profiler_enabled (#158…
pytorchmergebot Sep 9, 2025
6eb14ac
[Inductor] Fix cross-device scalar lowering - cpu scalar with cuda te…
karthickai Sep 8, 2025
a951f43
Avoid redundant PyTuple_GetSize call in _maybe_handle_torch_function …
swolchok Sep 8, 2025
eab2afe
fastpath type Tensor in THPVariable_NewWithVar (#161634)
swolchok Sep 8, 2025
12db2a7
Call checkLong in is_int_or_symint, completing TODO (#161692)
swolchok Sep 8, 2025
a8a187b
Overload _get_operation_for_overload_or_packet & friends to accept Ar…
swolchok Sep 8, 2025
e025c0f
Dynamo: set_eval_frame microoptimization (#162220)
swolchok Sep 8, 2025
583bbf7
[MPS] Add `native_dropout` and `native_dropout_backward` (#162108)
kurtamohler Sep 8, 2025
a965f09
[export] Update PT2 archive docs (#162308)
yiming0416 Sep 9, 2025
d8b6622
testing infra and some fixes (#162183)
tugsbayasgalan Sep 8, 2025
7b8a645
[inductor] fix 3d tiled online softmax (#162341)
shunting314 Sep 8, 2025
1641606
Revert "Add BundledAOTAutogradSerializableCallable (#162170)"
pytorchmergebot Sep 9, 2025
4c45090
[DTensor] Check if tracing for sharding propagation to handle unhasha…
azahed98 Sep 9, 2025
98ecc0f
[SymmMem] Add team pool to hold duplicated teams for the same rank gr…
kwen2501 Sep 8, 2025
065c446
[SymmMem] Use global pe for put and get (#162394)
kwen2501 Sep 8, 2025
847d7f2
[CUDA-13] Implement workaround for cudaErrorNotSupported (#162412)
malfet Sep 9, 2025
f216d64
[SymmMem] Better tuning of A2AV based on accurate node boundary (#162…
kwen2501 Sep 9, 2025
607327b
[vllm hash update] update the pinned vllm hash (#162356)
pytorchupdatebot Sep 9, 2025
7ad40de
[audio hash update] update the pinned audio hash (#162437)
pytorchupdatebot Sep 9, 2025
8494afb
Add missing fstream include to fix std::ofstream compilation error (#…
0xjeffro Sep 9, 2025
4590438
[fx] fix qualified name for methods of torch.Tensor (#162407)
isuruf Sep 8, 2025
60d0092
Revert "testing infra and some fixes (#162183)"
pytorchmergebot Sep 9, 2025
7feb8fc
[SymmMEM] Allow to import _SymmetricMemory when NVSHMEM is not availa…
fegin Sep 9, 2025
d85392a
Add BundledAOTAutogradSerializableCallable (#162170)
jamesjwu Sep 7, 2025
d49205f
Add more tests for vllm and clean out the old vllm test (#162292)
yangw-dev Sep 9, 2025
4840a1a
Run vLLM tests on all trunk commits before 2.9 branch cut (#161797)
huydhn Sep 9, 2025
002e594
fix torch.sparse.log_softmax on CPU (#161959)
jiayisunx Sep 3, 2025
dcc42e9
Fix missing moves in initJITBindings (#162428)
swolchok Sep 8, 2025
0d9c95c
Use same NVSHMEM version across CUDA builds (#162206)
kwen2501 Sep 9, 2025
4dd73e6
Revert "fix torch.sparse.log_softmax on CPU (#161959)"
pytorchmergebot Sep 9, 2025
e38e953
CUDA 13.0 Windows Nvidia Driver Update to 580.88 (#162425)
atalman Sep 9, 2025
5ccf3ca
Revert "Use same NVSHMEM version across CUDA builds (#162206)"
pytorchmergebot Sep 9, 2025
be3b8d2
[ROCm][CI] update fbgemm nightly benchmark hash (#162385)
jataylo Sep 9, 2025
3ea6868
[MPS] mps sparse mul op implementation (#162349)
Isalia20 Sep 9, 2025
c0142f5
[ROCm] Enabling several UTs (#161715)
pragupta Sep 9, 2025
1f0b01d
[ROCm] OffsetCalc Unroll Optimization (#161700)
amd-hhashemi Sep 9, 2025
f03d635
[ROCm][CI] skip test_max_autotune until resolved (#162496)
jeffdaily Sep 9, 2025
5eb35d2
[CUDA][float8][TF32] Disable tf32 for vs. emulated rowwise comparison…
eqy Sep 9, 2025
b1e99c8
[inductor] add kernel template choice (ktc) (#161347)
coconutruben Sep 9, 2025
d3c4cf8
[inductor][ez] V.choices.get_mm_configs returns list of ChoiceCallers…
coconutruben Sep 9, 2025
24a4dae
[inductor] V.choices.get_mm_configs override point (#161349)
coconutruben Sep 9, 2025
d91eecc
[inductor][template heuristics] don't take layout to generate choices…
coconutruben Sep 9, 2025
e1be887
[PP] Add spacing to visualizer (#160474)
H-Huang Sep 9, 2025
0ec723a
Update docs for quantile to be clearer for nearest (#162423)
janeyx99 Sep 9, 2025
4b2d297
python fastpath for DTensor detach(), confirm that aliasing DTensorSp…
bdhirsh Sep 9, 2025
82f1eb9
Revert "[MPS] mps sparse mul op implementation (#162349)"
pytorchmergebot Sep 9, 2025
af60398
Update the operator benchmarking, to benchmark using torch.compile (#…
jainapurva Sep 9, 2025
bdbe931
[build] Add LeakSanitizer option to CMake (#158686)
benjaminglass1 Sep 9, 2025
723c27e
[standalone_compile] binary format write should be atomic (#162432)
zou3519 Sep 9, 2025
8508651
Fix flaky AOTFxirTestCase (#162472)
huydhn Sep 9, 2025
86d34a4
NamedTuple: Allow side effects for dynamic attributes (#161645)
morrison-turnansky Sep 9, 2025
ab55758
Merge remote-tracking branch 'upstream/main' into rocm7.1_internal_te…
github-actions[bot] Sep 9, 2025
ff2ba4c
Fix merge conflicts
pragupta Sep 9, 2025
dba8539
Revert "[ROCm] Enable USE_FBGEMM_GENAI (#160676)"
pragupta Sep 10, 2025
9a66f82
Update related_commits
pragupta Sep 10, 2025
9e7df76
Kepp triton_version to 3.4.0 until we switch triton branch/commiti pin
pragupta Sep 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
15 changes: 15 additions & 0 deletions .bc-linter.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
version: 1
paths:
include:
- "**/*.py"
exclude:
- ".*"
- ".*/**"
- "**/.*/**"
- "**/.*"
- "**/_*/**"
- "**/_*.py"
- "**/test/**"
- "**/benchmarks/**"
- "**/test_*.py"
- "**/*_test.py"
29 changes: 26 additions & 3 deletions .ci/aarch64_linux/aarch64_ci_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,18 @@ set -eux -o pipefail

GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

if [[ "$GPU_ARCH_VERSION" == *"12.9"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;12.0"
# Set CUDA architecture lists to match x86 build_cuda.sh
if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then
export TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;8.0;9.0"
elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then
export TORCH_CUDA_ARCH_LIST="7.0;8.0;9.0;10.0;12.0"
elif [[ "$GPU_ARCH_VERSION" == *"13.0"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;11.0;12.0+PTX"
fi

# Compress the fatbin with -compress-mode=size for CUDA 13
if [[ "$DESIRED_CUDA" == *"13"* ]]; then
export TORCH_NVCC_FLAGS="-compress-mode=size"
fi

SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"
Expand All @@ -18,14 +28,27 @@ cd /
# on the mounted pytorch repo
git config --global --add safe.directory /pytorch
pip install -r /pytorch/requirements.txt
pip install auditwheel==6.2.0
pip install auditwheel==6.2.0 wheel
if [ "$DESIRED_CUDA" = "cpu" ]; then
echo "BASE_CUDA_VERSION is not set. Building cpu wheel."
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files
USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn
else
echo "BASE_CUDA_VERSION is set to: $DESIRED_CUDA"
export USE_SYSTEM_NCCL=1

# Check if we should use NVIDIA libs from PyPI (similar to x86 build_cuda.sh logic)
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling CUDA libraries with wheel for aarch64."
else
echo "Using nvidia libs from pypi for aarch64."
# Fix platform constraints in PYTORCH_EXTRA_INSTALL_REQUIREMENTS for aarch64
# Replace 'platform_machine == "x86_64"' with 'platform_machine == "aarch64"'
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS//platform_machine == \'x86_64\'/platform_machine == \'aarch64\'}"
echo "Updated PYTORCH_EXTRA_INSTALL_REQUIREMENTS for aarch64: $PYTORCH_EXTRA_INSTALL_REQUIREMENTS"
export USE_NVIDIA_PYPI_LIBS=1
fi

#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files
USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda
fi
228 changes: 176 additions & 52 deletions .ci/aarch64_linux/aarch64_wheel_ci_build.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,76 +69,190 @@ def replace_tag(filename) -> None:
f.writelines(lines)


def patch_library_rpath(
folder: str,
lib_name: str,
use_nvidia_pypi_libs: bool = False,
desired_cuda: str = "",
) -> None:
"""Apply patchelf to set RPATH for a library in torch/lib"""
lib_path = f"{folder}/tmp/torch/lib/{lib_name}"

if use_nvidia_pypi_libs:
# For PyPI NVIDIA libraries, construct CUDA RPATH
cuda_rpaths = [
"$ORIGIN/../../nvidia/cudnn/lib",
"$ORIGIN/../../nvidia/nvshmem/lib",
"$ORIGIN/../../nvidia/nccl/lib",
"$ORIGIN/../../nvidia/cusparselt/lib",
]

if "130" in desired_cuda:
cuda_rpaths.append("$ORIGIN/../../nvidia/cu13/lib")
else:
cuda_rpaths.extend(
[
"$ORIGIN/../../nvidia/cublas/lib",
"$ORIGIN/../../nvidia/cuda_cupti/lib",
"$ORIGIN/../../nvidia/cuda_nvrtc/lib",
"$ORIGIN/../../nvidia/cuda_runtime/lib",
"$ORIGIN/../../nvidia/cufft/lib",
"$ORIGIN/../../nvidia/curand/lib",
"$ORIGIN/../../nvidia/cusolver/lib",
"$ORIGIN/../../nvidia/cusparse/lib",
"$ORIGIN/../../nvidia/nvtx/lib",
"$ORIGIN/../../nvidia/cufile/lib",
]
)

# Add $ORIGIN for local torch libs
rpath = ":".join(cuda_rpaths) + ":$ORIGIN"
else:
# For bundled libraries, just use $ORIGIN
rpath = "$ORIGIN"

if os.path.exists(lib_path):
os.system(
f"cd {folder}/tmp/torch/lib/; "
f"patchelf --set-rpath '{rpath}' --force-rpath {lib_name}"
)


def copy_and_patch_library(
src_path: str,
folder: str,
use_nvidia_pypi_libs: bool = False,
desired_cuda: str = "",
) -> None:
"""Copy a library to torch/lib and patch its RPATH"""
if os.path.exists(src_path):
lib_name = os.path.basename(src_path)
shutil.copy2(src_path, f"{folder}/tmp/torch/lib/{lib_name}")
patch_library_rpath(folder, lib_name, use_nvidia_pypi_libs, desired_cuda)


def package_cuda_wheel(wheel_path, desired_cuda) -> None:
"""
Package the cuda wheel libraries
"""
folder = os.path.dirname(wheel_path)
wheelname = os.path.basename(wheel_path)
os.mkdir(f"{folder}/tmp")
os.system(f"unzip {wheel_path} -d {folder}/tmp")
libs_to_copy = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",
"/usr/local/cuda/lib64/libcudnn.so.9",
"/usr/local/cuda/lib64/libcublas.so.12",
"/usr/local/cuda/lib64/libcublasLt.so.12",
"/usr/local/cuda/lib64/libcudart.so.12",
"/usr/local/cuda/lib64/libcufft.so.11",
"/usr/local/cuda/lib64/libcusparse.so.12",
"/usr/local/cuda/lib64/libcusparseLt.so.0",
"/usr/local/cuda/lib64/libcusolver.so.11",
"/usr/local/cuda/lib64/libcurand.so.10",
"/usr/local/cuda/lib64/libnccl.so.2",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
"/usr/local/cuda/lib64/libcudnn_cnn.so.9",
"/usr/local/cuda/lib64/libcudnn_graph.so.9",
"/usr/local/cuda/lib64/libcudnn_ops.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9",
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
]

if "129" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.9",
# Check if we should use PyPI NVIDIA libraries or bundle system libraries
use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"

if use_nvidia_pypi_libs:
print("Using nvidia libs from pypi - skipping CUDA library bundling")
# For PyPI approach, we don't bundle CUDA libraries - they come from PyPI packages
# We only need to bundle non-NVIDIA libraries
minimal_libs_to_copy = [
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
]

# Copy minimal libraries to unzipped_folder/torch/lib
for lib_path in minimal_libs_to_copy:
copy_and_patch_library(lib_path, folder, use_nvidia_pypi_libs, desired_cuda)

# Patch torch libraries used for searching libraries
torch_libs_to_patch = [
"libtorch.so",
"libtorch_cpu.so",
"libtorch_cuda.so",
"libtorch_cuda_linalg.so",
"libtorch_global_deps.so",
"libtorch_python.so",
"libtorch_nvshmem.so",
"libc10.so",
"libc10_cuda.so",
"libcaffe2_nvrtc.so",
"libshm.so",
]
for lib_name in torch_libs_to_patch:
patch_library_rpath(folder, lib_name, use_nvidia_pypi_libs, desired_cuda)
else:
print("Bundling CUDA libraries with wheel")
# Original logic for bundling system CUDA libraries
# Common libraries for all CUDA versions
common_libs = [
# Non-NVIDIA system libraries
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
# Common CUDA libraries (same for all versions)
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",
"/usr/local/cuda/lib64/libcudnn.so.9",
"/usr/local/cuda/lib64/libcusparseLt.so.0",
"/usr/local/cuda/lib64/libcurand.so.10",
"/usr/local/cuda/lib64/libnccl.so.2",
"/usr/local/cuda/lib64/libnvshmem_host.so.3",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
"/usr/local/cuda/lib64/libcudnn_cnn.so.9",
"/usr/local/cuda/lib64/libcudnn_graph.so.9",
"/usr/local/cuda/lib64/libcudnn_ops.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
"/usr/local/cuda/lib64/libcusparse.so.12",
]

# Copy libraries to unzipped_folder/a/lib
for lib_path in libs_to_copy:
lib_name = os.path.basename(lib_path)
shutil.copy2(lib_path, f"{folder}/tmp/torch/lib/{lib_name}")
os.system(
f"cd {folder}/tmp/torch/lib/; "
f"patchelf --set-rpath '$ORIGIN' --force-rpath {folder}/tmp/torch/lib/{lib_name}"
)
# CUDA version-specific libraries
if "130" in desired_cuda:
version_specific_libs = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.13",
"/usr/local/cuda/lib64/libcublas.so.13",
"/usr/local/cuda/lib64/libcublasLt.so.13",
"/usr/local/cuda/lib64/libcudart.so.13",
"/usr/local/cuda/lib64/libcufft.so.12",
"/usr/local/cuda/lib64/libcusolver.so.12",
"/usr/local/cuda/lib64/libnvJitLink.so.13",
"/usr/local/cuda/lib64/libnvrtc.so.13",
"/usr/local/cuda/lib64/libnvrtc-builtins.so.13.0",
]
elif "12" in desired_cuda:
# Get the last character for libnvrtc-builtins version (e.g., "129" -> "9")
minor_version = desired_cuda[-1]
version_specific_libs = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",
"/usr/local/cuda/lib64/libcublas.so.12",
"/usr/local/cuda/lib64/libcublasLt.so.12",
"/usr/local/cuda/lib64/libcudart.so.12",
"/usr/local/cuda/lib64/libcufft.so.11",
"/usr/local/cuda/lib64/libcusolver.so.11",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
f"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.{minor_version}",
]

# Combine all libraries
libs_to_copy = common_libs + version_specific_libs

# Copy libraries to unzipped_folder/torch/lib
for lib_path in libs_to_copy:
copy_and_patch_library(lib_path, folder, use_nvidia_pypi_libs, desired_cuda)

# Make sure the wheel is tagged with manylinux_2_28
for f in os.scandir(f"{folder}/tmp/"):
if f.is_dir() and f.name.endswith(".dist-info"):
replace_tag(f"{f.path}/WHEEL")
break

os.mkdir(f"{folder}/cuda_wheel")
os.system(f"cd {folder}/tmp/; zip -r {folder}/cuda_wheel/{wheelname} *")
shutil.move(
f"{folder}/cuda_wheel/{wheelname}",
f"{folder}/{wheelname}",
copy_function=shutil.copy2,
)
os.system(f"rm -rf {folder}/tmp/ {folder}/cuda_wheel/")
os.system(f"wheel pack {folder}/tmp/ -d {folder}")
os.system(f"rm -rf {folder}/tmp/")


def complete_wheel(folder: str) -> str:
Expand Down Expand Up @@ -208,7 +322,17 @@ def parse_arguments():
build_vars = "CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "
# MAX_JOB=5 is not required for CPU backend (see commit 465d98b)
if enable_cuda:
build_vars = "MAX_JOBS=5 " + build_vars
build_vars += "MAX_JOBS=5 "

# Handle PyPI NVIDIA libraries vs bundled libraries
use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"
if use_nvidia_pypi_libs:
print("Configuring build for PyPI NVIDIA libraries")
# Configure for dynamic linking (matching x86 logic)
build_vars += "ATEN_STATIC_CUDA=0 USE_CUDA_STATIC_LINK=0 USE_CUPTI_SO=1 "
else:
print("Configuring build for bundled NVIDIA libraries")
# Keep existing static linking approach - already configured above

override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")
desired_cuda = os.getenv("DESIRED_CUDA")
Expand Down
16 changes: 4 additions & 12 deletions .ci/aarch64_linux/build_aarch64_wheel.py
Original file line number Diff line number Diff line change
Expand Up @@ -438,9 +438,7 @@ def build_torchvision(
)
build_vars += f"BUILD_VERSION={version}.dev{build_date}"
elif build_version is not None:
build_vars += (
f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"
)
build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

Expand Down Expand Up @@ -495,9 +493,7 @@ def build_torchdata(
)
build_vars += f"BUILD_VERSION={version}.dev{build_date}"
elif build_version is not None:
build_vars += (
f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"
)
build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

Expand Down Expand Up @@ -553,9 +549,7 @@ def build_torchtext(
)
build_vars += f"BUILD_VERSION={version}.dev{build_date}"
elif build_version is not None:
build_vars += (
f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"
)
build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

Expand Down Expand Up @@ -613,9 +607,7 @@ def build_torchaudio(
)
build_vars += f"BUILD_VERSION={version}.dev{build_date}"
elif build_version is not None:
build_vars += (
f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"
)
build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

Expand Down
4 changes: 2 additions & 2 deletions .ci/docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,8 @@ If your new Docker image needs a library installed from a specific pinned commit
If you're introducing a new argument to the Docker build, make sure to add it in the Docker build step in `.ci/docker/build.sh`:
```bash
docker build \
....
--build-arg "NEW_ARG_1=${NEW_ARG_1}"
....
--build-arg "NEW_ARG_1=${NEW_ARG_1}"
```

3. **Update Dockerfile logic**:
Expand Down
Loading