[AUTOGENERATED] develop_IFU_20251114 #2806

pragupta · 2025-11-14T16:28:09Z

rocm_base: 3d74218

Wheel Generated: http://rocm-ci.amd.com/view/preview/job/pytorch-latest-manylinux-wheels-preview/195/

pytorch#167413) Second attempt for pytorch#167138 with fixes for name conflicts in downstream packages. Should slightly simplify pytorch#166342 Pull Request resolved: pytorch#167413 Approved by: https://github.com/Skylion007

This PR moves some implemented types from typing_extensions to typing due to the recent update to Python 3.10. Pull Request resolved: pytorch#167185 Approved by: https://github.com/janeyx99

i see failures like https://github.com/pytorch/pytorch/actions/runs/19189378182/job/54865171317?pr=167389 maybe this will fix it Pull Request resolved: pytorch#167417 Approved by: https://github.com/yf225

- The logger name in test_fully_shard_logging.py was wrong so the logs didn't happen. - The `device` variable in test_fully_shard_logging is expected to be a string, so quote it - `unittest.skipIf` is used so importing `unittest` instead of `unittest.mock` is required Pull Request resolved: pytorch#167312 Approved by: https://github.com/Skylion007, https://github.com/cyyever

This PR enables `UP035` rule of ruff. Pull Request resolved: pytorch#167307 Approved by: https://github.com/Lucaskabela

This PR applies new `Union` and `Optional` typing syntax to some files. Pull Request resolved: pytorch#167167 Approved by: https://github.com/XuehaiPan, https://github.com/mlazos

This PR fixes more context manager usage in Python code. Pull Request resolved: pytorch#167404 Approved by: https://github.com/mlazos

…orch#167405) ## Summary Previously fake/functionalized tensors that have `null` storage_ptr could segfault when checking for `.expired()` on weak storage ref, so handle `nullptr` storages separately, without checking their weakrefs. Diagnosis and PR created by codex ------ [Codex Task](https://chatgpt.com/codex/tasks/task_e_690ea8790054832f90eaffb37ee0d8c8) Pull Request resolved: pytorch#167405 Approved by: https://github.com/Skylion007

Summary: This is a reland of pytorch#165036, which previously contained a minor bug in the logic that determined whether the kernel should be enabled. As a result, it was incorrectly activated on non-Blackwell GPUs. Test Plan: Inductor test (fbcode): `INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 TORCHINDUCTOR_CACHE_DIR=~/cutetest buck2 run mode/opt //caffe2/test/inductor:cutedsl_grouped_mm -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -m "ovr_config//third-party/pypi/nvidia-cutlass-dsl/constraints:4.2.1"` Tritonbench (fbcode): `clear; CUDA_VISIBLE_DEVICES=7 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/opt //pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -m "ovr_config//third-party/pypi/nvidia-cutlass-dsl/constraints:4.2.1" -- --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_cute_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy` Tritonbench(oss): `clear; CUDA_VISIBLE_DEVICES=2 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_triton_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy` Unit Tests(oss): `clear; python test/inductor/test_cutedsl_grouped_mm.py` Differential Revision: D86537373 Pull Request resolved: pytorch#167340 Approved by: https://github.com/jananisriram

# Motivation `torch.cuda.mem_get_info` and `torch.xpu.mem_get_info` are widely used in other popular repos, such as - https://github.com/sgl-project/sglang/blob/076313bd099ac1ee484ee77009eaae864eacf396/python/sglang/srt/utils.py#L378， - https://github.com/huggingface/accelerate/blob/7ecc2d7f394fc0686062a18d46128a8bd97c7dad/src/accelerate/utils/modeling.py#L822, - https://github.com/vllm-project/vllm/blob/7ba34b1241ada58f8212f350a8b17382cb412cf2/vllm/worker/worker.py#L150. - This PR introduces a unified API `torch.accelerator.get_memory_info` to cover this scenario. Pull Request resolved: pytorch#156812 Approved by: https://github.com/albanD

Getting some weird failures building cuda13, lets stick to what we know works Pull Request resolved: pytorch#167428 Approved by: https://github.com/jansel

Fixes TestOperatorsXPU.test_data_write_errors_under_transform_xpu intel/torch-xpu-ops#2237 Tests on other devices throw runtime error "_mutating directly with `.data` inside functorch transform is not allowed._", but XPU/HPU fails earlier on `_has_compatible_shallow_copy_type`. This check is not met only when calling tensor.data inside functorch call. ```cpp bool _has_compatible_shallow_copy_type(const Tensor& self, const Tensor& from) { return self.unsafeGetTensorImpl()->has_compatible_shallow_copy_type( from.key_set()); } ``` ### t.data | Tensor | Device | Dispatch Keys | |--------|---------|---------------| | `self` | `xpu` | `XPU, ADInplaceOrView, AutogradXPU, AutocastXPU` | | `from` | `cpu` | `CPU, ADInplaceOrView, AutogradCPU, AutocastCPU` | ### t.data inside functorch transform | Tensor | Device | Dispatch Keys | |--------|---------|---------------| | `self` | `xpu` | `ADInplaceOrView, AutogradOther, FuncTorchGradWrapper` | | `from` | `cpu` | `CPU, ADInplaceOrView, AutogradCPU, AutocastCPU, FuncTorchGradWrapper` | ### t.data inside functorch transform + XPU dispatch key | Tensor | Device | Dispatch Keys | |--------|---------|---------------| | `self` | `xpu` | `XPU, ADInplaceOrView, AutogradXPU, AutocastXPU, FuncTorchGradWrapper` | | `from` | `cpu` | `CPU, ADInplaceOrView, AutogradCPU, AutocastCPU, FuncTorchGradWrapper` | Pull Request resolved: pytorch#167095 Approved by: https://github.com/guangyey, https://github.com/albanD

Fixes pytorch#165130 Pull Request resolved: pytorch#165423 Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/mlazos

Pass `dim_map` to `_requires_data_exchange` and return False if both spatial and channels dimensions are replicated Modify `test_conv1d` and `test_conv3d` to check values rather than just shape, and replicate `conv3d` across batch dimension In general, feels like current Convolution implementation was written to work only if tensor is sharded across last dimention Pull Request resolved: pytorch#167402 Approved by: https://github.com/ezyang

Should be merged after pytorch#166708 Pull Request resolved: pytorch#166711 Approved by: https://github.com/Skylion007, https://github.com/malfet

Sparse sparse mm op implementation Pull Request resolved: pytorch#167013 Approved by: https://github.com/malfet

…ytorch#162564) # Motivation Support XPU for `torch.accelerator.get_memory_info`. Pull Request resolved: pytorch#162564 Approved by: https://github.com/albanD ghstack dependencies: pytorch#156812

…ytorch#166740) We plan to use `StridedShard` to express `shard_order`. This PR adds the function to support the conversion between `StridedShard` and `shard_order`. I moved some test related function into torch/testing/_internal/common_utils.py. We may only care about **_dtensor_spec.py** and **test_utils.py** in this PR for the review. ### How to convert shard order to StridedShard: Considering the example: - placements = $[x_0, x_1, x_2, x_3, x_4]$, all $x_?$ are shard on the same tensor dim. Let's see how the shard order will impact the split_factor (sf). We loop from right to left in the placements to construct the split_factor by assuming different shard order. Starting from $x_4$, this should be a normal shard. Then $x_3$. There are two possibilities, $x_3$'s order can be before $x_4$. If so, $x_3$'s sf=1, because $x_3$ is before $x_4$ in the placements. Else $x_3$'s order is after $x_4$, then the $x_3$'s sf should be the mesh dim size of $x_4$, which is $T(x_4)$: <img width="820" height="431" alt="image" src="https://github.com/user-attachments/assets/f53b4b24-2523-42cc-ad6f-41f3c280db70" /> We can use this method to decide on the split factor for $x_2$, $x_1$ and so on. ### How to convert StridedShard to shard order: This follows the same method above. We check all possible paths and use the real split_factor to see which path matchs the split_factor. If no such matches, the StridedShard is unable to be converted to shard order. --- Pull Request resolved: pytorch#166740 Approved by: https://github.com/ezyang

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: pytorch#167452 Approved by: https://github.com/pytorchbot

This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: pytorch#166844 Approved by: https://github.com/pytorchbot

`torch.ao.quantization` and `torch.fx.experimental` <img width="833" height="518" alt="Screenshot 2025-11-07 at 3 20 54 PM" src="https://github.com/user-attachments/assets/47b72f28-29bd-4bab-b41f-24d97419e411" /> <img width="892" height="560" alt="Screenshot 2025-11-07 at 3 20 45 PM" src="https://github.com/user-attachments/assets/129825ab-6706-41f2-964d-8774debab18c" /> Pull Request resolved: pytorch#167334 Approved by: https://github.com/janeyx99

Summary: Fix pytorch#166841. AOTI incorrectly generates a call to aoti_torch_cuda_scatter_reduce_two_out while the op should actually run on CPU. Fix by using the correct device when calling _generate_scatter_fallback in the wrapper codegen. Pull Request resolved: pytorch#167341 Approved by: https://github.com/yushangdi

Part of pytorch#164878 We can start narrowing the skips and remove them as PRs keep landing. This PR is just to setup the scaffolding, fix will be in follow up Pull Request resolved: pytorch#167360 Approved by: https://github.com/janeyx99

My understanding is this is needed for performance. Pull Request resolved: pytorch#167441 Approved by: https://github.com/oulgen

Discovered while enabling assertions on out-of-bounds accesses. Otherwise test fails with ``` ERROR: test_sdpa_mask_fp16_L6_S17_NH23_HS121 (__main__.TestSDPA.test_sdpa_mask_fp16_L6_S17_NH23_HS121) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3334, in wrapper method(*args, **kwargs) ~~~~~~^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/build/../test/test_mps.py", line 9494, in test_sdpa_mask_fp16_L6_S17_NH23_HS121 self._test_sdpa_mask(torch.float16, 7, 17, 23, 121) ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/build/../test/test_mps.py", line 9478, in _test_sdpa_mask y_ref = F.scaled_dot_product_attention(q.cpu(), k.cpu(), v.cpu(), attn_mask=mask.cpu(), dropout_p=0.0, is_causal=False) ~~~~~^^ torch.AcceleratorError: index out of range ``` Pull Request resolved: pytorch#167444 Approved by: https://github.com/Skylion007, https://github.com/manuelcandales

Fixes pytorch#166918 The output device may not be on the same device as the predicate device. ``` python test/inductor/test_control_flow.py -k test_output_on_different_device ``` Pull Request resolved: pytorch#167354 Approved by: https://github.com/ydwu4, https://github.com/zou3519

…167330) See pytorch/test-infra#7446 for the paths Pull Request resolved: pytorch#167330 Approved by: https://github.com/huydhn

Add a "--virtual-local-rank" mode to torchrun. When used instead of passing the local rank in LOCAL_RANK it uses a LOCAL_RANK of "0" and adjusts CUDA_VISIBLE_DEVICES to reflect the desired GPU index. Testing: (tweaked run_train.sh to use `--log-dir`) ``` export NGPU=8 export CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 ``` And then comparing ranks: Without --virtual-local-rank gives a lot of differences like: ``` [rank#]: mul_1: "f32[8, 512, 256]" = torch.ops.aten.mul.Tensor(mul, view_9); mul = None -[rank#]: _to_copy_3: "bf16[8, 512, 256]" = torch.ops.aten._to_copy.default(mul_1, dtype = torch.bfloat16, layout = torch.strided, device = device(type='cuda', index=0)); mul_1 = None +[rank#]: _to_copy_3: "bf16[8, 512, 256]" = torch.ops.aten._to_copy.default(mul_1, dtype = torch.bfloat16, layout = torch.strided, device = device(type='cuda', index=1)); mul_1 = None [rank#]: detach: "f32[8, 512, 1]" = torch.ops.aten.detach.default(rsqrt); rsqrt = None ``` With --virtual-local-rank makes those differences go away. Pull Request resolved: pytorch#166680 Approved by: https://github.com/ezyang

… unblock production (pytorch#167443) Summary: pytorch#166609 updates `node.is_impure` to consider a submodule as impure if submodule contains impure node. This in turn changes `graph.eliminate_dead_code()` function behavior, which does not eliminate nodes with side effects, see [pytorch documentation](https://docs.pytorch.org/docs/stable/fx.html#torch.fx.Graph.eliminate_dead_code) > Remove all dead code from the graph, based on each node’s number of users, and whether the nodes have any side effects. While this is correct that a submodule containing side-effectful ops is side-effectful and should not be dead code eliminated, some customers rely on the dead code elimination to eliminate submodules that contain impure ops which is the behavior before pytorch#166609 fix. Due to production environment constraints, we have to revert pytorch#166609 and move the side-effectful submodule check logic to `const_fold.py`, which will correctly **not** const-fold a submodule that contains impure ops. NOTE other call sites that use `node.is_impure()` to make decisions are still incorrectly eliminating side-effectful submodules, but we can't safely change that today. ## This pr - move `_subgraph_has_impure_op` into `fx/experimental/const_fold.py`, check and prevent const-folding an impure submodule - added a note in `node.is_impure` to highlight the incorrect behavior and context in case people go looking in the future. Test Plan: run test_fx_const_fold and all tests pass Differential Revision: D86641994 Pull Request resolved: pytorch#167443 Approved by: https://github.com/jfix71

Fixes pytorch#167037 Move the module definition outside of the unit test so when we run the unit test multiple times, the module is not re-compiled. Pull Request resolved: pytorch#167268 Approved by: https://github.com/angelayi

Failing test was `pytest test/export/test_export.py -k test_python_asserts_with_sym_int` Pull Request resolved: pytorch#167700 Approved by: https://github.com/bobrenjc93 ghstack dependencies: pytorch#167382, pytorch#167383, pytorch#167384, pytorch#167387, pytorch#167396, pytorch#167669

Currently, conv1d converts the 3D view to 4D before calling onednn::convolution(). However, this function converts the 4D tensor to a channel-last memory format for computation, resulting in incorrect return results (the correct result should be channel-first). This PR fixes this issue, ensuring that the output return value format is consistent with the expected format. Pull Request resolved: pytorch#162944 Approved by: https://github.com/EikanWang

Summary: Adds documentation for EventList, FunctionEvent and FunctionEventAvg. Closes pytorch#165907 Test Plan: N/A Documentation Differential Revision: D86913697 Pull Request resolved: pytorch#167688 Approved by: https://github.com/sanrise

## MOTIVATION To generalize Distributed test cases for non-CUDA devices ## CHANGES - Replaced hard coded device/backends with torch.accelerator.current_accelerator() and dist.get_default_backend_for_device - Use DistributedTestBase instead of MultiProcessTestCase to use common utilities - Remove instantiate_device_tests and make use of torch.accelerator.current_accelerator for test/distributed/test_c10d_object_collectives.py - fix deterministic context issue for non-cuda devices in test/distributed/optim/test_zero_redundancy_optimizer.py - use torch.accelerator.device_count() for multi-gpu check in torch/testing/_internal/distributed/_tensor/common_dtensor.py Pull Request resolved: pytorch#165067 Approved by: https://github.com/guangyey, https://github.com/albanD

…o "original_aten" node meta (pytorch#167749) Fixes pytorch#167706 - Add `torch.fx.experimental.proxy_tensor.set_original_aten_op()` around flex_atention HOP dispatch so we have `original_aten` populated for flex_attention - Update the usages of `original_aten` to also expect HOP in addition to OpOverload Pull Request resolved: pytorch#167749 Approved by: https://github.com/drisspg

) Summary: Autovectorization of casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21. We are adding a workaround vectorized code, which improves conversion speed from smaller int data types. We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2: before: uint8->bfloat16_t ===> 319.433us int8->bfloat16_t ===> 320.216us int16->bfloat16_t ===> 326.899us int32->bfloat16_t ===> 327.925us after: uint8->bfloat16_t ===> 185.189us -----> 72% higher throughput int8->bfloat16_t ===> 169.790us -----> 89% higher throughput int16->bfloat16_t ===> 180.744us -----> 81% higher throughput int32->bfloat16_t ===> 185.129us -----> 77% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D86207189 Pull Request resolved: pytorch#166958 Approved by: https://github.com/mcfi

Fixes pytorch#150477 ### Summary: - Added frame information (function name, file, line number) to all graph break/skip messages - Standardized message format: "torch.compile will skip tracing the frame <name> (<file> line <N>) and fall back to eager. Reason: <reason>" ### Impacts: module: dynamo Pull Request resolved: pytorch#167067 Approved by: https://github.com/williamwen42

… and add focused documentation (pytorch#165897) ## Summary This PR enriches OpenReg device management codes and adds focused documentation. ## Key Changes - Introduced device management documentation in `device.md`. - Updated `OpenRegFunctions.h` and `OpenRegFunctions.cpp` to use `DeviceIndex` and added error handling. - Implemented `check_device_index` function for validating device indices. - Enhanced Python bindings in `Module.cpp` for device management. - Added tests for invalid device index handling in `test_device.py`. Pull Request resolved: pytorch#165897 Approved by: https://github.com/fffrog

…ytorch#166573) We need to track all symbols, we used to skip u = item() and fail with ``` File "/home/lsakka/pytorch10/pytorch/torch/fx/passes/_tensorify_python_scalars.py", line 149, in _sympy_interp expr_to_sym_proxy[expr] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: KeyError: u0 ``` Pull Request resolved: pytorch#166573 Approved by: https://github.com/bobrenjc93

To support use case in pytorch/helion#1122, i.e. ``` @helion.kernel def foo( x: Tensor, group_name: str ): x_remotes = torch.ops.symm_mem.get_remote_tensors(x, group_name) for t in x_remotes: ... ```` Helion uses fake tensor to trace a program, thus we cannot use the following code in a Helion function: ``` hdl = rendezvous(tensor) remote_tensors = tuple( hdl.get_remote_tensor(peer, ...) for peer in range(world_size) ) ``` The reason is that when `tensor` is fake, the returned `hdl` is None, thus any subsequent call on it will fail. This PR wraps the above functionality as an op: ``` lib.define("get_remote_tensors(Tensor x, str group_name) -> Tensor[]") ``` so that things like `hdl` is not exposed to Helion. The op also provides a `meta` implementation so that Helion can trace it without actually running the rendezvous. Pull Request resolved: pytorch#167779 Approved by: https://github.com/yf225

Differential Revision: D86685546 Pull Request resolved: pytorch#167481 Approved by: https://github.com/eellison

Pull Request resolved: pytorch#167198 Approved by: https://github.com/bobrenjc93

This reverts commit c78e646. Reverted pytorch#167481 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#167481 (comment)))

@yuchengliu1

…h#165978) This PR implements `scaled_mm` for XPU. It enables the following data types: 1. TensorWise Scaling: `fp8_e4m3` and `fp8_e5m2` 2. RowWise Scaling: `fp8_e4m3` and `fp8_e5m2` It leaves the BlockWise Scaling to next PR, so that it will have less reviewing efforts. This is the first PR that only adds `scaled_mm_xpu` but does not registered. We separate this out for less reviewing efforts. Secondly, there is a `scaled_mm_v2` API in pytorch#164141 . We will align with it once the v1 is cleaned up. **Co-author:** @yuchengliu1, @carsonwang ## PR stack: - -> pytorch#165978 : implementation of XPU scaled_mm and oneDNN kernel - pytorch#167518 : implementation of XPU scaled_mm_v2 - pytorch#166056 : Op registration ## Test Status: 1. Relies on the changes in intel/torch-xpu-ops#1746, Otherwise the op will fallback to CPU. 2. This PR does not include tests, the tests are enabled in pytorch#166056. ## Credit: This work is based on @yuchengliu1's work at pytorch#140972 . The purpose that we created a new PR is to align with the API / checks with CUDA, so there will be less porting efforts. ## FP8 Task tracker: We will track all the scaled_mm related tasks in: pytorch#167170 Pull Request resolved: pytorch#165978 Approved by: https://github.com/liangan1, https://github.com/EikanWang Co-authored-by: Eikan Wang <[email protected]>

)" This reverts commit 50bf1f0. Reverted pytorch#167198 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#167198 (comment)))

…ytorch#164729) Fixes pytorch#163374. Here is the output from reproducible code: ``` W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] ***************************************** W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] ***************************************** aten::clamp_(dt: f32[][R], None, 2) redistribute_input(0, [P] -> [R]) redistribute_input(t: f32[], [P] -> [R]) _c10d_functional::all_reduce(t: f32[], sum, 0) _c10d_functional::wait_tensor(t: f32[]) aten::clamp_(t: f32[], None, 2) aten::view(t: f32[], []) (Replicate(),) tensor(2., device='cuda:0') ``` The behavior is now matching what you were expecting in issue pytorch#163374: Expected behavior (from the issue): 1. Placement should change from Partial(sum) to Replicate() 2. Value should be tensor(2.) instead of tensor(144.) Actual output from this build: 1. (Replicate(),) - placement is correct 2. tensor(2., device='cuda:0') - value is correct so the inplace operation now properly redistributes the partial DTensor to replicate before performing the clamp snd maintains the correct aliasing semantics. It also produces the expected clamped value. Pull Request resolved: pytorch#164729 Approved by: https://github.com/ezyang

This PR add a sm_121a flag for row-wise scaled matmuls on DGX Spark. Pull Request resolved: pytorch#167734 Approved by: https://github.com/eqy, https://github.com/cyyever

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#167338 Approved by: https://github.com/jamesjwu

This PR adds a basic spin configuration to allow for linting. It is designed as a drop-in replacement for the current Makefile based solution, i.e. it sets up and updates lintrunner based on the hashes of certain configuration files. Lintrunner is called via Uv's `uvx` command, separating its environment from the general development environment in an effort to reduce instances of competing requirements breaking environments. Pull Request resolved: pytorch#167226 Approved by: https://github.com/atalman, https://github.com/albanD

…sLtWorkspace" (pytorch#167722) Summary: getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes. This diff adds mutexes to synchronize access to the static maps. Note: this is a re-land of D86316117 / pytorch#167248 (see comments for details) Test Plan: Use a GPU OD, run multi-threaded tests (cuda_cublas_handle_pool_test) with TSAN: ``` buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test -- --stress-runs 100 ``` https://www.internalfb.com/intern/testinfra/testrun/14355223937501118 TSAN output (before synchronization was added): P2026731804 Differential Revision: D86964261 Pull Request resolved: pytorch#167722 Approved by: https://github.com/malfet

# Conflicts: # .ci/docker/ci_commit_pins/triton.txt # requirements.txt

rocm-repo-management-api · 2025-11-14T16:42:08Z

Jenkins build for b012f56ca2fd5edb6431a6e296a6006a9a9036fc commit finished as NOT_BUILT
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-11-14T17:13:00Z

Jenkins build for b012f56ca2fd5edb6431a6e296a6006a9a9036fc commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-11-14T20:16:36Z

Jenkins build for 2903e7a671b6e093b002fbb9442adfb5914018b3 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

colesbury and others added 30 commits November 9, 2025 02:14

Move types from typing_extensions to typing (pytorch#167185)

06aa3ef

This PR moves some implemented types from typing_extensions to typing due to the recent update to Python 3.10. Pull Request resolved: pytorch#167185 Approved by: https://github.com/janeyx99

Update inductor-unittest.yml (pytorch#167417)

9cf623a

i see failures like https://github.com/pytorch/pytorch/actions/runs/19189378182/job/54865171317?pr=167389 maybe this will fix it Pull Request resolved: pytorch#167417 Approved by: https://github.com/yf225

Enable ruff UP035 rule (pytorch#167307)

5135ace

This PR enables `UP035` rule of ruff. Pull Request resolved: pytorch#167307 Approved by: https://github.com/Lucaskabela

[2/N] Use Python 3.10 typing (pytorch#167167)

14a845a

This PR applies new `Union` and `Optional` typing syntax to some files. Pull Request resolved: pytorch#167167 Approved by: https://github.com/XuehaiPan, https://github.com/mlazos

[2/N] Use context managers (pytorch#167404)

b91a2ab

This PR fixes more context manager usage in Python code. Pull Request resolved: pytorch#167404 Approved by: https://github.com/mlazos

Swap pallas test shard to 12.8 (pytorch#167428)

fe6615e

Getting some weird failures building cuda13, lets stick to what we know works Pull Request resolved: pytorch#167428 Approved by: https://github.com/jansel

[xpu][test] Enable profiler test for XPU (pytorch#165423)

a058bbd

Fixes pytorch#165130 Pull Request resolved: pytorch#165423 Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/mlazos

[MPS] erfinv for sparse mps (pytorch#166711)

50af6f3

Should be merged after pytorch#166708 Pull Request resolved: pytorch#166711 Approved by: https://github.com/Skylion007, https://github.com/malfet

[MPS] sparse sparse mm (pytorch#167013)

47db552

Sparse sparse mm op implementation Pull Request resolved: pytorch#167013 Approved by: https://github.com/malfet

[xpu][feature] Add XPU support on torch.accelerator.get_memory_info (p…

3cfbf98

…ytorch#162564) # Motivation Support XPU for `torch.accelerator.get_memory_info`. Pull Request resolved: pytorch#162564 Approved by: https://github.com/albanD ghstack dependencies: pytorch#156812

Update slow tests (pytorch#166844)

c28475d

This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: pytorch#166844 Approved by: https://github.com/pytorchbot

[inductor] Wrap pallas_call in jax.jit (pytorch#167441)

f6a79b2

My understanding is this is needed for performance. Pull Request resolved: pytorch#167441 Approved by: https://github.com/oulgen

Add some labeler rules that used to be in the autolabel bot (pytorch#…

a4437d7

…167330) See pytorch/test-infra#7446 for the paths Pull Request resolved: pytorch#167330 Approved by: https://github.com/huydhn

Fix flaky memory profiler test [2] (pytorch#167268)

86130aa

Fixes pytorch#167037 Move the module definition outside of the unit test so when we run the unit test multiple times, the module is not re-compiled. Pull Request resolved: pytorch#167268 Approved by: https://github.com/angelayi

williamwen42 and others added 21 commits November 14, 2025 01:00

Fix different seq length (pytorch#167481)

c78e646

Differential Revision: D86685546 Pull Request resolved: pytorch#167481 Approved by: https://github.com/eellison

deprecate check_is_size and guard_size_oblivious (pytorch#167198)

50bf1f0

Pull Request resolved: pytorch#167198 Approved by: https://github.com/bobrenjc93

[ATen][CUDA] Add sm_121a flag for RowwiseScaledMM (pytorch#167734)

226850c

This PR add a sm_121a flag for row-wise scaled matmuls on DGX Spark. Pull Request resolved: pytorch#167734 Approved by: https://github.com/eqy, https://github.com/cyyever

[precompile] Integrate AOTI as a backend. (pytorch#167338)

b657061

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#167338 Approved by: https://github.com/jamesjwu

Merge remote-tracking branch 'upstream/main' into develop_IFU_20251114

a3addec

# Conflicts: # .ci/docker/ci_commit_pins/triton.txt # requirements.txt

pragupta requested review from jataylo, jeffdaily, jithunnair-amd and pruthvistony as code owners November 14, 2025 16:28

Fix merge conflicts

2903e7a

pragupta force-pushed the develop_IFU_20251114 branch from b012f56 to 2903e7a Compare November 14, 2025 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AUTOGENERATED] develop_IFU_20251114 #2806

[AUTOGENERATED] develop_IFU_20251114 #2806

pragupta commented Nov 14, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

102 participants

[AUTOGENERATED] develop_IFU_20251114 #2806

Are you sure you want to change the base?

[AUTOGENERATED] develop_IFU_20251114 #2806

Conversation

pragupta commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

102 participants

pragupta commented Nov 14, 2025 •

edited

Loading

rocm-repo-management-api bot commented Nov 14, 2025 •

edited

Loading

rocm-repo-management-api bot commented Nov 14, 2025 •

edited

Loading

rocm-repo-management-api bot commented Nov 14, 2025 •

edited

Loading