[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-23 #2670

pragupta · 2025-09-23T14:43:01Z

rocm_base: 7ea3967

…#163126) Pull Request resolved: pytorch#163126 Approved by: https://github.com/bdhirsh

We changed how we are tracing, as a result, we need to trace into register_data_class now. Differential Revision: [D82478651](https://our.internmc.facebook.com/intern/diff/D82478651) Pull Request resolved: pytorch#162557 Approved by: https://github.com/zhxchen17

It is better to have the new tracer as global config that can be manipulated easily. Also I believe dynamo-like config infra is useful instead of relying on custom way of patching stuff. Differential Revision: [D82478649](https://our.internmc.facebook.com/intern/diff/D82478649) Pull Request resolved: pytorch#162558 Approved by: https://github.com/zhxchen17 ghstack dependencies: pytorch#162557

…#162893) Summary: Use c10::CudaCachingAllocator for AOTInductor's initial constant buffer allocation. Test Plan: Activate test under test/cpp/aoti_inference/test.cpp Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#162893 Approved by: https://github.com/desertfire

previous graph seems wrong probably because dynamo bytecode running might be changing the grad state unintentionally. Differential Revision: [D82478643](https://our.internmc.facebook.com/intern/diff/D82478643) Pull Request resolved: pytorch#162559 Approved by: https://github.com/zhxchen17, https://github.com/ydwu4 ghstack dependencies: pytorch#162557, pytorch#162558

To make CI machines capable of running CUDA-13 tests. Unfortunately, this upgrade regresses NUMBA integration, so live patch it with NVIDIA/numba-cuda@6e08c9d This fix was suggested in pytorch#162878 (comment) Pull Request resolved: pytorch#163111 Approved by: https://github.com/huydhn

…ch#160908) Summary: **Background** Torch Elastic sends SIGKILL/SIGTERM signals if any process fails while others are still running. However, processes terminated by these signals do not generate termination logs, causing confusion. **Solution** Capture exit codes after SIGTERM signals to ensure complete and accurate termination logging. Test Plan: unit tests https://www.internalfb.com/mlhub/pipelines/runs/mast/f773486907-TrainingApplication__13_D79584569?job_attempt=1&version=0&tab=summary&env=PRODUCTION Rollback Plan: Differential Revision: D79584569 Pull Request resolved: pytorch#160908 Approved by: https://github.com/d4l3k

…162732) Summary: Quick fix for runtime support on foreach_div, see D81274963. Fixed an issue that I created in that diff so that the CIs pass. Test Plan: CIs created in D81274963 and D81286593 pass. Added some logs in [aten_mtia_ops.py](https://www.internalfb.com/code/fbsource/[c56272ba042c43c65517dcac254364cf732fcfa9]/fbcode/mtia/host_runtime/torch_mtia/aten_mtia_ops.cpp?lines=3676) to all the foreach_div ops. We can see that the correct MTIA kernels are being invoked in the tests. https://www.internalfb.com/intern/testinfra/testrun/15481123829281588 Rollback Plan: Pull Request resolved: pytorch#162732 Approved by: https://github.com/danielhou0515

…pytorch#163117) # why - KTC might regenerate a choicecaller e.g. through FlexibleLayout optimization. This in turn would delete any annotations # what - provide an annotations dict inside KTC - forward that dict towards the ChoiceCaller's annotations - ChoiceCaller users e.g. in selectalgorithm now have access to the KTC and can register handlers do record/make decisions based on the KTC # testing n/a Differential Revision: [D82587631](https://our.internmc.facebook.com/intern/diff/D82587631) Pull Request resolved: pytorch#163117 Approved by: https://github.com/nmacchioni

…ytorch#163088) In reaction to pytorch#116202 (comment) Pull Request resolved: pytorch#163088 Approved by: https://github.com/albanD

@OverRide

Fixes pytorch#163105 Note that the new `SWALR.load_state_dict` is **not backwards compatible**: ```python @OverRide def load_state_dict(self, state_dict: dict[str, Any]) -> None: """Load the scheduler's state. Args: state_dict (dict): scheduler state. Should be an object returned from a call to :meth:`state_dict`. """ self.__dict__.update(state_dict) self._set_anneal_func(self._anneal_strategy) ``` If we'd like to maintain compatibility with old state_dicts (loaded with `weights_only=False`), we could use something along these lines: ```python @OverRide def load_state_dict(self, state_dict: dict[str, Any]) -> None: """Load the scheduler's state. Args: state_dict (dict): scheduler state. Should be an object returned from a call to :meth:`state_dict`. """ anneal_func = state_dict.pop("anneal_func", None) strategy = state_dict.get("_anneal_strategy") self.__dict__.update(state_dict) if anneal_func is not None: state_dict["anneal_func"] = anneal_func if strategy is None: if anneal_func == self._linear_anneal: strategy = "linear" elif anneal_func == self._cosine_anneal: strategy = "cos" if strategy is None: strategy = getattr(self, "_anneal_strategy", "cos") self._set_anneal_func(strategy) ``` But given the fact that loading an `SWALR` state_dict before this PR would have caused an error, this seems okay. A GitHub/Google search for `SWALR.load_state_dict` had no results. Happy to change if not, or add a warning just in case. Pull Request resolved: pytorch#163122 Approved by: https://github.com/janeyx99

Part of pytorch#162270 Pull Request resolved: pytorch#163012 Approved by: https://github.com/kulinseth, https://github.com/malfet

In Unified Runtime, we cannot have any fallback ops (for now). Not all conv1d ops can avoid fallbacks now, so we write a decomposition for it. it's not registered to the default decomposition table as currently only executorch/unified runtime needs it. But it might benefit inductor as well because conv2d can generate triton kernels while there's no triton codegen for conv1d. I don't know if the conv2d triton kernel will have better perf compared to aten::conv1d, so it's not registered by default yet. To register it, one just needs to do `import torch._decomp as decomp;decomp.register_decomposition(torch.ops.aten.conv1d.default, conv1d_to_conv2d)` Pull Request resolved: pytorch#163080 Approved by: https://github.com/angelayi

We previously asked users to seperate these because we didn't have any way of adding extern C declarations. Now we don't and we don't need this confusing flag anymore BC breaking but is fine for this API since it doesn't have major users yet. Please just put your all your code in `kernel_source` moving forward ## BC note The header_code parameter has been removed from torch.cuda._compile_kernel. Previously, users could pass separate header code that would be prepended to the kernel source. Now, header code must be included directly in the kernel_source parameter. Note this only affects torch.cuda._compile_kernel, which is a private API. Example: Before ```python kernel = compile_kernel( kernel_source="global void my_kernel() { ... }", kernel_name="my_kernel", header_code="#define SCALE 2.0f\n__device_ float scale(float x) { return x * SCALE; }" ) ``` After ```python kernel_source = """ #define SCALE 2.0f device float scale(float x) { return x * SCALE; } global void my_kernel() { ... } """ kernel = _compile_kernel(kernel_source, "my_kernel") ``` Pull Request resolved: pytorch#163165 Approved by: https://github.com/janeyx99, https://github.com/albanD

…ytorch#162682) inline_and_install_module export variant is our long term state so it is better to use the new tracer for this. It also uncovered bunch of minor bugs because with inline_and_install_module, the nn_module_stack generation is changed a bit. Differential Revision: [D82478648](https://our.internmc.facebook.com/intern/diff/D82478648) Pull Request resolved: pytorch#162682 Approved by: https://github.com/zhxchen17 ghstack dependencies: pytorch#162557, pytorch#162558, pytorch#162559

Pull Request resolved: pytorch#163155 Approved by: https://github.com/awgu, https://github.com/Skylion007

… function (pytorch#159830) Fixes pytorch#159829 Pull Request resolved: pytorch#159830 Approved by: https://github.com/albanD

…162636) **Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This test is important as it verifies we can cast a replicated module to a different type after initialization, and import feature for enabling mixed precision, **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_to_float64_after_init Pull Request resolved: pytorch#162636 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631

Very similar to pytorch#161007 except now for mark_unbacked. Pull Request resolved: pytorch#162652 Approved by: https://github.com/laithsakka

Note to self: i should probably. start using gh stack This is rebased on top of pytorch#163165 so you only need to review this commit pytorch@7387c1b This test doesn't add any new functionality it just ensures DLPack conversion is working well Pull Request resolved: pytorch#163166 Approved by: https://github.com/janeyx99, https://github.com/albanD

Fixes pytorch#135954 Torch Inductor Windows Path Escape Characters Pull Request resolved: pytorch#162761 Approved by: https://github.com/jansel, https://github.com/mlazos

… compiled model check. (pytorch#162951) Following pytorch#162438, this PR generalized the origin CUDA only check, and add XPU check. Fixes pytorch#162939, Fixes pytorch#162938, Fixes pytorch#163032，Fixes pytorch#163045 Pull Request resolved: pytorch#162951 Approved by: https://github.com/EikanWang, https://github.com/jansel

Summary: 1. Generalized testing by auto-detecting Cache types and splitting testing by abstract base class - Now checks that all Cache types are thread-safe - Will fail tests if any new Cache is added and is untested (for example, any cache with non-str key or non-bytes value) 2. All Caches are thread-safe - InMemoryCache was the only one not thread-safe, so added a lock for access - Realized that to implement MultiCache we should just have this requirement. * Also, OnDiskCache is now a functioning AsyncCache with a default base_dir using Python's tempfile.gettempdir, i.e. OnDiskCache is no longer an abstract cache class Test Plan: ``` [nmacchioni@*** / ()]$ buck test fbcode//mode/opt caffe2/test/inductor:pcache Tests finished: Pass 28. Fail 0. Fatal 0. Skip 0. Build failure 0 [nmacchioni@*** / ()|remote/fbcode/warm_gpu_od_stable...)]$ ``` Rollback Plan: Differential Revision: D82660240 Pull Request resolved: pytorch#163173 Approved by: https://github.com/masnesral

…ytorch#162650) **Summary:** The parity tests train two identical models with the same inputs - one using a reference approach and one using the test approach (replicate) - then check that both models produce identical losses. This ensures the distributed training methods don't change the mathematical results compared to standard training. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_single_group 2. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group 3. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group_cpu_offload_eager Pull Request resolved: pytorch#162650 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631, pytorch#162636

…rch#162992) Differential Revision: [D82478646](https://our.internmc.facebook.com/intern/diff/D82478646) Pull Request resolved: pytorch#162992 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#162557, pytorch#162558, pytorch#162559, pytorch#162682

…ts (pytorch#162993) Differential Revision: [D82478644](https://our.internmc.facebook.com/intern/diff/D82478644) Pull Request resolved: pytorch#162993 Approved by: https://github.com/zhxchen17 ghstack dependencies: pytorch#162557, pytorch#162558, pytorch#162559, pytorch#162682, pytorch#162992

…non-root module (pytorch#162654) **Summary:** Verifies that Replicate correctly handles the scenario where forward and backward passes are run through both the root module and a non-root module. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_non_root_forward_backward Pull Request resolved: pytorch#162654 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650

1. The dispatch signatures defined in `core.extern_elementwise` call must match the C signature of the NVSHMEM functions, in particular the dtypes. Otherwise, there would be weird errors, such as IMA or hang. When matched, most of time the NVSHMEM device function will be inlined into the generated PTX. When not matched, it is represented as a function call in the PTX (not sure if it is the function call that goes wrong). 2. When calling the `core.extern` wrappers from the `triton.jit` kernels, the input must be cast to match the signatures defined in 1, e.g. via `nbytes.to(tl.int64)`. Otherwise, Triton will report a key error when searching for such kernel. Pull Request resolved: pytorch#163152 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#163025

…62927) Summary: I am really skeptical about inductor sizevars creating an empty shape env when not provided with one i think we should fail there if the graph has dynamic shapes and no shape env is provided. however i wonder if there are actually use cases that depends on the shape env not being there? Reasoning APIs depends on facts in the shape env. and assumes some stuff exists for specific symbols. Test Plan: Fix the bug reported in creating simple e2e unit test is not trivial https://www.internalfb.com/diff/D82337184 Rollback Plan: Differential Revision: D82412384 Pull Request resolved: pytorch#162927 Approved by: https://github.com/ezyang, https://github.com/eellison, https://github.com/jansel

…iple times in a forward pass (pytorch#162656) **Summary:** Verifies that Replicate works correctly when a module is used multiple times in a single forward pass. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_multi_forward_module Pull Request resolved: pytorch#162656 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: pytorch#163590 Approved by: https://github.com/pytorchbot

Pull Request resolved: pytorch#163553 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547

…m_magma.sh (#2651) Fixes #ISSUE_NUMBER --------- Co-authored-by: AMD <[email protected]>

…its for flops and bandwidth (pytorch#162942) In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators. Testing: ``` import torch if torch.cuda.is_available(): device = torch.cuda.current_device() mod = torch.get_device_module('cuda') hw = mod._device_limits.GPULimits(device) print(hw.get_tflops_per_second(torch.float16)) print(hw.get_tflops_per_second(torch.float32)) print(hw.get_tflops_per_second(torch.float64)) print(hw.get_tflops_per_second(torch.bfloat16)) print(hw.get_tflops_per_second(torch.int8)) print(hw.get_memory_bandwidth_Bps() / 1e9) print(hw.get_shared_memory_bandwidth_Bps() / 1e9) # Output on an H100 GPU 1070.53056 535.26528 66.90816 1070.53056 2141.06112 4893.696 33454.08 ``` Pull Request resolved: pytorch#162942 Approved by: https://github.com/ngimel, https://github.com/albanD

…s.iter.grouping (pytorch#163438) This PR removes import tricks of `SHARDING_PRIORITIES` and `ShardingFilterIterDataPipe` from `torch.utils.data.datapipes.iter.grouping`. They are declared to be removed in PyTorch 2.1 but not. Before change: ``` import torch.utils.data.datapipes.iter.grouping.SHARDING_PRIORITIES import torch.utils.data.datapipes.iter.grouping.ShardingFilterIterDataPipe ``` works After change: there is an import error exception. Pull Request resolved: pytorch#163438 Approved by: https://github.com/janeyx99

Cache the result of `has_efa` by `functools.cache`. Pull Request resolved: pytorch#163439 Approved by: https://github.com/janeyx99

This reverts commit 509c4e8. Reverted pytorch#163091 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#163091 (comment)))

Pull Request resolved: pytorch#163554 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547, pytorch#163553

Pull Request resolved: pytorch#163555 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547, pytorch#163553, pytorch#163554

Pull Request resolved: pytorch#163556 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547, pytorch#163553, pytorch#163554, pytorch#163555

Pull Request resolved: pytorch#163557 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547, pytorch#163553, pytorch#163554, pytorch#163555, pytorch#163556

Summary: Update Test Plan: CI Rollback Plan: Differential Revision: D81727392 Pull Request resolved: pytorch#162222 Approved by: https://github.com/sanrise

…163235) ## Why this PR? I've tried to follow the guidance of the `OpenReg` [usage example](https://github.com/pytorch/pytorch/tree/main/test/cpp_extensions/open_registration_extension/torch_openreg/third_party/openreg) and found that the command for compiling `example.cpp` (`g++ -o out example/example.cpp -L ./build -lopenreg`) is not compatible with my `gcc` (v11.4). Since I installed my `gcc` through `apt install build-essential`, and I think that's a common way to install `gcc` for a few developers? I believe it's necessary to slightly modify the command to add `-I ./` to explicitly indicate the header file search path. ## What I've changed? - I added `-I ./` to correctly search for `./include/openreg.h`. - I also added a `pwd` comment for better readability and removed unused imports in `example/example.cpp`. Pull Request resolved: pytorch#163235 Approved by: https://github.com/FFFrog, https://github.com/albanD Co-authored-by: Jiawei Li <[email protected]>

Pull Request resolved: pytorch#163558 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547, pytorch#163553, pytorch#163554, pytorch#163555, pytorch#163556, pytorch#163557

…rch#163488) Differential Revision: D82933509 over the weekend I realized that some of the cache implementation was a bit silly, and too constrained to be actually generic. for example, InMemoryCache[str, bytes] was odd since we'd probably want to be able to store more than just str keys with bytes values. so tldr; everything is now generic, with the one constraint being that Key and Value must both be pickle-able types. this makes things a lot simpler for us, since all caches can now be str -> bytes caches under the hood if we'd like, and Key/Value just get pickled on the way in and out. with this change, there were also some improvements made to the testing; mainly better coverage, but now we also test each cache across every combination of Key/Value types to ensure that they will work with the types we might specify later I also hardened some things here and there, for example we now use literal_eval (forgot who mentioned this on the first PR, but thank you for the suggestion!), and all errors coming from the caching will be wrapped in CacheError from now on (although we still raise from the original error context where possible) putting this PR up now for feedback, in the process of generalizing the code I did remove the documentation since it was becoming outdated but I will add that back in after the PR is green I have the next PR ready as well (implements a fresh cache context manager), will export once this lands Pull Request resolved: pytorch#163488 Approved by: https://github.com/aorenste, https://github.com/masnesral

Because it required that CUDA >=12. Pull Request resolved: pytorch#163495 Approved by: https://github.com/janeyx99

Please see build script: https://github.com/pytorch/pytorch/blob/8da008678fcb95dbf55a33451136a242871ae4e2/.ci/manywheel/build_cuda.sh#L69-L71 This should display correct warning: `` Please install PyTorch with a following CUDA configurations: 12.6 12.8 13.0 following instructions at https://pytorch.org/get-started/locally/ `` Pull Request resolved: pytorch#163585 Approved by: https://github.com/malfet

…163319) FIXES pytorch#163286 Pull Request resolved: pytorch#163319 Approved by: https://github.com/eellison

Fixes part of pytorch#163314 In particular bug: **Bug 1: H=None Broadcasting Produces Incorrect Results** This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (**mask[:, :, i]**). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting. The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error Pull Request resolved: pytorch#163426 Approved by: https://github.com/drisspg

This reverts commit 27164b6. Reverted pytorch#163167 on behalf of https://github.com/malfet due to This broke in inductor-cpu-test, see https://hud.pytorch.org/hud/pytorch/pytorch/1a42656d6c43a9bb7eb90c511884ce451d29422f/1?per_page=50&name_filter=inductor-cpu-test&mergeEphemeralLF=true ([comment](pytorch#163167 (comment)))

…sting_IFU_2025-09-23 # Conflicts: # .ci/aarch64_linux/aarch64_ci_build.sh # .ci/aarch64_linux/aarch64_wheel_ci_build.py # .ci/docker/build.sh # .ci/docker/ci_commit_pins/huggingface-requirements.txt # .ci/docker/ci_commit_pins/triton.txt # .ci/docker/common/install_rocm.sh # .ci/docker/requirements-ci.txt # .ci/docker/requirements-docs.txt # .ci/libtorch/build.sh # .ci/lumen_cli/cli/lib/core/vllm/lib.py # .ci/lumen_cli/cli/lib/core/vllm/vllm_build.py # .ci/lumen_cli/cli/lib/core/vllm/vllm_test.py # .ci/wheel/build_wheel.sh # .github/ci_commit_pins/audio.txt # .github/ci_commit_pins/vllm.txt # .github/ci_commit_pins/xla.txt # .github/ci_configs/vllm/Dockerfile.tmp_vllm # .github/scripts/generate_binary_build_matrix.py # .github/templates/macos_binary_build_workflow.yml.j2 # .github/workflows/build-vllm-wheel.yml # .github/workflows/docker-builds.yml # .github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml # .github/workflows/generated-linux-binary-manywheel-main.yml # .github/workflows/generated-linux-binary-manywheel-nightly.yml # .github/workflows/generated-linux-binary-manywheel-rocm-main.yml # .github/workflows/generated-macos-arm64-binary-libtorch-release-nightly.yml # .github/workflows/generated-macos-arm64-binary-wheel-nightly.yml # .github/workflows/inductor-nightly.yml # .github/workflows/inductor-perf-test-nightly-x86-zen.yml # .github/workflows/inductor-perf-test-nightly-x86.yml # .github/workflows/inductor-periodic.yml # .github/workflows/inductor-unittest.yml # .github/workflows/inductor.yml # .github/workflows/operator_benchmark.yml # .github/workflows/pull.yml # .github/workflows/trunk.yml # .github/workflows/vllm.yml # aten/src/ATen/CMakeLists.txt # aten/src/ATen/DLConvertor.cpp # aten/src/ATen/cuda/CUDABlas.cpp # aten/src/ATen/native/CPUBlas.cpp # aten/src/ATen/native/LinearAlgebra.cpp # aten/src/ATen/native/Normalization.cpp # aten/src/ATen/native/cuda/Blas.cpp # aten/src/ATen/native/cuda/int8mm.cu # aten/src/ATen/native/cudnn/MHA.cpp # aten/src/ATen/native/miopen/BatchNorm_miopen.cpp # aten/src/ATen/native/miopen/Conv_miopen.cpp # aten/src/ATen/native/mps/operations/GridSampler.mm # aten/src/ATen/native/native_functions.yaml # aten/src/ATen/native/sparse/mps/SparseMPSTensorMath.mm # aten/src/ATen/native/transformers/hip/flash_attn/flash_api.h # benchmarks/dynamo/ci_expected_accuracy/aot_eager_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/aot_eager_torchbench_training.csv # benchmarks/dynamo/ci_expected_accuracy/cpu_inductor_amp_freezing_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/cpu_inductor_freezing_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/cpu_inductor_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/dynamic_aot_eager_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/dynamic_aot_eager_torchbench_training.csv # benchmarks/dynamo/ci_expected_accuracy/dynamic_cpu_inductor_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/dynamic_inductor_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/dynamic_inductor_torchbench_training.csv # benchmarks/dynamo/ci_expected_accuracy/dynamo_eager_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/dynamo_eager_torchbench_training.csv # benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_training.csv # benchmarks/dynamo/ci_expected_accuracy/rocm/aot_eager_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/rocm/dynamic_aot_eager_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/rocm/dynamo_eager_torchbench_inference.csv # benchmarks/dynamo/pr_time_benchmarks/expected_results.csv # benchmarks/operator_benchmark/benchmark_core.py # build_variables.bzl # c10/cuda/CUDAFunctions.cpp # cmake/Codegen.cmake # cmake/External/aotriton.cmake # docs/source/accelerator/index.md # docs/source/accelerator/operators.md # functorch/dim/__init__.py # functorch/dim/wrap_type.py # requirements-build.txt # requirements.txt # test/cpp/nativert/CMakeLists.txt # test/cpp/nativert/test_triton_kernel_manager_registration.cpp # test/cpp_extensions/libtorch_agnostic_extension/libtorch_agnostic/csrc/kernel.cpp # test/cpp_extensions/libtorch_agnostic_extension/libtorch_agnostic/ops.py # test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py # test/cpp_extensions/open_registration_extension/torch_openreg/README.md # test/cpp_extensions/open_registration_extension/torch_openreg/setup.py # test/cpp_extensions/open_registration_extension/torch_openreg/third_party/openreg/README.md # test/cpp_extensions/open_registration_extension/torch_openreg/third_party/openreg/example/example.cpp # test/cpp_extensions/open_registration_extension/torch_openreg/torch_openreg/__init__.py # test/distributed/_composable/fsdp/test_fully_shard_training.py # test/distributed/_composable/test_composability/test_2d_composability.py # test/distributed/fsdp/test_fsdp_comm_hooks.py # test/distributed/tensor/parallel/test_tp_examples.py # test/distributed/tensor/test_attention.py # test/distributed/tensor/test_dtensor_compile.py # test/distributed/tensor/test_dtensor_ops.py # test/distributed/tensor/test_op_schema.py # test/distributed/test_inductor_collectives.py # test/distributed/test_nvshmem.py # test/distributed/test_nvshmem_triton.py # test/distributed/test_symmetric_memory.py # test/dynamo/test_activation_checkpointing.py # test/dynamo/test_aot_compile.py # test/dynamo/test_callback.py # test/dynamo/test_error_messages.py # test/dynamo/test_guard_serialization.py # test/dynamo/test_misc.py # test/dynamo/test_package.py # test/dynamo/test_structured_trace.py # test/export/test_export.py # test/export/test_export_opinfo.py # test/export/test_passes.py # test/export/test_serialize.py # test/functorch/test_control_flow.py # test/inductor/test_aot_inductor.py # test/inductor/test_aot_inductor_package.py # test/inductor/test_flex_attention.py # test/inductor/test_fxir_backend.py # test/inductor/test_loop_ordering.py # test/inductor/test_max_autotune.py # test/inductor/test_torchinductor.py # test/nn/test_convolution.py # test/nn/test_pooling.py # test/run_test.py # test/slow_tests.json # test/test_binary_ufuncs.py # test/test_dynamic_shapes.py # test/test_matmul_cuda.py # test/test_nestedtensor.py # test/test_nn.py # test/test_openreg.py # third_party/xpu.txt # tools/flight_recorder/components/config_manager.py # tools/pyi/gen_pyi.py # torch/_C/_dynamo/guards.pyi # torch/_dynamo/aot_compile.py # torch/_dynamo/convert_frame.py # torch/_dynamo/functional_export.py # torch/_dynamo/graph_break_registry.json # torch/_dynamo/guards.py # torch/_dynamo/output_graph.py # torch/_dynamo/package.py # torch/_dynamo/symbolic_convert.py # torch/_dynamo/variables/higher_order_ops.py # torch/_dynamo/variables/lists.py # torch/_dynamo/variables/optimizer.py # torch/_export/serde/serialize.py # torch/_export/wrappers.py # torch/_functorch/_aot_autograd/autograd_cache.py # torch/_higher_order_ops/__init__.py # torch/_higher_order_ops/associative_scan.py # torch/_higher_order_ops/flex_attention.py # torch/_higher_order_ops/triton_kernel_wrap.py # torch/_inductor/choices.py # torch/_inductor/codegen/cpp.py # torch/_inductor/codegen/cpp_micro_gemm.py # torch/_inductor/codegen/cpp_wrapper_cpu.py # torch/_inductor/codegen/triton.py # torch/_inductor/codegen/wrapper_fxir.py # torch/_inductor/config.py # torch/_inductor/cpp_builder.py # torch/_inductor/decomposition.py # torch/_inductor/kernel/bmm.py # torch/_inductor/kernel/flex/flex_attention.py # torch/_inductor/kernel/flex/templates/flex_attention.py.jinja # torch/_inductor/kernel/flex/templates/flex_backwards.py.jinja # torch/_inductor/kernel/flex/templates/flex_decode.py.jinja # torch/_inductor/kernel/flex/templates/utilities.py.jinja # torch/_inductor/kernel/mm.py # torch/_inductor/kernel/mm_plus_mm.py # torch/_inductor/kernel_template_choice.py # torch/_inductor/memory.py # torch/_inductor/runtime/triton_heuristics.py # torch/_inductor/scheduler.py # torch/_inductor/select_algorithm.py # torch/_inductor/template_heuristics/base.py # torch/_inductor/template_heuristics/triton.py # torch/_inductor/utils.py # torch/_meta_registrations.py # torch/_prims_common/__init__.py # torch/csrc/Module.cpp # torch/csrc/autograd/python_variable.cpp # torch/csrc/autograd/python_variable_indexing.cpp # torch/csrc/distributed/c10d/FlightRecorder.cpp # torch/csrc/distributed/c10d/ProcessGroupGloo.hpp # torch/csrc/distributed/c10d/symm_mem/NVSHMEMSymmetricMemory.cu # torch/csrc/inductor/aoti_runtime/utils.h # torch/csrc/stable/accelerator.h # torch/csrc/stable/ops.h # torch/csrc/utils/generated_serialization_types.h # torch/csrc/utils/tensor_numpy.cpp # torch/distributed/_symmetric_memory/_nvshmem_triton.py # torch/distributed/device_mesh.py # torch/distributed/pipelining/_schedule_visualizer.py # torch/distributed/tensor/_api.py # torch/distributed/tensor/_dispatch.py # torch/distributed/tensor/_op_schema.py # torch/distributed/tensor/_random.py # torch/distributed/tensor/_sharding_prop.py # torch/export/_trace.py # torch/export/_unlift.py # torch/export/exported_program.py # torch/fx/experimental/proxy_tensor.py # torch/nativert/executor/triton/CpuTritonKernelManager.cpp # torch/nativert/executor/triton/CudaTritonKernelManager.cpp # torch/nativert/executor/triton/TritonKernelManager.h # torch/nativert/kernels/KernelHandlerRegistry.cpp # torch/nativert/kernels/TritonKernel.cpp # torch/nested/_internal/ops.py # torch/onnx/__init__.py # torch/overrides.py # torch/testing/_internal/common_cuda.py # torch/testing/_internal/common_distributed.py # torch/testing/_internal/common_quantization.py # torch/testing/_internal/common_utils.py # torch/testing/_internal/distributed/_tensor/common_dtensor.py # torch/testing/_internal/distributed/fake_pg.py # torch/testing/_internal/hop_db.py # torch/utils/_python_dispatch.py # torch/utils/data/datapipes/iter/combinatorics.py

rocm-repo-management-api · 2025-09-23T15:23:28Z

Jenkins build for a10a30e34c9f2acc660cf1494ef0908205c67e30 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

pragupta · 2025-09-24T21:39:31Z

We will do a fresh IFU since we had to update the previous IFU PR: #2677

anijain2305 and others added 30 commits September 17, 2025 16:49

[dtensor][compile] Disable proxy mode in sharding prop rules (pytorch…

c5c9e20

…#163126) Pull Request resolved: pytorch#163126 Approved by: https://github.com/bdhirsh

remove tolerance override for dynamo test_mixed_device_dtype in SGD (p…

bcbb45b

…ytorch#163088) In reaction to pytorch#116202 (comment) Pull Request resolved: pytorch#163088 Approved by: https://github.com/albanD

[MPS] Add embedding_bag forward pass (pytorch#163012)

5236007

Part of pytorch#162270 Pull Request resolved: pytorch#163012 Approved by: https://github.com/kulinseth, https://github.com/malfet

very small typo in fsdp2 comment (pytorch#163155)

dfda2df

Pull Request resolved: pytorch#163155 Approved by: https://github.com/awgu, https://github.com/Skylion007

[Inductor] support mixed dtype in the native_layer_norm_backward meta…

63276ed

… function (pytorch#159830) Fixes pytorch#159829 Pull Request resolved: pytorch#159830 Approved by: https://github.com/albanD

add support for hint_override in mark_unbacked (pytorch#162652)

0661ecd

Very similar to pytorch#161007 except now for mark_unbacked. Pull Request resolved: pytorch#162652 Approved by: https://github.com/laithsakka

Fix windows path escape characters (pytorch#162761)

26eefd5

Fixes pytorch#135954 Torch Inductor Windows Path Escape Characters Pull Request resolved: pytorch#162761 Approved by: https://github.com/jansel, https://github.com/mlazos

pytorchupdatebot and others added 21 commits September 23, 2025 04:44

[torchfuzz] remove supports_variable_inputs for now (pytorch#163553)

0e12238

Pull Request resolved: pytorch#163553 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547

[rocm7.1_internal_testing][SWDEV-554101] Fix bad merge of install_roc…

7ea3967

…m_magma.sh (#2651) Fixes #ISSUE_NUMBER --------- Co-authored-by: AMD <[email protected]>

Use functools.cache on has_efa (pytorch#163439)

d3a1345

Cache the result of `has_efa` by `functools.cache`. Pull Request resolved: pytorch#163439 Approved by: https://github.com/janeyx99

Revert "Update cutlass version for fbcode (pytorch#163091)"

19b754d

This reverts commit 509c4e8. Reverted pytorch#163091 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#163091 (comment)))

[torchfuzz] cache operators (pytorch#163554)

08c5efd

Pull Request resolved: pytorch#163554 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547, pytorch#163553

[torchfuzz] decompose -> fuzz_inputs_specs (pytorch#163555)

d5e51d3

Pull Request resolved: pytorch#163555 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547, pytorch#163553, pytorch#163554

[torchfuzz] shuffle compatible ops (pytorch#163556)

1545bb1

Pull Request resolved: pytorch#163556 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547, pytorch#163553, pytorch#163554, pytorch#163555

[torchfuzz] remove unneeded try catch (pytorch#163557)

309fe03

Pull Request resolved: pytorch#163557 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547, pytorch#163553, pytorch#163554, pytorch#163555, pytorch#163556

Update Kineto Submodule (pytorch#162222)

45d9dcc

Summary: Update Test Plan: CI Rollback Plan: Differential Revision: D81727392 Pull Request resolved: pytorch#162222 Approved by: https://github.com/sanrise

[torchfuzz] introduce tensor and scalar pointwise ops (pytorch#163558)

b426ba1

Pull Request resolved: pytorch#163558 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547, pytorch#163553, pytorch#163554, pytorch#163555, pytorch#163556, pytorch#163557

Remove test conditions for CUDA<12 (pytorch#163495)

5d749ce

Because it required that CUDA >=12. Pull Request resolved: pytorch#163495 Approved by: https://github.com/janeyx99

[inductor] fix as_strided lowering with .view(dtype) inputs (pytorch#…

bda9ab2

…163319) FIXES pytorch#163286 Pull Request resolved: pytorch#163319 Approved by: https://github.com/eellison

pragupta requested review from jataylo, jeffdaily, jithunnair-amd and pruthvistony as code owners September 23, 2025 14:43

jithunnair-amd force-pushed the rocm7.1_internal_testing branch from a7dc2b0 to 1c57644 Compare September 24, 2025 21:24

pragupta closed this Sep 24, 2025

pragupta deleted the rocm7.1_internal_testing_IFU_2025-09-23 branch October 1, 2025 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-23 #2670

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-23 #2670

Uh oh!

pragupta commented Sep 23, 2025

Uh oh!

rocm-repo-management-api bot commented Sep 23, 2025 •

edited

Loading

Uh oh!

pragupta commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

108 participants

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-23 #2670

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-23 #2670

Uh oh!

Conversation

pragupta commented Sep 23, 2025

Uh oh!

rocm-repo-management-api bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pragupta commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

108 participants

rocm-repo-management-api bot commented Sep 23, 2025 •

edited

Loading