forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 75
[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-23 #2670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
pragupta
wants to merge
2,112
commits into
rocm7.1_internal_testing
from
rocm7.1_internal_testing_IFU_2025-09-23
Closed
[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-23 #2670
pragupta
wants to merge
2,112
commits into
rocm7.1_internal_testing
from
rocm7.1_internal_testing_IFU_2025-09-23
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…#163126) Pull Request resolved: pytorch#163126 Approved by: https://github.com/bdhirsh
We changed how we are tracing, as a result, we need to trace into register_data_class now. Differential Revision: [D82478651](https://our.internmc.facebook.com/intern/diff/D82478651) Pull Request resolved: pytorch#162557 Approved by: https://github.com/zhxchen17
It is better to have the new tracer as global config that can be manipulated easily. Also I believe dynamo-like config infra is useful instead of relying on custom way of patching stuff. Differential Revision: [D82478649](https://our.internmc.facebook.com/intern/diff/D82478649) Pull Request resolved: pytorch#162558 Approved by: https://github.com/zhxchen17 ghstack dependencies: pytorch#162557
…#162893) Summary: Use c10::CudaCachingAllocator for AOTInductor's initial constant buffer allocation. Test Plan: Activate test under test/cpp/aoti_inference/test.cpp Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#162893 Approved by: https://github.com/desertfire
previous graph seems wrong probably because dynamo bytecode running might be changing the grad state unintentionally. Differential Revision: [D82478643](https://our.internmc.facebook.com/intern/diff/D82478643) Pull Request resolved: pytorch#162559 Approved by: https://github.com/zhxchen17, https://github.com/ydwu4 ghstack dependencies: pytorch#162557, pytorch#162558
To make CI machines capable of running CUDA-13 tests. Unfortunately, this upgrade regresses NUMBA integration, so live patch it with NVIDIA/numba-cuda@6e08c9d This fix was suggested in pytorch#162878 (comment) Pull Request resolved: pytorch#163111 Approved by: https://github.com/huydhn
…ch#160908) Summary: **Background** Torch Elastic sends SIGKILL/SIGTERM signals if any process fails while others are still running. However, processes terminated by these signals do not generate termination logs, causing confusion. **Solution** Capture exit codes after SIGTERM signals to ensure complete and accurate termination logging. Test Plan: unit tests https://www.internalfb.com/mlhub/pipelines/runs/mast/f773486907-TrainingApplication__13_D79584569?job_attempt=1&version=0&tab=summary&env=PRODUCTION Rollback Plan: Differential Revision: D79584569 Pull Request resolved: pytorch#160908 Approved by: https://github.com/d4l3k
…162732) Summary: Quick fix for runtime support on foreach_div, see D81274963. Fixed an issue that I created in that diff so that the CIs pass. Test Plan: CIs created in D81274963 and D81286593 pass. Added some logs in [aten_mtia_ops.py](https://www.internalfb.com/code/fbsource/[c56272ba042c43c65517dcac254364cf732fcfa9]/fbcode/mtia/host_runtime/torch_mtia/aten_mtia_ops.cpp?lines=3676) to all the foreach_div ops. We can see that the correct MTIA kernels are being invoked in the tests. https://www.internalfb.com/intern/testinfra/testrun/15481123829281588 Rollback Plan: Pull Request resolved: pytorch#162732 Approved by: https://github.com/danielhou0515
…pytorch#163117) # why - KTC might regenerate a choicecaller e.g. through FlexibleLayout optimization. This in turn would delete any annotations # what - provide an annotations dict inside KTC - forward that dict towards the ChoiceCaller's annotations - ChoiceCaller users e.g. in selectalgorithm now have access to the KTC and can register handlers do record/make decisions based on the KTC # testing n/a Differential Revision: [D82587631](https://our.internmc.facebook.com/intern/diff/D82587631) Pull Request resolved: pytorch#163117 Approved by: https://github.com/nmacchioni
…ytorch#163088) In reaction to pytorch#116202 (comment) Pull Request resolved: pytorch#163088 Approved by: https://github.com/albanD
Fixes pytorch#163105 Note that the new `SWALR.load_state_dict` is **not backwards compatible**: ```python @OverRide def load_state_dict(self, state_dict: dict[str, Any]) -> None: """Load the scheduler's state. Args: state_dict (dict): scheduler state. Should be an object returned from a call to :meth:`state_dict`. """ self.__dict__.update(state_dict) self._set_anneal_func(self._anneal_strategy) ``` If we'd like to maintain compatibility with old state_dicts (loaded with `weights_only=False`), we could use something along these lines: ```python @OverRide def load_state_dict(self, state_dict: dict[str, Any]) -> None: """Load the scheduler's state. Args: state_dict (dict): scheduler state. Should be an object returned from a call to :meth:`state_dict`. """ anneal_func = state_dict.pop("anneal_func", None) strategy = state_dict.get("_anneal_strategy") self.__dict__.update(state_dict) if anneal_func is not None: state_dict["anneal_func"] = anneal_func if strategy is None: if anneal_func == self._linear_anneal: strategy = "linear" elif anneal_func == self._cosine_anneal: strategy = "cos" if strategy is None: strategy = getattr(self, "_anneal_strategy", "cos") self._set_anneal_func(strategy) ``` But given the fact that loading an `SWALR` state_dict before this PR would have caused an error, this seems okay. A GitHub/Google search for `SWALR.load_state_dict` had no results. Happy to change if not, or add a warning just in case. Pull Request resolved: pytorch#163122 Approved by: https://github.com/janeyx99
Part of pytorch#162270 Pull Request resolved: pytorch#163012 Approved by: https://github.com/kulinseth, https://github.com/malfet
In Unified Runtime, we cannot have any fallback ops (for now). Not all conv1d ops can avoid fallbacks now, so we write a decomposition for it. it's not registered to the default decomposition table as currently only executorch/unified runtime needs it. But it might benefit inductor as well because conv2d can generate triton kernels while there's no triton codegen for conv1d. I don't know if the conv2d triton kernel will have better perf compared to aten::conv1d, so it's not registered by default yet. To register it, one just needs to do `import torch._decomp as decomp;decomp.register_decomposition(torch.ops.aten.conv1d.default, conv1d_to_conv2d)` Pull Request resolved: pytorch#163080 Approved by: https://github.com/angelayi
We previously asked users to seperate these because we didn't have any way of adding extern C declarations. Now we don't and we don't need this confusing flag anymore
BC breaking but is fine for this API since it doesn't have major users yet. Please just put your all your code in `kernel_source` moving forward
## BC note
The header_code parameter has been removed from torch.cuda._compile_kernel. Previously, users could pass separate header code that would be prepended to the kernel source. Now, header code must be included directly in the kernel_source parameter.
Note this only affects torch.cuda._compile_kernel, which is a private API.
Example:
Before
```python
kernel = compile_kernel(
kernel_source="global void my_kernel() { ... }",
kernel_name="my_kernel",
header_code="#define SCALE 2.0f\n__device_ float scale(float x) { return x * SCALE; }"
)
```
After
```python
kernel_source = """
#define SCALE 2.0f
device float scale(float x) { return x * SCALE; }
global void my_kernel() { ... }
"""
kernel = _compile_kernel(kernel_source, "my_kernel")
```
Pull Request resolved: pytorch#163165
Approved by: https://github.com/janeyx99, https://github.com/albanD
…ytorch#162682) inline_and_install_module export variant is our long term state so it is better to use the new tracer for this. It also uncovered bunch of minor bugs because with inline_and_install_module, the nn_module_stack generation is changed a bit. Differential Revision: [D82478648](https://our.internmc.facebook.com/intern/diff/D82478648) Pull Request resolved: pytorch#162682 Approved by: https://github.com/zhxchen17 ghstack dependencies: pytorch#162557, pytorch#162558, pytorch#162559
Pull Request resolved: pytorch#163155 Approved by: https://github.com/awgu, https://github.com/Skylion007
… function (pytorch#159830) Fixes pytorch#159829 Pull Request resolved: pytorch#159830 Approved by: https://github.com/albanD
…162636) **Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This test is important as it verifies we can cast a replicated module to a different type after initialization, and import feature for enabling mixed precision, **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_to_float64_after_init Pull Request resolved: pytorch#162636 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631
Very similar to pytorch#161007 except now for mark_unbacked. Pull Request resolved: pytorch#162652 Approved by: https://github.com/laithsakka
Note to self: i should probably. start using gh stack This is rebased on top of pytorch#163165 so you only need to review this commit pytorch@7387c1b This test doesn't add any new functionality it just ensures DLPack conversion is working well Pull Request resolved: pytorch#163166 Approved by: https://github.com/janeyx99, https://github.com/albanD
Fixes pytorch#135954 Torch Inductor Windows Path Escape Characters Pull Request resolved: pytorch#162761 Approved by: https://github.com/jansel, https://github.com/mlazos
… compiled model check. (pytorch#162951) Following pytorch#162438, this PR generalized the origin CUDA only check, and add XPU check. Fixes pytorch#162939, Fixes pytorch#162938, Fixes pytorch#163032,Fixes pytorch#163045 Pull Request resolved: pytorch#162951 Approved by: https://github.com/EikanWang, https://github.com/jansel
Summary: 1. Generalized testing by auto-detecting Cache types and splitting testing by abstract base class - Now checks that all Cache types are thread-safe - Will fail tests if any new Cache is added and is untested (for example, any cache with non-str key or non-bytes value) 2. All Caches are thread-safe - InMemoryCache was the only one not thread-safe, so added a lock for access - Realized that to implement MultiCache we should just have this requirement. * Also, OnDiskCache is now a functioning AsyncCache with a default base_dir using Python's tempfile.gettempdir, i.e. OnDiskCache is no longer an abstract cache class Test Plan: ``` [nmacchioni@*** / ()]$ buck test fbcode//mode/opt caffe2/test/inductor:pcache Tests finished: Pass 28. Fail 0. Fatal 0. Skip 0. Build failure 0 [nmacchioni@*** / ()|remote/fbcode/warm_gpu_od_stable...)]$ ``` Rollback Plan: Differential Revision: D82660240 Pull Request resolved: pytorch#163173 Approved by: https://github.com/masnesral
…ytorch#162650) **Summary:** The parity tests train two identical models with the same inputs - one using a reference approach and one using the test approach (replicate) - then check that both models produce identical losses. This ensures the distributed training methods don't change the mathematical results compared to standard training. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_single_group 2. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group 3. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group_cpu_offload_eager Pull Request resolved: pytorch#162650 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631, pytorch#162636
…rch#162992) Differential Revision: [D82478646](https://our.internmc.facebook.com/intern/diff/D82478646) Pull Request resolved: pytorch#162992 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#162557, pytorch#162558, pytorch#162559, pytorch#162682
…ts (pytorch#162993) Differential Revision: [D82478644](https://our.internmc.facebook.com/intern/diff/D82478644) Pull Request resolved: pytorch#162993 Approved by: https://github.com/zhxchen17 ghstack dependencies: pytorch#162557, pytorch#162558, pytorch#162559, pytorch#162682, pytorch#162992
…non-root module (pytorch#162654) **Summary:** Verifies that Replicate correctly handles the scenario where forward and backward passes are run through both the root module and a non-root module. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_non_root_forward_backward Pull Request resolved: pytorch#162654 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650
1. The dispatch signatures defined in `core.extern_elementwise` call must match the C signature of the NVSHMEM functions, in particular the dtypes. Otherwise, there would be weird errors, such as IMA or hang. When matched, most of time the NVSHMEM device function will be inlined into the generated PTX. When not matched, it is represented as a function call in the PTX (not sure if it is the function call that goes wrong). 2. When calling the `core.extern` wrappers from the `triton.jit` kernels, the input must be cast to match the signatures defined in 1, e.g. via `nbytes.to(tl.int64)`. Otherwise, Triton will report a key error when searching for such kernel. Pull Request resolved: pytorch#163152 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#163025
…62927) Summary: I am really skeptical about inductor sizevars creating an empty shape env when not provided with one i think we should fail there if the graph has dynamic shapes and no shape env is provided. however i wonder if there are actually use cases that depends on the shape env not being there? Reasoning APIs depends on facts in the shape env. and assumes some stuff exists for specific symbols. Test Plan: Fix the bug reported in creating simple e2e unit test is not trivial https://www.internalfb.com/diff/D82337184 Rollback Plan: Differential Revision: D82412384 Pull Request resolved: pytorch#162927 Approved by: https://github.com/ezyang, https://github.com/eellison, https://github.com/jansel
…iple times in a forward pass (pytorch#162656) **Summary:** Verifies that Replicate works correctly when a module is used multiple times in a single forward pass. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_multi_forward_module Pull Request resolved: pytorch#162656 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: pytorch#163590 Approved by: https://github.com/pytorchbot
Pull Request resolved: pytorch#163553 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547
…m_magma.sh (#2651) Fixes #ISSUE_NUMBER --------- Co-authored-by: AMD <[email protected]>
…its for flops and bandwidth (pytorch#162942) In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators. Testing: ``` import torch if torch.cuda.is_available(): device = torch.cuda.current_device() mod = torch.get_device_module('cuda') hw = mod._device_limits.GPULimits(device) print(hw.get_tflops_per_second(torch.float16)) print(hw.get_tflops_per_second(torch.float32)) print(hw.get_tflops_per_second(torch.float64)) print(hw.get_tflops_per_second(torch.bfloat16)) print(hw.get_tflops_per_second(torch.int8)) print(hw.get_memory_bandwidth_Bps() / 1e9) print(hw.get_shared_memory_bandwidth_Bps() / 1e9) # Output on an H100 GPU 1070.53056 535.26528 66.90816 1070.53056 2141.06112 4893.696 33454.08 ``` Pull Request resolved: pytorch#162942 Approved by: https://github.com/ngimel, https://github.com/albanD
…s.iter.grouping (pytorch#163438) This PR removes import tricks of `SHARDING_PRIORITIES` and `ShardingFilterIterDataPipe` from `torch.utils.data.datapipes.iter.grouping`. They are declared to be removed in PyTorch 2.1 but not. Before change: ``` import torch.utils.data.datapipes.iter.grouping.SHARDING_PRIORITIES import torch.utils.data.datapipes.iter.grouping.ShardingFilterIterDataPipe ``` works After change: there is an import error exception. Pull Request resolved: pytorch#163438 Approved by: https://github.com/janeyx99
Cache the result of `has_efa` by `functools.cache`. Pull Request resolved: pytorch#163439 Approved by: https://github.com/janeyx99
This reverts commit 509c4e8. Reverted pytorch#163091 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#163091 (comment)))
Pull Request resolved: pytorch#163554 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547, pytorch#163553
Pull Request resolved: pytorch#163555 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547, pytorch#163553, pytorch#163554
Pull Request resolved: pytorch#163556 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547, pytorch#163553, pytorch#163554, pytorch#163555
Pull Request resolved: pytorch#163557 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547, pytorch#163553, pytorch#163554, pytorch#163555, pytorch#163556
Summary: Update Test Plan: CI Rollback Plan: Differential Revision: D81727392 Pull Request resolved: pytorch#162222 Approved by: https://github.com/sanrise
…163235) ## Why this PR? I've tried to follow the guidance of the `OpenReg` [usage example](https://github.com/pytorch/pytorch/tree/main/test/cpp_extensions/open_registration_extension/torch_openreg/third_party/openreg) and found that the command for compiling `example.cpp` (`g++ -o out example/example.cpp -L ./build -lopenreg`) is not compatible with my `gcc` (v11.4). Since I installed my `gcc` through `apt install build-essential`, and I think that's a common way to install `gcc` for a few developers? I believe it's necessary to slightly modify the command to add `-I ./` to explicitly indicate the header file search path. ## What I've changed? - I added `-I ./` to correctly search for `./include/openreg.h`. - I also added a `pwd` comment for better readability and removed unused imports in `example/example.cpp`. Pull Request resolved: pytorch#163235 Approved by: https://github.com/FFFrog, https://github.com/albanD Co-authored-by: Jiawei Li <[email protected]>
Pull Request resolved: pytorch#163558 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163547, pytorch#163553, pytorch#163554, pytorch#163555, pytorch#163556, pytorch#163557
…rch#163488) Differential Revision: D82933509 over the weekend I realized that some of the cache implementation was a bit silly, and too constrained to be actually generic. for example, InMemoryCache[str, bytes] was odd since we'd probably want to be able to store more than just str keys with bytes values. so tldr; everything is now generic, with the one constraint being that Key and Value must both be pickle-able types. this makes things a lot simpler for us, since all caches can now be str -> bytes caches under the hood if we'd like, and Key/Value just get pickled on the way in and out. with this change, there were also some improvements made to the testing; mainly better coverage, but now we also test each cache across every combination of Key/Value types to ensure that they will work with the types we might specify later I also hardened some things here and there, for example we now use literal_eval (forgot who mentioned this on the first PR, but thank you for the suggestion!), and all errors coming from the caching will be wrapped in CacheError from now on (although we still raise from the original error context where possible) putting this PR up now for feedback, in the process of generalizing the code I did remove the documentation since it was becoming outdated but I will add that back in after the PR is green I have the next PR ready as well (implements a fresh cache context manager), will export once this lands Pull Request resolved: pytorch#163488 Approved by: https://github.com/aorenste, https://github.com/masnesral
Because it required that CUDA >=12. Pull Request resolved: pytorch#163495 Approved by: https://github.com/janeyx99
Please see build script: https://github.com/pytorch/pytorch/blob/8da008678fcb95dbf55a33451136a242871ae4e2/.ci/manywheel/build_cuda.sh#L69-L71 This should display correct warning: `` Please install PyTorch with a following CUDA configurations: 12.6 12.8 13.0 following instructions at https://pytorch.org/get-started/locally/ `` Pull Request resolved: pytorch#163585 Approved by: https://github.com/malfet
…163319) FIXES pytorch#163286 Pull Request resolved: pytorch#163319 Approved by: https://github.com/eellison
Fixes part of pytorch#163314 In particular bug: **Bug 1: H=None Broadcasting Produces Incorrect Results** This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (**mask[:, :, i]**). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting. The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error Pull Request resolved: pytorch#163426 Approved by: https://github.com/drisspg
This reverts commit 27164b6. Reverted pytorch#163167 on behalf of https://github.com/malfet due to This broke in inductor-cpu-test, see https://hud.pytorch.org/hud/pytorch/pytorch/1a42656d6c43a9bb7eb90c511884ce451d29422f/1?per_page=50&name_filter=inductor-cpu-test&mergeEphemeralLF=true ([comment](pytorch#163167 (comment)))
…sting_IFU_2025-09-23 # Conflicts: # .ci/aarch64_linux/aarch64_ci_build.sh # .ci/aarch64_linux/aarch64_wheel_ci_build.py # .ci/docker/build.sh # .ci/docker/ci_commit_pins/huggingface-requirements.txt # .ci/docker/ci_commit_pins/triton.txt # .ci/docker/common/install_rocm.sh # .ci/docker/requirements-ci.txt # .ci/docker/requirements-docs.txt # .ci/libtorch/build.sh # .ci/lumen_cli/cli/lib/core/vllm/lib.py # .ci/lumen_cli/cli/lib/core/vllm/vllm_build.py # .ci/lumen_cli/cli/lib/core/vllm/vllm_test.py # .ci/wheel/build_wheel.sh # .github/ci_commit_pins/audio.txt # .github/ci_commit_pins/vllm.txt # .github/ci_commit_pins/xla.txt # .github/ci_configs/vllm/Dockerfile.tmp_vllm # .github/scripts/generate_binary_build_matrix.py # .github/templates/macos_binary_build_workflow.yml.j2 # .github/workflows/build-vllm-wheel.yml # .github/workflows/docker-builds.yml # .github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml # .github/workflows/generated-linux-binary-manywheel-main.yml # .github/workflows/generated-linux-binary-manywheel-nightly.yml # .github/workflows/generated-linux-binary-manywheel-rocm-main.yml # .github/workflows/generated-macos-arm64-binary-libtorch-release-nightly.yml # .github/workflows/generated-macos-arm64-binary-wheel-nightly.yml # .github/workflows/inductor-nightly.yml # .github/workflows/inductor-perf-test-nightly-x86-zen.yml # .github/workflows/inductor-perf-test-nightly-x86.yml # .github/workflows/inductor-periodic.yml # .github/workflows/inductor-unittest.yml # .github/workflows/inductor.yml # .github/workflows/operator_benchmark.yml # .github/workflows/pull.yml # .github/workflows/trunk.yml # .github/workflows/vllm.yml # aten/src/ATen/CMakeLists.txt # aten/src/ATen/DLConvertor.cpp # aten/src/ATen/cuda/CUDABlas.cpp # aten/src/ATen/native/CPUBlas.cpp # aten/src/ATen/native/LinearAlgebra.cpp # aten/src/ATen/native/Normalization.cpp # aten/src/ATen/native/cuda/Blas.cpp # aten/src/ATen/native/cuda/int8mm.cu # aten/src/ATen/native/cudnn/MHA.cpp # aten/src/ATen/native/miopen/BatchNorm_miopen.cpp # aten/src/ATen/native/miopen/Conv_miopen.cpp # aten/src/ATen/native/mps/operations/GridSampler.mm # aten/src/ATen/native/native_functions.yaml # aten/src/ATen/native/sparse/mps/SparseMPSTensorMath.mm # aten/src/ATen/native/transformers/hip/flash_attn/flash_api.h # benchmarks/dynamo/ci_expected_accuracy/aot_eager_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/aot_eager_torchbench_training.csv # benchmarks/dynamo/ci_expected_accuracy/cpu_inductor_amp_freezing_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/cpu_inductor_freezing_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/cpu_inductor_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/dynamic_aot_eager_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/dynamic_aot_eager_torchbench_training.csv # benchmarks/dynamo/ci_expected_accuracy/dynamic_cpu_inductor_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/dynamic_inductor_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/dynamic_inductor_torchbench_training.csv # benchmarks/dynamo/ci_expected_accuracy/dynamo_eager_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/dynamo_eager_torchbench_training.csv # benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_training.csv # benchmarks/dynamo/ci_expected_accuracy/rocm/aot_eager_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/rocm/dynamic_aot_eager_torchbench_inference.csv # benchmarks/dynamo/ci_expected_accuracy/rocm/dynamo_eager_torchbench_inference.csv # benchmarks/dynamo/pr_time_benchmarks/expected_results.csv # benchmarks/operator_benchmark/benchmark_core.py # build_variables.bzl # c10/cuda/CUDAFunctions.cpp # cmake/Codegen.cmake # cmake/External/aotriton.cmake # docs/source/accelerator/index.md # docs/source/accelerator/operators.md # functorch/dim/__init__.py # functorch/dim/wrap_type.py # requirements-build.txt # requirements.txt # test/cpp/nativert/CMakeLists.txt # test/cpp/nativert/test_triton_kernel_manager_registration.cpp # test/cpp_extensions/libtorch_agnostic_extension/libtorch_agnostic/csrc/kernel.cpp # test/cpp_extensions/libtorch_agnostic_extension/libtorch_agnostic/ops.py # test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py # test/cpp_extensions/open_registration_extension/torch_openreg/README.md # test/cpp_extensions/open_registration_extension/torch_openreg/setup.py # test/cpp_extensions/open_registration_extension/torch_openreg/third_party/openreg/README.md # test/cpp_extensions/open_registration_extension/torch_openreg/third_party/openreg/example/example.cpp # test/cpp_extensions/open_registration_extension/torch_openreg/torch_openreg/__init__.py # test/distributed/_composable/fsdp/test_fully_shard_training.py # test/distributed/_composable/test_composability/test_2d_composability.py # test/distributed/fsdp/test_fsdp_comm_hooks.py # test/distributed/tensor/parallel/test_tp_examples.py # test/distributed/tensor/test_attention.py # test/distributed/tensor/test_dtensor_compile.py # test/distributed/tensor/test_dtensor_ops.py # test/distributed/tensor/test_op_schema.py # test/distributed/test_inductor_collectives.py # test/distributed/test_nvshmem.py # test/distributed/test_nvshmem_triton.py # test/distributed/test_symmetric_memory.py # test/dynamo/test_activation_checkpointing.py # test/dynamo/test_aot_compile.py # test/dynamo/test_callback.py # test/dynamo/test_error_messages.py # test/dynamo/test_guard_serialization.py # test/dynamo/test_misc.py # test/dynamo/test_package.py # test/dynamo/test_structured_trace.py # test/export/test_export.py # test/export/test_export_opinfo.py # test/export/test_passes.py # test/export/test_serialize.py # test/functorch/test_control_flow.py # test/inductor/test_aot_inductor.py # test/inductor/test_aot_inductor_package.py # test/inductor/test_flex_attention.py # test/inductor/test_fxir_backend.py # test/inductor/test_loop_ordering.py # test/inductor/test_max_autotune.py # test/inductor/test_torchinductor.py # test/nn/test_convolution.py # test/nn/test_pooling.py # test/run_test.py # test/slow_tests.json # test/test_binary_ufuncs.py # test/test_dynamic_shapes.py # test/test_matmul_cuda.py # test/test_nestedtensor.py # test/test_nn.py # test/test_openreg.py # third_party/xpu.txt # tools/flight_recorder/components/config_manager.py # tools/pyi/gen_pyi.py # torch/_C/_dynamo/guards.pyi # torch/_dynamo/aot_compile.py # torch/_dynamo/convert_frame.py # torch/_dynamo/functional_export.py # torch/_dynamo/graph_break_registry.json # torch/_dynamo/guards.py # torch/_dynamo/output_graph.py # torch/_dynamo/package.py # torch/_dynamo/symbolic_convert.py # torch/_dynamo/variables/higher_order_ops.py # torch/_dynamo/variables/lists.py # torch/_dynamo/variables/optimizer.py # torch/_export/serde/serialize.py # torch/_export/wrappers.py # torch/_functorch/_aot_autograd/autograd_cache.py # torch/_higher_order_ops/__init__.py # torch/_higher_order_ops/associative_scan.py # torch/_higher_order_ops/flex_attention.py # torch/_higher_order_ops/triton_kernel_wrap.py # torch/_inductor/choices.py # torch/_inductor/codegen/cpp.py # torch/_inductor/codegen/cpp_micro_gemm.py # torch/_inductor/codegen/cpp_wrapper_cpu.py # torch/_inductor/codegen/triton.py # torch/_inductor/codegen/wrapper_fxir.py # torch/_inductor/config.py # torch/_inductor/cpp_builder.py # torch/_inductor/decomposition.py # torch/_inductor/kernel/bmm.py # torch/_inductor/kernel/flex/flex_attention.py # torch/_inductor/kernel/flex/templates/flex_attention.py.jinja # torch/_inductor/kernel/flex/templates/flex_backwards.py.jinja # torch/_inductor/kernel/flex/templates/flex_decode.py.jinja # torch/_inductor/kernel/flex/templates/utilities.py.jinja # torch/_inductor/kernel/mm.py # torch/_inductor/kernel/mm_plus_mm.py # torch/_inductor/kernel_template_choice.py # torch/_inductor/memory.py # torch/_inductor/runtime/triton_heuristics.py # torch/_inductor/scheduler.py # torch/_inductor/select_algorithm.py # torch/_inductor/template_heuristics/base.py # torch/_inductor/template_heuristics/triton.py # torch/_inductor/utils.py # torch/_meta_registrations.py # torch/_prims_common/__init__.py # torch/csrc/Module.cpp # torch/csrc/autograd/python_variable.cpp # torch/csrc/autograd/python_variable_indexing.cpp # torch/csrc/distributed/c10d/FlightRecorder.cpp # torch/csrc/distributed/c10d/ProcessGroupGloo.hpp # torch/csrc/distributed/c10d/symm_mem/NVSHMEMSymmetricMemory.cu # torch/csrc/inductor/aoti_runtime/utils.h # torch/csrc/stable/accelerator.h # torch/csrc/stable/ops.h # torch/csrc/utils/generated_serialization_types.h # torch/csrc/utils/tensor_numpy.cpp # torch/distributed/_symmetric_memory/_nvshmem_triton.py # torch/distributed/device_mesh.py # torch/distributed/pipelining/_schedule_visualizer.py # torch/distributed/tensor/_api.py # torch/distributed/tensor/_dispatch.py # torch/distributed/tensor/_op_schema.py # torch/distributed/tensor/_random.py # torch/distributed/tensor/_sharding_prop.py # torch/export/_trace.py # torch/export/_unlift.py # torch/export/exported_program.py # torch/fx/experimental/proxy_tensor.py # torch/nativert/executor/triton/CpuTritonKernelManager.cpp # torch/nativert/executor/triton/CudaTritonKernelManager.cpp # torch/nativert/executor/triton/TritonKernelManager.h # torch/nativert/kernels/KernelHandlerRegistry.cpp # torch/nativert/kernels/TritonKernel.cpp # torch/nested/_internal/ops.py # torch/onnx/__init__.py # torch/overrides.py # torch/testing/_internal/common_cuda.py # torch/testing/_internal/common_distributed.py # torch/testing/_internal/common_quantization.py # torch/testing/_internal/common_utils.py # torch/testing/_internal/distributed/_tensor/common_dtensor.py # torch/testing/_internal/distributed/fake_pg.py # torch/testing/_internal/hop_db.py # torch/utils/_python_dispatch.py # torch/utils/data/datapipes/iter/combinatorics.py
|
Jenkins build for a10a30e34c9f2acc660cf1494ef0908205c67e30 commit finished as FAILURE |
a7dc2b0 to
1c57644
Compare
Collaborator
Author
|
We will do a fresh IFU since we had to update the previous IFU PR: #2677 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
rocm_base: 7ea3967