-
Notifications
You must be signed in to change notification settings - Fork 75
[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-09 #2625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-09 #2625
Conversation
…lt (pytorch#159889)" This reverts commit 4ae57d4. Reverted pytorch#159889 on behalf of https://github.com/jeanschmidt due to Failing internal tests, probably typechecks. See D81588399 ([comment](pytorch#159889 (comment)))
On Zen 2 (AMD EPYC) and Intel Sapphire Rapids this fails with small differences when compiled with native targeted optimizations. I.e. it fails with `-march=znver2` but succeeds with `-march=znver1`. I assume some operator fusing is being used by GCC. Small differences like using `vmovdqa` can be seen in the minimized code of the baddbmm kernel: https://godbolt.org/z/jsxMa91Wb The greatest differences are consistent and the same on both CPU architectures: ``` Greatest absolute difference: 3.43852152582258e-05 at index (1, 2, 1) (up to 1e-05 allowed) Greatest relative difference: 3.6034286949870875e-06 at index (1, 2, 1) (up to 1.3e-06 allowed) ``` Hence I assume this is in the expected tolerances especially as `complex128` and all other types pass. Pull Request resolved: pytorch#152424 Approved by: https://github.com/malfet
This reverts commit 90b0864. Reverted pytorch#160449 on behalf of https://github.com/jeanschmidt due to Already discussed with @ezyang about the internal quirks and errors ([comment](pytorch#160449 (comment)))
Many users want a config to force all cuda ops captured by cudagraph. When not possible, pt2 should error. This PR adds `torch._inductor.triton.cudagraph_or_error` for that (default as False). Also added an environment variable `TORCHINDUCTOR_CUDAGRAPH_OR_ERROR` to control. Pull Request resolved: pytorch#161862 Approved by: https://github.com/ezyang, https://github.com/mlazos
…ytorch#162044)" This reverts commit cd529b6. Reverted pytorch#162044 on behalf of https://github.com/jeffdaily due to mi200 backlog is purged, and mi300 runners are failing in GHA download ([comment](pytorch#162044 (comment)))
# Motivation https://github.com/pytorch/pytorch/pull/143553/files#diff-6492991193449e118ff0c8d42ca544cc38a73604e505ff246a3c711aeab91748R1345 makes `largeTensorTest` malfunction on XPU. This PR aims to fix it. Pull Request resolved: pytorch#161988 Approved by: https://github.com/EikanWang, https://github.com/albanD
…h#161907) `CMAKE_PREFIX_PATH` is a list of paths used to find dependencies. The test overwrites that with a single path causing dependencies such as protobuf or Abseil not being found. Instead prepend the path to the existing value. This fixes a test failure: > pytorch-v2.7.1/test/inductor/test_aot_inductor_package.py", line 242, in test_compile_after_package > self.assertTrue(so_path.exists()) > AssertionError: False is not true Caused by: ``` /software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::utility: No such file or directory /software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::variant: No such file or directory collect2: error: ld returned 1 exit status ``` Pull Request resolved: pytorch#161907 Approved by: https://github.com/Skylion007
I found a number of places that seem to want forwarding references but the type signature does not reflect that Pull Request resolved: pytorch#161094 Approved by: https://github.com/malfet
Signed-off-by: Edward Yang <[email protected]> Pull Request resolved: pytorch#162164 Approved by: https://github.com/bdhirsh, https://github.com/albanD, https://github.com/wconstab
Fixes pytorch#161868 Pull Request resolved: pytorch#162106 Approved by: https://github.com/jansel, https://github.com/zou3519
…ch#158747) This is a part of our effort for integrating Composable Kernel library for Inductor backend. Currently we have a submodule, but would prefer to have commit pin control over the library as with Triton. We intentionally avoid putting all installation logic in CI scripts to allow locally built versions to have this functionality. The idea is to have CK as a pytorch dependency in pytorch 2.9 release to allow people to use it with inductor and AOT inductor and then gradually step away from submodule usage. Right now CK usage in SDPA/Gemm is tied to submodule files. This PR is a remake of due to branch error: pytorch#156192 Pull Request resolved: pytorch#158747 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <[email protected]> Co-authored-by: Jack Taylor <[email protected]> Co-authored-by: Max Podkorytov <[email protected]> Co-authored-by: Copilot <[email protected]>
[PEP 735](https://peps.python.org/pep-0735) introduces the [dependency-groups] table for a number of use-cases one of which includes specifying development dependencies for projects. Pull Request resolved: pytorch#161216 Approved by: https://github.com/seemethere
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@83c5a5](intel/torch-xpu-ops@83c5a5a), includes: - Revert "Disable xccl timer avoid drlm hang" because XPU time event issue has been fixed - Fallback lu_factor kernel to CPU for single batch - Enable aten::linalg_inv and aten::linalg_inv_ex on XPU Pull Request resolved: pytorch#162062 Approved by: https://github.com/EikanWang
) This PR implements the semantics change to `torch._dynamo.error_on_graph_break`: - ~`torch.compile` now has a new `error_on_graph_break` kwarg that serves as a lower-priority toggle for erroring/continuing on graph breaks~ - `error_on_graph_break` is a new internal `torch.compile `setting that is lower-priority than `fullgraph`. It allows the user to toggle erroring/continuing on graph breaks. - `error_on_graph_break` does nothing when `fullgraph=True` - `error_on_graph_break` does NOT guarantee a single graph Followup [DONE]: need to change the programming model docs to reflect the 3 graph break modes for compilation: - `fullgraph=True`: enforce one graph, no graph breaks, cannot be toggled - `fullgraph=False, error_on_graph_break=True`: errors on graph breaks, latter can be toggled during compile time - `fullgraph=False, error_on_graph_break=False`: resumes tracing on graph breaks, latter can be toggled during compile time Pull Request resolved: pytorch#161747 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#161739
…he CUDACachingAllocator (pytorch#158352) ## Introduction During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it **capturing graph**) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture. This PR adds an experimental flag `graph_capture_record_stream_reuse: True|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path. ## Terms * **Free marker**: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it. * **Terminal**: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`. ## When can we reuse a block during capture? ### Strong Rule (Graph-Wide Safety) This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph. > A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph. Why it's safe: This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness. ### Per-stream Rule (A Practical Optimization) The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check. In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream. > Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S. In short, a block is considered **reusable** on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins. ## Implementation * On `free(block)` during capture * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail. * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path. * Otherwise, store the marker handles and keep the block in the capture-private structures. * On `allocate(stream)` during capture (attempt per-stream reclaim) * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`. * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal. * If yes, hand the block to S for immediate reuse within the same capture. * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances. * On capture end * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture. ## Examples (2 streams) <img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" /> * Case 0 — Unsafe The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails. Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this. * Case 1 — Reusable on stream 1 Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1. * Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator` This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable. * Case 3 — Safe (strong rule holds) In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block. * Case 4 — Freeing after a join See the note below. ## Edge Case: Freeing after a join Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](pytorch#158352 (review))). In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused. ## Thanks Thanks to @galv for his great idea around graph parsing and empty nodes. Pull Request resolved: pytorch#158352 Approved by: https://github.com/ngimel, https://github.com/eqy Co-authored-by: Jeff Daily <[email protected]>
…orch#161984) Added a helper API to tell if the world is entirely within a P2P domain or crosses network. This is mainly for nblocks tuning purpose. (In later PRs) Pull Request resolved: pytorch#161984 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#161983
so that the signal calls do not step on each other's foot. Pull Request resolved: pytorch#162026 Approved by: https://github.com/ngimel
…161407) Summary: Creates a fallback path for `torch._grouped_mm`, using the naive for loop implementation (or bmm). For the sake of keeping the PR small, this PR only enables SM80+ (CUDA capability 8.0 and up), since I am testing this on an A100 machine. In future PRs, we can increase the coverage of the fallback to: 1. float32 and float16, which will extend the GPU coverage 2. cpu Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_3d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_2d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_2d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_3d -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#161407 Approved by: https://github.com/drisspg, https://github.com/eqy
…61717) Summary: Moves the `torch._grouped_mm` fallback from cuda-only code to a place where it can be used by multiple backends. Specifically: 1. make the fallback path and util functions reusable and move them to `ATen/native/GroupedMMUtils.h` 2. register a backend-agnostic kernel to composite explicit autograd key 3. refactor the grouped_mm tests to their own test case and enable CPU At the end of this PR, here is the support matrix: * CUDA SM90+: fast path with test coverage (no change) * CUDA SM80+: fallback with test coverage (no change) * CPU: fallback works, but without test coverage (new in this PR) * other SM versions and other backends: will probably already work, but let's leave this to future PRs * float32/float16: will probably already work, but let's leave this to future PRs Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#161717 Approved by: https://github.com/ngimel, https://github.com/drisspg ghstack dependencies: pytorch#161407
…62059) Summary: Enables `torch.float32` and `torch.float16` options in `torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`, `mat_b`, and `out_dtype` are `torch.bfloat16`. Saving for future PRs: 1. enabling testing on more platforms 2. supporting out_dtype != mat_a.dtype 3. opinfo 4. better compile support Test Plan: ```bash // on A100 and H100 pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x // on H100 pytest test/test_matmul_cuda.py -s -k test_scaled_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#162059 Approved by: https://github.com/ngimel, https://github.com/eqy ghstack dependencies: pytorch#161407, pytorch#161717
I dont have a failing test case but just saw an extra guard somewhere. Pull Request resolved: pytorch#162105 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel
…pytorch#161688) Fixes pytorch#161080 torch.export.export fails with TypeError: expand() got an unexpected keyword argument 'implicit' when calling torch.expand_copy(..., implicit=True). This happened because expand_copy = _make_copy_from_view(aten.expand) register aten. expand as the decomposition path for aten.expand_copy, which doesn’t accept the implicit argument. I have added an explicit a decomposition for aten.expand_copy in torch/_decomp/decompositions.py to ignore the implicit argument, and a simple unit test to demonstrate the bug being fixed. Pull Request resolved: pytorch#161688 Approved by: https://github.com/angelayi, https://github.com/can-gaa-hou
…ch#162073) for 2.9 🙏 Pull Request resolved: pytorch#162073 Approved by: https://github.com/drisspg
pytorch#161951) …h.is_complex. The PR proposes adding a simple, self-explanatory example to the documentation page. The example demonstrates the function's output for tensors with various data types, showing both True and False return values. Fixes pytorch#161859 Pull Request resolved: pytorch#161951 Approved by: https://github.com/zou3519
…orch#161355) Pull Request resolved: pytorch#161355 Approved by: https://github.com/zou3519
Update cpp-httplib with better error handling, bugfixes, and performance. Header only library update. Pull Request resolved: pytorch#162181 Approved by: https://github.com/jansel
Summary: att Test Plan: ci Rollback Plan: Reviewed By: minjang Differential Revision: D80828148 Pull Request resolved: pytorch#161798 Approved by: https://github.com/minjang, https://github.com/SherlockNoMad
Signed-off-by: Edward Yang <[email protected]> Pull Request resolved: pytorch#160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
This reverts commit 2c03f0a. Reverted pytorch#162007 on behalf of https://github.com/jeanschmidt due to Breaks internal builds see [D81588372](https://www.internalfb.com/diff/D81588372), @malfet may you help the author? ([comment](pytorch#162007 (comment)))
This reverts commit b40d943. Reverted pytorch#162001 on behalf of https://github.com/jeanschmidt due to break a few internal tests ([comment](pytorch#161999 (comment)))
|
Jenkins build for ab5575833f1eb9066df192dd91d9d7bd43385f65 commit finished as FAILURE |
|
Jenkins build for 60644390d3e6c3da6228427bb14c6b759011f97a commit finished as FAILURE |
|
Jenkins build for 304889c9da6276844081450426d1722846961f6c commit finished as FAILURE |
pruthvistony
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rubber-stamping the PR. Build is successful and conflicts have been resolved.
This reverts commit 69a25f6.
304889c to
dba8539
Compare
|
Jenkins build for 9a66f82081053fff8105de82cd5f4593c393caf1 commit finished as FAILURE |
|
Do not merge yet, vllm team is testing with this branch and if perf looks good then we will merge it. Otherwise, wait till 9/15. |
| @@ -1,5 +1 @@ | |||
| <<<<<<< HEAD | |||
| 56765e8c1f6490e21312b46242ed78cb2dd46d35 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be updated to new branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this will be handled separately
| if ((dims > 0) && (dims <= 2)) { | ||
| auto divmod = sizes_[0].divmod(linear_idx); | ||
| <<<<<<< HEAD | ||
| #pragma unroll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we merge this line, then next IFU this will not show as conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't follow. I chose upstream change so next time we do IFU it won't show as merge conflict. If I choose HEAD, then we are picking local changes and may show as a merge conflict.
| return HIP_R_4F_E2M1; | ||
| #else | ||
| <<<<<<< HEAD | ||
| // Return HIP_R_4F_E2M1 enum value for earlier ROCm version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
In current scenario better to merge <<HEAD block, to avoid conflicts at next IFU.
aten/src/ATen/native/Convolution.cpp
Outdated
| case ConvBackend::Miopen: | ||
| case ConvBackend::MiopenDepthwise: | ||
| case ConvBackend::MiopenTranspose: | ||
| <<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matching upstream, it is fine.
aten/src/ATen/native/ConvUtils.h
Outdated
| } | ||
|
|
||
| // TODO: Remove PYTORCH_MIOPEN_SUGGEST_NHWC once ROCm officially supports NHWC in MIOpen | ||
| <<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matching upstream, it is fine.
pruthvistony
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On NHWC batchnorm, need Dmitry confirmation.
Other conflicts are resolved properly. AFAIK.
jithunnair-amd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Hope other IFUs get shorter, both in time and diffs :)
|
Jenkins build for 9e7df766290def1ac0112fc758a6fa1ea126e95a commit finished as FAILURE Detected error during base docker image building: |
rocm_base: 681e60e
Tested this PR on MI300x using
registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16623_ubuntu24.04_py3.12_pytorch_rocm7.1_internal_testing_681e60e1Ran the following UTs:
test_nn, test_torch, test_cuda, test_ops, test_unary_ufuncs, test_autograd, inductor/test_torchinductor
All ran fine, attaching logs!
default_ut.log
Successful wheel build job with this branch: http://rocm-ci.amd.com/view/preview/job/pytorch2.8-manylinux-wheels-preview/116/