Skip to content

Conversation

@pragupta
Copy link
Collaborator

@pragupta pragupta commented Nov 3, 2025

rocm_base: 777e73c

jeffdaily and others added 30 commits October 30, 2025 01:08
Summary:
This ads waitcounter for whether or not the pool is running, as well as if we
are running jobs.

This also ads waitcounters for the first job within a pool. First job and running are working correctly. The job waitcounter seems to either be detecting a leak of a job, or is broken subtly.

Test Plan:
We've tested this internally and see valid ods metrics.

Note that we may be leaking jobs, or the job logic may not be handling an exception correctly.

Differential Revision: D83705931

Pull Request resolved: pytorch#164527
Approved by: https://github.com/masnesral
\# why

- enable users to control which choices get used on which inputs
- reduce lowering time, and pin kernel selection, by selecting
  them for the inputs

\# what

- a new InductorChoices subclass that implements a lookup table
- a README explaining the usage
- corresponding testing

- currently only supports templates that go through
  `V.choices.get_template_configs`

\# testing

```
python3 -bb -m pytest test/inductor/test_lookup_table.py -v
```

Differential Revision: [D85685743](https://our.internmc.facebook.com/intern/diff/D85685743)
Pull Request resolved: pytorch#164978
Approved by: https://github.com/PaulZhang12, https://github.com/eellison, https://github.com/mlazos
Bumps [uv](https://github.com/astral-sh/uv) from 0.9.5 to 0.9.6.
- [Release notes](https://github.com/astral-sh/uv/releases)
- [Changelog](https://github.com/astral-sh/uv/blob/main/CHANGELOG.md)
- [Commits](astral-sh/uv@0.9.5...0.9.6)

---
updated-dependencies:
- dependency-name: uv
  dependency-version: 0.9.6
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
)

Closes pytorch#164529

To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch.

This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization.

For more info:  [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator)

Pull Request resolved: pytorch#164518
Approved by: https://github.com/kwen2501
* We are separating out the rocm jobs of the periodic workflow
* We are introducing a new label `ciflow/periodic-rocm-mi200` to allow us to run distributed tests only on ROCm runners, without triggering many other jobs on the `periodic.yml` workflow (via `ciflow/periodic`)
* This new workflow will also be triggered via the `ciflow/periodic`, thus maintaining the old status quo.
* We are reverting to the `linux.rocm.gpu.4` label since it targets a lot more CI nodes at this point than the K8s/ARC-based `linux.rocm.gpu.mi250.4` label, as that is still having some network/scaling issues.

Pull Request resolved: pytorch#166544
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <[email protected]>
This PR fixes a syntactic error in test_indexing.py by a misplaced `if else` expression.

Pull Request resolved: pytorch#166390
Approved by: https://github.com/jerryzh168
…ytorch#166384)

This PR reused native_mm and mix_order_reduction for Intel GPU and enabled the corresonding test.
Fixes pytorch#165370

Pull Request resolved: pytorch#166384
Approved by: https://github.com/jansel
**Summary**
This implements the backward pass for the Varlen API and registers `_varlen_attn()` as a custom op.

**Benchmarking**

To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding.

Settings:

- 1 H100 machine
- `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16`
- dtype `torch.bfloat16`
- `is_causal=False`
- for variable length, we set sequences to be random multiples of 64 up to `max_seq_len`
- 100 runs

|        | Variable Length API | SDPA     |
|--------|--------------------|----------|
| Runtime | 0.8189142608642578 ms       | 3.263883056640625 ms  |
| TFLOPs | 268.652       | 158.731  |

We can see that runtime for Varlen is >3x faster

**Testing**

Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen gradients vs SDPA.

For custom op testing, `test_custom_op_registration` uses logging mode to verify that `_varlen_attn()` was called and tests with `torch.compile`. `test_custom_op_compliances` uses `torch.library.opcheck()` to verify.

Pull Request resolved: pytorch#164504
Approved by: https://github.com/drisspg
Summary:
was digging through matmul padding for other work, and I noticed that the compute bound checking won't work on MI350 since we haven't supplied the tech specs yet.

I added MI350 specs following the predefined format

Test Plan: CI

Differential Revision: D85804980

Pull Request resolved: pytorch#166576
Approved by: https://github.com/leitian
Preivously, we would stash a single stream value we constructed at trace time in a global and return the same value from repeated calls to the graph.

With this PR, we construct the stream value in advance, reference the constructed value in the graph via the lookup table, and if that value is returned as an output, read the value from the lookup table and return it (in bytecode, not as a graph output, since we don't support arbitrary stream outputs).

Pull Request resolved: pytorch#164819
Approved by: https://github.com/anijain2305
ghstack dependencies: pytorch#164304, pytorch#164522
… sigmoid + CUDA kernel bug (pytorch#166568)

Differential Revision: D85792537

Pull Request resolved: pytorch#166568
Approved by: https://github.com/minjang
…orch#161476)

For pytorch#114850, we will port 3 distributed tests to Intel GPU.
We could enable Intel GPU with the following methods and try the best to keep the original code styles:

- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- use "requires_accelerator_dist_backend" to enable "xccl"
- enabled XPU for some test path
- skip some test cases that Intel GPU does not support

Pull Request resolved: pytorch#161476
Approved by: https://github.com/weifengpy, https://github.com/guangyey
This PR adds `strict=True/False` to zip calls in test utils. strict=True is passed when possible.

Pull Request resolved: pytorch#166257
Approved by: https://github.com/janeyx99
fix typo in other folders

pytorch#166374
pytorch#166126

_typos.toml
```bash
[files]
extend-exclude = ["tools/linter/dictionary.txt"]
[default.extend-words]
nd = "nd"
arange = "arange"
Nd = "Nd"
GLOBALs = "GLOBALs"
hte = "hte"
iy = "iy"
PN = "PN"
Dout = "Dout"
optin = "optin"
gam = "gam"
PTD = "PTD"
Sur = "Sur"
nin = "nin"
tme = "tme"
inpt = "inpt"
mis = "mis"
Raison = "Raison"
ouput = "ouput"
nto = "nto"
Onwer = "Onwer"
callibrate = "callibrate"
ser = "ser"
Metdata = "Metdata"
```

Pull Request resolved: pytorch#166606
Approved by: https://github.com/ezyang
This reverts commit 39e5cdd.

Reverted pytorch#166257 on behalf of https://github.com/atalman due to Failing: test/distributed/fsdp/test_fsdp_mixed_precision.py::TestFSDPTrainEval::test_train_ema_eval_flow [GH job link](https://github.com/pytorch/pytorch/actions/runs/18934047991/job/54057218160) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/39e5cdddf7e57881c52473d1288a66f0222527e1) ([comment](pytorch#166257 (comment)))
…ut device index (pytorch#165356)"

This reverts commit f1af679.

Reverted pytorch#165356 on behalf of https://github.com/atalman due to test/test_rename_privateuse1_to_existing_device.py::TestRenamePrivateuseoneToExistingBackend::test_external_module_register_with_existing_backend [GH job link](https://github.com/pytorch/pytorch/actions/runs/18930365446/job/54046768884) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/a5335263d32b5be2b2647661334d81225c3cc3fc) ([comment](pytorch#165356 (comment)))
This reverts commit a533526.

Reverted pytorch#165212 on behalf of https://github.com/atalman due to test/test_rename_privateuse1_to_existing_device.py::TestRenamePrivateuseoneToExistingBackend::test_external_module_register_with_existing_backend [GH job link](https://github.com/pytorch/pytorch/actions/runs/18930365446/job/54046768884) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/a5335263d32b5be2b2647661334d81225c3cc3fc) ([comment](pytorch#165212 (comment)))
…r triton sigmoid + CUDA kernel bug (pytorch#166568)"

This reverts commit d46d8d6.

Reverted pytorch#166568 on behalf of https://github.com/atalman due to Failed test/test_extension_utils.py::TestExtensionUtils::test_external_module_register_with_renamed_backend [GH job link](https://github.com/pytorch/pytorch/actions/runs/18931754443/job/54050880312) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/d46d8d6f54b15ded4f2483c7bde31be124281ab8) ([comment](pytorch#166568 (comment)))
In the initial pr for overlapping preserving bucketing, for a graph like:

```
def foo(...):
     ag = all_gather(...)
     hiding_compute = mm(...)
     wait(ag)
```

We would add dependencies from mm -> ag, and wait from wait -> hiding_compute, to prevent bucketing reordering these collectives so that overlap no long occurred. however, there is an additional way for bucketing to prevent overlap.

If we were to reorder another collective so the graph looked like:

```
def foo(...):
     ag = all_gather(...)
     ar = all_reduce(...)
     wait(ar)
     hiding_compute = mm(...)
     wait(ag)
```

Overlap would not occur, because the wait for the all reduce would also force realization of every collective enqueued on the same stream prior to the all reduce. NCCL uses a single stream per process group.

To model, we set a set a strict ordering of all collective starts, waits, and hiding compute initially when bucketing. Then, when trying to add a collective to a bucket, we will see if we interfere with overlap for all of the following possible bucketings:

[move collective start to bucket start, move bucket start to collective start] x [move collective wait to bucket wait x move bucket wait to collective wait].

For any of these positions, we check if overlap would have been interfered with because of stream queue semantics. Then, if not, we remove the moving start and wait from the constrained ordering of collectives, and see if it's topologically valid to merge the nodes.

Pull Request resolved: pytorch#166324
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: pytorch#166309
RohitRathore1 and others added 20 commits November 3, 2025 19:30
pytorch#165216)

Replaces 71 assert statements across 11 files in `torch.distributed` with explicit if-checks raising AssertionError to prevent assertions from being disabled with Python -O flag.

Fixes pytorch#164878

Pull Request resolved: pytorch#165216
Approved by: https://github.com/albanD
Update devtoolset in Manylinux 2.28 rocm builds. 11 is too old does not support compiling with C++20 properly

Pull Request resolved: pytorch#166764
Approved by: https://github.com/sudharssun, https://github.com/jeffdaily
This reverts commit c761999.

Reverted pytorch#166361 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#166361 (comment)))
This PR cleans up unused assignments.

Pull Request resolved: pytorch#166791
Approved by: https://github.com/xmfan
By wrapping the python objects with FakeScriptObject(FakeOpaqueQueue) we restrict users to do anything to this object. torch.compile support can be easily enabled by the rest of [this stack](pytorch#163936) and existing support for ScriptObjects.

One thing to note is that by default in functionalization we mark all ops that take in FakeScriptObjects as being effectful. Should this be the case for these custom ops that take in python objs?

Pull Request resolved: pytorch#165005
Approved by: https://github.com/zou3519
Fixes pytorch#165865

## What this PR does?

- [x] Add `generator` arg to `rand*_like` APIs (`rand_like()`, `randn_like()`, `randint_like()`).
- [x] Add unit tests for  `rand*_like` APIs
- [x] Add corresponding arg docs
- [x] Refactor `rand*_like()` codes in `TensorFactories.cpp`
- [x] Add corresponding and former missed items in `VmapModeRegistrations.cpp`

## Example (using `rand_like()`)

```python
gen0 = torch.Generator()
gen1 = torch.Generator()
gen2 = torch.Generator()

gen0.manual_seed(42)
gen1.manual_seed(42)
gen2.manual_seed(2025)

tensor = torch.empty(10)

t0 = torch.rand_like(tensor, generator=gen0)
t1 = torch.rand_like(tensor, generator=gen1)
t2 = torch.rand_like(tensor, generator=gen2)

assert t0 == t1
assert t2 != t0
assert t2 != t1
```

Pull Request resolved: pytorch#166160
Approved by: https://github.com/cyyever, https://github.com/albanD
Use the more generic `Benchmarker.benchmark` function to allow benchmarking other devices that support the required functionality, for example prologue and epilogue fusion can be benchmarked for triton CPU.

Pull Request resolved: pytorch#164938
Approved by: https://github.com/nmacchioni, https://github.com/eellison
Adding `conv` (conv1d, conv2d, conv3d) to the list of operator microbenchmarks run in the CI script (`.ci/pytorch/test.sh`), ensuring convolution operators are now benchmarked alongside existing ones.
Pull Request resolved: pytorch#166331
Approved by: https://github.com/huydhn, https://github.com/jbschlosser
…torch#166806)

I missed this API for MTIAGraph in D84457757(pytorch#165963)

Differential Revision: [D86026706](https://our.internmc.facebook.com/intern/diff/D86026706/)

Pull Request resolved: pytorch#166806
Approved by: https://github.com/albanD
ghstack dependencies: pytorch#166805
…e. (pytorch#166775)

Summary:
as title, we should return an entire tracing_context object instead of fake_mode only, since tracing context should contain full set of information.

Test Plan:
pytest test/export/test_experimental.py

Pull Request resolved: pytorch#166775
Approved by: https://github.com/tugsbayasgalan
Summary:
dict_keys_getitem can show up in the bytecode but it's using dict.keys() which is not fx tracable.

fx.wrap should make it as a standalone function in the graph to be invoked later with real inputs.

Test Plan:
pytest test/export/test_experimental.py

Pull Request resolved: pytorch#166776
Approved by: https://github.com/jamesjwu
ghstack dependencies: pytorch#166775
)

Summary:
make_fx() will register tensor constants as new buffers while tracing a shuffle graph for dynamo graph capture. This breaks the invariance that the resulting graph looks identical to the original eager model in terms of state dict.

So we need to de-register the buffers and set them as plain tensor constants.

Test Plan:
pytest test/export/test_experimental.py

Pull Request resolved: pytorch#166777
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: pytorch#166775, pytorch#166776
…ch#166554)

Fixes pytorch#166253

## Summary
When `torch.full` is called with a 0-D tensor as `fill_value` inside a `torch.compile`'d function, the value was being incorrectly cached, causing subsequent calls with different values to return the first value.

## Root Cause
The Dynamo handler for `torch.full` was calling `aten._local_scalar_dense` to convert tensor fill_values to Python scalars at compile time, which baked the value into the compiled graph as a constant.

## Solution
Modified the Dynamo handler to decompose `torch.full(size, tensor_fill_value)` into `empty(size).fill_(tensor_fill_value)` when `fill_value` is a `TensorVariable`, keeping the fill value dynamic in the compiled graph.

## Testing
Added test case that verifies torch.full works correctly with dynamic tensor fill_values across multiple calls and dtypes.

Pull Request resolved: pytorch#166554
Approved by: https://github.com/Lucaskabela
…wise/reduction consumer (pytorch#166165)"

This reverts commit 94f2657.

Reverted pytorch#166165 on behalf of https://github.com/izaitsevfb due to breaks test_LinearAndSoftmax_codegen test ([comment](pytorch#166165 (comment)))
…rocess (pytorch#166560)

Summary:
DCP checkpoint background process currently determines the port used for pg via get_free_port().

During checkpoint background process initialization, gloo pg init occasionally times out on the first call but succeeds in a subsequent call.

We hypothesized that the timeouts are related to the port being used, and the solution would be to create the pg with PrefixStore and reuse the master port.

This diff adds the option for checkpoint background process to use PrefixStore with MASTER_ADDR + MASTER_PORT.

The default behavior is unchanged. Enabling the new PrefixStore behavior requires setting "DCP_USE_PREFIX_STORE" env var to "1".

context:
 https://fb.workplace.com/groups/319878845696681/permalink/1516883985996155/

Differential Revision: D84928180

Pull Request resolved: pytorch#166560
Approved by: https://github.com/meetv18
…into fsdp (pytorch#166433)

**Summary:** I have created a new composable replicate api that's integrated into FSDP's codebase with minimal changes. The key changes I made are when we use DDPMeshInfo, we use Replicate placements, prevent initial sharding of parameters, set worldsize to 1 to skip allgathers and reducescatter.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py
2. pytest test_pp_composability.py
3. pytest test_replicate_with_fsdp.py

Pull Request resolved: pytorch#166433
Approved by: https://github.com/weifengpy
# Conflicts:
#	.ci/docker/requirements-ci.txt
@pragupta pragupta requested a review from jeffdaily as a code owner November 3, 2025 23:21
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Nov 3, 2025

Jenkins build for 2eea9c424bfdb5249d3e1cf516617dd616dc7f01 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Nov 4, 2025

Jenkins build for 9396162549a868bc4dc5af38a1c39f462f0b1aa2 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@pragupta pragupta force-pushed the develop_IFU_20251103 branch from 9396162 to 86a7a33 Compare November 4, 2025 13:58
@jithunnair-amd jithunnair-amd merged commit 0b6213a into develop Nov 4, 2025
15 checks passed
@jithunnair-amd jithunnair-amd deleted the develop_IFU_20251103 branch November 4, 2025 14:10
@pragupta pragupta restored the develop_IFU_20251103 branch November 4, 2025 14:24
@pragupta
Copy link
Collaborator Author

pragupta commented Nov 4, 2025

Due to issues with create_ifu_tag.yml, changes from this PR were reverted. All changes from here are now reflected in #2784

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.