Rebase to upstream #10

sshonTT · 2025-10-03T18:44:49Z

Rebase to upstream

…_call in the first parameter (pytorch#9451) Co-authored-by: zmelumian <[email protected]>

Co-authored-by: qihqi <[email protected]>

)

…ropagate status. (pytorch#9429)

…#9501) * Refactored jax device handling * Removed option to use CPU jax array for CPU torch tensors. - changing jax devices after the fact will use different APIs

…pytorch#9495)

…`XlaDataToTensors`. (pytorch#9431)

… to propagate status. (pytorch#9445)

…status types. (pytorch#9510)

…bility (pytorch#9509)

Remove the explicit opmath-driven cast chain (bf16→f32→bf16, etc.) from `mul`. The op now executes in the dtype chosen by standard dtype promotion, without inserting unconditional upcast/downcast steps. But leave its functionality for future usage.

…n PyTorch/XLA (#1) Adds an environment variable CONVERT_SHLO_TO_SHARDY that does 2 things: - Uses V2 sharding annotations when generating the GSPMD SHLO module (i.e., in V1 a mesh annotation string like: devices=[2,1,4]0,1,2,3,4,5,6,7 becomes this in V2: devices=[2,1,4]<=[8]). - Converts the new GSPMD module with the V2 annotations into a Shardy module.

…chip training (#2) * Add V2 sharding support and improve partition spec handling for multi-chip training These changes are required to support multi-chip training for real models on the torch-xla side. - Added V2 OpSharding support in XlaShardingSpec, which is used internally by MpLoader for parallel input loading. The original implementation only supported V1 shardings. - Fixed environment variable parsing for CONVERT_SHLO_TO_SHARDY - previous logic treated values like "0" or "false" as truthy. - Added logic to compute dims, reshape_dims, and transpose_perm for V2 sharding based on mesh_shape and partition_spec. The new logic now correctly handles cases that were previously unsupported: case 1: mesh_shape=(2,1,1,1), partition_spec=(0,None,None,None) -> dims=[2,1,1,1], reshape_dims=[2,1,1,1], transpose_perm=[0,1,2,3] case 2: mesh_shape=(2,1,1,1), partition_spec=(0,) Ã-> dims=[2], reshape_dims=[2,1,1,1], transpose_perm=[0,1,2,3] case 3: mesh_shape=(2,4), partition_spec=(0,None) -> dims=[2,1,4], reshape_dims=[2,4], transpose_perm=[0,1] * Fix formatting according to Torch-XLA style guide --------- Co-authored-by: Het Shah <[email protected]>

… PJRT backend This change introduces the ability to pass custom compile options from Python down to the PJRT backend, allowing users to fine-tune XLA compilation behavior without modifying core code. Key changes: * Python API * Added custom_compile_options parameter to torch_xla.compile for passing compile-time options as a dict (supports bool, float, int, and str values). * Added torch_xla.set_custom_compile_options() utility for setting compile options globally. * Added internal binding _XLAC._set_custom_compile_options(). * C++ Runtime * Added SetCustomCompileOptions() virtual method to ComputationClient and implemented it in PjRtComputationClient. * PjRtComputationClient now stores custom_compile_options_ and injects them into xla::CompileOptions.env_option_overrides during compilation. * Options are stringified before being passed to XLA for compatibility. Motivation: This enables advanced users to pass through backend-specific tuning flags (e.g., enabling experimental optimizations, toggling partitioning strategies) without hardcoding them, improving flexibility for research and debugging workflows.

…ation (#7) This PR adds support for all previously unsupported partition specs and fixes the visualize_tensor_sharding() function to support V2 sharding specs. See pytorch#9541 for the upstream PR discussion and additional context. * Add some tests and reviewer suggestions. Will update V2 op sharding logic in a later commit soon. * New implementation (WIP) * Fix new implementation * Fix visualize_tensor_sharding function for V2 shardings

sshonTT · 2025-10-06T13:37:11Z

@AleksKnezevic I tried this out, but it seems to cause random segfaults. I’ll need to dig in further to figure out the root cause.

AleksKnezevic · 2025-10-06T13:38:38Z

Thanks @sshonTT, then please repoen and merge #9 while we investigate.

fix for api match

sshonTT · 2025-10-06T19:53:58Z

Hi @hshahTT, @jazpurTT, @ddilbazTT,

I’ve rebased our branch with the upstream changes and verified it on my side, but I’d appreciate it if you could double-check that everything works correctly.

I’ve also built a wheel for testing, which you can find here:
wh-lb-57:/localdev/sshon/ws/pytorch/pytorch-xla/dist/torch_xla-2.9.0+git86bac8b-cp311-cp311-linux_x86_64.whl

Please let me know if it installs and runs fine on your end. Thanks!

AleksKnezevic · 2025-10-06T20:09:39Z

@sshonTT are you still seeing segfaults?

sshonTT · 2025-10-06T20:13:23Z

@AleksKnezevic No I don't see it now.

AleksKnezevic · 2025-10-06T20:14:14Z

That's great, what was causing them previously, any ideas?

sshonTT · 2025-10-06T20:21:28Z

I couldn’t root-cause it completely, but it turned out to be a system-level issue related to Torch Inductor’s mutex handling. It was failing because a mutex was already acquired by another process or context, likely left uncleared from a previous pytest run.

After releasing and reassigning the same IRD machine, the issue disappeared. I also verified it on another IRD machine to confirm that it works correctly now.

AleksKnezevic · 2025-10-06T20:22:39Z

awesome, thanks @sshonTT! Do we have a way of running CI with this wheel?

sshonTT · 2025-10-06T20:26:07Z

I’ve triggered this workflow run
to build a wheel. Once it’s ready, I’ll update the torch-xla version in tt-xla and test how it behaves. Other than that, I don’t currently have a concrete way to verify this change yet.

sshonTT · 2025-10-06T21:12:21Z

I think we have a build issue since here. Will find a way to get over this.

Torch build option change to avoid build warning and error.

sshonTT · 2025-10-08T13:23:07Z

Build success after turning off warning as an error, but there is an error when publish it. I think it is some related to s3 bucket credential,

@jazpurTT I believe you have experience on S3 bucket, so do you know what is going on and have any suggestion to fix it?

checkout branch

zmelumian972 and others added 30 commits July 18, 2025 12:42

Add support for callable in torchax.interop.JittableModule.functional…

ca91445

…_call in the first parameter (pytorch#9451) Co-authored-by: zmelumian <[email protected]>

Update README.md to reflect supported python versions (pytorch#9484)

86a99d7

Co-authored-by: qihqi <[email protected]>

Remove support for one-process-per-device style of distributed. (pyto…

f3c7907

…rch#9490)

Allow mixed tensor type math if one of them is a scalar (pytorch#9453)

95ba754

Fix nested stableHLO composite regions (pytorch#9385)

55b7d02

Misc fixes: (pytorch#9491)

26def0f

Fix python 3.11 cuda wheel link in the readme (pytorch#9493)

e82631e

[Bugfix] fix ragged attention kernel auto-tuning table key (pytorch#9497

31c4c2f

)

Error Handling: refactor ComputationClient::TransferFromDevice to p…

299a16b

…ropagate status. (pytorch#9429)

Implement XLAShardedTensor._spec and test (pytorch#9488)

ca47198

Clean up quantized matmul condition code (pytorch#9506)

16b1202

Move mutable properties of env to thread local, misc changes (pytorch…

0a1594a

…#9501) * Refactored jax device handling * Removed option to use CPU jax array for CPU torch tensors. - changing jax devices after the fact will use different APIs

Optimize w8a8 kernel vmem limit (pytorch#9508)

29ae4c7

Error Handling: return status value when loading PjRt dynamic plugin. (…

2820f7c

…pytorch#9495)

Add block sizes for Qwen/Qwen2.5-32B-Instruct (pytorch#9516)

531c724

Error Handling: propagate status for ReleaseGilAndTransferData and …

1ed6b46

…`XlaDataToTensors`. (pytorch#9431)

Error Handling: refactor ExecuteComputation and ExecuteReplicated…

b0ffc49

… to propagate status. (pytorch#9445)

Error Handling: refactor GetXlaTensor and related functions to use …

cd3bd91

…status types. (pytorch#9510)

Dump C++ and Status propagation stacktraces. (pytorch#9492)

7aa466e

Add w8a8 kernel blocks for Qwen 2.5 7B (pytorch#9517)

199a9bd

Deduplicate GetXlaTensors() function. (pytorch#9518)

cb64f4c

[XLA] Add placements property to XLAShardedTensor for DTensor compati…

95bee8f

…bility (pytorch#9509)

Update artifacts_builds.tf for 2.8.0-rc2 (pytorch#9522)

241cd47

Update artifacts_builds.tf for 2.8.0-rc3 wheel (pytorch#9527)

c807ebc

make jax as an optional dependency (pytorch#9521)

83d4253

Reorganize PyTorch/XLA Overview page (pytorch#9498)

d487007

Support torch.nn.functional.one_hot (pytorch#9523)

7a48185

Introduce PlatformVersion bindings (pytorch#9513)

0ad39c2

Update artifacts_builds.tf for 2.8.0-rc4 (pytorch#9532)

ebefc8f

Fix pip install torch_xla[pallas] (pytorch#9531)

adf305f

sshonTT and others added 10 commits October 3, 2025 15:06

Create job to build torch-xla wheel and publish to tt-pypi

036321a

Add permision from caller workflow to enable job (#4)

58da15c

Uplift wheel python 3.10 to 3.11

7bc474a

Update jax dependency to 0.7.1 to align with tt front ends (#8)

a2514dd

Merge branch 'master' into sshon/rebase-to-upstream

849fe9b

sshonTT mentioned this pull request Oct 3, 2025

(upstream merged) mul: remove opmath cast sequence #9

Merged

Fix for API match

86bac8b

fix for api match

sshonTT force-pushed the sshon/rebase-to-upstream branch 3 times, most recently from 616047b to 626b736 Compare October 7, 2025 16:15

Torch build option change

27f7792

Torch build option change to avoid build warning and error.

sshonTT force-pushed the sshon/rebase-to-upstream branch from 626b736 to 27f7792 Compare October 7, 2025 16:27

Temporary adding checkout branch

b1ebc54

checkout branch

sshonTT force-pushed the sshon/rebase-to-upstream branch from 9165c52 to b1ebc54 Compare October 9, 2025 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rebase to upstream #10

Rebase to upstream #10

Uh oh!

sshonTT commented Oct 3, 2025

Uh oh!

sshonTT commented Oct 6, 2025

Uh oh!

AleksKnezevic commented Oct 6, 2025

Uh oh!

sshonTT commented Oct 6, 2025

Uh oh!

AleksKnezevic commented Oct 6, 2025

Uh oh!

sshonTT commented Oct 6, 2025

Uh oh!

AleksKnezevic commented Oct 6, 2025

Uh oh!

sshonTT commented Oct 6, 2025

Uh oh!

AleksKnezevic commented Oct 6, 2025

Uh oh!

sshonTT commented Oct 6, 2025

Uh oh!

sshonTT commented Oct 6, 2025

Uh oh!

sshonTT commented Oct 8, 2025

Uh oh!

Uh oh!

Rebase to upstream #10

Are you sure you want to change the base?

Rebase to upstream #10

Uh oh!

Conversation

sshonTT commented Oct 3, 2025

Uh oh!

sshonTT commented Oct 6, 2025

Uh oh!

AleksKnezevic commented Oct 6, 2025

Uh oh!

sshonTT commented Oct 6, 2025

Uh oh!

AleksKnezevic commented Oct 6, 2025

Uh oh!

sshonTT commented Oct 6, 2025

Uh oh!

AleksKnezevic commented Oct 6, 2025

Uh oh!

sshonTT commented Oct 6, 2025

Uh oh!

AleksKnezevic commented Oct 6, 2025

Uh oh!

sshonTT commented Oct 6, 2025

Uh oh!

sshonTT commented Oct 6, 2025

Uh oh!

sshonTT commented Oct 8, 2025

Uh oh!

Uh oh!