Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
134 commits
Select commit Hold shift + click to select a range
ca91445
Add support for callable in torchax.interop.JittableModule.functional…
zmelumian972 Jul 18, 2025
86a99d7
Update README.md to reflect supported python versions (#9484)
bhavya01 Jul 18, 2025
f3c7907
Remove support for one-process-per-device style of distributed. (#9490)
qihqi Jul 18, 2025
95ba754
Allow mixed tensor type math if one of them is a scalar (#9453)
qihqi Jul 18, 2025
55b7d02
Fix nested stableHLO composite regions (#9385)
Carlomus Jul 20, 2025
26def0f
Misc fixes: (#9491)
qihqi Jul 20, 2025
e82631e
Fix python 3.11 cuda wheel link in the readme (#9493)
vfdev-5 Jul 21, 2025
31c4c2f
[Bugfix] fix ragged attention kernel auto-tuning table key (#9497)
yaochengji Jul 23, 2025
299a16b
Error Handling: refactor `ComputationClient::TransferFromDevice` to p…
ysiraichi Jul 24, 2025
ca47198
Implement XLAShardedTensor._spec and test (#9488)
aws-cph Jul 24, 2025
16b1202
Clean up quantized matmul condition code (#9506)
kyuyeunk Jul 24, 2025
0a1594a
Move mutable properties of env to thread local, misc changes (#9501)
qihqi Jul 24, 2025
29ae4c7
Optimize w8a8 kernel vmem limit (#9508)
kyuyeunk Jul 26, 2025
2820f7c
Error Handling: return status value when loading PjRt dynamic plugin.…
ysiraichi Jul 28, 2025
531c724
Add block sizes for Qwen/Qwen2.5-32B-Instruct (#9516)
vanbasten23 Jul 29, 2025
1ed6b46
Error Handling: propagate status for `ReleaseGilAndTransferData` and …
ysiraichi Jul 29, 2025
b0ffc49
Error Handling: refactor `ExecuteComputation` and `ExecuteReplicated`…
ysiraichi Jul 29, 2025
cd3bd91
Error Handling: refactor `GetXlaTensor` and related functions to use …
ysiraichi Jul 29, 2025
7aa466e
Dump C++ and Status propagation stacktraces. (#9492)
ysiraichi Jul 29, 2025
199a9bd
Add w8a8 kernel blocks for Qwen 2.5 7B (#9517)
kyuyeunk Jul 30, 2025
cb64f4c
Deduplicate `GetXlaTensors()` function. (#9518)
ysiraichi Jul 30, 2025
95bee8f
[XLA] Add placements property to XLAShardedTensor for DTensor compati…
Hoomaaan Jul 30, 2025
241cd47
Update artifacts_builds.tf for 2.8.0-rc2 (#9522)
bhavya01 Jul 31, 2025
c807ebc
Update artifacts_builds.tf for 2.8.0-rc3 wheel (#9527)
bhavya01 Jul 31, 2025
83d4253
make jax as an optional dependency (#9521)
qihqi Aug 1, 2025
d487007
Reorganize PyTorch/XLA Overview page (#9498)
melissawm Aug 1, 2025
7a48185
Support torch.nn.functional.one_hot (#9523)
vanbasten23 Aug 1, 2025
0ad39c2
Introduce PlatformVersion bindings (#9513)
rpsilva-aws Aug 1, 2025
ebefc8f
Update artifacts_builds.tf for 2.8.0-rc4 (#9532)
bhavya01 Aug 1, 2025
adf305f
Fix pip install torch_xla[pallas] (#9531)
bhavya01 Aug 1, 2025
d3d91a8
Remove cuda builds for release wheels (#9533)
bhavya01 Aug 1, 2025
9995e97
Optimize KV cache dequantization performance (#9528)
kyuyeunk Aug 1, 2025
2ccd5dc
Add gemini edited docstring
qihqi Aug 2, 2025
b6a5b82
add more files
qihqi Aug 4, 2025
2889f69
Revert 2 accidental commits that I made. (#9536)
qihqi Aug 4, 2025
43589c0
Implement XLAShardedTensor.redistribute and test (#9529)
aws-cph Aug 4, 2025
15496cd
Do not set `PJRT_DEVICE=CUDA` automatically on import. (#9540)
ysiraichi Aug 5, 2025
e5e75a8
Add triggers for release 2.8.0 (#9545)
bhavya01 Aug 6, 2025
30ad68a
Update torchbench pin location. (#9543)
ysiraichi Aug 7, 2025
6050927
Improve error message of functions related to `GetXlaTensor()`. (#9520)
ysiraichi Aug 7, 2025
41bfd62
Update artifacts_builds.tf for rc5
bhavya01 Aug 7, 2025
57cd41c
Refactor the status error message builder. (#9546)
ysiraichi Aug 8, 2025
8c1449f
Use `TORCH_CHECK()` instead of throwing `std::runtime_error` in `XLA_…
ysiraichi Aug 8, 2025
095faec
Error Handling: make `XLATensor::Create()` return status type. (#9544)
ysiraichi Aug 8, 2025
38e0f03
`cat`: improve error handling and error messages. (#9548)
ysiraichi Aug 11, 2025
23158fd
`div`: improve error handling and error messages. (#9549)
ysiraichi Aug 11, 2025
1f787f1
Bug fixes (#9554)
qihqi Aug 11, 2025
f400690
Run torchprime CI only when the pull requests have torchprimeci label…
bhavya01 Aug 12, 2025
c8c9776
[Documentation] Fixed typo in C++ debugging docs (#9559)
hinriksnaer Aug 13, 2025
40f58a6
Update README.md to mention 2.8 release (#9560)
bhavya01 Aug 13, 2025
d5b9a6d
`flip`: improve error handling and error messages. (#9550)
ysiraichi Aug 14, 2025
2c34318
Generalize crash message for non-ok status. (#9552)
ysiraichi Aug 14, 2025
4199865
Rename `MaybeThrow` to `OkOrThrow`. (#9561)
ysiraichi Aug 16, 2025
a1c6ee9
Add xla random generator. (#9539)
iwknow Aug 16, 2025
0f56dec
[EZ] Replace `pytorch-labs` with `meta-pytorch` (#9556)
ZainRizvi Aug 19, 2025
b84c83b
Added missing "#"s for the comments in triton.md (#9571)
SriRangaTarun Aug 21, 2025
6b6ef5c
Remove tests that are defined outside of this repo. (#9577)
qihqi Aug 22, 2025
748ac9b
Update XLA pin then fix up to make it compile (#9565)
qihqi Aug 22, 2025
f8b44e2
Create mapping for FP8 torch dtypes (#9573)
kyuyeunk Aug 22, 2025
b098be8
refactor: DTensor inheritance for XLAShardedTensor (#9576)
aws-cph Aug 23, 2025
147d2c2
`full`: improve error handling and error messages. (#9564)
ysiraichi Aug 23, 2025
8243a25
`gather`: improve error handling and error messages. (#9566)
ysiraichi Aug 23, 2025
49ac22a
`random_`: improve error handling and error messages. (#9567)
ysiraichi Aug 25, 2025
aada9fc
Remove `XLA_CUDA` and other CUDA build flags. (#9582)
ysiraichi Aug 25, 2025
e9a1c5f
Remove OpenXLA CUDA fallback and `_XLAC_cuda_functions.so` extension.…
ysiraichi Aug 25, 2025
abf18e4
Fix case when both device & dtype are given in .to (#9583)
qihqi Aug 25, 2025
5522c69
implement send and recv using collective_permute (#9373)
bfolie Aug 25, 2025
163193e
Set environment variables for tpu7x (#9586)
bhavya01 Aug 26, 2025
4c586bd
Create new macros for throwing status errors. (#9588)
ysiraichi Aug 27, 2025
d214faf
`test`: Use new macros for throwing exceptions. (#9590)
ysiraichi Aug 28, 2025
d9a9e44
`runtime`: Use new macros for throwing exceptions. (#9591)
ysiraichi Aug 28, 2025
8d20a86
`ops`: Use new macros for throwing exceptions. (#9592)
ysiraichi Aug 28, 2025
d55cc00
`init_python_bindings.cpp`: Use new macros for throwing exceptions. (…
ysiraichi Aug 28, 2025
90be04a
`aten_xla_type.cpp`: Use new macros for throwing exceptions. (#9596)
ysiraichi Aug 28, 2025
1bc7737
Remove CUDA plugin. (#9597)
ysiraichi Aug 28, 2025
d4cf42a
Remove triton. (#9601)
ysiraichi Aug 28, 2025
f5a2218
`torch_xla`: Use new macros for throwing exceptions (part 1). (#9593)
ysiraichi Aug 28, 2025
e7b1159
`torch_xla`: Use new macros for throwing exceptions (part 2). (#9594)
ysiraichi Aug 28, 2025
004f19e
Remove CUDA specific logic from runtime. (#9598)
ysiraichi Aug 29, 2025
763e5b7
Remove `gpu_custom_call` logic. (#9600)
ysiraichi Aug 29, 2025
8fb90c8
Remove functions that throw status error. (#9602)
ysiraichi Sep 2, 2025
05d9cba
Remove CUDA logic from C++ files in `torch_xla/csrc` directory. (#9603)
ysiraichi Sep 2, 2025
c0eeb57
Remove CUDA specific path from internal Python packages. (#9606)
ysiraichi Sep 2, 2025
89f929b
Move `_jax_forward` and `_jax_backward` inside `j2t_autograd` to avoi…
jialei777 Sep 2, 2025
647804c
Remove remaining GPU/CUDA mentions in `torch_xla` directory. (#9608)
ysiraichi Sep 2, 2025
94fdadc
Update version to 0.0.6 (#9611)
qihqi Sep 2, 2025
ddf75a1
Remove CUDA from PyTorch/XLA build. (#9609)
ysiraichi Sep 3, 2025
8ff2ee6
Remove CUDA from `benchmarks` directory. (#9610)
ysiraichi Sep 3, 2025
c48478a
Remove CUDA tests from distributed tests. (#9612)
ysiraichi Sep 3, 2025
e0de097
Make torch_xla package PEP 561 compliant (#9515)
wirthual Sep 3, 2025
342de86
Remove other CUDA usage from PyTorch/XLA repository. (#9618)
ysiraichi Sep 4, 2025
77d85fb
Remove CUDA from remaining tests. (#9613)
ysiraichi Sep 4, 2025
bd95382
Miscelanous cleanup (#9619)
qihqi Sep 4, 2025
f6ff30d
Do not skip fetching sources.
bhavya01 Sep 4, 2025
2518381
Update build_and_test.yml to match r2.8 and r2.8.1
bhavya01 Sep 5, 2025
6ee7627
Update build_and_test.yml
bhavya01 Sep 5, 2025
8274f94
Replace `GetComputationClientOrDie()` with `GetComputationClient()` (…
ysiraichi Sep 5, 2025
92dcabc
`mm`: improve error handling and error messages. (#9621)
ysiraichi Sep 5, 2025
6c5478f
Add triggers for v2.8.1 version
bhavya01 Sep 5, 2025
aba96d8
Replace `GetComputationClientOrDie()` with `GetComputationClient()` (…
ysiraichi Sep 9, 2025
49dec2e
Upgrade build infra to use debian-12 and gcc-11 (#9631)
bhavya01 Sep 10, 2025
caa809f
Remove libopenblas-dev from ansible dependencies (#9632)
bhavya01 Sep 10, 2025
7aba922
support load and save checkpoint in torchax (#9616)
junjieqian Sep 10, 2025
8efa568
Set `allow_broken_conditionals` configuration variable at `ansible.cf…
ysiraichi Sep 11, 2025
c77852e
Move torch ops error message tests into a new file. (#9622)
ysiraichi Sep 11, 2025
2329746
Fix `test_ops_error_message.py` and run it on CI. (#9640)
ysiraichi Sep 15, 2025
efe20ab
Do not warn on jax usage when workarounds are available (#9624)
bhavya01 Sep 15, 2025
a66cfc3
`roll`: improve error handling and error messages. (#9628)
ysiraichi Sep 16, 2025
6d755ee
`stack`: improve error handling and error messages. (#9629)
ysiraichi Sep 16, 2025
0c0ae2d
`expand`: improve error handling and error messages. (#9645)
ysiraichi Sep 17, 2025
0fc62aa
update gcc (#9650)
qihqi Sep 24, 2025
03d4dc0
Add default args for _aten_conv2d (#9623)
hsjts0u Sep 29, 2025
302c3f1
Pin `flax` and skip C++ test `SiLUBackward`. (#9660)
ysiraichi Sep 30, 2025
a511691
`trace`: improve error handling and error messages. (#9630)
ysiraichi Oct 1, 2025
3240166
Fix Terraform usage of `cuda_version`. (#9655)
ysiraichi Oct 1, 2025
3862b87
Create PyTorch commit pin. (#9654)
ysiraichi Oct 1, 2025
6ac4a7c
Accept conda channels' ToS when building the upstream docker image. (…
ysiraichi Oct 1, 2025
cc300f7
Revert "Fix Terraform usage of `cuda_version`. (#9655)" (#9664)
ysiraichi Oct 1, 2025
420adaa
Bump Python version of `ci-tpu-test-trigger` to 3.12. (#9665)
ysiraichi Oct 1, 2025
1348545
fix(xla): convert group-local to global ranks in broadcast (#9657)
Hoomaaan Oct 1, 2025
1ab6787
Accept conda channels' ToS with environment variable. (#9666)
ysiraichi Oct 1, 2025
2a9138a
mul: remove opmath cast sequence (#9663)
sshonTT Oct 3, 2025
d36ded2
[Experimental] Add initial implementation of GSPMD->Shardy pass withi…
hshahTT Jul 18, 2025
036321a
Create job to build torch-xla wheel and publish to tt-pypi
jazpurTT Jul 25, 2025
58da15c
Add permision from caller workflow to enable job (#4)
jazpurTT Jul 29, 2025
24bb34c
Add V2 sharding support and improve partition spec handling for multi…
sshonTT Aug 2, 2025
686cb76
feat: add support for custom compile options in torch_xla.compile and…
sshonTT Aug 11, 2025
5dfbb4d
Change V2 sharding spec algorithm + Fix tensor sharding spec visualiz…
hshahTT Sep 3, 2025
7bc474a
Uplift wheel python 3.10 to 3.11
ddilbazTT Sep 2, 2025
a2514dd
Update jax dependency to 0.7.1 to align with tt front ends (#8)
jazpurTT Sep 5, 2025
849fe9b
Merge branch 'master' into sshon/rebase-to-upstream
sshonTT Oct 3, 2025
86bac8b
Fix for API match
sshonTT Oct 6, 2025
27f7792
Torch build option change
sshonTT Oct 7, 2025
b1ebc54
Temporary adding checkout branch
sshonTT Oct 9, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
19 changes: 3 additions & 16 deletions .bazelrc
Original file line number Diff line number Diff line change
Expand Up @@ -79,18 +79,6 @@ build:native_arch_posix --host_copt=-march=native

build:mkl_open_source_only --define=tensorflow_mkldnn_contraction_kernel=1

build:cuda --repo_env TF_NEED_CUDA=1
# "sm" means we emit only cubin, which is forward compatible within a GPU generation.
# "compute" means we emit both cubin and PTX, which is larger but also forward compatible to future GPU generations.
build:cuda --crosstool_top=@local_config_cuda//crosstool:toolchain
build:cuda --@local_config_cuda//:enable_cuda
build:cuda --define=xla_python_enable_gpu=true
build:cuda --cxxopt=-DXLA_CUDA=1

# Coverage with cuda/gcc/nvcc requires manually setting coverage flags.
coverage:cuda --per_file_copt=third_party/.*,torch_xla/.*@--coverage
coverage:cuda --linkopt=-lgcov

build:acl --define==build_with_acl=true

build:nonccl --define=no_nccl_support=true
Expand All @@ -103,21 +91,20 @@ build:short_logs --output_filter=DONT_MATCH_ANYTHING
#build:tpu --@xla//xla/python:enable_tpu=true
build:tpu --define=with_tpu_support=true

# Run tests serially with TPU and GPU (only 1 device is available).
# Run tests serially with TPU (only 1 device is available).
test:tpu --local_test_jobs=1
test:cuda --local_test_jobs=1

#########################################################################
# RBE config options below.
# Flag to enable remote config
common --experimental_repo_remote_exec

# Inherit environmental variables that are used in testing.
test --test_env=TPU_NUM_DEVICES --test_env=GPU_NUM_DEVICES --test_env=CPU_NUM_DEVICES --test_env=XRT_LOCAL_WORKER
test --test_env=TPU_NUM_DEVICES --test_env=CPU_NUM_DEVICES --test_env=XRT_LOCAL_WORKER
test --test_env=XRT_TPU_CONFIG --test_env=XRT_DEVICE_MAP --test_env=XRT_WORKERS --test_env=XRT_MESH_SERVICE_ADDRESS
test --test_env=XRT_SHARD_WORLD_SIZE --test_env=XRT_MULTI_PROCESSING_DEVICE --test_env=XRT_HOST_ORDINAL --test_env=XRT_SHARD_ORDINAL
test --test_env=XRT_START_LOCAL_SERVER --test_env=TPUVM_MODE --test_env=PJRT_DEVICE --test_env=PJRT_TPU_MAX_INFLIGHT_COMPUTATIONS
test --test_env=PJRT_CPU_ASYNC_CLIENT --test_env=PJRT_GPU_ASYNC_CLIENT --test_env=TPU_LIBRARY_PATH --test_env=PJRT_DIST_SERVICE_ADDR
test --test_env=PJRT_CPU_ASYNC_CLIENT --test_env=TPU_LIBRARY_PATH --test_env=PJRT_DIST_SERVICE_ADDR
test --test_env=PJRT_LOCAL_PROCESS_RANK

# This environmental variable is important for properly integrating with XLA.
Expand Down
1 change: 0 additions & 1 deletion .circleci/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,6 @@ source $XLA_DIR/xla_env
export GCLOUD_SERVICE_KEY_FILE="$XLA_DIR/default_credentials.json"
export SILO_NAME='cache-silo-ci-dev-3.8_cuda_12.1' # cache bucket for CI
export BUILD_CPP_TESTS='1'
export TF_CUDA_COMPUTE_CAPABILITIES="sm_50,sm_70,sm_75,compute_80,$TF_CUDA_COMPUTE_CAPABILITIES"
build_torch_xla $XLA_DIR

popd
27 changes: 5 additions & 22 deletions .circleci/common.sh
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,8 @@ function build_torch_xla() {
# Need to uncomment the line below.
# Currently it fails upstream XLA CI.
# pip install plugins/cuda -v
pip install --pre torch_xla[pallas] --index-url https://us-python.pkg.dev/ml-oss-artifacts-published/jax/simple/ --find-links https://storage.googleapis.com/jax-releases/libtpu_releases.html

popd
}

Expand Down Expand Up @@ -156,26 +158,12 @@ function run_torch_xla_cpp_tests() {
fi

if [ "$USE_COVERAGE" != "0" ]; then
if [ -x "$(command -v nvidia-smi)" ]; then
PJRT_DEVICE=CUDA test/cpp/run_tests.sh $EXTRA_ARGS -L""
cp $XLA_DIR/bazel-out/_coverage/_coverage_report.dat /tmp/cov1.dat
PJRT_DEVICE=CUDA test/cpp/run_tests.sh -X early_sync -F AtenXlaTensorTest.TestEarlySyncLiveTensors -L"" $EXTRA_ARGS
cp $XLA_DIR/bazel-out/_coverage/_coverage_report.dat /tmp/cov2.dat
lcov --add-tracefile /tmp/cov1.dat -a /tmp/cov2.dat -o /tmp/merged.dat
else
PJRT_DEVICE=CPU test/cpp/run_tests.sh $EXTRA_ARGS -L""
cp $XLA_DIR/bazel-out/_coverage/_coverage_report.dat /tmp/merged.dat
fi
PJRT_DEVICE=CPU test/cpp/run_tests.sh $EXTRA_ARGS -L""
cp $XLA_DIR/bazel-out/_coverage/_coverage_report.dat /tmp/merged.dat
genhtml /tmp/merged.dat -o ~/htmlcov/cpp/cpp_lcov.info
mv /tmp/merged.dat ~/htmlcov/cpp_lcov.info
else
# Shard GPU testing
if [ -x "$(command -v nvidia-smi)" ]; then
PJRT_DEVICE=CUDA test/cpp/run_tests.sh $EXTRA_ARGS -L""
PJRT_DEVICE=CUDA test/cpp/run_tests.sh -X early_sync -F AtenXlaTensorTest.TestEarlySyncLiveTensors -L"" $EXTRA_ARGS
else
PJRT_DEVICE=CPU test/cpp/run_tests.sh $EXTRA_ARGS -L""
fi
PJRT_DEVICE=CPU test/cpp/run_tests.sh $EXTRA_ARGS -L""
fi
popd
}
Expand All @@ -194,11 +182,6 @@ function run_torch_xla_tests() {
RUN_CPP="${RUN_CPP_TESTS:0}"
RUN_PYTHON="${RUN_PYTHON_TESTS:0}"

if [ -x "$(command -v nvidia-smi)" ]; then
num_devices=$(nvidia-smi --list-gpus | wc -l)
echo "Found $num_devices GPU devices..."
export GPU_NUM_DEVICES=$num_devices
fi
export PYTORCH_TESTING_DEVICE_ONLY_FOR="xla"
export CXX_ABI=$(python -c "import torch;print(int(torch._C._GLIBCXX_USE_CXX11_ABI))")

Expand Down
30 changes: 0 additions & 30 deletions .devcontainer/gpu-internal/devcontainer.json

This file was deleted.

2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,5 @@ Error messages and stack traces are also helpful.

## System Info

- reproducible on XLA backend [CPU/TPU/CUDA]:
- reproducible on XLA backend [CPU/TPU]:
- torch_xla version:
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/bug-report.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Steps to reproduce the behavior:

## Environment

- Reproducible on XLA backend [CPU/TPU/CUDA]:
- Reproducible on XLA backend [CPU/TPU]:
- torch_xla version:


Expand Down
87 changes: 40 additions & 47 deletions .github/ci.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,22 +3,22 @@
PyTorch and PyTorch/XLA use CI to lint, build, and test each PR that is
submitted. All CI tests should succeed before the PR is merged into master.
PyTorch CI pins PyTorch/XLA to a specific commit. On the other hand, PyTorch/XLA
CI pulls PyTorch from master unless a pin is manually provided. This README will
go through the reasons of these pins, how to pin a PyTorch/XLA PR to an upstream
PyTorch PR, and how to coordinate a merge for breaking PyTorch changes.
CI pulls PyTorch from `.torch_commit` unless a pin is manually provided. This
README will go through the reasons of these pins, how to pin a PyTorch/XLA PR
to an upstream PyTorch PR, and how to coordinate a merge for breaking PyTorch
changes.

## Usage

### Pinning PyTorch PR in PyTorch/XLA PR
### Temporarily Pinning PyTorch PR in PyTorch/XLA PR

Sometimes a PyTorch/XLA PR needs to be pinned to a specific PyTorch PR to test
new features, fix breaking changes, etc. Since PyTorch/XLA CI pulls from PyTorch
master by default, we need to manually provide a PyTorch pin. In a PyTorch/XLA
PR, PyTorch can be manually pinned by creating a `.torch_pin` file at the root
of the repository. The `.torch_pin` should have the corresponding PyTorch PR
number prefixed by "#". Take a look at [example
here](https://github.com/pytorch/xla/pull/7313). Before the PyTorch/XLA PR gets
merged, the `.torch_pin` must be deleted.
new features, fix breaking changes, etc. In a PyTorch/XLA PR, PyTorch can be
manually pinned by creating a `.torch_pin` file at the root of the repository.
The `.torch_pin` should have the corresponding PyTorch PR number prefixed by
"#". Take a look at [example here](https://github.com/pytorch/xla/pull/7313).
Before the PyTorch/XLA PR gets merged, the `.torch_pin` must be deleted and
`.torch_commit` updated.

### Coordinating merges for breaking PyTorch PRs

Expand All @@ -35,29 +35,42 @@ fail. Steps for fixing and merging such breaking PyTorch change is as following:
PyTorch PR to pin the PyTorch/XLA to the commit hash created in step 1 by
updating `pytorch/.github/ci_commit_pins/xla.txt`.
1. Once CI tests are green on both ends, merge PyTorch PR.
1. Remove the `.torch_pin` in PyTorch/XLA PR and merge. To be noted, `git commit
--amend` should be avoided in this step as PyTorch CI will keep using the
commit hash created in step 1 until other PRs update that manually or the
nightly buildbot updates that automatically.
1. Remove the `.torch_pin` in PyTorch/XLA PR and update the `.torch_commit` to
the hash of the merged PyTorch PR. To be noted, `git commit --amend` should
be avoided in this step as PyTorch CI will keep using the commit hash
created in step 1 until other PRs update that manually or the nightly
buildbot updates that automatically.
1. Finally, don't delete your branch until 2 days later. See step 4 for
explanations.

### Running TPU tests on PRs

The `build_and_test.yml` workflow runs tests on the TPU in addition to CPU and
GPU. The set of tests run on the TPU is defined in `test/tpu/run_tests.sh`.
The `build_and_test.yml` workflow runs tests on the TPU in addition to CPU.
The set of tests run on the TPU is defined in `test/tpu/run_tests.sh`.

## Update the PyTorch Commit Pin

In order to reduce development burden of PyTorch/XLA, starting from #9654, we
started pinning PyTorch using the `.torch_commit` file. This should reduce the
number of times a PyTorch PR breaks our most recent commits. However, this also
requires maintenance, i.e. someone has to keep updating the PyTorch commit so
as to make sure it's always supporting (almost) the latest PyTorch versions.

Updating the PyTorch commit pin is, theoretically, simple. You just have to run
`scripts/update_deps.py --pytorch` file, and open a PR. In practice, you may
encounter a few compilation errors, or even segmentation faults.

## CI Environment

Before the CI in this repository runs, we build a base dev image. These are the
same images we recommend in our VSCode `.devcontainer` setup and nightly build
to ensure consistency between environments. We produce variants with and without
CUDA, configured in `infra/ansible` (build config) and
`infra/tpu-pytorch-releases/dev_images.tf` (build triggers).
to ensure consistency between environments. We produce variants configured in
`infra/ansible` (build config) and `infra/tpu-pytorch-releases/dev_images.tf`
(build triggers).

The CI runs in two environments:

1. Organization self-hosted runners for CPU and GPU: used for almost every step
1. Organization self-hosted runners for CPU: used for almost every step
of the CI. These runners are managed by PyTorch and have access to the shared
ECR repository.
1. TPU self-hosted runners: these are managed by us and are only available in
Expand All @@ -68,48 +81,35 @@ The CI runs in two environments:

We have two build paths for each CI run:

- `torch_xla`: we build the main package to support both TPU and GPU[^1], along
- `torch_xla`: we build the main package to support TPU, along
with a CPU build of `torch` from HEAD. This build step exports the
`torch-xla-wheels` artifact for downstream use in tests.
- Some CI tests also require `torchvision`. To reduce flakiness, we compile
`torchvision` from [`torch`'s CI pin][pytorch-vision-pin].
- C++ tests are piggybacked onto the same build and uploaded in the
`cpp-test-bin` artifact.
- `torch_xla_cuda_plugin`: the XLA CUDA runtime can be built independently of
either `torch` or `torch_xla` -- it depends only on our pinned OpenXLA. Thus,
this build should be almost entirely cached, unless your PR changes the XLA
pin or adds a patch.

Both the main package build and plugin build are configured with ansible at
`infra/ansible`, although they run in separate stages (`stage=build_srcs` vs
`stage=build_plugin`). This is the same configuration we use for our nightly and
release builds.
The main package build is configured with ansible at `infra/ansible`. This is
the same configuration we use for our nightly and release builds.

The CPU and GPU test configs are defined in the same file, `_test.yml`. Since
The CPU test config is defined in the file `_test.yml`. Since
some of the tests come from the upstream PyTorch repository, we check out
PyTorch at the same git rev as the `build` step (taken from
`torch_xla.version.__torch_gitrev__`). The tests are split up into multiple
groups that run in parallel; the `matrix` section of `_test.yml` corresponds to
in `.github/scripts/run_tests.sh`.

CPU tests run immediately after the `torch_xla` build completes. This will
likely be the first test feedback on your commit. GPU tests will launch when
both the `torch_xla` and `torch_xla_cuda_plugin` complete. GPU compilation is
much slower due to the number of possible optimizations, and the GPU chips
themselves are quite outdated, so these tests will take longer to run than the
CPU tests.
likely be the first test feedback on your commit.

![CPU tests launch when `torch_xla` is
complete](../docs/assets/ci_test_dependency.png)

![GPU tests also depend on CUDA
plugin](../docs/assets/ci_test_dependency_gpu.png)

For the C++ test groups in either case, the test binaries are pre-built during
the build phase and packaged in `cpp-test-bin`. This will only be downloaded if
necessary.

[^1]: Note: both GPU and TPU support require their respective plugins to be
[^1]: Note: TPU support require its respective plugins to be
installed. This package will _not_ work on either out of the box.

### TPU CI
Expand Down Expand Up @@ -165,13 +165,6 @@ good" commit to prevent accidental changes from PyTorch/XLA to break PyTorch CI
without warning. PyTorch has hundreds of commits each week, and this pin ensures
that PyTorch/XLA as a downstream package does not cause failures in PyTorch CI.

#### Why does PyTorch/XLA CI pull from PyTorch master?

[PyTorch/XLA CI pulls PyTorch from master][pull-pytorch-master] unless a PyTorch
pin is manually provided. PyTorch/XLA is a downstream package to PyTorch, and
pulling from master ensures that PyTorch/XLA will stay up-to-date and works with
the latest PyTorch changes.

#### TPU CI is broken

If the TPU CI won't run, try to debug using the following steps:
Expand Down
15 changes: 2 additions & 13 deletions .github/scripts/run_tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,7 @@ function run_torch_xla_cpp_tests() {

TORCH_DIR=$(python -c "import pkgutil; import os; print(os.path.dirname(pkgutil.get_loader('torch').get_filename()))")
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${TORCH_DIR}/lib
if [ -x "$(command -v nvidia-smi)" ]; then
CUDA_PLUGIN_DIR=$(python -c "import pkgutil; import os; print(os.path.dirname(pkgutil.get_loader('torch_xla_cuda_plugin').get_filename()))")
export PJRT_LIBRARY_PATH=$CUDA_PLUGIN_DIR/lib/pjrt_c_api_gpu_plugin.so
export PJRT_DEVICE=LIBRARY
export PJRT_DYNAMIC_PLUGINS=1
else
export PJRT_DEVICE=CPU
fi
export PJRT_DEVICE=CPU
export XLA_EXPERIMENTAL="nonzero:masked_select:nms"

test_names=("test_aten_xla_tensor_1"
Expand All @@ -55,6 +48,7 @@ function run_torch_xla_cpp_tests() {
"test_tensor"
# disable test_xla_backend_intf since it is flaky on upstream
#"test_xla_backend_intf"
"test_xla_generator"
"test_xla_sharding"
"test_runtime"
"test_status_dont_show_cpp_stacktraces"
Expand Down Expand Up @@ -83,11 +77,6 @@ PYTORCH_DIR=$1
XLA_DIR=$2
USE_COVERAGE="${3:-0}"

if [ -x "$(command -v nvidia-smi)" ]; then
num_devices=$(nvidia-smi --list-gpus | wc -l)
echo "Found $num_devices GPU devices..."
export GPU_NUM_DEVICES=$num_devices
fi
export PYTORCH_TESTING_DEVICE_ONLY_FOR="xla"
export CXX_ABI=$(python -c "import torch;print(int(torch._C._GLIBCXX_USE_CXX11_ABI))")

Expand Down
5 changes: 0 additions & 5 deletions .github/upstream/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,6 @@ ARG tpuvm=""
# Disable CUDA for PyTorch
ENV USE_CUDA "0"

# Enable CUDA for XLA
ENV XLA_CUDA "${cuda}"
ENV TF_CUDA_COMPUTE_CAPABILITIES "${cuda_compute}"
ENV TF_CUDA_PATHS "/usr/local/cuda,/usr/include,/usr"

# CUDA build guidance
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
Expand Down
4 changes: 4 additions & 0 deletions .github/upstream/install_conda.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,10 @@ function install_and_setup_conda() {
fi
export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../"

# Accept Conda channels' ToS automatically.
# Ref: https://github.com/pytorch/pytorch/issues/158438#issuecomment-3084935777
export CONDA_PLUGINS_AUTO_ACCEPT_TOS="yes"

conda update -y -n base conda
conda install -y python=$PYTHON_VERSION

Expand Down
Loading
Loading