pytorch
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 0 additions & 6 deletions b/‎CONTRIBUTING.md‎
Lines changed: 0 additions & 6 deletions
diff --git a/‎configuration.yaml‎
Lines changed: 0 additions & 9 deletions b/‎configuration.yaml‎
Lines changed: 0 additions & 9 deletions
diff --git a/‎docs/source/learn/_pjrt.md‎
Lines changed: 0 additions & 63 deletions b/‎docs/source/learn/_pjrt.md‎
Lines changed: 0 additions & 63 deletions
diff --git a/‎docs/source/perf/amp.md‎
Lines changed: 0 additions & 53 deletions b/‎docs/source/perf/amp.md‎
Lines changed: 0 additions & 53 deletions
diff --git a/‎test/test_autocast.py‎
Lines changed: 1 addition & 1 deletion b/‎test/test_autocast.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎test/test_train_mp_imagenet_amp.py‎
Lines changed: 1 addition & 3 deletions b/‎test/test_train_mp_imagenet_amp.py‎
Lines changed: 1 addition & 3 deletions
diff --git a/‎test/test_train_mp_mnist_amp.py‎
Lines changed: 2 additions & 5 deletions b/‎test/test_train_mp_mnist_amp.py‎
Lines changed: 2 additions & 5 deletions
diff --git a/‎torch_xla/_internal/pjrt.py‎
Lines changed: 1 addition & 1 deletion b/‎torch_xla/_internal/pjrt.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎torch_xla/amp/__init__.py‎
Lines changed: 0 additions & 1 deletion b/‎torch_xla/amp/__init__.py‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎torch_xla/amp/autocast_mode.py‎
Lines changed: 4 additions & 63 deletions b/‎torch_xla/amp/autocast_mode.py‎
Lines changed: 4 additions & 63 deletions
@@ -291,12 +291,6 @@ To run the tests, follow __one__ of the options below:
   export PJRT_DEVICE=TPU
   ```
 
-* Run on GPU:
-
-  ```shell
-  export PJRT_DEVICE=CUDA GPU_NUM_DEVICES=${NUM_GPU}
-  ```
-
 For more detail on configuring the runtime, please refer to [this doc](https://github.com/pytorch/xla/blob/master/docs/pjrt.md#quickstart)
 
 If you are planning to be building from source and hence using the latest _PyTorch/TPU_ code base,
 
@@ -15,11 +15,6 @@ variables:
         - Whether or not to create an async PJRT client for the CPU device(s).
       type: bool
       default_value: false
-    PJRT_GPU_ASYNC_CLIENT:
-      description:
-        - Whether or not to create an async PJRT client for the GPU device(s).
-      type: bool
-      default_value: false
     PJRT_TPU_MAX_INFLIGHT_COMPUTATIONS:
       description:
         - Max inflight computations that the PJRT client can handle for TPU.
@@ -229,10 +224,6 @@ variables:
       description:
         - Number of CPU devices being used by this instance of XRT.
       type: int
-    GPU_NUM_DEVICES:
-      description:
-        - Number of GPU devices being used by this instance of XRT.
-      type: int
   debug_variables:
     XLA_FNTRACKER_FILE:
       description:
 
@@ -188,69 +188,6 @@ time. See the [Cloud TPU
 documentation](https://cloud.google.com/tpu/docs/run-in-container) for
 more information.
 
-### GPU
-
-### Single-node GPU training
-
-To use GPUs with PJRT, simply set `PJRT_DEVICE=CUDA` and configure
-`GPU_NUM_DEVICES` to the number of devices on the host. For example:
-
-    PJRT_DEVICE=CUDA GPU_NUM_DEVICES=4 python3 xla/test/test_train_mp_imagenet.py --fake_data --batch_size=128 --num_epochs=1
-
-You can also use `torchrun` to initiate the single-node multi-GPU
-training. For example,
-
-    PJRT_DEVICE=CUDA torchrun --nnodes 1 --nproc-per-node ${NUM_GPU_DEVICES} xla/test/test_train_mp_imagenet.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
-
-In the above example, `--nnodes` means how many machines (physical
-machines or VMs) to be used (it is 1 since we do single-node training).
-`--nproc-per-node` means how many GPU devices to be used.
-
-### Multi-node GPU training
-
-**Note that this feature only works for cuda 12+**. Similar to how
-PyTorch uses multi-node training, you can run the command as below:
-
-    PJRT_DEVICE=CUDA torchrun \
-    --nnodes=${NUMBER_GPU_VM} \
-    --node_rank=${CURRENT_NODE_RANK} \
-    --nproc_per_node=${NUMBER_LOCAL_GPU_DEVICES} \
-    --rdzv_endpoint=<internal_ip_address:port> multinode_training.py
-
--   `--nnodes`: how many GPU machines to be used.
--   `--node_rank`: the index of the current GPU machines. The value can
-    be 0, 1, ..., \${NUMBER_GPU_VM}-1.
--   `--nproc_per_node`: the number of GPU devices to be used on the
-    current machine.
--   `--rdzv_endpoint`: the endpoint of the GPU machine with
-    node_rank==0, in the form `host:port`. The `host` will be the
-    internal IP address. The `port` can be any available port on the
-    machine. For single-node training/inference, this parameter can be
-    omitted.
-
-For example, if you want to train on 2 GPU machines: machine_0 and
-machine_1, on the first GPU machine machine_0, run
-
-    # PJRT_DEVICE=CUDA torchrun \
-    --nnodes=2 \
-    --node_rank=0 \
-    --nproc_per_node=4 \
-    --rdzv_endpoint="<MACHINE_0_INTERNAL_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py  --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
-
-On the second GPU machine, run
-
-    # PJRT_DEVICE=CUDA torchrun \
-    --nnodes=2 \
-    --node_rank=1 \
-    --nproc_per_node=4 \
-    --rdzv_endpoint="<MACHINE_0_INTERNAL_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py  --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
-
-the difference between the 2 commands above are `--node_rank` and
-potentially `--nproc_per_node` if you want to use different number of
-GPU devices on each machine. All the rest are identical. For more
-information about `torchrun`, please refer to this
-[page](https://pytorch.org/docs/stable/elastic/run.html).
-
 ## Differences from XRT
 
 Although in most cases we expect PJRT and XRT to work mostly
 
@@ -95,59 +95,6 @@ unlisted ops run if they're downstream from autocasted ops.
 
 `stack`, `cat`, `index_copy`
 
-## AMP for XLA:GPU
-
-AMP on XLA:GPU devices reuse Pytorch's AMP rules. See [Pytorch's AMP
-documentation](https://pytorch.org/docs/stable/amp.html) for CUDA
-specific behavior. A simple CUDA AMP example is below:
-
-``` python
-from torch_xla.amp import syncfree
-import torch_xla.core.xla_model as xm
-
-# Creates model and optimizer in default precision
-model = Net().to('xla')
-# Pytorch/XLA provides sync-free optimizers for improved performance
-optimizer = syncfree.SGD(model.parameters(), ...)
-scaler = GradScaler()
-
-for input, target in data:
-    optimizer.zero_grad()
-
-    # Enables autocasting for the forward pass
-    with autocast(torch_xla.device()):
-        output = model(input)
-        loss = loss_fn(output, target)
-
-    # Exits the context manager before backward pass
-    scaler.scale(loss).backward()
-    gradients = xm._fetch_gradients(optimizer)
-    xm.all_reduce('sum', gradients, scale=1.0 / xr.world_size())
-    scaler.step(optimizer)
-    scaler.update()
-```
-
-`autocast(torch_xla.device())` aliases `torch.cuda.amp.autocast()` when the
-XLA Device is a CUDA device (XLA:GPU). Alternatively, if a script is
-only used with CUDA devices, then `torch.cuda.amp.autocast` can be
-directly used, but requires `torch` is compiled with `cuda` support for
-datatype of `torch.bfloat16`. We recommend using
-`autocast(torch_xla.device())` on XLA:GPU as it does not require
-`torch.cuda` support for any datatypes, including `torch.bfloat16`.
-
-### AMP for XLA:GPU Best Practices
-
-1.  `autocast` should wrap only the forward pass(es) and loss
-    computation(s) of the network. Backward ops run in the same type
-    that autocast used for the corresponding forward ops.
-2.  Do not set `XLA_USE_F16` flag when using AMP on Cuda devices. This
-    will override the per-operator precision settings provided by AMP
-    and cause all operators to execute in float16.
-3.  Use gradient scaling to prevent float16 gradients from underflowing.
-4.  Pytorch/XLA provides modified version of
-    [optimizers](https://github.com/pytorch/xla/tree/master/torch_xla/amp/syncfree)
-    that avoid the additional sync between device and host.
-
 ## Examples
 
 Our [mnist training script](https://github.com/pytorch/xla/blob/master/test/test_train_mp_mnist_amp.py)
 
@@ -12,7 +12,7 @@
 import collections
 import unittest
 from torch.testing._internal.autocast_test_lists import AutocastTestLists
-from torch_xla.amp import autocast, GradScaler
+from torch_xla.amp import autocast
 
 
 class AutocastTPUTestLists:
 
@@ -67,7 +67,7 @@
 import torch_xla.utils.utils as xu
 import torch_xla.core.xla_model as xm
 import torch_xla.test.test_utils as test_utils
-from torch_xla.amp import autocast, GradScaler
+from torch_xla.amp import autocast
 try:
   from torch_xla.amp import syncfree
 except ImportError:
@@ -220,8 +220,6 @@ def train_imagenet():
   if FLAGS.amp:
     if device_hw == 'TPU':
       scaler = None
-    elif device_hw == 'CUDA':
-      scaler = GradScaler(use_zero_grad=FLAGS.use_zero_grad)
 
   def train_loop_fn(loader, epoch):
     tracker = xm.RateTracker()
 
@@ -38,7 +38,7 @@
 import torch_xla.core.xla_model as xm
 import torch_xla.distributed.xla_multiprocessing as xmp
 import torch_xla.test.test_utils as test_utils
-from torch_xla.amp import autocast, GradScaler
+from torch_xla.amp import autocast
 try:
   from torch_xla.amp import syncfree
 except ImportError:
@@ -143,11 +143,8 @@ def train_mnist(flags, **kwargs):
 
   if device_hw == 'TPU':
     scaler = None
-  elif device_hw == 'CUDA':
-    # GradScaler only used for GPU
-    scaler = GradScaler(use_zero_grad=FLAGS.use_zero_grad)
   else:
-    print("Only TPU or GPU supported for AMP.")
+    print("Only TPU supported for AMP.")
     sys.exit(1)
 
   def train_loop_fn(loader):
 
@@ -205,7 +205,7 @@ def spawn(fn: Callable,
     return _run_singleprocess(spawn_fn)
   elif nprocs is not None:
     raise ValueError(
-        'Unsupported nprocs (%d). Please use nprocs=1 or None (default). If None, spawn will use all available devices. Use the environment variable X_NUM_DEVICES (where X is CPU, GPU, TPU, NEURONCORE, etc) to limit the number of devices used.'
+        'Unsupported nprocs (%d). Please use nprocs=1 or None (default). If None, spawn will use all available devices. Use the environment variable X_NUM_DEVICES (where X is CPU, TPU, NEURONCORE, etc) to limit the number of devices used.'
         % nprocs)
 
   run_multiprocess(spawn_fn, start_method=start_method)
 
@@ -1,2 +1 @@
 from .autocast_mode import autocast  # noqa: F401
-from .grad_scaler import GradScaler  # noqa: F401
@@ -10,8 +10,7 @@ class autocast(torch.amp.autocast_mode.autocast):
   r"""
   `torch.autocast` for XLA backend devices. See :class:`torch.autocast`.
   ``torch_xla.amp.autocast(device, **kwargs)`` is equivalent to
-  ``torch.autocast("xla", **kwargs)`` for XLA:GPU and XLA:TPU for dtype torch.bfloat16,
-  ``torch.autocast("cuda", **kwargs)`` for XLA:GPU and other dtypes.
+  ``torch.autocast("xla", **kwargs)`` for XLA:TPU for dtype torch.bfloat16.
   """
 
   def __init__(self,
@@ -20,34 +19,11 @@ def __init__(self,
                dtype: torch.dtype = None,
                cache_enabled: bool = True):
     # `torch_xla.amp.autocast` is intended for XLA backend, with AutocastXLA dispatch key.
-    assert 'xla' in device.__str__(
-    ), "torch_xla.autocast is available for XLA:TPU, XLA:GPU"
+    assert 'xla' in str(device), "torch_xla.autocast is available for XLA:TPU"
 
     self._enabled = enabled
     self._xla_device = xm.xla_device_hw(device)
-    if self._xla_device == 'CUDA':
-      backend = 'cuda'
-      self._xla_bfloat16 = False  # True if xla backend with bfloat16 dtype.
-      if dtype is None:
-        dtype = torch.float16
-      elif dtype == torch.bfloat16 and not torch.cuda.is_available():
-        if xr.is_bf16_supported():
-          # XLA:GPU with bfloat16 should run on `xla` backend
-          # unless torch.autocast is compiled with cuda.
-          backend = 'xla'
-          self._xla_bfloat16 = True
-        else:
-          # This has been the default behavior for unsupported bfloat16 dtype
-          dtype = torch.float16
-          error_message = "In XLA:GPU autocast, but bfloat16 is not supported on this HW.\n"
-          error_message += ("Using the default cuda autocast dtype float16.")
-      self._dtype = dtype
-      super().__init__(
-          backend,
-          enabled=enabled,
-          dtype=self._dtype,
-          cache_enabled=cache_enabled)
-    elif self._xla_device == 'TPU' or self._xla_device == 'NEURON':
+    if self._xla_device == 'TPU' or self._xla_device == 'NEURON':
       if dtype is None:
         dtype = torch.bfloat16
       if dtype != torch.bfloat16:
@@ -63,39 +39,4 @@ def __init__(self,
           dtype=self._dtype,
           cache_enabled=cache_enabled)
     else:
-      print(
-          'Warning: AMP only supported for XLA:TPU and XLA:GPU. Ignoring autocast.'
-      )
-
-  def __enter__(self):
-    # This ensures that xla autocast is enabled even for XLA:GPU, which calls
-    # `torch.amp.autocast_mode.autocast` with `cuda` backend.
-    if self._xla_device == 'CUDA':
-      self.prev = torch.is_autocast_xla_enabled()  # type: ignore[attr-defined]
-      self.prev_dtype = torch.get_autocast_xla_dtype(
-      )  # type: ignore[attr-defined]
-      if self._xla_bfloat16:
-        # autocast_xla flags will be set by `torch.autocast` and we need to
-        # set autocast flags as we call into `torch.autocast` apis.
-        torch.set_autocast_enabled(self._enabled)
-        torch.set_autocast_gpu_dtype(self._dtype)
-      else:
-        torch.set_autocast_xla_enabled(self._enabled)
-        torch.set_autocast_xla_dtype(self._dtype)
-    return super().__enter__()
-
-  def __exit__(self, exc_type: Any, exc_val: Any,
-               exc_tb: Any):  # type: ignore[override]
-    if self._xla_device == 'CUDA':
-      if self._xla_bfloat16:
-        # autocast_xla flags will be set by `torch.autocast` and we need to
-        # set autocast flags as we call into `torch.autocast` apis.
-        torch.set_autocast_enabled(self.prev)
-        torch.set_autocast_gpu_dtype(self.prev_dtype)
-      else:
-        torch.set_autocast_xla_enabled(self.prev)
-        torch.set_autocast_xla_dtype(self.prev_dtype)
-    return super().__exit__(exc_type, exc_val, exc_tb)
-
-  def __call__(self, func):
-    return super().__call__(func)
+      print('Warning: AMP only supported for XLA:TPU. Ignoring autocast.')
Original file line number	Diff line number	Diff line change
`@@ -1,2 +1 @@`
`1`	`1`	`from .autocast_mode import autocast # noqa: F401`
`2`		`-from .grad_scaler import GradScaler # noqa: F401`