Skip to content

Commit d7c9e50

Browse files
authored
Fix duplicate labels and other docs build warnings (#9446)
1 parent 4b3d34d commit d7c9e50

File tree

8 files changed

+22
-18
lines changed

8 files changed

+22
-18
lines changed

docs/source/conf.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,6 @@
2727
"sphinx.ext.napoleon",
2828
"sphinx.ext.viewcode",
2929
"sphinxcontrib.katex",
30-
"sphinx.ext.autosectionlabel",
3130
"sphinx_copybutton",
3231
# "sphinx_panels",
3332
# "myst_parser", # Will be activated by myst_nb
@@ -38,6 +37,8 @@
3837
extensions = pytorch_extensions + [
3938
"myst_nb"
4039
]
40+
# Automatically generate section anchors for selected heading level
41+
myst_heading_anchors = 3
4142

4243
# Users must manually execute their notebook cells
4344
# with the correct hardware accelerator.

docs/source/contribute/cpp_debugger.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ We suggest the following steps:
5454

5555
At this point, your PyTorch is built with debugging symbols and ready to debug
5656
with GDB. However, we recommend debugging with VSCode. For more information, see
57-
{ref}`Debug with VSCode`.
57+
[](#debug-with-vscode)
5858

5959
### Verify your file is built
6060

docs/source/contribute/plugins.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ you can test with the placeholder `LIBRARY` device type. For example:
4545
[device(type='xla', index=0), device(type='xla', index=1), device(type='xla', index=2), device(type='xla', index=3)]
4646

4747
To register your device type automatically for users as well as to
48-
handle extra setup for e.g. multiprocessing, you may implement the
48+
handle extra setup, for example, multiprocessing, you may implement the
4949
`DevicePlugin` Python API. PyTorch/XLA plugin packages contain two key
5050
components:
5151

@@ -65,9 +65,6 @@ class CpuPlugin(plugins.DevicePlugin):
6565
that identifies your `DevicePlugin`. For exmaple, to register the
6666
`EXAMPLE` device type in a `pyproject.toml`:
6767

68-
```{=html}
69-
<!-- -->
70-
```
7168
[project.entry-points."torch_xla.plugins"]
7269
example = "torch_xla_cpu_plugin:CpuPlugin"
7370

docs/source/features/pallas.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,7 @@ output = torch.ops.xla.paged_attention(
9595
)
9696
```
9797

98+
(pallas-integration-example)=
9899
#### Integration Example
99100

100101
The vLLM TPU integration utilizes [PagedAttention

docs/source/learn/_pjrt.md

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
---
2+
orphan: true
3+
---
4+
15
# PJRT Runtime
26

37
PyTorch/XLA has migrated from the TensorFlow-based XRT runtime to the
@@ -39,7 +43,7 @@ the `runtime` tag.
3943
per device. On TPU v2 and v3 in PJRT, workloads are multiprocess and
4044
multithreaded (4 processes with 2 threads each), so your workload
4145
should be thread-safe. See [Multithreading on TPU
42-
v2/v3](#multithreading-on-tpu-v2v3) and the [Multiprocessing section
46+
v2/v3](multithreading-on-tpu-v2v3) and the [Multiprocessing section
4347
of the API
4448
guide](https://github.com/pytorch/xla/blob/master/API_GUIDE.md#running-on-multiple-xla-devices-with-multi-processing)
4549
for more information. Key differences to keep in mind:
@@ -267,7 +271,7 @@ for more information about TPU architecture.
267271
from .
268272
- Under XRT, the server process is the only process that interacts
269273
with the TPU devices, and client processes don't have direct access
270-
to the TPU devices. When profiling a single-host TPU (e.g. v3-8 or
274+
to the TPU devices. When profiling a single-host TPU (e.g. v3-8 or
271275
v4-8), you would normally see 8 device traces (one for each TPU
272276
core). With PJRT, each process has one chip, and a profile from that
273277
process will show only 2 TPU cores.
@@ -282,11 +286,12 @@ for more information about TPU architecture.
282286
each TPU host
283287
(`[gcloud compute tpus tpu-vm scp](https://cloud.google.com/sdk/gcloud/reference/alpha/compute/tpus/tpu-vm/scp)`)
284288
and run the code on each host in parallel
285-
(e.g. `[gcloud compute tpus tpu-vm ssh --workers=all --command="PJRT_DEVICE=TPU python run.py"](https://cloud.google.com/sdk/gcloud/reference/alpha/compute/tpus/tpu-vm/ssh)`)
289+
(e.g. `[gcloud compute tpus tpu-vm ssh --workers=all --command="PJRT_DEVICE=TPU python run.py"](https://cloud.google.com/sdk/gcloud/reference/alpha/compute/tpus/tpu-vm/ssh)`)
286290
- `xm.rendezvous` has been reimplemented using XLA-native collective
287291
communication to enhance stability on large TPU pods. See below for
288292
more details.
289293

294+
(multithreading-on-tpu-v2v3)=
290295
### Multithreading on TPU v2/v3
291296

292297
On TPU v2 and v3, **distributed workloads always run multithreaded**,
@@ -332,7 +337,7 @@ implementation:
332337
- Because XLA does not permit collective operations to run on a subset
333338
of workers, all workers must participate in the `rendezvous`.
334339

335-
If you require the old behavior of `xm.rendezvous` (i.e. communicating
340+
If you require the old behavior of `xm.rendezvous` (i.e. communicating
336341
data without altering the XLA graph and/or synchronizing a subset of
337342
workers), consider using `` `torch.distributed.barrier ``
338343
\<<https://pytorch.org/docs/stable/distributed.html#torch.distributed.barrier>\>[\_\_
@@ -358,7 +363,7 @@ from the PyTorch documentation. Keep in mind these constraints:
358363
*New in PyTorch/XLA r2.0*
359364

360365
When using PJRT with `torch.distributed` and
361-
`[torch.nn.parallel.DistributedDataParallel](https://github.com/pytorch/xla/blob/master/docs/ddp.md)`
366+
`[torch.nn.parallel.DistributedDataParallel](https://github.com/pytorch/xla/blob/master/docs/source/perf/ddp.md)`
362367
we strongly recommend using the new `xla://` `init_method`, which
363368
automatically finds the replica IDs, world size, and master IP by
364369
querying the runtime. For example:
@@ -398,9 +403,9 @@ Note: For TPU v2/v3, you still need to import
398403
`torch.distributed` is still experimental.
399404

400405
For more information about using `DistributedDataParallel` on
401-
PyTorch/XLA, see [ddp.md](./ddp.md) on TPU V4. For an example that uses
406+
PyTorch/XLA, see [ddp.md](../perf/ddp.md) on TPU V4. For an example that uses
402407
DDP and PJRT together, run the following [example
403-
script](../test/test_train_mp_imagenet.py) on a TPU:
408+
script](../../../test/test_train_mp_imagenet.py) on a TPU:
404409

405410
``` bash
406411
PJRT_DEVICE=TPU python xla/test/test_train_mp_mnist.py --ddp --pjrt_distributed --fake_data --num_epochs 1

docs/source/learn/pytorch-on-xla-devices.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ XLA. The model definition, dataloader, optimizer and training loop can
103103
work on any device. The only XLA-specific code is a couple lines that
104104
acquire the XLA device and materializing the tensors. Calling `torch_xla.sync()`
105105
at the end of each training iteration causes XLA to execute its current
106-
graph and update the model's parameters. See {ref}`XLA Tensor Deep Dive`
106+
graph and update the model's parameters. See [](#xla-tensor-deep-dive)
107107
for more on how XLA creates graphs and runs
108108
operations.
109109

docs/source/learn/troubleshoot.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -164,7 +164,7 @@ disable execution analysis by `PT_XLA_DEBUG_LEVEL=1`). To use
164164
PyTorch/XLA efficiently, we expect the same models code to be run for
165165
every step and compilation only happen once for every graph. If you keep
166166
seeing `Compilation Cause`, you should try to dump the IR/HLO following
167-
{ref}`Common Debugging Environment Variables Combinations` and
167+
[](#common-debugging-environment-variables-combinations) and
168168
compare the graphs for each step and understand the source of the
169169
differences.
170170
@@ -313,7 +313,7 @@ If your model shows bad performance, keep in mind the following caveats:
313313
*Solution*:
314314
315315
- For most ops we can lower them to XLA to fix it. Checkout
316-
{ref}`Get A Metrics Report` to find out the
316+
[](#get-a-metrics-report) to find out the
317317
missing ops and open a feature request on
318318
[GitHub](https://github.com/pytorch/xla/issues).
319319

docs/source/perf/spmd_advanced.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
This guide covers advanced topics with SPMD. Please read the
44
[SPMD user guide](https://github.com/pytorch/xla/blob/master/docs/spmd_basic.md) as a prerequisite.
55

6-
### Sharding-Aware Host-to-Device Data Loading
6+
## Sharding-Aware Host-to-Device Data Loading
77

88
SPMD takes a single-device program, shards it, and executes it in parallel.
99

@@ -38,7 +38,7 @@ train_loader = pl.MpDeviceLoader(
3838
)
3939
```
4040

41-
### Virtual device optimization
41+
## Virtual device optimization
4242

4343
PyTorch/XLA normally transfers tensor data asynchronously from host to device once the tensor is defined. This is to overlap the data transfer with the graph tracing time. However, because SPMD allows the user to modify the tensor sharding _after _the tensor has been defined, we need an optimization to prevent unnecessary transfer of tensor data back and forth between host and device. We introduce Virtual Device Optimization, a technique to place the tensor data on a virtual device SPMD:0 first, before uploading to the physical devices when all the sharding decisions are finalized. Every tensor data in SPMD mode is placed on a virtual device, SPMD:0. The virtual device is exposed to the user as an XLA device XLA:0 with the actual shards on physical devices, like TPU:0, TPU:1, etc.
4444

0 commit comments

Comments
 (0)