You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/guides/megascale_hang_playbook.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,12 +29,12 @@ This message will often provide the potential cause of a hang. Please provide Go
29
29
30
30
## Common Issues
31
31
32
-
### 1. Inconsistent TPU Programs
32
+
### 1. Fingerprint mismatch
33
33
34
-
Occasionally, different programs can run on TPU workers within the same system. This can lead to errors. Search your logs for a message like the following:
34
+
Occasionally, an HLO module can be compiled differently across TPU workers within the same system. This can lead to errors. Search your logs for a message like the following:
35
35
36
36
```
37
-
Megascale detects a hang that is likely caused by inconsistent TPU programs. This can be caused by some workers running with different JIT functions or a bug in the XLA compiler. Please inspect the HLO dumps to confirm the root cause.
37
+
Megascale detects a hang that is likely caused by inconsistent HLO module compilation across workers. This can be caused by some workers running with different JIT functions or a bug in the XLA compiler. Please inspect the HLO dumps to confirm the root cause.
38
38
39
39
Example hosts that have different HLO fingerprints: ...
0 commit comments