Commit 50e896e
authored
fix: Prevent DEADLINE_EXCEEDED errors in multi-slice TPU jobs (#1166)
Multi-slice TPU tasks were failing with `DEADLINE_EXCEEDED` and `Heartbeat` timeouts. Log analysis showed an observed 327s compilation window which exceeded the default JAX RPC timeout, causing nodes to be dropped before the training loop started.1 parent d0d00f7 commit 50e896e
1 file changed
+2
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
107 | 107 | | |
108 | 108 | | |
109 | 109 | | |
| 110 | + | |
| 111 | + | |
110 | 112 | | |
111 | 113 | | |
112 | 114 | | |
| |||
0 commit comments