Skip to content

Commit f9ad9f5

Browse files
authored
fix: Extend the timeout for maxtext_convergence (#1178)
The convergence test using the grain dataset frequently exceeds the previous 5-hour limit. Observations show that this specific task often requires more than 5 hours to reach completion. This change extends the timeout to 6 hours to ensure the DAG can finish successfully without manual intervention.
1 parent 0533d4d commit f9ad9f5

File tree

1 file changed

+4
-1
lines changed

1 file changed

+4
-1
lines changed

dags/multipod/maxtext_convergence.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,9 +84,12 @@
8484

8585
sequential_tests = []
8686
for test_name, run_command in convergence_tests.items():
87+
# The grain dataset takes longer to run, so we give it a longer timeout. The other tests are expected to complete within 5 hours.
88+
timeout_in_min = 360 if test_name == "maxtext-convergence-grain" else 300
89+
8790
test_task = gke_config.get_gke_config(
8891
cluster=XpkClusters.TPU_V6E_256_MLPERF_CLUSTER,
89-
time_out_in_min=300,
92+
time_out_in_min=timeout_in_min,
9093
test_name=test_name,
9194
run_model_cmds=run_command,
9295
docker_image=DockerImage.MAXTEXT_TPU_JAX_STABLE.value,

0 commit comments

Comments
 (0)