Skip to content

Commit a841512

Browse files
Enable offline logging to wandb for MAST jobs (meta-pytorch#593)
1 parent ffc7a24 commit a841512

File tree

7 files changed

+18
-11
lines changed

7 files changed

+18
-11
lines changed

.meta/mast/README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,3 +119,11 @@ This ensures that when MAST runs with `HF_HUB_OFFLINE=1`, the transformers libra
119119
Both cache and model files are stored under:
120120
- **Cache**: `/mnt/wsfuse/teamforge/hf` (set via `HF_HOME`)
121121
- **Model weights**: `/mnt/wsfuse/teamforge/hf/<model_name>`
122+
123+
### Wandb Logs
124+
Wandb logs will be stored under `/mnt/wsfuse/teamforge/wandb`. The latest run will be stored under `/mnt/wsfuse/teamforge/wandb/latest-run`.
125+
126+
To sync to wandb from a devserver with internet access, run:
127+
```bash
128+
wandb sync -p grpo-training /mnt/wsfuse/teamforge/wandb/latest-run
129+
```

.meta/mast/qwen3_14b_mast.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ rollout_threads: ${services.policy.num_replicas} # Recommended to set equal to
1717
# Observability configuration
1818
metric_logging:
1919
wandb:
20-
project: "grpo-training"
21-
group: "grpo_exp_${oc.env:USER}"
20+
mode: offline
21+
dir: /mnt/wsfuse/teamforge/
2222
logging_mode: global_reduce
2323
console:
2424
logging_mode: global_reduce

.meta/mast/qwen3_1_7b_mast.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ rollout_threads: ${services.policy.num_replicas} # Recommended to set equal to
1717
# Observability configuration
1818
metric_logging:
1919
wandb:
20-
project: "grpo-training"
21-
group: "grpo_exp_${oc.env:USER}"
20+
mode: offline
21+
dir: /mnt/wsfuse/teamforge/
2222
logging_mode: global_reduce
2323
console:
2424
logging_mode: global_reduce

.meta/mast/qwen3_32b_mast.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ rollout_threads: ${services.policy.num_replicas} # Recommended to set equal to
1717
# Observability configuration
1818
metric_logging:
1919
wandb:
20-
project: "grpo-training"
21-
group: "grpo_exp_${oc.env:USER}"
20+
mode: offline
21+
dir: /mnt/wsfuse/teamforge/
2222
logging_mode: global_reduce
2323
console:
2424
logging_mode: global_reduce

.meta/mast/qwen3_4b_mast.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ rollout_threads: ${services.policy.num_replicas} # Recommended to set equal to
1717
# Observability configuration
1818
metric_logging:
1919
wandb:
20-
project: "grpo-training"
21-
group: "grpo_exp_${oc.env:USER}"
20+
mode: offline
21+
dir: /mnt/wsfuse/teamforge/
2222
logging_mode: global_reduce
2323
console:
2424
logging_mode: global_reduce

.meta/mast/qwen3_8b_mast.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ rollout_threads: ${services.policy.num_replicas} # Recommended to set equal to
1717
# Observability configuration
1818
metric_logging:
1919
wandb:
20-
project: "grpo-training"
21-
group: "grpo_exp_${oc.env:USER}"
20+
mode: offline
21+
dir: /mnt/wsfuse/teamforge/
2222
logging_mode: global_reduce
2323
console:
2424
logging_mode: global_reduce

src/forge/controller/launcher.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -293,7 +293,6 @@ def build_appdef(self) -> specs.AppDef:
293293
"TORCHDYNAMO_VERBOSE": "1",
294294
"VLLM_TORCH_COMPILE_LEVEL": "0",
295295
"VLLM_USE_TRITON_FLASH_ATTN": "0",
296-
"WANDB_MODE": "offline",
297296
"HF_HUB_OFFLINE": "1",
298297
"MONARCH_HOST_MESH_V1_REMOVE_ME_BEFORE_RELEASE": "1",
299298
"TORCHSTORE_RDMA_ENABLED": "1",

0 commit comments

Comments
 (0)