Skip to content

Commit 3738821

Browse files
Revert "Enable offline logging to wandb for MAST jobs (meta-pytorch#593)" (meta-pytorch#597)
Co-authored-by: Jiyue Wang <[email protected]>
1 parent be18482 commit 3738821

File tree

7 files changed

+11
-18
lines changed

7 files changed

+11
-18
lines changed

.meta/mast/README.md

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -119,11 +119,3 @@ This ensures that when MAST runs with `HF_HUB_OFFLINE=1`, the transformers libra
119119
Both cache and model files are stored under:
120120
- **Cache**: `/mnt/wsfuse/teamforge/hf` (set via `HF_HOME`)
121121
- **Model weights**: `/mnt/wsfuse/teamforge/hf/<model_name>`
122-
123-
### Wandb Logs
124-
Wandb logs will be stored under `/mnt/wsfuse/teamforge/wandb`. The latest run will be stored under `/mnt/wsfuse/teamforge/wandb/latest-run`.
125-
126-
To sync to wandb from a devserver with internet access, run:
127-
```bash
128-
wandb sync -p grpo-training /mnt/wsfuse/teamforge/wandb/latest-run
129-
```

.meta/mast/qwen3_14b_mast.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ rollout_threads: ${services.policy.num_replicas} # Recommended to set equal to
1717
# Observability configuration
1818
metric_logging:
1919
wandb:
20-
mode: offline
21-
dir: /mnt/wsfuse/teamforge/
20+
project: "grpo-training"
21+
group: "grpo_exp_${oc.env:USER}"
2222
logging_mode: global_reduce
2323
console:
2424
logging_mode: global_reduce

.meta/mast/qwen3_1_7b_mast.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ rollout_threads: ${services.policy.num_replicas} # Recommended to set equal to
1717
# Observability configuration
1818
metric_logging:
1919
wandb:
20-
mode: offline
21-
dir: /mnt/wsfuse/teamforge/
20+
project: "grpo-training"
21+
group: "grpo_exp_${oc.env:USER}"
2222
logging_mode: global_reduce
2323
console:
2424
logging_mode: global_reduce

.meta/mast/qwen3_32b_mast.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ rollout_threads: ${services.policy.num_replicas} # Recommended to set equal to
1717
# Observability configuration
1818
metric_logging:
1919
wandb:
20-
mode: offline
21-
dir: /mnt/wsfuse/teamforge/
20+
project: "grpo-training"
21+
group: "grpo_exp_${oc.env:USER}"
2222
logging_mode: global_reduce
2323
console:
2424
logging_mode: global_reduce

.meta/mast/qwen3_4b_mast.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ rollout_threads: ${services.policy.num_replicas} # Recommended to set equal to
1717
# Observability configuration
1818
metric_logging:
1919
wandb:
20-
mode: offline
21-
dir: /mnt/wsfuse/teamforge/
20+
project: "grpo-training"
21+
group: "grpo_exp_${oc.env:USER}"
2222
logging_mode: global_reduce
2323
console:
2424
logging_mode: global_reduce

.meta/mast/qwen3_8b_mast.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ rollout_threads: ${services.policy.num_replicas} # Recommended to set equal to
1717
# Observability configuration
1818
metric_logging:
1919
wandb:
20-
mode: offline
21-
dir: /mnt/wsfuse/teamforge/
20+
project: "grpo-training"
21+
group: "grpo_exp_${oc.env:USER}"
2222
logging_mode: global_reduce
2323
console:
2424
logging_mode: global_reduce

src/forge/controller/launcher.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -293,6 +293,7 @@ def build_appdef(self) -> specs.AppDef:
293293
"TORCHDYNAMO_VERBOSE": "1",
294294
"VLLM_TORCH_COMPILE_LEVEL": "0",
295295
"VLLM_USE_TRITON_FLASH_ATTN": "0",
296+
"WANDB_MODE": "offline",
296297
"HF_HUB_OFFLINE": "1",
297298
"MONARCH_HOST_MESH_V1_REMOVE_ME_BEFORE_RELEASE": "1",
298299
"TORCHSTORE_RDMA_ENABLED": "1",

0 commit comments

Comments
 (0)