Skip to content

Commit cf55407

Browse files
authored
Enable MAST client mode (#405)
1 parent fd33e3a commit cf55407

File tree

14 files changed

+484
-205
lines changed

14 files changed

+484
-205
lines changed

.meta/mast/README.md

Lines changed: 80 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ The `env_setup.sh` script will automatically:
2121
chmod +x .meta/mast/env_setup.sh
2222

2323
# Run the setup
24-
./.meta/mast/env_setup.sh
24+
source .meta/mast/env_setup.sh
2525

2626
```
2727

@@ -44,3 +44,82 @@ The launch script will automatically:
4444
- Launch the MAST job with the specified config
4545

4646
You can run it from anywhere, and it will figure out the correct paths.
47+
48+
49+
## How MAST Launcher Works
50+
51+
The MAST launcher uses a two-stage architecture to run training jobs:
52+
53+
### Stage 1: Detached Mode (Local Machine)
54+
55+
When you run `./.meta/mast/launch.sh`, the `main.py` script starts in **detached mode**:
56+
57+
1. The launcher creates a MAST job with all the worker roles (GPU hosts)
58+
2. It also creates a special **client role** - a CPU-only role that will run inside MAST
59+
3. The client role's entrypoint is set to `client_bootstrap.sh`
60+
4. All CLI arguments you pass are forwarded to the client role
61+
62+
At this point, the job is submitted to MAST and your local script exits. Everything now runs in the cluster.
63+
64+
### Stage 2: Remote Mode (Inside MAST)
65+
66+
The `client_bootstrap.sh` script runs inside the MAST client role and:
67+
68+
1. Calls `main.py` again, but now with `--mode=remote`
69+
2. In **remote mode**, the script:
70+
- Mounts the OilFS workspace
71+
- Initializes the provisioner to connect to worker roles
72+
- Runs the actual training workload (e.g., GRPO)
73+
74+
This architecture allows the entire training workflow to run inside MAST without requiring a persistent connection from your local machine.
75+
76+
### Key Files
77+
78+
- **`main.py`**: Entry point that handles both detached and remote modes
79+
- **`client_bootstrap.sh`**: Entrypoint for the client role in MAST
80+
- **`launcher.py`**: Creates the MAST job specification and handles role configuration
81+
82+
83+
## Managing HuggingFace Models in MAST
84+
85+
### The Problem: No Internet Access
86+
87+
MAST compute nodes cannot access the internet, which means they cannot download models directly from HuggingFace. To work around this, we store all HuggingFace models and cache data on OilFS at `/mnt/wsfuse/teamforge/hf`, which is accessible from MAST.
88+
89+
### Solution: Two-Step Process
90+
91+
You need to perform both steps below to ensure models work correctly in MAST:
92+
93+
#### 1. Download Model Weights to OilFS
94+
95+
First, download the model weights directly to the OilFS path. This should be done from a machine with internet access (like your devserver):
96+
97+
```bash
98+
# Set HF_HOME to the OilFS path
99+
export HF_HOME=/mnt/wsfuse/teamforge/hf
100+
101+
# Download the model (replace with your desired model)
102+
huggingface-cli download Qwen/Qwen3-8B --local-dir /mnt/wsfuse/teamforge/hf_artifacts/qwen3_8b
103+
```
104+
105+
#### 2. Hydrate the HuggingFace Cache
106+
107+
After downloading the weights, you need to hydrate the HuggingFace cache so that the transformers library can find the model metadata:
108+
109+
```bash
110+
# Set HF_HOME to the OilFS path
111+
export HF_HOME=/mnt/wsfuse/teamforge/hf
112+
113+
# Hydrate the cache for the model
114+
python .meta/mast/hydrate_cache.py --model-id Qwen/Qwen3-8B
115+
```
116+
117+
This ensures that when MAST runs with `HF_HUB_OFFLINE=1`, the transformers library can locate all necessary files from the cache.
118+
119+
### Directory Structure
120+
121+
Both cache and model files are stored under:
122+
- **Cache**: `/mnt/wsfuse/teamforge/hf` (set via `HF_HOME`)
123+
- **Model weights**: `/mnt/wsfuse/teamforge/hf/<model_name>`
124+
125+
Make sure your MAST config files point to the correct paths in `hf_artifacts`.

.meta/mast/client_bootstrap.sh

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
#!/bin/bash
2+
# Copyright (c) Meta Platforms, Inc. and affiliates.
3+
# All rights reserved.
4+
#
5+
# This source code is licensed under the BSD-style license found in the
6+
# LICENSE file in the root directory of this source tree.
7+
8+
# Bootstrap script for the MAST client role
9+
# This script sets up the environment and launches the client training script
10+
11+
set -eEx
12+
13+
LIBCUDA="/usr/local/fbcode/platform010/lib/libcuda.so"
14+
if [ -f "$LIBCUDA" ]; then
15+
export LIBCUDA_DIR="${LIBCUDA%/*}"
16+
export TRITON_LIBCUDA_PATH="$LIBCUDA_DIR"
17+
export LD_PRELOAD="$LIBCUDA:/usr/local/fbcode/platform010/lib/libnvidia-ml.so${PRELOAD_PATH:+:$PRELOAD_PATH}"
18+
fi
19+
20+
# Also preload put path to torch libs as for monarch dev workflow we dont
21+
# install it into the env so we need to make sure the binaries can find
22+
# libtorch and friends on mast and the rpaths set during dev install will
23+
# be wrong on mast.
24+
export LD_LIBRARY_PATH="${CONDA_DIR}/lib:${CONDA_DIR}/lib/python3.10/site-packages/torch/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
25+
export PYTHONPATH="${PYTHONPATH:+$PYTHONPATH:}$TORCHX_RUN_PYTHONPATH"
26+
27+
# shellcheck disable=SC1091
28+
if [ -n "$CONDA_PREFIX" ]; then
29+
echo "A conda environment is already activated: $CONDA_DEFAULT_ENV"
30+
else
31+
# Disable command printing to avoid log spew.
32+
set +x
33+
source "${CONDA_DIR}/bin/activate"
34+
# Re-enable command printing after conda activation.
35+
set -x
36+
fi
37+
38+
if [ -z "$WORKSPACE_DIR" ] || [ ! -d "$WORKSPACE_DIR" ]; then
39+
WORKSPACE_DIR="$CONDA_PREFIX"
40+
fi
41+
42+
cd "$WORKSPACE_DIR/forge"
43+
44+
export WANDB_MODE=offline
45+
export HF_HUB_OFFLINE=1
46+
export MONARCH_HOST_MESH_V1_REMOVE_ME_BEFORE_RELEASE=1
47+
export TORCHSTORE_RDMA_ENABLED=1
48+
export HF_HOME=/mnt/wsfuse/teamforge/hf
49+
50+
# Execute the client training script with all passed arguments
51+
exec python -X faulthandler .meta/mast/main.py "$@"

.meta/mast/env_setup.sh

Lines changed: 13 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,9 @@
77
# LICENSE file in the root directory of this source tree.
88

99
# setup_forge_env.sh - Setup conda environment and install forge with mounting
10-
set -e # Exit on any error
1110

1211
# Configuration
13-
CONDA_ENV_NAME="forge:stable"
12+
CONDA_ENV_NAME="forge:41468b33a03eaf2bf5b44517f418028a"
1413

1514
# Colors for output
1615
RED='\033[0;31m'
@@ -109,8 +108,6 @@ fi
109108
# Define paths
110109
FBSOURCE_PATH="/data/users/$USER/fbsource"
111110
CONDA_SCRIPT_PATH="$FBSOURCE_PATH/genai/xlformers/dev/xl_conda.sh"
112-
FORGE_BASE_DIR="/data/users/$USER"
113-
FORGE_REPO_DIR="$FORGE_BASE_DIR/forge"
114111

115112
# Workspace URL for mounting
116113
WORKSPACE_URL="ws://ws.ai.pci0ai/genai_fair_llm"
@@ -143,63 +140,12 @@ fi
143140

144141
log_info "Conda environment activated successfully"
145142

146-
# Step 3: Create and navigate to forge base directory
147-
log_info "Step 3: Setting up forge directory..."
148-
if [ ! -d "$FORGE_BASE_DIR" ]; then
149-
log_info "Creating forge base directory: $FORGE_BASE_DIR"
150-
mkdir -p "$FORGE_BASE_DIR"
151-
fi
152-
153-
cd "$FORGE_BASE_DIR"
154-
log_info "Changed to directory: $(pwd)"
155-
156-
# Step 4: Clone or update forge repository
157-
log_info "Step 4: Setting up forge git repository..."
158-
if [ -d "$FORGE_REPO_DIR" ]; then
159-
log_warn "Forge repository already exists at: $FORGE_REPO_DIR"
160-
cd "$FORGE_REPO_DIR"
161-
162-
if [ -d ".git" ]; then
163-
log_info "Updating existing repository..."
164-
git fetch origin
165-
if [ $? -eq 0 ]; then
166-
log_info "Repository updated successfully"
167-
else
168-
log_warn "Failed to fetch updates, continuing with existing code"
169-
fi
170-
else
171-
log_error "Directory exists but is not a git repository"
172-
log_info "Removing directory and cloning fresh..."
173-
cd "$FORGE_BASE_DIR"
174-
rm -rf "$FORGE_REPO_DIR"
175-
git clone [email protected]:meta-pytorch/forge.git
176-
if [ $? -ne 0 ]; then
177-
log_error "Failed to clone forge repository"
178-
exit 1
179-
fi
180-
cd "$FORGE_REPO_DIR"
181-
fi
182-
else
183-
log_info "Cloning forge repository..."
184-
git clone [email protected]:meta-pytorch/forge.git
185-
if [ $? -ne 0 ]; then
186-
log_error "Failed to clone forge repository"
187-
log_error "Please ensure:"
188-
log_error "1. You have SSH access to github.com"
189-
log_error "2. Your SSH key is added to GitHub"
190-
log_error "3. You have access to meta-pytorch/forge repository"
191-
exit 1
192-
fi
193-
cd "$FORGE_REPO_DIR"
194-
fi
195-
196-
log_info "Current directory: $(pwd)"
197143

198-
# Step 5: Install torchtitan
199-
log_info "Step 5: Installing torchtitan..."
144+
# Step 3: Install torchtitan
145+
log_info "Step 3: Installing torchtitan..."
200146

201147
# Source versions.sh to get the pinned commit
202-
VERSIONS_FILE="$FORGE_REPO_DIR/assets/versions.sh"
148+
VERSIONS_FILE="assets/versions.sh"
203149
if [ -f "$VERSIONS_FILE" ]; then
204150
log_info "Sourcing version information from: $VERSIONS_FILE"
205151
source "$VERSIONS_FILE"
@@ -225,8 +171,8 @@ else
225171
exit 1
226172
fi
227173

228-
# Step 5.5: Apply monarch torch import hack
229-
log_info "Step 5.5: Applying monarch torch import hack..."
174+
# Step 3.5: Apply monarch torch import hack
175+
log_info "Step 3.5: Applying monarch torch import hack..."
230176

231177
MONARCH_INIT="$CONDA_PREFIX/lib/python3.10/site-packages/monarch/__init__.py"
232178
if [ -f "$MONARCH_INIT" ]; then
@@ -259,8 +205,8 @@ else
259205
log_warn "Skipping monarch torch import hack (monarch may not be installed yet)"
260206
fi
261207

262-
# Step 6: Install forge package
263-
log_info "Step 6: Installing forge package..."
208+
# Step 4: Install forge package
209+
log_info "Step 4: Installing forge package..."
264210
pip install --no-deps --force-reinstall .
265211
if [ $? -ne 0 ]; then
266212
log_error "Failed to install forge package"
@@ -298,7 +244,11 @@ pip list | grep -E "(forge|monarch)" || log_warn "No forge/monarch packages foun
298244
log_info "Environment setup complete! You can now run your scripts."
299245
log_info "Mounted workspace available at: /mnt/wsfuse"
300246

301-
# Step 6: Ask user to deactivate and activate conda env conda environment
247+
log_info "Unsetting CUDA_HOME and overwriting the LD_LIBRARY_PATH"
248+
unset CUDA_HOME
249+
export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib
250+
251+
# Step 5: Ask user to test
302252
echo ""
303253
log_info "Installation completed successfully!"
304254
echo ""

.meta/mast/hydrate_cache.py

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Copyright (c) Meta Platforms, Inc. and affiliates.
2+
# All rights reserved.
3+
#
4+
# This source code is licensed under the BSD-style license found in the
5+
# LICENSE file in the root directory of this source tree.
6+
7+
"""This is convenience script meant for hydrating the HuggingFace cache.
8+
9+
This is meant for downloading the model weights and tokenizer to the cache, i.e. for
10+
OilFS.
11+
12+
Example:
13+
14+
python .meta/mast/hydrate_cache.py --model-id Qwen/Qwen3-32B
15+
16+
"""
17+
import argparse
18+
import os
19+
import sys
20+
21+
from transformers import AutoModelForCausalLM, AutoTokenizer
22+
23+
24+
def main():
25+
parser = argparse.ArgumentParser(
26+
description="Hydrate HuggingFace cache for a specific model"
27+
)
28+
parser.add_argument(
29+
"--model-id",
30+
type=str,
31+
required=True,
32+
help="HuggingFace model ID (e.g., Qwen/Qwen3-8B)",
33+
)
34+
args = parser.parse_args()
35+
36+
# Ensure HF_HOME is set
37+
hf_home = os.environ.get("HF_HOME")
38+
if not hf_home:
39+
print(
40+
"ERROR: HF_HOME environment variable must be set. "
41+
"You will likely want to run export HF_HOME=/mnt/wsfuse/teamforge/hf."
42+
)
43+
sys.exit(1)
44+
45+
print(f"Using HF_HOME: {hf_home}")
46+
print(f"Downloading {args.model_id}...")
47+
48+
# This will pull tokenizer + config + all weight shards
49+
tokenizer = AutoTokenizer.from_pretrained(args.model_id, trust_remote_code=True)
50+
model = AutoModelForCausalLM.from_pretrained(args.model_id, trust_remote_code=True)
51+
52+
print("Download complete. Cache hydrated.")
53+
54+
55+
if __name__ == "__main__":
56+
main()

.meta/mast/launch.sh

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,12 @@ fi
3434

3535
CONFIG_FILE="$1"
3636

37+
# Generate a unique job name
38+
USER=$(whoami)
39+
RANDOM_SUFFIX=$(cat /dev/urandom | tr -dc 'a-z0-9' | fold -w 6 | head -n 1)
40+
JOB_NAME="${USER}-forge-${RANDOM_SUFFIX}"
41+
log_info "Generated job name: $JOB_NAME"
42+
3743
# Get the directory where this script is located
3844
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
3945

@@ -64,5 +70,10 @@ fi
6470
log_info "Successfully reinstalled forge package"
6571

6672
# Launch the job
73+
CHECKPOINT_FOLDER=/mnt/wsfuse/teamforge/forge_runs/$JOB_NAME
6774
log_info "Launching MAST job..."
68-
PYTHONPATH=. python .meta/mast/main.py --config "$CONFIG_FILE"
75+
76+
# Manually override the relevant checkpoint path(s)
77+
# This unfortunately cannot be done in the YAML itself since this should be
78+
# based on job name...
79+
PYTHONPATH=. python .meta/mast/main.py --job-name $JOB_NAME --config $CONFIG_FILE trainer.checkpoint.folder=${CHECKPOINT_FOLDER} trainer.dcp_path=${CHECKPOINT_FOLDER}

0 commit comments

Comments
 (0)