Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 80 additions & 1 deletion .meta/mast/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ The `env_setup.sh` script will automatically:
chmod +x .meta/mast/env_setup.sh

# Run the setup
./.meta/mast/env_setup.sh
source .meta/mast/env_setup.sh

```

Expand All @@ -44,3 +44,82 @@ The launch script will automatically:
- Launch the MAST job with the specified config

You can run it from anywhere, and it will figure out the correct paths.


## How MAST Launcher Works

The MAST launcher uses a two-stage architecture to run training jobs:

### Stage 1: Detached Mode (Local Machine)

When you run `./.meta/mast/launch.sh`, the `main.py` script starts in **detached mode**:

1. The launcher creates a MAST job with all the worker roles (GPU hosts)
2. It also creates a special **client role** - a CPU-only role that will run inside MAST
3. The client role's entrypoint is set to `client_bootstrap.sh`
4. All CLI arguments you pass are forwarded to the client role

At this point, the job is submitted to MAST and your local script exits. Everything now runs in the cluster.

### Stage 2: Remote Mode (Inside MAST)

The `client_bootstrap.sh` script runs inside the MAST client role and:

1. Calls `main.py` again, but now with `--mode=remote`
2. In **remote mode**, the script:
- Mounts the OilFS workspace
- Initializes the provisioner to connect to worker roles
- Runs the actual training workload (e.g., GRPO)

This architecture allows the entire training workflow to run inside MAST without requiring a persistent connection from your local machine.

### Key Files

- **`main.py`**: Entry point that handles both detached and remote modes
- **`client_bootstrap.sh`**: Entrypoint for the client role in MAST
- **`launcher.py`**: Creates the MAST job specification and handles role configuration


## Managing HuggingFace Models in MAST

### The Problem: No Internet Access

MAST compute nodes cannot access the internet, which means they cannot download models directly from HuggingFace. To work around this, we store all HuggingFace models and cache data on OilFS at `/mnt/wsfuse/teamforge/hf`, which is accessible from MAST.

### Solution: Two-Step Process

You need to perform both steps below to ensure models work correctly in MAST:

#### 1. Download Model Weights to OilFS

First, download the model weights directly to the OilFS path. This should be done from a machine with internet access (like your devserver):

```bash
# Set HF_HOME to the OilFS path
export HF_HOME=/mnt/wsfuse/teamforge/hf

# Download the model (replace with your desired model)
huggingface-cli download Qwen/Qwen3-8B --local-dir /mnt/wsfuse/teamforge/hf_artifacts/qwen3_8b
```

#### 2. Hydrate the HuggingFace Cache

After downloading the weights, you need to hydrate the HuggingFace cache so that the transformers library can find the model metadata:

```bash
# Set HF_HOME to the OilFS path
export HF_HOME=/mnt/wsfuse/teamforge/hf

# Hydrate the cache for the model
python .meta/mast/hydrate_cache.py --model-id Qwen/Qwen3-8B
```

This ensures that when MAST runs with `HF_HUB_OFFLINE=1`, the transformers library can locate all necessary files from the cache.

### Directory Structure

Both cache and model files are stored under:
- **Cache**: `/mnt/wsfuse/teamforge/hf` (set via `HF_HOME`)
- **Model weights**: `/mnt/wsfuse/teamforge/hf/<model_name>`

Make sure your MAST config files point to the correct paths in `hf_artifacts`.
Comment on lines +99 to +125
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you automate this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the million dollar question Vidhya lol

51 changes: 51 additions & 0 deletions .meta/mast/client_bootstrap.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/bin/bash
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

# Bootstrap script for the MAST client role
# This script sets up the environment and launches the client training script

set -eEx

LIBCUDA="/usr/local/fbcode/platform010/lib/libcuda.so"
if [ -f "$LIBCUDA" ]; then
export LIBCUDA_DIR="${LIBCUDA%/*}"
export TRITON_LIBCUDA_PATH="$LIBCUDA_DIR"
export LD_PRELOAD="$LIBCUDA:/usr/local/fbcode/platform010/lib/libnvidia-ml.so${PRELOAD_PATH:+:$PRELOAD_PATH}"
fi

# Also preload put path to torch libs as for monarch dev workflow we dont
# install it into the env so we need to make sure the binaries can find
# libtorch and friends on mast and the rpaths set during dev install will
# be wrong on mast.
export LD_LIBRARY_PATH="${CONDA_DIR}/lib:${CONDA_DIR}/lib/python3.10/site-packages/torch/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
export PYTHONPATH="${PYTHONPATH:+$PYTHONPATH:}$TORCHX_RUN_PYTHONPATH"

# shellcheck disable=SC1091
if [ -n "$CONDA_PREFIX" ]; then
echo "A conda environment is already activated: $CONDA_DEFAULT_ENV"
else
# Disable command printing to avoid log spew.
set +x
source "${CONDA_DIR}/bin/activate"
# Re-enable command printing after conda activation.
set -x
fi

if [ -z "$WORKSPACE_DIR" ] || [ ! -d "$WORKSPACE_DIR" ]; then
WORKSPACE_DIR="$CONDA_PREFIX"
fi

cd "$WORKSPACE_DIR/forge"

export WANDB_MODE=offline
export HF_HUB_OFFLINE=1
export MONARCH_HOST_MESH_V1_REMOVE_ME_BEFORE_RELEASE=1
export TORCHSTORE_RDMA_ENABLED=1
export HF_HOME=/mnt/wsfuse/teamforge/hf

# Execute the client training script with all passed arguments
exec python -X faulthandler .meta/mast/main.py "$@"
76 changes: 13 additions & 63 deletions .meta/mast/env_setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,9 @@
# LICENSE file in the root directory of this source tree.

# setup_forge_env.sh - Setup conda environment and install forge with mounting
set -e # Exit on any error

# Configuration
CONDA_ENV_NAME="forge:stable"
CONDA_ENV_NAME="forge:41468b33a03eaf2bf5b44517f418028a"

# Colors for output
RED='\033[0;31m'
Expand Down Expand Up @@ -109,8 +108,6 @@ fi
# Define paths
FBSOURCE_PATH="/data/users/$USER/fbsource"
CONDA_SCRIPT_PATH="$FBSOURCE_PATH/genai/xlformers/dev/xl_conda.sh"
FORGE_BASE_DIR="/data/users/$USER"
FORGE_REPO_DIR="$FORGE_BASE_DIR/forge"

# Workspace URL for mounting
WORKSPACE_URL="ws://ws.ai.pci0ai/genai_fair_llm"
Expand Down Expand Up @@ -143,63 +140,12 @@ fi

log_info "Conda environment activated successfully"

# Step 3: Create and navigate to forge base directory
log_info "Step 3: Setting up forge directory..."
if [ ! -d "$FORGE_BASE_DIR" ]; then
log_info "Creating forge base directory: $FORGE_BASE_DIR"
mkdir -p "$FORGE_BASE_DIR"
fi

cd "$FORGE_BASE_DIR"
log_info "Changed to directory: $(pwd)"

# Step 4: Clone or update forge repository
log_info "Step 4: Setting up forge git repository..."
if [ -d "$FORGE_REPO_DIR" ]; then
log_warn "Forge repository already exists at: $FORGE_REPO_DIR"
cd "$FORGE_REPO_DIR"

if [ -d ".git" ]; then
log_info "Updating existing repository..."
git fetch origin
if [ $? -eq 0 ]; then
log_info "Repository updated successfully"
else
log_warn "Failed to fetch updates, continuing with existing code"
fi
else
log_error "Directory exists but is not a git repository"
log_info "Removing directory and cloning fresh..."
cd "$FORGE_BASE_DIR"
rm -rf "$FORGE_REPO_DIR"
git clone [email protected]:meta-pytorch/forge.git
if [ $? -ne 0 ]; then
log_error "Failed to clone forge repository"
exit 1
fi
cd "$FORGE_REPO_DIR"
fi
else
log_info "Cloning forge repository..."
git clone [email protected]:meta-pytorch/forge.git
if [ $? -ne 0 ]; then
log_error "Failed to clone forge repository"
log_error "Please ensure:"
log_error "1. You have SSH access to github.com"
log_error "2. Your SSH key is added to GitHub"
log_error "3. You have access to meta-pytorch/forge repository"
exit 1
fi
cd "$FORGE_REPO_DIR"
fi

log_info "Current directory: $(pwd)"

# Step 5: Install torchtitan
log_info "Step 5: Installing torchtitan..."
# Step 3: Install torchtitan
log_info "Step 3: Installing torchtitan..."

# Source versions.sh to get the pinned commit
VERSIONS_FILE="$FORGE_REPO_DIR/assets/versions.sh"
VERSIONS_FILE="assets/versions.sh"
if [ -f "$VERSIONS_FILE" ]; then
log_info "Sourcing version information from: $VERSIONS_FILE"
source "$VERSIONS_FILE"
Expand All @@ -225,8 +171,8 @@ else
exit 1
fi

# Step 5.5: Apply monarch torch import hack
log_info "Step 5.5: Applying monarch torch import hack..."
# Step 3.5: Apply monarch torch import hack
log_info "Step 3.5: Applying monarch torch import hack..."

MONARCH_INIT="$CONDA_PREFIX/lib/python3.10/site-packages/monarch/__init__.py"
if [ -f "$MONARCH_INIT" ]; then
Expand Down Expand Up @@ -259,8 +205,8 @@ else
log_warn "Skipping monarch torch import hack (monarch may not be installed yet)"
fi

# Step 6: Install forge package
log_info "Step 6: Installing forge package..."
# Step 4: Install forge package
log_info "Step 4: Installing forge package..."
pip install --no-deps --force-reinstall .
if [ $? -ne 0 ]; then
log_error "Failed to install forge package"
Expand Down Expand Up @@ -298,7 +244,11 @@ pip list | grep -E "(forge|monarch)" || log_warn "No forge/monarch packages foun
log_info "Environment setup complete! You can now run your scripts."
log_info "Mounted workspace available at: /mnt/wsfuse"

# Step 6: Ask user to deactivate and activate conda env conda environment
log_info "Unsetting CUDA_HOME and overwriting the LD_LIBRARY_PATH"
unset CUDA_HOME
export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib

# Step 5: Ask user to test
echo ""
log_info "Installation completed successfully!"
echo ""
Expand Down
56 changes: 56 additions & 0 deletions .meta/mast/hydrate_cache.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

"""This is convenience script meant for hydrating the HuggingFace cache.

This is meant for downloading the model weights and tokenizer to the cache, i.e. for
OilFS.

Example:

python .meta/mast/hydrate_cache.py --model-id Qwen/Qwen3-32B

"""
import argparse
import os
import sys

from transformers import AutoModelForCausalLM, AutoTokenizer


def main():
parser = argparse.ArgumentParser(
description="Hydrate HuggingFace cache for a specific model"
)
parser.add_argument(
"--model-id",
type=str,
required=True,
help="HuggingFace model ID (e.g., Qwen/Qwen3-8B)",
)
args = parser.parse_args()

# Ensure HF_HOME is set
hf_home = os.environ.get("HF_HOME")
if not hf_home:
print(
"ERROR: HF_HOME environment variable must be set. "
"You will likely want to run export HF_HOME=/mnt/wsfuse/teamforge/hf."
)
sys.exit(1)

print(f"Using HF_HOME: {hf_home}")
print(f"Downloading {args.model_id}...")

# This will pull tokenizer + config + all weight shards
tokenizer = AutoTokenizer.from_pretrained(args.model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(args.model_id, trust_remote_code=True)

print("Download complete. Cache hydrated.")


if __name__ == "__main__":
main()
13 changes: 12 additions & 1 deletion .meta/mast/launch.sh
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,12 @@ fi

CONFIG_FILE="$1"

# Generate a unique job name
USER=$(whoami)
RANDOM_SUFFIX=$(cat /dev/urandom | tr -dc 'a-z0-9' | fold -w 6 | head -n 1)
JOB_NAME="${USER}-forge-${RANDOM_SUFFIX}"
log_info "Generated job name: $JOB_NAME"

# Get the directory where this script is located
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

Expand Down Expand Up @@ -64,5 +70,10 @@ fi
log_info "Successfully reinstalled forge package"

# Launch the job
CHECKPOINT_FOLDER=/mnt/wsfuse/teamforge/forge_runs/$JOB_NAME
log_info "Launching MAST job..."
PYTHONPATH=. python .meta/mast/main.py --config "$CONFIG_FILE"

# Manually override the relevant checkpoint path(s)
# This unfortunately cannot be done in the YAML itself since this should be
# based on job name...
PYTHONPATH=. python .meta/mast/main.py --job-name $JOB_NAME --config $CONFIG_FILE trainer.checkpoint.folder=${CHECKPOINT_FOLDER} trainer.dcp_path=${CHECKPOINT_FOLDER}
Loading
Loading