-
Notifications
You must be signed in to change notification settings - Fork 18
Enable MAST client mode #405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 20 commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
d8d0a33
changes for forge client mode
allenwang28 63b3274
initial commit
allenwang28 834f4f9
park
allenwang28 6d38ae6
park changes
allenwang28 d308862
fixes
allenwang28 4fdc2bc
nmerge
allenwang28 f5d6eb9
almost there, but keeps failing on dcp load despite me not setting that
allenwang28 be3c446
park again
allenwang28 5a4ddeb
gsm local
allenwang28 3e7b605
?
allenwang28 e6a0025
1.7b is running!
allenwang28 cc14604
Merge branch 'main' into mast_client
allenwang28 ed832f4
cleanups
allenwang28 d589ffc
update all yamls
allenwang28 c630f55
fix path
allenwang28 4a00205
comment out hf_home
allenwang28 42d83de
affinity
allenwang28 a2b3a9e
hydrate cache
allenwang28 37276d6
change hf_artifacts to hf
allenwang28 d33662b
Merge branch 'main' into mast_client
allenwang28 c84f796
update readme
allenwang28 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
#!/bin/bash | ||
# Copyright (c) Meta Platforms, Inc. and affiliates. | ||
# All rights reserved. | ||
# | ||
# This source code is licensed under the BSD-style license found in the | ||
# LICENSE file in the root directory of this source tree. | ||
|
||
# Bootstrap script for the MAST client role | ||
# This script sets up the environment and launches the client training script | ||
|
||
set -eEx | ||
|
||
LIBCUDA="/usr/local/fbcode/platform010/lib/libcuda.so" | ||
if [ -f "$LIBCUDA" ]; then | ||
export LIBCUDA_DIR="${LIBCUDA%/*}" | ||
export TRITON_LIBCUDA_PATH="$LIBCUDA_DIR" | ||
export LD_PRELOAD="$LIBCUDA:/usr/local/fbcode/platform010/lib/libnvidia-ml.so${PRELOAD_PATH:+:$PRELOAD_PATH}" | ||
fi | ||
|
||
# Also preload put path to torch libs as for monarch dev workflow we dont | ||
# install it into the env so we need to make sure the binaries can find | ||
# libtorch and friends on mast and the rpaths set during dev install will | ||
# be wrong on mast. | ||
export LD_LIBRARY_PATH="${CONDA_DIR}/lib:${CONDA_DIR}/lib/python3.10/site-packages/torch/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}" | ||
export PYTHONPATH="${PYTHONPATH:+$PYTHONPATH:}$TORCHX_RUN_PYTHONPATH" | ||
|
||
# shellcheck disable=SC1091 | ||
if [ -n "$CONDA_PREFIX" ]; then | ||
echo "A conda environment is already activated: $CONDA_DEFAULT_ENV" | ||
else | ||
# Disable command printing to avoid log spew. | ||
set +x | ||
source "${CONDA_DIR}/bin/activate" | ||
# Re-enable command printing after conda activation. | ||
set -x | ||
fi | ||
|
||
if [ -z "$WORKSPACE_DIR" ] || [ ! -d "$WORKSPACE_DIR" ]; then | ||
WORKSPACE_DIR="$CONDA_PREFIX" | ||
fi | ||
|
||
cd "$WORKSPACE_DIR/forge" | ||
|
||
export WANDB_MODE=offline | ||
export HF_HUB_OFFLINE=1 | ||
export MONARCH_HOST_MESH_V1_REMOVE_ME_BEFORE_RELEASE=1 | ||
export TORCHSTORE_RDMA_ENABLED=1 | ||
export HF_HOME=/mnt/wsfuse/teamforge/hf | ||
|
||
# Execute the client training script with all passed arguments | ||
exec python -X faulthandler .meta/mast/main.py "$@" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,10 +7,9 @@ | |
# LICENSE file in the root directory of this source tree. | ||
|
||
# setup_forge_env.sh - Setup conda environment and install forge with mounting | ||
set -e # Exit on any error | ||
|
||
# Configuration | ||
CONDA_ENV_NAME="forge:stable" | ||
CONDA_ENV_NAME="forge:41468b33a03eaf2bf5b44517f418028a" | ||
|
||
# Colors for output | ||
RED='\033[0;31m' | ||
|
@@ -109,8 +108,6 @@ fi | |
# Define paths | ||
FBSOURCE_PATH="/data/users/$USER/fbsource" | ||
CONDA_SCRIPT_PATH="$FBSOURCE_PATH/genai/xlformers/dev/xl_conda.sh" | ||
FORGE_BASE_DIR="/data/users/$USER" | ||
FORGE_REPO_DIR="$FORGE_BASE_DIR/forge" | ||
|
||
# Workspace URL for mounting | ||
WORKSPACE_URL="ws://ws.ai.pci0ai/genai_fair_llm" | ||
|
@@ -143,63 +140,12 @@ fi | |
|
||
log_info "Conda environment activated successfully" | ||
|
||
# Step 3: Create and navigate to forge base directory | ||
log_info "Step 3: Setting up forge directory..." | ||
if [ ! -d "$FORGE_BASE_DIR" ]; then | ||
log_info "Creating forge base directory: $FORGE_BASE_DIR" | ||
mkdir -p "$FORGE_BASE_DIR" | ||
fi | ||
|
||
cd "$FORGE_BASE_DIR" | ||
log_info "Changed to directory: $(pwd)" | ||
|
||
# Step 4: Clone or update forge repository | ||
log_info "Step 4: Setting up forge git repository..." | ||
if [ -d "$FORGE_REPO_DIR" ]; then | ||
log_warn "Forge repository already exists at: $FORGE_REPO_DIR" | ||
cd "$FORGE_REPO_DIR" | ||
|
||
if [ -d ".git" ]; then | ||
log_info "Updating existing repository..." | ||
git fetch origin | ||
if [ $? -eq 0 ]; then | ||
log_info "Repository updated successfully" | ||
else | ||
log_warn "Failed to fetch updates, continuing with existing code" | ||
fi | ||
else | ||
log_error "Directory exists but is not a git repository" | ||
log_info "Removing directory and cloning fresh..." | ||
cd "$FORGE_BASE_DIR" | ||
rm -rf "$FORGE_REPO_DIR" | ||
git clone [email protected]:meta-pytorch/forge.git | ||
if [ $? -ne 0 ]; then | ||
log_error "Failed to clone forge repository" | ||
exit 1 | ||
fi | ||
cd "$FORGE_REPO_DIR" | ||
fi | ||
else | ||
log_info "Cloning forge repository..." | ||
git clone [email protected]:meta-pytorch/forge.git | ||
if [ $? -ne 0 ]; then | ||
log_error "Failed to clone forge repository" | ||
log_error "Please ensure:" | ||
log_error "1. You have SSH access to github.com" | ||
log_error "2. Your SSH key is added to GitHub" | ||
log_error "3. You have access to meta-pytorch/forge repository" | ||
exit 1 | ||
fi | ||
cd "$FORGE_REPO_DIR" | ||
fi | ||
|
||
log_info "Current directory: $(pwd)" | ||
|
||
# Step 5: Install torchtitan | ||
log_info "Step 5: Installing torchtitan..." | ||
# Step 3: Install torchtitan | ||
log_info "Step 3: Installing torchtitan..." | ||
|
||
# Source versions.sh to get the pinned commit | ||
VERSIONS_FILE="$FORGE_REPO_DIR/assets/versions.sh" | ||
VERSIONS_FILE="assets/versions.sh" | ||
if [ -f "$VERSIONS_FILE" ]; then | ||
log_info "Sourcing version information from: $VERSIONS_FILE" | ||
source "$VERSIONS_FILE" | ||
|
@@ -225,8 +171,8 @@ else | |
exit 1 | ||
fi | ||
|
||
# Step 5.5: Apply monarch torch import hack | ||
log_info "Step 5.5: Applying monarch torch import hack..." | ||
# Step 3.5: Apply monarch torch import hack | ||
log_info "Step 3.5: Applying monarch torch import hack..." | ||
|
||
MONARCH_INIT="$CONDA_PREFIX/lib/python3.10/site-packages/monarch/__init__.py" | ||
if [ -f "$MONARCH_INIT" ]; then | ||
|
@@ -259,8 +205,8 @@ else | |
log_warn "Skipping monarch torch import hack (monarch may not be installed yet)" | ||
fi | ||
|
||
# Step 6: Install forge package | ||
log_info "Step 6: Installing forge package..." | ||
# Step 4: Install forge package | ||
log_info "Step 4: Installing forge package..." | ||
pip install --no-deps --force-reinstall . | ||
if [ $? -ne 0 ]; then | ||
log_error "Failed to install forge package" | ||
|
@@ -298,7 +244,11 @@ pip list | grep -E "(forge|monarch)" || log_warn "No forge/monarch packages foun | |
log_info "Environment setup complete! You can now run your scripts." | ||
log_info "Mounted workspace available at: /mnt/wsfuse" | ||
|
||
# Step 6: Ask user to deactivate and activate conda env conda environment | ||
log_info "Unsetting CUDA_HOME and overwriting the LD_LIBRARY_PATH" | ||
unset CUDA_HOME | ||
export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib | ||
|
||
# Step 5: Ask user to test | ||
echo "" | ||
log_info "Installation completed successfully!" | ||
echo "" | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Copyright (c) Meta Platforms, Inc. and affiliates. | ||
# All rights reserved. | ||
# | ||
# This source code is licensed under the BSD-style license found in the | ||
# LICENSE file in the root directory of this source tree. | ||
|
||
"""This is convenience script meant for hydrating the HuggingFace cache. | ||
|
||
This is meant for downloading the model weights and tokenizer to the cache, i.e. for | ||
OilFS. | ||
|
||
Example: | ||
|
||
python .meta/mast/hydrate_cache.py --model-id Qwen/Qwen3-32B | ||
|
||
""" | ||
import argparse | ||
import os | ||
import sys | ||
|
||
from transformers import AutoModelForCausalLM, AutoTokenizer | ||
|
||
|
||
def main(): | ||
parser = argparse.ArgumentParser( | ||
description="Hydrate HuggingFace cache for a specific model" | ||
) | ||
parser.add_argument( | ||
"--model-id", | ||
type=str, | ||
required=True, | ||
help="HuggingFace model ID (e.g., Qwen/Qwen3-8B)", | ||
) | ||
args = parser.parse_args() | ||
|
||
# Ensure HF_HOME is set | ||
hf_home = os.environ.get("HF_HOME") | ||
if not hf_home: | ||
print( | ||
"ERROR: HF_HOME environment variable must be set. " | ||
"You will likely want to run export HF_HOME=/mnt/wsfuse/teamforge/hf." | ||
) | ||
sys.exit(1) | ||
|
||
print(f"Using HF_HOME: {hf_home}") | ||
print(f"Downloading {args.model_id}...") | ||
|
||
# This will pull tokenizer + config + all weight shards | ||
tokenizer = AutoTokenizer.from_pretrained(args.model_id, trust_remote_code=True) | ||
model = AutoModelForCausalLM.from_pretrained(args.model_id, trust_remote_code=True) | ||
|
||
print("Download complete. Cache hydrated.") | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't you automate this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the million dollar question Vidhya lol