Skip to content
Merged
111 changes: 90 additions & 21 deletions examples/advanced/llm_hf/MULTINODE.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,11 @@ SLURM Job (2 nodes allocated)
- **Does NOT use srun** to run the job script (only ONE FL client)

#### 2. **Job Configuration** (`job.py`)
- Sends wrapper script to client: `job.to("client_wrapper.sh", site_name)`
- ScriptRunner uses simple command: `bash client_wrapper.sh`
- Uses `FedAvgRecipe` with `per_site_config` for multi-node setup
- When `--multi_node` flag is set:
- Sets command in per_site_config: `"command": "bash custom/client_wrapper.sh"`
- Adds wrapper script to job: `recipe.job.to("client_wrapper.sh", site_name)`
- Script arguments passed via `"train_args"` in per_site_config
- No need to handle environment variables in Python

#### 3. **Wrapper Script** (`client_wrapper.sh`)
Expand All @@ -73,7 +76,9 @@ SLURM Job (2 nodes allocated)
- Uses `CUDA_VISIBLE_DEVICES` to set GPUs. Assumes that they are set as comma-separated list, e.g. "0,1,2,3,4,5,6,7".

**Why this works:**
- The wrapper script is included in the FL job package
- The wrapper script is included in the FL job package via `recipe.job.to("client_wrapper.sh", site_name)`
- It's placed in the `custom/` subdirectory of the job workspace
- Command is set to `bash custom/client_wrapper.sh` in the per_site_config
- It runs in the same environment as the SLURM job (has access to `srun` and SLURM variables)
- It handles all the complexity of multi-node coordination

Expand All @@ -95,17 +100,23 @@ SLURM Job (2 nodes allocated)
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
# Start NVFlare server and client
python3 job.py --client_ids dolly --gpu "[0,1,2,3,4,5,6,7]" ...
python3 job.py --client_ids dolly --data_path ${PWD}/dataset \
--gpu "[0,1,2,3,4,5,6,7]" --multi_node \
--workspace_dir ${PWD}/workspace --job_dir ${PWD}/jobs
```

3. **`job.py` creates and exports FL job**:
- Includes `client_wrapper.sh` in job package
- Includes `client.py` training script
- ScriptRunner configured to run: `bash client_wrapper.sh`
- Uses `FedAvgRecipe` to configure the federated learning job
- For multi-node mode (`--multi_node` flag):
- Sets command via `per_site_config`: `"command": "bash custom/client_wrapper.sh"`
- Adds wrapper script: `recipe.job.to("client_wrapper.sh", site_name)`
- Includes `client.py` training script automatically via recipe
- Exports job configuration to specified directory

4. **NVFlare client receives training task**:
- Extracts job files to workspace
- Executes: `bash client_wrapper.sh client.py --args...`
- Extracts job files to workspace (including `custom/client_wrapper.sh`)
- Executes command from per_site_config: `bash custom/client_wrapper.sh`
- Wrapper script receives training script path and arguments

5. **Wrapper script detects multi-node setup**:
```bash
Expand All @@ -125,27 +136,83 @@ SLURM Job (2 nodes allocated)

## Current Solution Advantages

✅ **Separation of concerns**: Job creation vs execution
✅ **Environment isolation**: Wrapper script runs in correct environment
✅ **Flexibility**: Works for both single-node and multi-node
✅ **Simplicity**: No complex string escaping or variable expansion
✅ **Debugging**: Wrapper script provides clear logging
✅ **Recipe Pattern**: Uses `FedAvgRecipe` for maintainable configuration
✅ **Separation of concerns**: Job creation vs execution
✅ **Environment isolation**: Wrapper script runs in correct environment
✅ **Flexibility**: Works for both single-node and multi-node via `--multi_node` flag
✅ **Per-Site Configuration**: Different commands and arguments per client via `per_site_config`
✅ **Simplicity**: No complex string escaping or variable expansion
✅ **Debugging**: Wrapper script provides clear logging
✅ **Portability**: Easy to modify for different cluster setups

## Job Configuration Arguments

The `job.py` script accepts several arguments relevant to multi-node training:

**Required:**
- `--client_ids`: Client/site names, space-separated (e.g., `dolly` or `hospital-1`). Used directly as site names.
- `--data_path`: Root directory containing client datasets
- `--multi_node`: Flag to enable multi-node training mode

**Optional:**
- `--gpu`: GPU assignments (e.g., `"[0,1,2,3,4,5,6,7]"` for 8 GPUs)
- `--num_rounds`: Number of FL rounds (default: 3)
- `--local_epoch`: Local training epochs per round (default: 1)
- `--lr_scheduler`: Learning rate scheduler (default: "constant")
- `--train_mode`: Training mode - `SFT` or `PEFT` (default: "SFT")
- `--message_mode`: Communication format - `numpy` or `tensor` (default: "numpy")
- `--quantize_mode`: Model quantization for communication (e.g., `float16`, `blockwise8`)
- `--wandb_project`: WandB project name for experiment tracking
- `--wandb_run_name`: WandB run name
- `--use_tracking`: Enable TensorBoard tracking

**Example with all key arguments:**
```bash
python3 job.py \
--client_ids dolly \
--data_path ${PWD}/dataset \
--multi_node \
--gpu "[0,1,2,3,4,5,6,7]" \
--num_rounds 5 \
--local_epoch 2 \
--lr_scheduler cosine_with_restarts \
--train_mode SFT \
--message_mode tensor \
--wandb_project my_llm_project \
--workspace_dir ${PWD}/workspace \
--job_dir ${PWD}/jobs
```

## Testing

### Single Node (8 GPUs)
When testing with a single node, you can either:

**Option 1: Without `--multi_node` flag** (uses torchrun directly, no wrapper):
```bash
python3 job.py --client_ids dolly --data_path ./dataset \
--gpu "[0,1,2,3,4,5,6,7]" \
--workspace_dir ./workspace --job_dir ./jobs
```

**Option 2: With `--multi_node` flag** (uses wrapper script):
```bash
sbatch --nodes=1 --gpus-per-node=8 nvflare.slurm
```
Wrapper script detects `NNODES=1` and uses torchrun directly.
Wrapper script detects `NNODES=1` and uses torchrun directly (no srun).

### Multi-Node (2 nodes, 16 GPUs)
**Required: Must use `--multi_node` flag**
```bash
sbatch --nodes=2 --gpus-per-node=8 nvflare.slurm
```
Wrapper script detects `NNODES=2` and uses srun + torchrun.

**The `--multi_node` flag is critical** - it tells `job.py` to:
- Use `client_wrapper.sh` instead of direct torchrun
- Include the wrapper script in the job package
- Set the correct command in the job configuration

## Troubleshooting

### Check wrapper script execution
Expand Down Expand Up @@ -217,9 +284,11 @@ In total X params to be sent to server.
## Key Principles for Multi-Node NVFlare + PyTorch DDP

1. **One FL Client Per Multi-node Cluster**: Only one NVFlare client process is needed per cluster, and it should run on the SLURM master node.
2. **Rank 0 Only for FL Operations**: Only global rank 0 talks to FL server
3. **Local Rank for GPU Selection**: Use local_rank (0-7) for `cuda:X` device mapping
4. **Global Rank for FL Communication**: Use rank (0-15) for NVFlare API calls
5. **Broadcast Coordination**: Rank 0 broadcasts FL state to all other ranks
6. **Shared Filesystem**: Only rank 0 saves checkpoints (avoid conflicts)
7. **Wrapper Script Pattern**: Separate job creation from execution environment
2. **Use `--multi_node` Flag**: Must be set in `job.py` to enable wrapper script and correct command configuration
3. **Recipe-Based Configuration**: Use `FedAvgRecipe` with `per_site_config` for flexible job setup
4. **Rank 0 Only for FL Operations**: Only global rank 0 talks to FL server
5. **Local Rank for GPU Selection**: Use local_rank (0-7) for `cuda:X` device mapping
6. **Global Rank for FL Communication**: Use rank (0-15) for NVFlare API calls
7. **Broadcast Coordination**: Rank 0 broadcasts FL state to all other ranks
8. **Shared Filesystem**: Only rank 0 saves checkpoints (avoid conflicts)
9. **Wrapper Script Pattern**: Separate job creation from execution environment via `client_wrapper.sh`
182 changes: 140 additions & 42 deletions examples/advanced/llm_hf/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,146 @@ python ./utils/preprocess_alpaca.py --training_file dataset/alpaca/data/train-00
python ./utils/preprocess_oasst1.py --training_file dataset/oasst1/data/train-00000-of-00001-b42a775f407cee45.parquet --validation_file dataset/oasst1/data/validation-00000-of-00001-134b8fd0c89408b6.parquet --output_dir dataset/oasst1
```

## Implementation Overview

This implementation uses NVFlare's recipe-based pattern for federated learning with HuggingFace LLMs. Below is an overview of the key components:

### Data
- **Datasets**: Three public instruction-tuning datasets (Dolly, Alpaca, OASST1)
- **Format**: JSONL files with `input` and `output` fields for instruction tuning
- **Preprocessing**: Each dataset is split into `training.jsonl` and `validation.jsonl`
- **Client Distribution**: Each client gets its own dataset directory (e.g., `dataset/dolly/`, `dataset/alpaca/`)

### Model
The example supports two model definition files for different training modes:

**`hf_sft_model.py` (Supervised Fine-Tuning)**
```python
class CausalLMModel(torch.nn.Module):
def __init__(self, model_name_or_path):
super(CausalLMModel, self).__init__()
self.model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
```

**`hf_peft_model.py` (Parameter-Efficient Fine-Tuning)**
```python
class CausalLMPEFTModel(torch.nn.Module):
def __init__(self, model_name_or_path):
super(CausalLMPEFTModel, self).__init__()
peft_config = LoraConfig(lora_alpha=16, lora_dropout=0.1, r=64,
bias="none", task_type="CAUSAL_LM")
full_model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
self.model = get_peft_model(full_model, peft_config)
```

### Client-Side Code
**`client.py`** - Federated client using HuggingFace SFTTrainer with DDP support

Key features:
- **Multi-GPU Support**: Automatic DDP setup via `torch.distributed`
- **Rank Management**: Only rank 0 communicates with NVFlare server
- **Model Synchronization**: Broadcasts global model from rank 0 to all ranks
- **Federated Training Loop**: Integrates with NVFlare using numbered steps:
1. Import nvflare client API
2. Initialize NVFlare client API (`flare.init()`)
3. Federated training rounds loop (`while flare.is_running()`)
4. Receive global model from NVFlare (`flare.receive()`)
5. Load global model state dict
6. Evaluate global model for server-side model selection
7. Train locally using SFTTrainer
8. Compose output model parameters
9. Construct trained FL model with metrics
10. Send model back to NVFlare (`flare.send()`)

**Launch Modes:**
- Single GPU: `python client.py [args]`
- Multi-GPU: `python -m torch.distributed.run --nnodes=1 --nproc_per_node=N --master_port=7777 client.py [args]`
- Multi-node: via `client_wrapper.sh`

### Server-Side Code / Job Recipe
**`job.py`** - Job configuration using NVFlare's `FedAvgRecipe` pattern

**Recipe-Based Approach:**
```python
# Create recipe with FedAvgRecipe
recipe = FedAvgRecipe(
name=job_name,
initial_model=initial_model, # CausalLMModel or CausalLMPEFTModel
min_clients=num_clients,
num_rounds=args.num_rounds,
train_script="client.py",
server_expected_format=server_expected_format, # "pytorch" or "numpy"
launch_external_process=True,
per_site_config=per_site_config, # Site-specific configurations
)
```

**Per-Site Configuration:**
Each client can have custom configurations for different data paths and multi-GPU setups:
```python
per_site_config = {
"dolly": {
"train_args": "--model_name_or_path meta-llama/llama-3.2-1b "
"--data_path_train ./dataset/dolly/training.jsonl "
"--data_path_valid ./dataset/dolly/validation.jsonl ...",
"command": "python3 -m torch.distributed.run --nnodes=1 "
"--nproc_per_node=2 --master_port=7777"
},
"alpaca": {
"train_args": "--model_name_or_path meta-llama/llama-3.2-1b "
"--data_path_train ./dataset/alpaca/training.jsonl ...",
"command": "python3 -m torch.distributed.run --nnodes=1 "
"--nproc_per_node=2 --master_port=8888"
}
}
```

**Optional Features:**
- **Quantization**: Add ModelQuantizer and ModelDequantizer filters for communication efficiency
- **Experiment Tracking**: Enable TensorBoard tracking with `--use_tracking`
- **Extended Timeouts**: Automatic configuration for long-running LLM training

### Run Job
The recipe supports multiple execution modes:

**1. Export Only** (generate job config without running):
```bash
python job.py \
--client_ids dolly \
--data_path ${PWD}/dataset \
--job_dir ${PWD}/workspace/jobs/job_config \
--export_config
```

**2. Simulation Mode** (local testing):
```bash
python job.py \
--client_ids dolly \
--data_path ${PWD}/dataset \
--workspace_dir ${PWD}/workspace/simulation \
--job_dir ${PWD}/workspace/jobs/simulation
```

**3. Production Mode** (real deployment):
```bash
python job.py \
--client_ids dolly \
--data_path ${PWD}/dataset \
--startup_kit_location /path/to/startup_kit \
--username [email protected]
```

**Key Job Arguments:**
- `--client_ids`: Client/site names (space-separated). Used directly as site names (e.g., `dolly`, `hospital-1`)
- `--data_path`: Root directory containing client datasets
- `--train_mode`: `SFT` or `PEFT`
- `--message_mode`: `numpy` (float32) or `tensor` (bf16)
- `--quantize_mode`: Optional quantization (`float16`, `blockwise8`, `float4`, `normfloat4`)
- `--gpu`: GPU assignments, e.g., `"[0,1],[2,3]"` for two clients with 2 GPUs each
- `--ports`: Master ports for DDP, e.g., `7777 8888`
- `--num_rounds`: Number of federated learning rounds
- `--use_tracking`: Enable TensorBoard experiment tracking

## Adaptation of Centralized Training Script to Federated
Below, we illustrate how to adapt a standard HuggingFace SFT/PEFT training script to a federated paradigm with NVFlare.

Expand Down Expand Up @@ -247,47 +387,5 @@ Alpaca:
Oasst1:
![peft](./figs/peft_oasst1.png)


## Multi-node Training
The NVFlare client can run in a multi-node environment as well. The deployment depends on your cluster environment. We provide an example on how to test this with a SLURM-based cluster. See the details and some findings on ensuring the job runs correctly in multi-node setting in [MULTINODE.md](MULTINODE.md).

### 1. Create a fresh virtual environment on your cluster
Create a fresh virtual environment on your cluster and install the requrements.
```bash
export VENV_DIR=<path/to/your/venv>
```

### 2. Create a NVFlare project
As an example, we create a project with only one client for the Dolly dataset.
```bash
nvflare poc prepare -c site-dolly
```
Copy the created "prod_00" where your SLURM job can access it, i.e., a shared file system.

```bash
export NVFLARE_PROJECT=<your/path/to/prod_00>
```

### 3. (Optionally) Set your Weights and Biases API Key
The training can be logged to WandB if you provide and API key via

```bash
export WANDB_API_KEY=<your_wandb_api_key>
```

### 4. Submit the SLURM Job

Update your SLURM account name and partitions by providing the information in [nvflare.slurm](nvflare.slurm):

```
#SBATCH -A [ACCOUNT_NAME]
#SBATCH --partition=[PARTITION_NAME1,PARTITION_NAME2,...]
```

By default, you can submit a job, requesting 2 nodes with 8 GPUs via

```bash
sbatch nvflare.slurm
```

For more options, see [MULTINODE.md](MULTINODE.md#testing).
Loading
Loading