NVIDIA · ZiyueXu77 · Jan 14, 2026 · Dec 11, 2025 · Dec 11, 2025 · Jan 13, 2026
diff --git a/examples/advanced/llm_hf/MULTINODE.md b/examples/advanced/llm_hf/MULTINODE.md
@@ -60,8 +60,11 @@ SLURM Job (2 nodes allocated)
 - **Does NOT use srun** to run the job script (only ONE FL client)
 
 #### 2. **Job Configuration** (`job.py`)
-- Sends wrapper script to client: `job.to("client_wrapper.sh", site_name)`
-- ScriptRunner uses simple command: `bash client_wrapper.sh`
+- Uses `FedAvgRecipe` with `per_site_config` for multi-node setup
+- When `--multi_node` flag is set:
+  - Sets command in per_site_config: `"command": "bash custom/client_wrapper.sh"`
+  - Adds wrapper script to job: `recipe.job.to("client_wrapper.sh", site_name)`
+- Script arguments passed via `"train_args"` in per_site_config
 - No need to handle environment variables in Python
 
 #### 3. **Wrapper Script** (`client_wrapper.sh`)
@@ -73,7 +76,9 @@ SLURM Job (2 nodes allocated)
 - Uses `CUDA_VISIBLE_DEVICES` to set GPUs. Assumes that they are set as comma-separated list, e.g. "0,1,2,3,4,5,6,7".
 
 **Why this works:**
-- The wrapper script is included in the FL job package
+- The wrapper script is included in the FL job package via `recipe.job.to("client_wrapper.sh", site_name)`
+- It's placed in the `custom/` subdirectory of the job workspace
+- Command is set to `bash custom/client_wrapper.sh` in the per_site_config
 - It runs in the same environment as the SLURM job (has access to `srun` and SLURM variables)
 - It handles all the complexity of multi-node coordination
 
@@ -95,17 +100,23 @@ SLURM Job (2 nodes allocated)
    export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
    export MASTER_PORT=29500
    # Start NVFlare server and client
-   python3 job.py --client_ids dolly --gpu "[0,1,2,3,4,5,6,7]" ...
+   python3 job.py --client_ids dolly --data_path ${PWD}/dataset \
+       --gpu "[0,1,2,3,4,5,6,7]" --multi_node \
+       --workspace_dir ${PWD}/workspace --job_dir ${PWD}/jobs
    ```
 
 3. **`job.py` creates and exports FL job**:
-   - Includes `client_wrapper.sh` in job package
-   - Includes `client.py` training script
-   - ScriptRunner configured to run: `bash client_wrapper.sh`
+   - Uses `FedAvgRecipe` to configure the federated learning job
+   - For multi-node mode (`--multi_node` flag):
+     - Sets command via `per_site_config`: `"command": "bash custom/client_wrapper.sh"`
+     - Adds wrapper script: `recipe.job.to("client_wrapper.sh", site_name)`
+   - Includes `client.py` training script automatically via recipe
+   - Exports job configuration to specified directory
 
 4. **NVFlare client receives training task**:
-   - Extracts job files to workspace
-   - Executes: `bash client_wrapper.sh client.py --args...`
+   - Extracts job files to workspace (including `custom/client_wrapper.sh`)
+   - Executes command from per_site_config: `bash custom/client_wrapper.sh`
+   - Wrapper script receives training script path and arguments
 
 5. **Wrapper script detects multi-node setup**:
    ```bash
@@ -125,27 +136,83 @@ SLURM Job (2 nodes allocated)
 
 ## Current Solution Advantages
 
-✅ **Separation of concerns**: Job creation vs execution
-✅ **Environment isolation**: Wrapper script runs in correct environment
-✅ **Flexibility**: Works for both single-node and multi-node
-✅ **Simplicity**: No complex string escaping or variable expansion
-✅ **Debugging**: Wrapper script provides clear logging
+✅ **Recipe Pattern**: Uses `FedAvgRecipe` for maintainable configuration  
+✅ **Separation of concerns**: Job creation vs execution  
+✅ **Environment isolation**: Wrapper script runs in correct environment  
+✅ **Flexibility**: Works for both single-node and multi-node via `--multi_node` flag  
+✅ **Per-Site Configuration**: Different commands and arguments per client via `per_site_config`  
+✅ **Simplicity**: No complex string escaping or variable expansion  
+✅ **Debugging**: Wrapper script provides clear logging  
 ✅ **Portability**: Easy to modify for different cluster setups
 
+## Job Configuration Arguments
+
+The `job.py` script accepts several arguments relevant to multi-node training:
+
+**Required:**
+- `--client_ids`: Client/site names, space-separated (e.g., `dolly` or `hospital-1`). Used directly as site names.
+- `--data_path`: Root directory containing client datasets
+- `--multi_node`: Flag to enable multi-node training mode
+
+**Optional:**
+- `--gpu`: GPU assignments (e.g., `"[0,1,2,3,4,5,6,7]"` for 8 GPUs)
+- `--num_rounds`: Number of FL rounds (default: 3)
+- `--local_epoch`: Local training epochs per round (default: 1)
+- `--lr_scheduler`: Learning rate scheduler (default: "constant")
+- `--train_mode`: Training mode - `SFT` or `PEFT` (default: "SFT")
+- `--message_mode`: Communication format - `numpy` or `tensor` (default: "numpy")
+- `--quantize_mode`: Model quantization for communication (e.g., `float16`, `blockwise8`)
+- `--wandb_project`: WandB project name for experiment tracking
+- `--wandb_run_name`: WandB run name
+- `--use_tracking`: Enable TensorBoard tracking
+
+**Example with all key arguments:**
+```bash
+python3 job.py \
+    --client_ids dolly \
+    --data_path ${PWD}/dataset \
+    --multi_node \
+    --gpu "[0,1,2,3,4,5,6,7]" \
+    --num_rounds 5 \
+    --local_epoch 2 \
+    --lr_scheduler cosine_with_restarts \
+    --train_mode SFT \
+    --message_mode tensor \
+    --wandb_project my_llm_project \
+    --workspace_dir ${PWD}/workspace \
+    --job_dir ${PWD}/jobs
+```
+
 ## Testing
 
 ### Single Node (8 GPUs)
+When testing with a single node, you can either:
+
+**Option 1: Without `--multi_node` flag** (uses torchrun directly, no wrapper):
+```bash
+python3 job.py --client_ids dolly --data_path ./dataset \
+    --gpu "[0,1,2,3,4,5,6,7]" \
+    --workspace_dir ./workspace --job_dir ./jobs
+```
+
+**Option 2: With `--multi_node` flag** (uses wrapper script):
 ```bash
 sbatch --nodes=1 --gpus-per-node=8 nvflare.slurm
 ```
-Wrapper script detects `NNODES=1` and uses torchrun directly.
+Wrapper script detects `NNODES=1` and uses torchrun directly (no srun).
 
 ### Multi-Node (2 nodes, 16 GPUs)
+**Required: Must use `--multi_node` flag**
 ```bash
 sbatch --nodes=2 --gpus-per-node=8 nvflare.slurm
 ```
 Wrapper script detects `NNODES=2` and uses srun + torchrun.
 
+**The `--multi_node` flag is critical** - it tells `job.py` to:
+- Use `client_wrapper.sh` instead of direct torchrun
+- Include the wrapper script in the job package
+- Set the correct command in the job configuration
+
 ## Troubleshooting
 
 ### Check wrapper script execution
@@ -217,9 +284,11 @@ In total X params to be sent to server.
 ## Key Principles for Multi-Node NVFlare + PyTorch DDP
 
 1. **One FL Client Per Multi-node Cluster**: Only one NVFlare client process is needed per cluster, and it should run on the SLURM master node.
-2. **Rank 0 Only for FL Operations**: Only global rank 0 talks to FL server
-3. **Local Rank for GPU Selection**: Use local_rank (0-7) for `cuda:X` device mapping
-4. **Global Rank for FL Communication**: Use rank (0-15) for NVFlare API calls
-5. **Broadcast Coordination**: Rank 0 broadcasts FL state to all other ranks
-6. **Shared Filesystem**: Only rank 0 saves checkpoints (avoid conflicts)
-7. **Wrapper Script Pattern**: Separate job creation from execution environment
+2. **Use `--multi_node` Flag**: Must be set in `job.py` to enable wrapper script and correct command configuration
+3. **Recipe-Based Configuration**: Use `FedAvgRecipe` with `per_site_config` for flexible job setup
+4. **Rank 0 Only for FL Operations**: Only global rank 0 talks to FL server
+5. **Local Rank for GPU Selection**: Use local_rank (0-7) for `cuda:X` device mapping
+6. **Global Rank for FL Communication**: Use rank (0-15) for NVFlare API calls
+7. **Broadcast Coordination**: Rank 0 broadcasts FL state to all other ranks
+8. **Shared Filesystem**: Only rank 0 saves checkpoints (avoid conflicts)
+9. **Wrapper Script Pattern**: Separate job creation from execution environment via `client_wrapper.sh`
diff --git a/examples/advanced/llm_hf/README.md b/examples/advanced/llm_hf/README.md
@@ -40,6 +40,146 @@ python ./utils/preprocess_alpaca.py --training_file dataset/alpaca/data/train-00
 python ./utils/preprocess_oasst1.py --training_file dataset/oasst1/data/train-00000-of-00001-b42a775f407cee45.parquet --validation_file dataset/oasst1/data/validation-00000-of-00001-134b8fd0c89408b6.parquet --output_dir dataset/oasst1
 ```
 
+## Implementation Overview
+
+This implementation uses NVFlare's recipe-based pattern for federated learning with HuggingFace LLMs. Below is an overview of the key components:
+
+### Data
+- **Datasets**: Three public instruction-tuning datasets (Dolly, Alpaca, OASST1)
+- **Format**: JSONL files with `input` and `output` fields for instruction tuning
+- **Preprocessing**: Each dataset is split into `training.jsonl` and `validation.jsonl`
+- **Client Distribution**: Each client gets its own dataset directory (e.g., `dataset/dolly/`, `dataset/alpaca/`)
+
+### Model
+The example supports two model definition files for different training modes:
+
+**`hf_sft_model.py` (Supervised Fine-Tuning)**
+```python
+class CausalLMModel(torch.nn.Module):
+    def __init__(self, model_name_or_path):
+        super(CausalLMModel, self).__init__()
+        self.model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
+```
+
+**`hf_peft_model.py` (Parameter-Efficient Fine-Tuning)**
+```python
+class CausalLMPEFTModel(torch.nn.Module):
+    def __init__(self, model_name_or_path):
+        super(CausalLMPEFTModel, self).__init__()
+        peft_config = LoraConfig(lora_alpha=16, lora_dropout=0.1, r=64, 
+                                 bias="none", task_type="CAUSAL_LM")
+        full_model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
+        self.model = get_peft_model(full_model, peft_config)
+```
+
+### Client-Side Code
+**`client.py`** - Federated client using HuggingFace SFTTrainer with DDP support
+
+Key features:
+- **Multi-GPU Support**: Automatic DDP setup via `torch.distributed`
+- **Rank Management**: Only rank 0 communicates with NVFlare server
+- **Model Synchronization**: Broadcasts global model from rank 0 to all ranks
+- **Federated Training Loop**: Integrates with NVFlare using numbered steps:
+  1. Import nvflare client API
+  2. Initialize NVFlare client API (`flare.init()`)
+  3. Federated training rounds loop (`while flare.is_running()`)
+  4. Receive global model from NVFlare (`flare.receive()`)
+  5. Load global model state dict
+  6. Evaluate global model for server-side model selection
+  7. Train locally using SFTTrainer
+  8. Compose output model parameters
+  9. Construct trained FL model with metrics
+  10. Send model back to NVFlare (`flare.send()`)
+
+**Launch Modes:**
+- Single GPU: `python client.py [args]`
+- Multi-GPU: `python -m torch.distributed.run --nnodes=1 --nproc_per_node=N --master_port=7777 client.py [args]`
+- Multi-node: via `client_wrapper.sh`
+
+### Server-Side Code / Job Recipe
+**`job.py`** - Job configuration using NVFlare's `FedAvgRecipe` pattern
+
+**Recipe-Based Approach:**
+```python
+# Create recipe with FedAvgRecipe
+recipe = FedAvgRecipe(
+    name=job_name,
+    initial_model=initial_model,  # CausalLMModel or CausalLMPEFTModel
+    min_clients=num_clients,
+    num_rounds=args.num_rounds,
+    train_script="client.py",
+    server_expected_format=server_expected_format,  # "pytorch" or "numpy"
+    launch_external_process=True,
+    per_site_config=per_site_config,  # Site-specific configurations
+)
+```
+
+**Per-Site Configuration:**
+Each client can have custom configurations for different data paths and multi-GPU setups:
+```python
+per_site_config = {
+    "dolly": {
+        "train_args": "--model_name_or_path meta-llama/llama-3.2-1b "
+                      "--data_path_train ./dataset/dolly/training.jsonl "
+                      "--data_path_valid ./dataset/dolly/validation.jsonl ...",
+        "command": "python3 -m torch.distributed.run --nnodes=1 "
+                   "--nproc_per_node=2 --master_port=7777"
+    },
+    "alpaca": {
+        "train_args": "--model_name_or_path meta-llama/llama-3.2-1b "
+                      "--data_path_train ./dataset/alpaca/training.jsonl ...",
+        "command": "python3 -m torch.distributed.run --nnodes=1 "
+                   "--nproc_per_node=2 --master_port=8888"
+    }
+}
+```
+
+**Optional Features:**
+- **Quantization**: Add ModelQuantizer and ModelDequantizer filters for communication efficiency
+- **Experiment Tracking**: Enable TensorBoard tracking with `--use_tracking`
+- **Extended Timeouts**: Automatic configuration for long-running LLM training
+
+### Run Job
+The recipe supports multiple execution modes:
+
+**1. Export Only** (generate job config without running):
+```bash
+python job.py \
+    --client_ids dolly \
+    --data_path ${PWD}/dataset \
+    --job_dir ${PWD}/workspace/jobs/job_config \
+    --export_config
+```
+
+**2. Simulation Mode** (local testing):
+```bash
+python job.py \
+    --client_ids dolly \
+    --data_path ${PWD}/dataset \
+    --workspace_dir ${PWD}/workspace/simulation \
+    --job_dir ${PWD}/workspace/jobs/simulation
+```
+
+**3. Production Mode** (real deployment):
+```bash
+python job.py \
+    --client_ids dolly \
+    --data_path ${PWD}/dataset \
+    --startup_kit_location /path/to/startup_kit \
+    --username [email protected]
+```
+
+**Key Job Arguments:**
+- `--client_ids`: Client/site names (space-separated). Used directly as site names (e.g., `dolly`, `hospital-1`)
+- `--data_path`: Root directory containing client datasets
+- `--train_mode`: `SFT` or `PEFT`
+- `--message_mode`: `numpy` (float32) or `tensor` (bf16)
+- `--quantize_mode`: Optional quantization (`float16`, `blockwise8`, `float4`, `normfloat4`)
+- `--gpu`: GPU assignments, e.g., `"[0,1],[2,3]"` for two clients with 2 GPUs each
+- `--ports`: Master ports for DDP, e.g., `7777 8888`
+- `--num_rounds`: Number of federated learning rounds
+- `--use_tracking`: Enable TensorBoard experiment tracking
+
 ## Adaptation of Centralized Training Script to Federated
 Below, we illustrate how to adapt a standard HuggingFace SFT/PEFT training script to a federated paradigm with NVFlare. 
 
@@ -247,47 +387,5 @@ Alpaca:
 Oasst1:
 ![peft](./figs/peft_oasst1.png)
 
-
 ## Multi-node Training
 The NVFlare client can run in a multi-node environment as well. The deployment depends on your cluster environment. We provide an example on how to test this with a SLURM-based cluster. See the details and some findings on ensuring the job runs correctly in multi-node setting in [MULTINODE.md](MULTINODE.md).
-
-### 1. Create a fresh virtual environment on your cluster
-Create a fresh virtual environment on your cluster and install the requrements.
-```bash
-export VENV_DIR=<path/to/your/venv>
-```
-
-### 2. Create a NVFlare project
-As an example, we create a project with only one client for the Dolly dataset.
-```bash
-nvflare poc prepare -c site-dolly
-```
-Copy the created "prod_00" where your SLURM job can access it, i.e., a shared file system.
-
-```bash
-export NVFLARE_PROJECT=<your/path/to/prod_00>
-```
-
-### 3. (Optionally) Set your Weights and Biases API Key
-The training can be logged to WandB if you provide and API key via
-
-```bash
-export WANDB_API_KEY=<your_wandb_api_key>
-```
-
-### 4. Submit the SLURM Job
-
-Update your SLURM account name and partitions by providing the information in [nvflare.slurm](nvflare.slurm):
-
-```
-#SBATCH -A [ACCOUNT_NAME]
-#SBATCH --partition=[PARTITION_NAME1,PARTITION_NAME2,...]
-```
-
-By default, you can submit a job, requesting 2 nodes with 8 GPUs via
-
-```bash
-sbatch nvflare.slurm
-```
-
-For more options, see [MULTINODE.md](MULTINODE.md#testing).