-
Notifications
You must be signed in to change notification settings - Fork 233
Convert job to recipe for LLM_HF example #3888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+471
−347
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
5dc5b01
convert job to recipe
ZiyueXu77 946ae81
bug correction, further polish
ZiyueXu77 1a93502
Merge branch 'NVIDIA:main' into llm_rcp
ZiyueXu77 2e61e2d
update llm example with latest multi-gpu func
ZiyueXu77 e3e2229
update readmes
ZiyueXu77 8e46477
fix arg issue
ZiyueXu77 ffee274
Merge branch 'main' into llm_rcp
ZiyueXu77 885912a
further polishes
ZiyueXu77 c6cc74c
Merge branch 'main' into llm_rcp
ZiyueXu77 92eb4a3
further touchups
ZiyueXu77 ff9f940
update slurm and readme to reflect site name change
ZiyueXu77 413e2d2
format update
ZiyueXu77 172d08a
add defaults to wandb
ZiyueXu77 8716373
add defaults to wandb
ZiyueXu77 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -40,6 +40,146 @@ python ./utils/preprocess_alpaca.py --training_file dataset/alpaca/data/train-00 | |
| python ./utils/preprocess_oasst1.py --training_file dataset/oasst1/data/train-00000-of-00001-b42a775f407cee45.parquet --validation_file dataset/oasst1/data/validation-00000-of-00001-134b8fd0c89408b6.parquet --output_dir dataset/oasst1 | ||
| ``` | ||
|
|
||
| ## Implementation Overview | ||
|
|
||
| This implementation uses NVFlare's recipe-based pattern for federated learning with HuggingFace LLMs. Below is an overview of the key components: | ||
|
|
||
| ### Data | ||
| - **Datasets**: Three public instruction-tuning datasets (Dolly, Alpaca, OASST1) | ||
| - **Format**: JSONL files with `input` and `output` fields for instruction tuning | ||
| - **Preprocessing**: Each dataset is split into `training.jsonl` and `validation.jsonl` | ||
| - **Client Distribution**: Each client gets its own dataset directory (e.g., `dataset/dolly/`, `dataset/alpaca/`) | ||
|
|
||
| ### Model | ||
| The example supports two model definition files for different training modes: | ||
|
|
||
| **`hf_sft_model.py` (Supervised Fine-Tuning)** | ||
| ```python | ||
| class CausalLMModel(torch.nn.Module): | ||
| def __init__(self, model_name_or_path): | ||
| super(CausalLMModel, self).__init__() | ||
| self.model = AutoModelForCausalLM.from_pretrained(model_name_or_path) | ||
| ``` | ||
|
|
||
| **`hf_peft_model.py` (Parameter-Efficient Fine-Tuning)** | ||
| ```python | ||
| class CausalLMPEFTModel(torch.nn.Module): | ||
| def __init__(self, model_name_or_path): | ||
| super(CausalLMPEFTModel, self).__init__() | ||
| peft_config = LoraConfig(lora_alpha=16, lora_dropout=0.1, r=64, | ||
| bias="none", task_type="CAUSAL_LM") | ||
| full_model = AutoModelForCausalLM.from_pretrained(model_name_or_path) | ||
| self.model = get_peft_model(full_model, peft_config) | ||
| ``` | ||
|
|
||
| ### Client-Side Code | ||
| **`client.py`** - Federated client using HuggingFace SFTTrainer with DDP support | ||
|
|
||
| Key features: | ||
| - **Multi-GPU Support**: Automatic DDP setup via `torch.distributed` | ||
| - **Rank Management**: Only rank 0 communicates with NVFlare server | ||
| - **Model Synchronization**: Broadcasts global model from rank 0 to all ranks | ||
| - **Federated Training Loop**: Integrates with NVFlare using numbered steps: | ||
| 1. Import nvflare client API | ||
| 2. Initialize NVFlare client API (`flare.init()`) | ||
| 3. Federated training rounds loop (`while flare.is_running()`) | ||
| 4. Receive global model from NVFlare (`flare.receive()`) | ||
| 5. Load global model state dict | ||
| 6. Evaluate global model for server-side model selection | ||
| 7. Train locally using SFTTrainer | ||
| 8. Compose output model parameters | ||
| 9. Construct trained FL model with metrics | ||
| 10. Send model back to NVFlare (`flare.send()`) | ||
|
|
||
| **Launch Modes:** | ||
| - Single GPU: `python client.py [args]` | ||
| - Multi-GPU: `python -m torch.distributed.run --nnodes=1 --nproc_per_node=N --master_port=7777 client.py [args]` | ||
| - Multi-node: via `client_wrapper.sh` | ||
|
|
||
| ### Server-Side Code / Job Recipe | ||
| **`job.py`** - Job configuration using NVFlare's `FedAvgRecipe` pattern | ||
|
|
||
| **Recipe-Based Approach:** | ||
| ```python | ||
| # Create recipe with FedAvgRecipe | ||
| recipe = FedAvgRecipe( | ||
| name=job_name, | ||
| initial_model=initial_model, # CausalLMModel or CausalLMPEFTModel | ||
| min_clients=num_clients, | ||
| num_rounds=args.num_rounds, | ||
| train_script="client.py", | ||
| server_expected_format=server_expected_format, # "pytorch" or "numpy" | ||
| launch_external_process=True, | ||
| per_site_config=per_site_config, # Site-specific configurations | ||
| ) | ||
| ``` | ||
|
|
||
| **Per-Site Configuration:** | ||
| Each client can have custom configurations for different data paths and multi-GPU setups: | ||
| ```python | ||
| per_site_config = { | ||
| "dolly": { | ||
| "train_args": "--model_name_or_path meta-llama/llama-3.2-1b " | ||
| "--data_path_train ./dataset/dolly/training.jsonl " | ||
| "--data_path_valid ./dataset/dolly/validation.jsonl ...", | ||
| "command": "python3 -m torch.distributed.run --nnodes=1 " | ||
| "--nproc_per_node=2 --master_port=7777" | ||
| }, | ||
| "alpaca": { | ||
| "train_args": "--model_name_or_path meta-llama/llama-3.2-1b " | ||
| "--data_path_train ./dataset/alpaca/training.jsonl ...", | ||
| "command": "python3 -m torch.distributed.run --nnodes=1 " | ||
| "--nproc_per_node=2 --master_port=8888" | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| **Optional Features:** | ||
| - **Quantization**: Add ModelQuantizer and ModelDequantizer filters for communication efficiency | ||
| - **Experiment Tracking**: Enable TensorBoard tracking with `--use_tracking` | ||
| - **Extended Timeouts**: Automatic configuration for long-running LLM training | ||
|
|
||
| ### Run Job | ||
| The recipe supports multiple execution modes: | ||
|
|
||
| **1. Export Only** (generate job config without running): | ||
| ```bash | ||
| python job.py \ | ||
| --client_ids dolly \ | ||
| --data_path ${PWD}/dataset \ | ||
| --job_dir ${PWD}/workspace/jobs/job_config \ | ||
| --export_config | ||
| ``` | ||
|
|
||
| **2. Simulation Mode** (local testing): | ||
| ```bash | ||
| python job.py \ | ||
| --client_ids dolly \ | ||
| --data_path ${PWD}/dataset \ | ||
| --workspace_dir ${PWD}/workspace/simulation \ | ||
| --job_dir ${PWD}/workspace/jobs/simulation | ||
| ``` | ||
|
|
||
| **3. Production Mode** (real deployment): | ||
| ```bash | ||
| python job.py \ | ||
| --client_ids dolly \ | ||
| --data_path ${PWD}/dataset \ | ||
| --startup_kit_location /path/to/startup_kit \ | ||
| --username [email protected] | ||
| ``` | ||
|
|
||
| **Key Job Arguments:** | ||
| - `--client_ids`: Client/site names (space-separated). Used directly as site names (e.g., `dolly`, `hospital-1`) | ||
| - `--data_path`: Root directory containing client datasets | ||
| - `--train_mode`: `SFT` or `PEFT` | ||
| - `--message_mode`: `numpy` (float32) or `tensor` (bf16) | ||
| - `--quantize_mode`: Optional quantization (`float16`, `blockwise8`, `float4`, `normfloat4`) | ||
| - `--gpu`: GPU assignments, e.g., `"[0,1],[2,3]"` for two clients with 2 GPUs each | ||
| - `--ports`: Master ports for DDP, e.g., `7777 8888` | ||
| - `--num_rounds`: Number of federated learning rounds | ||
| - `--use_tracking`: Enable TensorBoard experiment tracking | ||
|
|
||
| ## Adaptation of Centralized Training Script to Federated | ||
| Below, we illustrate how to adapt a standard HuggingFace SFT/PEFT training script to a federated paradigm with NVFlare. | ||
|
|
||
|
|
@@ -247,47 +387,5 @@ Alpaca: | |
| Oasst1: | ||
|  | ||
|
|
||
|
|
||
| ## Multi-node Training | ||
| The NVFlare client can run in a multi-node environment as well. The deployment depends on your cluster environment. We provide an example on how to test this with a SLURM-based cluster. See the details and some findings on ensuring the job runs correctly in multi-node setting in [MULTINODE.md](MULTINODE.md). | ||
|
|
||
| ### 1. Create a fresh virtual environment on your cluster | ||
| Create a fresh virtual environment on your cluster and install the requrements. | ||
| ```bash | ||
| export VENV_DIR=<path/to/your/venv> | ||
| ``` | ||
|
|
||
| ### 2. Create a NVFlare project | ||
| As an example, we create a project with only one client for the Dolly dataset. | ||
| ```bash | ||
| nvflare poc prepare -c site-dolly | ||
| ``` | ||
| Copy the created "prod_00" where your SLURM job can access it, i.e., a shared file system. | ||
|
|
||
| ```bash | ||
| export NVFLARE_PROJECT=<your/path/to/prod_00> | ||
| ``` | ||
|
|
||
| ### 3. (Optionally) Set your Weights and Biases API Key | ||
| The training can be logged to WandB if you provide and API key via | ||
|
|
||
| ```bash | ||
| export WANDB_API_KEY=<your_wandb_api_key> | ||
| ``` | ||
|
|
||
| ### 4. Submit the SLURM Job | ||
|
|
||
| Update your SLURM account name and partitions by providing the information in [nvflare.slurm](nvflare.slurm): | ||
|
|
||
| ``` | ||
| #SBATCH -A [ACCOUNT_NAME] | ||
| #SBATCH --partition=[PARTITION_NAME1,PARTITION_NAME2,...] | ||
| ``` | ||
|
|
||
| By default, you can submit a job, requesting 2 nodes with 8 GPUs via | ||
|
|
||
| ```bash | ||
| sbatch nvflare.slurm | ||
| ``` | ||
|
|
||
| For more options, see [MULTINODE.md](MULTINODE.md#testing). | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.