This repository contains code and examples for Week 4 of the LLM Engineering and Deployment Certification Program, covering memory optimization, distributed training, and production workflows for LLM fine-tuning.
- Distributed Data Parallelism (DDP) - Speed up training with Accelerate
- DeepSpeed ZeRO - Memory-efficient multi-GPU sharding
- FSDP - PyTorch's Fully Sharded Data Parallelism
- Axolotl - Production-grade training framework
- Advanced Parallelism - Tensor and pipeline parallelism concepts
- Multi-GPU machine (RunPod, Lambda Labs, AWS, or local)
- CUDA-compatible GPUs
- Python 3.9+
- Install dependencies:
pip install -r requirements.txt- Configure environment variables:
Create a .env file in the repository root:
HF_TOKEN=your-hf-token
HF_USERNAME=your-hf-username
WANDB_API_KEY=your-wandb-key
WANDB_PROJECT=your-project-name
WANDB_DISABLED=falseGetting your tokens:
- HF_TOKEN: Create at huggingface.co/settings/tokens
- WANDB_API_KEY: Find at wandb.ai/authorize
- Accept model license:
For Llama models, accept the license at meta-llama/Llama-3.2-8B
Training:
python code/train_qlora_baseline.pyOutputs:
data/outputs/baseline_qlora/lora_adapters/- LoRA adaptersdata/outputs/baseline_qlora/training_duration.json- Training time
Evaluation:
python code/evaluate_model.py \
--cfg_path code/configs/training/qlora.yaml \
--model_path data/outputs/baseline_qlora/lora_adaptersEvaluation outputs:
data/outputs/baseline_qlora/lora_adapters/eval_results.json- ROUGE scoresdata/outputs/baseline_qlora/lora_adapters/predictions.jsonl- Model predictions
Training (2 GPUs):
accelerate launch --config_file code/configs/accelerate/ddp_2gpu.yaml code/train_qlora_ddp.pyTraining (4 GPUs):
accelerate launch --config_file code/configs/accelerate/ddp_4gpu.yaml code/train_qlora_ddp.pyOutputs:
data/outputs/ddp_2gpu/<model-name>/lora_adapters/- LoRA adaptersdata/outputs/ddp_2gpu/<model-name>/training_duration.json- Training time
Evaluation:
# 2 GPU
python code/evaluate_model.py \
--cfg_path code/configs/training/qlora.yaml \
--model_path data/outputs/ddp_2gpu/<model-name>/lora_adapters
# 4 GPU
python code/evaluate_model.py \
--cfg_path code/configs/training/qlora.yaml \
--model_path data/outputs/ddp_4gpu/<model-name>/lora_adaptersEvaluation outputs: Same as baseline (in model path directory)
FSDP supports 8 combinations: {LoRA, Full FT} × {ZeRO2, ZeRO3} × {2 GPU, 4 GPU}
Training pattern:
accelerate launch --cfg_path code/configs/accelerate/fsdp_<ngpu>gpu_zero<stage>.yaml \
code/train_fsdp.py --cfg_path code/configs/training/<lora|full_ft>.yamlExamples:
# LoRA with 2 GPU, ZeRO2
accelerate launch --config_file code/configs/accelerate/fsdp_2gpu_zero2.yaml \
code/train_fsdp.py --cfg_path code/configs/training/lora.yaml
# Full fine-tuning with 4 GPU, ZeRO3
accelerate launch --config_file code/configs/accelerate/fsdp_4gpu_zero3.yaml \
code/train_fsdp.py --cfg_path code/configs/training/full_ft.yamlOutputs:
data/outputs/fsdp_<ngpu>gpu_zero<stage>/<model-name>/lora_adapters/(LoRA)data/outputs/fsdp_<ngpu>gpu_zero<stage>/<model-name>/final_model/(Full FT)data/outputs/fsdp_<ngpu>gpu_zero<stage>/<model-name>/training_duration.json
Evaluation:
# LoRA
python code/evaluate_model.py \
--cfg_path code/configs/training/lora.yaml \
--model_path data/outputs/fsdp_2gpu_zero2/<model-name>/lora_adapters
# Full FT
python code/evaluate_model.py \
--cfg_path code/configs/training/full_ft.yaml \
--model_path data/outputs/fsdp_zero3_4gpu_full/<model-name>/final_modelEvaluation outputs: Same structure (in model path directory)
DeepSpeed supports 8 combinations: {LoRA, Full FT} × {ZeRO2, ZeRO3} × {2 GPU, 4 GPU}
Training pattern:
accelerate launch --config_file code/configs/accelerate/deepspeed_<ngpu>gpu_zero<stage>.yaml \
code/train_deepspeed.py --cfg_path code/configs/training/<lora|full_ft>.yamlExamples:
# LoRA with 2 GPU, ZeRO2
accelerate launch --config_file code/configs/accelerate/deepspeed_2gpu_zero2.yaml \
code/train_deepspeed.py --cfg_path code/configs/training/lora.yaml
# Full fine-tuning with 4 GPU, ZeRO3
accelerate launch --config_file code/configs/accelerate/deepspeed_4gpu_zero3.yaml \
code/train_deepspeed.py --cfg_path code/configs/training/full_ft.yamlOutputs:
data/outputs/deepspeed_zero<stage>_<ngpu>gpu_<lora|full>/<model-name>/lora_adapters/(LoRA)data/outputs/deepspeed_zero<stage>_<ngpu>gpu_<lora|full>/<model-name>/final_model/(Full FT)data/outputs/deepspeed_zero<stage>_<ngpu>gpu_<lora|full>/<model-name>/training_duration.json
Evaluation:
# LoRA
python code/evaluate_model.py \
--cfg_path code/configs/training/lora.yaml \
--model_path data/outputs/deepspeed_zero2_2gpu_lora/<model-name>/lora_adapters
# Full FT
python code/evaluate_model.py \
--cfg_path code/configs/training/full_ft.yaml \
--model_path data/outputs/deepspeed_zero3_4gpu_full/<model-name>/final_modelEvaluation outputs: Same structure (in model path directory)