|
| 1 | +# <img src="./docs/img//azure_logo.png" alt="Azure Logo" style="width:30px;height:30px;"/> Fine Tuning Open Source LLM Models - QLora and Lora features implemented |
| 2 | + |
| 3 | +## Overview |
| 4 | +Open-source LLMs are powerful but require fine-tuning for specific tasks like chatbots or content generation. Fine-tuning these models can be expensive due to the need for substantial VRAM. For instance, fully fine-tuning the Llama7B model requires 112GB of VRAM. However, techniques like QLoRA and PEFT can significantly reduce these requirements. |
| 5 | + |
| 6 | +# Whisper Fine-Tuning Pipeline |
| 7 | + |
| 8 | +This repository provides a pipeline for fine-tuning OpenAI's Whisper models on custom audio datasets using LoRA (Low-Rank Adaptation) and PEFT (Parameter-Efficient Fine-Tuning). The solution is designed for scalable training and deployment on Azure Machine Learning. |
| 9 | + |
| 10 | +## Features |
| 11 | +- Fine-tune Whisper models with LoRA and PEFT |
| 12 | +- Custom data loading and preprocessing for speech datasets |
| 13 | +- MLflow integration for experiment tracking and model logging (including all console logs as artifacts) |
| 14 | +- Azure ML job specification for cloud training |
| 15 | +- Modular codebase for easy extension |
| 16 | + |
| 17 | +## Directory Structure |
| 18 | +``` |
| 19 | +src/core/ |
| 20 | + config.py # Training and LoRA configuration classes |
| 21 | + load_data.py # Data loading and preprocessing |
| 22 | + train.py # Trainer class and training logic |
| 23 | +deployment/ |
| 24 | + job_data.py # Data preparation script |
| 25 | + job_train.py # Training script |
| 26 | + environment.yml # Conda environment for Azure ML |
| 27 | + training_job.yaml # Azure ML job specification |
| 28 | +notebooks/ |
| 29 | + fine_tuned_usage.ipynb # Example notebook for inference and usage |
| 30 | +data/ |
| 31 | + dataset_silver/ # Example processed dataset |
| 32 | +``` |
| 33 | + |
| 34 | +## Setup |
| 35 | +1. **Clone the repository** |
| 36 | + ```bash |
| 37 | + git clone <repo-url> |
| 38 | + cd whisper-fine-tuning |
| 39 | + ``` |
| 40 | +2. **Prepare your dataset** |
| 41 | + - Place your audio datasets in a directory (e.g., `data/train_raw`, `data/evaluation_raw`). |
| 42 | + - Ensure datasets are in Hugging Face `datasets` format. |
| 43 | + |
| 44 | +3. **Create the environment** |
| 45 | + |
| 46 | +```bash |
| 47 | +# Use Python 3.12 for the Poetry environment |
| 48 | +poetry env use 3.12 |
| 49 | +
|
| 50 | +# Activate the virtual environment |
| 51 | +poetry shell |
| 52 | +
|
| 53 | +# Install all project dependencies |
| 54 | +poetry install |
| 55 | +``` |
| 56 | + |
| 57 | +## Usage |
| 58 | + |
| 59 | +### Data Preparation |
| 60 | + |
| 61 | +### Data Preparation in Create Data |
| 62 | + |
| 63 | +For detailed instructions on dataset preparation, please refer to the README in the create_data directory. |
| 64 | + |
| 65 | + |
| 66 | + |
| 67 | +### Model Training |
| 68 | + |
| 69 | +Run the training pipeline: |
| 70 | +```bash |
| 71 | +python deployment/training/job_train.py --model_name openai/whisper-large-v2 --dataset ./data/silver/dataset --apply_lora True |
| 72 | +``` |
| 73 | + |
| 74 | +If you want to load a lightweight model configuration, run the training command below. This command fine-tunes OpenAI's Whisper model with LoRA applied on your specified dataset: |
| 75 | +
|
| 76 | +```bash |
| 77 | +python deployment/training/job_train.py --model_name openai/whisper-large-v2 --dataset ./data/silver/dataset --apply_lora True --experiment_name "whisper-mayoruna-v2" |
| 78 | +
|
| 79 | +``` |
| 80 | +
|
| 81 | +This approach helps optimize resource usage while enabling effective model adaptation. |
| 82 | +
|
| 83 | +
|
| 84 | +- All console and logging output will be saved as an MLflow artifact (`training_console.log`). |
| 85 | +
|
| 86 | +
|
| 87 | +## Model Evaluating |
| 88 | +
|
| 89 | +
|
| 90 | +```bash |
| 91 | +python src/core/evaluation/evaluation_process.py --is_public_repo False --ckpt_dir "output_model_dir" --temp_ckpt_folder "temp" --eval_datasets data/raw/testing --device 0 --batch_size 16 --output_dir predictions_dir |
| 92 | +``` |
| 93 | +
|
| 94 | +
|
| 95 | +python deployment/training/job_train_nemo.py \ |
| 96 | + --dataset_path data/silver/dataset \ |
| 97 | + --output_dir nemo_rnnt_da/ \ |
| 98 | +
|
| 99 | +
|
| 100 | +python deployment/training/job_train_nemo.py \ |
| 101 | + --dataset_path data/silver/dataset \ |
| 102 | + --output_dir output_model_dir/nemo_rnnt_da \ |
| 103 | + --num_workers 0 |
| 104 | +
|
| 105 | +
|
| 106 | +### Azure ML Training |
| 107 | +1. **Configure Azure ML compute and workspace.** |
| 108 | +2. **Submit the job:** |
| 109 | + ```bash |
| 110 | + az ml job create --file deployment/training_job.yaml |
| 111 | + ``` |
| 112 | +
|
| 113 | +## Arguments |
| 114 | +
|
| 115 | +### `job_data.py` |
| 116 | +- `--model_name`: Hugging Face model name (default: `openai/whisper-small`) |
| 117 | +- `--train_datasets`: List of training dataset paths (required) |
| 118 | +- `--eval_datasets`: List of evaluation dataset paths (optional) |
| 119 | +- `--sampling_rate`: Audio sampling rate (default: 16000) |
| 120 | +- `--num_proc`: Number of parallel jobs for data prep (default: 2) |
| 121 | +- `--output_dir`: Output directory for processed dataset |
| 122 | +
|
| 123 | +### `job_train.py` |
| 124 | +- `--model_name`: Hugging Face model name (default: `openai/whisper-small`) |
| 125 | +- `--dataset`: Path to processed dataset (required) |
| 126 | +- `--output_dir`: Output directory for checkpoints (default: `output_model_dir`) |
| 127 | +- `--apply_lora`: Whether to apply LoRA (default: True) |
| 128 | +- `--language`: Language for adaptation (default: Hindi) |
| 129 | +- `--sampling_rate`: Audio sampling rate (default: 16000) |
| 130 | +- `--num_proc`: Number of parallel jobs for data prep (default: 2) |
| 131 | +- `--train_strategy`: Training strategy (`steps` or `epoch`, default: `steps`) |
| 132 | +
|
| 133 | +## Customization |
| 134 | +- Modify `src/core/config.py` to adjust training and LoRA parameters. |
| 135 | +- Update `deployment/environment.yml` to add/remove dependencies. |
| 136 | +- Edit `deployment/training_job.yaml` for Azure ML compute, environment, and output settings. |
| 137 | +
|
| 138 | +## Outputs |
| 139 | +- Trained model checkpoints in the specified output directory |
| 140 | +- MLflow experiment logs (including all console logs as artifacts) |
| 141 | +- Optionally, model artifacts pushed to Hugging Face Hub |
| 142 | +
|
| 143 | +## Inference Example |
| 144 | +
|
| 145 | +See `notebooks/fine_tuned_usage.ipynb` for detailed usage. |
| 146 | +To load and use a model logged with MLflow: |
| 147 | +```python |
| 148 | +import mlflow.transformers |
| 149 | +
|
| 150 | +model_uri = "models:/whisper-fine-tuned/1" # or your own model URI |
| 151 | +pipe = mlflow.transformers.load_model(model_uri) |
| 152 | +result = pipe("path/to/audio/file.wav") |
| 153 | +print(result["text"]) |
| 154 | +``` |
| 155 | +Or, using the pyfunc interface: |
| 156 | +```python |
| 157 | +import mlflow.pyfunc |
| 158 | +import pandas as pd |
| 159 | +
|
| 160 | +loaded_model = mlflow.pyfunc.load_model(model_uri) |
| 161 | +df = pd.DataFrame({"inputs": ["path/to/audio/file.wav"]}) |
| 162 | +result = loaded_model.predict(df) |
| 163 | +print(result) |
| 164 | +``` |
| 165 | +
|
| 166 | +
|
| 167 | +## Steps to Configure Pre-commit |
| 168 | +
|
| 169 | +1. Install pre-commit (if not already installed): |
| 170 | +
|
| 171 | + ```bash |
| 172 | + pip install pre-commit |
| 173 | + ``` |
| 174 | +
|
| 175 | +2. Use the `.pre-commit-config.yaml` file in the root of the repository |
| 176 | +
|
| 177 | +3. Install the pre-commit hooks: |
| 178 | +
|
| 179 | + ```bash |
| 180 | + pre-commit install |
| 181 | + ``` |
| 182 | +
|
| 183 | +4. Run the hooks manually on all files to ensure consistency: |
| 184 | +
|
| 185 | + ```bash |
| 186 | + pre-commit run --all-files |
| 187 | + ``` |
| 188 | +
|
| 189 | +Integrating pre-commit into your workflow helps maintain consistent code standards and prevents common issues before they enter your codebase. |
| 190 | +
|
| 191 | +
|
| 192 | +## Troubleshooting |
| 193 | +- Ensure all dependencies in `environment.yml` are installed. |
| 194 | +- Check dataset format and paths. |
| 195 | +- Review Azure ML compute and environment configuration if running in the cloud. |
| 196 | +
|
| 197 | +## License |
| 198 | +This project is licensed under the MIT License. |
| 199 | +
|
| 200 | +## Contact |
| 201 | +For questions or support, please open an issue or contact the repository maintainer. |
| 202 | +
|
| 203 | +
|
| 204 | +## Tips current scenario |
| 205 | +| Parameter | Suggested | Reason | |
| 206 | +|--------------------------------|------------------------|--------------------------------------------------------------------------------------------------| |
| 207 | +| per_device_train_batch_size | 4–8 | With only 30 samples, large batches are unnecessary and may cause instability. | |
| 208 | +| gradient_accumulation_steps | 1–4 | Increase if you want to simulate larger batches without memory overhead. | |
| 209 | +| num_train_epochs | 10–30 | More epochs help the model learn from limited data. Use early stopping or monitor validation loss. | |
| 210 | +| max_steps | Remove or set to -1 | Let training be driven by num_train_epochs to avoid premature stopping. | |
| 211 | +| warmup_steps | 50–100 | 20k is too high for such a small dataset. | |
| 212 | +| learning_rate | 1e-5 to 3e-5 | Your current value is reasonable, but monitor for overfitting. | |
| 213 | +| eval_strategy | "epoch" | With few samples, evaluating per epoch is more meaningful. | |
| 214 | +| save_strategy | "epoch" | Align with evaluation to save meaningful checkpoints. | |
| 215 | +| fp16 | ✅ | Keep if your hardware supports it. | |
| 216 | +| gradient_checkpointing | ✅ | Helps with memory efficiency. | |
| 217 | +| optim | "adamw_bnb_8bit" | Good choice for memory-constrained environments. | |
| 218 | +| load_best_model_at_end | ✅ | Helps select the best checkpoint. | |
| 219 | +| predict_with_generate | ✅ | Needed for Whisper’s transcription tasks. | |
| 220 | +
|
| 221 | +--- |
| 222 | +
|
| 223 | +### 🧠 Additional Tips |
| 224 | +
|
| 225 | +- **Use Data Augmentation**: Consider adding noise, pitch shift, or speed variation to artificially expand your dataset. |
| 226 | +- **Freeze Most Layers**: Fine-tune only the final layers or adapters to reduce overfitting. |
| 227 | +- **Use Early Stopping**: Monitor validation loss and stop training when it plateaus. |
| 228 | +- **Log Carefully**: Avoid logging too many parameters to MLflow to prevent the `INVALID_PARAMETER_VALUE` error. |
0 commit comments