|
1 | | -# <img src="./docs/img//azure_logo.png" alt="Azure Logo" style="width:30px;height:30px;"/> Fine Tuning Open Source LLM Models - QLora and Lora features implemented |
2 | | - |
3 | | -## Overview |
4 | | -Open-source LLMs are powerful but require fine-tuning for specific tasks like chatbots or content generation. Fine-tuning these models can be expensive due to the need for substantial VRAM. For instance, fully fine-tuning the Llama7B model requires 112GB of VRAM. However, techniques like QLoRA and PEFT can significantly reduce these requirements. |
5 | | - |
6 | 1 | # Whisper Fine-Tuning Pipeline |
7 | 2 |
|
8 | | -This repository provides a pipeline for fine-tuning OpenAI's Whisper models on custom audio datasets using LoRA (Low-Rank Adaptation) and PEFT (Parameter-Efficient Fine-Tuning). The solution is designed for scalable training and deployment on Azure Machine Learning. |
| 3 | +This project provides an end-to-end workflow for preparing custom audio datasets, fine-tuning OpenAI’s Whisper models with LoRA/PEFT, and operationalizing the resulting checkpoints on Azure ML. |
9 | 4 |
|
10 | | -## Features |
11 | | -- Fine-tune Whisper models with LoRA and PEFT |
12 | | -- Custom data loading and preprocessing for speech datasets |
13 | | -- MLflow integration for experiment tracking and model logging (including all console logs as artifacts) |
14 | | -- Azure ML job specification for cloud training |
15 | | -- Modular codebase for easy extension |
| 5 | +## Highlights |
| 6 | +- LoRA-based fine-tuning scripts that minimize GPU memory requirements. |
| 7 | +- Data ingestion utilities that convert raw audio into Hugging Face `datasets` format. |
| 8 | +- MLflow integration for experiment tracking and artifact storage. |
| 9 | +- Azure ML job definitions for cloud-scale training and evaluation. |
16 | 10 |
|
17 | | -## Directory Structure |
| 11 | +## Project Layout |
18 | 12 | ``` |
19 | | -src/core/ |
20 | | - config.py # Training and LoRA configuration classes |
21 | | - load_data.py # Data loading and preprocessing |
22 | | - train.py # Trainer class and training logic |
23 | | -deployment/ |
24 | | - job_data.py # Data preparation script |
25 | | - job_train.py # Training script |
26 | | - environment.yml # Conda environment for Azure ML |
27 | | - training_job.yaml # Azure ML job specification |
28 | | -notebooks/ |
29 | | - fine_tuned_usage.ipynb # Example notebook for inference and usage |
30 | | -data/ |
31 | | - dataset_silver/ # Example processed dataset |
| 13 | +apps/whisper_fine_tuning/ |
| 14 | +├── deployment/ # Azure ML job specs and training entrypoints |
| 15 | +├── docs/ # Diagrams and supporting documentation |
| 16 | +├── infra/ # Infrastructure-as-code templates |
| 17 | +├── notebooks/ # Exploratory analysis and inference notebooks |
| 18 | +├── src/core/ # Data prep, training, and evaluation modules |
| 19 | +└── data/ # Example datasets (raw/silver) – ignored in git |
32 | 20 | ``` |
33 | 21 |
|
34 | | -## Setup |
35 | | -1. **Clone the repository** |
36 | | - ```bash |
37 | | - git clone <repo-url> |
38 | | - cd whisper-fine-tuning |
39 | | - ``` |
40 | | -2. **Prepare your dataset** |
41 | | - - Place your audio datasets in a directory (e.g., `data/train_raw`, `data/evaluation_raw`). |
42 | | - - Ensure datasets are in Hugging Face `datasets` format. |
43 | | - |
44 | | -3. **Create the environment** |
45 | | - |
| 22 | +## Quick Start |
| 23 | +1. **Install dependencies** |
| 24 | + ```bash |
| 25 | + poetry env use 3.12 |
| 26 | + poetry install |
| 27 | + ``` |
| 28 | + Activate the environment with `poetry shell` or your preferred virtualenv tool. |
| 29 | + |
| 30 | +2. **Prepare raw audio** |
| 31 | + ```bash |
| 32 | + python src/core/data_prep/main_data_prep.py \ |
| 33 | + --source_data_dir data/raw/audios/matis \ |
| 34 | + --output_data_dir data/raw/training \ |
| 35 | + --domain train |
| 36 | + ``` |
| 37 | + Repeat for `--domain evaluation` and `--domain test`. See `src/core/data_prep/README.md` for more options. |
| 38 | + |
| 39 | +3. **Generate silver dataset** |
| 40 | + ```bash |
| 41 | + python src/core/data_prep/main_silver_data_prep.py \ |
| 42 | + --train_datasets data/raw/training \ |
| 43 | + --eval_datasets data/raw/evaluation \ |
| 44 | + --test_datasets data/raw/testing |
| 45 | + ``` |
| 46 | + |
| 47 | +4. **Train Whisper with LoRA** |
| 48 | + ```bash |
| 49 | + python src/core/train/main_train.py \ |
| 50 | + --model_name openai/whisper-large-v2 \ |
| 51 | + --dataset data/silver/dataset \ |
| 52 | + --language Matis \ |
| 53 | + --output_dir output_model_dir \ |
| 54 | + --apply_lora True |
| 55 | + ``` |
| 56 | + |
| 57 | +## Running on Azure ML |
| 58 | +Submit the packaged job using the provided YAML spec: |
46 | 59 | ```bash |
47 | | -# Use Python 3.12 for the Poetry environment |
48 | | -poetry env use 3.12 |
49 | | -
|
50 | | -# Activate the virtual environment |
51 | | -poetry shell |
52 | | -
|
53 | | -# Install all project dependencies |
54 | | -poetry install |
| 60 | +az ml job create --file deployment/training_job.yaml |
55 | 61 | ``` |
56 | | - |
57 | | -## Usage |
58 | | - |
59 | | -### Data Preparation |
60 | | - |
61 | | -### Data Preparation in Create Data |
62 | | - |
63 | | -For detailed instructions on dataset preparation, please refer to the README in the create_data directory. |
64 | | - |
65 | | - |
66 | | - |
67 | | -### Model Training |
68 | | - |
69 | | -Run the training pipeline: |
70 | | -```bash |
71 | | -python deployment/training/job_train.py --model_name openai/whisper-large-v2 --dataset ./data/silver/dataset --apply_lora True |
| 62 | +Customize compute, environment, and inputs inside the YAML before submission. |
| 63 | + |
| 64 | +## Evaluation & Inference |
| 65 | +- Offline evaluation: `python src/core/evaluation/evaluation_process.py --eval_datasets data/raw/testing ...` |
| 66 | +- NeMo experiments: `python src/core/train/main_train_nemo.py --dataset_path data/silver/dataset ...` |
| 67 | +- Inference notebook: `notebooks/fine_tuned_usage.ipynb` |
| 68 | + |
| 69 | +## Key CLI Arguments |
| 70 | +| Script | Argument | Description | |
| 71 | +| --- | --- | --- | |
| 72 | +| `main_data_prep.py` | `--source_data_dir` | Folder with `audio_paths` and `text` mappings | |
| 73 | +| | `--domain` | `train`, `test`, `evaluation`, or `all` | |
| 74 | +| `main_silver_data_prep.py` | `--train_datasets`/`--eval_datasets` | Hugging Face datasets produced by the raw prep stage | |
| 75 | +| `main_train.py` | `--model_name` | Base Whisper checkpoint (default `openai/whisper-small`) | |
| 76 | +| | `--apply_lora` | Enable/disable LoRA adapters | |
| 77 | +| | `--experiment_name` | MLflow experiment name; auto-generated if omitted | |
| 78 | + |
| 79 | +Full option lists live in each script’s `--help` output. |
| 80 | + |
| 81 | +## MLflow & Logging |
| 82 | +- Runs log configuration, metrics, and console output (see `training_console.log`). |
| 83 | +- Checkpoints are written to `output_model_dir/` and can be registered with MLflow or uploaded to Azure. |
| 84 | + |
| 85 | +## Pre-Commit Hooks |
72 | 86 | ``` |
73 | | - |
74 | | -If you want to load a lightweight model configuration, run the training command below. This command fine-tunes OpenAI's Whisper model with LoRA applied on your specified dataset: |
75 | | -
|
76 | | -```bash |
77 | | -python deployment/training/job_train.py --model_name openai/whisper-large-v2 --dataset ./data/silver/dataset --apply_lora True --experiment_name "whisper-mayoruna-v2" |
78 | | -
|
| 87 | +pip install pre-commit |
| 88 | +pre-commit install |
| 89 | +pre-commit run --all-files |
79 | 90 | ``` |
80 | | -
|
81 | | -This approach helps optimize resource usage while enabling effective model adaptation. |
82 | | -
|
83 | | -
|
84 | | -- All console and logging output will be saved as an MLflow artifact (`training_console.log`). |
85 | | -
|
86 | | -
|
87 | | -## Model Evaluating |
88 | | -
|
89 | | -
|
90 | | -```bash |
91 | | -python src/core/evaluation/evaluation_process.py --is_public_repo False --ckpt_dir "output_model_dir" --temp_ckpt_folder "temp" --eval_datasets data/raw/testing --device 0 --batch_size 16 --output_dir predictions_dir |
92 | | -``` |
93 | | -
|
94 | | -
|
95 | | -python deployment/training/job_train_nemo.py \ |
96 | | - --dataset_path data/silver/dataset \ |
97 | | - --output_dir nemo_rnnt_da/ \ |
98 | | -
|
99 | | -
|
100 | | -python deployment/training/job_train_nemo.py \ |
101 | | - --dataset_path data/silver/dataset \ |
102 | | - --output_dir output_model_dir/nemo_rnnt_da \ |
103 | | - --num_workers 0 |
104 | | -
|
105 | | -
|
106 | | -### Azure ML Training |
107 | | -1. **Configure Azure ML compute and workspace.** |
108 | | -2. **Submit the job:** |
109 | | - ```bash |
110 | | - az ml job create --file deployment/training_job.yaml |
111 | | - ``` |
112 | | -
|
113 | | -## Arguments |
114 | | -
|
115 | | -### `job_data.py` |
116 | | -- `--model_name`: Hugging Face model name (default: `openai/whisper-small`) |
117 | | -- `--train_datasets`: List of training dataset paths (required) |
118 | | -- `--eval_datasets`: List of evaluation dataset paths (optional) |
119 | | -- `--sampling_rate`: Audio sampling rate (default: 16000) |
120 | | -- `--num_proc`: Number of parallel jobs for data prep (default: 2) |
121 | | -- `--output_dir`: Output directory for processed dataset |
122 | | -
|
123 | | -### `job_train.py` |
124 | | -- `--model_name`: Hugging Face model name (default: `openai/whisper-small`) |
125 | | -- `--dataset`: Path to processed dataset (required) |
126 | | -- `--output_dir`: Output directory for checkpoints (default: `output_model_dir`) |
127 | | -- `--apply_lora`: Whether to apply LoRA (default: True) |
128 | | -- `--language`: Language for adaptation (default: Hindi) |
129 | | -- `--sampling_rate`: Audio sampling rate (default: 16000) |
130 | | -- `--num_proc`: Number of parallel jobs for data prep (default: 2) |
131 | | -- `--train_strategy`: Training strategy (`steps` or `epoch`, default: `steps`) |
132 | | -
|
133 | | -## Customization |
134 | | -- Modify `src/core/config.py` to adjust training and LoRA parameters. |
135 | | -- Update `deployment/environment.yml` to add/remove dependencies. |
136 | | -- Edit `deployment/training_job.yaml` for Azure ML compute, environment, and output settings. |
137 | | -
|
138 | | -## Outputs |
139 | | -- Trained model checkpoints in the specified output directory |
140 | | -- MLflow experiment logs (including all console logs as artifacts) |
141 | | -- Optionally, model artifacts pushed to Hugging Face Hub |
142 | | -
|
143 | | -## Inference Example |
144 | | -
|
145 | | -See `notebooks/fine_tuned_usage.ipynb` for detailed usage. |
146 | | -To load and use a model logged with MLflow: |
147 | | -```python |
148 | | -import mlflow.transformers |
149 | | -
|
150 | | -model_uri = "models:/whisper-fine-tuned/1" # or your own model URI |
151 | | -pipe = mlflow.transformers.load_model(model_uri) |
152 | | -result = pipe("path/to/audio/file.wav") |
153 | | -print(result["text"]) |
154 | | -``` |
155 | | -Or, using the pyfunc interface: |
156 | | -```python |
157 | | -import mlflow.pyfunc |
158 | | -import pandas as pd |
159 | | -
|
160 | | -loaded_model = mlflow.pyfunc.load_model(model_uri) |
161 | | -df = pd.DataFrame({"inputs": ["path/to/audio/file.wav"]}) |
162 | | -result = loaded_model.predict(df) |
163 | | -print(result) |
164 | | -``` |
165 | | -
|
166 | | -
|
167 | | -## Steps to Configure Pre-commit |
168 | | -
|
169 | | -1. Install pre-commit (if not already installed): |
170 | | -
|
171 | | - ```bash |
172 | | - pip install pre-commit |
173 | | - ``` |
174 | | -
|
175 | | -2. Use the `.pre-commit-config.yaml` file in the root of the repository |
176 | | -
|
177 | | -3. Install the pre-commit hooks: |
178 | | -
|
179 | | - ```bash |
180 | | - pre-commit install |
181 | | - ``` |
182 | | -
|
183 | | -4. Run the hooks manually on all files to ensure consistency: |
184 | | -
|
185 | | - ```bash |
186 | | - pre-commit run --all-files |
187 | | - ``` |
188 | | -
|
189 | | -Integrating pre-commit into your workflow helps maintain consistent code standards and prevents common issues before they enter your codebase. |
190 | | -
|
| 91 | +Hooks enforce formatting and linting before changes land in version control. |
191 | 92 |
|
192 | 93 | ## Troubleshooting |
193 | | -- Ensure all dependencies in `environment.yml` are installed. |
194 | | -- Check dataset format and paths. |
195 | | -- Review Azure ML compute and environment configuration if running in the cloud. |
| 94 | +- Verify `environment.yml` or `pyproject.toml` dependencies are installed. |
| 95 | +- Ensure datasets follow the expected directory structure (`data/raw/audios/...`). |
| 96 | +- For Azure ML issues, confirm workspace credentials and compute targets. |
| 97 | + |
| 98 | +## Fine-Tuning Tips |
| 99 | +| Parameter | Recommended | Why | |
| 100 | +| --- | --- | --- | |
| 101 | +| `per_device_train_batch_size` | 4–8 | Small datasets benefit from smaller batches | |
| 102 | +| `num_train_epochs` | 10–30 | Compensate for limited data with more passes | |
| 103 | +| `warmup_steps` | 50–100 | Faster ramp-up than large fixed counts | |
| 104 | +| `eval_strategy` | `epoch` | Evaluate once per epoch to avoid noise | |
| 105 | +| `save_strategy` | `epoch` | Align checkpointing with evaluation | |
| 106 | +| `load_best_model_at_end` | `True` | Automatically keep the best-performing checkpoint | |
| 107 | + |
| 108 | +Additional guidance on data augmentation, freezing layers, early stopping, and MLflow logging is summarized in `src/core/train/README.md`. |
196 | 109 |
|
197 | 110 | ## License |
198 | | -This project is licensed under the MIT License. |
199 | | -
|
200 | | -## Contact |
201 | | -For questions or support, please open an issue or contact the repository maintainer. |
202 | | -
|
203 | | -
|
204 | | -## Tips current scenario |
205 | | -| Parameter | Suggested | Reason | |
206 | | -|--------------------------------|------------------------|--------------------------------------------------------------------------------------------------| |
207 | | -| per_device_train_batch_size | 4–8 | With only 30 samples, large batches are unnecessary and may cause instability. | |
208 | | -| gradient_accumulation_steps | 1–4 | Increase if you want to simulate larger batches without memory overhead. | |
209 | | -| num_train_epochs | 10–30 | More epochs help the model learn from limited data. Use early stopping or monitor validation loss. | |
210 | | -| max_steps | Remove or set to -1 | Let training be driven by num_train_epochs to avoid premature stopping. | |
211 | | -| warmup_steps | 50–100 | 20k is too high for such a small dataset. | |
212 | | -| learning_rate | 1e-5 to 3e-5 | Your current value is reasonable, but monitor for overfitting. | |
213 | | -| eval_strategy | "epoch" | With few samples, evaluating per epoch is more meaningful. | |
214 | | -| save_strategy | "epoch" | Align with evaluation to save meaningful checkpoints. | |
215 | | -| fp16 | ✅ | Keep if your hardware supports it. | |
216 | | -| gradient_checkpointing | ✅ | Helps with memory efficiency. | |
217 | | -| optim | "adamw_bnb_8bit" | Good choice for memory-constrained environments. | |
218 | | -| load_best_model_at_end | ✅ | Helps select the best checkpoint. | |
219 | | -| predict_with_generate | ✅ | Needed for Whisper’s transcription tasks. | |
220 | | -
|
221 | | ---- |
222 | | -
|
223 | | -### 🧠 Additional Tips |
224 | | -
|
225 | | -- **Use Data Augmentation**: Consider adding noise, pitch shift, or speed variation to artificially expand your dataset. |
226 | | -- **Freeze Most Layers**: Fine-tune only the final layers or adapters to reduce overfitting. |
227 | | -- **Use Early Stopping**: Monitor validation loss and stop training when it plateaus. |
228 | | -- **Log Carefully**: Avoid logging too many parameters to MLflow to prevent the `INVALID_PARAMETER_VALUE` error. |
| 111 | +Licensed under the MIT License. Contributions and issues are welcome. |
0 commit comments