Skip to content

Commit df0b782

Browse files
committed
first version - FT
1 parent afce944 commit df0b782

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

59 files changed

+24553
-1
lines changed

.github/workflows/pre-commit.yml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
name: pre-commit
2+
on: [pull_request, push]
3+
jobs:
4+
pre-commit:
5+
runs-on: ubuntu-latest
6+
steps:
7+
- uses: actions/checkout@v4
8+
- uses: actions/setup-python@v4
9+
with:
10+
python-version: '3.11'
11+
- name: Install pre-commit
12+
run: python -m pip install --upgrade pre-commit
13+
- name: Run pre-commit
14+
run: pre-commit run --all-files

.gitignore

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -427,4 +427,24 @@ transliteration*
427427
.venv*
428428
infra/target*
429429
.vscode*
430-
junk/
430+
junk/
431+
*.code-workspace
432+
433+
# Machine learning artifacts and cache directories
434+
mlruns/
435+
custom_data/
436+
output_model_dir/
437+
nemo_rnnt_da/
438+
training_console.log
439+
cache*
440+
.mypy_cache/
441+
.ruff_cache/
442+
443+
# Local dataset directories
444+
apps/whisper_fine_tuning/data/
445+
446+
*.DS_Store
447+
448+
data/
449+
450+
predictions_dir/

.pre-commit-config.yaml

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
repos:
2+
- repo: https://github.com/psf/black
3+
rev: 24.10.0
4+
hooks:
5+
- id: black
6+
language_version: python3.11 # require Python 3.11 or newer
7+
8+
- repo: https://github.com/charliermarsh/ruff-pre-commit
9+
rev: v0.1.0
10+
hooks:
11+
- id: ruff
12+
args: [--fix, --extend-ignore, E402] # ruff will auto-fix many issues
13+
additional_dependencies: []
14+
15+
- repo: https://github.com/pre-commit/mirrors-isort
16+
rev: v5.10.1
17+
hooks:
18+
- id: isort
19+
args: ["--profile", "black"]
20+
21+
- repo: https://github.com/pre-commit/pre-commit-hooks
22+
rev: v5.0.0
23+
hooks:
24+
- id: end-of-file-fixer
25+
- id: trailing-whitespace
26+
- id: check-yaml
27+
- id: check-added-large-files
28+
29+
30+
- repo: https://github.com/pre-commit/mirrors-mypy
31+
rev: v1.6.1 # pick a valid tag you confirmed with git ls-remote
32+
hooks:
33+
- id: mypy
34+
# keep commonly useful flags, then selectively disable error codes reported by mypy
35+
args:
36+
- --ignore-missing-imports
37+
- --disable-error-code=import-untyped
38+
- --disable-error-code=call-arg
39+
- --disable-error-code=union-attr
40+
- --disable-error-code=arg-type
41+
- --disable-error-code=used-before-def
42+
- --disable-error-code=attr-defined
43+
files: \.py$
44+
language_version: python3.11

apps/whisper_fine_tuning/Makefile

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
export HOME := $(HOME)
2+
.ONESHELL:
3+
4+
ifeq ($(OS),Windows_NT)
5+
SHELL = cmd
6+
CONDA_ACTIVATE = call %CONDA_PREFIX%\Scripts\activate.bat
7+
else
8+
SHELL = /bin/bash
9+
CONDA_ACTIVATE = source $$(conda info --base)/etc/profile.d/conda.sh ; conda activate ; conda activate
10+
endif
11+
12+
setup_aml:
13+
rm -rf ~/.pyenv
14+
curl https://pyenv.run | bash
15+
$(HOME)/.pyenv/bin/pyenv --version
16+
$(HOME)/.pyenv/bin/pyenv install 3.12 --skip-existing
17+
$(HOME)/.pyenv/bin/pyenv local 3.12
18+
python --version
19+
conda create -n condav0 python=3.12
20+
$(CONDA_ACTIVATE) activate condav0
21+
conda install -c conda-forge poetry
22+
poetry config virtualenvs.create true
23+
poetry config virtualenvs.in-project true
24+
poetry lock --no-update
25+
poetry install
26+
conda install pip
27+
conda install -c conda-forge "ffmpeg>=5,<7"
28+
sudo apt update && sudo apt install -y ffmpeg
29+
python -m ipykernel install --user --name condav0 --display-name "condav0"
30+
31+
32+
USERPROFILE := $(USERPROFILE)
33+
CURRENT_DIR := $(shell cd)
34+
setup_win:
35+
if exist %USERPROFILE%\.pyenv rmdir /s /q %USERPROFILE%\.pyenv
36+
git clone https://github.com/pyenv-win/pyenv-win.git "%USERPROFILE%\.pyenv"
37+
$(USERPROFILE)\.pyenv\pyenv-win\bin\pyenv --version
38+
$(USERPROFILE)\.pyenv\pyenv-win\bin\pyenv install 3.12 --skip-existing
39+
$(USERPROFILE)\.pyenv\pyenv-win\bin\pyenv local 3.12
40+
python --version
41+
python -m venv venv
42+
echo $(CURRENT_DIR)
43+
"$(CURRENT_DIR)/venv/Scripts/activate"
44+
pip install poetry
45+
poetry config virtualenvs.create true
46+
poetry config virtualenvs.in-project true
47+
poetry lock
48+
poetry install
49+
conda install pip

apps/whisper_fine_tuning/README.md

Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
# <img src="./docs/img//azure_logo.png" alt="Azure Logo" style="width:30px;height:30px;"/> Fine Tuning Open Source LLM Models - QLora and Lora features implemented
2+
3+
## Overview
4+
Open-source LLMs are powerful but require fine-tuning for specific tasks like chatbots or content generation. Fine-tuning these models can be expensive due to the need for substantial VRAM. For instance, fully fine-tuning the Llama7B model requires 112GB of VRAM. However, techniques like QLoRA and PEFT can significantly reduce these requirements.
5+
6+
# Whisper Fine-Tuning Pipeline
7+
8+
This repository provides a pipeline for fine-tuning OpenAI's Whisper models on custom audio datasets using LoRA (Low-Rank Adaptation) and PEFT (Parameter-Efficient Fine-Tuning). The solution is designed for scalable training and deployment on Azure Machine Learning.
9+
10+
## Features
11+
- Fine-tune Whisper models with LoRA and PEFT
12+
- Custom data loading and preprocessing for speech datasets
13+
- MLflow integration for experiment tracking and model logging (including all console logs as artifacts)
14+
- Azure ML job specification for cloud training
15+
- Modular codebase for easy extension
16+
17+
## Directory Structure
18+
```
19+
src/core/
20+
config.py # Training and LoRA configuration classes
21+
load_data.py # Data loading and preprocessing
22+
train.py # Trainer class and training logic
23+
deployment/
24+
job_data.py # Data preparation script
25+
job_train.py # Training script
26+
environment.yml # Conda environment for Azure ML
27+
training_job.yaml # Azure ML job specification
28+
notebooks/
29+
fine_tuned_usage.ipynb # Example notebook for inference and usage
30+
data/
31+
dataset_silver/ # Example processed dataset
32+
```
33+
34+
## Setup
35+
1. **Clone the repository**
36+
```bash
37+
git clone <repo-url>
38+
cd whisper-fine-tuning
39+
```
40+
2. **Prepare your dataset**
41+
- Place your audio datasets in a directory (e.g., `data/train_raw`, `data/evaluation_raw`).
42+
- Ensure datasets are in Hugging Face `datasets` format.
43+
44+
3. **Create the environment**
45+
46+
```bash
47+
# Use Python 3.12 for the Poetry environment
48+
poetry env use 3.12
49+
50+
# Activate the virtual environment
51+
poetry shell
52+
53+
# Install all project dependencies
54+
poetry install
55+
```
56+
57+
## Usage
58+
59+
### Data Preparation
60+
61+
### Data Preparation in Create Data
62+
63+
For detailed instructions on dataset preparation, please refer to the README in the create_data directory.
64+
65+
66+
67+
### Model Training
68+
69+
Run the training pipeline:
70+
```bash
71+
python deployment/training/job_train.py --model_name openai/whisper-large-v2 --dataset ./data/silver/dataset --apply_lora True
72+
```
73+
74+
If you want to load a lightweight model configuration, run the training command below. This command fine-tunes OpenAI's Whisper model with LoRA applied on your specified dataset:
75+
76+
```bash
77+
python deployment/training/job_train.py --model_name openai/whisper-large-v2 --dataset ./data/silver/dataset --apply_lora True --experiment_name "whisper-mayoruna-v2"
78+
79+
```
80+
81+
This approach helps optimize resource usage while enabling effective model adaptation.
82+
83+
84+
- All console and logging output will be saved as an MLflow artifact (`training_console.log`).
85+
86+
87+
## Model Evaluating
88+
89+
90+
```bash
91+
python src/core/evaluation/evaluation_process.py --is_public_repo False --ckpt_dir "output_model_dir" --temp_ckpt_folder "temp" --eval_datasets data/raw/testing --device 0 --batch_size 16 --output_dir predictions_dir
92+
```
93+
94+
95+
python deployment/training/job_train_nemo.py \
96+
--dataset_path data/silver/dataset \
97+
--output_dir nemo_rnnt_da/ \
98+
99+
100+
python deployment/training/job_train_nemo.py \
101+
--dataset_path data/silver/dataset \
102+
--output_dir output_model_dir/nemo_rnnt_da \
103+
--num_workers 0
104+
105+
106+
### Azure ML Training
107+
1. **Configure Azure ML compute and workspace.**
108+
2. **Submit the job:**
109+
```bash
110+
az ml job create --file deployment/training_job.yaml
111+
```
112+
113+
## Arguments
114+
115+
### `job_data.py`
116+
- `--model_name`: Hugging Face model name (default: `openai/whisper-small`)
117+
- `--train_datasets`: List of training dataset paths (required)
118+
- `--eval_datasets`: List of evaluation dataset paths (optional)
119+
- `--sampling_rate`: Audio sampling rate (default: 16000)
120+
- `--num_proc`: Number of parallel jobs for data prep (default: 2)
121+
- `--output_dir`: Output directory for processed dataset
122+
123+
### `job_train.py`
124+
- `--model_name`: Hugging Face model name (default: `openai/whisper-small`)
125+
- `--dataset`: Path to processed dataset (required)
126+
- `--output_dir`: Output directory for checkpoints (default: `output_model_dir`)
127+
- `--apply_lora`: Whether to apply LoRA (default: True)
128+
- `--language`: Language for adaptation (default: Hindi)
129+
- `--sampling_rate`: Audio sampling rate (default: 16000)
130+
- `--num_proc`: Number of parallel jobs for data prep (default: 2)
131+
- `--train_strategy`: Training strategy (`steps` or `epoch`, default: `steps`)
132+
133+
## Customization
134+
- Modify `src/core/config.py` to adjust training and LoRA parameters.
135+
- Update `deployment/environment.yml` to add/remove dependencies.
136+
- Edit `deployment/training_job.yaml` for Azure ML compute, environment, and output settings.
137+
138+
## Outputs
139+
- Trained model checkpoints in the specified output directory
140+
- MLflow experiment logs (including all console logs as artifacts)
141+
- Optionally, model artifacts pushed to Hugging Face Hub
142+
143+
## Inference Example
144+
145+
See `notebooks/fine_tuned_usage.ipynb` for detailed usage.
146+
To load and use a model logged with MLflow:
147+
```python
148+
import mlflow.transformers
149+
150+
model_uri = "models:/whisper-fine-tuned/1" # or your own model URI
151+
pipe = mlflow.transformers.load_model(model_uri)
152+
result = pipe("path/to/audio/file.wav")
153+
print(result["text"])
154+
```
155+
Or, using the pyfunc interface:
156+
```python
157+
import mlflow.pyfunc
158+
import pandas as pd
159+
160+
loaded_model = mlflow.pyfunc.load_model(model_uri)
161+
df = pd.DataFrame({"inputs": ["path/to/audio/file.wav"]})
162+
result = loaded_model.predict(df)
163+
print(result)
164+
```
165+
166+
167+
## Steps to Configure Pre-commit
168+
169+
1. Install pre-commit (if not already installed):
170+
171+
```bash
172+
pip install pre-commit
173+
```
174+
175+
2. Use the `.pre-commit-config.yaml` file in the root of the repository
176+
177+
3. Install the pre-commit hooks:
178+
179+
```bash
180+
pre-commit install
181+
```
182+
183+
4. Run the hooks manually on all files to ensure consistency:
184+
185+
```bash
186+
pre-commit run --all-files
187+
```
188+
189+
Integrating pre-commit into your workflow helps maintain consistent code standards and prevents common issues before they enter your codebase.
190+
191+
192+
## Troubleshooting
193+
- Ensure all dependencies in `environment.yml` are installed.
194+
- Check dataset format and paths.
195+
- Review Azure ML compute and environment configuration if running in the cloud.
196+
197+
## License
198+
This project is licensed under the MIT License.
199+
200+
## Contact
201+
For questions or support, please open an issue or contact the repository maintainer.
202+
203+
204+
## Tips current scenario
205+
| Parameter | Suggested | Reason |
206+
|--------------------------------|------------------------|--------------------------------------------------------------------------------------------------|
207+
| per_device_train_batch_size | 4–8 | With only 30 samples, large batches are unnecessary and may cause instability. |
208+
| gradient_accumulation_steps | 1–4 | Increase if you want to simulate larger batches without memory overhead. |
209+
| num_train_epochs | 10–30 | More epochs help the model learn from limited data. Use early stopping or monitor validation loss. |
210+
| max_steps | Remove or set to -1 | Let training be driven by num_train_epochs to avoid premature stopping. |
211+
| warmup_steps | 50–100 | 20k is too high for such a small dataset. |
212+
| learning_rate | 1e-5 to 3e-5 | Your current value is reasonable, but monitor for overfitting. |
213+
| eval_strategy | "epoch" | With few samples, evaluating per epoch is more meaningful. |
214+
| save_strategy | "epoch" | Align with evaluation to save meaningful checkpoints. |
215+
| fp16 | ✅ | Keep if your hardware supports it. |
216+
| gradient_checkpointing | ✅ | Helps with memory efficiency. |
217+
| optim | "adamw_bnb_8bit" | Good choice for memory-constrained environments. |
218+
| load_best_model_at_end | ✅ | Helps select the best checkpoint. |
219+
| predict_with_generate | ✅ | Needed for Whisper’s transcription tasks. |
220+
221+
---
222+
223+
### 🧠 Additional Tips
224+
225+
- **Use Data Augmentation**: Consider adding noise, pitch shift, or speed variation to artificially expand your dataset.
226+
- **Freeze Most Layers**: Fine-tune only the final layers or adapters to reduce overfitting.
227+
- **Use Early Stopping**: Monitor validation loss and stop training when it plateaus.
228+
- **Log Carefully**: Avoid logging too many parameters to MLflow to prevent the `INVALID_PARAMETER_VALUE` error.

0 commit comments

Comments
 (0)