Skip to content

Commit 4a2bb72

Browse files
committed
chore: readme improvement
1 parent 3d2d02a commit 4a2bb72

File tree

3 files changed

+111
-225
lines changed

3 files changed

+111
-225
lines changed

apps/whisper_fine_tuning/README.md

Lines changed: 96 additions & 213 deletions
Original file line numberDiff line numberDiff line change
@@ -1,228 +1,111 @@
1-
# <img src="./docs/img//azure_logo.png" alt="Azure Logo" style="width:30px;height:30px;"/> Fine Tuning Open Source LLM Models - QLora and Lora features implemented
2-
3-
## Overview
4-
Open-source LLMs are powerful but require fine-tuning for specific tasks like chatbots or content generation. Fine-tuning these models can be expensive due to the need for substantial VRAM. For instance, fully fine-tuning the Llama7B model requires 112GB of VRAM. However, techniques like QLoRA and PEFT can significantly reduce these requirements.
5-
61
# Whisper Fine-Tuning Pipeline
72

8-
This repository provides a pipeline for fine-tuning OpenAI's Whisper models on custom audio datasets using LoRA (Low-Rank Adaptation) and PEFT (Parameter-Efficient Fine-Tuning). The solution is designed for scalable training and deployment on Azure Machine Learning.
3+
This project provides an end-to-end workflow for preparing custom audio datasets, fine-tuning OpenAIs Whisper models with LoRA/PEFT, and operationalizing the resulting checkpoints on Azure ML.
94

10-
## Features
11-
- Fine-tune Whisper models with LoRA and PEFT
12-
- Custom data loading and preprocessing for speech datasets
13-
- MLflow integration for experiment tracking and model logging (including all console logs as artifacts)
14-
- Azure ML job specification for cloud training
15-
- Modular codebase for easy extension
5+
## Highlights
6+
- LoRA-based fine-tuning scripts that minimize GPU memory requirements.
7+
- Data ingestion utilities that convert raw audio into Hugging Face `datasets` format.
8+
- MLflow integration for experiment tracking and artifact storage.
9+
- Azure ML job definitions for cloud-scale training and evaluation.
1610

17-
## Directory Structure
11+
## Project Layout
1812
```
19-
src/core/
20-
config.py # Training and LoRA configuration classes
21-
load_data.py # Data loading and preprocessing
22-
train.py # Trainer class and training logic
23-
deployment/
24-
job_data.py # Data preparation script
25-
job_train.py # Training script
26-
environment.yml # Conda environment for Azure ML
27-
training_job.yaml # Azure ML job specification
28-
notebooks/
29-
fine_tuned_usage.ipynb # Example notebook for inference and usage
30-
data/
31-
dataset_silver/ # Example processed dataset
13+
apps/whisper_fine_tuning/
14+
├── deployment/ # Azure ML job specs and training entrypoints
15+
├── docs/ # Diagrams and supporting documentation
16+
├── infra/ # Infrastructure-as-code templates
17+
├── notebooks/ # Exploratory analysis and inference notebooks
18+
├── src/core/ # Data prep, training, and evaluation modules
19+
└── data/ # Example datasets (raw/silver) – ignored in git
3220
```
3321

34-
## Setup
35-
1. **Clone the repository**
36-
```bash
37-
git clone <repo-url>
38-
cd whisper-fine-tuning
39-
```
40-
2. **Prepare your dataset**
41-
- Place your audio datasets in a directory (e.g., `data/train_raw`, `data/evaluation_raw`).
42-
- Ensure datasets are in Hugging Face `datasets` format.
43-
44-
3. **Create the environment**
45-
22+
## Quick Start
23+
1. **Install dependencies**
24+
```bash
25+
poetry env use 3.12
26+
poetry install
27+
```
28+
Activate the environment with `poetry shell` or your preferred virtualenv tool.
29+
30+
2. **Prepare raw audio**
31+
```bash
32+
python src/core/data_prep/main_data_prep.py \
33+
--source_data_dir data/raw/audios/matis \
34+
--output_data_dir data/raw/training \
35+
--domain train
36+
```
37+
Repeat for `--domain evaluation` and `--domain test`. See `src/core/data_prep/README.md` for more options.
38+
39+
3. **Generate silver dataset**
40+
```bash
41+
python src/core/data_prep/main_silver_data_prep.py \
42+
--train_datasets data/raw/training \
43+
--eval_datasets data/raw/evaluation \
44+
--test_datasets data/raw/testing
45+
```
46+
47+
4. **Train Whisper with LoRA**
48+
```bash
49+
python src/core/train/main_train.py \
50+
--model_name openai/whisper-large-v2 \
51+
--dataset data/silver/dataset \
52+
--language Matis \
53+
--output_dir output_model_dir \
54+
--apply_lora True
55+
```
56+
57+
## Running on Azure ML
58+
Submit the packaged job using the provided YAML spec:
4659
```bash
47-
# Use Python 3.12 for the Poetry environment
48-
poetry env use 3.12
49-
50-
# Activate the virtual environment
51-
poetry shell
52-
53-
# Install all project dependencies
54-
poetry install
60+
az ml job create --file deployment/training_job.yaml
5561
```
56-
57-
## Usage
58-
59-
### Data Preparation
60-
61-
### Data Preparation in Create Data
62-
63-
For detailed instructions on dataset preparation, please refer to the README in the create_data directory.
64-
65-
66-
67-
### Model Training
68-
69-
Run the training pipeline:
70-
```bash
71-
python deployment/training/job_train.py --model_name openai/whisper-large-v2 --dataset ./data/silver/dataset --apply_lora True
62+
Customize compute, environment, and inputs inside the YAML before submission.
63+
64+
## Evaluation & Inference
65+
- Offline evaluation: `python src/core/evaluation/evaluation_process.py --eval_datasets data/raw/testing ...`
66+
- NeMo experiments: `python src/core/train/main_train_nemo.py --dataset_path data/silver/dataset ...`
67+
- Inference notebook: `notebooks/fine_tuned_usage.ipynb`
68+
69+
## Key CLI Arguments
70+
| Script | Argument | Description |
71+
| --- | --- | --- |
72+
| `main_data_prep.py` | `--source_data_dir` | Folder with `audio_paths` and `text` mappings |
73+
| | `--domain` | `train`, `test`, `evaluation`, or `all` |
74+
| `main_silver_data_prep.py` | `--train_datasets`/`--eval_datasets` | Hugging Face datasets produced by the raw prep stage |
75+
| `main_train.py` | `--model_name` | Base Whisper checkpoint (default `openai/whisper-small`) |
76+
| | `--apply_lora` | Enable/disable LoRA adapters |
77+
| | `--experiment_name` | MLflow experiment name; auto-generated if omitted |
78+
79+
Full option lists live in each script’s `--help` output.
80+
81+
## MLflow & Logging
82+
- Runs log configuration, metrics, and console output (see `training_console.log`).
83+
- Checkpoints are written to `output_model_dir/` and can be registered with MLflow or uploaded to Azure.
84+
85+
## Pre-Commit Hooks
7286
```
73-
74-
If you want to load a lightweight model configuration, run the training command below. This command fine-tunes OpenAI's Whisper model with LoRA applied on your specified dataset:
75-
76-
```bash
77-
python deployment/training/job_train.py --model_name openai/whisper-large-v2 --dataset ./data/silver/dataset --apply_lora True --experiment_name "whisper-mayoruna-v2"
78-
87+
pip install pre-commit
88+
pre-commit install
89+
pre-commit run --all-files
7990
```
80-
81-
This approach helps optimize resource usage while enabling effective model adaptation.
82-
83-
84-
- All console and logging output will be saved as an MLflow artifact (`training_console.log`).
85-
86-
87-
## Model Evaluating
88-
89-
90-
```bash
91-
python src/core/evaluation/evaluation_process.py --is_public_repo False --ckpt_dir "output_model_dir" --temp_ckpt_folder "temp" --eval_datasets data/raw/testing --device 0 --batch_size 16 --output_dir predictions_dir
92-
```
93-
94-
95-
python deployment/training/job_train_nemo.py \
96-
--dataset_path data/silver/dataset \
97-
--output_dir nemo_rnnt_da/ \
98-
99-
100-
python deployment/training/job_train_nemo.py \
101-
--dataset_path data/silver/dataset \
102-
--output_dir output_model_dir/nemo_rnnt_da \
103-
--num_workers 0
104-
105-
106-
### Azure ML Training
107-
1. **Configure Azure ML compute and workspace.**
108-
2. **Submit the job:**
109-
```bash
110-
az ml job create --file deployment/training_job.yaml
111-
```
112-
113-
## Arguments
114-
115-
### `job_data.py`
116-
- `--model_name`: Hugging Face model name (default: `openai/whisper-small`)
117-
- `--train_datasets`: List of training dataset paths (required)
118-
- `--eval_datasets`: List of evaluation dataset paths (optional)
119-
- `--sampling_rate`: Audio sampling rate (default: 16000)
120-
- `--num_proc`: Number of parallel jobs for data prep (default: 2)
121-
- `--output_dir`: Output directory for processed dataset
122-
123-
### `job_train.py`
124-
- `--model_name`: Hugging Face model name (default: `openai/whisper-small`)
125-
- `--dataset`: Path to processed dataset (required)
126-
- `--output_dir`: Output directory for checkpoints (default: `output_model_dir`)
127-
- `--apply_lora`: Whether to apply LoRA (default: True)
128-
- `--language`: Language for adaptation (default: Hindi)
129-
- `--sampling_rate`: Audio sampling rate (default: 16000)
130-
- `--num_proc`: Number of parallel jobs for data prep (default: 2)
131-
- `--train_strategy`: Training strategy (`steps` or `epoch`, default: `steps`)
132-
133-
## Customization
134-
- Modify `src/core/config.py` to adjust training and LoRA parameters.
135-
- Update `deployment/environment.yml` to add/remove dependencies.
136-
- Edit `deployment/training_job.yaml` for Azure ML compute, environment, and output settings.
137-
138-
## Outputs
139-
- Trained model checkpoints in the specified output directory
140-
- MLflow experiment logs (including all console logs as artifacts)
141-
- Optionally, model artifacts pushed to Hugging Face Hub
142-
143-
## Inference Example
144-
145-
See `notebooks/fine_tuned_usage.ipynb` for detailed usage.
146-
To load and use a model logged with MLflow:
147-
```python
148-
import mlflow.transformers
149-
150-
model_uri = "models:/whisper-fine-tuned/1" # or your own model URI
151-
pipe = mlflow.transformers.load_model(model_uri)
152-
result = pipe("path/to/audio/file.wav")
153-
print(result["text"])
154-
```
155-
Or, using the pyfunc interface:
156-
```python
157-
import mlflow.pyfunc
158-
import pandas as pd
159-
160-
loaded_model = mlflow.pyfunc.load_model(model_uri)
161-
df = pd.DataFrame({"inputs": ["path/to/audio/file.wav"]})
162-
result = loaded_model.predict(df)
163-
print(result)
164-
```
165-
166-
167-
## Steps to Configure Pre-commit
168-
169-
1. Install pre-commit (if not already installed):
170-
171-
```bash
172-
pip install pre-commit
173-
```
174-
175-
2. Use the `.pre-commit-config.yaml` file in the root of the repository
176-
177-
3. Install the pre-commit hooks:
178-
179-
```bash
180-
pre-commit install
181-
```
182-
183-
4. Run the hooks manually on all files to ensure consistency:
184-
185-
```bash
186-
pre-commit run --all-files
187-
```
188-
189-
Integrating pre-commit into your workflow helps maintain consistent code standards and prevents common issues before they enter your codebase.
190-
91+
Hooks enforce formatting and linting before changes land in version control.
19192

19293
## Troubleshooting
193-
- Ensure all dependencies in `environment.yml` are installed.
194-
- Check dataset format and paths.
195-
- Review Azure ML compute and environment configuration if running in the cloud.
94+
- Verify `environment.yml` or `pyproject.toml` dependencies are installed.
95+
- Ensure datasets follow the expected directory structure (`data/raw/audios/...`).
96+
- For Azure ML issues, confirm workspace credentials and compute targets.
97+
98+
## Fine-Tuning Tips
99+
| Parameter | Recommended | Why |
100+
| --- | --- | --- |
101+
| `per_device_train_batch_size` | 4–8 | Small datasets benefit from smaller batches |
102+
| `num_train_epochs` | 10–30 | Compensate for limited data with more passes |
103+
| `warmup_steps` | 50–100 | Faster ramp-up than large fixed counts |
104+
| `eval_strategy` | `epoch` | Evaluate once per epoch to avoid noise |
105+
| `save_strategy` | `epoch` | Align checkpointing with evaluation |
106+
| `load_best_model_at_end` | `True` | Automatically keep the best-performing checkpoint |
107+
108+
Additional guidance on data augmentation, freezing layers, early stopping, and MLflow logging is summarized in `src/core/train/README.md`.
196109

197110
## License
198-
This project is licensed under the MIT License.
199-
200-
## Contact
201-
For questions or support, please open an issue or contact the repository maintainer.
202-
203-
204-
## Tips current scenario
205-
| Parameter | Suggested | Reason |
206-
|--------------------------------|------------------------|--------------------------------------------------------------------------------------------------|
207-
| per_device_train_batch_size | 4–8 | With only 30 samples, large batches are unnecessary and may cause instability. |
208-
| gradient_accumulation_steps | 1–4 | Increase if you want to simulate larger batches without memory overhead. |
209-
| num_train_epochs | 10–30 | More epochs help the model learn from limited data. Use early stopping or monitor validation loss. |
210-
| max_steps | Remove or set to -1 | Let training be driven by num_train_epochs to avoid premature stopping. |
211-
| warmup_steps | 50–100 | 20k is too high for such a small dataset. |
212-
| learning_rate | 1e-5 to 3e-5 | Your current value is reasonable, but monitor for overfitting. |
213-
| eval_strategy | "epoch" | With few samples, evaluating per epoch is more meaningful. |
214-
| save_strategy | "epoch" | Align with evaluation to save meaningful checkpoints. |
215-
| fp16 | ✅ | Keep if your hardware supports it. |
216-
| gradient_checkpointing | ✅ | Helps with memory efficiency. |
217-
| optim | "adamw_bnb_8bit" | Good choice for memory-constrained environments. |
218-
| load_best_model_at_end | ✅ | Helps select the best checkpoint. |
219-
| predict_with_generate | ✅ | Needed for Whisper’s transcription tasks. |
220-
221-
---
222-
223-
### 🧠 Additional Tips
224-
225-
- **Use Data Augmentation**: Consider adding noise, pitch shift, or speed variation to artificially expand your dataset.
226-
- **Freeze Most Layers**: Fine-tune only the final layers or adapters to reduce overfitting.
227-
- **Use Early Stopping**: Monitor validation loss and stop training when it plateaus.
228-
- **Log Carefully**: Avoid logging too many parameters to MLflow to prevent the `INVALID_PARAMETER_VALUE` error.
111+
Licensed under the MIT License. Contributions and issues are welcome.

apps/whisper_fine_tuning/src/core/data_prep/README.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -53,25 +53,25 @@ Run the data preparation pipeline from the project root to create domain-specifi
5353

5454
**Training Data:**
5555
```bash
56-
python src/core/data_prep/data_prep.py \
57-
--source_data_dir data/raw/audios/matis \
58-
--output_data_dir data/raw/samples/training \
56+
python apps/whisper_fine_tuning/src/core/data_prep/main_data_prep.py \
57+
--source_data_dir apps/whisper_fine_tuning/data/raw/audios/matis \
58+
--output_data_dir apps/whisper_fine_tuning/data/raw/samples/training \
5959
--domain train
6060
```
6161

6262
**Testing Data:**
6363
```bash
64-
python src/core/data_prep/data_prep.py \
65-
--source_data_dir data/raw/audios/matis \
66-
--output_data_dir data/raw/samples/testing \
64+
python apps/whisper_fine_tuning/src/core/data_prep/main_data_prep.py \
65+
--source_data_dir apps/whisper_fine_tuning/data/raw/audios/matis \
66+
--output_data_dir apps/whisper_fine_tuning/data/raw/samples/testing \
6767
--domain test
6868
```
6969

7070
**Evaluation Data:**
7171
```bash
72-
python src/core/data_prep/data_prep.py \
73-
--source_data_dir data/raw/audios/matis \
74-
--output_data_dir data/raw/samples/evaluation \
72+
python apps/whisper_fine_tuning/src/core/data_prep/main_data_prep.py \
73+
--source_data_dir apps/whisper_fine_tuning/data/raw/audios/matis \
74+
--output_data_dir apps/whisper_fine_tuning/data/raw/samples/evaluation \
7575
--domain evaluation
7676
```
7777

@@ -81,7 +81,7 @@ python src/core/data_prep/data_prep.py \
8181
After preparing your datasets, run the complete pipeline locally:
8282

8383
```bash
84-
python src/core/data_prep/main_silver_data_prep.py --train_datasets data/raw/samples/training --eval_datasets data/raw/samples/evaluation --test_datasets data/raw/samples/testing
84+
python apps/whisper_fine_tuning/src/core/data_prep/main_silver_data_prep.py --train_datasets apps/whisper_fine_tuning/data/raw/samples/training --eval_datasets apps/whisper_fine_tuning/data/raw/samples/evaluation --test_datasets apps/whisper_fine_tuning/data/raw/samples/testing
8585
```
8686

8787
This command will use the prepared datasets for the complete Whisper fine-tuning workflow.

apps/whisper_fine_tuning/src/core/train/README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,14 +30,17 @@ core/train/
3030
```
3131
2. Launch training with the primary CLI:
3232
```bash
33-
python src/core/train/main_train.py \
33+
python apps/whisper_fine_tuning/src/core/train/main_train.py \
3434
--model_name openai/whisper-large-v2 \
35-
--dataset data/silver/dataset \
35+
--dataset apps/whisper_fine_tuning/data/silver/dataset \
3636
--output_dir output_model_dir \
3737
--apply_lora True
3838
```
3939
3. Monitor MLflow for run metadata, metrics, and artifacts.
4040

41+
42+
- All console and logging output will be saved as an MLflow artifact (`training_console.log`).
43+
4144
## CLI Overview (`main_train.py`)
4245
- `--model_name`: Hugging Face identifier for the base Whisper model.
4346
- `--dataset`: Path to the on-disk Hugging Face dataset produced during data prep.

0 commit comments

Comments
 (0)