chore: readme improvement

karinassini · karinassini · commit 4a2bb7260b4f · 2025-11-03T19:49:55.000Z
diff --git a/apps/whisper_fine_tuning/README.md b/apps/whisper_fine_tuning/README.md
@@ -1,228 +1,111 @@
-# <img src="./docs/img//azure_logo.png" alt="Azure Logo" style="width:30px;height:30px;"/> Fine Tuning Open Source LLM Models - QLora and Lora features implemented
-
-## Overview
-Open-source LLMs are powerful but require fine-tuning for specific tasks like chatbots or content generation. Fine-tuning these models can be expensive due to the need for substantial VRAM. For instance, fully fine-tuning the Llama7B model requires 112GB of VRAM. However, techniques like QLoRA and PEFT can significantly reduce these requirements.
-
 # Whisper Fine-Tuning Pipeline
 
-This repository provides a pipeline for fine-tuning OpenAI's Whisper models on custom audio datasets using LoRA (Low-Rank Adaptation) and PEFT (Parameter-Efficient Fine-Tuning). The solution is designed for scalable training and deployment on Azure Machine Learning.
+This project provides an end-to-end workflow for preparing custom audio datasets, fine-tuning OpenAI’s Whisper models with LoRA/PEFT, and operationalizing the resulting checkpoints on Azure ML.
 
-## Features
-- Fine-tune Whisper models with LoRA and PEFT
-- Custom data loading and preprocessing for speech datasets
-- MLflow integration for experiment tracking and model logging (including all console logs as artifacts)
-- Azure ML job specification for cloud training
-- Modular codebase for easy extension
+## Highlights
+- LoRA-based fine-tuning scripts that minimize GPU memory requirements.
+- Data ingestion utilities that convert raw audio into Hugging Face `datasets` format.
+- MLflow integration for experiment tracking and artifact storage.
+- Azure ML job definitions for cloud-scale training and evaluation.
 
-## Directory Structure
+## Project Layout
 ```
-src/core/
-  config.py              # Training and LoRA configuration classes
-  load_data.py           # Data loading and preprocessing
-  train.py               # Trainer class and training logic
-deployment/
-  job_data.py            # Data preparation script
-  job_train.py           # Training script
-  environment.yml        # Conda environment for Azure ML
-  training_job.yaml      # Azure ML job specification
-notebooks/
-  fine_tuned_usage.ipynb # Example notebook for inference and usage
-data/
-  dataset_silver/        # Example processed dataset
+apps/whisper_fine_tuning/
+├── deployment/                # Azure ML job specs and training entrypoints
+├── docs/                      # Diagrams and supporting documentation
+├── infra/                     # Infrastructure-as-code templates
+├── notebooks/                 # Exploratory analysis and inference notebooks
+├── src/core/                  # Data prep, training, and evaluation modules
+└── data/                      # Example datasets (raw/silver) – ignored in git
 ```
 
-## Setup
-1. **Clone the repository**
-    ```bash
-    git clone <repo-url>
-    cd whisper-fine-tuning
-    ```
-2. **Prepare your dataset**
-    - Place your audio datasets in a directory (e.g., `data/train_raw`, `data/evaluation_raw`).
-    - Ensure datasets are in Hugging Face `datasets` format.
-
-3. **Create the environment**
-
+## Quick Start
+1. **Install dependencies**
+   ```bash
+   poetry env use 3.12
+   poetry install
+   ```
+   Activate the environment with `poetry shell` or your preferred virtualenv tool.
+
+2. **Prepare raw audio**
+   ```bash
+   python src/core/data_prep/main_data_prep.py \
+     --source_data_dir data/raw/audios/matis \
+     --output_data_dir data/raw/training \
+     --domain train
+   ```
+   Repeat for `--domain evaluation` and `--domain test`. See `src/core/data_prep/README.md` for more options.
+
+3. **Generate silver dataset**
+   ```bash
+   python src/core/data_prep/main_silver_data_prep.py \
+     --train_datasets data/raw/training \
+     --eval_datasets data/raw/evaluation \
+     --test_datasets data/raw/testing
+   ```
+
+4. **Train Whisper with LoRA**
+   ```bash
+   python src/core/train/main_train.py \
+     --model_name openai/whisper-large-v2 \
+     --dataset data/silver/dataset \
+     --language Matis \
+     --output_dir output_model_dir \
+     --apply_lora True
+   ```
+
+## Running on Azure ML
+Submit the packaged job using the provided YAML spec:
 ```bash
-# Use Python 3.12 for the Poetry environment
-poetry env use 3.12
-
-# Activate the virtual environment
-poetry shell
-
-# Install all project dependencies
-poetry install
+az ml job create --file deployment/training_job.yaml
 ```
-
-## Usage
-
-### Data Preparation
-
-### Data Preparation in Create Data
-
-For detailed instructions on dataset preparation, please refer to the README in the create_data directory.
-
-
-
-### Model Training
-
-Run the training pipeline:
-```bash
-python deployment/training/job_train.py --model_name openai/whisper-large-v2 --dataset ./data/silver/dataset --apply_lora True
+Customize compute, environment, and inputs inside the YAML before submission.
+
+## Evaluation & Inference
+- Offline evaluation: `python src/core/evaluation/evaluation_process.py --eval_datasets data/raw/testing ...`
+- NeMo experiments: `python src/core/train/main_train_nemo.py --dataset_path data/silver/dataset ...`
+- Inference notebook: `notebooks/fine_tuned_usage.ipynb`
+
+## Key CLI Arguments
+| Script | Argument | Description |
+| --- | --- | --- |
+| `main_data_prep.py` | `--source_data_dir` | Folder with `audio_paths` and `text` mappings |
+|                     | `--domain` | `train`, `test`, `evaluation`, or `all` |
+| `main_silver_data_prep.py` | `--train_datasets`/`--eval_datasets` | Hugging Face datasets produced by the raw prep stage |
+| `main_train.py` | `--model_name` | Base Whisper checkpoint (default `openai/whisper-small`) |
+|                  | `--apply_lora` | Enable/disable LoRA adapters |
+|                  | `--experiment_name` | MLflow experiment name; auto-generated if omitted |
+
+Full option lists live in each script’s `--help` output.
+
+## MLflow & Logging
+- Runs log configuration, metrics, and console output (see `training_console.log`).
+- Checkpoints are written to `output_model_dir/` and can be registered with MLflow or uploaded to Azure.
+
+## Pre-Commit Hooks
 ```
-
-If you want to load a lightweight model configuration, run the training command below. This command fine-tunes OpenAI's Whisper model with LoRA applied on your specified dataset:
-
-```bash
-python deployment/training/job_train.py --model_name openai/whisper-large-v2 --dataset ./data/silver/dataset --apply_lora True  --experiment_name "whisper-mayoruna-v2" 
-
+pip install pre-commit
+pre-commit install
+pre-commit run --all-files
 ```
-
-This approach helps optimize resource usage while enabling effective model adaptation.
-
-
-- All console and logging output will be saved as an MLflow artifact (`training_console.log`).
-
-
-## Model Evaluating
-
-
-```bash
-python src/core/evaluation/evaluation_process.py --is_public_repo False --ckpt_dir "output_model_dir" --temp_ckpt_folder "temp" --eval_datasets data/raw/testing --device 0 --batch_size 16 --output_dir predictions_dir
-```
-
-
-python deployment/training/job_train_nemo.py \
-  --dataset_path data/silver/dataset \
-  --output_dir nemo_rnnt_da/ \
-
-
-python deployment/training/job_train_nemo.py \
-  --dataset_path data/silver/dataset \
-  --output_dir output_model_dir/nemo_rnnt_da \
-  --num_workers 0
-
-
-### Azure ML Training
-1. **Configure Azure ML compute and workspace.**
-2. **Submit the job:**
-    ```bash
-    az ml job create --file deployment/training_job.yaml
-    ```
-
-## Arguments
-
-### `job_data.py`
-- `--model_name`: Hugging Face model name (default: `openai/whisper-small`)
-- `--train_datasets`: List of training dataset paths (required)
-- `--eval_datasets`: List of evaluation dataset paths (optional)
-- `--sampling_rate`: Audio sampling rate (default: 16000)
-- `--num_proc`: Number of parallel jobs for data prep (default: 2)
-- `--output_dir`: Output directory for processed dataset
-
-### `job_train.py`
-- `--model_name`: Hugging Face model name (default: `openai/whisper-small`)
-- `--dataset`: Path to processed dataset (required)
-- `--output_dir`: Output directory for checkpoints (default: `output_model_dir`)
-- `--apply_lora`: Whether to apply LoRA (default: True)
-- `--language`: Language for adaptation (default: Hindi)
-- `--sampling_rate`: Audio sampling rate (default: 16000)
-- `--num_proc`: Number of parallel jobs for data prep (default: 2)
-- `--train_strategy`: Training strategy (`steps` or `epoch`, default: `steps`)
-
-## Customization
-- Modify `src/core/config.py` to adjust training and LoRA parameters.
-- Update `deployment/environment.yml` to add/remove dependencies.
-- Edit `deployment/training_job.yaml` for Azure ML compute, environment, and output settings.
-
-## Outputs
-- Trained model checkpoints in the specified output directory
-- MLflow experiment logs (including all console logs as artifacts)
-- Optionally, model artifacts pushed to Hugging Face Hub
-
-## Inference Example
-
-See `notebooks/fine_tuned_usage.ipynb` for detailed usage.
-To load and use a model logged with MLflow:
-```python
-import mlflow.transformers
-
-model_uri = "models:/whisper-fine-tuned/1"  # or your own model URI
-pipe = mlflow.transformers.load_model(model_uri)
-result = pipe("path/to/audio/file.wav")
-print(result["text"])
-```
-Or, using the pyfunc interface:
-```python
-import mlflow.pyfunc
-import pandas as pd
-
-loaded_model = mlflow.pyfunc.load_model(model_uri)
-df = pd.DataFrame({"inputs": ["path/to/audio/file.wav"]})
-result = loaded_model.predict(df)
-print(result)
-```
-
-
-## Steps to Configure Pre-commit
-
-1. Install pre-commit (if not already installed):
-
-    ```bash
-    pip install pre-commit
-    ```
-
-2. Use the `.pre-commit-config.yaml` file in the root of the repository
-
-3. Install the pre-commit hooks:
-
-    ```bash
-    pre-commit install
-    ```
-
-4. Run the hooks manually on all files to ensure consistency:
-
-    ```bash
-    pre-commit run --all-files
-    ```
-
-Integrating pre-commit into your workflow helps maintain consistent code standards and prevents common issues before they enter your codebase.
-
+Hooks enforce formatting and linting before changes land in version control.
 
 ## Troubleshooting
-- Ensure all dependencies in `environment.yml` are installed.
-- Check dataset format and paths.
-- Review Azure ML compute and environment configuration if running in the cloud.
+- Verify `environment.yml` or `pyproject.toml` dependencies are installed.
+- Ensure datasets follow the expected directory structure (`data/raw/audios/...`).
+- For Azure ML issues, confirm workspace credentials and compute targets.
+
+## Fine-Tuning Tips
+| Parameter | Recommended | Why |
+| --- | --- | --- |
+| `per_device_train_batch_size` | 4–8 | Small datasets benefit from smaller batches |
+| `num_train_epochs` | 10–30 | Compensate for limited data with more passes |
+| `warmup_steps` | 50–100 | Faster ramp-up than large fixed counts |
+| `eval_strategy` | `epoch` | Evaluate once per epoch to avoid noise |
+| `save_strategy` | `epoch` | Align checkpointing with evaluation |
+| `load_best_model_at_end` | `True` | Automatically keep the best-performing checkpoint |
+
+Additional guidance on data augmentation, freezing layers, early stopping, and MLflow logging is summarized in `src/core/train/README.md`.
 
 ## License
-This project is licensed under the MIT License.
-
-## Contact
-For questions or support, please open an issue or contact the repository maintainer.
-
-
-## Tips current scenario
-| Parameter                      | Suggested              | Reason                                                                                           |
-|--------------------------------|------------------------|--------------------------------------------------------------------------------------------------|
-| per_device_train_batch_size    | 4–8                    | With only 30 samples, large batches are unnecessary and may cause instability.                   |
-| gradient_accumulation_steps      | 1–4                    | Increase if you want to simulate larger batches without memory overhead.                         |
-| num_train_epochs               | 10–30                  | More epochs help the model learn from limited data. Use early stopping or monitor validation loss. |
-| max_steps                      | Remove or set to -1    | Let training be driven by num_train_epochs to avoid premature stopping.                          |
-| warmup_steps                   | 50–100                 | 20k is too high for such a small dataset.                                                        |
-| learning_rate                  | 1e-5 to 3e-5           | Your current value is reasonable, but monitor for overfitting.                                   |
-| eval_strategy                  | "epoch"                | With few samples, evaluating per epoch is more meaningful.                                      |
-| save_strategy                  | "epoch"                | Align with evaluation to save meaningful checkpoints.                                           |
-| fp16                           | ✅                     | Keep if your hardware supports it.                                                              |
-| gradient_checkpointing         | ✅                     | Helps with memory efficiency.                                                                   |
-| optim                          | "adamw_bnb_8bit"       | Good choice for memory-constrained environments.                                               |
-| load_best_model_at_end         | ✅                     | Helps select the best checkpoint.                                                              |
-| predict_with_generate          | ✅                     | Needed for Whisper’s transcription tasks.                                                      |
-
----
-
-### 🧠 Additional Tips
-
-- **Use Data Augmentation**: Consider adding noise, pitch shift, or speed variation to artificially expand your dataset.
-- **Freeze Most Layers**: Fine-tune only the final layers or adapters to reduce overfitting.
-- **Use Early Stopping**: Monitor validation loss and stop training when it plateaus.
-- **Log Carefully**: Avoid logging too many parameters to MLflow to prevent the `INVALID_PARAMETER_VALUE` error.
+Licensed under the MIT License. Contributions and issues are welcome.
diff --git a/apps/whisper_fine_tuning/src/core/data_prep/README.md b/apps/whisper_fine_tuning/src/core/data_prep/README.md
@@ -53,25 +53,25 @@ Run the data preparation pipeline from the project root to create domain-specifi
 
 **Training Data:**
 ```bash
-python src/core/data_prep/data_prep.py \
-    --source_data_dir data/raw/audios/matis \
-    --output_data_dir data/raw/samples/training \
+python apps/whisper_fine_tuning/src/core/data_prep/main_data_prep.py \
+    --source_data_dir apps/whisper_fine_tuning/data/raw/audios/matis \
+    --output_data_dir apps/whisper_fine_tuning/data/raw/samples/training \
     --domain train
 ```
 
 **Testing Data:**
 ```bash
-python src/core/data_prep/data_prep.py \
-    --source_data_dir data/raw/audios/matis \
-    --output_data_dir data/raw/samples/testing \
+python apps/whisper_fine_tuning/src/core/data_prep/main_data_prep.py \
+    --source_data_dir apps/whisper_fine_tuning/data/raw/audios/matis \
+    --output_data_dir apps/whisper_fine_tuning/data/raw/samples/testing \
     --domain test
 ```
 
 **Evaluation Data:**
 ```bash
-python src/core/data_prep/data_prep.py \
-    --source_data_dir data/raw/audios/matis \
-    --output_data_dir data/raw/samples/evaluation \
+python apps/whisper_fine_tuning/src/core/data_prep/main_data_prep.py \
+    --source_data_dir apps/whisper_fine_tuning/data/raw/audios/matis \
+    --output_data_dir apps/whisper_fine_tuning/data/raw/samples/evaluation \
     --domain evaluation
 ```
 
@@ -81,7 +81,7 @@ python src/core/data_prep/data_prep.py \
 After preparing your datasets, run the complete pipeline locally:
 
 ```bash
-python src/core/data_prep/main_silver_data_prep.py --train_datasets data/raw/samples/training --eval_datasets data/raw/samples/evaluation --test_datasets data/raw/samples/testing
+python apps/whisper_fine_tuning/src/core/data_prep/main_silver_data_prep.py --train_datasets apps/whisper_fine_tuning/data/raw/samples/training --eval_datasets apps/whisper_fine_tuning/data/raw/samples/evaluation --test_datasets apps/whisper_fine_tuning/data/raw/samples/testing
 ```
 
 This command will use the prepared datasets for the complete Whisper fine-tuning workflow.
diff --git a/apps/whisper_fine_tuning/src/core/train/README.md b/apps/whisper_fine_tuning/src/core/train/README.md
@@ -30,14 +30,17 @@ core/train/
    ```
 2. Launch training with the primary CLI:
    ```bash
-   python src/core/train/main_train.py \
+   python apps/whisper_fine_tuning/src/core/train/main_train.py \
      --model_name openai/whisper-large-v2 \
-     --dataset data/silver/dataset \
+     --dataset apps/whisper_fine_tuning/data/silver/dataset \
      --output_dir output_model_dir \
      --apply_lora True
    ```
 3. Monitor MLflow for run metadata, metrics, and artifacts.
 
+
+- All console and logging output will be saved as an MLflow artifact (`training_console.log`).
+
 ## CLI Overview (`main_train.py`)
 - `--model_name`: Hugging Face identifier for the base Whisper model.
 - `--dataset`: Path to the on-disk Hugging Face dataset produced during data prep.