Fine-tuned Large Language Model for Programming Q&A

This project focuses on fine-tuning a Large Language Model (LLM) to specialize in answering programming-related questions, particularly those pertaining to Python, Django, and Docker. It demonstrates the complete workflow from data acquisition and preparation to fine-tuning with Parameter-Efficient Fine-Tuning (PEFT) using LoRA, and finally, local deployment and integration with the Hugging Face Hub.

Project Overview

The primary objective is to enhance a general-purpose LLM's performance and relevance for specific technical queries by fine-tuning it on domain-specific data. This project covers:

Data Acquisition: Programmatic collection of Q&A data from Stack Overflow.
Data Preparation: Cleaning, filtering, and formatting the raw data into a suitable format for LLM fine-tuning.
Fine-tuning: Applying LoRA (Low-Rank Adaptation) to a base LLM for efficient domain adaptation.
Local Deployment: Setting up a FastAPI server and a Gradio UI for local interaction with the fine-tuned model.
Hugging Face Integration: Preparing the model for upload to the Hugging Face Hub for broader accessibility.

Model Details

Base Model: mistralai/Mistral-7B-Instruct-v0.2 - A powerful instruction-tuned model known for its strong performance across various tasks.
Fine-tuning Method: LoRA (Low-Rank Adaptation), a Parameter-Efficient Fine-Tuning (PEFT) technique. LoRA allows for efficient adaptation of large models by injecting small, trainable rank-decomposition matrices into existing layers, significantly reducing the number of trainable parameters and computational cost.
- Quantization: The base model is loaded in 4-bit precision using bitsandbytes (nf4 quantization type, bfloat16 compute dtype, with double quantization enabled) to further reduce memory footprint during fine-tuning.
Fine-tuned Model Output: The resulting LoRA adapters are saved to data_preparation/fine_tuning_data/fine_tuned_model/. For inference, these adapters are merged with the base model.

Data Acquisition and Dataset

The quality and relevance of the fine-tuning data are paramount for a specialized LLM.

Data Source: The entire dataset used for fine-tuning was acquired programmatically from the Stack Overflow API. This approach ensures adherence to Stack Exchange's terms of service and licensing.
Data Processing:
- The data_preparation/stack_exchange_api_acquisition.py script was used to fetch relevant questions and answers from the Stack Overflow API based on specific tags (e.g., python, django, docker).
- The fetched data was then processed to extract question titles, bodies, and accepted answers, and formatted into conversational pairs suitable for instruction fine-tuning.
- Anonymization: The data obtained from the Stack Overflow API is already anonymized, meaning personal user information is not included.
Final Dataset: The prepared dataset is stored as data_preparation/fine_tuning_data/combined_qa_dataset.jsonl. Each line in this JSONL file represents a single training example, structured as a list of dictionaries (messages) compatible with the apply_chat_template method of the tokenizer.

Setup and Installation

To run this project locally, follow these steps:

Clone the repository:

git clone https://github.com/osmarbetancourt/osmar-generative-ai
cd osmar-generative-ai/text_generation_model/

Create and activate a Conda environment:

conda create -n ai_dev_env python=3.10
conda activate ai_dev_env

Install PyTorch with CUDA support: IMPORTANT: This step is critical for GPU acceleration and bitsandbytes compatibility. The exact command depends on your CUDA version. For CUDA 12.1, use:
```
pip install torch==2.3.0+cu121 torchvision==0.18.0+cu121 torchaudio==2.3.0+cu121 --index-url https://download.pytorch.org/whl/cu121
```
(Adjust cu121 if your CUDA version is different. Refer to PyTorch's official website for the correct command for your system.)
Install other dependencies: Create a requirements.txt file in the text_generation_model/ directory with the following content:
```
transformers
tokenizers
peft==0.10.0
accelerate
datasets==2.20.0
trl==0.8.6
numpy==1.26.4
fsspec==2024.5.0
fastapi==0.111.0
uvicorn==0.30.1
gradio==4.37.1
```
Then install them:
```
pip install -r requirements.txt
```
Note: If you encounter specific issues with bitsandbytes or accelerate during installation, sometimes a manual installation as suggested by their respective documentation (e.g., pip install -i https://pypi.org/simple/ bitsandbytes) might be necessary.
Hugging Face CLI Login (for model access and upload): This is required to download some models and to upload your fine-tuned model to the Hugging Face Hub.
```
huggingface-cli login
```
Follow the prompts to enter your Hugging Face access token (you can generate one in your Hugging Face settings under "Access Tokens").

1. Fine-tuning the Model

The finetune_llm.py script orchestrates the entire fine-tuning process.

To start fine-tuning:

python finetune_llm.py

The script will first download the base Mistral-7B-Instruct-v0.2 model and its tokenizer (if not already cached locally).
It will then load and prepare your combined_qa_dataset.jsonl.
LoRA adapters are applied to the base model, and the training loop commences.
Upon successful completion, the fine-tuned LoRA adapters and the tokenizer will be saved to the data_preparation/fine_tuning_data/fine_tuned_model/ directory.

2. Running Local Inference

The finetuned_inference.py script provides a direct way to test your fine-tuned model's capabilities from the command line.

To run inference:

python finetuned_inference.py

This script loads the base model, then loads your saved LoRA adapters, and finally merges them for efficient inference.
It includes a few predefined test prompts (e.g., Django, Python, Docker questions) and will print the generated responses directly to your terminal. This allows for a quick qualitative assessment of the fine-tuning's effectiveness.

3. Serving the Model with FastAPI

The api_server.py script sets up a lightweight FastAPI application to serve your fine-tuned LLM as a local API endpoint. This API can then be consumed by other applications, such as the Gradio UI.

To start the FastAPI server:

uvicorn api_server:app --host 0.0.0.0 --port 8001 --reload

When the server starts, the base model and your fine-tuned LoRA adapters will be loaded into GPU memory. This process can take a few moments.
The API endpoint for text generation will be http://localhost:8001/generate (accepts POST requests with a prompt field).

4. Interacting with Gradio UI

The gradio_app.py script creates a simple, interactive web interface using Gradio, allowing for easy interaction with your FastAPI server.

To launch the Gradio app (ensure the FastAPI server from step 3 is already running in a separate terminal):

python gradio_app.py

Gradio will provide a local URL (typically http://127.0.0.1:7860/) which you can open in your web browser.
You can then type your programming questions into the text box and submit them to receive responses from your locally served, fine-tuned LLM.

5. Uploading to Hugging Face Hub

To make your fine-tuned model publicly accessible and easily shareable, you can upload it to the Hugging Face Hub.

Ensure you are logged in to Hugging Face CLI (as per "Setup and Installation" step 5).
Edit upload_model_to_hf.py:
- Open the upload_model_to_hf.py script.
- Crucially, update the HF_REPO_ID variable: Replace "YourUserName/fine-tuned-mistral-django-qa" with your actual Hugging Face username and your desired repository name (e.g., "Osmar/fine-tuned-mistral-django-qa").
Run the upload script:
```
python upload_model_to_hf.py
```
This script will:
- Load the base model and merge your LoRA adapters into it.
- Push the complete, merged model and its tokenizer to your specified Hugging Face repository.
Enhance Model Card on Hugging Face Hub: After a successful upload, visit your new model repository on Hugging Face Hub (the script will print the direct URL). Click the "Edit file" button on the README.md to enrich it with a comprehensive model card. This should include:
- A detailed description of the model and its specific purpose.
- Information about the base model, fine-tuning methodology, and the dataset used.
- Clear statements on intended uses and any known limitations.
- An example code snippet demonstrating how others can load and use your model for inference.
- Relevant tags for discoverability and licensing information.

License & Attribution

The source code in this repository is licensed under the MIT License.

The dataset used for fine-tuning is derived from Stack Overflow content, which is licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). When using this model or the derived data, please ensure proper attribution to Stack Overflow as the original source.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fine-tuned Large Language Model for Programming Q&A

Table of Contents

Project Overview

Model Details

Data Acquisition and Dataset

Setup and Installation

1. Fine-tuning the Model

2. Running Local Inference

3. Serving the Model with FastAPI

4. Interacting with Gradio UI

5. Uploading to Hugging Face Hub

License & Attribution

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data_preparation		data_preparation
text_generation_model		text_generation_model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

osmarbetancourt/osmar-generative-ai

Folders and files

Latest commit

History

Repository files navigation

Fine-tuned Large Language Model for Programming Q&A

Table of Contents

Project Overview

Model Details

Data Acquisition and Dataset

Setup and Installation

1. Fine-tuning the Model

2. Running Local Inference

3. Serving the Model with FastAPI

4. Interacting with Gradio UI

5. Uploading to Hugging Face Hub

License & Attribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages