This project focuses on fine-tuning a Large Language Model (LLM) to specialize in answering programming-related questions, particularly those pertaining to Python, Django, and Docker. It demonstrates the complete workflow from data acquisition and preparation to fine-tuning with Parameter-Efficient Fine-Tuning (PEFT) using LoRA, and finally, local deployment and integration with the Hugging Face Hub.
- Project Overview
- Model Details
- Data Acquisition and Dataset
- Setup and Installation
- License & Attribution
The primary objective is to enhance a general-purpose LLM's performance and relevance for specific technical queries by fine-tuning it on domain-specific data. This project covers:
- Data Acquisition: Programmatic collection of Q&A data from Stack Overflow.
- Data Preparation: Cleaning, filtering, and formatting the raw data into a suitable format for LLM fine-tuning.
- Fine-tuning: Applying LoRA (Low-Rank Adaptation) to a base LLM for efficient domain adaptation.
- Local Deployment: Setting up a FastAPI server and a Gradio UI for local interaction with the fine-tuned model.
- Hugging Face Integration: Preparing the model for upload to the Hugging Face Hub for broader accessibility.
- Base Model:
mistralai/Mistral-7B-Instruct-v0.2
- A powerful instruction-tuned model known for its strong performance across various tasks. - Fine-tuning Method: LoRA (Low-Rank Adaptation), a Parameter-Efficient Fine-Tuning (PEFT) technique. LoRA allows for efficient adaptation of large models by injecting small, trainable rank-decomposition matrices into existing layers, significantly reducing the number of trainable parameters and computational cost.
- Quantization: The base model is loaded in 4-bit precision using
bitsandbytes
(nf4
quantization type,bfloat16
compute dtype, with double quantization enabled) to further reduce memory footprint during fine-tuning.
- Quantization: The base model is loaded in 4-bit precision using
- Fine-tuned Model Output: The resulting LoRA adapters are saved to
data_preparation/fine_tuning_data/fine_tuned_model/
. For inference, these adapters are merged with the base model.
The quality and relevance of the fine-tuning data are paramount for a specialized LLM.
- Data Source: The entire dataset used for fine-tuning was acquired programmatically from the Stack Overflow API. This approach ensures adherence to Stack Exchange's terms of service and licensing.
- Data Processing:
- The
data_preparation/stack_exchange_api_acquisition.py
script was used to fetch relevant questions and answers from the Stack Overflow API based on specific tags (e.g.,python
,django
,docker
). - The fetched data was then processed to extract question titles, bodies, and accepted answers, and formatted into conversational pairs suitable for instruction fine-tuning.
- Anonymization: The data obtained from the Stack Overflow API is already anonymized, meaning personal user information is not included.
- The
- Final Dataset: The prepared dataset is stored as
data_preparation/fine_tuning_data/combined_qa_dataset.jsonl
. Each line in this JSONL file represents a single training example, structured as a list of dictionaries (messages) compatible with theapply_chat_template
method of the tokenizer.
To run this project locally, follow these steps:
-
Clone the repository:
git clone https://github.com/osmarbetancourt/osmar-generative-ai cd osmar-generative-ai/text_generation_model/
-
Create and activate a Conda environment:
conda create -n ai_dev_env python=3.10 conda activate ai_dev_env
-
Install PyTorch with CUDA support: IMPORTANT: This step is critical for GPU acceleration and
bitsandbytes
compatibility. The exact command depends on your CUDA version. For CUDA 12.1, use:pip install torch==2.3.0+cu121 torchvision==0.18.0+cu121 torchaudio==2.3.0+cu121 --index-url https://download.pytorch.org/whl/cu121
(Adjust
cu121
if your CUDA version is different. Refer to PyTorch's official website for the correct command for your system.) -
Install other dependencies: Create a
requirements.txt
file in thetext_generation_model/
directory with the following content:transformers tokenizers peft==0.10.0 accelerate datasets==2.20.0 trl==0.8.6 numpy==1.26.4 fsspec==2024.5.0 fastapi==0.111.0 uvicorn==0.30.1 gradio==4.37.1
Then install them:
pip install -r requirements.txt
Note: If you encounter specific issues with
bitsandbytes
oraccelerate
during installation, sometimes a manual installation as suggested by their respective documentation (e.g.,pip install -i https://pypi.org/simple/ bitsandbytes
) might be necessary. -
Hugging Face CLI Login (for model access and upload): This is required to download some models and to upload your fine-tuned model to the Hugging Face Hub.
huggingface-cli login
Follow the prompts to enter your Hugging Face access token (you can generate one in your Hugging Face settings under "Access Tokens").
The finetune_llm.py
script orchestrates the entire fine-tuning process.
To start fine-tuning:
python finetune_llm.py
- The script will first download the base
Mistral-7B-Instruct-v0.2
model and its tokenizer (if not already cached locally). - It will then load and prepare your
combined_qa_dataset.jsonl
. - LoRA adapters are applied to the base model, and the training loop commences.
- Upon successful completion, the fine-tuned LoRA adapters and the tokenizer will be saved to the
data_preparation/fine_tuning_data/fine_tuned_model/
directory.
The finetuned_inference.py
script provides a direct way to test your fine-tuned model's capabilities from the command line.
To run inference:
python finetuned_inference.py
- This script loads the base model, then loads your saved LoRA adapters, and finally merges them for efficient inference.
- It includes a few predefined test prompts (e.g., Django, Python, Docker questions) and will print the generated responses directly to your terminal. This allows for a quick qualitative assessment of the fine-tuning's effectiveness.
The api_server.py
script sets up a lightweight FastAPI application to serve your fine-tuned LLM as a local API endpoint. This API can then be consumed by other applications, such as the Gradio UI.
To start the FastAPI server:
uvicorn api_server:app --host 0.0.0.0 --port 8001 --reload
- When the server starts, the base model and your fine-tuned LoRA adapters will be loaded into GPU memory. This process can take a few moments.
- The API endpoint for text generation will be
http://localhost:8001/generate
(accepts POST requests with aprompt
field).
The gradio_app.py
script creates a simple, interactive web interface using Gradio, allowing for easy interaction with your FastAPI server.
To launch the Gradio app (ensure the FastAPI server from step 3 is already running in a separate terminal):
python gradio_app.py
- Gradio will provide a local URL (typically
http://127.0.0.1:7860/
) which you can open in your web browser. - You can then type your programming questions into the text box and submit them to receive responses from your locally served, fine-tuned LLM.
To make your fine-tuned model publicly accessible and easily shareable, you can upload it to the Hugging Face Hub.
-
Ensure you are logged in to Hugging Face CLI (as per "Setup and Installation" step 5).
-
Edit
upload_model_to_hf.py
:- Open the
upload_model_to_hf.py
script. - Crucially, update the
HF_REPO_ID
variable: Replace"YourUserName/fine-tuned-mistral-django-qa"
with your actual Hugging Face username and your desired repository name (e.g.,"Osmar/fine-tuned-mistral-django-qa"
).
- Open the
-
Run the upload script:
python upload_model_to_hf.py
This script will:
- Load the base model and merge your LoRA adapters into it.
- Push the complete, merged model and its tokenizer to your specified Hugging Face repository.
-
Enhance Model Card on Hugging Face Hub: After a successful upload, visit your new model repository on Hugging Face Hub (the script will print the direct URL). Click the "Edit file" button on the
README.md
to enrich it with a comprehensive model card. This should include:- A detailed description of the model and its specific purpose.
- Information about the base model, fine-tuning methodology, and the dataset used.
- Clear statements on intended uses and any known limitations.
- An example code snippet demonstrating how others can load and use your model for inference.
- Relevant tags for discoverability and licensing information.
The source code in this repository is licensed under the MIT License.
The dataset used for fine-tuning is derived from Stack Overflow content, which is licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). When using this model or the derived data, please ensure proper attribution to Stack Overflow as the original source.