Skip to content

Commit a43f713

Browse files
authored
Llm fine tune (#109)
* feat: πŸ’¬ Chatbot for demos * feat: πŸ“š Dataset generation * feat: ⏫ Dataset augmentation * doc: πŸ“ add some comments * fix: πŸ› JSON schema validation * doc: πŸ“ update readme with dataset scripts * feat: ✨ Notebook * fix: πŸ› end command (back slash) * fix: πŸ› path for dataset * fix: πŸ› change framework naming * conf: βš™οΈ fix dependencies versions * docs: πŸ“ fix some typos * fix: πŸ› fix some paths * fix: πŸ› Some typos * fix: πŸ› Remove explicite ID from W&B configuration
1 parent 50bbd38 commit a43f713

File tree

11 files changed

+722
-0
lines changed

11 files changed

+722
-0
lines changed

β€Ž.gitignoreβ€Ž

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,3 +55,7 @@ kustomize
5555
containers-orchestration/managed-rancher/create-rancher-with-tf/variables.tf
5656
use-cases/create-and-use-object-storage-as-tf-backend/my-app/backend.tf
5757
use-cases/create-and-use-object-storage-as-tf-backend/object-storage-tf/variables.tf
58+
59+
# LLM Fine Tune data
60+
ai/llm-fine-tune/dataset/docs/*
61+
ai/llm-fine-tune/dataset/generated/*

β€Žai/llm-fine-tune/README.mdβ€Ž

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# 🎯 What is the goals of this example ? 🎯
2+
3+
This example shows you how it's very simple to fine tune a LLM with the [axolotl](https://docs.axolotl.ai/) Framework and OVHcloud [Machine Learning Services](https://www.ovhcloud.com/fr/public-cloud/ai-machine-learning/).
4+
5+
## πŸ“š Prerequisites πŸ“š
6+
7+
- An OVHcloud [public cloud project created](https://help.ovhcloud.com/csm/en-ie-public-cloud-compute-essential-information?id=kb_article_view&sysparm_article=KB0050387)
8+
- An OVHcloud [AI Endpoints valid API Key](https://help.ovhcloud.com/csm/en-ie-public-cloud-ai-endpoints-getting-started?id=kb_article_view&sysparm_article=KB0065398) stored in an environment variable named `OVH_AI_ENDPOINTS_ACCESS_TOKEN`
9+
- A valid AI Endpoint model URL stored in an environment variable named `OVH_AI_ENDPOINTS_MODEL_URL`
10+
- A valid AI Endpoint model name stored in an environment variable named `OVH_AI_ENDPOINTS_MODEL_NAME`
11+
- A [Hugging Face](https://huggingface.co/) account with a valid API Key
12+
- Optional:
13+
- a valid Python installation
14+
- a valid Docker installation
15+
16+
## πŸ’¬ The chatbot πŸ€–
17+
18+
To test the created models, you can use the chatbot in the [chatbot](./chatbot) folder.
19+
**⚠️ It's a simple chatbot for testing purpose only, not for real production πŸ˜‰ ⚠️**
20+
21+
The chatbot is packaged with Docker and can be built with the provided [Dockerfile](./chatbot/Dockerfile): `cd ./chatbot && docker buildx build --platform="linux/amd64" -t <id>/fine-tune-chatbot:1.0.0 .`
22+
You can run the chatbot using:
23+
- your local Python installation: `cd ./chatbot && pip install -r requirements.txt && python ./chatbot/chatbot.py`
24+
- your local Docker installation: `cd ./chatbot && docker run -p 7860:7860 <id>/fine-tune-chatbot:1.0.0 .`
25+
- using [OVHcloud AI Deploy](https://www.ovhcloud.com/fr/public-cloud/ai-deploy/):
26+
```bash
27+
ovhai app run \
28+
--name fine-tune-chatbot \
29+
--cpu 1 \
30+
--default-http-port 7860 \
31+
--env OVH_AI_ENDPOINTS_ACCESS_TOKEN=$OVH_AI_ENDPOINTS_ACCESS_TOKEN \
32+
--unsecure-http \
33+
my-id/fine-tune-chatbot:1.0.0
34+
```
35+
36+
And you can access the chatbot by navigating to `http://127.0.0.1:7860` or using the public URL provided by OVHcloud AI Deploy.
37+
38+
## πŸ“š The data generation πŸ“š
39+
40+
To train the model you need data.
41+
Data are generated from the OVHcloud AI Endpoints [official documentation](https://help.ovhcloud.com/csm/en-gb-documentation-public-cloud-ai-and-machine-learning-ai-endpoints?id=kb_browse_cat&kb_id=574a8325551974502d4c6e78b7421938&kb_category=ea1d6daa918a1a541e11d3d71f8624aa&spa=1).
42+
43+
You have two Python scripts:
44+
- one to generate valide dataset from the markdown documentation: [DatasetCreation.py](./dataset/DatasetCreation.py)
45+
- one to generate synthetic data from the previous generated documentation: [DatasetAugmentation.py](./dataset/DatasetAugmentation.py)
46+
47+
Once you have set the environment variables (see Prerequisites section) you can run the scripts with Python : `python DatasetCreation.py`
48+
49+
## πŸ‹οΈβ€β™€οΈ Train the model πŸ‹
50+
51+
You have to create a notebook thanks to `ovhai` CLI:
52+
```bash
53+
ovhai notebook run conda jupyterlab \
54+
--name axolto-llm-fine-tune \
55+
--framework-version 25.3.1-py312-cudadevel128-gpu \
56+
--flavor l4-1-gpu \
57+
--gpu 1 \
58+
--volume https://github.com/ovh/public-cloud-examples.git:/workspace/public-cloud-examples:RW \
59+
--envvar HF_TOKEN=$MY_HF_TOKEN \
60+
--envvar WANDB_TOKEN=$MY_WANDB_TOKEN \
61+
--unsecure-http
62+
```
63+
64+
To train the model please follow the steps in the [notebook](./notebook/axolto-llm-fine-tune-Meta-Llama-3.2-1B-instruct-ai-endpoints.ipynb) provided in the [notebook](./notebook/) folder.
65+
You have to upload the previously generated data in the [ai-endpoints-doc](./notebook/ai-endpoints-doc/) folder.
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
FROM python:3.13-slim
2+
3+
# πŸ“‚ Working directory in the container πŸ“‚
4+
WORKDIR /workspace
5+
6+
# 🐍 Copy files in /workspace 🐍
7+
COPY . /workspace
8+
9+
# ⬇️ Install any needed packages specified in requirements.txt ⬇️
10+
RUN pip install --no-cache-dir -r requirements.txt
11+
12+
# πŸ” Change ownership of the workspace directory to the user with UID 42420 (OVHcloud user) πŸ”
13+
RUN chown -R 42420:42420 /workspace
14+
15+
# βš™οΈ Make port 7860 available βš™οΈ
16+
EXPOSE 7860
17+
18+
# βš™οΈ Gradio configuration to run on localhost βš™οΈ
19+
ENV GRADIO_SERVER_NAME="0.0.0.0"
20+
21+
# πŸ” Define default value for AI Endpoints API key πŸ”
22+
ENV OVH_AI_ENDPOINTS_ACCESS_TOKEN=$OVH_AI_ENDPOINTS_ACCESS_TOKEN
23+
24+
# ⚑️ Run chatbot.py when the container launches ⚑️
25+
CMD ["python", "chatbot.py"]
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Application to compare answers generation from OVHcloud AI Endpoints exposed model and fine tuned model.
2+
# ⚠️ Do not used in production!! ⚠️
3+
4+
import gradio as gr
5+
import os
6+
7+
from langchain_openai import ChatOpenAI
8+
from langchain_core.prompts import ChatPromptTemplate
9+
10+
# πŸ“œ Prompts templates πŸ“œ
11+
prompt_template = ChatPromptTemplate.from_messages(
12+
[
13+
("system", "{system_prompt}"),
14+
("human", "{user_prompt}"),
15+
]
16+
)
17+
18+
def chat(prompt, system_prompt, temperature, top_p, model_name, model_url, api_key):
19+
"""
20+
Function to generate a chat response using the provided prompt, system prompt, temperature, top_p, model name, model URL and API key.
21+
"""
22+
23+
# βš™οΈ Initialize the OpenAI model βš™οΈ
24+
llm = ChatOpenAI(api_key=api_key,
25+
model=model_name,
26+
base_url=model_url,
27+
temperature=temperature,
28+
top_p=top_p
29+
)
30+
31+
# πŸ“œ Apply the prompt to the model πŸ“œ
32+
chain = prompt_template | llm
33+
ai_msg = chain.invoke(
34+
{
35+
"system_prompt": system_prompt,
36+
"user_prompt": prompt
37+
}
38+
)
39+
40+
# πŸ€– Return answer in a compatible format for Gradio component.
41+
return [{"role": "user", "content": prompt}, {"role": "assistant", "content": ai_msg.content}]
42+
43+
# πŸ–₯️ Main application πŸ–₯️
44+
with gr.Blocks() as demo:
45+
with gr.Row():
46+
with gr.Column():
47+
system_prompt = gr.Textbox(value="""You are a specialist on OVHcloud products.
48+
If you can't find any sure and relevant information about the product asked, answer with "This product doesn't exist in OVHcloud""",
49+
label="πŸ§‘β€πŸ« System Prompt πŸ§‘β€πŸ«")
50+
temperature = gr.Slider(minimum=0.0, maximum=2.0, step=0.01, label="Temperature", value=0.5)
51+
top_p = gr.Slider(minimum=0.0, maximum=1.0, step=0.01, label="Top P", value=0.0)
52+
model_name = gr.Textbox(label="🧠 Model Name 🧠", value='Llama-3.1-8B-Instruct')
53+
model_url = gr.Textbox(label="πŸ”— Model URL πŸ”—", value='https://oai.endpoints.kepler.ai.cloud.ovh.net/v1')
54+
api_key = gr.Textbox(label="πŸ”‘ OVH AI Endpoints Access Token πŸ”‘", value=os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN"), type="password")
55+
56+
with gr.Column():
57+
chatbot = gr.Chatbot(type="messages", label="πŸ€– Response πŸ€–")
58+
prompt = gr.Textbox(label="πŸ“ Prompt πŸ“", value='How many requests by minutes can I do with AI Endpoints?')
59+
submit = gr.Button("Submit")
60+
61+
submit.click(chat, inputs=[prompt, system_prompt, temperature, top_p, model_name, model_url, api_key], outputs=chatbot)
62+
63+
demo.launch()
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
gradio==5.38.0
2+
langchain-openai==0.3.28
3+
langchain-core==0.3.69
4+
langchain==0.3.26
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
import os
2+
import json
3+
import uuid
4+
from pathlib import Path
5+
from langchain_openai import ChatOpenAI
6+
from langchain.schema import HumanMessage
7+
from jsonschema import validate, ValidationError
8+
9+
# πŸ—ΊοΈ Define the JSON schema for the response πŸ—ΊοΈ
10+
message_schema = {
11+
"type": "object",
12+
"properties": {
13+
"role": {"type": "string"},
14+
"content": {"type": "string"}
15+
},
16+
"required": ["role", "content"]
17+
}
18+
19+
response_format = {
20+
"type": "json_object",
21+
"json_schema": {
22+
"name": "Messages",
23+
"description": "A list of messages with role and content",
24+
"properties": {
25+
"messages": {
26+
"type": "array",
27+
"items": message_schema
28+
}
29+
}
30+
}
31+
}
32+
33+
# βœ… JSON validity verification ❌
34+
def is_valid(json_data):
35+
"""
36+
Test the validity of the JSON data against the schema.
37+
Argument:
38+
json_data (dict): The JSON data to validate.
39+
Raises:
40+
ValidationError: If the JSON data does not conform to the specified schema.
41+
"""
42+
try:
43+
validate(instance=json_data, schema=response_format["json_schema"])
44+
return True
45+
except ValidationError as e:
46+
print(f"❌ Validation error: {e}")
47+
return False
48+
49+
# βš™οΈ Initialize the chat model with AI Endpoints configuration βš™οΈ
50+
chat_model = ChatOpenAI(
51+
api_key=os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN"),
52+
base_url=os.getenv("OVH_AI_ENDPOINTS_MODEL_URL"),
53+
model_name=os.getenv("OVH_AI_ENDPOINTS_MODEL_NAME"),
54+
temperature=0.0
55+
)
56+
57+
# πŸ“‚ Define the directory path πŸ“‚
58+
directory_path = "generated"
59+
print(f"πŸ“‚ Directory path: {directory_path}")
60+
directory = Path(directory_path)
61+
62+
# πŸ—ƒοΈ Walk through the directory and its subdirectories πŸ—ƒοΈ
63+
for path in directory.rglob("*"):
64+
print(f"πŸ“œ Processing file: {path}")
65+
# Check if the current path is a valid file
66+
if path.is_file() and path.name.__contains__ ("endpoints"):
67+
# Read the raw data from the file
68+
with open(path, 'r', encoding='utf-8') as file:
69+
raw_data = file.read()
70+
71+
try:
72+
json_data = json.loads(raw_data)
73+
except json.JSONDecodeError:
74+
print(f"❌ Failed to decode JSON from file: {path.name}")
75+
continue
76+
77+
if not is_valid(json_data):
78+
print(f"❌ Dataset non valide: {path.name}")
79+
continue
80+
print(f"βœ… Input dataset valide: {path.name}")
81+
82+
user_message = HumanMessage(content=f"""
83+
Given the following JSON, generate a similar JSON file where you paraphrase each question in the content attribute
84+
(when the role attribute is user) and also paraphrase the value of the response to the question stored in the content attribute
85+
when the role attribute is assistant.
86+
The objective is to create synthetic datasets based on existing datasets.
87+
I do not need to know the code to do this, but I want the resulting JSON file.
88+
It is important that the term OVHcloud is present as much as possible, especially when the terms AI Endpoints are mentioned
89+
either in the question or in the response.
90+
There must always be a question followed by an answer, never two questions or two answers in a row.
91+
It is IMPERATIVE to keep the language in English.
92+
The source JSON file:
93+
{raw_data}
94+
""")
95+
96+
chat_response = chat_model.invoke([user_message], response_format=response_format)
97+
98+
output = chat_response.content
99+
100+
# Replace unauthorized characters
101+
output = output.replace("\\t", " ")
102+
103+
generated_file_name = f"{uuid.uuid4()}_{path.name}"
104+
with open(f"./generated/synthetic/{generated_file_name}", 'w', encoding='utf-8') as output_file:
105+
output_file.write(output)
106+
107+
if not is_valid(json.loads(output)):
108+
print(f"❌ ERROR: File {generated_file_name} is not valid")
109+
else:
110+
print(f"βœ… Successfully generated file: {generated_file_name}")
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
import os
2+
from pathlib import Path
3+
from langchain_openai import ChatOpenAI
4+
from langchain.schema import HumanMessage
5+
6+
# πŸ—ΊοΈ Define the JSON schema for the response πŸ—ΊοΈ
7+
message_schema = {
8+
"type": "object",
9+
"properties": {
10+
"role": {"type": "string"},
11+
"content": {"type": "string"}
12+
},
13+
"required": ["role", "content"]
14+
}
15+
16+
response_format = {
17+
"type": "json_object",
18+
"json_schema": {
19+
"name": "Messages",
20+
"description": "A list of messages with role and content",
21+
"properties": {
22+
"messages": {
23+
"type": "array",
24+
"items": message_schema
25+
}
26+
}
27+
}
28+
}
29+
30+
# βš™οΈ Initialize the chat model with AI Endpoints configuration βš™οΈ
31+
chat_model = ChatOpenAI(
32+
api_key=os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN"),
33+
base_url=os.getenv("OVH_AI_ENDPOINTS_MODEL_URL"),
34+
model_name=os.getenv("OVH_AI_ENDPOINTS_MODEL_NAME"),
35+
temperature=0.0
36+
)
37+
38+
# πŸ“‚ Define the directory path πŸ“‚
39+
directory_path = "docs/pages/public_cloud/ai_machine_learning"
40+
directory = Path(directory_path)
41+
42+
# πŸ—ƒοΈ Walk through the directory and its subdirectories πŸ—ƒοΈ
43+
for path in directory.rglob("*"):
44+
# Check if the current path is a directory
45+
if path.is_dir():
46+
# Get the name of the subdirectory
47+
sub_directory = path.name
48+
49+
# Construct the path to the "guide.en-gb.md" file in the subdirectory
50+
guide_file_path = path / "guide.en-gb.md"
51+
52+
# Check if the "guide.en-gb.md" file exists in the subdirectory
53+
if "endpoints" in sub_directory and guide_file_path.exists():
54+
print(f"πŸ“— Guide processed: {sub_directory}")
55+
with open(guide_file_path, 'r', encoding='utf-8') as file:
56+
raw_data = file.read()
57+
58+
user_message = HumanMessage(content=f"""
59+
With the markdown following, generate a JSON file composed as follows: a list named "messages" composed of tuples with a key "role" which can have the value "user" when it's the question and "assistant" when it's the response. To split the document, base it on the markdown chapter titles to create the question, seems like a good idea.
60+
Keep the language English.
61+
I don't need to know the code to do it but I want the JSON result file.
62+
For the "user" field, don't just repeat the title but make a real question, for example "What are the requirements for OVHcloud AI Endpoints?"
63+
Be sure to add OVHcloud with AI Endpoints so that it's clear that OVHcloud creates AI Endpoints.
64+
Generate the entire JSON file.
65+
An example of what it should look like: messages [{{"role":"user", "content":"What is AI Endpoints?"}}]
66+
There must always be a question followed by an answer, never two questions or two answers in a row.
67+
The source markdown file:
68+
{raw_data}
69+
""")
70+
chat_response = chat_model.invoke([user_message], response_format=response_format)
71+
72+
with open(f"./generated/{sub_directory}.json", 'w', encoding='utf-8') as output_file:
73+
output_file.write(chat_response.content)
74+
print(f"βœ… Dataset generated: ./generated/{sub_directory}.json")
75+

β€Žai/llm-fine-tune/notebook/ai-endpoints-doc/.gitkeepβ€Ž

Whitespace-only changes.

0 commit comments

Comments
Β (0)