Skip to content

Commit 4e42986

Browse files
authored
feat: dataset creation from traces (#673)
1 parent 5173d88 commit 4e42986

File tree

18 files changed

+5928
-2
lines changed

18 files changed

+5928
-2
lines changed

src/rai_finetune/README.md

Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,223 @@
1+
# RAI Fine-tuning Module
2+
3+
⚠️ **Experimental Module**: This module is in active development. Features may change and some functionality is still in progress.
4+
5+
## Module Overview
6+
7+
This module provides tools for extracting and formatting training data from various providers such as Langfuse and LangSmith. The formatted training data is designed to work seamlessly with Unsloth for efficient fine-tuning. It includes:
8+
9+
**Data Preparation:**
10+
11+
- **Observation Extractors**: Extract observations from various sources (Langfuse, LangSmith) with standardized preprocessing
12+
- **Training Data Formatter**: Converts RAI observations to training data format in ChatML format for Unsloth compatibility
13+
14+
It is recommended for extractors to adopt a standardized data format based on Langfuse structure. Langfuse format was chosen as the standardization target because it provides cleaner, more direct access to conversation data with flat message structures (`input`/`output` fields) that closely match the target ChatML format. This reduces preprocessing complexity and makes the formatter more maintainable compared to handling raw LangSmith data with deeply nested LangChain internal structures.
15+
16+
Data from different sources (e.g., LangSmith) can be preprocessed at the extraction level to ensure consistent formatting. For example, LangSmith data with nested message structures and different field names is converted to the standard format before reaching the formatter, maintaining a single, reusable formatter for all data sources.
17+
18+
Formatter follows OpenAI recommendation on [data formatting](https://platform.openai.com/docs/guides/supervised-fine-tuning#formatting-your-data) for fine tuning.
19+
20+
**Fine-tune Helpers:**
21+
22+
To be added. It includes:
23+
24+
- **Model Fine-tuning**: Uses Unsloth for optimized training with 4-bit quantization and LoRA support
25+
- **LoRA Merger**: Merges LoRA adapter weights back into base models for standalone deployment
26+
- **Ollama Converter**: Converts fine-tuned models to Ollama format using GGUF export
27+
28+
The module is designed as a standalone package to avoid dependency conflicts between different versions of Triton required by openai-whisper and unsloth-zoo.
29+
30+
**System Component Proposal**: (Feedback is welcome and appreciated!)
31+
32+
<div style="text-align: center; padding: 20px;"><img src="imgs/rai-fine-tune-system-components.png" alt="RAI Fine Tune System Components"></div>
33+
34+
Folder Structure (Tenatative)
35+
36+
```
37+
src/rai_finetune/rai_finetune/
38+
├── data/ # Data processing
39+
│ ├── formatters/ # Data formatting
40+
│ ├── extractors/ # Data extraction
41+
│ ├── validators/ # Data validation (To be implemented)
42+
├── utils/ # Utilities
43+
│ ├── chat_template.py # Chat templates
44+
│ ├── templates/ # Template files
45+
│ └── model_loader.py # Base model loading (from ModelManager, to be implemented)
46+
├── adapters/ # LoRA management
47+
│ ├── merger.py # LoRA merging (To be implemented)
48+
│ └── config.py # Adapter configs (To be implemented)
49+
├── trainers/ # Training orchestration
50+
│ ├── trainer.py # Main trainer (To be implemented)
51+
│ └── data_loader.py # Data preparation (To be implemented)
52+
└── exporters/ # Model export
53+
├── ollama.py # Ollama export (To be implemented)
54+
└── gguf.py # GGUF export (To be implemented)
55+
```
56+
57+
## Environment Setup
58+
59+
This module utilizes `unsloth` which works with Python 3.10, 3.11, and 3.12. Python 3.12+ has Dynamo compatibility issues with `unsloth`; see [issue reference](https://github.com/unslothai/unsloth/issues/886). Thus Python 3.10 is selected for its compatibility with the rest of RAI components. The instructions below are targeted for Linux.
60+
61+
### 1. Install System Dependencies
62+
63+
```bash
64+
sudo apt update
65+
sudo apt install -y \
66+
libncurses5-dev \
67+
libncursesw5-dev \
68+
libreadline-dev \
69+
libsqlite3-dev \
70+
libssl-dev \
71+
zlib1g-dev \
72+
libbz2-dev \
73+
libffi-dev \
74+
liblzma-dev \
75+
libgdbm-dev \
76+
libnss3-dev \
77+
libtinfo6 \
78+
build-essential
79+
```
80+
81+
### 2. Install Python 3.10 with pyenv
82+
83+
Use pyenv to manage Python versions:
84+
85+
```bash
86+
# Install pyenv if not already installed
87+
curl https://pyenv.run | bash
88+
89+
# Add to shell profile
90+
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
91+
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
92+
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
93+
94+
# Reload shell or source profile
95+
source ~/.bashrc
96+
97+
# Install Python 3.10
98+
pyenv install 3.10
99+
```
100+
101+
### 3. Set up Poetry Environment
102+
103+
```bash
104+
cd src/rai_finetune
105+
106+
# Set local Python version
107+
pyenv local 3.10
108+
109+
# Install Poetry if not already installed
110+
curl -sSL https://install.python-poetry.org | python3 -
111+
112+
# Create and activate Poetry environment
113+
poetry env use python
114+
poetry install
115+
poetry run pip install flash-attn --no-build-isolation
116+
117+
# Activate the environment
118+
. ./setup_finetune_shell.sh
119+
```
120+
121+
### 4. Install llama.cpp Tools (Optional)
122+
123+
The Ollama conversion process requires the `llama-quantize` tool from llama.cpp. To handle this, developers can:
124+
125+
```bash
126+
# Clone and build llama.cpp at project root
127+
git clone https://github.com/ggerganov/llama.cpp.git
128+
cd llama.cpp
129+
mkdir build && cd build
130+
cmake ..
131+
cmake --build . --config Release
132+
# The llama-quantize tool will be in the build/bin directory
133+
```
134+
135+
## CLI Usage
136+
137+
The module provides a unified command-line interface for data operations:
138+
139+
```bash
140+
# Show general help
141+
python -m rai_finetune.data_cli --help
142+
143+
# Show help for specific extractors
144+
python -m rai_finetune.data_cli extract langfuse --help
145+
python -m rai_finetune.data_cli format --help
146+
```
147+
148+
## Script Execution Flow
149+
150+
Before running any scripts, make sure the shell is set up properly by running the following from the root folder of the project:
151+
152+
```bash
153+
source src/rai_finetune/setup_finetune_shell.sh
154+
```
155+
156+
### 1. Observation Extraction
157+
158+
Extract observations from Langfuse for specific models using the CLI:
159+
160+
```bash
161+
python -m rai_finetune.data_cli extract langfuse \
162+
--models "gpt-4o" \
163+
--models "gpt-4o-mini" \
164+
--output langfuse_raw_data.jsonl \
165+
--max-data-limit 5000
166+
```
167+
168+
With start and stop time filters:
169+
170+
```bash
171+
python -m rai_finetune.data_cli extract langfuse \
172+
--models "gpt-4o" \
173+
--start-time "2025-08-01T00:00:00Z" \
174+
--stop-time "2025-08-31T23:59:59Z" \
175+
--output langfuse_raw_data_filtered.jsonl
176+
177+
```
178+
179+
**CLI Options:**
180+
181+
**Langfuse Options:**
182+
183+
- `--models`: List of model names to extract observations from
184+
- `--output`: Output file for extracted observations (required)
185+
- `--page-size`: Page size for pagination (default: 50)
186+
- `--start-time`: Start time for data extraction (ISO format)
187+
- `--stop-time`: Stop time for data extraction (ISO format)
188+
- `--max-data-limit`: Maximum number of records to extract (default: 5000)
189+
- `--host`: Langfuse host URL (default: http://localhost:3000)
190+
- `--public-key`: Langfuse public key (or set LANGFUSE_PUBLIC_KEY env var)
191+
- `--secret-key`: Langfuse secret key (or set LANGFUSE_SECRET_KEY env var)
192+
- `--type-filter`: Observation type filter (default: GENERATION)
193+
- `--trace-id`: Restrict to specific trace ID
194+
- `--include-fields`: Fields to include in saved data samples
195+
196+
**Environment Variables:**
197+
You can set Langfuse credentials as environment variables to avoid passing them on the command line:
198+
199+
```bash
200+
export LANGFUSE_PUBLIC_KEY="your_public_key"
201+
export LANGFUSE_SECRET_KEY="your_secret_key"
202+
```
203+
204+
### 2. Training Data Preparation
205+
206+
For tool calling fine-tuning using the CLI, format data samples using
207+
208+
Format data for training:
209+
210+
```bash
211+
python -m rai_finetune.data_cli format \
212+
--input langfuse_raw_data.jsonl \
213+
--output langfuse_tc_data.jsonl \
214+
--system-prompt "You are a specialized AI assistant for robotics and tool calling tasks."
215+
```
216+
217+
**CLI Options:**
218+
219+
- `--input`: Input observations file (required)
220+
- `--output`: Output training data file (required)
221+
- `--system-prompt`: System prompt to use (default: "You are a helpful AI assistant that can use tools to help users.")
222+
- `--system-prompt`: System prompt to use (default: "You are a helpful AI assistant that can use tools to help users.")
223+
- `--system-prompt-file`: Path to file containing custom system prompt
110 KB
Loading

0 commit comments

Comments
 (0)