Fine-tune Qwen3-4B with LoRA on Apple Silicon to build a local, private customer support model — trained on real FAQ data and served with llama.cpp.
| Step | Description | Script |
|---|---|---|
| 1 | Define FAQ knowledge base | faqs.json |
| 2 | Generate synthetic training data | generate_training_data.py |
| 3 | Fine-tune Qwen3-4B with LoRA | train.py |
| 4 | Test the fine-tuned model | test_model.py |
| 5 | Merge LoRA weights & export to GGUF | merge_and_export.py |
| 6 | Serve locally with llama.cpp | llama-server |
- MacBook Pro with Apple Silicon (M1 Pro 16GB+ recommended)
- Python 3.11+
- uv package manager
- OpenRouter API key (for synthetic data generation)
- Hugging Face account
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment
uv venv --python 3.11
source .venv/bin/activate
# Install dependencies
uv add torch torchvision torchaudio
uv add "transformers>=4.51.0" datasets peft trl accelerate
uv add openai huggingface_hub sentencepiece
# Install llama.cpp (for GGUF conversion and serving)
brew install llama.cpp
uv pip install "gguf @ git+https://github.com/ggerganov/llama.cpp.git#subdirectory=gguf-py"export OPENROUTER_API_KEY="your-key-here"
uv run python generate_training_data.pyuv run python train.pyuv run python test_model.pyuv run python merge_and_export.py
.venv/bin/python $(brew --prefix llama.cpp)/bin/convert_hf_to_gguf.py ./taikai-support-merged \
--outfile ./taikai-support-q8_0.gguf \
--outtype q8_0llama-server \
-m ./taikai-support-q8_0.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \
-c 2048 \
--chat-template chatmlcurl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful customer support assistant for TAIKAI."},
{"role": "user", "content": "how do i join a hackathon?"}
],
"temperature": 0.7,
"max_tokens": 256
}'.
├── faqs.json # Source FAQ knowledge base (196 entries)
├── generate_training_data.py # Synthetic data generation via OpenRouter
├── train.py # LoRA fine-tuning script
├── test_model.py # Test the fine-tuned model
├── merge_and_export.py # Merge LoRA weights into base model
├── train.jsonl # Generated training data
├── val.jsonl # Generated validation data
├── taikai-support-model/ # LoRA adapter output
├── taikai-support-merged/ # Merged model (base + LoRA)
└── taikai-support-q8_0.gguf # Final GGUF model for llama.cpp