LoRA adapters trained on 1,213 examples created through iterative authoring with proactive sampling - the researcher actively directed and curated multi-turn LLM discussions to capture domain expertise and reasoning patterns. This human-directed data authoring methodology achieved 77-91% win rates, demonstrating a reproducible approach for domain-specific fine-tuning.
Tested across multiple architectures: Qwen 2.5 3B, Llama 3.2 3B, and Qwen 2.5 0.5B to observe how the authored training data transfers across different base models.
Blind A/B testing against baseline models using multiple independent LLM judges. Same training data (1,213 authored examples) tested across different architectures.
Evaluation Criteria: Judges evaluated responses on five dimensions (1-5 scale):
- Creativity & Originality
- Coherence & Structure
- Depth & Insight
- Engagement & Interest
- Writing Quality
Responses were presented in randomized order (A/B or B/A) to control for position bias. Judges selected a winner and provided reasoning for their choice.
| Base Architecture | Win Rate | Judge | Sample Size |
|---|---|---|---|
| Qwen 2.5 3B | 91.2% | Multi-judge (4 judges) | n=57 |
| Llama 3.2 3B | 80.4% | Multi-judge (4 judges) | n=57 |
| Qwen 2.5 0.5B | 71.9% | Multi-judge (4 judges) | n=57 |
Observation: The authored training data transfers differently across architectures. Qwen 2.5 3B shows strongest alignment (91.2%).
| Judge | Preference | Sample Size |
|---|---|---|
| Claude 3.5 Sonnet | 95.2% | n=42 |
| Claude Opus 4 | 78.9% | n=57 |
| GPT-4o | 93.0% | n=57 |
| Gemini 2.5 | 94.7% | n=57 |
Inter-judge agreement (GPT-4o ↔ Gemini): 91%
| Judge | Preference | Sample Size |
|---|---|---|
| Claude Sonnet 4 | 73.8% | n=42 |
| Claude Opus 4 | 80.0% | n=15 |
| GPT-4o | 82.5% | n=57 |
| Gemini 2.5 Flash Lite | 84.2% | n=57 |
Overall win rate (multi-judge): 80.4%
40% win rate on coding, math, and practical tasks - expected trade-off for domain specialization.
Note: These are preliminary results with newer generation models. More thorough evaluations coming soon!
In-Domain Performance (Brie 3B on 57 comprehensive prompts):
| Model | Win Rate |
|---|---|
| Gemini 3 | 82.9% |
| Claude Haiku 4.5 | 80.7% |
| GPT-5 | 75.4%* |
| Claude Sonnet 4.5 | 71.9% |
*GPT-5 shows ~97% raw rate but many parsing failures; 75.4% excludes failures
Out-of-Domain Performance (15 prompts):
| Model | Win Rate | Notes |
|---|---|---|
| Claude Sonnet 4.5 | 60.0% | |
| GPT-5 | 46.7% | |
| Gemini 3 | 40.0% | 1 tie |
| Claude Haiku 4.5 | 6.7% | 12 unknowns (parsing issues) |
- Full 2-epoch training essential: checkpoint-100 (1 epoch) showed ~10% performance, checkpoint-290 (2 epochs) achieved 72-83%
- Blind A/B testing with randomized presentation order
Each judge model (GPT-4o, Claude, Gemini) received this identical structured prompt:
You are an expert evaluator of creative writing and philosophical prose. Compare these two responses to the same prompt.
Original Prompt: "{prompt}"
Response A:
{response_a}
Response B:
{response_b}
Evaluate both responses on these criteria (rate 1-5 for each, where 5 is excellent and 1 is poor):
1. Creativity & Originality
2. Coherence & Structure
3. Depth & Insight
4. Engagement & Interest
5. Writing Quality
Provide your evaluation in this EXACT format:
Response A - Creativity: X, Coherence: X, Depth: X, Engagement: X, Quality: X
Response B - Creativity: X, Coherence: X, Depth: X, Engagement: X, Quality: X
Winner: [A or B or Tie]
Reasoning: [2-3 sentences explaining your choice]
Be critical and honest. Consider whether responses are truly insightful or just verbose.
To control for position bias, responses were randomly assigned to "Response A" or "Response B" positions for each comparison. The assignment order was tracked and inverted when computing final win rates.
Using multiple independent judge models (GPT-4o, Claude Sonnet/Opus, Gemini) provides more robust evaluation than single-judge approaches. High inter-judge agreement (91% between GPT-4o and Gemini) indicates clear quality differentials rather than judge-specific preferences.
- Brie v1 (
runs/brie-v1-0.5b/): Initial 10-step test run - Brie v2 checkpoint-100 (
runs/brie-v2-0.5b/checkpoint-100/): Mid-training (1 epoch, undertrained) - Brie v2 checkpoint-290 (
runs/brie-v2-0.5b/checkpoint-290/): Full training (2 epochs, 290 steps) - Brie v2 3B (
runs/brie-v2-3b/): Qwen 2.5 3B (91.2% win rate, trained on RunPod) - Brie Llama 3B (
runs/brie-llama-3b/): Llama 3.2 3B (80.4% win rate, trained on RunPod)
- Base Model: Qwen/Qwen2.5-0.5B-Instruct (618M parameters)
- Training Method: LoRA (Low-Rank Adaptation)
- Training Data: 1,213 original examples authored by the researcher
- Validation Data: 60 examples
- Training: 2 epochs (290 steps) on Apple M4 MacBook (16GB unified memory)
- Current Version: Brie v2 checkpoint-290
- Rank (r): 8
- Alpha: 16
- Dropout: 0.05
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Task: Causal Language Modeling
- Final Training Loss: 1.4824 (checkpoint-290)
- Validation Loss: 1.5031 (checkpoint-290)
- Training Time: ~5 hours (2 epochs)
- Adapter Size: 4.1MB
efficient-domain-adaptation/
├── data/
│ ├── sft.train.sample.jsonl # Sample training examples (public)
│ └── TRAINING_DATA_README.md # Data documentation
├── docs/
│ ├── karman-2025-small-data-domain-adaptation.pdf # Research paper
│ ├── figures/ # Paper figures (PNG + PDF)
│ ├── brie-0.5b/ # 0.5B model cards and evaluation
│ ├── brie-3b/ # 3B model card
│ └── general/ # Evaluation framework docs
├── exports/ # Evaluation results (JSONL)
├── runs/
│ ├── brie-v1-0.5b/ # Initial test run
│ ├── brie-v2-0.5b/ # Qwen 2.5 0.5B checkpoints
│ ├── brie-v2-3b/ # Qwen 2.5 3B (91.2% win rate)
│ └── brie-llama-3b/ # Llama 3.2 3B (80.4% win rate)
├── experimental/ # Experimental training scripts
├── scripts/ # RunPod and analysis scripts
├── train_brie_v2.py # Training script (0.5B)
├── train_brie_v2_3b.py # Training script (3B)
├── train_brie_llama_3b.py # Training script (Llama 3B)
├── judge_multi_provider.py # Multi-judge evaluation
├── compare_all_judges.py # Inter-judge agreement analysis
├── comprehensive_evaluation_suite.py # Full evaluation pipeline
├── summarize_judge_results.py # Results aggregation
└── .env.example # API key configuration
# Install dependencies
pip install openai anthropic google-generativeai torch transformers peft
# Configure API keys
cp .env.example .env
# Edit .env with your API keys
# Run multi-judge evaluation on comparison data
python judge_multi_provider.py exports/your_comparisons.jsonl --judge both
# Analyze inter-judge agreement
python compare_all_judges.py exports/
# Summarize results
python summarize_judge_results.py exports/your_results_judged.jsonlDefault settings (in test script):
- Max tokens: 512
- Temperature: 0.75
- Sampling: Enabled
Brie performs best on:
- Philosophical discussions (especially continental philosophy)
- Creative brainstorming for artists and writers
- Conceptual exploration and analysis
- Methodology discussions for RLHF testing
Example:
Can you suggest some article ideas on the philosophy of AI?
- Apple M4 MacBook Pro
- 16GB unified memory
- MPS (Metal Performance Shaders) backend
- Epochs: 2 (completed successfully)
- Batch size: 2 per device
- Gradient accumulation: 4 steps
- Effective batch size: 8
- Learning rate: 2e-4 (linear decay with 20-step warmup)
- Evaluation: Every 50 steps
- Checkpointing: Every 100 steps
- Total steps: 290 (2 full epochs)
- 0.5B model trained on Apple M4 MacBook (16GB RAM)
- 3B model trained on RunPod GPU
- 2nd epoch was critical: checkpoint-100 (1 epoch) showed minimal performance, checkpoint-290 (2 epochs) achieved 77% in-domain win rate
- Training to completion (2+ epochs) essential for domain expertise with small datasets
Brie v2 checkpoint-290 (recommended) contains:
adapter_model.safetensors(4.1MB) - LoRA adapter weightsadapter_config.json- LoRA configuration- Full tokenizer files
- Training state and metrics
Total checkpoint size: ~19MB (no optimizer state - training complete)
Access via: runs/brie-v2-0.5b/checkpoint-290/
Note: checkpoint-100 contains optimizer state (8.3MB) for resuming training.
Comprehensive evaluation: 85+ blind A/B comparisons against baseline.
In-Domain Performance (77% win rate, n=13):
- Continental philosophy (phenomenology, existentialism, critical theory)
- Speculative and conceptual reframing
- Contemplative prose
- Philosophical argumentation
Out-of-Domain Performance (40% win rate, n=15):
- Math: 33%
- Practical tasks: 67%
- Creative writing: 67%
- Factual knowledge: 33%
- Coding: 0%
Comprehensive Multi-Domain (50% win rate, n=57)
Domain specialization without catastrophic forgetting: strong performance in target domains, maintained competence elsewhere.
# Create virtual environment
python3 -m venv .venv
# Install dependencies
pip install torch transformers datasets peft trl
# Set environment variable (macOS Xet Storage bug fix)
export HF_HUB_DISABLE_XET=1The model was trained on 1,213 examples authored by the researcher, drawn from years of philosophical discussions with LLMs. This method of generating training data achieved 77-91% win rates, demonstrating a reproducible approach for domain-specific fine-tuning.
The dataset covers:
- Continental philosophy discussions (phenomenology, existentialism, critical theory)
- Speculative and experimental thinking
- Conceptual work for artists and writers
- Theoretical brainstorming and reframing
- Contemplative and meditative prose
This same dataset was used across multiple architectures (Qwen 2.5 3B, Llama 3.2 3B, Qwen 2.5 0.5B) to test how this training methodology transfers between different base models.
Model weights and training code for personal/research use.
Base model (Qwen 2.5 0.5B Instruct) license: Apache 2.0