Efficient Domain Adaptation: LLM-Assisted Data Authoring

LoRA adapters trained on 1,213 examples created through iterative authoring with proactive sampling - the researcher actively directed and curated multi-turn LLM discussions to capture domain expertise and reasoning patterns. This human-directed data authoring methodology achieved 77-91% win rates, demonstrating a reproducible approach for domain-specific fine-tuning.

Tested across multiple architectures: Qwen 2.5 3B, Llama 3.2 3B, and Qwen 2.5 0.5B to observe how the authored training data transfers across different base models.

Evaluation Results

Blind A/B testing against baseline models using multiple independent LLM judges. Same training data (1,213 authored examples) tested across different architectures.

Evaluation Criteria: Judges evaluated responses on five dimensions (1-5 scale):

Creativity & Originality
Coherence & Structure
Depth & Insight
Engagement & Interest
Writing Quality

Responses were presented in randomized order (A/B or B/A) to control for position bias. Judges selected a winner and provided reasoning for their choice.

Architecture Comparison

Base Architecture	Win Rate	Judge	Sample Size
Qwen 2.5 3B	91.2%	Multi-judge (4 judges)	n=57
Llama 3.2 3B	80.4%	Multi-judge (4 judges)	n=57
Qwen 2.5 0.5B	71.9%	Multi-judge (4 judges)	n=57

Observation: The authored training data transfers differently across architectures. Qwen 2.5 3B shows strongest alignment (91.2%).

Brie v2 3B (Qwen 2.5) - Detailed Results

Judge	Preference	Sample Size
Claude 3.5 Sonnet	95.2%	n=42
Claude Opus 4	78.9%	n=57
GPT-4o	93.0%	n=57
Gemini 2.5	94.7%	n=57

Inter-judge agreement (GPT-4o ↔ Gemini): 91%

Brie Llama 3B - Detailed Results

Judge	Preference	Sample Size
Claude Sonnet 4	73.8%	n=42
Claude Opus 4	80.0%	n=15
GPT-4o	82.5%	n=57
Gemini 2.5 Flash Lite	84.2%	n=57

Overall win rate (multi-judge): 80.4%

Out-of-Domain Performance (Qwen 0.5B)

40% win rate on coding, math, and practical tasks - expected trade-off for domain specialization.

Latest Frontier Model Comparisons (Preliminary)

Note: These are preliminary results with newer generation models. More thorough evaluations coming soon!

In-Domain Performance (Brie 3B on 57 comprehensive prompts):

Model	Win Rate
Gemini 3	82.9%
Claude Haiku 4.5	80.7%
GPT-5	75.4%*
Claude Sonnet 4.5	71.9%

*GPT-5 shows ~97% raw rate but many parsing failures; 75.4% excludes failures

Out-of-Domain Performance (15 prompts):

Model	Win Rate	Notes
Claude Sonnet 4.5	60.0%
GPT-5	46.7%
Gemini 3	40.0%	1 tie
Claude Haiku 4.5	6.7%	12 unknowns (parsing issues)

Training Notes

Full 2-epoch training essential: checkpoint-100 (1 epoch) showed ~10% performance, checkpoint-290 (2 epochs) achieved 72-83%
Blind A/B testing with randomized presentation order

Evaluation Methodology

Judge Prompt

Each judge model (GPT-4o, Claude, Gemini) received this identical structured prompt:

You are an expert evaluator of creative writing and philosophical prose. Compare these two responses to the same prompt.

Original Prompt: "{prompt}"

Response A:
{response_a}

Response B:
{response_b}

Evaluate both responses on these criteria (rate 1-5 for each, where 5 is excellent and 1 is poor):
1. Creativity & Originality
2. Coherence & Structure
3. Depth & Insight
4. Engagement & Interest
5. Writing Quality

Provide your evaluation in this EXACT format:
Response A - Creativity: X, Coherence: X, Depth: X, Engagement: X, Quality: X
Response B - Creativity: X, Coherence: X, Depth: X, Engagement: X, Quality: X
Winner: [A or B or Tie]
Reasoning: [2-3 sentences explaining your choice]

Be critical and honest. Consider whether responses are truly insightful or just verbose.

Blind Presentation

To control for position bias, responses were randomly assigned to "Response A" or "Response B" positions for each comparison. The assignment order was tracked and inverted when computing final win rates.

Multi-Judge Consensus

Using multiple independent judge models (GPT-4o, Claude Sonnet/Opus, Gemini) provides more robust evaluation than single-judge approaches. High inter-judge agreement (91% between GPT-4o and Gemini) indicates clear quality differentials rather than judge-specific preferences.

Version History

Brie v1 (runs/brie-v1-0.5b/): Initial 10-step test run
Brie v2 checkpoint-100 (runs/brie-v2-0.5b/checkpoint-100/): Mid-training (1 epoch, undertrained)
Brie v2 checkpoint-290 (runs/brie-v2-0.5b/checkpoint-290/): Full training (2 epochs, 290 steps)
Brie v2 3B (runs/brie-v2-3b/): Qwen 2.5 3B (91.2% win rate, trained on RunPod)
Brie Llama 3B (runs/brie-llama-3b/): Llama 3.2 3B (80.4% win rate, trained on RunPod)

Model Details

Base Model: Qwen/Qwen2.5-0.5B-Instruct (618M parameters)
Training Method: LoRA (Low-Rank Adaptation)
Training Data: 1,213 original examples authored by the researcher
Validation Data: 60 examples
Training: 2 epochs (290 steps) on Apple M4 MacBook (16GB unified memory)
Current Version: Brie v2 checkpoint-290

LoRA Configuration

Rank (r): 8
Alpha: 16
Dropout: 0.05
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Task: Causal Language Modeling

Training Results

Final Training Loss: 1.4824 (checkpoint-290)
Validation Loss: 1.5031 (checkpoint-290)
Training Time: ~5 hours (2 epochs)
Adapter Size: 4.1MB

Directory Structure

efficient-domain-adaptation/
├── data/
│   ├── sft.train.sample.jsonl       # Sample training examples (public)
│   └── TRAINING_DATA_README.md      # Data documentation
├── docs/
│   ├── karman-2025-small-data-domain-adaptation.pdf  # Research paper
│   ├── figures/                     # Paper figures (PNG + PDF)
│   ├── brie-0.5b/                   # 0.5B model cards and evaluation
│   ├── brie-3b/                     # 3B model card
│   └── general/                     # Evaluation framework docs
├── exports/                         # Evaluation results (JSONL)
├── runs/
│   ├── brie-v1-0.5b/                # Initial test run
│   ├── brie-v2-0.5b/                # Qwen 2.5 0.5B checkpoints
│   ├── brie-v2-3b/                  # Qwen 2.5 3B (91.2% win rate)
│   └── brie-llama-3b/               # Llama 3.2 3B (80.4% win rate)
├── experimental/                    # Experimental training scripts
├── scripts/                         # RunPod and analysis scripts
├── train_brie_v2.py                 # Training script (0.5B)
├── train_brie_v2_3b.py              # Training script (3B)
├── train_brie_llama_3b.py           # Training script (Llama 3B)
├── judge_multi_provider.py          # Multi-judge evaluation
├── compare_all_judges.py            # Inter-judge agreement analysis
├── comprehensive_evaluation_suite.py # Full evaluation pipeline
├── summarize_judge_results.py       # Results aggregation
└── .env.example                     # API key configuration

Usage

Running Evaluation

# Install dependencies
pip install openai anthropic google-generativeai torch transformers peft

# Configure API keys
cp .env.example .env
# Edit .env with your API keys

# Run multi-judge evaluation on comparison data
python judge_multi_provider.py exports/your_comparisons.jsonl --judge both

# Analyze inter-judge agreement
python compare_all_judges.py exports/

# Summarize results
python summarize_judge_results.py exports/your_results_judged.jsonl

Generation Parameters

Default settings (in test script):

Max tokens: 512
Temperature: 0.75
Sampling: Enabled

Example Prompts

Brie performs best on:

Philosophical discussions (especially continental philosophy)
Creative brainstorming for artists and writers
Conceptual exploration and analysis
Methodology discussions for RLHF testing

Example:

Can you suggest some article ideas on the philosophy of AI?

Training Details

Hardware

Apple M4 MacBook Pro
16GB unified memory
MPS (Metal Performance Shaders) backend

Training Configuration

Epochs: 2 (completed successfully)
Batch size: 2 per device
Gradient accumulation: 4 steps
Effective batch size: 8
Learning rate: 2e-4 (linear decay with 20-step warmup)
Evaluation: Every 50 steps
Checkpointing: Every 100 steps
Total steps: 290 (2 full epochs)

Training Notes

0.5B model trained on Apple M4 MacBook (16GB RAM)
3B model trained on RunPod GPU
2nd epoch was critical: checkpoint-100 (1 epoch) showed minimal performance, checkpoint-290 (2 epochs) achieved 77% in-domain win rate
Training to completion (2+ epochs) essential for domain expertise with small datasets

Model Files

Brie v2 checkpoint-290 (recommended) contains:

adapter_model.safetensors (4.1MB) - LoRA adapter weights
adapter_config.json - LoRA configuration
Full tokenizer files
Training state and metrics

Total checkpoint size: ~19MB (no optimizer state - training complete)

Access via: runs/brie-v2-0.5b/checkpoint-290/

Note: checkpoint-100 contains optimizer state (8.3MB) for resuming training.

Performance by Domain

Comprehensive evaluation: 85+ blind A/B comparisons against baseline.

In-Domain Performance (77% win rate, n=13):

Continental philosophy (phenomenology, existentialism, critical theory)
Speculative and conceptual reframing
Contemplative prose
Philosophical argumentation

Out-of-Domain Performance (40% win rate, n=15):

Math: 33%
Practical tasks: 67%
Creative writing: 67%
Factual knowledge: 33%
Coding: 0%

Comprehensive Multi-Domain (50% win rate, n=57)

Domain specialization without catastrophic forgetting: strong performance in target domains, maintained competence elsewhere.

Environment Setup

# Create virtual environment
python3 -m venv .venv

# Install dependencies
pip install torch transformers datasets peft trl

# Set environment variable (macOS Xet Storage bug fix)
export HF_HUB_DISABLE_XET=1

Training Data

The model was trained on 1,213 examples authored by the researcher, drawn from years of philosophical discussions with LLMs. This method of generating training data achieved 77-91% win rates, demonstrating a reproducible approach for domain-specific fine-tuning.

The dataset covers:

Continental philosophy discussions (phenomenology, existentialism, critical theory)
Speculative and experimental thinking
Conceptual work for artists and writers
Theoretical brainstorming and reframing
Contemplative and meditative prose

This same dataset was used across multiple architectures (Qwen 2.5 3B, Llama 3.2 3B, Qwen 2.5 0.5B) to test how this training methodology transfers between different base models.

License

Model weights and training code for personal/research use.

Base model (Qwen 2.5 0.5B Instruct) license: Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
data		data
docs		docs
experimental		experimental
exports		exports
runs		runs
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
compare_all_judges.py		compare_all_judges.py
comprehensive_evaluation_suite.py		comprehensive_evaluation_suite.py
judge_multi_provider.py		judge_multi_provider.py
summarize_judge_results.py		summarize_judge_results.py
train_brie_llama_3b.py		train_brie_llama_3b.py
train_brie_v2.py		train_brie_v2.py
train_brie_v2_3b.py		train_brie_v2_3b.py

Folders and files

Latest commit

History

Repository files navigation

Efficient Domain Adaptation: LLM-Assisted Data Authoring

Evaluation Results

Architecture Comparison

Brie v2 3B (Qwen 2.5) - Detailed Results

Brie Llama 3B - Detailed Results

Out-of-Domain Performance (Qwen 0.5B)

Latest Frontier Model Comparisons (Preliminary)

Training Notes

Evaluation Methodology

Judge Prompt

Blind Presentation

Multi-Judge Consensus

Version History

Model Details

LoRA Configuration

Training Results

Directory Structure

Usage

Running Evaluation

Generation Parameters

Example Prompts

Training Details

Hardware

Training Configuration

Training Notes

Model Files

Performance by Domain

Environment Setup

Training Data

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages