Skip to content

closestfriend/efficient-domain-adaptation

Repository files navigation

Efficient Domain Adaptation: LLM-Assisted Data Authoring

DOI

LoRA adapters trained on 1,213 examples created through iterative authoring with proactive sampling - the researcher actively directed and curated multi-turn LLM discussions to capture domain expertise and reasoning patterns. This human-directed data authoring methodology achieved 77-91% win rates, demonstrating a reproducible approach for domain-specific fine-tuning.

Tested across multiple architectures: Qwen 2.5 3B, Llama 3.2 3B, and Qwen 2.5 0.5B to observe how the authored training data transfers across different base models.

Evaluation Results

Blind A/B testing against baseline models using multiple independent LLM judges. Same training data (1,213 authored examples) tested across different architectures.

Evaluation Criteria: Judges evaluated responses on five dimensions (1-5 scale):

  • Creativity & Originality
  • Coherence & Structure
  • Depth & Insight
  • Engagement & Interest
  • Writing Quality

Responses were presented in randomized order (A/B or B/A) to control for position bias. Judges selected a winner and provided reasoning for their choice.

Architecture Comparison

Base Architecture Win Rate Judge Sample Size
Qwen 2.5 3B 91.2% Multi-judge (4 judges) n=57
Llama 3.2 3B 80.4% Multi-judge (4 judges) n=57
Qwen 2.5 0.5B 71.9% Multi-judge (4 judges) n=57

Observation: The authored training data transfers differently across architectures. Qwen 2.5 3B shows strongest alignment (91.2%).

Brie v2 3B (Qwen 2.5) - Detailed Results

Judge Preference Sample Size
Claude 3.5 Sonnet 95.2% n=42
Claude Opus 4 78.9% n=57
GPT-4o 93.0% n=57
Gemini 2.5 94.7% n=57

Inter-judge agreement (GPT-4o ↔ Gemini): 91%

Brie Llama 3B - Detailed Results

Judge Preference Sample Size
Claude Sonnet 4 73.8% n=42
Claude Opus 4 80.0% n=15
GPT-4o 82.5% n=57
Gemini 2.5 Flash Lite 84.2% n=57

Overall win rate (multi-judge): 80.4%

Out-of-Domain Performance (Qwen 0.5B)

40% win rate on coding, math, and practical tasks - expected trade-off for domain specialization.

Latest Frontier Model Comparisons (Preliminary)

Note: These are preliminary results with newer generation models. More thorough evaluations coming soon!

In-Domain Performance (Brie 3B on 57 comprehensive prompts):

Model Win Rate
Gemini 3 82.9%
Claude Haiku 4.5 80.7%
GPT-5 75.4%*
Claude Sonnet 4.5 71.9%

*GPT-5 shows ~97% raw rate but many parsing failures; 75.4% excludes failures

Out-of-Domain Performance (15 prompts):

Model Win Rate Notes
Claude Sonnet 4.5 60.0%
GPT-5 46.7%
Gemini 3 40.0% 1 tie
Claude Haiku 4.5 6.7% 12 unknowns (parsing issues)

Training Notes

  • Full 2-epoch training essential: checkpoint-100 (1 epoch) showed ~10% performance, checkpoint-290 (2 epochs) achieved 72-83%
  • Blind A/B testing with randomized presentation order

Evaluation Methodology

Judge Prompt

Each judge model (GPT-4o, Claude, Gemini) received this identical structured prompt:

You are an expert evaluator of creative writing and philosophical prose. Compare these two responses to the same prompt.

Original Prompt: "{prompt}"

Response A:
{response_a}

Response B:
{response_b}

Evaluate both responses on these criteria (rate 1-5 for each, where 5 is excellent and 1 is poor):
1. Creativity & Originality
2. Coherence & Structure
3. Depth & Insight
4. Engagement & Interest
5. Writing Quality

Provide your evaluation in this EXACT format:
Response A - Creativity: X, Coherence: X, Depth: X, Engagement: X, Quality: X
Response B - Creativity: X, Coherence: X, Depth: X, Engagement: X, Quality: X
Winner: [A or B or Tie]
Reasoning: [2-3 sentences explaining your choice]

Be critical and honest. Consider whether responses are truly insightful or just verbose.

Blind Presentation

To control for position bias, responses were randomly assigned to "Response A" or "Response B" positions for each comparison. The assignment order was tracked and inverted when computing final win rates.

Multi-Judge Consensus

Using multiple independent judge models (GPT-4o, Claude Sonnet/Opus, Gemini) provides more robust evaluation than single-judge approaches. High inter-judge agreement (91% between GPT-4o and Gemini) indicates clear quality differentials rather than judge-specific preferences.

Version History

  • Brie v1 (runs/brie-v1-0.5b/): Initial 10-step test run
  • Brie v2 checkpoint-100 (runs/brie-v2-0.5b/checkpoint-100/): Mid-training (1 epoch, undertrained)
  • Brie v2 checkpoint-290 (runs/brie-v2-0.5b/checkpoint-290/): Full training (2 epochs, 290 steps)
  • Brie v2 3B (runs/brie-v2-3b/): Qwen 2.5 3B (91.2% win rate, trained on RunPod)
  • Brie Llama 3B (runs/brie-llama-3b/): Llama 3.2 3B (80.4% win rate, trained on RunPod)

Model Details

  • Base Model: Qwen/Qwen2.5-0.5B-Instruct (618M parameters)
  • Training Method: LoRA (Low-Rank Adaptation)
  • Training Data: 1,213 original examples authored by the researcher
  • Validation Data: 60 examples
  • Training: 2 epochs (290 steps) on Apple M4 MacBook (16GB unified memory)
  • Current Version: Brie v2 checkpoint-290

LoRA Configuration

  • Rank (r): 8
  • Alpha: 16
  • Dropout: 0.05
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Task: Causal Language Modeling

Training Results

  • Final Training Loss: 1.4824 (checkpoint-290)
  • Validation Loss: 1.5031 (checkpoint-290)
  • Training Time: ~5 hours (2 epochs)
  • Adapter Size: 4.1MB

Directory Structure

efficient-domain-adaptation/
├── data/
│   ├── sft.train.sample.jsonl       # Sample training examples (public)
│   └── TRAINING_DATA_README.md      # Data documentation
├── docs/
│   ├── karman-2025-small-data-domain-adaptation.pdf  # Research paper
│   ├── figures/                     # Paper figures (PNG + PDF)
│   ├── brie-0.5b/                   # 0.5B model cards and evaluation
│   ├── brie-3b/                     # 3B model card
│   └── general/                     # Evaluation framework docs
├── exports/                         # Evaluation results (JSONL)
├── runs/
│   ├── brie-v1-0.5b/                # Initial test run
│   ├── brie-v2-0.5b/                # Qwen 2.5 0.5B checkpoints
│   ├── brie-v2-3b/                  # Qwen 2.5 3B (91.2% win rate)
│   └── brie-llama-3b/               # Llama 3.2 3B (80.4% win rate)
├── experimental/                    # Experimental training scripts
├── scripts/                         # RunPod and analysis scripts
├── train_brie_v2.py                 # Training script (0.5B)
├── train_brie_v2_3b.py              # Training script (3B)
├── train_brie_llama_3b.py           # Training script (Llama 3B)
├── judge_multi_provider.py          # Multi-judge evaluation
├── compare_all_judges.py            # Inter-judge agreement analysis
├── comprehensive_evaluation_suite.py # Full evaluation pipeline
├── summarize_judge_results.py       # Results aggregation
└── .env.example                     # API key configuration

Usage

Running Evaluation

# Install dependencies
pip install openai anthropic google-generativeai torch transformers peft

# Configure API keys
cp .env.example .env
# Edit .env with your API keys

# Run multi-judge evaluation on comparison data
python judge_multi_provider.py exports/your_comparisons.jsonl --judge both

# Analyze inter-judge agreement
python compare_all_judges.py exports/

# Summarize results
python summarize_judge_results.py exports/your_results_judged.jsonl

Generation Parameters

Default settings (in test script):

  • Max tokens: 512
  • Temperature: 0.75
  • Sampling: Enabled

Example Prompts

Brie performs best on:

  • Philosophical discussions (especially continental philosophy)
  • Creative brainstorming for artists and writers
  • Conceptual exploration and analysis
  • Methodology discussions for RLHF testing

Example:

Can you suggest some article ideas on the philosophy of AI?

Training Details

Hardware

  • Apple M4 MacBook Pro
  • 16GB unified memory
  • MPS (Metal Performance Shaders) backend

Training Configuration

  • Epochs: 2 (completed successfully)
  • Batch size: 2 per device
  • Gradient accumulation: 4 steps
  • Effective batch size: 8
  • Learning rate: 2e-4 (linear decay with 20-step warmup)
  • Evaluation: Every 50 steps
  • Checkpointing: Every 100 steps
  • Total steps: 290 (2 full epochs)

Training Notes

  • 0.5B model trained on Apple M4 MacBook (16GB RAM)
  • 3B model trained on RunPod GPU
  • 2nd epoch was critical: checkpoint-100 (1 epoch) showed minimal performance, checkpoint-290 (2 epochs) achieved 77% in-domain win rate
  • Training to completion (2+ epochs) essential for domain expertise with small datasets

Model Files

Brie v2 checkpoint-290 (recommended) contains:

  • adapter_model.safetensors (4.1MB) - LoRA adapter weights
  • adapter_config.json - LoRA configuration
  • Full tokenizer files
  • Training state and metrics

Total checkpoint size: ~19MB (no optimizer state - training complete)

Access via: runs/brie-v2-0.5b/checkpoint-290/

Note: checkpoint-100 contains optimizer state (8.3MB) for resuming training.

Performance by Domain

Comprehensive evaluation: 85+ blind A/B comparisons against baseline.

In-Domain Performance (77% win rate, n=13):

  • Continental philosophy (phenomenology, existentialism, critical theory)
  • Speculative and conceptual reframing
  • Contemplative prose
  • Philosophical argumentation

Out-of-Domain Performance (40% win rate, n=15):

  • Math: 33%
  • Practical tasks: 67%
  • Creative writing: 67%
  • Factual knowledge: 33%
  • Coding: 0%

Comprehensive Multi-Domain (50% win rate, n=57)

Domain specialization without catastrophic forgetting: strong performance in target domains, maintained competence elsewhere.

Environment Setup

# Create virtual environment
python3 -m venv .venv

# Install dependencies
pip install torch transformers datasets peft trl

# Set environment variable (macOS Xet Storage bug fix)
export HF_HUB_DISABLE_XET=1

Training Data

The model was trained on 1,213 examples authored by the researcher, drawn from years of philosophical discussions with LLMs. This method of generating training data achieved 77-91% win rates, demonstrating a reproducible approach for domain-specific fine-tuning.

The dataset covers:

  • Continental philosophy discussions (phenomenology, existentialism, critical theory)
  • Speculative and experimental thinking
  • Conceptual work for artists and writers
  • Theoretical brainstorming and reframing
  • Contemplative and meditative prose

This same dataset was used across multiple architectures (Qwen 2.5 3B, Llama 3.2 3B, Qwen 2.5 0.5B) to test how this training methodology transfers between different base models.

License

Model weights and training code for personal/research use.

Base model (Qwen 2.5 0.5B Instruct) license: Apache 2.0

About

Research repository for Brie: LLM-assisted data authoring methodology achieving 91% win rates. Training infrastructure, evaluation framework, and paper documenting small-data domain adaptation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors