Skip to content

Latest commit

 

History

History
317 lines (247 loc) · 7.57 KB

File metadata and controls

317 lines (247 loc) · 7.57 KB

RuvLTRA Training Datasets

Complete guide to fine-tuning datasets for RuvLTRA models.

Available Datasets

1. Claude Task Routing Dataset

Purpose: Train models to intelligently route tasks to Claude Flow agents and select optimal Claude models (Haiku/Sonnet/Opus).

Location: crates/ruvllm/src/training/claude_dataset.rs

Size: ~2,700 examples (configurable)

Categories:

  • Coder (20%) - Code generation, debugging, refactoring
  • Researcher (20%) - Analysis, exploration, documentation
  • Security (20%) - Audit, vulnerability analysis
  • Architecture (20%) - System design, planning
  • Reviewer (20%) - Code review, quality assessment

Quick Start:

cargo run --example generate_claude_dataset --release

Documentation:

Dataset Comparison

Dataset Examples Categories Quality Use Case
Claude Task 2,700 5 0.87 Task routing, model selection
(Future) Code Completion TBD - - Code generation
(Future) Security Audit TBD - - Vulnerability detection

Dataset Format

All datasets use consistent JSONL format:

{
  "input": "Task description",
  "context": "Additional context",
  "output_agent": "target_agent",
  "metadata": {
    "category": "TaskCategory",
    "complexity": "ComplexityLevel",
    "domain": "DomainType",
    "expected_model": "haiku|sonnet|opus",
    "quality_score": 0.87,
    "tags": ["tag1", "tag2"]
  }
}

Data Splits

Standard splits for all datasets:

  • Training: 70%
  • Validation: 15%
  • Test: 15%

Stratified sampling ensures balanced representation across categories.

Quality Standards

All datasets follow quality guidelines:

Quality Score Ranges:

  • 0.90-1.00: Excellent (security, critical tasks)
  • 0.85-0.90: Good (architecture, complex code)
  • 0.80-0.85: Adequate (research, reviews)

Minimum Standards:

  • Input clarity: Must be unambiguous
  • Context completeness: All necessary details
  • Output correctness: Verified agent/model selection
  • Metadata accuracy: Properly labeled

Generation Pipeline

1. Template Definition
   ↓
   Hand-crafted task templates
   ↓
   Quality review (0.90+ for seeds)

2. Base Generation
   ↓
   Fill templates with variations
   ↓
   Validate quality/correctness

3. Augmentation (optional)
   ↓
   Paraphrasing
   ↓
   Complexity variations
   ↓
   Domain transfer
   ↓
   Filter invalid examples

4. Export
   ↓
   JSONL, JSON, Parquet
   ↓
   Statistics and analysis

Usage Patterns

Generate Default Dataset

use ruvllm::training::{DatasetGenerator, DatasetConfig};

let config = DatasetConfig::default();
let mut generator = DatasetGenerator::new(config);
let dataset = generator.generate();

dataset.export_jsonl("training.jsonl")?;

Custom Configuration

let config = DatasetConfig {
    examples_per_category: 200,
    enable_augmentation: true,
    augmentation: AugmentationConfig {
        paraphrases_per_example: 3,
        complexity_variations: 2,
        enable_domain_transfer: true,
    },
    seed: 42,
};

Filter by Category

let security_tasks: Vec<_> = dataset.examples
    .iter()
    .filter(|e| e.metadata.category == TaskCategory::Security)
    .collect();

Filter by Complexity

let simple_tasks: Vec<_> = dataset.examples
    .iter()
    .filter(|e| e.metadata.complexity == ComplexityLevel::Simple)
    .collect();

Integration with RuvLTRA

Training Pipeline

use ruvllm::training::DatasetGenerator;
use ruvllm::SonaLlm;

// 1. Generate dataset
let dataset = DatasetGenerator::new(config).generate();

// 2. Split data
let (train, val, test) = dataset.split(0.7, 0.15, 0.15, 42);

// 3. Train model
let mut model = SonaLlm::new(config)?;
for example in train {
    let features = model.extract_features(&example.input)?;
    let target = encode_target(&example.output_agent);
    model.train(features, target)?;
}

// 4. Validate
let accuracy = evaluate_model(&model, &val)?;
println!("Validation accuracy: {:.2}%", accuracy * 100.0);

Model Heads

1. Task Embedding:

  • Input: Task description + context
  • Output: 768-dim semantic vector

2. Agent Classification:

  • Input: Task embedding
  • Output: 5-way softmax (agent types)

3. Model Selection:

  • Input: Task embedding + complexity
  • Output: 3-way softmax (Haiku/Sonnet/Opus)

4. Quality Prediction:

  • Input: Task embedding
  • Output: Quality score (0-1)

Performance Metrics

Generation Performance

  • Speed: ~7,000 examples/second
  • Memory: ~200 MB for 2,700 examples
  • Disk: ~10 MB JSONL for 2,700 examples

Training Performance

  • Accuracy: 95%+ for agent classification
  • Cost Savings: 50%+ with model selection
  • Latency: <10ms for routing decision

Best Practices

1. Dataset Size

  • Minimum: 1,000 examples total (200 per category)
  • Recommended: 2,500-5,000 examples
  • Maximum: 10,000+ for production

2. Quality Over Quantity

  • Prefer fewer high-quality examples (0.90+)
  • Review augmented examples for correctness
  • Filter low-quality generations

3. Balanced Representation

  • Equal distribution across categories
  • Mix of complexity levels (33% Simple, 40% Moderate, 27% Complex)
  • Diverse domain coverage

4. Regular Updates

  • Add new task patterns as they emerge
  • Update templates based on user feedback
  • Retrain models quarterly

5. Validation

  • Hold out 15% for validation
  • Monitor accuracy on validation set
  • A/B test routing decisions

Common Issues

Issue: Low Quality Scores

Solution: Disable augmentation or review templates

let config = DatasetConfig {
    enable_augmentation: false,
    ..Default::default()
};

Issue: Imbalanced Categories

Solution: Adjust examples per category

let config = DatasetConfig {
    examples_per_category: 500,  // Increase for balance
    ..Default::default()
};

Issue: Too Much Variation

Solution: Reduce augmentation rates

augmentation: AugmentationConfig {
    paraphrases_per_example: 1,
    complexity_variations: 1,
    enable_domain_transfer: false,
}

Roadmap

Short Term (Q1 2024)

  • Parquet export format
  • Custom template loading
  • Multi-language support
  • HuggingFace Datasets integration

Medium Term (Q2-Q3 2024)

  • Code completion dataset
  • Security audit dataset
  • Multi-turn conversation dataset
  • Active learning integration

Long Term (Q4 2024+)

  • Few-shot learning examples
  • Code execution feedback
  • Self-improvement trajectories
  • Cross-lingual transfer

Resources

Documentation

Examples

Code

Support

License

All datasets are licensed under MIT OR Apache-2.0, same as RuvLTRA.