Skip to content

VellumForge2 is a Golang CLI for generating high-quality Direct Preference Optimization datasets via a hierarchical prompt pipeline with optional LLM-as-a-Judge scoring. This tool supports OpenAI-compatible APIs, smart rate limiting, concurrent workers, and one-command Hugging Face Hub uploads.

License

Notifications You must be signed in to change notification settings

lemon07r/VellumForge2

Repository files navigation

VellumForge2

High-performance synthetic dataset generator for LLM training. Generates DPO, SFT, KTO, and MO-DPO datasets using any OpenAI-compatible API with hierarchical generation pipeline, checkpoint/resume support, and optional LLM-as-Judge evaluation.

./bin/vellumforge2 run --config config.toml

Features

Multiple Dataset Formats

  • SFT - Simple instruction-output pairs for supervised fine-tuning
  • DPO - Standard preference pairs (prompt, chosen, rejected) compatible with HuggingFace TRL
  • KTO - Unpaired preferences with binary labels compatible with HuggingFace TRL KTOTrainer
  • MO-DPO - Full multi-objective DPO with detailed judge scoring for reward modeling

High Performance

  • Concurrent worker pool supporting up to 1024 parallel requests or more
  • Provider-level and per-model rate limiting with configurable burst capacity
  • Checkpoint/resume for interrupted sessions
  • Asynchronous judge evaluation (non-blocking)
  • Smart over-generation strategy achieving 95%+ count accuracy
  • Robust 4-strategy JSON parsing with 99%+ success rate

Provider Agnostic

Works with any OpenAI-compatible API: OpenAI, NVIDIA NIM, Anthropic, Together AI, llama.cpp, Ollama, LM Studio, kobold.cpp, vLLM, and more.

Configurable Pipeline

  • Hierarchical generation: Main topic → Subtopics → Prompts → Preference pairs
  • Custom prompt templates at every stage
  • Optional LLM-as-Judge quality filtering (40-60% token savings vs full evaluation)
  • Flexible rate limiting strategies

Hugging Face Integration

One-command dataset uploads with automatic repository creation using native NDJSON commit API (no external dependancies like HF CLI required).

Installation

Prebuilt Binaries

Download from releases page for Linux, macOS (x86_64/ARM64), and Windows.

From Source

git clone https://github.com/lemon07r/vellumforge2.git
cd vellumforge2
make build
# Binary at ./bin/vellumforge2

Quick Start

# 1. Copy configuration template
cp configs/config.example.toml config.toml
cp configs/.env.example .env

# 2. Edit .env with your API keys
# NVIDIA_API_KEY=nvapi-your-key
# OPENAI_API_KEY=sk-your-key

# 3. Edit config.toml with your settings
# Choose dataset_mode, configure models, customize prompts

# 4. Generate dataset
./bin/vellumforge2 run --config config.toml

# 5. Results in output/session_YYYY-MM-DDTHH-MM-SS/dataset.jsonl

See GETTING_STARTED.md for step-by-step tutorial.

Configuration

Minimal configuration:

[generation]
main_topic = "Fantasy Fiction"
num_subtopics = 64
num_prompts_per_subtopic = 2
concurrency = 64
dataset_mode = "dpo"  # Options: sft, dpo, kto, mo-dpo

[models.main]
base_url = "https://integrate.api.nvidia.com/v1"
model_name = "moonshotai/kimi-k2-instruct-0905"
temperature = 0.6
max_output_tokens = 8192
rate_limit_per_minute = 40

[models.rejected]  # Required for DPO/KTO/MO-DPO
base_url = "http://localhost:8080/v1" # Default URL for llama.cpp local server, but you can use any api of choice
model_name = "phi-4-mini-instruct"
temperature = 0.0
max_output_tokens = 4096

[prompt_templates]
chosen_generation = "Write a compelling story (400-600 words): {{.Prompt}}"
rejected_generation = "Write a simple story (200-300 words): {{.Prompt}}"

Complete configuration reference in configs/config.example.toml.

Dataset Modes

Mode Selection

Mode Output Format Models Required HuggingFace TRL Use Case
sft instruction → output Main only SFTTrainer Basic fine-tuning
dpo prompt, chosen, rejected Main + Rejected DPOTrainer Preference optimization
kto prompt, completion, label Main + Rejected KTOTrainer Unpaired preferences
mo-dpo Full schema + judge scores Main + Rejected + Judge Custom Multi-objective training

Example Outputs

SFT Format:

{
  "instruction": "Write a fantasy story about dragons",
  "output": "In the mountains of Eldoria..."
}

DPO Format:

{
  "prompt": "Write a fantasy story about dragons",
  "chosen": "In the ancient mountains of Eldoria, where mist...",
  "rejected": "There was a dragon. It was big..."
}

KTO Format (2 rows per pair):

{"prompt": "Write about dragons", "completion": "Good story...", "label": true}
{"prompt": "Write about dragons", "completion": "Bad story...", "label": false}

MO-DPO Format:

{
  "prompt": "Write a fantasy story about dragons",
  "chosen": "In the mountains...",
  "rejected": "There was a dragon...",
  "chosen_scores": {
    "plot_quality": {"score": 5, "reasoning": "Excellent narrative..."},
    "creativity": {"score": 4, "reasoning": "Fresh perspective..."}
  },
  "rejected_scores": {
    "plot_quality": {"score": 2, "reasoning": "Minimal development..."},
    "creativity": {"score": 2, "reasoning": "Generic treatment..."}
  },
  "chosen_score_total": 4.5,
  "rejected_score_total": 2.0,
  "preference_margin": 2.5
}

See DATASET_MODES.md for detailed format specifications and configuration examples.

Optional Judge Filtering

Available for SFT, DPO, KTO modes. MO-DPO always includes full judge evaluation.

[judge_filtering]
enabled = true
use_explanations = false  # Scores only = 40-60% token savings
min_chosen_score = 4.0    # Keep chosen responses >= 4.0
max_rejected_score = 3.0  # Keep rejected responses <= 3.0

[models.judge]
enabled = true
base_url = "https://integrate.api.nvidia.com/v1"
model_name = "meta/llama-3.1-70b-instruct"
temperature = 0.4
max_output_tokens = 2048

Filters responses before writing to dataset based on quality scores. Use when API budget is limited or training time is expensive.

Rate Limiting

Provider-Level Limits

Global rate limits shared across all models from same provider:

[provider_rate_limits]
nvidia = 40  # All NVIDIA models share this 40 RPM limit
provider_burst_percent = 15  # 15% burst capacity (default)

Takes precedence over per-model limits. Prevents 429 errors when multiple models share one API endpoint.

Per-Model Limits

Individual model rate limiting:

[models.main]
rate_limit_per_minute = 40  # Overridden by provider_rate_limits if set

Optimization

Recommended configuration for high throughput:

[generation]
concurrency = 128  # Or 256 for high throughput, recommended to test with the bencmark scripts

[provider_rate_limits]
nvidia = 40
provider_burst_percent = 20  # Higher burst for better throughput

Conservative configuration for avoiding rate limits:

[generation]
concurrency = 32

[provider_rate_limits]
nvidia = 30
provider_burst_percent = 10  # Lower burst for fewer 429 errors

See BENCHMARK_README.md for benchmarking guide using our easy to use benchmark scripts.

Checkpoint & Resume

Enable automatic checkpointing:

[generation]
enable_checkpointing = true
checkpoint_interval = 10  # Save every 10 completed jobs

Resume interrupted session:

# List available checkpoints
./bin/vellumforge2 checkpoint list

# Inspect checkpoint
./bin/vellumforge2 checkpoint inspect session_2025-11-05T12-34-56

# Resume generation
./bin/vellumforge2 checkpoint resume session_2025-11-05T12-34-56

Graceful shutdown with Ctrl+C automatically saves checkpoint.

CLI Commands

Generate Dataset

# Basic generation
./bin/vellumforge2 run --config config.toml

# With verbose logging
./bin/vellumforge2 run --config config.toml --verbose

# Upload to Hugging Face
./bin/vellumforge2 run --config config.toml \
  --upload-to-hf --hf-repo-id username/my-dataset #--hf-repo-id not required if set in config file

Checkpoint Management

# List checkpoints
./bin/vellumforge2 checkpoint list

# Inspect checkpoint
./bin/vellumforge2 checkpoint inspect <session-dir>

# Resume from checkpoint
./bin/vellumforge2 checkpoint resume <session-dir>

Other

# Show version
./bin/vellumforge2 --version

# Show help
./bin/vellumforge2 --help

Output Structure

output/
└── session_2025-11-05T12-34-56/
    ├── dataset.jsonl       # Training dataset
    ├── config.toml.bak     # Configuration snapshot
    ├── checkpoint.json     # Resume state (if checkpointing enabled)
    └── session.log         # Structured JSON logs

Example Datasets

Generated with VellumForge2 using Kimi K2 0905 + Phi-4 Instruct:

  • VellumK2-Fantasy-DPO-Tiny-01: 126 rows - Testing and validation
  • VellumK2-Fantasy-DPO-Small-01: 1,038 rows - Light training and experiments
  • VellumK2-Fantasy-DPO-Medium-01: 3,069 rows - Combination training component
  • VellumK2-Fantasy-DPO-Large-01: 10,222 rows - Large-scale training
  • VellumK2-Unfettered-DPO-01: 2,576 rows - Decensoring dataset to reduce refusal on sensitive content

View all example datasets

Troubleshooting

Rate Limit Errors (429)

Reduce concurrency and global limits:

[provider_rate_limits]
nvidia = 30

[generation]
concurrency = 32

Getting Fewer Records Than Expected

Increase over-generation buffer:

[generation]
over_generation_buffer = 0.25  # Increase from default 0.15

JSON Parsing Errors

VellumForge2 has 4 fallback parsing strategies achieving 99%+ success rate. If issues persist:

  • Use lower temperature (< 0.9)
  • Ensure templates explicitly request JSON format
  • Set structure_temperature lower than temperature for JSON generation

Out of Memory

Reduce concurrency:

[generation]
concurrency = 32  # Or lower

Connection Refused for Rejected Model

Start local server or use same API endpoint:

[models.rejected]
base_url = "https://integrate.api.nvidia.com/v1"
model_name = "meta/llama-3.1-8b-instruct"  # Smaller model

See GETTING_STARTED.md for more troubleshooting.

Documentation

System Requirements

  • Go 1.21+ (for building from source)
  • Memory: 512MB minimum, 2GB+ recommended
  • Network: Stable internet for API calls
  • Disk: ~10MB per 1000 records

Contributing

Contributions welcome:

  1. Fork the repository
  2. Create a feature branch
  3. Make changes with tests
  4. Run make lint && make test
  5. Submit a pull request

License

MIT License - see LICENSE file.

Citation

@software{vellumforge2,
  title = {VellumForge2: Synthetic Dataset Generator for LLM Training},
  author = {Lamim},
  year = {2025},
  url = {https://github.com/lemon07r/vellumforge2},
  version = {1.5.3}
}

Acknowledgments

Support


Current Version: v1.5.3 Last Updated: 2025-11-06

About

VellumForge2 is a Golang CLI for generating high-quality Direct Preference Optimization datasets via a hierarchical prompt pipeline with optional LLM-as-a-Judge scoring. This tool supports OpenAI-compatible APIs, smart rate limiting, concurrent workers, and one-command Hugging Face Hub uploads.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •