VellumForge2

High-performance synthetic dataset generator for LLM training. Generates DPO, SFT, KTO, and MO-DPO datasets using any OpenAI-compatible API with hierarchical generation pipeline, checkpoint/resume support, and optional LLM-as-Judge evaluation.

./bin/vellumforge2 run --config config.toml

Features

Multiple Dataset Formats

SFT - Simple instruction-output pairs for supervised fine-tuning
DPO - Standard preference pairs (prompt, chosen, rejected) compatible with HuggingFace TRL
KTO - Unpaired preferences with binary labels compatible with HuggingFace TRL KTOTrainer
MO-DPO - Full multi-objective DPO with detailed judge scoring for reward modeling

High Performance

Concurrent worker pool supporting up to 1024 parallel requests or more
Provider-level and per-model rate limiting with configurable burst capacity
Checkpoint/resume for interrupted sessions
Asynchronous judge evaluation (non-blocking)
Smart over-generation strategy achieving 95%+ count accuracy
Robust 4-strategy JSON parsing with 99%+ success rate

Provider Agnostic

Works with any OpenAI-compatible API: OpenAI, NVIDIA NIM, Anthropic, Together AI, llama.cpp, Ollama, LM Studio, kobold.cpp, vLLM, and more.

Configurable Pipeline

Hierarchical generation: Main topic → Subtopics → Prompts → Preference pairs
Custom prompt templates at every stage
Optional LLM-as-Judge quality filtering (40-60% token savings vs full evaluation)
Flexible rate limiting strategies

Hugging Face Integration

One-command dataset uploads with automatic repository creation using native NDJSON commit API (no external dependancies like HF CLI required).

Installation

Prebuilt Binaries

Download from releases page for Linux, macOS (x86_64/ARM64), and Windows.

From Source

git clone https://github.com/lemon07r/vellumforge2.git
cd vellumforge2
make build
# Binary at ./bin/vellumforge2

Quick Start

# 1. Copy configuration template
cp configs/config.example.toml config.toml
cp configs/.env.example .env

# 2. Edit .env with your API keys
# NVIDIA_API_KEY=nvapi-your-key
# OPENAI_API_KEY=sk-your-key

# 3. Edit config.toml with your settings
# Choose dataset_mode, configure models, customize prompts

# 4. Generate dataset
./bin/vellumforge2 run --config config.toml

# 5. Results in output/session_YYYY-MM-DDTHH-MM-SS/dataset.jsonl

See GETTING_STARTED.md for step-by-step tutorial.

Configuration

Minimal configuration:

[generation]
main_topic = "Fantasy Fiction"
num_subtopics = 64
num_prompts_per_subtopic = 2
concurrency = 64
dataset_mode = "dpo"  # Options: sft, dpo, kto, mo-dpo

[models.main]
base_url = "https://integrate.api.nvidia.com/v1"
model_name = "moonshotai/kimi-k2-instruct-0905"
temperature = 0.6
max_output_tokens = 8192
rate_limit_per_minute = 40

[models.rejected]  # Required for DPO/KTO/MO-DPO
base_url = "http://localhost:8080/v1" # Default URL for llama.cpp local server, but you can use any api of choice
model_name = "phi-4-mini-instruct"
temperature = 0.0
max_output_tokens = 4096

[prompt_templates]
chosen_generation = "Write a compelling story (400-600 words): {{.Prompt}}"
rejected_generation = "Write a simple story (200-300 words): {{.Prompt}}"

Complete configuration reference in configs/config.example.toml.

Dataset Modes

Mode Selection

Mode	Output Format	Models Required	HuggingFace TRL	Use Case
sft	instruction → output	Main only	SFTTrainer	Basic fine-tuning
dpo	prompt, chosen, rejected	Main + Rejected	DPOTrainer	Preference optimization
kto	prompt, completion, label	Main + Rejected	KTOTrainer	Unpaired preferences
mo-dpo	Full schema + judge scores	Main + Rejected + Judge	Custom	Multi-objective training

Example Outputs

SFT Format:

{
  "instruction": "Write a fantasy story about dragons",
  "output": "In the mountains of Eldoria..."
}

DPO Format:

{
  "prompt": "Write a fantasy story about dragons",
  "chosen": "In the ancient mountains of Eldoria, where mist...",
  "rejected": "There was a dragon. It was big..."
}

KTO Format (2 rows per pair):

{"prompt": "Write about dragons", "completion": "Good story...", "label": true}
{"prompt": "Write about dragons", "completion": "Bad story...", "label": false}

MO-DPO Format:

{
  "prompt": "Write a fantasy story about dragons",
  "chosen": "In the mountains...",
  "rejected": "There was a dragon...",
  "chosen_scores": {
    "plot_quality": {"score": 5, "reasoning": "Excellent narrative..."},
    "creativity": {"score": 4, "reasoning": "Fresh perspective..."}
  },
  "rejected_scores": {
    "plot_quality": {"score": 2, "reasoning": "Minimal development..."},
    "creativity": {"score": 2, "reasoning": "Generic treatment..."}
  },
  "chosen_score_total": 4.5,
  "rejected_score_total": 2.0,
  "preference_margin": 2.5
}

See DATASET_MODES.md for detailed format specifications and configuration examples.

Optional Judge Filtering

Available for SFT, DPO, KTO modes. MO-DPO always includes full judge evaluation.

[judge_filtering]
enabled = true
use_explanations = false  # Scores only = 40-60% token savings
min_chosen_score = 4.0    # Keep chosen responses >= 4.0
max_rejected_score = 3.0  # Keep rejected responses <= 3.0

[models.judge]
enabled = true
base_url = "https://integrate.api.nvidia.com/v1"
model_name = "meta/llama-3.1-70b-instruct"
temperature = 0.4
max_output_tokens = 2048

Filters responses before writing to dataset based on quality scores. Use when API budget is limited or training time is expensive.

Rate Limiting

Provider-Level Limits

Global rate limits shared across all models from same provider:

[provider_rate_limits]
nvidia = 40  # All NVIDIA models share this 40 RPM limit
provider_burst_percent = 15  # 15% burst capacity (default)

Takes precedence over per-model limits. Prevents 429 errors when multiple models share one API endpoint.

Per-Model Limits

Individual model rate limiting:

[models.main]
rate_limit_per_minute = 40  # Overridden by provider_rate_limits if set

Optimization

Recommended configuration for high throughput:

[generation]
concurrency = 128  # Or 256 for high throughput, recommended to test with the bencmark scripts

[provider_rate_limits]
nvidia = 40
provider_burst_percent = 20  # Higher burst for better throughput

Conservative configuration for avoiding rate limits:

[generation]
concurrency = 32

[provider_rate_limits]
nvidia = 30
provider_burst_percent = 10  # Lower burst for fewer 429 errors

See BENCHMARK_README.md for benchmarking guide using our easy to use benchmark scripts.

Checkpoint & Resume

Enable automatic checkpointing:

[generation]
enable_checkpointing = true
checkpoint_interval = 10  # Save every 10 completed jobs

Resume interrupted session:

# List available checkpoints
./bin/vellumforge2 checkpoint list

# Inspect checkpoint
./bin/vellumforge2 checkpoint inspect session_2025-11-05T12-34-56

# Resume generation
./bin/vellumforge2 checkpoint resume session_2025-11-05T12-34-56

Graceful shutdown with Ctrl+C automatically saves checkpoint.

CLI Commands

Generate Dataset

# Basic generation
./bin/vellumforge2 run --config config.toml

# With verbose logging
./bin/vellumforge2 run --config config.toml --verbose

# Upload to Hugging Face
./bin/vellumforge2 run --config config.toml \
  --upload-to-hf --hf-repo-id username/my-dataset #--hf-repo-id not required if set in config file

Checkpoint Management

# List checkpoints
./bin/vellumforge2 checkpoint list

# Inspect checkpoint
./bin/vellumforge2 checkpoint inspect <session-dir>

# Resume from checkpoint
./bin/vellumforge2 checkpoint resume <session-dir>

Other

# Show version
./bin/vellumforge2 --version

# Show help
./bin/vellumforge2 --help

Output Structure

output/
└── session_2025-11-05T12-34-56/
    ├── dataset.jsonl       # Training dataset
    ├── config.toml.bak     # Configuration snapshot
    ├── checkpoint.json     # Resume state (if checkpointing enabled)
    └── session.log         # Structured JSON logs

Example Datasets

Generated with VellumForge2 using Kimi K2 0905 + Phi-4 Instruct:

VellumK2-Fantasy-DPO-Tiny-01: 126 rows - Testing and validation
VellumK2-Fantasy-DPO-Small-01: 1,038 rows - Light training and experiments
VellumK2-Fantasy-DPO-Medium-01: 3,069 rows - Combination training component
VellumK2-Fantasy-DPO-Large-01: 10,222 rows - Large-scale training
VellumK2-Unfettered-DPO-01: 2,576 rows - Decensoring dataset to reduce refusal on sensitive content

View all example datasets

Troubleshooting

Rate Limit Errors (429)

Reduce concurrency and global limits:

[provider_rate_limits]
nvidia = 30

[generation]
concurrency = 32

Getting Fewer Records Than Expected

Increase over-generation buffer:

[generation]
over_generation_buffer = 0.25  # Increase from default 0.15

JSON Parsing Errors

VellumForge2 has 4 fallback parsing strategies achieving 99%+ success rate. If issues persist:

Use lower temperature (< 0.9)
Ensure templates explicitly request JSON format
Set structure_temperature lower than temperature for JSON generation

Out of Memory

Reduce concurrency:

[generation]
concurrency = 32  # Or lower

Connection Refused for Rejected Model

Start local server or use same API endpoint:

[models.rejected]
base_url = "https://integrate.api.nvidia.com/v1"
model_name = "meta/llama-3.1-8b-instruct"  # Smaller model

See GETTING_STARTED.md for more troubleshooting.

Documentation

GETTING_STARTED.md - Step-by-step tutorial
DATASET_MODES.md - Detailed format specifications
BENCHMARK_README.md - Performance benchmarking guide
CHANGELOG.md - Version history
configs/config.example.toml - Complete configuration reference

System Requirements

Go 1.21+ (for building from source)
Memory: 512MB minimum, 2GB+ recommended
Network: Stable internet for API calls
Disk: ~10MB per 1000 records

Contributing

Contributions welcome:

Fork the repository
Create a feature branch
Make changes with tests
Run make lint && make test
Submit a pull request

License

MIT License - see LICENSE file.

Citation

@software{vellumforge2,
  title = {VellumForge2: Synthetic Dataset Generator for LLM Training},
  author = {Lamim},
  year = {2025},
  url = {https://github.com/lemon07r/vellumforge2},
  version = {1.5.3}
}

Acknowledgments

Direct Preference Optimization by Rafailov et al.
Kahneman-Tversky Optimization for KTO
HuggingFace TRL for training framework inspiration

Support

GitHub Issues - Bug reports and feature requests
Discussions - Questions and community help

Current Version: v1.5.3 Last Updated: 2025-11-06

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github		.github
cmd		cmd
configs		configs
internal		internal
pkg/models		pkg/models
scripts		scripts
.gitignore		.gitignore
BENCHMARK_README.md		BENCHMARK_README.md
CHANGELOG.md		CHANGELOG.md
DATASET_MODES.md		DATASET_MODES.md
GETTING_STARTED.md		GETTING_STARTED.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
run_tests.sh		run_tests.sh

License

lemon07r/VellumForge2

Folders and files

Latest commit

History

Repository files navigation

VellumForge2

Features

Multiple Dataset Formats

High Performance

Provider Agnostic

Configurable Pipeline

Hugging Face Integration

Installation

Prebuilt Binaries

From Source

Quick Start

Configuration

Dataset Modes

Mode Selection

Example Outputs

Optional Judge Filtering

Rate Limiting

Provider-Level Limits

Per-Model Limits

Optimization

Checkpoint & Resume

CLI Commands

Generate Dataset

Checkpoint Management

Other

Output Structure

Example Datasets

Troubleshooting

Rate Limit Errors (429)

Getting Fewer Records Than Expected

JSON Parsing Errors

Out of Memory

Connection Refused for Rejected Model

Documentation

System Requirements

Contributing

License

Citation

Acknowledgments

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 2

Uh oh!

Languages

Packages