DataGenFlow transforms complex data generation workflows into intuitive visual pipelines. A minimal tool to help you generate and validate data from seed templates with full visibility.
- Easy to Extend: Add custom blocks in minutes with auto-discovery
- Faster Development: Visual pipeline builder eliminates boilerplate code
- Simple to Use: Intuitive drag-and-drop interface, no training required
- Full Transparency: Complete execution traces for debugging
Get started in under 2 minutes:
# Install dependencies
make setup
make dev
# Launch application (backend + frontend), make sure to have .env configured
make run-dev
# Open http://localhost:8000That's it! No complex configuration, no external dependencies required.
Example of a simple pipeline generating text based on seed data:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. SEED DATA (JSON) β
β { "repetitions": 2, "metadata": {"topic": "AI", "level": "basic"} } β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. PIPELINE (Visual Drag & Drop) β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β LLM Block β ββββΊ β Validator β ββββΊ β Output β β
β β β β Block β β Block β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
β Accumulated State Flow: β
β topic, level ββΊ + assistant ββΊ + is_valid ββΊ + formatted β
β β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. GENERATION & REVIEW β
β + Execute pipeline for each seed Γ repetitions β
β + Review results with keyboard shortcuts (A/R/E) β
β + View full execution trace for debugging β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. EXPORT β
β Download as JSONL ββΊ Ready for training/integration β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Concept: Each block adds data to the accumulated state, so subsequent blocks automatically have access to all previous outputsβno manual wiring needed!
Start by creating a JSON seed file with the variables your pipeline will use. Seeds define what data you want to generate.
Single seed:
{
"repetitions": 2,
"metadata": {
"topic": "Python programming",
"difficulty": "beginner"
}
}Multiple seeds (generate different variations):
[
{
"repetitions": 1,
"metadata": {
"topic": "Python lists",
"difficulty": "beginner"
}
},
{
"repetitions": 1,
"metadata": {
"topic": "Python dictionaries",
"difficulty": "intermediate"
}
}
]Fields:
repetitions: How many times to run the pipeline with this seedmetadata: Variables accessible in your blocks via{{ variable_name }}
Design your data generation workflow using drag-and-drop blocks. Each block processes data and passes it to the next one.
Start with ready-to-use blocks:
- LLM Generator: Generate text using AI models (OpenAI, Ollama, etc.)
- Validator: Check quality (length, forbidden words, patterns)
- JSON Validator: Ensure structured data correctness
- Output Formatter: Format results for review page
- ... waiting for more!
DataGenFlow includes research-backed algorithms for synthetic conversation generation:
- Persona-Driven Dialogue - Generate realistic multi-turn conversations with consistent character voices
- Back-Translation Diversity - Automatically create diverse variations while maintaining intent
- Adversarial Perturbation - Generate edge cases and robustness test scenarios
- Quality Metrics - Auto-computed scores for diversity, coherence, and engagement
Perfect for training conversational AI, chatbots, and dialogue systems. Get started with the pre-configured "Customer Service Conversations" template.
π Complete guide: Conversational AI Vertical | Research Algorithms
The real power of DataGenFlow is creating your own blocks. Add domain-specific logic in minutes with automatic discovery:
from lib.blocks.base import BaseBlock
from typing import Any
class SentimentAnalyzerBlock(BaseBlock):
name = "Sentiment Analyzer"
description = "Analyzes text sentiment"
inputs = ["text"] # what this block needs from accumulated state
outputs = ["sentiment", "confidence"] # what it adds to accumulated state
async def execute(self, data: dict[str, Any]) -> dict[str, Any]:
text = data["text"] # access from accumulated state
sentiment = analyze_sentiment(text)
# return values are added to accumulated state automatically
return {
"sentiment": sentiment.label,
"confidence": sentiment.score
}Drop your file in user_blocks/ and it's automatically discovered on restartβno configuration needed.
Why this matters:
- Adapt to your specific domain or workflow instantly
- Integrate proprietary validation logic or data sources
- Build reusable components for your team
- Share blocks as Python filesβsimple as copy/paste
Debugging Custom Blocks
Need to debug your custom block? Use the included debug_pipeline.py script with VS Code debugger. See Developer Documentation for details.
π Complete guide: Custom Block Development
Data flows automatically through your pipeline. Each block adds its outputs to an accumulated state that every subsequent block can accessβno manual wiring:
βββββββββββββββββββ
β LLM Block β β outputs: {"assistant": "Generated text"}
βββββββββββββββββββ
β
βΌ (state: assistant)
βββββββββββββββββββ
β Validator Block β β outputs: {"is_valid": true}
βββββββββββββββββββ
β
βΌ (state: assistant, is_valid)
βββββββββββββββββββ
β Output Block β β can access both: assistant, is_valid
βββββββββββββββββββ
This makes building complex pipelines incredibly simpleβconnect blocks and they automatically share data.
Review your results with keyboard shortcuts (Accept: A, Reject: R, Edit: E) and full execution traces to see how each result was generated.
Export your data in JSONL format, filtered by status (accepted, rejected, pending).
Create .env file (or copy from .env.example):
# LLM Configuration
LLM_ENDPOINT=http://localhost:11434/v1 # Ollama, OpenAI, etc.
LLM_API_KEY= # Optional for some endpoints
LLM_MODEL=llama3.2
# Database
DATABASE_PATH=data/qa_records.db
# Server
HOST=0.0.0.0
PORT=8000
# Debug mode (optional)
DEBUG=false # set to true for detailed loggingπ Comprehensive Guides
- How to Use DataGenFlow - Complete user guide
- Custom Block Development - Extend functionality
- Developer Documentation - Technical reference for developers
Contributions are welcome and appreciated. Before submitting a contribution, please review the guidelines below.
Prerequisites:
- Read the Contributing Guidelines thoroughly
- Check existing issues and pull requests to avoid duplication
- Follow the project's commit conventions and code style standards
Areas for Contribution:
- New processing blocks and pipeline templates
- Documentation improvements and examples
- Bug fixes and performance optimizations
- Test coverage expansion
- Integration examples and use cases
For detailed technical requirements and development setup, refer to the Developer Documentation.
DataGenFlow is built on the KISS principle (Keep It Simple, Stupid):
- Minimal Abstraction: Direct, understandable code over clever tricks
- Flat Architecture: Simple structure over deep nesting
- Explicit Design: Clear intentions over implicit magic
- Composition First: Combine simple pieces over complex inheritance
- Developer Friendly: Easy to understand, modify, and extend
Result: Simple, understandable code that's easy to maintain and extend.
Get Started β’ View Documentation
Happy Data Generating! π±
