Skip to content

giuseppeambrosio97/DataGenFlow

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

152 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

minimal.mp4

Define seeds β†’ Build pipeline β†’ Review results β†’ Export data

Watch full demo

Why DataGenFlow 🌱

DataGenFlow transforms complex data generation workflows into intuitive visual pipelines. A minimal tool to help you generate and validate data from seed templates with full visibility.

Key Benefits

  • Easy to Extend: Add custom blocks in minutes with auto-discovery
  • Faster Development: Visual pipeline builder eliminates boilerplate code
  • Simple to Use: Intuitive drag-and-drop interface, no training required
  • Full Transparency: Complete execution traces for debugging

Quick Start

Get started in under 2 minutes:

# Install dependencies
make setup
make dev

# Launch application (backend + frontend), make sure to have .env configured
make run-dev

# Open http://localhost:8000

That's it! No complex configuration, no external dependencies required.

How It Works

TL;DR - Visual Overview

Example of a simple pipeline generating text based on seed data:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. SEED DATA (JSON)                                                     β”‚
β”‚    { "repetitions": 2, "metadata": {"topic": "AI", "level": "basic"} }  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. PIPELINE (Visual Drag & Drop)                                        β”‚
β”‚                                                                         β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚    β”‚  LLM Block   β”‚ ───► β”‚  Validator   β”‚ ───► β”‚    Output    β”‚         β”‚
β”‚    β”‚              β”‚      β”‚    Block     β”‚      β”‚    Block     β”‚         β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β”‚                                                                         β”‚
β”‚    Accumulated State Flow:                                              β”‚
β”‚    topic, level  ─►  + assistant  ─►  + is_valid  ─►  + formatted       β”‚
β”‚                                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. GENERATION & REVIEW                                                  β”‚
β”‚    + Execute pipeline for each seed Γ— repetitions                       β”‚
β”‚    + Review results with keyboard shortcuts (A/R/E)                     β”‚
β”‚    + View full execution trace for debugging                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 4. EXPORT                                                               β”‚
β”‚    Download as JSONL ─► Ready for training/integration                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Concept: Each block adds data to the accumulated state, so subsequent blocks automatically have access to all previous outputsβ€”no manual wiring needed!


1. Define Your Seed Data

Start by creating a JSON seed file with the variables your pipeline will use. Seeds define what data you want to generate.

Single seed:

{
  "repetitions": 2,
  "metadata": {
    "topic": "Python programming",
    "difficulty": "beginner"
  }
}

Multiple seeds (generate different variations):

[
  {
    "repetitions": 1,
    "metadata": {
      "topic": "Python lists",
      "difficulty": "beginner"
    }
  },
  {
    "repetitions": 1,
    "metadata": {
      "topic": "Python dictionaries",
      "difficulty": "intermediate"
    }
  }
]

Fields:

  • repetitions: How many times to run the pipeline with this seed
  • metadata: Variables accessible in your blocks via {{ variable_name }}

2. Build Your Pipeline Visually

Design your data generation workflow using drag-and-drop blocks. Each block processes data and passes it to the next one.

Built-in Blocks

Start with ready-to-use blocks:

  • LLM Generator: Generate text using AI models (OpenAI, Ollama, etc.)
  • Validator: Check quality (length, forbidden words, patterns)
  • JSON Validator: Ensure structured data correctness
  • Output Formatter: Format results for review page
  • ... waiting for more!

Conversational AI Vertical

DataGenFlow includes research-backed algorithms for synthetic conversation generation:

  • Persona-Driven Dialogue - Generate realistic multi-turn conversations with consistent character voices
  • Back-Translation Diversity - Automatically create diverse variations while maintaining intent
  • Adversarial Perturbation - Generate edge cases and robustness test scenarios
  • Quality Metrics - Auto-computed scores for diversity, coherence, and engagement

Perfect for training conversational AI, chatbots, and dialogue systems. Get started with the pre-configured "Customer Service Conversations" template.

πŸ“š Complete guide: Conversational AI Vertical | Research Algorithms

Extend with Custom Blocks

The real power of DataGenFlow is creating your own blocks. Add domain-specific logic in minutes with automatic discovery:

from lib.blocks.base import BaseBlock
from typing import Any

class SentimentAnalyzerBlock(BaseBlock):
    name = "Sentiment Analyzer"
    description = "Analyzes text sentiment"
    inputs = ["text"]  # what this block needs from accumulated state
    outputs = ["sentiment", "confidence"]  # what it adds to accumulated state

    async def execute(self, data: dict[str, Any]) -> dict[str, Any]:
        text = data["text"]  # access from accumulated state
        sentiment = analyze_sentiment(text)

        # return values are added to accumulated state automatically
        return {
            "sentiment": sentiment.label,
            "confidence": sentiment.score
        }

Drop your file in user_blocks/ and it's automatically discovered on restartβ€”no configuration needed.

Why this matters:

  • Adapt to your specific domain or workflow instantly
  • Integrate proprietary validation logic or data sources
  • Build reusable components for your team
  • Share blocks as Python filesβ€”simple as copy/paste

Debugging Custom Blocks

Need to debug your custom block? Use the included debug_pipeline.py script with VS Code debugger. See Developer Documentation for details.

πŸ“š Complete guide: Custom Block Development

Accumulated State

Data flows automatically through your pipeline. Each block adds its outputs to an accumulated state that every subsequent block can accessβ€”no manual wiring:

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   LLM Block     β”‚ β†’ outputs: {"assistant": "Generated text"}
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό (state: assistant)
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Validator Block β”‚ β†’ outputs: {"is_valid": true}
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό (state: assistant, is_valid)
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Output Block   β”‚ ← can access both: assistant, is_valid
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This makes building complex pipelines incredibly simpleβ€”connect blocks and they automatically share data.

3. Review and Refine

Review your results with keyboard shortcuts (Accept: A, Reject: R, Edit: E) and full execution traces to see how each result was generated.

4. Export Your Data

Export your data in JSONL format, filtered by status (accepted, rejected, pending).

Configuration

Create .env file (or copy from .env.example):

# LLM Configuration
LLM_ENDPOINT=http://localhost:11434/v1  # Ollama, OpenAI, etc.
LLM_API_KEY=                            # Optional for some endpoints
LLM_MODEL=llama3.2

# Database
DATABASE_PATH=data/qa_records.db

# Server
HOST=0.0.0.0
PORT=8000

# Debug mode (optional)
DEBUG=false  # set to true for detailed logging

Documentation

πŸ“– Comprehensive Guides

Contributing

Contributions are welcome and appreciated. Before submitting a contribution, please review the guidelines below.

Prerequisites:

  • Read the Contributing Guidelines thoroughly
  • Check existing issues and pull requests to avoid duplication
  • Follow the project's commit conventions and code style standards

Areas for Contribution:

  • New processing blocks and pipeline templates
  • Documentation improvements and examples
  • Bug fixes and performance optimizations
  • Test coverage expansion
  • Integration examples and use cases

For detailed technical requirements and development setup, refer to the Developer Documentation.

Design Strategy

DataGenFlow is built on the KISS principle (Keep It Simple, Stupid):

  • Minimal Abstraction: Direct, understandable code over clever tricks
  • Flat Architecture: Simple structure over deep nesting
  • Explicit Design: Clear intentions over implicit magic
  • Composition First: Combine simple pieces over complex inheritance
  • Developer Friendly: Easy to understand, modify, and extend

Result: Simple, understandable code that's easy to maintain and extend.


Get Started β€’ View Documentation

Happy Data Generating! 🌱

About

A minimal library to generate and validate datasets.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages

  • TypeScript 54.5%
  • Python 43.0%
  • Other 2.5%