GitHub - Oqura-ai/deepresearch-datagen-cli: Using deep research workflow to generate datasets for finetuning LLMs.

Overview

Oqura's deepresearch-datagen-cli is a terminal tool for generating structured datasets from real-world data using deep research. You describe the kind of dataset you need, and it automatically searches across the web, builds context through multi-step research, suggests a schema, and outputs clean, usable data. It’s built for quick experimentation, training tasks, or anytime you need structured data without manually gathering or formatting it.

How It Works

takes a query describing the dataset you want
suggests a high-level schema based on the query
refines the schema with follow-up adjustments if needed
breaks the dataset into focused sections and subtopics
assigns a research agent to each section
each agent runs web search, extracts info, and summarizes key data
generates section-wise structured data
merges all sections into a single final dataset
saves the dataset in the output_files directory via the terminal

Workflow

This diagram shows how the tool takes a user prompt, performs recursive web research, and turns the results into a structured dataset.

Getting Started

Follow these steps to set up and run the project locally.

Prerequisite: Install `uv`

uv is required to manage the virtual environment and dependencies.

You can download it from the official uv GitHub repository, which includes platform-specific installation instructions.

1. Clone the Repository

Clone the repository:

git clone https://github.com/Oqura-ai/deepresearch-datagen-cli.git
cd deepresearch-datagen-cli

2. Create a Virtual Environment

Use uv to create a virtual environment:

uv venv

3. Activate the Virtual Environment

Activate the environment depending on your operating system:

Windows:

.venv\Scripts\activate

macOS/Linux:

source .venv/bin/activate

4. Set Up Environment Variables

Copy the example .env file and add your API keys:

cp .env.example .env

Open the .env file in a text editor and fill in the required fields:

OPENAI_API_KEY=your_openai_api_key_here
TAVILY_API_KEY=your_tavily_api_key_here

These keys are essential for the application to work correctly.

5. Install Dependencies

uv pip install -r requirements.txt

This will install all required packages as defined in your requirements.txt.

6. Run the Application

Once installed and configured, start the app with:

python main.py

You're all set to go! The application will now guide you through the dataset creation process step by step and the final dataset will be saved in the output_files directory.

Optional: `configuration.py`

You can customize how the tool behaves using the configuration.py file inside deep_research_workflow. It lets you adjust things like model type, temperature, search depth, delays, and more.

from dataclasses import dataclass, fields
from langchain_core.runnables import RunnableConfig
import os
import uuid

@dataclass(kw_only=True)
class Configuration:
    thread_id: str = str(uuid.uuid4())
    provider: str = "openai"
    model: str = "gpt-4o-mini"
    temperature: float = 0.5
    max_queries: int = 3
    search_depth: int = 2
    num_reflections: int = 2
    section_delay_seconds: int = 15
    max_rows_from_each_section: int = 5

    @classmethod
    def from_runnable_config(cls, config: RunnableConfig) -> "Configuration":
        configurable = config.get("configurable", {}) if config else {}
        values = {
            f.name: os.environ.get(f.name.upper(), configurable.get(f.name, f.default))
            for f in fields(cls) if f.init
        }
        return cls(**values)

Authors

Contributing

If something here could be improved, please open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
deep_research_workflow		deep_research_workflow
notebook_implementation		notebook_implementation
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

How It Works

Workflow

Getting Started

Prerequisite: Install `uv`

1. Clone the Repository

2. Create a Virtual Environment

3. Activate the Virtual Environment

4. Set Up Environment Variables

5. Install Dependencies

6. Run the Application

Optional: `configuration.py`

Authors

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

How It Works

Workflow

Getting Started

Prerequisite: Install uv

1. Clone the Repository

2. Create a Virtual Environment

3. Activate the Virtual Environment

4. Set Up Environment Variables

5. Install Dependencies

6. Run the Application

Optional: configuration.py

Authors

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Prerequisite: Install `uv`

Optional: `configuration.py`

Packages