GitHub - lifelog-index/data-curation-suite: vllm

Data Curation Suite

Collection of ready-to-use, high-performance LLM/VLM workflows for data curation and annotation. Built to streamline multimedia understanding tasks with scalable batching, chunking, and preview generation.

The models are specifically selected for the use cases.

Overview

Modalities: Images, image sequences (as lists), and videos treated uniformly via frame sampling.
Performance: Batch execution, chunked prefill, prefix caching, and controlled GPU memory utilization.
For Curators/Annotators: Produce consistent descriptions, summaries, and labels at scale with reproducible outputs written per chunk.

Installation

Simple installation using uv, this will handle annoying dependencies (vllm, cuda, pytorch versions, etc).

## Install dependencies with uv (recommended)
uv sync

Use Case 1: Video Activity Description Generation

Leverage a VLM to generate detailed activity descriptions for long videos by representing them as sampled frame sequences. This repository provides a reference pipeline using Tarsier2 (SOTA)

Check out the detailed instructions for setup and usage.

Use Case 2: JSON Generation (WIP)

Convert LLM descriptions into structured JSON format for easier integration with downstream applications. This workflow is under development and aims to provide a seamless way to organize and store generated metadata.

Use Case 3: Attribute Extraction From Images

Leverage a VLM to extract specific attributes from images.

objects
scene type
action category
time-based attributes (duration, frequency)
environment attributes (indoor/outdoor, lighting)
human attributes (pose, direction, group size)
Visual Relation Extraction (spatial relationships between objects)

Use Case 4: Visual Question Generation (VQG)

Automatically generate question–answer pairs from videos for training video-QA or reasoning models.

“What is the person doing after…?”
“Why does the action change at timestamp…?”
multi-choice or free-form QA
reasoning-based questions

Use Case 5: Text Classification Synthetic Dataset Generation

Generate high-quality labeled text datasets using vLLM reasoning models with YAML-based configuration.

text-clf-synth enables you to:

Define dataset schemas using simple YAML configs (fields, types, ranges, labels)
Generate realistic data with reasoning models for better coherence
Automatically split into train/test sets with optional stratification
Output ready-to-use CSV files

Example: Generate IELTS Task 2 essays with topics, essay types, full essays, band scores, and scoring rationale.

Check out the detailed instructions for setup and usage.

Notes

This repository focuses on reproducible, scalable generation for curators/annotators; it is not a training codebase.
Ensure you have rights to process the media you run through the pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
tarsier_vllm		tarsier_vllm
text_clf_synth		text_clf_synth
.gitignore		.gitignore
.python-version		.python-version
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Curation Suite

Overview

Installation

Use Case 1: Video Activity Description Generation

Use Case 2: JSON Generation (WIP)

Use Case 3: Attribute Extraction From Images

Use Case 4: Visual Question Generation (VQG)

Use Case 5: Text Classification Synthetic Dataset Generation

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Curation Suite

Overview

Installation

Use Case 1: Video Activity Description Generation

Use Case 2: JSON Generation (WIP)

Use Case 3: Attribute Extraction From Images

Use Case 4: Visual Question Generation (VQG)

Use Case 5: Text Classification Synthetic Dataset Generation

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages