Collection of ready-to-use, high-performance LLM/VLM workflows for data curation and annotation. Built to streamline multimedia understanding tasks with scalable batching, chunking, and preview generation.
The models are specifically selected for the use cases.
- Modalities: Images, image sequences (as lists), and videos treated uniformly via frame sampling.
- Performance: Batch execution, chunked prefill, prefix caching, and controlled GPU memory utilization.
- For Curators/Annotators: Produce consistent descriptions, summaries, and labels at scale with reproducible outputs written per chunk.
Simple installation using uv, this will handle annoying dependencies (vllm, cuda, pytorch versions, etc).
## Install dependencies with uv (recommended)
uv syncLeverage a VLM to generate detailed activity descriptions for long videos by representing them as sampled frame sequences. This repository provides a reference pipeline using Tarsier2 (SOTA)
Check out the detailed instructions for setup and usage.
Convert LLM descriptions into structured JSON format for easier integration with downstream applications. This workflow is under development and aims to provide a seamless way to organize and store generated metadata.
Leverage a VLM to extract specific attributes from images.
- objects
- scene type
- action category
- time-based attributes (duration, frequency)
- environment attributes (indoor/outdoor, lighting)
- human attributes (pose, direction, group size)
- Visual Relation Extraction (spatial relationships between objects)
Automatically generate question–answer pairs from videos for training video-QA or reasoning models.
- “What is the person doing after…?”
- “Why does the action change at timestamp…?”
- multi-choice or free-form QA
- reasoning-based questions
Generate high-quality labeled text datasets using vLLM reasoning models with YAML-based configuration.
text-clf-synth enables you to:
- Define dataset schemas using simple YAML configs (fields, types, ranges, labels)
- Generate realistic data with reasoning models for better coherence
- Automatically split into train/test sets with optional stratification
- Output ready-to-use CSV files
Example: Generate IELTS Task 2 essays with topics, essay types, full essays, band scores, and scoring rationale.
Check out the detailed instructions for setup and usage.
- This repository focuses on reproducible, scalable generation for curators/annotators; it is not a training codebase.
- Ensure you have rights to process the media you run through the pipeline.