Skip to content

lanmmn/Task-revevant-autolabel

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎬 Video2Tasks

Split Multi-Task Robot Videos into Single-Task Segments with Auto-Generated Instructions for VLA Training

Python 3.8+ License: MIT FastAPI

English | δΈ­ζ–‡ζ–‡ζ‘£


πŸ“– Overview

🎯 What Problem Does This Solve?

When training VLA (Vision-Language-Action) models like Ο€β‚€ (pi-zero), you need single-task video segments with instruction labels. However, real-world robot demonstration videos often contain multiple consecutive tasks without any annotation:

Input:  Long video with multiple tasks, NO labels
           ┃
           β–Ό
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚  🎬 Video2Tasks                                             β”‚
     β”‚  β€’ VLM-powered task boundary detection                      β”‚
     β”‚  β€’ Auto-generate natural language instructions              β”‚
     β”‚  β€’ Distributed processing for large-scale datasets          β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ┃
           β–Ό
Output: Single-task segments + instruction labels, READY for VLA training

  segment_001.mp4         segment_002.mp4         segment_003.mp4
  "Pick up the fork"      "Place the fork"        "Pick up the spoon"

Video2Tasks = Task Segmentation + Instruction Labeling β†’ VLA Training Data Pipeline

πŸ”§ How It Works

This tool uses a distributed client-server architecture with VLMs (like Qwen3-VL) to analyze video frames, intelligently detect task boundaries, and generate natural language instructions for each segment.

Component Description
Server Manages job queues, video frame extraction, and result aggregation
Worker Runs VLM inference to detect task transitions and generate instructions

πŸ“Š Output Example

VLM Window-by-Window Reasoning

The VLM analyzes each overlapping frame window and provides detailed reasoning about task transitions:

πŸ” Click to see VLM reasoning for multiple windows

Window 0 - Detecting bag β†’ mask transition:

{
  "task_id": "LongData601-1189::1765279974654_w0",
  "window_id": 0,
  "vlm_json": {
    "thought": "Frames 0-2: The person is standing, hands open, wearing gloves, facing the room. No object interaction yet. Frame 3: The person reaches toward a white tote bag on the sofa. Frame 4: The person grasps the tote bag and begins lifting it. Frames 5-11: The person continues manipulating the tote bag, opening it, adjusting its straps, and handling its contents. This is a continuous interaction with the same object (the tote bag). Frame 12: The person reaches into the bag and pulls out a white object with a black strap (likely a mask or headgear). The interaction from frame 12 onward shifts from the tote bag to the white object (mask). Therefore, a switch occurs at frame 12.",
    "transitions": [12],
    "instructions": ["Pick up and manipulate the tote bag", "Retrieve and adjust the white face mask"]
  }
}

Window 3 - Detecting multiple object switches:

{
  "task_id": "LongData601-1189::1765279974654_w3",
  "window_id": 3,
  "vlm_json": {
    "thought": "Frames 0-2: The robot's left hand reaches for and grasps a small black object from the left table. The right hand holds a white tote bag. Frames 3-5: The left hand places the black object into the tote bag. Frames 6-7: The left hand releases the black object into the bag and then reaches back to pick up another small black object. This is a clear switch: the robot completes interaction with the first black object and starts interacting with a second, distinct black object. Frame 15: The robot reaches for the white kettle on the left table. This marks a new interaction with a different object (the kettle). Therefore, switches are detected at frame 6 (first→second black object) and frame 15 (bag→kettle).",
    "transitions": [6, 15],
    "instructions": ["Place the first black object into the tote bag", "Place the second black object into the tote bag", "Pick up the white kettle"]
  }
}

Window 8 - No switch detected (continuous task):

{
  "task_id": "LongData601-1189::1765279974654_w8",
  "window_id": 8,
  "vlm_json": {
    "thought": "Frames 0-2: The robot is holding a small white object (possibly a charger or adapter) in both hands and appears to be manipulating or connecting it to a cable. Frames 3-15: The robot continues to hold the same object and cable, moving it toward an electrical outlet near the TV stand. There is no clear release of the object or grasping of a new distinct object. The entire sequence involves manipulating a single object. No switch occurs.",
    "transitions": [],
    "instructions": ["Move the charger toward the power outlet"]
  }
}

Window 14 - Complex multi-object sequence:

{
  "task_id": "LongData601-1189::1765279974654_w14",
  "window_id": 14,
  "vlm_json": {
    "thought": "Frames 0-2: The person is interacting with objects on the left table — moving a small white cup and adjusting a black rectangular object. Frame 3: The right hand moves away from the black object and reaches toward the ashtray. This marks the end of interaction with the black object. Frame 4: The right hand grasps the ashtray. This is a clear switch to a new object. Frames 5-7: The person moves the ashtray toward the trash can. Frame 11: The right hand reaches down to pick up a pair of white slippers from the floor. This is a clear switch from ashtray to slippers. Switches occur at frame 3 (black object→ashtray) and frame 11 (ashtray→slippers).",
    "transitions": [3, 11],
    "instructions": ["Move the black rectangular object and cup", "Pick up the ashtray", "Pick up the white slippers", "Place the slippers on the rack"]
  }
}

Final Segmentation Output

A 4501-frame video automatically split into 16 single-task segments:

{
  "video_id": "1765279974654",
  "nframes": 4501,
  "segments": [
    {"seg_id": 0,  "start_frame": 0,    "end_frame": 373,  "instruction": "Pick up and manipulate the tote bag"},
    {"seg_id": 1,  "start_frame": 373,  "end_frame": 542,  "instruction": "Retrieve and adjust the white face mask"},
    {"seg_id": 2,  "start_frame": 542,  "end_frame": 703,  "instruction": "Open and place items into the bag"},
    {"seg_id": 3,  "start_frame": 703,  "end_frame": 912,  "instruction": "Place the first black object into the tote bag"},
    {"seg_id": 4,  "start_frame": 912,  "end_frame": 1214, "instruction": "Place the second black object into the tote bag"},
    {"seg_id": 5,  "start_frame": 1214, "end_frame": 1375, "instruction": "Place the white cup on the table"},
    {"seg_id": 6,  "start_frame": 1375, "end_frame": 1524, "instruction": "Move the cup to the right table"},
    {"seg_id": 7,  "start_frame": 1524, "end_frame": 1784, "instruction": "Connect the power adapter to the cable"},
    {"seg_id": 8,  "start_frame": 1784, "end_frame": 2991, "instruction": "Plug the device into the power strip"},
    {"seg_id": 9,  "start_frame": 2991, "end_frame": 3135, "instruction": "Interact with black object on coffee table"},
    {"seg_id": 10, "start_frame": 3135, "end_frame": 3238, "instruction": "Adjust the ashtray"},
    {"seg_id": 11, "start_frame": 3238, "end_frame": 3359, "instruction": "Interact with the white mug"},
    {"seg_id": 12, "start_frame": 3359, "end_frame": 3478, "instruction": "Move the black rectangular object and cup"},
    {"seg_id": 13, "start_frame": 3478, "end_frame": 3711, "instruction": "Pick up the ashtray"},
    {"seg_id": 14, "start_frame": 3711, "end_frame": 4095, "instruction": "Move the white slippers from the shoe rack"},
    {"seg_id": 15, "start_frame": 4095, "end_frame": 4501, "instruction": "Raise the window blind"}
  ]
}

🎯 Each segment contains exactly ONE task with auto-generated natural language instruction β€” ready for VLA training!


πŸ’‘ Why This Architecture?

🧠 Distributed Architecture

Not just a single script. FastAPI acts as the orchestrator, Workers handle inference only.

Run Server on one 4090, then connect 10 machines running Workers to process massive datasets in parallel.

This is production-grade thinking.

πŸ›‘οΈ Production-Ready Resilience

  • ⏱️ Inflight timeout & re-dispatch
  • πŸ”„ Configurable retry limits
  • πŸ“ .DONE checkpoint markers for resume

Critical mechanisms for running large-scale tasks to completion.

🎯 Smart Segmentation Algorithm

Not just throwing images at a model. build_segments_via_cuts performs weighted voting across overlapping windows with Hanning Window edge weighting.

Solves the classic "unstable edge detection" problem.

✍️ Domain-Specific Prompts

prompt_switch_detection explicitly distinguishes:

  • True Switch: Transition to a new object
  • False Switch: Different operation on the same object

Tailored for manipulation datasets, significantly reducing over-segmentation.


✨ Features

Feature Description
πŸŽ₯ Video Windowing Configurable video window sampling parameters
πŸ€– Pluggable Backends Support for Qwen3-VL, Remote API, or custom VLM implementations
πŸ“Š Smart Aggregation Automatic segment generation with weighted voting & Hanning window
πŸ”„ Distributed Processing Scale horizontally with multiple workers
βš™οΈ YAML Config Simple, declarative configuration management
πŸ–₯️ Cross-Platform Linux/GPU recommended; Windows/CPU with dummy backend

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 β”‚         β”‚                 β”‚         β”‚                 β”‚
β”‚     Server      │────────▢│   Job Queue     │◀────────│     Worker      β”‚
β”‚    (FastAPI)    β”‚         β”‚                 β”‚         β”‚     (VLM)       β”‚
β”‚                 β”‚         β”‚                 β”‚         β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                                                       β”‚
         β–Ό                                                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Video Files   β”‚                                     β”‚    VLM Model    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/ly-geming/video2tasks.git
cd video2tasks

# Install with core dependencies
pip install -e .

# Or install with Qwen3-VL support (requires GPU)
pip install -e ".[qwen3vl]"

Configuration

# Copy example config
cp config.example.yaml config.yaml

# Edit with your paths and settings
vim config.yaml  # or your preferred editor

Running

Terminal 1 - Start the Server:

v2t-server --config config.yaml

Terminal 2 - Start a Worker:

v2t-worker --config config.yaml

πŸ’‘ Tip: You can start multiple workers to process videos in parallel!


βš™οΈ Configuration

See config.example.yaml for all available options:

Section Description
datasets Video dataset paths and subsets
run Output directory configuration
server Host, port, and queue settings
worker VLM backend selection and model paths
windowing Frame sampling parameters

πŸ”Œ VLM Backends

Dummy Backend (Default)

Lightweight backend for testing and Windows/CPU environments. Returns mock results without loading heavy models.

worker:
  backend: dummy

Qwen3-VL Backend

Full inference using Qwen3-VL-32B-Instruct (or other variants).

Requirements:

  • 🐧 Linux with NVIDIA GPU
  • πŸ’Ύ 24GB+ VRAM (for 32B model)
  • πŸ”₯ PyTorch with CUDA support
worker:
  backend: qwen3vl
  model_path: /path/to/model

Remote API Backend

Use an external API endpoint for inference:

worker:
  backend: remote_api
  api_url: http://your-api-server/infer
πŸ“‘ API Request/Response Format

Request:

{
  "prompt": "...",
  "images_b64_png": ["...", "..."]
}

Response:

{
  "transitions": [6],
  "instructions": ["Place the fork", "Place the spoon"],
  "thought": "..."
}

Custom Backend

Implement the VLMBackend interface to add your own:

from video2tasks.vlm.base import VLMBackend

class MyBackend(VLMBackend):
    def infer(self, images, prompt):
        # Your inference logic
        return {"transitions": [], "instructions": []}

πŸ“ Project Structure

video2tasks/
β”œβ”€β”€ πŸ“‚ src/video2tasks/
β”‚   β”œβ”€β”€ config.py              # Configuration models
β”‚   β”œβ”€β”€ prompt.py              # Prompt templates
β”‚   β”œβ”€β”€ πŸ“‚ server/             # FastAPI server
β”‚   β”‚   β”œβ”€β”€ app.py
β”‚   β”‚   └── windowing.py
β”‚   β”œβ”€β”€ πŸ“‚ worker/             # Worker implementation
β”‚   β”‚   └── runner.py
β”‚   β”œβ”€β”€ πŸ“‚ vlm/                # VLM backends
β”‚   β”‚   β”œβ”€β”€ dummy.py
β”‚   β”‚   β”œβ”€β”€ qwen3vl.py
β”‚   β”‚   └── remote_api.py
β”‚   └── πŸ“‚ cli/                # CLI entrypoints
β”‚       β”œβ”€β”€ server.py
β”‚       └── worker.py
β”œβ”€β”€ πŸ“„ config.example.yaml
β”œβ”€β”€ πŸ“„ pyproject.toml
β”œβ”€β”€ πŸ“„ README.md
β”œβ”€β”€ πŸ“„ README_CN.md
└── πŸ“„ LICENSE

πŸ§ͺ Testing

# Validate configuration
v2t-validate --config config.yaml

# Run tests
pytest

πŸ’» Requirements

Minimum (Dummy Backend) Recommended (Qwen3-VL)
  • Python 3.8+
  • 4GB RAM
  • Any OS
  • Python 3.8+
  • Linux + NVIDIA GPU
  • 24GB+ VRAM
  • CUDA 11.8+ / 12.x

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments


⭐ Star this repo if you find it useful! ⭐

WARNING!! thanks for the great using tips from YuanJingYi (Sun Yat-sen UniversityοΌ‰ PLEASE name your video like this: 348e206ad4948edee65c82d8c12ae671 and put each video in each folder like this 1be04121f3312610400b559daa5bd7b3

About

Video2Tasks: Split multi-task robot videos into single-task segments with auto-generated instruction labels for VLA (pi0, OpenVLA) training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%