Split Multi-Task Robot Videos into Single-Task Segments with Auto-Generated Instructions for VLA Training
When training VLA (Vision-Language-Action) models like Οβ (pi-zero), you need single-task video segments with instruction labels. However, real-world robot demonstration videos often contain multiple consecutive tasks without any annotation:
Input: Long video with multiple tasks, NO labels
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π¬ Video2Tasks β
β β’ VLM-powered task boundary detection β
β β’ Auto-generate natural language instructions β
β β’ Distributed processing for large-scale datasets β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
Output: Single-task segments + instruction labels, READY for VLA training
segment_001.mp4 segment_002.mp4 segment_003.mp4
"Pick up the fork" "Place the fork" "Pick up the spoon"
Video2Tasks = Task Segmentation + Instruction Labeling β VLA Training Data Pipeline
This tool uses a distributed client-server architecture with VLMs (like Qwen3-VL) to analyze video frames, intelligently detect task boundaries, and generate natural language instructions for each segment.
| Component | Description |
|---|---|
| Server | Manages job queues, video frame extraction, and result aggregation |
| Worker | Runs VLM inference to detect task transitions and generate instructions |
The VLM analyzes each overlapping frame window and provides detailed reasoning about task transitions:
π Click to see VLM reasoning for multiple windows
Window 0 - Detecting bag β mask transition:
{
"task_id": "LongData601-1189::1765279974654_w0",
"window_id": 0,
"vlm_json": {
"thought": "Frames 0-2: The person is standing, hands open, wearing gloves, facing the room. No object interaction yet. Frame 3: The person reaches toward a white tote bag on the sofa. Frame 4: The person grasps the tote bag and begins lifting it. Frames 5-11: The person continues manipulating the tote bag, opening it, adjusting its straps, and handling its contents. This is a continuous interaction with the same object (the tote bag). Frame 12: The person reaches into the bag and pulls out a white object with a black strap (likely a mask or headgear). The interaction from frame 12 onward shifts from the tote bag to the white object (mask). Therefore, a switch occurs at frame 12.",
"transitions": [12],
"instructions": ["Pick up and manipulate the tote bag", "Retrieve and adjust the white face mask"]
}
}Window 3 - Detecting multiple object switches:
{
"task_id": "LongData601-1189::1765279974654_w3",
"window_id": 3,
"vlm_json": {
"thought": "Frames 0-2: The robot's left hand reaches for and grasps a small black object from the left table. The right hand holds a white tote bag. Frames 3-5: The left hand places the black object into the tote bag. Frames 6-7: The left hand releases the black object into the bag and then reaches back to pick up another small black object. This is a clear switch: the robot completes interaction with the first black object and starts interacting with a second, distinct black object. Frame 15: The robot reaches for the white kettle on the left table. This marks a new interaction with a different object (the kettle). Therefore, switches are detected at frame 6 (firstβsecond black object) and frame 15 (bagβkettle).",
"transitions": [6, 15],
"instructions": ["Place the first black object into the tote bag", "Place the second black object into the tote bag", "Pick up the white kettle"]
}
}Window 8 - No switch detected (continuous task):
{
"task_id": "LongData601-1189::1765279974654_w8",
"window_id": 8,
"vlm_json": {
"thought": "Frames 0-2: The robot is holding a small white object (possibly a charger or adapter) in both hands and appears to be manipulating or connecting it to a cable. Frames 3-15: The robot continues to hold the same object and cable, moving it toward an electrical outlet near the TV stand. There is no clear release of the object or grasping of a new distinct object. The entire sequence involves manipulating a single object. No switch occurs.",
"transitions": [],
"instructions": ["Move the charger toward the power outlet"]
}
}Window 14 - Complex multi-object sequence:
{
"task_id": "LongData601-1189::1765279974654_w14",
"window_id": 14,
"vlm_json": {
"thought": "Frames 0-2: The person is interacting with objects on the left table β moving a small white cup and adjusting a black rectangular object. Frame 3: The right hand moves away from the black object and reaches toward the ashtray. This marks the end of interaction with the black object. Frame 4: The right hand grasps the ashtray. This is a clear switch to a new object. Frames 5-7: The person moves the ashtray toward the trash can. Frame 11: The right hand reaches down to pick up a pair of white slippers from the floor. This is a clear switch from ashtray to slippers. Switches occur at frame 3 (black objectβashtray) and frame 11 (ashtrayβslippers).",
"transitions": [3, 11],
"instructions": ["Move the black rectangular object and cup", "Pick up the ashtray", "Pick up the white slippers", "Place the slippers on the rack"]
}
}A 4501-frame video automatically split into 16 single-task segments:
{
"video_id": "1765279974654",
"nframes": 4501,
"segments": [
{"seg_id": 0, "start_frame": 0, "end_frame": 373, "instruction": "Pick up and manipulate the tote bag"},
{"seg_id": 1, "start_frame": 373, "end_frame": 542, "instruction": "Retrieve and adjust the white face mask"},
{"seg_id": 2, "start_frame": 542, "end_frame": 703, "instruction": "Open and place items into the bag"},
{"seg_id": 3, "start_frame": 703, "end_frame": 912, "instruction": "Place the first black object into the tote bag"},
{"seg_id": 4, "start_frame": 912, "end_frame": 1214, "instruction": "Place the second black object into the tote bag"},
{"seg_id": 5, "start_frame": 1214, "end_frame": 1375, "instruction": "Place the white cup on the table"},
{"seg_id": 6, "start_frame": 1375, "end_frame": 1524, "instruction": "Move the cup to the right table"},
{"seg_id": 7, "start_frame": 1524, "end_frame": 1784, "instruction": "Connect the power adapter to the cable"},
{"seg_id": 8, "start_frame": 1784, "end_frame": 2991, "instruction": "Plug the device into the power strip"},
{"seg_id": 9, "start_frame": 2991, "end_frame": 3135, "instruction": "Interact with black object on coffee table"},
{"seg_id": 10, "start_frame": 3135, "end_frame": 3238, "instruction": "Adjust the ashtray"},
{"seg_id": 11, "start_frame": 3238, "end_frame": 3359, "instruction": "Interact with the white mug"},
{"seg_id": 12, "start_frame": 3359, "end_frame": 3478, "instruction": "Move the black rectangular object and cup"},
{"seg_id": 13, "start_frame": 3478, "end_frame": 3711, "instruction": "Pick up the ashtray"},
{"seg_id": 14, "start_frame": 3711, "end_frame": 4095, "instruction": "Move the white slippers from the shoe rack"},
{"seg_id": 15, "start_frame": 4095, "end_frame": 4501, "instruction": "Raise the window blind"}
]
}π― Each segment contains exactly ONE task with auto-generated natural language instruction β ready for VLA training!
|
Not just a single script. FastAPI acts as the orchestrator, Workers handle inference only. Run Server on one 4090, then connect 10 machines running Workers to process massive datasets in parallel. This is production-grade thinking. |
Critical mechanisms for running large-scale tasks to completion. |
|
Not just throwing images at a model. Solves the classic "unstable edge detection" problem. |
Tailored for manipulation datasets, significantly reducing over-segmentation. |
| Feature | Description |
|---|---|
| π₯ Video Windowing | Configurable video window sampling parameters |
| π€ Pluggable Backends | Support for Qwen3-VL, Remote API, or custom VLM implementations |
| π Smart Aggregation | Automatic segment generation with weighted voting & Hanning window |
| π Distributed Processing | Scale horizontally with multiple workers |
| βοΈ YAML Config | Simple, declarative configuration management |
| π₯οΈ Cross-Platform | Linux/GPU recommended; Windows/CPU with dummy backend |
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β β β β
β Server ββββββββββΆβ Job Queue βββββββββββ Worker β
β (FastAPI) β β β β (VLM) β
β β β β β β
ββββββββββ¬βββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ
β β
βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ
β Video Files β β VLM Model β
βββββββββββββββββββ βββββββββββββββββββ
# Clone the repository
git clone https://github.com/ly-geming/video2tasks.git
cd video2tasks
# Install with core dependencies
pip install -e .
# Or install with Qwen3-VL support (requires GPU)
pip install -e ".[qwen3vl]"# Copy example config
cp config.example.yaml config.yaml
# Edit with your paths and settings
vim config.yaml # or your preferred editorTerminal 1 - Start the Server:
v2t-server --config config.yamlTerminal 2 - Start a Worker:
v2t-worker --config config.yamlπ‘ Tip: You can start multiple workers to process videos in parallel!
See config.example.yaml for all available options:
| Section | Description |
|---|---|
datasets |
Video dataset paths and subsets |
run |
Output directory configuration |
server |
Host, port, and queue settings |
worker |
VLM backend selection and model paths |
windowing |
Frame sampling parameters |
Lightweight backend for testing and Windows/CPU environments. Returns mock results without loading heavy models.
worker:
backend: dummyFull inference using Qwen3-VL-32B-Instruct (or other variants).
Requirements:
- π§ Linux with NVIDIA GPU
- πΎ 24GB+ VRAM (for 32B model)
- π₯ PyTorch with CUDA support
worker:
backend: qwen3vl
model_path: /path/to/modelUse an external API endpoint for inference:
worker:
backend: remote_api
api_url: http://your-api-server/inferπ‘ API Request/Response Format
Request:
{
"prompt": "...",
"images_b64_png": ["...", "..."]
}Response:
{
"transitions": [6],
"instructions": ["Place the fork", "Place the spoon"],
"thought": "..."
}Implement the VLMBackend interface to add your own:
from video2tasks.vlm.base import VLMBackend
class MyBackend(VLMBackend):
def infer(self, images, prompt):
# Your inference logic
return {"transitions": [], "instructions": []}video2tasks/
βββ π src/video2tasks/
β βββ config.py # Configuration models
β βββ prompt.py # Prompt templates
β βββ π server/ # FastAPI server
β β βββ app.py
β β βββ windowing.py
β βββ π worker/ # Worker implementation
β β βββ runner.py
β βββ π vlm/ # VLM backends
β β βββ dummy.py
β β βββ qwen3vl.py
β β βββ remote_api.py
β βββ π cli/ # CLI entrypoints
β βββ server.py
β βββ worker.py
βββ π config.example.yaml
βββ π pyproject.toml
βββ π README.md
βββ π README_CN.md
βββ π LICENSE
# Validate configuration
v2t-validate --config config.yaml
# Run tests
pytest| Minimum (Dummy Backend) | Recommended (Qwen3-VL) |
|---|---|
|
|
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with FastAPI
- VLM support via Transformers
- Inspired by robotic video analysis research

