Skip to content

Commit 2ba14e7

Browse files
authored
Video Pipeline/README Improvement (#1096)
* better readme Signed-off-by: Ao Tang <aot@nvidia.com> * better args Signed-off-by: Ao Tang <aot@nvidia.com> * remove video_read_example.py Signed-off-by: Ao Tang <aot@nvidia.com> * Add instruction for IV2 Signed-off-by: Ao Tang <aot@nvidia.com> --------- Signed-off-by: Ao Tang <aot@nvidia.com>
1 parent b852ef7 commit 2ba14e7

File tree

3 files changed

+176
-69
lines changed

3 files changed

+176
-69
lines changed
Lines changed: 158 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,161 @@
11
# Getting Started with Video Curation
22

3-
The Python scripts in this directory contain examples for how to run video curation workflows with NeMo Curator. In particular:
3+
The Python scripts in this directory contain examples for how to run video curation workflows with NeMo Curator.
44

5-
- `video_read_example.py` contains a simple pipeline which reads video data
6-
- `video_split_clip_example.py` contains a more complex pipeline which reads video data, applies a splitting algorithm, transcodes the clips, applies a motion filter, scores and filters by aesthetic content, generates embeddings, generates captions, generates previews, and saves the results
5+
## Scripts Overview
6+
7+
- **`video_split_clip_example.py`**: Complete pipeline that reads videos, splits into clips, transcodes, filters, generates embeddings/captions, and saves results
8+
9+
## Quick Start
10+
11+
### Prerequisites
12+
13+
1. **Set up directories**:
14+
```bash
15+
export VIDEO_DIR="/path/to/your/videos" # Video to be processed
16+
export OUTPUT_DIR="/path/to/output"
17+
export MODEL_DIR="./models" # Will download models if not exist
18+
```
19+
20+
2. **Minimal working example**:
21+
```bash
22+
python video_split_clip_example.py \
23+
--video-dir "$VIDEO_DIR" \
24+
--output-clip-path "$OUTPUT_DIR" \
25+
--splitting-algorithm fixed_stride \
26+
--fixed-stride-split-duration 10.0
27+
```
28+
The example above demonstrates how to run a minimal video curation pipeline using NeMo Curator. It processes all videos in the specified `VIDEO_DIR`, splits each video into fixed-length clips (10 seconds each, as set by `--fixed-stride-split-duration 10.0`), and saves the resulting clips to `OUTPUT_DIR`. This is a basic workflow to get started with automated video splitting and curation, and can be extended with additional options for embedding, captioning, filtering, and transcoding as shown in later sections.
29+
30+
### Common Use Cases
31+
32+
**Basic video splitting with embeddings**:
33+
```bash
34+
python video_split_clip_example.py \
35+
--video-dir "$VIDEO_DIR" \
36+
--output-clip-path "$OUTPUT_DIR" \
37+
--splitting-algorithm fixed_stride \
38+
--fixed-stride-split-duration 10.0 \
39+
--embedding-algorithm cosmos-embed1-224p
40+
```
41+
This example extends from the above example and adds an additional embedding stages using cosmos-embed1-224p model.
42+
43+
**Scene-aware splitting with TransNetV2**:
44+
```bash
45+
python video_split_clip_example.py \
46+
--video-dir "$VIDEO_DIR" \
47+
--output-clip-path "$OUTPUT_DIR" \
48+
--splitting-algorithm transnetv2 \
49+
--transnetv2-threshold 0.4 \
50+
--transnetv2-min-length-s 2.0 \
51+
--transnetv2-max-length-s 10.0 \
52+
--embedding-algorithm internvideo2 \
53+
--transcode-encoder libopenh264 \
54+
--verbose
55+
```
56+
This example demonstrates a more advanced workflow than the minimal example by using scene-aware splitting with the TransNetV2 algorithm (which detects scene boundaries instead of fixed intervals), applies the InternVideo2 embedding model to each clip, transcodes the output using the libopenh264 encoder, and enables verbose logging for more detailed output.
57+
58+
**Note: Choosing Between InternVideo2 and Cosmos-Embed1 for Embeddings**
59+
60+
Cosmos-Embed1 is generally better than InternVideo2 for most video embedding tasks, offering improved performance and quality. However, the optimal choice can vary depending on your specific use case and requirements. We recommend starting with Cosmos-Embed1 (`cosmos-embed1-224p`) for your initial experiments, as it typically provides superior results. If you find that Cosmos-Embed1 doesn't meet your specific needs or performance expectations, consider exploring InternVideo2 (`internvideo2`) as an alternative. This approach allows you to leverage the generally better-performing model first while keeping the option to experiment with InternVideo2 if needed.
61+
62+
To install InternVideo2:
63+
64+
InternVideo2 requires a specific installation process involving cloning the repository and applying patches:
65+
66+
```bash
67+
# Run the InternVideo2 installation script from the Curator directory
68+
cd /path/to/Curator
69+
bash external/intern_video2_installation.sh
70+
71+
uv add InternVideo/InternVideo2/multi_modality
72+
```
73+
74+
After running this script, InternVideo2 will be available when you use `--embedding-algorithm internvideo2` in your video curation pipelines.
75+
76+
77+
78+
**Full pipeline with captions and filtering**:
79+
```bash
80+
python video_split_clip_example.py \
81+
--video-dir "$VIDEO_DIR" \
82+
--output-clip-path "$OUTPUT_DIR" \
83+
--splitting-algorithm fixed_stride \
84+
--fixed-stride-split-duration 10.0 \
85+
--embedding-algorithm cosmos-embed1-224p \
86+
--generate-captions \
87+
--aesthetic-threshold 3.5 \
88+
--motion-filter enable
89+
```
90+
This example demonstrates the most comprehensive pipeline among the examples above. In addition to splitting videos and generating embeddings, it also generates captions for each clip and applies filtering based on aesthetic and motion scores. This means that only clips meeting the specified quality thresholds (e.g., `--aesthetic-threshold 3.5` and `--motion-filter enable`) will be kept, and captions will be generated for each valid clip. This workflow is useful for curating high-quality, captioned video datasets with automated quality control.
91+
92+
93+
## Output Structure
94+
95+
The pipeline creates the following directory structure:
96+
97+
```
98+
$OUTPUT_DIR/
99+
├── clips/ # Encoded clip videos (.mp4)
100+
├── filtered_clips/ # Filtered-out clips (.mp4)
101+
├── previews/ # Preview images (.webp)
102+
├── metas/v0/ # Per-clip metadata (.json)
103+
├── iv2_embd/ # InternVideo2 embeddings (.pickle)
104+
├── ce1_embd/ # Cosmos-Embed1 embeddings (.pickle)
105+
├── iv2_embd_parquet/ # InternVideo2 embeddings (Parquet)
106+
├── ce1_embd_parquet/ # Cosmos-Embed1 embeddings (Parquet)
107+
├── processed_videos/ # Video-level metadata
108+
└── processed_clip_chunks/ # Per-chunk statistics
109+
```
110+
111+
## Metadata Schema
112+
113+
Each clip generates a JSON metadata file in `metas/v0/` with the following structure:
114+
115+
```json
116+
{
117+
"span_uuid": "d2d0b3d1-...",
118+
"source_video": "/path/to/source/video.mp4",
119+
"duration_span": [0.0, 5.0],
120+
"width_source": 1920,
121+
"height_source": 1080,
122+
"framerate_source": 30.0,
123+
"clip_location": "/outputs/clips/d2/d2d0b3d1-....mp4",
124+
"motion_score": {
125+
"global_mean": 0.51,
126+
"per_patch_min_256": 0.29
127+
},
128+
"aesthetic_score": 0.72,
129+
"windows": [
130+
{
131+
"start_frame": 0,
132+
"end_frame": 30,
133+
"qwen_caption": "A person walks across a room",
134+
"qwen_lm_enhanced_caption": "A person briskly crosses a bright modern room"
135+
}
136+
],
137+
"valid": true
138+
}
139+
```
140+
141+
### Metadata Fields
142+
143+
- **`span_uuid`**: Unique identifier for the clip
144+
- **`source_video`**: Path to the original video file
145+
- **`duration_span`**: Start and end times in seconds `[start, end]`
146+
- **`width_source`**, **`height_source`**, **`framerate_source`**: Original video properties
147+
- **`clip_location`**: Path to the encoded clip file
148+
- **`motion_score`**: Motion analysis scores (if motion filtering enabled)
149+
- **`aesthetic_score`**: Aesthetic quality score (if aesthetic filtering enabled)
150+
- **`windows`**: Caption windows with generated text (if captioning enabled)
151+
- **`valid`**: Whether the clip passed all filters
152+
153+
## Embedding Formats
154+
155+
### Parquet Files
156+
Embeddings are stored in Parquet format with two columns:
157+
- **`id`**: String UUID for the clip
158+
- **`embedding`**: List of float values (512 dimensions for InternVideo2, 768 for Cosmos-Embed1)
159+
160+
### Pickle Files
161+
Individual clip embeddings are also saved as `.pickle` files for direct access.

tutorials/video/getting-started/video_read_example.py

Lines changed: 0 additions & 64 deletions
This file was deleted.

tutorials/video/getting-started/video_split_clip_example.py

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -283,7 +283,23 @@ def main(args: argparse.Namespace) -> None:
283283
parser = argparse.ArgumentParser()
284284
# General arguments
285285
parser.add_argument("--video-dir", type=str, required=True, help="Path to input video directory")
286-
parser.add_argument("--model-dir", type=str, required=True, help="Path to model directory")
286+
parser.add_argument(
287+
"--model-dir",
288+
type=str,
289+
default="./models",
290+
help=(
291+
"Path to model directory containing required model weights. "
292+
"Models will be automatically downloaded on first use if not present. "
293+
"Required models depend on selected algorithms:\n"
294+
" - TransNetV2: For scene detection (--splitting-algorithm transnetv2)\n"
295+
" - InternVideo2: For embeddings (--embedding-algorithm internvideo2)\n"
296+
" - Cosmos-Embed1: For embeddings (--embedding-algorithm cosmos-embed1-*)\n"
297+
" - Qwen: For captioning (--generate-captions)\n"
298+
" - Aesthetic models: For filtering (--aesthetic-threshold)\n"
299+
"Default: ./models\n"
300+
"Example: --model-dir /path/to/models or --model-dir ./models"
301+
)
302+
)
287303
parser.add_argument("--video-limit", type=int, default=None, help="Limit the number of videos to read")
288304
parser.add_argument("--verbose", action="store_true", default=False)
289305
parser.add_argument("--output-clip-path", type=str, help="Path to output clips", required=True)
@@ -487,7 +503,7 @@ def main(args: argparse.Namespace) -> None:
487503
"--clip-extraction-target-res",
488504
type=int,
489505
default=-1,
490-
help="Target resolution for clip extraction as (height, width). A value of -1 implies disables resize",
506+
help="Target resolution for clip extraction as a square (height=width). A value of -1 disables resize",
491507
)
492508
# Aesthetic arguments
493509
parser.add_argument(

0 commit comments

Comments
 (0)