Skip to content

Commit 2bde7fb

Browse files
arhamm1lbliii
andauthored
Update get-started/video.md (#1261)
Signed-off-by: Arham Mehta <141266146+arhamm1@users.noreply.github.com> Co-authored-by: L.B. <llane@nvidia.com>
1 parent 2ffc9bd commit 2bde7fb

File tree

1 file changed

+160
-41
lines changed

1 file changed

+160
-41
lines changed

docs/get-started/video.md

Lines changed: 160 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -16,19 +16,51 @@ This guide shows how to install Curator and run your first video curation pipeli
1616

1717
The [example pipeline](#run-the-splitting-pipeline-example) processes a list of videos, splitting each into 10‑second clips using a fixed stride. It then generates clip‑level embeddings for downstream tasks such as duplicate removal and similarity search.
1818

19+
## Overview
20+
21+
This quickstart guide demonstrates how to:
22+
23+
1. **Install NeMo Curator** with video processing support
24+
2. **Set up FFmpeg** with GPU-accelerated encoding
25+
3. **Configure embedding models** (Cosmos-Embed1 or InternVideo2)
26+
4. **Process videos** through a complete splitting and embedding pipeline
27+
5. **Generate outputs** ready for duplicate removal, captioning, and model training
28+
29+
**What you'll build:** A video processing pipeline that:
30+
- Splits videos into 10-second clips using fixed stride or scene detection
31+
- Generates clip-level embeddings for similarity search and deduplication
32+
- Optionally creates captions and preview images
33+
- Outputs results in formats compatible with multimodal training workflows
34+
1935
## Prerequisites
2036

21-
To use NeMo Curator's video curation modules, ensure you meet the following requirements:
37+
### System Requirements
38+
39+
To use NeMo Curator's video curation capabilities, ensure your system meets these requirements:
40+
41+
#### Operating System
42+
* **Ubuntu 24.04, 22.04, or 20.04** (required for GPU-accelerated video processing)
43+
* Other Linux distributions may work but are not officially supported
2244

23-
- **OS**: Ubuntu 24.04/22.04/20.04 (required for GPU-accelerated processing)
24-
- **Python**: 3.10, 3.11, or 3.12
25-
- **uv** (for package management and installation)
26-
- **NVIDIA GPU** (required)
27-
- Volta™ or higher (compute capability 7.0+)
28-
- CUDA 12 or above
29-
- With defaults, the full splitting plus captioning example can use up to 38 GB of VRAM. Reduce VRAM to about 21 GB by lowering batch sizes and using FP8 where available.
30-
- **FFmpeg** 7+ on your system path. For H.264, ensure an encoder is available: `h264_nvenc` (GPU) or `libopenh264`/`libx264` (CPU).
31-
- **Git** (required for some model dependencies)
45+
#### Python Environment
46+
* **Python 3.10, 3.11, or 3.12**
47+
* **uv package manager** for dependency management
48+
* **Git** for model and repository dependencies
49+
50+
#### GPU Requirements
51+
* **NVIDIA GPU required** (CPU-only mode not supported for video processing)
52+
* **Architecture**: Volta™ or newer (compute capability 7.0+)
53+
- Examples: V100, T4, RTX 2080+, A100, H100
54+
* **CUDA**: Version 12.0 or above
55+
* **VRAM**: Minimum requirements by configuration:
56+
- Basic splitting + embedding: ~16GB VRAM
57+
- Full pipeline (splitting + embedding + captioning): ~38GB VRAM
58+
- Reduced configuration (lower batch sizes, FP8): ~21GB VRAM
59+
60+
#### Software Dependencies
61+
* **FFmpeg 7.0+** with H.264 encoding support
62+
- GPU encoder: `h264_nvenc` (recommended for performance)
63+
- CPU encoders: `libopenh264` or `libx264` (fallback options)
3264

3365
:::{tip}
3466
If you don't have `uv` installed, refer to the [Installation Guide](../admin/installation.md) for setup instructions, or install it quickly with:
@@ -165,22 +197,36 @@ If encoders are missing, reinstall `FFmpeg` with the required options or use the
165197

166198
Refer to [Clip Encoding](video-process-transcoding) to choose encoders and verify NVENC support on your system.
167199

168-
## Choose Embedding Model
200+
### Available Models
201+
169202

170203
Embeddings convert each video clip into a numeric vector that captures visual and semantic content. Curator uses these vectors to:
171204

172205
- Remove near-duplicate clips during duplicate removal
173206
- Enable similarity search and clustering
174207
- Support downstream analysis such as caption verification
175208

176-
You can choose between two embedding models:
209+
NeMo Curator supports two embedding model families:
210+
211+
#### Cosmos-Embed1 (Recommended)
212+
213+
**Cosmos-Embed1 (default)**: Available in three variants—**cosmos-embed1-224p**, **cosmos-embed1-336p**, and **cosmos-embed1-448p**—which differ in input resolution and accuracy/VRAM tradeoff. All variants are automatically downloaded to `MODEL_DIR` on first run.
214+
215+
| Model Variant | Resolution | VRAM Usage | Speed | Accuracy | Best For |
216+
|---------------|------------|------------|-------|----------|----------|
217+
| **cosmos-embed1-224p** | 224×224 | ~8GB | Fastest | Good | Large-scale processing, initial curation |
218+
| **cosmos-embed1-336p** | 336×336 | ~12GB | Medium | Better | Balanced performance and quality |
219+
| **cosmos-embed1-448p** | 448×448 | ~16GB | Slower | Best | High-quality embeddings, fine-grained matching |
177220

178-
- **Cosmos-Embed1 (default)**: Available in three variants—**cosmos-embed1-224p**, **cosmos-embed1-336p**, and **cosmos-embed1-448p**—which differ in input resolution and accuracy/VRAM tradeoff. All variants are automatically downloaded to `MODEL_DIR` on first run.
179-
- [cosmos-embed1-224p on Hugging Face](https://huggingface.co/nvidia/Cosmos-Embed1-224p)
180-
- [cosmos-embed1-336p on Hugging Face](https://huggingface.co/nvidia/Cosmos-Embed1-336p)
181-
- [cosmos-embed1-448p on Hugging Face](https://huggingface.co/nvidia/Cosmos-Embed1-448p)
182-
- **InternVideo2 (IV2)**: Open model that requires the IV2 checkpoint and BERT model files to be available locally; higher VRAM usage.
183-
- [InternVideo Official Github Page](https://github.com/OpenGVLab/InternVideo)
221+
**Model links:**
222+
- [cosmos-embed1-224p on Hugging Face](https://huggingface.co/nvidia/cosmos-embed1-224p)
223+
- [cosmos-embed1-336p on Hugging Face](https://huggingface.co/nvidia/cosmos-embed1-336p)
224+
- [cosmos-embed1-448p on Hugging Face](https://huggingface.co/nvidia/cosmos-embed1-448p)
225+
226+
#### InternVideo2 (IV2)
227+
228+
Open model that requires the IV2 checkpoint and BERT model files to be available locally; higher VRAM usage.
229+
- [InternVideo Official Github Page](https://github.com/OpenGVLab/InternVideo)
184230

185231
For this quickstart, we're going to set up support for **Cosmos-Embed1-224p**.
186232

@@ -200,17 +246,23 @@ For most use cases, you only need to create a model directory. The required mode
200246

201247
## Set Up Data Directories
202248

203-
Store input videos locally or on S3-compatible storage.
249+
Organize input videos and output locations before running the pipeline.
204250

205-
- **Local**: Define paths like:
251+
- **Local**: For local file processing. Define paths like:
206252

207253
```bash
208254
DATA_DIR=/path/to/videos
209255
OUT_DIR=/path/to/output_clips
210256
MODEL_DIR=/path/to/models
211257
```
212258

213-
- **S3**: Configure credentials in `~/.aws/credentials` and use `s3://` paths for `--video-dir` and `--output-clip-path`.
259+
- **S3**: For cloud storage (AWS S3, MinIO, etc.). Configure credentials in `~/.aws/credentials` and use `s3://` paths for `--video-dir` and `--output-clip-path`.
260+
261+
**S3 usage notes:**
262+
- Input videos can be read from S3 paths
263+
- Output clips can be written to S3 paths
264+
- Model directory should remain local for performance
265+
- Ensure IAM permissions allow read/write access to specified buckets
214266

215267
## Run the Splitting Pipeline Example
216268

@@ -228,31 +280,98 @@ python -m nemo_curator.examples.video.video_split_clip_example \
228280
--verbose
229281
```
230282

231-
### Options
232-
233-
The example script supports the following options:
234-
235-
```{list-table} Common Options
236-
:header-rows: 1
237-
238-
* - Option
239-
- Values or Description
240-
* - `--splitting-algorithm`
241-
- `fixed_stride` | `transnetv2`
242-
* - `--transnetv2-frame-decoder-mode`
243-
- `pynvc` | `ffmpeg_gpu` | `ffmpeg_cpu`
244-
* - `--embedding-algorithm`
245-
- `cosmos-embed1-224p` | `cosmos-embed1-336p` | `cosmos-embed1-448p` | `internvideo2`
246-
* - `--generate-captions`, `--generate-previews`
247-
- Enable captioning and preview generation
248-
* - `--transcode-use-hwaccel`, `--transcode-encoder`
249-
- Use NVENC when available (for example, `h264_nvenc`). Refer to [Clip Encoding](video-process-transcoding) to verify NVENC support and choose encoders.
250-
```
283+
**What this command does:**
284+
1. Reads all video files from `$DATA_DIR`
285+
2. Splits each video into 10-second clips using fixed stride
286+
3. Generates embeddings using Cosmos-Embed1-224p model
287+
4. Encodes clips using libopenh264 codec
288+
5. Writes output clips and metadata to `$OUT_DIR`
289+
290+
### Configuration Options Reference
291+
292+
| Option | Values | Description |
293+
|--------|--------|-------------|
294+
| **Splitting** |
295+
| `--splitting-algorithm` | `fixed_stride`, `transnetv2` | Method for dividing videos into clips |
296+
| `--fixed-stride-split-duration` | Float (seconds) | Clip length for fixed stride (default: 10.0) |
297+
| `--transnetv2-frame-decoder-mode` | `pynvc`, `ffmpeg_gpu`, `ffmpeg_cpu` | Frame decoding method for TransNetV2 |
298+
| **Embedding** |
299+
| `--embedding-algorithm` | `cosmos-embed1-224p`, `cosmos-embed1-336p`, `cosmos-embed1-448p`, `internvideo2` | Embedding model to use |
300+
| **Encoding** |
301+
| `--transcode-encoder` | `h264_nvenc`, `libopenh264`, `libx264` | Video encoder for output clips |
302+
| `--transcode-use-hwaccel` | Flag | Enable hardware acceleration for encoding |
303+
| **Optional Features** |
304+
| `--generate-captions` | Flag | Generate text captions for each clip |
305+
| `--generate-previews` | Flag | Create preview images for each clip |
306+
| `--verbose` | Flag | Enable detailed logging output |
251307

252308
:::{tip}
253309
To use InternVideo2 instead, set `--embedding-algorithm internvideo2`.
254310
:::
255311

312+
### Understanding Pipeline Output
313+
314+
After successful execution, the output directory will contain:
315+
316+
```
317+
$OUT_DIR/
318+
├── clips/
319+
│ ├── video1_clip_0000.mp4
320+
│ ├── video1_clip_0001.mp4
321+
│ └── ...
322+
├── embeddings/
323+
│ ├── video1_clip_0000.npy
324+
│ ├── video1_clip_0001.npy
325+
│ └── ...
326+
├── metadata/
327+
│ └── manifest.jsonl
328+
└── previews/ (if --generate-previews enabled)
329+
├── video1_clip_0000.jpg
330+
└── ...
331+
```
332+
333+
**File descriptions:**
334+
- **clips/**: Encoded video clips (MP4 format)
335+
- **embeddings/**: Numpy arrays containing clip embeddings (for similarity search)
336+
- **metadata/manifest.jsonl**: JSONL file with clip metadata (paths, timestamps, embeddings)
337+
- **previews/**: Thumbnail images for each clip (optional)
338+
339+
**Example manifest entry:**
340+
```json
341+
{
342+
"video_path": "/data/input_videos/video1.mp4",
343+
"clip_path": "/data/output_clips/clips/video1_clip_0000.mp4",
344+
"start_time": 0.0,
345+
"end_time": 10.0,
346+
"embedding_path": "/data/output_clips/embeddings/video1_clip_0000.npy",
347+
"preview_path": "/data/output_clips/previews/video1_clip_0000.jpg"
348+
}
349+
```
350+
351+
## Best Practices
352+
353+
### Data Preparation
354+
- **Validate input videos**: Ensure videos are not corrupted before processing
355+
- **Consistent formats**: Convert videos to a standard format (MP4 with H.264) for consistent results
356+
- **Organize by content**: Group similar videos together for efficient processing
357+
358+
### Model Selection
359+
- **Start with Cosmos-Embed1-224p**: Best balance of speed and quality for initial experiments
360+
- **Upgrade resolution as needed**: Use 336p or 448p only when higher precision is required
361+
- **Monitor VRAM usage**: Check GPU memory with `nvidia-smi` during processing
362+
363+
### Pipeline Configuration
364+
- **Enable verbose logging**: Use `--verbose` flag for debugging and monitoring
365+
- **Test on small subset**: Run pipeline on 5-10 videos before processing large datasets
366+
- **Use GPU encoding**: Enable NVENC for significant performance improvements
367+
- **Save intermediate results**: Keep embeddings and metadata for downstream tasks
368+
369+
### Infrastructure
370+
- **Use shared storage**: Mount shared filesystem for multi-node processing
371+
- **Allocate sufficient VRAM**: Plan for peak usage (captioning + embedding)
372+
- **Monitor GPU utilization**: Use `nvidia-smi dmon` to track GPU usage during processing
373+
- **Schedule long-running jobs**: Process large video datasets in batch jobs overnight
374+
256375
## Next Steps
257376

258377
Explore the [Video Curation documentation](video-overview). For encoding guidance, refer to [Clip Encoding](video-process-transcoding).

0 commit comments

Comments
 (0)