You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -16,19 +16,51 @@ This guide shows how to install Curator and run your first video curation pipeli
16
16
17
17
The [example pipeline](#run-the-splitting-pipeline-example) processes a list of videos, splitting each into 10‑second clips using a fixed stride. It then generates clip‑level embeddings for downstream tasks such as duplicate removal and similarity search.
18
18
19
+
## Overview
20
+
21
+
This quickstart guide demonstrates how to:
22
+
23
+
1.**Install NeMo Curator** with video processing support
24
+
2.**Set up FFmpeg** with GPU-accelerated encoding
25
+
3.**Configure embedding models** (Cosmos-Embed1 or InternVideo2)
26
+
4.**Process videos** through a complete splitting and embedding pipeline
27
+
5.**Generate outputs** ready for duplicate removal, captioning, and model training
28
+
29
+
**What you'll build:** A video processing pipeline that:
30
+
- Splits videos into 10-second clips using fixed stride or scene detection
31
+
- Generates clip-level embeddings for similarity search and deduplication
32
+
- Optionally creates captions and preview images
33
+
- Outputs results in formats compatible with multimodal training workflows
34
+
19
35
## Prerequisites
20
36
21
-
To use NeMo Curator's video curation modules, ensure you meet the following requirements:
37
+
### System Requirements
38
+
39
+
To use NeMo Curator's video curation capabilities, ensure your system meets these requirements:
40
+
41
+
#### Operating System
42
+
***Ubuntu 24.04, 22.04, or 20.04** (required for GPU-accelerated video processing)
43
+
* Other Linux distributions may work but are not officially supported
22
44
23
-
-**OS**: Ubuntu 24.04/22.04/20.04 (required for GPU-accelerated processing)
24
-
-**Python**: 3.10, 3.11, or 3.12
25
-
-**uv** (for package management and installation)
26
-
-**NVIDIA GPU** (required)
27
-
- Volta™ or higher (compute capability 7.0+)
28
-
- CUDA 12 or above
29
-
- With defaults, the full splitting plus captioning example can use up to 38 GB of VRAM. Reduce VRAM to about 21 GB by lowering batch sizes and using FP8 where available.
30
-
-**FFmpeg** 7+ on your system path. For H.264, ensure an encoder is available: `h264_nvenc` (GPU) or `libopenh264`/`libx264` (CPU).
31
-
-**Git** (required for some model dependencies)
45
+
#### Python Environment
46
+
***Python 3.10, 3.11, or 3.12**
47
+
***uv package manager** for dependency management
48
+
***Git** for model and repository dependencies
49
+
50
+
#### GPU Requirements
51
+
***NVIDIA GPU required** (CPU-only mode not supported for video processing)
52
+
***Architecture**: Volta™ or newer (compute capability 7.0+)
53
+
- Examples: V100, T4, RTX 2080+, A100, H100
54
+
***CUDA**: Version 12.0 or above
55
+
***VRAM**: Minimum requirements by configuration:
56
+
- Basic splitting + embedding: ~16GB VRAM
57
+
- Full pipeline (splitting + embedding + captioning): ~38GB VRAM
- GPU encoder: `h264_nvenc` (recommended for performance)
63
+
- CPU encoders: `libopenh264` or `libx264` (fallback options)
32
64
33
65
:::{tip}
34
66
If you don't have `uv` installed, refer to the [Installation Guide](../admin/installation.md) for setup instructions, or install it quickly with:
@@ -165,22 +197,36 @@ If encoders are missing, reinstall `FFmpeg` with the required options or use the
165
197
166
198
Refer to [Clip Encoding](video-process-transcoding) to choose encoders and verify NVENC support on your system.
167
199
168
-
## Choose Embedding Model
200
+
### Available Models
201
+
169
202
170
203
Embeddings convert each video clip into a numeric vector that captures visual and semantic content. Curator uses these vectors to:
171
204
172
205
- Remove near-duplicate clips during duplicate removal
173
206
- Enable similarity search and clustering
174
207
- Support downstream analysis such as caption verification
175
208
176
-
You can choose between two embedding models:
209
+
NeMo Curator supports two embedding model families:
210
+
211
+
#### Cosmos-Embed1 (Recommended)
212
+
213
+
**Cosmos-Embed1 (default)**: Available in three variants—**cosmos-embed1-224p**, **cosmos-embed1-336p**, and **cosmos-embed1-448p**—which differ in input resolution and accuracy/VRAM tradeoff. All variants are automatically downloaded to `MODEL_DIR` on first run.
214
+
215
+
| Model Variant | Resolution | VRAM Usage | Speed | Accuracy | Best For |
-**Cosmos-Embed1 (default)**: Available in three variants—**cosmos-embed1-224p**, **cosmos-embed1-336p**, and **cosmos-embed1-448p**—which differ in input resolution and accuracy/VRAM tradeoff. All variants are automatically downloaded to `MODEL_DIR` on first run.
179
-
-[cosmos-embed1-224p on Hugging Face](https://huggingface.co/nvidia/Cosmos-Embed1-224p)
180
-
-[cosmos-embed1-336p on Hugging Face](https://huggingface.co/nvidia/Cosmos-Embed1-336p)
181
-
-[cosmos-embed1-448p on Hugging Face](https://huggingface.co/nvidia/Cosmos-Embed1-448p)
182
-
-**InternVideo2 (IV2)**: Open model that requires the IV2 checkpoint and BERT model files to be available locally; higher VRAM usage.
183
-
-[InternVideo Official Github Page](https://github.com/OpenGVLab/InternVideo)
221
+
**Model links:**
222
+
-[cosmos-embed1-224p on Hugging Face](https://huggingface.co/nvidia/cosmos-embed1-224p)
223
+
-[cosmos-embed1-336p on Hugging Face](https://huggingface.co/nvidia/cosmos-embed1-336p)
224
+
-[cosmos-embed1-448p on Hugging Face](https://huggingface.co/nvidia/cosmos-embed1-448p)
225
+
226
+
#### InternVideo2 (IV2)
227
+
228
+
Open model that requires the IV2 checkpoint and BERT model files to be available locally; higher VRAM usage.
229
+
-[InternVideo Official Github Page](https://github.com/OpenGVLab/InternVideo)
184
230
185
231
For this quickstart, we're going to set up support for **Cosmos-Embed1-224p**.
186
232
@@ -200,17 +246,23 @@ For most use cases, you only need to create a model directory. The required mode
200
246
201
247
## Set Up Data Directories
202
248
203
-
Store input videos locally or on S3-compatible storage.
249
+
Organize input videos and output locations before running the pipeline.
204
250
205
-
-**Local**: Define paths like:
251
+
-**Local**: For local file processing. Define paths like:
206
252
207
253
```bash
208
254
DATA_DIR=/path/to/videos
209
255
OUT_DIR=/path/to/output_clips
210
256
MODEL_DIR=/path/to/models
211
257
```
212
258
213
-
-**S3**: Configure credentials in `~/.aws/credentials` and use `s3://` paths for `--video-dir` and `--output-clip-path`.
259
+
-**S3**: For cloud storage (AWS S3, MinIO, etc.). Configure credentials in `~/.aws/credentials` and use `s3://` paths for `--video-dir` and `--output-clip-path`.
260
+
261
+
**S3 usage notes:**
262
+
- Input videos can be read from S3 paths
263
+
- Output clips can be written to S3 paths
264
+
- Model directory should remain local for performance
265
+
- Ensure IAM permissions allow read/write access to specified buckets
- Use NVENC when available (for example, `h264_nvenc`). Refer to [Clip Encoding](video-process-transcoding) to verify NVENC support and choose encoders.
250
-
```
283
+
**What this command does:**
284
+
1. Reads all video files from `$DATA_DIR`
285
+
2. Splits each video into 10-second clips using fixed stride
286
+
3. Generates embeddings using Cosmos-Embed1-224p model
287
+
4. Encodes clips using libopenh264 codec
288
+
5. Writes output clips and metadata to `$OUT_DIR`
289
+
290
+
### Configuration Options Reference
291
+
292
+
| Option | Values | Description |
293
+
|--------|--------|-------------|
294
+
|**Splitting**|
295
+
|`--splitting-algorithm`|`fixed_stride`, `transnetv2`| Method for dividing videos into clips |
0 commit comments