video doc improvement (#1124)

suiyoubi · web-flow · commit e8443235aac6 · 2025-09-26T19:58:01.000Z
Signed-off-by: Ao Tang &lt;aot@nvidia.com&gt;
diff --git a/docs/about/concepts/video/abstractions.md b/docs/about/concepts/video/abstractions.md
@@ -33,13 +33,15 @@ A pipeline orchestrates stages into an end-to-end workflow. Key characteristics:
 
 ## Stages
 
-A stage represents a single step in your data curation workflow. For example, stages can:
-
-- Download videos
-- Convert video formats
-- Split videos into clips
-- Generate embeddings
-- Calculate scores
+A stage represents a single step in your data curation workflow. Video stages are organized into several functional categories:
+
+- **Input/Output**: Read video files and write processed outputs to storage ([Save & Export Documentation](video-save-export))
+- **Video Clipping**: Split videos into clips using fixed stride or scene-change detection ([Video Clipping Documentation](video-process-clipping))
+- **Frame Extraction**: Extract frames from videos or clips for analysis and embeddings ([Frame Extraction Documentation](video-process-frame-extraction))
+- **Embedding Generation**: Generate clip-level embeddings using InternVideo2 or Cosmos-Embed1 models ([Embeddings Documentation](video-process-embeddings))
+- **Filtering**: Filter clips based on motion analysis and aesthetic quality scores ([Filtering Documentation](video-process-filtering))
+- **Caption and Preview**: Generate captions and preview images from video clips ([Captions & Preview Documentation](video-process-captions-preview))
+- **Deduplication**: Remove near-duplicate clips using embedding-based clustering ([Duplicate Removal Documentation](video-process-dedup))
 
 ### Stage Architecture
 
diff --git a/docs/curate-video/process-data/frame-extraction.md b/docs/curate-video/process-data/frame-extraction.md
@@ -117,6 +117,8 @@ from nemo_curator.stages.video.clipping.video_frame_extraction import VideoFrame
 
 frame_extractor = VideoFrameExtractionStage(
     decoder_mode="pynvc",  # or "ffmpeg_gpu", "ffmpeg_cpu"
+    output_hw=(27, 48),    # (height, width) for frame extraction
+    pyncv_batch_size=64,   # batch size for PyNvCodec
     verbose=True,
 )
 ```
@@ -139,8 +141,14 @@ frame_extractor = VideoFrameExtractionStage(
   - Shortcut that sets default FPS for specific purposes (such as embeddings). You can still pass `target_fps` to override.
 * - `target_res`
   - Output frame resolution `(height, width)`. Use `(-1, -1)` to keep original.
+* - `num_cpus`
+  - Number of CPU cores for frame extraction. Default: `3`.
 * - `decoder_mode`
   - For full‑video extraction: `pynvc` (NVDEC), `ffmpeg_gpu`, or `ffmpeg_cpu`.
+* - `output_hw`
+  - For full‑video extraction: `(height, width)` tuple for frame dimensions. Default: `(27, 48)`.
+* - `pyncv_batch_size`
+  - For full‑video extraction: batch size for PyNvCodec processing. Default: `64`.
 ```
 
 ### LCM Sampling for Several FPS Values
diff --git a/docs/curate-video/save-export.md b/docs/curate-video/save-export.md
@@ -30,7 +30,7 @@ pipeline.add_stage(
         generate_embeddings=True,
         generate_previews=False,
         generate_captions=False,
-        embedding_algorithm="internvideo2",  # or "cosmos-embed1"
+        embedding_algorithm="cosmos-embed1",  # or "internvideo2"
         caption_models=["qwen"],
         enhanced_caption_models=["qwen_lm"],
         verbose=True,
@@ -69,7 +69,7 @@ pipeline.add_stage(
   - The stage includes captions in metadata when upstream stages provide them.
 * - `embedding_algorithm`
   - `str`
-  - Accepted: `internvideo2` or `cosmos-embed1`.
+  - Accepted: `cosmos-embed1` or `internvideo2`. Default: `cosmos-embed1`.
 * - `caption_models`
   - `list[str] | None`
   - Ordered caption models to emit. Use `[]` when not using captions.
diff --git a/docs/get-started/video.md b/docs/get-started/video.md
@@ -117,32 +117,28 @@ Embeddings convert each video clip into a numeric vector that captures visual an
 
 You can choose between two embedding models:
 
-- **Cosmos-Embed1 (default)**: Automatically downloaded to `MODEL_DIR` on first run; good general-purpose performance and lower VRAM usage.
-- **InternVideo2 (IV2)**: Open model that requires the IV2 checkpoint and BERT model files to be available locally; higher VRAM usage.
+- **Cosmos-Embed1 (default)**: Available in three variants—**cosmos-embed1-224p**, **cosmos-embed1-336p**, and **cosmos-embed1-448p**—which differ in input resolution and accuracy/VRAM tradeoff. All variants are automatically downloaded to `MODEL_DIR` on first run.  
+  - [cosmos-embed1-224p on Hugging Face](https://huggingface.co/nvidia/Cosmos-Embed1-224p)
+  - [cosmos-embed1-336p on Hugging Face](https://huggingface.co/nvidia/Cosmos-Embed1-336p)
+  - [cosmos-embed1-448p on Hugging Face](https://huggingface.co/nvidia/Cosmos-Embed1-448p)
+- **InternVideo2 (IV2)**: Open model that requires the IV2 checkpoint and BERT model files to be available locally; higher VRAM usage. 
+  - [InternVideo Official Github Page](https://github.com/OpenGVLab/InternVideo)
 
-For this quickstart, we're going to set up support for **IV2**.
+For this quickstart, we're going to set up support for **Cosmos-Embed1-224p**.
 
-### Prepare IV2 Model Weights
+### Prepare Model Weights
 
-Complete the following steps when you set `--embedding-algorithm` to `internvideo2` or when you pre-stage models for offline use.
+For most use cases, you only need to create a model directory. The required model files will be downloaded automatically on first run.
 
-1. Create a model directory.
+1. Create a model directory:
+   ```bash
+   mkdir -p "$MODEL_DIR"
+   ```
    :::{tip}
    You can reuse the same `<MODEL_DIR>` across runs.
    :::
-2. Download the IV2 Checkpoint from the [OpenGVLab page](https://github.com/OpenGVLab) and accept the terms.
-3. Download the BERT model files for [`google-bert/bert-large-uncased`](https://huggingface.co/google-bert/bert-large-uncased).
-
-The directory should resemble the following:
-
-```text
-<MODEL_DIR>/
-  OpenGVLab/InternVideo2-Stage2_1B-224p-f4/InternVideo2-stage2_1b-224p-f4.pt
-  google-bert/bert-large-uncased/
-    config.json
-    tokenizer.json
-    ... (standard tokenizer files)
-```
+
+2. No additional setup is required. The model will be downloaded automatically when first used.
 
 ## Set Up Data Directories
 
@@ -169,7 +165,7 @@ python -m nemo_curator.examples.video.video_split_clip_example \
   --output-clip-path "$OUT_DIR" \
   --splitting-algorithm fixed_stride \
   --fixed-stride-split-duration 10.0 \
-  --embedding-algorithm internvideo2 \
+  --embedding-algorithm cosmos-embed1-224p \
   --transcode-encoder libopenh264 \
   --verbose
 ```
@@ -196,7 +192,7 @@ The example script supports the following options:
 ```
 
 :::{tip}
-To use the default Cosmos-Embed1 instead, omit `--embedding-algorithm` or set `--embedding-algorithm cosmos-embed1-224p`.
+To use InternVideo2 instead, set `--embedding-algorithm internvideo2`.
 :::
 
 ## Next Steps