Skip to content

Commit 18bfe0b

Browse files
committed
more video docs
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
1 parent bda3993 commit 18bfe0b

File tree

10 files changed

+227
-68
lines changed

10 files changed

+227
-68
lines changed

docs/curate-video/process-data/dedup.md

Lines changed: 213 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -10,90 +10,195 @@ modality: "video-only"
1010

1111
(video-process-dedup)=
1212

13-
# Duplicate Removal
13+
# Duplicate Identification
14+
15+
Use clip-level embeddings to identify near-duplicate video clips so your dataset remains compact, diverse, and efficient to train on.
16+
17+
## Before You Start
18+
- Make sure you have embeddings which are written by the [`ClipWriterStage`](video-save-export) under `ce1_embd_parquet/`. For a runnable workflow, refer to the [Split and Remove Duplicates Workflow](video-tutorials-split-dedup). The embeddings must be in parquet files containing the columns `id` and `embedding`.
19+
- Verify local paths or configure S3-compatible credentials. Provide `storage_options` in read/write keyword arguments when reading or writing cloud paths.
1420

15-
Use clip-level embeddings to identify and remove near-duplicate video clips so your dataset remains compact, diverse, and efficient to train on.
1621

1722
## How it Works
1823

19-
Duplicate removal operates on clip-level embeddings produced during processing:
24+
Duplicate identification operates on clip-level embeddings produced during processing:
2025

2126
1. **Inputs**
22-
- Parquet batches from `ClipWriterStage` under `iv2_embd_parquet/` or `ce1_embd_parquet/`
27+
- Parquet batches from `ClipWriterStage` under `ce1_embd_parquet/`
2328
- Columns: `id`, `embedding`
2429

2530
2. **Outputs**
2631
- Cluster: `KMeansStage` partitions embeddings and writes centroid distances (for example, `cosine_dist_to_cent`).
2732
- Pairwise: `PairwiseStage` computes within-cluster similarity on GPU and, for each clip, emits `max_id` and `cosine_sim_score`. Ranking controls whether to prefer outliers ("hard") or representatives ("easy").
2833
- Identify: `IdentifyDuplicatesStage` filters pairs with `cosine_sim_score >= 1.0 - eps` and writes Parquet files of duplicate `id`s for removal during export.
2934

30-
## Before You Start
31-
32-
- Verify local paths or configure S3-compatible credentials. Provide `storage_options` in read/write keyword arguments when reading or writing cloud paths.
33-
- Create output directories for `KMeansStage`, `PairwiseStage`, and `IdentifyDuplicatesStage`.
34-
3535
---
3636

3737
## Quickstart
3838

39-
Use the generic semantic duplicate-removal stages with clip embeddings written to Parquet.
39+
Use the semantic duplicate workflow with clip embeddings written to Parquet.
40+
41+
:::::{tab-set}
4042

41-
::::{tab-set}
43+
::::{tab-item} Single Step Workflow
4244

43-
:::{tab-item} Pipeline Stage
45+
The `SemanticDeduplicationWorkflow` provides an end-to-end interface that orchestrates K-means clustering, pairwise similarity computation, and duplicate identification:
4446

4547
```python
46-
from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
47-
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
48+
from nemo_curator.stages.deduplication.semantic.workflow import SemanticDeduplicationWorkflow
4849
from nemo_curator.stages.deduplication.semantic.ranking import RankingStrategy
49-
from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage
50+
from nemo_curator.backends.xenna import XennaExecutor
5051

51-
kmeans = KMeansStage(
52+
workflow = SemanticDeduplicationWorkflow(
53+
input_path="/path/to/embeddings/", # e.g., ce1_embd_parquet/
54+
output_path="/path/to/duplicates/",
55+
cache_path="/path/to/cache/", # Optional: defaults to output_path
5256
n_clusters=1000,
5357
id_field="id",
5458
embedding_field="embedding",
55-
input_path="/path/to/embeddings/",
56-
output_path="/path/to/kmeans_out/",
59+
embedding_dim=768, # Embedding dimension (768 for Cosmos-Embed1, varies by model)
5760
input_filetype="parquet",
58-
)
59-
60-
pairwise = PairwiseStage(
61-
id_field="id",
62-
embedding_field="embedding",
63-
input_path="/path/to/kmeans_out/",
64-
output_path="/path/to/pairwise_out/",
61+
eps=0.1, # Similarity threshold: cosine_sim >= 1.0 - eps identifies duplicates
6562
ranking_strategy=RankingStrategy.metadata_based(
6663
metadata_cols=["cosine_dist_to_cent", "id"],
6764
ascending=[True, True],
6865
),
66+
pairwise_batch_size=1024,
67+
read_kwargs={"storage_options": None}, # Add S3 credentials here if needed
68+
write_kwargs={"storage_options": None},
69+
verbose=True,
6970
)
7071

71-
identify = IdentifyDuplicatesStage(
72-
output_path="/path/to/duplicates/",
73-
eps=0.1,
74-
)
72+
# Run with XennaExecutor (GPU-accelerated)
73+
executor = XennaExecutor()
74+
results = workflow.run(executor)
7575
```
7676

77+
:::{note}
78+
**Determine `eps` first**: Before running the full workflow, we recommend first running K-means and pairwise steps (set `eps=None`) to inspect similarity distributions and determine an appropriate `eps` threshold. See the tip below for details.
7779
:::
7880

79-
:::{tab-item} Script Flags
81+
The workflow automatically:
82+
1. Runs K-means clustering to partition embeddings into clusters
83+
2. Computes pairwise similarity within each cluster
84+
3. Identifies duplicates based on the `eps` threshold
85+
4. Writes duplicate IDs to `output_path/duplicates/`
8086

81-
No example script flags are available for duplicate removal in the split pipeline. Run these stages as a separate job against Parquet embeddings written by the example pipeline's writer.
87+
```{seealso}
88+
For detailed information about how semantic deduplication works, see [Semantic Deduplication](text-process-data-format-sem-dedup). The algorithm and concepts are the same for video clips as for text documents.
89+
```
8290

83-
:::
8491
::::
8592

86-
Input format: Parquet with columns `id` and `embedding` (produced by the video pipeline’s embedding stages and writer). Duplicate removal operates at the clip level using these embeddings. The `IdentifyDuplicatesStage` writes Parquet files containing duplicate `id`s; perform removal by filtering out rows whose `id` appears in those files during export.
93+
::::{tab-item} Individual Stages
8794

88-
```{seealso}
89-
Embeddings are written by the [`ClipWriterStage`](video-save-export) under `iv2_embd_parquet/` or `ce1_embd_parquet/`. For a runnable workflow, refer to the [Split and Remove Duplicates Workflow](video-tutorials-split-dedup).
95+
For advanced users who need fine-grained control, you can run the stages individually:
96+
97+
```python
98+
from nemo_curator.pipeline import Pipeline
99+
from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
100+
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
101+
from nemo_curator.stages.deduplication.semantic.ranking import RankingStrategy
102+
from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage
103+
104+
pipe = Pipeline(name="semantic_dedup")
105+
106+
pipe.add_stage(
107+
KMeansStage(
108+
n_clusters=1000,
109+
id_field="id",
110+
embedding_field="embedding",
111+
input_path="/path/to/embeddings/",
112+
output_path="/path/to/kmeans_out/",
113+
input_filetype="parquet",
114+
embedding_dim=512,
115+
)
116+
)
117+
118+
pipe.add_stage(
119+
PairwiseStage(
120+
id_field="id",
121+
embedding_field="embedding",
122+
input_path="/path/to/kmeans_out/",
123+
output_path="/path/to/pairwise_out/",
124+
ranking_strategy=RankingStrategy.metadata_based(
125+
metadata_cols=["cosine_dist_to_cent", "id"],
126+
ascending=[True, True],
127+
),
128+
)
129+
)
130+
131+
pipe.add_stage(
132+
IdentifyDuplicatesStage(
133+
output_path="/path/to/duplicates/",
134+
eps=0.1,
135+
)
136+
)
137+
138+
pipe.run()
90139
```
91140

141+
::::
142+
143+
::::{tab-item} Script Flags
144+
145+
No example script flags are available for duplicate identification in the split pipeline. Run these stages as a separate job against Parquet embeddings written by the example pipeline's writer.
146+
147+
::::
148+
:::::
149+
150+
:::{tip}
151+
**Recommended Workflow: Determine `eps` First**
152+
153+
The `eps` parameter is highly data-dependent and affects how many duplicates are identified. We recommend a two-step approach:
154+
155+
1. **Step 1: Run K-means and pairwise without duplicate identification**
156+
- Use `SemanticDeduplicationWorkflow` with `eps=None` (or run K-means and pairwise stages individually)
157+
- This generates pairwise similarity scores without identifying duplicates
158+
159+
2. **Step 2: Inspect the similarity distribution**
160+
- Analyze the `cosine_sim_score` values in the pairwise results
161+
- Determine an appropriate `eps` threshold based on your data characteristics
162+
- For example, if 20% of pairs have similarity ≥ 0.9, you might use `eps=0.1` (since `cosine_sim >= 1.0 - eps`)
163+
164+
3. **Step 3: Run the full workflow with your chosen `eps`**
165+
- Use `SemanticDeduplicationWorkflow` with the determined `eps` value
166+
- Or run `IdentifyDuplicatesStage` separately on the pairwise results
167+
168+
For a detailed example of this workflow with similarity analysis, see the [Step-by-Step Semantic Deduplication tutorial](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/text/deduplication/semantic/semantic_step_by_step.ipynb) (demonstrated on text data, but the approach applies to video clips as well).
169+
:::
170+
171+
:::{tip}
172+
**Custom Ranking with Metadata Columns**
173+
174+
If your embedding Parquet files contain additional metadata columns (such as video quality scores, duration, resolution, or other clip attributes), you can use `RankingStrategy.metadata_based()` to create custom ranking methods. This allows you to prioritize which clips to keep within duplicate groups based on your specific criteria.
175+
176+
For example, to prefer higher quality or longer duration clips:
177+
178+
```python
179+
from nemo_curator.stages.deduplication.semantic.ranking import RankingStrategy
180+
181+
# Prefer clips with higher quality scores, then longer duration
182+
ranking_strategy = RankingStrategy.metadata_based(
183+
metadata_cols=["quality_score", "duration"],
184+
ascending=[False, False], # False = descending (higher is better)
185+
)
186+
187+
# Or prefer clips closer to cluster centroid, then by quality
188+
ranking_strategy = RankingStrategy.metadata_based(
189+
metadata_cols=["cosine_dist_to_cent", "quality_score"],
190+
ascending=[True, False], # Closer to centroid first, then higher quality
191+
)
192+
```
193+
194+
The metadata columns must be present in your embedding Parquet files and will be preserved through the K-means stage. Specify these columns using the `metadata_fields` parameter in `KMeansStage` or `SemanticDeduplicationWorkflow`.
195+
:::
196+
92197
## Parameters
93198

94-
::::{tab-set}
199+
:::::{tab-set}
95200

96-
:::{tab-item} KMeansStage
201+
::::{tab-item} KMeansStage
97202

98203
```{list-table} KMeansStage (semantic clustering)
99204
:header-rows: 1
@@ -116,23 +221,86 @@ Embeddings are written by the [`ClipWriterStage`](video-save-export) under `iv2_
116221
- Embedding dimension (Cosmos‑Embed1 varies by variant: 768 for most).
117222
```
118223

119-
:::
224+
::::
120225

121-
:::{tab-item} PairwiseStage
226+
::::{tab-item} PairwiseStage
122227

123228
```{list-table} PairwiseStage (within‑cluster similarity)
124229
:header-rows: 1
125230
126231
* - Parameter
127232
- Description
128-
* - `which_to_keep`
129-
- `"hard"` keeps outliers far from centroid; `"easy"` keeps nearest to centroid; `"random"` ignores distance.
130-
* - `sim_metric`
131-
- `"cosine"` (default) or `"l2"` affects centroid distances and ranking.
132233
* - `ranking_strategy`
133-
- Optional explicit ranking (overrides switches). Use `RankingStrategy.metadata_based([...])`.
234+
- Ranking strategy for selecting which clips to keep within clusters. Use `RankingStrategy.metadata_based(metadata_cols=[...], ascending=[...])` to sort by metadata columns (for example, `metadata_cols=["cosine_dist_to_cent", "id"]`). Use `RankingStrategy.random()` for random selection.
134235
* - `pairwise_batch_size`
135236
- Batch size for GPU pairwise computation (default `1024`). Increase with available memory.
136237
* - `embedding_dim`
137238
- Embedding dimension for memory estimates and batching.
138-
* - `
239+
* - `id_field`
240+
- Column name containing clip IDs (for example, `"id"`).
241+
* - `embedding_field`
242+
- Column with vector data (for example, `"embedding"`).
243+
* - `input_path`
244+
- Path to K-means output directory (sharded by cluster).
245+
* - `output_path`
246+
- Directory for pairwise similarity outputs.
247+
```
248+
249+
::::
250+
251+
::::{tab-item} IdentifyDuplicatesStage
252+
253+
```{list-table} IdentifyDuplicatesStage (duplicate identification)
254+
:header-rows: 1
255+
256+
* - Parameter
257+
- Description
258+
* - `output_path`
259+
- Directory to write Parquet files containing duplicate `id`s.
260+
* - `eps`
261+
- Similarity threshold: pairs with `cosine_sim_score >= 1.0 - eps` are identified as duplicates (for example, `0.1` means similarity >= `0.9`).
262+
* - `read_kwargs`
263+
- Optional keyword arguments for reading files (including `storage_options` for cloud storage).
264+
* - `write_kwargs`
265+
- Optional keyword arguments for writing files (including `storage_options` for cloud storage).
266+
* - `verbose`
267+
- Enable verbose logging (default `False`).
268+
```
269+
270+
::::
271+
272+
::::{tab-item} SemanticDeduplicationWorkflow
273+
274+
The `SemanticDeduplicationWorkflow` accepts parameters from all three stages (KMeansStage, PairwiseStage, and IdentifyDuplicatesStage). See the tabs above for parameter descriptions.
275+
276+
```{list-table} SemanticDeduplicationWorkflow (workflow-specific parameters)
277+
:header-rows: 1
278+
279+
* - Parameter
280+
- Description
281+
* - `cache_path`
282+
- Directory for intermediate results (K-means and pairwise outputs). Defaults to `output_path` if not specified.
283+
* - `cache_kwargs`
284+
- Optional keyword arguments for writing cache files (including `storage_options` for cloud storage). Defaults to `write_kwargs` if not specified.
285+
* - `clear_output`
286+
- Clear output directory before running (default `True`).
287+
* - `metadata_fields`
288+
- List of metadata field names to preserve in output (optional).
289+
```
290+
291+
For parameters shared with individual stages, refer to:
292+
- **KMeansStage** tab: `input_path`, `output_path`, `n_clusters`, `id_field`, `embedding_field`, `embedding_dim`
293+
- **PairwiseStage** tab: `ranking_strategy`, `pairwise_batch_size`
294+
- **IdentifyDuplicatesStage** tab: `eps`
295+
- Common parameters: `read_kwargs`, `write_kwargs`, `verbose`
296+
297+
::::
298+
:::::
299+
300+
---
301+
302+
## Removing Duplicates
303+
304+
The duplicate identification stages (`IdentifyDuplicatesStage` or `SemanticDeduplicationWorkflow` with `eps` specified) write Parquet files containing duplicate clip IDs to the output directory (typically `output_path/duplicates/`). These files contain a single column `id` with the IDs of clips that should be removed.
305+
306+
**It is your responsibility to exclude these duplicate IDs when exporting or persisting your final dataset.** The removal process depends on how you want to persist and shard your data:

docs/curate-video/process-data/embeddings.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ pipe.run()
6565

6666
```bash
6767
# Cosmos-Embed1 (224p)
68-
python -m nemo_curator.examples.video.video_split_clip_example \
68+
python tutorials/video/getting-started/video_split_clip_example.py \
6969
... \
7070
--generate-embeddings \
7171
--embedding-algorithm cosmos-embed1-224p \

docs/curate-video/process-data/filtering.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ pipe.run()
8383

8484
```bash
8585
# Motion filtering
86-
python -m nemo_curator.examples.video.video_split_clip_example \
86+
python tutorials/video/getting-started/video_split_clip_example.py \
8787
... \
8888
--motion-filter enable \
8989
--motion-decode-target-fps 2.0 \
@@ -95,7 +95,7 @@ python -m nemo_curator.examples.video.video_split_clip_example \
9595
--motion-score-gpus-per-worker 0.5
9696

9797
# Aesthetic filtering
98-
python -m nemo_curator.examples.video.video_split_clip_example \
98+
python tutorials/video/getting-started/video_split_clip_example.py \
9999
... \
100100
--aesthetic-threshold 3.5 \
101101
--aesthetic-reduction min \

docs/curate-video/process-data/frame-extraction.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ Extract frames from clips or full videos at target rates and resolutions. Use fr
2323

2424
## Before You Start
2525

26-
[Embeddings](video-process-embeddings) and [aesthetic filtering](video-process-filtering-aesthetic) require frames. If you need saved media files, frame extraction is optional.
26+
If you need saved media files, frame extraction is optional. [Embeddings](video-process-embeddings) and [aesthetic filtering](video-process-filtering-aesthetic) require frames.
2727

2828
---
2929

@@ -66,13 +66,13 @@ pipe.run()
6666

6767
```bash
6868
# Clip frames implicitly when generating embeddings or aesthetics
69-
python -m nemo_curator.examples.video.video_split_clip_example \
69+
python tutorials/video/getting-started/video_split_clip_example.py \
7070
... \
7171
--generate-embeddings \
7272
--clip-extraction-target-res -1
7373

7474
# Full-video frames for TransNetV2 scene change
75-
python -m nemo_curator.examples.video.video_split_clip_example \
75+
python tutorials/video/getting-started/video_split_clip_example.py \
7676
... \
7777
--splitting-algorithm transnetv2 \
7878
--transnetv2-frame-decoder-mode pynvc

docs/curate-video/process-data/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ pipeline.add_stage(
124124
)
125125
```
126126

127-
Path helpers are available to resolve common locations (such as `clips/`, `filtered_clips/`, `previews/`, `metas/v0/`, and `iv2_embd_parquet/`).
127+
Path helpers are available to resolve common locations (such as `clips/`, `filtered_clips/`, `previews/`, `metas/v0/`, and `ce1_embd_parquet/`).
128128

129129
```{toctree}
130130
:maxdepth: 2

0 commit comments

Comments
 (0)