You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Use clip-level embeddings to identify near-duplicate video clips so your dataset remains compact, diverse, and efficient to train on.
16
+
17
+
## Before You Start
18
+
- Make sure you have embeddings which are written by the [`ClipWriterStage`](video-save-export) under `ce1_embd_parquet/`. For a runnable workflow, refer to the [Split and Remove Duplicates Workflow](video-tutorials-split-dedup). The embeddings must be in parquet files containing the columns `id` and `embedding`.
19
+
- Verify local paths or configure S3-compatible credentials. Provide `storage_options` in read/write keyword arguments when reading or writing cloud paths.
14
20
15
-
Use clip-level embeddings to identify and remove near-duplicate video clips so your dataset remains compact, diverse, and efficient to train on.
16
21
17
22
## How it Works
18
23
19
-
Duplicate removal operates on clip-level embeddings produced during processing:
24
+
Duplicate identification operates on clip-level embeddings produced during processing:
20
25
21
26
1.**Inputs**
22
-
- Parquet batches from `ClipWriterStage` under `iv2_embd_parquet/` or `ce1_embd_parquet/`
27
+
- Parquet batches from `ClipWriterStage` under `ce1_embd_parquet/`
23
28
- Columns: `id`, `embedding`
24
29
25
30
2.**Outputs**
26
31
- Cluster: `KMeansStage` partitions embeddings and writes centroid distances (for example, `cosine_dist_to_cent`).
27
32
- Pairwise: `PairwiseStage` computes within-cluster similarity on GPU and, for each clip, emits `max_id` and `cosine_sim_score`. Ranking controls whether to prefer outliers ("hard") or representatives ("easy").
28
33
- Identify: `IdentifyDuplicatesStage` filters pairs with `cosine_sim_score >= 1.0 - eps` and writes Parquet files of duplicate `id`s for removal during export.
29
34
30
-
## Before You Start
31
-
32
-
- Verify local paths or configure S3-compatible credentials. Provide `storage_options` in read/write keyword arguments when reading or writing cloud paths.
33
-
- Create output directories for `KMeansStage`, `PairwiseStage`, and `IdentifyDuplicatesStage`.
34
-
35
35
---
36
36
37
37
## Quickstart
38
38
39
-
Use the generic semantic duplicate-removal stages with clip embeddings written to Parquet.
39
+
Use the semantic duplicate workflow with clip embeddings written to Parquet.
40
+
41
+
:::::{tab-set}
40
42
41
-
::::{tab-set}
43
+
::::{tab-item} Single Step Workflow
42
44
43
-
:::{tab-item} Pipeline Stage
45
+
The `SemanticDeduplicationWorkflow` provides an end-to-end interface that orchestrates K-means clustering, pairwise similarity computation, and duplicate identification:
44
46
45
47
```python
46
-
from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
47
-
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
48
+
from nemo_curator.stages.deduplication.semantic.workflow import SemanticDeduplicationWorkflow
48
49
from nemo_curator.stages.deduplication.semantic.ranking import RankingStrategy
49
-
from nemo_curator.stages.deduplication.semantic.identify_duplicatesimportIdentifyDuplicatesStage
50
+
from nemo_curator.backends.xennaimportXennaExecutor
read_kwargs={"storage_options": None}, # Add S3 credentials here if needed
68
+
write_kwargs={"storage_options": None},
69
+
verbose=True,
69
70
)
70
71
71
-
identify = IdentifyDuplicatesStage(
72
-
output_path="/path/to/duplicates/",
73
-
eps=0.1,
74
-
)
72
+
# Run with XennaExecutor (GPU-accelerated)
73
+
executor = XennaExecutor()
74
+
results = workflow.run(executor)
75
75
```
76
76
77
+
:::{note}
78
+
**Determine `eps` first**: Before running the full workflow, we recommend first running K-means and pairwise steps (set `eps=None`) to inspect similarity distributions and determine an appropriate `eps` threshold. See the tip below for details.
77
79
:::
78
80
79
-
:::{tab-item} Script Flags
81
+
The workflow automatically:
82
+
1. Runs K-means clustering to partition embeddings into clusters
83
+
2. Computes pairwise similarity within each cluster
84
+
3. Identifies duplicates based on the `eps` threshold
85
+
4. Writes duplicate IDs to `output_path/duplicates/`
80
86
81
-
No example script flags are available for duplicate removal in the split pipeline. Run these stages as a separate job against Parquet embeddings written by the example pipeline's writer.
87
+
```{seealso}
88
+
For detailed information about how semantic deduplication works, see [Semantic Deduplication](text-process-data-format-sem-dedup). The algorithm and concepts are the same for video clips as for text documents.
89
+
```
82
90
83
-
:::
84
91
::::
85
92
86
-
Input format: Parquet with columns `id` and `embedding` (produced by the video pipeline’s embedding stages and writer). Duplicate removal operates at the clip level using these embeddings. The `IdentifyDuplicatesStage` writes Parquet files containing duplicate `id`s; perform removal by filtering out rows whose `id` appears in those files during export.
93
+
::::{tab-item} Individual Stages
87
94
88
-
```{seealso}
89
-
Embeddings are written by the [`ClipWriterStage`](video-save-export) under `iv2_embd_parquet/` or `ce1_embd_parquet/`. For a runnable workflow, refer to the [Split and Remove Duplicates Workflow](video-tutorials-split-dedup).
95
+
For advanced users who need fine-grained control, you can run the stages individually:
96
+
97
+
```python
98
+
from nemo_curator.pipeline import Pipeline
99
+
from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
100
+
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
101
+
from nemo_curator.stages.deduplication.semantic.ranking import RankingStrategy
102
+
from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage
103
+
104
+
pipe = Pipeline(name="semantic_dedup")
105
+
106
+
pipe.add_stage(
107
+
KMeansStage(
108
+
n_clusters=1000,
109
+
id_field="id",
110
+
embedding_field="embedding",
111
+
input_path="/path/to/embeddings/",
112
+
output_path="/path/to/kmeans_out/",
113
+
input_filetype="parquet",
114
+
embedding_dim=512,
115
+
)
116
+
)
117
+
118
+
pipe.add_stage(
119
+
PairwiseStage(
120
+
id_field="id",
121
+
embedding_field="embedding",
122
+
input_path="/path/to/kmeans_out/",
123
+
output_path="/path/to/pairwise_out/",
124
+
ranking_strategy=RankingStrategy.metadata_based(
125
+
metadata_cols=["cosine_dist_to_cent", "id"],
126
+
ascending=[True, True],
127
+
),
128
+
)
129
+
)
130
+
131
+
pipe.add_stage(
132
+
IdentifyDuplicatesStage(
133
+
output_path="/path/to/duplicates/",
134
+
eps=0.1,
135
+
)
136
+
)
137
+
138
+
pipe.run()
90
139
```
91
140
141
+
::::
142
+
143
+
::::{tab-item} Script Flags
144
+
145
+
No example script flags are available for duplicate identification in the split pipeline. Run these stages as a separate job against Parquet embeddings written by the example pipeline's writer.
146
+
147
+
::::
148
+
:::::
149
+
150
+
:::{tip}
151
+
**Recommended Workflow: Determine `eps` First**
152
+
153
+
The `eps` parameter is highly data-dependent and affects how many duplicates are identified. We recommend a two-step approach:
154
+
155
+
1.**Step 1: Run K-means and pairwise without duplicate identification**
156
+
- Use `SemanticDeduplicationWorkflow` with `eps=None` (or run K-means and pairwise stages individually)
157
+
- This generates pairwise similarity scores without identifying duplicates
158
+
159
+
2.**Step 2: Inspect the similarity distribution**
160
+
- Analyze the `cosine_sim_score` values in the pairwise results
161
+
- Determine an appropriate `eps` threshold based on your data characteristics
162
+
- For example, if 20% of pairs have similarity ≥ 0.9, you might use `eps=0.1` (since `cosine_sim >= 1.0 - eps`)
163
+
164
+
3.**Step 3: Run the full workflow with your chosen `eps`**
165
+
- Use `SemanticDeduplicationWorkflow` with the determined `eps` value
166
+
- Or run `IdentifyDuplicatesStage` separately on the pairwise results
167
+
168
+
For a detailed example of this workflow with similarity analysis, see the [Step-by-Step Semantic Deduplication tutorial](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/text/deduplication/semantic/semantic_step_by_step.ipynb) (demonstrated on text data, but the approach applies to video clips as well).
169
+
:::
170
+
171
+
:::{tip}
172
+
**Custom Ranking with Metadata Columns**
173
+
174
+
If your embedding Parquet files contain additional metadata columns (such as video quality scores, duration, resolution, or other clip attributes), you can use `RankingStrategy.metadata_based()` to create custom ranking methods. This allows you to prioritize which clips to keep within duplicate groups based on your specific criteria.
175
+
176
+
For example, to prefer higher quality or longer duration clips:
177
+
178
+
```python
179
+
from nemo_curator.stages.deduplication.semantic.ranking import RankingStrategy
180
+
181
+
# Prefer clips with higher quality scores, then longer duration
ascending=[True, False], # Closer to centroid first, then higher quality
191
+
)
192
+
```
193
+
194
+
The metadata columns must be present in your embedding Parquet files and will be preserved through the K-means stage. Specify these columns using the `metadata_fields` parameter in `KMeansStage` or `SemanticDeduplicationWorkflow`.
195
+
:::
196
+
92
197
## Parameters
93
198
94
-
::::{tab-set}
199
+
:::::{tab-set}
95
200
96
-
:::{tab-item} KMeansStage
201
+
::::{tab-item} KMeansStage
97
202
98
203
```{list-table} KMeansStage (semantic clustering)
99
204
:header-rows: 1
@@ -116,23 +221,86 @@ Embeddings are written by the [`ClipWriterStage`](video-save-export) under `iv2_
116
221
- Embedding dimension (Cosmos‑Embed1 varies by variant: 768 for most).
- `"hard"` keeps outliers far from centroid; `"easy"` keeps nearest to centroid; `"random"` ignores distance.
130
-
* - `sim_metric`
131
-
- `"cosine"` (default) or `"l2"` affects centroid distances and ranking.
132
233
* - `ranking_strategy`
133
-
- Optional explicit ranking (overrides switches). Use `RankingStrategy.metadata_based([...])`.
234
+
- Ranking strategy for selecting which clips to keep within clusters. Use `RankingStrategy.metadata_based(metadata_cols=[...], ascending=[...])` to sort by metadata columns (for example, `metadata_cols=["cosine_dist_to_cent", "id"]`). Use `RankingStrategy.random()` for random selection.
134
235
* - `pairwise_batch_size`
135
236
- Batch size for GPU pairwise computation (default `1024`). Increase with available memory.
136
237
* - `embedding_dim`
137
238
- Embedding dimension for memory estimates and batching.
138
-
* - `
239
+
* - `id_field`
240
+
- Column name containing clip IDs (for example, `"id"`).
241
+
* - `embedding_field`
242
+
- Column with vector data (for example, `"embedding"`).
243
+
* - `input_path`
244
+
- Path to K-means output directory (sharded by cluster).
- Directory to write Parquet files containing duplicate `id`s.
260
+
* - `eps`
261
+
- Similarity threshold: pairs with `cosine_sim_score >= 1.0 - eps` are identified as duplicates (for example, `0.1` means similarity >= `0.9`).
262
+
* - `read_kwargs`
263
+
- Optional keyword arguments for reading files (including `storage_options` for cloud storage).
264
+
* - `write_kwargs`
265
+
- Optional keyword arguments for writing files (including `storage_options` for cloud storage).
266
+
* - `verbose`
267
+
- Enable verbose logging (default `False`).
268
+
```
269
+
270
+
::::
271
+
272
+
::::{tab-item} SemanticDeduplicationWorkflow
273
+
274
+
The `SemanticDeduplicationWorkflow` accepts parameters from all three stages (KMeansStage, PairwiseStage, and IdentifyDuplicatesStage). See the tabs above for parameter descriptions.
- Common parameters: `read_kwargs`, `write_kwargs`, `verbose`
296
+
297
+
::::
298
+
:::::
299
+
300
+
---
301
+
302
+
## Removing Duplicates
303
+
304
+
The duplicate identification stages (`IdentifyDuplicatesStage` or `SemanticDeduplicationWorkflow` with `eps` specified) write Parquet files containing duplicate clip IDs to the output directory (typically `output_path/duplicates/`). These files contain a single column `id` with the IDs of clips that should be removed.
305
+
306
+
**It is your responsibility to exclude these duplicate IDs when exporting or persisting your final dataset.** The removal process depends on how you want to persist and shard your data:
Copy file name to clipboardExpand all lines: docs/curate-video/process-data/frame-extraction.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,7 +23,7 @@ Extract frames from clips or full videos at target rates and resolutions. Use fr
23
23
24
24
## Before You Start
25
25
26
-
[Embeddings](video-process-embeddings) and [aesthetic filtering](video-process-filtering-aesthetic) require frames. If you need saved media files, frame extraction is optional.
26
+
If you need saved media files, frame extraction is optional. [Embeddings](video-process-embeddings) and [aesthetic filtering](video-process-filtering-aesthetic) require frames.
27
27
28
28
---
29
29
@@ -66,13 +66,13 @@ pipe.run()
66
66
67
67
```bash
68
68
# Clip frames implicitly when generating embeddings or aesthetics
0 commit comments