You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/curate-video/process-data/dedup.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@ modality: "video-only"
15
15
Use clip-level embeddings to identify near-duplicate video clips so your dataset remains compact, diverse, and efficient to train on.
16
16
17
17
## Before You Start
18
-
- Make sure you have embeddings which are written by the [`ClipWriterStage`](video-save-export) under `iv2_embd_parquet/` or `ce1_embd_parquet/`. For a runnable workflow, refer to the [Split and Remove Duplicates Workflow](video-tutorials-split-dedup). The embeddings must be in parquet files containing the columns `id` and `embedding`.
18
+
- Make sure you have embeddings which are written by the [`ClipWriterStage`](video-save-export) under `ce1_embd_parquet/`. For a runnable workflow, refer to the [Split and Remove Duplicates Workflow](video-tutorials-split-dedup). The embeddings must be in parquet files containing the columns `id` and `embedding`.
19
19
- Verify local paths or configure S3-compatible credentials. Provide `storage_options` in read/write keyword arguments when reading or writing cloud paths.
20
20
21
21
@@ -24,7 +24,7 @@ Use clip-level embeddings to identify near-duplicate video clips so your dataset
24
24
Duplicate identification operates on clip-level embeddings produced during processing:
25
25
26
26
1.**Inputs**
27
-
- Parquet batches from `ClipWriterStage` under `iv2_embd_parquet/` or `ce1_embd_parquet/`
27
+
- Parquet batches from `ClipWriterStage` under `ce1_embd_parquet/`
28
28
- Columns: `id`, `embedding`
29
29
30
30
2.**Outputs**
@@ -50,7 +50,7 @@ from nemo_curator.stages.deduplication.semantic.ranking import RankingStrategy
50
50
from nemo_curator.backends.xenna import XennaExecutor
51
51
52
52
workflow = SemanticDeduplicationWorkflow(
53
-
input_path="/path/to/embeddings/", # e.g., iv2_embd_parquet/ or ce1_embd_parquet/
-`iv2_embd_parquet/`, `ce1_embd_parquet/`: Parquet batches with columns `id` and `embedding`.
98
+
-`ce1_embd/`: Per-clip embeddings (`.pickle`).
99
+
-`ce1_embd_parquet/`: Parquet batches with columns `id` and `embedding`.
100
100
-`processed_videos/`, `processed_clip_chunks/`: Video-level metadata and per-chunk statistics.
101
101
102
102
### Per-Clip Metadata
@@ -132,8 +132,8 @@ Each clip writes a JSON file under `metas/v0/` with clip- and window-level field
132
132
133
133
### Embeddings and Parquet outputs
134
134
135
-
- When embeddings exist, the stage writes per-clip `.pickle` files under `iv2_embd/` or `ce1_embd/`.
136
-
- The stage also batches embeddings per clip chunk into Parquet files under `iv2_embd_parquet/` or `ce1_embd_parquet/` with columns `id` and `embedding` and writes those files to disk.
135
+
- When embeddings exist, the stage writes per-clip `.pickle` files under `ce1_embd/`.
136
+
- The stage also batches embeddings per clip chunk into Parquet files under `ce1_embd_parquet/` with columns `id` and `embedding` and writes those files to disk.
0 commit comments