Skip to content

Commit e5ca35d

Browse files
committed
remove more internvid content
Signed-off-by: Lawrence Lane <llane@nvidia.com>
1 parent 5798cd0 commit e5ca35d

File tree

4 files changed

+14
-15
lines changed

4 files changed

+14
-15
lines changed

docs/curate-video/process-data/dedup.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ modality: "video-only"
1515
Use clip-level embeddings to identify near-duplicate video clips so your dataset remains compact, diverse, and efficient to train on.
1616

1717
## Before You Start
18-
- Make sure you have embeddings which are written by the [`ClipWriterStage`](video-save-export) under `iv2_embd_parquet/` or `ce1_embd_parquet/`. For a runnable workflow, refer to the [Split and Remove Duplicates Workflow](video-tutorials-split-dedup). The embeddings must be in parquet files containing the columns `id` and `embedding`.
18+
- Make sure you have embeddings which are written by the [`ClipWriterStage`](video-save-export) under `ce1_embd_parquet/`. For a runnable workflow, refer to the [Split and Remove Duplicates Workflow](video-tutorials-split-dedup). The embeddings must be in parquet files containing the columns `id` and `embedding`.
1919
- Verify local paths or configure S3-compatible credentials. Provide `storage_options` in read/write keyword arguments when reading or writing cloud paths.
2020

2121

@@ -24,7 +24,7 @@ Use clip-level embeddings to identify near-duplicate video clips so your dataset
2424
Duplicate identification operates on clip-level embeddings produced during processing:
2525

2626
1. **Inputs**
27-
- Parquet batches from `ClipWriterStage` under `iv2_embd_parquet/` or `ce1_embd_parquet/`
27+
- Parquet batches from `ClipWriterStage` under `ce1_embd_parquet/`
2828
- Columns: `id`, `embedding`
2929

3030
2. **Outputs**
@@ -50,7 +50,7 @@ from nemo_curator.stages.deduplication.semantic.ranking import RankingStrategy
5050
from nemo_curator.backends.xenna import XennaExecutor
5151

5252
workflow = SemanticDeduplicationWorkflow(
53-
input_path="/path/to/embeddings/", # e.g., iv2_embd_parquet/ or ce1_embd_parquet/
53+
input_path="/path/to/embeddings/", # e.g., ce1_embd_parquet/
5454
output_path="/path/to/duplicates/",
5555
cache_path="/path/to/cache/", # Optional: defaults to output_path
5656
n_clusters=1000,

docs/curate-video/process-data/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ pipeline.add_stage(
124124
)
125125
```
126126

127-
Path helpers are available to resolve common locations (such as `clips/`, `filtered_clips/`, `previews/`, `metas/v0/`, and `iv2_embd_parquet/`).
127+
Path helpers are available to resolve common locations (such as `clips/`, `filtered_clips/`, `previews/`, `metas/v0/`, and `ce1_embd_parquet/`).
128128

129129
```{toctree}
130130
:maxdepth: 2

docs/curate-video/save-export.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -95,8 +95,8 @@ The writer produces these directories under `output_path`:
9595
- `filtered_clips/`: Media for filtered-out clips.
9696
- `previews/`: Preview images (`.webp`).
9797
- `metas/v0/`: Per-clip metadata (`.json`).
98-
- `iv2_embd/`, `ce1_embd/`: Per-clip embeddings (`.pickle`).
99-
- `iv2_embd_parquet/`, `ce1_embd_parquet/`: Parquet batches with columns `id` and `embedding`.
98+
- `ce1_embd/`: Per-clip embeddings (`.pickle`).
99+
- `ce1_embd_parquet/`: Parquet batches with columns `id` and `embedding`.
100100
- `processed_videos/`, `processed_clip_chunks/`: Video-level metadata and per-chunk statistics.
101101

102102
### Per-Clip Metadata
@@ -132,8 +132,8 @@ Each clip writes a JSON file under `metas/v0/` with clip- and window-level field
132132

133133
### Embeddings and Parquet outputs
134134

135-
- When embeddings exist, the stage writes per-clip `.pickle` files under `iv2_embd/` or `ce1_embd/`.
136-
- The stage also batches embeddings per clip chunk into Parquet files under `iv2_embd_parquet/` or `ce1_embd_parquet/` with columns `id` and `embedding` and writes those files to disk.
135+
- When embeddings exist, the stage writes per-clip `.pickle` files under `ce1_embd/`.
136+
- The stage also batches embeddings per clip chunk into Parquet files under `ce1_embd_parquet/` with columns `id` and `embedding` and writes those files to disk.
137137

138138
## Helpers
139139

@@ -150,7 +150,6 @@ clips_dir = ClipWriterStage.get_output_path_clips(OUT)
150150
filtered_clips_dir = ClipWriterStage.get_output_path_clips(OUT, filtered=True)
151151
previews_dir = ClipWriterStage.get_output_path_previews(OUT)
152152
metas_dir = ClipWriterStage.get_output_path_metas(OUT, "v0")
153-
iv2_parquet_dir = ClipWriterStage.get_output_path_iv2_embd_parquet(OUT)
154153
ce1_parquet_dir = ClipWriterStage.get_output_path_ce1_embd_parquet(OUT)
155154
processed_videos_dir = ClipWriterStage.get_output_path_processed_videos(OUT)
156155
processed_chunks_dir = ClipWriterStage.get_output_path_processed_clip_chunks(OUT)

docs/curate-video/tutorials/split-dedup.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ Writer-related flags you can add:
4949
--dry-run # Write nothing; validate only
5050
```
5151

52-
The pipeline writes embeddings under `$OUT_DIR/iv2_embd_parquet/` (or `ce1_embd_parquet/` if you use Cosmos-Embed1).
52+
The pipeline writes embeddings under `$OUT_DIR/ce1_embd_parquet/` when using Cosmos-Embed1.
5353

5454
### Embedding Format Example
5555

@@ -64,7 +64,7 @@ The pipeline writes embeddings to Parquet with two columns:
6464

6565
```text
6666
$OUT_DIR/
67-
iv2_embd_parquet/
67+
ce1_embd_parquet/
6868
1a2b3c4d-....parquet
6969
5e6f7g8h-....parquet
7070
```
@@ -93,7 +93,7 @@ embedding: list<float32> # length = 768 for Cosmos-Embed1
9393
```python
9494
import pyarrow.parquet as pq
9595

96-
table = pq.read_table(f"{OUT_DIR}/iv2_embd_parquet")
96+
table = pq.read_table(f"{OUT_DIR}/ce1_embd_parquet")
9797
df = table.to_pandas()
9898
print(df.head()) # columns: id, embedding (list[float])
9999
```
@@ -113,7 +113,7 @@ from nemo_curator.pipeline import Pipeline
113113
from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
114114
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
115115

116-
INPUT_PARQUET = f"{OUT_DIR}/iv2_embd_parquet" # or s3://...
116+
INPUT_PARQUET = f"{OUT_DIR}/ce1_embd_parquet" # or s3://...
117117
OUTPUT_DIR = f"{OUT_DIR}/semantic_dedup"
118118

119119
pipe = Pipeline(name="video_semantic_dedup", description="K-means + pairwise duplicate removal")
@@ -175,7 +175,7 @@ Video-specific pointers:
175175
- Use `ClipWriterStage` path helpers to locate outputs: `nemo_curator/stages/video/io/clip_writer.py`.
176176
- Processed videos: `get_output_path_processed_videos(OUT_DIR)`
177177
- Clip chunks and previews: `get_output_path_processed_clip_chunks(OUT_DIR)`, `get_output_path_previews(OUT_DIR)`
178-
- Embeddings parquet: `${OUT_DIR}/iv2_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`)
178+
- Embeddings parquet: `${OUT_DIR}/ce1_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`)
179179

180180
### Example Export
181181

@@ -188,7 +188,7 @@ from glob import glob
188188

189189
OUT_DIR = os.environ["OUT_DIR"]
190190
clips_dir = os.path.join(OUT_DIR, "clips") # adjust if filtering path used
191-
meta_parquet = os.path.join(OUT_DIR, "iv2_embd_parquet")
191+
meta_parquet = os.path.join(OUT_DIR, "ce1_embd_parquet")
192192

193193
def iter_clips(path):
194194
for p in glob(os.path.join(path, "**", "*.mp4"), recursive=True):

0 commit comments

Comments
 (0)