NVIDIA-NeMo
diff --git a/‎docs/about/concepts/image/data-export-concepts.md‎
Lines changed: 16 additions & 66 deletions b/‎docs/about/concepts/image/data-export-concepts.md‎
Lines changed: 16 additions & 66 deletions
diff --git a/‎docs/about/concepts/image/data-loading-concepts.md‎
Lines changed: 21 additions & 21 deletions b/‎docs/about/concepts/image/data-loading-concepts.md‎
Lines changed: 21 additions & 21 deletions
diff --git a/‎docs/about/concepts/image/data-processing-concepts.md‎
Lines changed: 4 additions & 0 deletions b/‎docs/about/concepts/image/data-processing-concepts.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/curate-images/index.md‎
Lines changed: 15 additions & 5 deletions b/‎docs/curate-images/index.md‎
Lines changed: 15 additions & 5 deletions
diff --git a/‎docs/curate-images/load-data/tar-archives.md‎
Lines changed: 1 addition & 74 deletions b/‎docs/curate-images/load-data/tar-archives.md‎
Lines changed: 1 addition & 74 deletions
diff --git a/‎docs/curate-images/process-data/embeddings/clip-embedder.md‎
Lines changed: 21 additions & 0 deletions b/‎docs/curate-images/process-data/embeddings/clip-embedder.md‎
Lines changed: 21 additions & 0 deletions
@@ -1,7 +1,7 @@
 ---
-description: "Core concepts for saving and exporting curated image datasets including metadata, filtering, and resharding"
+description: "Core concepts for saving and exporting curated image datasets including metadata and resharding"
 categories: ["concepts-architecture"]
-tags: ["data-export", "tar-files", "parquet", "filtering", "resharding", "metadata"]
+tags: ["data-export", "tar-files", "parquet", "resharding", "metadata"]
 personas: ["data-scientist-focused", "mle-focused"]
 difficulty: "intermediate"
 content_type: "concept"
@@ -16,10 +16,9 @@ This page covers the core concepts for saving and exporting curated image datase
 
 ## Key Topics
 
-- Saving metadata to Parquet files
-- Exporting filtered datasets as tar archives
-- Configuring output sharding
+- Saving curated images and metadata
 - Understanding output format structure
+- Configuring output sharding
 - Preparing data for downstream training or analysis
 
 ## Saving Results
@@ -34,56 +33,27 @@ from nemo_curator.stages.image.io.image_writer import ImageWriterStage
 # Add writer stage to pipeline
 pipeline.add_stage(ImageWriterStage(
     output_dir="/output/curated_dataset",
-    images_per_tar=1000,
+    images_per_tar=1000,  # Images per tar file
     remove_image_data=True,
     verbose=True,
     deterministic_name=True,  # Use deterministic naming for reproducible output
 ))
 ```
 
-- The writer stage creates tar files with curated images
-- Metadata (if updated during curation pipeline) is stored in separate Parquet files alongside tar archives
-- Configurable images per tar file for optimal sharding
-- `deterministic_name=True` ensures reproducible file naming based on input content
-
-## Pipeline-Based Filtering
-
-Filtering happens automatically within the pipeline stages. Each filter stage (aesthetic, NSFW) removes images that don't meet the configured thresholds, so only curated images reach the final `ImageWriterStage`.
-
-**Example Pipeline Flow:**
-
-```python
-from nemo_curator.pipeline.pipeline import Pipeline
-from nemo_curator.stages.file_partitioning import FilePartitioningStage
-from nemo_curator.stages.image.io.image_reader import ImageReaderStage
-from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
-from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage
-from nemo_curator.stages.image.filters.nsfw_filter import ImageNSFWFilterStage
-from nemo_curator.stages.image.io.image_writer import ImageWriterStage
+**Key Parameters:**
 
-# Complete pipeline with filtering
-pipeline = Pipeline(name="image_curation")
+- `output_dir`: Directory where tar archives and metadata files are written
+- `images_per_tar`: Number of images per tar file for optimal sharding
+- `remove_image_data`: Whether to remove image data from memory after writing
+- `deterministic_name`: Ensures reproducible file naming based on input content
 
-# Load images
-pipeline.add_stage(FilePartitioningStage(...))
-pipeline.add_stage(ImageReaderStage(...))
+**Behavior:**
 
-# Generate embeddings
-pipeline.add_stage(ImageEmbeddingStage(...))
-
-# Filter by quality (removes low aesthetic scores)
-pipeline.add_stage(ImageAestheticFilterStage(score_threshold=0.5))
-
-# Filter NSFW content (removes high NSFW scores)
-pipeline.add_stage(ImageNSFWFilterStage(score_threshold=0.5))
-
-# Save curated results
-pipeline.add_stage(ImageWriterStage(output_dir="/output/curated"))
-```
-
-- Filtering is built into the stages - no separate filtering step needed
-- Images passing all filters reach the output
-- Thresholds are configurable per stage
+- The writer stage creates tar files with curated images
+- Metadata for each image (including paths, IDs, scores, and processing metadata) is always stored in separate Parquet files alongside tar archives
+- Adjust `images_per_tar` to balance I/O, parallelism, and storage efficiency
+- Smaller values create more files but enable better parallelism
+- Larger values reduce file count but may impact loading performance
 
 ## Output Format
 
@@ -106,26 +76,6 @@ output/
 - **Naming**: Deterministic or random naming based on configuration
 - **Sharding**: Configurable number of images per tar file for optimal performance
 
-## Configuring Output Sharding
-
-The `ImageWriterStage` parameters control how images get distributed across output tar files.
-
-**Example:**
-
-```python
-# Configure output sharding
-pipeline.add_stage(ImageWriterStage(
-    output_dir="/output/curated_dataset",
-    images_per_tar=5000,  # Images per tar file
-    remove_image_data=True,
-    deterministic_name=True,
-))
-```
-
-- Adjust `images_per_tar` to balance I/O, parallelism, and storage efficiency
-- Smaller values create more files but enable better parallelism
-- Larger values reduce file count but may impact loading performance
-
 ## Preparing for Downstream Use
 
 - Ensure your exported dataset matches the requirements of your training or analysis pipeline.
 
@@ -1,7 +1,7 @@
 ---
 description: "Core concepts for loading and managing image datasets from tar archives with cloud storage support"
 categories: ["concepts-architecture"]
-tags: ["data-loading", "tar-archives", "dali", "cloud-storage", "sharding", "gpu-accelerated"]
+tags: ["data-loading", "tar-archives", "dali", "cloud-storage", "gpu-accelerated"]
 personas: ["data-scientist-focused", "mle-focused"]
 difficulty: "intermediate"
 content_type: "concept"
@@ -14,8 +14,6 @@ modality: "image-only"
 
 This page covers the core concepts for loading and managing image datasets in NeMo Curator.
 
-> **Input vs. Output**: This page focuses on **input** data formats for loading datasets into NeMo Curator. For information about **output** formats (including Parquet metadata files created during export), see the [Data Export Concepts](data-export-concepts.md) page.
-
 ## Input Data Format and Directory Structure
 
 NeMo Curator loads image datasets from tar archives for scalable, distributed image curation. The `ImageReaderStage` reads only JPEG images from input `.tar` files, ignoring other content.
@@ -24,29 +22,28 @@ NeMo Curator loads image datasets from tar archives for scalable, distributed im
 
 ```bash
 input_dataset/
-├── 00000.tar
+├── 00000.tar          # Tar archive containing JPEG images
 │   ├── 000000000.jpg
-│   ├── 000000000.txt
-│   ├── 000000000.json
+│   ├── 000000001.jpg
+│   ├── 000000002.jpg
 │   ├── ...
 ├── 00001.tar
+│   ├── 000001000.jpg
+│   ├── 000001001.jpg
 │   ├── ...
 ```
 
-**Input file types:**
+**What gets loaded:**
 
-- `.tar` files: Contain images (`.jpg`), captions (`.txt`), and metadata (`.json`) - only images are loaded
+- `.tar` files: Tar archives containing JPEG images (`.jpg`)
+- Only JPEG images are extracted and processed
 
-:::{note} While tar archives may contain captions (`.txt`) and metadata (`.json`) files, the `ImageReaderStage` only extracts JPEG images. Other file types are ignored during the loading process.
+:::{note}
+**WebDataset Format Support**: If your tar archives follow the [WebDataset format](https://github.com/webdataset/webdataset) and contain additional files (captions as `.txt`, metadata as `.json`), the `ImageReaderStage` will **only extract JPEG images**. Other file types (`.txt`, `.json`, etc.) are automatically ignored during loading.
 :::
 
 Each record is identified by a unique ID (e.g., `000000031`), used as the prefix for all files belonging to that record.
 
-## Sharding and Metadata Management
-
-- **Sharding:** Datasets are split into multiple `.tar` files (shards) for efficient distributed processing.
-- **Metadata:** Each record has a unique ID, and metadata is stored in `.json` files (per record) within the tar archives.
-
 ## Loading from Local Disk
 
 **Example:**
@@ -59,20 +56,23 @@ from nemo_curator.stages.image.io.image_reader import ImageReaderStage
 # Create pipeline for loading
 pipeline = Pipeline(name="image_loading")
 
-# Partition tar files
+# Partition tar files for parallel processing
 pipeline.add_stage(FilePartitioningStage(
     file_paths="/path/to/tar_dataset",
-    files_per_partition=1,
-    file_extensions=[".tar"],  # Required for ImageReaderStage
+    files_per_partition=1,         # Process one tar file per partition
+    file_extensions=[".tar"],       # Only include .tar files
 ))
 
-# Load images with DALI
+# Load JPEG images from tar files using DALI
 pipeline.add_stage(ImageReaderStage(
-    batch_size=100,
+    batch_size=100,                 # Number of images per batch
     verbose=True,
-    num_threads=8,
-    num_gpus_per_worker=0.25,
+    num_threads=8,                  # Number of threads for I/O operations
+    num_gpus_per_worker=0.25,       # Allocate 1/4 GPU per worker
 ))
+
+# Execute the pipeline
+results = pipeline.run()
 ```
 
 ## DALI Integration for High-Performance Loading
 
@@ -138,6 +138,7 @@ A typical image curation pipeline using NeMo Curator's stage-based architecture:
 **Example:**
 
 ```python
+from nemo_curator.pipeline import Pipeline
 from nemo_curator.stages.file_partitioning import FilePartitioningStage
 from nemo_curator.stages.image.io.image_reader import ImageReaderStage
 from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
@@ -156,6 +157,9 @@ pipeline.add_stage(ImageDuplicatesRemovalStage(
     removal_parquets_dir="/path/to/removal_ids/duplicates",
     duplicate_id_field="id",
 ))
+
+# Execute the pipeline
+results = pipeline.run()
 ```
 
 This modular pipeline approach allows you to customize or skip stages based on your workflow needs. Filtering stages (aesthetic and NSFW filtering) must always follow embedding generation, as they require pre-computed embeddings as input.
 
@@ -105,11 +105,21 @@ Load and process JPEG images from tar archives using DALI
 
 ### Process Data
 
-Transform and enhance your image data through classification, embeddings, and filters.
+Transform and enhance your image data through embeddings, classification, and filters.
 
 ::::{grid} 1 1 1 2
 :gutter: 1 1 1 2
 
+:::{grid-item-card} {octicon}`pencil;1.5em;sd-mr-1` Embeddings
+:link: image-process-data-embeddings
+:link-type: ref
+
+Generate image embeddings using CLIP models.
++++
+{bdg-secondary}`embeddings`
+
+:::
+
 :::{grid-item-card} {octicon}`filter;1.5em;sd-mr-1` Filters
 :link: image-process-data-filters
 :link-type: ref
@@ -120,13 +130,13 @@ Apply built-in filters for aesthetic quality and NSFW content filtering.
 
 :::
 
-:::{grid-item-card} {octicon}`pencil;1.5em;sd-mr-1` Embeddings
-:link: image-process-data-embeddings
+:::{grid-item-card} {octicon}`versions;1.5em;sd-mr-1` Deduplication
+:link: image-tutorials-dedup
 :link-type: ref
 
-Generate image embeddings using CLIP models.
+Remove duplicate images using semantic similarity and clustering.
 +++
-{bdg-secondary}`embeddings`
+{bdg-secondary}`deduplication` {bdg-secondary}`semantic` {bdg-secondary}`clustering`
 
 :::
 
 
@@ -65,7 +65,7 @@ pipeline.add_stage(FilePartitioningStage(
 pipeline.add_stage(ImageReaderStage(
     batch_size=100,
     verbose=True,
-    num_threads=16,
+    num_threads=8,
     num_gpus_per_worker=0.25,
 ))
 
@@ -110,38 +110,6 @@ The `ImageReaderStage` is the core component that handles tar archive loading wi
 
 ## Parameters
 
-### FilePartitioningStage Parameters
-
-```{list-table}
-:header-rows: 1
-:widths: 20 15 15 50
-
-* - Parameter
-  - Type
-  - Default
-  - Description
-* - `file_paths`
-  - str | list[str]
-  - Required
-  - Path to directory containing tar files, or list of file paths
-* - `files_per_partition`
-  - int | None
-  - None
-  - Number of tar files to process per partition (controls parallelism). Defaults to 1 if both `files_per_partition` and `blocksize` are not provided
-* - `file_extensions`
-  - list[str] | None
-  - `[".jsonl", ".json", ".parquet"]`
-  - List of file extensions to include (for example, `[".tar"]`)
-* - `blocksize`
-  - int | str | None
-  - None
-  - Target size of the partitions. If provided, `files_per_partition` is ignored
-* - `limit`
-  - int | None
-  - None
-  - Maximum number of partitions to create
-```
-
 ### ImageReaderStage Parameters
 
 ```{list-table}
@@ -193,44 +161,3 @@ ImageObject(
 ```
 
 **Note**: Only JPEG images are extracted from tar files. Other content (text files, JSON metadata, etc.) within the tar archives is ignored during processing.
-
----
-
-## Performance Optimization
-
-### Hardware-Specific Configuration
-
-**GPU-Enabled Environments (Recommended)**
-
-```python
-# Optimal configuration for GPU acceleration
-pipeline.add_stage(ImageReaderStage(
-    batch_size=256,        # Larger batches for GPU throughput
-    num_threads=16,             # More threads for I/O parallelism
-    num_gpus_per_worker=0.5,    # Allocate more GPU memory
-    verbose=True,
-))
-```
-
-**CPU Environments**
-
-```python
-# Optimized for CPU decoding
-pipeline.add_stage(ImageReaderStage(
-    batch_size=64,         # Smaller batches to avoid memory pressure
-    num_threads=8,              # Fewer threads for CPU processing
-    num_gpus_per_worker=0,      # No GPU allocation
-    verbose=True,
-))
-```
-
-## Customization Options & Performance Tips
-
-- **GPU Acceleration**: Use a GPU-enabled environment for optimal performance. The stage automatically detects CUDA availability and uses GPU decoding when possible.
-- **Parallelism Control**: Adjust `files_per_partition` to control how many tar files are processed together. Lower values increase parallelism but may increase overhead.
-- **Batch Size Tuning**: Increase `batch_size` for better throughput, but ensure sufficient memory is available.
-- **Thread Configuration**: Adjust `num_threads` for I/O operations based on your storage system's characteristics.
-
----
-
-<!-- More advanced usage and troubleshooting tips can be added here. -->
@@ -25,6 +25,26 @@ The `ImageEmbeddingStage` generates CLIP embeddings for images using OpenAI's Vi
 
 The stage processes `ImageBatch` objects containing `ImageObject` instances with loaded image data. It applies CLIP preprocessing, generates embeddings in batches, and stores the results in each `ImageObject.embedding` attribute.
 
+## Prerequisites
+
+Before using the `ImageEmbeddingStage`, ensure you have:
+
+### Model Setup
+
+The CLIP model weights are automatically downloaded from HuggingFace on first use. The stage will:
+1. Download the OpenAI CLIP ViT-L/14 model (~3.5GB) to the specified `model_dir`
+2. Cache the model for subsequent runs
+3. Load the model onto GPU (or CPU if GPU unavailable)
+
+**First-time setup:** The initial model download may take several minutes depending on your internet connection. Subsequent runs will use the cached model.
+
+### System Requirements
+
+- **GPU:** NVIDIA GPU with CUDA support (recommended for performance)
+- **Memory:** At least 8GB GPU memory for batch processing
+- **Disk Space:** ~4GB for model weights
+- **Python Dependencies:** PyTorch, transformers (installed with NeMo Curator)
+
 ## Usage
 
 ```python
@@ -46,6 +66,7 @@ pipeline.add_stage(FilePartitioningStage(
 # Stage 2: Read images
 pipeline.add_stage(ImageReaderStage(
     batch_size=100,
+    num_threads=8,
     num_gpus_per_worker=0.25,
 ))