NVIDIA-NeMo
diff --git a/‎docs/about/concepts/image/data-export-concepts.md‎
Lines changed: 95 additions & 40 deletions b/‎docs/about/concepts/image/data-export-concepts.md‎
Lines changed: 95 additions & 40 deletions
diff --git a/‎docs/about/concepts/image/data-loading-concepts.md‎
Lines changed: 48 additions & 40 deletions b/‎docs/about/concepts/image/data-loading-concepts.md‎
Lines changed: 48 additions & 40 deletions
@@ -1,78 +1,133 @@
 ---
 description: "Core concepts for saving and exporting curated image datasets including metadata, filtering, and resharding"
 categories: ["concepts-architecture"]
-tags: ["data-export", "webdataset", "parquet", "filtering", "resharding", "metadata"]
+tags: ["data-export", "tar-files", "parquet", "filtering", "resharding", "metadata"]
 personas: ["data-scientist-focused", "mle-focused"]
 difficulty: "intermediate"
 content_type: "concept"
 modality: "image-only"
 ---
 
 (about-concepts-image-data-export)=
+
 # Data Export Concepts (Image)
 
 This page covers the core concepts for saving and exporting curated image datasets in NeMo Curator.
 
 ## Key Topics
-- Saving metadata to Parquet
-- Exporting filtered datasets
-- Resharding WebDatasets
+
+- Saving metadata to Parquet files
+- Exporting filtered datasets as tar archives
+- Configuring output sharding
+- Understanding output format structure
 - Preparing data for downstream training or analysis
 
-## Saving Metadata
+## Saving Results
 
-After processing, you can save the dataset's metadata (including embeddings, classifier scores, and other fields) to Parquet files for easy analysis or further processing.
+After processing through the pipeline, you can save the curated images and metadata using the `ImageWriterStage`.
 
 **Example:**
-```python
-# Save all metadata columns to the original path
-dataset.save_metadata()
 
-# Save only selected columns to a custom path
-dataset.save_metadata(path="/output/metadata", columns=["id", "aesthetic_score", "nsfw_score"])
+```python
+from nemo_curator.stages.image.io.image_writer import ImageWriterStage
+
+# Add writer stage to pipeline
+pipeline.add_stage(ImageWriterStage(
+    output_dir="/output/curated_dataset",
+    images_per_tar=1000,
+    remove_image_data=True,
+    verbose=True,
+    deterministic_name=True,  # Use deterministic naming for reproducible output
+))
 ```
-- Parquet format is efficient and compatible with many analytics tools.
-- You can choose to save all or only specific columns.
 
-## Exporting Filtered Datasets
+- The writer stage creates tar files with curated images
+- Metadata (if updated during curation pipeline) is stored in separate Parquet files alongside tar archives
+- Configurable images per tar file for optimal sharding
+- `deterministic_name=True` ensures reproducible file naming based on input content
 
-To export a filtered version of your dataset (e.g., after removing low-quality or NSFW images), use the `to_webdataset` method. This writes new `.tar` and `.parquet` files containing only the filtered samples.
+## Pipeline-Based Filtering
+
+Filtering happens automatically within the pipeline stages. Each filter stage (aesthetic, NSFW) removes images that don't meet the configured thresholds, so only curated images reach the final `ImageWriterStage`.
+
+**Example Pipeline Flow:**
 
-**Example:**
 ```python
-# Filter your metadata (e.g., keep only high-quality images)
-filtered_col = (dataset.metadata["aesthetic_score"] > 0.5) & (dataset.metadata["nsfw_score"] < 0.2)
-dataset.metadata["keep"] = filtered_col
-
-dataset.to_webdataset(
-    path="/output/filtered_webdataset",  # Output directory
-    filter_column="keep",                # Boolean column indicating which samples to keep
-    samples_per_shard=10000,              # Number of samples per tar shard
-    max_shards=5                         # Number of digits for shard IDs
-)
+from nemo_curator.pipeline.pipeline import Pipeline
+from nemo_curator.stages.file_partitioning import FilePartitioningStage
+from nemo_curator.stages.image.io.image_reader import ImageReaderStage
+from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
+from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage
+from nemo_curator.stages.image.filters.nsfw_filter import ImageNSFWFilterStage
+from nemo_curator.stages.image.io.image_writer import ImageWriterStage
+
+# Complete pipeline with filtering
+pipeline = Pipeline(name="image_curation")
+
+# Load images
+pipeline.add_stage(FilePartitioningStage(...))
+pipeline.add_stage(ImageReaderStage(...))
+
+# Generate embeddings
+pipeline.add_stage(ImageEmbeddingStage(...))
+
+# Filter by quality (removes low aesthetic scores)
+pipeline.add_stage(ImageAestheticFilterStage(score_threshold=0.5))
+
+# Filter NSFW content (removes high NSFW scores)
+pipeline.add_stage(ImageNSFWFilterStage(score_threshold=0.5))
+
+# Save curated results
+pipeline.add_stage(ImageWriterStage(output_dir="/output/curated"))
+```
+
+- Filtering is built into the stages - no separate filtering step needed
+- Images passing all filters reach the output
+- Thresholds are configurable per stage
+
+## Output Format
+
+The `ImageWriterStage` creates tar archives containing curated images with accompanying metadata files:
+
+**Output Structure:**
+
+```bash
+output/
+├── images-{hash}-000000.tar    # Contains JPEG images
+├── images-{hash}-000000.parquet # Metadata for corresponding tar
+├── images-{hash}-000001.tar
+├── images-{hash}-000001.parquet
 ```
-- The output directory will contain new `.tar` files (with images, captions, and metadata) and matching `.parquet` files for each shard.
-- Adjust `samples_per_shard` and `max_shards` to control sharding granularity and naming.
 
-## Resharding WebDatasets
+**Format Details:**
+
+- **Tar contents**: JPEG images with sequential or ID-based filenames
+- **Metadata storage**: Separate Parquet files containing image paths, IDs, and processing metadata
+- **Naming**: Deterministic or random naming based on configuration
+- **Sharding**: Configurable number of images per tar file for optimal performance
+
+## Configuring Output Sharding
 
-Resharding changes the number of samples per shard, which can optimize data loading or prepare data for specific workflows.
+The `ImageWriterStage` parameters control how images get distributed across output tar files.
 
 **Example:**
+
 ```python
-# Reshard the dataset without filtering (keep all samples)
-dataset.metadata["keep"] = True
-
-dataset.to_webdataset(
-    path="/output/resharded_webdataset",
-    filter_column="keep",
-    samples_per_shard=20000,  # New shard size
-    max_shards=6
-)
+# Configure output sharding
+pipeline.add_stage(ImageWriterStage(
+    output_dir="/output/curated_dataset",
+    images_per_tar=5000,  # Images per tar file
+    remove_image_data=True,
+    deterministic_name=True,
+))
 ```
-- Use resharding to balance I/O, parallelism, and storage efficiency.
+
+- Adjust `images_per_tar` to balance I/O, parallelism, and storage efficiency
+- Smaller values create more files but enable better parallelism
+- Larger values reduce file count but may impact loading performance
 
 ## Preparing for Downstream Use
+
 - Ensure your exported dataset matches the requirements of your training or analysis pipeline.
 - Use consistent naming and metadata fields for compatibility.
 - Document any filtering or processing steps for reproducibility.
 
@@ -1,83 +1,91 @@
 ---
-description: "Core concepts for loading and managing image datasets using WebDataset format with cloud storage support"
+description: "Core concepts for loading and managing image datasets from tar archives with cloud storage support"
 categories: ["concepts-architecture"]
-tags: ["data-loading", "webdataset", "dali", "cloud-storage", "sharding", "gpu-accelerated"]
+tags: ["data-loading", "tar-archives", "dali", "cloud-storage", "sharding", "gpu-accelerated"]
 personas: ["data-scientist-focused", "mle-focused"]
 difficulty: "intermediate"
 content_type: "concept"
 modality: "image-only"
 ---
 
 (about-concepts-image-data-loading)=
+
 # Data Loading Concepts (Image)
 
 This page covers the core concepts for loading and managing image datasets in NeMo Curator.
 
-## WebDataset Format and Directory Structure
+> **Input vs. Output**: This page focuses on **input** data formats for loading datasets into NeMo Curator. For information about **output** formats (including Parquet metadata files created during export), see the [Data Export Concepts](data-export-concepts.md) page.
 
-NeMo Curator uses the [WebDataset](https://github.com/webdataset/webdataset) format for scalable, distributed image curation. A WebDataset directory contains sharded `.tar` files, each holding image-text pairs and metadata, along with corresponding `.parquet` files for tabular metadata. Optionally, `.idx` index files can be provided for fast DALI-based loading.
+## Input Data Format and Directory Structure
 
-**Example directory structure:**
+NeMo Curator loads image datasets from tar archives for scalable, distributed image curation. The `ImageReaderStage` reads only JPEG images from input `.tar` files, ignoring other content.
 
-```
-dataset/
+**Example input directory structure:**
+
+```bash
+input_dataset/
 ├── 00000.tar
 │   ├── 000000000.jpg
 │   ├── 000000000.txt
 │   ├── 000000000.json
 │   ├── ...
 ├── 00001.tar
 │   ├── ...
-├── 00000.parquet
-├── 00001.parquet
-├── 00000.idx  # optional
-├── 00001.idx  # optional
 ```
 
-- `.tar` files: Contain images (`.jpg`), captions (`.txt`), and metadata (`.json`)
-- `.parquet` files: Tabular metadata for each record
-- `.idx` files: (Optional) Index files for fast DALI-based loading
+**Input file types:**
+
+- `.tar` files: Contain images (`.jpg`), captions (`.txt`), and metadata (`.json`) - only images are loaded
+
+:::{note} While tar archives may contain captions (`.txt`) and metadata (`.json`) files, the `ImageReaderStage` only extracts JPEG images. Other file types are ignored during the loading process.
+:::
 
 Each record is identified by a unique ID (e.g., `000000031`), used as the prefix for all files belonging to that record.
 
 ## Sharding and Metadata Management
 
 - **Sharding:** Datasets are split into multiple `.tar` files (shards) for efficient distributed processing.
-- **Metadata:** Each record has a unique ID, and metadata is stored both in `.json` (per record) and `.parquet` (per shard) files. The `.parquet` files enable fast, tabular access to metadata for filtering and analysis.
-
-## Loading from Local Disk and Cloud Storage
+- **Metadata:** Each record has a unique ID, and metadata is stored in `.json` files (per record) within the tar archives.
 
-NeMo Curator supports loading datasets from both local disk and cloud storage (S3, GCS, Azure) using the [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) library. This allows you to use the same API regardless of where your data is stored.
+## Loading from Local Disk
 
 **Example:**
-```python
-from nemo_curator.datasets import ImageTextPairDataset
 
-dataset = ImageTextPairDataset.from_webdataset(
-    path="/path/to/webdataset",  # or "s3://bucket/webdataset"
-    id_col="key"
-)
+```python
+from nemo_curator.pipeline import Pipeline
+from nemo_curator.stages.file_partitioning import FilePartitioningStage
+from nemo_curator.stages.image.io.image_reader import ImageReaderStage
+
+# Create pipeline for loading
+pipeline = Pipeline(name="image_loading")
+
+# Partition tar files
+pipeline.add_stage(FilePartitioningStage(
+    file_paths="/path/to/tar_dataset",
+    files_per_partition=1,
+    file_extensions=[".tar"],  # Required for ImageReaderStage
+))
+
+# Load images with DALI
+pipeline.add_stage(ImageReaderStage(
+    task_batch_size=100,
+    verbose=True,
+    num_threads=8,
+    num_gpus_per_worker=0.25,
+))
 ```
 
 ## DALI Integration for High-Performance Loading
 
-[NVIDIA DALI](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/) is used for efficient, GPU-accelerated loading and preprocessing of images from WebDataset tar files. DALI enables:
-- Fast image decoding and augmentation on GPU
-- Efficient shuffling and batching
-- Support for large-scale, distributed workflows
-
-## Index Files
+The `ImageReaderStage` uses [NVIDIA DALI](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/) for efficient, GPU-accelerated loading and preprocessing of JPEG images from tar files. DALI enables:
 
-For large datasets, DALI can use `.idx` index files for each `.tar` to enable even faster loading. These index files are generated using DALI's `wds2idx` tool and must be placed alongside the corresponding `.tar` files.
-
-- **How to generate:** See [DALI documentation](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/general/data_loading/dataloading_webdataset.html#Creating-an-index)
-- **Naming:** Each index file must match its `.tar` file (e.g., `00000.tar` → `00000.idx`)
-- **Usage:** Set `use_index_files=True` in your embedder or loader.
+- **GPU Acceleration:** Fast image decoding on GPU with automatic CPU fallback
+- **Batch Processing:** Efficient batching and streaming of image data
+- **Tar Archive Processing:** Built-in support for tar archive format
+- **Memory Efficiency:** Streams images without loading entire datasets into memory
 
 ## Best Practices and Troubleshooting
+
 - Use sharding to enable distributed and parallel processing.
-- Always include `.parquet` metadata for fast access and filtering.
-- For cloud storage, ensure your environment is configured with the appropriate credentials.
-- Use `.idx` files for large datasets to maximize DALI performance.
-- Monitor GPU memory and adjust batch size as needed.
-- If you encounter loading errors, check for missing or mismatched files in your dataset structure. 
+- Watch GPU memory and adjust batch size as needed.
+- If you encounter loading errors, check for missing or mismatched files in your dataset structure.