Skip to content

Commit c94acab

Browse files
Review Image Documentations (#1272)
* Review Image Tutorial Signed-off-by: Ao Tang <[email protected]> * pr comments resolved Signed-off-by: Ao Tang <[email protected]> * Update docs/about/concepts/image/data-loading-concepts.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Ao Tang <[email protected]> * pr comment resolve Signed-off-by: Ao Tang <[email protected]> --------- Signed-off-by: Ao Tang <[email protected]> Signed-off-by: Ao Tang <[email protected]> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
1 parent aaf6330 commit c94acab

File tree

12 files changed

+135
-358
lines changed

12 files changed

+135
-358
lines changed

docs/about/concepts/image/data-export-concepts.md

Lines changed: 16 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
2-
description: "Core concepts for saving and exporting curated image datasets including metadata, filtering, and resharding"
2+
description: "Core concepts for saving and exporting curated image datasets including metadata and resharding"
33
categories: ["concepts-architecture"]
4-
tags: ["data-export", "tar-files", "parquet", "filtering", "resharding", "metadata"]
4+
tags: ["data-export", "tar-files", "parquet", "resharding", "metadata"]
55
personas: ["data-scientist-focused", "mle-focused"]
66
difficulty: "intermediate"
77
content_type: "concept"
@@ -16,10 +16,9 @@ This page covers the core concepts for saving and exporting curated image datase
1616

1717
## Key Topics
1818

19-
- Saving metadata to Parquet files
20-
- Exporting filtered datasets as tar archives
21-
- Configuring output sharding
19+
- Saving curated images and metadata
2220
- Understanding output format structure
21+
- Configuring output sharding
2322
- Preparing data for downstream training or analysis
2423

2524
## Saving Results
@@ -34,56 +33,27 @@ from nemo_curator.stages.image.io.image_writer import ImageWriterStage
3433
# Add writer stage to pipeline
3534
pipeline.add_stage(ImageWriterStage(
3635
output_dir="/output/curated_dataset",
37-
images_per_tar=1000,
36+
images_per_tar=1000, # Images per tar file
3837
remove_image_data=True,
3938
verbose=True,
4039
deterministic_name=True, # Use deterministic naming for reproducible output
4140
))
4241
```
4342

44-
- The writer stage creates tar files with curated images
45-
- Metadata (if updated during curation pipeline) is stored in separate Parquet files alongside tar archives
46-
- Configurable images per tar file for optimal sharding
47-
- `deterministic_name=True` ensures reproducible file naming based on input content
48-
49-
## Pipeline-Based Filtering
50-
51-
Filtering happens automatically within the pipeline stages. Each filter stage (aesthetic, NSFW) removes images that don't meet the configured thresholds, so only curated images reach the final `ImageWriterStage`.
52-
53-
**Example Pipeline Flow:**
54-
55-
```python
56-
from nemo_curator.pipeline.pipeline import Pipeline
57-
from nemo_curator.stages.file_partitioning import FilePartitioningStage
58-
from nemo_curator.stages.image.io.image_reader import ImageReaderStage
59-
from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
60-
from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage
61-
from nemo_curator.stages.image.filters.nsfw_filter import ImageNSFWFilterStage
62-
from nemo_curator.stages.image.io.image_writer import ImageWriterStage
43+
**Key Parameters:**
6344

64-
# Complete pipeline with filtering
65-
pipeline = Pipeline(name="image_curation")
45+
- `output_dir`: Directory where tar archives and metadata files are written
46+
- `images_per_tar`: Number of images per tar file for optimal sharding
47+
- `remove_image_data`: Whether to remove image data from memory after writing
48+
- `deterministic_name`: Ensures reproducible file naming based on input content
6649

67-
# Load images
68-
pipeline.add_stage(FilePartitioningStage(...))
69-
pipeline.add_stage(ImageReaderStage(...))
50+
**Behavior:**
7051

71-
# Generate embeddings
72-
pipeline.add_stage(ImageEmbeddingStage(...))
73-
74-
# Filter by quality (removes low aesthetic scores)
75-
pipeline.add_stage(ImageAestheticFilterStage(score_threshold=0.5))
76-
77-
# Filter NSFW content (removes high NSFW scores)
78-
pipeline.add_stage(ImageNSFWFilterStage(score_threshold=0.5))
79-
80-
# Save curated results
81-
pipeline.add_stage(ImageWriterStage(output_dir="/output/curated"))
82-
```
83-
84-
- Filtering is built into the stages - no separate filtering step needed
85-
- Images passing all filters reach the output
86-
- Thresholds are configurable per stage
52+
- The writer stage creates tar files with curated images
53+
- Metadata for each image (including paths, IDs, scores, and processing metadata) is always stored in separate Parquet files alongside tar archives
54+
- Adjust `images_per_tar` to balance I/O, parallelism, and storage efficiency
55+
- Smaller values create more files but enable better parallelism
56+
- Larger values reduce file count but may impact loading performance
8757

8858
## Output Format
8959

@@ -106,26 +76,6 @@ output/
10676
- **Naming**: Deterministic or random naming based on configuration
10777
- **Sharding**: Configurable number of images per tar file for optimal performance
10878

109-
## Configuring Output Sharding
110-
111-
The `ImageWriterStage` parameters control how images get distributed across output tar files.
112-
113-
**Example:**
114-
115-
```python
116-
# Configure output sharding
117-
pipeline.add_stage(ImageWriterStage(
118-
output_dir="/output/curated_dataset",
119-
images_per_tar=5000, # Images per tar file
120-
remove_image_data=True,
121-
deterministic_name=True,
122-
))
123-
```
124-
125-
- Adjust `images_per_tar` to balance I/O, parallelism, and storage efficiency
126-
- Smaller values create more files but enable better parallelism
127-
- Larger values reduce file count but may impact loading performance
128-
12979
## Preparing for Downstream Use
13080

13181
- Ensure your exported dataset matches the requirements of your training or analysis pipeline.

docs/about/concepts/image/data-loading-concepts.md

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
description: "Core concepts for loading and managing image datasets from tar archives with cloud storage support"
33
categories: ["concepts-architecture"]
4-
tags: ["data-loading", "tar-archives", "dali", "cloud-storage", "sharding", "gpu-accelerated"]
4+
tags: ["data-loading", "tar-archives", "dali", "cloud-storage", "gpu-accelerated"]
55
personas: ["data-scientist-focused", "mle-focused"]
66
difficulty: "intermediate"
77
content_type: "concept"
@@ -14,8 +14,6 @@ modality: "image-only"
1414

1515
This page covers the core concepts for loading and managing image datasets in NeMo Curator.
1616

17-
> **Input vs. Output**: This page focuses on **input** data formats for loading datasets into NeMo Curator. For information about **output** formats (including Parquet metadata files created during export), see the [Data Export Concepts](data-export-concepts.md) page.
18-
1917
## Input Data Format and Directory Structure
2018

2119
NeMo Curator loads image datasets from tar archives for scalable, distributed image curation. The `ImageReaderStage` reads only JPEG images from input `.tar` files, ignoring other content.
@@ -24,29 +22,28 @@ NeMo Curator loads image datasets from tar archives for scalable, distributed im
2422

2523
```bash
2624
input_dataset/
27-
├── 00000.tar
25+
├── 00000.tar # Tar archive containing JPEG images
2826
│ ├── 000000000.jpg
29-
│ ├── 000000000.txt
30-
│ ├── 000000000.json
27+
│ ├── 000000001.jpg
28+
│ ├── 000000002.jpg
3129
│ ├── ...
3230
├── 00001.tar
31+
│ ├── 000001000.jpg
32+
│ ├── 000001001.jpg
3333
│ ├── ...
3434
```
3535

36-
**Input file types:**
36+
**What gets loaded:**
3737

38-
- `.tar` files: Contain images (`.jpg`), captions (`.txt`), and metadata (`.json`) - only images are loaded
38+
- `.tar` files: Tar archives containing JPEG images (`.jpg`)
39+
- Only JPEG images are extracted and processed
3940

40-
:::{note} While tar archives may contain captions (`.txt`) and metadata (`.json`) files, the `ImageReaderStage` only extracts JPEG images. Other file types are ignored during the loading process.
41+
:::{note}
42+
**WebDataset Format Support**: If your tar archives follow the [WebDataset format](https://github.com/webdataset/webdataset) and contain additional files (captions as `.txt`, metadata as `.json`), the `ImageReaderStage` will **only extract JPEG images**. Other file types (`.txt`, `.json`, etc.) are automatically ignored during loading.
4143
:::
4244

4345
Each record is identified by a unique ID (e.g., `000000031`), used as the prefix for all files belonging to that record.
4446

45-
## Sharding and Metadata Management
46-
47-
- **Sharding:** Datasets are split into multiple `.tar` files (shards) for efficient distributed processing.
48-
- **Metadata:** Each record has a unique ID, and metadata is stored in `.json` files (per record) within the tar archives.
49-
5047
## Loading from Local Disk
5148

5249
**Example:**
@@ -59,20 +56,23 @@ from nemo_curator.stages.image.io.image_reader import ImageReaderStage
5956
# Create pipeline for loading
6057
pipeline = Pipeline(name="image_loading")
6158

62-
# Partition tar files
59+
# Partition tar files for parallel processing
6360
pipeline.add_stage(FilePartitioningStage(
6461
file_paths="/path/to/tar_dataset",
65-
files_per_partition=1,
66-
file_extensions=[".tar"], # Required for ImageReaderStage
62+
files_per_partition=1, # Process one tar file per partition
63+
file_extensions=[".tar"], # Only include .tar files
6764
))
6865

69-
# Load images with DALI
66+
# Load JPEG images from tar files using DALI
7067
pipeline.add_stage(ImageReaderStage(
71-
batch_size=100,
68+
batch_size=100, # Number of images per batch
7269
verbose=True,
73-
num_threads=8,
74-
num_gpus_per_worker=0.25,
70+
num_threads=8, # Number of threads for I/O operations
71+
num_gpus_per_worker=0.25, # Allocate 1/4 GPU per worker
7572
))
73+
74+
# Execute the pipeline
75+
results = pipeline.run()
7676
```
7777

7878
## DALI Integration for High-Performance Loading

docs/about/concepts/image/data-processing-concepts.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,7 @@ A typical image curation pipeline using NeMo Curator's stage-based architecture:
138138
**Example:**
139139

140140
```python
141+
from nemo_curator.pipeline import Pipeline
141142
from nemo_curator.stages.file_partitioning import FilePartitioningStage
142143
from nemo_curator.stages.image.io.image_reader import ImageReaderStage
143144
from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
@@ -156,6 +157,9 @@ pipeline.add_stage(ImageDuplicatesRemovalStage(
156157
removal_parquets_dir="/path/to/removal_ids/duplicates",
157158
duplicate_id_field="id",
158159
))
160+
161+
# Execute the pipeline
162+
results = pipeline.run()
159163
```
160164

161165
This modular pipeline approach allows you to customize or skip stages based on your workflow needs. Filtering stages (aesthetic and NSFW filtering) must always follow embedding generation, as they require pre-computed embeddings as input.

docs/curate-images/index.md

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -105,11 +105,21 @@ Load and process JPEG images from tar archives using DALI
105105

106106
### Process Data
107107

108-
Transform and enhance your image data through classification, embeddings, and filters.
108+
Transform and enhance your image data through embeddings, classification, and filters.
109109

110110
::::{grid} 1 1 1 2
111111
:gutter: 1 1 1 2
112112

113+
:::{grid-item-card} {octicon}`pencil;1.5em;sd-mr-1` Embeddings
114+
:link: image-process-data-embeddings
115+
:link-type: ref
116+
117+
Generate image embeddings using CLIP models.
118+
+++
119+
{bdg-secondary}`embeddings`
120+
121+
:::
122+
113123
:::{grid-item-card} {octicon}`filter;1.5em;sd-mr-1` Filters
114124
:link: image-process-data-filters
115125
:link-type: ref
@@ -120,13 +130,13 @@ Apply built-in filters for aesthetic quality and NSFW content filtering.
120130

121131
:::
122132

123-
:::{grid-item-card} {octicon}`pencil;1.5em;sd-mr-1` Embeddings
124-
:link: image-process-data-embeddings
133+
:::{grid-item-card} {octicon}`versions;1.5em;sd-mr-1` Deduplication
134+
:link: image-tutorials-dedup
125135
:link-type: ref
126136

127-
Generate image embeddings using CLIP models.
137+
Remove duplicate images using semantic similarity and clustering.
128138
+++
129-
{bdg-secondary}`embeddings`
139+
{bdg-secondary}`deduplication` {bdg-secondary}`semantic` {bdg-secondary}`clustering`
130140

131141
:::
132142

docs/curate-images/load-data/tar-archives.md

Lines changed: 1 addition & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ pipeline.add_stage(FilePartitioningStage(
6565
pipeline.add_stage(ImageReaderStage(
6666
batch_size=100,
6767
verbose=True,
68-
num_threads=16,
68+
num_threads=8,
6969
num_gpus_per_worker=0.25,
7070
))
7171

@@ -110,38 +110,6 @@ The `ImageReaderStage` is the core component that handles tar archive loading wi
110110

111111
## Parameters
112112

113-
### FilePartitioningStage Parameters
114-
115-
```{list-table}
116-
:header-rows: 1
117-
:widths: 20 15 15 50
118-
119-
* - Parameter
120-
- Type
121-
- Default
122-
- Description
123-
* - `file_paths`
124-
- str | list[str]
125-
- Required
126-
- Path to directory containing tar files, or list of file paths
127-
* - `files_per_partition`
128-
- int | None
129-
- None
130-
- Number of tar files to process per partition (controls parallelism). Defaults to 1 if both `files_per_partition` and `blocksize` are not provided
131-
* - `file_extensions`
132-
- list[str] | None
133-
- `[".jsonl", ".json", ".parquet"]`
134-
- List of file extensions to include (for example, `[".tar"]`)
135-
* - `blocksize`
136-
- int | str | None
137-
- None
138-
- Target size of the partitions. If provided, `files_per_partition` is ignored
139-
* - `limit`
140-
- int | None
141-
- None
142-
- Maximum number of partitions to create
143-
```
144-
145113
### ImageReaderStage Parameters
146114

147115
```{list-table}
@@ -193,44 +161,3 @@ ImageObject(
193161
```
194162

195163
**Note**: Only JPEG images are extracted from tar files. Other content (text files, JSON metadata, etc.) within the tar archives is ignored during processing.
196-
197-
---
198-
199-
## Performance Optimization
200-
201-
### Hardware-Specific Configuration
202-
203-
**GPU-Enabled Environments (Recommended)**
204-
205-
```python
206-
# Optimal configuration for GPU acceleration
207-
pipeline.add_stage(ImageReaderStage(
208-
batch_size=256, # Larger batches for GPU throughput
209-
num_threads=16, # More threads for I/O parallelism
210-
num_gpus_per_worker=0.5, # Allocate more GPU memory
211-
verbose=True,
212-
))
213-
```
214-
215-
**CPU Environments**
216-
217-
```python
218-
# Optimized for CPU decoding
219-
pipeline.add_stage(ImageReaderStage(
220-
batch_size=64, # Smaller batches to avoid memory pressure
221-
num_threads=8, # Fewer threads for CPU processing
222-
num_gpus_per_worker=0, # No GPU allocation
223-
verbose=True,
224-
))
225-
```
226-
227-
## Customization Options & Performance Tips
228-
229-
- **GPU Acceleration**: Use a GPU-enabled environment for optimal performance. The stage automatically detects CUDA availability and uses GPU decoding when possible.
230-
- **Parallelism Control**: Adjust `files_per_partition` to control how many tar files are processed together. Lower values increase parallelism but may increase overhead.
231-
- **Batch Size Tuning**: Increase `batch_size` for better throughput, but ensure sufficient memory is available.
232-
- **Thread Configuration**: Adjust `num_threads` for I/O operations based on your storage system's characteristics.
233-
234-
---
235-
236-
<!-- More advanced usage and troubleshooting tips can be added here. -->

docs/curate-images/process-data/embeddings/clip-embedder.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,26 @@ The `ImageEmbeddingStage` generates CLIP embeddings for images using OpenAI's Vi
2525

2626
The stage processes `ImageBatch` objects containing `ImageObject` instances with loaded image data. It applies CLIP preprocessing, generates embeddings in batches, and stores the results in each `ImageObject.embedding` attribute.
2727

28+
## Prerequisites
29+
30+
Before using the `ImageEmbeddingStage`, ensure you have:
31+
32+
### Model Setup
33+
34+
The CLIP model weights are automatically downloaded from HuggingFace on first use. The stage will:
35+
1. Download the OpenAI CLIP ViT-L/14 model (~3.5GB) to the specified `model_dir`
36+
2. Cache the model for subsequent runs
37+
3. Load the model onto GPU (or CPU if GPU unavailable)
38+
39+
**First-time setup:** The initial model download may take several minutes depending on your internet connection. Subsequent runs will use the cached model.
40+
41+
### System Requirements
42+
43+
- **GPU:** NVIDIA GPU with CUDA support (recommended for performance)
44+
- **Memory:** At least 8GB GPU memory for batch processing
45+
- **Disk Space:** ~4GB for model weights
46+
- **Python Dependencies:** PyTorch, transformers (installed with NeMo Curator)
47+
2848
## Usage
2949

3050
```python
@@ -46,6 +66,7 @@ pipeline.add_stage(FilePartitioningStage(
4666
# Stage 2: Read images
4767
pipeline.add_stage(ImageReaderStage(
4868
batch_size=100,
69+
num_threads=8,
4970
num_gpus_per_worker=0.25,
5071
))
5172

0 commit comments

Comments
 (0)