Skip to content

Commit 81786a9

Browse files
authored
Llane/docs image updates (#1087)
* image curation refactor changes Signed-off-by: Lawrence Lane <[email protected]> * image curation docs refactor Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * no web dataset -- feedback updates Signed-off-by: Lawrence Lane <[email protected]> * web dataset > tar Signed-off-by: Lawrence Lane <[email protected]> * remove Signed-off-by: Lawrence Lane <[email protected]> * remove Signed-off-by: Lawrence Lane <[email protected]> * clarify Signed-off-by: Lawrence Lane <[email protected]> * link and other fix Signed-off-by: Lawrence Lane <[email protected]> * feedback Signed-off-by: Lawrence Lane <[email protected]> * link Signed-off-by: Lawrence Lane <[email protected]> * remove recs, source install update Signed-off-by: Lawrence Lane <[email protected]> * minor updates (will fix this pg later) Signed-off-by: Lawrence Lane <[email protected]> * remove internvideo2 from image Signed-off-by: Lawrence Lane <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]>
1 parent 8c68846 commit 81786a9

File tree

26 files changed

+2198
-813
lines changed

26 files changed

+2198
-813
lines changed

docs/about/concepts/image/data-export-concepts.md

Lines changed: 95 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,78 +1,133 @@
11
---
22
description: "Core concepts for saving and exporting curated image datasets including metadata, filtering, and resharding"
33
categories: ["concepts-architecture"]
4-
tags: ["data-export", "webdataset", "parquet", "filtering", "resharding", "metadata"]
4+
tags: ["data-export", "tar-files", "parquet", "filtering", "resharding", "metadata"]
55
personas: ["data-scientist-focused", "mle-focused"]
66
difficulty: "intermediate"
77
content_type: "concept"
88
modality: "image-only"
99
---
1010

1111
(about-concepts-image-data-export)=
12+
1213
# Data Export Concepts (Image)
1314

1415
This page covers the core concepts for saving and exporting curated image datasets in NeMo Curator.
1516

1617
## Key Topics
17-
- Saving metadata to Parquet
18-
- Exporting filtered datasets
19-
- Resharding WebDatasets
18+
19+
- Saving metadata to Parquet files
20+
- Exporting filtered datasets as tar archives
21+
- Configuring output sharding
22+
- Understanding output format structure
2023
- Preparing data for downstream training or analysis
2124

22-
## Saving Metadata
25+
## Saving Results
2326

24-
After processing, you can save the dataset's metadata (including embeddings, classifier scores, and other fields) to Parquet files for easy analysis or further processing.
27+
After processing through the pipeline, you can save the curated images and metadata using the `ImageWriterStage`.
2528

2629
**Example:**
27-
```python
28-
# Save all metadata columns to the original path
29-
dataset.save_metadata()
3030

31-
# Save only selected columns to a custom path
32-
dataset.save_metadata(path="/output/metadata", columns=["id", "aesthetic_score", "nsfw_score"])
31+
```python
32+
from nemo_curator.stages.image.io.image_writer import ImageWriterStage
33+
34+
# Add writer stage to pipeline
35+
pipeline.add_stage(ImageWriterStage(
36+
output_dir="/output/curated_dataset",
37+
images_per_tar=1000,
38+
remove_image_data=True,
39+
verbose=True,
40+
deterministic_name=True, # Use deterministic naming for reproducible output
41+
))
3342
```
34-
- Parquet format is efficient and compatible with many analytics tools.
35-
- You can choose to save all or only specific columns.
3643

37-
## Exporting Filtered Datasets
44+
- The writer stage creates tar files with curated images
45+
- Metadata (if updated during curation pipeline) is stored in separate Parquet files alongside tar archives
46+
- Configurable images per tar file for optimal sharding
47+
- `deterministic_name=True` ensures reproducible file naming based on input content
3848

39-
To export a filtered version of your dataset (e.g., after removing low-quality or NSFW images), use the `to_webdataset` method. This writes new `.tar` and `.parquet` files containing only the filtered samples.
49+
## Pipeline-Based Filtering
50+
51+
Filtering happens automatically within the pipeline stages. Each filter stage (aesthetic, NSFW) removes images that don't meet the configured thresholds, so only curated images reach the final `ImageWriterStage`.
52+
53+
**Example Pipeline Flow:**
4054

41-
**Example:**
4255
```python
43-
# Filter your metadata (e.g., keep only high-quality images)
44-
filtered_col = (dataset.metadata["aesthetic_score"] > 0.5) & (dataset.metadata["nsfw_score"] < 0.2)
45-
dataset.metadata["keep"] = filtered_col
46-
47-
dataset.to_webdataset(
48-
path="/output/filtered_webdataset", # Output directory
49-
filter_column="keep", # Boolean column indicating which samples to keep
50-
samples_per_shard=10000, # Number of samples per tar shard
51-
max_shards=5 # Number of digits for shard IDs
52-
)
56+
from nemo_curator.pipeline.pipeline import Pipeline
57+
from nemo_curator.stages.file_partitioning import FilePartitioningStage
58+
from nemo_curator.stages.image.io.image_reader import ImageReaderStage
59+
from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
60+
from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage
61+
from nemo_curator.stages.image.filters.nsfw_filter import ImageNSFWFilterStage
62+
from nemo_curator.stages.image.io.image_writer import ImageWriterStage
63+
64+
# Complete pipeline with filtering
65+
pipeline = Pipeline(name="image_curation")
66+
67+
# Load images
68+
pipeline.add_stage(FilePartitioningStage(...))
69+
pipeline.add_stage(ImageReaderStage(...))
70+
71+
# Generate embeddings
72+
pipeline.add_stage(ImageEmbeddingStage(...))
73+
74+
# Filter by quality (removes low aesthetic scores)
75+
pipeline.add_stage(ImageAestheticFilterStage(score_threshold=0.5))
76+
77+
# Filter NSFW content (removes high NSFW scores)
78+
pipeline.add_stage(ImageNSFWFilterStage(score_threshold=0.5))
79+
80+
# Save curated results
81+
pipeline.add_stage(ImageWriterStage(output_dir="/output/curated"))
82+
```
83+
84+
- Filtering is built into the stages - no separate filtering step needed
85+
- Images passing all filters reach the output
86+
- Thresholds are configurable per stage
87+
88+
## Output Format
89+
90+
The `ImageWriterStage` creates tar archives containing curated images with accompanying metadata files:
91+
92+
**Output Structure:**
93+
94+
```bash
95+
output/
96+
├── images-{hash}-000000.tar # Contains JPEG images
97+
├── images-{hash}-000000.parquet # Metadata for corresponding tar
98+
├── images-{hash}-000001.tar
99+
├── images-{hash}-000001.parquet
53100
```
54-
- The output directory will contain new `.tar` files (with images, captions, and metadata) and matching `.parquet` files for each shard.
55-
- Adjust `samples_per_shard` and `max_shards` to control sharding granularity and naming.
56101

57-
## Resharding WebDatasets
102+
**Format Details:**
103+
104+
- **Tar contents**: JPEG images with sequential or ID-based filenames
105+
- **Metadata storage**: Separate Parquet files containing image paths, IDs, and processing metadata
106+
- **Naming**: Deterministic or random naming based on configuration
107+
- **Sharding**: Configurable number of images per tar file for optimal performance
108+
109+
## Configuring Output Sharding
58110

59-
Resharding changes the number of samples per shard, which can optimize data loading or prepare data for specific workflows.
111+
The `ImageWriterStage` parameters control how images get distributed across output tar files.
60112

61113
**Example:**
114+
62115
```python
63-
# Reshard the dataset without filtering (keep all samples)
64-
dataset.metadata["keep"] = True
65-
66-
dataset.to_webdataset(
67-
path="/output/resharded_webdataset",
68-
filter_column="keep",
69-
samples_per_shard=20000, # New shard size
70-
max_shards=6
71-
)
116+
# Configure output sharding
117+
pipeline.add_stage(ImageWriterStage(
118+
output_dir="/output/curated_dataset",
119+
images_per_tar=5000, # Images per tar file
120+
remove_image_data=True,
121+
deterministic_name=True,
122+
))
72123
```
73-
- Use resharding to balance I/O, parallelism, and storage efficiency.
124+
125+
- Adjust `images_per_tar` to balance I/O, parallelism, and storage efficiency
126+
- Smaller values create more files but enable better parallelism
127+
- Larger values reduce file count but may impact loading performance
74128

75129
## Preparing for Downstream Use
130+
76131
- Ensure your exported dataset matches the requirements of your training or analysis pipeline.
77132
- Use consistent naming and metadata fields for compatibility.
78133
- Document any filtering or processing steps for reproducibility.
Lines changed: 48 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,83 +1,91 @@
11
---
2-
description: "Core concepts for loading and managing image datasets using WebDataset format with cloud storage support"
2+
description: "Core concepts for loading and managing image datasets from tar archives with cloud storage support"
33
categories: ["concepts-architecture"]
4-
tags: ["data-loading", "webdataset", "dali", "cloud-storage", "sharding", "gpu-accelerated"]
4+
tags: ["data-loading", "tar-archives", "dali", "cloud-storage", "sharding", "gpu-accelerated"]
55
personas: ["data-scientist-focused", "mle-focused"]
66
difficulty: "intermediate"
77
content_type: "concept"
88
modality: "image-only"
99
---
1010

1111
(about-concepts-image-data-loading)=
12+
1213
# Data Loading Concepts (Image)
1314

1415
This page covers the core concepts for loading and managing image datasets in NeMo Curator.
1516

16-
## WebDataset Format and Directory Structure
17+
> **Input vs. Output**: This page focuses on **input** data formats for loading datasets into NeMo Curator. For information about **output** formats (including Parquet metadata files created during export), see the [Data Export Concepts](data-export-concepts.md) page.
1718
18-
NeMo Curator uses the [WebDataset](https://github.com/webdataset/webdataset) format for scalable, distributed image curation. A WebDataset directory contains sharded `.tar` files, each holding image-text pairs and metadata, along with corresponding `.parquet` files for tabular metadata. Optionally, `.idx` index files can be provided for fast DALI-based loading.
19+
## Input Data Format and Directory Structure
1920

20-
**Example directory structure:**
21+
NeMo Curator loads image datasets from tar archives for scalable, distributed image curation. The `ImageReaderStage` reads only JPEG images from input `.tar` files, ignoring other content.
2122

22-
```
23-
dataset/
23+
**Example input directory structure:**
24+
25+
```bash
26+
input_dataset/
2427
├── 00000.tar
2528
│ ├── 000000000.jpg
2629
│ ├── 000000000.txt
2730
│ ├── 000000000.json
2831
│ ├── ...
2932
├── 00001.tar
3033
│ ├── ...
31-
├── 00000.parquet
32-
├── 00001.parquet
33-
├── 00000.idx # optional
34-
├── 00001.idx # optional
3534
```
3635

37-
- `.tar` files: Contain images (`.jpg`), captions (`.txt`), and metadata (`.json`)
38-
- `.parquet` files: Tabular metadata for each record
39-
- `.idx` files: (Optional) Index files for fast DALI-based loading
36+
**Input file types:**
37+
38+
- `.tar` files: Contain images (`.jpg`), captions (`.txt`), and metadata (`.json`) - only images are loaded
39+
40+
:::{note} While tar archives may contain captions (`.txt`) and metadata (`.json`) files, the `ImageReaderStage` only extracts JPEG images. Other file types are ignored during the loading process.
41+
:::
4042

4143
Each record is identified by a unique ID (e.g., `000000031`), used as the prefix for all files belonging to that record.
4244

4345
## Sharding and Metadata Management
4446

4547
- **Sharding:** Datasets are split into multiple `.tar` files (shards) for efficient distributed processing.
46-
- **Metadata:** Each record has a unique ID, and metadata is stored both in `.json` (per record) and `.parquet` (per shard) files. The `.parquet` files enable fast, tabular access to metadata for filtering and analysis.
47-
48-
## Loading from Local Disk and Cloud Storage
48+
- **Metadata:** Each record has a unique ID, and metadata is stored in `.json` files (per record) within the tar archives.
4949

50-
NeMo Curator supports loading datasets from both local disk and cloud storage (S3, GCS, Azure) using the [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) library. This allows you to use the same API regardless of where your data is stored.
50+
## Loading from Local Disk
5151

5252
**Example:**
53-
```python
54-
from nemo_curator.datasets import ImageTextPairDataset
5553

56-
dataset = ImageTextPairDataset.from_webdataset(
57-
path="/path/to/webdataset", # or "s3://bucket/webdataset"
58-
id_col="key"
59-
)
54+
```python
55+
from nemo_curator.pipeline import Pipeline
56+
from nemo_curator.stages.file_partitioning import FilePartitioningStage
57+
from nemo_curator.stages.image.io.image_reader import ImageReaderStage
58+
59+
# Create pipeline for loading
60+
pipeline = Pipeline(name="image_loading")
61+
62+
# Partition tar files
63+
pipeline.add_stage(FilePartitioningStage(
64+
file_paths="/path/to/tar_dataset",
65+
files_per_partition=1,
66+
file_extensions=[".tar"], # Required for ImageReaderStage
67+
))
68+
69+
# Load images with DALI
70+
pipeline.add_stage(ImageReaderStage(
71+
task_batch_size=100,
72+
verbose=True,
73+
num_threads=8,
74+
num_gpus_per_worker=0.25,
75+
))
6076
```
6177

6278
## DALI Integration for High-Performance Loading
6379

64-
[NVIDIA DALI](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/) is used for efficient, GPU-accelerated loading and preprocessing of images from WebDataset tar files. DALI enables:
65-
- Fast image decoding and augmentation on GPU
66-
- Efficient shuffling and batching
67-
- Support for large-scale, distributed workflows
68-
69-
## Index Files
80+
The `ImageReaderStage` uses [NVIDIA DALI](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/) for efficient, GPU-accelerated loading and preprocessing of JPEG images from tar files. DALI enables:
7081

71-
For large datasets, DALI can use `.idx` index files for each `.tar` to enable even faster loading. These index files are generated using DALI's `wds2idx` tool and must be placed alongside the corresponding `.tar` files.
72-
73-
- **How to generate:** See [DALI documentation](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/general/data_loading/dataloading_webdataset.html#Creating-an-index)
74-
- **Naming:** Each index file must match its `.tar` file (e.g., `00000.tar``00000.idx`)
75-
- **Usage:** Set `use_index_files=True` in your embedder or loader.
82+
- **GPU Acceleration:** Fast image decoding on GPU with automatic CPU fallback
83+
- **Batch Processing:** Efficient batching and streaming of image data
84+
- **Tar Archive Processing:** Built-in support for tar archive format
85+
- **Memory Efficiency:** Streams images without loading entire datasets into memory
7686

7787
## Best Practices and Troubleshooting
88+
7889
- Use sharding to enable distributed and parallel processing.
79-
- Always include `.parquet` metadata for fast access and filtering.
80-
- For cloud storage, ensure your environment is configured with the appropriate credentials.
81-
- Use `.idx` files for large datasets to maximize DALI performance.
82-
- Monitor GPU memory and adjust batch size as needed.
83-
- If you encounter loading errors, check for missing or mismatched files in your dataset structure.
90+
- Watch GPU memory and adjust batch size as needed.
91+
- If you encounter loading errors, check for missing or mismatched files in your dataset structure.

0 commit comments

Comments
 (0)