|
1 | 1 | --- |
2 | 2 | description: "Core concepts for saving and exporting curated image datasets including metadata, filtering, and resharding" |
3 | 3 | categories: ["concepts-architecture"] |
4 | | -tags: ["data-export", "webdataset", "parquet", "filtering", "resharding", "metadata"] |
| 4 | +tags: ["data-export", "tar-files", "parquet", "filtering", "resharding", "metadata"] |
5 | 5 | personas: ["data-scientist-focused", "mle-focused"] |
6 | 6 | difficulty: "intermediate" |
7 | 7 | content_type: "concept" |
8 | 8 | modality: "image-only" |
9 | 9 | --- |
10 | 10 |
|
11 | 11 | (about-concepts-image-data-export)= |
| 12 | + |
12 | 13 | # Data Export Concepts (Image) |
13 | 14 |
|
14 | 15 | This page covers the core concepts for saving and exporting curated image datasets in NeMo Curator. |
15 | 16 |
|
16 | 17 | ## Key Topics |
17 | | -- Saving metadata to Parquet |
18 | | -- Exporting filtered datasets |
19 | | -- Resharding WebDatasets |
| 18 | + |
| 19 | +- Saving metadata to Parquet files |
| 20 | +- Exporting filtered datasets as tar archives |
| 21 | +- Configuring output sharding |
| 22 | +- Understanding output format structure |
20 | 23 | - Preparing data for downstream training or analysis |
21 | 24 |
|
22 | | -## Saving Metadata |
| 25 | +## Saving Results |
23 | 26 |
|
24 | | -After processing, you can save the dataset's metadata (including embeddings, classifier scores, and other fields) to Parquet files for easy analysis or further processing. |
| 27 | +After processing through the pipeline, you can save the curated images and metadata using the `ImageWriterStage`. |
25 | 28 |
|
26 | 29 | **Example:** |
27 | | -```python |
28 | | -# Save all metadata columns to the original path |
29 | | -dataset.save_metadata() |
30 | 30 |
|
31 | | -# Save only selected columns to a custom path |
32 | | -dataset.save_metadata(path="/output/metadata", columns=["id", "aesthetic_score", "nsfw_score"]) |
| 31 | +```python |
| 32 | +from nemo_curator.stages.image.io.image_writer import ImageWriterStage |
| 33 | + |
| 34 | +# Add writer stage to pipeline |
| 35 | +pipeline.add_stage(ImageWriterStage( |
| 36 | + output_dir="/output/curated_dataset", |
| 37 | + images_per_tar=1000, |
| 38 | + remove_image_data=True, |
| 39 | + verbose=True, |
| 40 | + deterministic_name=True, # Use deterministic naming for reproducible output |
| 41 | +)) |
33 | 42 | ``` |
34 | | -- Parquet format is efficient and compatible with many analytics tools. |
35 | | -- You can choose to save all or only specific columns. |
36 | 43 |
|
37 | | -## Exporting Filtered Datasets |
| 44 | +- The writer stage creates tar files with curated images |
| 45 | +- Metadata (if updated during curation pipeline) is stored in separate Parquet files alongside tar archives |
| 46 | +- Configurable images per tar file for optimal sharding |
| 47 | +- `deterministic_name=True` ensures reproducible file naming based on input content |
38 | 48 |
|
39 | | -To export a filtered version of your dataset (e.g., after removing low-quality or NSFW images), use the `to_webdataset` method. This writes new `.tar` and `.parquet` files containing only the filtered samples. |
| 49 | +## Pipeline-Based Filtering |
| 50 | + |
| 51 | +Filtering happens automatically within the pipeline stages. Each filter stage (aesthetic, NSFW) removes images that don't meet the configured thresholds, so only curated images reach the final `ImageWriterStage`. |
| 52 | + |
| 53 | +**Example Pipeline Flow:** |
40 | 54 |
|
41 | | -**Example:** |
42 | 55 | ```python |
43 | | -# Filter your metadata (e.g., keep only high-quality images) |
44 | | -filtered_col = (dataset.metadata["aesthetic_score"] > 0.5) & (dataset.metadata["nsfw_score"] < 0.2) |
45 | | -dataset.metadata["keep"] = filtered_col |
46 | | - |
47 | | -dataset.to_webdataset( |
48 | | - path="/output/filtered_webdataset", # Output directory |
49 | | - filter_column="keep", # Boolean column indicating which samples to keep |
50 | | - samples_per_shard=10000, # Number of samples per tar shard |
51 | | - max_shards=5 # Number of digits for shard IDs |
52 | | -) |
| 56 | +from nemo_curator.pipeline.pipeline import Pipeline |
| 57 | +from nemo_curator.stages.file_partitioning import FilePartitioningStage |
| 58 | +from nemo_curator.stages.image.io.image_reader import ImageReaderStage |
| 59 | +from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage |
| 60 | +from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage |
| 61 | +from nemo_curator.stages.image.filters.nsfw_filter import ImageNSFWFilterStage |
| 62 | +from nemo_curator.stages.image.io.image_writer import ImageWriterStage |
| 63 | + |
| 64 | +# Complete pipeline with filtering |
| 65 | +pipeline = Pipeline(name="image_curation") |
| 66 | + |
| 67 | +# Load images |
| 68 | +pipeline.add_stage(FilePartitioningStage(...)) |
| 69 | +pipeline.add_stage(ImageReaderStage(...)) |
| 70 | + |
| 71 | +# Generate embeddings |
| 72 | +pipeline.add_stage(ImageEmbeddingStage(...)) |
| 73 | + |
| 74 | +# Filter by quality (removes low aesthetic scores) |
| 75 | +pipeline.add_stage(ImageAestheticFilterStage(score_threshold=0.5)) |
| 76 | + |
| 77 | +# Filter NSFW content (removes high NSFW scores) |
| 78 | +pipeline.add_stage(ImageNSFWFilterStage(score_threshold=0.5)) |
| 79 | + |
| 80 | +# Save curated results |
| 81 | +pipeline.add_stage(ImageWriterStage(output_dir="/output/curated")) |
| 82 | +``` |
| 83 | + |
| 84 | +- Filtering is built into the stages - no separate filtering step needed |
| 85 | +- Images passing all filters reach the output |
| 86 | +- Thresholds are configurable per stage |
| 87 | + |
| 88 | +## Output Format |
| 89 | + |
| 90 | +The `ImageWriterStage` creates tar archives containing curated images with accompanying metadata files: |
| 91 | + |
| 92 | +**Output Structure:** |
| 93 | + |
| 94 | +```bash |
| 95 | +output/ |
| 96 | +├── images-{hash}-000000.tar # Contains JPEG images |
| 97 | +├── images-{hash}-000000.parquet # Metadata for corresponding tar |
| 98 | +├── images-{hash}-000001.tar |
| 99 | +├── images-{hash}-000001.parquet |
53 | 100 | ``` |
54 | | -- The output directory will contain new `.tar` files (with images, captions, and metadata) and matching `.parquet` files for each shard. |
55 | | -- Adjust `samples_per_shard` and `max_shards` to control sharding granularity and naming. |
56 | 101 |
|
57 | | -## Resharding WebDatasets |
| 102 | +**Format Details:** |
| 103 | + |
| 104 | +- **Tar contents**: JPEG images with sequential or ID-based filenames |
| 105 | +- **Metadata storage**: Separate Parquet files containing image paths, IDs, and processing metadata |
| 106 | +- **Naming**: Deterministic or random naming based on configuration |
| 107 | +- **Sharding**: Configurable number of images per tar file for optimal performance |
| 108 | + |
| 109 | +## Configuring Output Sharding |
58 | 110 |
|
59 | | -Resharding changes the number of samples per shard, which can optimize data loading or prepare data for specific workflows. |
| 111 | +The `ImageWriterStage` parameters control how images get distributed across output tar files. |
60 | 112 |
|
61 | 113 | **Example:** |
| 114 | + |
62 | 115 | ```python |
63 | | -# Reshard the dataset without filtering (keep all samples) |
64 | | -dataset.metadata["keep"] = True |
65 | | - |
66 | | -dataset.to_webdataset( |
67 | | - path="/output/resharded_webdataset", |
68 | | - filter_column="keep", |
69 | | - samples_per_shard=20000, # New shard size |
70 | | - max_shards=6 |
71 | | -) |
| 116 | +# Configure output sharding |
| 117 | +pipeline.add_stage(ImageWriterStage( |
| 118 | + output_dir="/output/curated_dataset", |
| 119 | + images_per_tar=5000, # Images per tar file |
| 120 | + remove_image_data=True, |
| 121 | + deterministic_name=True, |
| 122 | +)) |
72 | 123 | ``` |
73 | | -- Use resharding to balance I/O, parallelism, and storage efficiency. |
| 124 | + |
| 125 | +- Adjust `images_per_tar` to balance I/O, parallelism, and storage efficiency |
| 126 | +- Smaller values create more files but enable better parallelism |
| 127 | +- Larger values reduce file count but may impact loading performance |
74 | 128 |
|
75 | 129 | ## Preparing for Downstream Use |
| 130 | + |
76 | 131 | - Ensure your exported dataset matches the requirements of your training or analysis pipeline. |
77 | 132 | - Use consistent naming and metadata fields for compatibility. |
78 | 133 | - Document any filtering or processing steps for reproducibility. |
|
0 commit comments