Hugging Face Datasets Integration

SimpleTuner supports loading datasets directly from the Hugging Face Hub, enabling efficient training on large-scale datasets without downloading all images locally.

Overview

The Hugging Face datasets backend allows you to:

Load datasets directly from Hugging Face Hub
Apply filters based on metadata or quality metrics
Extract captions from dataset columns
Handle composite/grid images
Cache only the processed embeddings locally

Important: SimpleTuner requires full dataset access to build aspect ratio buckets and calculate batch sizes. While Hugging Face supports streaming datasets, this feature is not compatible with SimpleTuner's architecture. Use filtering to reduce very large datasets to manageable sizes.

Basic Configuration

To use a Hugging Face dataset, configure your dataloader with "type": "huggingface":

{
  "id": "my-hf-dataset",
  "type": "huggingface",
  "dataset_name": "username/dataset-name",
  "split": "train",
  "caption_strategy": "huggingface",
  "metadata_backend": "huggingface",
  "caption_column": "text",
  "image_column": "image",
  "cache_dir": "cache/my-hf-dataset"
}

Required Fields

type: Must be "huggingface"
dataset_name: The Hugging Face dataset identifier (e.g., "laion/laion-aesthetic")
caption_strategy: Must be "huggingface"
metadata_backend: Must be "huggingface"

Optional Fields

split: Dataset split to use (default: "train")
revision: Specific dataset revision
image_column: Column containing images (default: "image")
caption_column: Column(s) containing captions
cache_dir: Local cache directory for dataset files
streaming: ⚠️ Currently not functional - SimpleTuner tries to efficiently scan the dataset to build metadata and encoder caches.
num_proc: Number of processes for filtering (default: 16)

Caption Configuration

The Hugging Face backend supports flexible caption extraction:

Single Caption Column

{
  "caption_column": "caption"
}

Multiple Caption Columns

{
  "caption_column": ["short_caption", "detailed_caption", "tags"]
}

Nested Column Access

{
  "caption_column": "metadata.caption",
  "fallback_caption_column": "basic_caption"
}

Advanced Caption Configuration

{
  "huggingface": {
    "caption_column": "caption",
    "fallback_caption_column": "description",
    "description_column": "detailed_description",
    "width_column": "width",
    "height_column": "height"
  }
}

Filtering Datasets

Apply filters to select only high-quality samples:

Quality-Based Filtering

{
  "huggingface": {
    "filter_func": {
      "quality_thresholds": {
        "clip_score": 0.3,
        "aesthetic_score": 5.0,
        "resolution": 0.8
      },
      "quality_column": "quality_assessment"
    }
  }
}

Collection/Subset Filtering

{
  "huggingface": {
    "filter_func": {
      "collection": ["photo", "artwork"],
      "min_width": 512,
      "min_height": 512
    }
  }
}

Composite Image Support

Handle datasets with multiple images in a grid:

{
  "huggingface": {
    "composite_image_config": {
      "enabled": true,
      "image_count": 4,
      "select_index": 0
    }
  }
}

This configuration will:

Detect 4-image grids
Extract only the first image (index 0)
Adjust dimensions accordingly

Complete Example Configurations

Basic Photo Dataset

{
  "id": "aesthetic-photos",
  "type": "huggingface",
  "dataset_name": "aesthetic-foundation/aesthetic-photos",
  "split": "train",
  "caption_strategy": "huggingface",
  "metadata_backend": "huggingface",
  "caption_column": "caption",
  "image_column": "image",
  "resolution": 1024,
  "resolution_type": "pixel",
  "minimum_image_size": 512,
  "cache_dir": "cache/aesthetic-photos"
}

Filtered High-Quality Dataset

{
  "id": "high-quality-art",
  "type": "huggingface",
  "dataset_name": "example/art-dataset",
  "caption_strategy": "huggingface",
  "metadata_backend": "huggingface",
  "huggingface": {
    "caption_column": ["title", "description", "tags"],
    "fallback_caption_column": "filename",
    "width_column": "original_width",
    "height_column": "original_height",
    "filter_func": {
      "quality_thresholds": {
        "aesthetic_score": 6.0,
        "technical_quality": 0.8
      },
      "min_width": 768,
      "min_height": 768
    }
  },
  "resolution": 1024,
  "resolution_type": "pixel_area",
  "crop": true,
  "crop_aspect": "square"
}

Video Dataset

{
  "id": "video-dataset",
  "type": "huggingface",
  "dataset_type": "video",
  "dataset_name": "example/video-clips",
  "caption_strategy": "huggingface",
  "metadata_backend": "huggingface",
  "huggingface": {
    "caption_column": "description",
    "num_frames_column": "frame_count",
    "fps_column": "fps"
  },
  "video": {
    "num_frames": 125,
    "min_frames": 100
  },
  "resolution": 480,
  "resolution_type": "pixel"
}

Virtual File System

The Hugging Face backend uses a virtual file system where images are referenced by their dataset index:

0.jpg → First item in dataset
1.jpg → Second item in dataset
etc.

This allows the standard SimpleTuner pipeline to work without modification.

Caching Behavior

Dataset files: Cached according to Hugging Face datasets library defaults
VAE embeddings: Stored in cache_dir/vae/{backend_id}/
Text embeddings: Use standard text embed cache configuration
Metadata: Stored in cache_dir/huggingface_metadata/{backend_id}/

Performance Considerations

Initial scan: The first run will download dataset metadata and build aspect ratio buckets
Dataset size: The entire dataset metadata must be loaded to build file lists and calculate lengths
Filtering: Applied during initial load - filtered items won't be downloaded
Cache reuse: Subsequent runs reuse cached metadata and embeddings

Note: While Hugging Face datasets support streaming, SimpleTuner requires full dataset access to build aspect buckets and calculate batch sizes. Very large datasets should be filtered to a manageable size.

Limitations

Read-only access (cannot modify source dataset)
Requires internet connection for initial dataset access
Some dataset formats may not be supported
Streaming mode is not supported - SimpleTuner requires full dataset access
Very large datasets must be filtered to manageable sizes
Initial metadata loading can be memory-intensive for huge datasets

Troubleshooting

Dataset Not Found

Error: Dataset 'username/dataset' not found

Verify the dataset exists on Hugging Face Hub
Check if the dataset is private (requires authentication)
Ensure correct spelling of dataset name

Slow Initial Loading

Large datasets take time to load metadata and build buckets
Use aggressive filtering to reduce dataset size
Consider using a subset or filtered version of the dataset
Cache files will speed up subsequent runs

Memory Issues

Use filters to reduce dataset size before loading
Reduce num_proc for filtering operations
Consider splitting very large datasets into smaller chunks
Use quality thresholds to limit the dataset to high-quality samples

Caption Extraction Issues

Verify column names match dataset schema
Check for nested column structures
Use fallback_caption_column for missing captions

Advanced Usage

Custom Filter Functions

While the configuration supports basic filtering, you can implement more complex filters by modifying the code. The filter function receives each dataset item and returns True/False.

Multi-Dataset Training

Combine Hugging Face datasets with local data:

[
  {
    "id": "hf-dataset",
    "type": "huggingface",
    "dataset_name": "laion/laion-art",
    "probability": 0.7
  },
  {
    "id": "local-dataset",
    "type": "local",
    "instance_data_dir": "/path/to/local/data",
    "probability": 0.3
  }
]

This configuration will sample 70% from the Hugging Face dataset and 30% from local data.

Audio Datasets

For audio models (like ACE-Step), you can specify dataset_type: "audio".

{
    "id": "audio-dataset",
    "type": "huggingface",
    "dataset_type": "audio",
    "dataset_name": "my-audio-data",
    "audio_column": "audio",
    "config": {
        "audio_caption_fields": ["tags"],
        "lyrics_column": "lyrics"
    }
}

audio_column: The column containing audio data (decoded or bytes). Defaults to "audio".
audio_caption_fields: A list of column names to combine to form the prompt (text conditioning). Defaults to ["prompt", "tags"].
lyrics_column: The column containing the song lyrics. Defaults to "lyrics". If this column is missing, SimpleTuner will check for "norm_lyrics" as a fallback.

Expected Columns

audio: The audio data.
prompt / tags: Descriptive tags or prompts used for the text encoder.
lyrics: Song lyrics used for the lyric encoder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hugging Face Datasets Integration

Overview

Basic Configuration

Required Fields

Optional Fields

Caption Configuration

Single Caption Column

Multiple Caption Columns

Nested Column Access

Advanced Caption Configuration

Filtering Datasets

Quality-Based Filtering

Collection/Subset Filtering

Composite Image Support

Complete Example Configurations

Basic Photo Dataset

Filtered High-Quality Dataset

Video Dataset

Virtual File System

Caching Behavior

Performance Considerations

Limitations

Troubleshooting

Dataset Not Found

Slow Initial Loading

Memory Issues

Caption Extraction Issues

Advanced Usage

Custom Filter Functions

Multi-Dataset Training

Audio Datasets

Expected Columns

FilesExpand file tree

HUGGINGFACE_DATASETS.md

Latest commit

History

HUGGINGFACE_DATASETS.md

File metadata and controls

Hugging Face Datasets Integration

Overview

Basic Configuration

Required Fields

Optional Fields

Caption Configuration

Single Caption Column

Multiple Caption Columns

Nested Column Access

Advanced Caption Configuration

Filtering Datasets

Quality-Based Filtering

Collection/Subset Filtering

Composite Image Support

Complete Example Configurations

Basic Photo Dataset

Filtered High-Quality Dataset

Video Dataset

Virtual File System

Caching Behavior

Performance Considerations

Limitations

Troubleshooting

Dataset Not Found

Slow Initial Loading

Memory Issues

Caption Extraction Issues

Advanced Usage

Custom Filter Functions

Multi-Dataset Training

Audio Datasets

Expected Columns