SimpleTuner supports loading datasets directly from the Hugging Face Hub, enabling efficient training on large-scale datasets without downloading all images locally.
The Hugging Face datasets backend allows you to:
- Load datasets directly from Hugging Face Hub
- Apply filters based on metadata or quality metrics
- Extract captions from dataset columns
- Handle composite/grid images
- Cache only the processed embeddings locally
Important: SimpleTuner requires full dataset access to build aspect ratio buckets and calculate batch sizes. While Hugging Face supports streaming datasets, this feature is not compatible with SimpleTuner's architecture. Use filtering to reduce very large datasets to manageable sizes.
To use a Hugging Face dataset, configure your dataloader with "type": "huggingface":
{
"id": "my-hf-dataset",
"type": "huggingface",
"dataset_name": "username/dataset-name",
"split": "train",
"caption_strategy": "huggingface",
"metadata_backend": "huggingface",
"caption_column": "text",
"image_column": "image",
"cache_dir": "cache/my-hf-dataset"
}type: Must be"huggingface"dataset_name: The Hugging Face dataset identifier (e.g., "laion/laion-aesthetic")caption_strategy: Must be"huggingface"metadata_backend: Must be"huggingface"
split: Dataset split to use (default: "train")revision: Specific dataset revisionimage_column: Column containing images (default: "image")caption_column: Column(s) containing captionscache_dir: Local cache directory for dataset filesstreaming:⚠️ Currently not functional - SimpleTuner tries to efficiently scan the dataset to build metadata and encoder caches.num_proc: Number of processes for filtering (default: 16)
The Hugging Face backend supports flexible caption extraction:
{
"caption_column": "caption"
}{
"caption_column": ["short_caption", "detailed_caption", "tags"]
}{
"caption_column": "metadata.caption",
"fallback_caption_column": "basic_caption"
}{
"huggingface": {
"caption_column": "caption",
"fallback_caption_column": "description",
"description_column": "detailed_description",
"width_column": "width",
"height_column": "height"
}
}Apply filters to select only high-quality samples:
{
"huggingface": {
"filter_func": {
"quality_thresholds": {
"clip_score": 0.3,
"aesthetic_score": 5.0,
"resolution": 0.8
},
"quality_column": "quality_assessment"
}
}
}{
"huggingface": {
"filter_func": {
"collection": ["photo", "artwork"],
"min_width": 512,
"min_height": 512
}
}
}Handle datasets with multiple images in a grid:
{
"huggingface": {
"composite_image_config": {
"enabled": true,
"image_count": 4,
"select_index": 0
}
}
}This configuration will:
- Detect 4-image grids
- Extract only the first image (index 0)
- Adjust dimensions accordingly
{
"id": "aesthetic-photos",
"type": "huggingface",
"dataset_name": "aesthetic-foundation/aesthetic-photos",
"split": "train",
"caption_strategy": "huggingface",
"metadata_backend": "huggingface",
"caption_column": "caption",
"image_column": "image",
"resolution": 1024,
"resolution_type": "pixel",
"minimum_image_size": 512,
"cache_dir": "cache/aesthetic-photos"
}{
"id": "high-quality-art",
"type": "huggingface",
"dataset_name": "example/art-dataset",
"caption_strategy": "huggingface",
"metadata_backend": "huggingface",
"huggingface": {
"caption_column": ["title", "description", "tags"],
"fallback_caption_column": "filename",
"width_column": "original_width",
"height_column": "original_height",
"filter_func": {
"quality_thresholds": {
"aesthetic_score": 6.0,
"technical_quality": 0.8
},
"min_width": 768,
"min_height": 768
}
},
"resolution": 1024,
"resolution_type": "pixel_area",
"crop": true,
"crop_aspect": "square"
}{
"id": "video-dataset",
"type": "huggingface",
"dataset_type": "video",
"dataset_name": "example/video-clips",
"caption_strategy": "huggingface",
"metadata_backend": "huggingface",
"huggingface": {
"caption_column": "description",
"num_frames_column": "frame_count",
"fps_column": "fps"
},
"video": {
"num_frames": 125,
"min_frames": 100
},
"resolution": 480,
"resolution_type": "pixel"
}The Hugging Face backend uses a virtual file system where images are referenced by their dataset index:
0.jpg→ First item in dataset1.jpg→ Second item in dataset- etc.
This allows the standard SimpleTuner pipeline to work without modification.
- Dataset files: Cached according to Hugging Face datasets library defaults
- VAE embeddings: Stored in
cache_dir/vae/{backend_id}/ - Text embeddings: Use standard text embed cache configuration
- Metadata: Stored in
cache_dir/huggingface_metadata/{backend_id}/
- Initial scan: The first run will download dataset metadata and build aspect ratio buckets
- Dataset size: The entire dataset metadata must be loaded to build file lists and calculate lengths
- Filtering: Applied during initial load - filtered items won't be downloaded
- Cache reuse: Subsequent runs reuse cached metadata and embeddings
Note: While Hugging Face datasets support streaming, SimpleTuner requires full dataset access to build aspect buckets and calculate batch sizes. Very large datasets should be filtered to a manageable size.
- Read-only access (cannot modify source dataset)
- Requires internet connection for initial dataset access
- Some dataset formats may not be supported
- Streaming mode is not supported - SimpleTuner requires full dataset access
- Very large datasets must be filtered to manageable sizes
- Initial metadata loading can be memory-intensive for huge datasets
Error: Dataset 'username/dataset' not found
- Verify the dataset exists on Hugging Face Hub
- Check if the dataset is private (requires authentication)
- Ensure correct spelling of dataset name
- Large datasets take time to load metadata and build buckets
- Use aggressive filtering to reduce dataset size
- Consider using a subset or filtered version of the dataset
- Cache files will speed up subsequent runs
- Use filters to reduce dataset size before loading
- Reduce
num_procfor filtering operations - Consider splitting very large datasets into smaller chunks
- Use quality thresholds to limit the dataset to high-quality samples
- Verify column names match dataset schema
- Check for nested column structures
- Use
fallback_caption_columnfor missing captions
While the configuration supports basic filtering, you can implement more complex filters by modifying the code. The filter function receives each dataset item and returns True/False.
Combine Hugging Face datasets with local data:
[
{
"id": "hf-dataset",
"type": "huggingface",
"dataset_name": "laion/laion-art",
"probability": 0.7
},
{
"id": "local-dataset",
"type": "local",
"instance_data_dir": "/path/to/local/data",
"probability": 0.3
}
]This configuration will sample 70% from the Hugging Face dataset and 30% from local data.
For audio models (like ACE-Step), you can specify dataset_type: "audio".
{
"id": "audio-dataset",
"type": "huggingface",
"dataset_type": "audio",
"dataset_name": "my-audio-data",
"audio_column": "audio",
"config": {
"audio_caption_fields": ["tags"],
"lyrics_column": "lyrics"
}
}audio_column: The column containing audio data (decoded or bytes). Defaults to"audio".audio_caption_fields: A list of column names to combine to form the prompt (text conditioning). Defaults to["prompt", "tags"].lyrics_column: The column containing the song lyrics. Defaults to"lyrics". If this column is missing, SimpleTuner will check for"norm_lyrics"as a fallback.
audio: The audio data.prompt/tags: Descriptive tags or prompts used for the text encoder.lyrics: Song lyrics used for the lyric encoder.