feat: Add mmCIF file support for macromolecular structures #7925

behroozazarkhalili · 2025-12-31T20:11:32Z

Summary

This PR adds support for loading mmCIF (macromolecular Crystallographic Information File) files with load_dataset(), following the ImageFolder pattern where one row = one structure.

Based on feedback from @lhoestq in #7930, this approach makes datasets more practical for ML workflows:

Each row is independent, enabling train/test splits and shuffling
Easy to add labels (folder-based) and metadata (metadata.jsonl)
Compatible with Dataset Viewer (one 3D render per row)

Architecture

Uses FolderBasedBuilder pattern (like ImageFolder, AudioFolder):

class MmcifFolder(FolderBasedBuilder):
    BASE_FEATURE = ProteinStructure
    BASE_COLUMN_NAME = "structure"
    EXTENSIONS = [".cif", ".mmcif"]

New `ProteinStructure` Feature Type

# Arrow schema for lazy loading
pa.struct({"bytes": pa.binary(), "path": pa.string()})

# Decoded: returns structure file content as string
dataset = load_dataset("mmcif", data_dir="structures/")
print(dataset[0]["structure"])  # Full mmCIF file content

Supported Extensions

.cif, .mmcif

Usage

from datasets import load_dataset

# Load from directory
dataset = load_dataset("mmcif", data_dir="protein_structures/")

# Load with folder-based labels
# structures/
#   enzymes/
#     1abc.cif
#   receptors/
#     2def.cif
dataset = load_dataset("mmcif", data_dir="structures/")
print(dataset[0])  # {"structure": "data_...", "label": "enzymes"}

# Load with metadata
# structures/
#   1abc.cif
#   metadata.jsonl  # {"file_name": "1abc.cif", "resolution": 2.5}
dataset = load_dataset("mmcif", data_dir="structures/")
print(dataset[0])  # {"structure": "data_...", "resolution": 2.5}

# Drop labels or metadata
dataset = load_dataset("mmcif", data_dir="structures/", drop_labels=True)
dataset = load_dataset("mmcif", data_dir="structures/", drop_metadata=True)

Test Results

All 24 mmCIF tests + 15 ProteinStructure feature tests pass.

Related PRs

Add lightweight PDB (Protein Data Bank) file support #7926 - PDB support (same pattern)
Proposal: Protein 3D Structure Visualization for Dataset Viewer #7930 - Protein 3D visualization proposal

References

mmCIF specification: https://mmcif.wwpdb.org/
PDB archive: https://www.rcsb.org/

cc @lhoestq @georgia-hf

Add support for loading mmCIF (macromolecular Crystallographic Information File) format directly with load_dataset(). mmCIF is the modern standard for 3D macromolecular structures used by PDB since 2014. Key features: - Zero external dependencies: Pure Python parser for CIF syntax - Streaming support: Generator-based parsing for large structure files - Compression support: Auto-detection of gzip, bzip2, xz compressed files - ML-ready output: Atomic coordinates suitable for structure-based ML models Configuration options: - columns: Select subset of atom_site columns (default: 11 common columns) - include_hetatm: Option to exclude ligand/water HETATM records - batch_size: Control atoms per batch (default: 100000) Supported extensions: .cif, .mmcif (and compressed variants)

This refactors the mmCIF loader to follow the ImageFolder pattern, where each row in the dataset contains one complete protein structure file. This is the recommended ML-friendly approach for working with structural data. Key changes: - Add ProteinStructure feature type for handling protein structure files - Supports lazy loading (decode=False) or full content (decode=True) - Works with both PDB and mmCIF formats - Rewrite MmcifFolder to extend FolderBasedBuilder - Supports folder-based labels (like ImageFolder) - Supports metadata.csv files for additional columns - Uses ProteinStructure as BASE_FEATURE - Fix bug in FolderBasedBuilder._generate_examples where drop_metadata would fail with IndexError when metadata files were in the files list - Root cause: enumerate(files) created gaps in shard_idx when files were skipped due to extension filtering - Solution: Use separate valid_shard_idx counter that only increments when samples are actually yielded Usage: >>> from datasets import load_dataset >>> dataset = load_dataset("mmcif", data_dir="./structures") >>> structure_content = dataset[0]["structure"] # Complete mmCIF content

- Fix line length in protein_structure.py error messages - Sort imports alphabetically in __init__.py - Format function calls and f-strings in test_mmcif.py

This was referenced Dec 31, 2025

Add lightweight PDB (Protein Data Bank) file support #7926

Open

Proposal: Protein 3D Structure Visualization for Dataset Viewer #7930

Open

behroozazarkhalili added 2 commits January 9, 2026 10:30

style: apply ruff formatting fixes

b96b3f8

- Fix line length in protein_structure.py error messages - Sort imports alphabetically in __init__.py - Format function calls and f-strings in test_mmcif.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add mmCIF file support for macromolecular structures #7925

feat: Add mmCIF file support for macromolecular structures #7925

Uh oh!

behroozazarkhalili commented Dec 31, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add mmCIF file support for macromolecular structures #7925

Are you sure you want to change the base?

feat: Add mmCIF file support for macromolecular structures #7925

Uh oh!

Conversation

behroozazarkhalili commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

New ProteinStructure Feature Type

Supported Extensions

Usage

Test Results

Related PRs

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

behroozazarkhalili commented Dec 31, 2025 •

edited

Loading

New `ProteinStructure` Feature Type