Skip to content

Conversation

@behroozazarkhalili
Copy link

@behroozazarkhalili behroozazarkhalili commented Dec 31, 2025

Summary

This PR adds support for loading mmCIF (macromolecular Crystallographic Information File) files with load_dataset(), following the ImageFolder pattern where one row = one structure.

Based on feedback from @lhoestq in #7930, this approach makes datasets more practical for ML workflows:

  • Each row is independent, enabling train/test splits and shuffling
  • Easy to add labels (folder-based) and metadata (metadata.jsonl)
  • Compatible with Dataset Viewer (one 3D render per row)

Architecture

Uses FolderBasedBuilder pattern (like ImageFolder, AudioFolder):

class MmcifFolder(FolderBasedBuilder):
    BASE_FEATURE = ProteinStructure
    BASE_COLUMN_NAME = "structure"
    EXTENSIONS = [".cif", ".mmcif"]

New ProteinStructure Feature Type

# Arrow schema for lazy loading
pa.struct({"bytes": pa.binary(), "path": pa.string()})

# Decoded: returns structure file content as string
dataset = load_dataset("mmcif", data_dir="structures/")
print(dataset[0]["structure"])  # Full mmCIF file content

Supported Extensions

.cif, .mmcif

Usage

from datasets import load_dataset

# Load from directory
dataset = load_dataset("mmcif", data_dir="protein_structures/")

# Load with folder-based labels
# structures/
#   enzymes/
#     1abc.cif
#   receptors/
#     2def.cif
dataset = load_dataset("mmcif", data_dir="structures/")
print(dataset[0])  # {"structure": "data_...", "label": "enzymes"}

# Load with metadata
# structures/
#   1abc.cif
#   metadata.jsonl  # {"file_name": "1abc.cif", "resolution": 2.5}
dataset = load_dataset("mmcif", data_dir="structures/")
print(dataset[0])  # {"structure": "data_...", "resolution": 2.5}

# Drop labels or metadata
dataset = load_dataset("mmcif", data_dir="structures/", drop_labels=True)
dataset = load_dataset("mmcif", data_dir="structures/", drop_metadata=True)

Test Results

All 24 mmCIF tests + 15 ProteinStructure feature tests pass.

Related PRs

References

cc @lhoestq @georgia-hf

Add support for loading mmCIF (macromolecular Crystallographic Information File)
format directly with load_dataset(). mmCIF is the modern standard for 3D
macromolecular structures used by PDB since 2014.

Key features:
- Zero external dependencies: Pure Python parser for CIF syntax
- Streaming support: Generator-based parsing for large structure files
- Compression support: Auto-detection of gzip, bzip2, xz compressed files
- ML-ready output: Atomic coordinates suitable for structure-based ML models

Configuration options:
- columns: Select subset of atom_site columns (default: 11 common columns)
- include_hetatm: Option to exclude ligand/water HETATM records
- batch_size: Control atoms per batch (default: 100000)

Supported extensions: .cif, .mmcif (and compressed variants)
This refactors the mmCIF loader to follow the ImageFolder pattern, where
each row in the dataset contains one complete protein structure file.
This is the recommended ML-friendly approach for working with structural data.

Key changes:
- Add ProteinStructure feature type for handling protein structure files
  - Supports lazy loading (decode=False) or full content (decode=True)
  - Works with both PDB and mmCIF formats
- Rewrite MmcifFolder to extend FolderBasedBuilder
  - Supports folder-based labels (like ImageFolder)
  - Supports metadata.csv files for additional columns
  - Uses ProteinStructure as BASE_FEATURE
- Fix bug in FolderBasedBuilder._generate_examples where drop_metadata
  would fail with IndexError when metadata files were in the files list
  - Root cause: enumerate(files) created gaps in shard_idx when files
    were skipped due to extension filtering
  - Solution: Use separate valid_shard_idx counter that only increments
    when samples are actually yielded

Usage:
    >>> from datasets import load_dataset
    >>> dataset = load_dataset("mmcif", data_dir="./structures")
    >>> structure_content = dataset[0]["structure"]  # Complete mmCIF content
- Fix line length in protein_structure.py error messages
- Sort imports alphabetically in __init__.py
- Format function calls and f-strings in test_mmcif.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant