Proposal: Protein 3D Structure Visualization for Dataset Viewer #7930

behroozazarkhalili · 2026-01-03T03:30:01Z

Proposal: Protein 3D Structure Visualization for HuggingFace Dataset Viewer

Executive Summary

This proposal outlines adding 3D protein structure visualization to the HuggingFace Dataset Viewer, enabling users to interactively view PDB and mmCIF molecular structures directly within the dataset preview interface.

Data Type Support (Updated Architecture)

Supported formats (from recent PRs):

PDB (PR Add lightweight PDB (Protein Data Bank) file support #7926): .pdb, .ent extensions via PdbFolder builder
mmCIF (PR feat: Add mmCIF file support for macromolecular structures #7925): .cif, .mmcif extensions via MmcifFolder builder

New Implementation Pattern (One Row = One Structure):

Both PRs have been refactored to follow the ImageFolder pattern, where each row in the dataset contains one complete protein structure file. This is the recommended ML-friendly approach:

>>> from datasets import load_dataset
>>> dataset = load_dataset("mmcif", data_dir="./structures")
>>> dataset[0]
{'structure': 'data_1ABC\n_entry.id 1ABC\n_atom_site...'}  # Complete mmCIF content

>>> from datasets import load_dataset  
>>> dataset = load_dataset("pdb", data_dir="./pdbs")
>>> dataset[0]
{'structure': 'HEADER  PROTEIN  01-JAN-20  1ABC\nATOM...'}  # Complete PDB content

Key Components:

ProteinStructure feature type: New feature type supporting both PDB and mmCIF formats with lazy loading
PdbFolder builder (PR Add lightweight PDB (Protein Data Bank) file support #7926): Folder-based loader for PDB files with label and metadata support
MmcifFolder builder (PR feat: Add mmCIF file support for macromolecular structures #7925): Folder-based loader for mmCIF files with label and metadata support

What gets visualized:

3D atomic coordinates (x, y, z)
Chain structures
Residue information
Atom types and elements
Secondary structure (helices, sheets)

Not applicable (1D sequence only):

FASTA (PR feat(fasta): add lightweight FASTA file format support #7923) - text sequences, no 3D coordinates
FASTQ (PR Add lightweight FASTQ file format support #7924) - sequences with quality scores, no 3D coordinates

Visualization Library Comparison

Library	Bundle Size (minified)	Bundle Size (gzipped)	License	Pros	Cons
3Dmol.js	512 KB	~150 KB	BSD-3	Lightweight, easy integration, good docs	Fewer advanced features
NGL Viewer	1.3 MB	~350 KB	MIT	Excellent MMTF support, beautiful rendering	Moderate complexity
Mol*	4.6 MB	~1.3 MB	MIT	Industry standard, used by RCSB PDB, feature-rich	Heavy, complex
PDBe Molstar	5.8 MB	~1.6 MB	Apache 2.0	EMBL-EBI maintained, simpler Mol* wrapper	Still very heavy

Bundle sizes verified by downloading actual distribution files from npm/CDN (January 2026)

Recommendation: 3Dmol.js

Primary choice: 3Dmol.js

Rationale:

Bundle size: ~150 KB gzipped - the lightest option by far, ideal for lazy loading
Simple API: Easy to integrate with React/Next.js
BSD-3 License: Compatible with HuggingFace licensing
Active maintenance: Regular updates, good community support
Format support: Native PDB and mmCIF parsing built-in
Sufficient features: Rotation, zoom, style switching (cartoon, stick, sphere)

Why not Mol?* As Georgia noted, Mol* is heavy (~1.3 MB gzipped). While it's the industry standard for RCSB PDB, it's overkill for a dataset preview where users just need to verify structure data looks correct.

Alternative for power users: If users need advanced features like density maps, ligand interactions, or sequence alignment overlay, consider PDBe Molstar as an optional "full viewer" mode.

Summary

Recommended approach:

Use 3Dmol.js (~150 KB gzipped) with lazy loading
Only loads when user views PDB/mmCIF datasets
Simple integration, BSD-3 license, active community support

Backend implementation (Updated):

PR feat: Add mmCIF file support for macromolecular structures #7925 (mmCIF): Uses MmcifFolder builder with ProteinStructure feature type
PR Add lightweight PDB (Protein Data Bank) file support #7926 (PDB): Uses PdbFolder builder with ProteinStructure feature type
Both follow the one-row-per-structure pattern (like ImageFolder)
Each row's structure column contains the complete file content ready for 3D rendering

Next Steps

Get feedback on this proposal
Create proof-of-concept in a standalone demo if needed
Integrate into dataset-viewer once approach is approved

This PR proposes adding 3D protein structure visualization to the HuggingFace Dataset Viewer using 3Dmol.js (~150KB gzipped). See PR body for full proposal details.

behroozazarkhalili · 2026-01-03T03:30:14Z

cc @georgia-hf - Following up on your question about protein visualization for the Dataset Viewer. This proposal recommends 3Dmol.js (~150KB gzipped) as a lightweight alternative to Mol* (~1.3MB gzipped).

Looking forward to your feedback!

lhoestq · 2026-01-05T14:34:31Z

Exciting ! cc @cfahlgren1 @severo for the Viewer part

For the datasets part I'll leave my feedbacks in the PRs :)

severo · 2026-01-05T14:41:51Z

I don't know the JS libraries, but indeed, the lighter the better, as we don't require advanced features.

lhoestq · 2026-01-05T14:50:03Z

From a quick look at the PDB and mmCIF PRs I noticed that the dataset has one row = one atom. However I humbly believe that such datasets would be more practical to use if one row = one structure. This way each row is independent, which is practical in ML to perform train/test splits or dataset shuffling.

This would also make it easier to add labels and metadata for each structure, similar to what we already for images. E.g. you could group them per folder named after a label, or you can have a metadata.parquet file to add custom metadata per structure.

And this way in the Viewer it could show one 3D render per row.

What do you think ?

behroozazarkhalili · 2026-01-05T16:00:45Z

@lhoestq @severo @georgia-hf I will be waiting for all your comments; then, I will start implementing the final plan.

Proposal: Protein 3D Structure Visualization for Dataset Viewer

2bbbd22

This PR proposes adding 3D protein structure visualization to the HuggingFace Dataset Viewer using 3Dmol.js (~150KB gzipped). See PR body for full proposal details.

This was referenced Jan 9, 2026

Add lightweight PDB (Protein Data Bank) file support #7926

Open

feat: Add mmCIF file support for macromolecular structures #7925

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Protein 3D Structure Visualization for Dataset Viewer #7930

Proposal: Protein 3D Structure Visualization for Dataset Viewer #7930

Uh oh!

behroozazarkhalili commented Jan 3, 2026 •

edited

Loading

Uh oh!

behroozazarkhalili commented Jan 3, 2026

Uh oh!

lhoestq commented Jan 5, 2026

Uh oh!

severo commented Jan 5, 2026

Uh oh!

lhoestq commented Jan 5, 2026 •

edited

Loading

Uh oh!

behroozazarkhalili commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Proposal: Protein 3D Structure Visualization for Dataset Viewer #7930

Are you sure you want to change the base?

Proposal: Protein 3D Structure Visualization for Dataset Viewer #7930

Uh oh!

Conversation

behroozazarkhalili commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposal: Protein 3D Structure Visualization for HuggingFace Dataset Viewer

Executive Summary

Data Type Support (Updated Architecture)

Visualization Library Comparison

Recommendation: 3Dmol.js

Summary

Next Steps

Uh oh!

behroozazarkhalili commented Jan 3, 2026

Uh oh!

lhoestq commented Jan 5, 2026

Uh oh!

severo commented Jan 5, 2026

Uh oh!

lhoestq commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

behroozazarkhalili commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

behroozazarkhalili commented Jan 3, 2026 •

edited

Loading

lhoestq commented Jan 5, 2026 •

edited

Loading