Skip to content

Conversation

@behroozazarkhalili
Copy link

@behroozazarkhalili behroozazarkhalili commented Jan 3, 2026

Proposal: Protein 3D Structure Visualization for HuggingFace Dataset Viewer

Executive Summary

This proposal outlines adding 3D protein structure visualization to the HuggingFace Dataset Viewer, enabling users to interactively view PDB and mmCIF molecular structures directly within the dataset preview interface.


Data Type Support (Updated Architecture)

Supported formats (from recent PRs):

New Implementation Pattern (One Row = One Structure):

Both PRs have been refactored to follow the ImageFolder pattern, where each row in the dataset contains one complete protein structure file. This is the recommended ML-friendly approach:

>>> from datasets import load_dataset
>>> dataset = load_dataset("mmcif", data_dir="./structures")
>>> dataset[0]
{'structure': 'data_1ABC\n_entry.id 1ABC\n_atom_site...'}  # Complete mmCIF content

>>> from datasets import load_dataset  
>>> dataset = load_dataset("pdb", data_dir="./pdbs")
>>> dataset[0]
{'structure': 'HEADER  PROTEIN  01-JAN-20  1ABC\nATOM...'}  # Complete PDB content

Key Components:

What gets visualized:

  • 3D atomic coordinates (x, y, z)
  • Chain structures
  • Residue information
  • Atom types and elements
  • Secondary structure (helices, sheets)

Not applicable (1D sequence only):


Visualization Library Comparison

Library Bundle Size (minified) Bundle Size (gzipped) License Pros Cons
3Dmol.js 512 KB ~150 KB BSD-3 Lightweight, easy integration, good docs Fewer advanced features
NGL Viewer 1.3 MB ~350 KB MIT Excellent MMTF support, beautiful rendering Moderate complexity
Mol* 4.6 MB ~1.3 MB MIT Industry standard, used by RCSB PDB, feature-rich Heavy, complex
PDBe Molstar 5.8 MB ~1.6 MB Apache 2.0 EMBL-EBI maintained, simpler Mol* wrapper Still very heavy

Bundle sizes verified by downloading actual distribution files from npm/CDN (January 2026)


Recommendation: 3Dmol.js

Primary choice: 3Dmol.js

Rationale:

  1. Bundle size: ~150 KB gzipped - the lightest option by far, ideal for lazy loading
  2. Simple API: Easy to integrate with React/Next.js
  3. BSD-3 License: Compatible with HuggingFace licensing
  4. Active maintenance: Regular updates, good community support
  5. Format support: Native PDB and mmCIF parsing built-in
  6. Sufficient features: Rotation, zoom, style switching (cartoon, stick, sphere)

Why not Mol?* As Georgia noted, Mol* is heavy (~1.3 MB gzipped). While it's the industry standard for RCSB PDB, it's overkill for a dataset preview where users just need to verify structure data looks correct.

Alternative for power users: If users need advanced features like density maps, ligand interactions, or sequence alignment overlay, consider PDBe Molstar as an optional "full viewer" mode.


Summary

Recommended approach:

  • Use 3Dmol.js (~150 KB gzipped) with lazy loading
  • Only loads when user views PDB/mmCIF datasets
  • Simple integration, BSD-3 license, active community support

Backend implementation (Updated):


Next Steps

  1. Get feedback on this proposal
  2. Create proof-of-concept in a standalone demo if needed
  3. Integrate into dataset-viewer once approach is approved

This PR proposes adding 3D protein structure visualization to the HuggingFace
Dataset Viewer using 3Dmol.js (~150KB gzipped).

See PR body for full proposal details.
@behroozazarkhalili
Copy link
Author

cc @georgia-hf - Following up on your question about protein visualization for the Dataset Viewer. This proposal recommends 3Dmol.js (~150KB gzipped) as a lightweight alternative to Mol* (~1.3MB gzipped).

Looking forward to your feedback!

@lhoestq
Copy link
Member

lhoestq commented Jan 5, 2026

Exciting ! cc @cfahlgren1 @severo for the Viewer part

For the datasets part I'll leave my feedbacks in the PRs :)

@severo
Copy link
Collaborator

severo commented Jan 5, 2026

I don't know the JS libraries, but indeed, the lighter the better, as we don't require advanced features.

@lhoestq
Copy link
Member

lhoestq commented Jan 5, 2026

From a quick look at the PDB and mmCIF PRs I noticed that the dataset has one row = one atom. However I humbly believe that such datasets would be more practical to use if one row = one structure. This way each row is independent, which is practical in ML to perform train/test splits or dataset shuffling.

This would also make it easier to add labels and metadata for each structure, similar to what we already for images. E.g. you could group them per folder named after a label, or you can have a metadata.parquet file to add custom metadata per structure.

And this way in the Viewer it could show one 3D render per row.

What do you think ?

@behroozazarkhalili
Copy link
Author

@lhoestq @severo @georgia-hf I will be waiting for all your comments; then, I will start implementing the final plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants