Proposal: Protein 3D Structure Visualization for Dataset Viewer #7930

behroozazarkhalili · 2026-01-03T03:30:01Z

Proposal: Protein 3D Structure Visualization for HuggingFace Dataset Viewer

Executive Summary

This proposal outlines adding 3D protein structure visualization to the HuggingFace Dataset Viewer, enabling users to interactively view PDB and mmCIF molecular structures directly within the dataset preview interface.

Data Type Support

Supported formats (from recent PRs):

PDB (PR Add lightweight PDB (Protein Data Bank) file support #7926): Legacy fixed-width format for 3D macromolecular structures
mmCIF (PR feat: Add mmCIF file support for macromolecular structures #7925): Modern standard format with full crystallographic data

What gets visualized:

3D atomic coordinates (x, y, z)
Chain structures
Residue information
Atom types and elements
Secondary structure (helices, sheets)

Not applicable (1D sequence only):

FASTA (PR feat(fasta): add lightweight FASTA file format support #7923) - text sequences, no 3D coordinates
FASTQ (PR Add lightweight FASTQ file format support #7924) - sequences with quality scores, no 3D coordinates

Visualization Library Comparison

Library	Bundle Size (minified)	Bundle Size (gzipped)	License	Pros	Cons
3Dmol.js	512 KB	~150 KB	BSD-3	Lightweight, easy integration, good docs	Fewer advanced features
NGL Viewer	1.3 MB	~350 KB	MIT	Excellent MMTF support, beautiful rendering	Moderate complexity
Mol*	4.6 MB	~1.3 MB	MIT	Industry standard, used by RCSB PDB, feature-rich	Heavy, complex
PDBe Molstar	5.8 MB	~1.6 MB	Apache 2.0	EMBL-EBI maintained, simpler Mol* wrapper	Still very heavy

Bundle sizes verified by downloading actual distribution files from npm/CDN (January 2026)

Recommendation: 3Dmol.js

Primary choice: 3Dmol.js

Rationale:

Bundle size: ~150 KB gzipped - the lightest option by far, ideal for lazy loading
Simple API: Easy to integrate with React/Next.js
BSD-3 License: Compatible with HuggingFace licensing
Active maintenance: Regular updates, good community support
Format support: Native PDB and mmCIF parsing built-in
Sufficient features: Rotation, zoom, style switching (cartoon, stick, sphere)

Why not Mol?* As Georgia noted, Mol* is heavy (~1.3 MB gzipped). While it's the industry standard for RCSB PDB, it's overkill for a dataset preview where users just need to verify structure data looks correct.

Alternative for power users: If users need advanced features like density maps, ligand interactions, or sequence alignment overlay, consider PDBe Molstar as an optional "full viewer" mode.

Architecture for Dataset Viewer Integration

Lazy Loading Pattern (React/Next.js)

// ProteinViewer.tsx
import dynamic from 'next/dynamic';

const Protein3DViewer = dynamic(
  () => import('./Protein3DViewerCore'),
  {
    ssr: false,  // WebGL requires client-side only
    loading: () => <ProteinViewerSkeleton />
  }
);

export function ProteinViewer({ data, format }) {
  // Only render when PDB/mmCIF format detected
  if (!['pdb', 'mmcif', 'cif'].includes(format)) {
    return <SequenceViewer data={data} />;
  }

  return <Protein3DViewer structureData={data} format={format} />;
}

Core Viewer Component (3Dmol.js)

// Protein3DViewerCore.tsx
import { useEffect, useRef } from 'react';
import $3Dmol from '3dmol';

export default function Protein3DViewerCore({ structureData, format }) {
  const viewerRef = useRef(null);
  const containerRef = useRef(null);

  useEffect(() => {
    if (!containerRef.current) return;

    // Initialize viewer
    const viewer = $3Dmol.createViewer(containerRef.current, {
      backgroundColor: 'white',
      antialias: true,
    });
    viewerRef.current = viewer;

    // Add structure
    viewer.addModel(structureData, format);
    viewer.setStyle({}, { cartoon: { color: 'spectrum' } });
    viewer.zoomTo();
    viewer.render();

    return () => viewer.clear();
  }, [structureData, format]);

  return (
    <div
      ref={containerRef}
      style={{ width: '100%', height: '400px', position: 'relative' }}
    />
  );
}

Integration Points in Dataset Viewer

File Type Detection

// Detect protein structure formats
const PROTEIN_3D_FORMATS = ['pdb', 'ent', 'cif', 'mmcif'];

function getViewerType(filename, datasetFeatures) {
  const ext = filename.split('.').pop().toLowerCase();

  if (PROTEIN_3D_FORMATS.includes(ext)) {
    return 'protein-3d';
  }
  // ... other format checks
}

Data Flow

Dataset Row → Format Detection → Lazy Load Viewer → Render 3D Structure
     ↓
  PDB/mmCIF text → 3Dmol.js parser → WebGL canvas → User interaction

UI/UX Considerations

Viewer Controls

Rotate: Mouse drag
Zoom: Scroll wheel
Style toggle: Cartoon / Stick / Sphere / Surface
Reset view button
Full-screen toggle

Style Dropdown Options

const STYLE_OPTIONS = [
  { label: 'Cartoon (ribbon)', value: 'cartoon' },
  { label: 'Sticks', value: 'stick' },
  { label: 'Spheres (CPK)', value: 'sphere' },
  { label: 'Line', value: 'line' },
  { label: 'Surface', value: 'surface' },
];

Loading State

Skeleton placeholder (400px height)
"Loading 3D viewer..." text
Progressive: Show 2D preview while 3D loads

Implementation Phases

Phase 1: Basic Viewer (MVP)

Add 3Dmol.js dependency (~150 KB gzipped)
Create ProteinViewer component with lazy loading
Support PDB format display
Basic rotation/zoom controls
Single style (cartoon)

Phase 2: Enhanced Features

mmCIF format support
Style switching dropdown
Full-screen mode
Chain coloring options

Phase 3: Advanced (Optional)

Atom selection/highlighting
Distance measurements
Export snapshot as PNG
Consider PDBe Molstar for power users

Bundle Impact Analysis

Without lazy loading: +150 KB to initial bundle (acceptable but not ideal)

With lazy loading:

Initial load: 0 KB additional
On-demand: ~150 KB when viewing PDB/mmCIF
Cached after first load

Comparison with other viewers:

Viewer Type	Typical Bundle Size
PDF viewer	~500 KB
Audio player	~50 KB
Image gallery	~100 KB
Protein 3D (3Dmol.js)	~150 KB

The protein viewer is comparable to other specialized viewers and well within acceptable limits for lazy-loaded content.

Alternative Approach: CDN Loading

If bundle size is critical:

// Load from CDN on-demand
const load3Dmol = async () => {
  if (window.$3Dmol) return window.$3Dmol;

  return new Promise((resolve) => {
    const script = document.createElement('script');
    script.src = 'https://3dmol.csb.pitt.edu/build/3Dmol-min.js';
    script.onload = () => resolve(window.$3Dmol);
    document.head.appendChild(script);
  });
};

Pros: Zero bundle impact
Cons: External dependency, potential availability issues

Files to Modify (in dataset-viewer repo)

Since dataset-viewer is closed-source, this proposal should be shared with the HuggingFace team. They would need to:

package.json - Add 3dmol dependency
Create components/viewers/ProteinViewer.tsx
Create components/viewers/Protein3DViewerCore.tsx
Update viewer routing logic to detect PDB/mmCIF
Add viewer style controls component

Summary

Recommended approach:

Use 3Dmol.js (~150 KB gzipped) with lazy loading
Only loads when user views PDB/mmCIF datasets
Simple integration, BSD-3 license, active community support

Why 3Dmol.js over Mol?*:

3Dmol.js: ~150 KB gzipped
Mol*: ~1.3 MB gzipped (nearly 9x heavier)

Key insight: The PDB and mmCIF loaders we implemented (PRs #7925, #7926) extract the 3D coordinates needed for visualization. The viewer just needs to consume the raw file content.

Next Steps

Get feedback on this proposal
Create proof-of-concept in a standalone demo if needed
Integrate into dataset-viewer once approach is approved

This PR proposes adding 3D protein structure visualization to the HuggingFace Dataset Viewer using 3Dmol.js (~150KB gzipped). See PR body for full proposal details.

behroozazarkhalili · 2026-01-03T03:30:14Z

cc @georgia-hf - Following up on your question about protein visualization for the Dataset Viewer. This proposal recommends 3Dmol.js (~150KB gzipped) as a lightweight alternative to Mol* (~1.3MB gzipped).

Looking forward to your feedback!

lhoestq · 2026-01-05T14:34:31Z

Exciting ! cc @cfahlgren1 @severo for the Viewer part

For the datasets part I'll leave my feedbacks in the PRs :)

severo · 2026-01-05T14:41:51Z

I don't know the JS libraries, but indeed, the lighter the better, as we don't require advanced features.

lhoestq · 2026-01-05T14:50:03Z

From a quick look at the PDB and mmCIF PRs I noticed that the dataset has one row = one atom. However I humbly believe that such datasets would be more practical to use if one row = one structure. This way each row is independent, which is practical in ML to perform train/test splits or dataset shuffling.

This would also make it easier to add labels and metadata for each structure, similar to what we already for images. E.g. you could group them per folder named after a label, or you can have a metadata.parquet file to add custom metadata per structure.

And this way in the Viewer it could show one 3D render per row.

What do you think ?

behroozazarkhalili · 2026-01-05T16:00:45Z

@lhoestq @severo @georgia-hf I will be waiting for all your comments; then, I will start implementing the final plan.

Proposal: Protein 3D Structure Visualization for Dataset Viewer

2bbbd22

This PR proposes adding 3D protein structure visualization to the HuggingFace Dataset Viewer using 3Dmol.js (~150KB gzipped). See PR body for full proposal details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Protein 3D Structure Visualization for Dataset Viewer #7930

Proposal: Protein 3D Structure Visualization for Dataset Viewer #7930

behroozazarkhalili commented Jan 3, 2026

Uh oh!

behroozazarkhalili commented Jan 3, 2026

Uh oh!

lhoestq commented Jan 5, 2026

Uh oh!

severo commented Jan 5, 2026

Uh oh!

lhoestq commented Jan 5, 2026 •

edited

Loading

Uh oh!

behroozazarkhalili commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Proposal: Protein 3D Structure Visualization for Dataset Viewer #7930

Are you sure you want to change the base?

Proposal: Protein 3D Structure Visualization for Dataset Viewer #7930

Conversation

behroozazarkhalili commented Jan 3, 2026

Proposal: Protein 3D Structure Visualization for HuggingFace Dataset Viewer

Executive Summary

Data Type Support

Visualization Library Comparison

Recommendation: 3Dmol.js

Architecture for Dataset Viewer Integration

Lazy Loading Pattern (React/Next.js)

Core Viewer Component (3Dmol.js)

Integration Points in Dataset Viewer

File Type Detection

Data Flow

UI/UX Considerations

Viewer Controls

Style Dropdown Options

Loading State

Implementation Phases

Phase 1: Basic Viewer (MVP)

Phase 2: Enhanced Features

Phase 3: Advanced (Optional)

Bundle Impact Analysis

Alternative Approach: CDN Loading

Files to Modify (in dataset-viewer repo)

Summary

Next Steps

Uh oh!

behroozazarkhalili commented Jan 3, 2026

Uh oh!

lhoestq commented Jan 5, 2026

Uh oh!

severo commented Jan 5, 2026

Uh oh!

lhoestq commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

behroozazarkhalili commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lhoestq commented Jan 5, 2026 •

edited

Loading