Skip to content

Conversation

@behroozazarkhalili
Copy link

Proposal: Protein 3D Structure Visualization for HuggingFace Dataset Viewer

Executive Summary

This proposal outlines adding 3D protein structure visualization to the HuggingFace Dataset Viewer, enabling users to interactively view PDB and mmCIF molecular structures directly within the dataset preview interface.


Data Type Support

Supported formats (from recent PRs):

What gets visualized:

  • 3D atomic coordinates (x, y, z)
  • Chain structures
  • Residue information
  • Atom types and elements
  • Secondary structure (helices, sheets)

Not applicable (1D sequence only):


Visualization Library Comparison

Library Bundle Size (minified) Bundle Size (gzipped) License Pros Cons
3Dmol.js 512 KB ~150 KB BSD-3 Lightweight, easy integration, good docs Fewer advanced features
NGL Viewer 1.3 MB ~350 KB MIT Excellent MMTF support, beautiful rendering Moderate complexity
Mol* 4.6 MB ~1.3 MB MIT Industry standard, used by RCSB PDB, feature-rich Heavy, complex
PDBe Molstar 5.8 MB ~1.6 MB Apache 2.0 EMBL-EBI maintained, simpler Mol* wrapper Still very heavy

Bundle sizes verified by downloading actual distribution files from npm/CDN (January 2026)


Recommendation: 3Dmol.js

Primary choice: 3Dmol.js

Rationale:

  1. Bundle size: ~150 KB gzipped - the lightest option by far, ideal for lazy loading
  2. Simple API: Easy to integrate with React/Next.js
  3. BSD-3 License: Compatible with HuggingFace licensing
  4. Active maintenance: Regular updates, good community support
  5. Format support: Native PDB and mmCIF parsing built-in
  6. Sufficient features: Rotation, zoom, style switching (cartoon, stick, sphere)

Why not Mol?* As Georgia noted, Mol* is heavy (~1.3 MB gzipped). While it's the industry standard for RCSB PDB, it's overkill for a dataset preview where users just need to verify structure data looks correct.

Alternative for power users: If users need advanced features like density maps, ligand interactions, or sequence alignment overlay, consider PDBe Molstar as an optional "full viewer" mode.


Architecture for Dataset Viewer Integration

Lazy Loading Pattern (React/Next.js)

// ProteinViewer.tsx
import dynamic from 'next/dynamic';

const Protein3DViewer = dynamic(
  () => import('./Protein3DViewerCore'),
  {
    ssr: false,  // WebGL requires client-side only
    loading: () => <ProteinViewerSkeleton />
  }
);

export function ProteinViewer({ data, format }) {
  // Only render when PDB/mmCIF format detected
  if (!['pdb', 'mmcif', 'cif'].includes(format)) {
    return <SequenceViewer data={data} />;
  }

  return <Protein3DViewer structureData={data} format={format} />;
}

Core Viewer Component (3Dmol.js)

// Protein3DViewerCore.tsx
import { useEffect, useRef } from 'react';
import $3Dmol from '3dmol';

export default function Protein3DViewerCore({ structureData, format }) {
  const viewerRef = useRef(null);
  const containerRef = useRef(null);

  useEffect(() => {
    if (!containerRef.current) return;

    // Initialize viewer
    const viewer = $3Dmol.createViewer(containerRef.current, {
      backgroundColor: 'white',
      antialias: true,
    });
    viewerRef.current = viewer;

    // Add structure
    viewer.addModel(structureData, format);
    viewer.setStyle({}, { cartoon: { color: 'spectrum' } });
    viewer.zoomTo();
    viewer.render();

    return () => viewer.clear();
  }, [structureData, format]);

  return (
    <div
      ref={containerRef}
      style={{ width: '100%', height: '400px', position: 'relative' }}
    />
  );
}

Integration Points in Dataset Viewer

File Type Detection

// Detect protein structure formats
const PROTEIN_3D_FORMATS = ['pdb', 'ent', 'cif', 'mmcif'];

function getViewerType(filename, datasetFeatures) {
  const ext = filename.split('.').pop().toLowerCase();

  if (PROTEIN_3D_FORMATS.includes(ext)) {
    return 'protein-3d';
  }
  // ... other format checks
}

Data Flow

Dataset Row → Format Detection → Lazy Load Viewer → Render 3D Structure
     ↓
  PDB/mmCIF text → 3Dmol.js parser → WebGL canvas → User interaction

UI/UX Considerations

Viewer Controls

  • Rotate: Mouse drag
  • Zoom: Scroll wheel
  • Style toggle: Cartoon / Stick / Sphere / Surface
  • Reset view button
  • Full-screen toggle

Style Dropdown Options

const STYLE_OPTIONS = [
  { label: 'Cartoon (ribbon)', value: 'cartoon' },
  { label: 'Sticks', value: 'stick' },
  { label: 'Spheres (CPK)', value: 'sphere' },
  { label: 'Line', value: 'line' },
  { label: 'Surface', value: 'surface' },
];

Loading State

  • Skeleton placeholder (400px height)
  • "Loading 3D viewer..." text
  • Progressive: Show 2D preview while 3D loads

Implementation Phases

Phase 1: Basic Viewer (MVP)

  • Add 3Dmol.js dependency (~150 KB gzipped)
  • Create ProteinViewer component with lazy loading
  • Support PDB format display
  • Basic rotation/zoom controls
  • Single style (cartoon)

Phase 2: Enhanced Features

  • mmCIF format support
  • Style switching dropdown
  • Full-screen mode
  • Chain coloring options

Phase 3: Advanced (Optional)

  • Atom selection/highlighting
  • Distance measurements
  • Export snapshot as PNG
  • Consider PDBe Molstar for power users

Bundle Impact Analysis

Without lazy loading: +150 KB to initial bundle (acceptable but not ideal)

With lazy loading:

  • Initial load: 0 KB additional
  • On-demand: ~150 KB when viewing PDB/mmCIF
  • Cached after first load

Comparison with other viewers:

Viewer Type Typical Bundle Size
PDF viewer ~500 KB
Audio player ~50 KB
Image gallery ~100 KB
Protein 3D (3Dmol.js) ~150 KB

The protein viewer is comparable to other specialized viewers and well within acceptable limits for lazy-loaded content.


Alternative Approach: CDN Loading

If bundle size is critical:

// Load from CDN on-demand
const load3Dmol = async () => {
  if (window.$3Dmol) return window.$3Dmol;

  return new Promise((resolve) => {
    const script = document.createElement('script');
    script.src = 'https://3dmol.csb.pitt.edu/build/3Dmol-min.js';
    script.onload = () => resolve(window.$3Dmol);
    document.head.appendChild(script);
  });
};

Pros: Zero bundle impact
Cons: External dependency, potential availability issues


Files to Modify (in dataset-viewer repo)

Since dataset-viewer is closed-source, this proposal should be shared with the HuggingFace team. They would need to:

  1. package.json - Add 3dmol dependency
  2. Create components/viewers/ProteinViewer.tsx
  3. Create components/viewers/Protein3DViewerCore.tsx
  4. Update viewer routing logic to detect PDB/mmCIF
  5. Add viewer style controls component

Summary

Recommended approach:

  • Use 3Dmol.js (~150 KB gzipped) with lazy loading
  • Only loads when user views PDB/mmCIF datasets
  • Simple integration, BSD-3 license, active community support

Why 3Dmol.js over Mol?*:

  • 3Dmol.js: ~150 KB gzipped
  • Mol*: ~1.3 MB gzipped (nearly 9x heavier)

Key insight: The PDB and mmCIF loaders we implemented (PRs #7925, #7926) extract the 3D coordinates needed for visualization. The viewer just needs to consume the raw file content.


Next Steps

  1. Get feedback on this proposal
  2. Create proof-of-concept in a standalone demo if needed
  3. Integrate into dataset-viewer once approach is approved

This PR proposes adding 3D protein structure visualization to the HuggingFace
Dataset Viewer using 3Dmol.js (~150KB gzipped).

See PR body for full proposal details.
@behroozazarkhalili
Copy link
Author

cc @georgia-hf - Following up on your question about protein visualization for the Dataset Viewer. This proposal recommends 3Dmol.js (~150KB gzipped) as a lightweight alternative to Mol* (~1.3MB gzipped).

Looking forward to your feedback!

@lhoestq
Copy link
Member

lhoestq commented Jan 5, 2026

Exciting ! cc @cfahlgren1 @severo for the Viewer part

For the datasets part I'll leave my feedbacks in the PRs :)

@severo
Copy link
Collaborator

severo commented Jan 5, 2026

I don't know the JS libraries, but indeed, the lighter the better, as we don't require advanced features.

@lhoestq
Copy link
Member

lhoestq commented Jan 5, 2026

From a quick look at the PDB and mmCIF PRs I noticed that the dataset has one row = one atom. However I humbly believe that such datasets would be more practical to use if one row = one structure. This way each row is independent, which is practical in ML to perform train/test splits or dataset shuffling.

This would also make it easier to add labels and metadata for each structure, similar to what we already for images. E.g. you could group them per folder named after a label, or you can have a metadata.parquet file to add custom metadata per structure.

And this way in the Viewer it could show one 3D render per row.

What do you think ?

@behroozazarkhalili
Copy link
Author

@lhoestq @severo @georgia-hf I will be waiting for all your comments; then, I will start implementing the final plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants