Skip to content

Conversation

@d4dassistant
Copy link

Summary

Created new D4D datasheet for Cell Maps for Artificial Intelligence (CM4AI) based on comprehensive documentation from:

Files Added

  • data/extracted_by_column/CM4AI/cm4ai_comprehensive_d4d.yaml - D4D YAML datasheet
  • src/html/output/D4D_-_CM4AI_Dataverse_v3_human_readable.html - HTML preview

Validation

  • ✅ Schema validation passed
  • ✅ Required fields populated (id, name)
  • ✅ YAML syntax valid
  • ✅ HTML preview generated

Key Metadata Extracted

  • Dataset ID: cm4ai-cell-maps
  • Dataset Name: Cell Maps for Artificial Intelligence
  • Purpose: Map the spatiotemporal architecture of human cells for interpretable genotype-phenotype learning using multimodal approaches (proteomics, imaging, CRISPR perturbation)
  • Composition: 53,788 immunofluorescent images, 1,374 protein interactions, 1,792 proteins investigated, 11,739 genes targeted
  • Data Volume: 22.7 TB total
  • Distribution: University of Virginia Dataverse with quarterly releases (CC BY-NC-SA 4.0)
  • AI-Ready: RO-Crate format with FAIRSCAPE metadata framework and Evidence Graph Ontology provenance
  • Funding: NIH Bridge2AI grant 1OT2OD032742-01
  • Institutions: UCSD, UCSF, Stanford, UVA, Yale, UT Austin, UAB, Simon Fraser University, Hastings Center

Dataset Features

  • Multimodal data: Affinity purification mass spectrometry (AP-MS), immunofluorescence imaging, CRISPR/Cas9 perturbation screens, SEC-MS
  • Cell types: MDA-MB-468 breast cancer cells, iPSCs, iPSC-derived NPCs, neurons, cardiomyocytes
  • Treatment conditions: Untreated, paclitaxel-treated, vorinostat-treated
  • Target proteins: 100 chromatin modifiers + 100 metabolic enzymes (cancer, neuropsychiatric, cardiac disorders)
  • Standards: FAIR principles, Schema.org vocabulary, JSON-LD serialization, W3C PROV provenance

How to Review

  1. View HTML preview: Open src/html/output/D4D_-_CM4AI_Dataverse_v3_human_readable.html in a browser for human-readable format
  2. Check YAML: Review data/extracted_by_column/CM4AI/cm4ai_comprehensive_d4d.yaml for completeness and accuracy
  3. Validate sources: Compare against original documentation URLs listed above
  4. Verify required fields: Ensure id and name are present and accurate
  5. Check optional fields: Confirm populated fields are accurate; additional fields can be added based on schema

Notes

  • This is a minimal valid datasheet focusing on core required fields and key descriptive metadata
  • Additional D4D schema fields (purposes, tasks, instances, subsets, etc.) can be added in future updates as the schema structure supports more complex nested objects
  • All information extracted directly from official CM4AI sources
  • Dataset uses cell lines only (no human subjects data)
  • Quarterly data releases ongoing through project completion

Publications Referenced

Key publications that describe the dataset:

  • Clark et al. (2024) "Cell Maps for Artificial Intelligence: AI-Ready Maps of Human Cell Architecture from Disease-Relevant Cell Lines" bioRxiv doi:10.1101/2024.05.21.589311
  • Nourreddine et al. (2024) "A Perturbation Cell Atlas of Human Induced Pluripotent Stem Cells" bioRxiv
  • 36 additional publications listed at https://cm4ai.org/publications/

Related to: #66


🤖 Generated with D4D Assistant

- Extracted metadata from CM4AI website, Dataverse, and publications
- Validated against D4D schema (all checks passed)
- Generated HTML preview for review
- Sources: https://cm4ai.org, https://dataverse.lib.virginia.edu/dataverse/CM4AI, https://cm4ai.org/publications/

Datasheet includes:
- Comprehensive description of multimodal cell architecture data
- 53,788 immunofluorescent images, 1,374 protein interactions, 11,739 genes targeted
- AI-ready data in RO-Crate format with FAIRSCAPE metadata
- CC BY-NC-SA 4.0 license
- Quarterly releases via UVA Dataverse

Related to: #66

Co-Authored-By: Claude <[email protected]>
@d4dassistant d4dassistant mentioned this pull request Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants