Skip to content

Conversation

@d4dassistant
Copy link

Summary

Created new consolidated D4D datasheet for Cell Maps for Artificial Intelligence (CM4AI) based on comprehensive documentation from CM4AI website, Virginia Dataverse releases, and 38+ publications.

Files Added

  • data/sheets_d4dassistant/cm4ai_d4d.yaml - Consolidated CM4AI D4D datasheet

Validation

  • ✅ Schema validation passed
  • ✅ Required fields populated (id, name)
  • ✅ YAML syntax valid
  • ✅ Structured with resources section for detailed component metadata

Key Metadata Extracted

Dataset ID: cm4ai
Dataset Name: Cell Maps for Artificial Intelligence (CM4AI)

Purpose: CM4AI was created to generate comprehensive, AI-ready maps of human cell architecture from disease-relevant cell lines to support interpretable genotype-phenotype learning and advance functional genomics research using the FAIRSCAPE framework and RO-Crate format.

Composition:

  • 53,788 immunofluorescence images
  • 1,792 proteins investigated
  • 11,739 genes targeted via CRISPR
  • 1,374 protein interactions mapped
  • 22.7 TB total data volume

Data Types:

  1. CRISPR Perturbation Cell Atlas: Genome-scale CRISPRi perturbation atlas in KOLF2.1J hiPSCs with single-cell RNA sequencing
  2. Protein-Protein Interactions: SEC-MS data from multiple cell types (hiPSCs, NPCs, neurons, cardiomyocytes)
  3. Protein Localization Imaging: Multi-channel confocal microscopy of 563 proteins in MDA-MB-468 cells under various treatment conditions

Distribution:

  • Repository: University of Virginia Dataverse (https://dataverse.lib.virginia.edu/dataverse/CM4AI)
  • License: CC BY-NC-SA 4.0
  • Releases: Quarterly updates (Alpha May 2024, Beta March 2025, Beta June 2025)
  • Formats: RO-Crate packages, ZIP archives, FASTQ, CSV, H5AD

Consortium: UC San Diego (lead), UCSF, Stanford, UVA, Yale, UT Austin, UAB, Simon Fraser, Hastings Center

Funding: NIH Bridge2AI grant 1OT2OD032742-01

Sources

This consolidated datasheet synthesizes information from:

  1. CM4AI Website: https://cm4ai.org (main project page, data releases, tools, publications)
  2. Virginia Dataverse Releases:
    • March 2025 Beta: doi:10.18130/V3/B35XWX
    • June 2025 Beta: doi:10.18130/V3/F3TD5R
    • Alpha v0.5: doi:10.18130/V3/DXWOS5
  3. CM4AI Publications (https://cm4ai.org/publications/):
    • Clark T, et al. Cell Maps for Artificial Intelligence (bioRxiv 2024.05.21.589311)
    • Schaffer LV, et al. Multimodal cell maps as a foundation for structural and functional genomics (Nature 2025)
    • Lenkiewicz J, et al. Cell Mapping Toolkit (Bioinformatics 2024)
    • Al Manir S, et al. FAIRSCAPE: An Evolving AI-readiness Framework (bioRxiv 2024)
    • Ethics and ELSI publications from Pacia DM, Stevens I, Ravitsky V, et al.
  4. Existing Datasheets: Consolidated information from data/extracted_by_column/CM4AI/dataverse_10.18130_V3_B35XWX_d4d.yaml and related files

How to Review

  1. Check YAML structure: Review data/sheets_d4dassistant/cm4ai_d4d.yaml for completeness and accuracy
  2. Verify sources: Compare against original CM4AI documentation URLs listed above
  3. Validate resources section: Three detailed resource entries cover CRISPR atlas, SEC-MS data, and imaging data
  4. Check metadata fields: Confirm populated fields are accurate; omitted fields indicate missing source info

Notes

  • Fields marked as null or omitted indicate information not found in source documentation
  • Controlled vocabulary fields use enums defined in the D4D schema
  • All dates follow ISO 8601 format (YYYY-MM-DD)
  • Resources section provides detailed metadata for each major data component
  • This consolidates information across multiple data releases into a single comprehensive datasheet

Related to: #71


🤖 Generated with D4D Assistant

- Comprehensive datasheet consolidating information from CM4AI publications and data releases
- Includes metadata from cm4ai.org, Virginia Dataverse releases, and 38+ publications
- Covers 3 main data types: CRISPR perturbation atlas, SEC-MS protein interactions, and IF imaging
- Documents 53,788 images, 1,792 proteins, 11,739 genes targeted, 1,374 protein interactions (22.7 TB total)
- Structured with detailed resources section for each major data component
- Validates against D4D schema

Sources:
- CM4AI website (https://cm4ai.org)
- Virginia Dataverse data releases (DOIs: 10.18130/V3/B35XWX, 10.18130/V3/F3TD5R, 10.18130/V3/DXWOS5)
- CM4AI publications list (https://cm4ai.org/publications/)
- Existing datasheets in data/extracted_by_column/CM4AI/

Related to: #71

Co-Authored-By: Claude <[email protected]>
@d4dassistant d4dassistant mentioned this pull request Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants