-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Test Issue: Create D4D Datasheet for CM4AI Dataset
Issue Type: D4D Creation Test
Dataset: CM4AI (Common Mind for Artificial Intelligence)
Test Date: 2025-02-10
Request
@d4dassistant Please create a comprehensive D4D datasheet for the CM4AI dataset using the input documents provided in this repository.
Dataset Information
Dataset Name: CM4AI (Common Mind for Artificial Intelligence)
Project: Bridge2AI Common Mind Consortium
Purpose: Multi-modal dataset for AI research in mental health
Input Documents Location
The preprocessed input documents are located in:
data/sheets_d4dassistant/inputs/CM4AI/
Available files (6 total, ~2,978 lines):
2024.05.21.589311v1.full.txt(53K) - Research paperRePORT β© RePORTER - CM4AI.txt(5.7K) - NIH grant reportcreativecommons_org_licenses-by-nc-sa_row15.txt(5.6K) - License informationdataverse_10.18130_V3_B35XWX_row16.txt(27K) - Dataset documentationdataverse_10.18130_V3_F3TD5R_row19.txt(29K) - Additional dataset infodoi_row3.json(123B) - Metadata
Expected Workflow
When the D4D Assistant processes this request, it should:
-
β Validate Prerequisites
- Schema file exists
- Prompt files exist
- Input files accessible (6 files in
data/sheets_d4dassistant/inputs/CM4AI/) - Output directory ready
-
π Study Schema
- Load D4D schema from
src/data_sheets_schema/schema/data_sheets_schema_all.yaml - Review reference examples
- Understand field structure
- Load D4D schema from
-
π₯ Process Input Documents
- Read all 6 files
- Extract D4D-relevant metadata
- Map information to schema classes
-
π Generate D4D YAML
- Create
data/sheets_d4dassistant/CM4AI_d4d.yaml - Populate with deterministic generation (temperature=0.0)
- Use schema field names exactly
- Create
-
π Generate Metadata
- Create
data/sheets_d4dassistant/CM4AI_d4d_metadata.yaml - Include SHA-256 hashes for all inputs
- Track git commit, model settings, prompts
- Record processing environment
- Create
-
β Validate Quality
- Schema validation: Must pass LinkML validation
- Completeness check: Must meet minimal threshold (4+ sections, 50+ slots, 100+ lines)
- If validation fails β DO NOT create PR
-
π Generate HTML Preview
- Create
data/sheets_d4dassistant/CM4AI_d4d.html - Human-readable format for review
- Create
-
π€ Create Pull Request
- Branch:
d4d/add-cm4ai-datasheet - Include all three files (YAML, metadata, HTML)
- Provide summary of extracted metadata
- Link back to this issue
- Branch:
Test Objectives
This test issue validates:
- β Deterministic generation: temperature=0.0, date-pinned model
- β
File-based input mode: Reading from
data/sheets_d4dassistant/inputs/ - β Prerequisites validation: Fail-fast checks before generation
- β Metadata generation: Comprehensive provenance tracking
- β Completeness validation: Quality gates before PR
- β GitHub integration: Issue β PR workflow
- β Documentation: Clear communication with user
Expected Output Files
After processing, the following files should be created:
1. D4D YAML Datasheet
Location: data/sheets_d4dassistant/CM4AI_d4d.yaml
Expected content:
id: Dataset identifiername: CM4AI or Common Mind for Artificial Intelligencemotivation: Purpose, tasks, addressing_gapscomposition: Instances, subsets, instance_countcollection_process: Acquisition methodologypreprocessing: Preprocessing steps and softwareuses: Recommended and not-recommended usesdistribution: Format, access, requirementsmaintenance: Plan, version, contacthuman_subjects: IRB, consent, protectionsethics_and_data_protection: Reviews, securitydata_governance: Stewards, license, terms
2. Metadata File
Location: data/sheets_d4dassistant/CM4AI_d4d_metadata.yaml
Expected sections:
extraction_metadata:
timestamp, extraction_id, extraction_type, input_mode: file
input_documents:
- 6 files with SHA-256 hashes
datasheets_schema:
sha256_hash: <schema hash>
llm_model:
model_name: claude-sonnet-4-5-20250929
temperature: 0.0
prompts:
system_prompt_hash: <hash>
user_prompt_hash: <hash>
provenance:
git_commit: <current commit>3. HTML Preview
Location: data/sheets_d4dassistant/CM4AI_d4d.html
Purpose: Human-readable view for reviewers
Success Criteria
β
Prerequisites pass: All files found, validation succeeds
β
Schema validation passes: YAML conforms to D4D schema
β
Completeness check passes: Meets minimal quality threshold
β
Metadata generated: Complete provenance tracking
β
PR created: With all three files
β
Clear communication: Comments in issue and PR
Troubleshooting
If the assistant encounters issues:
Missing Input Files
- Check:
ls data/sheets_d4dassistant/inputs/CM4AI/ - Should see 6 files
Schema Validation Fails
- Review error messages
- Check field names against schema
- Fix and regenerate
Completeness Check Fails
- Review metrics (sections, slots, lines)
- May need more detailed extraction
- Consider adding more content
API Issues
- Check
ANTHROPIC_API_KEYis set - Verify API access
Notes
- This is a test issue to validate the deterministic D4D assistant implementation
- All components (Phases 1-3) have been implemented and tested
- This represents the first end-to-end workflow test
- Results will inform Phase 4 (Verification and Comparison)
Related Documentation
- Implementation:
IMPLEMENTATION_SUMMARY.md - Testing:
PHASE3_TESTING_RESULTS.md - Verification:
VERIFICATION_GUIDE.md - Instructions:
.github/workflows/d4d_assistant_create.md - Config:
.github/workflows/d4d_assistant_deterministic_config.yaml
Test Context: Phase 3 β Phase 4 transition
Repository Branch: 2025_issue_fixes