Add comprehensive D4D datasheet for VOICE dataset#65
Add comprehensive D4D datasheet for VOICE dataset#65d4dassistant wants to merge 3 commits intomainfrom
Conversation
- Created detailed datasheet for Bridge2AI-Voice v1.1 - Extracted metadata from PhysioNet and Health Data Nexus documentation - Includes comprehensive coverage of all D4D sections: * Motivation and purpose * Dataset composition (12,523 recordings, 306 participants) * Collection process and protocols * Preprocessing and derived data details * Privacy protections and HIPAA Safe Harbor de-identification * Intended uses and limitations * Access requirements and licensing * Funding, citations, and references - Follows structure of existing VOICE D4D files in repository Co-Authored-By: Claude <noreply@anthropic.com>
D4D Datasheet Creation SummaryI've successfully created a comprehensive D4D (Datasheets for Datasets) metadata file for the Bridge2AI-Voice dataset. Metadata Extraction ProcessSources Accessed:
Information Extracted:
File StructureThe datasheet follows the practical D4D structure used by existing files in
Coverage Completeness
Quality Checks
The datasheet is ready for review and provides comprehensive documentation of the Bridge2AI-Voice dataset suitable for researchers, data stewards, and the broader community. |
Enhanced the D4D datasheet with detailed Data Use Agreement information: - Added DUA provider (University of South Florida Board of Trustees) - Documented agreement terms and duration - Detailed permitted uses and restrictions - Added sharing policy and data disposition requirements - Included legal protections (Certificate of Confidentiality, PII classification) - Documented breach notification requirements - Added warranties, liability, and PhysioNet compliance terms - Enhanced privacy section with data classification and security requirements - Updated access conditions to include DACO application requirement Source: Bridge2AI Data Transfer and Use Agreement (Google Drive PDF) Successfully accessed using proper Google Drive download URL format. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
✅ DUA Information Successfully AddedI've updated the D4D datasheet with comprehensive Data Use Agreement (DUA) details extracted from the Google Drive PDF. What ChangedEnhanced Sections:
Document Access MethodThanks for the tip! I successfully accessed the Google Drive PDF using the correct URL format: This allowed me to extract 9 pages of detailed DUA information. Note on IRB DocumentThe IRB Google Doc link (https://docs.google.com/document/d/1gTFzAM-FoYlM_X9qF0s7fXoswmaz8IqN/edit) still requires authentication and redirects to a Google login page. If you'd like me to add IRB-specific details beyond what's already in the datasheet, please either:
Validation✅ YAML syntax validated successfully 🤖 D4D Assistant |
- Accessed Bridge2AI Voice IRB Protocol v14 (Google Doc) - Added detailed IRB approval structure (Single IRB with USF as lead) - Documented comprehensive informed consent procedures (4 modalities) - Added inclusion/exclusion criteria and recruitment methods - Expanded ethics section with: * Protocol details (PI, version, study design, target sample) * Vulnerable population protections (pediatric, diversity/inclusion) * Risk assessment and mitigation strategies * Multi-level privacy framework (institutional, shared, federated) * Data security measures and quality control * Phased implementation timeline * Participant rights and compensation - Enhanced collection_process section with: * All 9 participating institutions (US and Canadian sites) * Data types collected (acoustic, clinical, imaging, genomic, questionnaires) * Collection settings (in-clinic and remote) * Longitudinal data collection details Sources: - https://physionet.org/content/b2ai-voice/1.1/ - https://healthdatanexus.ai/content/b2ai-voice/1.0/ - Bridge2AI Data Transfer and Use Agreement (Google Drive PDF) - Bridge2AI Voice IRB Protocol v14 (Google Doc - newly accessed) Co-Authored-By: Claude <noreply@anthropic.com>
✅ IRB Documentation Successfully AddedI've successfully accessed and extracted comprehensive IRB and ethics information from the Bridge2AI Voice IRB Protocol v14 document! What Changed279 new lines added with detailed IRB and ethics documentation: 1. IRB Approval Structure
2. Informed Consent Procedures
3. Participant Selection
4. Ethics & Human Subjects ProtectionsVulnerable Populations:
Risk Assessment:
Privacy Framework:
Data Security:
5. Data Collection Details
6. Study Implementation
Sources Added
Validation✅ YAML syntax validated with Python The datasheet now includes complete IRB approval details, informed consent procedures, participant protections, and ethics review information as requested. 🤖 D4D Assistant |
Summary
Created new comprehensive D4D datasheet for the Bridge2AI-Voice dataset based on official documentation from PhysioNet and Health Data Nexus.
Background
Responding to issue #64 requesting creation of a D4D datasheet for the VOICE Data Generation Project (DGP).
Sources
Note: IRB and DUA documents were access-restricted (Google Docs requiring authentication).
Files Added
data/extracted_by_column/VOICE/b2ai_voice_comprehensive_d4d.yaml- Comprehensive D4D YAML datasheetKey Metadata Extracted
Dataset Overview
b2ai-voice-v1.1Composition
Data Types
Privacy & Ethics
Access
Funding
D4D Sections Covered
✅ Motivation - Purpose and research gaps addressed
✅ Composition - Dataset contents, participants, cohorts, missing data
✅ Collection Process - Settings, protocols, consent, IRB approval
✅ Preprocessing - Audio processing, feature extraction, de-identification
✅ Uses - Intended uses, examples, limitations
✅ Distribution - Formats, access requirements, licensing
✅ Maintenance - Consortium management, versioning, updates
✅ Ethics - IRB approval, consent, privacy protections
✅ Data Governance - Licensing, citation requirements, controlled access
How to Review
data/extracted_by_column/VOICE/b2ai_voice_comprehensive_d4d.yamlfor completeness and accuracyNotes
data/extracted_by_column/VOICE/physionet_b2ai-voice_1.1_d4d.yamlfile patternValidation
data/extracted_by_column/use a practical structure optimized for human readability and HTML rendering, which differs from strict LinkML schema validationRelated to: #64
🤖 Generated with D4D Assistant