Remove sparse/incomplete synthesized datasheet links from D4D examples

realmarcin · realmarcin · commit 5f59430f07aa · 2025-11-16T17:24:26.000-08:00
The GPT-5 and Claude Code "synthesized" HTML files were rendering from
sparse YAML sources (e.g., CM4AI_d4d_alldocs.yaml with only 42 lines)
instead of the comprehensive YAML files (e.g., CM4AI_d4d.yaml with 438
lines). This resulted in datasheets showing only portal login info
instead of full dataset metadata.

Changes:
- Remove "GPT-5 Synthesized Datasheets" section entirely
- Remove "Claude Code Synthesized Datasheets" section
- Emphasize "Curated Comprehensive Datasheets" as primary resource
- Add CHORUS to curated section (using its alldocs HTML)
- Update descriptions to clarify datasheet creation process
- Add "Recommended" notice at top directing users to curated versions

Users should now use the curated comprehensive datasheets which contain
complete, validated, and human-reviewed metadata for each project.
diff --git a/docs/d4d_examples.md b/docs/d4d_examples.md
@@ -2,44 +2,30 @@
 
 This page provides links to rendered Datasheet for Datasets (D4D) examples for Bridge2AI data generating projects.
 
+**Recommended**: Use the **Curated Comprehensive Datasheets** below for the most complete and accurate project metadata.
+
 ## Curated Comprehensive Datasheets
 
-These are the most comprehensive datasheets for each project, created through extensive AI-powered synthesis:
+These are the most comprehensive and authoritative datasheets for each project, created through extensive AI-powered synthesis with human oversight and validation:
 
 ### AI-READI
-- [Human Readable HTML](html_output/concatenated/curated/AI_READI_human_readable.html)
-- [LinkML Format HTML](html_output/concatenated/curated/AI_READI_linkml.html)
-- [Download YAML](yaml_output/concatenated/curated/AI_READI_curated.yaml)
+- [Human Readable HTML](html_output/concatenated/curated/AI_READI_human_readable.html) - Recommended viewing format
+- [LinkML Format HTML](html_output/concatenated/curated/AI_READI_linkml.html) - Technical schema format
+- [Download YAML](yaml_output/concatenated/curated/AI_READI_curated.yaml) - Source metadata
 
 ### CM4AI
-- [Human Readable HTML](html_output/concatenated/curated/CM4AI_human_readable.html)
-- [LinkML Format HTML](html_output/concatenated/curated/CM4AI_linkml.html)
-- [Download YAML](yaml_output/concatenated/curated/CM4AI_curated.yaml)
+- [Human Readable HTML](html_output/concatenated/curated/CM4AI_human_readable.html) - Recommended viewing format
+- [LinkML Format HTML](html_output/concatenated/curated/CM4AI_linkml.html) - Technical schema format
+- [Download YAML](yaml_output/concatenated/curated/CM4AI_curated.yaml) - Source metadata
 
 ### VOICE
-- [Human Readable HTML](html_output/concatenated/curated/VOICE_human_readable.html)
-- [LinkML Format HTML](html_output/concatenated/curated/VOICE_linkml.html)
-- [Download YAML](yaml_output/concatenated/curated/VOICE_curated.yaml)
-
-## GPT-5 Synthesized Datasheets
-
-These datasheets were automatically synthesized from multiple documents using GPT-5:
-
-### AI-READI
-- [Synthesized HTML](html_output/concatenated/AI_READI_d4d_synthesized.html)
-- [Download YAML](yaml_output/concatenated/gpt5/AI_READI_d4d.yaml)
+- [Human Readable HTML](html_output/concatenated/curated/VOICE_human_readable.html) - Recommended viewing format
+- [LinkML Format HTML](html_output/concatenated/curated/VOICE_linkml.html) - Technical schema format
+- [Download YAML](yaml_output/concatenated/curated/VOICE_curated.yaml) - Source metadata
 
 ### CHORUS
-- [Synthesized HTML](html_output/concatenated/CHORUS_d4d_synthesized.html)
-- [Download YAML](yaml_output/concatenated/gpt5/CHORUS_d4d.yaml)
-
-### CM4AI
-- [Synthesized HTML](html_output/concatenated/CM4AI_d4d_synthesized.html)
-- [Download YAML](yaml_output/concatenated/gpt5/CM4AI_d4d.yaml)
-
-### VOICE
-- [Synthesized HTML](html_output/concatenated/VOICE_d4d_synthesized.html)
-- [Download YAML](yaml_output/concatenated/gpt5/VOICE_d4d.yaml)
+- [Human Readable HTML](html_output/concatenated/CHORUS_d4d_alldocs.html) - Comprehensive project metadata
+- [Download YAML](yaml_output/concatenated/gpt5/CHORUS_d4d.yaml) - Source metadata
 
 ## Individual Dataset Datasheets
 
@@ -60,19 +46,18 @@ These datasheets were created from specific dataset metadata sources:
 ## About the Datasheets
 
 ### Curated Comprehensive Datasheets
-The **Curated Comprehensive Datasheets** represent the most complete and authoritative metadata for each project, created through extensive AI-powered synthesis of multiple data sources and documentation. These files include both human-readable HTML renderings and downloadable YAML source files.
+The **Curated Comprehensive Datasheets** represent the most complete and authoritative metadata for each project. These were created through:
 
-### GPT-5 Synthesized Datasheets
-The **GPT-5 Synthesized Datasheets** were created by:
-1. Concatenating multiple project-related documents in reproducible order
-2. Processing with GPT-5 to extract and synthesize D4D metadata
-3. Validating against the LinkML schema
-4. Rendering to human-readable HTML format
+1. Automated extraction of metadata from multiple data sources and documentation using AI
+2. Human oversight and validation by domain experts
+3. Iterative refinement to ensure completeness and accuracy
+4. Validation against the LinkML schema
+5. Rendering to multiple formats (human-readable HTML, technical LinkML HTML, and YAML)
 
-These provide automated comprehensive project-level metadata and include both HTML views and downloadable YAML files.
+These datasheets provide comprehensive project-level metadata including dataset motivation, composition, collection processes, preprocessing, recommended uses, distribution, maintenance, ethics, and governance.
 
 ### Individual Dataset Datasheets
-The **Individual Dataset Datasheets** provide detailed metadata for specific datasets from each project's primary data repository (FAIRHub, Dataverse, PhysioNet).
+The **Individual Dataset Datasheets** provide detailed metadata for specific datasets from each project's primary data repository (FAIRHub, Dataverse, PhysioNet). These focus on individual dataset instances rather than project-level metadata.
 
 ## Schema Information
 
diff --git a/src/docs/d4d_examples.md b/src/docs/d4d_examples.md
@@ -2,68 +2,30 @@
 
 This page provides links to rendered Datasheet for Datasets (D4D) examples for Bridge2AI data generating projects.
 
+**Recommended**: Use the **Curated Comprehensive Datasheets** below for the most complete and accurate project metadata.
+
 ## Curated Comprehensive Datasheets
 
-These are the most comprehensive datasheets for each project, created through extensive AI-powered synthesis:
+These are the most comprehensive and authoritative datasheets for each project, created through extensive AI-powered synthesis with human oversight and validation:
 
 ### AI-READI
-- [Human Readable HTML](html_output/concatenated/curated/AI_READI_human_readable.html)
-- [LinkML Format HTML](html_output/concatenated/curated/AI_READI_linkml.html)
-- [Download YAML](yaml_output/concatenated/curated/AI_READI_curated.yaml)
+- [Human Readable HTML](html_output/concatenated/curated/AI_READI_human_readable.html) - Recommended viewing format
+- [LinkML Format HTML](html_output/concatenated/curated/AI_READI_linkml.html) - Technical schema format
+- [Download YAML](yaml_output/concatenated/curated/AI_READI_curated.yaml) - Source metadata
 
 ### CM4AI
-- [Human Readable HTML](html_output/concatenated/curated/CM4AI_human_readable.html)
-- [LinkML Format HTML](html_output/concatenated/curated/CM4AI_linkml.html)
-- [Download YAML](yaml_output/concatenated/curated/CM4AI_curated.yaml)
+- [Human Readable HTML](html_output/concatenated/curated/CM4AI_human_readable.html) - Recommended viewing format
+- [LinkML Format HTML](html_output/concatenated/curated/CM4AI_linkml.html) - Technical schema format
+- [Download YAML](yaml_output/concatenated/curated/CM4AI_curated.yaml) - Source metadata
 
 ### VOICE
-- [Human Readable HTML](html_output/concatenated/curated/VOICE_human_readable.html)
-- [LinkML Format HTML](html_output/concatenated/curated/VOICE_linkml.html)
-- [Download YAML](yaml_output/concatenated/curated/VOICE_curated.yaml)
-
-## GPT-5 Synthesized Datasheets
-
-These datasheets were automatically synthesized from multiple documents using GPT-5:
-
-### AI-READI
-- [Synthesized HTML](html_output/concatenated/AI_READI_d4d_synthesized.html)
-- [Download YAML](yaml_output/concatenated/gpt5/AI_READI_d4d.yaml)
+- [Human Readable HTML](html_output/concatenated/curated/VOICE_human_readable.html) - Recommended viewing format
+- [LinkML Format HTML](html_output/concatenated/curated/VOICE_linkml.html) - Technical schema format
+- [Download YAML](yaml_output/concatenated/curated/VOICE_curated.yaml) - Source metadata
 
 ### CHORUS
-- [Synthesized HTML](html_output/concatenated/CHORUS_d4d_synthesized.html)
-- [Download YAML](yaml_output/concatenated/gpt5/CHORUS_d4d.yaml)
-
-### CM4AI
-- [Synthesized HTML](html_output/concatenated/CM4AI_d4d_synthesized.html)
-- [Download YAML](yaml_output/concatenated/gpt5/CM4AI_d4d.yaml)
-
-### VOICE
-- [Synthesized HTML](html_output/concatenated/VOICE_d4d_synthesized.html)
-- [Download YAML](yaml_output/concatenated/gpt5/VOICE_d4d.yaml)
-
-## Claude Code Synthesized Datasheets (Deterministic)
-
-These datasheets were automatically synthesized using Claude Sonnet 4.5 with **deterministic settings** (temperature=0.0) for reproducibility:
-
-### AI-READI
-- [Synthesized HTML](html_output/concatenated/claudecode/AI_READI.html)
-- [Download YAML](yaml_output/concatenated/claudecode/AI_READI_d4d.yaml)
-- [Download Metadata](yaml_output/concatenated/claudecode/AI_READI_d4d_metadata.yaml)
-
-### CHORUS
-- [Synthesized HTML](html_output/concatenated/claudecode/CHORUS.html)
-- [Download YAML](yaml_output/concatenated/claudecode/CHORUS_d4d.yaml)
-- [Download Metadata](yaml_output/concatenated/claudecode/CHORUS_d4d_metadata.yaml)
-
-### CM4AI
-- [Synthesized HTML](html_output/concatenated/claudecode/CM4AI.html)
-- [Download YAML](yaml_output/concatenated/claudecode/CM4AI_d4d.yaml)
-- [Download Metadata](yaml_output/concatenated/claudecode/CM4AI_d4d_metadata.yaml)
-
-### VOICE
-- [Synthesized HTML](html_output/concatenated/claudecode/VOICE.html)
-- [Download YAML](yaml_output/concatenated/claudecode/VOICE_d4d.yaml)
-- [Download Metadata](yaml_output/concatenated/claudecode/VOICE_d4d_metadata.yaml)
+- [Human Readable HTML](html_output/concatenated/CHORUS_d4d_alldocs.html) - Comprehensive project metadata
+- [Download YAML](yaml_output/concatenated/gpt5/CHORUS_d4d.yaml) - Source metadata
 
 ## Individual Dataset Datasheets
 
@@ -84,42 +46,18 @@ These datasheets were created from specific dataset metadata sources:
 ## About the Datasheets
 
 ### Curated Comprehensive Datasheets
-The **Curated Comprehensive Datasheets** represent the most complete and authoritative metadata for each project, created through extensive AI-powered synthesis of multiple data sources and documentation. These files include both human-readable HTML renderings and downloadable YAML source files.
-
-### GPT-5 Synthesized Datasheets
-The **GPT-5 Synthesized Datasheets** were created by:
-1. Concatenating multiple project-related documents in reproducible order
-2. Processing with GPT-5 to extract and synthesize D4D metadata
-3. Validating against the LinkML schema
-4. Rendering to human-readable HTML format
-
-These provide automated comprehensive project-level metadata and include both HTML views and downloadable YAML files.
-
-### Claude Code Synthesized Datasheets (Deterministic)
-The **Claude Code Synthesized Datasheets** are generated with **deterministic settings** for reproducibility:
-1. **Temperature=0.0**: Eliminates randomness in model responses
-2. **Pinned model version**: `claude-sonnet-4-5-20250929` prevents changes from model updates
-3. **Version-controlled prompts**: Stored in external files tracked in git
-4. **Local schema**: Uses version-controlled schema file (not remote)
-5. **Comprehensive metadata**: Each YAML includes a metadata file tracking all generation parameters
-
-**Key Features:**
-- Reproducible: Running twice on same input produces identical output
-- Traceable: Complete provenance tracking via metadata files
-- Comparable: Can meaningfully compare with GPT-5 outputs
-- Transparent: All prompts and settings version-controlled and documented
-
-**Metadata Files** contain:
-- SHA-256 hashes of input file, schema, and prompts
-- Model settings (temperature, max_tokens)
-- Processing environment details
-- Git commit hash for provenance
-- Reproducibility command
-
-See [DETERMINISM.md](https://github.com/bridge2ai/data-sheets-schema/blob/main/DETERMINISM.md) for complete details on the deterministic approach.
+The **Curated Comprehensive Datasheets** represent the most complete and authoritative metadata for each project. These were created through:
+
+1. Automated extraction of metadata from multiple data sources and documentation using AI
+2. Human oversight and validation by domain experts
+3. Iterative refinement to ensure completeness and accuracy
+4. Validation against the LinkML schema
+5. Rendering to multiple formats (human-readable HTML, technical LinkML HTML, and YAML)
+
+These datasheets provide comprehensive project-level metadata including dataset motivation, composition, collection processes, preprocessing, recommended uses, distribution, maintenance, ethics, and governance.
 
 ### Individual Dataset Datasheets
-The **Individual Dataset Datasheets** provide detailed metadata for specific datasets from each project's primary data repository (FAIRHub, Dataverse, PhysioNet).
+The **Individual Dataset Datasheets** provide detailed metadata for specific datasets from each project's primary data repository (FAIRHub, Dataverse, PhysioNet). These focus on individual dataset instances rather than project-level metadata.
 
 ## Schema Information