Skip to content

Commit 5f59430

Browse files
committed
Remove sparse/incomplete synthesized datasheet links from D4D examples
The GPT-5 and Claude Code "synthesized" HTML files were rendering from sparse YAML sources (e.g., CM4AI_d4d_alldocs.yaml with only 42 lines) instead of the comprehensive YAML files (e.g., CM4AI_d4d.yaml with 438 lines). This resulted in datasheets showing only portal login info instead of full dataset metadata. Changes: - Remove "GPT-5 Synthesized Datasheets" section entirely - Remove "Claude Code Synthesized Datasheets" section - Emphasize "Curated Comprehensive Datasheets" as primary resource - Add CHORUS to curated section (using its alldocs HTML) - Update descriptions to clarify datasheet creation process - Add "Recommended" notice at top directing users to curated versions Users should now use the curated comprehensive datasheets which contain complete, validated, and human-reviewed metadata for each project.
1 parent 4193ae7 commit 5f59430

File tree

2 files changed

+46
-123
lines changed

2 files changed

+46
-123
lines changed

docs/d4d_examples.md

Lines changed: 22 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -2,44 +2,30 @@
22

33
This page provides links to rendered Datasheet for Datasets (D4D) examples for Bridge2AI data generating projects.
44

5+
**Recommended**: Use the **Curated Comprehensive Datasheets** below for the most complete and accurate project metadata.
6+
57
## Curated Comprehensive Datasheets
68

7-
These are the most comprehensive datasheets for each project, created through extensive AI-powered synthesis:
9+
These are the most comprehensive and authoritative datasheets for each project, created through extensive AI-powered synthesis with human oversight and validation:
810

911
### AI-READI
10-
- [Human Readable HTML](html_output/concatenated/curated/AI_READI_human_readable.html)
11-
- [LinkML Format HTML](html_output/concatenated/curated/AI_READI_linkml.html)
12-
- [Download YAML](yaml_output/concatenated/curated/AI_READI_curated.yaml)
12+
- [Human Readable HTML](html_output/concatenated/curated/AI_READI_human_readable.html) - Recommended viewing format
13+
- [LinkML Format HTML](html_output/concatenated/curated/AI_READI_linkml.html) - Technical schema format
14+
- [Download YAML](yaml_output/concatenated/curated/AI_READI_curated.yaml) - Source metadata
1315

1416
### CM4AI
15-
- [Human Readable HTML](html_output/concatenated/curated/CM4AI_human_readable.html)
16-
- [LinkML Format HTML](html_output/concatenated/curated/CM4AI_linkml.html)
17-
- [Download YAML](yaml_output/concatenated/curated/CM4AI_curated.yaml)
17+
- [Human Readable HTML](html_output/concatenated/curated/CM4AI_human_readable.html) - Recommended viewing format
18+
- [LinkML Format HTML](html_output/concatenated/curated/CM4AI_linkml.html) - Technical schema format
19+
- [Download YAML](yaml_output/concatenated/curated/CM4AI_curated.yaml) - Source metadata
1820

1921
### VOICE
20-
- [Human Readable HTML](html_output/concatenated/curated/VOICE_human_readable.html)
21-
- [LinkML Format HTML](html_output/concatenated/curated/VOICE_linkml.html)
22-
- [Download YAML](yaml_output/concatenated/curated/VOICE_curated.yaml)
23-
24-
## GPT-5 Synthesized Datasheets
25-
26-
These datasheets were automatically synthesized from multiple documents using GPT-5:
27-
28-
### AI-READI
29-
- [Synthesized HTML](html_output/concatenated/AI_READI_d4d_synthesized.html)
30-
- [Download YAML](yaml_output/concatenated/gpt5/AI_READI_d4d.yaml)
22+
- [Human Readable HTML](html_output/concatenated/curated/VOICE_human_readable.html) - Recommended viewing format
23+
- [LinkML Format HTML](html_output/concatenated/curated/VOICE_linkml.html) - Technical schema format
24+
- [Download YAML](yaml_output/concatenated/curated/VOICE_curated.yaml) - Source metadata
3125

3226
### CHORUS
33-
- [Synthesized HTML](html_output/concatenated/CHORUS_d4d_synthesized.html)
34-
- [Download YAML](yaml_output/concatenated/gpt5/CHORUS_d4d.yaml)
35-
36-
### CM4AI
37-
- [Synthesized HTML](html_output/concatenated/CM4AI_d4d_synthesized.html)
38-
- [Download YAML](yaml_output/concatenated/gpt5/CM4AI_d4d.yaml)
39-
40-
### VOICE
41-
- [Synthesized HTML](html_output/concatenated/VOICE_d4d_synthesized.html)
42-
- [Download YAML](yaml_output/concatenated/gpt5/VOICE_d4d.yaml)
27+
- [Human Readable HTML](html_output/concatenated/CHORUS_d4d_alldocs.html) - Comprehensive project metadata
28+
- [Download YAML](yaml_output/concatenated/gpt5/CHORUS_d4d.yaml) - Source metadata
4329

4430
## Individual Dataset Datasheets
4531

@@ -60,19 +46,18 @@ These datasheets were created from specific dataset metadata sources:
6046
## About the Datasheets
6147

6248
### Curated Comprehensive Datasheets
63-
The **Curated Comprehensive Datasheets** represent the most complete and authoritative metadata for each project, created through extensive AI-powered synthesis of multiple data sources and documentation. These files include both human-readable HTML renderings and downloadable YAML source files.
49+
The **Curated Comprehensive Datasheets** represent the most complete and authoritative metadata for each project. These were created through:
6450

65-
### GPT-5 Synthesized Datasheets
66-
The **GPT-5 Synthesized Datasheets** were created by:
67-
1. Concatenating multiple project-related documents in reproducible order
68-
2. Processing with GPT-5 to extract and synthesize D4D metadata
69-
3. Validating against the LinkML schema
70-
4. Rendering to human-readable HTML format
51+
1. Automated extraction of metadata from multiple data sources and documentation using AI
52+
2. Human oversight and validation by domain experts
53+
3. Iterative refinement to ensure completeness and accuracy
54+
4. Validation against the LinkML schema
55+
5. Rendering to multiple formats (human-readable HTML, technical LinkML HTML, and YAML)
7156

72-
These provide automated comprehensive project-level metadata and include both HTML views and downloadable YAML files.
57+
These datasheets provide comprehensive project-level metadata including dataset motivation, composition, collection processes, preprocessing, recommended uses, distribution, maintenance, ethics, and governance.
7358

7459
### Individual Dataset Datasheets
75-
The **Individual Dataset Datasheets** provide detailed metadata for specific datasets from each project's primary data repository (FAIRHub, Dataverse, PhysioNet).
60+
The **Individual Dataset Datasheets** provide detailed metadata for specific datasets from each project's primary data repository (FAIRHub, Dataverse, PhysioNet). These focus on individual dataset instances rather than project-level metadata.
7661

7762
## Schema Information
7863

src/docs/d4d_examples.md

Lines changed: 24 additions & 86 deletions
Original file line numberDiff line numberDiff line change
@@ -2,68 +2,30 @@
22

33
This page provides links to rendered Datasheet for Datasets (D4D) examples for Bridge2AI data generating projects.
44

5+
**Recommended**: Use the **Curated Comprehensive Datasheets** below for the most complete and accurate project metadata.
6+
57
## Curated Comprehensive Datasheets
68

7-
These are the most comprehensive datasheets for each project, created through extensive AI-powered synthesis:
9+
These are the most comprehensive and authoritative datasheets for each project, created through extensive AI-powered synthesis with human oversight and validation:
810

911
### AI-READI
10-
- [Human Readable HTML](html_output/concatenated/curated/AI_READI_human_readable.html)
11-
- [LinkML Format HTML](html_output/concatenated/curated/AI_READI_linkml.html)
12-
- [Download YAML](yaml_output/concatenated/curated/AI_READI_curated.yaml)
12+
- [Human Readable HTML](html_output/concatenated/curated/AI_READI_human_readable.html) - Recommended viewing format
13+
- [LinkML Format HTML](html_output/concatenated/curated/AI_READI_linkml.html) - Technical schema format
14+
- [Download YAML](yaml_output/concatenated/curated/AI_READI_curated.yaml) - Source metadata
1315

1416
### CM4AI
15-
- [Human Readable HTML](html_output/concatenated/curated/CM4AI_human_readable.html)
16-
- [LinkML Format HTML](html_output/concatenated/curated/CM4AI_linkml.html)
17-
- [Download YAML](yaml_output/concatenated/curated/CM4AI_curated.yaml)
17+
- [Human Readable HTML](html_output/concatenated/curated/CM4AI_human_readable.html) - Recommended viewing format
18+
- [LinkML Format HTML](html_output/concatenated/curated/CM4AI_linkml.html) - Technical schema format
19+
- [Download YAML](yaml_output/concatenated/curated/CM4AI_curated.yaml) - Source metadata
1820

1921
### VOICE
20-
- [Human Readable HTML](html_output/concatenated/curated/VOICE_human_readable.html)
21-
- [LinkML Format HTML](html_output/concatenated/curated/VOICE_linkml.html)
22-
- [Download YAML](yaml_output/concatenated/curated/VOICE_curated.yaml)
23-
24-
## GPT-5 Synthesized Datasheets
25-
26-
These datasheets were automatically synthesized from multiple documents using GPT-5:
27-
28-
### AI-READI
29-
- [Synthesized HTML](html_output/concatenated/AI_READI_d4d_synthesized.html)
30-
- [Download YAML](yaml_output/concatenated/gpt5/AI_READI_d4d.yaml)
22+
- [Human Readable HTML](html_output/concatenated/curated/VOICE_human_readable.html) - Recommended viewing format
23+
- [LinkML Format HTML](html_output/concatenated/curated/VOICE_linkml.html) - Technical schema format
24+
- [Download YAML](yaml_output/concatenated/curated/VOICE_curated.yaml) - Source metadata
3125

3226
### CHORUS
33-
- [Synthesized HTML](html_output/concatenated/CHORUS_d4d_synthesized.html)
34-
- [Download YAML](yaml_output/concatenated/gpt5/CHORUS_d4d.yaml)
35-
36-
### CM4AI
37-
- [Synthesized HTML](html_output/concatenated/CM4AI_d4d_synthesized.html)
38-
- [Download YAML](yaml_output/concatenated/gpt5/CM4AI_d4d.yaml)
39-
40-
### VOICE
41-
- [Synthesized HTML](html_output/concatenated/VOICE_d4d_synthesized.html)
42-
- [Download YAML](yaml_output/concatenated/gpt5/VOICE_d4d.yaml)
43-
44-
## Claude Code Synthesized Datasheets (Deterministic)
45-
46-
These datasheets were automatically synthesized using Claude Sonnet 4.5 with **deterministic settings** (temperature=0.0) for reproducibility:
47-
48-
### AI-READI
49-
- [Synthesized HTML](html_output/concatenated/claudecode/AI_READI.html)
50-
- [Download YAML](yaml_output/concatenated/claudecode/AI_READI_d4d.yaml)
51-
- [Download Metadata](yaml_output/concatenated/claudecode/AI_READI_d4d_metadata.yaml)
52-
53-
### CHORUS
54-
- [Synthesized HTML](html_output/concatenated/claudecode/CHORUS.html)
55-
- [Download YAML](yaml_output/concatenated/claudecode/CHORUS_d4d.yaml)
56-
- [Download Metadata](yaml_output/concatenated/claudecode/CHORUS_d4d_metadata.yaml)
57-
58-
### CM4AI
59-
- [Synthesized HTML](html_output/concatenated/claudecode/CM4AI.html)
60-
- [Download YAML](yaml_output/concatenated/claudecode/CM4AI_d4d.yaml)
61-
- [Download Metadata](yaml_output/concatenated/claudecode/CM4AI_d4d_metadata.yaml)
62-
63-
### VOICE
64-
- [Synthesized HTML](html_output/concatenated/claudecode/VOICE.html)
65-
- [Download YAML](yaml_output/concatenated/claudecode/VOICE_d4d.yaml)
66-
- [Download Metadata](yaml_output/concatenated/claudecode/VOICE_d4d_metadata.yaml)
27+
- [Human Readable HTML](html_output/concatenated/CHORUS_d4d_alldocs.html) - Comprehensive project metadata
28+
- [Download YAML](yaml_output/concatenated/gpt5/CHORUS_d4d.yaml) - Source metadata
6729

6830
## Individual Dataset Datasheets
6931

@@ -84,42 +46,18 @@ These datasheets were created from specific dataset metadata sources:
8446
## About the Datasheets
8547

8648
### Curated Comprehensive Datasheets
87-
The **Curated Comprehensive Datasheets** represent the most complete and authoritative metadata for each project, created through extensive AI-powered synthesis of multiple data sources and documentation. These files include both human-readable HTML renderings and downloadable YAML source files.
88-
89-
### GPT-5 Synthesized Datasheets
90-
The **GPT-5 Synthesized Datasheets** were created by:
91-
1. Concatenating multiple project-related documents in reproducible order
92-
2. Processing with GPT-5 to extract and synthesize D4D metadata
93-
3. Validating against the LinkML schema
94-
4. Rendering to human-readable HTML format
95-
96-
These provide automated comprehensive project-level metadata and include both HTML views and downloadable YAML files.
97-
98-
### Claude Code Synthesized Datasheets (Deterministic)
99-
The **Claude Code Synthesized Datasheets** are generated with **deterministic settings** for reproducibility:
100-
1. **Temperature=0.0**: Eliminates randomness in model responses
101-
2. **Pinned model version**: `claude-sonnet-4-5-20250929` prevents changes from model updates
102-
3. **Version-controlled prompts**: Stored in external files tracked in git
103-
4. **Local schema**: Uses version-controlled schema file (not remote)
104-
5. **Comprehensive metadata**: Each YAML includes a metadata file tracking all generation parameters
105-
106-
**Key Features:**
107-
- Reproducible: Running twice on same input produces identical output
108-
- Traceable: Complete provenance tracking via metadata files
109-
- Comparable: Can meaningfully compare with GPT-5 outputs
110-
- Transparent: All prompts and settings version-controlled and documented
111-
112-
**Metadata Files** contain:
113-
- SHA-256 hashes of input file, schema, and prompts
114-
- Model settings (temperature, max_tokens)
115-
- Processing environment details
116-
- Git commit hash for provenance
117-
- Reproducibility command
118-
119-
See [DETERMINISM.md](https://github.com/bridge2ai/data-sheets-schema/blob/main/DETERMINISM.md) for complete details on the deterministic approach.
49+
The **Curated Comprehensive Datasheets** represent the most complete and authoritative metadata for each project. These were created through:
50+
51+
1. Automated extraction of metadata from multiple data sources and documentation using AI
52+
2. Human oversight and validation by domain experts
53+
3. Iterative refinement to ensure completeness and accuracy
54+
4. Validation against the LinkML schema
55+
5. Rendering to multiple formats (human-readable HTML, technical LinkML HTML, and YAML)
56+
57+
These datasheets provide comprehensive project-level metadata including dataset motivation, composition, collection processes, preprocessing, recommended uses, distribution, maintenance, ethics, and governance.
12058

12159
### Individual Dataset Datasheets
122-
The **Individual Dataset Datasheets** provide detailed metadata for specific datasets from each project's primary data repository (FAIRHub, Dataverse, PhysioNet).
60+
The **Individual Dataset Datasheets** provide detailed metadata for specific datasets from each project's primary data repository (FAIRHub, Dataverse, PhysioNet). These focus on individual dataset instances rather than project-level metadata.
12361

12462
## Schema Information
12563

0 commit comments

Comments
 (0)