You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Remove sparse/incomplete synthesized datasheet links from D4D examples
The GPT-5 and Claude Code "synthesized" HTML files were rendering from
sparse YAML sources (e.g., CM4AI_d4d_alldocs.yaml with only 42 lines)
instead of the comprehensive YAML files (e.g., CM4AI_d4d.yaml with 438
lines). This resulted in datasheets showing only portal login info
instead of full dataset metadata.
Changes:
- Remove "GPT-5 Synthesized Datasheets" section entirely
- Remove "Claude Code Synthesized Datasheets" section
- Emphasize "Curated Comprehensive Datasheets" as primary resource
- Add CHORUS to curated section (using its alldocs HTML)
- Update descriptions to clarify datasheet creation process
- Add "Recommended" notice at top directing users to curated versions
Users should now use the curated comprehensive datasheets which contain
complete, validated, and human-reviewed metadata for each project.
Copy file name to clipboardExpand all lines: docs/d4d_examples.md
+22-37Lines changed: 22 additions & 37 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,44 +2,30 @@
2
2
3
3
This page provides links to rendered Datasheet for Datasets (D4D) examples for Bridge2AI data generating projects.
4
4
5
+
**Recommended**: Use the **Curated Comprehensive Datasheets** below for the most complete and accurate project metadata.
6
+
5
7
## Curated Comprehensive Datasheets
6
8
7
-
These are the most comprehensive datasheets for each project, created through extensive AI-powered synthesis:
9
+
These are the most comprehensive and authoritative datasheets for each project, created through extensive AI-powered synthesis with human oversight and validation:
@@ -60,19 +46,18 @@ These datasheets were created from specific dataset metadata sources:
60
46
## About the Datasheets
61
47
62
48
### Curated Comprehensive Datasheets
63
-
The **Curated Comprehensive Datasheets** represent the most complete and authoritative metadata for each project, created through extensive AI-powered synthesis of multiple data sources and documentation. These files include both human-readable HTML renderings and downloadable YAML source files.
49
+
The **Curated Comprehensive Datasheets** represent the most complete and authoritative metadata for each project. These were created through:
64
50
65
-
### GPT-5 Synthesized Datasheets
66
-
The **GPT-5 Synthesized Datasheets** were created by:
67
-
1. Concatenating multiple project-related documents in reproducible order
68
-
2. Processing with GPT-5 to extract and synthesize D4D metadata
69
-
3. Validating against the LinkML schema
70
-
4. Rendering to human-readable HTML format
51
+
1. Automated extraction of metadata from multiple data sources and documentation using AI
52
+
2. Human oversight and validation by domain experts
53
+
3. Iterative refinement to ensure completeness and accuracy
54
+
4. Validation against the LinkML schema
55
+
5. Rendering to multiple formats (human-readable HTML, technical LinkML HTML, and YAML)
71
56
72
-
These provide automated comprehensive project-level metadata and include both HTML views and downloadable YAML files.
57
+
These datasheets provide comprehensive project-level metadata including dataset motivation, composition, collection processes, preprocessing, recommended uses, distribution, maintenance, ethics, and governance.
73
58
74
59
### Individual Dataset Datasheets
75
-
The **Individual Dataset Datasheets** provide detailed metadata for specific datasets from each project's primary data repository (FAIRHub, Dataverse, PhysioNet).
60
+
The **Individual Dataset Datasheets** provide detailed metadata for specific datasets from each project's primary data repository (FAIRHub, Dataverse, PhysioNet). These focus on individual dataset instances rather than project-level metadata.
Copy file name to clipboardExpand all lines: src/docs/d4d_examples.md
+24-86Lines changed: 24 additions & 86 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,68 +2,30 @@
2
2
3
3
This page provides links to rendered Datasheet for Datasets (D4D) examples for Bridge2AI data generating projects.
4
4
5
+
**Recommended**: Use the **Curated Comprehensive Datasheets** below for the most complete and accurate project metadata.
6
+
5
7
## Curated Comprehensive Datasheets
6
8
7
-
These are the most comprehensive datasheets for each project, created through extensive AI-powered synthesis:
9
+
These are the most comprehensive and authoritative datasheets for each project, created through extensive AI-powered synthesis with human oversight and validation:
@@ -84,42 +46,18 @@ These datasheets were created from specific dataset metadata sources:
84
46
## About the Datasheets
85
47
86
48
### Curated Comprehensive Datasheets
87
-
The **Curated Comprehensive Datasheets** represent the most complete and authoritative metadata for each project, created through extensive AI-powered synthesis of multiple data sources and documentation. These files include both human-readable HTML renderings and downloadable YAML source files.
88
-
89
-
### GPT-5 Synthesized Datasheets
90
-
The **GPT-5 Synthesized Datasheets** were created by:
91
-
1. Concatenating multiple project-related documents in reproducible order
92
-
2. Processing with GPT-5 to extract and synthesize D4D metadata
93
-
3. Validating against the LinkML schema
94
-
4. Rendering to human-readable HTML format
95
-
96
-
These provide automated comprehensive project-level metadata and include both HTML views and downloadable YAML files.
97
-
98
-
### Claude Code Synthesized Datasheets (Deterministic)
99
-
The **Claude Code Synthesized Datasheets** are generated with **deterministic settings** for reproducibility:
100
-
1.**Temperature=0.0**: Eliminates randomness in model responses
101
-
2.**Pinned model version**: `claude-sonnet-4-5-20250929` prevents changes from model updates
102
-
3.**Version-controlled prompts**: Stored in external files tracked in git
5.**Comprehensive metadata**: Each YAML includes a metadata file tracking all generation parameters
105
-
106
-
**Key Features:**
107
-
- Reproducible: Running twice on same input produces identical output
108
-
- Traceable: Complete provenance tracking via metadata files
109
-
- Comparable: Can meaningfully compare with GPT-5 outputs
110
-
- Transparent: All prompts and settings version-controlled and documented
111
-
112
-
**Metadata Files** contain:
113
-
- SHA-256 hashes of input file, schema, and prompts
114
-
- Model settings (temperature, max_tokens)
115
-
- Processing environment details
116
-
- Git commit hash for provenance
117
-
- Reproducibility command
118
-
119
-
See [DETERMINISM.md](https://github.com/bridge2ai/data-sheets-schema/blob/main/DETERMINISM.md) for complete details on the deterministic approach.
49
+
The **Curated Comprehensive Datasheets** represent the most complete and authoritative metadata for each project. These were created through:
50
+
51
+
1. Automated extraction of metadata from multiple data sources and documentation using AI
52
+
2. Human oversight and validation by domain experts
53
+
3. Iterative refinement to ensure completeness and accuracy
54
+
4. Validation against the LinkML schema
55
+
5. Rendering to multiple formats (human-readable HTML, technical LinkML HTML, and YAML)
56
+
57
+
These datasheets provide comprehensive project-level metadata including dataset motivation, composition, collection processes, preprocessing, recommended uses, distribution, maintenance, ethics, and governance.
120
58
121
59
### Individual Dataset Datasheets
122
-
The **Individual Dataset Datasheets** provide detailed metadata for specific datasets from each project's primary data repository (FAIRHub, Dataverse, PhysioNet).
60
+
The **Individual Dataset Datasheets** provide detailed metadata for specific datasets from each project's primary data repository (FAIRHub, Dataverse, PhysioNet). These focus on individual dataset instances rather than project-level metadata.
0 commit comments