Skip to content

Commit 5feb0c3

Browse files
realmarcinclaude
andcommitted
Improve D4D agent schema interaction instructions
Based on root cause analysis of VOICE D4D validation failures, enhanced all D4D generation and editing instructions to prevent semantic field name invention and ensure strict schema compliance. Key improvements: - Added requirement to read reference examples FIRST before generation - Documented common field name mistakes with wrong/correct examples - Emphasized {id, description} pattern used by most D4D classes - Made validation non-skippable with clear error handling guidance - Added verification checklist for post-generation validation - Highlighted that most validation failures are due to invented field names Affected files: - .claude/commands/d4d-agent.md: Enhanced generation process workflow - .github/workflows/d4d_assistant_create.md: Added schema study section - .github/workflows/d4d_assistant_edit.md: Added field name verification Prevents issue where agents invent semantic field names (purpose_description, creator_name, subset_name) instead of using actual schema fields (id, description). This was the root cause of 50+ validation errors in the original VOICE concatenated D4D file. 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
1 parent b1921c6 commit 5feb0c3

File tree

3 files changed

+195
-28
lines changed

3 files changed

+195
-28
lines changed

.claude/commands/d4d-agent.md

Lines changed: 64 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -59,13 +59,70 @@ Extract these key elements from source documents:
5959
For each project (AI_READI, CM4AI, VOICE, CHORUS):
6060

6161
1. **Launch Task agents in parallel** using Task tool with subagent_type='general-purpose'
62-
2. **Read source documents** from preprocessed locations
63-
3. **Read schema** from src/data_sheets_schema/schema/data_sheets_schema_all.yaml
64-
4. **Extract metadata** using the checklist above
65-
5. **Generate valid YAML** conforming to schema
66-
6. **Validate schema compliance** with: poetry run linkml-validate -s src/data_sheets_schema/schema/data_sheets_schema_all.yaml -C Dataset <file>
67-
7. **Validate ontology terms** with: poetry run linkml-term-validator validate-data <file> --schema src/data_sheets_schema/schema/data_sheets_schema_all.yaml
68-
8. **Save** to output location
62+
63+
2. **Read reference examples FIRST** (Critical for understanding correct structure):
64+
- Read validated examples: `data/d4d_concatenated/claudecode_agent/AI_READI_d4d.yaml`
65+
- Study how `Purpose`, `Task`, `AddressingGap`, `Creator`, `FundingMechanism`, etc. are structured
66+
- Note: These classes use simple `{id, description}` pattern, NOT semantic field names
67+
68+
3. **Read schema and extract field definitions**:
69+
- Path: `src/data_sheets_schema/schema/data_sheets_schema_all.yaml`
70+
- For each class you'll use (Purpose, Task, Creator, etc.), extract EXACT field names
71+
- **Critical**: Do NOT invent field names based on semantics
72+
73+
4. **Common Field Name Mistakes to AVOID**:
74+
```yaml
75+
# ❌ WRONG - Semantic field names (not in schema)
76+
purposes:
77+
- purpose_description: "..."
78+
tasks:
79+
- task_description: "..."
80+
tags: [...]
81+
creators:
82+
- creator_name: "John Doe"
83+
creator_role: "PI"
84+
creator_affiliation: "University"
85+
86+
# ✅ CORRECT - Schema field names
87+
purposes:
88+
- id: project:purpose:1
89+
description: "..."
90+
tasks:
91+
- id: project:task:1
92+
description: "..."
93+
creators:
94+
- id: project:creator:1
95+
description: "John Doe, Principal Investigator, University"
96+
```
97+
98+
5. **Read source documents** from preprocessed locations
99+
100+
6. **Extract metadata** using the checklist above
101+
102+
7. **Generate valid YAML** conforming to schema:
103+
- Use ONLY field names found in schema
104+
- Include required `id` fields for all objects
105+
- Merge multi-part information into single `description` strings
106+
- Follow reference examples for structure
107+
108+
8. **REQUIRED validation** (NON-SKIPPABLE):
109+
```bash
110+
poetry run linkml-validate -s src/data_sheets_schema/schema/data_sheets_schema_all.yaml -C Dataset <file>
111+
```
112+
- If validation fails: analyze errors, fix field names, re-validate
113+
- DO NOT proceed without passing validation
114+
115+
9. **Validate ontology terms**:
116+
```bash
117+
poetry run linkml-term-validator validate-data <file> --schema src/data_sheets_schema/schema/data_sheets_schema_all.yaml
118+
```
119+
120+
10. **Verify output**:
121+
- Check file has comprehensive content (should be 1000+ lines for concatenated)
122+
- Confirm all major sections populated (purposes, tasks, creators, etc.)
123+
- Verify no invented field names used
124+
125+
11. **Save** to output location
69126

70127
## Merging Multiple Sources
71128

.github/workflows/d4d_assistant_create.md

Lines changed: 108 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -84,28 +84,99 @@ This workflow is triggered when a user requests creation of a new D4D datasheet,
8484

8585
## Step-by-Step Process
8686

87-
### 1. Load the D4D Schema
87+
### 1. Study Schema Structure and Reference Examples
88+
89+
**CRITICAL**: Before generating ANY D4D YAML, you MUST understand the exact field names used by each schema class.
90+
91+
#### 1a. Read Reference Examples FIRST
8892

8993
```bash
9094
# Ensure the full merged schema is available
9195
make full-schema
9296
```
9397

98+
**Read validated reference examples:**
99+
- `data/d4d_concatenated/claudecode_agent/AI_READI_d4d.yaml` - Comprehensive validated example
100+
- `data/d4d_concatenated/claudecode_agent/CHORUS_d4d.yaml` - Another validated example
101+
102+
**What to observe:**
103+
- How `purposes`, `tasks`, `addressing_gaps`, `creators`, `funders` are structured
104+
- Field naming patterns: Most classes use simple `{id, description}` structure
105+
- How multi-part information is merged into single `description` strings
106+
- Proper use of `id` fields with namespace prefixes (e.g., `project:creator:1`)
107+
108+
#### 1b. Read the Schema and Extract Field Definitions
109+
94110
**Schema Reference:**
95111
- Read the complete schema from: `src/data_sheets_schema/schema/data_sheets_schema_all.yaml`
96112
- This contains ALL D4D classes, slots, and enums in a single file
97113
- Use this schema as the authoritative reference for structure and valid values
98114

115+
**For each class you'll use, extract EXACT field names:**
116+
- Search for `class Purpose:`, `class Task:`, `class Creator:`, etc.
117+
- Note which fields are required vs optional
118+
- Identify field types (string, integer, enum, etc.)
119+
- Check for multivalued fields (lists)
120+
121+
#### 1c. Common Field Name Mistakes to AVOID
122+
123+
**Problem**: Agents often invent semantic field names that "make sense" but aren't in the schema.
124+
125+
```yaml
126+
# ❌ WRONG - Invented semantic field names (validation will FAIL)
127+
purposes:
128+
- purpose_description: "To enable AI research..." # Field doesn't exist!
129+
tasks:
130+
- task_description: "Disease screening" # Field doesn't exist!
131+
tags: [screening, diagnosis] # Field doesn't exist!
132+
creators:
133+
- creator_name: "Dr. Jane Smith" # Field doesn't exist!
134+
creator_role: "Principal Investigator" # Field doesn't exist!
135+
creator_affiliation: "Stanford University" # Field doesn't exist!
136+
funders:
137+
- funder_name: "NIH" # Field doesn't exist!
138+
grant_number: "R01-123456" # Field doesn't exist!
139+
funding_amount: 500000 # Field doesn't exist!
140+
subsets:
141+
- subset_name: "Training set" # Field doesn't exist!
142+
subset_description: "80% of data" # Field doesn't exist!
143+
# Missing required 'id' field!
144+
145+
# ✅ CORRECT - Actual schema field names
146+
purposes:
147+
- id: dataset:purpose:1
148+
description: "To enable AI research..." # Simple description string
149+
tasks:
150+
- id: dataset:task:1
151+
description: "Disease screening for five disease categories" # All info in description
152+
creators:
153+
- id: dataset:creator:1
154+
description: "Dr. Jane Smith, Principal Investigator, Stanford University" # Merged into description
155+
funders:
156+
- id: dataset:funder:1
157+
name: "NIH" # FundingMechanism has name AND description
158+
description: "NIH grant R01-123456, total funding $500,000" # Details in description
159+
subsets:
160+
- id: dataset:subset:training # Required!
161+
name: "Training set" # DataSubset has name field
162+
description: "80% of data for model training"
163+
```
164+
165+
**Key Pattern**: Most D4D classes use a minimal `{id, description}` structure where:
166+
- `id` is required and should use namespaced format (e.g., `project:type:identifier`)
167+
- `description` contains all the details in natural language
168+
- Multi-part information is merged into the `description` string
169+
99170
**Key Schema Sections:**
100171
- `id` (required) - Unique identifier for the dataset
101172
- `name` (required) - Dataset name
102-
- `motivation` - Purpose and creation motivation (Motivation class)
103-
- `composition` - What the dataset contains (Composition class)
104-
- `collection_process` - How data was collected (CollectionProcess class)
105-
- `preprocessing` - Cleaning and preprocessing steps (Preprocessing class)
106-
- `uses` - Recommended and unsuitable uses (Uses class)
107-
- `distribution` - How dataset is distributed (Distribution class)
108-
- `maintenance` - Update and maintenance plans (Maintenance class)
173+
- `purposes` - List of Purpose objects, each with `{id, description}`
174+
- `tasks` - List of Task objects, each with `{id, description}`
175+
- `addressing_gaps` - List of AddressingGap objects, each with `{id, description}`
176+
- `creators` - List of Creator objects, each with `{id, description}`
177+
- `funders` - List of FundingMechanism objects with `{id, name, description}`
178+
- `instances` - List of Instance objects, each with `{id, description}`
179+
- `subsets` - List of DataSubset objects (inherits from Dataset, requires `id`)
109180

110181
### 2. Gather Source Content
111182

@@ -199,36 +270,53 @@ poetry run linkml-validate -s src/data_sheets_schema/schema/data_sheets_schema_a
199270

200271
**Common Validation Errors and Fixes:**
201272

202-
1. **Missing Required Field**
273+
1. **Unknown/Invented Field Names** (MOST COMMON ERROR)
274+
```
275+
Error: Additional properties are not allowed ('purpose_description' was unexpected)
276+
Error: Additional properties are not allowed ('creator_name', 'creator_role' were unexpected)
277+
Error: Additional properties are not allowed ('subset_name', 'subset_description' were unexpected)
203278
```
204-
Error: 'id' is a required property
279+
**Root Cause**: You invented semantic field names instead of using schema field names
280+
281+
**Fix**:
282+
- Read reference examples to see correct field structure
283+
- Most classes use `{id, description}` pattern
284+
- Replace invented fields with schema-defined fields:
285+
```yaml
286+
# ❌ WRONG
287+
- creator_name: "Jane Smith"
288+
creator_role: "PI"
289+
290+
# ✅ CORRECT
291+
- id: project:creator:1
292+
description: "Jane Smith, Principal Investigator"
293+
```
294+
295+
2. **Missing Required Field**
205296
```
206-
**Fix**: Add the missing required field (`id` and `name` are always required)
297+
Error: 'id' is a required property in /purposes/0
298+
Error: 'id' is a required property in /subsets/0
299+
```
300+
**Fix**: Add the required `id` field with namespaced format (e.g., `project:purpose:1`)
207301
208-
2. **Invalid Enum Value**
302+
3. **Invalid Enum Value**
209303
```
210304
Error: 'SomeValue' is not one of ['ValidValue1', 'ValidValue2']
211305
```
212306
**Fix**: Check the schema for valid enum values and use one from the allowed list
213307
214-
3. **Wrong Data Type**
308+
4. **Wrong Data Type**
215309
```
216310
Error: 'string_value' is not of type 'integer'
217311
```
218312
**Fix**: Convert the value to the correct type (e.g., change "1000" to 1000 for integers)
219313
220-
4. **Invalid YAML Syntax**
314+
5. **Invalid YAML Syntax**
221315
```
222316
Error: mapping values are not allowed here
223317
```
224318
**Fix**: Check indentation, quotes, and YAML structure
225319
226-
5. **Unknown Field**
227-
```
228-
Error: Additional properties are not allowed ('unknown_field' was unexpected)
229-
```
230-
**Fix**: Remove the field or check if you're using the correct field name from the schema
231-
232320
**If Validation Fails:**
233321
1. Read the error message carefully to identify the issue
234322
2. Check the schema file to understand correct structure: `src/data_sheets_schema/schema/data_sheets_schema_all.yaml`

.github/workflows/d4d_assistant_edit.md

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -123,23 +123,45 @@ cat <path-to-datasheet>.yaml
123123
- **Add list items**: Append to multivalued fields
124124
- **Update from new source**: Extract additional metadata from new URLs/documents
125125

126-
### 3. Load Schema for Reference
126+
### 3. Load Schema and Verify Field Names
127127

128128
```bash
129129
# Ensure full schema is available
130130
make full-schema
131131
```
132132

133+
**CRITICAL**: Before making edits, verify you're using correct schema field names.
134+
135+
**Read Reference Examples:**
136+
- `data/d4d_concatenated/claudecode_agent/AI_READI_d4d.yaml` - Validated example structure
137+
- Compare existing datasheet structure with reference examples
138+
- Note field naming patterns for classes you'll modify
139+
133140
**Verify Schema Constraints:**
134141
- Read `src/data_sheets_schema/schema/data_sheets_schema_all.yaml`
135142
- Check field definitions for the sections being edited
143+
- Extract EXACT field names for classes you'll use (Purpose, Task, Creator, etc.)
136144
- Verify slot constraints:
137145
- Is the field required or optional?
138146
- Is it multivalued (list)?
139147
- What is the expected range/type?
140148
- Are there enum constraints?
141149
- Understand class relationships for nested objects
142150

151+
**Common Field Name Mistakes to AVOID When Editing:**
152+
```yaml
153+
# ❌ WRONG - Invented semantic field names
154+
purposes:
155+
- purpose_description: "..." # Field doesn't exist!
156+
157+
# ✅ CORRECT - Schema field names
158+
purposes:
159+
- id: project:purpose:1
160+
description: "..." # Use 'description' field
161+
```
162+
163+
**Key Pattern**: Most D4D classes use `{id, description}` structure. Don't invent field names like `purpose_description`, `creator_name`, `subset_name`, etc.
164+
143165
### 4. Make Edits
144166

145167
**Edit Guidelines:**

0 commit comments

Comments
 (0)