Skip to content

Commit 597e8d1

Browse files
Merge pull request #11 from monarch-initiative/dashboard
Add QC dashboard and fix fabricated evidence references
2 parents 9e96811 + e110916 commit 597e8d1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+5573
-396
lines changed

CLAUDE.md

Lines changed: 86 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,92 @@ uv run runoak -i sqlite:obo:maxo search "physical therapy"
136136
## Testing
137137

138138
Tests are in `tests/test_data.py`:
139-
- Schema validation for all 55 disorder files
139+
- Schema validation for all 56 disorder files
140140
- Required field checks
141141
- Evidence reference validation
142142
- Unique name verification
143+
144+
## Standard Operating Procedure: Adding/Editing Evidence
145+
146+
When adding or editing evidence items in disorder files, follow this SOP to prevent hallucinations:
147+
148+
### 1. Never Fabricate Snippets
149+
150+
Evidence snippets MUST be exact quotes from the cited paper's abstract. Do not paraphrase.
151+
152+
**Wrong:**
153+
```yaml
154+
evidence:
155+
- reference: PMID:12345678
156+
snippet: The study showed that X causes Y through Z mechanism. # Paraphrase - will fail validation
157+
```
158+
159+
**Correct:**
160+
```yaml
161+
evidence:
162+
- reference: PMID:12345678
163+
snippet: "X causes Y through the Z mechanism, as demonstrated by..." # Exact quote from abstract
164+
```
165+
166+
### 2. Verify PMIDs Before Use
167+
168+
Always check that a PMID actually corresponds to the paper you think it does:
169+
170+
```bash
171+
# Check cached abstract (if previously fetched)
172+
cat references_cache/pmid_12345678.md
173+
174+
# Or fetch fresh and validate
175+
just validate-references kb/disorders/MyDisease.yaml
176+
```
177+
178+
### 3. Validation Workflow
179+
180+
Before committing changes to any disorder file:
181+
182+
```bash
183+
# 1. Schema validation (structure correct)
184+
just validate kb/disorders/MyDisease.yaml
185+
186+
# 2. Reference validation (snippets match abstracts)
187+
just validate-references kb/disorders/MyDisease.yaml
188+
189+
# 3. Term validation (ontology IDs/labels correct)
190+
just validate-terms-file kb/disorders/MyDisease.yaml
191+
```
192+
193+
### 4. When Evidence Cannot Be Verified
194+
195+
If a claim is well-established but you cannot find a quotable snippet:
196+
197+
- **Option A**: Move the claim to the `notes` field (no evidence required)
198+
- **Option B**: Find a different paper with a quotable abstract
199+
- **Option C**: Remove the evidence block entirely, keep the description
200+
201+
**Do NOT** fabricate quotes or use incorrect PMIDs.
202+
203+
### 5. Common Validation Errors
204+
205+
| Error | Cause | Fix |
206+
|-------|-------|-----|
207+
| "Text part not found as substring" | Snippet is paraphrased | Use exact quote from abstract |
208+
| "Reference not found" | PMID doesn't exist | Verify PMID on PubMed |
209+
| Low similarity score | Wrong PMID for the paper | Check abstract matches topic |
210+
211+
### 6. Running Full QC
212+
213+
```bash
214+
# All validation checks
215+
just qc
216+
217+
# Compliance analysis (recommended field coverage)
218+
just compliance-all
219+
220+
# With weighted scoring and threshold checks
221+
just compliance-weighted
222+
223+
# Generate visual dashboard (dashboard/index.html)
224+
just gen-dashboard
225+
```
226+
227+
The dashboard shows priority curation targets - the 10 files with lowest compliance scores.

NOTES.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Development Notes
2+
3+
## 2025-12-08
4+
5+
### linkml-data-qc Now on PyPI
6+
7+
Updated project to use `linkml-data-qc` from PyPI instead of local development install:
8+
- Removed local path override from `pyproject.toml`
9+
- Added `[viz]` extras for dashboard visualization features
10+
- Version 0.1.0 installed with matplotlib, seaborn, pillow dependencies
11+
12+
### QC Dashboard
13+
14+
Added new `just gen-dashboard` target that generates a visual HTML dashboard:
15+
- Uses `linkml-data-qc --dashboard-dir dashboard/`
16+
- Creates `dashboard/index.html` with charts and tables
17+
- Shows slot compliance comparison across all 56 disorder files
18+
- Highlights the 10 lowest-compliance files as priority curation targets
19+
- Includes detailed per-file charts for priority files
20+
21+
Dashboard contents:
22+
- `index.html` - Main dashboard page
23+
- `comparison.png` - Slot compliance bar chart
24+
- `detail_*.png` - Per-file heatmaps for priority files
25+
- `reports.json` - Raw report data
26+
27+
### Reference Validation Findings
28+
29+
Ran comprehensive reference validation (`just validate-references-all`) and discovered significant issues with fabricated evidence snippets in several Mendelian disease files.
30+
31+
#### Key Issues Found
32+
33+
1. **Fabricated snippets**: Evidence snippets were AI-generated paraphrases rather than actual quotes from cited papers. The reference validator correctly flagged these with low similarity scores (0-37%).
34+
35+
2. **Wrong PMIDs**: Several PMIDs pointed to completely unrelated papers:
36+
- `PMID:30084541` in Dravet_syndrome.yaml was about "Black Phosphorus Nanosheets Passivation Using a Tripeptide" - not Dravet syndrome
37+
- `PMID:22267103` was about "How to use insulin-like growth factor 1 (IGF1)" - not Dravet syndrome
38+
- `PMID:34812478` was about "catastrophic natural disasters impact on arts nonprofits" - not Dravet syndrome
39+
- `PMID:31428203` in Fanconi_Anemia.yaml was about "insulin-glucose metabolism in diabetic mice" - not Fanconi anemia
40+
41+
#### Files Fixed
42+
43+
**Fanconi_Anemia.yaml:**
44+
- Replaced 5 fabricated snippets with real quotes from PMID:35596788 (Peake & Noguchi 2022 review) and PMID:20301575 (GeneReviews)
45+
- Removed 8 unverifiable evidence items, converted claims to `notes` fields
46+
- Quotes now use exact text from abstracts (with proper YAML quoting for colons)
47+
48+
**Dravet_syndrome.yaml:**
49+
- Removed all evidence citing wrong PMIDs (30084541, 22267103, 34812478)
50+
- Used only PMID:21463282 (Oakley et al. 2011 "Insights into pathophysiology and therapy from a mouse model of Dravet syndrome")
51+
- Added 4 verified quotes from that paper
52+
- Moved unverifiable claims to `notes` fields
53+
54+
#### Lessons Learned
55+
56+
1. **Always validate references**: The reference validator is essential for catching AI hallucinations. Run `just validate-references file` before committing evidence items.
57+
58+
2. **Use actual quotes**: Snippets must be exact quotes from abstracts, not paraphrases. The validator checks substring matching.
59+
60+
3. **Verify PMIDs independently**: Don't trust that a PMID is correct - check the cached abstract in `references_cache/pmid_*.md` or fetch it fresh.
61+
62+
4. **When in doubt, use notes**: If a claim is well-established but you can't find a quotable snippet, put it in `notes` rather than fabricating evidence.
63+
64+
### Compliance Analysis
65+
66+
Ran `just compliance-weighted` with the QC config:
67+
68+
- **Global compliance**: 56.1%
69+
- **Weighted compliance**: 75.3%
70+
- **Term coverage**: 93.0%
71+
- **Evidence coverage**: 77.7%
72+
- **Description coverage**: 26.4%
73+
74+
Critical paths are meeting thresholds:
75+
- `phenotypes[].phenotype_term.term`: 99.5% (threshold: 90%)
76+
- `disease_term.term`: 98.2% (threshold: 95%)
77+
- `pathophysiology[].cell_types[].term`: 100% (threshold: 85%)
78+
- `treatments[].treatment_term.term`: 100% (threshold: 80%)
79+
80+
Violations are in sparse data paths (locations, chemical_entities, pathways) indicating areas for future data enrichment, not config issues.

README.md

Lines changed: 47 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ A curated knowledge base of disease pathophysiology, with structured evidence fr
44

55
## Browse the Knowledge Base
66

7-
**[View all disorders online](https://monarch-initiative.github.io/dismech/disorders/)**
7+
**[View all disorders online](https://monarch-initiative.github.io/dismech/disorders/)** | **[QC Dashboard](https://monarch-initiative.github.io/dismech/dashboard/)**
88

99
Each disorder page includes:
1010
- Disease mechanisms and pathophysiology
@@ -68,15 +68,56 @@ All claims must cite PubMed references with exact quotes. This prevents misinfor
6868
6969
### Validation Pipeline
7070
71-
Multiple layers of automated validation:
72-
1. **Schema validation**: Ensures correct YAML structure
73-
2. **Ontology term validation**: Verifies term IDs exist and labels match authoritative sources
74-
3. **Reference validation**: Confirms quoted snippets appear in cited abstracts
71+
Multiple layers of automated validation ensure data quality and prevent AI hallucinations:
72+
73+
1. **Schema validation**: Ensures correct YAML structure against the LinkML schema
74+
2. **Ontology term validation**: Verifies term IDs exist and labels match authoritative sources (HPO, MONDO, GO, etc.)
75+
3. **Reference validation**: Confirms that quoted snippets actually appear in cited PubMed abstracts
76+
4. **Compliance analysis**: Measures coverage of recommended fields (descriptions, evidence, ontology terms)
77+
78+
```bash
79+
# Run schema + term validation
80+
just qc
81+
82+
# Validate a single file
83+
just validate kb/disorders/Asthma.yaml
84+
85+
# Validate references against PubMed abstracts
86+
just validate-references kb/disorders/Asthma.yaml
87+
88+
# Analyze compliance with recommended field coverage
89+
just compliance-all
90+
91+
# Compliance with weighted scoring and threshold checks
92+
just compliance-weighted
93+
```
94+
95+
### Why Reference Validation Matters
96+
97+
All evidence snippets must be **exact quotes** from paper abstracts, not paraphrases. The reference validator fetches abstracts from PubMed and checks that the quoted text appears verbatim. This catches:
98+
99+
- AI-generated paraphrases that don't match the actual paper
100+
- Wrong PMIDs (e.g., a PMID that points to an unrelated paper)
101+
- Fabricated citations
102+
103+
When validation fails, either fix the snippet to match the actual abstract or remove the evidence item.
104+
105+
### QC Dashboard
106+
107+
Generate a visual dashboard showing compliance metrics across all disorder files:
75108

76109
```bash
77-
just qc # Run all quality checks
110+
just gen-dashboard
78111
```
79112

113+
This creates `dashboard/index.html` with:
114+
- Summary metrics (files analyzed, average compliance, violations)
115+
- Slot compliance comparison chart
116+
- Detailed views of the 10 lowest-compliance files (priority curation targets)
117+
- Full table of all files sorted by compliance
118+
119+
View online: [QC Dashboard](https://monarch-initiative.github.io/dismech/dashboard/) or locally: `open dashboard/index.html`
120+
80121
### HTML Generation
81122

82123
YAML files are rendered to browsable HTML pages with clickable ontology term links.

0 commit comments

Comments
 (0)