|
| 1 | +# Development Notes |
| 2 | + |
| 3 | +## 2025-12-08 |
| 4 | + |
| 5 | +### linkml-data-qc Now on PyPI |
| 6 | + |
| 7 | +Updated project to use `linkml-data-qc` from PyPI instead of local development install: |
| 8 | +- Removed local path override from `pyproject.toml` |
| 9 | +- Added `[viz]` extras for dashboard visualization features |
| 10 | +- Version 0.1.0 installed with matplotlib, seaborn, pillow dependencies |
| 11 | + |
| 12 | +### QC Dashboard |
| 13 | + |
| 14 | +Added new `just gen-dashboard` target that generates a visual HTML dashboard: |
| 15 | +- Uses `linkml-data-qc --dashboard-dir dashboard/` |
| 16 | +- Creates `dashboard/index.html` with charts and tables |
| 17 | +- Shows slot compliance comparison across all 56 disorder files |
| 18 | +- Highlights the 10 lowest-compliance files as priority curation targets |
| 19 | +- Includes detailed per-file charts for priority files |
| 20 | + |
| 21 | +Dashboard contents: |
| 22 | +- `index.html` - Main dashboard page |
| 23 | +- `comparison.png` - Slot compliance bar chart |
| 24 | +- `detail_*.png` - Per-file heatmaps for priority files |
| 25 | +- `reports.json` - Raw report data |
| 26 | + |
| 27 | +### Reference Validation Findings |
| 28 | + |
| 29 | +Ran comprehensive reference validation (`just validate-references-all`) and discovered significant issues with fabricated evidence snippets in several Mendelian disease files. |
| 30 | + |
| 31 | +#### Key Issues Found |
| 32 | + |
| 33 | +1. **Fabricated snippets**: Evidence snippets were AI-generated paraphrases rather than actual quotes from cited papers. The reference validator correctly flagged these with low similarity scores (0-37%). |
| 34 | + |
| 35 | +2. **Wrong PMIDs**: Several PMIDs pointed to completely unrelated papers: |
| 36 | + - `PMID:30084541` in Dravet_syndrome.yaml was about "Black Phosphorus Nanosheets Passivation Using a Tripeptide" - not Dravet syndrome |
| 37 | + - `PMID:22267103` was about "How to use insulin-like growth factor 1 (IGF1)" - not Dravet syndrome |
| 38 | + - `PMID:34812478` was about "catastrophic natural disasters impact on arts nonprofits" - not Dravet syndrome |
| 39 | + - `PMID:31428203` in Fanconi_Anemia.yaml was about "insulin-glucose metabolism in diabetic mice" - not Fanconi anemia |
| 40 | + |
| 41 | +#### Files Fixed |
| 42 | + |
| 43 | +**Fanconi_Anemia.yaml:** |
| 44 | +- Replaced 5 fabricated snippets with real quotes from PMID:35596788 (Peake & Noguchi 2022 review) and PMID:20301575 (GeneReviews) |
| 45 | +- Removed 8 unverifiable evidence items, converted claims to `notes` fields |
| 46 | +- Quotes now use exact text from abstracts (with proper YAML quoting for colons) |
| 47 | + |
| 48 | +**Dravet_syndrome.yaml:** |
| 49 | +- Removed all evidence citing wrong PMIDs (30084541, 22267103, 34812478) |
| 50 | +- Used only PMID:21463282 (Oakley et al. 2011 "Insights into pathophysiology and therapy from a mouse model of Dravet syndrome") |
| 51 | +- Added 4 verified quotes from that paper |
| 52 | +- Moved unverifiable claims to `notes` fields |
| 53 | + |
| 54 | +#### Lessons Learned |
| 55 | + |
| 56 | +1. **Always validate references**: The reference validator is essential for catching AI hallucinations. Run `just validate-references file` before committing evidence items. |
| 57 | + |
| 58 | +2. **Use actual quotes**: Snippets must be exact quotes from abstracts, not paraphrases. The validator checks substring matching. |
| 59 | + |
| 60 | +3. **Verify PMIDs independently**: Don't trust that a PMID is correct - check the cached abstract in `references_cache/pmid_*.md` or fetch it fresh. |
| 61 | + |
| 62 | +4. **When in doubt, use notes**: If a claim is well-established but you can't find a quotable snippet, put it in `notes` rather than fabricating evidence. |
| 63 | + |
| 64 | +### Compliance Analysis |
| 65 | + |
| 66 | +Ran `just compliance-weighted` with the QC config: |
| 67 | + |
| 68 | +- **Global compliance**: 56.1% |
| 69 | +- **Weighted compliance**: 75.3% |
| 70 | +- **Term coverage**: 93.0% |
| 71 | +- **Evidence coverage**: 77.7% |
| 72 | +- **Description coverage**: 26.4% |
| 73 | + |
| 74 | +Critical paths are meeting thresholds: |
| 75 | +- `phenotypes[].phenotype_term.term`: 99.5% (threshold: 90%) |
| 76 | +- `disease_term.term`: 98.2% (threshold: 95%) |
| 77 | +- `pathophysiology[].cell_types[].term`: 100% (threshold: 85%) |
| 78 | +- `treatments[].treatment_term.term`: 100% (threshold: 80%) |
| 79 | + |
| 80 | +Violations are in sparse data paths (locations, chemical_entities, pathways) indicating areas for future data enrichment, not config issues. |
0 commit comments