-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Currently it looks like term files are written fresh eat time, updates should be incremental and diff friendly
Copying content of issue from: monarch-initiative/dismech#40
Problem
When adding new disorder files or updating existing ones, the cache files (e.g., cache/hp/terms.csv, cache/mondo/terms.csv) show large diffs where all timestamps get updated, not just the newly added terms.
Example from PR #38
In PR #38, adding Frontotemporal Dementia caused:
cache/hp/terms.csv: 324 additions, 322 deletions (646 lines changed)cache/mondo/terms.csv: 65 additions, 64 deletions (129 lines changed)
However, only 1-2 new terms were actually added. The bulk of the diff is timestamp updates for existing terms:
-HP:0040282,Frequent,2025-12-16T10:24:02.844891
+HP:0040282,Frequent,2025-12-23T11:16:01.374193Expected Behavior
The cache should be incremental:
- Only newly added terms should get timestamps
- Existing cached terms should retain their original timestamps
- Diffs should only show actual new content
Current Behavior
Every validation run appears to re-write the entire cache with fresh timestamps, making code review difficult:
- Hard to identify which terms are actually new
- Large diffs obscure the meaningful changes
- Git history becomes noisy
Impact
- Code review friction: Reviewers must manually verify which terms are genuinely new
- Merge conflicts: Higher likelihood of cache file conflicts in concurrent PRs
- Git history noise: Harder to track when specific terms were first introduced
Potential Root Cause
This may be an upstream issue with linkml-term-validator or the caching mechanism in the validation stack. The cache re-generation logic may be:
- Reading all existing terms
- Re-fetching/re-validating them
- Writing them back with updated timestamps
Instead of:
- Loading existing cache with original timestamps
- Only fetching new terms
- Appending new terms without touching existing ones
Reproduction
# Add a single new HPO term to any disorder file
# Run validation
just validate kb/disorders/YourDisorder.yaml
# Observe: entire cache gets timestamp refresh
git diff cache/hp/terms.csvPossible Solutions
- Preserve timestamps on cache reads: When loading cache, preserve the
retrieved_atfield - Append-only mode: Add flag to only append new terms without re-writing existing ones
- Explicit re-cache command: Make full cache refresh opt-in via
just recache-allor similar - Upstream fix: If this is in
linkml-reference-validator, contribute a fix there
Questions to Investigate
- Is this behavior in
linkml-term-validatororlinkml-reference-validator? - Does the cache loader preserve timestamps when reading?
- Is there an existing flag/config to enable incremental mode?
- Should timestamps represent "first cached" vs "last validated"?
Related
- PR #38 (example where this occurs)
linkml-reference-validatorrepository (potential upstream source)
Workaround: For now, users can explicitly choose to re-cache all terms, but the default should be incremental updates.