Skip to content

term cache files should be free of spurious diffs #9

@dragon-ai-agent

Description

@dragon-ai-agent

Currently it looks like term files are written fresh eat time, updates should be incremental and diff friendly

Copying content of issue from: monarch-initiative/dismech#40

Problem

When adding new disorder files or updating existing ones, the cache files (e.g., cache/hp/terms.csv, cache/mondo/terms.csv) show large diffs where all timestamps get updated, not just the newly added terms.

Example from PR #38

In PR #38, adding Frontotemporal Dementia caused:

  • cache/hp/terms.csv: 324 additions, 322 deletions (646 lines changed)
  • cache/mondo/terms.csv: 65 additions, 64 deletions (129 lines changed)

However, only 1-2 new terms were actually added. The bulk of the diff is timestamp updates for existing terms:

-HP:0040282,Frequent,2025-12-16T10:24:02.844891
+HP:0040282,Frequent,2025-12-23T11:16:01.374193

Expected Behavior

The cache should be incremental:

  • Only newly added terms should get timestamps
  • Existing cached terms should retain their original timestamps
  • Diffs should only show actual new content

Current Behavior

Every validation run appears to re-write the entire cache with fresh timestamps, making code review difficult:

  • Hard to identify which terms are actually new
  • Large diffs obscure the meaningful changes
  • Git history becomes noisy

Impact

  • Code review friction: Reviewers must manually verify which terms are genuinely new
  • Merge conflicts: Higher likelihood of cache file conflicts in concurrent PRs
  • Git history noise: Harder to track when specific terms were first introduced

Potential Root Cause

This may be an upstream issue with linkml-term-validator or the caching mechanism in the validation stack. The cache re-generation logic may be:

  1. Reading all existing terms
  2. Re-fetching/re-validating them
  3. Writing them back with updated timestamps

Instead of:

  1. Loading existing cache with original timestamps
  2. Only fetching new terms
  3. Appending new terms without touching existing ones

Reproduction

# Add a single new HPO term to any disorder file
# Run validation
just validate kb/disorders/YourDisorder.yaml

# Observe: entire cache gets timestamp refresh
git diff cache/hp/terms.csv

Possible Solutions

  1. Preserve timestamps on cache reads: When loading cache, preserve the retrieved_at field
  2. Append-only mode: Add flag to only append new terms without re-writing existing ones
  3. Explicit re-cache command: Make full cache refresh opt-in via just recache-all or similar
  4. Upstream fix: If this is in linkml-reference-validator, contribute a fix there

Questions to Investigate

  • Is this behavior in linkml-term-validator or linkml-reference-validator?
  • Does the cache loader preserve timestamps when reading?
  • Is there an existing flag/config to enable incremental mode?
  • Should timestamps represent "first cached" vs "last validated"?

Related

  • PR #38 (example where this occurs)
  • linkml-reference-validator repository (potential upstream source)

Workaround: For now, users can explicitly choose to re-cache all terms, but the default should be incremental updates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions