Skip to content

fix(cache): close rename-cascade race and bot-task delete eviction gap#28100

Open
harshach wants to merge 3 commits into
mainfrom
harshach/fix-cache-it-failures
Open

fix(cache): close rename-cascade race and bot-task delete eviction gap#28100
harshach wants to merge 3 commits into
mainfrom
harshach/fix-cache-it-failures

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

@harshach harshach commented May 13, 2026

Describe your changes:

Three IT failures on the new Integration Tests - PostgreSQL + Elasticsearch + Redis workflow (gate added by #28012) traced back to two cache-invalidation gaps in the cache improvements merge:

  1. Rename cascade race (ClassificationResourceIT + GlossaryRepository + TagRepository + DomainRepository): invalidateCacheForRenameCascade ran BEFORE the bulk DAO updateFqn. With invalidateCacheForTaggedEntitiesAndDescendants (search-index walk) in between, CI traces showed a ~4 s window where any concurrent reader loaded the still-visible pre-rename row from DB and repopulated L1+L2 cache with the old FQN — pinned for the entity TTL. Awaitility timeouts on ClassificationResourceIT.test_classificationRename_tagActivityFeedsPreserved and test_classificationRename_multipleTagsUpdated.

  2. Bot task delete eviction gap (TaskResourceIT): UserRepository.deleteSuggestionTasksForUser issued a direct DELETE FROM task_entity WHERE createdById=… that bypassed EntityRepository.delete and its cache hook. Any task previously read by id was still pinned in the L1 Guava CACHE_WITH_ID, so the next GET returned the "deleted" task — failing TaskResourceIT.testDeletingBotCreatorCleansUpOpenSuggestionTasks.

Refactored invalidateCacheForRenameCascade to return the captured (id, oldFqn) list and added a finishInvalidateCacheForRenameCascade post-commit pair that re-evicts the same entries; updated the 6 call sites (Classification, Tag, Glossary, GlossaryTerm ×2, Domain ×2 for DOMAIN+DATA_PRODUCT) to capture the list pre-write and finish post-write. For the bot path, added TaskDAO.listIdsByCreatorAndCategory, capture ids before the bulk DELETE, then fan out EntityRepository.invalidateCacheForEntity(Entity.TASK, id, null) after.

Type of change:

  • Bug fix

High-level design:

The pattern was that pre-commit invalidations clear stale cache entries, but anything mutating in long-running rename flows (search-index walks, ES asset updates, policy condition updates) leaves a wide race window where a concurrent reader can repopulate the L1/L2 cache from the still-visible pre-rename DB row. The new finishInvalidateCacheForRenameCascade is the symmetric post-commit pair — it re-evicts the descendants captured at pre-invalidate time, by id and by old FQN, closing the window. The bot-delete fix follows the same principle: any direct SQL write that bypasses EntityRepository.delete must explicitly fan out cache invalidation, since the L1 Guava cache otherwise keeps stale rows alive past the DB drop.

List-then-delete in the bot path is intentionally not transactional — over-invalidating a few extra ids on retry is cheap; missing one is the original bug.

Tests:

Use cases covered

  • Renaming a Classification correctly propagates the new FQN to all child tags on subsequent reads (no stale cache).
  • Renaming a Glossary / GlossaryTerm / Tag / Domain / DataProduct does the same.
  • Deleting a bot user synchronously cleans up its open suggestion tasks AND makes them un-retrievable via GET-by-id immediately afterwards.

Backend integration tests

  • Not adding new tests — these fixes resolve existing failures in ClassificationResourceIT.test_classificationRename_* and TaskResourceIT.testDeletingBotCreatorCleansUpOpenSuggestionTasks. CI on the postgres+ES+redis profile will verify.

Manual testing performed

  • mvn clean compile -pl openmetadata-service — passes
  • mvn spotless:check -pl openmetadata-service — passes
  • Full repro requires the cache CI profile (Redis); not exercised locally (no Redis in the local docker stack).

UI screen recording / screenshots:

Not applicable.

Checklist:

  • I have read the CONTRIBUTING document.
  • I have commented on my code, particularly in hard-to-understand areas.

🤖 Generated with Claude Code


Summary by Gitar

  • Fixed cache race condition:
    • Replaced postUpdateMany with postLogicalSuiteRelationshipUpdate in TestCaseRepository to prevent concurrent PATCH operations from being overwritten by stale cache data during logical test suite bulk additions.

This will update automatically on new commits.

Three IT failures on the new postgres+ES+redis CI profile traced back to
two cache-invalidation gaps introduced alongside #28012:

1) Classification / Tag / Glossary / GlossaryTerm / Domain renames called
   invalidateCacheForRenameCascade BEFORE the bulk DAO updateFqn. With
   invalidateCacheForTaggedEntitiesAndDescendants (search-index walk) in
   between, the window was ~4 s in CI traces. Any concurrent reader landing
   in that window loaded the still-visible pre-rename row from DB and
   repopulated L1+L2 cache with the old FQN, which then stuck for the
   entity TTL. Awaitility timeouts on
   ClassificationResourceIT.test_classificationRename_tagActivityFeedsPreserved
   and test_classificationRename_multipleTagsUpdated.

   Refactored invalidateCacheForRenameCascade to return the captured
   (id, oldFqn) pairs and added finishInvalidateCacheForRenameCascade —
   a post-commit pair that re-evicts the same entries by id and by old
   FQN, closing the race window. Updated the 6 call sites
   (Classification, Tag, Glossary, GlossaryTerm x2, Domain x2 for
   DOMAIN+DATA_PRODUCT) to capture the list pre-write and call the
   finish pair after all DB writes complete.

2) UserRepository.deleteSuggestionTasksForUser issued a direct
   DELETE FROM task_entity ... that bypassed EntityRepository.delete and
   its cache hook. Any task previously read by id was still pinned in
   the L1 Guava CACHE_WITH_ID, so the next GET returned the "deleted"
   task — failing
   TaskResourceIT.testDeletingBotCreatorCleansUpOpenSuggestionTasks.

   Added TaskDAO.listIdsByCreatorAndCategory, capture the ids before
   the bulk DELETE, then fan out
   EntityRepository.invalidateCacheForEntity(Entity.TASK, id, null)
   afterwards. List + delete are intentionally not in one transaction —
   over-invalidating a few extra ids on retry is cheap; missing one is
   the bug.

mvn clean compile + spotless:check pass on openmetadata-service.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 13, 2026 22:10
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels May 13, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses cache invalidation gaps that caused flaky backend integration tests under the Redis cache profile: (1) rename-cascade flows could be “re-poisoned” by concurrent readers repopulating caches with pre-rename rows, and (2) bulk deletion of bot-created suggestion tasks bypassed repository delete hooks, leaving stale task entries pinned in the L1 Guava cache.

Changes:

  • Refactors EntityRepository.invalidateCacheForRenameCascade to return the enumerated descendant (id, oldFqn) pairs and adds a symmetric finishInvalidateCacheForRenameCascade(...) pass for post-rename re-eviction.
  • Updates rename flows (Classification/Tag/Glossary/GlossaryTerm/Domain + DataProduct) to capture descendants pre-update and re-invalidate after the rename-related writes.
  • Fixes bot-task deletion cache eviction by listing task IDs before bulk delete and explicitly invalidating the per-task cache entries afterward.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/UserRepository.java Captures task IDs before bulk delete and explicitly invalidates Task cache entries by ID afterward.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/CollectionDAO.java Adds TaskDAO.listIdsByCreatorAndCategory to support pre-delete ID capture.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java Changes rename-cascade invalidation to return affected descendants and adds finishInvalidateCacheForRenameCascade plus shared eviction helper.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TagRepository.java Adopts two-phase rename-cascade invalidation for tag rename flows.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/ClassificationRepository.java Adopts two-phase rename-cascade invalidation for classification rename → tag descendants.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/GlossaryRepository.java Adopts two-phase rename-cascade invalidation for glossary rename → glossary term descendants.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/GlossaryTermRepository.java Adopts two-phase rename-cascade invalidation for glossary term rename/move cascades.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/DomainRepository.java Adopts two-phase rename-cascade invalidation for domain rename affecting domains + data products.
Comments suppressed due to low confidence (1)

openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TagRepository.java:989

  • finishInvalidateCacheForRenameCascade is called before the subsequent classificationChanged / parentChanged handling (and the final response-field invalidations). Since entitySpecificUpdate runs under @Transaction, there is still time before commit where a concurrent reader can repopulate the cache with the pre-rename row after this “finish” pass. Consider moving the finish call to the very end of updateNameAndParent (after the classification/parent updates and final invalidations), or invoking it again right before returning, so the post-pass is as close to commit as possible.
        finishInvalidateCacheForRenameCascade(Entity.TAG, renamedTags);
      }

      if (classificationChanged) {
        updateClassificationRelationship(original, updated);

Comment on lines +2924 to +2928
* #finishInvalidateCacheForRenameCascade} after the DB writes commit — necessary because a
* reader landing in the window between this call and the DB commit will repopulate the by-id
* cache with the still-visible pre-rename row, and only a second invalidate pass after the
* commit can evict the poisoned entry.
*
The remaining failure on the postgres+ES+redis CI gate —
TestCaseResourceIT.testBulkFluentAPI ("Description should be bulk updated"
times out after 60s) — was a cache-poisoning race between the bulk PATCH
loop and a concurrent test running test_bulkAddAllTestCasesWithExcludeIds.

CI trace for testCase c5fa887e:
  T0: bulk-add fetches all candidate test cases via Entity.getEntities(
      refs, "*", ALL) — gets snapshot of c5fa887e with OLD description.
  T1: testBulkFluentAPI PATCHes c5fa887e — DB committed, cache write-
      through stores the NEW description (1649 bytes).
  T2: bulk-add calls postUpdateMany(updatedTestCases) → writeThroughCache-
      Many serializes the pre-read snapshot and overwrites Redis with the
      OLD description (2158 bytes).
  T3+: 60s of polling sees the poisoned cache value and never reaches
      "Bulk updated".

The pre-read snapshot was load-bearing for nothing — testSuites is in the
storage-stripped field list (getFieldsStrippedFromStorageJson), so the
testCase entity JSON does not actually change here. The only DB write is
the entity_relationship CONTAINS row.

Fix in TestCaseRepository.addTestCasesToLogicalTestSuite and
addAllTestCasesToLogicalTestSuiteTxn: replace postUpdateMany with a new
postLogicalSuiteRelationshipUpdate hook that:

  1. Invalidates the read-bundle cache (where testSuites is fanned out
     during reads) for each affected test case — so the next GET picks
     up the new relationship.
  2. Fires the lifecycle "entities updated" event (event subscribers
     still see the testSuites field change).
  3. Updates the RDF graph.

Crucially, no writeThroughCacheMany. The base-entity JSON in Redis is
left alone, so a concurrent PATCH's write-through is not clobbered.

mvn clean compile + spotless:check pass on openmetadata-service.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

🔴 Playwright Results — 1 failure(s), 13 flaky

✅ 3353 passed · ❌ 1 failed · 🟡 13 flaky · ⏭️ 56 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 299 0 0 4
🟡 Shard 2 748 0 8 19
🟡 Shard 3 783 0 1 7
🟡 Shard 4 789 0 1 18
🔴 Shard 6 734 1 3 8

Genuine Failures (failed on all attempts)

Pages/GlossaryImportExport.spec.ts › Glossary CSV import preserves typed relations (shard 6)
Error: �[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoBe�[2m(�[22m�[32mexpected�[39m�[2m) // Object.is equality�[22m

Expected: �[32m200�[39m
Received: �[31m404�[39m
🟡 13 flaky test(s) (passed on retry)
  • Features/BulkEditEntity.spec.ts › Database service (shard 2, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 2 retries)
  • Features/ImpactAnalysis.spec.ts › Verify table columns visibility and content (shard 2, 1 retry)
  • Features/KnowledgeCenter.spec.ts › Article mentions in description should working for Knowledge Center (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Pages/CustomProperties.spec.ts › Should display custom properties for apiCollection in right panel (shard 4, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Container (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Users.spec.ts › Admin soft & hard delete and restore user (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

…with imports

gitar-bot review on #28100: per CLAUDE.md "no fully qualified names in code
— import the class instead". Add imports for CacheBundle,
EntityLifecycleEventDispatcher, and RdfUpdater; drop the inline FQNs in the
method body.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 14, 2026 14:34
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 14, 2026

Code Review ✅ Approved 1 resolved / 1 findings

Closes rename-cascade and bot-task deletion race conditions by implementing pre- and post-commit cache invalidation patterns, resolving intermittent integration test failures. No issues found.

✅ 1 resolved
Quality: Fully qualified class names used instead of imports

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TestCaseRepository.java:1168 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TestCaseRepository.java:1174 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TestCaseRepository.java:1176
The new postLogicalSuiteRelationshipUpdate method uses three fully qualified class names inline (org.openmetadata.service.cache.CacheBundle, org.openmetadata.service.events.lifecycle.EntityLifecycleEventDispatcher, org.openmetadata.service.rdf.RdfUpdater) instead of adding proper import statements. Per project conventions, fully qualified names in method bodies should be avoided — add imports at the top of the file.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

LOG.warn("Skipping cache invalidation for non-UUID task id: {}", taskIdStr);
continue;
}
EntityRepository.invalidateCacheForEntity(Entity.TASK, taskId, null);
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants