fix(bots): skip bot upsert when nothing changed to stop team-strip + reindex loop on boot#28128
fix(bots): skip bot upsert when nothing changed to stop team-strip + reindex loop on boot#28128joaopamaral wants to merge 1 commit into
Conversation
…reindex loop on boot `BotResource.initialize()` runs `UserUtil.addOrUpdateBotUser(user)` for every bot on every OM boot. The in-memory `User` built by `UserUtil.user(...)` does not have the `teams` field populated, so the PUT path through `userRepository.createOrUpdate -> UserUpdater.entitySpecificUpdate` runs `updateTeams(original, updated)` with `original.teams = [Organization]` (or the bot's real stored teams) and `updated.teams = null`. `updateTeams` then executes `deleteTo(user, HAS, TEAM) + assignTeams(null)`, which strips every stored team membership the bot had, bumps the user version, and triggers an Elasticsearch reindex of both the user and each affected team. With several bots this fires on every restart and produces the reindex storm plus "Circular dependency detected in team hierarchy for team: Organization" warnings in the boot logs. In one production deployment this added almost 3 minutes to every boot. Short-circuit when the incoming bot has no real change vs. the persisted row: compare `description`, `displayName`, and `roles`. If they all match, return the original user and skip the PUT entirely — no `UserUpdater`, no team strip, no version bump, no reindex. Two adjustments to make the guard actually fire: - `retrieveWithAuthMechanism` now also loads `"roles"` (was loading only `"authenticationMechanism"`); `description` and `displayName` are scalar JSON-column fields and were already populated by the base read. - Compare `roles` via `listOrEmpty(...)` on both sides because the database-loaded original returns an empty list while the freshly built in-memory user returns null, and `Objects.equals(null, [])` is false. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| new org.openmetadata.schema.entity.teams.AuthenticationMechanism() | ||
| .withAuthType(org.openmetadata.schema.entity.teams.AuthenticationMechanism.AuthType.JWT)); |
There was a problem hiding this comment.
💡 Quality: Fully qualified class names used instead of imports
The test uses fully qualified names org.openmetadata.schema.entity.teams.AuthenticationMechanism inline (lines 66-67 and 108-109) instead of importing the class. Per project conventions, wildcard imports and fully qualified names should be avoided — add a proper import statement instead.
Import the class and use short names:
// Add to imports section:
import org.openmetadata.schema.entity.teams.AuthenticationMechanism;
import org.openmetadata.schema.entity.teams.AuthenticationMechanism.AuthType;
// Then replace inline usages with:
.withAuthenticationMechanism(
new AuthenticationMechanism()
.withAuthType(AuthType.JWT));
- Apply fix
Check the box to apply the fix or reply for a change | Was this helpful? React with 👍 / 👎
| try { | ||
| User result = UserUtil.addOrUpdateBotUser(incoming); | ||
| assertNotEquals( | ||
| stored, | ||
| result, | ||
| "When fields differ the upsert path must run and produce a different User"); | ||
| } catch (RuntimeException ignored) { | ||
| // The downstream createOrUpdate call may throw against the mock; the assertion we | ||
| // care about is that the short-circuit guard did NOT fire, which we verify below. | ||
| } | ||
| } |
There was a problem hiding this comment.
💡 Quality: Swallowed RuntimeException uses flow-control exception pattern
The test at line 131 catches RuntimeException ignored to handle the case where the mock doesn't fully stub the downstream path. This is a flow-control exception anti-pattern and makes the test fragile — if addOrUpdateBotUser throws for an unexpected reason, the test silently passes. Consider stubbing createOrUpdate to return a value (or use doNothing()/doReturn(...)) so the method completes normally, and assert on the result instead.
Stub the mock properly instead of catching RuntimeException:
// Stub createOrUpdate to return the incoming user so the method completes normally:
when(userRepository.createOrUpdate(any(), any(User.class), any())).thenReturn(incoming);
try (MockedStatic<Entity> entityStatic = mockStatic(Entity.class)) {
entityStatic.when(() -> Entity.getEntityRepository(Entity.USER)).thenReturn(userRepository);
User result = UserUtil.addOrUpdateBotUser(incoming);
assertNotEquals(
stored,
result,
"When fields differ the upsert path must run and produce a different User");
}
- Apply fix
Check the box to apply the fix or reply for a change | Was this helpful? React with 👍 / 👎
| @SuppressWarnings("unused") | ||
| private static List<EntityReference> roleRef(String name) { | ||
| List<EntityReference> refs = new ArrayList<>(); | ||
| refs.add(new EntityReference().withId(UUID.randomUUID()).withName(name).withType("role")); | ||
| return refs; | ||
| } |
There was a problem hiding this comment.
💡 Quality: Unused private helper method roleRef should be removed
The private method roleRef at lines 140-145 is annotated with @SuppressWarnings("unused") and is never called. Commented-out or dead code should be removed per project conventions. If it's intended for future tests, add it when those tests are written.
Was this helpful? React with 👍 / 👎
| if (originalUser != null | ||
| && Objects.equals(listOrEmpty(originalUser.getRoles()), listOrEmpty(user.getRoles())) | ||
| && Objects.equals(originalUser.getDescription(), user.getDescription()) | ||
| && Objects.equals(originalUser.getDisplayName(), user.getDisplayName())) { | ||
| return originalUser; |
There was a problem hiding this comment.
⚠️ Edge Case: Short-circuit doesn't compare email field changes
The guard at line 342-346 compares roles, description, and displayName but not email. The UserUtil.user(...) method (which builds the in-memory bot user) may set an email based on domain. If an admin changes the domain configuration between restarts, the email update would be silently skipped by the short-circuit. Consider whether email should be included in the comparison, or document why it's excluded.
Was this helpful? React with 👍 / 👎
Code Review
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
Summary
BotResource.initialize()runsUserUtil.addOrUpdateBotUser(user)for every bot on every OM boot. The in-memoryUserbuilt byUserUtil.user(...)does not have theteamsfield populated, so the PUT path throughuserRepository.createOrUpdate -> UserUpdater.entitySpecificUpdaterunsupdateTeams(original, updated)withoriginal.teams = [Organization](or the bot's real stored teams) andupdated.teams = null.updateTeamsthen executes:…which strips every stored team membership the bot had, bumps the user version, and triggers an Elasticsearch reindex of both the user and each affected team. With several bots this fires on every restart and produces the reindex storm we observed in production logs:
Repeated for
profiler-bot,governance-bot,usage-bot,ingestion-bot, … each one taking ~100–200 ms plus a team-side reindex. We also sawCircular dependency detected in team hierarchy for team: Organization. Skipping to prevent StackOverflowError.fromSubjectContextduring the same window — the boot-time team churn was tripping the cycle guard.In a real environment with several bots this added ~3 minutes to every boot.
Fix
Short-circuit
addOrUpdateBotUserwhen the incoming bot has no real change vs. the persisted row: comparedescription,displayName, androles. If they all match, return the original user and skip the PUT entirely — noUserUpdater, no team strip, no version bump, no reindex.Two small adjustments to make the guard actually fire:
retrieveWithAuthMechanismnow also loads\"roles\"(was loading only\"authenticationMechanism\").descriptionanddisplayNameare scalar JSON-column fields and were already populated by the base read.rolesvialistOrEmpty(...)on both sides because the database-loaded original returns an empty list while the freshly built in-memory user returnsnull, andObjects.equals(null, [])isfalse.The first call still hits the existing code path (no
originalUser-> guard skipped), so seeding new bots is unchanged.Reproducer
ingestion-bot,profiler-bot,governance-bot,usage-bot, …) is upserted byBotResource.initialize().Organization) team.Organization. Boot logs showfieldsDeleted=[name=teams, oldValue=[...], newValue=null]for every bot, each followed by user + team ES reindex log lines.fieldsDeleted=[teams]log lines appear. Boot time drops accordingly (~3 minutes saved in production).Test plan
Added
openmetadata-service/src/test/java/org/openmetadata/service/util/UserUtilBotTestwith two Mockito unit tests:addOrUpdateBotUserShortCircuitsWhenNothingChanged— mocksUserRepository, stubsgetByNameto return a stored bot whoseroles/description/displayNamematch the incoming user, callsaddOrUpdateBotUser(boundary), asserts the return is the same instance as the stored user, and verifiesuserRepository.createOrUpdate(...)was never called. Verified to fail before the fix (test reachesUserUtil.addOrUpdateUserand explodes on the unstubbedcreateOrUpdate); passes after the fix.addOrUpdateBotUserGoesThroughUpsertWhenDisplayNameChanged— same setup but with mismatchingdisplayName; verifiesuserRepository.createOrUpdate(...)is called, so we don't accidentally short-circuit on real changes.Local manual verification: spun the fix into the 1.12.7 backport branch we run in production, restarted, observed no
fieldsDeleted=[teams]log entries for the bot users and no follow-up bot/team reindex log lines. Boot duration dropped by ~3 minutes.Opening as draft for maintainer feedback on:
roles, description, displayName) is acceptable or you'd prefer a broader/narrower field check;🤖 Generated with Claude Code