-
Notifications
You must be signed in to change notification settings - Fork 0
Database Performance Optimizations #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: db-cleanup-baseline
Are you sure you want to change the base?
Conversation
… deadlocks on MySQL (#80329) * Split subquery when cleaning annotations * update comment * Raise batch size, now that we pay attention to it * Iterate in batches * Separate cancellable batch implementation to allow for multi-statement callbacks, add overload for single-statement use * Use split-out utility in outer batching loop so it respects context cancellation * guard against empty queries * Use SQL parameters * Use same approach for tags * drop unused function * Work around parameter limit on sqlite for large batches * Bulk insert test data in DB * Refactor test to customise test data creation * Add test for catching SQLITE_MAX_VARIABLE_NUMBER limit * Turn annotation cleanup test to integration tests * lint --------- Co-authored-by: Sofia Papagiannaki <[email protected]>
|
@coderabbitai review |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
WalkthroughThe changes refactor annotation cleanup operations to perform batched deletions instead of single-query operations, introduce integration-style test scaffolding with real database writes, and increase cleanup task execution frequency from every 10 minutes to every 1 minute. Changes
Sequence DiagramsequenceDiagram
participant Test
participant Store as CleanAnnotations
participant DB as Database
rect rgb(240, 248, 255)
note over Store: Old approach (single query)
Test->>Store: CleanAnnotations(cutoff)
Store->>DB: DELETE (single query with subquery)
DB-->>Store: result
Store-->>Test: affected count
end
rect rgb(255, 250, 240)
note over Store: New approach (batched)
Test->>Store: CleanAnnotations(cutoff)
loop untilDoneOrCancelled
Store->>DB: SELECT IDs WHERE created < cutoff LIMIT batchSize
DB-->>Store: IDs
alt IDs empty
Store->>Store: break loop
else IDs found
Store->>DB: DELETE WHERE ID IN (IDs)
DB-->>Store: affected count
Store->>Store: accumulate affected, continue
end
end
Store-->>Test: total affected count
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
|
Note Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it. Generating unit tests... This may take up to 20 minutes. |
|
✅ UTG Post-Process Complete No new issues were detected in the generated code and all check runs have completed. The unit test generation process has completed successfully. |
|
Creating a PR to put the unit tests in... The changes have been created in this pull request: View PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (9)
pkg/services/annotations/annotationsimpl/xorm_store.go (4)
523-563: Batched cleanup loop looks correct, but logging is too heavy/noisy for normal operationThe new
untilDoneOrCancelledusage for both MaxAge and MaxCount paths looks logically sound: you batch by IDs, accumulatetotalAffected, and terminate cleanly onaffected == 0or error/cancellation.However, the logging inside each batch is aggressive and potentially expensive:
- Uses
r.log.Errorfor routine, successful batches ("Annotations to clean by time/count"and"cleaned annotations by time/count").- Logs the full
idsslice and thecondstring every batch, which can be large and high‑volume under load.This will generate a lot of error‑level noise and can impact performance and log storage.
Suggest:
- Downgrade these to
Debug(or at mostInfo) and dropidsfrom the log fields, keeping only counts and perhaps a short summary of the condition.- Optionally, log only once per cleanup run (e.g., before/after the
untilDoneOrCancelledloop) instead of per batch.
584-595:fetchIDsis safe for current call sites but tightly couples condition shape to callers
fetchIDsbuildsSELECT id FROM %s WHERE %swith a rawconditionstring. Given thatconditionis composed from internal constants (annotationType) and fixed patterns, it’s reasonably safe here, but it means:
- Callers must always remember to include any
ORDER BY/LIMITin the condition string.- The helper is only suitable for tables with an
idcolumn.This is fine for current usage, but if you plan to reuse
fetchIDsmore broadly, consider:
- Splitting
where,orderBy, andlimitinto separate arguments, or- Accepting a full SQL fragment (including
ORDER BY/LIMIT) and making it explicit in the name, e.g.,fetchIDsWithClause.
597-630: SQLite parameter-limit handling should ideally be based onlen(ids), not configured batch sizeThe SQLite path in
deleteByIDsuses:const sqliteParameterLimit = 999 if r.db.GetDBType() == migrator.SQLite && r.cfg.AnnotationCleanupJobBatchSize > sqliteParameterLimit { // inline IDs } else { // use placeholders }This works for current callers because
len(ids)is bounded byAnnotationCleanupJobBatchSize, but it couples the behavior to config rather than the actual number of IDs:
- If another caller ever reuses
deleteByIDswith a different batch size,len(ids)could exceed 999 even whenAnnotationCleanupJobBatchSizedoes not, or vice versa.To make this helper robust in isolation, it would be better to switch on
len(ids):if r.db.GetDBType() == migrator.SQLite && len(ids) > sqliteParameterLimit { // inline IDs } else { // placeholders }This keeps the behavior correct regardless of how callers choose batch sizes.
632-638:asAnyhelper is fine; consider inlining if it stays single-use
asAnycleanly converts[]int64to[]anyfor the Exec call and is correct as written. If it remains used only indeleteByIDs, you might consider inlining it there to keep the helper surface small, but that’s purely cosmetic.pkg/services/cleanup/cleanup.go (1)
77-85: More frequent cleanup cadence looks OK, but consider stopping the ticker and validating load impactChanging the ticker to run every minute will cause cleanup (including DB work) to be attempted much more frequently. Since
Runcallssrv.clean(ctx)synchronously, runs won’t overlap, but you may:
- Do significantly more frequent annotation and snapshot cleanups.
- Hit the 9‑minute per-run timeout while ticks keep accumulating in the ticker channel.
Two suggestions:
- Add
defer ticker.Stop()after creating the ticker to avoid leaking resources whenRunexits.- Confirm via metrics/load tests that a 1‑minute cadence doesn’t cause undue DB or IO pressure compared to the previous 10‑minute interval.
pkg/services/annotations/annotationsimpl/cleanup_test.go (4)
18-37: Integration test setup is good; config usage is a bit split but coherentRenaming to
TestIntegrationAnnotationCleanUp, gating withtesting.Short(), and switching todb.InitTestDB(t)makes these behave like proper integration tests. The table‑driven structure withcreateAnnotationsNum/createOldAnnotationsNumand explicit expected counts per annotation type matches the new batched cleanup semantics.You’re using two configs per test (
cfgfor batch size viaAnnotationCleanupJobBatchSize, andtest.cfgfor the per‑type MaxAge/MaxCount settings). As long asProvideCleanupServicereads batch size from the first and uses the second only insideRun, this is fine, but it’s worth keeping in mind to avoid confusion if more knobs are added later.
97-111: Large SQLite batch-size test is valuable but quite heavy; consider trimming if runtime becomes an issueThe
"should not fail if batch size is larger than SQLITE_MAX_VARIABLE_NUMBER..."case effectively validates:
AnnotationCleanupJobBatchSizemuch larger than SQLite’s parameter limit (32767 vs 999).- Cleanup correctly deletes
40000of40003annotations while keeping one per annotation type.This meaningfully exercises the new
deleteByIDsSQLite path. The trade‑off is that inserting and cleaning 40k+ annotations plus tags makes this test relatively expensive.If CI time ever becomes a concern, you could:
- Reduce
createAnnotationsNumwhile still exceeding the parameter limit (e.g., ~2× or 3× 1000), and- Keep
AnnotationCleanupJobBatchSizejust over the limit (e.g., 1500) so the special code path is still covered.For now, it’s acceptable, but it’s something to monitor.
114-147: Per-test cleanup viat.Cleanupis correct, but you can simplify error handlingThe
t.Cleanupadded for each subtest:
- Clears both
annotationandannotation_tagtables.- Uses
errors.Jointo combine potential errors and assertsNoError.This is functionally correct and keeps the shared test DB clean across subtests. If you want to simplify slightly, you could:
- Short-circuit on the first error instead of joining, since a single failure will already fail the test, or
- Log which DELETE failed to aid debugging.
Not required, but might make debugging a bit cleaner if cleanup ever fails.
If your Go toolchain is older, ensure
errors.Joinis available (Go 1.20+).
232-292: Batched insert helper looks correct; explicit IDs and tags are consistent with cleanup expectationsIn
createTestAnnotations:
- Preallocating
newAnnotationsandnewAnnotationTagsand inserting in batches of 500 viaInsertMultiis a good way to keep test setup fast, even with tens of thousands of rows.- Assigning
ID: int64(i + 1)ensures deterministic ordering for theORDER BY id DESClogic under test, and theannotation_tagrows reference those IDs consistently.- Distribution of alert/dashboard/API annotations and
Createdtimestamps matches the table-driven expectations for MaxAge/MaxCount tests.The overall approach is sound for an integration test fixture. The only minor consideration is that this helper assumes an empty or test-local DB; if more shared state is introduced later, reusing explicit IDs from 1..N could conflict and may need revisiting.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
pkg/services/annotations/annotationsimpl/cleanup_test.go(9 hunks)pkg/services/annotations/annotationsimpl/xorm_store.go(2 hunks)pkg/services/cleanup/cleanup.go(1 hunks)
🔇 Additional comments (2)
pkg/services/annotations/annotationsimpl/xorm_store.go (2)
640-651:untilDoneOrCancelledbehavior is sound and preserves partial progress on error/cancelThe loop correctly:
- Checks
ctx.Done()before each batch, returning thectx.Err()together withtotalAffected.- Calls
batchWork()once per iteration, accumulatestotalAffected, and exits when a batch returnsaffected == 0.This ensures cleanup jobs report how many rows were processed even when they terminate early due to errors or timeouts, which aligns well with how
CleanAnnotationsandCleanOrphanedAnnotationTagsare using it.
569-582: The annotation_tag schema includes anidcolumn—the original concern is invalidThe migration file (
pkg/services/sqlstore/migrations/annotation_mig.go) shows the schema evolution: v2 had onlyannotation_idandtag_idcolumns, but v3 (lines 92–102) explicitly addsidas an auto-incrementing primary key. The migration (lines 109–114) copies data from v2 to v3 and drops v2, making v3 the current canonical schema. ThefetchIDsanddeleteByIDshelper methods (lines 584, 597) correctly assume the presence of anidcolumn. This cleanup functionality shipped in Grafana 7.4.0, confirming the schema migration is active in production deployments.
Test 7
Summary by CodeRabbit
Bug Fixes
Chores
✏️ Tip: You can customize this high-level summary in your review settings.
Replicated from ai-code-review-evaluation/grafana-coderabbit#7