Skip to content

Conversation

@akshayutture-augment
Copy link

@akshayutture-augment akshayutture-augment commented Nov 14, 2025

Test 7

Summary by CodeRabbit

  • Bug Fixes

    • Improved annotation cleanup reliability by implementing a batching mechanism to prevent deadlocks during bulk deletion operations.
  • Chores

    • Increased cleanup task frequency from 10-minute to 1-minute intervals for faster maintenance of stale annotations.

✏️ Tip: You can customize this high-level summary in your review settings.


Replicated from ai-code-review-evaluation/grafana-coderabbit#7

… deadlocks on MySQL (#80329)

* Split subquery when cleaning annotations

* update comment

* Raise batch size, now that we pay attention to it

* Iterate in batches

* Separate cancellable batch implementation to allow for multi-statement callbacks, add overload for single-statement use

* Use split-out utility in outer batching loop so it respects context cancellation

* guard against empty queries

* Use SQL parameters

* Use same approach for tags

* drop unused function

* Work around parameter limit on sqlite for large batches

* Bulk insert test data in DB

* Refactor test to customise test data creation

* Add test for catching SQLITE_MAX_VARIABLE_NUMBER limit

* Turn annotation cleanup test to integration tests

* lint

---------

Co-authored-by: Sofia Papagiannaki <[email protected]>
@akshayutture-augment
Copy link
Author

@coderabbitai review

@akshayutture-augment
Copy link
Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Nov 22, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link

coderabbitai bot commented Nov 22, 2025

Walkthrough

The changes refactor annotation cleanup operations to perform batched deletions instead of single-query operations, introduce integration-style test scaffolding with real database writes, and increase cleanup task execution frequency from every 10 minutes to every 1 minute.

Changes

Cohort / File(s) Summary
Annotation cleanup batching mechanism
pkg/services/annotations/annotationsimpl/xorm_store.go
Replaces single-statement DELETE operations with iterative batched deletions. Adds helpers: fetchIDs (load IDs by condition), deleteByIDs (delete by ID list with SQLite parameter limit handling), asAny (type conversion), and untilDoneOrCancelled (generic batching loop). Reworks CleanAnnotations, CleanAnnotations (MaxCount variant), and CleanOrphanedAnnotationTags to use batched fetch-delete pattern. Includes SQLite-specific parameter limit optimizations.
Annotation cleanup test conversion
pkg/services/annotations/annotationsimpl/cleanup_test.go
Converts unit tests to integration-style with real test database initialization. Renames TestAnnotationCleanUp to TestIntegrationAnnotationCleanUp and adds short-mode skip. Replaces pre-initialized fake SQL DB with real test DB setup. Switches from separate count variables to createAnnotationsNum and createOldAnnotationsNum. Introduces per-test annotationCleanupJobBatchSize parameter. Replaces single-insert flow with batched InsertMulti (batch size 500) for annotations and tags. Adds combined per-test cleanup via error-join deletion. Updates test expectations for annotation type and tag counts. Adds test case for batch size exceeding SQLite variable limits.
Cleanup service frequency adjustment
pkg/services/cleanup/cleanup.go
Increases cleanup task execution frequency by reducing ticker interval from 10 minutes to 1 minute in CleanUpService.Run.

Sequence Diagram

sequenceDiagram
    participant Test
    participant Store as CleanAnnotations
    participant DB as Database
    
    rect rgb(240, 248, 255)
    note over Store: Old approach (single query)
    Test->>Store: CleanAnnotations(cutoff)
    Store->>DB: DELETE (single query with subquery)
    DB-->>Store: result
    Store-->>Test: affected count
    end
    
    rect rgb(255, 250, 240)
    note over Store: New approach (batched)
    Test->>Store: CleanAnnotations(cutoff)
    loop untilDoneOrCancelled
        Store->>DB: SELECT IDs WHERE created < cutoff LIMIT batchSize
        DB-->>Store: IDs
        alt IDs empty
            Store->>Store: break loop
        else IDs found
            Store->>DB: DELETE WHERE ID IN (IDs)
            DB-->>Store: affected count
            Store->>Store: accumulate affected, continue
        end
    end
    Store-->>Test: total affected count
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • xorm_store.go: Introduces new batching patterns with untilDoneOrCancelled, fetchIDs, and deleteByIDs helpers; SQLite-specific parameter limit handling requires careful validation
  • cleanup_test.go: Extensive structural changes from unit to integration tests; batched inserts and per-test cleanup patterns need verification for correctness
  • cleanup.go: Trivial constant change; review the rationale for 10x frequency increase (deadlock prevention impact)

Poem

🐰 A rabbit's hop through batch deletes,
No more deadlocks in our little heats—
By ten's the minute now we clean,
The smartest schema you've e'er seen! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'Database Performance Optimizations' is vague and generic, covering only the superficial outcome rather than the specific changes made. Consider a more specific title that describes the actual changes, such as 'Implement batched deletion cleanup operations' or 'Add batched cleanup with interval tuning to reduce deadlocks'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch db-cleanup-optimized

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link

coderabbitai bot commented Nov 22, 2025

Note

Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it.


Generating unit tests... This may take up to 20 minutes.

@coderabbitai
Copy link

coderabbitai bot commented Nov 22, 2025

UTG Post-Process Complete

No new issues were detected in the generated code and all check runs have completed. The unit test generation process has completed successfully.

@coderabbitai
Copy link

coderabbitai bot commented Nov 22, 2025

Creating a PR to put the unit tests in...

The changes have been created in this pull request: View PR

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (9)
pkg/services/annotations/annotationsimpl/xorm_store.go (4)

523-563: Batched cleanup loop looks correct, but logging is too heavy/noisy for normal operation

The new untilDoneOrCancelled usage for both MaxAge and MaxCount paths looks logically sound: you batch by IDs, accumulate totalAffected, and terminate cleanly on affected == 0 or error/cancellation.

However, the logging inside each batch is aggressive and potentially expensive:

  • Uses r.log.Error for routine, successful batches ("Annotations to clean by time/count" and "cleaned annotations by time/count").
  • Logs the full ids slice and the cond string every batch, which can be large and high‑volume under load.

This will generate a lot of error‑level noise and can impact performance and log storage.

Suggest:

  • Downgrade these to Debug (or at most Info) and drop ids from the log fields, keeping only counts and perhaps a short summary of the condition.
  • Optionally, log only once per cleanup run (e.g., before/after the untilDoneOrCancelled loop) instead of per batch.

584-595: fetchIDs is safe for current call sites but tightly couples condition shape to callers

fetchIDs builds SELECT id FROM %s WHERE %s with a raw condition string. Given that condition is composed from internal constants (annotationType) and fixed patterns, it’s reasonably safe here, but it means:

  • Callers must always remember to include any ORDER BY / LIMIT in the condition string.
  • The helper is only suitable for tables with an id column.

This is fine for current usage, but if you plan to reuse fetchIDs more broadly, consider:

  • Splitting where, orderBy, and limit into separate arguments, or
  • Accepting a full SQL fragment (including ORDER BY/LIMIT) and making it explicit in the name, e.g., fetchIDsWithClause.

597-630: SQLite parameter-limit handling should ideally be based on len(ids), not configured batch size

The SQLite path in deleteByIDs uses:

const sqliteParameterLimit = 999
if r.db.GetDBType() == migrator.SQLite && r.cfg.AnnotationCleanupJobBatchSize > sqliteParameterLimit {
    // inline IDs
} else {
    // use placeholders
}

This works for current callers because len(ids) is bounded by AnnotationCleanupJobBatchSize, but it couples the behavior to config rather than the actual number of IDs:

  • If another caller ever reuses deleteByIDs with a different batch size, len(ids) could exceed 999 even when AnnotationCleanupJobBatchSize does not, or vice versa.

To make this helper robust in isolation, it would be better to switch on len(ids):

if r.db.GetDBType() == migrator.SQLite && len(ids) > sqliteParameterLimit {
    // inline IDs
} else {
    // placeholders
}

This keeps the behavior correct regardless of how callers choose batch sizes.


632-638: asAny helper is fine; consider inlining if it stays single-use

asAny cleanly converts []int64 to []any for the Exec call and is correct as written. If it remains used only in deleteByIDs, you might consider inlining it there to keep the helper surface small, but that’s purely cosmetic.

pkg/services/cleanup/cleanup.go (1)

77-85: More frequent cleanup cadence looks OK, but consider stopping the ticker and validating load impact

Changing the ticker to run every minute will cause cleanup (including DB work) to be attempted much more frequently. Since Run calls srv.clean(ctx) synchronously, runs won’t overlap, but you may:

  • Do significantly more frequent annotation and snapshot cleanups.
  • Hit the 9‑minute per-run timeout while ticks keep accumulating in the ticker channel.

Two suggestions:

  • Add defer ticker.Stop() after creating the ticker to avoid leaking resources when Run exits.
  • Confirm via metrics/load tests that a 1‑minute cadence doesn’t cause undue DB or IO pressure compared to the previous 10‑minute interval.
pkg/services/annotations/annotationsimpl/cleanup_test.go (4)

18-37: Integration test setup is good; config usage is a bit split but coherent

Renaming to TestIntegrationAnnotationCleanUp, gating with testing.Short(), and switching to db.InitTestDB(t) makes these behave like proper integration tests. The table‑driven structure with createAnnotationsNum / createOldAnnotationsNum and explicit expected counts per annotation type matches the new batched cleanup semantics.

You’re using two configs per test (cfg for batch size via AnnotationCleanupJobBatchSize, and test.cfg for the per‑type MaxAge/MaxCount settings). As long as ProvideCleanupService reads batch size from the first and uses the second only inside Run, this is fine, but it’s worth keeping in mind to avoid confusion if more knobs are added later.


97-111: Large SQLite batch-size test is valuable but quite heavy; consider trimming if runtime becomes an issue

The "should not fail if batch size is larger than SQLITE_MAX_VARIABLE_NUMBER..." case effectively validates:

  • AnnotationCleanupJobBatchSize much larger than SQLite’s parameter limit (32767 vs 999).
  • Cleanup correctly deletes 40000 of 40003 annotations while keeping one per annotation type.

This meaningfully exercises the new deleteByIDs SQLite path. The trade‑off is that inserting and cleaning 40k+ annotations plus tags makes this test relatively expensive.

If CI time ever becomes a concern, you could:

  • Reduce createAnnotationsNum while still exceeding the parameter limit (e.g., ~2× or 3× 1000), and
  • Keep AnnotationCleanupJobBatchSize just over the limit (e.g., 1500) so the special code path is still covered.

For now, it’s acceptable, but it’s something to monitor.


114-147: Per-test cleanup via t.Cleanup is correct, but you can simplify error handling

The t.Cleanup added for each subtest:

  • Clears both annotation and annotation_tag tables.
  • Uses errors.Join to combine potential errors and asserts NoError.

This is functionally correct and keeps the shared test DB clean across subtests. If you want to simplify slightly, you could:

  • Short-circuit on the first error instead of joining, since a single failure will already fail the test, or
  • Log which DELETE failed to aid debugging.

Not required, but might make debugging a bit cleaner if cleanup ever fails.

If your Go toolchain is older, ensure errors.Join is available (Go 1.20+).


232-292: Batched insert helper looks correct; explicit IDs and tags are consistent with cleanup expectations

In createTestAnnotations:

  • Preallocating newAnnotations and newAnnotationTags and inserting in batches of 500 via InsertMulti is a good way to keep test setup fast, even with tens of thousands of rows.
  • Assigning ID: int64(i + 1) ensures deterministic ordering for the ORDER BY id DESC logic under test, and the annotation_tag rows reference those IDs consistently.
  • Distribution of alert/dashboard/API annotations and Created timestamps matches the table-driven expectations for MaxAge/MaxCount tests.

The overall approach is sound for an integration test fixture. The only minor consideration is that this helper assumes an empty or test-local DB; if more shared state is introduced later, reusing explicit IDs from 1..N could conflict and may need revisiting.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f84c8f6 and 81c45bf.

📒 Files selected for processing (3)
  • pkg/services/annotations/annotationsimpl/cleanup_test.go (9 hunks)
  • pkg/services/annotations/annotationsimpl/xorm_store.go (2 hunks)
  • pkg/services/cleanup/cleanup.go (1 hunks)
🔇 Additional comments (2)
pkg/services/annotations/annotationsimpl/xorm_store.go (2)

640-651: untilDoneOrCancelled behavior is sound and preserves partial progress on error/cancel

The loop correctly:

  • Checks ctx.Done() before each batch, returning the ctx.Err() together with totalAffected.
  • Calls batchWork() once per iteration, accumulates totalAffected, and exits when a batch returns affected == 0.

This ensures cleanup jobs report how many rows were processed even when they terminate early due to errors or timeouts, which aligns well with how CleanAnnotations and CleanOrphanedAnnotationTags are using it.


569-582: The annotation_tag schema includes an id column—the original concern is invalid

The migration file (pkg/services/sqlstore/migrations/annotation_mig.go) shows the schema evolution: v2 had only annotation_id and tag_id columns, but v3 (lines 92–102) explicitly adds id as an auto-incrementing primary key. The migration (lines 109–114) copies data from v2 to v3 and drops v2, making v3 the current canonical schema. The fetchIDs and deleteByIDs helper methods (lines 584, 597) correctly assume the presence of an id column. This cleanup functionality shipped in Grafana 7.4.0, confirming the schema migration is active in production deployments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants