Database Performance Optimizations #4

akshayutture-augment · 2025-11-20T00:37:27Z

Test 7

Replicated from ai-code-review-evaluation/grafana-greptile#7

… deadlocks on MySQL (#80329) * Split subquery when cleaning annotations * update comment * Raise batch size, now that we pay attention to it * Iterate in batches * Separate cancellable batch implementation to allow for multi-statement callbacks, add overload for single-statement use * Use split-out utility in outer batching loop so it respects context cancellation * guard against empty queries * Use SQL parameters * Use same approach for tags * drop unused function * Work around parameter limit on sqlite for large batches * Bulk insert test data in DB * Refactor test to customise test data creation * Add test for catching SQLITE_MAX_VARIABLE_NUMBER limit * Turn annotation cleanup test to integration tests * lint --------- Co-authored-by: Sofia Papagiannaki <[email protected]>

akshayutture-augment · 2025-11-20T00:40:34Z

@claude review this PR

akshayutture-augment · 2025-11-20T01:01:13Z

@claude review this PR

claude · 2025-11-20T01:01:37Z

Claude finished @akshayutture-augment's task —— View job

Code Review Complete ✅

Summary

This PR addresses database deadlock issues on MySQL by refactoring annotation cleanup logic. The changes split the cleanup process into separate query and delete operations to prevent deadlocks caused by concurrent inserts during cleanup. Overall, the implementation is solid with good test coverage and proper handling of database-specific limitations.

🎯 Key Changes

Refactored cleanup strategy - Split query and delete into separate operations to avoid MySQL deadlocks
Batch processing with context cancellation - New untilDoneOrCancelled helper for incremental cleanup
SQLite parameter limit handling - Special case for batch sizes > 999 parameters
Enhanced test coverage - Tests now use bulk insert and include edge cases (40k+ annotations, SQLite limits)
Reduced cleanup interval - Changed from 10 minutes to 1 minute (minor change in cleanup.go)

✅ Strengths

1. Smart Deadlock Avoidance

The core problem is well-addressed. Moving from a single DELETE-with-subquery to separate SELECT+DELETE operations prevents lock escalation on MySQL.

Before (deadlock-prone):

DELETE FROM annotation WHERE id IN (SELECT id FROM (...) a)

After (deadlock-resistant):

ids := fetchIDs(...)  // Releases locks quickly
deleteByIDs(...)       // Separate transaction

2. Robust SQLite Handling

The workaround for SQLite's 999-parameter limit is clever:

if r.db.GetDBType() == migrator.SQLite && r.cfg.AnnotationCleanupJobBatchSize > sqliteParameterLimit {
    values := fmt.Sprint(ids[0])
    for _, v := range ids[1:] {
        values = fmt.Sprintf("%s, %d", values, v)
    }
    sql = fmt.Sprintf(`DELETE FROM %s WHERE id IN (%s)`, table, values)

3. Excellent Test Coverage

The test improvements are significant:

Bulk insertion (500 at a time) for performance
Tests with 40,003 annotations and batch size 32,767
Proper cleanup between test cases
Integration test markers

🔴 Critical Issues

1. Error-Level Logging for Normal Operations (pkg/services/annotations/annotationsimpl/xorm_store.go)

Lines 534, 537, 554, 557, 576, 579 use r.log.Error() for what appears to be normal operational logging:

r.log.Error("Annotations to clean by time", "count", len(ids), "ids", ids, "cond", cond, "err", err)
r.log.Error("cleaned annotations by time", "count", len(ids), "affected", x, "err", y)

Problem: This will flood error logs with routine cleanup operations, making it hard to spot actual errors.

Fix: Change to r.log.Debug() or r.log.Info():

r.log.Debug("Cleaning annotations by time", "count", len(ids))
r.log.Debug("Cleaned annotations by time", "affected", x)

Locations:

xorm_store.go:534
xorm_store.go:537
xorm_store.go:554
xorm_store.go:557
xorm_store.go:576
xorm_store.go:579

2. SQL Injection Risk with String Concatenation (xorm_store.go:609-613)

values := fmt.Sprint(ids[0])
for _, v := range ids[1:] {
    values = fmt.Sprintf("%s, %d", values, v)
}
sql = fmt.Sprintf(`DELETE FROM %s WHERE id IN (%s)`, table, values)

Problem: While ids are int64 internally, constructing SQL via string concatenation is risky. The table name is also interpolated unsafely.

Analysis:

table parameter comes from trusted internal callers only ("annotation" or "annotation_tag")
ids are verified as int64 from database queries
Risk is low but pattern is concerning

Recommendation: Consider using a string builder and add validation:

if len(ids) == 0 {
    return 0, nil
}
// Validate table name
if table != "annotation" && table != "annotation_tag" {
    return 0, fmt.Errorf("invalid table name: %s", table)
}

var sb strings.Builder
sb.WriteString("DELETE FROM ")
sb.WriteString(table)
sb.WriteString(" WHERE id IN (")
sb.WriteString(strconv.FormatInt(ids[0], 10))
for _, v := range ids[1:] {
    sb.WriteString(",")
    sb.WriteString(strconv.FormatInt(v, 10))
}
sb.WriteString(")")
sql = sb.String()

⚠️ Moderate Issues

3. Excessive Logging of IDs (xorm_store.go:534, 554, 576)

r.log.Error("Annotations to clean by time", "count", len(ids), "ids", ids, ...)

Problem: Logging all IDs when batch size could be 32,767 will create massive log entries.

Fix: Log count only, or log first N IDs:

r.log.Debug("Cleaning annotations by time", "count", len(ids))
// Or for debugging:
idsPreview := ids
if len(ids) > 10 {
    idsPreview = ids[:10]
}
r.log.Debug("Cleaning annotations", "count", len(ids), "sample_ids", idsPreview)

4. Logging Error Variable When No Error (xorm_store.go:534, 554, 576)

ids, err := r.fetchIDs(ctx, "annotation", cond)
if err != nil {
    return 0, err
}
r.log.Error("Annotations to clean by time", "count", len(ids), "ids", ids, "cond", cond, "err", err)

Problem: err is logged but was already checked to be nil. This is confusing.

Fix: Remove "err", err from log statement if no error occurred.

5. Inefficient String Building (xorm_store.go:609-613)

values := fmt.Sprint(ids[0])
for _, v := range ids[1:] {
    values = fmt.Sprintf("%s, %d", values, v)  // Creates new string each iteration
}

Problem: Repeated string concatenation with fmt.Sprintf is O(n²) for large batches.

Fix: Use strings.Builder:

var sb strings.Builder
sb.WriteString(strconv.FormatInt(ids[0], 10))
for _, v := range ids[1:] {
    sb.WriteString(", ")
    sb.WriteString(strconv.FormatInt(v, 10))
}
values := sb.String()

6. Missing Table Name Validation (xorm_store.go:584, 597)

The fetchIDs and deleteByIDs functions accept a table parameter but don't validate it. While current callers are trusted, this is a defensive programming issue.

Locations:

xorm_store.go:584 (fetchIDs)
xorm_store.go:597 (deleteByIDs)

💡 Minor Suggestions

7. Potential Under-Deletion Documented But Not Monitored

Comment on line 524-525:

"This may under-delete when concurrent inserts happen, but any such annotations will simply be cleaned on the next cycle."

Suggestion: Consider adding metrics to track if the same annotations are attempted to be cleaned multiple times, which could indicate this edge case is occurring frequently.

8. Cleanup Interval Change Unexplained (cleanup.go:77)

-ticker := time.NewTicker(time.Minute * 10)
+ticker := time.NewTicker(time.Minute * 1)

Question: Was this intentional for testing, or is 1-minute interval the new production value? This dramatically increases cleanup frequency (10x). Should this be configurable?

9. Test Names (cleanup_test.go)

The test names now have TestIntegration prefix, which is good. However, consider using build tags instead:

//go:build integration
// +build integration

This allows running go test -tags=integration to specifically run/skip these tests.

10. Magic Number (xorm_store.go:607)

const sqliteParameterLimit = 999

Good! But add a comment explaining why:

// SQLite versions prior to 3.32.0 had a limit of 999 parameters per query.
// Even in newer versions, this is a reasonable default limit.
const sqliteParameterLimit = 999

🧪 Testing

Test Quality: Excellent ✅

Tests cover normal operations, edge cases, and database limits
Proper use of table-driven tests
Good use of cleanup functions
Bulk insert optimization (500 records at a time) significantly improves test performance

Test Concerns:

The 40,003 annotation test might be slow in CI
Consider adding a shorter smoke test variant

📊 Performance Impact

Positive:

✅ Eliminates deadlocks on MySQL (main goal achieved)
✅ Batch processing allows incremental progress
✅ Context cancellation prevents runaway queries

Considerations:

⚠️ Two-phase approach (query then delete) means more round trips to DB
⚠️ Cleanup interval reduced to 1 minute (10x more frequent) - verify this is intended
⚠️ For SQLite with large batches, string building creates large SQL statements

🔒 Security Assessment

Low Risk Overall ✅

No user input directly affects queries
Table names and IDs are from trusted internal sources
SQL injection risk is theoretical but worth addressing

📝 Recommendations Summary

Must Fix (Before Merge):

❗ Change error-level logging to debug/info for normal operations (6 locations)
❗ Remove excessive ID logging or limit to sample
❗ Fix inefficient string building in SQLite path

Should Fix:

Add table name validation in fetchIDs and deleteByIDs
Use strings.Builder for SQL construction
Clarify if 1-minute cleanup interval is intentional

Nice to Have:

Add metrics for under-deletion tracking
Add comment explaining SQLite parameter limit
Consider build tags for integration tests

Verdict: ✅ Approve with Changes

This is a well-thought-out solution to a real production problem. The core logic is sound, and the test coverage is excellent. The main issues are logging-related (severity level and volume), which should be fixed before merge. The performance and security concerns are minor and don't block the PR.

Estimated effort to address critical issues: 15-30 minutes

|

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Database Performance Optimizations #4

Database Performance Optimizations #4

Uh oh!

akshayutture-augment commented Nov 20, 2025

Uh oh!

akshayutture-augment commented Nov 20, 2025

Uh oh!

akshayutture-augment commented Nov 20, 2025

Uh oh!

claude bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Database Performance Optimizations #4

Are you sure you want to change the base?

Database Performance Optimizations #4

Uh oh!

Conversation

akshayutture-augment commented Nov 20, 2025

Uh oh!

akshayutture-augment commented Nov 20, 2025

Uh oh!

akshayutture-augment commented Nov 20, 2025

Uh oh!

claude bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✅

Summary

🎯 Key Changes

✅ Strengths

1. Smart Deadlock Avoidance

2. Robust SQLite Handling

3. Excellent Test Coverage

🔴 Critical Issues

1. Error-Level Logging for Normal Operations (pkg/services/annotations/annotationsimpl/xorm_store.go)

2. SQL Injection Risk with String Concatenation (xorm_store.go:609-613)

⚠️ Moderate Issues

3. Excessive Logging of IDs (xorm_store.go:534, 554, 576)

4. Logging Error Variable When No Error (xorm_store.go:534, 554, 576)

5. Inefficient String Building (xorm_store.go:609-613)

6. Missing Table Name Validation (xorm_store.go:584, 597)

💡 Minor Suggestions

7. Potential Under-Deletion Documented But Not Monitored

8. Cleanup Interval Change Unexplained (cleanup.go:77)

9. Test Names (cleanup_test.go)

10. Magic Number (xorm_store.go:607)

🧪 Testing

Test Quality: Excellent ✅

Test Concerns:

📊 Performance Impact

Positive:

Considerations:

🔒 Security Assessment

Low Risk Overall ✅

📝 Recommendations Summary

Must Fix (Before Merge):

Should Fix:

Nice to Have:

Verdict: ✅ Approve with Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude bot commented Nov 20, 2025 •

edited

Loading