Skip to content

Conversation

@akshayutture-augment
Copy link
Contributor

… deadlocks on MySQL (#80329)

* Split subquery when cleaning annotations

* update comment

* Raise batch size, now that we pay attention to it

* Iterate in batches

* Separate cancellable batch implementation to allow for multi-statement callbacks, add overload for single-statement use

* Use split-out utility in outer batching loop so it respects context cancellation

* guard against empty queries

* Use SQL parameters

* Use same approach for tags

* drop unused function

* Work around parameter limit on sqlite for large batches

* Bulk insert test data in DB

* Refactor test to customise test data creation

* Add test for catching SQLITE_MAX_VARIABLE_NUMBER limit

* Turn annotation cleanup test to integration tests

* lint

---------

Co-authored-by: Sofia Papagiannaki <[email protected]>
@akshayutture-augment
Copy link
Contributor Author

@claude review this PR

1 similar comment
@akshayutture-augment
Copy link
Contributor Author

@claude review this PR

@claude
Copy link

claude bot commented Nov 20, 2025

Claude finished @akshayutture-augment's task —— View job


Code Review Complete ✅

Summary

This PR addresses database deadlock issues on MySQL by refactoring annotation cleanup logic. The changes split the cleanup process into separate query and delete operations to prevent deadlocks caused by concurrent inserts during cleanup. Overall, the implementation is solid with good test coverage and proper handling of database-specific limitations.


🎯 Key Changes

  1. Refactored cleanup strategy - Split query and delete into separate operations to avoid MySQL deadlocks
  2. Batch processing with context cancellation - New untilDoneOrCancelled helper for incremental cleanup
  3. SQLite parameter limit handling - Special case for batch sizes > 999 parameters
  4. Enhanced test coverage - Tests now use bulk insert and include edge cases (40k+ annotations, SQLite limits)
  5. Reduced cleanup interval - Changed from 10 minutes to 1 minute (minor change in cleanup.go)

✅ Strengths

1. Smart Deadlock Avoidance

The core problem is well-addressed. Moving from a single DELETE-with-subquery to separate SELECT+DELETE operations prevents lock escalation on MySQL.

Before (deadlock-prone):

DELETE FROM annotation WHERE id IN (SELECT id FROM (...) a)

After (deadlock-resistant):

ids := fetchIDs(...)  // Releases locks quickly
deleteByIDs(...)       // Separate transaction

2. Robust SQLite Handling

The workaround for SQLite's 999-parameter limit is clever:

if r.db.GetDBType() == migrator.SQLite && r.cfg.AnnotationCleanupJobBatchSize > sqliteParameterLimit {
    values := fmt.Sprint(ids[0])
    for _, v := range ids[1:] {
        values = fmt.Sprintf("%s, %d", values, v)
    }
    sql = fmt.Sprintf(`DELETE FROM %s WHERE id IN (%s)`, table, values)

3. Excellent Test Coverage

The test improvements are significant:

  • Bulk insertion (500 at a time) for performance
  • Tests with 40,003 annotations and batch size 32,767
  • Proper cleanup between test cases
  • Integration test markers

🔴 Critical Issues

1. Error-Level Logging for Normal Operations (pkg/services/annotations/annotationsimpl/xorm_store.go)

Lines 534, 537, 554, 557, 576, 579 use r.log.Error() for what appears to be normal operational logging:

r.log.Error("Annotations to clean by time", "count", len(ids), "ids", ids, "cond", cond, "err", err)
r.log.Error("cleaned annotations by time", "count", len(ids), "affected", x, "err", y)

Problem: This will flood error logs with routine cleanup operations, making it hard to spot actual errors.

Fix: Change to r.log.Debug() or r.log.Info():

r.log.Debug("Cleaning annotations by time", "count", len(ids))
r.log.Debug("Cleaned annotations by time", "affected", x)

Locations:

  • xorm_store.go:534
  • xorm_store.go:537
  • xorm_store.go:554
  • xorm_store.go:557
  • xorm_store.go:576
  • xorm_store.go:579

2. SQL Injection Risk with String Concatenation (xorm_store.go:609-613)

values := fmt.Sprint(ids[0])
for _, v := range ids[1:] {
    values = fmt.Sprintf("%s, %d", values, v)
}
sql = fmt.Sprintf(`DELETE FROM %s WHERE id IN (%s)`, table, values)

Problem: While ids are int64 internally, constructing SQL via string concatenation is risky. The table name is also interpolated unsafely.

Analysis:

  • table parameter comes from trusted internal callers only ("annotation" or "annotation_tag")
  • ids are verified as int64 from database queries
  • Risk is low but pattern is concerning

Recommendation: Consider using a string builder and add validation:

if len(ids) == 0 {
    return 0, nil
}
// Validate table name
if table != "annotation" && table != "annotation_tag" {
    return 0, fmt.Errorf("invalid table name: %s", table)
}

var sb strings.Builder
sb.WriteString("DELETE FROM ")
sb.WriteString(table)
sb.WriteString(" WHERE id IN (")
sb.WriteString(strconv.FormatInt(ids[0], 10))
for _, v := range ids[1:] {
    sb.WriteString(",")
    sb.WriteString(strconv.FormatInt(v, 10))
}
sb.WriteString(")")
sql = sb.String()

⚠️ Moderate Issues

3. Excessive Logging of IDs (xorm_store.go:534, 554, 576)

r.log.Error("Annotations to clean by time", "count", len(ids), "ids", ids, ...)

Problem: Logging all IDs when batch size could be 32,767 will create massive log entries.

Fix: Log count only, or log first N IDs:

r.log.Debug("Cleaning annotations by time", "count", len(ids))
// Or for debugging:
idsPreview := ids
if len(ids) > 10 {
    idsPreview = ids[:10]
}
r.log.Debug("Cleaning annotations", "count", len(ids), "sample_ids", idsPreview)

4. Logging Error Variable When No Error (xorm_store.go:534, 554, 576)

ids, err := r.fetchIDs(ctx, "annotation", cond)
if err != nil {
    return 0, err
}
r.log.Error("Annotations to clean by time", "count", len(ids), "ids", ids, "cond", cond, "err", err)

Problem: err is logged but was already checked to be nil. This is confusing.

Fix: Remove "err", err from log statement if no error occurred.

5. Inefficient String Building (xorm_store.go:609-613)

values := fmt.Sprint(ids[0])
for _, v := range ids[1:] {
    values = fmt.Sprintf("%s, %d", values, v)  // Creates new string each iteration
}

Problem: Repeated string concatenation with fmt.Sprintf is O(n²) for large batches.

Fix: Use strings.Builder:

var sb strings.Builder
sb.WriteString(strconv.FormatInt(ids[0], 10))
for _, v := range ids[1:] {
    sb.WriteString(", ")
    sb.WriteString(strconv.FormatInt(v, 10))
}
values := sb.String()

6. Missing Table Name Validation (xorm_store.go:584, 597)

The fetchIDs and deleteByIDs functions accept a table parameter but don't validate it. While current callers are trusted, this is a defensive programming issue.

Locations:

  • xorm_store.go:584 (fetchIDs)
  • xorm_store.go:597 (deleteByIDs)

💡 Minor Suggestions

7. Potential Under-Deletion Documented But Not Monitored

Comment on line 524-525:

"This may under-delete when concurrent inserts happen, but any such annotations will simply be cleaned on the next cycle."

Suggestion: Consider adding metrics to track if the same annotations are attempted to be cleaned multiple times, which could indicate this edge case is occurring frequently.

8. Cleanup Interval Change Unexplained (cleanup.go:77)

-ticker := time.NewTicker(time.Minute * 10)
+ticker := time.NewTicker(time.Minute * 1)

Question: Was this intentional for testing, or is 1-minute interval the new production value? This dramatically increases cleanup frequency (10x). Should this be configurable?

9. Test Names (cleanup_test.go)

The test names now have TestIntegration prefix, which is good. However, consider using build tags instead:

//go:build integration
// +build integration

This allows running go test -tags=integration to specifically run/skip these tests.

10. Magic Number (xorm_store.go:607)

const sqliteParameterLimit = 999

Good! But add a comment explaining why:

// SQLite versions prior to 3.32.0 had a limit of 999 parameters per query.
// Even in newer versions, this is a reasonable default limit.
const sqliteParameterLimit = 999

🧪 Testing

Test Quality: Excellent ✅

  • Tests cover normal operations, edge cases, and database limits
  • Proper use of table-driven tests
  • Good use of cleanup functions
  • Bulk insert optimization (500 records at a time) significantly improves test performance

Test Concerns:

  • The 40,003 annotation test might be slow in CI
  • Consider adding a shorter smoke test variant

📊 Performance Impact

Positive:

  • ✅ Eliminates deadlocks on MySQL (main goal achieved)
  • ✅ Batch processing allows incremental progress
  • ✅ Context cancellation prevents runaway queries

Considerations:

  • ⚠️ Two-phase approach (query then delete) means more round trips to DB
  • ⚠️ Cleanup interval reduced to 1 minute (10x more frequent) - verify this is intended
  • ⚠️ For SQLite with large batches, string building creates large SQL statements

🔒 Security Assessment

Low Risk Overall ✅

  • No user input directly affects queries
  • Table names and IDs are from trusted internal sources
  • SQL injection risk is theoretical but worth addressing

📝 Recommendations Summary

Must Fix (Before Merge):

  1. ❗ Change error-level logging to debug/info for normal operations (6 locations)
  2. ❗ Remove excessive ID logging or limit to sample
  3. ❗ Fix inefficient string building in SQLite path

Should Fix:

  1. Add table name validation in fetchIDs and deleteByIDs
  2. Use strings.Builder for SQL construction
  3. Clarify if 1-minute cleanup interval is intentional

Nice to Have:

  1. Add metrics for under-deletion tracking
  2. Add comment explaining SQLite parameter limit
  3. Consider build tags for integration tests

Verdict: ✅ Approve with Changes

This is a well-thought-out solution to a real production problem. The core logic is sound, and the test coverage is excellent. The main issues are logging-related (severity level and volume), which should be fixed before merge. The performance and security concerns are minor and don't block the PR.

Estimated effort to address critical issues: 15-30 minutes


|

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants