Skip to content

Remove counts from CalculateNextIterationRangeEndValues that caused problems with --panic-on-warnings #1557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

grodowski
Copy link
Contributor

@grodowski grodowski commented May 22, 2025

The split between Temptable and Offset query builders was originally introduced in #471. Then, both query builders have been changed to calculate actual chunk sizes in #1500.

The updated implementation of the Offset query introduced a bug, where a missing order by clause in select_osc_chunk can result in end values potentially surpassing the chunk_size. This wasn't detected during our initial testing where the query just returned rows in their original order, but may happen in real-world scenarios in case the db returns data in an undefined order.

An obvious fix would be to just add an order by to the Offset builder subquery, however since both builders use temptables now, it makes more sense to simplify and use only one of them.

Update: using just one query builder was the initial approach in 660cbf9, however based on our findings that the count query can produce unpredictable results with concurrent writes (#1557 (comment)), I propose to drop the row count check in adc7049. This will hopefully mitigate all count-related bugs, bring back the old query builders and make PanicOnWarnigns fail in cases where rows are not being dropped, but data is truncated (see added test with a truncated varchar column).

Related issue: #1498

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

The split between Temptable and Offset query builders was originally introduced in github#471. Then, both query builders have been changed to calculate actual chunk sizes in github#1500.

The updated implementation of the Offset query introduced a bug, where a missing `order by` clause in `select_osc_chunk` can result in end values potentially surpassing the chunk_size. This wasn't detected during our initial testing where the query just returned rows in their original order, but may happen in real-world scenarios in case the db returns data in an undefined order.

An obvious fix would be to just add an `order by` to the Offset builder subquery, however since both builders use temptables now, it makes more sense to simplify and use only one of them.

Alternatively, the builder could only use the less performant count query variants when `--panic-on-warnings` is enabled and otherwise use the simpler ones from pull/471. We decided to not follow this path for now, hoping `--panic-on-warnings` becomes an updated and safer default in the future.

Co-authored-by: Bastian Bartmann <[email protected]>
@Copilot Copilot AI review requested due to automatic review settings May 22, 2025 16:30
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR consolidates the unique-key range end query to use only the temptable-based builder, removing the offset-based variant and simplifying the applier logic accordingly.

  • Removed the BuildUniqueKeyRangeEndPreparedQueryViaOffset function and now exclusively use the temptable builder.
  • Updated the applier to invoke only BuildUniqueKeyRangeEndPreparedQueryViaTemptable.
  • Renamed the corresponding test to TestBuildUniqueKeyRangeEndPreparedQueryViaTemptable.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
go/sql/builder_test.go Renamed test function to match the temptable-only builder
go/sql/builder.go Deleted the offset-based builder, retaining only the temptable variant
go/logic/applier.go Removed loop over both builders and now calls only the temptable builder

@meiji163
Copy link
Contributor

meiji163 commented May 22, 2025

Thanks @grodowski, we also noticed this bug in production.

We also noticed a smaller problem with the range size logic. Between the execution of CalculateNextIterationRangeEndValues and ApplyIterationInsertQuery there could be new rows inserted/deleted into the range causing expectedRowCount to be inaccurate. With --panic-on-warnings this could lead to the migration erroneously panicking.

gh-ost/go/logic/migrator.go

Lines 1278 to 1281 in 5c88f54

if expectedRangeSize != rowsAffected {
joinedWarnings := strings.Join(this.migrationContext.MigrationLastInsertSQLWarnings, "; ")
terminateRowIteration(fmt.Errorf("ApplyIterationInsertQuery failed because of SQL warnings: [%s]", joinedWarnings))
}

For performance reasons we probably don't want to use a gap lock to ensure expectedRowCount is accurate. So we may have to abandon the idea of using expectedRowCount

@grodowski
Copy link
Contributor Author

Thanks for the context @meiji163, it does sound like, given our findings, we should abandon the idea of checking the row count with --panic-on-warnings. The original inspiration lives here, but the logic might also contain bugs (e.g. expected_rows = top - bottom + 1 causing the warning to not fire in some scenarios).

Do you see any downsides of just dropping the row count check? I don't know there are types of warnings we might need to ignore besides the ones on insert ignore, though I do think we have the tooling to try it safely at my team. Then, the query builders can also go back to their more performant counterparts from #471.

- Removed count subqueries from range builders to restore the more performant approach from PR github#471
- Modified panic-on-warnings logic to trigger errors based solely on SQL warnings, not on row count mismatches
- This addresses potential race conditions where row count comparisons could produce false positives due to concurrent table modifications
@grodowski grodowski changed the title Fix CalculateNextIterationRangeEndValues order by Remove counts from CalculateNextIterationRangeEndValues that caused problems with --panic-on-warnings Jun 5, 2025
@grodowski
Copy link
Contributor Author

@meiji163 added adc7049 that removes expectedRangeSize, updated the PR description and also wrote a new test case to cover warnings without changes in row count.

@meiji163 meiji163 merged commit 7c18055 into github:master Jun 5, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants