-
-
Notifications
You must be signed in to change notification settings - Fork 238
dolthub/dolt#8893 - Fix charset validation to match MySQL behavior for issue #8893 #3078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
e5652ca to
1f1c420
Compare
nicktobey
approved these changes
Jul 11, 2025
425d94d to
5187bfe
Compare
This commit addresses issue #8893 by implementing proper context-based charset error handling that matches MySQL behavior exactly. Key changes: - Fix context handling in insert.go to preserve *sql.Context type - Add column/row context to charset validation errors - Implement MySQL-compatible strict/non-strict mode behavior - Add truncation for invalid UTF8 in non-strict mode - Centralize strict mode validation in sql.ValidateStrictMode() - Fix enum zero validation regression - Add comprehensive test coverage for all scenarios The implementation now provides proper error messages with column and row context, and offers customers a migration path using non-strict mode. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Replace overly complex customer scenario simulation with a focused test that directly addresses the customer's issue from #8893: - Test inserting latin1-encoded "DoltLab®" into utf8mb4 column - Verify error shows column name (not "<unknown>") - Demonstrate non-strict mode migration solution - Confirm data truncation and queryability The test is now concise (5 assertions vs 12) and specifically validates the customer's problem and our fix. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
… bats test The 'charset validation CSV import simulation' test was just using INSERT statements with UNHEX - not actually testing CSV import functionality. A proper CSV import test should: - Create real CSV files with invalid UTF-8 data - Use dolt's actual CSV import commands - Live in the dolt repo as a bats test This will be implemented in a separate PR branch in the dolt repo.
- Add systematic edge case tests for UTF-8 validation - Test overlong sequences, surrogates, invalid bytes, incomplete sequences - Verify GMS matches MySQL behavior exactly for all edge cases - Document critical discovery: UTF-8 validation is always strict for UTF8MB4 columns - Update memory_stack.md with comprehensive MySQL comparison testing results - Establish baseline for production-ready UTF-8 charset validation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
… tests - Fix incorrect int32 -> int64 data type changes in charset_collation_engine.go - Remove misplaced edge case tests from charset_collation_engine.go - Add comprehensive UTF-8 charset validation edge case tests to script_queries.go - Test formatInvalidByteForError function with 21 edge case scenarios - Cover overlong sequences, surrogates, invalid bytes, incomplete sequences - Test both strict mode (errors) and non-strict mode (truncation) behavior - All tests verified against MySQL 8.0 comparison server for accuracy - Ensure GMS matches MySQL behavior exactly for all UTF-8 validation scenarios 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Added 21 edge case tests for formatInvalidByteForError function - Added 19 ASCII range tests covering full 0x00-0x7F range - Test function boundary constants (asciiMin=32, asciiMax=127) - Verify error message format matches MySQL exactly - Test both strict and non-strict mode behaviors - All test expectations verified against MySQL 8.0 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Remove misleading comment about customer complaints regarding <unknown> - Remove duplicate test for long invalid sequence (999897969594939291) - Correct test descriptions to focus on charset validation functionality 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
1. Explicitly specify utf8mb4 charset for test table columns instead of relying on defaults 2. Fix formatInvalidByteForError to properly detect invalid UTF-8 sequences using utf8.DecodeRune instead of incorrect byte > 127 check 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
…multibyte test 1. Fix table charset specification to use table-level charset instead of column-level 2. Add test case for UTF-8 multibyte sequences to validate fix for b > asciiMax bug - Tests é (0xC3 0xA9), € (0xE2 0x82 0xAC), 🍕 (0xF0 0x9F 0x8D 0x95) - all valid UTF-8 with bytes > 127 - Ensures the formatInvalidByteForError fix correctly handles multibyte sequences 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
The "charset validation UTF-8 multibyte sequence validation" test was duplicating functionality already covered by existing tests: - "charset validation ASCII range tests" already tests é, €, 🍕 multibyte sequences - "charset validation edge cases" already covers invalid UTF-8 sequence rejection This cleanup reduces test redundancy while maintaining full coverage of the formatInvalidByteForError fix for proper UTF-8 validation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Add Dialect: "mysql" to all charset validation tests to prevent them from running against doltgres (PostgreSQL), which doesn't support sql_mode. This resolves test failures where doltgres was trying to execute MySQL-specific sql_mode commands like 'STRICT_TRANS_TABLES' that don't exist in PostgreSQL. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
47bb835 to
be19fb0
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes dolthub/dolt#8893
This allows querying existing invalid UTF-8 data (enabling cleanup) while preventing new invalid data insertion with proper error messages.