dolthub/dolt#8893 - Fix charset validation to match MySQL behavior for issue #8893 #3078

elianddb · 2025-07-08T21:09:19Z

Fix charset validation error messages to show proper column names and row numbers.
Add MySQL-compatible strict/non-strict mode behavior for invalid UTF-8 handling
Preserve charset validation in ConvertToBytes() method (data insertion)
Remove charset validation from SQL() method (query results)

This allows querying existing invalid UTF-8 data (enabling cleanup) while preventing new invalid data insertion with proper error messages.

enginetest/queries/script_queries.go

sql/types/strings.go

This commit addresses issue #8893 by implementing proper context-based charset error handling that matches MySQL behavior exactly. Key changes: - Fix context handling in insert.go to preserve *sql.Context type - Add column/row context to charset validation errors - Implement MySQL-compatible strict/non-strict mode behavior - Add truncation for invalid UTF8 in non-strict mode - Centralize strict mode validation in sql.ValidateStrictMode() - Fix enum zero validation regression - Add comprehensive test coverage for all scenarios The implementation now provides proper error messages with column and row context, and offers customers a migration path using non-strict mode. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Replace overly complex customer scenario simulation with a focused test that directly addresses the customer's issue from #8893: - Test inserting latin1-encoded "DoltLab®" into utf8mb4 column - Verify error shows column name (not "<unknown>") - Demonstrate non-strict mode migration solution - Confirm data truncation and queryability The test is now concise (5 assertions vs 12) and specifically validates the customer's problem and our fix. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

… bats test The 'charset validation CSV import simulation' test was just using INSERT statements with UNHEX - not actually testing CSV import functionality. A proper CSV import test should: - Create real CSV files with invalid UTF-8 data - Use dolt's actual CSV import commands - Live in the dolt repo as a bats test This will be implemented in a separate PR branch in the dolt repo.

- Add systematic edge case tests for UTF-8 validation - Test overlong sequences, surrogates, invalid bytes, incomplete sequences - Verify GMS matches MySQL behavior exactly for all edge cases - Document critical discovery: UTF-8 validation is always strict for UTF8MB4 columns - Update memory_stack.md with comprehensive MySQL comparison testing results - Establish baseline for production-ready UTF-8 charset validation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

… tests - Fix incorrect int32 -> int64 data type changes in charset_collation_engine.go - Remove misplaced edge case tests from charset_collation_engine.go - Add comprehensive UTF-8 charset validation edge case tests to script_queries.go - Test formatInvalidByteForError function with 21 edge case scenarios - Cover overlong sequences, surrogates, invalid bytes, incomplete sequences - Test both strict mode (errors) and non-strict mode (truncation) behavior - All tests verified against MySQL 8.0 comparison server for accuracy - Ensure GMS matches MySQL behavior exactly for all UTF-8 validation scenarios 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Added 21 edge case tests for formatInvalidByteForError function - Added 19 ASCII range tests covering full 0x00-0x7F range - Test function boundary constants (asciiMin=32, asciiMax=127) - Verify error message format matches MySQL exactly - Test both strict and non-strict mode behaviors - All test expectations verified against MySQL 8.0 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Remove misleading comment about customer complaints regarding <unknown> - Remove duplicate test for long invalid sequence (999897969594939291) - Correct test descriptions to focus on charset validation functionality 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

1. Explicitly specify utf8mb4 charset for test table columns instead of relying on defaults 2. Fix formatInvalidByteForError to properly detect invalid UTF-8 sequences using utf8.DecodeRune instead of incorrect byte > 127 check 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

…multibyte test 1. Fix table charset specification to use table-level charset instead of column-level 2. Add test case for UTF-8 multibyte sequences to validate fix for b > asciiMax bug - Tests é (0xC3 0xA9), € (0xE2 0x82 0xAC), 🍕 (0xF0 0x9F 0x8D 0x95) - all valid UTF-8 with bytes > 127 - Ensures the formatInvalidByteForError fix correctly handles multibyte sequences 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

The "charset validation UTF-8 multibyte sequence validation" test was duplicating functionality already covered by existing tests: - "charset validation ASCII range tests" already tests é, €, 🍕 multibyte sequences - "charset validation edge cases" already covers invalid UTF-8 sequence rejection This cleanup reduces test redundancy while maintaining full coverage of the formatInvalidByteForError fix for proper UTF-8 validation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Add Dialect: "mysql" to all charset validation tests to prevent them from running against doltgres (PostgreSQL), which doesn't support sql_mode. This resolves test failures where doltgres was trying to execute MySQL-specific sql_mode commands like 'STRICT_TRANS_TABLES' that don't exist in PostgreSQL. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

elianddb force-pushed the elianddb/8893-charset-validation-fix branch 13 times, most recently from e5652ca to 1f1c420 Compare July 11, 2025 20:47

elianddb requested a review from nicktobey July 11, 2025 21:11

nicktobey approved these changes Jul 11, 2025

View reviewed changes

enginetest/queries/script_queries.go Outdated Show resolved Hide resolved

sql/types/strings.go Outdated Show resolved Hide resolved

elianddb force-pushed the elianddb/8893-charset-validation-fix branch from 425d94d to 5187bfe Compare July 11, 2025 22:50

elianddb and others added 12 commits July 14, 2025 15:56

fix use of constants in formatInvalidByteForError

dfd1cfb

elianddb force-pushed the elianddb/8893-charset-validation-fix branch from 47bb835 to be19fb0 Compare July 14, 2025 15:56

[ga-format-pr] Run ./format_repo.sh to fix formatting

64100d3

elianddb merged commit 9ed5730 into main Jul 14, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

dolthub/dolt#8893 - Fix charset validation to match MySQL behavior for issue #8893 #3078

dolthub/dolt#8893 - Fix charset validation to match MySQL behavior for issue #8893 #3078

Uh oh!

elianddb commented Jul 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

dolthub/dolt#8893 - Fix charset validation to match MySQL behavior for issue #8893 #3078

dolthub/dolt#8893 - Fix charset validation to match MySQL behavior for issue #8893 #3078

Uh oh!

Conversation

elianddb commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

elianddb commented Jul 8, 2025 •

edited

Loading