Skip to content

Conversation

@elianddb
Copy link
Contributor

@elianddb elianddb commented Jul 8, 2025

Fixes dolthub/dolt#8893

  • Fix charset validation error messages to show proper column names and row numbers.
  • Add MySQL-compatible strict/non-strict mode behavior for invalid UTF-8 handling
  • Preserve charset validation in ConvertToBytes() method (data insertion)
  • Remove charset validation from SQL() method (query results)

This allows querying existing invalid UTF-8 data (enabling cleanup) while preventing new invalid data insertion with proper error messages.

@elianddb elianddb force-pushed the elianddb/8893-charset-validation-fix branch 13 times, most recently from e5652ca to 1f1c420 Compare July 11, 2025 20:47
@elianddb elianddb requested a review from nicktobey July 11, 2025 21:11
@elianddb elianddb force-pushed the elianddb/8893-charset-validation-fix branch from 425d94d to 5187bfe Compare July 11, 2025 22:50
elianddb and others added 12 commits July 14, 2025 15:56
This commit addresses issue #8893 by implementing proper context-based
charset error handling that matches MySQL behavior exactly.

Key changes:
- Fix context handling in insert.go to preserve *sql.Context type
- Add column/row context to charset validation errors
- Implement MySQL-compatible strict/non-strict mode behavior
- Add truncation for invalid UTF8 in non-strict mode
- Centralize strict mode validation in sql.ValidateStrictMode()
- Fix enum zero validation regression
- Add comprehensive test coverage for all scenarios

The implementation now provides proper error messages with column and row
context, and offers customers a migration path using non-strict mode.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Replace overly complex customer scenario simulation with a focused test
that directly addresses the customer's issue from #8893:

- Test inserting latin1-encoded "DoltLab®" into utf8mb4 column
- Verify error shows column name (not "<unknown>")
- Demonstrate non-strict mode migration solution
- Confirm data truncation and queryability

The test is now concise (5 assertions vs 12) and specifically validates
the customer's problem and our fix.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
… bats test

The 'charset validation CSV import simulation' test was just using INSERT
statements with UNHEX - not actually testing CSV import functionality.

A proper CSV import test should:
- Create real CSV files with invalid UTF-8 data
- Use dolt's actual CSV import commands
- Live in the dolt repo as a bats test

This will be implemented in a separate PR branch in the dolt repo.
- Add systematic edge case tests for UTF-8 validation
- Test overlong sequences, surrogates, invalid bytes, incomplete sequences
- Verify GMS matches MySQL behavior exactly for all edge cases
- Document critical discovery: UTF-8 validation is always strict for UTF8MB4 columns
- Update memory_stack.md with comprehensive MySQL comparison testing results
- Establish baseline for production-ready UTF-8 charset validation

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
… tests

- Fix incorrect int32 -> int64 data type changes in charset_collation_engine.go
- Remove misplaced edge case tests from charset_collation_engine.go
- Add comprehensive UTF-8 charset validation edge case tests to script_queries.go
- Test formatInvalidByteForError function with 21 edge case scenarios
- Cover overlong sequences, surrogates, invalid bytes, incomplete sequences
- Test both strict mode (errors) and non-strict mode (truncation) behavior
- All tests verified against MySQL 8.0 comparison server for accuracy
- Ensure GMS matches MySQL behavior exactly for all UTF-8 validation scenarios

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Added 21 edge case tests for formatInvalidByteForError function
- Added 19 ASCII range tests covering full 0x00-0x7F range
- Test function boundary constants (asciiMin=32, asciiMax=127)
- Verify error message format matches MySQL exactly
- Test both strict and non-strict mode behaviors
- All test expectations verified against MySQL 8.0

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Remove misleading comment about customer complaints regarding <unknown>
- Remove duplicate test for long invalid sequence (999897969594939291)
- Correct test descriptions to focus on charset validation functionality

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
1. Explicitly specify utf8mb4 charset for test table columns instead of relying on defaults
2. Fix formatInvalidByteForError to properly detect invalid UTF-8 sequences using utf8.DecodeRune instead of incorrect byte > 127 check

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…multibyte test

1. Fix table charset specification to use table-level charset instead of column-level
2. Add test case for UTF-8 multibyte sequences to validate fix for b > asciiMax bug
   - Tests é (0xC3 0xA9), € (0xE2 0x82 0xAC), 🍕 (0xF0 0x9F 0x8D 0x95) - all valid UTF-8 with bytes > 127
   - Ensures the formatInvalidByteForError fix correctly handles multibyte sequences

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
The "charset validation UTF-8 multibyte sequence validation" test was duplicating
functionality already covered by existing tests:
- "charset validation ASCII range tests" already tests é, €, 🍕 multibyte sequences
- "charset validation edge cases" already covers invalid UTF-8 sequence rejection

This cleanup reduces test redundancy while maintaining full coverage of the
formatInvalidByteForError fix for proper UTF-8 validation.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Add Dialect: "mysql" to all charset validation tests to prevent them from
running against doltgres (PostgreSQL), which doesn't support sql_mode.

This resolves test failures where doltgres was trying to execute MySQL-specific
sql_mode commands like 'STRICT_TRANS_TABLES' that don't exist in PostgreSQL.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@elianddb elianddb force-pushed the elianddb/8893-charset-validation-fix branch from 47bb835 to be19fb0 Compare July 14, 2025 15:56
@elianddb elianddb merged commit 9ed5730 into main Jul 14, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Invalid string for charset utf8mb4

3 participants