Skip to content

Latest commit

 

History

History
501 lines (355 loc) · 16.4 KB

File metadata and controls

501 lines (355 loc) · 16.4 KB

MygramDB v1.3.2 Release Notes

Release Date: 2025-11-25 Type: Patch Release (Critical Bug Fixes) Previous Version: v1.3.1


Overview

Version 1.3.2 is a critical patch release that fixes severe binlog event parsing bugs that prevented MySQL replication from working correctly in v1.3.0 and v1.3.1. These bugs caused TABLE_MAP_EVENT and ROWS_EVENT parsing failures, leading to complete replication breakdown.

⚠️ CRITICAL: All users with MySQL replication enabled should upgrade immediately. Without this fix, replication will not work correctly.


🐛 Critical Bug Fixes

1. Binlog Event Parsing Offset Error - Duplicate OK Byte Skip

Severity: Critical - Replication Failure

Problem: BinlogEventParser was skipping the MySQL protocol OK byte (0x00) at the beginning of each binlog event buffer, but BinlogReader had already skipped it. This caused a 1-byte offset error in all event parsing.

Impact:

  • All binlog events parsed with incorrect offset
  • TABLE_MAP_EVENT parsing completely broken (database/table names read from wrong positions)
  • ROWS_EVENT parsing failures (wrong event type detection, wrong field positions)
  • Complete replication breakdown
  • Silent failures (events skipped, no error messages in some cases)

Root Cause:

The MySQL C API mysql_binlog_fetch() returns a buffer with the following format:

[OK byte (0x00)][binlog event data...]

Both BinlogReader and BinlogEventParser were skipping the OK byte:

  • BinlogReader at line 598: const unsigned char* event_buffer = rpl.buffer + 1;
  • BinlogEventParser at line 156: buffer++; length--;

This caused all parsers to read from position +1 byte from the correct location.

Fix:

  • Remove duplicate OK byte skip from BinlogEventParser::ParseBinlogEvent() (lines 156-157)
  • Document that binlog_reader already skips OK byte, parsers receive clean event data
  • Update all event parsers to use buffer directly without additional offset
  • Add MySQL source code references (mysql-8.4.7/sql-common/client.cc, mysql-8.4.7/libs/mysql/binlog/event/binlog_event.h)

Files Changed:

  • src/mysql/binlog_event_parser.cpp: Remove duplicate OK byte skip, add documentation (lines 153-163)
  • src/mysql/binlog_reader.cpp: Document OK byte handling (lines 601-608)
  • src/mysql/rows_parser.cpp: Update boundary calculations with proper documentation (lines 352-365)

2. Binlog Checksum Boundary Error - Buffer Overrun Risk

Severity: Critical - Buffer Overrun and Parsing Failure

Problem: Event parsers (ParseWriteRowsEvent, ParseUpdateRowsEvent, ParseDeleteRowsEvent) were not excluding the 4-byte checksum at the end of each binlog event when calculating parsing boundaries.

Impact:

  • Buffer overrun when parsing the last 4 bytes of row events
  • Parsing failures when attempting to read data from checksum area
  • Data corruption risk from reading invalid data
  • UPDATE_ROWS_EVENT particularly affected (before/after image parsing)

Root Cause:

MySQL binlog events have the following structure:

[event header (19 bytes)][event data][checksum (4 bytes)]

Even when checksums are disabled via SET @source_binlog_checksum='NONE', MySQL still allocates 4 bytes at the end for checksum space (see BINLOG_CHECKSUM_LEN in mysql-8.4.7/libs/mysql/binlog/event/binlog_event.h).

The event parsers were using buffer + length or buffer + event_size as the end boundary, which included the checksum area.

Fix:

  • Extract event_size from binlog header bytes [9-12] (little-endian)
  • Calculate end boundary as buffer + event_size - 4 (exclude BINLOG_CHECKSUM_LEN)
  • Apply to all ROWS_EVENT parsers:
    • ParseWriteRowsEvent() (lines 352-365)
    • ParseUpdateRowsEvent() (lines 570-578)
    • ParseDeleteRowsEvent() (lines 972-980)
  • Add MySQL source code references and detailed comments explaining checksum handling

Files Changed:

  • src/mysql/rows_parser.cpp: Fix boundary calculations in all ROWS_EVENT parsers

3. Extra Row Info Length Calculation Error - MySQL 8.0 Compatibility

Severity: Critical - MySQL 8.0 Replication Failure

Problem: MySQL 8.0 ROWS_EVENT_V2 includes extra_row_info with a packed integer length field. The code misinterpreted the length as data-only, but MySQL's format includes the packed integer itself in the total length.

Impact:

  • Pointer position offset errors in MySQL 8.0 row event parsing
  • Parsing failures for all INSERT/UPDATE/DELETE operations
  • Complete replication breakdown on MySQL 8.0
  • Silent failures (events appear but are not parsed)

Root Cause:

MySQL 8.0 ROWS_EVENT_V2 format (when flags & 0x0001):

[extra_row_info_len (packed int)][extra_row_info_data]

The extra_row_info_len value is the TOTAL length including the packed integer itself. For example, if the packed integer is 1 byte with value 2, then:

  • Total length = 2 bytes
  • Packed integer = 1 byte
  • Actual data = 1 byte

The code was skipping ptr += extra_info_len, which double-counted the packed integer length.

Fix:

  • Read packed integer and calculate bytes consumed: auto len_bytes = static_cast<int>(ptr - ptr_before);
  • Skip only the remaining data: ptr += (extra_info_len - len_bytes);
  • Add boundary validation: if (skip_bytes < 0 || ptr + skip_bytes > end) { error }
  • Apply to all ROWS_EVENT parsers:
    • ParseWriteRowsEvent() (lines 384-404)
    • ParseUpdateRowsEvent() (lines 629-649)
    • ParseDeleteRowsEvent() (lines 1005-1025)
  • Add MySQL source code references (mysql-8.4.7/libs/mysql/binlog/event/rows_event.h)

Files Changed:

  • src/mysql/rows_parser.cpp: Fix extra_row_info length calculation in all ROWS_EVENT parsers

4. Binlog Purged Error Detection - Better Error Handling

Severity: High - Operational Visibility

Problem: When the requested GTID position has been purged from MySQL binlogs (errno 1236), the system would retry indefinitely without clear indication that manual intervention (SYNC command) is required.

Impact:

  • Infinite retry loops consuming resources
  • No clear guidance for operators on required action
  • Delayed incident response
  • Confusion about replication state

Root Cause:

BinlogReader treated errno 1236 (ER_MASTER_FATAL_ERROR_READING_BINLOG) as a generic error and attempted to reconnect, even though this error means the GTID position is no longer available on the server.

Fix:

  • Detect errno 1236 specifically in two locations:
    • mysql_binlog_open() failure (lines 410-423)
    • mysql_binlog_fetch() failure (lines 551-568)
  • Stop replication immediately (set should_stop_ = true)
  • Log structured error with clear action message:
    "Binlog position no longer available on server.
     GTID position has been purged.
     Manual intervention required: run SYNC command to establish new position."
  • Prevent wasted reconnection attempts

Files Changed:

  • src/mysql/binlog_reader.cpp: Add errno 1236 detection and handling

📈 Improvements

1. Enhanced TABLE_MAP_EVENT Debugging

Problem: TABLE_MAP_EVENT parsing failures were silent or produced unclear error messages.

Fix:

  • Add detailed debug logs at each parsing step:
    • Buffer length and remaining bytes after each field extraction
    • Database name, table name, table_id values
    • Field lengths (db_len, table_len) before boundary checks
    • Error logs for all boundary validation failures
  • Add structured error messages for each validation point
  • Easier diagnosis of parsing failures

Files Changed:

  • src/mysql/binlog_event_parser.cpp: Add field-by-field debug logging (lines 506-587)

2. Structured Logging for BinlogReader Lifecycle

Problem: BinlogReader lifecycle events logged inconsistently, making it difficult to trace connection/replication state.

Fix:

  • Convert info-level logs to structured logs using StructuredLog():
    • binlog_connection_init: Creating dedicated binlog connection
    • binlog_connection_validated: Connection validation successful
    • binlog_reader_started: Reader thread started with GTID
    • binlog_reader_stopped: Reader stopped with event count
    • binlog_gtid_set: GTID position updated
    • binlog_replication_start: Starting replication from GTID
    • binlog_reconnected: Reconnection successful
    • binlog_stream_opened: Binlog stream opened
    • binlog_connection_lost: Connection lost, will reconnect
    • binlog_error: Critical errors (binlog_purged, fetch failures)
  • All structured logs include relevant context (GTID, error messages, errno)
  • Better monitoring and alerting integration

Files Changed:

  • src/mysql/binlog_reader.cpp: Convert lifecycle logs to structured format

3. Improved Binlog Fetch Diagnostics

Problem: When replication appeared to start but no events were received, diagnosis was difficult.

Fix:

  • Log first mysql_binlog_fetch() result with details:
    spdlog::debug("First mysql_binlog_fetch returned: result={}, size={}, buffer={}",
                  result, rpl.size, (void*)rpl.buffer);
  • Track and log when no data is returned repeatedly:
    static int no_data_count = 0;
    if (no_data_count % 100 == 1) {
      spdlog::debug("Binlog fetch returned no data (count={}). This may indicate:
                     1) No new events on MySQL,
                     2) GTID position issue,
                     3) Network keepalive", no_data_count);
    }
  • Detect TABLE_MAP_EVENT parsing attempts and log results
  • Easier diagnosis of replication issues

Files Changed:

  • src/mysql/binlog_reader.cpp: Add fetch diagnostics (lines 483-489, 586-594, 626-633)

4. Structured Logging for SyncOperationManager

Problem: SYNC operations logged inconsistently without structured context.

Fix:

  • Add structured logs for SYNC lifecycle:
    • SYNC start with configuration
    • SYNC completion with statistics
    • SYNC failure with error details
  • Include context fields (table name, document count, duration)
  • Better integration with monitoring systems

Files Changed:

  • src/server/sync_operation_manager.cpp: Add structured logs for SYNC operations

5. Production Log Verbosity Reduction

Problem: Production logs were too verbose with info-level logs for routine replication events.

Fix:

  • Change routine replication logs to debug level:
    • Checksums disabled: info → debug
    • GTID set usage: info → debug
    • Empty GTID set: info → debug
    • Reconnect delays: info → debug
    • Connection validation: info → debug
    • Column name fetches: info → debug
    • Individual INSERT/UPDATE/DELETE events: info → debug
    • First few non-tracked table skips: remains at info level for awareness
  • Keep important events at info level:
    • Connection establishment
    • Reader start/stop
    • GTID changes
    • Stream open
    • Reconnection events
    • Errors and warnings

Files Changed:

  • src/mysql/binlog_reader.cpp: Reduce log verbosity for production

🧪 Testing

Test Coverage

Existing Tests Updated:

  • All existing binlog parsing tests continue to pass
  • Tests validated against MySQL 8.0 and MySQL 8.4 binlog formats

Test Files:

  • tests/mysql/binlog_parsing_test.cpp: Validates OK byte handling fix
  • tests/mysql/rows_parser_test.cpp: Validates checksum boundary fix

📊 Statistics

Code Changes

  • Files Changed: 6
  • Insertions: +263 lines
  • Deletions: -46 lines
  • Net Change: +217 lines

Module Breakdown

Module Files Description
MySQL 4 Binlog parsing fixes, structured logging
Server 1 SyncOperationManager structured logging
Tests 2 Validation of parsing fixes

🔄 Migration Guide

From v1.3.0/v1.3.1 to v1.3.2

⚠️ URGENT: All users with MySQL replication enabled should upgrade immediately.

No configuration changes required. This is a transparent bug fix release.

Critical Issues Fixed

If you are running v1.3.0 or v1.3.1 with MySQL replication, you likely experienced:

  • TABLE_MAP_EVENT parsing failures (events for tables not being processed)
  • ROWS_EVENT parsing failures (INSERT/UPDATE/DELETE events ignored)
  • Complete replication breakdown (no data synchronization from MySQL)
  • Silent failures (events appear in binlog but are not applied)

Upgrade Steps

Docker users:

# Pull new image
docker pull ghcr.io/libraz/mygram-db:v1.3.2

# Update docker-compose.yml
services:
  mygramdb:
    image: ghcr.io/libraz/mygram-db:v1.3.2

# Restart container
docker-compose up -d

RPM users:

# Download and install
sudo rpm -Uvh mygramdb-1.3.2-1.el9.x86_64.rpm

# Restart service
sudo systemctl restart mygramdb

Source build:

git checkout v1.3.2
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel
sudo systemctl restart mygramdb

Verification After Upgrade

1. Verify replication is processing events:

echo "REPLICATION STATUS" | nc localhost 3307
# Should show: running
# events_processed should be increasing

2. Check logs for successful TABLE_MAP parsing:

tail -f /var/log/mygramdb/mygramdb.log | grep "TABLE_MAP"
# Should show: "TABLE_MAP_EVENT parsed successfully: database.table"

3. Verify data is being synchronized:

-- On MySQL, insert a test row
INSERT INTO your_table (id, content) VALUES (99999, 'test replication');

-- Wait a few seconds, then on MygramDB
echo "GET your_table 99999" | nc localhost 3307
# Should return the test row

4. Check for parsing errors:

tail -f /var/log/mygramdb/mygramdb.log | grep "ERROR"
# Should not show binlog parsing errors

5. Monitor binlog queue size:

echo "SHOW STATUS LIKE 'binlog_queue%'" | nc localhost 3307
# binlog_queue_size should remain low (not growing unbounded)

Rollback Procedure

If issues arise, rollback to v1.3.1 is possible but NOT RECOMMENDED due to critical replication bugs:

# Docker
docker pull ghcr.io/libraz/mygram-db:v1.3.1

# RPM
sudo rpm -Uvh --oldpackage mygramdb-1.3.1-1.el9.x86_64.rpm

⚠️ Warning: Rolling back to v1.3.1 exposes you to binlog parsing failures that break replication.


🚀 Upgrade Recommendation

Priority: CRITICAL - Upgrade Immediately (Replication Users Only)

All deployments using MySQL replication (v1.3.0 or v1.3.1) must upgrade to v1.3.2 immediately.

Affected Scenarios:

  1. Replication Completely Broken - Critical

    • Any deployment with MySQL replication enabled
    • TABLE_MAP_EVENT parsing failures prevent table identification
    • ROWS_EVENT parsing failures prevent data synchronization
    • Risk: Complete replication breakdown, data not synchronized
  2. Silent Replication Failures - Critical

    • Events appear in binlog but are silently skipped
    • No error messages in some failure cases
    • Data inconsistency between MySQL and MygramDB
    • Risk: Silent data loss, difficult to diagnose
  3. MySQL 8.0 Compatibility - Critical

    • MySQL 8.0 ROWS_EVENT_V2 parsing broken
    • INSERT/UPDATE/DELETE operations not processed
    • Risk: Complete replication failure on MySQL 8.0
  4. Buffer Overrun Risk - High

    • Checksum boundary errors can cause buffer overrun
    • Data corruption risk from reading invalid memory
    • Risk: Undefined behavior, potential crashes

Non-Affected Scenarios:

  • Deployments NOT using MySQL replication (direct data loading only) are NOT affected
  • Query/search functionality continues to work normally
  • Existing data in MygramDB is not affected

📈 Performance Impact

Positive Impact

  • Reduced Log Volume: Debug-level logging for routine events reduces log I/O overhead
  • Better Diagnostics: Structured logging enables faster issue diagnosis

No Negative Impact

All fixes are transparent improvements with no performance degradation.


🔗 Links


📧 Support

Questions or Issues?


Recommended Version: v1.3.2 (for replication users), v1.3.1 (for non-replication users)

Release Tag: git tag -a v1.3.2 -m "MygramDB v1.3.2: Critical binlog parsing fixes for MySQL replication"