Skip to content

gh-ost process stuck infinitely after writer database failure #1569

@grodowski

Description

@grodowski

We detected this issue accidentally when a small gh-ost migration got interrupted by a daily test database failover. Since the current writer became read-only, we saw lots of FATAL Error 1290 (HY000): The MySQL server is running with the --read-only option so it cannot execute this statement logs and had to step in and kill the process.

Failures like this while copying data in the ApplyIterationInsertQuery loop will be retried until MaxRetries and then the process will terminate through a PanicAbort error.

However, in the case described here, we were past the copy stage and observed gh-ost getting stuck in throttled state with the heartbeat lag growing indefinitely.

Repro steps

Partial output printed by an integration test where I issued a SET GLOBAL super_read_only=1 after the data is copied, but before cutover (super just because of how my test db was set up).

First, copying completes.

I, [2025-07-10T11:57:02.661777 #11938]  INFO -- : [gh-ost]: 2025-07-10 11:57:02 INFO Copy: 2000/2000 100.0%; Applied: 0; Backlog: 0/1000; Time: 1s(total), 1s(copy); streamer: mysql-bin.000001:139473498; Lag: 0.01s, HeartbeatLag: 0.07s, State: migrating; ETA: due []
I, [2025-07-10T11:57:02.662144 #11938]  INFO -- : [gh-ost]: 2025-07-10 11:57:02 INFO Grabbing voluntary lock: gh-ost.2831.lock
I, [2025-07-10T11:57:02.662327 #11938]  INFO -- : [gh-ost]: 2025-07-10 11:57:02 INFO Setting LOCK timeout as 20 seconds
I, [2025-07-10T11:57:02.662529 #11938]  INFO -- : [gh-ost]: 2025-07-10 11:57:02 INFO Looking for magic cut-over table
I, [2025-07-10T11:57:02.663413 #11938]  INFO -- : [gh-ost]: 2025-07-10 11:57:02 INFO Creating magic cut-over table `agent_test`.`_gh_ost_users_20250710115701_del`
I, [2025-07-10T11:57:02.667448 #11938]  INFO -- : [gh-ost]: 2025-07-10 11:57:02 INFO Magic cut-over table created
I, [2025-07-10T11:57:02.667498 #11938]  INFO -- : [gh-ost]: 2025-07-10 11:57:02 INFO Setting cut-over idle timeout as 30 seconds
I, [2025-07-10T11:57:02.667709 #11938]  INFO -- : [gh-ost]: 2025-07-10 11:57:02 INFO Locking `agent_test`.`users`, `agent_test`.`_gh_ost_users_20250710115701_del`
I, [2025-07-10T11:57:02.668176 #11938]  INFO -- : [gh-ost]: 2025-07-10 11:57:02 INFO Tables locked
I, [2025-07-10T11:57:02.668196 #11938]  INFO -- : [gh-ost]: 2025-07-10 11:57:02 INFO Session locking original & magic tables is 2831
I, [2025-07-10T11:57:02.668225 #11938]  INFO -- : [gh-ost]: 2025-07-10 11:57:02 INFO Writing changelog state: AllEventsUpToLockProcessed:1752141422668177000
I, [2025-07-10T11:57:02.670031 #11938]  INFO -- : [gh-ost]: 2025-07-10 11:57:02 INFO Waiting for events up to lock
I, [2025-07-10T11:57:02.672942 #11938]  INFO -- : [gh-ost]: 2025-07-10 11:57:02 INFO Intercepted changelog state AllEventsUpToLockProcessed
I, [2025-07-10T11:57:02.672997 #11938]  INFO -- : [gh-ost]: 2025-07-10 11:57:02 INFO Handled changelog state AllEventsUpToLockProcessed

gh-ost starts the idle timeout and we can simulate failover now.

I, [2025-07-10T11:36:08.696813 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:08 INFO Setting cut-over idle timeout as 30 seconds 

Once queries start to fail and process hangs in State: throttled, printing a batches of errors and the progress message every second, infinitely.

I, [2025-07-10T11:36:39.258420 #10808]  INFO -- : [gh-ost]: Copy: 2000/2000 100.0%; Applied: 0; Backlog: 0/1000; Time: 32s(total), 1s(copy); streamer: mysql-bin.000001:135927553; Lag: 30.11s, HeartbeatLag: 30.21s, State: migrating; ETA: due
I, [2025-07-10T11:36:39.258710 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 INFO Copy: 2000/2000 100.0%; Applied: 0; Backlog: 0/1000; Time: 32s(total), 1s(copy); streamer: mysql-bin.000001:135927553; Lag: 30.11s, HeartbeatLag: 30.21s, State: migrating; ETA: due []
I, [2025-07-10T11:36:39.351713 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:39.352618 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:39.450953 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:39.451353 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:39.551662 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:39.553035 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:39.651587 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:39.652518 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:39.715076 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 INFO Checking session lock: gh-ost.2817.lock
I, [2025-07-10T11:36:39.717221 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 ERROR Session lock gh-ost.2817.lock expected to be found but wasn't
I, [2025-07-10T11:36:39.751343 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:39.752083 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:39.851623 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:39.852618 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:39.951796 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:39.952941 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:39 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.052204 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.053756 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.152068 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.153595 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.253441 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.253506 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.253548 #10808]  INFO -- : [gh-ost]: Copy: 2000/2000 100.0%; Applied: 0; Backlog: 0/1000; Time: 33s(total), 1s(copy); streamer: mysql-bin.000001:135927553; Lag: 31.11s, HeartbeatLag: 31.21s, State: throttled, lag=31.010552s; ETA: due
I, [2025-07-10T11:36:40.253640 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 INFO Copy: 2000/2000 100.0%; Applied: 0; Backlog: 0/1000; Time: 33s(total), 1s(copy); streamer: mysql-bin.000001:135927553; Lag: 31.11s, HeartbeatLag: 31.21s, State: throttled, lag=31.010552s; ETA: due []
I, [2025-07-10T11:36:40.254304 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.351491 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.352606 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.450674 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.450972 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.550188 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.550489 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.650515 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.650860 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.750394 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.750620 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.850404 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.850777 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.950297 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:40.950692 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:40 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:41.050396 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:41 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:41.050685 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:41 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:41.150416 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:41 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:41.150718 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:41 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:41.250454 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:41 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:41.250707 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:41 ERROR Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
I, [2025-07-10T11:36:41.250730 #10808]  INFO -- : [gh-ost]: Copy: 2000/2000 100.0%; Applied: 0; Backlog: 0/1000; Time: 34s(total), 1s(copy); streamer: mysql-bin.000001:135927553; Lag: 32.11s, HeartbeatLag: 32.21s, State: throttled, lag=32.008558s; ETA: due
I, [2025-07-10T11:36:41.250872 #10808]  INFO -- : [gh-ost]: 2025-07-10 11:36:41 INFO Copy: 2000/2000 100.0%; Applied: 0; Backlog: 0/1000; Time: 34s(total), 1s(copy); streamer: mysql-bin.000001:135927553; Lag: 32.11s, HeartbeatLag: 32.21s, State: throttled, lag=32.008558s; ETA: due []

Solution

I used the repro test case to find out which functions fail to connect to the writer and how they are retried. All of the failures point to things trying to write to the changelog table, which makes sense.

  • the first failures logged come from Applier.InitiateHeartbeat, which retries up to MaxRetries and then returns on error and silently stops sending new heartbeats. It doesn't seem there's any logic to re-initiate the heartbeat goroutine.
  • throttlerFunction fails to WriteAndLogChangelog every 100ms, which is what gets logged infinitely

I think it would be reasonable to exit with error after a retry period in injectHeartbeat, assuming the heartbeat failure is non-recoverable. Perhaps it makes sense to control this with a new CLI flag.

It could look like this: Shopify#13. Would love your feedback before I open an upstream PR here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions