-
Notifications
You must be signed in to change notification settings - Fork 1.3k
WIP: Resurrect #343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Resurrect #343
Conversation
|
TODO:
|
|
Export is to changelog table. This ensures atomicy and durability of write, assuming changelog table is I'm fine stating that resurrection does not work on |
|
5f25f74 makes for something that works! I'll need to iterate to see what has been overlooked, but basically we're getting there fast. |
…s from the resurrected context
|
A concern is to not rely on the streamer's last known position, because streamer writes to a buffer (currently hard coded to
|
StreamerBinlogCoordinates -> AppliedBinlogCoordinates updating AppliedBinlogCoordinates when truly applied; no longer asking streamer for coordinates (because streamer's events can be queued, but not handled, a crash implies we need to look at the last _handled_ event, not the last _streamed_ event)
…nitial resurrection context
|
Off issue, you were mentioning gh-ost was having checksum issues with resurrections. When you mentioned that I was thinking, could it be related to the fact that we have two things going on: the backlog and the iteration of inserts? I hope that makes sense. None the less, it was something that popped in my head that I hoped might help when you get back to this. (Not fully understanding the code changes, this might already be something you're handling.) |
|
@tomkrouper the conjecture is as follows:
it should be OK to resume execution
this is the conjecture's logic:
But then, of course, tests are failing... |
|
@shlomi-noach are there any plans to revisit this feature? I'm looking at gh-ost again and one of the concerns my team has is that we have some very large tables that can take days, if not a week to copy. If the process were to crash in the middle we'd have a lot of wasted effort, especially when we have to slowly drain the _gho table to prevent the dreaded global metadata lock when dropping it. This would be extremely helpful for us! |
|
@Xopherus this isn't on the near future's roadmap. such that hitting critical load doesn't bail out. I understand the stress involved with running a week long migration. Our history shows those migrations do not break, hence the Resurrection feature is not urgent for us to implement. |
|
Thanks for the advice @shlomi-noach! Appreciate the wisdom - I've found that tuning gh-ost is one of the challenges because the feedback cycles are so long. I'll have to try that parameter and let you know how it goes. |
@Xopherus Could you please elaborate on that? I'm not sure I understand. |
|
Oh I just mean that if your migrations can take multiple hours / days, it can be tricky to tune parameters (e.g. critical load threshold or lock cutover timeouts, etc) because it takes longer to experiment. Fortunately we've gotten solid advice from you and others here to help guide us in the right direction. |
|
Hi @shlomi-noach :-) I think "resurrect" is not the best term. It's not a standard technical term. Even doc/command-line-flags.md has to clarify: "It is possible to resurrect/resume a failed migration". When people think, "Can I resume an osc?", they'll look for and Google with that term. Imho, "resurrect" will never cross people's minds. By contrast, everyone knows what "resume" (and its reciprocals "suspend" or "pause") mean. I'd also argue that it's not technically descriptive or intention-revealing. A dead body can be resurrected, and I get the joke with the app being called "ghost", but it begs the question: What does it mean to resurrect a program? My last argument is: for non-native English speakers/readers, these issues are compounded by uncommon words in a technical context. -- I'd vote for pause/resume or start/stop. |
|
Thank you @daniel-nichter |
|
bumping this feature request to check on if there were any changes to make this feature available? |
This code is fairly old and there are a bunch of conflicting files at this point. We don't have any immediate plans to work on this, but I do agree this would be a good feature to have and if anyone would like to continue the work, we'd love the community contribution. |
|
superseded by #1595 |
Storyline: #205
WORK IN PROGRESS: resurrecting a migration after failure.
The idea is that
gh-ostwould routinely dump migration status/context. It would be ossible for onegh-ostprocess to fail (e.g. having metcritical-load) and for anothergh-ostprocess to pick up from where the first left off.Initial commits present exporting of migration context, with some shuffling & cleanup.