Skip to content

Add option to enable persistence to the status, to allow restart-ability to the script#4

Open
realmgic wants to merge 7 commits intoaerospike:masterfrom
realmgic:master
Open

Add option to enable persistence to the status, to allow restart-ability to the script#4
realmgic wants to merge 7 commits intoaerospike:masterfrom
realmgic:master

Conversation

@realmgic
Copy link
Member

In some cases the script exits (or hangs) on various (unknown) reasons, leaving the cluster in quiesce mode. Saving the state to a file will allow restarting the script and if the node is out of maintenance mode, it will automatically undo the quiesce.

@realmgic
Copy link
Member Author

@spkesan - do you have some time to review and merge this? thanks! :)

@arrowplum arrowplum requested a review from spkesan November 12, 2019 18:38
@spkesan
Copy link

spkesan commented Nov 18, 2019

@realmgic
Thanks for the work and PR. I'll review the changes.
Do we know why 'the script exits (or hangs) on various (unknown) reasons'.

@spkesan
Copy link

spkesan commented Nov 18, 2019

Hi @realmgic
If we just want to restart the script to perform quiesce-undo (if the node was quiesced by the script and the script did not completely run for some reason, I guess that's what you are trying to address here?), wouldn't it be just simple to pass in last_maintenance_event via a command line option? We don't need these many changes or persist the last event, right.

Let's say:

  1. maintenance-event changed to MIGRATE_ON_HOST_MAINTENANCE .
  2. The script realizes this and quiesced this node.
  3. After the live migration, the maintenance-event will be changed to NONE, but let's say before this point, the script exists (for some unknown reason as you mentioned).
  4. Now the node is in quiesced state. We need to quiesce-undo the node.
  5. We can restart the script by passing last_maintenance_event as MIGRATE_ON_HOST_MAINTENANCE. The script will perform the quiesce-undo since the latest maintenance-event will be None.
  • Ideally we should find out and fix why the script is hanging or exiting unexpectedly.
  • Also improve logging to know the last state of the script (since it's only 60 seconds from the metadata change to actual start of maintenance event).

When you observed the script hung or exited unexpectedly, did you check or collect the log file (/var/log/aerospike/agm.log)?

@realmgic
Copy link
Member Author

realmgic commented Nov 18, 2019

Hi @spkesan,

When the script (or actually, the systemd service which runs it) is restarted, we don't know what was the last state we observed and what is the cluster state.

In that specific case, the service was stopped after maintenance flag was raised but before we got the NONE event to clear it. When the service started again, we got NONE, so we assumed the cluster is "fine" and the maintenance (quiesce) wasn't cleared. from the cluster perspective, that node stayed quiesced for another full day or two before someone noticed it on a dashboard somewhere after the weekend.

What I did is to persist the fact that the last state we saw was maintenance (saved to a file) so when we come back, we can check for that and undo the quiesce modes that might be there (if we get NONE) or know that we're still in maintenance mode and wait for it to end (if we get MIGRATE_ON_HOST_MAINTENANCE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants