Refresh remote session during long migrations#6889
Refresh remote session during long migrations#6889gthvn1 wants to merge 1 commit intoxapi-project:masterfrom
Conversation
Long VM migrations can exceed the default SM session timeout (24h), causing the destination host to fail when using an expired session. This is especially problematic for large VDI copies. This patch ensures migrations do not fail due to session expiration by refreshing the remote session periodically while the data copy is in progress. The session is passed from the XAPI "core" layer into the storage layer so it can be refreshed as needed during long-running migrations. Signed-off-by: Guillaume <guillaume.thouvenin@vates.tech>
da9aa17 to
80f31c1
Compare
|
Any comment to solve the issue is welcome. I'm not sure that passing remote session is the correct approach. |
| } | ||
| in | ||
| let mirror_to_remote new_dp = | ||
| let task = |
There was a problem hiding this comment.
why can't the refresh thread be created here and wait until the migrarion is complete?
Currently you've done a lot of plumbing to refresh the session for smapiv1 plugins, but for smapiv3 ones it doesn't work and you'll need to duplicate the effort as well
There was a problem hiding this comment.
Because the problem is that nothing is called here while the data is being copied. As I understand it, we are blocked until sparse_dd finishes copying the data. I can't find a place for a hook/callback here. The only place I found to regularly call the session refresh is in the thread callback of sparse_dd, but that is on the other side of the layer, in the storage part.
There was a problem hiding this comment.
And yes I agree that for smapiv3 we will need to do that as well if it is the solution.
There was a problem hiding this comment.
can't find a place for a hook/callback here.
As a synchronisation mechanism a ref bool + mutex could be used, the child thread, once the migration has finished changes it to true; and the main thread, which was in a loop refreshing the session every X minutes, decides to call Thread.join on the child and continue execution
There was a problem hiding this comment.
Yes it is probably better than passing session through layers 👍
Here is what we want to solve
During a migration, the host where the VM is running (the source) needs to interact with the host where the VM will migrate (the destination). The SM endpoint is used, and it requires an authenticated session. This session Id has, by default, a time to live of 24 hours. It can be modified by setting inactive_session_timeout in xapi.conf.
This session reference is passed as an argument to all API calls. Every session has an associated last active timestamp, which is updated on every API call.
The problem is that when everything is set up to transfer the data of a VDI from the source to the destination (mirror and snapshot are ready), the source creates a thread to do the copy (sparse_dd) and blocks until the thread ends. This works fine when the transfer takes less than 24h, but for large VDIs (for example 2TB) it can take more than 24 hours and the session will time out. When the copy is finished, the source then tries to call API, but the session has already expired.
To avoid, this we need to either refresh the session while the copy is in progress, or create a new session once the copy is done so that the host can still interact with the destination. We can also set inactive_session_timeout to a greater value that is the current workaround proposed to the customer. But it makes sense to automatically refresh the session because increasing the timeout for all session may create new problem like too much session created in the same time.