Skip to content

Conversation

thiell
Copy link
Collaborator

@thiell thiell commented Aug 4, 2025

Fixes #566.

@thiell thiell added this to the 1.9.4 milestone Aug 5, 2025
@thiell thiell self-assigned this Aug 5, 2025
@thiell thiell force-pushed the b194_566_propagation branch 3 times, most recently from a945f82 to f8c60aa Compare August 7, 2025 07:50
@thiell thiell changed the title Tree: fix error handling when gateway channel is closing Tree: fix gateway channel close/abort Aug 7, 2025
@thiell thiell force-pushed the b194_566_propagation branch 2 times, most recently from d89a1d4 to b75a928 Compare August 8, 2025 06:01
thiell added 2 commits August 7, 2025 23:22
Fix PropagationChannel.ev_close() where gateway channel termination is
handled.
If we get an actual rc > 0, that comes from the gateway command itself
and that means the gateway is defective/misconfigured, in that case, we
mark it as unreachable at the Task level.
In addition, in that case, if we have not launched the remote commands
yet, they are redistributed to other available gateways.

rc=None is now handled as a normal termination of the propagation
channel and the corresponding gateway is not marked as unreachable
anymore.

Fixes cea-hpc#566.
This commit adds the functionality to abort a specific gateway channel from
the initiator. Until now, this was not properly handled. This also fixes
gateway failover.

Changes:

* Implement TreeWorker._gateway_abort() that can be used to abort/cancel
  all tasks being done by the TreeWorker via the specified gateway. In
  case of such abort (likely due to some gateway failure), a special return
  code 76 (os.EX_PROTOCOL) is used for closing all running remote commands via
  this gateway. This return code is sometimes used to specify a "Remote
  protocol error" / "An error occurred in a remote communication protocol"
  which seems appropriate here.

* Implement a new Task._pchannel_closing() method that is called on
  PropagationChannel.ev_close(), so deterministically every time a gateway
  channel is closing (self-initiated or not). This method performs necessary
  cleanup actions, but most notably calls TreeWorker._gateway_abort(gateway)
  on each worker currently using the gateway channel.

* Update Task._pchannel_release() so that it now calls
  PropagationChannel._close() instead of Worker.abort() to properly reset
  the channel's opened/setup flags.

* Updated TreeWorkerTest with tests to better cover the above and gateway
  failover.

Part of cea-hpc#229 and extended work on cea-hpc#566.
@thiell thiell force-pushed the b194_566_propagation branch from b75a928 to 812b20c Compare August 8, 2025 06:22
@thiell thiell added this pull request to the merge queue Aug 8, 2025
Merged via the queue into cea-hpc:master with commit 93e7ee8 Aug 8, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ClusterShell.Propagation.RouteResolvingError: No route available to pm4-nod01
1 participant