Skip to content

Conversation

thiell
Copy link
Collaborator

@thiell thiell commented Aug 6, 2025

StreamWorker.abort() calls Engine.remove() with abort set to True but this flag was never actually used. The problem is that EngineClient._close_stream() is then called and flushes any read buffers, which can lead to unexpected read events and messing with the user's code logic (StreamWorker is used by the Gateway and this issue was discovered there).

With this patch, we ensure that worker.abort() actually aborts all engine events.

Just a last note that we keep the current behavior when both abort and timeout are set to True in Engine.remove()/EngineClient._close(). In that case, timeout=True prevails as it is Engine-induced and we consider we can flush any read buffers (so potentially generating a few ev_read events).

StreamWorker.abort() calls Engine.remove() with abort set to True but
this flag was never actually used. The problem is that
EngineClient._close_stream() is then called and flushes any read
buffers, which can lead to unexpected read events and messing with the
user's code logic (StreamWorker is used by the Gateway and this issue
was discovered there).

With this patch, we ensure that worker.abort() actually aborts all
engine events.

Just a last note that we keep the current behavior when both abort and
timeout are set to True in Engine.remove()/EngineClient._close().
In that case, timeout=True prevails as it is Engine-induced and we
consider we can flush any read buffers (so potentially generating
a few ev_read events).
@thiell
Copy link
Collaborator Author

thiell commented Aug 6, 2025

For the record, discovered at the Gateway level when troubleshooting gateway abort/termination. In the trace below we can see that upon receiving bad XML data, the Gateway triggered a worker abort() that generates unwanted ev_read events. This patch in this PR will avoid this behavior.

Traceback (most recent call last):
  File "/usr/lib64/python3.9/xml/sax/expatreader.py", line 217, in feed
    self._parser.Parse(data, isFinal)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 6, column 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/ClusterShell/Communication.py", line 234, in ev_read
    self._parser.feed(msg + b'\n')
  File "/usr/lib64/python3.9/xml/sax/expatreader.py", line 221, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib64/python3.9/xml/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:6:1: not well-formed (invalid token)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/ClusterShell/Gateway.py", line 187, in recv
    self.close()
  File "/usr/lib/python3.9/site-packages/ClusterShell/Gateway.py", line 179, in close
    self._close()
  File "/usr/lib/python3.9/site-packages/ClusterShell/Communication.py", line 212, in _close
    self.worker.abort()
  File "/usr/lib/python3.9/site-packages/ClusterShell/Worker/Worker.py", line 581, in abort   <<<
    self.clients[0].abort()
  File "/usr/lib/python3.9/site-packages/ClusterShell/Worker/EngineClient.py", line 438, in abort
    engine.remove(self, abort=True)
  File "/usr/lib/python3.9/site-packages/ClusterShell/Engine/Engine.py", line 495, in remove
    self._remove(client, abort, did_timeout)
  File "/usr/lib/python3.9/site-packages/ClusterShell/Engine/Engine.py", line 483, in _remove
    client._close(abort=abort, timeout=did_timeout)
  File "/usr/lib/python3.9/site-packages/ClusterShell/Worker/Worker.py", line 405, in _close
    EngineClient._close(self, abort, timeout)
  File "/usr/lib/python3.9/site-packages/ClusterShell/Worker/EngineClient.py", line 267, in _close
    self._close_stream(sname)
  File "/usr/lib/python3.9/site-packages/ClusterShell/Worker/EngineClient.py", line 276, in _close_stream
    self._flush_read(sname)
  File "/usr/lib/python3.9/site-packages/ClusterShell/Worker/Worker.py", line 438, in _flush_read
    self.worker._on_msgline(self.key, stream.rbuf, sname)
  File "/usr/lib/python3.9/site-packages/ClusterShell/Worker/Worker.py", line 563, in _on_msgline
    self.eh.ev_read(self, key, sname, msg)
  File "/usr/lib/python3.9/site-packages/ClusterShell/Communication.py", line 244, in ev_read    <<<
    self.send(ErrorMessage('Parse error: %s' % ex.getMessage()))
  File "/usr/lib/python3.9/site-packages/ClusterShell/Communication.py", line 270, in send
    self.worker.write(msg.xml() + b'\n', sname=self.SNAME_WRITER)
  File "/usr/lib/python3.9/site-packages/ClusterShell/Worker/Worker.py", line 601, in write
    self.clients[0].write(buf, sname)
  File "/usr/lib/python3.9/site-packages/ClusterShell/Worker/Worker.py", line 443, in write
    self._write(sname, buf)
  File "/usr/lib/python3.9/site-packages/ClusterShell/Worker/EngineClient.py", line 408, in _write
    wfile = self.streams[sname]
KeyError: 'ch-writer'

@thiell thiell self-assigned this Aug 6, 2025
@thiell thiell added this to the 1.9.4 milestone Aug 6, 2025
@thiell thiell added this pull request to the merge queue Aug 6, 2025
Merged via the queue into cea-hpc:master with commit d547257 Aug 6, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant