Skip to content

Conversation

ChrisPaulBennett
Copy link
Contributor

@ChrisPaulBennett ChrisPaulBennett commented Oct 3, 2025

This PR hopefully closes #699 and closes #732

#732 has a write up of Oliver's investigation into this matter (and also how to recreate the memory leak in a predictable way)
I am doing a full write up of my investigation here as there appears to be another small memory leak here, and this write up should stop any duplicate work.

This memory leak was very tricky to pin down. This is because the leak is caused by code outside of Python. The tornado library calls some other code (most likely C) as part of it's functionality. This means that the leak is not seen from within python. All tools I used from within python showed no memory leak (Heapy, tracemalloc etc). But if you profile the program outside of python, using mprof in this case, the memory leak is easy to see. (mprof is a python library, but it will profile any executable you give it, no matter the language used to make it).

In a weird quirk, Valgrind (pronounced val - grinned for some reason) did not detect a memory leak. Valgrind appears to be the go to memory leak detection software on linux, it's bundled in with some distro's of linux. I'm not sure why valgrind was unable to see the memory leak, but mprof was.
Thanks to the information from #699, we knew that the memory leak was something to do with network connections, which thankfully narrowed down the problem to only a few files.
This GitHub conversation details some memory leak problems that other projects were having. The Tornado developers claim that the issue is caused by improper use of the library.

tornadoweb/tornado#2515

It "appears" to be a problem with how closed network connections are detected. I say "appears" because I'm not 100% certain what is going on here. My working hypothesis is that this code
if self.closed: return await self.ws.write_message(data)
Does not detect "closing" connections as opposed to closed connections. As well as open and closed, a network object can be in a "closing" state, where it is not accepting new data, but it is not detected by the above code. Leaving the data to be sent in RAM somewhere. The new code
if not self.ws.ws_connection.is_closing(): await self.ws.write_message(data)
address's this issue.

Me and @oliver-sanders have had differing results when it comes to profiling, admittedly we were using different versions of python. I still see a memory leak there, I think, but it is much smaller.
When reviewing please use the instructions detailed in #732 to profile the code both with and without the fix, I would be very interested in your results.

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@oliver-sanders
Copy link
Member

Have had a quick skim of tornadoweb/tornado#2515, there are several different examples on that thread but I don't think this issue is related. The examples on that thread do indeed look like misuse of the library. If you try to send messages faster than the client can process them, you're going to run into problems (with any framework). Another issue mentioned there is attempting to send messages from the wrong thread (?), we should probs look into this just incase we are doing it without realising. Again, legitimate misuse.

But what's going on here is different. The theory of this PR is that attempting to send a message to a client in the process of being disconnected from the server side, results in the message being persisted in memory. This makes some sense as the message probably isn't going to be sent if the connection is in the process of being closed.

Need to dig into the code and understand this better. It's possible that this is something Tornado could/should defend against, if so, we should open an issue over there. But it's also possible that this is an issue of our own creation ;(, need to dig.

@ChrisPaulBennett ChrisPaulBennett marked this pull request as ready for review October 6, 2025 07:53
@ChrisPaulBennett
Copy link
Contributor Author

ChrisPaulBennett commented Oct 6, 2025

I've done a bit more digging into the tornado library. The "close_code" variable (and it's accompanying "close_reason" variable) stores the reason why the connection was closed. When a web socket object is created, this variable is set to None. When a connection is closed, the on_ws_connection_close() method is called, it set's the close_code and close_reason variable and invokes the on_connection_close() method. I'm assuming that if this connection is not closed properly then these variables don't get set and the close_code variable stays at None. So a misbehaving / unstable network connection would cause issue's.
I would argue that using close_code() is misusing the library.
But I also totally agree that Tornado should be defending against this. This is a very subtle / non-obvious leak.

@hjoliver
Copy link
Member

hjoliver commented Oct 9, 2025

if not self.ws.ws_connection.is_closing(): await self.ws.write_message(data)

Should it be "if is-closing OR is-closed"?

@ChrisPaulBennett
Copy link
Contributor Author

if not self.ws.ws_connection.is_closing(): await self.ws.write_message(data)

Should it be "if is-closing OR is-closed"?

Yes it probably should be. Can you try profiling the leak with and without the fix on your system Hilary? Oliver hasn't been able to replicate my results (The leak still appears to be there). Having some results on a different network might be illuminating.

@oliver-sanders oliver-sanders removed the request for review from MetRonnie October 13, 2025 11:57
@oliver-sanders oliver-sanders marked this pull request as draft October 14, 2025 14:53
@oliver-sanders
Copy link
Member

(drafting whilst we investigate further)

@ChrisPaulBennett
Copy link
Contributor Author

I now have a new working theory of this leak.
I am very ready for this to also be wrong, but hey ho.
An exception (task exception was never retrieved) was being raised in the send method that was not being handled correctly due to asyncio usage.
From the asyncio docs:

"If a Future.set_exception() is called but the Future object is never awaited on, the exception would never be propagated to the user code."

I discovered this when I implemented context managers. Don't, quite understand why context managers revealed this, but they did.

I have now handled the expectation correctly so it gets passed to the error handling logic.
We now get messages

[W 2025-10-16 08:23:33.454 CylcUIServer] [GraphQL WS] Websocket closed on send (Op.Type: complete, Op.ID: 1) to remote IP: 127.0.0.1
[W 2025-10-16 08:23:42.195 CylcUIServer] [GraphQL WS] Websocket closed on send (Op.Type: data, Op.ID: 1) to remote IP: 127.0.0.1
[W 2025-10-16 08:23:42.196 CylcUIServer] [GraphQL WS] Websocket closed on send (Op.Type: error, Op.ID: 1) to remote IP: 127.0.0.1

I have been unable to ascertain what exactly is causing the exception.
The self.ws.ws_connection object is sometimes None when the send method is called, causing the exception. I have lost quite a lot sanity trying to figure out what is causing the ws object to lose it's ws_connection object.
But correctly handling the exception seems to have stopped the leak.
These two instances of the UI server have been running for about 16 hours
with fix:
image

Without fix:
image

Not being able to get the root cause of the issue is somewhat annoying. But I think I have rendered it harmless (fingers crossed)

@ChrisPaulBennett ChrisPaulBennett marked this pull request as ready for review October 16, 2025 08:20
@oliver-sanders
Copy link
Member

Profiling results, left:before, right:after:

Screenshot from 2025-10-16 10-19-41

The first 10 seconds show the GUI with no clients attached to provide a baseline, at 10s the browser is opened.

Memory usage appears to be consistently starting from a higher start point after these changes. I have no idea why. Something about this branch is having an impact which is promising, but the stepped pattern of memory increase doesn't appear to have altered :(

@oliver-sanders
Copy link
Member

Note to reviewers

To profile this change, make sure to run your experiment for at least one hour! The memory usage in the first ~500 seconds is not indicative of the long term usage of the program. It seems to settle down and plateau.

Both before and after this change there appears to be a sudden memory jump after a good length of time. Why, I do not yet understand, however, memory appears to be stable before and after this. The subject for a future investigation perhaps, or perhaps just some form of low level memory management?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

small memory leak for new connections/subscriptions WebSocketClosedError on write_message (correlation with memory accumulation)

3 participants