-
Notifications
You must be signed in to change notification settings - Fork 21
Fixing memory leak #740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fixing memory leak #740
Conversation
Have had a quick skim of tornadoweb/tornado#2515, there are several different examples on that thread but I don't think this issue is related. The examples on that thread do indeed look like misuse of the library. If you try to send messages faster than the client can process them, you're going to run into problems (with any framework). Another issue mentioned there is attempting to send messages from the wrong thread (?), we should probs look into this just incase we are doing it without realising. Again, legitimate misuse. But what's going on here is different. The theory of this PR is that attempting to send a message to a client in the process of being disconnected from the server side, results in the message being persisted in memory. This makes some sense as the message probably isn't going to be sent if the connection is in the process of being closed. Need to dig into the code and understand this better. It's possible that this is something Tornado could/should defend against, if so, we should open an issue over there. But it's also possible that this is an issue of our own creation ;(, need to dig. |
I've done a bit more digging into the tornado library. The "close_code" variable (and it's accompanying "close_reason" variable) stores the reason why the connection was closed. When a web socket object is created, this variable is set to |
Should it be "if is-closing OR is-closed"? |
Yes it probably should be. Can you try profiling the leak with and without the fix on your system Hilary? Oliver hasn't been able to replicate my results (The leak still appears to be there). Having some results on a different network might be illuminating. |
(drafting whilst we investigate further) |
Note to reviewers To profile this change, make sure to run your experiment for at least one hour! The memory usage in the first ~500 seconds is not indicative of the long term usage of the program. It seems to settle down and plateau. Both before and after this change there appears to be a sudden memory jump after a good length of time. Why, I do not yet understand, however, memory appears to be stable before and after this. The subject for a future investigation perhaps, or perhaps just some form of low level memory management? |
This PR hopefully closes #699 and closes #732
#732 has a write up of Oliver's investigation into this matter (and also how to recreate the memory leak in a predictable way)
I am doing a full write up of my investigation here as there appears to be another small memory leak here, and this write up should stop any duplicate work.
This memory leak was very tricky to pin down. This is because the leak is caused by code outside of Python. The tornado library calls some other code (most likely C) as part of it's functionality. This means that the leak is not seen from within python. All tools I used from within python showed no memory leak (Heapy, tracemalloc etc). But if you profile the program outside of python, using mprof in this case, the memory leak is easy to see. (mprof is a python library, but it will profile any executable you give it, no matter the language used to make it).
In a weird quirk, Valgrind (pronounced val - grinned for some reason) did not detect a memory leak. Valgrind appears to be the go to memory leak detection software on linux, it's bundled in with some distro's of linux. I'm not sure why valgrind was unable to see the memory leak, but mprof was.
Thanks to the information from #699, we knew that the memory leak was something to do with network connections, which thankfully narrowed down the problem to only a few files.
This GitHub conversation details some memory leak problems that other projects were having. The Tornado developers claim that the issue is caused by improper use of the library.
tornadoweb/tornado#2515
It "appears" to be a problem with how closed network connections are detected. I say "appears" because I'm not 100% certain what is going on here. My working hypothesis is that this code
if self.closed: return await self.ws.write_message(data)
Does not detect "closing" connections as opposed to closed connections. As well as open and closed, a network object can be in a "closing" state, where it is not accepting new data, but it is not detected by the above code. Leaving the data to be sent in RAM somewhere. The new code
if not self.ws.ws_connection.is_closing(): await self.ws.write_message(data)
address's this issue.
Me and @oliver-sanders have had differing results when it comes to profiling, admittedly we were using different versions of python. I still see a memory leak there, I think, but it is much smaller.
When reviewing please use the instructions detailed in #732 to profile the code both with and without the fix, I would be very interested in your results.
Check List
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
(andconda-environment.yml
if present).?.?.x
branch.