- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 33.2k
gh-128679: Fix tracemalloc.stop() race condition #128710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Check again 'tracemalloc_config.tracing' once the GIL is held in tracemalloc_raw_alloc() and PyTraceMalloc_Track(), since another thread can call tracemalloc.stop() during PyGILState_Ensure() call.
| Should we include a test based off the reproducer in the issue? | 
* Hold the table lock while calling _PyTraceMalloc_Stop(). * Add tracemalloc_raw_free().
| My fix was incomplete, the PyMem_RawFree() hook also had the bug: 
 Ok, I added a test. It was a good idea, I found PyMem_RawFree() bug with it :-) | 
| @ZeroIntensity: Would you mind to review the PR which now has a test? | 
| Will get to it a little later today :) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not too comfortable relying on incidental lock ordering with the tables lock for synchronization here. Everywhere tracemalloc_config.tracing is used the GIL is held, right? We should be able to rely on the GIL for synchronization. Oherwise, let's either use a dedicated lock for it or one of the _Py_atomic APIs.
| { | ||
| PyTraceMalloc_Track(123, 10, 1); | ||
|  | ||
| PyThread_type_lock lock = (PyThread_type_lock)data; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a lot easier to use a PyMutex, considering you don't have to heap allocate it. I think it works without a thread state if you use _PyMutex_LockFlags(mutex, _Py_LOCK_DONT_DETACH)
| else { | ||
| // gh-128679: tracemalloc.stop() was called by another thread | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove the else:
| else { | |
| // gh-128679: tracemalloc.stop() was called by another thread | |
| } | |
| // gh-128679: If we're not tracing, then tracemalloc.stop() was called by another thread | 
|  | ||
| // Lock to synchronize with tracemalloc_raw_free() which checks | ||
| // 'tracing' while holding the lock. | ||
| TABLES_LOCK(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in the other PR, relying on lock ordering like this is really error prone. If we hold the GIL, then we should be able to use tracemalloc_config.tracing without a lock, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in the other PR, relying on lock ordering like this is really error prone.
Would you mind to elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment: #128710 (comment)
| 
 This PR adds  So I decided to reuse TABLES_LOCK() to access  
 | 
| 
 Right, so this is only safe because of lock-ordering. This approach with the tables lock only works for this case because we assume the only thread that can modify  My other concern is re-entrancy and lock-ordering deadlocks. There are two cases here, depending on what  
 Either way, it's problematic. | 
| 
 
 | 
| 
 It doesn't release the GIL. 
 _PyTraceMalloc_Stop() cannot release the GIL. It calls _Py_hashtable_clear() on these tables: 
 
 
 The tricky part is to remove a trace in  | 
| Generally, things release the GIL when they're waiting on something, like a lock.  
 Ok, good to know. I was under the impression that any  Anyways, the fix here would be a lot simpler (and less ambiguous) if we went with a dedicated lock for  | 
| 
 There are multiple constraints. 
 (B) means that  My fix for the  | 
| 
 | 
| 
 Are you sure about  EDIT: Never mind, found it. | 
| 
 Well, sort of. It's used for protecting the operations from multithreaded races, not necessarily synchronizing the threads in a certain order. Anyways--the fix works, but only for the specific test. I'd rather dedicated locks or atomics because: 
 So, I'm willing to approve this as a bandaid for 3.12, but if we go with this let's try to get a better fix in 3.13 and main (where  | 
| /* tracemalloc is not tracing: do nothing */ | ||
| return -2; | ||
| } | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's either lock the read above this or remove it entirely, then LGTM for 3.12.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote #128897 to fix Python 3.12 and 3.13.
| I wrote a different fix based on my refactoring that I just merged: #128893. | 
Check again 'tracemalloc_config.tracing' once the GIL is held in tracemalloc_raw_alloc() and PyTraceMalloc_Track(), since another thread can call tracemalloc.stop() during PyGILState_Ensure() call.