-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
gh-115999: Enable specialization of CALL instructions in free-threaded builds
#127123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
_CALL_ALLOC_AND_ENTER_INIT will be addressed in a separate PR
This needs to acquire a critical section on the list.
- Modify `get_init_for_simple_managed_python_class` to return both init
as well as the type version at the time of lookup.
- Modify caching logic to verify that the current version of the type
matches the version at the time of lookup. This prevents potentially
caching a stale value if we race with an update to __init__.
- Only cache __init__ functions that are deferred in free-threaded builds.
This ensures that the borrowed reference to __init__ that is stored in
the cache is valid if the type version guard in _CHECK_AND_ALLOCATE_OBJECT
passes:
1. The type version is cleared before the reference in the MRO to __init__
is destroyed.
2. If the reference in (1) was the last reference then the __init__ method
will be queued for deletion the next time GC runs.
3. GC requires stopping the world, which forces a synchronizes-with operation
between all threads.
4. If the GC collects the cached __init__, then type's version will have been
updated *and* the update will be visible to all threads, so the guard
cannot pass.
- There are no escaping calls in between loading from the specialization cache
and pushing the frame. This is a requirement for the default build.
CALL in free-threaded buildsCALL instructions in free-threaded builds
Lib/test/test_monitoring.py
Outdated
| self.events.append(("return", code.co_name, val)) | ||
|
|
||
| # CALL_ALLOC_AND_ENTER_INIT will only cache __init__ methods that are | ||
| # deferred. We only defer functions defined at the top-level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I think we should be deferring functions defined in classes, even if the classes are nested. They already require GC for collection because classes are full of cycles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #127274
Python/perf_trampoline.c
Outdated
| } | ||
| if (!activate) { | ||
| tstate->interp->eval_frame = NULL; | ||
| set_eval_frame(tstate, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these go through _PyInterpreterState_SetEvalFrameFunc()? It'd be nice to have just one function that modifies interp->eval_frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that should be fine (the tests pass using it). @pablogsal - do you know if we need to set tstate->eval_frame explicitly when setting / clearing the perf trampoline, or is it fine to go through _PyInterpreterState_SetEvalFrameFunc()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any difference here so it should be safe to just go through _PyInterpreterState_SetEvalFrameFunc().
Just remember to run the buildbots before landing because these tests only run there with all the options.
|
Looks like the nogil refleaks buildbots are surfacing a real issue, digging into that. The other buildbot failures look like they're unrelated to this PR:
|
Fix a bug in `_CREATE_INIT_FRAME` where the frame is pushed to the stack on failure. `_CREATE_INIT_FRAME` pushes a pointer to the new frame onto the stack for consumption by the next uop. When pushing the frame fails, we do not want to push the result (NULL) to the stack because it is not a valid stackref and will be exposed to the generic error handling code in the interpreter loop. This worked in default builds because `PyStackRef_NULL` is `NULL` in default builds, which is not the case in free-threaded builds.
|
!buildbot nogil refleak |
|
🤖 New build scheduled with the buildbot fleet by @mpage for commit de0e2ee 🤖 The command will test the builders whose names match following regular expression: The builders matched are:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Two small comments below.
Objects/typeobject.c
Outdated
| can_cache = can_cache && _PyObject_HasDeferredRefcount(init); | ||
| #endif | ||
| if (can_cache) { | ||
| FT_ATOMIC_STORE_PTR_RELAXED(type->_spec_cache.init, init); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be "release" so that the contents of init are visible before init is written to type->_spec_cache.init.
(There might be other synchronization of the thread-local specialization process that makes relaxed okay, but it seems easier to reason about to me if it's at least "release")
Python/bytecodes.c
Outdated
| DEOPT_IF(!PyList_Check(self_o)); | ||
| STAT_INC(CALL, hit); | ||
| #ifdef Py_GIL_DISABLED | ||
| int err = _PyList_AppendTakeRefAndLock((PyListObject *)self_o, PyStackRef_AsPyObjectSteal(arg)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use the DEOPT_IF(!LOCK_OBJECT(...)) macros instead of adding the new _PyList_AppendTakeRefAndLock function.
I'm a bit uneasy about the critical section after the DEOPT_IF guards. The critical section may block and allow a stop-the-world operation to occur in between the check of the guards and the list append. I'm concerned that it might allow some weird things to occur (like swapping the type of objects) that would invalidate the guards.
In other words, this adds a potentially escaping call in a place where we previously didn't have one.
The DEOPT_IF(!LOCK_OBJECT(...)) avoids this because it doesn't block (and doesn't detach the thread state).
|
Test failure is #127421 |
…-threaded builds (python#127123) The CALL family of instructions were mostly thread-safe already and only required a small number of changes, which are documented below. A few changes were needed to make CALL_ALLOC_AND_ENTER_INIT thread-safe: Added _PyType_LookupRefAndVersion, which returns the type version corresponding to the returned ref. Added _PyType_CacheInitForSpecialization, which takes an init method and the corresponding type version and only populates the specialization cache if the current type version matches the supplied version. This prevents potentially caching a stale value in free-threaded builds if we race with an update to __init__. Only cache __init__ functions that are deferred in free-threaded builds. This ensures that the reference to __init__ that is stored in the specialization cache is valid if the type version guard in _CHECK_AND_ALLOCATE_OBJECT passes. Fix a bug in _CREATE_INIT_FRAME where the frame is pushed to the stack on failure. A few other miscellaneous changes were also needed: Use {LOCK,UNLOCK}_OBJECT in LIST_APPEND. This ensures that the list's per-object lock is held while we are appending to it. Add missing co_tlbc for _Py_InitCleanup. Stop/start the world around setting the eval frame hook. This allows us to read interp->eval_frame non-atomically and preserves the behavior of _CHECK_PEP_523 documented below.
…-threaded builds (python#127123) The CALL family of instructions were mostly thread-safe already and only required a small number of changes, which are documented below. A few changes were needed to make CALL_ALLOC_AND_ENTER_INIT thread-safe: Added _PyType_LookupRefAndVersion, which returns the type version corresponding to the returned ref. Added _PyType_CacheInitForSpecialization, which takes an init method and the corresponding type version and only populates the specialization cache if the current type version matches the supplied version. This prevents potentially caching a stale value in free-threaded builds if we race with an update to __init__. Only cache __init__ functions that are deferred in free-threaded builds. This ensures that the reference to __init__ that is stored in the specialization cache is valid if the type version guard in _CHECK_AND_ALLOCATE_OBJECT passes. Fix a bug in _CREATE_INIT_FRAME where the frame is pushed to the stack on failure. A few other miscellaneous changes were also needed: Use {LOCK,UNLOCK}_OBJECT in LIST_APPEND. This ensures that the list's per-object lock is held while we are appending to it. Add missing co_tlbc for _Py_InitCleanup. Stop/start the world around setting the eval frame hook. This allows us to read interp->eval_frame non-atomically and preserves the behavior of _CHECK_PEP_523 documented below.
The
CALLfamily of instructions were mostly thread-safe already and only required a small number of changes, which are documented below.A few changes were needed to make
CALL_ALLOC_AND_ENTER_INITthread-safe:_PyType_LookupRefAndVersion, which returns the type version corresponding to the returned ref._PyType_CacheInitForSpecialization, which takes an init method and the corresponding type version and only populates the specialization cache if the current type version matches the supplied version. This prevents potentially caching a stale value in free-threaded builds if we race with an update to__init__.__init__functions that are deferred in free-threaded builds. This ensures that the reference to__init__that is stored in the specialization cache is valid if the type version guard in_CHECK_AND_ALLOCATE_OBJECTpasses._CREATE_INIT_FRAMEwhere the frame is pushed to the stack on failure.A few other miscellaneous changes were also needed:
{LOCK,UNLOCK}_OBJECTinLIST_APPEND. This ensures that the list's per-object lock is held while we are appending to it.co_tlbcfor_Py_InitCleanup.interp->eval_framenon-atomically and preserves the behavior of_CHECK_PEP_523documented below.Single-threaded performance
Scaling
The scaling benchmark looks about the same for this PR vs its base:
Thread safety
Thread safety of each instruction in the CALL family is documented below, starting with the uops that are composed to form instructions in the family.
UOPS
The more interesting uops that warrant closer inspection are:
_CHECK_AND_ALLOCATE_OBJECTThis uop loads an
__init__method from the specialization cache of the operand (a type) if the operand's type version matches the type version stored in the inline cache. The loaded method is guaranteed to be valid because we only store deferred objects in the specialization cache and there are no escaping calls following the load:__init__is destroyed.__init__method will be queued for deletion the next time GC runs.__init__, then type's version will have been updated and the update will be visible to all threads, so the guard cannot pass._CHECK_FUNCTION_VERSIONThis uop guards that the top of the stack is a function and that its version matches the version stored in the inline cache. Instructions assume that if the guard passes, the version, and any properties verified by the version, will not change for the remainder of the instruction execution, assuming there are no escaping calls in between the guard and the code that relies on the guard. This property is preserved in free-threaded builds: the world is stopped whenever a function's version changes.
_CHECK_PEP_523This uop guards that a custom eval frame function is not in use. Instructions assume that if the guard passes, an eval frame function will not be set for the remainder of the instruction's execution, assuming there are no escaping calls in between the guard and code that relies on the guard passing. This property is preserved in free-threaded builds: the world is stopped whenever the eval frame function is set.
The instructions are also composed of uops whose thread safety properties are easier to reason about and require less scrutiny. These are:
_CALL_NON_PY_GENERAL- Uses existing thread-safe APIs._CHECK_CALL_BOUND_METHOD_EXACT_ARGS- Only performs exact type checks, which are thread-safe: changing an instance's type stops the world._CHECK_FUNCTION_EXACT_ARGS- All the loads in the uop are safe to perform non-atomically: settingfunc->func_codestops the world, theco_argcountattribute of code objects is immutable._CHECK_IS_NOT_PY_CALLABLE- Only performs exact type checks._CHECK_METHOD_VERSION- This loads a function from aPyMethodObjectand guards that its version matches what is stored in the cache.PyMethodObjects are immutable; their fields can be accessed non-atomically. The thread safety of function version guards was already documented above._CHECK_PERIODIC- Thread safety was previously addressed as part of the 3.13 release._CHECK_STACK_SPACE- All the loads in this uop are safe to perform non-atomically: settingfunc->func_codestops the world, theco_framesizeattribute of code objects is immutable, andtstate->py_recursion_remainingshould only be mutated by the current thread._CREATE_INIT_FRAME- Uses existing thread-safe APIs._EXPAND_METHOD- Only loads fromPyMethodObjects._INIT_CALL_BOUND_METHOD_EXACT_ARGS- Only loads fromPyMethodObjects._INIT_CALL_PY_EXACT_ARGS- Only operates on data that isn't yet visible to other threads._PUSH_FRAME- Only manipulateststate->current_frameand fields that are not read by other threads._PY_FRAME_GENERAL- Reads from fields that are either immutable (co_flags) or requires stopping the world to change (func_code)._SAVE_RETURN_OFFSET- Stores only to the frame'sreturn_offsetwhich is not read by other threads.Instructions
These instructions perform exact type checks and loads from immutable fields of
PyCFunctionobjects:CALL_BUILTIN_FASTCALL_BUILTIN_FAST_WITH_KEYWORDSCALL_BUILTIN_OThese instructions perform exact type checks and loads from immutable fields of
PyMethodDescrObjects:CALL_METHOD_DESCRIPTOR_FASTCALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDSCALL_METHOD_DESCRIPTOR_NOARGSCALL_METHOD_DESCRIPTOR_OThese instructions are composed of the uops documented above, and are thread-safe transitively:
CALL_ALLOC_AND_ENTER_INITCALL_BOUND_METHOD_EXACT_ARGSCALL_BOUND_METHOD_GENERALCALL_NON_PY_GENERALCALL_PY_EXACT_ARGSCALL_PY_GENERALThese instructions load from the callable cache, which is immutable, perform exact type checks, and use existing thread-safe APIs:
CALL_ISINSTANCECALL_LENCALL_LIST_APPENDThese instructions use existing thread-safe APIs:
CALL_STR_1CALL_TUPLE_1CALL_TYPE_1Finally, these instructions don't categorize neatly:
CALL_BUILTIN_CLASS- Performs exact type checks and loads from immutable types.Specialization
Apart from the changes discussed earlier, specialization is already thread-safe. It inspects immutable properties (i.e. those of code objects, method descriptors, or
PyCFunctions) or properties that require stopping the world to mutate (i.e. properties checked by function version guards).--disable-gilbuilds #115999