-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
gh-139109: A new tracing JIT compiler frontend for CPython #140310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Closing because this is triggering some UB in CI and wasting resources. Will reopen once I fix it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exciting stuff.
I've done a preliminary review. More to come later.
@Fidget-Spinner do you have any ideas as to why the go benchmark is 20% slower?
|
||
typedef _Py_CODEUNIT *(*_PyJitEntryFuncPtr)(struct _PyExecutorObject *exec, _PyInterpreterFrame *frame, _PyStackRef *stack_pointer, PyThreadState *tstate); | ||
|
||
typedef struct _PyJitState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All but one of these fields have a 'jit_tracer' prefix. Maybe remove the prefix, so we don't have too many redundant "jit"s in the code.
Maybe rename the struct to PyJitTracerState
.
In general, drop "jit" wherever the context makes it clear, to help readability
PyMem_Free(co_extra); | ||
} | ||
#ifdef _Py_TIER2 | ||
_Py_JITTracer_InvalidateDependency(tstate, self); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct, but it made me curious.
OOI, did you just spot this, or did the JIT manage to end up running code from a deallocated code object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main problem isn't running deallocated code. The problem was that the JIT actually traced then tried to compile code that happened to deallocate while tracing. Thus leading to segfaults while optimizing.
This came up in a few tests from the test suite.
if (length == 0) { | ||
return length; | ||
} | ||
optimize_uops( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may still want to prune the trace in the optimizer. For example, if the optimizer's confidence drops too low.
So leave optimize_uops
returning the length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code is wrong. The optimizer never errors. So length can never be 0.
The optimizer currently just returns the length of the unoptimized trace if it doesn't optimize. So this entire code is dead code.
However, I'll leave it there in case you need it in the future.
DISPATCH_TABLE_VAR = TRACING_DISPATCH_TABLE; | ||
# define LEAVE_TRACING() \ | ||
DISPATCH_TABLE_VAR = DISPATCH_TABLE; | ||
# define BAIL_TRACING_NO_DISPATCH() \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've already got too many macros here, and it makes the code really hard to follow.
It also creates code bloat as we replicate all this code 256 times.
At the very least, factor out as much as you can into helper functions.
Almost everything in this macro after LEAVE_TRACING()
could go in a helper function.
} | ||
#endif | ||
|
||
/* _PyEval_EvalFrameDefault is too large to optimize for speed with PGO on MSVC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never more true 🙂
} while (0) | ||
|
||
#define GOTO_TIER_ONE(TARGET) \ | ||
#define GOTO_TIER_ONE(TARGET, SHOULD_CONTINUE_TRACING) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHOULD_CONTINUE_TRACING
doesn't seem to be used anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used by _COLD_EXIT
to tell the interpreter to resume tracing upon exiting to tier 1. This is what allows it to form the side exit trace.
ADD_TO_TRACE(_SET_IP, 0, (uintptr_t)instr, target); | ||
DPRINTF(2, "%p %d: %s(%d) %d\n", old_code, target, _PyOpcode_OpName[opcode], oparg, progress_needed); | ||
|
||
bool needs_guard_ip = _PyOpcode_NeedsGuardIp[opcode] && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we fix up this table, so we don't need all the special cases afterwards?
This PR changes the current JIT model from trace projection to trace recording. Benchmarking: 0.6% slower pyperformance versus TC+current JIT, However, 70% faster Richards on the most improved benchmark versus the current JIT. https://github.com/facebookexperimental/free-threading-benchmarking/blob/main/results/bm-20251020-3.15.0a1%2B-ec2971f-CLANG%2CJIT/bm-20251020-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-ec2971f-vs-base.md
This new JIT frontend is already able to record/execute significantly more instructions than the previous JIT frontend. In this PR, we are now able to record through custom dunders, simple object creation, generators, etc. None of these were done by the old JIT frontend. Some custom dunders uops were discovered to be broken as part of this work gh-140277
Some tests for the optimizers are disabled for now not because the optimizer doesn't work, but that it's actually unrolling recursion, thus leading to different results in the trace!
The full test suite passes on my system.
Not close enough to a holiday, so no poem this time!
Pros:
optimizer.c
is now significantly simpler, as we don't have to do strange things to recover the bytecode from a trace.Cons:
Design:
_DYNAMIC_EXIT
. This new instruction will cause a re-trace when hot enough starting from the target instruction.