-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Local names linking #60031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Local names linking #60031
Conversation
Renumber jl_invoke_api_t
Use JITLink everywhere Rename jlcall_type, add jl_funcs_invoke_ptr Move JLLinkingLayer into JuliaOJIT Use jl_invoke_api_t elsewhere Rename JL_INVOKE_JFPTR -> JL_INVOKE_SPECSIG Put all special symbol names in one place Add helper for specsig -> tojlinvoke (fptr1) and use it Fix invariants for code_outputs Document JIT invariants better; remove invalid assertions Replace workqueue, partially support OpaqueClosure Add JIT tests Stop using strings so much Don't create an LLVM::Linker unless necessary Generate trampolines in aot_link_output GCChecker annotations, misc changes Re-add emit_always_inline Get JLDebuginfoPlugin and eh_frame working again Re-add OpaqueClosure MethodInstance global root Fix GCChecker annotations Clean up TODOs Read dump compile Use multiple threads in the JIT Add PLT/GOT for external fns Name Julia PLT GOT entries Do emit_llvmcall_modules at the end Suppress clang-tidy, static analyzer warnings Keep temporary_roots alive during emit_always_inline Mark pkg PLT thunks noinline Don't attempt to emit inline codeinsts when IR is too large or missing Improve thunk generation on x86 Fix infinite loop in emit_always_inline if inlining not possible Use local names for global targets Fix jl_get_llvmf_defn_impl cfunction hacks
| class JLMaterializationUnit : public orc::MaterializationUnit { | ||
| public: | ||
| static JLMaterializationUnit Create(JuliaOJIT &JIT, ObjectLinkingLayer &OL, | ||
| std::unique_ptr<jl_linker_info_t> Info, | ||
| std::unique_ptr<MemoryBuffer> Obj) JL_NOTSAFEPOINT | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I have been wanting this for a long time!
Would it make sense to have a C-API for creating these? So that LLVM.jl could create them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, though I would not want to expose it in a way that would lock in some of the design choices, like how JLMaterializationUnit owns the object buffer.
I'm undecided on how much work should be deferred to materialization. Right now jl_compile_codeinst_now blocks all threads waiting on compilation until everything is compiled to object files, like on master. I'd like to leave the door open to letting ORC decide when to compile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I have been wanting to try and ORC based setup for GPUCompiler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still no C API, but fwiw I have switched this most recent version over to doing compilation in JLMaterializationUnit::materialize.
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (JuliaLang#60031). This is essentially a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen. I plan to add support for DualMapAllocator on ARM64 macOS, as well as an alternative for executable memory to come later. For now, we fall back to the old MapperJITLinkMemoryManager.
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (JuliaLang#60031). This is a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen. I plan to add support for DualMapAllocator on ARM64 macOS, as well as an alternative for executable memory later. For now, we fall back to the old MapperJITLinkMemoryManager.
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (JuliaLang#60031). This is a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen. I plan to add support for DualMapAllocator on ARM64 macOS, as well as an alternative for executable memory later. For now, we fall back to the old MapperJITLinkMemoryManager. Release JLJITLinkMemoryManager lock when calling FinalizedCallbacks
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (JuliaLang#60031). This is a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen. I plan to add support for DualMapAllocator on ARM64 macOS, as well as an alternative for executable memory later. For now, we fall back to the old MapperJITLinkMemoryManager. Release JLJITLinkMemoryManager lock when calling FinalizedCallbacks
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (JuliaLang#60031). This is a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen. I plan to add support for DualMapAllocator on ARM64 macOS, as well as an alternative for executable memory later. For now, we fall back to the old MapperJITLinkMemoryManager. Release JLJITLinkMemoryManager lock when calling FinalizedCallbacks
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (JuliaLang#60031). This is a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen. I plan to add support for DualMapAllocator on ARM64 macOS, as well as an alternative for executable memory later. For now, we fall back to the old MapperJITLinkMemoryManager. Release JLJITLinkMemoryManager lock when calling FinalizedCallbacks
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (#60031). This is a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen.
there are several open issues observing inference changes when methods are redefined; does this PR affect those? |
|
No, this PR only changes code generation. |
Unfortunately the "portable" LLVM way of generating thunks doesn't generate the code we want. Instead, on platforms where it makes sense, we'll steal the LLD PLT thunk code, but in disassembled form. At some point this should be moved to after linking, where it can be in assembled form again. Amusingly it will be more portable in assembled form, because the assembler syntax for relocations differs between object formats.
|
This new commit fixes some horrible code generation in |
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (#60031). This is a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen. (cherry picked from commit 6fa0e75)
Replace all uses of `ptrdiff_t slide` and `int64_t slide` with `uint64_t`. If a JITted object is ever assigned an address in the upper half of the address space, which is quite common on 32-bit Linux, the expression `SectionAddr - SectionLoadAddr` has undefined behaviour. This resulted in some very confusing bugs that manifested far from the source. It is easier to use unsigned integers everywhere we need a difference, since we know they have two's complement representation. Cherry-picked from JuliaLang#60031. [1] https://buildkite.com/julialang/julia-master/builds/52196/steps/canvas?sid=019a9d6f-14a6-4ffc-be19-f2f835d1e719
Replace all uses of `ptrdiff_t slide` and `int64_t slide` with `uint64_t`. If a
JITted object is ever assigned an address in the upper half of the address space
on a platform with `sizeof(char *) = 4`, which is quite common on 32-bit Linux,
the following can happen:
In JITDebugInfoRegistry::registerJITObject, `SectionAddr - SectionLoadAddr`
is computed in uint64_t (ok), then cast to ptrdiff_t (two's complement of
the uint64_t version mod 2^32). This is apparently implementation-defined
behaviour rather than undefined.
Say SectionAddr = 0x1000UL, SectionLoadAddr = 0xe93b2000UL and
size_t pointer = 0xe93b20abU.
```
(ptrdiff_t)(SectionAddr - SectionLoadAddr) == (ptrdiff_t)0xffffffff16c4f000
== 382005248
```
jl_DI_for_fptr implicitly converts the ptrdiff_t to int64_t:
```
(int64_t)382005248 == 382005248L
```
lookup_pointer adds `size_t pointer` to `int64_t slide`. Both are converted
to int64_t because it can represent every size_t:
```
(int64_t)0xe93b20abU + 382005248L == 3912966315L + 382005248L
== 4294971563L
```
This is converted back to uint64_t by makeAddress, resulting in an address other
than the 0x10ab we expected:
```
(uint64_t)4294971563L == 0x1000010abUL
```
It is easier to use unsigned integers everywhere we need a difference, since
they avoid the problem of losing upper bits after sign extension and avoid weird
UB from signed overflow.
Cherry-picked from JuliaLang#60031.
[1] https://buildkite.com/julialang/julia-master/builds/52196/steps/canvas?sid=019a9d6f-14a6-4ffc-be19-f2f835d1e719
|
@nanosoldier |
|
The package evaluation job you requested has completed - possible new issues were detected. Report summary❗ Packages that crashed1 packages crashed only on the current version.
272 packages crashed on the previous version too. ✖ Packages that failed35 packages failed only on the current version.
1203 packages failed on the previous version too. ✔ Packages that passed tests5 packages passed tests only on the current version.
5557 packages passed tests on the previous version too. ~ Packages that at least loaded3439 packages successfully loaded on the previous version too. ➖ Packages that were skipped altogether1 packages were skipped only on the current version.
907 packages were skipped on the previous version too. |
|
Is this expected to be breaking for low-level packages like AllocCheck or CompilerCaching.jl? Both error now with duplicate definition errors when putting stuff in the Julia JIT ( |
|
Putting stuff directly in Julia's JIT is generally expected to be fairly buggy. But we should make some new JITDylib for each thing you put in to at least protect it against such trivial conflicts from crashing. |
|
They are putting it in the external jitdylib. Enzyme will also have an issue here. The idea is to have good debug info. We maybe need a replacement API that does the linking tricks |
src/aotcompile.cpp
Outdated
| // now add it to our compilation results | ||
| jl_code_instance_t *codeinst = (jl_code_instance_t*)item; | ||
|
|
||
| // TODO: check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "TODO: check" comments are places where I wanted to return to make sure the behaviour is unchanged from upstream. In this case I need to make sure I didn't cause a regression from aab7490: if it's okay to skip JL_CI_FLAGS_FROM_IMAGE code instances here and emit them later, in emit_always_inline.
|
RE: @maleadt
Yes, but I added a new flag in The fix for AllocCheck.jl is to make the following change to GPUCompiler.jl: diff --git i/src/jlgen.jl w/src/jlgen.jl
index 2812a02..ffae1d5 100644
--- i/src/jlgen.jl
+++ w/src/jlgen.jl
@@ -752,6 +752,9 @@ function compile_method_instance(@nospecialize(job::CompilerJob))
if v"1.12.0-DEV.2126" <= VERSION < v"1.13-" || VERSION >= v"1.13.0-DEV.285"
cgparams = (; force_emit_all = true , cgparams...)
end
+ if v"1.14.0-DEV.1688" <= VERSION
+ cgparams = (; unique_names = true, cgparams...)
+ end
params = Base.CodegenParams(; cgparams...)
# generate IRThat makes |
| } | ||
|
|
||
| static jl_llvm_functions_t jl_emit_oc_wrapper(orc::ThreadSafeModule &m, jl_codegen_params_t ¶ms, jl_method_instance_t *mi, jl_value_t *rettype) | ||
| // TODO: handle jl_invoke_type properly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds important, although I don't know what jl_invoke_type is (jl_invoke_api_t?)
Should this be fixed up, or elaborated so that it's clearer what's left TODO?
| # define jl_unreachable() __builtin_unreachable() | ||
| #else | ||
| # define jl_unreachable() ((void)jl_assume(0)) | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want this to depend on JL_NDEBUG and be more like assert(false) in debug builds, right?
P.S. I have wanted this for ages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree, but all I did was move it from the bottom of this file :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@nanosoldier |
576b136 to
13e2328
Compare
|
13e2328 fixes a bug that would cause bootstrapping to fail maybe 1/20 times, but I'd appreciate feedback on how I fixed it. This is an overview of the life cycle of a CodeInstance, post-local-names-linking (to be cleaned up and added to devdocs, eventually):
13e2328 fixed a bug that was rather painful to find:
I don't love the solution of deleting entries from |
When JL_NDEBUG is undefined, we should use these as assertions. Suggested by @topolarity in JuliaLang#60031 (comment).
When JL_NDEBUG is undefined, we should use these as assertions. Suggested by @topolarity in JuliaLang#60031 (comment).
Overview
This PR overhauls the way linking works in Julia, both in the JIT and AOT. The point is to enable us to generate LLVM IR that depends only on the source IR, eliminating both nondeterminism and statefulness. This serves two purposes. First, if the IR is predictable, we can cache compile objects using the bitcode hash as a key, like how the ThinLTO cache works. #58592 was an early experiment along these lines. Second, we can reuse work that was done in a previous session, like pkgimages, but for the JIT.
We accomplish this by generating names that are unique only within the current LLVM module, removing most uses of the
globalUniqueGeneratedNamescounter. The replacement forjl_codegen_params_t,jl_codegen_output_t, represents a Julia "translation unit", and tracks the information we'll need to link the compiled module into the running session. When linking, we manipulate the JITLink LinkGraph (after compilation) instead of renaming functions in the LLVM IR (before).Example
Nightly:
Diff after this PR. Notice how each symbol gets the lowest possible integer suffix that will make it unique to the module, and how the two specializations for
fooget different names:List of changes
Many sources of statefulness and nondeterminism in the emitted LLVM IR have been eliminated, namely:
jl_codeinst_params_thas becomejl_codegen_output_t. It now represents one Julia "translation unit". More than one CodeInstance can be emitted to the samejl_codegen_output_t, if desired, though in the JIT every CI gets its own right now. One motivation behind this is to allow us to emit code on multiple threads and avoid the bitcode serialize/deserialize step we currently do, if that proves worthwhile.When we are done emitting to a
jl_codegen_output_t, we call.finish(), which discards the intermediate state and returns only the LLVM module and the info needed for linking (jl_linker_info_t).The new
JLMaterializationUnitwraps emitting Julia LLVM modules and the associatedjl_linker_info_t. It informs ORC that we can materialize symbols for the CIs defined by that output, and picks globally unique names for them. When it is materialized, it resolves all the call targets and generates trampolines for CodeInstances that are invoked but have the wrong calling convention, or are not yet compiled.We now postpone linking decisions to after codegen whenever possible. For example,
emit_invokeno longer tries to find a compiled version of the CodeInstance, and it no longer generates trampolines to adapt calling conventions.jl_analyze_workqueue's job has been absorbed intoJuliaOJIT::linkOutput.Some
image_codegendifferences have been removed:In
jl_emit_native_impl, emit every CodeInstance into onejl_codegen_output_t. We now defer the creation of thellvm::Linkerfor llvmcalls, which has construction cost that grows with the size of the destination module, until the very end.RTDyld is removed completely, since we cannot control linking like we can with JITLink. Since Add JLJITLinkMemoryManager (ports memory manager to JITLink) #60105, platforms that previous used the optimized memory manager now use the new one.
General refactoring
jl_callingconv_tenum fromstaticdata.cintojl_invoke_api_tand use it in more places. There is one enumerator for each specialjl_callptr_tfunction that can go in a CodeInstance'sinvokefield, as well as one that indicates an invoke wrapper should be there. There is a convenience function for reading an invoke pointer and getting the API type, and vice versa.Function *or ORC string pool entries when possible.Future work
DLSymOptimizershould be mostly removed, in favour of emitting raw ccalls and redirecting them to the appropriate target during linking.We should support ahead-of-time linking multiple
jl_codegen_output_ts together, in order to parallelize LLVM IR emission when compiling a system image.We still pass strings to
emit_call_specfun_other, even though the prototype for the function is now created byjl_codegen_output_t::get_call_target. We should hold on to the calling convention info so it doesn't have to be recomputed.