-
Notifications
You must be signed in to change notification settings - Fork 747
Fix ODR violation due to const-mismatch with eigens gemm definition. #8049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
cptspacemanspiff
wants to merge
2
commits into
pytorch:main
from
cptspacemanspiff:fix-odr-violation-due-to-mismatched-gemm-definition
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the existing code matches PyTorch core (https://github.com/pytorch/pytorch/blob/9f9904172d5b94e6cd06d915e15721d6e8868b7a/aten/src/ATen/native/CPUBlas.cpp#L19), so I'm hesitant to change this without an explanation of why it is OK in PyTorch core (which supports Eigen as the BLAS implementation) but not here. In particular, C linkage should mean that the name of the function is the only thing that matters. can you include the specific build command you are running and error you are seeing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is happening at linking, only occurs with LTO (as in building everything with LTO enabled)
This is the command to clean build executorch off of the main branch w/ LTO enabled, which should reproduce the error:
(from the executorch root)
In terms of C linkage not exposing this:
(Note: this is to my best understanding...)
This only shows up because of link time optimization, which rather than compiling each static library/compilation unit to bytecode, it compiles them to an IR representation (Clang uses llvm IR, I am not sure what gcc uses). These are all bundled together during link time and an additional compilation pass is ran there, and then the bytecode is actually produced.
This allows for whole program optimization across compilation units, for instance if one of our gemm operators just did a no-op, LTO would optimize the whole call out.
This is why during link time this ODR issue shows up, the linker is looking at the IR representation of the whole program and saying wait these two definitions don't match.
This would not show up if eigen was not compiled with LTO, and was just a regular static library, the linker would fall back to the regular C linkage, and have no idea about the internals.
random thoughts
I cannot comment about pytorch, but I guess I am assuming that they don't do whole program builds with LTO enabled, in which case this would have never appeared? LTO is newish CMAKE added support for non-intel compilers in ~2017?
But depending on how the build is setup, to get LTO working, everything needs to be a static lib, and you need to be able to control the build flags of all artifacts. It's much easier to setup if you're building in tree.
I doubt that this particular ODR violation breaks things, because we are essentially calling with const parameters through a non-const wrapper. But even without the LTO optimization pass, this is probably non-ideal, because somewhere the compiler has to check
did these pointers move, or more likely they just get dropped. IDK.sidenote:
I am kind of curious if enabling LTO in one of the examples affects performance...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional:
Looks like pytorch people are looking at enabling LTO (for binary size savings mainly it seems)?pytorch/pytorch#93955
My guess is that they have not run into this yet, but it is probably a bug on their end too (if my understanding that the gemm operators are a standard API).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the second definition? The thing on line 19 of CPUBlas.cpp is a declaration, not a definition.
What compiler version is this? I see that it's only a warning (-Wodr), and perhaps the warning is just wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No repro on Mac with Xcode 16.2, FWIW.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that should have been declaration
It should not happen on a mac, because they use the accelerate library instead which seems to provide the declaration headers (they use const in theirs FWIW)
I was using gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04), but will try with clang:
I just rebuilt with clang-18, and it is not there, so maybe it is just a compiler warning issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point here was that you can't break the One Definition Rule with a declaration, hence the suspicion of the compiler