Fix ODR violation due to const-mismatch with eigens gemm definition. #8049

cptspacemanspiff · 2025-01-30T11:43:40Z

Issue:

When I have been compiling executorch with LTO enabled (linktime-optimization) I ran into warnings saying that there was an ODR violation. ODR violations scare me, so looking at it, it seems to be caused by a const mismatch between the eigen declaration of the BLAS operation and the CPUBlas.cpp forward declaration.

In kernals/optimized/blas/CPUBlass.cpp

#ifdef ET_BUILD_WITH_BLAS
#ifdef ET_BUILD_FOR_APPLE
#include <Accelerate/Accelerate.h>
#else
// clang-format off
extern "C" void dgemm_(char *transa, char *transb, int *m, int *n, int *k, double *alpha, const double *a, int *lda, const double *b, int *ldb, double *beta, double *c, int *ldc);
extern "C" void sgemm_(char *transa, char *transb, int *m, int *n, int *k, float *alpha, const float *a, int *lda, const float *b, int *ldb, float *beta, float *c, int *ldc);
// clang-format on
#endif // ET_BUILD_FOR_APPLE
#endif // ET_BUILD_WITH_BLAS

In kernels/optimized/third-party/eigen/blas/level3_impl.h (they have a macro that generates the s/d precisions.)

EIGEN_BLAS_FUNC(gemm)
(const char *opa, const char *opb, const int *m, const int *n, const int *k, const RealScalar *palpha,
 const RealScalar *pa, const int *lda, const RealScalar *pb, const int *ldb, const RealScalar *pbeta, RealScalar *pc,
 const int *ldc)

Since the forward declaration is inside a cpp compilation unit, this error usually does not show, but LTO allows the linker to introspect across compilation units for optimizations.

It seems like it was probably a typo/manual copy paste error, to my understanding BLAS apis are pretty standardized, and it does not make sense to have your BLAS operator move your input pointers.

It also probably has not broken things (aside from possibly lost optimization opportunity)?

That being said ODR violations and undefined behavior in general scare me.

System:

FWIW, This is on ubuntu-22.04, x86-64 w/ zen3 cpu, but is probably there on all non-apple machines.

…on and extern-c declaration.

pytorch-bot · 2025-01-30T11:43:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8049

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 94decbc with merge base 21ec3a1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…inition

manuelcandales · 2025-01-30T20:47:59Z

cc @swolchok

swolchok · 2025-01-30T21:40:03Z

kernels/optimized/blas/CPUBlas.cpp

 // clang-format off
-extern "C" void dgemm_(char *transa, char *transb, int *m, int *n, int *k, double *alpha, const double *a, int *lda, const double *b, int *ldb, double *beta, double *c, int *ldc);
-extern "C" void sgemm_(char *transa, char *transb, int *m, int *n, int *k, float *alpha, const float *a, int *lda, const float *b, int *ldb, float *beta, float *c, int *ldc);
+extern "C" void dgemm_(const char *transa, const char *transb, const int *m, const int *n, const int *k, const double *alpha, const double *a, const int *lda, const double *b, const int *ldb, const double *beta, double *c, const int *ldc);


the existing code matches PyTorch core (https://github.com/pytorch/pytorch/blob/9f9904172d5b94e6cd06d915e15721d6e8868b7a/aten/src/ATen/native/CPUBlas.cpp#L19), so I'm hesitant to change this without an explanation of why it is OK in PyTorch core (which supports Eigen as the BLAS implementation) but not here. In particular, C linkage should mean that the name of the function is the only thing that matters. can you include the specific build command you are running and error you are seeing?

This error is happening at linking, only occurs with LTO (as in building everything with LTO enabled)

&& /usr/bin/c++ -flto=auto -fno-fat-lto-objects -Wl,--whole-archive /home/nlong/execu-tools/cpp/build/stage/lib/libexecutorch.a -Wl,--no-whole-archive -Wl,--whole-archive /home/nlong/execu-tools/cpp/build/stage/lib/liboptimized_native_cpu_ops_lib.a -Wl,--no-whole-archive test/CMakeFiles/ExecuToolsEncoderDecoder.dir/manual_encoder_decoder_runner.cpp.o -o stage/bin/ExecuToolsEncoderDecoder stage/lib/libExecuTools.a stage/lib/liboptimized_native_cpu_ops_lib.a stage/lib/liboptimized_kernels.a stage/lib/libportable_kernels.a stage/lib/libextension_module_static.a stage/lib/libextension_data_loader.a stage/lib/libextension_tensor.a stage/lib/libetdump.a _deps/executorch/third-party/flatcc/lib/libflatccrt.a stage/lib/libtokenizers_cpp.a _deps/tokenizers-cpp-build/release/libtokenizers_c.a stage/lib/libsentencepiece.a stage/lib/libcpublas.a stage/lib/libeigen_blas.a stage/lib/libextension_threadpool.a stage/lib/libcpuinfo.a stage/lib/libpthreadpool.a stage/lib/libexecutorch.a stage/lib/libexecutorch_core.a -ldl && : /home/nlong/execu-tools/cpp/build/_deps/executorch/kernels/optimized/blas/CPUBlas.cpp:19:17: warning: ‘sgemm_’ violates the C++ One Definition Rule [-Wodr] 19 | extern "C" void sgemm_(char *transa, char *transb, int *m, int *n, int *k, float *alpha, const float *a, int *lda, const float *b, int *ldb, float *beta, float *c, int *ldc); | ^ /home/nlong/execu-tools/cpp/build/_deps/executorch/kernels/optimized/third-party/eigen/blas/level3_impl.h:12:1: note: type mismatch in parameter 1 12 | EIGEN_BLAS_FUNC(gemm) | ^ /home/nlong/execu-tools/cpp/build/_deps/executorch/kernels/optimized/third-party/eigen/blas/level3_impl.h:12:1: note: type ‘const char’ should match type ‘char’ /home/nlong/execu-tools/cpp/build/_deps/executorch/kernels/optimized/third-party/eigen/blas/level3_impl.h:12:1: note: ‘sgemm_’ was previously declared here /home/nlong/execu-tools/cpp/build/_deps/executorch/kernels/optimized/blas/CPUBlas.cpp:18:17: warning: ‘dgemm_’ violates the C++ One Definition Rule [-Wodr] 18 | extern "C" void dgemm_(char *transa, char *transb, int *m, int *n, int *k, double *alpha, const double *a, int *lda, const double *b, int *ldb, double *beta, double *c, int *ldc); | ^ /home/nlong/execu-tools/cpp/build/_deps/executorch/kernels/optimized/third-party/eigen/blas/level3_impl.h:12:1: note: type mismatch in parameter 1 12 | EIGEN_BLAS_FUNC(gemm) | ^ /home/nlong/execu-tools/cpp/build/_deps/executorch/kernels/optimized/third-party/eigen/blas/level3_impl.h:12:1: note: type ‘const char’ should match type ‘char’ /home/nlong/execu-tools/cpp/build/_deps/executorch/kernels/optimized/third-party/eigen/blas/level3_impl.h:12:1: note: ‘dgemm_’ was previously declared here

This is the command to clean build executorch off of the main branch w/ LTO enabled, which should reproduce the error:

(from the executorch root)

mkdir cmake_out_lto_build && cd cmake_out_lto_build cmake .. -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON -DCMAKE_POLICY_DEFAULT_CMP0069=NEW -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON make -j 12

In terms of C linkage not exposing this:

(Note: this is to my best understanding...)

This only shows up because of link time optimization, which rather than compiling each static library/compilation unit to bytecode, it compiles them to an IR representation (Clang uses llvm IR, I am not sure what gcc uses). These are all bundled together during link time and an additional compilation pass is ran there, and then the bytecode is actually produced.

This allows for whole program optimization across compilation units, for instance if one of our gemm operators just did a no-op, LTO would optimize the whole call out.

This is why during link time this ODR issue shows up, the linker is looking at the IR representation of the whole program and saying wait these two definitions don't match.

This would not show up if eigen was not compiled with LTO, and was just a regular static library, the linker would fall back to the regular C linkage, and have no idea about the internals.

random thoughts

I cannot comment about pytorch, but I guess I am assuming that they don't do whole program builds with LTO enabled, in which case this would have never appeared? LTO is newish CMAKE added support for non-intel compilers in ~2017?

But depending on how the build is setup, to get LTO working, everything needs to be a static lib, and you need to be able to control the build flags of all artifacts. It's much easier to setup if you're building in tree.

I doubt that this particular ODR violation breaks things, because we are essentially calling with const parameters through a non-const wrapper. But even without the LTO optimization pass, this is probably non-ideal, because somewhere the compiler has to check did these pointers move, or more likely they just get dropped. IDK.

sidenote:

I am kind of curious if enabling LTO in one of the examples affects performance...

Additional:

Looks like pytorch people are looking at enabling LTO (for binary size savings mainly it seems)?pytorch/pytorch#93955

My guess is that they have not run into this yet, but it is probably a bug on their end too (if my understanding that the gemm operators are a standard API).

these two definitions don't match

Where is the second definition? The thing on line 19 of CPUBlas.cpp is a declaration, not a definition.

What compiler version is this? I see that it's only a warning (-Wodr), and perhaps the warning is just wrong.

What compiler version

No repro on Mac with Xcode 16.2, FWIW.

Where is the second definition? The thing on line 19 of CPUBlas.cpp is a declaration, not a definition.

Sorry, that should have been declaration

No repro on Mac with Xcode 16.2, FWIW.

It should not happen on a mac, because they use the accelerate library instead which seems to provide the declaration headers (they use const in theirs FWIW)

I was using gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04), but will try with clang:

I just rebuilt with clang-18, and it is not there, so maybe it is just a compiler warning issue.

Sorry, that should have been declaration

my point here was that you can't break the One Definition Rule with a declaration, hence the suspicion of the compiler

shoumikhin · 2025-03-17T03:32:00Z

@cptspacemanspiff any updates?
@swolchok how do we want to proceed with this change?

cptspacemanspiff · 2025-03-17T03:44:33Z

So, it is the same error in the gcc-14, an odr warning, because of mismatch between declarations.

Clang does not have the warning because the clang IR representation for LTO does not contain the full function type signature of the declarations to compare. so clang does not have the ability to catch this mismatch (according to a stack overflow article I read somewhere a month ago.)

My view:
All this being said, it does not cause issues, and apparently mirrors pytorch, so if they get annoyed with the warning message if/when enabling LTO builds, probably just fix the issue then.

I am assuming that it was a typo/forgot the const because the gemm function def is pretty well standardized, and since the only way you would notice is if you enable LTO, in gcc, and look at the warnings.

byjlw · 2025-03-31T17:17:22Z

So, it is the same error in the gcc-14, an odr warning, because of mismatch between declarations.

Clang does not have the warning because the clang IR representation for LTO does not contain the full function type signature of the declarations to compare. so clang does not have the ability to catch this mismatch (according to a stack overflow article I read somewhere a month ago.)

My view: All this being said, it does not cause issues, and apparently mirrors pytorch, so if they get annoyed with the warning message if/when enabling LTO builds, probably just fix the issue then.

I am assuming that it was a typo/forgot the const because the gemm function def is pretty well standardized, and since the only way you would notice is if you enable LTO, in gcc, and look at the warnings.

What's the plan moving forward on this?

swolchok · 2025-04-01T02:08:54Z

What's the plan moving forward on this?

As I replied on Discord on 3/25: my impression was that all parties concerned were happy not taking further action for now

larryliu0820 · 2025-05-28T23:00:00Z

Should we close this if no further action?

Fix ODR violation due to const-mismatch between std gemm implementati…

84be313

…on and extern-c declaration.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 30, 2025

manuelcandales added the release notes: ops & kernels Changes to the opset and any new / changed kernel implementations label Jan 30, 2025

Merge branch 'main' into fix-odr-violation-due-to-mismatched-gemm-def…

94decbc

…inition

manuelcandales added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 30, 2025

manuelcandales requested a review from swolchok January 30, 2025 20:48

manuelcandales assigned swolchok Jan 30, 2025

swolchok suggested changes Jan 30, 2025

View reviewed changes

larryliu0820 closed this May 28, 2025

Fix ODR violation due to const-mismatch with eigens gemm definition. #8049

Fix ODR violation due to const-mismatch with eigens gemm definition. #8049

Uh oh!

Conversation

cptspacemanspiff commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue:

System:

Uh oh!

pytorch-bot bot commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8049

✅ No Failures

Uh oh!

manuelcandales commented Jan 30, 2025

Uh oh!

swolchok Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cptspacemanspiff Jan 30, 2025

Choose a reason for hiding this comment

In terms of C linkage not exposing this:

random thoughts

sidenote:

Uh oh!

cptspacemanspiff Jan 30, 2025

Choose a reason for hiding this comment

Additional:

Uh oh!

swolchok Jan 31, 2025

Choose a reason for hiding this comment

Uh oh!

swolchok Jan 31, 2025

Choose a reason for hiding this comment

Uh oh!

cptspacemanspiff Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swolchok Jan 31, 2025

Choose a reason for hiding this comment

Uh oh!

shoumikhin commented Mar 17, 2025

Uh oh!

cptspacemanspiff commented Mar 17, 2025

Uh oh!

byjlw commented Mar 31, 2025

Uh oh!

swolchok commented Apr 1, 2025

Uh oh!

larryliu0820 commented May 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cptspacemanspiff commented Jan 30, 2025 •

edited

Loading

pytorch-bot bot commented Jan 30, 2025 •

edited

Loading

swolchok Jan 30, 2025 •

edited

Loading

cptspacemanspiff Jan 31, 2025 •

edited

Loading