Skip to content

__llvm_profile_counter_bias is undefined Error in Large CUDA-related .so with PGO Continuation Mode #118019

@xuesu

Description

@xuesu

Hi,

I've been struggling with an issue that has consumed me for nearly a year. I am building some large shared libraries (.so) related to CUDA using PGO continuation mode, but I consistently encounter the following error:

if (UseBiasVar && BiasAddr == BiasDefaultAddr) {
    PROF_ERR("%s\n", "__llvm_profile_counter_bias is undefined");
    return;
}

Here:

  • BiasAddr = &__llvm_profile_counter_bias
  • BiasDefaultAddr = &__llvm_profile_counter_bias_default

From my understanding, __llvm_profile_counter_bias should be determined later, after passing this check, inside mmapForContinuousMode.
From the code of Value *InstrProfiling::getCounterAddress(InstrProfInstBase *I), I assume that when counters are inserted, the compiler should insert __llvm_profile_counter_bias, but only as a placeholder. Since the counter size exceeds 100k, I assume this insertion has already happened, but the symbol might be lost because it is a global weak symbol?This seems similar to a case like [this Rust issue](rust-lang/rust#120842).

However, the build system of paddle is too large, fragile, and complex, and I cannot pinpoint where this symbol might be lost exactly.

Problem Details

The __llvm_profile_counter_bias symbol is consistently reported as undefined, and I cannot determine why this happens. The error always occurs, and due to CUDA's limitations, I can only use LLVM-15. While I reviewed LLVM-19's source code, the logic seems quite similar, so I suspect the root cause persists across these versions.

I am building the PaddlePaddle framework. After compiling with the following flags:

  • -fprofile-instr-generate
  • -fcoverage-mapping
  • -g
  • -mllvm
  • -runtime-counter-relocation

It produces four instrumented shared libraries (.so), which dlopen each other. The first three shared libraries behave as expected, but the latest one (libpaddle.so/libphi.so) always reports this error:

dll = ctypes.CDLL("/Paddle/build/python/paddle/libs/libphi.so")
LLVM Profile Error: LALALA: initializeProfileForContinuousMode
LLVM Profile Error: LALALA: CountersSize 32480
LLVM Profile Error: LALALA: BiasAddr 133895675533552
LLVM Profile Error: LALALA: BiasDefaultAddr 133896653571776
LALALA: mmapForContinuousMode 2
LLVM Profile Error: LALALA: initializeProfileForContinuousMode
LLVM Profile Error: LALALA: CountersSize 92448
LLVM Profile Error: LALALA: BiasAddr 133896026068704
LLVM Profile Error: LALALA: BiasDefaultAddr 133896653571776
LALALA: mmapForContinuousMode 2
LLVM Profile Error: LALALA: initializeProfileForContinuousMode
LLVM Profile Error: LALALA: CountersSize 7777352
LLVM Profile Error: LALALA: BiasAddr 133896653571776
LLVM Profile Error: LALALA: BiasDefaultAddr 133896653571776
LLVM Profile Error: __llvm_profile_counter_bias is undefined

I inspected the compiled .so files with nm and found the following symbol table entries:

libcommon.so:

00000000001298f0 B __llvm_profile_counter_bias_default
00000000001299d8 V __llvm_profile_filename
...

libpaddle.so:

00000000268d9c08 B __llvm_profile_counter_bias_default
                 U __llvm_profile_filename
...

libphi.so:

000000002566c6c0 B __llvm_profile_counter_bias_default
                 U __llvm_profile_filename
...

libphi_kernel_gpu.so:

0000000014e496e0 B __llvm_profile_counter_bias_default
                 U __llvm_profile_filename
...

As seen, the symbol __llvm_profile_counter_bias_default exists in all shared libraries, but the actual __llvm_profile_counter_bias symbol is missing.


Questions

  1. Can I safely ignore this check?

    if (UseBiasVar && BiasAddr == BiasDefaultAddr) {
        PROF_ERR("%s\n", "__llvm_profile_counter_bias is undefined");
        return;
    }

    Since mmapForContinuousMode should eventually initialize the __llvm_profile_counter_bias symbol, would ignoring this check resolve the issue?

  2. What scenarios might cause __llvm_profile_counter_bias to be undefined?

    • From my understanding, if counters are being inserted (over 100k in this case), the compiler should have inserted __llvm_profile_counter_bias. Could this issue arise due to its nature as a weak global symbol and loss during linking?

Any insights or hints on debugging this issue further would be greatly appreciated. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    PGOProfile Guided OptimizationscudaquestionA question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead!

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions