-
Notifications
You must be signed in to change notification settings - Fork 39
Description
Helgrind runs fail randomly on concurrent critnib write and read. Possibly related to the issue #886.
Examples of failed jobs:
https://github.com/oneapi-src/unified-memory-framework/actions/runs/11481262621/job/31999571534
https://github.com/oneapi-src/unified-memory-framework/actions/runs/11719465368/job/32644634889
Environment Information
- UMF version (hash commit or a tag): main
- OS(es) version(s): ubuntu-22.04
- kernel version(s): 6.5.0
- compiler, libraries, and other related tools version(s): cc 11.4
Please provide a reproduction of the bug:
Although failures seem to occur randomly throughout different PRs from time to time, there is one branch were the issue occurres always. Here is the branch that might be useful for debugging: https://github.com/PatKamin/unified-memory-framework/tree/helgrind-bug
Actual behavior:
==5747== ---Thread-Announcement------------------------------------------
==5747==
==5747== Thread #1 is the program's root thread
==5747==
==5747== ----------------------------------------------------------------
==5747==
==5747== Thread #1: pthread_rwlock_{rd,rw}lock with a pthread_mutex_t* argument
==5747== at 0x4852514: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==5747== by 0x15B0E7: std::__glibcxx_rwlock_wrlock(pthread_rwlock_t*) (shared_mutex:80)
==5747== by 0x15B13B: std::__shared_mutex_pthread::lock() (shared_mutex:193)
==5747== by 0x15B1E7: std::shared_timed_mutex::lock() (shared_mutex:473)
==5747== by 0x15C13F: std::lock_guard<std::shared_timed_mutex>::lock_guard(std::shared_timed_mutex&) (std_mutex.h:229)
==5747== by 0x157846: Slab::regSlabByAddr(void*, Slab&) (pool_disjoint.cpp:535)
==5747== by 0x157AD4: Slab::regSlab(Slab&) (pool_disjoint.cpp:563)
==5747== by 0x157348: Slab::Slab(Bucket&) (pool_disjoint.cpp:463)
==5747== by 0x15C[371](https://github.com/oneapi-src/unified-memory-framework/actions/runs/11481262621/job/31999571534#step:8:372): std::_MakeUniq<Slab>::__single_object std::make_unique<Slab, Bucket&>(Bucket&) (unique_ptr.h:962)
==5747== by 0x157E1C: Bucket::getAvailFullSlab(bool&) (pool_disjoint.cpp:617)
==5747== by 0x157F59: Bucket::getSlab(bool&) (pool_disjoint.cpp:631)
==5747== by 0x1590E5: DisjointPool::AllocImpl::allocate(unsigned long, bool&) (pool_disjoint.cpp:853)
==5747==
{
<insert_a_suppression_name_here>
Helgrind:Misc
obj:/usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so
fun:_ZStL23__glibcxx_rwlock_wrlockP16pthread_rwlock_t
fun:_ZNSt22__shared_mutex_pthread4lockEv
fun:_ZNSt18shared_timed_mutex4lockEv
fun:_ZNSt10lock_guardISt18shared_timed_mutexEC1ERS0_
fun:_ZN4Slab13regSlabByAddrEPvRS_
fun:_ZN4Slab7regSlabERS_
fun:_ZN4SlabC1ER6Bucket
fun:_ZSt11make_uniqueI4SlabJR6BucketEENSt9_MakeUniqIT_E15__single_objectEDpOT0_
fun:_ZN6Bucket16getAvailFullSlabERb
fun:_ZN6Bucket7getSlabERb
fun:_ZN12DisjointPool9AllocImpl8allocateEmRb
}
==5747== ----------------------------------------------------------------
==5747==
==5747== Thread #1: pthread_rwlock_unlock with a pthread_mutex_t* argument
==5747== at 0x4852CD6: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==5747== by 0x15B116: std::__glibcxx_rwlock_unlock(pthread_rwlock_t*) (shared_mutex:82)
==5747== by 0x15B16D: std::__shared_mutex_pthread::unlock() (shared_mutex:213)
==5747== by 0x15B207: std::shared_timed_mutex::unlock() (shared_mutex:475)
==5747== by 0x15C162: std::lock_guard<std::shared_timed_mutex>::~lock_guard() (std_mutex.h:235)
==5747== by 0x15787C: Slab::regSlabByAddr(void*, Slab&) (pool_disjoint.cpp:537)
==5747== by 0x157AD4: Slab::regSlab(Slab&) (pool_disjoint.cpp:563)
==5747== by 0x157348: Slab::Slab(Bucket&) (pool_disjoint.cpp:463)
==5747== by 0x15C371: std::_MakeUniq<Slab>::__single_object std::make_unique<Slab, Bucket&>(Bucket&) (unique_ptr.h:962)
==5747== by 0x157E1C: Bucket::getAvailFullSlab(bool&) (pool_disjoint.cpp:617)
==5747== by 0x157F59: Bucket::getSlab(bool&) (pool_disjoint.cpp:631)
==5747== by 0x1590E5: DisjointPool::AllocImpl::allocate(unsigned long, bool&) (pool_disjoint.cpp:853)
==5747==
{
<insert_a_suppression_name_here>
Helgrind:Misc
obj:/usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so
fun:_ZStL23__glibcxx_rwlock_unlockP16pthread_rwlock_t
fun:_ZNSt22__shared_mutex_pthread6unlockEv
fun:_ZNSt18shared_timed_mutex6unlockEv
fun:_ZNSt10lock_guardISt18shared_timed_mutexED1Ev
fun:_ZN4Slab13regSlabByAddrEPvRS_
fun:_ZN4Slab7regSlabERS_
fun:_ZN4SlabC1ER6Bucket
fun:_ZSt11make_uniqueI4SlabJR6BucketEENSt9_MakeUniqIT_E15__single_objectEDpOT0_
fun:_ZN6Bucket16getAvailFullSlabERb
fun:_ZN6Bucket7getSlabERb
fun:_ZN12DisjointPool9AllocImpl8allocateEmRb
}
==5747== ----------------------------------------------------------------
==5747==
==5747== Thread #1: pthread_rwlock_{rd,rw}lock with a pthread_mutex_t* argument
==5747== at 0x4852704: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==5747== by 0x15B0B8: std::__glibcxx_rwlock_rdlock(pthread_rwlock_t*) (shared_mutex:78)
==5747== by 0x15B18F: std::__shared_mutex_pthread::lock_shared() (shared_mutex:229)
==5747== by 0x15B227: std::shared_timed_mutex::lock_shared() (shared_mutex:489)
==5747== by 0x15C6D8: std::shared_lock<std::shared_timed_mutex>::shared_lock(std::shared_timed_mutex&) (shared_mutex:727)
==5747== by 0x15965B: DisjointPool::AllocImpl::deallocate(void*, bool&) (pool_disjoint.cpp:960)
==5747== by 0x159E95: DisjointPool::free(void*) (pool_disjoint.cpp:1081)
==5747== by 0x15DB31: _ZZN3umf6detail11poolOpsBaseI12DisjointPoolEE21umf_memory_pool_ops_tvENKUlPvDpT_E5_clIJS4_EEEDaS4_S6_ (cpp_helpers.hpp:77)
==5747== by 0x15FEFA: _ZZN3umf6detail11poolOpsBaseI12DisjointPoolEE21umf_memory_pool_ops_tvENUlPvDpT_E5_4_FUNIJS4_EEEDTcldtdeLKPKS7_0EonclIJS6_EEscOS4_fp_spscOS5_fp0_EES4_S6_ (cpp_helpers.hpp:77)
==5747== by 0x162B84: umfPoolFree (memory_pool.c:171)
==5747== by 0x1549E9: free_memory(umf_memory_pool_t*, void**) (malloc_compliance_tests.cpp:40)
==5747== by 0x154F03: malloc_compliance_test(umf_memory_pool_t*) (malloc_compliance_tests.cpp:76)
==5747==
{
<insert_a_suppression_name_here>
Helgrind:Misc
obj:/usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so
fun:_ZStL23__glibcxx_rwlock_rdlockP16pthread_rwlock_t
fun:_ZNSt22__shared_mutex_pthread11lock_sharedEv
fun:_ZNSt18shared_timed_mutex11lock_sharedEv
fun:_ZNSt11shared_lockISt18shared_timed_mutexEC1ERS0_
fun:_ZN12DisjointPool9AllocImpl10deallocateEPvRb
fun:_ZN12DisjointPool4freeEPv
fun:_ZZN3umf6detail11poolOpsBaseI12DisjointPoolEE21umf_memory_pool_ops_tvENKUlPvDpT_E5_clIJS4_EEEDaS4_S6_
fun:_ZZN3umf6detail11poolOpsBaseI12DisjointPoolEE21umf_memory_pool_ops_tvENUlPvDpT_E5_4_FUNIJS4_EEEDTcldtdeLKPKS7_0EonclIJS6_EEscOS4_fp_spscOS5_fp0_EES4_S6_
fun:umfPoolFree
fun:_ZL11free_memoryP17umf_memory_pool_tPPv
fun:_Z22malloc_compliance_testP17umf_memory_pool_t
}
Helgrind: hg_main.c:308 (lockN_acquire_reader): Assertion 'lk->kind == LK_rdwr' failed.
host stacktrace:
==5747== at 0x58026E7A: ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
==5747== by 0x58026FA7: ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
==5747== by 0x5802714B: ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
==5747== by 0x58006519: ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
==5747== by 0x5800E5CA: ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
==5747== by 0x5800EB6A: ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
==5747== by 0x58038[374](https://github.com/oneapi-src/unified-memory-framework/actions/runs/11481262621/job/31999571534#step:8:375): ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
==5747== by 0x5807F46A: ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
==5747== by 0x580C9E70: ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
sched status:
running_tid=1
Thread 1: status = VgTs_Runnable (lwpid 5747)
==5747== at 0x48527CF: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==5747== by 0x15B0B8: std::__glibcxx_rwlock_rdlock(pthread_rwlock_t*) (shared_mutex:78)
==5747== by 0x15B18F: std::__shared_mutex_pthread::lock_shared() (shared_mutex:229)
==5747== by 0x15B227: std::shared_timed_mutex::lock_shared() (shared_mutex:489)
==5747== by 0x15C6D8: std::shared_lock<std::shared_timed_mutex>::shared_lock(std::shared_timed_mutex&) (shared_mutex:727)
==5747== by 0x15965B: DisjointPool::AllocImpl::deallocate(void*, bool&) (pool_disjoint.cpp:960)
==5747== by 0x159E95: DisjointPool::free(void*) (pool_disjoint.cpp:1081)
==5747== by 0x15DB31: _ZZN3umf6detail11poolOpsBaseI12DisjointPoolEE21umf_memory_pool_ops_tvENKUlPvDpT_E5_clIJS4_EEEDaS4_S6_ (cpp_helpers.hpp:77)
==5747== by 0x15FEFA: _ZZN3umf6detail11poolOpsBaseI12DisjointPoolEE21umf_memory_pool_ops_tvENUlPvDpT_E5_4_FUNIJS4_EEEDTcldtdeLKPKS7_0EonclIJS6_EEscOS4_fp_spscOS5_fp0_EES4_S6_ (cpp_helpers.hpp:77)
==5747== by 0x162B84: umfPoolFree (memory_pool.c:171)
==5747== by 0x1549E9: free_memory(umf_memory_pool_t*, void**) (malloc_compliance_tests.cpp:40)
==5747== by 0x154F03: malloc_compliance_test(umf_memory_pool_t*) (malloc_compliance_tests.cpp:76)
==5747== by 0x11E661: umfPoolTest_malloc_compliance_Test::TestBody() (poolFixtures.hpp:439)
==5747== by 0x1BBABE: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2638)
==5747== by 0x1B3DBC: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2674)
==5747== by 0x18D3FB: testing::Test::Run() (gtest.cc:2713)
==5747== by 0x18DE8B: testing::TestInfo::Run() (gtest.cc:2859)
==5747== by 0x18E829: testing::TestSuite::Run() (gtest.cc:3037)
==5747== by 0x19F19D: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:5967)
==5747== by 0x1BCBE1: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2638)
==5747== by 0x1B5062: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2674)
==5747== by 0x19D7AB: testing::UnitTest::Run() (gtest.cc:5546)
==5747== by 0x162258: RUN_ALL_TESTS() (gtest.h:2334)
==5747== by 0x1621D1: main (gtest_main.cc:64)
client stack range: [0x1FFEFFA000 0x1FFF000FFF] client SP: 0x1FFEFFE6F0
valgrind stack range: [0x100328E000 0x100338DFFF] top usage: 18984 of 1048576
How often bug is revealed:
often, in some cases always
(always, often, rare)
Details
Most probably a proper valgrind instrumentation needs to be added to critnib.
Additional information about Priority and Help Requested:
Are you willing to submit a pull request with a proposed change? (Yes, No) No
Requested priority: (Showstopper, High, Medium, Low) Medium