Skip to content

Conversation

@cambyzju
Copy link
Contributor

What problem does this PR solve?

In branch-2.0, we already refractor these codes in pr: #18590

Deadlock stack:
1、While load, we alloc MemTracker and need lock TrackerGroup.group_lock

   NodeChannel::NodeChannel
     _node_channel_tracker = std::make_shared<MemTracker>
       MemTracker::bind_parent
         std::lock_guard<std::mutex> l(mem_tracker_pool[_parent_group_num].group_lock);
         _tracker_group_it = mem_tracker_pool[_parent_group_num].trackers.insert(
                mem_tracker_pool[_parent_group_num].trackers.end(), this);

2、but while we try to call std::list::insert, we need alloc std::_List_node,here new_hook (in file tcmalloc_hook.h) is triggered, then we lock the same TrackerGroup.group_lock, make it deadlock

new_hook
doris::ThreadMemTrackerMgr::consume
  doris::ThreadMemTrackerMgr::flush_untracked_mem<true, true>
    doris::ThreadMemTrackerMgr::exceeded
      doris::MemTrackerLimiter::print_log_usage
         doris::MemTracker::make_group_snapshot
           std::lock_guard<std::mutex> l(mem_tracker_pool[group_num].group_lock);

Full stack info:

(gdb) bt
#0  0x00007f219772454d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f219771fe9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f219771fd68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000563ca7a8a824 in __gthread_mutex_lock (__mutex=0x563cb18cbc58)
    at /var/local/ldb-toolchain/include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749
#4  std::mutex::lock (this=0x563cb18cbc58) at /var/local/ldb-toolchain/include/c++/11/bits/std_mutex.h:100
#5  std::lock_guard<std::mutex>::lock_guard (__m=..., this=<synthetic pointer>) at /var/local/ldb-toolchain/include/c++/11/bits/std_mutex.h:229
#6  doris::MemTracker::make_group_snapshot (snapshots=0x7f1f7fe06ea0, group_num=<optimized out>, parent_label=...)
    at /data/TCHouse-D-1.2/be/src/runtime/memory/mem_tracker.cpp:115
#7  0x0000563ca7a7eeb4 in doris::MemTrackerLimiter::print_log_usage (this=0x5640ab7956c0, msg=...)
    at /data/TCHouse-D-1.2/be/src/runtime/memory/mem_tracker_limiter.cpp:198
#8  0x0000563ca7a8d1e4 in doris::ThreadMemTrackerMgr::exceeded (this=this@entry=0x563ce6a2c820, size=1048584)
    at /data/TCHouse-D-1.2/be/src/runtime/memory/thread_mem_tracker_mgr.cpp:59
#9  0x0000563ca78cfeb4 in doris::ThreadMemTrackerMgr::flush_untracked_mem<true, true> (this=0x563ce6a2c820)
    at /data/TCHouse-D-1.2/be/src/runtime/memory/thread_mem_tracker_mgr.h:223
#10 doris::ThreadMemTrackerMgr::consume (size=<optimized out>, this=0x563ce6a2c820)
    at /data/TCHouse-D-1.2/be/src/runtime/memory/thread_mem_tracker_mgr.h:188
#11 doris::ThreadMemTrackerMgr::consume (size=<optimized out>, this=0x563ce6a2c820)
    at /data/TCHouse-D-1.2/be/src/runtime/memory/thread_mem_tracker_mgr.h:178
#12 new_hook (ptr=<optimized out>, size=24) at /data/TCHouse-D-1.2/be/src/runtime/memory/tcmalloc_hook.h:39
#13 0x0000563caf3dfa78 in MallocHook::InvokeNewHookSlow (p=p@entry=0x56422d713b40, s=s@entry=24) at src/malloc_hook.cc:498
#14 0x0000563caf55f2c1 in MallocHook::InvokeNewHook (s=24, p=0x56422d713b40) at src/malloc_hook-inl.h:127
#15 tcmalloc::do_allocate_full<tcmalloc::cpp_throw_oom> (size=size@entry=24) at src/tcmalloc.cc:1805
#16 tcmalloc::allocate_full_cpp_throw_oom (size=size@entry=24) at src/tcmalloc.cc:1815
#17 0x0000563caf55f429 in tcmalloc::dispatch_allocate_full<tcmalloc::cpp_throw_oom> (size=24) at src/tcmalloc.cc:1822
#18 0x0000563ca7a8ab9a in __gnu_cxx::new_allocator<std::_List_node<doris::MemTracker*> >::allocate (__n=1, this=0x563cb18cbc40)
    at /var/local/ldb-toolchain/include/c++/11/ext/new_allocator.h:103
#19 std::allocator_traits<std::allocator<std::_List_node<doris::MemTracker*> > >::allocate (__n=1, __a=...)
    at /var/local/ldb-toolchain/include/c++/11/bits/alloc_traits.h:460
#20 std::__cxx11::_List_base<doris::MemTracker*, std::allocator<doris::MemTracker*> >::_M_get_node (this=0x563cb18cbc40)
    at /var/local/ldb-toolchain/include/c++/11/bits/stl_list.h:442
#21 std::__cxx11::list<doris::MemTracker*, std::allocator<doris::MemTracker*> >::_M_create_node<doris::MemTracker*> (this=0x563cb18cbc40)
    at /var/local/ldb-toolchain/include/c++/11/bits/stl_list.h:634
#22 std::__cxx11::list<doris::MemTracker*, std::allocator<doris::MemTracker*> >::emplace<doris::MemTracker*> (__position=..., this=0x563cb18cbc40)
    at /var/local/ldb-toolchain/include/c++/11/bits/list.tcc:92
#23 std::__cxx11::list<doris::MemTracker*, std::allocator<doris::MemTracker*> >::insert (__x=<optimized out>, __position=..., this=0x563cb18cbc40)
    at /var/local/ldb-toolchain/include/c++/11/bits/stl_list.h:1309
#24 doris::MemTracker::bind_parent (this=0x563eff46ad10, parent=<optimized out>) at /data/TCHouse-D-1.2/be/src/runtime/memory/mem_tracker.cpp:79
#25 0x0000563ca7a8b608 in doris::MemTracker::MemTracker (this=this@entry=0x563eff46ad10, label=..., parent=parent@entry=0x0)
    at /data/TCHouse-D-1.2/be/src/runtime/memory/mem_tracker.cpp:66
#26 0x0000563ca7967467 in __gnu_cxx::new_allocator<doris::MemTracker>::construct<doris::MemTracker, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__p=0x563eff46ad10, this=<optimized out>) at /var/local/ldb-toolchain/include/c++/11/ext/new_allocator.h:154
#27 std::allocator_traits<std::allocator<doris::MemTracker> >::construct<doris::MemTracker, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__p=0x563eff46ad10, __a=...) at /var/local/ldb-toolchain/include/c++/11/bits/alloc_traits.h:512
#28 std::_Sp_counted_ptr_inplace<doris::MemTracker, std::allocator<doris::MemTracker>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__a=..., this=0x563eff46ad00)
    at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr_base.h:519
#29 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<doris::MemTracker, std::allocator<doris::MemTracker>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__a=..., __p=<optimized out>, this=<optimized out>)
    at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr_base.h:650
#30 std::__shared_ptr<doris::MemTracker, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<doris::MemTracker>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__tag=..., this=<optimized out>)
    at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr_base.h:1337
#31 std::shared_ptr<doris::MemTracker>::shared_ptr<std::allocator<doris::MemTracker>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__tag=..., this=<optimized out>) at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr.h:409
#32 std::allocate_shared<doris::MemTracker, std::allocator<doris::MemTracker>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__a=...) at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr.h:861
#33 std::make_shared<doris::MemTracker, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > ()
    at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr.h:877
#34 doris::stream_load::NodeChannel::NodeChannel (this=this@entry=0x563d0ca7e110, parent=<optimized out>, 
    index_channel=index_channel@entry=0x563d4b43f200, node_id=<optimized out>) at /data/TCHouse-D-1.2/be/src/exec/tablet_sink.cpp:49
#35 0x0000563cac74a82d in doris::stream_load::VNodeChannel::VNodeChannel (this=this@entry=0x563d0ca7e110, parent=<optimized out>, 
    index_channel=index_channel@entry=0x563d4b43f200, node_id=<optimized out>) at /data/TCHouse-D-1.2/be/src/vec/sink/vtablet_sink.cpp:37
#36 0x0000563ca7967eaa in __gnu_cxx::new_allocator<doris::stream_load::VNodeChannel>::construct<doris::stream_load::VNodeChannel, doris::stream_load::OlapTableSink*&, doris::stream_load::IndexChannel*, long&> (__p=0x563d0ca7e110, this=<optimized out>)
    at /var/local/ldb-toolchain/include/c++/11/ext/new_allocator.h:154
#37 std::allocator_traits<std::allocator<doris::stream_load::VNodeChannel> >::construct<doris::stream_load::VNodeChannel, doris::stream_load::OlapTableSink*&, doris::stream_load::IndexChannel*, long&> (__p=0x563d0ca7e110, __a=...) at /var/local/ldb-toolchain/include/c++/11/bits/alloc_traits.h:512
#38 std::_Sp_counted_ptr_inplace<doris::stream_load::VNodeChannel, std::allocator<doris::stream_load::VNodeChannel>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<doris::stream_load::OlapTableSink*&, doris::stream_load::IndexChannel*, long&> (__a=..., this=0x563d0ca7e100)
    at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr_base.h:519
#39 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<doris::stream_load::VNodeChannel, std::allocator<doris::stream_load::VNodeChannel>, doris::stream_load::OlapTableSink*&, doris::stream_load::IndexChannel*, long&> (__a=..., __p=<optimized out>, this=<optimized out>)
    at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr_base.h:650
#40 std::__shared_ptr<doris::stream_load::VNodeChannel, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<doris::stream_load::VNodeChannel>, doris::stream_load::OlapTableSink*&, doris::stream_load::IndexChannel*, long&> (__tag=..., this=<optimized out>)
    at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr_base.h:1337
#41 std::shared_ptr<doris::stream_load::VNodeChannel>::shared_ptr<std::allocator<doris::stream_load::VNodeChannel>, doris::stream_load::OlapTableSink*&, doris::stream_load::IndexChannel*, long&> (__tag=..., this=<optimized out>) at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr.h:409
#42 std::allocate_shared<doris::stream_load::VNodeChannel, std::allocator<doris::stream_load::VNodeChannel>, doris::stream_load::OlapTableSink*&, doris::stream_load::IndexChannel*, long&> (__a=...) at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr.h:861
#43 std::make_shared<doris::stream_load::VNodeChannel, doris::stream_load::OlapTableSink*&, doris::stream_load::IndexChannel*, long&> ()
    at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr.h:877
#44 doris::stream_load::IndexChannel::init (this=this@entry=0x563d4b43f200, state=state@entry=0x563e027a3500, tablets=...)
    at /data/TCHouse-D-1.2/be/src/exec/tablet_sink.cpp:705
#45 0x0000563ca79698d8 in doris::stream_load::OlapTableSink::prepare (this=this@entry=0x5644b9efe880, state=state@entry=0x563e027a3500)
    at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr_base.h:1290
#46 0x0000563cac74db65 in doris::stream_load::VOlapTableSink::prepare (this=0x5644b9efe880, state=0x563e027a3500)
    at /data/TCHouse-D-1.2/be/src/vec/sink/vtablet_sink.cpp:450
#47 0x0000563ca7946c21 in doris::PlanFragmentExecutor::prepare (this=this@entry=0x563f55efd280, request=..., fragments_ctx=<optimized out>)
    at /var/local/ldb-toolchain/include/c++/11/bits/unique_ptr.h:173
#48 0x0000563ca791f17c in doris::FragmentExecState::prepare (this=this@entry=0x563f55efd200, params=...)
--Type <RET> for more, q to quit, c to continue without paging--
    at /data/TCHouse-D-1.2/be/src/runtime/fragment_mgr.cpp:238
#49 0x0000563ca7926629 in doris::FragmentMgr::exec_plan_fragment(doris::TExecPlanFragmentParams const&, std::function<void (doris::PlanFragmentExecutor*)>) (this=this@entry=0x563cb6e93400, params=..., cb=...) at /data/TCHouse-D-1.2/be/src/runtime/fragment_mgr.cpp:720
#50 0x0000563ca7928dab in doris::FragmentMgr::exec_plan_fragment (this=0x563cb6e93400, params=...)
    at /data/TCHouse-D-1.2/be/src/runtime/fragment_mgr.cpp:564
#51 0x0000563ca7aaf507 in doris::PInternalServiceImpl::_exec_plan_fragment_impl (this=this@entry=0x563cb577b880, ser_request=..., 
    version=<optimized out>, compact=<optimized out>) at /data/TCHouse-D-1.2/be/src/service/internal_service.cpp:480
#52 0x0000563ca7aaf703 in doris::PInternalServiceImpl::_exec_plan_fragment_in_pthread (this=0x563cb577b880, controller=<optimized out>, 
    request=0x563d6c5bb9b0, response=0x564203b92ce0, done=0x563d885a60c0) at /data/TCHouse-D-1.2/be/src/service/internal_service.cpp:254
#53 0x0000563ca78d6dbd in std::function<void ()>::operator()() const (this=0x7f1f7fe090c8)
    at /var/local/ldb-toolchain/include/c++/11/bits/std_function.h:560
#54 doris::PriorityThreadPool::work_thread (this=0x563cb577ba38, thread_id=<optimized out>)
    at /data/TCHouse-D-1.2/be/src/util/priority_thread_pool.hpp:145
#55 0x0000563caf526b00 in std::execute_native_thread_routine (__p=0x563cbfc81d10) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#56 0x00007f219771dea5 in start_thread () from /lib64/libpthread.so.0
#57 0x00007f2197a309fd in clone () from /lib64/libc.so.6

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented May 28, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Contributor

@lide-reed lide-reed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels May 28, 2025
@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@lide-reed lide-reed merged commit 353cb94 into apache:branch-1.2-lts May 28, 2025
13 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/1.2.9-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants