Skip to content

Commit db499ba

Browse files
committed
MB-48925 1/3: Don't extend VBucket lifetime via bg Tasks
During bucket shutdown we intermittently see an exception thrown during task scheduling on a background NonIO thread, which crashes the memcached process. +Analysis+ Bug is as follows. Starting at the main thread which is deleting the Bucket (Thread 1): (gdb) bt ... #10 0x0000000000649bf3 in FollyExecutorPool::schedule(std::shared_ptr<GlobalTask>) () at /c++/10.2.0/new:175 #11 0x000000000084271b in EPVBucket::scheduleDeferredDeletion(EventuallyPersistentEngine&) () at /c++/10.2.0/ext/atomicity.h:100 #12 0x00000000006dfe7a in VBucket::DeferredDeleter::operator()(VBucket*) const () at kv_engine/engines/ep/src/vbucket.cc:3990 #13 0x000000000086f874 in std::_Sp_counted_deleter<EPVBucket*, VBucket::DeferredDeleter, ...>::_M_dispose () at /c++/10.2.0/bits/shared_ptr_base.h:453 ... #18 std::shared_ptr<VBucket>::~shared_ptr (this=0x7b44000515d0, __in_chrg=<optimized out>) at /c++/10.2.0/bits/shared_ptr.h:121 #19 PagingVisitor::~PagingVisitor (this=0x7b4400051540, __in_chrg=<optimized out>) at kv_engine/engines/ep/src/paging_visitor.h:39 ... #31 std::__shared_ptr<GlobalTask, (__gnu_cxx::_Lock_policy)2>::reset () at /c++/10.2.0/bits/shared_ptr_base.h:1301 #32 EventuallyPersistentEngine::waitForTasks(std::vector<std::shared_ptr<GlobalTask>, std::allocator<std::shared_ptr<GlobalTask> > >&) () at kv_engine/engines/ep/src/ep_engine.cc:6752 #33 0x000000000082396f in EventuallyPersistentEngine::destroyInner(bool) () at kv_engine/engines/ep/src/ep_engine.cc:2135 1. PagingVisitor is still in existence running after `EventuallyPersistentEngine::destroyInner` - see frame #19. This is because all tasks belonging to bucket were returned from unregisterTaskable() just before. 2. PagingVisitor (via VBCBAdaptor) is destroyed, it decrements the refcount on the shared_ptr<VBucket> it owns - see frame #18. 3. That is the last reference to the VBucket, which results in VBucket::DeferredDeleter being invoked which in turn schedules a task to delete the VBucket (disk and memory) in the background - see frame #11. We see the schedule's lambda happen on the SchedulerPool0 thread (T:35): Thread 35 "SchedulerPool0" hit Catchpoint 1 (exception thrown), __cxxabiv1::__cxa_throw (..., tinfo=0x10c4ec8 <typeinfo for std::out_of_range@@GLIBCXX_3.4>, ...) at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_throw.cc:80 (gdb) bt #1 0x00007ffff4cad7d2 in std::__throw_out_of_range (__s=__s@entry=0xcc68e6 "_Map_base::at") at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/src/c++11/functexcept.cc:82 ... #3 0x00000000005504ee in std::unordered_map<...>::at (__k=@0x7fffe83a8f88: 0x7b7400000848, this=0x7b1000005580) at /c++/10.2.0/bits/unordered_map.h:1000 #4 FollyExecutorPool::State::scheduleTask (this=..., executor=..., pool=..., task=...) at kv_engine/executor/folly_executorpool.cc:415 ... #8 folly::EventBase::runInEventBaseThreadAndWait(...) at folly/io/async/EventBase.cpp:671 ... In FollyExecutorPool::State::scheduleTask (frame #3) we attempt to lookup the Taskable (Bucket) in the ExecutorPool's map, however given its already been unregistered, the taskable is not found an the std::out_of_range exception is thrown. This is a lifetime issue. We have VBucket objects potentially being kept alive longer than their expected lifetime by virtue of background tasks having shared ownership of them - and those background tasks outlive the lifetime of their parent object (KVBucket), and crucially past when the owning Bucket is unregistered with the ExecutorPool and can no longer schedule tasks. When it then _does+ attempt to schedule a task against an unregistered (and deleted) Taskable; we see the crash. +Solution+ There's arguably two problems which should be addressed (although technically only one of the two is required to encounter this crash): 1. Background tasks owning VBuckets when they are not executing. 2. Background tasks outliving their associated Taskable (aka Bucket). This patch addresses the critical issue of (1) - we remove the (shared) ownership of VBucket from the background tasks which previoulsy had it - both PagingVisitor which is the problematic class in this scenario, but also in the other background Tasks which potentially have the same problem. The 2nd patch will tighten up the API for visiting VBuckets, so visitors are not passed a VBucketPtr, but instead VBucket& which reduces the chance of similar problems happening in future. The 3rd patch will adddress Background Taks outliving their Taskable. Change-Id: I340a3e4dc3d9234c4a34866b410fb8295a1c98d1 Reviewed-on: http://review.couchbase.org/c/kv_engine/+/163783 Tested-by: Dave Rigby <[email protected]> Reviewed-by: Richard de Mellow <[email protected]> Reviewed-by: James H <[email protected]>
1 parent 9e5480f commit db499ba

File tree

7 files changed

+115
-14
lines changed

7 files changed

+115
-14
lines changed

engines/ep/src/paging_visitor.cc

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -182,10 +182,11 @@ void PagingVisitor::visitBucket(const VBucketPtr& vb) {
182182
// fast path for expiry item pager
183183
if (owner == EXPIRY_PAGER) {
184184
if (vBucketFilter(vb->getId())) {
185-
currentBucket = vb;
185+
currentBucket = vb.get();
186186
// EvictionPolicy is not required when running expiry item
187187
// pager
188188
vb->ht.visit(*this);
189+
currentBucket = nullptr;
189190
}
190191
return;
191192
}
@@ -204,8 +205,7 @@ void PagingVisitor::visitBucket(const VBucketPtr& vb) {
204205
return;
205206
}
206207

207-
currentBucket = vb;
208-
maxCas = currentBucket->getMaxCas();
208+
maxCas = vb->getMaxCas();
209209
itemEviction.reset();
210210
freqCounterThreshold = 0;
211211

@@ -220,17 +220,18 @@ void PagingVisitor::visitBucket(const VBucketPtr& vb) {
220220
: ItemEviction::learningPopulation;
221221
itemEviction.setUpdateInterval(interval);
222222

223+
currentBucket = vb.get();
223224
vb->ht.visit(*this);
225+
currentBucket = nullptr;
224226
/**
225227
* Note: We are not taking a reader lock on the vbucket state.
226228
* Therefore it is possible that the stats could be slightly
227229
* out. However given that its just for stats we don't want
228230
* to incur any performance cost associated with taking the
229231
* lock.
230232
*/
231-
const bool isActiveOrPending =
232-
((currentBucket->getState() == vbucket_state_active) ||
233-
(currentBucket->getState() == vbucket_state_pending));
233+
const bool isActiveOrPending = ((vb->getState() == vbucket_state_active) ||
234+
(vb->getState() == vbucket_state_pending));
234235

235236
// Take a snapshot of the latest frequency histogram
236237
if (isActiveOrPending) {

engines/ep/src/paging_visitor.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,8 @@ class PagingVisitor : public CappedDurationVBucketVisitor,
122122
size_t ejected;
123123

124124
// The current vbucket that the eviction algorithm is operating on.
125-
VBucketPtr currentBucket;
125+
// Only valid while inside visitBucket().
126+
VBucket* currentBucket{nullptr};
126127

127128
// The frequency counter threshold that is used to determine whether we
128129
// should evict items from the hash table.

engines/ep/src/warmup.cc

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1092,8 +1092,9 @@ void LoadStorageKVPairCallback::purge() {
10921092

10931093
void visitBucket(const VBucketPtr& vb) override {
10941094
if (vBucketFilter(vb->getId())) {
1095-
currentBucket = vb;
1095+
currentBucket = vb.get();
10961096
vb->ht.visit(*this);
1097+
currentBucket = nullptr;
10971098
}
10981099
}
10991100

@@ -1107,7 +1108,9 @@ void LoadStorageKVPairCallback::purge() {
11071108

11081109
private:
11091110
EPBucket& epstore;
1110-
VBucketPtr currentBucket;
1111+
// The current vbucket that the visitor is operating on. Only valid
1112+
// while inside visitBucket().
1113+
VBucket* currentBucket{nullptr};
11111114
};
11121115

11131116
auto vbucketIds(vbuckets.getBuckets());

engines/ep/tests/mock/mock_paging_visitor.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ class MockPagingVisitor : public PagingVisitor {
6060
}
6161

6262
void setCurrentBucket(VBucketPtr _currentBucket) {
63-
currentBucket = _currentBucket;
63+
currentBucket = _currentBucket.get();
6464
}
6565

6666
MOCK_METHOD1(visitBucket, void(const VBucketPtr&));

engines/ep/tests/module_tests/evp_engine_test.cc

Lines changed: 91 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,13 @@
2222
#include "programs/engine_testapp/mock_cookie.h"
2323
#include "programs/engine_testapp/mock_server.h"
2424
#include "tests/module_tests/test_helpers.h"
25+
#include "vb_visitors.h"
2526
#include <executor/cb3_taskqueue.h>
2627

2728
#include <boost/algorithm/string/join.hpp>
2829
#include <boost/filesystem.hpp>
2930
#include <configuration_impl.h>
31+
#include <folly/synchronization/Baton.h>
3032
#include <platform/dirutils.h>
3133
#include <chrono>
3234
#include <thread>
@@ -108,10 +110,16 @@ void EventuallyPersistentEngineTest::TearDown() {
108110
}
109111

110112
void EventuallyPersistentEngineTest::shutdownEngine() {
111-
destroy_mock_cookie(cookie);
112-
// Need to force the destroy (i.e. pass true) because
113-
// NonIO threads may have been disabled (see DCPTest subclass).
114-
engine->destroy(true);
113+
if (cookie) {
114+
destroy_mock_cookie(cookie);
115+
cookie = nullptr;
116+
}
117+
if (engine) {
118+
// Need to force the destroy (i.e. pass true) because
119+
// NonIO threads may have been disabled (see DCPTest subclass).
120+
engine->destroy(true);
121+
engine = nullptr;
122+
}
115123
}
116124

117125
queued_item EventuallyPersistentEngineTest::store_item(
@@ -495,3 +503,82 @@ INSTANTIATE_TEST_SUITE_P(EphemeralOrPersistent,
495503
[](const ::testing::TestParamInfo<std::string>& info) {
496504
return info.param;
497505
});
506+
507+
/**
508+
* Regression test for MB-48925 - if a Task is scheduled against a Taskable
509+
* (Bucket) which has already been unregistered, then the ExecutorPool throws
510+
* and crashes the process.
511+
* Note: This test as it stands will *not* crash kv-engine if the fix for the
512+
* issue (see rest of this commit) is reverted. This is because the fix is
513+
* to change the currentVb Task member variable from an (owning)
514+
* shared_ptr<VBucket> to a (non-owning) VBucket* - the same thing TestVisior
515+
* below does. However it is included here for reference as to the original
516+
* problematic scenario.
517+
*/
518+
TEST_F(EventuallyPersistentEngineTest, MB48925_ScheduleTaskAfterUnregistered) {
519+
class TestVisitor : public InterruptableVBucketVisitor {
520+
public:
521+
TestVisitor(int& visitCount,
522+
folly::Baton<>& waitForVisit,
523+
folly::Baton<>& waitForDeinitialise)
524+
: visitCount(visitCount),
525+
waitForVisit(waitForVisit),
526+
waitForDeinitialise(waitForDeinitialise) {
527+
}
528+
529+
void visitBucket(const VBucketPtr& vb) override {
530+
if (visitCount++ == 0) {
531+
currentVb = vb.get();
532+
// On first call to visitBucket() perform the necessary
533+
// interleaved baton wait / sleeping.
534+
// Suspend execution of this thread; and allow main thread to
535+
// continue, delete Bucket and unregisterTaskable.
536+
waitForVisit.post();
537+
538+
// Keep task running until unregisterTaskable() has been called
539+
// and starts to cancel tasks - this ensures that the Task
540+
// object is still alive (ExecutorPool has a reference to it)
541+
// and hence is passed out from unregisterTaskable(), hence kept
542+
// alive past when KVBucket is deleted.
543+
waitForDeinitialise.wait();
544+
}
545+
}
546+
InterruptableVBucketVisitor::ExecutionState shouldInterrupt() override {
547+
return ExecutionState::Continue;
548+
}
549+
550+
int& visitCount;
551+
folly::Baton<>& waitForVisit;
552+
folly::Baton<>& waitForDeinitialise;
553+
554+
// Model the behaviour of PagingVisitor prior to the bugfix. Note that
555+
// _if_ this is changed to a shared_ptr<VBucket> then we crash.
556+
VBucket* currentVb;
557+
};
558+
559+
int visitCount{0};
560+
folly::Baton waitForVisit;
561+
folly::Baton waitForUnregister;
562+
engine->getKVBucket()->visitAsync(
563+
std::make_unique<TestVisitor>(
564+
visitCount, waitForVisit, waitForUnregister),
565+
"MB48925_ScheduleTaskAfterUnregistered",
566+
TaskId::ExpiredItemPagerVisitor,
567+
std::chrono::seconds{1});
568+
waitForVisit.wait();
569+
570+
// Setup testing hook so we allow our TestVisitor's Task above to
571+
// continue once we are inside unregisterTaskable.
572+
ExecutorPool::get()->unregisterTaskablePostCancelHook =
573+
[&waitForUnregister]() { waitForUnregister.post(); };
574+
575+
// Delete the vbucket; so the file deletion will be performed by
576+
// VBucket::DeferredDeleter when the last reference goes out of scope
577+
// (expected to be the ExpiryPager.
578+
engine->getKVBucket()->deleteVBucket(vbid);
579+
580+
// Destroy the engine. This does happen implicitly in TearDown, but call
581+
// it earlier because we need to call destroy() before our various Baton
582+
// local variables etc go out of scope.
583+
shutdownEngine();
584+
}

executor/executorpool.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111

1212
#include <memcached/engine_common.h>
1313
#include <memcached/thread_pool_config.h>
14+
#include <utilities/testing_hook.h>
1415

1516
#include "task_type.h"
1617
#include <atomic>
@@ -222,6 +223,12 @@ class ExecutorPool {
222223
*/
223224
static int getThreadPriority(task_type_t taskType);
224225

226+
/************** Testing *************************************************/
227+
228+
// Testing hook for MB-48925 - called inside unregisterTaskable after
229+
// tasks have been cancelled.
230+
TestingHook<> unregisterTaskablePostCancelHook;
231+
225232
protected:
226233
ExecutorPool(size_t maxThreads);
227234

executor/folly_executorpool.cc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -879,6 +879,8 @@ std::vector<ExTask> FollyExecutorPool::unregisterTaskable(Taskable& taskable,
879879
removedTasks = state->cancelTasksOwnedBy(taskable, force);
880880
});
881881

882+
unregisterTaskablePostCancelHook();
883+
882884
// Step 2 - poll for taskOwners to become empty. This will only
883885
// occur once all outstanding, running tasks have been cancelled.
884886
auto isTaskOwnersEmpty = [eventBase, &state = this->state, &taskable] {

0 commit comments

Comments
 (0)