Skip to content

Commit 73ed0bc

Browse files
committed
fix: Resolve use-after-free crash in DocumentStore::RemoveDocument
Fix segmentation fault during concurrent delete operations caused by using a dangling reference to the primary key string after erasing the map entry. Root cause: The function stored a const reference to pk_it->second, then erased the entry from doc_id_to_pk_ (invalidating the reference), and finally used the dangling reference in StructuredLog. This bug existed since the initial commit but was exposed by Linux CI's memory allocator during concurrent tests. Changes: - Copy primary key string before erasing map entry - Add stress tests (DocumentStoreStressTest) with SLOW label - Update release notes and changelog for v1.3.7
1 parent ec40fc7 commit 73ed0bc

File tree

5 files changed

+233
-8
lines changed

5 files changed

+233
-8
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1919

2020
### Fixed
2121

22+
- **Critical: Use-After-Free in RemoveDocument** - Copy primary key string before erasing map entry; add stress tests with SLOW label for regression detection
2223
- **Critical: SIGSEGV on SET/SHOW VARIABLES** - Transfer VariableHandler ownership to TcpServer to prevent use-after-free crash
2324
- **Medium: auto-dump/manual dump conflict** - Add mutual exclusion between SnapshotScheduler and manual DUMP SAVE operations
2425
- **Low: DUMP LOAD GTID restoration** - Restore GTID even when replication was not running, enabling manual REPLICATION START after load

docs/releases/v1.3.7.md

Lines changed: 33 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,31 @@ Thread-safe progress tracking structure for dump operations.
104104

105105
## Bug Fixes
106106

107-
### 1. Critical: SIGSEGV on SET/SHOW VARIABLES
107+
### 1. Critical: Use-After-Free in RemoveDocument
108+
109+
**Type:** Critical
110+
**Severity:** Critical
111+
112+
Fix use-after-free crash in `DocumentStore::RemoveDocument` during concurrent delete operations.
113+
114+
**Root Cause:** The function stored a reference to the primary key string from the map, then erased the map entry (invalidating the reference), and finally used the dangling reference in the log statement. This bug existed since the initial commit but was exposed by Linux CI's memory allocator behavior during concurrent tests.
115+
116+
**Symptom:** Segmentation fault in `DocumentStoreTest.ConcurrentDeletes` test on Linux CI.
117+
118+
**Solution:**
119+
120+
- Copy the primary key string before erasing the map entry
121+
- Add stress tests (`DocumentStoreStressTest`) with SLOW label for regression detection
122+
123+
**Files Changed:**
124+
125+
- `src/storage/document_store.cpp` - Copy string before erase
126+
- `tests/storage/document_store_stress_test.cpp` - New stress test file
127+
- `tests/storage/CMakeLists.txt` - Add stress test with SLOW label
128+
129+
---
130+
131+
### 2. Critical: SIGSEGV on SET/SHOW VARIABLES
108132

109133
**Type:** Critical
110134
**Severity:** Critical
@@ -126,7 +150,7 @@ Fix segmentation fault when executing SET or SHOW VARIABLES commands.
126150

127151
---
128152

129-
### 2. Medium: Auto-dump/Manual DUMP SAVE Mutual Exclusion
153+
### 3. Medium: Auto-dump/Manual DUMP SAVE Mutual Exclusion
130154

131155
**Type:** Bug Fix
132156
**Severity:** Medium
@@ -157,7 +181,7 @@ Implement proper synchronization between SnapshotScheduler (auto-dump) and manua
157181

158182
---
159183

160-
### 3. Low: DUMP LOAD GTID Restoration
184+
### 4. Low: DUMP LOAD GTID Restoration
161185

162186
**Type:** Bug Fix
163187
**Severity:** Low
@@ -170,7 +194,7 @@ GTID is now restored even when replication was not running before DUMP LOAD, ena
170194

171195
---
172196

173-
### 4. Low: BinlogReader::IsRunning() Flag Check
197+
### 5. Low: BinlogReader::IsRunning() Flag Check
174198

175199
**Type:** Bug Fix
176200
**Severity:** Low
@@ -275,10 +299,12 @@ Added complete documentation for the new command:
275299
- 5 new DUMP STATUS tests (idle, save in progress, load in progress, replication paused, GTID preservation)
276300
- GTID restoration test with MockBinlogReader
277301
- Variable handler ownership test
302+
- 2 new stress tests for RemoveDocument use-after-free regression (`DocumentStoreStressTest`)
278303

279304
### Test Infrastructure
280305

281306
- Add `DumpHandlerGtidTest` fixture with MockBinlogReader
307+
- Add `document_store_stress_test.cpp` with SLOW label for CI exclusion
282308
- Update 10+ test files with new flag names
283309

284310
---
@@ -288,13 +314,13 @@ Added complete documentation for the new command:
288314
| Category | Files |
289315
|----------|-------|
290316
| Core Features | dump_handler.cpp/h, server_types.h, query_parser.cpp |
291-
| Bug Fixes | tcp_server.cpp/h, snapshot_scheduler.cpp/h, binlog_reader.cpp |
317+
| Bug Fixes | document_store.cpp, tcp_server.cpp/h, snapshot_scheduler.cpp/h, binlog_reader.cpp |
292318
| Refactoring | 25+ files (StructuredLog migration) |
293319
| Interface | binlog_reader_interface.h |
294320
| Documentation | docs/en/protocol.md, docs/ja/protocol.md |
295-
| Tests | dump_handler_test.cpp, variable_handler_test.cpp, 10+ updated |
321+
| Tests | document_store_stress_test.cpp, dump_handler_test.cpp, variable_handler_test.cpp, 10+ updated |
296322

297-
**Total: 65 files changed, +2894/-634 lines**
323+
**Total: 68 files changed**
298324

299325
---
300326

src/storage/document_store.cpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -231,7 +231,8 @@ bool DocumentStore::RemoveDocument(DocId doc_id) {
231231
return false;
232232
}
233233

234-
const std::string& primary_key = pk_it->second;
234+
// Copy the primary key before erasing (avoid use-after-free)
235+
std::string primary_key = pk_it->second;
235236

236237
// Remove mappings
237238
pk_to_doc_id_.erase(primary_key);

tests/storage/CMakeLists.txt

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,27 @@ gtest_discover_tests(document_store_concurrent_test
3333
PROPERTIES RESOURCE_LOCK storage_io
3434
)
3535

36+
# DocumentStore stress tests (marked as SLOW for CI)
37+
add_executable(document_store_stress_test
38+
document_store_stress_test.cpp
39+
)
40+
41+
target_link_libraries(document_store_stress_test
42+
mygramdb_storage
43+
GTest::gtest_main
44+
)
45+
46+
# Note: Stress tests are designed to detect concurrency bugs like use-after-free
47+
# through high memory pressure. They are excluded from regular CI runs.
48+
gtest_discover_tests(document_store_stress_test
49+
DISCOVERY_MODE POST_BUILD
50+
DISCOVERY_TIMEOUT 30
51+
PROPERTIES
52+
LABELS "SLOW"
53+
RESOURCE_LOCK storage_io
54+
TIMEOUT 60
55+
)
56+
3657
# DocumentStore serialization tests
3758
add_executable(document_store_serialization_test
3859
document_store_serialization_test.cpp
Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
/**
2+
* @file document_store_stress_test.cpp
3+
* @brief Stress tests for DocumentStore (marked as SLOW for CI)
4+
*
5+
* These tests are designed to detect concurrency bugs like use-after-free
6+
* through high memory pressure and concurrent operations. They are excluded
7+
* from regular CI runs due to their longer execution time.
8+
*/
9+
10+
#include <gtest/gtest.h>
11+
12+
#include <atomic>
13+
#include <thread>
14+
#include <vector>
15+
16+
#include "storage/document_store.h"
17+
18+
using namespace mygramdb::storage;
19+
20+
/**
21+
* @brief Stress test for RemoveDocument to detect use-after-free bugs
22+
*
23+
* This test targets the RemoveDocument function with high concurrency and memory
24+
* pressure. It was added to prevent regression of the use-after-free bug where
25+
* RemoveDocument held a reference to the primary key string after erasing the
26+
* map entry (the reference became dangling).
27+
*
28+
* The bug manifested as:
29+
* - const std::string& primary_key = pk_it->second; // Reference to string
30+
* - doc_id_to_pk_.erase(doc_id); // Invalidates reference
31+
* - StructuredLog()...Field("primary_key", primary_key) // Use after free!
32+
*
33+
* The fix was to copy the string before erasing:
34+
* - std::string primary_key = pk_it->second; // Copy the string
35+
*/
36+
TEST(DocumentStoreStressTest, RemoveDocumentUseAfterFreeRegression) {
37+
constexpr int kIterations = 10;
38+
constexpr int kDocsPerIteration = 500;
39+
constexpr int kNumThreads = 8;
40+
41+
for (int iter = 0; iter < kIterations; ++iter) {
42+
DocumentStore store;
43+
44+
// Add documents with long primary keys to increase memory churn
45+
std::vector<DocId> doc_ids;
46+
doc_ids.reserve(kDocsPerIteration);
47+
48+
for (int i = 0; i < kDocsPerIteration; ++i) {
49+
// Use longer primary keys to increase memory allocation/deallocation
50+
std::string pk = "primary_key_with_longer_content_for_memory_pressure_" + std::to_string(iter) + "_" +
51+
std::to_string(i) + "_padding";
52+
53+
std::unordered_map<std::string, FilterValue> filters;
54+
filters["iteration"] = static_cast<int64_t>(iter);
55+
filters["index"] = static_cast<int64_t>(i);
56+
57+
auto result = store.AddDocument(pk, filters);
58+
ASSERT_TRUE(result.has_value()) << "Failed to add document " << i;
59+
doc_ids.push_back(*result);
60+
}
61+
62+
ASSERT_EQ(store.Size(), kDocsPerIteration);
63+
64+
// Concurrent deletion from multiple threads
65+
std::vector<std::thread> threads;
66+
std::atomic<int> delete_count{0};
67+
std::atomic<int> docs_per_thread = kDocsPerIteration / kNumThreads;
68+
69+
for (int t = 0; t < kNumThreads; ++t) {
70+
threads.emplace_back([&store, &doc_ids, &delete_count, &docs_per_thread, t]() {
71+
int start = t * docs_per_thread;
72+
int end = (t == kNumThreads - 1) ? static_cast<int>(doc_ids.size()) : start + docs_per_thread;
73+
74+
for (int i = start; i < end; ++i) {
75+
// RemoveDocument should not crash even with concurrent access
76+
// The bug was that primary_key reference became invalid after erase
77+
bool removed = store.RemoveDocument(doc_ids[i]);
78+
if (removed) {
79+
delete_count++;
80+
}
81+
}
82+
});
83+
}
84+
85+
for (auto& thread : threads) {
86+
thread.join();
87+
}
88+
89+
// All documents should be deleted
90+
EXPECT_EQ(delete_count.load(), kDocsPerIteration) << "Iteration " << iter << " failed";
91+
EXPECT_EQ(store.Size(), 0) << "Store not empty after iteration " << iter;
92+
93+
// Verify all documents are gone
94+
for (const auto& doc_id : doc_ids) {
95+
auto doc = store.GetDocument(doc_id);
96+
EXPECT_FALSE(doc.has_value()) << "Document " << doc_id << " still exists";
97+
}
98+
}
99+
}
100+
101+
/**
102+
* @brief Test concurrent add and remove operations with memory stress
103+
*
104+
* This test creates memory pressure by doing rapid add/remove cycles
105+
* across multiple threads, which increases the likelihood of detecting
106+
* use-after-free bugs due to memory reuse.
107+
*/
108+
TEST(DocumentStoreStressTest, ConcurrentAddRemoveMemoryStress) {
109+
DocumentStore store;
110+
111+
constexpr int kNumThreads = 6;
112+
constexpr int kOperationsPerThread = 200;
113+
114+
std::atomic<bool> stop{false};
115+
std::atomic<int> add_success{0};
116+
std::atomic<int> remove_success{0};
117+
std::vector<std::thread> threads;
118+
119+
// Producer threads - add documents
120+
for (int t = 0; t < kNumThreads / 2; ++t) {
121+
threads.emplace_back([&store, &stop, &add_success, t]() {
122+
int counter = 0;
123+
while (!stop && counter < kOperationsPerThread) {
124+
std::string pk = "stress_add_thread_" + std::to_string(t) + "_doc_" + std::to_string(counter) +
125+
"_with_extra_padding_for_memory_allocation";
126+
127+
std::unordered_map<std::string, FilterValue> filters;
128+
filters["thread"] = static_cast<int64_t>(t);
129+
filters["counter"] = static_cast<int64_t>(counter);
130+
filters["description"] = std::string("Document created by thread ") + std::to_string(t);
131+
132+
auto result = store.AddDocument(pk, filters);
133+
if (result.has_value()) {
134+
add_success++;
135+
}
136+
counter++;
137+
}
138+
});
139+
}
140+
141+
// Consumer threads - remove documents (will remove whatever exists)
142+
for (int t = 0; t < kNumThreads / 2; ++t) {
143+
threads.emplace_back([&store, &stop, &remove_success]() {
144+
while (!stop) {
145+
// Get all doc IDs and try to remove some
146+
auto all_ids = store.GetAllDocIds();
147+
for (const auto& doc_id : all_ids) {
148+
if (stop)
149+
break;
150+
if (store.RemoveDocument(doc_id)) {
151+
remove_success++;
152+
}
153+
}
154+
// Small yield to allow other operations
155+
std::this_thread::yield();
156+
}
157+
});
158+
}
159+
160+
// Let it run for a bit
161+
std::this_thread::sleep_for(std::chrono::milliseconds(300));
162+
stop = true;
163+
164+
for (auto& thread : threads) {
165+
thread.join();
166+
}
167+
168+
// Verify operations completed without crashes
169+
EXPECT_GT(add_success.load(), 0) << "No documents were added";
170+
// Note: remove_success may be 0 if all adds happened after removes finished
171+
// The main verification is that no crashes occurred
172+
173+
// Final state verification
174+
size_t final_size = store.Size();
175+
EXPECT_GE(final_size, 0);
176+
}

0 commit comments

Comments
 (0)