|
| 1 | +# Phase 1: Critical Bugs and Thread Safety Fixes |
| 2 | + |
| 3 | +**Status**: 🔴 IN PROGRESS |
| 4 | +**Priority**: CRITICAL |
| 5 | +**Estimated Time**: 1-2 days |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Critical Issues Identified |
| 10 | + |
| 11 | +### 1. **Thread-Safety Violations** (CRITICAL) |
| 12 | + |
| 13 | +#### Issue 1.1: Unprotected Concurrent Access to Data Structures |
| 14 | +**Location**: `PRTree::insert()`, `PRTree::erase()`, `PRTree::rebuild()` |
| 15 | + |
| 16 | +```cpp |
| 17 | +// UNSAFE: Multiple threads can modify these simultaneously |
| 18 | +std::unordered_map<T, BB<D>> idx2bb; |
| 19 | +std::unordered_map<T, std::string> idx2data; |
| 20 | +vec<PRTreeElement<T, B, D>> flat_tree; |
| 21 | +``` |
| 22 | + |
| 23 | +**Impact**: Data races, crashes, undefined behavior |
| 24 | +**Severity**: CRITICAL - Will fail TSan |
| 25 | +**Fix**: Add mutex protection for all mutable operations |
| 26 | + |
| 27 | +#### Issue 1.2: Atomic Variable Misuse |
| 28 | +**Location**: Line 613 |
| 29 | + |
| 30 | +```cpp |
| 31 | +std::atomic<T> global_idx = 0; // Atomic variable |
| 32 | +// But operations like this are NOT atomic: |
| 33 | +auto new_idx = global_idx++; // Read-modify-write race! |
| 34 | +idx2bb[new_idx] = bb; // Separate operation, not atomic! |
| 35 | +``` |
| 36 | + |
| 37 | +**Impact**: Index collisions, data corruption |
| 38 | +**Severity**: HIGH |
| 39 | +**Fix**: Protect entire critical section with mutex |
| 40 | + |
| 41 | +### 2. **Memory Safety Issues** |
| 42 | + |
| 43 | +#### Issue 2.1: Potential Memory Leak in rebuild() |
| 44 | +**Location**: Line 956 |
| 45 | + |
| 46 | +```cpp |
| 47 | +void rebuild() { |
| 48 | + void *placement = std::malloc(sizeof(DataType<T, D>) * length); |
| 49 | + b = reinterpret_cast<DataType<T, D> *>(placement); |
| 50 | + // ... code that might throw ... |
| 51 | + std::free(placement); // Never reached if exception thrown! |
| 52 | +} |
| 53 | +``` |
| 54 | + |
| 55 | +**Impact**: Memory leak on exception |
| 56 | +**Severity**: MEDIUM |
| 57 | +**Fix**: Use RAII (std::unique_ptr with custom deleter) |
| 58 | + |
| 59 | +#### Issue 2.2: Manual Memory Management |
| 60 | +**Location**: Lines 702-704, 762-764 |
| 61 | + |
| 62 | +```cpp |
| 63 | +void *placement = std::malloc(sizeof(DataType<T, D>) * length); |
| 64 | +// Manual placement new |
| 65 | +new (b + i) DataType<T, D>{...}; |
| 66 | +// Manual free |
| 67 | +std::free(placement); |
| 68 | +``` |
| 69 | +
|
| 70 | +**Impact**: Error-prone, exception-unsafe |
| 71 | +**Severity**: MEDIUM |
| 72 | +**Fix**: Use `std::vector` or RAII wrapper |
| 73 | +
|
| 74 | +### 3. **Performance/Correctness Issues** |
| 75 | +
|
| 76 | +#### Issue 3.1: Pass-by-Value for Strings |
| 77 | +**Location**: Lines 624, 639, 656 |
| 78 | +
|
| 79 | +```cpp |
| 80 | +void save(std::string fname) // ❌ Copies string |
| 81 | +void load(std::string fname) // ❌ Copies string |
| 82 | +PRTree(std::string fname) // ❌ Copies string |
| 83 | +``` |
| 84 | + |
| 85 | +**Impact**: Unnecessary string copies |
| 86 | +**Severity**: LOW |
| 87 | +**Fix**: Use `const std::string&` |
| 88 | + |
| 89 | +#### Issue 3.2: Typo in Error Message |
| 90 | +**Location**: Line 688 |
| 91 | + |
| 92 | +```cpp |
| 93 | +"Both index and boudning box must have the same length" |
| 94 | +// ^^^^^^^ typo |
| 95 | +``` |
| 96 | + |
| 97 | +**Impact**: User confusion |
| 98 | +**Severity**: TRIVIAL |
| 99 | +**Fix**: Fix typo |
| 100 | + |
| 101 | +### 4. **Exception Safety Issues** |
| 102 | + |
| 103 | +#### Issue 4.1: Non-RAII Resource Management |
| 104 | +**Location**: Multiple locations using malloc/free |
| 105 | + |
| 106 | +**Impact**: Leaks on exception paths |
| 107 | +**Severity**: MEDIUM |
| 108 | +**Fix**: Use RAII everywhere |
| 109 | + |
| 110 | +--- |
| 111 | + |
| 112 | +## Implementation Plan |
| 113 | + |
| 114 | +### Step 1: Add Thread-Safety Infrastructure |
| 115 | + |
| 116 | +```cpp |
| 117 | +// Add to PRTree class |
| 118 | +private: |
| 119 | + mutable std::mutex tree_mutex_; // Protects all data structures |
| 120 | +``` |
| 121 | + |
| 122 | +### Step 2: Protect All Mutable Operations |
| 123 | + |
| 124 | +```cpp |
| 125 | +void insert(const T &idx, const py::array_t<float> &x, |
| 126 | + const std::optional<std::string> objdumps = std::nullopt) { |
| 127 | + std::lock_guard<std::mutex> lock(tree_mutex_); |
| 128 | + // ... rest of implementation ... |
| 129 | +} |
| 130 | + |
| 131 | +void erase(const T idx) { |
| 132 | + std::lock_guard<std::mutex> lock(tree_mutex_); |
| 133 | + // ... rest of implementation ... |
| 134 | +} |
| 135 | + |
| 136 | +void rebuild() { |
| 137 | + std::lock_guard<std::mutex> lock(tree_mutex_); |
| 138 | + // ... rest of implementation ... |
| 139 | +} |
| 140 | +``` |
| 141 | +
|
| 142 | +### Step 3: Fix Memory Management |
| 143 | +
|
| 144 | +```cpp |
| 145 | +// BEFORE: Manual malloc/free |
| 146 | +void *placement = std::malloc(sizeof(DataType<T, D>) * length); |
| 147 | +// ... use ... |
| 148 | +std::free(placement); |
| 149 | +
|
| 150 | +// AFTER: RAII with unique_ptr |
| 151 | +struct MallocDeleter { |
| 152 | + void operator()(void* ptr) const { std::free(ptr); } |
| 153 | +}; |
| 154 | +auto placement = std::unique_ptr<void, MallocDeleter>( |
| 155 | + std::malloc(sizeof(DataType<T, D>) * length) |
| 156 | +); |
| 157 | +// Automatically freed on scope exit or exception |
| 158 | +``` |
| 159 | + |
| 160 | +### Step 4: Fix API Issues |
| 161 | + |
| 162 | +```cpp |
| 163 | +// Fix string parameters |
| 164 | +void save(const std::string& fname); |
| 165 | +void load(const std::string& fname); |
| 166 | +PRTree(const std::string& fname); |
| 167 | + |
| 168 | +// Fix typo |
| 169 | +"Both index and bounding box must have the same length" |
| 170 | +``` |
| 171 | +
|
| 172 | +--- |
| 173 | +
|
| 174 | +## Testing Strategy |
| 175 | +
|
| 176 | +### 1. Compile with TSan |
| 177 | +```bash |
| 178 | +mkdir -p build_tsan |
| 179 | +cd build_tsan |
| 180 | +cmake -DENABLE_TSAN=ON -DBUILD_BENCHMARKS=ON .. |
| 181 | +make |
| 182 | +``` |
| 183 | + |
| 184 | +### 2. Run Stress Tests |
| 185 | +```bash |
| 186 | +./stress_test_concurrent # Short test |
| 187 | +timeout 3600 ./stress_test_concurrent # 1 hour |
| 188 | +``` |
| 189 | + |
| 190 | +**Expected**: Zero TSan warnings |
| 191 | + |
| 192 | +### 3. Run Under ASan |
| 193 | +```bash |
| 194 | +mkdir -p build_asan |
| 195 | +cd build_asan |
| 196 | +cmake -DENABLE_ASAN=ON .. |
| 197 | +make |
| 198 | +python -m pytest ../tests/ |
| 199 | +``` |
| 200 | + |
| 201 | +**Expected**: No memory leaks, no use-after-free |
| 202 | + |
| 203 | +### 4. Regression Testing |
| 204 | +```bash |
| 205 | +# Re-run benchmarks |
| 206 | +./benchmark_construction |
| 207 | +./benchmark_query |
| 208 | + |
| 209 | +# Compare to baseline |
| 210 | +python3 scripts/analyze_baseline.py |
| 211 | +``` |
| 212 | + |
| 213 | +**Expected**: <5% performance change |
| 214 | + |
| 215 | +--- |
| 216 | + |
| 217 | +## Success Criteria |
| 218 | + |
| 219 | +- ✅ All TSan warnings fixed |
| 220 | +- ✅ All ASan warnings fixed |
| 221 | +- ✅ Stress tests pass (1 hour continuous) |
| 222 | +- ✅ Unit tests pass |
| 223 | +- ✅ Performance within 5% of baseline |
| 224 | +- ✅ CI sanitizer checks pass |
| 225 | + |
| 226 | +--- |
| 227 | + |
| 228 | +## Implementation Order |
| 229 | + |
| 230 | +1. ✅ Identify all critical bugs |
| 231 | +2. 🔄 Add mutex to PRTree class |
| 232 | +3. 🔄 Protect all mutable operations |
| 233 | +4. 🔄 Fix memory management (RAII) |
| 234 | +5. 🔄 Fix API issues (string parameters, typos) |
| 235 | +6. 🔄 Test with TSan |
| 236 | +7. 🔄 Test with ASan |
| 237 | +8. 🔄 Run regression benchmarks |
| 238 | +9. 🔄 Commit and push |
| 239 | + |
| 240 | +--- |
| 241 | + |
| 242 | +## Risk Assessment |
| 243 | + |
| 244 | +**High Risk**: |
| 245 | +- Adding mutex might impact performance (measure carefully) |
| 246 | +- Need to ensure no deadlocks |
| 247 | + |
| 248 | +**Medium Risk**: |
| 249 | +- RAII changes might affect existing code paths |
| 250 | + |
| 251 | +**Low Risk**: |
| 252 | +- String parameter fixes (purely optimization) |
| 253 | +- Typo fixes (cosmetic) |
| 254 | + |
| 255 | +--- |
| 256 | + |
| 257 | +## Rollback Plan |
| 258 | + |
| 259 | +If performance degrades >5%: |
| 260 | +1. Profile to find bottleneck |
| 261 | +2. Consider reader-writer locks instead of mutex |
| 262 | +3. Use lock-free data structures if necessary |
| 263 | +4. Document tradeoffs |
| 264 | + |
| 265 | +If deadlocks occur: |
| 266 | +1. Add lock ordering documentation |
| 267 | +2. Use std::lock() for multiple mutexes |
| 268 | +3. Consider lock hierarchies |
| 269 | + |
| 270 | +--- |
| 271 | + |
| 272 | +## Next Steps After Phase 1 |
| 273 | + |
| 274 | +- Phase 2: C++20 Migration |
| 275 | +- Re-run all benchmarks |
| 276 | +- Update baseline if needed |
0 commit comments