|
| 1 | +# 36. Memory Leak Testing Infrastructure |
| 2 | + |
| 3 | +Date: 2026-01-12 |
| 4 | + |
| 5 | +## Status |
| 6 | + |
| 7 | +Proposed |
| 8 | + |
| 9 | +## Context |
| 10 | + |
| 11 | +Brighter has received reports of memory leaks in production environments. These leaks can manifest as: |
| 12 | +- Handler instances not being disposed after command/query processing |
| 13 | +- DbContext instances accumulating over time |
| 14 | +- Message producer connections (RabbitMQ, Kafka) not being properly released |
| 15 | +- Outbox/Inbox message processing leaving objects in memory |
| 16 | +- ServiceActivator consumer threads holding references to disposed objects |
| 17 | +- Issues with scoped, transient and singleton lifetimes |
| 18 | + |
| 19 | +Without systematic memory leak testing, these issues are discovered late in production, making them: |
| 20 | +- Difficult to reproduce and diagnose |
| 21 | +- Expensive to fix due to emergency response requirements |
| 22 | +- Damaging to user trust and system stability |
| 23 | +- Time-consuming to track down through manual profiling |
| 24 | + |
| 25 | +Currently, Brighter has no automated memory leak detection in its CI/CD pipeline. The existing test suite covers functional correctness but doesn't verify memory behavior under sustained load or detect gradual memory accumulation. |
| 26 | + |
| 27 | +Manual profiling with tools like dotMemory or PerfView is performed ad-hoc when issues are reported, but this reactive approach means: |
| 28 | +- Issues reach production before detection |
| 29 | +- Regression testing for memory leaks is manual and inconsistent |
| 30 | +- Contributors don't get feedback on memory behavior during development |
| 31 | +- No baseline exists for acceptable memory growth patterns |
| 32 | + |
| 33 | +The WebAPI samples in `samples/WebAPI` provide realistic scenarios for testing: |
| 34 | +- **WebAPI_EFCore/GreetingsWeb**: REST API using CommandProcessor, Darker queries, EF Core, and message publishing |
| 35 | +- **WebAPI_EFCore/SalutationAnalytics**: Message consumer using ServiceActivator with Inbox pattern |
| 36 | + |
| 37 | +These samples exercise the core Brighter patterns that are most susceptible to memory leaks: |
| 38 | +- Scoped handler lifetime management |
| 39 | +- Database connection pooling |
| 40 | +- Message producer/consumer lifecycles |
| 41 | +- Transactional outbox/inbox operations |
| 42 | +- Background processing (sweeper, consumers) |
| 43 | + |
| 44 | +## Decision |
| 45 | + |
| 46 | +We will create a comprehensive, repeatable memory leak testing infrastructure integrated into GitHub Actions workflows using a two-tier testing strategy: |
| 47 | + |
| 48 | +### Two-Tier Testing Strategy |
| 49 | + |
| 50 | +**1. Quick Tests (5-10 minutes) - Run on every PR** |
| 51 | +- Fast memory checks that catch obvious leaks |
| 52 | +- Test handler disposal, DbContext lifecycle, connection management |
| 53 | +- Run 500-1000 operations to detect immediate leaks |
| 54 | +- Provide fast feedback to contributors before merge |
| 55 | +- Use strict thresholds: 0 leaked instances, <10MB growth |
| 56 | + |
| 57 | +**2. Soak Tests (30-60 minutes) - Run nightly on main branch** |
| 58 | +- Long-running tests under sustained load (10k+ operations) |
| 59 | +- Detect gradual memory accumulation over time |
| 60 | +- Monitor memory checkpoints every 5 minutes |
| 61 | +- Test realistic production scenarios |
| 62 | +- Catch subtle leaks that quick tests might miss |
| 63 | + |
| 64 | +### Tooling Decisions |
| 65 | + |
| 66 | +**JetBrains dotMemory Unit** will be used for memory profiling and assertions because: |
| 67 | +- Provides explicit memory assertions in unit tests |
| 68 | +- Can check for specific leaked object types |
| 69 | +- Tracks memory growth over test execution |
| 70 | +- Works in CI/CD environments |
| 71 | +- Free for open-source projects via JetBrains OSS program |
| 72 | +- Generates `.dmw` workspace files for offline analysis |
| 73 | +- Gracefully degrades when not available (`FailIfRunWithoutSupport = false`) |
| 74 | + |
| 75 | +**WebApplicationFactory** (ASP.NET Core TestServer) will be used for API testing because: |
| 76 | +- In-process testing without network overhead |
| 77 | +- Full control over configuration and environment |
| 78 | +- Faster test execution than external process testing |
| 79 | +- Deterministic behavior for reproducible results |
| 80 | +- Allows dependency injection customization |
| 81 | + |
| 82 | +**xUnit** will continue as the test framework (existing infrastructure) with trait-based filtering: |
| 83 | +- `[Trait("Category", "MemoryLeak")]` - Identifies all memory tests |
| 84 | +- `[Trait("Speed", "Quick")]` - Fast tests for PR runs |
| 85 | +- `[Trait("Speed", "Soak")]` - Long-running tests for nightly runs |
| 86 | + |
| 87 | +### Test Project Structure |
| 88 | + |
| 89 | +Create new test project: `tests/Paramore.Brighter.MemoryLeak.Tests` |
| 90 | + |
| 91 | +**Infrastructure Layer:** |
| 92 | +- `MemoryLeakTestBase.cs` - Base class with dotMemory helper methods |
| 93 | + - `AssertNoLeakedHandlers<T>()` - Verify handlers are disposed |
| 94 | + - `AssertDbContextsDisposed<T>()` - Verify DbContexts are released |
| 95 | + - `AssertMemoryGrowthWithinBounds()` - Check bounded memory growth |
| 96 | +- `WebApiTestServer.cs` - WebApplicationFactory wrapper for GreetingsWeb |
| 97 | +- `ConsumerTestHost.cs` - IHost wrapper for SalutationAnalytics consumer |
| 98 | +- `LoadGenerator.cs` - HTTP load generation with configurable concurrency |
| 99 | + |
| 100 | +**Quick Tests:** |
| 101 | +- `ApiHandlerLifecycleTests.cs` - Handler disposal verification (1000 requests) |
| 102 | +- `DbContextLifecycleTests.cs` - DbContext lifecycle checks (500 operations) |
| 103 | +- `CommandProcessorMemoryTests.cs` - CommandProcessor memory behavior |
| 104 | +- `MessageProducerMemoryTests.cs` - Producer connection management |
| 105 | +- `ConsumerBasicMemoryTests.cs` - Consumer handler disposal |
| 106 | + |
| 107 | +**Soak Tests:** |
| 108 | +- `ApiUnderLoadTests.cs` - 30 min, 10k+ requests, memory checkpoints |
| 109 | +- `ContinuousConsumerTests.cs` - 30 min, 10k+ messages |
| 110 | +- `OutboxSweeperLongRunTests.cs` - Background sweeper stability |
| 111 | + |
| 112 | +### Memory Thresholds |
| 113 | + |
| 114 | +**Quick Tests (strict, immediate feedback):** |
| 115 | +- Handler instances after GC: 0 (strict - no leaked handlers) |
| 116 | +- DbContext instances after GC: 0 (strict - must be disposed) |
| 117 | +- Memory growth per 1000 commands: < 5MB |
| 118 | +- Memory growth per 500 API requests: < 10MB |
| 119 | +- RabbitMQ connection objects: < 50 (pooled connections) |
| 120 | + |
| 121 | +**Soak Tests (realistic sustained load):** |
| 122 | +- Total memory growth over 30 minutes: < 50MB |
| 123 | +- Memory growth per request: < 1KB (average) |
| 124 | +- Consumer memory after 10k messages: < 100MB growth |
| 125 | +- No unbounded linear growth pattern (stable checkpoints) |
| 126 | + |
| 127 | +### CI/CD Integration |
| 128 | + |
| 129 | +**Quick Tests - New job in `.github/workflows/ci.yml`:** |
| 130 | +```yaml |
| 131 | +memory-leak-quick: |
| 132 | + runs-on: ubuntu-latest |
| 133 | + timeout-minutes: 15 |
| 134 | + needs: [build] |
| 135 | + services: |
| 136 | + rabbitmq: |
| 137 | + image: brightercommand/rabbitmq:3.13-management-delay |
| 138 | +``` |
| 139 | +- Runs in parallel with other test jobs (postgres, mysql, etc.) |
| 140 | +- Fails fast if memory leaks detected |
| 141 | +- Uploads `.dmw` snapshots on failure (7 day retention) |
| 142 | + |
| 143 | +**Soak Tests - New workflow `.github/workflows/memory-leak-soak.yml`:** |
| 144 | +- Scheduled: Daily at 2 AM UTC |
| 145 | +- Manual trigger via workflow_dispatch |
| 146 | +- Push to master (on src/, samples/ changes) |
| 147 | +- 90 minute timeout for all soak tests |
| 148 | +- Always upload artifacts (30 day retention) |
| 149 | +- Create GitHub issue on failure |
| 150 | + |
| 151 | +### Package Dependencies |
| 152 | + |
| 153 | +Add to `Directory.Packages.props`: |
| 154 | +- `JetBrains.dotMemoryUnit` version 3.2.20220510 |
| 155 | +- `Microsoft.AspNetCore.Mvc.Testing` version per target framework (8.0/9.0/10.0) |
| 156 | + |
| 157 | +Project references: |
| 158 | +- `samples/WebAPI/WebAPI_EFCore/GreetingsWeb/GreetingsWeb.csproj` |
| 159 | +- `samples/WebAPI/WebAPI_EFCore/SalutationAnalytics/SalutationAnalytics.csproj` |
| 160 | + |
| 161 | +### Test Execution Pattern |
| 162 | + |
| 163 | +**GC Collection Pattern:** |
| 164 | +```csharp |
| 165 | +// Thorough cleanup before assertions |
| 166 | +GC.Collect(); |
| 167 | +GC.WaitForPendingFinalizers(); |
| 168 | +GC.Collect(); |
| 169 | +``` |
| 170 | + |
| 171 | +**Memory Checkpoint Pattern:** |
| 172 | +```csharp |
| 173 | +// Soak tests: Track memory over time |
| 174 | +var checkpoint = dotMemory.Check(); |
| 175 | +// ... operations ... |
| 176 | +var final = dotMemory.Check(); |
| 177 | +var growth = final.TotalMemory - checkpoint.TotalMemory; |
| 178 | +``` |
| 179 | + |
| 180 | +**Warmup Phase:** |
| 181 | +All soak tests include warmup to stabilize baseline: |
| 182 | +```csharp |
| 183 | +// Warmup: 100 operations, then establish baseline |
| 184 | +await loadGen.RunLoadAsync(100, 10); |
| 185 | +GC.Collect(); |
| 186 | +var baseline = dotMemory.Check(); |
| 187 | +``` |
| 188 | + |
| 189 | +## Consequences |
| 190 | + |
| 191 | +### Positive Consequences |
| 192 | + |
| 193 | +**1. Early Detection** |
| 194 | +- Memory leaks caught during PR review, not in production |
| 195 | +- Contributors get immediate feedback on memory behavior |
| 196 | +- Regression testing prevents reintroduction of fixed leaks |
| 197 | + |
| 198 | +**2. Confidence in Releases** |
| 199 | +- Systematic verification that handlers/contexts are properly disposed |
| 200 | +- Proof that memory remains stable under sustained load |
| 201 | +- Baseline metrics for acceptable memory behavior |
| 202 | + |
| 203 | +**3. Diagnostic Capabilities** |
| 204 | +- `.dmw` workspace files enable offline analysis with dotMemory |
| 205 | +- Memory checkpoints show exactly when/where leaks occur |
| 206 | +- Clear thresholds make failures actionable |
| 207 | + |
| 208 | +**4. Documentation Through Tests** |
| 209 | +- Tests demonstrate correct disposal patterns |
| 210 | +- Examples of proper handler/context lifecycle management |
| 211 | +- Load patterns document expected production behavior |
| 212 | + |
| 213 | +**5. Contributor Experience** |
| 214 | +- Fast feedback loop (5-10 min quick tests) |
| 215 | +- Clear pass/fail criteria |
| 216 | +- No manual profiling required during development |
| 217 | + |
| 218 | +### Negative Consequences |
| 219 | + |
| 220 | +**1. CI Time Increase** |
| 221 | +- Quick tests add 10-15 minutes to PR workflow (run in parallel) |
| 222 | +- Soak tests consume 90 minutes of runner time nightly |
| 223 | +- Mitigation: Quick tests run in parallel; soak tests only on schedule |
| 224 | + |
| 225 | +**2. Maintenance Overhead** |
| 226 | +- Thresholds may need tuning as code evolves |
| 227 | +- False positives require investigation and threshold adjustment |
| 228 | +- Mitigation: Phase 4 includes threshold tuning and documentation |
| 229 | + |
| 230 | +**3. External Dependency** |
| 231 | +- Requires JetBrains dotMemory Unit package |
| 232 | +- Tests gracefully skip memory assertions if unavailable |
| 233 | +- Free OSS license from JetBrains (need to apply) |
| 234 | + |
| 235 | +**4. Test Flakiness Risk** |
| 236 | +- Memory measurements can vary between runs |
| 237 | +- GC timing is non-deterministic |
| 238 | +- Mitigation: Generous thresholds, multiple GC passes, warmup phases |
| 239 | + |
| 240 | +**5. Learning Curve** |
| 241 | +- Contributors need to understand memory testing concepts |
| 242 | +- More complex than functional testing |
| 243 | +- Mitigation: Good documentation, clear examples, helpful base classes |
| 244 | + |
| 245 | +### Implementation Phases |
| 246 | + |
| 247 | +**Phase 1 (Week 1): Foundation** |
| 248 | +- Create test project structure |
| 249 | +- Implement base classes with dotMemory helpers |
| 250 | +- 2-3 basic quick tests working locally |
| 251 | +- Verification: `dotnet test --filter "Speed=Quick"` passes |
| 252 | + |
| 253 | +**Phase 2 (Week 2): Quick Tests + CI** |
| 254 | +- Complete all 5 quick test scenarios |
| 255 | +- Add memory-leak-quick job to ci.yml |
| 256 | +- Tune thresholds based on real CI data |
| 257 | +- Verification: Green check on sample PR |
| 258 | + |
| 259 | +**Phase 3 (Week 3): Soak Tests** |
| 260 | +- Implement 3 soak test scenarios |
| 261 | +- Create memory-leak-soak.yml workflow |
| 262 | +- Test with workflow_dispatch trigger |
| 263 | +- Verification: Successful nightly run with artifacts |
| 264 | + |
| 265 | +**Phase 4 (Week 4): Production Ready** |
| 266 | +- Tune thresholds from real data |
| 267 | +- Document troubleshooting procedures |
| 268 | +- Add to CONTRIBUTING.md |
| 269 | +- Verification: Week of clean runs |
| 270 | + |
| 271 | +### Migration Notes for Contributors |
| 272 | + |
| 273 | +**Running Tests Locally:** |
| 274 | +```bash |
| 275 | +# Quick tests (5-10 min) |
| 276 | +dotnet test --filter "Category=MemoryLeak&Speed=Quick" |
| 277 | +
|
| 278 | +# Requires RabbitMQ (via Docker) |
| 279 | +docker run -d -p 5672:5672 brightercommand/rabbitmq:3.13-management-delay |
| 280 | +``` |
| 281 | + |
| 282 | +**Without dotMemory Unit:** |
| 283 | +- Tests will still run but skip memory assertions |
| 284 | +- Useful for quick iteration without profiling |
| 285 | +- CI always has dotMemory Unit available |
| 286 | + |
| 287 | +**Investigating Failures:** |
| 288 | +- Download `.dmw` artifacts from failed CI runs |
| 289 | +- Open in JetBrains dotMemory standalone tool |
| 290 | +- Analyze retained objects and allocation paths |
| 291 | + |
| 292 | +### Future Enhancements (Not in Scope) |
| 293 | + |
| 294 | +1. **Additional Samples**: WebAPI_Dapper, Greetings_Sweeper |
| 295 | +2. **More Transports**: Kafka, Azure Service Bus memory profiles |
| 296 | +3. **Database Variations**: MySQL, PostgreSQL, SQL Server leaks |
| 297 | +4. **BenchmarkDotNet Integration**: Allocation profiling per operation |
| 298 | +5. **Historical Tracking**: Trend analysis over time |
| 299 | +6. **Memory Reports**: HTML reports from .dmw files |
| 300 | + |
| 301 | +### Success Criteria |
| 302 | + |
| 303 | +The memory leak testing infrastructure will be considered successful when: |
| 304 | +1. Quick tests provide feedback within 15 minutes on every PR |
| 305 | +2. Soak tests run reliably every night without false positives |
| 306 | +3. At least one real memory leak is caught before reaching production |
| 307 | +4. Contributors understand how to write and debug memory tests |
| 308 | +5. Memory-related issues decrease in production environments |
0 commit comments