Skip to content

Commit 7651606

Browse files
committed
feat: ADR for a performance testing suite for Brighter
1 parent f92fa0f commit 7651606

File tree

1 file changed

+308
-0
lines changed

1 file changed

+308
-0
lines changed

docs/adr/0036-memory-leak-tests.md

Lines changed: 308 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,308 @@
1+
# 36. Memory Leak Testing Infrastructure
2+
3+
Date: 2026-01-12
4+
5+
## Status
6+
7+
Proposed
8+
9+
## Context
10+
11+
Brighter has received reports of memory leaks in production environments. These leaks can manifest as:
12+
- Handler instances not being disposed after command/query processing
13+
- DbContext instances accumulating over time
14+
- Message producer connections (RabbitMQ, Kafka) not being properly released
15+
- Outbox/Inbox message processing leaving objects in memory
16+
- ServiceActivator consumer threads holding references to disposed objects
17+
- Issues with scoped, transient and singleton lifetimes
18+
19+
Without systematic memory leak testing, these issues are discovered late in production, making them:
20+
- Difficult to reproduce and diagnose
21+
- Expensive to fix due to emergency response requirements
22+
- Damaging to user trust and system stability
23+
- Time-consuming to track down through manual profiling
24+
25+
Currently, Brighter has no automated memory leak detection in its CI/CD pipeline. The existing test suite covers functional correctness but doesn't verify memory behavior under sustained load or detect gradual memory accumulation.
26+
27+
Manual profiling with tools like dotMemory or PerfView is performed ad-hoc when issues are reported, but this reactive approach means:
28+
- Issues reach production before detection
29+
- Regression testing for memory leaks is manual and inconsistent
30+
- Contributors don't get feedback on memory behavior during development
31+
- No baseline exists for acceptable memory growth patterns
32+
33+
The WebAPI samples in `samples/WebAPI` provide realistic scenarios for testing:
34+
- **WebAPI_EFCore/GreetingsWeb**: REST API using CommandProcessor, Darker queries, EF Core, and message publishing
35+
- **WebAPI_EFCore/SalutationAnalytics**: Message consumer using ServiceActivator with Inbox pattern
36+
37+
These samples exercise the core Brighter patterns that are most susceptible to memory leaks:
38+
- Scoped handler lifetime management
39+
- Database connection pooling
40+
- Message producer/consumer lifecycles
41+
- Transactional outbox/inbox operations
42+
- Background processing (sweeper, consumers)
43+
44+
## Decision
45+
46+
We will create a comprehensive, repeatable memory leak testing infrastructure integrated into GitHub Actions workflows using a two-tier testing strategy:
47+
48+
### Two-Tier Testing Strategy
49+
50+
**1. Quick Tests (5-10 minutes) - Run on every PR**
51+
- Fast memory checks that catch obvious leaks
52+
- Test handler disposal, DbContext lifecycle, connection management
53+
- Run 500-1000 operations to detect immediate leaks
54+
- Provide fast feedback to contributors before merge
55+
- Use strict thresholds: 0 leaked instances, <10MB growth
56+
57+
**2. Soak Tests (30-60 minutes) - Run nightly on main branch**
58+
- Long-running tests under sustained load (10k+ operations)
59+
- Detect gradual memory accumulation over time
60+
- Monitor memory checkpoints every 5 minutes
61+
- Test realistic production scenarios
62+
- Catch subtle leaks that quick tests might miss
63+
64+
### Tooling Decisions
65+
66+
**JetBrains dotMemory Unit** will be used for memory profiling and assertions because:
67+
- Provides explicit memory assertions in unit tests
68+
- Can check for specific leaked object types
69+
- Tracks memory growth over test execution
70+
- Works in CI/CD environments
71+
- Free for open-source projects via JetBrains OSS program
72+
- Generates `.dmw` workspace files for offline analysis
73+
- Gracefully degrades when not available (`FailIfRunWithoutSupport = false`)
74+
75+
**WebApplicationFactory** (ASP.NET Core TestServer) will be used for API testing because:
76+
- In-process testing without network overhead
77+
- Full control over configuration and environment
78+
- Faster test execution than external process testing
79+
- Deterministic behavior for reproducible results
80+
- Allows dependency injection customization
81+
82+
**xUnit** will continue as the test framework (existing infrastructure) with trait-based filtering:
83+
- `[Trait("Category", "MemoryLeak")]` - Identifies all memory tests
84+
- `[Trait("Speed", "Quick")]` - Fast tests for PR runs
85+
- `[Trait("Speed", "Soak")]` - Long-running tests for nightly runs
86+
87+
### Test Project Structure
88+
89+
Create new test project: `tests/Paramore.Brighter.MemoryLeak.Tests`
90+
91+
**Infrastructure Layer:**
92+
- `MemoryLeakTestBase.cs` - Base class with dotMemory helper methods
93+
- `AssertNoLeakedHandlers<T>()` - Verify handlers are disposed
94+
- `AssertDbContextsDisposed<T>()` - Verify DbContexts are released
95+
- `AssertMemoryGrowthWithinBounds()` - Check bounded memory growth
96+
- `WebApiTestServer.cs` - WebApplicationFactory wrapper for GreetingsWeb
97+
- `ConsumerTestHost.cs` - IHost wrapper for SalutationAnalytics consumer
98+
- `LoadGenerator.cs` - HTTP load generation with configurable concurrency
99+
100+
**Quick Tests:**
101+
- `ApiHandlerLifecycleTests.cs` - Handler disposal verification (1000 requests)
102+
- `DbContextLifecycleTests.cs` - DbContext lifecycle checks (500 operations)
103+
- `CommandProcessorMemoryTests.cs` - CommandProcessor memory behavior
104+
- `MessageProducerMemoryTests.cs` - Producer connection management
105+
- `ConsumerBasicMemoryTests.cs` - Consumer handler disposal
106+
107+
**Soak Tests:**
108+
- `ApiUnderLoadTests.cs` - 30 min, 10k+ requests, memory checkpoints
109+
- `ContinuousConsumerTests.cs` - 30 min, 10k+ messages
110+
- `OutboxSweeperLongRunTests.cs` - Background sweeper stability
111+
112+
### Memory Thresholds
113+
114+
**Quick Tests (strict, immediate feedback):**
115+
- Handler instances after GC: 0 (strict - no leaked handlers)
116+
- DbContext instances after GC: 0 (strict - must be disposed)
117+
- Memory growth per 1000 commands: < 5MB
118+
- Memory growth per 500 API requests: < 10MB
119+
- RabbitMQ connection objects: < 50 (pooled connections)
120+
121+
**Soak Tests (realistic sustained load):**
122+
- Total memory growth over 30 minutes: < 50MB
123+
- Memory growth per request: < 1KB (average)
124+
- Consumer memory after 10k messages: < 100MB growth
125+
- No unbounded linear growth pattern (stable checkpoints)
126+
127+
### CI/CD Integration
128+
129+
**Quick Tests - New job in `.github/workflows/ci.yml`:**
130+
```yaml
131+
memory-leak-quick:
132+
runs-on: ubuntu-latest
133+
timeout-minutes: 15
134+
needs: [build]
135+
services:
136+
rabbitmq:
137+
image: brightercommand/rabbitmq:3.13-management-delay
138+
```
139+
- Runs in parallel with other test jobs (postgres, mysql, etc.)
140+
- Fails fast if memory leaks detected
141+
- Uploads `.dmw` snapshots on failure (7 day retention)
142+
143+
**Soak Tests - New workflow `.github/workflows/memory-leak-soak.yml`:**
144+
- Scheduled: Daily at 2 AM UTC
145+
- Manual trigger via workflow_dispatch
146+
- Push to master (on src/, samples/ changes)
147+
- 90 minute timeout for all soak tests
148+
- Always upload artifacts (30 day retention)
149+
- Create GitHub issue on failure
150+
151+
### Package Dependencies
152+
153+
Add to `Directory.Packages.props`:
154+
- `JetBrains.dotMemoryUnit` version 3.2.20220510
155+
- `Microsoft.AspNetCore.Mvc.Testing` version per target framework (8.0/9.0/10.0)
156+
157+
Project references:
158+
- `samples/WebAPI/WebAPI_EFCore/GreetingsWeb/GreetingsWeb.csproj`
159+
- `samples/WebAPI/WebAPI_EFCore/SalutationAnalytics/SalutationAnalytics.csproj`
160+
161+
### Test Execution Pattern
162+
163+
**GC Collection Pattern:**
164+
```csharp
165+
// Thorough cleanup before assertions
166+
GC.Collect();
167+
GC.WaitForPendingFinalizers();
168+
GC.Collect();
169+
```
170+
171+
**Memory Checkpoint Pattern:**
172+
```csharp
173+
// Soak tests: Track memory over time
174+
var checkpoint = dotMemory.Check();
175+
// ... operations ...
176+
var final = dotMemory.Check();
177+
var growth = final.TotalMemory - checkpoint.TotalMemory;
178+
```
179+
180+
**Warmup Phase:**
181+
All soak tests include warmup to stabilize baseline:
182+
```csharp
183+
// Warmup: 100 operations, then establish baseline
184+
await loadGen.RunLoadAsync(100, 10);
185+
GC.Collect();
186+
var baseline = dotMemory.Check();
187+
```
188+
189+
## Consequences
190+
191+
### Positive Consequences
192+
193+
**1. Early Detection**
194+
- Memory leaks caught during PR review, not in production
195+
- Contributors get immediate feedback on memory behavior
196+
- Regression testing prevents reintroduction of fixed leaks
197+
198+
**2. Confidence in Releases**
199+
- Systematic verification that handlers/contexts are properly disposed
200+
- Proof that memory remains stable under sustained load
201+
- Baseline metrics for acceptable memory behavior
202+
203+
**3. Diagnostic Capabilities**
204+
- `.dmw` workspace files enable offline analysis with dotMemory
205+
- Memory checkpoints show exactly when/where leaks occur
206+
- Clear thresholds make failures actionable
207+
208+
**4. Documentation Through Tests**
209+
- Tests demonstrate correct disposal patterns
210+
- Examples of proper handler/context lifecycle management
211+
- Load patterns document expected production behavior
212+
213+
**5. Contributor Experience**
214+
- Fast feedback loop (5-10 min quick tests)
215+
- Clear pass/fail criteria
216+
- No manual profiling required during development
217+
218+
### Negative Consequences
219+
220+
**1. CI Time Increase**
221+
- Quick tests add 10-15 minutes to PR workflow (run in parallel)
222+
- Soak tests consume 90 minutes of runner time nightly
223+
- Mitigation: Quick tests run in parallel; soak tests only on schedule
224+
225+
**2. Maintenance Overhead**
226+
- Thresholds may need tuning as code evolves
227+
- False positives require investigation and threshold adjustment
228+
- Mitigation: Phase 4 includes threshold tuning and documentation
229+
230+
**3. External Dependency**
231+
- Requires JetBrains dotMemory Unit package
232+
- Tests gracefully skip memory assertions if unavailable
233+
- Free OSS license from JetBrains (need to apply)
234+
235+
**4. Test Flakiness Risk**
236+
- Memory measurements can vary between runs
237+
- GC timing is non-deterministic
238+
- Mitigation: Generous thresholds, multiple GC passes, warmup phases
239+
240+
**5. Learning Curve**
241+
- Contributors need to understand memory testing concepts
242+
- More complex than functional testing
243+
- Mitigation: Good documentation, clear examples, helpful base classes
244+
245+
### Implementation Phases
246+
247+
**Phase 1 (Week 1): Foundation**
248+
- Create test project structure
249+
- Implement base classes with dotMemory helpers
250+
- 2-3 basic quick tests working locally
251+
- Verification: `dotnet test --filter "Speed=Quick"` passes
252+
253+
**Phase 2 (Week 2): Quick Tests + CI**
254+
- Complete all 5 quick test scenarios
255+
- Add memory-leak-quick job to ci.yml
256+
- Tune thresholds based on real CI data
257+
- Verification: Green check on sample PR
258+
259+
**Phase 3 (Week 3): Soak Tests**
260+
- Implement 3 soak test scenarios
261+
- Create memory-leak-soak.yml workflow
262+
- Test with workflow_dispatch trigger
263+
- Verification: Successful nightly run with artifacts
264+
265+
**Phase 4 (Week 4): Production Ready**
266+
- Tune thresholds from real data
267+
- Document troubleshooting procedures
268+
- Add to CONTRIBUTING.md
269+
- Verification: Week of clean runs
270+
271+
### Migration Notes for Contributors
272+
273+
**Running Tests Locally:**
274+
```bash
275+
# Quick tests (5-10 min)
276+
dotnet test --filter "Category=MemoryLeak&Speed=Quick"
277+
278+
# Requires RabbitMQ (via Docker)
279+
docker run -d -p 5672:5672 brightercommand/rabbitmq:3.13-management-delay
280+
```
281+
282+
**Without dotMemory Unit:**
283+
- Tests will still run but skip memory assertions
284+
- Useful for quick iteration without profiling
285+
- CI always has dotMemory Unit available
286+
287+
**Investigating Failures:**
288+
- Download `.dmw` artifacts from failed CI runs
289+
- Open in JetBrains dotMemory standalone tool
290+
- Analyze retained objects and allocation paths
291+
292+
### Future Enhancements (Not in Scope)
293+
294+
1. **Additional Samples**: WebAPI_Dapper, Greetings_Sweeper
295+
2. **More Transports**: Kafka, Azure Service Bus memory profiles
296+
3. **Database Variations**: MySQL, PostgreSQL, SQL Server leaks
297+
4. **BenchmarkDotNet Integration**: Allocation profiling per operation
298+
5. **Historical Tracking**: Trend analysis over time
299+
6. **Memory Reports**: HTML reports from .dmw files
300+
301+
### Success Criteria
302+
303+
The memory leak testing infrastructure will be considered successful when:
304+
1. Quick tests provide feedback within 15 minutes on every PR
305+
2. Soak tests run reliably every night without false positives
306+
3. At least one real memory leak is caught before reaching production
307+
4. Contributors understand how to write and debug memory tests
308+
5. Memory-related issues decrease in production environments

0 commit comments

Comments
 (0)