|
| 1 | +# Grizabella Memory Leak Fix - Comprehensive Connection Management Design |
| 2 | + |
| 3 | +## Problem Analysis |
| 4 | + |
| 5 | +### Root Cause Identification |
| 6 | + |
| 7 | +Based on the codebase analysis, the memory leak and threading issues stem from several interconnected problems: |
| 8 | + |
| 9 | +1. **Multiple Connection Instances**: Each TypeScript client connection creates new GrizabellaDBManager instances |
| 10 | +2. **KuzuAdapter Threading Issues**: KuzuDB connections are not thread-safe and create lock files/WAL files that aren't properly cleaned up |
| 11 | +3. **MCP Server Resource Leaks**: The MCP server creates singleton Grizabella instances but doesn't properly manage their lifecycle |
| 12 | +4. **No Connection Pooling**: Database adapters are created per-request rather than reused |
| 13 | +5. **Improper Cleanup**: Connection cleanup is inconsistent across different code paths |
| 14 | + |
| 15 | +### Memory Leak Symptoms |
| 16 | + |
| 17 | +- Over a dozen threads spawned for Grizabella process |
| 18 | +- Uncontrollable memory growth leading to system crashes |
| 19 | +- Lock files and WAL files accumulating in Kuzu database directories |
| 20 | +- MCP server connections not being properly terminated |
| 21 | + |
| 22 | +## Solution Architecture |
| 23 | + |
| 24 | +### Overview |
| 25 | + |
| 26 | +The solution implements a comprehensive connection management system with: |
| 27 | +- Thread-safe connection pooling |
| 28 | +- Singleton pattern for database managers |
| 29 | +- Proper resource lifecycle management |
| 30 | +- Memory and thread monitoring |
| 31 | +- Enhanced error handling and recovery |
| 32 | + |
| 33 | +### Core Components |
| 34 | + |
| 35 | +#### 1. Connection Pool Manager |
| 36 | + |
| 37 | +```python |
| 38 | +class ConnectionPoolManager: |
| 39 | + """Thread-safe connection pool for database adapters""" |
| 40 | + |
| 41 | + def __init__(self, max_connections: int = 10): |
| 42 | + self._pools = { |
| 43 | + 'sqlite': Queue(maxsize=max_connections), |
| 44 | + 'lancedb': Queue(maxsize=max_connections), |
| 45 | + 'kuzu': Queue(maxsize=max_connections) |
| 46 | + } |
| 47 | + self._lock = threading.RLock() |
| 48 | + self._connection_count = defaultdict(int) |
| 49 | + |
| 50 | + async def get_connection(self, adapter_type: str, **kwargs): |
| 51 | + """Get a connection from the pool or create a new one""" |
| 52 | + |
| 53 | + async def return_connection(self, adapter_type: str, connection): |
| 54 | + """Return a connection to the pool""" |
| 55 | + |
| 56 | + async def cleanup_all(self): |
| 57 | + """Clean up all connections in the pool""" |
| 58 | +``` |
| 59 | + |
| 60 | +#### 2. Singleton DBManager Factory |
| 61 | + |
| 62 | +```python |
| 63 | +class DBManagerFactory: |
| 64 | + """Factory for managing singleton GrizabellaDBManager instances""" |
| 65 | + |
| 66 | + _instances = {} |
| 67 | + _lock = threading.RLock() |
| 68 | + |
| 69 | + @classmethod |
| 70 | + def get_manager(cls, db_path: str, **kwargs) -> GrizabellaDBManager: |
| 71 | + """Get or create a singleton DBManager for the given database path""" |
| 72 | + |
| 73 | + @classmethod |
| 74 | + def cleanup_manager(cls, db_path: str): |
| 75 | + """Clean up a specific DBManager instance""" |
| 76 | + |
| 77 | + @classmethod |
| 78 | + def cleanup_all(cls): |
| 79 | + """Clean up all DBManager instances""" |
| 80 | +``` |
| 81 | + |
| 82 | +#### 3. Thread-Safe KuzuAdapter |
| 83 | + |
| 84 | +```python |
| 85 | +class ThreadSafeKuzuAdapter(BaseDBAdapter): |
| 86 | + """Thread-safe Kuzu adapter with proper connection isolation""" |
| 87 | + |
| 88 | + def __init__(self, db_path: str, config: Optional[dict] = None): |
| 89 | + self._local = threading.local() |
| 90 | + self._db_path = db_path |
| 91 | + self._config = config or {} |
| 92 | + self._lock = threading.RLock() |
| 93 | + |
| 94 | + @property |
| 95 | + def conn(self): |
| 96 | + """Get thread-local connection""" |
| 97 | + if not hasattr(self._local, 'conn'): |
| 98 | + self._local.conn = self._create_connection() |
| 99 | + return self._local.conn |
| 100 | + |
| 101 | + def _create_connection(self): |
| 102 | + """Create a new Kuzu connection with proper cleanup""" |
| 103 | + |
| 104 | + def close(self): |
| 105 | + """Close thread-local connection""" |
| 106 | +``` |
| 107 | + |
| 108 | +#### 4. Enhanced MCP Server Lifecycle |
| 109 | + |
| 110 | +```python |
| 111 | +class MCPServerManager: |
| 112 | + """Enhanced MCP server with proper resource management""" |
| 113 | + |
| 114 | + def __init__(self): |
| 115 | + self._grizabella_client = None |
| 116 | + self._shutdown_handlers = [] |
| 117 | + self._monitoring_thread = None |
| 118 | + |
| 119 | + async def start_server(self, db_path: str): |
| 120 | + """Start server with proper initialization""" |
| 121 | + |
| 122 | + async def shutdown_server(self): |
| 123 | + """Graceful shutdown with resource cleanup""" |
| 124 | + |
| 125 | + def _setup_monitoring(self): |
| 126 | + """Setup memory and thread monitoring""" |
| 127 | +``` |
| 128 | + |
| 129 | +#### 5. TypeScript Client Connection Management |
| 130 | + |
| 131 | +```typescript |
| 132 | +class ConnectionManager { |
| 133 | + private _connectionPool: Map<string, MCPClient> = new Map(); |
| 134 | + private _reconnectTimers: Map<string, NodeJS.Timeout> = new Map(); |
| 135 | + private _connectionState: Map<string, ConnectionState> = new Map(); |
| 136 | + |
| 137 | + async getConnection(config: GrizabellaClientConfig): Promise<MCPClient> { |
| 138 | + const key = this._getConnectionKey(config); |
| 139 | + |
| 140 | + if (this._connectionPool.has(key)) { |
| 141 | + const client = this._connectionPool.get(key)!; |
| 142 | + if (client.isConnected()) { |
| 143 | + return client; |
| 144 | + } |
| 145 | + this._cleanupConnection(key); |
| 146 | + } |
| 147 | + |
| 148 | + return this._createConnection(key, config); |
| 149 | + } |
| 150 | + |
| 151 | + private async _createConnection(key: string, config: GrizabellaClientConfig): Promise<MCPClient> { |
| 152 | + const client = new MCPClient(config); |
| 153 | + await client.connect(); |
| 154 | + this._connectionPool.set(key, client); |
| 155 | + this._connectionState.set(key, ConnectionState.CONNECTED); |
| 156 | + return client; |
| 157 | + } |
| 158 | + |
| 159 | + async cleanupAll(): Promise<void> { |
| 160 | + for (const [key, client] of this._connectionPool) { |
| 161 | + await this._cleanupConnection(key); |
| 162 | + } |
| 163 | + } |
| 164 | +} |
| 165 | +``` |
| 166 | + |
| 167 | +## Implementation Plan |
| 168 | + |
| 169 | +### Phase 1: Core Infrastructure |
| 170 | + |
| 171 | +1. **Connection Pool Manager Implementation** |
| 172 | + - Thread-safe queue-based pooling |
| 173 | + - Connection health checks |
| 174 | + - Automatic cleanup of idle connections |
| 175 | + |
| 176 | +2. **DBManager Factory** |
| 177 | + - Singleton pattern with thread safety |
| 178 | + - Reference counting for shared instances |
| 179 | + - Graceful shutdown procedures |
| 180 | + |
| 181 | +3. **Enhanced KuzuAdapter** |
| 182 | + - Thread-local storage for connections |
| 183 | + - Proper lock file management |
| 184 | + - WAL file cleanup on connection close |
| 185 | + |
| 186 | +### Phase 2: MCP Server Enhancements |
| 187 | + |
| 188 | +1. **Server Lifecycle Management** |
| 189 | + - Proper initialization and shutdown sequences |
| 190 | + - Resource cleanup handlers |
| 191 | + - Graceful error recovery |
| 192 | + |
| 193 | +2. **Memory Monitoring** |
| 194 | + - Thread tracking |
| 195 | + - Memory usage monitoring |
| 196 | + - Alert system for resource leaks |
| 197 | + |
| 198 | +### Phase 3: Client-Side Improvements |
| 199 | + |
| 200 | +1. **TypeScript Connection Manager** |
| 201 | + - Connection pooling |
| 202 | + - Auto-reconnect logic |
| 203 | + - Connection state management |
| 204 | + |
| 205 | +2. **Enhanced Error Handling** |
| 206 | + - Connection failure recovery |
| 207 | + - Exponential backoff for retries |
| 208 | + - Circuit breaker pattern |
| 209 | + |
| 210 | +### Phase 4: Testing and Documentation |
| 211 | + |
| 212 | +1. **Comprehensive Test Suite** |
| 213 | + - Unit tests for all components |
| 214 | + - Integration tests for connection lifecycle |
| 215 | + - Load testing for memory leak validation |
| 216 | + |
| 217 | +2. **Documentation Updates** |
| 218 | + - Connection management best practices |
| 219 | + - Troubleshooting guide |
| 220 | + - API documentation updates |
| 221 | + |
| 222 | +## Mermaid Architecture Diagram |
| 223 | + |
| 224 | +```mermaid |
| 225 | +graph TB |
| 226 | + subgraph "TypeScript Client Layer" |
| 227 | + TC[TypeScript Client] --> CM[Connection Manager] |
| 228 | + CM --> CP[Connection Pool] |
| 229 | + CM --> RC[Reconnect Handler] |
| 230 | + end |
| 231 | + |
| 232 | + subgraph "MCP Server Layer" |
| 233 | + MS[MCP Server] --> SM[Server Manager] |
| 234 | + SM --> GF[Grizabella Factory] |
| 235 | + SM --> MM[Memory Monitor] |
| 236 | + end |
| 237 | + |
| 238 | + subgraph "Database Management Layer" |
| 239 | + GF --> DF[DBManager Factory] |
| 240 | + DF --> DM[DBManager Instance] |
| 241 | + DM --> CPM[Connection Pool Manager] |
| 242 | + end |
| 243 | + |
| 244 | + subgraph "Adapter Layer" |
| 245 | + CPM --> SA[SQLite Adapter] |
| 246 | + CPM --> LA[LanceDB Adapter] |
| 247 | + CPM --> TKA[Thread-Safe Kuzu Adapter] |
| 248 | + end |
| 249 | + |
| 250 | + subgraph "Database Layer" |
| 251 | + SA --> DB[(SQLite DB)] |
| 252 | + LA --> LD[(LanceDB)] |
| 253 | + TKA --> KD[(KuzuDB)] |
| 254 | + end |
| 255 | + |
| 256 | + TC -.->|HTTP/Stdio| MS |
| 257 | + CM -.->|Connection Lifecycle| SM |
| 258 | + GF -.->|Singleton Management| DF |
| 259 | +``` |
| 260 | + |
| 261 | +## Benefits of This Solution |
| 262 | + |
| 263 | +1. **Memory Leak Prevention**: Proper resource cleanup and connection pooling |
| 264 | +2. **Thread Safety**: Thread-local connections and proper synchronization |
| 265 | +3. **Scalability**: Connection pooling allows better resource utilization |
| 266 | +4. **Reliability**: Enhanced error handling and automatic recovery |
| 267 | +5. **Monitoring**: Built-in memory and thread tracking |
| 268 | +6. **Maintainability**: Clear separation of concerns and modular design |
| 269 | + |
| 270 | +## Backward Compatibility |
| 271 | + |
| 272 | +The solution maintains backward compatibility by: |
| 273 | +- Preserving existing API interfaces |
| 274 | +- Adding new optional parameters for connection management |
| 275 | +- Providing migration path for existing code |
| 276 | +- Supporting both old and new connection patterns during transition |
| 277 | + |
| 278 | +## Performance Considerations |
| 279 | + |
| 280 | +- Connection pooling reduces connection overhead |
| 281 | +- Thread-local storage minimizes contention |
| 282 | +- Lazy initialization reduces startup time |
| 283 | +- Health checks prevent using stale connections |
| 284 | +- Resource limits prevent resource exhaustion |
| 285 | + |
| 286 | +## Security Considerations |
| 287 | + |
| 288 | +- Connection isolation prevents data leakage between threads |
| 289 | +- Proper cleanup prevents sensitive data retention |
| 290 | +- Resource limits prevent DoS attacks |
| 291 | +- Monitoring helps detect unusual activity patterns |
0 commit comments