Skip to content

Commit 5c63a2c

Browse files
committed
Fix threading errors / leaks -> bump version to 0.4.0
1 parent ed607b0 commit 5c63a2c

22 files changed

+8449
-1937
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,5 @@ memory_profile.log
1212
.vscode/settings.json
1313
/*.log
1414
typescript/test-embedding-*
15+
ctx_test_db/
16+
test_db/

docs/implementation_specification.md

Lines changed: 985 additions & 0 deletions
Large diffs are not rendered by default.

docs/memory_leak_fix_design.md

Lines changed: 291 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,291 @@
1+
# Grizabella Memory Leak Fix - Comprehensive Connection Management Design
2+
3+
## Problem Analysis
4+
5+
### Root Cause Identification
6+
7+
Based on the codebase analysis, the memory leak and threading issues stem from several interconnected problems:
8+
9+
1. **Multiple Connection Instances**: Each TypeScript client connection creates new GrizabellaDBManager instances
10+
2. **KuzuAdapter Threading Issues**: KuzuDB connections are not thread-safe and create lock files/WAL files that aren't properly cleaned up
11+
3. **MCP Server Resource Leaks**: The MCP server creates singleton Grizabella instances but doesn't properly manage their lifecycle
12+
4. **No Connection Pooling**: Database adapters are created per-request rather than reused
13+
5. **Improper Cleanup**: Connection cleanup is inconsistent across different code paths
14+
15+
### Memory Leak Symptoms
16+
17+
- Over a dozen threads spawned for Grizabella process
18+
- Uncontrollable memory growth leading to system crashes
19+
- Lock files and WAL files accumulating in Kuzu database directories
20+
- MCP server connections not being properly terminated
21+
22+
## Solution Architecture
23+
24+
### Overview
25+
26+
The solution implements a comprehensive connection management system with:
27+
- Thread-safe connection pooling
28+
- Singleton pattern for database managers
29+
- Proper resource lifecycle management
30+
- Memory and thread monitoring
31+
- Enhanced error handling and recovery
32+
33+
### Core Components
34+
35+
#### 1. Connection Pool Manager
36+
37+
```python
38+
class ConnectionPoolManager:
39+
"""Thread-safe connection pool for database adapters"""
40+
41+
def __init__(self, max_connections: int = 10):
42+
self._pools = {
43+
'sqlite': Queue(maxsize=max_connections),
44+
'lancedb': Queue(maxsize=max_connections),
45+
'kuzu': Queue(maxsize=max_connections)
46+
}
47+
self._lock = threading.RLock()
48+
self._connection_count = defaultdict(int)
49+
50+
async def get_connection(self, adapter_type: str, **kwargs):
51+
"""Get a connection from the pool or create a new one"""
52+
53+
async def return_connection(self, adapter_type: str, connection):
54+
"""Return a connection to the pool"""
55+
56+
async def cleanup_all(self):
57+
"""Clean up all connections in the pool"""
58+
```
59+
60+
#### 2. Singleton DBManager Factory
61+
62+
```python
63+
class DBManagerFactory:
64+
"""Factory for managing singleton GrizabellaDBManager instances"""
65+
66+
_instances = {}
67+
_lock = threading.RLock()
68+
69+
@classmethod
70+
def get_manager(cls, db_path: str, **kwargs) -> GrizabellaDBManager:
71+
"""Get or create a singleton DBManager for the given database path"""
72+
73+
@classmethod
74+
def cleanup_manager(cls, db_path: str):
75+
"""Clean up a specific DBManager instance"""
76+
77+
@classmethod
78+
def cleanup_all(cls):
79+
"""Clean up all DBManager instances"""
80+
```
81+
82+
#### 3. Thread-Safe KuzuAdapter
83+
84+
```python
85+
class ThreadSafeKuzuAdapter(BaseDBAdapter):
86+
"""Thread-safe Kuzu adapter with proper connection isolation"""
87+
88+
def __init__(self, db_path: str, config: Optional[dict] = None):
89+
self._local = threading.local()
90+
self._db_path = db_path
91+
self._config = config or {}
92+
self._lock = threading.RLock()
93+
94+
@property
95+
def conn(self):
96+
"""Get thread-local connection"""
97+
if not hasattr(self._local, 'conn'):
98+
self._local.conn = self._create_connection()
99+
return self._local.conn
100+
101+
def _create_connection(self):
102+
"""Create a new Kuzu connection with proper cleanup"""
103+
104+
def close(self):
105+
"""Close thread-local connection"""
106+
```
107+
108+
#### 4. Enhanced MCP Server Lifecycle
109+
110+
```python
111+
class MCPServerManager:
112+
"""Enhanced MCP server with proper resource management"""
113+
114+
def __init__(self):
115+
self._grizabella_client = None
116+
self._shutdown_handlers = []
117+
self._monitoring_thread = None
118+
119+
async def start_server(self, db_path: str):
120+
"""Start server with proper initialization"""
121+
122+
async def shutdown_server(self):
123+
"""Graceful shutdown with resource cleanup"""
124+
125+
def _setup_monitoring(self):
126+
"""Setup memory and thread monitoring"""
127+
```
128+
129+
#### 5. TypeScript Client Connection Management
130+
131+
```typescript
132+
class ConnectionManager {
133+
private _connectionPool: Map<string, MCPClient> = new Map();
134+
private _reconnectTimers: Map<string, NodeJS.Timeout> = new Map();
135+
private _connectionState: Map<string, ConnectionState> = new Map();
136+
137+
async getConnection(config: GrizabellaClientConfig): Promise<MCPClient> {
138+
const key = this._getConnectionKey(config);
139+
140+
if (this._connectionPool.has(key)) {
141+
const client = this._connectionPool.get(key)!;
142+
if (client.isConnected()) {
143+
return client;
144+
}
145+
this._cleanupConnection(key);
146+
}
147+
148+
return this._createConnection(key, config);
149+
}
150+
151+
private async _createConnection(key: string, config: GrizabellaClientConfig): Promise<MCPClient> {
152+
const client = new MCPClient(config);
153+
await client.connect();
154+
this._connectionPool.set(key, client);
155+
this._connectionState.set(key, ConnectionState.CONNECTED);
156+
return client;
157+
}
158+
159+
async cleanupAll(): Promise<void> {
160+
for (const [key, client] of this._connectionPool) {
161+
await this._cleanupConnection(key);
162+
}
163+
}
164+
}
165+
```
166+
167+
## Implementation Plan
168+
169+
### Phase 1: Core Infrastructure
170+
171+
1. **Connection Pool Manager Implementation**
172+
- Thread-safe queue-based pooling
173+
- Connection health checks
174+
- Automatic cleanup of idle connections
175+
176+
2. **DBManager Factory**
177+
- Singleton pattern with thread safety
178+
- Reference counting for shared instances
179+
- Graceful shutdown procedures
180+
181+
3. **Enhanced KuzuAdapter**
182+
- Thread-local storage for connections
183+
- Proper lock file management
184+
- WAL file cleanup on connection close
185+
186+
### Phase 2: MCP Server Enhancements
187+
188+
1. **Server Lifecycle Management**
189+
- Proper initialization and shutdown sequences
190+
- Resource cleanup handlers
191+
- Graceful error recovery
192+
193+
2. **Memory Monitoring**
194+
- Thread tracking
195+
- Memory usage monitoring
196+
- Alert system for resource leaks
197+
198+
### Phase 3: Client-Side Improvements
199+
200+
1. **TypeScript Connection Manager**
201+
- Connection pooling
202+
- Auto-reconnect logic
203+
- Connection state management
204+
205+
2. **Enhanced Error Handling**
206+
- Connection failure recovery
207+
- Exponential backoff for retries
208+
- Circuit breaker pattern
209+
210+
### Phase 4: Testing and Documentation
211+
212+
1. **Comprehensive Test Suite**
213+
- Unit tests for all components
214+
- Integration tests for connection lifecycle
215+
- Load testing for memory leak validation
216+
217+
2. **Documentation Updates**
218+
- Connection management best practices
219+
- Troubleshooting guide
220+
- API documentation updates
221+
222+
## Mermaid Architecture Diagram
223+
224+
```mermaid
225+
graph TB
226+
subgraph "TypeScript Client Layer"
227+
TC[TypeScript Client] --> CM[Connection Manager]
228+
CM --> CP[Connection Pool]
229+
CM --> RC[Reconnect Handler]
230+
end
231+
232+
subgraph "MCP Server Layer"
233+
MS[MCP Server] --> SM[Server Manager]
234+
SM --> GF[Grizabella Factory]
235+
SM --> MM[Memory Monitor]
236+
end
237+
238+
subgraph "Database Management Layer"
239+
GF --> DF[DBManager Factory]
240+
DF --> DM[DBManager Instance]
241+
DM --> CPM[Connection Pool Manager]
242+
end
243+
244+
subgraph "Adapter Layer"
245+
CPM --> SA[SQLite Adapter]
246+
CPM --> LA[LanceDB Adapter]
247+
CPM --> TKA[Thread-Safe Kuzu Adapter]
248+
end
249+
250+
subgraph "Database Layer"
251+
SA --> DB[(SQLite DB)]
252+
LA --> LD[(LanceDB)]
253+
TKA --> KD[(KuzuDB)]
254+
end
255+
256+
TC -.->|HTTP/Stdio| MS
257+
CM -.->|Connection Lifecycle| SM
258+
GF -.->|Singleton Management| DF
259+
```
260+
261+
## Benefits of This Solution
262+
263+
1. **Memory Leak Prevention**: Proper resource cleanup and connection pooling
264+
2. **Thread Safety**: Thread-local connections and proper synchronization
265+
3. **Scalability**: Connection pooling allows better resource utilization
266+
4. **Reliability**: Enhanced error handling and automatic recovery
267+
5. **Monitoring**: Built-in memory and thread tracking
268+
6. **Maintainability**: Clear separation of concerns and modular design
269+
270+
## Backward Compatibility
271+
272+
The solution maintains backward compatibility by:
273+
- Preserving existing API interfaces
274+
- Adding new optional parameters for connection management
275+
- Providing migration path for existing code
276+
- Supporting both old and new connection patterns during transition
277+
278+
## Performance Considerations
279+
280+
- Connection pooling reduces connection overhead
281+
- Thread-local storage minimizes contention
282+
- Lazy initialization reduces startup time
283+
- Health checks prevent using stale connections
284+
- Resource limits prevent resource exhaustion
285+
286+
## Security Considerations
287+
288+
- Connection isolation prevents data leakage between threads
289+
- Proper cleanup prevents sensitive data retention
290+
- Resource limits prevent DoS attacks
291+
- Monitoring helps detect unusual activity patterns

docs/user_guide/troubleshooting_faq.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,3 +75,78 @@ This section provides solutions to common problems and answers to frequently ask
7575

7676
* **Q: Where can I report bugs or ask for help?**
7777
* A: Please report bugs, suggest features, or ask for help by creating an issue on our GitHub repository: [`https://github.com/pwilkin/grizabella/issues`](https://github.com/pwilkin/grizabella/issues) (Note: This is a placeholder URL).
78+
79+
80+
## Connection Management Best Practices
81+
82+
### Resource Management
83+
84+
* **Always use context managers or proper cleanup:**
85+
* When using the Python API, use the `with` statement or explicitly call cleanup methods to ensure resources are properly released:
86+
```python
87+
from grizabella.api.client import Grizabella
88+
89+
# Recommended approach using context manager
90+
with Grizabella(db_name_or_path="my_db") as client:
91+
# Perform operations
92+
client.create_object_type(...)
93+
# Resources automatically cleaned up when exiting the context
94+
```
95+
* For long-running applications, ensure to call cleanup methods explicitly when shutting down.
96+
97+
* **Connection pooling:**
98+
* Grizabella implements connection pooling to efficiently manage database connections.
99+
* The connection pool automatically manages idle connections and cleans them up after a configurable timeout period.
100+
* Connection pools are shared across managers for the same database type and path.
101+
102+
* **Singleton pattern for DB managers:**
103+
* DB managers are implemented using a singleton pattern with reference counting.
104+
* Multiple requests for the same database path will return the same manager instance.
105+
* Managers are automatically cleaned up when all references are released.
106+
107+
### Memory Management
108+
109+
* **Preventing memory leaks:**
110+
* Always release DB managers when no longer needed by calling the appropriate cleanup methods.
111+
* The system implements automatic cleanup on process shutdown, but explicit cleanup is recommended.
112+
* Monitor memory usage during long-running operations to detect potential leaks early.
113+
114+
* **Resource monitoring:**
115+
* Grizabella includes a resource monitoring dashboard accessible via the web interface for real-time monitoring of CPU, memory, connections, and threads.
116+
* Use the monitoring tools to track resource usage patterns and identify potential issues.
117+
118+
### Threading and Concurrency
119+
120+
* **Thread-safe operations:**
121+
* Database adapters are designed to be thread-safe, with separate connections per thread for Kùzu adapter.
122+
* Avoid sharing connection objects across threads directly; use the provided connection management instead.
123+
124+
* **Concurrent access patterns:**
125+
* Multiple threads can safely access the same database through the connection pool.
126+
* The system handles concurrent access efficiently while maintaining data integrity.
127+
128+
### Connection Lifecycle
129+
130+
* **Proper initialization:**
131+
* Always initialize the client with appropriate configuration including timeouts and retry settings.
132+
* Use the factory pattern to create and manage DB managers properly.
133+
134+
* **Graceful shutdown:**
135+
* Implement proper shutdown handlers that clean up all resources.
136+
* The system includes signal handlers for SIGINT and SIGTERM to ensure graceful shutdown.
137+
* All connections and resources are automatically cleaned up during shutdown.
138+
139+
### Troubleshooting Connection Issues
140+
141+
* **Connection timeouts:**
142+
* If experiencing connection timeouts, increase the timeout values in the configuration.
143+
* Check if the database files are accessible and not locked by another process.
144+
145+
* **Too many open files:**
146+
* This error indicates that too many connections are being held open simultaneously.
147+
* Ensure that connections are being properly returned to the pool or closed after use.
148+
* Consider adjusting the maximum connection pool size if needed for your use case.
149+
150+
* **Database locking issues:**
151+
* SQLite databases can experience locking issues with high concurrent write operations.
152+
* Consider using appropriate transaction management and connection pooling to reduce lock contention.

0 commit comments

Comments
 (0)