|
| 1 | +# CLI Hanging Fix - Comprehensive Solution |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Fix the CLI hanging issue by addressing task completion detection, MCP disposal, and process exit enforcement. |
| 6 | + |
| 7 | +## Problem Statement |
| 8 | + |
| 9 | +The CLI hangs after processing batch commands and never exits naturally, requiring users to manually terminate with Ctrl+C. This affects user experience and prevents automated usage. |
| 10 | + |
| 11 | +## Root Cause Analysis |
| 12 | + |
| 13 | +### 1. Task Completion Detection Issues |
| 14 | + |
| 15 | +- BatchProcessor relies on event-driven completion detection |
| 16 | +- Simple Q&A tasks may not emit expected completion events |
| 17 | +- Complex timeout logic has edge cases |
| 18 | +- No fallback for non-tool response scenarios |
| 19 | + |
| 20 | +### 2. MCP Server Disposal Problems |
| 21 | + |
| 22 | +- Stdio child processes (dotnet, npx) don't terminate cleanly |
| 23 | +- 3-second disposal timeout insufficient for some processes |
| 24 | +- No force-kill mechanism for stubborn processes |
| 25 | +- Incomplete cleanup of event listeners and resources |
| 26 | + |
| 27 | +### 3. Event Loop Prevention |
| 28 | + |
| 29 | +- Timers, intervals, and event listeners keep process alive |
| 30 | +- No guaranteed exit mechanism |
| 31 | +- Async operations may hang indefinitely |
| 32 | +- Missing force-exit as last resort |
| 33 | + |
| 34 | +## Solution Architecture |
| 35 | + |
| 36 | +### Phase 1: Task Completion Detection Enhancement |
| 37 | + |
| 38 | +**File: `src/cli/commands/batch.ts`** |
| 39 | + |
| 40 | +#### Changes: |
| 41 | + |
| 42 | +1. **Response-based Completion Detection** |
| 43 | + |
| 44 | + - Detect completion when meaningful response is generated |
| 45 | + - Don't rely solely on task events for simple Q&A |
| 46 | + - Add completion heuristics for different response types |
| 47 | + |
| 48 | +2. **Improved Timeout Logic** |
| 49 | + |
| 50 | + - Simplify sliding timeout implementation |
| 51 | + - Add response completion triggers |
| 52 | + - Better handling of non-interactive responses |
| 53 | + |
| 54 | +3. **Fallback Completion** |
| 55 | + - Detect when response generation completes |
| 56 | + - Set maximum response time limits |
| 57 | + - Auto-complete for information queries |
| 58 | + |
| 59 | +#### Implementation: |
| 60 | + |
| 61 | +```typescript |
| 62 | +// Enhanced completion detection |
| 63 | +private detectResponseCompletion(response: string): boolean { |
| 64 | + // Detect complete responses for information queries |
| 65 | + const informationIndicators = [ |
| 66 | + 'Available MCP servers', |
| 67 | + 'Connected MCP servers', |
| 68 | + 'No MCP servers', |
| 69 | + 'I have access to' |
| 70 | + ]; |
| 71 | + |
| 72 | + return informationIndicators.some(indicator => |
| 73 | + response.toLowerCase().includes(indicator.toLowerCase()) |
| 74 | + ); |
| 75 | +} |
| 76 | + |
| 77 | +// Response-triggered completion |
| 78 | +private setupResponseCompletionDetection(task: Task): void { |
| 79 | + let responseBuffer = ''; |
| 80 | + |
| 81 | + task.on('message', (event) => { |
| 82 | + if (event.action === 'response') { |
| 83 | + responseBuffer += event.message?.text || ''; |
| 84 | + |
| 85 | + // Check if response appears complete |
| 86 | + if (this.detectResponseCompletion(responseBuffer)) { |
| 87 | + // Trigger completion after brief delay |
| 88 | + setTimeout(() => { |
| 89 | + this.completeTask('Response completion detected'); |
| 90 | + }, 1000); |
| 91 | + } |
| 92 | + } |
| 93 | + }); |
| 94 | +} |
| 95 | +``` |
| 96 | + |
| 97 | +### Phase 2: MCP Disposal Enhancement |
| 98 | + |
| 99 | +**Files: `src/cli/services/CLIMcpService.ts`, `src/cli/connections/BaseMcpConnection.ts`** |
| 100 | + |
| 101 | +#### Changes: |
| 102 | + |
| 103 | +1. **Force Process Termination** |
| 104 | + |
| 105 | + - Add SIGTERM followed by SIGKILL for stdio processes |
| 106 | + - Increase disposal timeout to 5 seconds |
| 107 | + - Track child process PIDs for forced cleanup |
| 108 | + |
| 109 | +2. **Enhanced Connection Cleanup** |
| 110 | + |
| 111 | + - Properly close all transport connections |
| 112 | + - Clear all event listeners and timers |
| 113 | + - Force-dispose stubborn connections |
| 114 | + |
| 115 | +3. **Disposal Monitoring** |
| 116 | + - Log disposal progress for debugging |
| 117 | + - Track which processes are hanging |
| 118 | + - Provide disposal status feedback |
| 119 | + |
| 120 | +#### Implementation: |
| 121 | + |
| 122 | +```typescript |
| 123 | +// Enhanced stdio process termination |
| 124 | +async forceTerminateStdioProcess(): Promise<void> { |
| 125 | + const transport = this.transport as StdioClientTransport; |
| 126 | + const childProcess = transport.process; |
| 127 | + |
| 128 | + if (childProcess && !childProcess.killed) { |
| 129 | + // Graceful termination first |
| 130 | + childProcess.kill('SIGTERM'); |
| 131 | + |
| 132 | + // Wait up to 2 seconds for graceful exit |
| 133 | + await new Promise((resolve) => { |
| 134 | + const timeout = setTimeout(() => { |
| 135 | + // Force kill if still running |
| 136 | + if (!childProcess.killed) { |
| 137 | + childProcess.kill('SIGKILL'); |
| 138 | + } |
| 139 | + resolve(void 0); |
| 140 | + }, 2000); |
| 141 | + |
| 142 | + childProcess.on('exit', () => { |
| 143 | + clearTimeout(timeout); |
| 144 | + resolve(void 0); |
| 145 | + }); |
| 146 | + }); |
| 147 | + } |
| 148 | +} |
| 149 | + |
| 150 | +// Enhanced disposal with force cleanup |
| 151 | +async dispose(): Promise<void> { |
| 152 | + const disposalTimeout = 5000; // 5 seconds |
| 153 | + |
| 154 | + const disposalPromise = Promise.allSettled([ |
| 155 | + ...Array.from(this.connections.keys()).map(id => |
| 156 | + this.forceDisconnectServer(id) |
| 157 | + ) |
| 158 | + ]); |
| 159 | + |
| 160 | + // Race disposal against timeout |
| 161 | + const timeoutPromise = new Promise((_, reject) => |
| 162 | + setTimeout(() => reject(new Error('Disposal timeout')), disposalTimeout) |
| 163 | + ); |
| 164 | + |
| 165 | + try { |
| 166 | + await Promise.race([disposalPromise, timeoutPromise]); |
| 167 | + } catch (error) { |
| 168 | + console.warn('Disposal timeout, forcing cleanup...'); |
| 169 | + // Force cleanup will happen in finally block |
| 170 | + } finally { |
| 171 | + // Clear all remaining references |
| 172 | + this.connections.clear(); |
| 173 | + this.healthCheckers.clear(); |
| 174 | + this.isDisposed = true; |
| 175 | + } |
| 176 | +} |
| 177 | +``` |
| 178 | +
|
| 179 | +### Phase 3: Process Exit Enforcement |
| 180 | +
|
| 181 | +**File: `src/cli/index.ts`** |
| 182 | +
|
| 183 | +#### Changes: |
| 184 | +
|
| 185 | +1. **Guaranteed Exit Mechanism** |
| 186 | +
|
| 187 | + - Force exit after maximum operation time |
| 188 | + - Clean shutdown with fallback to process.exit() |
| 189 | + - Exit code management for different scenarios |
| 190 | +
|
| 191 | +2. **Event Loop Cleanup** |
| 192 | +
|
| 193 | + - Clear all timers and intervals |
| 194 | + - Remove event listeners |
| 195 | + - Close remaining handles |
| 196 | +
|
| 197 | +3. **Exit Coordination** |
| 198 | + - Coordinate between task completion and MCP disposal |
| 199 | + - Ensure proper cleanup order |
| 200 | + - Log exit process for debugging |
| 201 | +
|
| 202 | +#### Implementation: |
| 203 | +
|
| 204 | +```typescript |
| 205 | +// Global exit enforcement |
| 206 | +private setupGlobalExitEnforcement(): void { |
| 207 | + const maxExecutionTime = 60000; // 1 minute max for any operation |
| 208 | + |
| 209 | + const forceExitTimer = setTimeout(() => { |
| 210 | + console.warn('Force exiting due to execution timeout'); |
| 211 | + process.exit(1); |
| 212 | + }, maxExecutionTime); |
| 213 | + |
| 214 | + // Clear timer on normal completion |
| 215 | + process.on('beforeExit', () => { |
| 216 | + clearTimeout(forceExitTimer); |
| 217 | + }); |
| 218 | +} |
| 219 | + |
| 220 | +// Enhanced cleanup sequence |
| 221 | +private async performComprehensiveCleanup(): Promise<void> { |
| 222 | + const cleanupTasks = [ |
| 223 | + this.disposeMcpServices(), |
| 224 | + this.clearGlobalTimers(), |
| 225 | + this.removeEventListeners(), |
| 226 | + this.closeOpenHandles() |
| 227 | + ]; |
| 228 | + |
| 229 | + // Execute cleanup with timeout |
| 230 | + try { |
| 231 | + await Promise.race([ |
| 232 | + Promise.allSettled(cleanupTasks), |
| 233 | + new Promise((_, reject) => |
| 234 | + setTimeout(() => reject(new Error('Cleanup timeout')), 5000) |
| 235 | + ) |
| 236 | + ]); |
| 237 | + } catch (error) { |
| 238 | + console.warn('Cleanup timeout, forcing exit'); |
| 239 | + } finally { |
| 240 | + // Guarantee exit |
| 241 | + setTimeout(() => process.exit(0), 1000); |
| 242 | + } |
| 243 | +} |
| 244 | +``` |
| 245 | +
|
| 246 | +### Phase 4: Integration and Testing |
| 247 | +
|
| 248 | +**Files: Multiple** |
| 249 | +
|
| 250 | +#### Changes: |
| 251 | +
|
| 252 | +1. **Coordinated Shutdown** |
| 253 | +
|
| 254 | + - Integrate all cleanup phases |
| 255 | + - Proper error handling and logging |
| 256 | + - Graceful degradation for cleanup failures |
| 257 | +
|
| 258 | +2. **Debug Logging** |
| 259 | +
|
| 260 | + - Add comprehensive debug logging for exit process |
| 261 | + - Track hanging operations |
| 262 | + - Identify remaining event loop items |
| 263 | +
|
| 264 | +3. **Testing Framework** |
| 265 | + - Add tests for hanging scenarios |
| 266 | + - Verify clean exit in various conditions |
| 267 | + - Performance testing for disposal operations |
| 268 | +
|
| 269 | +## Implementation Steps |
| 270 | +
|
| 271 | +### Step 1: Task Completion Enhancement |
| 272 | +
|
| 273 | +- [ ] Implement response-based completion detection |
| 274 | +- [ ] Add completion heuristics for information queries |
| 275 | +- [ ] Simplify timeout logic with fallback completion |
| 276 | +- [ ] Test with various question types |
| 277 | +
|
| 278 | +### Step 2: MCP Disposal Enhancement |
| 279 | +
|
| 280 | +- [ ] Add force termination for stdio processes |
| 281 | +- [ ] Implement comprehensive connection cleanup |
| 282 | +- [ ] Increase disposal timeouts and add monitoring |
| 283 | +- [ ] Test with multiple MCP server types |
| 284 | +
|
| 285 | +### Step 3: Process Exit Enforcement |
| 286 | +
|
| 287 | +- [ ] Implement guaranteed exit mechanisms |
| 288 | +- [ ] Add event loop cleanup utilities |
| 289 | +- [ ] Coordinate shutdown sequence |
| 290 | +- [ ] Add comprehensive debug logging |
| 291 | +
|
| 292 | +### Step 4: Integration Testing |
| 293 | +
|
| 294 | +- [ ] Test complete fix with original hanging scenario |
| 295 | +- [ ] Verify clean exit in various conditions |
| 296 | +- [ ] Performance test disposal operations |
| 297 | +- [ ] Add automated tests for regression prevention |
| 298 | +
|
| 299 | +## Acceptance Criteria |
| 300 | +
|
| 301 | +### Functional Requirements |
| 302 | +
|
| 303 | +- [ ] CLI exits cleanly after batch command completion |
| 304 | +- [ ] No hanging processes or child processes remain |
| 305 | +- [ ] All MCP servers disconnect properly |
| 306 | +- [ ] Exit codes reflect operation success/failure |
| 307 | +
|
| 308 | +### Performance Requirements |
| 309 | +
|
| 310 | +- [ ] Total exit time < 10 seconds under normal conditions |
| 311 | +- [ ] Force exit triggers within 1 minute maximum |
| 312 | +- [ ] MCP disposal completes within 5 seconds |
| 313 | +- [ ] No memory leaks or resource leaks |
| 314 | +
|
| 315 | +### Reliability Requirements |
| 316 | +
|
| 317 | +- [ ] 100% exit success rate for batch commands |
| 318 | +- [ ] Graceful handling of disposal failures |
| 319 | +- [ ] Proper cleanup even when services hang |
| 320 | +- [ ] Comprehensive error logging and debugging info |
| 321 | +
|
| 322 | +## Testing Strategy |
| 323 | +
|
| 324 | +### Unit Tests |
| 325 | +
|
| 326 | +- Task completion detection logic |
| 327 | +- MCP disposal mechanisms |
| 328 | +- Process cleanup utilities |
| 329 | +- Timeout and fallback logic |
| 330 | +
|
| 331 | +### Integration Tests |
| 332 | +
|
| 333 | +- End-to-end batch command execution |
| 334 | +- Multiple MCP server scenarios |
| 335 | +- Error condition handling |
| 336 | +- Resource cleanup verification |
| 337 | +
|
| 338 | +### Regression Tests |
| 339 | +
|
| 340 | +- Original hanging scenario |
| 341 | +- Various MCP server configurations |
| 342 | +- Different task types and responses |
| 343 | +- Timeout and error conditions |
| 344 | +
|
| 345 | +## Risk Mitigation |
| 346 | +
|
| 347 | +### Potential Issues |
| 348 | +
|
| 349 | +- Force-killing processes may cause data loss |
| 350 | +- Aggressive timeouts might interrupt legitimate operations |
| 351 | +- Changes might affect interactive mode behavior |
| 352 | +
|
| 353 | +### Mitigation Strategies |
| 354 | +
|
| 355 | +- Graduated escalation (graceful -> force -> kill) |
| 356 | +- Separate timeout values for different operations |
| 357 | +- Extensive testing in both batch and interactive modes |
| 358 | +- Rollback plan for compatibility issues |
| 359 | +
|
| 360 | +## Success Metrics |
| 361 | +
|
| 362 | +### Primary Metrics |
| 363 | +
|
| 364 | +- CLI hanging incidents: 0% |
| 365 | +- Clean exit success rate: 100% |
| 366 | +- User-reported hanging issues: 0 |
| 367 | +
|
| 368 | +### Secondary Metrics |
| 369 | +
|
| 370 | +- Average exit time: < 3 seconds |
| 371 | +- MCP disposal success rate: > 99% |
| 372 | +- Resource cleanup success: 100% |
| 373 | +
|
| 374 | +This comprehensive fix addresses all identified root causes while maintaining system reliability and user experience. |
0 commit comments