Skip to content

Query.close() can hang indefinitely causing 100% CPU usage due to missing timeout on task group cleanup #378

@dbelgrod

Description

@dbelgrod

Bug Description

The Query.close() method in _internal/query.py can hang indefinitely
if tasks in the anyio task group don't properly respond to cancellation.
This causes anyio's _deliver_cancellation() to spin at 100% CPU.

Affected Code

src/claude_code_sdk/_internal/query.py lines 550-558 (v0.1.10):

async def close(self) -> None:
    """Close the query and transport."""
    self._closed = True
    if self._tg:
        self._tg.cancel_scope.cancel()
        # Wait for task group to complete cancellation
        with suppress(anyio.get_cancelled_exc_class()):
            await self._tg.__aexit__(None, None, None)  # ⚠️ NO TIMEOUT
    await self.transport.close()

Root Cause

Line 557 await self._tg.__aexit__(None, None, None) has no timeout. If any
 task in the task group doesn't properly respond to cancellation (e.g.,
stuck in I/O, waiting on external resource), this call will hang
indefinitely.

When this happens:
1. anyio's _deliver_cancellation() runs in a busy loop trying to deliver
the cancellation
2. This consumes 100%+ CPU indefinitely
3. The caller's event loop becomes unresponsive

How We Discovered This

We're using ClaudeSDKClient in a JupyterLab extension. After calling
client.disconnect() (which calls Query.close()), the Python process would
sometimes spike to 100-150% CPU usage.

Using py-spy profiling, we found:
- _deliver_cancellation from anyio/_backends/_asyncio.py consuming 66% of
CPU
- current_task from asyncio/tasks.py consuming 35% of CPU

Suggested Fix

Add a timeout to the task group cleanup:

async def close(self) -> None:
    """Close the query and transport."""
    self._closed = True
    if self._tg:
        self._tg.cancel_scope.cancel()
        with suppress(anyio.get_cancelled_exc_class()):
            try:
                # Add timeout to prevent indefinite hang
                with anyio.fail_after(5.0):
                    await self._tg.__aexit__(None, None, None)
            except TimeoutError:
                logger.warning("Task group cleanup timed out after 5s")
    await self.transport.close()

Workaround

Callers can wrap disconnect() with a timeout:

try:
    await asyncio.wait_for(client.disconnect(), timeout=5.0)
except asyncio.TimeoutError:
    logger.warning("Client disconnect timed out")
finally:
    client = None

Environment

- claude-agent-sdk version: 0.1.10
- anyio version: 4.11.0
- Python: 3.11.4
- Platform: Linux (Amazon ECS/Fargate)

---

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions