Skip to content

PROVIDER_DATA_VAR context leak in asyncio.create_task #5221

@jaideepr97

Description

@jaideepr97

System Info

Collecting environment information...
PyTorch version: 2.10.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 26.3.1 (arm64)
GCC version: Could not collect
Clang version: 17.0.0 (clang-1700.6.4.2)
CMake version: Could not collect
Libc version: N/A

Python version: 3.11.9 (v3.11.9:de54cf5be3, Apr  2 2024, 07:12:50) [Clang 13.0.0 (clang-1300.0.29.30)] (64-bit runtime)
Python platform: macOS-26.3.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] numpy==2.4.2
[pip3] torch==2.10.0
[conda] Could not collect

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

In #5168 description, @iamemilio highlights an issue with asyncio.create_task automatically copying all context vars into long running background worker memories at the time of task creation leading to incorrect/misleading OTEL traces because of these traces lingering long after the initial request is completed

In addition to otel traces, the PROVIDER_DATA_VAR containing sensitive authentication/authorization token information is also copied into these workers' memories and not correctly flushed out when the write queue is enabled. This leads to a leak where user isolation is compromised for any stored resources, and users may access each others' stored conversations and responses.

Following is the description generated by claude:

How it works step by step
Step 1: Worker creation inherits the wrong user

When the first request arrives, the middleware sets PROVIDER_DATA_VAR with that user's identity:

server.py Lines 304-304

       with request_provider_data_context(headers, user):

Deep inside the request handler, store_chat_completion lazily creates worker tasks:

inference_store.py Lines 101-103

if not self._worker_tasks:
           loop = asyncio.get_running_loop()
               task = loop.create_task(self._worker_loop())

asyncio.create_task copies all contextvars -- including PROVIDER_DATA_VAR. The worker permanently inherits User A's identity. Same for the responses background worker.

Step 2: Every DB write stamps the worker's (wrong) user

When the worker writes to the database, it goes through AuthorizedSqlStore:

authorized_sqlstore.py Lines 132-140

async def insert(self, table: str, data: Mapping[str, Any] | Sequence[Mapping[str, Any]]) -> None:
        """Insert a row or batch of rows with automatic access control attribute capture."""
        current_user = get_authenticated_user()
        # ...
        enhanced_data = _enhance_item_with_access_control(data, current_user)
        await self.sql_store.insert(table, enhanced_data)

get_authenticated_user() reads PROVIDER_DATA_VAR from the current task's context -- which is the first user's identity, regardless of who made the request. So owner_principal is stamped wrong.

For the responses worker update also overwrites owner_principal:

authorized_sqlstore.py Lines 222-235

async def update(self, table: str, data: Mapping[str, Any], where: Mapping[str, Any]) -> None:
        """Update rows with automatic access control attribute capture."""
        enhanced_data = dict(data)
        current_user = get_authenticated_user()
        if current_user:
            enhanced_data["owner_principal"] = current_user.principal
            enhanced_data["access_attributes"] = current_user.attributes
        # ...

The response was created with the correct user during the synchronous part of the request, but when the background worker updates its status to "completed" or "failed", it overwrites owner_principal with the first user's identity.

Existing tests don't catch this for 2 reasons:

  • Write queue is disabled for SQLite, which is what is used for most unit and integration tests. This forces synchronous writes, which get the correct PROVIDER_DATA_VAR values. The bug only manifests on postgres
  • existing user isolation tests only check the synchronous path, not the write queue path

When does it actually matter?
The leak is harmful when all three conditions are true:

  • Authentication is enabled -- otherwise get_authenticated_user() returns None everywhere, all records are "unowned/public", and there's no identity to leak
  • Access control policies are configured (like user is owner) -- otherwise all users can see all records anyway
  • The write queue is active -- Postgres for inference store (always for responses background worker)
    In the default configuration (no auth, SQLite), the bug is completely invisible. In a production multi-tenant deployment with Postgres and auth -- which is exactly the deployment most likely to care about isolation -- it's a real data isolation violation where User A can see User B's completions/responses.

Error logs

N/A

Expected behavior

Context inheritance should be fixed such that each worker starts with a clean slate and user identity is not persisted in memory beyond the scope of the request that spawned that worker

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions