-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Summary
PR #5168 fixed OTel context leaks in the InferenceStore and OpenAIResponsesImpl background workers by introducing create_task_with_detached_otel_context. During review, @jaideepr97 identified three additional asyncio.create_task call sites in openai_vector_store_mixin.py that may have the same bug — background tasks inheriting and permanently retaining the creating request's OTel trace context.
Potentially affected locations
All in src/llama_stack/providers/utils/memory/openai_vector_store_mixin.py:
-
Line ~394 — Resuming a file batch on startup:
task = asyncio.create_task(self._process_file_batch_async(batch_id, batch_info, remaining_files))
-
Line ~1317 — Starting background processing of a new file batch:
task = asyncio.create_task(self._process_file_batch_async(batch_id, batch_info))
-
Line ~1327 — Running throttled cleanup of expired file batches:
asyncio.create_task(self._cleanup_expired_file_batches())
Expected behavior
Each background task should either:
- Start with a detached OTel context (using
create_task_with_detached_otel_context), or - Carry per-item context from the originating request (using
capture_otel_context/activate_otel_context)
so that spans are attributed to the correct request trace and don't leak across unrelated requests.
Steps to reproduce
- Enable OTel tracing (e.g. export to Jaeger)
- Send concurrent requests that trigger vector store file batch operations
- Inspect traces in Jaeger — look for inflated trace durations or spans from unrelated requests appearing under a single trace
Additional context
- See PR fix: prevent OTel context leak in fire-and-forget background tasks #5168 for full explanation of the root cause (
asyncio.create_taskcopies allcontextvarsat creation time) - The fix utilities (
create_task_with_detached_otel_context,capture_otel_context,activate_otel_context) are already available inllama_stack/core/task.py - These locations may behave differently from the ones fixed in fix: prevent OTel context leak in fire-and-forget background tasks #5168 (e.g. shorter-lived tasks vs. long-lived workers), so testing is needed to confirm whether the leak actually manifests here