Fix `RunUsage.tool_calls` being undercounted due to race condition when running tools in parallel #3133

certainly-param · 2025-10-10T03:27:58Z

Add asyncio.Lock to RunUsage class to prevent race conditions
Make incr() method async and use lock for thread-safe increments
Update all calls to usage.incr() to use await
Replace direct tool_calls += 1 with await usage.incr(RunUsage(tool_calls=1))
Fixes issue where concurrent tool calls could cause undercounted tool_calls
Maintains backward compatibility with synchronous add method

NOTE: While working on this fix, I noticed something interesting about the lock implementation. Since PydanticAI typically uses a single shared RunUsage object per agent run (the ctx.state.usage), I was curious about whether we could optimize the lock granularity. I ran some quick benchmarks and found that using context-based locks (where all tool calls in the same agent run share the same lock) could give about 26-29% better performance. The current instance-level approach works fine though. But, it might be worth exploring this optimization in the future. Current implementation should handle the concurrent tool execution issue nicely :)

Resolves #3120

DouweM · 2025-10-10T08:02:42Z

@certainly-param Considering the performance difference you call out, I'd prefer to use a lock only where we know we are running incr in parallel, meaning specifically inside the ToolManager.

We also definitely need a test that is failing on main, but will succeed after this PR. I'm still a bit surprised this issue exists at all, because as I understand it the usage. tool_calls += 1 line is run in parallel asyncio tasks, not threads, and asyncio tasks should not have race condition issues as they're not truly concurrent (or whatever the more precise word is for the difference between thread and event loop concurrency/parallelism).

Add asyncio.Lock specifically in _call_tools() to prevent race conditions during parallel tool execution, rather than adding overhead to every usage increment. Implementation: - Created asyncio.Lock in _call_tools() where parallel execution occurs - Used ContextVar to pass lock to ToolManager.handle_call() during parallel context - Guard usage.incr(RunUsage(tool_calls=1)) only when executing tools in parallel - Removed unnecessary lock from RunUsage class for better performance Why this works: The race condition occurs when multiple asyncio tasks call usage.incr() concurrently. Even though asyncio is single-threaded, tasks can interleave at await points, causing non-atomic read-modify-write operations (usage.tool_calls += 1) to lose increments. By guarding only the parallel tool execution path with a lock, we: - Prevent the race condition where it actually occurs - Avoid performance overhead in sequential/non-parallel execution - Maintain clean serialization (no lock in dataclass) - Achieve 100% test coverage Changes: - pydantic_ai_slim/pydantic_ai/_agent_graph.py: Add usage_lock in _call_tools() - pydantic_ai_slim/pydantic_ai/_tool_manager.py: Use lock from ContextVar - pydantic_ai_slim/pydantic_ai/usage.py: Simplified RunUsage.incr() and __add__() - Added pass statement for full branch coverage - tests/test_usage_limits.py: Added comprehensive test coverage - test_race_condition_parallel_tool_calls() with 20 iterations, 10 parallel tools - Enhanced test_run_usage_with_request_usage() for empty/non-empty details - Fixed snapshot mismatches in test files - Fixed formatting/trailing whitespace issues Test coverage: - Added test_race_condition_parallel_tool_calls() that fails on main - All existing tests pass with updated snapshots - 100% branch coverage achieved for usage.py Resolves pydantic#3120

certainly-param · 2025-10-11T18:31:18Z

@certainly-param Considering the performance difference you call out, I'd prefer to use a lock only where we know we are running incr in parallel, meaning specifically inside the ToolManager.

We also definitely need a test that is failing on main, but will succeed after this PR. I'm still a bit surprised this issue exists at all, because as I understand it the usage. tool_calls += 1 line is run in parallel asyncio tasks, not threads, and asyncio tasks should not have race condition issues as they're not truly concurrent (or whatever the more precise word is for the difference between thread and event loop concurrency/parallelism).

You're absolutely right that asyncio isn't truly concurrent like threads, but race conditions can still happen here. The issue is that usage.tool_calls += 1 looks like one operation, but it's actually three: read the current value, add one to it, and write it back. When you have multiple tools running in parallel and they hit an await point (like when actually calling the tool), the event loop can switch between tasks mid-operation. So imagine Task 1 reads tool_calls = 0, then the event loop switches to Task 2 which also reads 0, and both write back 1 instead of 2. Even though we're in a single-threaded event loop, that task switching at await points creates the exact same race condition you'd see with threads. It caught me by surprise too when I first dug into it 😄

So, I took your feedback moved the lock to exactly where the parallel execution happens instead of putting it in every incr() call. So now there's an asyncio.Lock() created right in _call_tools() where we know tools are running in parallel, and I use a ContextVar to pass it down to the tool manager. This way, it only locks when tools are actually executing in parallel, sequential calls don't touch the lock at all, so zero performance overhead there. I also added a test that really exercises this: 20 iterations with 10 parallel tools each, with multiple await points to maximize task switching. Without the fix, this test would fail intermittently on main (classic race condition behavior), but with the fix it's rock solid. Everything passes at 100% coverage, and RunUsage stays nice and clean without any lock logic cluttering it up.

DouweM · 2025-10-13T08:59:24Z

pydantic_ai_slim/pydantic_ai/_agent_graph.py

        ) as streamed_response:
            self._did_stream = True
-            ctx.state.usage.requests += 1
+            # Request count is incremented in _finish_handling via response.usage


No need to include this comment, or the next identical one

DouweM · 2025-10-13T09:00:30Z

pydantic_ai_slim/pydantic_ai/_agent_graph.py

    user_parts_by_index: dict[int, _messages.UserPromptPart] = {}
    deferred_calls_by_index: dict[int, Literal['external', 'unapproved']] = {}
+    # Lock to prevent race conditions when incrementing usage.tool_calls from concurrent tool executions
+    usage_lock = asyncio.Lock()


I think this could be a cached_property on ToolManager, and we wouldn't need to touch agent_graph at all

DouweM · 2025-10-13T09:01:41Z

pydantic_ai_slim/pydantic_ai/usage.py

            self.tool_calls += incr_usage.tool_calls
+        else:
+            # RequestUsage: requests is a property that returns 1
+            self.requests += incr_usage.requests


Duplicated with if branch

tests/models/test_google.py

DouweM · 2025-10-13T09:02:17Z

tests/models/test_openai_responses.py

                    assert response_stream.usage() == snapshot(
                        RunUsage(input_tokens=53, output_tokens=469, details={'reasoning_tokens': 448}, requests=1)
                    )
-                    assert run.usage() == snapshot(RunUsage(requests=1))


This looks like a breaking change we shouldn't make

As mentioned above, this would be a breaking change, so I'd rather ensure at the call site where we call incr that RunUsage.requests == 0.

DouweM · 2025-10-13T09:04:22Z

When you have multiple tools running in parallel and they hit an await point (like when actually calling the tool), the event loop can switch between tasks mid-operation. So imagine Task 1 reads tool_calls = 0, then the event loop switches to Task 2 which also reads 0, and both write back 1 instead of 2. Even though we're in a single-threaded event loop, that task switching at await points creates the exact same race condition you'd see with threads. It caught me by surprise too when I first dug into it 😄

@certainly-param Are you sure? I thought that switches only happen at await points, so not in the middle of the += operation.

Either way, if we have a test that used to fail and now succeeds, we're good.

- Remove unnecessary comments about request counting - Move usage_lock to ToolManager as cached_property for better encapsulation - Simplify RunUsage.incr() to avoid code duplication - Clean up _agent_graph.py by removing context var management This makes the lock management more localized to ToolManager where parallel execution actually happens, improving code organization and maintainability.

certainly-param · 2025-10-13T18:29:31Z

Thanks for the detailed review! I've made all the changes you suggested.

Removed the comments about request counting, moved the lock into ToolManager as a cached property (much cleaner this way - no more context var juggling), and simplified the RunUsage.incr() logic.

About the test_google.py diffs - those _identifier fields got added from inline-snapshot updates in an earlier commit. Happy to do an interactive rebase to clean them up if you want, or we can just squash them away when merging.

For test_openai_responses.py - yeah, there's a behavior change here. Before, requests was incremented right when streaming started, now it happens when the response finishes (in _finish_handling). This was needed to avoid double-counting since response.usage also includes requests. But it does mean run.usage() shows requests=0 during the stream now instead of requests=1.

Not sure if that's a problem for anyone? If we need to keep the old behavior, I could increment requests at the start but then skip that field when calling incr(response.usage).

On the race condition - you're right that asyncio doesn't have traditional threading issues. The problem is that += isn't atomic. It's roughly temp = x; temp = temp + 1; x = temp, and when two tasks do this concurrently they can interleave:

Task A reads tool_calls=0
Task A awaits (yields control)
Task B reads tool_calls=0  <- still 0!
Task B increments, writes tool_calls=1
Task A resumes, increments, writes tool_calls=1  <- lost update

The test I added (test_race_condition_parallel_tool_calls) fails pretty reliably without the lock, which is how I caught this.

Anyway, all tests are passing now. Let me know what you think about the test_google.py cleanup and the usage timing question!

DouweM · 2025-10-14T07:07:37Z

pydantic_ai_slim/pydantic_ai/models/openai.py

                truncation=model_settings.get('openai_truncation', NOT_GIVEN),
                timeout=model_settings.get('timeout', NOT_GIVEN),
                service_tier=model_settings.get('openai_service_tier', NOT_GIVEN),
-                previous_response_id=previous_response_id or NOT_GIVEN,


This looks like a broken merge conflict resolution! Please remove it from the diff to make sure we don't accidentally merge this into main.

pydantic_ai_slim/pydantic_ai/_tool_manager.py

pydantic_ai_slim/pydantic_ai/usage.py

tests/models/test_google.py

DouweM · 2025-10-14T07:10:32Z

tests/models/test_openai_responses.py

                    assert response_stream.usage() == snapshot(
                        RunUsage(input_tokens=53, output_tokens=469, details={'reasoning_tokens': 448}, requests=1)
                    )
-                    assert run.usage() == snapshot(RunUsage(requests=1))


As mentioned above, this would be a breaking change, so I'd rather ensure at the call site where we call incr that RunUsage.requests == 0.

DouweM · 2025-10-14T07:10:43Z

tests/test_usage_limits.py


    @controller_agent.tool
-    def delegate_to_other_agent(ctx: RunContext[None], sentence: str) -> int:
+    async def delegate_to_other_agent(ctx: RunContext[None], sentence: str) -> int:


Is this needed?

During rebase/merge, accidentally reverted PR pydantic#3134's change. Restoring: previous_response_id or NOT_GIVEN (instead of just previous_response_id)

- Reverted usage.py to main - the breaking change isn't needed for the fix - Reverted test_openai_responses.py snapshot change that was tied to above - Removed test_run_usage_with_request_usage() that was testing the wrong behavior - Fixed test_multi_agent_usage_sync() - removed unnecessary async keyword - Put back ctx.state.usage.requests += 1 lines in _agent_graph.py (needed for request counting) - Put back comment about stream consumption - Reverted formatting changes in type annotations The race condition fix itself is unchanged - just the lock in _tool_manager.py protecting tool_calls increment.

DouweM · 2025-10-15T09:39:05Z

@certainly-param Thank you!

…en running tools in parallel (pydantic#3133)

phemmer · 2025-10-15T12:25:22Z

I thought that switches only happen at await points, so not in the middle of the += operation.

This is true

The problem is that += isn't atomic. It's roughly temp = x; temp = temp + 1; x = temp, and when two tasks do this concurrently they can interleave:

This is not possible. There must be something within that flow to suspend the current task. Without doing so, the active task will continue to run. The asyncio design is extremely explicit on this. If tasks could be interrupted at any time, it could cause havoc with async applications.

I tried running the test case without the fix applied. I could not get any failures. I even bumped for iteration in range(20): up to 2000 and for i in range(10) to 1000. The issue could not be reproduced. I suspect there's something else going on in your setup, or you have a bug in your python or something.

DouweM · 2025-10-15T12:32:11Z

@phemmer Yeah I'm confused as well; I couldn't and can't reproduce the issue with the old code. @certainly-param wrote "I tested with 3 tools and sometimes get tool_calls=1, sometimes 2, rarely 3." and that it's fixed now, so I merged this mostly off the back of that, but I'm thinking I'll revert this until we have a failure multiple people can reproduce.

certainly-param · 2025-10-15T16:04:12Z

Looking deeper at the code, I think I found the actual issue that was causing those results. In _tool_manager.py, the usage.tool_calls += 1 increment is inside the try block:

try:
tool_result = await self._call_tool(call, allow_partial, wrap_validation_errors)
usage.tool_calls += 1 # Only happens on success
except ToolRetryError as e:
# Tool fails and retries, but increment never happened
raise e

This means when a tool fails with ToolRetryError and gets retried, it only gets counted once even though it was called twice. This could explain the intermittent undercounting I was seeing in my tests - it would depend on which tools fail and retry.
So the lock I added doesn't actually solve this retry counting issue. Should I update the fix to move the increment outside the try block so retries are properly counted? That seems like it would address the real root cause of the undercounting behavior I observed.

DouweM · 2025-10-16T08:27:40Z

@certainly-param Only counting successful calls was a conscious decision in #2633 and #2978, and changing it now could be considered a breaking change if people were relying on the existing behavior. I suppose at the time our thinking was that if your goal is to give your models a "budget" in terms of tool usage, an unsuccessful call (possibly because of arg validation) does not use of the budget. But of course there are cases where a call, even if unsuccessful did use up resources...

@tradeqvest What do you think?

certainly-param force-pushed the fix-race-condition-usage-incr branch 2 times, most recently from 9884c7d to 306626c Compare October 10, 2025 07:22

DouweM self-assigned this Oct 10, 2025

DouweM added the awaiting author revision label Oct 10, 2025

certainly-param force-pushed the fix-race-condition-usage-incr branch 5 times, most recently from 69d2272 to 4c1402b Compare October 11, 2025 17:03

certainly-param force-pushed the fix-race-condition-usage-incr branch from 4c1402b to bea4e20 Compare October 11, 2025 17:13

DouweM requested changes Oct 13, 2025

View reviewed changes

DouweM requested changes Oct 14, 2025

View reviewed changes

certainly-param added 2 commits October 14, 2025 13:47

Fix accidental reversion of previous_response_id change

12b102c

During rebase/merge, accidentally reverted PR pydantic#3134's change. Restoring: previous_response_id or NOT_GIVEN (instead of just previous_response_id)

DouweM changed the title ~~Fix race condition in RunUsage.incr() when running tools in parallel~~ Fix RunUsage.tool_calls being undercounted due to race condition when running tools in parallel Oct 15, 2025

DouweM enabled auto-merge (squash) October 15, 2025 09:38

Merge branch 'main' into fix-race-condition-usage-incr

969353f

DouweM merged commit afccc1b into pydantic:main Oct 15, 2025
29 checks passed

Artui pushed a commit to Artui/pydantic-ai that referenced this pull request Oct 15, 2025

Fix RunUsage.tool_calls being undercounted due to race condition wh…

c36e482

…en running tools in parallel (pydantic#3133)

DouweM mentioned this pull request Oct 15, 2025

Revert "Fix RunUsage.tool_calls being undercounted due to race condition when running tools in parallel" #3174

Merged

Fix RunUsage.tool_calls being undercounted due to race condition when running tools in parallel #3133

Fix RunUsage.tool_calls being undercounted due to race condition when running tools in parallel #3133

Uh oh!

Conversation

certainly-param commented Oct 10, 2025

Uh oh!

DouweM commented Oct 10, 2025

Uh oh!

certainly-param commented Oct 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DouweM commented Oct 13, 2025

Uh oh!

certainly-param commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DouweM commented Oct 15, 2025

Uh oh!

Uh oh!

phemmer commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DouweM commented Oct 15, 2025

Uh oh!

certainly-param commented Oct 15, 2025

Uh oh!

DouweM commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix `RunUsage.tool_calls` being undercounted due to race condition when running tools in parallel #3133

Fix `RunUsage.tool_calls` being undercounted due to race condition when running tools in parallel #3133

certainly-param commented Oct 13, 2025 •

edited

Loading

phemmer commented Oct 15, 2025 •

edited

Loading