Tool-Integrated Reasoning (TIR) in real tasks unfolds as a dynamic, opportunistic process rather than a fixed linear chain\cite{paranjape2023artautomaticmultistepreasoning}. As shown in \autoref{fig:agent_tool_compare} (left), the model-to-tool flow concentrates heavily on retrieval actions: \texttt{web\_search} is the most frequently invoked tool with 539 calls (33.98\% of all), followed by \texttt{visit\_webpage} (385, 24.27\%), \texttt{final\_answer} (358, 22.57\%), python\_interpreter (200, 12.61\%), and \texttt{wikipedia\_search} (104, 6.56\%). This distribution indicates that an external “retrieve-then-browse” loop remains the dominant path for contemporary agentic systems, reflecting persistent limits in time-sensitive and domain-specific knowledge available to base LLMs. Importantly, models differ in how efficiently they traverse this loop: for example, GPT-4.1 issues large volumes of \texttt{web\_search} (168) and \texttt{visit\_webpage} (110) that frequently land in slow tiers, whereas Qwen3-Max completes comparable coverage with far fewer retrieval and browsing steps (61 and 59, respectively). Practically, this pattern implies that reducing redundant retrieval iterations—via better query formulation and higher-quality extraction on the first pass—has immediate leverage on end-to-end latency, often exceeding gains from marginal improvements to raw model inference.
0 commit comments