Date: 2024-01-12 Topic: Preventing LLMs from "hallucinating" tool call results
Today I solved an interesting problem: how to ensure LLMs actually execute tool calls rather than generating fake results. The LLM would sometimes produce convincing output without actually calling the tool.
The expected flow:
- LLM analyzes task requirements
- LLM calls the logo detection tool
- LLM makes judgment based on actual results
The actual behavior:
# LLM might generate fake "Observation" without calling the tool
response = """
Action: check_logo_presence
Action Input: {"image": "..."}
Observation: {"status": "success", "has_logo": true} # Fabricated!
Final Answer: Logo is present.
"""LLMs are pattern predictors. They've seen many examples of tool call sequences and can convincingly generate the entire pattern - including the observation - without actually executing anything.
class ToolCallController:
def __init__(self):
self.current_tool_result = None # Store actual result
self.max_steps = 10
self.early_stop_threshold = 2
async def process(self, prompt: str, image_file: UploadFile):
conversation = prompt
step_count = 0
no_tool_call_count = 0
while step_count < self.max_steps:
response = await self._get_llm_response(conversation)
# Parse tool call
tool_name, tool_args, text = self._parse_latest_plugin_call(response)
if tool_name:
no_tool_call_count = 0
# Execute actual tool call
tool_result = await self._execute_tool(tool_name, tool_args)
# Store for validation
self.current_tool_result = tool_result
conversation += f"\nObservation: {tool_result}\nThought:"
else:
no_tool_call_count += 1
if "Final Answer:" in response:
return self._validate_result(response)
if no_tool_call_count >= self.early_stop_threshold:
break
step_count += 1
raise ValueError("Max steps reached")The key is comparing the final answer against actual tool results:
def _validate_result(self, text: str) -> Dict:
# Extract Final Answer
final_idx = text.rfind('Final Answer:')
answer_text = text[final_idx + len('Final Answer:'):].strip()
result = json.loads(answer_text)
# Verify against actual tool results
if self.current_tool_result:
if "reason" in result:
reason = result["reason"]
# Check if claimed results match actual tool returns
claimed_has_logo = "has_logo=true" in reason.lower()
if claimed_has_logo != self.current_tool_result["has_logo"]:
logger.error("Tool result mismatch detected!")
return {
"is_compliant": False,
"reason": f"Actual tool result: {self.current_tool_result}",
"reference": "Validation failed"
}
return resultForce the LLM to follow the tool calling pattern:
def build_prompt(self) -> str:
return """Analyze whether the advertisement contains a health food logo.
Important rules:
1. You MUST use the check_logo_presence tool for detection
2. Do NOT make judgments or guess results on your own
3. You MUST wait for the tool to return actual results
4. You MUST accurately reference tool return values in your final answer
Format:
Thought: Think about the next action
Action: Tool name
Action Input: Parameters as JSON
Observation: (wait for actual tool result)
...
Final Answer: JSON with result
"""This problem taught me that LLMs are unreliable by default. They're optimized to produce plausible-sounding output, not correct output. When tools are involved, you need:
- State tracking: Store actual tool results separately
- Validation: Compare claims against actual results
- Explicit prompts: Make the rules absolutely clear
- Early stopping: Don't let the loop run forever
The key insight: building reliable systems with LLMs isn't about trusting them - it's about building guardrails around them.
- Function calling in OpenAI/Anthropic APIs
- Tool-use frameworks (LangChain tools, OpenAI functions)
- LLM output validation patterns
- ReAct implementation best practices