Building a Reliable LLM Tool Calling System

Date: 2024-01-12 Topic: Preventing LLMs from "hallucinating" tool call results

Background

Today I solved an interesting problem: how to ensure LLMs actually execute tool calls rather than generating fake results. The LLM would sometimes produce convincing output without actually calling the tool.

The Problem

The expected flow:

LLM analyzes task requirements
LLM calls the logo detection tool
LLM makes judgment based on actual results

The actual behavior:

# LLM might generate fake "Observation" without calling the tool
response = """
Action: check_logo_presence
Action Input: {"image": "..."}
Observation: {"status": "success", "has_logo": true}  # Fabricated!
Final Answer: Logo is present.
"""

Root Cause

LLMs are pattern predictors. They've seen many examples of tool call sequences and can convincingly generate the entire pattern - including the observation - without actually executing anything.

Solution: Strict Validation

class ToolCallController:
    def __init__(self):
        self.current_tool_result = None  # Store actual result
        self.max_steps = 10
        self.early_stop_threshold = 2

    async def process(self, prompt: str, image_file: UploadFile):
        conversation = prompt
        step_count = 0
        no_tool_call_count = 0

        while step_count < self.max_steps:
            response = await self._get_llm_response(conversation)

            # Parse tool call
            tool_name, tool_args, text = self._parse_latest_plugin_call(response)

            if tool_name:
                no_tool_call_count = 0
                # Execute actual tool call
                tool_result = await self._execute_tool(tool_name, tool_args)
                # Store for validation
                self.current_tool_result = tool_result
                conversation += f"\nObservation: {tool_result}\nThought:"
            else:
                no_tool_call_count += 1
                if "Final Answer:" in response:
                    return self._validate_result(response)

            if no_tool_call_count >= self.early_stop_threshold:
                break

            step_count += 1

        raise ValueError("Max steps reached")

Result Validation

The key is comparing the final answer against actual tool results:

def _validate_result(self, text: str) -> Dict:
    # Extract Final Answer
    final_idx = text.rfind('Final Answer:')
    answer_text = text[final_idx + len('Final Answer:'):].strip()
    result = json.loads(answer_text)

    # Verify against actual tool results
    if self.current_tool_result:
        if "reason" in result:
            reason = result["reason"]
            # Check if claimed results match actual tool returns
            claimed_has_logo = "has_logo=true" in reason.lower()
            if claimed_has_logo != self.current_tool_result["has_logo"]:
                logger.error("Tool result mismatch detected!")
                return {
                    "is_compliant": False,
                    "reason": f"Actual tool result: {self.current_tool_result}",
                    "reference": "Validation failed"
                }

    return result

Prompt Engineering

Force the LLM to follow the tool calling pattern:

def build_prompt(self) -> str:
    return """Analyze whether the advertisement contains a health food logo.

Important rules:
1. You MUST use the check_logo_presence tool for detection
2. Do NOT make judgments or guess results on your own
3. You MUST wait for the tool to return actual results
4. You MUST accurately reference tool return values in your final answer

Format:
Thought: Think about the next action
Action: Tool name
Action Input: Parameters as JSON
Observation: (wait for actual tool result)
...
Final Answer: JSON with result
"""

Today's Reflection

This problem taught me that LLMs are unreliable by default. They're optimized to produce plausible-sounding output, not correct output. When tools are involved, you need:

State tracking: Store actual tool results separately
Validation: Compare claims against actual results
Explicit prompts: Make the rules absolutely clear
Early stopping: Don't let the loop run forever

The key insight: building reliable systems with LLMs isn't about trusting them - it's about building guardrails around them.

Further Learning

Function calling in OpenAI/Anthropic APIs
Tool-use frameworks (LangChain tools, OpenAI functions)
LLM output validation patterns
ReAct implementation best practices

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building a Reliable LLM Tool Calling System

Background

The Problem

Root Cause

Solution: Strict Validation

Result Validation

Prompt Engineering

Today's Reflection

Further Learning

FilesExpand file tree

41.md

Latest commit

History

41.md

File metadata and controls

Building a Reliable LLM Tool Calling System

Background

The Problem

Root Cause

Solution: Strict Validation

Result Validation

Prompt Engineering

Today's Reflection

Further Learning