Skip to content

Latest commit

 

History

History
156 lines (115 loc) · 4.58 KB

File metadata and controls

156 lines (115 loc) · 4.58 KB

Building a Reliable LLM Tool Calling System

Date: 2024-01-12 Topic: Preventing LLMs from "hallucinating" tool call results


Background

Today I solved an interesting problem: how to ensure LLMs actually execute tool calls rather than generating fake results. The LLM would sometimes produce convincing output without actually calling the tool.


The Problem

The expected flow:

  1. LLM analyzes task requirements
  2. LLM calls the logo detection tool
  3. LLM makes judgment based on actual results

The actual behavior:

# LLM might generate fake "Observation" without calling the tool
response = """
Action: check_logo_presence
Action Input: {"image": "..."}
Observation: {"status": "success", "has_logo": true}  # Fabricated!
Final Answer: Logo is present.
"""

Root Cause

LLMs are pattern predictors. They've seen many examples of tool call sequences and can convincingly generate the entire pattern - including the observation - without actually executing anything.


Solution: Strict Validation

class ToolCallController:
    def __init__(self):
        self.current_tool_result = None  # Store actual result
        self.max_steps = 10
        self.early_stop_threshold = 2

    async def process(self, prompt: str, image_file: UploadFile):
        conversation = prompt
        step_count = 0
        no_tool_call_count = 0

        while step_count < self.max_steps:
            response = await self._get_llm_response(conversation)

            # Parse tool call
            tool_name, tool_args, text = self._parse_latest_plugin_call(response)

            if tool_name:
                no_tool_call_count = 0
                # Execute actual tool call
                tool_result = await self._execute_tool(tool_name, tool_args)
                # Store for validation
                self.current_tool_result = tool_result
                conversation += f"\nObservation: {tool_result}\nThought:"
            else:
                no_tool_call_count += 1
                if "Final Answer:" in response:
                    return self._validate_result(response)

            if no_tool_call_count >= self.early_stop_threshold:
                break

            step_count += 1

        raise ValueError("Max steps reached")

Result Validation

The key is comparing the final answer against actual tool results:

def _validate_result(self, text: str) -> Dict:
    # Extract Final Answer
    final_idx = text.rfind('Final Answer:')
    answer_text = text[final_idx + len('Final Answer:'):].strip()
    result = json.loads(answer_text)

    # Verify against actual tool results
    if self.current_tool_result:
        if "reason" in result:
            reason = result["reason"]
            # Check if claimed results match actual tool returns
            claimed_has_logo = "has_logo=true" in reason.lower()
            if claimed_has_logo != self.current_tool_result["has_logo"]:
                logger.error("Tool result mismatch detected!")
                return {
                    "is_compliant": False,
                    "reason": f"Actual tool result: {self.current_tool_result}",
                    "reference": "Validation failed"
                }

    return result

Prompt Engineering

Force the LLM to follow the tool calling pattern:

def build_prompt(self) -> str:
    return """Analyze whether the advertisement contains a health food logo.

Important rules:
1. You MUST use the check_logo_presence tool for detection
2. Do NOT make judgments or guess results on your own
3. You MUST wait for the tool to return actual results
4. You MUST accurately reference tool return values in your final answer

Format:
Thought: Think about the next action
Action: Tool name
Action Input: Parameters as JSON
Observation: (wait for actual tool result)
...
Final Answer: JSON with result
"""

Today's Reflection

This problem taught me that LLMs are unreliable by default. They're optimized to produce plausible-sounding output, not correct output. When tools are involved, you need:

  1. State tracking: Store actual tool results separately
  2. Validation: Compare claims against actual results
  3. Explicit prompts: Make the rules absolutely clear
  4. Early stopping: Don't let the loop run forever

The key insight: building reliable systems with LLMs isn't about trusting them - it's about building guardrails around them.


Further Learning

  • Function calling in OpenAI/Anthropic APIs
  • Tool-use frameworks (LangChain tools, OpenAI functions)
  • LLM output validation patterns
  • ReAct implementation best practices