Skip to content

Benchmark Ollama Models for Instruction Following, Tool Calling and Agent Workflows

License

Notifications You must be signed in to change notification settings

VTSTech/VTSTech-GPTBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VTSTech-GPTBench R7

https://www.vts-tech.org https://github.com/VTSTech/VTSTech-GPTBench

usage: VTSTech-GPTBench.py [-h] [--models MODELS] [--delay DELAY] [--verbose] [--warmup]
                           [--no-pull] [--output OUTPUT] [--json-output JSON_OUTPUT]
                           [--mode {instruct,tool,agent,all}]

VTSTech GPT Benchmark – Evaluate tiny LLMs on Ollama

options:
  -h, --help            show this help message and exit
  --models MODELS, -m MODELS
                        Comma-separated list of model names
  --delay DELAY, -d DELAY
                        Sleep delay between tests
  --verbose, -v         Print full raw output
  --warmup              Send warmup ping before each model
  --no-pull             Skip pulling models
  --output OUTPUT, -o OUTPUT
                        Save results to CSV file
  --json-output JSON_OUTPUT, -j JSON_OUTPUT
                        Save full results as JSON
  --mode {instruct,tool,agent,all}, -M {instruct,tool,agent,all}
                        Benchmark mode: instruct, tool, agent or all

Example: python benchmark.py --models llama3.2:1b,qwen2.5:0.5b --mode instruct --verbose
VTSTech-GPTBench R7
https://www.vts-tech.org https://github.com/VTSTech/VTSTech-GPTBench


📚  INSTRUCT BENCHMARK MODE
=======================================================

========================================
🚀  EVALUATING: qwen2.5-coder:0.5b-instruct-q4_k_m
========================================
   🔥 Warmup ping...  done
Test: S1: List Hidden        ✅  PASS (13.51s)
    └─ Raw: "ls -a"
Test: S2: Disk Free          ✅  PASS (0.85s)
    └─ Raw: "df -h"
Test: S3: Find Text          ✅  PASS (1.41s)
    └─ Raw: "grep -r "error" app.log"
Test: S4: Own Change         ✅  PASS (1.35s)
    └─ Raw: "chown www-data:www-data web"
Test: S5: Port List          ✅  PASS (1.22s)
    └─ Raw: "netstat -tuln"
Test: S6: Process Kill       ✅  PASS (1.23s)
    └─ Raw: "kill 1234"
Test: S7: Create Dir         ✅  PASS (1.76s)
    └─ Raw: "mkdir -p a/b/c"
Test: F1: JSON Array         ✅  PASS (3.00s)
    └─ Raw: "{"list": ["A", "B", "C"]}"
Test: F2: JSON Pair          ✅  PASS (1.46s)
    └─ Raw: "{"status": "OK"}"
Test: F3: CSV Extract        ❌  FAIL (1.80s)
    └─ Raw: "["VTSTech", "101"]"
Test: F4: Lowercase          ✅  PASS (0.77s)
    └─ Raw: "hello"
Test: F5: JSON Nested        ✅  PASS (1.91s)
    └─ Raw: "{"user": {"id": 1}}"
Test: F6: No Spaces          ✅  PASS (0.84s)
    └─ Raw: "VTST"
Test: F7: Hex Color          ✅  PASS (0.64s)
    └─ Raw: "FFFFFF"
Test: L1: Reverse Word       ❌  FAIL (1.27s)
    └─ Raw: "ANIBED"
Test: L2: Math Step          ❌  FAIL (2.82s)
    └─ Raw: "[10]"
Test: L3: Is Prime           ✅  PASS (0.96s)
    └─ Raw: "No"
Test: L4: Max Val            ❌  FAIL (0.95s)
    └─ Raw: "4"
Test: L5: Count Chars        ✅  PASS (1.37s)
    └─ Raw: "2"
Test: L6: Simple Logic       ✅  PASS (0.96s)
    └─ Raw: "false"
Test: L7: Word Length        ❌  FAIL (0.65s)
    └─ Raw: "5"
Test: C1: No Letter E        ❌  FAIL (0.89s)
    └─ Raw: "Red"
Test: C2: One Word           ✅  PASS (0.70s)
    └─ Raw: "Berlin"
Test: C3: No Numbers         ✅  PASS (0.89s)
    └─ Raw: "Five"
Test: C4: Binary State       ✅  PASS (0.84s)
    └─ Raw: "OFF"

📊  Model Summary: qwen2.5-coder:0.5b-instruct-q4_k_m - Score: 76.00% - Avg Latency: 1.76s

========================================
🚀  EVALUATING: granite4:350m
========================================
   🔥 Warmup ping...  done
Test: S1: List Hidden        ✅  PASS (11.90s)
    └─ Raw: "ls -la"
Test: S2: Disk Free          ✅  PASS (1.10s)
    └─ Raw: "df -h"
Test: S3: Find Text          ✅  PASS (1.86s)
    └─ Raw: "find . -name 'error'"
Test: S4: Own Change         ✅  PASS (1.80s)
    └─ Raw: "chown -R www-data:www-data /web"
Test: S5: Port List          ❌  FAIL (0.84s)
    └─ Raw: "lso/tcp"
Test: S6: Process Kill       ✅  PASS (1.05s)
    └─ Raw: "kill -9 1234"
Test: S7: Create Dir         ✅  PASS (1.07s)
    └─ Raw: "mkdir -p a/b/c"
Test: F1: JSON Array         ✅  PASS (2.93s)
    └─ Raw: "{"A": "a", "B": "b", "C": "c"}"
Test: F2: JSON Pair          ✅ PASS (1.28s)
    └─ Raw: "{"Status": "OK"}"
Test: F3: CSV Extract        ❌  FAIL (1.20s)
    └─ Raw: "VTSTech,101"
Test: F4: Lowercase          ✅  PASS (0.80s)
    └─ Raw: "hello"
Test: F5: JSON Nested        ✅  PASS (2.57s)
    └─ Raw: "{"User": {"ID": 1}}"
Test: F6: No Spaces          ✅  PASS (0.73s)
    └─ Raw: "Vts"
Test: F7: Hex Color          ✅  PASS (0.62s)
    └─ Raw: "#FFFFFF"
Test: L1: Reverse Word       ❌  FAIL (0.85s)
    └─ Raw: "AN-ID"
Test: L2: Math Step          ❌  FAIL (1.79s)
    └─ Raw: "50 / 2 = 25"
Test: L3: Is Prime           ✅  PASS (0.68s)
    └─ Raw: "No"
Test: L4: Max Val            ✅  PASS (0.68s)
    └─ Raw: "99"
Test: L5: Count Chars        ❌  FAIL (1.04s)
    └─ Raw: "3"
Test: L6: Simple Logic       ✅  PASS (0.76s)
    └─ Raw: "false"
Test: L7: Word Length        ❌  FAIL (0.52s)
    └─ Raw: "26"
Test: C1: No Letter E        ❌  FAIL (0.67s)
    └─ Raw: "No"
Test: C2: One Word           ❌  FAIL (0.56s)
    └─ Raw: "Germany"
Test: C3: No Numbers         ❌  FAIL (0.70s)
    └─ Raw: "ERROR"
Test: C4: Binary State       ❌  FAIL (0.73s)
    └─ Raw: "ON"

📊  Model Summary: granite4:350m - Score: 60.00% - Avg Latency: 1.55s

🛠️  TOOL BENCHMARK MODE
=======================================================

========================================
🚀  TOOL BENCHMARK: qwen2.5-coder:0.5b-instruct-q4_k_m
========================================
   🔥 Warmup ping...  done
Test: TC1: Current Weather   
      ├─ Tool Call: get_weather({'location': 'London'})
      ├─ Tool Result: {"location": "London", "temperature": "+3\u00b0C", "condition": "Overcast", "wind": "\u21974km/h", "humidity": "70%", "unit": "celsius", "source": "wttr.in", "timestamp": "2026-02-14T20:32:49.382769"}
      └─ Final: The weather in London is +3°C overcast with a wind speed of 4 km/h and a humidity of 70%.
✅  PASS (51.97s)
Test: TC2: Weather with Units 
      ├─ Tool Call: get_weather({'location': 'Paris'})
      ├─ Tool Result: {"location": "Paris", "temperature": "+1\u00b0C", "condition": "Clear", "wind": "\u21937km/h", "humidity": "69%", "unit": "celsius", "source": "wttr.in", "timestamp": "2026-02-14T20:33:15.023620"}
      └─ Final: The temperature in Paris is +1°C, it is clear, the wind is 7 km/h, the humidity is 69%, and the unit is Celsius. The source is WTR.in, and the timestamp is 2026-02-14T20:33:15.023620.
✅  PASS (30.40s)
Test: TC3: Basic Math        
      ├─ Tool Call: calculator({'expression': '15 * 7'})
      ├─ Tool Result: {"expression": "15 * 7", "result": 105, "status": "success"}
      └─ Final: The weather in London is 15°C and cloudy.
❌  FAIL (19.06s)
Test: TC4: Complex Math      
      ├─ Tool Call: calculate_stats({'numbers': [144]})
      ├─ Tool Result: {"count": 1, "sum": 144, "mean": 144.0, "median": 144, "mode": 144, "min": 144, "max": 144, "range": 0, "variance": 0.0, "std_deviation": 0.0}
      └─ Final: The square root of 144 is 12.
✅  PASS (22.01s)
Test: TC5: User Lookup       
      ├─ Tool Call: find_user({'email': 'john@example.com'})
      ├─ Tool Result: {"status": "found", "user": {"user_id": 42, "name": "John Doe", "email": "john@example.com", "role": "developer", "department": "Engineering", "joined": "2023-01-15", "active": true, "projects": ["Project A", "Project C"]}}
      └─ Final: The user asked for information about a user with the email "john@example.com". The tool returned a JSON object with the user's ID, name, email, role, department, joined date, active status, and projects.
❌  FAIL (24.08s)
Test: TC6: User by ID        
      ├─ Tool Call: get_user({'user_id': 42})
      ├─ Tool Result: {"status": "found", "user": {"user_id": 42, "name": "John Doe", "email": "john@example.com", "role": "developer", "department": "Engineering", "joined": "2023-01-15", "active": true, "projects": ["Project A", "Project C"]}}
      └─ Final: The user asked for the profile of user 42. The tool returned the user's name, email, role, department, joined date, active status, and projects.
❌  FAIL (23.08s)
Test: TC7: Send Email        
      📧  SIMULATED EMAIL:
      To: alice@company.com
      Subject: Meeting at 3pm
      Body: Meeting at 3pm

      ├─ Tool Call: send_email({'to': 'alice@company.com', 'subject': 'Meeting at 3pm', 'body': 'Meeting at 3pm'})
      ├─ Tool Result: {"status": "sent", "to": "alice@company.com", "cc": null, "bcc": null, "subject": "Meeting at 3pm", "timestamp": "2026-02-14T20:35:12.181026", "message_id": "msg_1771101312_1591"}
      └─ Final: The tool returned the following JSON response:

{"status": "sent", "to": "alice@company.com", "cc": null, "bcc": null, "subject": "Meeting at 3pm", "timestamp": "2026-02-14T20:35:12.181026", "message_id": "msg_1771101312_1591"}

This response indicat
✅  PASS (36.89s)
Test: TC8: File Operation    
      ├─ Tool Call: create_directory({'path': '/tmp/benchmark_test'})
      ├─ Tool Result: {"status": "created", "path": "/tmp/benchmark_test", "exists": true, "is_directory": true, "permissions": "755"}
      └─ Final: To create a directory named `/tmp/benchmark_test`, the tool returned the following JSON response:

{"status": "created", "path": "/tmp/benchmark_test", "exists": true, "is_directory": true, "permissions": "755"}

This response indicates that the dire
✅  PASS (26.91s)
Test: TC9: No Tool Needed    ✅  PASS (12.88s)
Test: TC10: Ambiguous Query  ❌  FAIL (2.62s)
Test: TC11: Weather Forecast 
      ├─ Tool Call: get_forecast({'location': 'Paris', 'days': 3})
      ├─ Tool Result: {"location": "Paris", "forecast": [{"date": "2026-02-15", "temperature_high": "19\u00b0C", "temperature_low": "12\u00b0C", "condition": "sunny", "precipitation": "24%"}, {"date": "2026-02-16", "temperature_high": "25\u00b0C", "temperature_low": "7\u0
      └─ Final: The weather forecast for Paris for the next 3 days is as follows:

- Date: 2026-02-15
  - Temperature: 19°C
  - Condition: Sunny
  - Precipitation: 24%

- Date: 2026-02-16
  - Temperature: 25°C
  - Condition: Clear
  - Precipitation: 39%

- Date: 202
✅  PASS (27.67s)
Test: TC12: Air Quality      
      ├─ Tool Call: get_air_quality({'city': 'London'})
      ├─ Tool Result: {"city": "London", "aqi": 168, "status": "unhealthy", "pollutants": {"pm2.5": 41, "pm10": 93, "o3": 88, "no2": 25, "so2": 3}, "dominant_pollutant": "pm10", "timestamp": "2026-02-14T20:36:56.851515"}
      └─ Final: The air quality in London is 168 with a status of unhealthy, with pollutants including PM2.5, PM10, O3, NO2, and SO2. The dominant pollutant is PM10, with a concentration of 93 PM10. The timestamp is 2026-02-14T20:36:56.851515.
✅  PASS (28.79s)
Test: TC13: Unit Conversion  
      ├─ Tool Call: convert_units({'value': 100, 'from_unit': 'kilometers', 'to_unit': 'miles'})
      ├─ Tool Result: {"value": 100, "from_unit": "kilometers", "to_unit": "miles", "result": 62.1371, "status": "success"}
      └─ Final: The capital of France is Paris.
❌  FAIL (20.68s)
Test: TC14: Statistics       
      ├─ Tool Call: calculate_stats({'numbers': [5, 10, 15, 20, 25]})
      ├─ Tool Result: {"count": 5, "sum": 75, "mean": 15.0, "median": 15, "mode": 5, "min": 5, "max": 25, "range": 20, "variance": 50.0, "std_deviation": 7.0711}
      └─ Final: The tool returned the following JSON data for calculating the statistics for 5, 10, 15, 20, 25:

{"count": 5, "sum": 75, "mean": 15.0, "median": 15, "mode": 5, "min": 5, "max": 25, "range": 20, "variance": 50.0, "std_deviation": 7.0711}
✅  PASS (33.19s)
Test: TC15: Random Number    
      ├─ Tool Call: generate_random_number({'min_val': 1, 'max_val': 100})
      ├─ Tool Result: {"min": 1, "max": 100, "random": 49, "timestamp": "2026-02-14T20:38:20.648428"}
      └─ Final: The random number between 1 and 100 is 49.
✅  PASS (21.78s)
Test: TC16: List Users       
      ├─ Tool Call: list_users({'active_only': True})
      ├─ Tool Result: {"total_users": 3, "users": [{"user_id": 42, "name": "John Doe", "email": "john@example.com", "role": "developer", "department": "Engineering", "joined": "2023-01-15", "active": true, "projects": ["Project A", "Project C"]}, {"user_id": 43, "name": "
      └─ Final: The total number of active users is 3. The users are John Doe, Jane Smith, and Alice Johnson.
✅  PASS (27.23s)
Test: TC17: Create User      
      ├─ Tool Call: create_user({'name': 'Sarah Jones', 'email': 'sarah@example.com', 'role': 'developer'})
      ├─ Tool Result: {"status": "created", "user": {"user_id": 46, "name": "Sarah Jones", "email": "sarah@example.com", "role": "developer", "department": "New", "joined": "2026-02-14", "active": true, "projects": []}}
      └─ Final: To create a new user named Sarah Jones with email sarah@example.com, you can use the `create_user` tool. The tool will return a JSON response with the status "created" and the user details.
✅  PASS (25.51s)
Test: TC18: List Files       
      ├─ Tool Call: list_files({'path': '.'})
      ├─ Tool Result: {"path": ".", "files": [{"name": ".git", "type": "directory", "size": 4096, "size_human": "4.0 KB", "modified": "2026-02-14T20:13:26.798690", "permissions": "755"}, {"name": ".gitignore", "type": "file", "size": 4694, "size_human": "4.6 KB", "modifie
      └─ Final: The current directory contains 9 files, including 9 directories.
❌  FAIL (47.28s)
Test: TC19: Read File        
      ├─ Tool Call: read_file({'path': 'README.md'})
      ├─ Tool Result: {"path": "README.md", "exists": true, "size": 6052, "lines": 159, "content": "# VTSTech-GPTBench R6\n\nhttps://www.vts-tech.org https://github.com/VTSTech/VTSTech-GPTBench\n
\nusage: VTSTech-GPTBench.py [-h] [--models MODELS] [--delay DELAY] [--v
      └─ Final: The tool call is `read_file` with the arguments `path: README.md`, `exists: true`, `size: 6052`, `lines: 159`, and `content: "# VTSTech-GPTBench R6\n\nhttps://www.vts-tech.org https://github.com/VTSTech/VTSTech-GPTBench\n
\nusage: VTSTech-GPTBenc
✅  PASS (125.27s)
Test: TC20: Fetch URL        
      ├─ Tool Call: fetch_url({'url': 'https://www.example.com/'})
      ├─ Tool Result: {"url": "https://www.example.com/", "error": "HTTPSConnectionPool(host='www.example.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable 
      └─ Final: The tool returned an error message indicating that the SSL certificate verification failed.
❌  FAIL (21.92s)
Test: TC21: Encode URL       
      ├─ Tool Call: encode_url({'text': 'hello world!'})
      ├─ Tool Result: {"original": "hello world!", "encoded": "hello%20world%21", "scheme": "url_encoding"}
      └─ Final: The original request is to encode the string "hello world!" using the URL encoding scheme. The tool returns {"original": "hello world!", "encoded": "hello%20world%21", "scheme": "url_encoding"}.
✅  PASS (22.82s)
Test: TC22: Hash Text        
      ├─ Tool Call: hash_text({'text': 'password123', 'algorithm': 'sha256'})
      ├─ Tool Result: {"text": "password123", "algorithm": "sha256", "hash": "ef92b778bafe771e89245b89ecbc08a44a4e166c06659911881f383d4473e94f", "hash_length": 64}
      └─ Final: The SHA256 hash of the password 'password123' is 'ef92b778bafe771e89245b89ecbc08a44a4e166c06659911881f383d4473e94f'.
✅  PASS (29.11s)
Test: TC23: Generate Password 
      ├─ Tool Call: generate_password({'length': 12})
      ├─ Tool Result: {"password": "Kl0Js4&GLSS^", "length": 12, "strength": "strong", "timestamp": "2026-02-14T20:43:41.477052"}
      └─ Final: The tool returned a password of "Kl0Js4&GLSS^" with a length of 12, a strength of "strong", and a timestamp of "2026-02-14T20:43:41.477052".
✅  PASS (25.85s)
Test: TC24: Date Calculator  
      ├─ Tool Call: date_calculator({'start_date': '2026-02-13', 'days_to_add': 30})
      ├─ Tool Result: {"start_date": "2026-02-13", "operation": "add 30 days", "result_date": "2026-03-15T00:00:00", "result_date_formatted": "March 15, 2026", "days_difference": 30}
      └─ Final: The date 30 days from 2026-02-13 is March 15, 2026.
✅  PASS (26.02s)
Test: TC25: Timezone Converter 
      ├─ Tool Call: convert_units({'value': 14.3, 'from_unit': 'EST', 'to_unit': 'PST'})
      ├─ Tool Result: {"error": "Conversion from EST to PST not supported", "status": "error"}
      └─ Final: The capital of France is Paris.
❌  FAIL (21.35s)

📊  Model Summary: qwen2.5-coder:0.5b-instruct-q4_k_m - Score: 68.00% - Avg Latency: 30.18s

========================================
🚀  TOOL BENCHMARK: granite4:350m
========================================
   🔥 Warmup ping...  done
Test: TC1: Current Weather   
      ├─ Tool Call: get_weather({'location': 'London'})
      ├─ Tool Result: {"location": "London", "temperature": "+3\u00b0C", "condition": "Overcast", "wind": "\u21974km/h", "humidity": "70%", "unit": "celsius", "source": "wttr.in", "timestamp": "2026-02-14T20:45:17.811581"}
      └─ Final: Assistant: The weather in London is 15°C and cloudy.
✅  PASS (39.17s)
Test: TC2: Weather with Units 
      ├─ Tool Call: get_weather({'location': 'Paris'})
      ├─ Tool Result: {"location": "Paris", "temperature": "+1\u00b0C", "condition": "Clear", "wind": "\u21937km/h", "humidity": "69%", "unit": "celsius", "source": "wttr.in", "timestamp": "2026-02-14T20:45:43.555536"}
      └─ Final: Assistant: I found John Doe. He is a developer in Engineering.
❌  FAIL (26.71s)
Test: TC3: Basic Math        
      ├─ Tool Call: calculator({'expression': '15 * 7'})
      ├─ Tool Result: {"expression": "15 * 7", "result": 105, "status": "success"}
      └─ Final: Assistant: The weather in London is 15°C and cloudy.
❌  FAIL (16.79s)
Test: TC4: Complex Math      
      ├─ Tool Call: calculate_stats({'numbers': [144]})
      ├─ Tool Result: {"count": 1, "sum": 144, "mean": 144.0, "median": 144, "mode": 144, "min": 144, "max": 144, "range": 0, "variance": 0.0, "std_deviation": 0.0}
      └─ Final: Assistant: The square root of 144 is 12.
✅  PASS (19.20s)
Test: TC5: User Lookup       
      ├─ Tool Call: find_user({'email': 'john@example.com'})
      ├─ Tool Result: {"status": "found", "user": {"user_id": 42, "name": "John Doe", "email": "john@example.com", "role": "developer", "department": "Engineering", "joined": "2023-01-15", "active": true, "projects": ["Project A", "Project C"]}}
      └─ Final: Assistant: I found John Doe. He is a developer in Engineering.
✅  PASS (18.36s)
Test: TC6: User by ID        
      ├─ Tool Call: get_user({'user_id': 42})
      ├─ Tool Result: {"status": "found", "user": {"user_id": 42, "name": "John Doe", "email": "john@example.com", "role": "developer", "department": "Engineering", "joined": "2023-01-15", "active": true, "projects": ["Project A", "Project C"]}}
      └─ Final: Assistant: I found John Doe. He is a developer in Engineering.
✅  PASS (19.19s)
Test: TC7: Send Email        
      📧  SIMULATED EMAIL:
      To: alice@company.com
      Subject: Meeting at 3pm
      Body: I'm meeting you at 3pm.

      ├─ Tool Call: send_email({'to': 'alice@company.com', 'subject': 'Meeting at 3pm', 'body': "I'm meeting you at 3pm."})
      ├─ Tool Result: {"status": "sent", "to": "alice@company.com", "cc": null, "bcc": null, "subject": "Meeting at 3pm", "timestamp": "2026-02-14T20:47:18.608038", "message_id": "msg_1771102038_5392"}
      └─ Final: Assistant: I found Alice in Engineering. She is a developer.
❌  FAIL (22.34s)
Test: TC8: File Operation    
      ├─ Tool Call: create_directory({'path': '/tmp/benchmark_test'})
      ├─ Tool Result: {"status": "created", "path": "/tmp/benchmark_test", "exists": true, "is_directory": true, "permissions": "755"}
      └─ Final: Assistant: I found the directory /tmp/benchmark_test. It exists and is a directory.
✅  PASS (19.15s)
Test: TC9: No Tool Needed    ✅  PASS (10.87s)
Test: TC10: Ambiguous Query  ❌  FAIL (1.42s)
Test: TC11: Weather Forecast 
      ├─ Tool Call: get_forecast({'location': 'Paris', 'days': 3})
      ├─ Tool Result: {"location": "Paris", "forecast": [{"date": "2026-02-15", "temperature_high": "19\u00b0C", "temperature_low": "9\u00b0C", "condition": "sunny", "precipitation": "65%"}, {"date": "2026-02-16", "temperature_high": "15\u00b0C", "temperature_low": "9\u00
      └─ Final: Assistant: I found the weather forecast for Paris for the next 3 days. The forecast shows sunny conditions on 2026-02-15 with temperatures between 9°C and 19°C, and a 65% chance of precipitation.
✅  PASS (18.33s)
Test: TC12: Air Quality      
      ├─ Tool Call: get_air_quality({'city': 'London'})
      ├─ Tool Result: {"city": "London", "aqi": 138, "status": "unhealthy sensitive", "pollutants": {"pm2.5": 29, "pm10": 32, "o3": 65, "no2": 26, "so2": 16}, "dominant_pollutant": "o3", "timestamp": "2026-02-14T20:48:27.529234"}
      └─ Final: Assistant: The air quality in London is 138, which is considered unhealthy for sensitive individuals. The main pollutants are ozone (65) and particulate matter (o3).
✅  PASS (22.04s)
Test: TC13: Unit Conversion  
      ├─ Tool Call: convert_units({'value': 100, 'from_unit': 'kilometers', 'to_unit': 'miles'})
      ├─ Tool Result: {"value": 100, "from_unit": "kilometers", "to_unit": "miles", "result": 62.1371, "status": "success"}
      └─ Final: Assistant: I used the get_weather tool to find the current weather in London.
❌  FAIL (19.52s)
Test: TC14: Statistics       
      ├─ Tool Call: calculate_stats({'numbers': [5, 10, 15, 20, 25]})
      ├─ Tool Result: {"count": 5, "sum": 75, "mean": 15.0, "median": 15, "mode": 5, "min": 5, "max": 25, "range": 20, "variance": 50.0, "std_deviation": 7.0711}
      └─ Final: Assistant: Based on the current weather, the temperature is 15°C and it's cloudy.
✅  PASS (22.38s)
Test: TC15: Random Number    
      ├─ Tool Call: generate_random_number({'min_val': 1, 'max_val': 100})
      ├─ Tool Result: {"min": 1, "max": 100, "random": 63, "timestamp": "2026-02-14T20:49:32.556296"}
      └─ Final: Assistant: I generated a random number between 1 and 100, which is 63.
✅  PASS (19.30s)
Test: TC16: List Users       
      ├─ Tool Call: list_users({'active_only': True})
      ├─ Tool Result: {"total_users": 4, "users": [{"user_id": 42, "name": "John Doe", "email": "john@example.com", "role": "developer", "department": "Engineering", "joined": "2023-01-15", "active": true, "projects": ["Project A", "Project C"]}, {"user_id": 43, "name": "
      └─ Final: Assistant: I found John Doe. He is a developer in Engineering.
✅  PASS (25.43s)
Test: TC17: Create User      
      ├─ Tool Call: create_user({'name': 'Sarah Jones', 'email': 'sarah@example.com', 'role': 'developer'})
      ├─ Tool Result: {"status": "created", "user": {"user_id": 47, "name": "Sarah Jones", "email": "sarah@example.com", "role": "developer", "department": "New", "joined": "2026-02-14", "active": true, "projects": []}}
      └─ Final: Assistant: I found Sarah Jones. She is a developer in Engineering.
✅  PASS (21.80s)
Test: TC18: List Files       
      ├─ Tool Call: list_files({'path': '/home/user'})
      ├─ Tool Result: {"path": "/home/user", "error": "[Errno 2] No such file or directory: '/home/user'", "status": "error"}
      └─ Final: Assistant: I am a Function Call Generator. I can help you with tool requests. Please provide the tool result you received.
❌  FAIL (18.35s)
Test: TC19: Read File        
      ├─ Tool Call: read_file({'path': 'README.md'})
      ├─ Tool Result: {"path": "README.md", "exists": true, "size": 6052, "lines": 159, "content": "# VTSTech-GPTBench R6\n\nhttps://www.vts-tech.org https://github.com/VTSTech/VTSTech-GPTBench\n
\nusage: VTSTech-GPTBench.py [-h] [--models MODELS] [--delay DELAY] [--v
      └─ Final: Assistant: I found the README.md file. It contains instructions on how to use the VTSTech-GPTBench tool, including how to run benchmarks, evaluate models, and save results.
✅  PASS (49.54s)
Test: TC20: Fetch URL        ❌  FAIL (no tool call) (13.40s)
    └─ Raw: "{"name": "fetch_url", "arguments": {"url": "https://www.example.com/"}"
Test: TC21: Encode URL       
      ├─ Tool Call: encode_url({'text': 'hello world!'})
      ├─ Tool Result: {"original": "hello world!", "encoded": "hello%20world%21", "scheme": "url_encoding"}
      └─ Final: Assistant: I found John Doe. He is a developer in Engineering.
❌  FAIL (7.85s)
Test: TC22: Hash Text        
      ├─ Tool Call: hash_text({'text': 'password123', 'algorithm': 'sha256'})
      ├─ Tool Result: {"text": "password123", "algorithm": "sha256", "hash": "ef92b778bafe771e89245b89ecbc08a44a4e166c06659911881f383d4473e94f", "hash_length": 64}
      └─ Final: Assistant: I generated a SHA256 hash for the string 'password123'. The hash is 'ef92b778bafe771e89245b89ecbc08a44a4e166c06659911881f383d4473e94f'.
✅  PASS (23.42s)
Test: TC23: Generate Password 
      ├─ Tool Call: generate_password({'length': 12})
      ├─ Tool Result: {"password": "LC!kB4^7qpeD", "length": 12, "strength": "strong", "timestamp": "2026-02-14T20:52:30.023379"}
      └─ Final: Assistant: I generated a strong password: LC!kB4^7qpeD. It has 12 characters, is strong, and was created at 2026-02-14T20:52:30.023379.
✅  PASS (21.58s)
Test: TC24: Date Calculator  
      ├─ Tool Call: date_calculator({'start_date': '2026-02-13', 'days_to_add': 30})
      ├─ Tool Result: {"start_date": "2026-02-13", "operation": "add 30 days", "result_date": "2026-03-15T00:00:00", "result_date_formatted": "March 15, 2026", "days_difference": 30}
      └─ Final: Assistant: Based on the tool result, I can calculate the date 30 days from 2026-02-13 to find March 15, 2026.
✅  PASS (23.80s)
Test: TC25: Timezone Converter 
      ├─ Tool Call: timezone_converter({'time_str': '14:30', 'from_tz': 'EST', 'to_tz': 'PST'})
      ├─ Tool Result: {"original_time": "14:30", "original_timezone": "EST", "converted_time": "11:30", "converted_timezone": "PST", "offset_hours": -3, "status": "success"}
      └─ Final: Assistant: I used the get_weather tool to find that the weather in London is 15°C and cloudy.
❌  FAIL (21.85s)

📊  Model Summary: granite4:350m - Score: 64.00% - Avg Latency: 20.34s

🛠️  AGENT BENCHMARK MODE
=======================================================

🚀  EVALUATING AGENT: [Planner: qwen2.5-coder:0.5b-instruct-q4_k_m] [Tools/Synthesis: qwen2.5-coder:0.5b-instruct-q4_k_m]
-------------------------------------------------------
Agent Task: A1: Weather Conversion    
[debug] raw_plan: {"get_weather": "London"}
[debug] tool_call_raw: {"name": "get_weather", "arguments": {"location": "London"}}
[debug] final_answer: Task: Get the weather for London and convert to Fahrenheit.
Result: {"tool": "get_weather", "result": {"location": "London", "temperature": "+1°C", "condition": "Cloudy", "wind": "↑5km/h", "humidity": "75%", "unit": "celsius", "source": "wttr.in", "timestamp": "2026-02-14T20:54:15.563411"}}
✅  PASS (67.43s)
Agent Task: A2: User Email            
[debug] raw_plan: {"find_user": "john@example.com", "send_email": "Hello"}
[debug] tool_call_raw: {"name": "find_user", "arguments": {"email": "john@example.com", "message": "Hello"}}
[debug] tool_call_raw: {"name": "send_email", "arguments": {"to": "john@example.com", "subject": "Hello", "body": "This is a test email."}}


      📧  SIMULATED EMAIL:
      To: john@example.com
      Subject: Hello
      Body: This is a test email.
[debug] final_answer: Task: Find user john@example.com and email him 'Hello'
Result: {"tool": "find_user", "result": {"error": "Invalid arguments for find_user: ToolRegistry.find_user() got an unexpected keyword argument 'message'"}}, {"tool": "send_email", "result": {"status": "sent", "to": "john@example.com", "cc": null, "bcc": null, "subject": "Hello", "timestamp": "2026-02-14T20:55:20.332217", "message_id": "msg_1771102520_1498"}}
✅  PASS (68.01s)
Agent Task: A3: Secure User Email     
[debug] raw_plan: {"find_user": "get_user", "generate_password": "generate_password", "send_email": "send_email"}
[debug] tool_call_raw: {"name": "find_user", "arguments": {"user_id": 42}}
[debug] tool_call_raw: {"name": "generate_password", "arguments": {"length": 12}}
[debug] tool_call_raw: {"name": "send_email", "arguments": {"to": "john@example.com", "subject": "Password Reset Request", "body": "Please reset your password to access your account."}}

      📧  SIMULATED EMAIL:
      To: john@example.com
      Subject: Password Reset Request
      Body: Please reset your password to access your account.
[debug] final_answer: Task: Find user 42, generate a 12-char password for them, and email it.
Result: {"tool": "find_user", "result": {"error": "Invalid arguments for find_user: ToolRegistry.find_user() got an unexpected keyword argument 'user_id'"}}, {"tool": "generate_password", "result": {"password": "rSvIwT@0sdM0", "length": 12, "strength": "strong", "timestamp": "2026-02-14T20:56:27.734659"}}, {"tool": "send_email", "result": {"status": "sent", "to": "john@example.com", "cc": null, "bcc": null, "subject": "Password Reset Request", "timestamp": "2026-02-14T20:56:38.838697", "message_id": "msg_1771102598_5730"}}]
❌  FAIL (88.25s)


                     📊  AGENT BENCHMARK REPORT                    
-----------------------------------------------------------------
Model                          | Score        | Avg Latency  | Tests   
-----------------------------------------------------------------
qwen2.5-coder:0.5b-instruct-q4_k_m |      66.67% |       74.57s |      3
-----------------------------------------------------------------


                   📊  INSTRUCT BENCHMARK REPORT                   
-----------------------------------------------------------------
Model                          | Score        | Avg Latency  | Tests   
-----------------------------------------------------------------
qwen2.5-coder:0.5b-instruct-q4_k_m |      76.00% |        1.76s |     25
granite4:350m                  |      60.00% |        1.55s |     25
-----------------------------------------------------------------

🏆  Best Model: qwen2.5-coder:0.5b-instruct-q4_k_m - 76.00%


                    🛠️  TOOL BENCHMARK REPORT                    
-----------------------------------------------------------------
Model                          | Score        | Avg Latency  | Tests   
-----------------------------------------------------------------
qwen2.5-coder:0.5b-instruct-q4_k_m |      68.00% |       30.18s |     25
granite4:350m                  |      64.00% |       20.34s |     24
-----------------------------------------------------------------

🏆  Best Model: qwen2.5-coder:0.5b-instruct-q4_k_m - 68.00%

About

Benchmark Ollama Models for Instruction Following, Tool Calling and Agent Workflows

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages