Skip to content

How many of the letter "R" are in "Strawberry", and the like

Notifications You must be signed in to change notification settings

Pokebrouserkat/Lettertest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lettertest - LLM Letter Counting Benchmark

How many of the letter "R" are in "Strawberry", and the like

Overview

This benchmark tests Large Language Models' ability to count letters in words. The hypothesis is that due to tokenization, LLMs struggle to accurately count letters when a word contains more than 2 instances of a specific letter.

The benchmark:

  • Uses a comprehensive English word list (275,000+ words)
  • Filters for words with 3+ occurrences of a specific letter
  • Disables chain-of-thought by requesting short responses
  • Tests various LLM providers (OpenAI, Anthropic)
  • Tracks accuracy and generates detailed reports

Installation

  1. Clone the repository:
git clone https://github.com/Pokebrouserkat/Lettertest.git
cd Lettertest
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up API keys:
cp .env.example .env
# Edit .env and add your API keys

Usage

Basic Usage

Run with OpenAI (default):

python benchmark.py

Run with Anthropic Claude:

python benchmark.py --provider anthropic

Advanced Options

python benchmark.py \
  --provider openai \
  --model gpt-4 \
  --num-samples 50 \
  --min-letter-count 3 \
  --min-word-length 5 \
  --max-word-length 15

Options

  • --provider: LLM provider (openai or anthropic)
  • --model: Specific model to test (e.g., gpt-4, gpt-3.5-turbo, claude-3-haiku-20240307)
  • --num-samples: Number of test cases to run (default: 20)
  • --min-letter-count: Minimum occurrences of a letter in test words (default: 3)
  • --min-word-length: Minimum word length (default: 5)
  • --max-word-length: Maximum word length (default: 15)
  • --no-api: Dry run to see test cases without calling APIs

Dry Run (No API Calls)

To preview test cases without using API credits:

python benchmark.py --no-api

Example Output

Running benchmark with openai (gpt-3.5-turbo)
Test cases: 20
============================================================

Test 1/20: sassafras - letter 'S' (expected: 4)
  ✓ CORRECT - LLM answered: 4
  Response: 4

Test 2/20: bookkeeper - letter 'E' (expected: 3)
  ✗ WRONG - LLM answered: 2 (expected: 3)
  Response: 2

...

============================================================
BENCHMARK RESULTS
============================================================
Model: gpt-3.5-turbo
Total tests: 20
Correct: 12
Incorrect: 8
Accuracy: 60.00%
============================================================

Results

Results are automatically saved to results/benchmark_results_<timestamp>.json with detailed information about each test case.

How It Works

  1. Word List: Downloads a comprehensive English word list from dwyl/english-words (cached locally)

  2. Filtering: Identifies words where a specific letter appears 3+ times (configurable)

  3. Prompt Generation: Creates prompts like: How many of the letter "R" are in "strawberry"?

  4. LLM Testing:

    • Uses temperature=0 for deterministic responses
    • Limits tokens to force short answers
    • System prompt requests concise numerical answers
  5. Evaluation: Extracts numbers from responses and compares to expected counts

Why This Test Is Challenging

LLMs process text as tokens, not individual characters. Words are often split into subword tokens (e.g., "strawberry" might be tokenized as ["straw", "berry"]). This makes it difficult for the model to accurately count individual letters without explicitly detokenizing and analyzing character by character, which is not how transformer models naturally operate.

Environment Variables

Set these in your .env file:

  • OPENAI_API_KEY: Your OpenAI API key
  • ANTHROPIC_API_KEY: Your Anthropic API key
  • MAX_TOKENS: Maximum tokens in response (default: 50)
  • TEMPERATURE: Temperature for generation (default: 0)

Contributing

Contributions are welcome! Feel free to:

  • Add support for more LLM providers
  • Improve prompt engineering
  • Add additional test types
  • Enhance reporting and visualization

License

MIT License

This was made by Copilot

About

How many of the letter "R" are in "Strawberry", and the like

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages