Lettertest - LLM Letter Counting Benchmark

How many of the letter "R" are in "Strawberry", and the like

Overview

This benchmark tests Large Language Models' ability to count letters in words. The hypothesis is that due to tokenization, LLMs struggle to accurately count letters when a word contains more than 2 instances of a specific letter.

The benchmark:

Uses a comprehensive English word list (275,000+ words)
Filters for words with 3+ occurrences of a specific letter
Disables chain-of-thought by requesting short responses
Tests various LLM providers (OpenAI, Anthropic)
Tracks accuracy and generates detailed reports

Installation

Clone the repository:

git clone https://github.com/Pokebrouserkat/Lettertest.git
cd Lettertest

Install dependencies:

pip install -r requirements.txt

Set up API keys:

cp .env.example .env
# Edit .env and add your API keys

Usage

Basic Usage

Run with OpenAI (default):

python benchmark.py

Run with Anthropic Claude:

python benchmark.py --provider anthropic

Advanced Options

python benchmark.py \
  --provider openai \
  --model gpt-4 \
  --num-samples 50 \
  --min-letter-count 3 \
  --min-word-length 5 \
  --max-word-length 15

Options

--provider: LLM provider (openai or anthropic)
--model: Specific model to test (e.g., gpt-4, gpt-3.5-turbo, claude-3-haiku-20240307)
--num-samples: Number of test cases to run (default: 20)
--min-letter-count: Minimum occurrences of a letter in test words (default: 3)
--min-word-length: Minimum word length (default: 5)
--max-word-length: Maximum word length (default: 15)
--no-api: Dry run to see test cases without calling APIs

Dry Run (No API Calls)

To preview test cases without using API credits:

python benchmark.py --no-api

Example Output

Running benchmark with openai (gpt-3.5-turbo)
Test cases: 20
============================================================

Test 1/20: sassafras - letter 'S' (expected: 4)
  ✓ CORRECT - LLM answered: 4
  Response: 4

Test 2/20: bookkeeper - letter 'E' (expected: 3)
  ✗ WRONG - LLM answered: 2 (expected: 3)
  Response: 2

...

============================================================
BENCHMARK RESULTS
============================================================
Model: gpt-3.5-turbo
Total tests: 20
Correct: 12
Incorrect: 8
Accuracy: 60.00%
============================================================

Results

Results are automatically saved to results/benchmark_results_<timestamp>.json with detailed information about each test case.

How It Works

Word List: Downloads a comprehensive English word list from dwyl/english-words (cached locally)
Filtering: Identifies words where a specific letter appears 3+ times (configurable)
Prompt Generation: Creates prompts like: How many of the letter "R" are in "strawberry"?
LLM Testing:
- Uses temperature=0 for deterministic responses
- Limits tokens to force short answers
- System prompt requests concise numerical answers
Evaluation: Extracts numbers from responses and compares to expected counts

Why This Test Is Challenging

LLMs process text as tokens, not individual characters. Words are often split into subword tokens (e.g., "strawberry" might be tokenized as ["straw", "berry"]). This makes it difficult for the model to accurately count individual letters without explicitly detokenizing and analyzing character by character, which is not how transformer models naturally operate.

Environment Variables

Set these in your .env file:

OPENAI_API_KEY: Your OpenAI API key
ANTHROPIC_API_KEY: Your Anthropic API key
MAX_TOKENS: Maximum tokens in response (default: 50)
TEMPERATURE: Temperature for generation (default: 0)

Contributing

Contributions are welcome! Feel free to:

Add support for more LLM providers
Improve prompt engineering
Add additional test types
Enhance reporting and visualization

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.env.example		.env.example
.gitignore		.gitignore
QUICKSTART.md		QUICKSTART.md
README.md		README.md
benchmark.py		benchmark.py
demo.py		demo.py
requirements.txt		requirements.txt
test_benchmark.py		test_benchmark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lettertest - LLM Letter Counting Benchmark

Overview

Installation

Usage

Basic Usage

Advanced Options

Options

Dry Run (No API Calls)

Example Output

Results

How It Works

Why This Test Is Challenging

Environment Variables

Contributing

License

This was made by Copilot

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Pokebrouserkat/Lettertest

Folders and files

Latest commit

History

Repository files navigation

Lettertest - LLM Letter Counting Benchmark

Overview

Installation

Usage

Basic Usage

Advanced Options

Options

Dry Run (No API Calls)

Example Output

Results

How It Works

Why This Test Is Challenging

Environment Variables

Contributing

License

This was made by Copilot

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages