How many of the letter "R" are in "Strawberry", and the like
This benchmark tests Large Language Models' ability to count letters in words. The hypothesis is that due to tokenization, LLMs struggle to accurately count letters when a word contains more than 2 instances of a specific letter.
The benchmark:
- Uses a comprehensive English word list (275,000+ words)
- Filters for words with 3+ occurrences of a specific letter
- Disables chain-of-thought by requesting short responses
- Tests various LLM providers (OpenAI, Anthropic)
- Tracks accuracy and generates detailed reports
- Clone the repository:
git clone https://github.com/Pokebrouserkat/Lettertest.git
cd Lettertest- Install dependencies:
pip install -r requirements.txt- Set up API keys:
cp .env.example .env
# Edit .env and add your API keysRun with OpenAI (default):
python benchmark.pyRun with Anthropic Claude:
python benchmark.py --provider anthropicpython benchmark.py \
--provider openai \
--model gpt-4 \
--num-samples 50 \
--min-letter-count 3 \
--min-word-length 5 \
--max-word-length 15--provider: LLM provider (openaioranthropic)--model: Specific model to test (e.g.,gpt-4,gpt-3.5-turbo,claude-3-haiku-20240307)--num-samples: Number of test cases to run (default: 20)--min-letter-count: Minimum occurrences of a letter in test words (default: 3)--min-word-length: Minimum word length (default: 5)--max-word-length: Maximum word length (default: 15)--no-api: Dry run to see test cases without calling APIs
To preview test cases without using API credits:
python benchmark.py --no-apiRunning benchmark with openai (gpt-3.5-turbo)
Test cases: 20
============================================================
Test 1/20: sassafras - letter 'S' (expected: 4)
✓ CORRECT - LLM answered: 4
Response: 4
Test 2/20: bookkeeper - letter 'E' (expected: 3)
✗ WRONG - LLM answered: 2 (expected: 3)
Response: 2
...
============================================================
BENCHMARK RESULTS
============================================================
Model: gpt-3.5-turbo
Total tests: 20
Correct: 12
Incorrect: 8
Accuracy: 60.00%
============================================================
Results are automatically saved to results/benchmark_results_<timestamp>.json with detailed information about each test case.
-
Word List: Downloads a comprehensive English word list from dwyl/english-words (cached locally)
-
Filtering: Identifies words where a specific letter appears 3+ times (configurable)
-
Prompt Generation: Creates prompts like:
How many of the letter "R" are in "strawberry"? -
LLM Testing:
- Uses temperature=0 for deterministic responses
- Limits tokens to force short answers
- System prompt requests concise numerical answers
-
Evaluation: Extracts numbers from responses and compares to expected counts
LLMs process text as tokens, not individual characters. Words are often split into subword tokens (e.g., "strawberry" might be tokenized as ["straw", "berry"]). This makes it difficult for the model to accurately count individual letters without explicitly detokenizing and analyzing character by character, which is not how transformer models naturally operate.
Set these in your .env file:
OPENAI_API_KEY: Your OpenAI API keyANTHROPIC_API_KEY: Your Anthropic API keyMAX_TOKENS: Maximum tokens in response (default: 50)TEMPERATURE: Temperature for generation (default: 0)
Contributions are welcome! Feel free to:
- Add support for more LLM providers
- Improve prompt engineering
- Add additional test types
- Enhance reporting and visualization
MIT License