This project is a simulation environment for evaluating the capabilities of Large Language Models (LLMs) in the context of deception and multi-level strategy games. The game of Coup is used as the testing ground for these evaluations. To see the full breakdown, including background & results, check out the report:
👉🏾 Blink Twice: Measuring Strategic Deception Amongst LLMs 👈🏾
The primary motivation for this experiment is to create a controlled environment to study how LLMs perform in scenarios that require:
- Deception: The ability to bluff and mislead other players.
- Strategic Thinking: The ability to make multi-level decisions based on incomplete information and to anticipate the actions of other players.
- Adaptation: The ability to adjust strategies based on the actions of other players and the changing game state.
Read more about the project's motivation.
Well, being able to measure the tendency that LLMs have for deception and longer-horizon strategy games is useful, especially as they're released into the wild in environments that reward those behaviors. This game gives us a toy environment where we can measure LLM capabilities for deception & strategy, measuring on baseline how well and individual model is able to compete, and then how models play against one another.
Pros | Cons |
---|---|
Rewards deception | Large element of chance based on initial conditions / "luck of the draw" |
Requires some multi-stage thinking | Doesn't reward deception enough; it's somewhat easy to get far in the game just playing honestly |
Read more about why Coup was chosen.
We see that large reasoning models are adept at handling multi-level strategy games and deceptions, but all LLMs tested performed well at Coup. Check out the full report.
You can find outputted data, including logs and charts, here
Hell yeah, we love reproducible experiments! Replicate all of our findings with models of your choice by running the game engine yourself.
-
Clone the repository:
git clone https://github.com/your-username/llm-coup.git cd llm-coup
-
Install the dependencies:
yarn install
-
Set up your API keys:
Create a
.env
file in the root of the repo and add your API keys for the LLM providers you want to use. For example:GEMINI_API_KEY="your-gemini-api-key" ANTHROPIC_API_KEY="your-anthropic-api-key" OPENAI_API_KEY="your-openai-api-key" GROK_API_KEY="your-grok-api-key"
To start a game, run the following command:
yarn play
This will start a new game with 4 players using the default model. You can customize the game by using the following options:
-p, --players <number>
: Number of players (2-6). Defaults to 4.--player-model <provider:model_name>
: Specify a player model. This option can be used multiple times to have players with different models. For example:--player-model gemini:gemini-pro --player-model anthropic:claude-3-sonnet-20240229
. If fewer models are provided than the number of players, models will be assigned to players in a round-robin fashion.--discussion
: Enable player discussions.
Run a game with 3 players, two using Gemini and one using Anthropic:
yarn play --players 3 --player-model gemini:gemini-pro --player-model gemini:gemini-pro --player-model anthropic:claude-3-sonnet-20240229
Run a game with 5 players, all using Gemini Pro with different personalities:
yarn play --players 5 --player-model gemini:gemini-pro --personalities
- Multi-LLM Support: The simulation supports multiple LLM providers, including Gemini, Anthropic, and OpenAI.
- Self-Play: The simulation can be run with a single LLM playing against itself, allowing for the study of emergent strategies and behaviors.
- Personality Manipulation: Players can be assigned different personalities via prompts, allowing for the study of how different behavioral archetypes affect game dynamics.
- Public Discussion: Players can engage in public discussion to share information, negotiate, or deceive.