A FairEval-style CLI tool for pairwise evaluation of LLM responses that mitigates positional bias using calibrated evaluation procedures.
This project is a practical implementation inspired by the research paper Large Language Models are not Fair Evaluators, which demonstrates that naïvely using LLMs as judges leads to severe and exploitable ordering bias, and proposes simple but effective calibration strategies to fix it .
Most “LLM-as-a-judge” setups do something like:
- Show a prompt
- Show Response A then Response B
- Ask an LLM which is better
The paper shows this is fundamentally flawed:
- Merely swapping the order of responses can flip the evaluation result.
- Strong models (GPT-4) and weaker ones (ChatGPT) both exhibit positional bias.
- Single-pass judgments are unstable and self-contradictory.
FairEval-CLI treats evaluation as a statistical procedure, not a single API call.
FairEval-CLI performs pairwise response evaluation using two key calibration strategies from the paper:
-
The evaluator is forced to:
- Generate evaluation evidence first
- Then assign numeric scores
-
This process is repeated
ktimes to reduce variance.
-
Each comparison is evaluated twice:
- (A, B)
- (B, A)
-
Scores are mapped back and averaged across both positions, canceling positional bias.
The final output includes:
- Calibrated scores for A and B
- Win / lose / tie decision
- Confidence estimate based on disagreement across samples
- Evidence-first evaluation prompts (no verdict-first bias)
- Multiple sampling per position (
--k) - Automatic position swapping and aggregation
- Human-readable or JSON output
- File, inline, or stdin inputs
- OpenAI-compatible (supports gateways via
OPENAI_BASE_URL)
- Node.js 18+ (uses built-in
fetch) - An OpenAI-compatible API key
git clone https://github.com/pertrai1/ai-projects.git
cd air-projects/research-paper-projects/fair-eval-cli
npm install
npm run buildCreate a .env file:
OPENAI_API_KEY="your_api_key_here"
# Optional (for proxies / gateways):
OPENAI_BASE_URL="https://api.openai.com/v1"node dist/index.js eval \
--prompt "Explain closures in JavaScript" \
--a-file response_a.txt \
--b-file response_b.txtnode dist/index.js eval \
--prompt-file prompt.txt \
--a-file response_a.txt \
--b-file response_b.txt \
--criteria helpfulness,accuracy,clarity \
--k 3 \
--temperature 1.0 \
--model gpt-4 \
--verbosecat response_a.txt | node dist/index.js eval \
--prompt "What is dependency injection?" \
--a-file - \
--b-file response_b.txtnode dist/index.js eval \
--prompt "Write a SQL query to find duplicates" \
--a-file a.txt \
--b-file b.txt \
--jsonResult
Winner: A wins
Score (A): 8.12
Score (B): 7.01
Confidence: high
Run settings
Model: gpt-4
k: 3
Temperature: 1
Criteria: helpfulness, accuracy, clarity
Confidence is derived from the variance of score differences across all calibrated samples:
- High: very stable across runs
- Medium: some disagreement
- Low: unstable → human review recommended
This is a lightweight analogue of the paper’s BPDE metric.
src/
├── index.ts # CLI entry point
├── core/
│ ├── prompts.ts # Evidence-first prompt templates
│ ├── evaluator.ts # OpenAI-compatible LLM client
│ ├── parser.ts # Strict JSON parsing + validation
│ ├── runPairwiseEval.ts# MEC + BPC implementation
├── utils/
│ ├── readInput.ts # File / stdin handling
│ └── math.ts # Mean / variance helpers
- No human-in-the-loop calibration yet (HITLC)
- Cost scales with
2 × kevaluator calls - Assumes evaluator follows JSON constraints (defensive parsing mitigates this)
This tool is directly inspired by:
Peiyi Wang et al. Large Language Models are not Fair Evaluators arXiv:2305.17926
If you use this project in research or tooling, you should cite the paper .
LLMs are useful but noisy annotators.
If you treat them like a single, deterministic oracle, you will get misleading results. If you treat them like a biased, stochastic process—and calibrate accordingly—you get something genuinely useful.