llm-bias-auditor

Bias auditing service for large language models using controlled prompt variation and disparity metrics.

Why This Exists

LLMs can exhibit demographic bias in high-stakes scenarios like hiring, lending, and recommendations. This tool provides a systematic way to measure and document potential disparities in model behavior across demographic groups before deployment. It's designed for ML engineers, researchers, and organizations who need quantitative fairness evaluation as part of responsible AI workflows.

Quick Start (No LLM Required)

The fastest way to explore this tool is with the built-in mock backend—no LLM installation or API keys needed:

# Install dependencies
pip install -r requirements.txt

# Run with mock backend
LLM_BACKEND=mock uvicorn app.main:app --reload

# Try the API
curl http://localhost:8000/health
# {"status":"healthy"}

curl http://localhost:8000/backend
# {"backend":"mock","model":"mock-model","base_url":null}

curl -X POST "http://localhost:8000/audit" \
  -H "Content-Type: application/json" \
  -d '{"scenario": "hiring"}'
# Returns full audit report with metrics (see below)

Why mock mode exists: It enables instant experimentation, CI/CD testing, and demonstrations without external dependencies. The mock backend generates deterministic, realistic responses for all scenarios, allowing you to understand the audit workflow and API structure before connecting to a real LLM.

All tests run with no LLM required:

pytest  # Uses mock backend automatically

Example Response

{
  "audit_id": "a1b2c3d4...",
  "timestamp": "2026-01-13T10:30:00Z",
  "backend": "mock",
  "model": "mock-model",
  "scenario": "hiring",
  "num_prompts": 24,
  "metrics": {
    "length_disparity": {
      "disparity_score": 0.0,
      "interpretation": "Low disparity - responses are relatively consistent in length",
      "group_means": {...},
      "overall_mean": 89.0
    },
    "refusal_disparity": {
      "disparity_score": 0.0,
      "interpretation": "Low disparity - refusal rates are similar across groups",
      "group_rates": {...}
    },
    "sentiment_disparity": {
      "disparity_score": 0.0,
      "interpretation": "Low disparity - sentiment is consistent across groups",
      "group_means": {...}
    }
  },
  "summary": {
    "overall_assessment": "Low disparity observed",
    "concerns": [],
    "recommendation": "Disparity metrics are within acceptable ranges for tested scenarios"
  },
  "responses": [...]
}

Technical Stack

FastAPI: REST API framework
Mock Backend: Built-in deterministic responses (no dependencies)
Ollama: Local LLM backend (optional, no API key required)
OpenAI-compatible endpoints: Optional cloud LLM integration
Python standard libraries: Core logic with simple NLP metrics

Using Real LLMs

Option 1: Local with Ollama (Recommended)

# Install Ollama: https://ollama.ai

# Pull a model (examples for 8GB Macs)
ollama pull llama3.2:3b    # or qwen2.5:1.5b, phi3:mini

# Run the service
uvicorn app.main:app --reload

# Run an audit
curl -X POST "http://localhost:8000/audit" \
  -H "Content-Type: application/json" \
  -d '{"scenario": "hiring", "model": "llama3.2:3b"}'

Option 2: Cloud with OpenAI

# Set up environment
cp .env.example .env
# Edit .env:
#   LLM_BACKEND=openai
#   OPENAI_API_KEY=your-key-here
#   OPENAI_MODEL=gpt-4o-mini

# Run the service
uvicorn app.main:app --reload

Option 3: Remote Ollama

# On remote machine with GPU: ollama serve
# Local SSH port forward:
ssh -L 11434:localhost:11434 user@remote-host

# In .env:
#   OLLAMA_BASE_URL=http://localhost:11434
#   OLLAMA_MODEL=llama3.2:3b

API Endpoints

`GET /health`

Health check returning service status.

`GET /backend`

Returns currently configured LLM backend and model:

{
  "backend": "mock",
  "model": "mock-model",
  "base_url": null
}

`GET /scenarios`

Lists available audit scenarios with descriptions.

`POST /audit`

Run a bias audit on the specified scenario.

Request:

{
  "scenario": "hiring",
  "model": "llama3.2:3b",
  "temperature": 0.7,
  "max_tokens": 300
}

Parameters:

scenario (required): "hiring", "recommendation", or "credit"
model (optional): Override default model
temperature (optional): Sampling temperature (0-2, default: 0.7)
max_tokens (optional): Max response tokens (50-2000, default: 300)
attributes (optional): Demographic groups to test (default: all)

How It Works

Controlled Variation: Generates prompts varying only demographic attributes (name signals for gender/ethnicity)
Consistent Evaluation: Queries the same LLM with identical contexts and qualifications
Disparity Metrics: Compares response characteristics across demographic groups
Clear Reporting: Outputs interpretable JSON reports with actionable summaries

Metrics

Length Disparity

Measures variance in response length across demographic groups. Coefficient of variation (CV) indicates relative disparity—higher scores suggest differential treatment.

Refusal Rate Disparity

Tracks how often the model declines to respond for different groups. Computed as the range between max and min refusal rates.

Sentiment Disparity

Keyword-based proxy using positive/negative word counts. Measures tone differences across groups.

Running on 8GB Macs

Use lightweight models: llama3.2:3b, qwen2.5:1.5b, or phi3:mini
Reduce max_tokens to 200 in requests
Prompts run sequentially by default (low memory pressure)
For larger models, use remote Ollama or cloud APIs

Testing

All tests run without external LLM dependencies:

# Run all tests (uses mock backend)
pytest

# Run only mock backend tests
pytest tests/test_mock_backend.py -v

# Verify no-LLM execution
LLM_BACKEND=mock pytest tests/test_mock_backend.py -v

The test suite explicitly verifies:

Mock backend initialization without dependencies
Deterministic response generation
Complete audit workflows for all scenarios
Proper metric calculations

Responsible AI Considerations

What This Tool Does

Surfaces potential disparities in model behavior across demographic groups
Provides quantitative metrics for evaluation
Supports informed deployment decisions with documented evidence

What This Tool Does NOT Do

Prove causation, intent, or bias in isolation
Replace human judgment or domain expertise
Guarantee fairness comprehensively across all dimensions
Substitute for comprehensive fairness evaluation frameworks

Limitations

Limited to tested scenarios and demographic signals (name-based proxies)
Metrics are statistical proxies, not ground truth measurements
Results require contextual interpretation by domain experts
Name-based demographic signaling has known limitations and cultural variance
Does not test intersectional combinations systematically

Recommended Usage

Use as one input among many in model evaluation pipelines
Combine with domain expert review and qualitative analysis
Test scenarios directly relevant to your deployment context
Document audit results in model cards and deployment documentation
Re-audit when models, prompts, or use cases change
Supplement with additional fairness testing tools and frameworks

Future Enhancements

If time permits:

PyTorch-based embedding similarity analysis
Web UI for interactive auditing
Additional statistical fairness metrics (demographic parity, equalized odds)
Confidence intervals and significance testing
Support for custom scenarios and demographic attributes

References

NIST AI Risk Management Framework
Fairness and Machine Learning (Barocas, Hardt, Narayanan)
Mitchell et al., "Model Cards for Model Reporting" (2019)

License

MIT

Author

Built to support responsible AI deployment in research and educational contexts.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
app		app
tests		tests
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

llm-bias-auditor

Why This Exists

Quick Start (No LLM Required)

Example Response

Technical Stack

Using Real LLMs

Option 1: Local with Ollama (Recommended)

Option 2: Cloud with OpenAI

Option 3: Remote Ollama

API Endpoints

GET /health

GET /backend

GET /scenarios

POST /audit

How It Works

Metrics

Length Disparity

Refusal Rate Disparity

Sentiment Disparity

Running on 8GB Macs

Testing

Responsible AI Considerations

What This Tool Does

What This Tool Does NOT Do

Limitations

Recommended Usage

Future Enhancements

References

License

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /health`

`GET /backend`

`GET /scenarios`

`POST /audit`

Packages