“When in doubt, consult the Council.”
This project provides a clean, extensible framework for interacting with multiple LLMs (OpenAI, Anthropic, Mistral, Gemini) through a unified interface. It supports structured responses, token usage tracking, cost estimation, retry logic, and easy extensibility for adding more providers like LLaMA or Cohere.
- ✅ Unified interface for calling different LLMs (GPT, Claude, Mistral, Gemini)
- ✅ Automatic retry with exponential backoff
- ✅ Cost estimation based on provider pricing
- ✅ Structured logging + latency tracking
- ✅ Extensible to support more LLM providers
- CLI runner, streaming, parallelism (coming soon)
git clone https://github.com/yourusername/jedi-council.git
cd jedi-council
pip install -e .
This uses the pyproject.toml
to install dependencies and allows live development.
- Set up your API keys
Create a .env file in the root folder:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
MISTRAL_API_KEY=sk-mistral-...
GOOGLE_API_KEY=AIza...
Make sure you have a .env loader if you’re running scripts directly. Otherwise, export env vars before execution.
This tool evolved from a research pipeline designed to automate UI tasks using LLMs. To benchmark multiple models across tasks with traceable cost, latency, and output format — a centralized interface was essential.
-
Import the wrapper
from jedi_council.core import TheJediCouncil
-
Initialize a model
council = TheJediCouncil(model="gpt-4o")
✅ Supported models:
- OpenAI:
gpt-4o
,gpt-4
,gpt-3.5-turbo
- Anthropic:
claude-3-opus-20240229
,claude-3-sonnet-20240229
,claude-3-haiku-20240307
- Mistral:
mistral-small-latest
,mistral-7b
,mistral-large-latest
- Gemini:
gemini-1.5-pro-latest
,gemini-1.0-pro
- OpenAI:
-
Send your first message
response = council.get_wisdom("What is the capital of Naboo?") print(response.text)
-
See structured metadata and cost
# See structured metadata and cost print("Wisdom:", response.text) print("Usage:", f"{response.usage.input_tokens} in, {response.usage.output_tokens} out") print("Cost: $", f"{response.usage.cost:.4f}") print("Latency:", f"{int(response.latency_ms)}ms")
You can pass system prompts, temperature, and more:
council = TheJediCouncil(
model="gpt-4o",
system_prompt="You are a wise Jedi master.",
temperature=0.2,
)
Wisdom: The philosophy behind the Jedi Code is rooted in principles of peace, self-discipline, and harmony with the Force...
Usage: 16 in, 323 out
Cost: $ 0.0049
Latency: 9981ms
Try running the following to test all configured models:
python example.py
Try this to query a model from terminal:
python run_prompt.py --model gpt-4o --prompt "Tell me a Yoda quote"
To compare model performance across real tasks, you can run:
python benchmark/benchmarking_suite.py
This will:
- Run a suite of predefined tasks across all available LLMs
- Log model outputs, token usage, latency, and cost
- Save detailed results in
logs/benchmark_results.csv
You can analyze the results using pandas or any visualization tool:
import pandas as pd
df = pd.read_csv("logs/benchmark_results.csv")
print(df.groupby("model")["cost"].mean())
You can also track performance by task category or sort by latency:
df.groupby(["model", "task_name"])["latency_ms"].mean().unstack().plot(kind="bar")
Here's a sample summary of average latency (in ms) for different models across task categories:
Task | GPT-4o | Claude-3 | Gemini | Mistral |
---|---|---|---|---|
Ambiguity Resolution | 13528 | 4175 | 4390 | 5308 |
Code Generation | 17870 | 4091 | 3741 | 10309 |
Logical Reasoning | 13463 | 1630 | 1890 | 4002 |
Summarization | 10132 | 814 | 608 | 799 |
Time Zone Reasoning | 11085 | 1019 | 777 | 1712 |
This table highlights latency performance trends across various LLMs for core reasoning and generation tasks.
- Dry-run support (simulate requests without actual calls)
- CLI prompt runner:
python run_prompt.py --model gpt-4o --prompt "Say hello"
- Streaming & parallel inference
- LLaMA and Cohere provider integration
- Output visualizations (token bar, cost heatmap, model latency ranking) ✅ CSV logging added
We welcome contributions from Jedi and Padawans alike! If you have ideas for new features, improvements, or new LLM integrations, feel free to open an issue or a pull request.
- ⭐ Star this repository to support the project.
- Fork the repo and create a new branch:
git checkout -b feature/your-feature-name
- Write clear, tested code and include helpful commit messages.
- Make sure all tests pass: 🧪 Running Tests
Run unit tests locally:
pytest --cov=jedi_council --cov-report=term-missing
Make sure new features include test coverage.
- Submit a pull request and describe your changes in detail.
May the source be with you.
- Built by Avdhoot Patil as part of advanced LLM experimentation and research.
- Inspired by research needs in UI automation, prompt optimization, and agent evaluation.
- Big thanks to contributors and the open-source community for tools like
openai
,anthropic
,google-generativeai
, andmistralai
.
Join the Jedi Dev Discord: [coming soon]
This project is licensed under the MIT License — use freely, contribute kindly.