Skip to content
#

llm-judge

Here are 13 public repositories matching this topic...

A Streamlit web app that uses a Groq-powered LLM (Llama 3) to act as an impartial judge for evaluating and comparing two model outputs. Supports custom criteria, presets like creativity and brand tone, and returns structured scores, explanations, and a winner. Built end-to-end with Python, Groq API, and Streamlit.

  • Updated Nov 24, 2025
  • Python

Extensible benchmarking suite for evaluating AI coding agents on web search tasks. Compare native search vs MCP servers (You.com, expanding) across multiple agents (Claude Code, Gemini, Droid, Codex, expanding) with automated Docker workflows and statistical analysis.

  • Updated Feb 24, 2026
  • TypeScript

Agent QA Mentor: an agentic QA pipeline that evaluates tool-using AI agent trajectories (scores, issue codes, safety/hallucination detection), rewrites prompts with targeted fixes, and stores long-term memory for continuous improvement—plus a CI-style eval gate and demo notebook.

  • Updated Dec 2, 2025
  • Jupyter Notebook

Improve this page

Add a description, image, and links to the llm-judge topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-judge topic, visit your repo's landing page and select "manage topics."

Learn more