llm-judge

Star

Here are 13 public repositories matching this topic...

haizelabs / verdict

Star

Inference-time scaling for LLMs-as-a-judge.

reward-shaping llm llm-as-a-judge test-time-compute inference-time-compute llm-judge test-time-scaling

Updated Nov 5, 2025
Jupyter Notebook

vtdinh13 / habit-builder-ai-agent

Star

An end-to-end AI agent project that transcribes audio files, embeds user queries, and searches in Qdrant and web browser via the Brave API. A Streamlit interface powered by OpenAI GPT models delivers actionable health insights from both the archive and the latest research.

ai-agents qdrant pydantic-ai llm-judge ai-agent-evaluation

Updated Dec 12, 2025
Python

mennamohammedkh / Simple-Chatbot-Llama-3-8B-via-HuggingFace-API-TrustGuard-with-LLM-Judge

Star

🤖 A conversational chatbot powered by Meta-Llama-3-8B via HuggingFace API, with TrustGuard safety validation using an LLM-as-Judge.

python chatbot ai-safety uv huggingface llm llama3 llm-judge trustguard

Updated Feb 27, 2026

Anmolian / Prompt_Eval_LLM_Judge

Star

Prompt Design & LLM Judge

prompt-engineering llms few-shot-prompting one-shot-prompting zero-shot-prompting contrastive-cot-prompting cot-prompting llm-judge trec-rag-2024 self-consistency-prompting role-playing-prompting

Updated Feb 10, 2025
Python

syed-waleed-ahmed / LLM-as-Judge

Star

A Streamlit web app that uses a Groq-powered LLM (Llama 3) to act as an impartial judge for evaluating and comparing two model outputs. Supports custom criteria, presets like creativity and brand tone, and returns structured scores, explanations, and a winner. Built end-to-end with Python, Groq API, and Streamlit.

python code-evaluation a-b-testing text-evaluation groq streamlit model-benchmarking ai-automation ai-evaluation llm prompt-evaluation llama3 llm-judge output-evaluation scoring-framework

Updated Nov 24, 2025
Python

youdotcom-oss / web-search-agent-evals

Star

Extensible benchmarking suite for evaluating AI coding agents on web search tasks. Compare native search vs MCP servers (You.com, expanding) across multiple agents (Claude Code, Gemini, Droid, Codex, expanding) with automated Docker workflows and statistical analysis.

benchmark mcp gemini headless-testing droid codex ai-agents web-search coding-agents model-context-protocol llm-judge claude-code agent-evaluation evaluation-suite

Updated Feb 24, 2026
TypeScript

black-yt / structai

Star

StructAI offers a robust toolkit for LLM interaction—such as structured outputs, context management, and parallel execution.

Updated Jan 28, 2026
Python

PabloCabaleiro / pondera

Star

Pondera is a lightweight, YAML-first framework to evaluate AI models and agents with pluggable runners and an LLM-as-a-judge.

python ai agents model-agnostic ai-evaluation llms llm-evaluation llm-evaluation-framework llm-judge agent-evaluation ai-evaluation-framework rubric-based-evaluation yaml-first

Updated Oct 23, 2025
Python

liuxiaotong / agent-reward

Star

Process-level rubric-based reward engine for Code Agent trajectories. CLI + MCP ready.

python cli mcp preference-learning rubric dpo rlhf reward-model llm-judge code-agent ai-data-pipeline process-reward

Updated Feb 9, 2026
Python

Padraigobrien08 / Agentic-GenAI-Capstone-Google-Kaggle

Star

Agent QA Mentor: an agentic QA pipeline that evaluates tool-using AI agent trajectories (scores, issue codes, safety/hallucination detection), rewrites prompts with targeted fixes, and stores long-term memory for continuous improvement—plus a CI-style eval gate and demo notebook.