Project Website • Documentation • Video demo
EvalAssist is an LLM-as-a-Judge framework built on top of the Unitxt open source evaluation library for large language models. The EvalAssist application provides users with a convenient way of iteratively testing and refining LLM-as-a-judge criteria, and supports both direct (rubric-based) and pairwise assessment paradigms (relation-based), the two most prevalent forms of LLM-as-a-judge evaluations available. EvalAssist is designed to be model-agnostic, i.e. the content to be evaluated can come from any model. EvalAssist supports a rich set of off-the-shelf judge models that can easily be extended. An API key is required to use the pre-defined judge models. Once users are satisfied with their criteria, they can auto-generate a Notebook with Unitxt code to run bulk evaluations with larger data sets based on their criteria definition. EvalAssist also includes a catalog of example test cases, exhibiting the use of LLM-as-a-judge across a variety of scenarios. Users can save their own test cases.
EvalAssist can be installed using various package managers. Before proceeding, ensure you're using Python >= 3.10, <3.14 to avoid compatibility issues. Make sure to set DATA_DIR
to avoid data loss (e.g. export DATA_DIR="~/.eval_assist"
).
python3 -m venv venv
source venv/bin/activate # or venv\Scripts\activate.bat in Windows
pip install 'evalassist[webapp]'
eval-assist serve
uvx --python 3.11 --from 'evalassist[webapp]' eval-assist serve
conda create -n evalassist python=3.11
conda activate evalassist
pip install 'evalassist[webapp]'
eval-assist serve
In all cases, after running the command, you can access the EvalAssist server at http://localhost:8000.
EvalAssist can be configured through environment variables and command parameters. Take a look at the configuration documentation.
Check out the tutorials to see how to run evaluations and generate synthetic data.
You can run LLM as a Judge evaluations using Python only. For example:
from evalassist.judges import DirectJudge
from evalassist.judges.const import DEFAULT_JUDGE_INFERENCE_PARAMS
from unitxt.inference import CrossProviderInferenceEngine
judge = DirectJudge(
inference_engine=CrossProviderInferenceEngine(
model="llama-3-3-70b-instruct",
provider="watsonx",
**DEFAULT_JUDGE_INFERENCE_PARAMS,
),
)
results = judge(
instances=[
"Use the API client to fetch data from the server and the cache to store frequently accessed results for faster performance."
],
criteria="Is the text self-explanatory and self-contained?", # Create yes/no direct assessment criteria",
)
Look at the documentation of the judges sub-package.
You can contribute to EvalAssist or to Unitxt. Look at the Contribution Guidelines for more details.
Look at the Local Development Guide for instructions on setting up a local development environment.
You can find extesive documentation of the system in the Documentation page.