LLM Evaluation Framework for Java
Documentation • Getting Started • Examples • Issues
Dokimos is an evaluation framework for LLM applications in Java. It helps you evaluate responses, track quality over time, and catch regressions before they reach production.
It integrates with JUnit, LangChain4j, and Spring AI so you can run evaluations as part of your existing test suite and CI/CD pipeline.
- JUnit integration: Run evaluations as parameterized tests in your existing test suite
- Framework agnostic: Works with LangChain4j, Spring AI, or any LLM client. Powered by any LLM.
- Built in evaluators: Hallucination detection, faithfulness, contextual relevance, LLM as a judge, and more
- Custom evaluators: Build your own metrics by extending
BaseEvaluatoror usingLLMJudgeEvaluator - Dataset support: Load test cases from JSON, CSV, or define them programmatically
- CI/CD ready: Runs locally or in any CI/CD environment. Fail builds when quality drops.
Add the dependency to your pom.xml (check Maven Central for the latest version):
<dependency>
<groupId>dev.dokimos</groupId>
<artifactId>dokimos-core</artifactId>
<version>${dokimos.version}</version>
</dependency>Evaluate a single response directly:
Evaluator evaluator = ExactMatchEvaluator.builder()
.name("Exact Match")
.threshold(1.0)
.build();
EvalTestCase testCase = EvalTestCase.of("What is 2+2?", "4", "4");
EvalResult result = evaluator.evaluate(testCase);
System.out.println("Passed: " + result.success()); // true
System.out.println("Score: " + result.score()); // 1.0Use @DatasetSource to run evaluations as parameterized tests:
JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);
Evaluator correctnessEvaluator = LLMJudgeEvaluator.builder()
.name("Correctness")
.criteria("Is the answer correct and complete?")
.evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
.judge(judgeLM)
.build();
@ParameterizedTest
@DatasetSource("classpath:datasets/qa.json")
void testQAResponses(Example example) {
String response = assistant.chat(example.input());
EvalTestCase testCase = example.toTestCase(response);
Assertions.assertEval(testCase, correctnessEvaluator);
}Run experiments across entire datasets with aggregated metrics:
JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);
Evaluator correctnessEvaluator = LLMJudgeEvaluator.builder()
.name("Correctness")
.criteria("Is the answer correct?")
.evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
.judge(judgeLM)
.build();
Dataset dataset = Dataset.builder()
.name("QA Dataset")
.addExample(Example.of("What is 2+2?", "4"))
.addExample(Example.of("Capital of France?", "Paris"))
.build();
ExperimentResult result = Experiment.builder()
.name("QA Evaluation")
.dataset(dataset)
.task(example -> Map.of("output", yourLLM.generate(example.input())))
.evaluators(List.of(correctnessEvaluator))
.build()
.run();
// Check results
System.out.println("Pass rate: " + result.passRate());
System.out.println("Correctness avg: " + result.averageScore("Correctness"));
// Export to multiple formats
result.exportHtml(Path.of("report.html"));
result.exportJson(Path.of("results.json"));See more patterns in the dokimos-examples module.
Dataset driven evaluation Load test cases from JSON, CSV, or build them programmatically. Version your datasets alongside your code.
Built in evaluators Ready to use evaluators for hallucination detection, faithfulness, contextual relevance, and LLM as a judge patterns.
Experiment tracking Aggregate results across runs, calculate pass rates, and export to JSON, HTML, Markdown, or CSV.
Extensible
Build custom evaluators by extending BaseEvaluator, or use LLMJudgeEvaluator with your own criteria for quick semantic checks.
| Module | Description |
|---|---|
dokimos-core |
Core framework with datasets, evaluators, and experiments (required) |
dokimos-junit |
JUnit integration with @DatasetSource for parameterized tests |
dokimos-langchain4j |
LangChain4j support for evaluating RAG systems and agents |
dokimos-spring-ai |
Spring AI integration using ChatClient and ChatModel as judges |
dokimos-server |
Optional API and web UI for tracking experiments over time |
dokimos-server-client |
Client library for reporting to the Dokimos server |
Add the modules you need (check Maven Central for the latest version):
<dependencies>
<!-- Core framework (required) -->
<dependency>
<groupId>dev.dokimos</groupId>
<artifactId>dokimos-core</artifactId>
<version>${dokimos.version}</version>
</dependency>
<!-- JUnit integration -->
<dependency>
<groupId>dev.dokimos</groupId>
<artifactId>dokimos-junit</artifactId>
<version>${dokimos.version}</version>
<scope>test</scope>
</dependency>
<!-- LangChain4j integration -->
<dependency>
<groupId>dev.dokimos</groupId>
<artifactId>dokimos-langchain4j</artifactId>
<version>${dokimos.version}</version>
</dependency>
<!-- Spring AI integration -->
<dependency>
<groupId>dev.dokimos</groupId>
<artifactId>dokimos-spring-ai</artifactId>
<version>${dokimos.version}</version>
</dependency>
</dependencies>Gradle
dependencies {
implementation 'dev.dokimos:dokimos-core:$dokimosVersion'
testImplementation 'dev.dokimos:dokimos-junit:$dokimosVersion'
implementation 'dev.dokimos:dokimos-langchain4j:$dokimosVersion'
implementation 'dev.dokimos:dokimos-spring-ai:$dokimosVersion'
}No additional repository configuration needed.
Use @DatasetSource to load test cases and LLMJudgeEvaluator with custom criteria:
// Create a judge from any LLM client
JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);
@ParameterizedTest
@DatasetSource("classpath:support-tickets.json")
void testSupportResponses(Example example) {
String response = supportBot.answer(example.input());
EvalTestCase testCase = example.toTestCase(response);
Evaluator evaluator = LLMJudgeEvaluator.builder()
.name("Helpfulness")
.criteria("Is the response helpful and addresses the customer's issue?")
.evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
.judge(judgeLM)
.threshold(0.7)
.build();
Assertions.assertEval(testCase, evaluator);
}Evaluate RAG pipelines and AI assistants built with LangChain4j:
// Create a judge from any LLM client
JudgeLM judgeLM = prompt -> chatLanguageModel.generate(prompt);
Evaluator faithfulness = FaithfulnessEvaluator.builder()
.judge(judgeLM)
.contextKey("retrievedContext")
.threshold(0.8)
.build();
Experiment.builder()
.dataset(dataset)
.task(example -> {
Result<String> result = assistant.chat(example.input());
return Map.of(
"output", result.content(),
"retrievedContext", result.sources()
);
})
.evaluators(List.of(faithfulness))
.build()
.run();Use Spring AI's ChatModel as an evaluation judge:
JudgeLM judge = SpringAiSupport.asJudge(chatModel);
Evaluator evaluator = LLMJudgeEvaluator.builder()
.name("Accuracy")
.criteria("Is the response factually accurate?")
.evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
.judge(judge)
.threshold(0.8)
.build();The Dokimos server is an optional component for tracking experiment results over time. It provides a web UI for viewing runs, comparing results, and debugging failures.
curl -O https://raw.githubusercontent.com/dokimos-dev/dokimos/master/docker-compose.yml
docker compose up -dOpen http://localhost:8080 to view the dashboard.
See the server documentation for deployment options.
- More built in evaluators: multi turn conversations, agent tool use, misuse detection
- CLI for running evaluations outside of tests
- Server-side Dataset versioning and management
See the full roadmap on the docs site.
- Questions: GitHub Discussions
- Bugs: GitHub Issues
- Contributing: See CONTRIBUTING.md
MIT License. See LICENSE for details.
