Skip to content

dokimos-dev/dokimos

Repository files navigation

Dokimos Logo

Dokimos

LLM Evaluation Framework for Java

DocumentationGetting StartedExamplesIssues

Maven Central Build Status License Java 17+


Dokimos is an evaluation framework for LLM applications in Java. It helps you evaluate responses, track quality over time, and catch regressions before they reach production.

It integrates with JUnit, LangChain4j, and Spring AI so you can run evaluations as part of your existing test suite and CI/CD pipeline.

Why Dokimos?

  • JUnit integration: Run evaluations as parameterized tests in your existing test suite
  • Framework agnostic: Works with LangChain4j, Spring AI, or any LLM client. Powered by any LLM.
  • Built in evaluators: Hallucination detection, faithfulness, contextual relevance, LLM as a judge, and more
  • Custom evaluators: Build your own metrics by extending BaseEvaluator or using LLMJudgeEvaluator
  • Dataset support: Load test cases from JSON, CSV, or define them programmatically
  • CI/CD ready: Runs locally or in any CI/CD environment. Fail builds when quality drops.

Quick Start

Add the dependency to your pom.xml (check Maven Central for the latest version):

<dependency>
    <groupId>dev.dokimos</groupId>
    <artifactId>dokimos-core</artifactId>
    <version>${dokimos.version}</version>
</dependency>

Run a standalone evaluator

Evaluate a single response directly:

Evaluator evaluator = ExactMatchEvaluator.builder()
    .name("Exact Match")
    .threshold(1.0)
    .build();

EvalTestCase testCase = EvalTestCase.of("What is 2+2?", "4", "4");
EvalResult result = evaluator.evaluate(testCase);

System.out.println("Passed: " + result.success());  // true
System.out.println("Score: " + result.score());     // 1.0

Write a JUnit test

Use @DatasetSource to run evaluations as parameterized tests:

JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);

Evaluator correctnessEvaluator = LLMJudgeEvaluator.builder()
    .name("Correctness")
    .criteria("Is the answer correct and complete?")
    .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
    .judge(judgeLM)
    .build();

@ParameterizedTest
@DatasetSource("classpath:datasets/qa.json")
void testQAResponses(Example example) {
    String response = assistant.chat(example.input());
    EvalTestCase testCase = example.toTestCase(response);

    Assertions.assertEval(testCase, correctnessEvaluator);
}

Evaluate a dataset in bulk

Run experiments across entire datasets with aggregated metrics:

JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);

Evaluator correctnessEvaluator = LLMJudgeEvaluator.builder()
    .name("Correctness")
    .criteria("Is the answer correct?")
    .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
    .judge(judgeLM)
    .build();

Dataset dataset = Dataset.builder()
    .name("QA Dataset")
    .addExample(Example.of("What is 2+2?", "4"))
    .addExample(Example.of("Capital of France?", "Paris"))
    .build();

ExperimentResult result = Experiment.builder()
    .name("QA Evaluation")
    .dataset(dataset)
    .task(example -> Map.of("output", yourLLM.generate(example.input())))
    .evaluators(List.of(correctnessEvaluator))
    .build()
    .run();

// Check results
System.out.println("Pass rate: " + result.passRate());
System.out.println("Correctness avg: " + result.averageScore("Correctness"));

// Export to multiple formats
result.exportHtml(Path.of("report.html"));
result.exportJson(Path.of("results.json"));

See more patterns in the dokimos-examples module.

Features

Dataset driven evaluation Load test cases from JSON, CSV, or build them programmatically. Version your datasets alongside your code.

Built in evaluators Ready to use evaluators for hallucination detection, faithfulness, contextual relevance, and LLM as a judge patterns.

Experiment tracking Aggregate results across runs, calculate pass rates, and export to JSON, HTML, Markdown, or CSV.

Extensible Build custom evaluators by extending BaseEvaluator, or use LLMJudgeEvaluator with your own criteria for quick semantic checks.

Modules

Module Description
dokimos-core Core framework with datasets, evaluators, and experiments (required)
dokimos-junit JUnit integration with @DatasetSource for parameterized tests
dokimos-langchain4j LangChain4j support for evaluating RAG systems and agents
dokimos-spring-ai Spring AI integration using ChatClient and ChatModel as judges
dokimos-server Optional API and web UI for tracking experiments over time
dokimos-server-client Client library for reporting to the Dokimos server

Installation

Maven

Add the modules you need (check Maven Central for the latest version):

<dependencies>
    <!-- Core framework (required) -->
    <dependency>
        <groupId>dev.dokimos</groupId>
        <artifactId>dokimos-core</artifactId>
        <version>${dokimos.version}</version>
    </dependency>

    <!-- JUnit integration -->
    <dependency>
        <groupId>dev.dokimos</groupId>
        <artifactId>dokimos-junit</artifactId>
        <version>${dokimos.version}</version>
        <scope>test</scope>
    </dependency>

    <!-- LangChain4j integration -->
    <dependency>
        <groupId>dev.dokimos</groupId>
        <artifactId>dokimos-langchain4j</artifactId>
        <version>${dokimos.version}</version>
    </dependency>

    <!-- Spring AI integration -->
    <dependency>
        <groupId>dev.dokimos</groupId>
        <artifactId>dokimos-spring-ai</artifactId>
        <version>${dokimos.version}</version>
    </dependency>
</dependencies>
Gradle
dependencies {
    implementation 'dev.dokimos:dokimos-core:$dokimosVersion'
    testImplementation 'dev.dokimos:dokimos-junit:$dokimosVersion'
    implementation 'dev.dokimos:dokimos-langchain4j:$dokimosVersion'
    implementation 'dev.dokimos:dokimos-spring-ai:$dokimosVersion'
}

No additional repository configuration needed.

Integrations

JUnit

Use @DatasetSource to load test cases and LLMJudgeEvaluator with custom criteria:

// Create a judge from any LLM client
JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);

@ParameterizedTest
@DatasetSource("classpath:support-tickets.json")
void testSupportResponses(Example example) {
    String response = supportBot.answer(example.input());
    EvalTestCase testCase = example.toTestCase(response);

    Evaluator evaluator = LLMJudgeEvaluator.builder()
        .name("Helpfulness")
        .criteria("Is the response helpful and addresses the customer's issue?")
        .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
        .judge(judgeLM)
        .threshold(0.7)
        .build();

    Assertions.assertEval(testCase, evaluator);
}

LangChain4j

Evaluate RAG pipelines and AI assistants built with LangChain4j:

// Create a judge from any LLM client
JudgeLM judgeLM = prompt -> chatLanguageModel.generate(prompt);

Evaluator faithfulness = FaithfulnessEvaluator.builder()
    .judge(judgeLM)
    .contextKey("retrievedContext")
    .threshold(0.8)
    .build();

Experiment.builder()
    .dataset(dataset)
    .task(example -> {
        Result<String> result = assistant.chat(example.input());
        return Map.of(
            "output", result.content(),
            "retrievedContext", result.sources()
        );
    })
    .evaluators(List.of(faithfulness))
    .build()
    .run();

Spring AI

Use Spring AI's ChatModel as an evaluation judge:

JudgeLM judge = SpringAiSupport.asJudge(chatModel);

Evaluator evaluator = LLMJudgeEvaluator.builder()
    .name("Accuracy")
    .criteria("Is the response factually accurate?")
    .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
    .judge(judge)
    .threshold(0.8)
    .build();

Experiment Server

The Dokimos server is an optional component for tracking experiment results over time. It provides a web UI for viewing runs, comparing results, and debugging failures.

curl -O https://raw.githubusercontent.com/dokimos-dev/dokimos/master/docker-compose.yml
docker compose up -d

Open http://localhost:8080 to view the dashboard.

See the server documentation for deployment options.

Roadmap

  • More built in evaluators: multi turn conversations, agent tool use, misuse detection
  • CLI for running evaluations outside of tests
  • Server-side Dataset versioning and management

See the full roadmap on the docs site.

Get Help

License

MIT License. See LICENSE for details.


DocumentationGitHub