Skip to content

feat: LLM evaluation metrics and prompt tracking #71

@noahgift

Description

@noahgift

Summary

MLflow 3.x added GenAI/LLM evaluation features. Entrenar has CITL but lacks dedicated LLM evaluation metrics.

MLflow Equivalent

  • mlflow.evaluate() with LLM judges
  • mlflow.tracing for token-level capture
  • Prompt Registry for version control
  • Built-in metrics: latency, tokens/sec, cost

Proposed Implementation

pub struct LLMMetrics {
    pub prompt_tokens: u32,
    pub completion_tokens: u32,
    pub total_tokens: u32,
    pub latency_ms: f64,
    pub tokens_per_second: f64,
    pub cost_usd: Option<f64>,
    pub model_name: String,
}

pub struct PromptVersion {
    pub id: String,
    pub template: String,
    pub variables: Vec<String>,
    pub version: u32,
    pub created_at: DateTime<Utc>,
    pub sha256: String,
}

pub trait LLMEvaluator {
    fn evaluate_response(&self, prompt: &str, response: &str, reference: Option<&str>) -> EvalResult;
    fn log_llm_call(&mut self, run_id: &str, metrics: LLMMetrics) -> Result<()>;
    fn track_prompt(&mut self, run_id: &str, prompt: &PromptVersion) -> Result<()>;
}

pub struct EvalResult {
    pub relevance: f64,
    pub coherence: f64,
    pub groundedness: f64,
    pub harmfulness: f64,
}

Metrics to Track

  • Token counts (prompt, completion, total)
  • Latency (TTFT, TPS, total)
  • Cost per call
  • Quality scores (relevance, coherence, groundedness)

Acceptance Criteria

  • LLMMetrics struct with token/latency/cost tracking
  • Prompt versioning with content hashing
  • Built-in evaluation metrics (ROUGE, BLEU, semantic similarity)
  • LLM-as-judge evaluation option
  • entrenar eval llm CLI command

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions