-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
MLflow 3.x added GenAI/LLM evaluation features. Entrenar has CITL but lacks dedicated LLM evaluation metrics.
MLflow Equivalent
mlflow.evaluate()with LLM judgesmlflow.tracingfor token-level capture- Prompt Registry for version control
- Built-in metrics: latency, tokens/sec, cost
Proposed Implementation
pub struct LLMMetrics {
pub prompt_tokens: u32,
pub completion_tokens: u32,
pub total_tokens: u32,
pub latency_ms: f64,
pub tokens_per_second: f64,
pub cost_usd: Option<f64>,
pub model_name: String,
}
pub struct PromptVersion {
pub id: String,
pub template: String,
pub variables: Vec<String>,
pub version: u32,
pub created_at: DateTime<Utc>,
pub sha256: String,
}
pub trait LLMEvaluator {
fn evaluate_response(&self, prompt: &str, response: &str, reference: Option<&str>) -> EvalResult;
fn log_llm_call(&mut self, run_id: &str, metrics: LLMMetrics) -> Result<()>;
fn track_prompt(&mut self, run_id: &str, prompt: &PromptVersion) -> Result<()>;
}
pub struct EvalResult {
pub relevance: f64,
pub coherence: f64,
pub groundedness: f64,
pub harmfulness: f64,
}Metrics to Track
- Token counts (prompt, completion, total)
- Latency (TTFT, TPS, total)
- Cost per call
- Quality scores (relevance, coherence, groundedness)
Acceptance Criteria
-
LLMMetricsstruct with token/latency/cost tracking - Prompt versioning with content hashing
- Built-in evaluation metrics (ROUGE, BLEU, semantic similarity)
- LLM-as-judge evaluation option
-
entrenar eval llmCLI command
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request