feat: LLM evaluation metrics and prompt tracking

## Summary
MLflow 3.x added GenAI/LLM evaluation features. Entrenar has CITL but lacks dedicated LLM evaluation metrics.

## MLflow Equivalent
- `mlflow.evaluate()` with LLM judges
- `mlflow.tracing` for token-level capture
- Prompt Registry for version control
- Built-in metrics: latency, tokens/sec, cost

## Proposed Implementation

```rust
pub struct LLMMetrics {
    pub prompt_tokens: u32,
    pub completion_tokens: u32,
    pub total_tokens: u32,
    pub latency_ms: f64,
    pub tokens_per_second: f64,
    pub cost_usd: Option<f64>,
    pub model_name: String,
}

pub struct PromptVersion {
    pub id: String,
    pub template: String,
    pub variables: Vec<String>,
    pub version: u32,
    pub created_at: DateTime<Utc>,
    pub sha256: String,
}

pub trait LLMEvaluator {
    fn evaluate_response(&self, prompt: &str, response: &str, reference: Option<&str>) -> EvalResult;
    fn log_llm_call(&mut self, run_id: &str, metrics: LLMMetrics) -> Result<()>;
    fn track_prompt(&mut self, run_id: &str, prompt: &PromptVersion) -> Result<()>;
}

pub struct EvalResult {
    pub relevance: f64,
    pub coherence: f64,
    pub groundedness: f64,
    pub harmfulness: f64,
}
```

## Metrics to Track
- Token counts (prompt, completion, total)
- Latency (TTFT, TPS, total)
- Cost per call
- Quality scores (relevance, coherence, groundedness)

## Acceptance Criteria
- [ ] `LLMMetrics` struct with token/latency/cost tracking
- [ ] Prompt versioning with content hashing
- [ ] Built-in evaluation metrics (ROUGE, BLEU, semantic similarity)
- [ ] LLM-as-judge evaluation option
- [ ] `entrenar eval llm` CLI command

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: LLM evaluation metrics and prompt tracking #71

Summary

MLflow Equivalent

Proposed Implementation

Metrics to Track

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: LLM evaluation metrics and prompt tracking #71

Description

Summary

MLflow Equivalent

Proposed Implementation

Metrics to Track

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions