PrunaAI
diff --git a/‎docs/VLM_METRICS_PROMPT_COMPARISON.md‎
Lines changed: 158 additions & 0 deletions b/‎docs/VLM_METRICS_PROMPT_COMPARISON.md‎
Lines changed: 158 additions & 0 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 1 addition & 3 deletions b/‎pyproject.toml‎
Lines changed: 1 addition & 3 deletions
diff --git a/‎src/pruna/evaluation/metrics/__init__.py‎
Lines changed: 11 additions & 8 deletions b/‎src/pruna/evaluation/metrics/__init__.py‎
Lines changed: 11 additions & 8 deletions
diff --git a/‎src/pruna/evaluation/metrics/metric_alignment_score.py‎
Lines changed: 120 additions & 0 deletions b/‎src/pruna/evaluation/metrics/metric_alignment_score.py‎
Lines changed: 120 additions & 0 deletions
@@ -0,0 +1,158 @@
+# VLM Metrics: Prompt Comparison (Pruna vs InferBench)
+
+Overview of prompt differences between Pruna's VLM metrics and InferBench's implementation.
+
+---
+
+## Summary Table
+
+| Metric | Pruna | InferBench | Key Differences |
+|--------|-------|------------|-----------------|
+| **Alignment Score** | Single generic question | Multi-question with dependencies | Pruna: 1 prompt; InferBench: N questions from OneIG JSON |
+| **VQA** | Same as Alignment (reused) | Dedicated template | Both use "Does this show X? Yes/No" |
+| **Text Score** | Short OCR prompt | Detailed OCR prompt | InferBench: longer, explicit format rules |
+| **Img Edit Score** | Simple 0–10 rating | Full judge prompts from ImgEdit repo | InferBench: 5-point multi-criteria per edit type |
+| **VieScore** | Two short prompts | Long SC + PQ prompts | InferBench: detailed rules, JSON output |
+| **QA Accuracy** | Generic "What is in this image?" | Benchmark-specific questions | Different use cases |
+| **VLM Base (score)** | Litellm: "Answer Yes or No" / Transformers: "Question: X Answer:" | Generation + logprobs fallback | Response format differs |
+
+---
+
+## 1. Alignment Score
+
+### Pruna
+- **Question**: `Does this image show "{prompt}"? Answer Yes or No.`
+- **Expected answer**: `Yes`
+- **Scope**: Single prompt–image alignment per sample
+- **Source**: `metric_alignment_score.py`, `metric_vqa.py` (same logic)
+
+### InferBench
+- **Questions**: From OneIG JSON (e.g. `anime.json`, `human.json`, `object.json`)
+- **Template**: `{question}. Only answer 'Yes' or 'No'. Do not answer anything else.`
+- **Examples**: "Are there boys?", "Are there four boys?", "Is there a nun?", etc.
+- **Dependencies**: Parent–child question graph; child scores set to 0 if parent is No
+- **Scope**: 9–20 questions per image, dependency-aware aggregation
+- **Source**: `alignment_score.py`, `oneig.py` (benchmark)
+
+---
+
+## 2. VQA (Visual Question Answering)
+
+### Pruna
+- Same as Alignment Score: `Does this image show "{prompt}"? Answer Yes or No.`
+- Used for both `alignment_score` and `vqa` metrics
+
+### InferBench
+- **Template**: `Does this figure show "{prompt}"? Please answer yes or no.`
+- **Expected answer**: `Yes`
+- **Difference**: "figure" vs "image"; "Please answer yes or no" vs "Answer Yes or No"
+- **Source**: `vqa.py`
+
+---
+
+## 3. Text Score (OCR)
+
+### Pruna
+- **Prompt**: `Extract all text from this image. If no text, say 'No text'.`
+- **Output use**: Binary check (no text → score 10.0, else 0.0) — *Note: Pruna text_score appears to use edit distance logic elsewhere; this prompt is for OCR extraction*
+- **Source**: `metric_text_score.py`
+
+### InferBench
+- **Prompt**:
+  ```
+  Extract all text visible in this image. Include logos, stylized fonts, handwritten text, and non-standard typography.
+  Return only the extracted text, exactly as it appears—no preamble, explanation, or markdown.
+  Preserve words, numbers, punctuation, and spacing. If no text is recognized, reply with exactly: No text recognized
+  ```
+- **Post-processing**: Hallucination removal ("addCriterion", "No text recognized"), Levenshtein vs ground truth, word accuracy
+- **Source**: `text_score.py`
+
+---
+
+## 4. Image Edit Score
+
+### Pruna
+- **Question**: `Rate 0-10: Does this image show "{prompt}"? Reply with a number.`
+- **Input**: Single edited image + prompt
+- **Output**: 0–10 score, normalized to [0, 1]
+- **Source**: `metric_img_edit_score.py`
+
+### InferBench
+- **Input**: Original image + edited image + edit instruction
+- **Judge prompts**: Fetched from ImgEdit repo (`prompts.json`) per edit type (replace, add, remove, adjust, style, extract, background, compose)
+- **Format**: Long multi-criteria prompts (5-point scale):
+  - Prompt Compliance (1–5)
+  - Visual Naturalness / Seamlessness (1–5)
+  - Physical & Detail Integrity (1–5)
+- **Output**: Average of 3 scores, parsed from `"Prompt Compliance: N\nVisual Naturalness: N\n..."` format
+- **Source**: `img_edit_score.py`, `img_edit.py` (benchmark), external `prompts.json`
+
+---
+
+## 5. VieScore
+
+### Pruna
+- **Semantic**: `Rate 0-10: Does this image show "{prompt}"?`
+- **Quality**: `Rate 0-10: How natural is this image? Any artifacts?`
+- **Aggregation**: `sqrt(semantic * quality) / 10`
+- **Source**: `metric_viescore.py`
+
+### InferBench
+- **SC (Semantic/Compliance)**: Long prompt with rules for editing success + overediting
+  - Two images (original + edited)
+  - `score1` = editing success (0–10), `score2` = overediting (0–10)
+  - Output: `[score1, score2]`
+- **PQ (Perceptual Quality)**: Long prompt for naturalness + artifacts
+  - Single image
+  - `naturalness` (0–10), `artifacts` (0–10)
+  - Output: `[naturalness, artifacts]`
+- **Aggregation**: `min(SC_scores)`, `min(PQ_scores)`, `overall = sqrt(SC * PQ)`
+- **Context**: "You are a professional digital artist..." + JSON output format
+- **Source**: `viescore.py`
+
+---
+
+## 6. QA Accuracy
+
+### Pruna
+- **Question**: `What is in this image? Answer:`
+- **Scoring**: 1.0 if non-empty response, else 0.0
+- **Use**: Generic image understanding check
+- **Source**: `metric_qa_accuracy.py`
+
+### InferBench
+- **Questions**: From GenEval metadata (e.g. "Does the image show at least one red apple?", "Does the image show exactly 3 cats?")
+- **Template**: `{question} Please answer yes or no.`
+- **Expected answers**: `Yes` for all (benchmark-specific)
+- **Scoring**: Accuracy over N questions, n_correct, n_incorrect
+- **Source**: `qa_accuracy.py`, `geneval.py` (benchmark)
+
+---
+
+## 7. VLM Base Layer (Score Method)
+
+### Pruna – LitellmVLM & TransformersVLM
+- **Prompt**: `{question} Please answer yes or no.`
+- **Scoring**: `1.0 if answer.lower() in response else 0.0`
+- **Scoring**: Same substring check
+- **Source**: `vlm_base.py` line 371
+
+### InferBench – OpenAIAPIVLM
+- **Scoring**: Prefers logprobs (Yes/No token probabilities) when available
+- **Fallback**: Generation + substring check ("yes"/"no" in response)
+- **No prompt suffix**: Question passed as-is; metrics add their own suffix
+- **Source**: `api_vlm_base.py`
+
+---
+
+## Recommendations
+
+1. **Alignment / VQA**: InferBench’s multi-question + dependency setup is more detailed; Pruna’s single-question version is simpler. For OneIG-style benchmarks, InferBench’s approach is required.
+
+2. **Text Score**: InferBench’s OCR prompt is more explicit and robust; Pruna now uses InferBench-style OCR prompt and supports ground-truth edit distance when gt contains text_content.
+
+3. **Img Edit Score**: InferBench uses full ImgEdit judge prompts; Pruna uses an improved single 0–10 rating with explicit scale instructions. For ImgEdit benchmarks, InferBench’s prompts are necessary.
+
+4. **VieScore**: InferBench’s SC+PQ prompts match the original VieScore design. Pruna’s uses improved explicit 0–10 scale prompts.
+
+5. **VLM Base**: Pruna now uses unified "Please answer yes or no." suffix for both Litellm and Transformers.
@@ -142,10 +142,8 @@ dependencies = [
 
 [project.optional-dependencies]
 evaluation = [
-    "pydantic>=2.0.0",
+    "outlines>1.2.0,<2.0.0",
     "litellm>=1.0.0",
-    "transformers>=4.40.0",
-    "accelerate>=0.20.0",
 ]
 
 stable-fast = [
 
@@ -15,23 +15,22 @@
 from pruna.evaluation.metrics.registry import MetricRegistry  # isort:skip
 
 from pruna.evaluation.metrics.aesthetic_laion import AestheticLAION
+from pruna.evaluation.metrics.metric_alignment_score import AlignmentScoreMetric
 from pruna.evaluation.metrics.metric_cmmd import CMMD
 from pruna.evaluation.metrics.metric_dino_score import DinoScore
 from pruna.evaluation.metrics.metric_elapsed_time import LatencyMetric, ThroughputMetric, TotalTimeMetric
 from pruna.evaluation.metrics.metric_energy import CO2EmissionsMetric, EnergyConsumedMetric
+from pruna.evaluation.metrics.metric_img_edit_score import ImageEditScoreMetric
 from pruna.evaluation.metrics.metric_memory import DiskMemoryMetric, InferenceMemoryMetric, TrainingMemoryMetric
 from pruna.evaluation.metrics.metric_model_architecture import TotalMACsMetric, TotalParamsMetric
 from pruna.evaluation.metrics.metric_pairwise_clip import PairwiseClipScore
+from pruna.evaluation.metrics.metric_qa_accuracy import QAAccuracyMetric
 from pruna.evaluation.metrics.metric_sharpness import SharpnessMetric
+from pruna.evaluation.metrics.metric_text_score import TextScoreMetric
 from pruna.evaluation.metrics.metric_torch import TorchMetricWrapper
-from pruna.evaluation.metrics.metrics_vlm import (
-    AlignmentScoreMetric,
-    ImageEditScoreMetric,
-    QAAccuracyMetric,
-    TextScoreMetric,
-    VieScoreMetric,
-    VQAMetric,
-)
+from pruna.evaluation.metrics.metric_viescore import VieScoreMetric
+from pruna.evaluation.metrics.metric_vqa import VQAMetric
+from pruna.evaluation.metrics.vlm_base import BaseVLM, LitellmVLM, TransformersVLM, get_vlm
 
 __all__ = [
     "MetricRegistry",
@@ -57,4 +56,8 @@
     "QAAccuracyMetric",
     "TextScoreMetric",
     "VieScoreMetric",
+    "BaseVLM",
+    "LitellmVLM",
+    "TransformersVLM",
+    "get_vlm",
 ]
@@ -0,0 +1,120 @@
+# Copyright 2025 - Pruna AI GmbH. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Alignment Score metric using VLM for image-text alignment evaluation."""
+
+from __future__ import annotations
+
+from typing import Any, List, Literal, Optional
+
+import numpy as np
+import torch
+
+from pruna.engine.utils import set_to_best_available_device
+from pruna.evaluation.metrics.metric_stateful import StatefulMetric
+from pruna.evaluation.metrics.metric_vlm_utils import YesNoAnswer, _process_images
+from pruna.evaluation.metrics.registry import MetricRegistry
+from pruna.evaluation.metrics.result import MetricResult
+from pruna.evaluation.metrics.utils import SINGLE, get_call_type_for_single_metric, metric_data_processor
+from pruna.evaluation.metrics.vlm_base import BaseVLM, get_vlm
+
+
+@MetricRegistry.register("alignment_score")
+class AlignmentScoreMetric(StatefulMetric):
+    """
+    Alignment Score metric using VLM.
+
+    Assesses how well generated images match text prompts through structured questioning.
+    Higher scores indicate better alignment.
+
+    Parameters
+    ----------
+    *args : Any
+        Additional positional arguments.
+    vlm : BaseVLM | None, optional
+        Custom VLM instance. If provided, vlm_type and model_name are ignored.
+    vlm_type : {"litellm", "transformers"}, optional
+        VLM backend. Default is "litellm".
+    model_name : str, optional
+        Model name. Default is "gpt-4o".
+    vlm_kwargs : dict, optional
+        Extra kwargs for VLM init (e.g. model_load_kwargs for transformers).
+    structured_output : bool, optional
+        Use structured generation. Default is True.
+    use_outlines : bool, optional
+        Use outlines for transformers. Default is False.
+    device : str | torch.device | None, optional
+        Device for transformers VLM.
+    api_key : str | None, optional
+        API key for litellm.
+    call_type : str, optional
+        Call type for the metric.
+    **kwargs : Any
+        Additional arguments.
+    """
+
+    scores: List[float]
+    default_call_type: str = "y"
+    higher_is_better: bool = True
+    metric_name: str = "alignment_score"
+    runs_on: List[str] = ["cpu"]
+
+    def __init__(
+        self,
+        *args,
+        vlm: Optional[BaseVLM] = None,
+        vlm_type: Literal["litellm", "transformers"] = "litellm",
+        model_name: str = "gpt-4o",
+        vlm_kwargs: Optional[dict] = None,
+        structured_output: bool = True,
+        use_outlines: bool = False,
+        device=None,
+        api_key: Optional[str] = None,
+        call_type: str = SINGLE,
+        **kwargs,
+    ):
+        super().__init__(device=device)
+        self.device = set_to_best_available_device(device)
+
+        self.vlm = get_vlm(
+            vlm=vlm,
+            vlm_type=vlm_type,
+            model_name=model_name,
+            device=device,
+            api_key=api_key,
+            use_outlines=use_outlines,
+            **(vlm_kwargs or {}),
+        )
+        self.response_format = (
+            YesNoAnswer if structured_output and vlm_type == "litellm" else
+            ("yes_no" if structured_output and vlm_type == "transformers" else None)
+        )
+
+        self.call_type = get_call_type_for_single_metric(call_type, self.default_call_type)
+        self.add_state("scores", [])
+
+    def update(self, x: List[Any] | torch.Tensor, gt: torch.Tensor, outputs: torch.Tensor) -> None:
+        inputs = metric_data_processor(x, gt, outputs, self.call_type)
+        images = _process_images(inputs[0])
+        prompts = x if isinstance(x, list) else [""] * len(images)
+        for i, image in enumerate(images):
+            prompt = prompts[i] if i < len(prompts) else ""
+            question = f'Does this image show "{prompt}"?'
+            score = self.vlm.score([image], [question], ["Yes"], response_format=self.response_format)[0]
+            self.scores.append(score)
+
+    def compute(self) -> MetricResult:
+        if not self.scores:
+            return MetricResult(self.metric_name, self.__dict__, 0.0)
+        return MetricResult(self.metric_name, self.__dict__, float(np.mean(self.scores)))