Skip to content

feat: Metric Future Pack: Introduce metric types #391

Merged
nachiket-galileo merged 3 commits intomainfrom
dev/nachiket/metrics-pack-2
Oct 31, 2025
Merged

feat: Metric Future Pack: Introduce metric types #391
nachiket-galileo merged 3 commits intomainfrom
dev/nachiket/metrics-pack-2

Conversation

@nachiket-galileo
Copy link
Contributor

@nachiket-galileo nachiket-galileo commented Oct 30, 2025

User description

Shortcut:

Description:

Overview

Refactored the metric system to have a clean 4-type class hierarchy with a common base class, as requested.

New Class Hierarchy

Base Class: Metric (Abstract)

Common attributes shared by all metric types:

  • id: str | None - Unique identifier
  • name: str - Metric name
  • scorer_type: ScorerTypes | None - Type of scorer
  • description: str - Metric description
  • tags: list[str] - Associated tags
  • created_at: datetime | None - Creation timestamp
  • updated_at: datetime | None - Update timestamp
  • version: int | None - Version number

Common methods:

  • get(id=..., name=...) - Retrieve existing metric (returns appropriate subclass)
  • list(...) - List metrics (returns appropriate subclasses)
  • delete_by_name(name) - Delete metric by name
  • delete() - Delete this metric
  • refresh() - Refresh from API
  • update() - Update metric (not implemented)
  • to_legacy_metric() - Convert to legacy format

1. LlmMetric - Custom LLM-based Metrics

For creating custom metrics evaluated by an LLM judge.

Additional Attributes:

  • prompt: str - Prompt template for the LLM scorer
  • model: str - Model name (e.g., "gpt-4o-mini")
  • judges: int - Number of judges for scoring
  • cot_enabled: bool - Chain-of-thought enabled
  • node_level: StepType - Node level (e.g., StepType.llm)
  • output_type: OutputTypeEnum - Output type (percentage, boolean, etc.)

Additional Methods:

  • create() - Persist to API

Example:

from galileo.__future__ import LlmMetric

metric = LlmMetric(
    name="response_quality",
    prompt="Rate the quality of this response...",
    model="gpt-4o-mini",
    judges=3,
    output_type="percentage",
    cot_enabled=True,
).create()

Backward Compatibility:

  • Supports deprecated parameter names: user_prompt, model_name, num_judges
  • New parameter names take precedence over old ones

2. LocalMetric - Function-based Local Metrics

For metrics that use Python functions to score locally without API calls.

Additional Attributes:

  • scorer_fn: Callable - Scoring function
  • scorable_types: list[StepType] - Types that can be scored
  • aggregatable_types: list[StepType] - Types for aggregation

Additional Methods:

  • to_local_metric_config() - Convert to LocalMetricConfig format

Example:

from galileo.__future__ import LocalMetric
from galileo_core.schemas.logging.step import StepType

def response_length_scorer(trace_or_span):
    if hasattr(trace_or_span, "output") and trace_or_span.output:
        return min(len(trace_or_span.output) / 100.0, 1.0)
    return 0.0

metric = LocalMetric(
    name="response_length",
    scorer_fn=response_length_scorer,
    scorable_types=[StepType.llm],
    aggregatable_types=[StepType.trace],
)

3. CodeMetric - Code-based Metrics

For code-based scorers (limited support).

Notes:

  • create() method raises NotImplementedError
  • Can be retrieved via Metric.get() if they exist

Example:

from galileo.__future__ import Metric

# Get existing code metric
metric = Metric.get(name="my-code-metric")
assert isinstance(metric, CodeMetric)

4. GalileoMetric - Built-in Galileo Scorers

For Galileo's built-in scorers (correctness, completeness, toxicity, etc.).

Access via:

  • Metric.scorers.<scorer_name> - e.g., Metric.scorers.correctness
  • Metric.get(name="scorer_name") - Returns GalileoMetric instance

Example:

from galileo.__future__ import Metric

# Access built-in scorers
correctness = Metric.scorers.correctness
completeness = Metric.scorers.completeness

# Or get by name
metric = Metric.get(name="correctness")
assert isinstance(metric, GalileoMetric)

Key Features

Type-aware Factory Methods

The Metric.get() and Metric.list() methods automatically return the appropriate subclass based on scorer_type:

  • ScorerTypes.LLMLlmMetric
  • ScorerTypes.CODECodeMetric
  • Other types → GalileoMetric

Proper Type Annotations

All methods properly annotated with type hints, passing mypy strict type checking.

Clean Separation of Concerns

Each metric type has only the attributes and methods relevant to its purpose:

  • LlmMetric has prompt, model, judges
  • LocalMetric has scorer_fn, scorable_types
  • CodeMetric and GalileoMetric are minimal

Files Modified

1. /src/galileo/__future__/metric.py

  • Converted Metric to abstract base class (ABC)
  • Created 4 concrete subclasses: LlmMetric, LocalMetric, CodeMetric, GalileoMetric
  • Updated get() and list() to return appropriate subclass instances
  • Moved LLM-specific logic to LlmMetric
  • Moved local metric logic to LocalMetric
  • Updated _populate_from_scorer_response() to handle different types

2. /src/galileo/__future__/__init__.py

  • Added exports for all 4 metric types
  • Updated __all__ list

3. /tests/future/test_metric.py

  • Updated all tests to use appropriate metric classes (LlmMetric, LocalMetric, etc.)
  • Fixed type expectations in assertions

4. /tests/future/test_metric_types.py (NEW)

  • Comprehensive test suite for all 4 metric types
  • 37 tests covering initialization, validation, methods, inheritance, edge cases
  • Tests for backward compatibility with deprecated parameters

Breaking Changes

For Users Creating Metrics Directly

Before:

from galileo.__future__ import Metric

# LLM metric
metric = Metric(name="test", prompt="Rate this").create()

# Local metric
def scorer(t): return 0.5
metric = Metric(name="test", scorer_fn=scorer)

After:

from galileo.__future__ import LlmMetric, LocalMetric

# LLM metric - use LlmMetric
metric = LlmMetric(name="test", prompt="Rate this").create()

# Local metric - use LocalMetric
def scorer(t): return 0.5
metric = LocalMetric(name="test", scorer_fn=scorer)

For Users Retrieving Metrics

No Breaking Changes - Metric.get() and Metric.list() still work the same, but now return properly typed subclass instances.

# Still works, but returns LlmMetric, CodeMetric, or GalileoMetric
metric = Metric.get(name="my-metric")

Migration Guide

Simple Migration

  1. For LLM metrics: Replace Metric(...) with LlmMetric(...)
  2. For local metrics: Replace Metric(..., scorer_fn=...) with LocalMetric(..., scorer_fn=...)
  3. For retrieving metrics: No changes needed, but can use isinstance() to check types

Type Checking

from galileo.__future__ import Metric, LlmMetric, LocalMetric

metric = Metric.get(name="my-metric")

if isinstance(metric, LlmMetric):
    print(f"LLM metric with prompt: {metric.prompt}")
elif isinstance(metric, LocalMetric):
    print(f"Local metric with function: {metric.scorer_fn}")
elif isinstance(metric, GalileoMetric):
    print("Built-in Galileo scorer")

Testing

Test Coverage

  • 37 new tests in test_metric_types.py
  • All existing tests updated in test_metric.py
  • Type checking: Passes mypy strict mode
  • Linting: No errors (only pytest import warnings in test context)

Running Tests

# Run new metric types tests
pytest tests/future/test_metric_types.py -v

# Run all metric tests
pytest tests/future/test_metric.py tests/future/test_metric_types.py -v

# Type check
mypy src/galileo/__future__/metric.py

Benefits

  1. Clearer API: Users know exactly which class to use for their use case
  2. Better Type Safety: IDE autocomplete and type checkers work correctly
  3. Maintainability: Each class has only relevant attributes and methods
  4. Extensibility: Easy to add new metric types in the future
  5. Documentation: Each class can have focused, relevant documentation

Tests:

  • Unit Tests Added
  • E2E Test Added (if it's a user-facing feature, or fixing a bug)

Generated description

Below is a concise technical summary of the changes proposed in this PR:

graph LR
Metric_get_("Metric.get"):::modified
Metric_create_metric_from_type_("Metric._create_metric_from_type"):::added
Metric_list_("Metric.list"):::modified
LlmMetric_create_("LlmMetric.create"):::added
Metric_refresh_("Metric.refresh"):::modified
Metric_get_ -- "Instantiates correct Metric subclass based on scorer_type." --> Metric_create_metric_from_type_
Metric_list_ -- "Creates correct Metric subclass instances per scorer_type." --> Metric_create_metric_from_type_
LlmMetric_create_ -- "Refreshes metric instance to sync full scorer details." --> Metric_refresh_
classDef added stroke:#15AA7A
classDef removed stroke:#CD5270
classDef modified stroke:#EDAC4C
linkStyle default stroke:#CBD5E1,font-size:13px
Loading

Refactor the Metric system by introducing an abstract base class Metric and four concrete subclasses: LlmMetric, LocalMetric, CodeMetric, and GalileoMetric, providing a type-safe and extensible API for defining and managing various metric types. Update Metric.get() and Metric.list() to return appropriate subclass instances, enhancing clarity and maintainability.

TopicDetails
Metric Type Unit Tests Add comprehensive unit tests for the new metric class hierarchy, covering initialization, validation, methods, inheritance, and backward compatibility.
Modified files (2)
  • tests/future/test_metric.py
  • tests/future/test_metric_types.py
Latest Contributors(1)
UserCommitDate
nachiket@galileo.aifeat-Metric-future-pac...October 29, 2025
Metric Type Hierarchy Introduce an abstract Metric base class and four concrete subclasses (LlmMetric, LocalMetric, CodeMetric, GalileoMetric) to provide a type-safe and extensible API for defining and managing different metric types.
Modified files (2)
  • src/galileo/__future__/metric.py
  • src/galileo/__future__/__init__.py
Latest Contributors(2)
UserCommitDate
nachiket@galileo.aifeat-Metric-future-pac...October 29, 2025
vamaq@users.noreply.gi...feat-Add-declarative-f...October 29, 2025
This pull request is reviewed by Baz. Review like a pro on (Baz).

@codecov
Copy link

codecov bot commented Oct 30, 2025

Codecov Report

❌ Patch coverage is 92.45283% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.32%. Comparing base (e48cc16) to head (c1fda9e).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/galileo/__future__/metric.py 92.38% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #391      +/-   ##
==========================================
+ Coverage   86.16%   86.32%   +0.15%     
==========================================
  Files          75       75              
  Lines        5864     5880      +16     
==========================================
+ Hits         5053     5076      +23     
+ Misses        811      804       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

scorer_fn=response_length_scorer,
scorable_types=[StepType.llm],
aggregatable_types=[StepType.trace],
scorer_fn=my_scorer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the scorable & aggregatable types here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit ba74cf8 addressed this comment by adding comprehensive handling of scorable and aggregatable types. The diff introduces a factory method _create_metric_from_type that properly handles different scorer types (LLM, CODE, Galileo), and improves the extraction and handling of scoreable node types from API responses throughout the codebase.

Attributes
----------
prompt (str | None): Prompt template for the LLM scorer.
model (str | None): Model name to use for scoring.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have something here that indicates that the model must align with a model name avaialble in Galileo? Will we have an Integration or Model top level class in the python SDK that we could reference here instead of just a string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, we are moving in that direction.

This metric type is for code-based scorers that execute custom code
to evaluate traces/spans.

Note: Full support for creating CodeMetric instances is not yet implemented.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@dmcwhorter dmcwhorter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great

Copy link
Contributor

@vamaq vamaq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only pending action would be to solve some of the comments and we are ready to :shipit:

@nachiket-galileo nachiket-galileo enabled auto-merge (squash) October 31, 2025 16:16
@nachiket-galileo nachiket-galileo merged commit 08ad4d6 into main Oct 31, 2025
21 checks passed
@nachiket-galileo nachiket-galileo deleted the dev/nachiket/metrics-pack-2 branch October 31, 2025 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants