Skip to content

feat: Metric future package#378

Merged
nachiket-galileo merged 10 commits intomainfrom
dev/nachiket/metric-futures-pack
Oct 29, 2025
Merged

feat: Metric future package#378
nachiket-galileo merged 10 commits intomainfrom
dev/nachiket/metric-futures-pack

Conversation

@nachiket-galileo
Copy link
Contributor

@nachiket-galileo nachiket-galileo commented Oct 22, 2025

User description

Overview

The new galileo.__future__.Metric class provides a unified, object-oriented interface for working with all types of Galileo metrics. It's fully backward compatible with existing code while offering a much more intuitive API.

Key Features

Three Ways to Use Metrics

  1. Built-in Galileo Scorers - Access via Metric.scorers.correctness
  2. Custom LLM Metrics - Create with prompt templates and judge models
  3. Local Function Metrics - Define custom scoring functions

Complete Backward Compatibility

  • All existing functions (create_custom_llm_metric, delete_metric, get_metrics) still work
  • Existing LogStream.enable_metrics() and enable_metrics() functions still work
  • Can convert to legacy types: to_legacy_metric(), to_local_metric_config()
  • Accepts both old parameter names (user_prompt, model_name, num_judges) and new cleaner ones (prompt, model, judges)

Enhanced Capabilities

  • Retrieve metrics by ID or name (Metric.get())
  • List and filter metrics (Metric.list())
  • State management - Know if metric is synced, local-only, or failed
  • Refresh from API - metric.refresh()
  • Cleaner API - prompt instead of user_prompt, model instead of model_name

API Comparison

Built-in Scorers

NEW API (Preferred):

from galileo.__future__ import Metric, LogStream

log_stream = LogStream.get(name="my-stream", project_name="my-project")
log_stream.set_metrics([
    Metric.scorers.correctness,  # ✨ Most intuitive!
    Metric.scorers.completeness,
    Metric.scorers.toxicity,
])

EXISTING APIs (Still work!):

from galileo.schema.metrics import GalileoScorers

log_stream.set_metrics([
    GalileoScorers.correctness,  # Still works
    "completeness",               # String names still work
])

Custom LLM Metrics

NEW API (Improved):

from galileo.__future__ import Metric, StepType

metric = Metric(
    name="quality_checker",
    prompt="Rate the quality: {output}",  # Cleaner than 'user_prompt'
    model="gpt-4o-mini",                   # Cleaner than 'model_name'
    judges=3,                               # Cleaner than 'num_judges'
    node_level=StepType.llm,
    output_type="percentage",
    cot_enabled=True,
).create()

# Use it
log_stream.set_metrics([metric])

EXISTING API (Still works!):

from galileo.metrics import create_custom_llm_metric

version = create_custom_llm_metric(
    name="quality_checker",
    user_prompt="Rate the quality: {output}",  # Still works
    model_name="gpt-4o-mini",                   # Still works
    num_judges=3,                                # Still works
    node_level=StepType.llm,
)

Local Function Metrics

NEW API:

from galileo.__future__ import Metric, StepType

def my_scorer(trace_or_span):
    if hasattr(trace_or_span, "output"):
        return len(trace_or_span.output) / 100.0
    return 0.0

metric = Metric(
    name="response_length",
    scorer_fn=my_scorer,
    scorable_types=[StepType.llm],
    aggregatable_types=[StepType.trace],
)

# Use it
log_stream.set_metrics([metric])

EXISTING API (Still works!):

from galileo.schema.metrics import LocalMetricConfig

config = LocalMetricConfig(
    name="response_length",
    scorer_fn=my_scorer,
    scorable_types=[StepType.llm],
    aggregatable_types=[StepType.trace],
)

log_stream.set_metrics([config])  # Still works!

New Capabilities

1. Retrieve Metrics

# Get by name
metric = Metric.get(name="quality_checker")

# Get by ID
metric = Metric.get(id="abc-123-def")

# Returns None if not found
if metric is None:
    print("Metric not found")

2. List and Filter Metrics

# List all metrics
all_metrics = Metric.list()

# Filter by name
filtered = Metric.list(name_filter="quality")

# Filter by type
from galileo.resources.models import ScorerTypes
llm_metrics = Metric.list(scorer_types=[ScorerTypes.LLM])

3. State Management

# Create locally
metric = Metric(name="test", prompt="Test prompt", model="gpt-4o-mini")

# Check state
print(metric.is_local_only())  # True
print(metric.is_synced())       # False

# Persist to API
metric.create()

print(metric.is_local_only())  # False
print(metric.is_synced())       # True
print(metric.id)                # UUID assigned by API

# Refresh from API
metric.refresh()

# Delete
metric.delete()
print(metric.is_deleted())      # True

4. Cleaner Parameter Names

# ✅ New (preferred)
Metric(
    prompt="...",
    model="gpt-4o-mini",
    judges=3,
)

# ✅ Old (still works)
Metric(
    user_prompt="...",
    model_name="gpt-4o-mini",
    num_judges=3,
)

# ✅ Mix and match (new overrides old)
Metric(
    prompt="...",              # Takes precedence
    user_prompt="ignored",     # Ignored
    model="gpt-4o-mini",       # Takes precedence
    model_name="ignored",      # Ignored
)

Migration Examples

Example 1: Simple Migration

Before:

from galileo.schema.metrics import GalileoScorers
from galileo.log_streams import enable_metrics

enable_metrics(
    log_stream_name="my-stream",
    project_name="my-project",
    metrics=[GalileoScorers.correctness, "completeness"]
)

After (Optional):

from galileo.__future__ import Metric, LogStream

log_stream = LogStream.get(name="my-stream", project_name="my-project")
log_stream.set_metrics([
    Metric.scorers.correctness,
    Metric.scorers.completeness,
])

Result: ✅ Both work! No need to migrate unless you prefer the new API.

Example 2: Custom LLM Metric

Before:

from galileo.metrics import create_custom_llm_metric

version = create_custom_llm_metric(
    name="quality",
    user_prompt="Rate this: {output}",
    model_name="gpt-4.1-mini",
    num_judges=3,
)
scorer_id = version.scorer_id

After (Optional):

from galileo.__future__ import Metric

metric = Metric(
    name="quality",
    prompt="Rate this: {output}",  # Cleaner
    model="gpt-4.1-mini",            # Cleaner
    judges=3,                         # Cleaner
).create()

scorer_id = metric.id

Benefits of migrating:

  • ✅ Cleaner parameter names
  • ✅ Can retrieve later: Metric.get(name="quality")
  • ✅ State management: metric.is_synced()
  • ✅ Object-oriented workflow

Example 3: All Three Metric Types

Before:

from galileo.schema.metrics import GalileoScorers, LocalMetricConfig, Metric

def my_scorer(span):
    return 0.5

metrics = [
    GalileoScorers.correctness,
    Metric(name="custom_metric", version=2),
    LocalMetricConfig(name="local", scorer_fn=my_scorer),
]

enable_metrics(log_stream_name="stream", project_name="proj", metrics=metrics)

After (Simpler):

from galileo.__future__ import Metric

def my_scorer(span):
    return 0.5

log_stream.set_metrics([
    Metric.scorers.correctness,           # Built-in
    Metric.get(name="custom_metric"),     # Existing custom
    Metric(name="local", scorer_fn=my_scorer),  # Local
])

Implementation Details

Architecture

The new Metric class:

  1. Extends BusinessObjectMixin - Provides state management
  2. Wraps existing services - Uses Metrics(), Scorers() under the hood
  3. Provides conversion methods - to_legacy_metric(), to_local_metric_config()
  4. Works with existing infrastructure - Compatible with enable_metrics(), set_metrics()

State Diagram

LOCAL_ONLY ──.create()──> SYNCED
                            │
                 .refresh() │ .delete()
                            │
                            ↓
                         DELETED

SYNCED ──API error──> FAILED_SYNC ──.refresh()──> SYNCED

Type System

# Three metric creation patterns:

# 1. LLM Metric
Metric(
    name="...",
    prompt="...",
    scorer_type=ScorerTypes.LLM  # Auto-detected
)

# 2. Local Metric  
Metric(
    name="...",
    scorer_fn=callable,
    scorer_type=None  # Local metrics don't have scorer_type
)

# 3. Reference to Existing
Metric(
    name="...",
    version=2,
    scorer_type=None  # Just a reference
)

Common Use Cases

Use Case 1: Quick Setup with Built-ins

from galileo.__future__ import Metric, LogStream

log_stream = LogStream.get(name="prod", project_name="my-app")
log_stream.set_metrics([
    Metric.scorers.correctness,
    Metric.scorers.toxicity,
    Metric.scorers.prompt_injection,
])

Use Case 2: Custom Quality Scoring

from galileo.__future__ import Metric

quality_metric = Metric(
    name="response_quality_v2",
    prompt="""
    Rate response quality 1-10:
    - Accuracy
    - Completeness  
    - Clarity
    
    Input: {input}
    Output: {output}
    
    Score: """,
    model="gpt-4o",
    judges=5,
    tags=["quality", "v2"],
).create()

# Use across multiple streams
stream1.set_metrics([quality_metric])
stream2.set_metrics([quality_metric])

Use Case 3: Domain-Specific Local Metrics

from galileo.__future__ import Metric, StepType

def medical_terminology_scorer(span):
    """Check if medical terms are used correctly"""
    output = getattr(span, "output", "")
    medical_terms = ["diagnosis", "treatment", "symptoms"]
    return sum(1 for term in medical_terms if term in output.lower()) / len(medical_terms)

medical_metric = Metric(
    name="medical_terminology_usage",
    scorer_fn=medical_terminology_scorer,
    scorable_types=[StepType.llm],
)

log_stream.set_metrics([
    Metric.scorers.correctness,
    medical_metric,
])

Use Case 4: Metric Management Dashboard

from galileo.__future__ import Metric

# List all metrics
all_metrics = Metric.list()
print(f"Total: {len(all_metrics)}")

# Group by type
from collections import defaultdict
by_type = defaultdict(list)
for m in all_metrics:
    if m.scorer_type:
        by_type[m.scorer_type.value].append(m.name)

for type_name, names in by_type.items():
    print(f"{type_name}: {len(names)} metrics")

# Find unused metrics
active_metrics = {"correctness", "toxicity"}
for m in all_metrics:
    if m.name not in active_metrics:
        print(f"Unused: {m.name} (created {m.created_at})")
        # Optionally delete
        # m.delete()


Generated description

Below is a concise technical summary of the changes proposed in this PR:

graph LR
LogStream_set_metrics_("LogStream.set_metrics"):::modified
Metric_("Metric"):::added
LocalMetricConfig_("LocalMetricConfig"):::added
Metric_create_("Metric.create"):::added
Metrics_("Metrics"):::added
Metric_get_("Metric.get"):::added
Scorers_("Scorers"):::added
Metric_list_("Metric.list"):::added
Metric_delete_by_name_("Metric.delete_by_name"):::added
Metric_refresh_("Metric.refresh"):::added
Metric_to_legacy_metric_("Metric.to_legacy_metric"):::added
LogStream_set_metrics_ -- "Supports Metric objects for richer log stream metric configuration" --> Metric_
LogStream_set_metrics_ -- "Adds LocalMetricConfig support for local function-based metrics" --> LocalMetricConfig_
Metric_create_ -- "Creates custom LLM metric via Metrics service API call" --> Metrics_
Metric_get_ -- "Retrieves metric details using Scorers service API" --> Scorers_
Metric_list_ -- "Lists metrics with filters via Scorers service API" --> Scorers_
Metric_delete_by_name_ -- "Deletes metric by name using Metrics service API" --> Metrics_
Metric_refresh_ -- "Refreshes metric state from API via Scorers service" --> Scorers_
Metric_to_legacy_metric_ -- "Converts new Metric to legacy class for backward compatibility" --> Metric_
classDef added stroke:#15AA7A
classDef removed stroke:#CD5270
classDef modified stroke:#EDAC4C
linkStyle default stroke:#CBD5E1,font-size:13px
Loading

Introduce a new galileo.future.Metric class, providing a unified object-oriented interface for managing all types of Galileo metrics, including built-in, custom LLM, and local function metrics. Enhance the LogStream component to integrate seamlessly with this new Metric class, ensuring backward compatibility and offering improved metric retrieval and setting capabilities.

TopicDetails
Metric API & Mgmt Introduces the new galileo.__future__.Metric class, providing a unified object-oriented interface for creating, retrieving, listing, and managing built-in, custom LLM, and local function metrics.
Modified files (4)
  • tests/future/test_metric.py
  • src/galileo/__future__/types.py
  • src/galileo/__future__/metric.py
  • src/galileo/__future__/__init__.py
Latest Contributors(2)
UserCommitDate
vamaq@users.noreply.gi...feat-Add-declarative-f...October 29, 2025
jimbobbennett@mac.comfix-Fixing-docstrings-369October 15, 2025
LogStream Integration Updates the LogStream to support the new Metric class, including a new get_metrics method and an enhanced set_metrics method, while also adding configuration options for default scorer models.
Modified files (3)
  • tests/future/test_configuration.py
  • src/galileo/__future__/log_stream.py
  • src/galileo/__future__/configuration.py
Latest Contributors(2)
UserCommitDate
vamaq@users.noreply.gi...feat-Add-declarative-f...October 29, 2025
jimbobbennett@mac.comfix-Fixing-docstrings-369October 15, 2025
This pull request is reviewed by Baz. Review like a pro on (Baz).

@codecov
Copy link

codecov bot commented Oct 22, 2025

Codecov Report

❌ Patch coverage is 81.42292% with 47 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.11%. Comparing base (6e1b5c4) to head (4fc0aef).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/galileo/__future__/metric.py 86.69% 31 Missing ⚠️
src/galileo/__future__/log_stream.py 15.38% 11 Missing ⚠️
src/galileo/__future__/types.py 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #378      +/-   ##
==========================================
- Coverage   86.33%   86.11%   -0.22%     
==========================================
  Files          73       75       +2     
  Lines        5612     5864     +252     
==========================================
+ Hits         4845     5050     +205     
- Misses        767      814      +47     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


Args:
metrics: List of metrics to add. Supports:
- GalileoScorers enum values (e.g., GalileoScorers.correctness)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit c187e90 addressed this comment. The documentation for the metrics parameter was updated to promote the newer Metric.scorers approach as recommended while maintaining GalileoScorers for backward compatibility. This change appears to address the consistency issue that was likely being pointed out with "here too".

metrics = Metric.list()

# Delete a metric
metric.delete()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can i delete a metric by name without retrieving it first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we just added this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We should probably add a class methods for that too.

"""
Persist this metric to the API.

Only works for LLM metrics. Local metrics (with scorer_fn) don't need
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about galileo-hosted code-based metrics?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note that i dont think we support these from the client today, so we can leave these for a follow on

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have a SDK function for this yet

current_metrics = log_stream.get_metrics()
print(f"Currently enabled: {current_metrics}")
"""
from galileo.config import GalileoConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move all these imports to the top of the file.

logger.info(f"LogStream.add_metrics: setting {len(combined_metrics)} total metrics")
return self.set_metrics(combined_metrics)

def enable_metrics(
Copy link
Contributor

@vamaq vamaq Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As much as possible, we won't be addressing backward compatibility in the future package. That will be part of the later process when we move these objects to the base module.

from galileo.__future__ import Metric

# Access built-in scorers
Metric.scorers.correctness
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should just be Metric.correctness ?


@classmethod
def list(
cls, *, name_filter: str | None = None, scorer_types: list[ScorerTypes] | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idk if we want to expose ScorerTypes publicly


return result

def _populate_from_scorer_response(self, scorer_response: Any) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use typing here? why Any? Additionally, can't we do some Pydantic serialization for this and add field_validators for any of the fields that need custom population

else:
self.node_level = None

def update(self, **kwargs: Any) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why even add this to the SDK? I don't think we have plans to support this ?

logger.error(f"Metric.delete: id='{self.id}' - failed: {e}")
raise

def refresh(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool

return f"Metric(name='{self.name}', type='local', scorer_fn={self.scorer_fn.__name__})"
if self.scorer_type:
return (
f"Metric(name='{self.name}', id='{self.id}', type='{self.scorer_type.value}', "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all metrics will have judges etc

@john-weiler
Copy link
Contributor

john-weiler commented Oct 22, 2025

High level, I think we should have 4 types:

CodeMetric, LlmMetric, GalileoMetric, LocalMetric

They can inherit the common params from a base Metric class.

scorer_type (ScorerTypes | None): The type of scorer (LLM, CODE, LOCAL, etc.).
description (str): Description of the metric.
tags (list[str]): Tags associated with the metric.
prompt (str | None): Prompt template for LLM-based scorers (alias for user_prompt).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably rename the properties to reflect what they are. e.g. A prompt, according our definition, should be a prompt object, so here the parameter name here should be prompt_name.
Same for all the other properties.

@nachiket-galileo nachiket-galileo force-pushed the dev/nachiket/metric-futures-pack branch from 8ded234 to c187e90 Compare October 29, 2025 15:11
from galileo.search import RecordType
=======
from galileo.schema.metrics import GalileoScorers, LocalMetricConfig
>>>>>>> 37c9012 (add/update)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Failed check: Test / test (ubuntu-latest, 3.12)
I’ve attached the relevant part of the log for your convenience:
Invalid decimal literal [syntax] - merge conflict marker detected (>>>>>>> 37c9012 (add/update))


Finding type: Log Error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit e11a8d3 addressed this comment by removing the merge conflict marker ">>>>>>> 37c9012 (add/update)" that was causing the syntax error. The diff shows clean code without any conflict markers, resolving the test failure.

Comment on lines +371 to +376
logger.info(f"LogStream.get_metrics: id='{self.id}' - started")
config = GalileoConfig.get()

settings = get_settings_projects_project_id_runs_run_id_scorer_settings_get.sync(
project_id=self.project_id, run_id=self.id, client=config.api_client
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we restore the guard that raises ValueError when self.id or self.project_id is missing before calling the scorer-settings API? As written, a locally constructed log stream calls get_settings(..., project_id=None, run_id=None, ...), so instead of the documented error we'll send an invalid request to the backend.

Suggested change
logger.info(f"LogStream.get_metrics: id='{self.id}' - started")
config = GalileoConfig.get()
settings = get_settings_projects_project_id_runs_run_id_scorer_settings_get.sync(
project_id=self.project_id, run_id=self.id, client=config.api_client
)
if self.id is None or self.project_id is None:
raise ValueError("LogStream must have both id and project_id to get metrics")
logger.info(f"LogStream.get_metrics: id='{self.id}' - started")
config = GalileoConfig.get()
settings = get_settings_projects_project_id_runs_run_id_scorer_settings_get.sync(
project_id=self.project_id, run_id=self.id, client=config.api_client
)

Finding type: Logical Bugs

Comment on lines +415 to 424
from galileo.__future__ import Metric, LogStream

project = Project.get(name="My AI Project")
log_stream = project.create_log_stream(name="Production Logs")
log_stream = LogStream.get(name="Production Logs", project_name="My Project")

# Enable built-in metrics
local_metrics = log_stream.enable_metrics([
GalileoScorers.correctness,
GalileoScorers.completeness,
"context_relevance"
# Set metrics (replaces existing)
log_stream.set_metrics([
Metric.scorers.correctness,
Metric.scorers.completeness,
Metric.get(id="metric-from-console-uuid"), # From console
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new docstring advertises support for galileo.__future__.Metric (e.g. using Metric.scorers.correctness), but this module still imports Metric from galileo.schema.metrics. That legacy class has no scorers attribute or get helper, so following the example now raises AttributeError and set_metrics never receives the new Metric objects. Please import Metric from galileo.__future__.metric (and adjust the union) so the promised type actually works.


Finding type: Type Inconsistency

Comment on lines +465 to +469
instance = cls.__new__(cls)
StateManagementMixin.__init__(instance)
instance._populate_from_scorer_response(retrieved_scorer)
instance._set_state(SyncState.SYNCED)
result.append(instance)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to extract the repeated logic of creating and initializing a Metric instance from a scorer response into a helper method, since the same 3+ lines appear here and at line 427? For example:

def _from_scorer_response(cls, scorer):
instance = cls.new(cls)
StateManagementMixin.init(instance)
instance._populate_from_scorer_response(scorer)
instance._set_state(SyncState.SYNCED)
return instance

Then you could call this helper in both places to avoid duplication.

Prompt for AI Agents:

In `src/galileo/__future__/metric.py` around lines 465-469 and line 427, there is
repeated code for creating and initializing Metric instances from scorer responses.
Refactor by extracting a class method like `_from_scorer_response` that encapsulates the
common instance creation logic. This method should create a new instance, initialize it
with StateManagementMixin, populate from the scorer response, set the sync state, and
return the instance. Replace the duplicate code blocks with calls to this new helper
method to eliminate code duplication and improve maintainability.

Finding type: Code Dedup and Conventions

@nachiket-galileo nachiket-galileo force-pushed the dev/nachiket/metric-futures-pack branch from e11a8d3 to 0624b9d Compare October 29, 2025 17:48
@nachiket-galileo nachiket-galileo enabled auto-merge (squash) October 29, 2025 17:58
@nachiket-galileo nachiket-galileo merged commit 16f176d into main Oct 29, 2025
35 of 36 checks passed
@nachiket-galileo nachiket-galileo deleted the dev/nachiket/metric-futures-pack branch October 29, 2025 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants