Skip to content

Commit 4271c36

Browse files
[Enhancement] Adding aggregator metrics to platform for generic or task specific usage (#66)
* Adding unit_metric_result class since it is required in aggregator_metrics for downstream tasks * Fix linting errors for unit metric result * Adding support for metric registry, base class for aggregator metrics, accuracy * Modified existing base classes to make it more generic and extensible * Adding support for precision and recall metrics * Adding support for f1 score computation using existing logic for precision and recall * Fixing division bug in base aggregator metric, and adding relevant unit tests * Changes to avoid default instantiation of class level metrics * Adding documentation for aggregator metrics * Modifying documentation for end-user usage ease and dev understanding * Fixing linting errors * Fixed unit test error leading to make test fail * Moved metric to eval folder, changed decorator name, review comment fixes * Moved units tests and docs to eval parent folder for consistenct * Fixed sys path for sygra import in unit test cases * Review comment fixes: Added config class for common init, class name decoupling for generalization, pydantic metadata base class for easier re-usability, extendibility and separation of concern * Adding exact matching unit metric and corresponding unit tests * Review fixes:- Class naming consistency, method naming as per convention, in-class pydantic config validation * Refactoring as per github code suggest --------- Co-authored-by: Vipul Mittal <[email protected]>
1 parent 592c9af commit 4271c36

File tree

19 files changed

+4033
-0
lines changed

19 files changed

+4033
-0
lines changed

docs/eval/metrics/README.md

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# Metrics Documentation
2+
3+
This folder contains documentation for Sygra's metrics system, which provides tools for evaluating and measuring model performance.
4+
5+
## Overview
6+
7+
The metrics system in Sygra is designed with a **three-layer architecture**:
8+
9+
### Layer 1: Unit Metrics (Validators)
10+
Individual validation operations that produce binary pass/fail results:
11+
- Compare predicted output vs expected output for a single step
12+
- Return `UnitMetricResult` objects containing:
13+
- `correct` (bool): Was the prediction correct?
14+
- `golden` (dict): Expected/ground truth data
15+
- `predicted` (dict): Model's predicted data
16+
- `metadata` (dict): Additional context
17+
18+
**Example**: Validate if predicted tool matches expected event.
19+
20+
### Layer 2: Aggregator Metrics (Statistical Primitives)
21+
Statistical measures that calculate metrics for a **single class** from multiple unit results:
22+
- **AccuracyMetric**: Overall correctness measurement
23+
- **PrecisionMetric**: Quality of positive predictions for a specific class
24+
- **RecallMetric**: Coverage of actual positives for a specific class
25+
- **F1ScoreMetric**: Balanced precision-recall measure for a specific class
26+
27+
These are **building blocks** that consume `UnitMetricResult` lists.
28+
29+
### Layer 3: Platform Orchestration (High-Level)
30+
Platform code that:
31+
- Reads user's simple metric list from `graph_config.yaml`
32+
- Collects `UnitMetricResult` objects from validators
33+
- Discovers all classes from validation results
34+
- Iterates over classes automatically
35+
- Calls aggregator metrics with appropriate parameters
36+
- Aggregates results across all classes
37+
38+
**User never specifies classes or keys - platform handles it.**
39+
40+
41+
## Available Documentation
42+
43+
### [Aggregator Metrics Reference](aggregator_metrics_summary.md)
44+
Technical reference for metric developers and platform code:
45+
- What each metric calculates
46+
- Required parameters (handled by platform code)
47+
- How to instantiate via registry
48+
- Understanding `UnitMetricResult`
49+
50+
## Quick Start: End User Perspective
51+
52+
### User Configuration (Simple!)
53+
```yaml
54+
# graph_config.yaml
55+
graph_properties:
56+
metrics:
57+
- accuracy
58+
- precision
59+
- recall
60+
- f1_score
61+
```
62+
63+
User just lists which metrics they want.
64+
65+
### Basic Usage Example
66+
67+
```python
68+
# 1. Unit Metrics - Validate individual predictions
69+
from sygra.core.eval.metrics.unit_metrics.exact_match import ExactMatchMetric
70+
71+
# Initialize unit metric
72+
validator = ExactMatchMetric(case_sensitive=False, normalize_whitespace=True)
73+
74+
# Evaluate predictions
75+
results = validator.evaluate(
76+
golden=[{"text": "Hello World"}, {"text": "Foo"}],
77+
predicted=[{"text": "hello world"}, {"text": "bar"}]
78+
)
79+
# Returns: [UnitMetricResult(correct=True, ...), UnitMetricResult(correct=False, ...)]
80+
81+
# 2. Aggregator Metrics - Calculate statistics from unit results
82+
from sygra.core.eval.metrics.aggregator_metrics.accuracy import AccuracyMetric
83+
from sygra.core.eval.metrics.aggregator_metrics.precision import PrecisionMetric
84+
85+
# Accuracy (no config needed)
86+
accuracy = AccuracyMetric()
87+
accuracy_score = accuracy.calculate(results)
88+
# Returns: {'accuracy': 0.5}
89+
90+
# Precision (requires config)
91+
precision = PrecisionMetric(predicted_key="tool", positive_class="click")
92+
precision_score = precision.calculate(results)
93+
# Returns: {'precision': 0.75}
94+
```
95+
96+
### How Unit and Aggregator Metrics Work Together
97+
98+
```python
99+
from sygra.core.eval.metrics.unit_metrics.exact_match import ExactMatchMetric
100+
from sygra.core.eval.metrics.aggregator_metrics.accuracy import AccuracyMetric
101+
from sygra.core.eval.metrics.aggregator_metrics.precision import PrecisionMetric
102+
103+
# Step 1: Unit metric validates each prediction
104+
validator = ExactMatchMetric(key="tool")
105+
unit_results = validator.evaluate(
106+
golden=[{"tool": "click"}, {"tool": "type"}, {"tool": "click"}],
107+
predicted=[{"tool": "click"}, {"tool": "scroll"}, {"tool": "type"}]
108+
)
109+
# unit_results = [
110+
# UnitMetricResult(correct=True, golden={...}, predicted={...}),
111+
# UnitMetricResult(correct=False, golden={...}, predicted={...}),
112+
# UnitMetricResult(correct=False, golden={...}, predicted={...})
113+
# ]
114+
115+
# Step 2: Aggregator metrics compute statistics
116+
accuracy = AccuracyMetric()
117+
print(accuracy.calculate(unit_results))
118+
# Output: {'accuracy': 0.33} (1 out of 3 correct)
119+
120+
precision = PrecisionMetric(predicted_key="tool", positive_class="click")
121+
print(precision.calculate(unit_results))
122+
# Output: {'precision': 1.0} (1 click predicted, 1 was correct)
123+
```
124+
125+
**Key Point**: Unit metrics produce `UnitMetricResult` objects, aggregator metrics consume them to calculate statistics.
126+
127+
## Design Philosophy
128+
129+
The metrics system follows these principles:
130+
131+
1. **Fail Fast**: Required parameters must be provided at initialization to catch errors early
132+
2. **Explicit Configuration**: No default values for keys/classes to prevent silent bugs
133+
3. **Task Agnostic**: Works with any task through flexible `UnitMetricResult` structure
134+
4. **Composability**: Complex metrics reuse simpler ones for consistency
135+
136+
## Contributing
137+
138+
When adding new metrics documentation:
139+
1. Follow the existing structure (What, Parameters, Usage, Examples)
140+
2. Include complete, runnable code examples
141+
3. Explain the "why" behind design decisions
142+
4. Cover edge cases and common pitfalls
143+
5. Provide real-world use case scenarios
144+

0 commit comments

Comments
 (0)