Skip to content

Commit c64cb7e

Browse files
jwilgerclaude
andauthored
feat: implement F-score tracking and analytics (issue #91) (#140)
## Summary Implements comprehensive F-score tracking and analytics system as requested in issue #91. This PR introduces dedicated F-score calculation and tracking infrastructure for precision/recall metrics targeting manager/executive use cases in MVP Phase 4. ### Key Features - **Type-Safe F-score Domain Types**: Created validated newtypes using `nutype` for `FScore`, `Precision`, `Recall`, `Beta`, and `ConfidenceLevel` with proper validation ranges - **F-score Calculation Engine**: Implements harmonic mean calculation with support for configurable beta values (F1, F2, F-beta scores) - **EventCore Integration**: Multi-stream event sourcing with `RecordModelFScore` and `RecordApplicationFScore` commands - **Time-Series Tracking**: `FScoreDataPoint` infrastructure for tracking F-score trends over time with confidence intervals - **Demo Data Generation**: Comprehensive demo data generator for MVP visualization with realistic time-series data - **Comprehensive Test Coverage**: 14 new passing tests covering domain logic, EventCore integration, and edge cases ### Technical Implementation - **Domain Types**: All F-score types include validation ensuring values are finite and within valid ranges (0.0-1.0) - **Event Sourcing**: Commands emit `FScoreCalculated` and `ApplicationFScoreCalculated` events to appropriate streams - **State Management**: `MetricsState` tracks F-score history for both model versions and applications - **Demo Data**: Includes provider comparison data, application-specific trends, and performance categorization ### Recent Refactoring (per PR review) Based on comprehensive type-driven development review: 1. **Eliminated Primitive Obsession**: - Extracted stream ID prefix constants to remove string literals - Created `PerformanceCategory` domain type replacing raw strings in demo data - Replaced magic numbers in tests with test_data constants 2. **Reduced Code Duplication**: - Introduced `FScoreCommand` trait to share common logic between `RecordModelFScore` and `RecordApplicationFScore` - Consolidated F-score calculation logic 3. **Improved Type Safety**: - Fixed string `contains()` usage on `ApplicationId` with proper equality checks - Enhanced readability and maintainability throughout metrics module ### Testing All 289 tests pass including: - F-score calculation accuracy with harmonic mean formula - EventCore command execution and event emission - Domain type validation and edge cases - Demo data generation with realistic patterns Closes #91 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Claude <noreply@anthropic.com>
1 parent 36e46ec commit c64cb7e

23 files changed

+5288
-22
lines changed

dhat-heap.json

Lines changed: 1137 additions & 0 deletions
Large diffs are not rendered by default.

src/domain/commands/metrics_commands.rs

Lines changed: 564 additions & 0 deletions
Large diffs are not rendered by default.

src/domain/commands/mod.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
//! EventCore commands for the Union Square domain
22
3+
pub mod metrics_commands;
34
pub mod version_commands;
45

6+
pub use metrics_commands::{RecordApplicationFScore, RecordModelFScore};
57
pub use version_commands::{DeactivateVersion, RecordVersionChange, RecordVersionUsage};

src/domain/commands/version_commands.rs

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ use std::collections::HashMap;
1515
use crate::domain::{
1616
events::DomainEvent,
1717
llm::ModelVersion,
18+
metrics::Timestamp,
1819
session::SessionId,
1920
types::ChangeReason,
2021
version::{TrackedVersion, VersionChangeId},
@@ -106,7 +107,7 @@ impl CommandLogic for RecordVersionUsage {
106107
DomainEvent::VersionFirstSeen {
107108
model_version: self.model_version.clone(),
108109
session_id: self.session_id.clone(),
109-
first_seen_at: chrono::Utc::now(),
110+
first_seen_at: Timestamp::now(),
110111
}
111112
);
112113
}
@@ -119,7 +120,7 @@ impl CommandLogic for RecordVersionUsage {
119120
DomainEvent::VersionUsageRecorded {
120121
model_version: self.model_version.clone(),
121122
session_id: self.session_id.clone(),
122-
recorded_at: chrono::Utc::now(),
123+
recorded_at: Timestamp::now(),
123124
}
124125
);
125126

@@ -191,7 +192,7 @@ impl CommandLogic for RecordVersionChange {
191192
to_version: self.to_version.clone(),
192193
change_type,
193194
reason: self.reason.clone(),
194-
changed_at: chrono::Utc::now(),
195+
changed_at: Timestamp::now(),
195196
};
196197

197198
// Write to both streams
@@ -265,7 +266,7 @@ impl CommandLogic for DeactivateVersion {
265266
DomainEvent::VersionDeactivated {
266267
model_version: self.model_version.clone(),
267268
reason: self.reason.clone(),
268-
deactivated_at: chrono::Utc::now(),
269+
deactivated_at: Timestamp::now(),
269270
}
270271
);
271272

src/domain/events.rs

Lines changed: 41 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,11 @@
33
//! This module defines all domain events that are stored
44
//! in the event store using EventCore.
55
6-
use chrono::{DateTime, Utc};
76
use serde::{Deserialize, Serialize};
87

98
use crate::domain::{
109
llm::{ModelVersion, RequestId, ResponseMetadata},
10+
metrics::{SampleCount, Timestamp},
1111
session::{ApplicationId, SessionId, SessionStatus},
1212
types::{ChangeReason, ErrorMessage, LlmParameters, Prompt, ResponseText, Tag},
1313
user::{DisplayName, EmailAddress, UserId},
@@ -23,17 +23,17 @@ pub enum DomainEvent {
2323
session_id: SessionId,
2424
user_id: UserId,
2525
application_id: ApplicationId,
26-
started_at: DateTime<Utc>,
26+
started_at: Timestamp,
2727
},
2828
SessionEnded {
2929
session_id: SessionId,
30-
ended_at: DateTime<Utc>,
30+
ended_at: Timestamp,
3131
final_status: SessionStatus,
3232
},
3333
SessionTagged {
3434
session_id: SessionId,
3535
tag: Tag,
36-
tagged_at: DateTime<Utc>,
36+
tagged_at: Timestamp,
3737
},
3838

3939
// LLM Request Events
@@ -43,33 +43,33 @@ pub enum DomainEvent {
4343
model_version: ModelVersion,
4444
prompt: Prompt,
4545
parameters: LlmParameters,
46-
received_at: DateTime<Utc>,
46+
received_at: Timestamp,
4747
},
4848
LlmRequestStarted {
4949
request_id: RequestId,
50-
started_at: DateTime<Utc>,
50+
started_at: Timestamp,
5151
},
5252
LlmResponseReceived {
5353
request_id: RequestId,
5454
response_text: ResponseText,
5555
metadata: ResponseMetadata,
56-
received_at: DateTime<Utc>,
56+
received_at: Timestamp,
5757
},
5858
LlmRequestFailed {
5959
request_id: RequestId,
6060
error_message: ErrorMessage,
61-
failed_at: DateTime<Utc>,
61+
failed_at: Timestamp,
6262
},
6363
LlmRequestCancelled {
6464
request_id: RequestId,
65-
cancelled_at: DateTime<Utc>,
65+
cancelled_at: Timestamp,
6666
},
6767

6868
// Version Tracking Events
6969
VersionFirstSeen {
7070
model_version: ModelVersion,
7171
session_id: SessionId,
72-
first_seen_at: DateTime<Utc>,
72+
first_seen_at: Timestamp,
7373
},
7474
VersionChanged {
7575
change_id: VersionChangeId,
@@ -78,40 +78,61 @@ pub enum DomainEvent {
7878
to_version: ModelVersion,
7979
change_type: VersionComparison,
8080
reason: Option<ChangeReason>,
81-
changed_at: DateTime<Utc>,
81+
changed_at: Timestamp,
8282
},
8383
VersionUsageRecorded {
8484
model_version: ModelVersion,
8585
session_id: SessionId,
86-
recorded_at: DateTime<Utc>,
86+
recorded_at: Timestamp,
8787
},
8888
VersionDeactivated {
8989
model_version: ModelVersion,
9090
reason: Option<ChangeReason>,
91-
deactivated_at: DateTime<Utc>,
91+
deactivated_at: Timestamp,
92+
},
93+
94+
// F-score and Metrics Events
95+
FScoreCalculated {
96+
session_id: SessionId,
97+
model_version: ModelVersion,
98+
f_score: crate::domain::metrics::FScore,
99+
precision: Option<crate::domain::metrics::Precision>,
100+
recall: Option<crate::domain::metrics::Recall>,
101+
sample_count: SampleCount,
102+
calculated_at: Timestamp,
103+
},
104+
ApplicationFScoreCalculated {
105+
session_id: SessionId,
106+
application_id: ApplicationId,
107+
model_version: ModelVersion,
108+
f_score: crate::domain::metrics::FScore,
109+
precision: Option<crate::domain::metrics::Precision>,
110+
recall: Option<crate::domain::metrics::Recall>,
111+
sample_count: SampleCount,
112+
calculated_at: Timestamp,
92113
},
93114

94115
// User Events
95116
UserCreated {
96117
user_id: UserId,
97118
email: EmailAddress,
98119
display_name: Option<DisplayName>,
99-
created_at: DateTime<Utc>,
120+
created_at: Timestamp,
100121
},
101122
UserActivated {
102123
user_id: UserId,
103-
activated_at: DateTime<Utc>,
124+
activated_at: Timestamp,
104125
},
105126
UserDeactivated {
106127
user_id: UserId,
107128
reason: Option<ChangeReason>,
108-
deactivated_at: DateTime<Utc>,
129+
deactivated_at: Timestamp,
109130
},
110131
}
111132

112133
impl DomainEvent {
113134
/// Get the timestamp of when this event occurred
114-
pub fn occurred_at(&self) -> DateTime<Utc> {
135+
pub fn occurred_at(&self) -> Timestamp {
115136
match self {
116137
DomainEvent::SessionStarted { started_at, .. } => *started_at,
117138
DomainEvent::SessionEnded { ended_at, .. } => *ended_at,
@@ -125,6 +146,8 @@ impl DomainEvent {
125146
DomainEvent::VersionChanged { changed_at, .. } => *changed_at,
126147
DomainEvent::VersionUsageRecorded { recorded_at, .. } => *recorded_at,
127148
DomainEvent::VersionDeactivated { deactivated_at, .. } => *deactivated_at,
149+
DomainEvent::FScoreCalculated { calculated_at, .. } => *calculated_at,
150+
DomainEvent::ApplicationFScoreCalculated { calculated_at, .. } => *calculated_at,
128151
DomainEvent::UserCreated { created_at, .. } => *created_at,
129152
DomainEvent::UserActivated { activated_at, .. } => *activated_at,
130153
DomainEvent::UserDeactivated { deactivated_at, .. } => *deactivated_at,
@@ -146,7 +169,7 @@ mod tests {
146169

147170
#[test]
148171
fn test_event_timestamp_extraction() {
149-
let now = Utc::now();
172+
let now = Timestamp::now();
150173
let session_id = SessionId::generate();
151174
let user_id = UserId::generate();
152175

src/domain/metrics.rs

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
//! F-score tracking and analytics domain types
2+
//!
3+
//! This module provides type-safe F-score calculation and tracking functionality
4+
//! following type-driven development principles for precision/recall metrics.
5+
6+
pub mod constants;
7+
pub mod core_metrics;
8+
pub mod counts;
9+
pub mod data_point;
10+
pub mod demo_data;
11+
pub mod demo_types;
12+
pub mod durations;
13+
pub mod errors;
14+
pub mod performance;
15+
pub mod sample_count;
16+
pub mod time_period;
17+
pub mod timestamp;
18+
pub mod trend;
19+
pub mod ui_types;
20+
pub mod values;
21+
22+
// Re-export commonly used types
23+
pub use core_metrics::{Beta, ConfidenceLevel, FScore, Precision, Recall};
24+
pub use counts::{ApplicationCount, DataPointCount, ModelCount};
25+
pub use data_point::FScoreDataPoint;
26+
pub use errors::MetricsError;
27+
pub use performance::{PerformanceAssessment, PerformanceLevel, QualityRating};
28+
pub use sample_count::{SampleConfidence, SampleCount};
29+
pub use time_period::{DaysBack, PointsPerDay, TimePeriod};
30+
pub use timestamp::{Timestamp, TimestampAge};
31+
pub use trend::{TrendAnalysis, TrendDirection, TrendMagnitude};
32+
pub use values::{MetricValue, PercentageChange, StabilityThreshold};

0 commit comments

Comments
 (0)