- 
                Notifications
    
You must be signed in to change notification settings  - Fork 278
 
Gen AI Evaluation Result #2563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gen AI Evaluation Result #2563
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there publicly available prototypes of the code emitting evaluation results? Please link them in the PR description
| 
           One issue I have with creating a separate span for tracking each evaluation score is that it makes it harder (at least with the way we index spans..) to write at least a couple classes of queries for specific cases: 
 It would work way better for us if there was a way that we could, while complying with the semantic conventions, put all this information as attributes of a single span so that we can query it at once. I guess we can do that in addition to complying with the semantic convention, this just sticks out to me as an unfortunate aspect of this design.  | 
    
          
 Thank you for your detailed feedback. I agree that having all relevant metrics as attributes on a single span would simplify querying and analysis, however not fully sure if this approach is flexible to cover various different scenario especially with async evaluations To clarify with a concrete scenario: 
 If I need to retroactively compute the 2 new metrics on existing traces that already contain the original 3 metrics, would the recommended approach be to generate new spans for these additional metrics? Or is there a preferred way to update the original span with the new evaluation results, while still adhering to the semantic conventions? A similar situation could arise if some evaluation metrics are computed asynchronously or by a downstream service at different times. In such cases, would each metric (or set of metrics computed together) necessarily require a separate span, or is there flexibility to consolidate them as attributes on the original span? Thanks again for your insights. I’m keen to understand the best practices for handling evolving needs of asynchronous evaluation workflows in line with the semantic conventions. 
  | 
    
          
 For the most part, I typically think of traces as static/immutable after their root span closes (I know that's not technically the case, and there are reasonable scenarios where that is explicitly avoided, but still). I think we should just stick to that mental model here — if you want to "mutate" an evaluation run/experiment/whatever-you-want-to-call-it, my personal feeling is that that should happen in some application-layer logic, and just rely on OTel for a static record of what happened when it happened. For example, you could create a new evaluation run where you just copy the old run's outputs for the metrics you've already computed, and compute new ones where you want. I would personally have no problem just generating a new trace for the updated evaluation results (even if that meant copying execution data from an older trace or otherwise referencing it via span links or something), I don't need to extend an old trace. Maybe others feel differently, but just sharing my opinion. I'll note that the idea of adding additional metrics also feels somewhat awkward because, I can always add new metrics to an old trace, but what about redefining an existing metric? Presumably that opens more of a can of worms about what it means to "overwrite" old spans? I personally feel it's better to avoid the whole issue by not encouraging this pattern. Just my 2c  | 
    
          
 The proposal in this PR is consistent with your feedback on traces as static/immutable. Proposal says keep the  
 
 
 Appreciate you bring up the scenarios and use cases. It will help shape this work. :)  | 
    
| 
           Looks good to me overall  | 
    
| 
           Thank you all for the valuable feedback and thoughtful discussion that helped bring this PR to a merge-ready state.  | 
    
Fixes #
Changes
PR Proposed a way to capture Evaluation Results for GenAI Applications.
Prototype: https://github.com/singankit/evaluation_results
Note: if the PR is touching an area that is not listed in the existing areas, or the area does not have sufficient domain experts coverage, the PR might be tagged as experts needed and move slowly until experts are identified.
Merge requirement checklist
[chore]