Skip to content

Conversation

@Themba-Gqaza
Copy link
Contributor

Description

This PR introduces a modular Jupyter notebook workflow designed to evaluate production chatbot logs. It automates the assessment of model performance against human labels and reference scripts, providing granular insights into Extraction Accuracy, Question Consistency, and User Response Appropriateness.

Key Features

  • Automated Evaluation: Implements LLM-as-a-Judge (using GPT-4o) to score semantic consistency and response validity alongside deterministic extraction accuracy.

  • Demographic Enrichment: Integrates user strata data (Age, Gestation) to enable bias detection and detailed performance segmentation.

  • Visual Reporting: Generates executive summary tables, multi-metric bar charts, and error distribution plots (using Matplotlib) for global and flow-specific insights.

  • Sequence Analysis: Adds a specific verification step to prove the randomness of question ordering in the Onboarding flow.

Artifacts

  • evaluation_pipeline.ipynb: The main driver notebook containing the 4-step analysis pipeline.

  • detailed_metrics_export.csv: A generated granular report for external BI tools.

Next Steps

  • Run the notebook with RERUN_EVALUATION = True once to generate the baseline report.
  • Review the "Fail Log" section to prioritize fixes for the lowest-performing flows.

Copy link
Collaborator

@PaulEloffPraekelt PaulEloffPraekelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting the concern raised before, that we shouldn't have any users' MSISDNs in our GitHub. Other than that, I think things look good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants