Feat: Add comprehensive LLM evaluation & reporting pipeline #122

Themba-Gqaza · 2025-12-04T09:27:04Z

Description

This PR introduces a modular Jupyter notebook workflow designed to evaluate production chatbot logs. It automates the assessment of model performance against human labels and reference scripts, providing granular insights into Extraction Accuracy, Question Consistency, and User Response Appropriateness.

Key Features

Automated Evaluation: Implements LLM-as-a-Judge (using GPT-4o) to score semantic consistency and response validity alongside deterministic extraction accuracy.
Demographic Enrichment: Integrates user strata data (Age, Gestation) to enable bias detection and detailed performance segmentation.
Visual Reporting: Generates executive summary tables, multi-metric bar charts, and error distribution plots (using Matplotlib) for global and flow-specific insights.
Sequence Analysis: Adds a specific verification step to prove the randomness of question ordering in the Onboarding flow.

Artifacts

evaluation_pipeline.ipynb: The main driver notebook containing the 4-step analysis pipeline.
detailed_metrics_export.csv: A generated granular report for external BI tools.

Next Steps

Run the notebook with RERUN_EVALUATION = True once to generate the baseline report.
Review the "Fail Log" section to prioritize fixes for the lowest-performing flows.

PaulEloffPraekelt

Just noting the concern raised before, that we shouldn't have any users' MSISDNs in our GitHub. Other than that, I think things look good!

added eval scripts

8a83904

Themba-Gqaza requested a review from PaulEloffPraekelt December 4, 2025 09:27

Themba Gqaza added 4 commits December 4, 2025 15:44

updated sequencing

578200c

updated evals

0c52761

improved visualisation quality

dc9cf49

updated notebook

63e24f6

PaulEloffPraekelt reviewed Dec 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Add comprehensive LLM evaluation & reporting pipeline #122

Feat: Add comprehensive LLM evaluation & reporting pipeline #122

Uh oh!

Themba-Gqaza commented Dec 4, 2025

Uh oh!

PaulEloffPraekelt left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feat: Add comprehensive LLM evaluation & reporting pipeline #122

Are you sure you want to change the base?

Feat: Add comprehensive LLM evaluation & reporting pipeline #122

Uh oh!

Conversation

Themba-Gqaza commented Dec 4, 2025

Description

Key Features

Artifacts

Uh oh!

PaulEloffPraekelt left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants