Skip to content

Latest commit

 

History

History
470 lines (375 loc) · 17.1 KB

File metadata and controls

470 lines (375 loc) · 17.1 KB

AI Response Evaluation Automation Benchmark Website

AI Chatbot Evaluation Tool with Automated Data Collection

Evaluate chatbot responses for ethical alignment, inclusivity, complexity, and sentiment - with automatic CSV database tracking.


🚀 Quick Start

python main.py

Then open: http://localhost:8080 🎯

That's it! One command gives you:

  • ✅ Interactive web interface
  • ✅ REST API endpoints
  • ✅ Automatic CSV database
  • ✅ Real-time statistics
  • ✅ Data export capability

📁 Project Structure

web_automation_CSSW/
├── main.py                 🚀 Main entry point (start here!)
├── app.py                  🌐 Flask application (web + API + database)
├── api/                    🔌 Evaluation engine
│   ├── api_server.py       → NLP evaluation functions
│   └── requirements.txt    → Python dependencies
├── data/                   💾 Database (auto-created)
│   └── evaluations.csv     → All evaluation records
├── venv/                   🐍 Virtual environment
├── logs/                   📝 Application logs
└── README.md               📚 This file

Clean & Simple: Just 5 top-level items, no complex folder hierarchies!


🎯 Purpose & Overview

What Does This Tool Do?

This application evaluates AI chatbot responses across four critical dimensions:

  1. Ethical Alignment (0-1) - Professional appropriateness and ethical considerations
  2. Inclusivity (0-1) - LGBTQ+ support, cultural sensitivity, and inclusive language
  3. Complexity (0-100) - Text readability using Flesch-Kincaid scoring
  4. Sentiment (0-1) - Emotional alignment between human input and chatbot response

Why Is This Important?

  • 🏥 Mental Health Apps - Ensure responses are appropriate and supportive
  • 🤖 AI Development - Quality assurance for chatbot systems
  • 📊 Research - Analyze and compare AI model performance
  • 🎓 Education - Teach responsible AI development

Key Features

  • 100% Pure Python - No PHP, Drupal, or Composer complexity
  • Automatic Data Collection - Every evaluation saved to CSV
  • Web Interface + API - Use it interactively or programmatically
  • Real-time Statistics - Track averages and usage patterns
  • Export Capability - Download your data anytime
  • Simple Deployment - One command to start everything

🔄 Workflow Diagram

┌──────────────────────────────────────────────────────────────┐
│                    USER INTERACTION                           │
│                                                                │
│  Browser (http://localhost:8080)  OR  API Client (curl/code) │
└────────────────────────┬─────────────────────────────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────────────┐
│                   FLASK APPLICATION (app.py)                  │
│  ┌────────────────────────────────────────────────────────┐  │
│  │               WEB INTERFACE (Routes)                    │  │
│  │  ┌──────────────────────────────────────────────────┐  │  │
│  │  │  GET  /           → Home page with form          │  │  │
│  │  │  GET  /health     → Health check                 │  │  │
│  │  │  POST /api/evaluate → Process evaluation         │  │  │
│  │  │  GET  /api/history  → Get all records           │  │  │
│  │  │  GET  /api/stats    → Get statistics            │  │  │
│  │  │  GET  /api/download → Download CSV              │  │  │
│  │  └──────────────────────────────────────────────────┘  │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                                │
│  ┌────────────────────────────────────────────────────────┐  │
│  │            DATA COLLECTION (save_to_csv)               │  │
│  │  • Captures every evaluation                           │  │
│  │  • Timestamps each entry                               │  │
│  │  • Stores all metrics                                  │  │
│  │  • Appends to data/evaluations.csv                     │  │
│  └────────────────────────────────────────────────────────┘  │
└────────────────────────┬─────────────────────────────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────────────┐
│              EVALUATION ENGINE (api/api_server.py)            │
│  ┌────────────────────────────────────────────────────────┐  │
│  │  evaluate_ethical_alignment()                          │  │
│  │  → Checks professional appropriateness                 │  │
│  │  → Uses keyword matching for ethical concerns          │  │
│  │  → Returns: 0.0 (problematic) to 1.0 (appropriate)    │  │
│  └────────────────────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────────────────────┐  │
│  │  evaluate_inclusivity_score()                          │  │
│  │  → Detects LGBTQ+ terminology                          │  │
│  │  → Checks cultural sensitivity                         │  │
│  │  → Returns: 0.0 (exclusive) to 1.0 (inclusive)        │  │
│  └────────────────────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────────────────────┐  │
│  │  evaluate_complexity_score()                           │  │
│  │  → Flesch-Kincaid readability analysis                 │  │
│  │  → Sentence structure analysis                         │  │
│  │  → Returns: 0 (very complex) to 100 (simple)          │  │
│  └────────────────────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────────────────────┐  │
│  │  evaluate_sentiment_distribution()                     │  │
│  │  → Compares human and chatbot text                     │  │
│  │  → Analyzes emotional alignment                        │  │
│  │  → Returns: 0.0 (mismatched) to 1.0 (aligned)         │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                                │
│  Technologies: NLTK, NumPy, scikit-learn, TF-IDF             │
└────────────────────────┬─────────────────────────────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────────────┐
│                    RESULTS & DATABASE                         │
│                                                                │
│  ┌────────────────────┐         ┌───────────────────────┐    │
│  │  JSON Response     │         │  CSV Database         │    │
│  │  to User/API       │         │  (data/evaluations.csv│    │
│  │                    │         │                       │    │
│  │  {                 │         │  timestamp,chatbot... │    │
│  │   "ethical": 0.8,  │         │  2025-10-22T19:00:..  │    │
│  │   "inclusivity":.. │         │  2025-10-22T19:00:..  │    │
│  │  }                 │         │  ...                  │    │
│  └────────────────────┘         └───────────────────────┘    │
└──────────────────────────────────────────────────────────────┘

💾 Database Features

Automatic Data Collection

Every evaluation is automatically saved to data/evaluations.csv with:

  • Timestamp - Exact date/time of evaluation
  • 💬 Chatbot Text - The response being evaluated
  • 👤 Human Text - Optional user input (for sentiment analysis)
  • 🔧 Formula Used - Which metric(s) were calculated
  • 📊 All Scores - Ethical, inclusivity, complexity, sentiment values

CSV Structure

timestamp,chatbot_text,human_text,formula,ethical_alignment,inclusivity,complexity,sentiment
2025-10-22T19:00:11,I will help you.,,ethical_alignment,0.61,,,
2025-10-22T19:00:27,We welcome all backgrounds.,,inclusivity,,0.0,,
2025-10-22T19:01:45,I understand.,I need help.,all,1.0,0.0,80.31,0.03

Data Management

  1. View History - See your last 10 evaluations in the web UI
  2. Statistics - Get averages, totals, and formula usage
  3. Export - Download the complete CSV anytime
  4. No Cleanup Needed - Data persists automatically

🌐 Using the Application

Method 1: Web Interface (Easiest)

  1. Start the server:

    python main.py
  2. Open your browser:

    http://localhost:8080
    
  3. Evaluate chatbot responses:

    • Select evaluation type (All Metrics, Ethical Alignment, etc.)
    • Enter chatbot text
    • Optionally add human text (for sentiment)
    • Click "Evaluate"
    • View results instantly!
  4. Access your data:

    • Click "View History" - See recent evaluations
    • Click "View Statistics" - See averages and totals
    • Click "Download CSV" - Export all data

Method 2: API Usage (Programmatic)

Health Check

curl http://localhost:8080/health

Response:

{"status": "healthy"}

Evaluate Single Metric

curl -X POST http://localhost:8080/api/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "formula": "ethical_alignment",
    "chatbot_text": "I understand and support you."
  }'

Response:

{"ethical_alignment": 1.0}

Evaluate All Metrics

curl -X POST http://localhost:8080/api/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "formula": "all",
    "chatbot_text": "I understand your concerns.",
    "human_text": "I am feeling anxious."
  }'

Response:

{
  "ethical_alignment": 1.0,
  "inclusivity": 0.0,
  "complexity": 81.86,
  "sentiment": 0.03
}

Get Evaluation History

curl http://localhost:8080/api/history

Response:

{
  "count": 15,
  "data": [
    {
      "timestamp": "2025-10-22T19:00:11.986236",
      "chatbot_text": "I will help you with that.",
      "human_text": "",
      "formula": "ethical_alignment",
      "ethical_alignment": "0.61",
      "inclusivity": "",
      "complexity": "",
      "sentiment": ""
    }
  ]
}

Get Statistics

curl http://localhost:8080/api/stats

Response:

{
  "total_evaluations": 15,
  "formulas_used": {
    "all": 5,
    "ethical_alignment": 7,
    "inclusivity": 2,
    "sentiment": 1
  },
  "averages": {
    "ethical_alignment": 0.82,
    "inclusivity": 0.15,
    "complexity": 78.45,
    "sentiment": 0.05
  }
}

Download CSV Database

curl http://localhost:8080/api/download -o my_evaluations.csv

📊 API Reference

Endpoints

Method Endpoint Description Auth Required
GET / Web interface home page No
GET /health Health check No
POST /api/evaluate Evaluate chatbot text No
GET /api/history Get all evaluation records No
GET /api/stats Get database statistics No
GET /api/download Download CSV database No

POST /api/evaluate

Request Body:

{
  "formula": "all",
  "chatbot_text": "Your chatbot response here",
  "human_text": "Optional, required for sentiment"
}

Formula Options:

  • ethical_alignment - Professional ethics score only
  • inclusivity - Inclusivity score only
  • complexity - Readability score only
  • sentiment - Emotional match only (requires human_text)
  • all - All metrics at once

Response Codes:

  • 200 - Success
  • 400 - Bad request (missing required fields)
  • 500 - Server error

⚙️ Installation & Setup

Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)

First-Time Setup

# 1. Clone or download the project
cd web_automation_CSSW

# 2. Create virtual environment
python3 -m venv venv

# 3. Activate virtual environment
source venv/bin/activate  # Mac/Linux
# OR
venv\Scripts\activate     # Windows

# 4. Install dependencies
pip install -r api/requirements.txt

# 5. Run the application
python main.py

Subsequent Runs

# Just run (virtual environment auto-activates if needed)
python main.py

Dependencies (Installed Automatically)

From api/requirements.txt:

  • Flask - Web framework
  • NLTK - Natural language processing
  • NumPy - Numerical computing
  • scikit-learn - Machine learning utilities
  • transformers (optional) - Advanced NLP
  • torch (optional) - Deep learning

📈 Evaluation Metrics Explained

1. Ethical Alignment (0-1 scale)

Purpose: Ensures chatbot responses are professionally appropriate and ethically sound.

How it works:

  • Scans for problematic keywords and phrases
  • Checks for harmful advice or inappropriate content
  • Returns 0.0 for problematic text, 1.0 for appropriate text

Example:

✅ "I understand your concerns." → 1.0
❌ "You should harm yourself." → 0.0

Use case: Mental health chatbots, customer service bots


2. Inclusivity (0-1 scale)

Purpose: Measures LGBTQ+ support and cultural sensitivity.

How it works:

  • Detects inclusive terminology (LGBTQ+, pronouns, diversity terms)
  • Scores based on presence and frequency of inclusive language
  • Returns 0.0 for no inclusive language, higher scores for more inclusivity

Example:

✅ "We support LGBTQ+ individuals." → 0.8
⚪ "We help everyone." → 0.0

Use case: Diversity initiatives, inclusive app development


3. Complexity (0-100 scale)

Purpose: Measures text readability using Flesch-Kincaid scoring.

How it works:

  • Analyzes sentence length and syllable count
  • Calculates reading ease score
  • 0 = very complex, 100 = very simple

Example:

✅ "I can help." → 120 (very simple)
⚪ "I shall endeavor to facilitate assistance." → 40 (complex)

Use case: Ensuring accessible communication, education apps


4. Sentiment (0-1 scale)

Purpose: Measures emotional alignment between human input and chatbot response.

How it works:

  • Uses TF-IDF vectorization to compare texts
  • Calculates cosine similarity between human and chatbot text
  • Returns 0.0 for complete mismatch, 1.0 for perfect alignment

Example:

Human: "I'm feeling great!"
Chatbot: "That's wonderful to hear!" → 0.8 (good alignment)