AI Chatbot Evaluation Tool with Automated Data Collection
Evaluate chatbot responses for ethical alignment, inclusivity, complexity, and sentiment - with automatic CSV database tracking.
python main.pyThen open: http://localhost:8080 🎯
That's it! One command gives you:
- ✅ Interactive web interface
- ✅ REST API endpoints
- ✅ Automatic CSV database
- ✅ Real-time statistics
- ✅ Data export capability
web_automation_CSSW/
├── main.py 🚀 Main entry point (start here!)
├── app.py 🌐 Flask application (web + API + database)
├── api/ 🔌 Evaluation engine
│ ├── api_server.py → NLP evaluation functions
│ └── requirements.txt → Python dependencies
├── data/ 💾 Database (auto-created)
│ └── evaluations.csv → All evaluation records
├── venv/ 🐍 Virtual environment
├── logs/ 📝 Application logs
└── README.md 📚 This file
Clean & Simple: Just 5 top-level items, no complex folder hierarchies!
This application evaluates AI chatbot responses across four critical dimensions:
- Ethical Alignment (0-1) - Professional appropriateness and ethical considerations
- Inclusivity (0-1) - LGBTQ+ support, cultural sensitivity, and inclusive language
- Complexity (0-100) - Text readability using Flesch-Kincaid scoring
- Sentiment (0-1) - Emotional alignment between human input and chatbot response
- 🏥 Mental Health Apps - Ensure responses are appropriate and supportive
- 🤖 AI Development - Quality assurance for chatbot systems
- 📊 Research - Analyze and compare AI model performance
- 🎓 Education - Teach responsible AI development
- ✅ 100% Pure Python - No PHP, Drupal, or Composer complexity
- ✅ Automatic Data Collection - Every evaluation saved to CSV
- ✅ Web Interface + API - Use it interactively or programmatically
- ✅ Real-time Statistics - Track averages and usage patterns
- ✅ Export Capability - Download your data anytime
- ✅ Simple Deployment - One command to start everything
┌──────────────────────────────────────────────────────────────┐
│ USER INTERACTION │
│ │
│ Browser (http://localhost:8080) OR API Client (curl/code) │
└────────────────────────┬─────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ FLASK APPLICATION (app.py) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ WEB INTERFACE (Routes) │ │
│ │ ┌──────────────────────────────────────────────────┐ │ │
│ │ │ GET / → Home page with form │ │ │
│ │ │ GET /health → Health check │ │ │
│ │ │ POST /api/evaluate → Process evaluation │ │ │
│ │ │ GET /api/history → Get all records │ │ │
│ │ │ GET /api/stats → Get statistics │ │ │
│ │ │ GET /api/download → Download CSV │ │ │
│ │ └──────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ DATA COLLECTION (save_to_csv) │ │
│ │ • Captures every evaluation │ │
│ │ • Timestamps each entry │ │
│ │ • Stores all metrics │ │
│ │ • Appends to data/evaluations.csv │ │
│ └────────────────────────────────────────────────────────┘ │
└────────────────────────┬─────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ EVALUATION ENGINE (api/api_server.py) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ evaluate_ethical_alignment() │ │
│ │ → Checks professional appropriateness │ │
│ │ → Uses keyword matching for ethical concerns │ │
│ │ → Returns: 0.0 (problematic) to 1.0 (appropriate) │ │
│ └────────────────────────────────────────────────────────┘ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ evaluate_inclusivity_score() │ │
│ │ → Detects LGBTQ+ terminology │ │
│ │ → Checks cultural sensitivity │ │
│ │ → Returns: 0.0 (exclusive) to 1.0 (inclusive) │ │
│ └────────────────────────────────────────────────────────┘ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ evaluate_complexity_score() │ │
│ │ → Flesch-Kincaid readability analysis │ │
│ │ → Sentence structure analysis │ │
│ │ → Returns: 0 (very complex) to 100 (simple) │ │
│ └────────────────────────────────────────────────────────┘ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ evaluate_sentiment_distribution() │ │
│ │ → Compares human and chatbot text │ │
│ │ → Analyzes emotional alignment │ │
│ │ → Returns: 0.0 (mismatched) to 1.0 (aligned) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ Technologies: NLTK, NumPy, scikit-learn, TF-IDF │
└────────────────────────┬─────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ RESULTS & DATABASE │
│ │
│ ┌────────────────────┐ ┌───────────────────────┐ │
│ │ JSON Response │ │ CSV Database │ │
│ │ to User/API │ │ (data/evaluations.csv│ │
│ │ │ │ │ │
│ │ { │ │ timestamp,chatbot... │ │
│ │ "ethical": 0.8, │ │ 2025-10-22T19:00:.. │ │
│ │ "inclusivity":.. │ │ 2025-10-22T19:00:.. │ │
│ │ } │ │ ... │ │
│ └────────────────────┘ └───────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
Every evaluation is automatically saved to data/evaluations.csv with:
- ⏰ Timestamp - Exact date/time of evaluation
- 💬 Chatbot Text - The response being evaluated
- 👤 Human Text - Optional user input (for sentiment analysis)
- 🔧 Formula Used - Which metric(s) were calculated
- 📊 All Scores - Ethical, inclusivity, complexity, sentiment values
timestamp,chatbot_text,human_text,formula,ethical_alignment,inclusivity,complexity,sentiment
2025-10-22T19:00:11,I will help you.,,ethical_alignment,0.61,,,
2025-10-22T19:00:27,We welcome all backgrounds.,,inclusivity,,0.0,,
2025-10-22T19:01:45,I understand.,I need help.,all,1.0,0.0,80.31,0.03- View History - See your last 10 evaluations in the web UI
- Statistics - Get averages, totals, and formula usage
- Export - Download the complete CSV anytime
- No Cleanup Needed - Data persists automatically
-
Start the server:
python main.py
-
Open your browser:
http://localhost:8080 -
Evaluate chatbot responses:
- Select evaluation type (All Metrics, Ethical Alignment, etc.)
- Enter chatbot text
- Optionally add human text (for sentiment)
- Click "Evaluate"
- View results instantly!
-
Access your data:
- Click "View History" - See recent evaluations
- Click "View Statistics" - See averages and totals
- Click "Download CSV" - Export all data
curl http://localhost:8080/healthResponse:
{"status": "healthy"}curl -X POST http://localhost:8080/api/evaluate \
-H "Content-Type: application/json" \
-d '{
"formula": "ethical_alignment",
"chatbot_text": "I understand and support you."
}'Response:
{"ethical_alignment": 1.0}curl -X POST http://localhost:8080/api/evaluate \
-H "Content-Type: application/json" \
-d '{
"formula": "all",
"chatbot_text": "I understand your concerns.",
"human_text": "I am feeling anxious."
}'Response:
{
"ethical_alignment": 1.0,
"inclusivity": 0.0,
"complexity": 81.86,
"sentiment": 0.03
}curl http://localhost:8080/api/historyResponse:
{
"count": 15,
"data": [
{
"timestamp": "2025-10-22T19:00:11.986236",
"chatbot_text": "I will help you with that.",
"human_text": "",
"formula": "ethical_alignment",
"ethical_alignment": "0.61",
"inclusivity": "",
"complexity": "",
"sentiment": ""
}
]
}curl http://localhost:8080/api/statsResponse:
{
"total_evaluations": 15,
"formulas_used": {
"all": 5,
"ethical_alignment": 7,
"inclusivity": 2,
"sentiment": 1
},
"averages": {
"ethical_alignment": 0.82,
"inclusivity": 0.15,
"complexity": 78.45,
"sentiment": 0.05
}
}curl http://localhost:8080/api/download -o my_evaluations.csv| Method | Endpoint | Description | Auth Required |
|---|---|---|---|
| GET | / |
Web interface home page | No |
| GET | /health |
Health check | No |
| POST | /api/evaluate |
Evaluate chatbot text | No |
| GET | /api/history |
Get all evaluation records | No |
| GET | /api/stats |
Get database statistics | No |
| GET | /api/download |
Download CSV database | No |
Request Body:
{
"formula": "all",
"chatbot_text": "Your chatbot response here",
"human_text": "Optional, required for sentiment"
}Formula Options:
ethical_alignment- Professional ethics score onlyinclusivity- Inclusivity score onlycomplexity- Readability score onlysentiment- Emotional match only (requires human_text)all- All metrics at once
Response Codes:
200- Success400- Bad request (missing required fields)500- Server error
- Python 3.8 or higher
- pip (Python package manager)
# 1. Clone or download the project
cd web_automation_CSSW
# 2. Create virtual environment
python3 -m venv venv
# 3. Activate virtual environment
source venv/bin/activate # Mac/Linux
# OR
venv\Scripts\activate # Windows
# 4. Install dependencies
pip install -r api/requirements.txt
# 5. Run the application
python main.py# Just run (virtual environment auto-activates if needed)
python main.pyFrom api/requirements.txt:
- Flask - Web framework
- NLTK - Natural language processing
- NumPy - Numerical computing
- scikit-learn - Machine learning utilities
- transformers (optional) - Advanced NLP
- torch (optional) - Deep learning
Purpose: Ensures chatbot responses are professionally appropriate and ethically sound.
How it works:
- Scans for problematic keywords and phrases
- Checks for harmful advice or inappropriate content
- Returns 0.0 for problematic text, 1.0 for appropriate text
Example:
✅ "I understand your concerns." → 1.0
❌ "You should harm yourself." → 0.0
Use case: Mental health chatbots, customer service bots
Purpose: Measures LGBTQ+ support and cultural sensitivity.
How it works:
- Detects inclusive terminology (LGBTQ+, pronouns, diversity terms)
- Scores based on presence and frequency of inclusive language
- Returns 0.0 for no inclusive language, higher scores for more inclusivity
Example:
✅ "We support LGBTQ+ individuals." → 0.8
⚪ "We help everyone." → 0.0
Use case: Diversity initiatives, inclusive app development
Purpose: Measures text readability using Flesch-Kincaid scoring.
How it works:
- Analyzes sentence length and syllable count
- Calculates reading ease score
- 0 = very complex, 100 = very simple
Example:
✅ "I can help." → 120 (very simple)
⚪ "I shall endeavor to facilitate assistance." → 40 (complex)
Use case: Ensuring accessible communication, education apps
Purpose: Measures emotional alignment between human input and chatbot response.
How it works:
- Uses TF-IDF vectorization to compare texts
- Calculates cosine similarity between human and chatbot text
- Returns 0.0 for complete mismatch, 1.0 for perfect alignment
Example:
Human: "I'm feeling great!"
Chatbot: "That's wonderful to hear!" → 0.8 (good alignment)