A comprehensive framework for validating AI models against human-coded gold standard scores across 5 dimensions of peace journalism.
This repository contains tools for:
- Extracting and cleaning human-coded scores from Excel gold standard files
- Running 6 different AI models (4 LLM + 2 RoBERTa) on video transcripts
- Comparing model outputs against human gold standard with statistical analysis
- Generating publication-ready reports for team meetings and papers
- News vs. Opinion - Fact-based reporting (4-5) vs. Subjective analysis (1-2)
- Nuance vs. Oversimplification - Multi-perspective (4-5) vs. Binary/simplistic (1-2)
- Creativity vs. Order - Innovation/human-centered (4-5) vs. Control/authority (1-2)
- Prevention vs. Promotion - Growth/aspiration (4-5) vs. Safety/security (1-2)
- Compassion vs. Contempt - Inclusive/respectful (4-5) vs. Dehumanizing/divisive (1-2)
- OpenAI GPT-4o (No Context) - Pure LLM reasoning
- OpenAI GPT-4o (With RoBERTa) - LLM enhanced with emotion scores
- Google Gemini 2.5 Flash (No Context) - Pure LLM reasoning
- Google Gemini 2.5 Flash (With RoBERTa) - LLM enhanced with emotion scores
- RoBERTa Plain - Emotion-based scoring (respect vs. contempt)
- RoBERTa Valence - Weighted valence scoring (1-5 scale)
# Install dependencies
pip install -r requirements.txt
# Set up environment variables (.env file)
OPENAI_API_KEY=your_key_here
GEMINI_API_KEY=your_key_herepython validate_against_human.pyThis will:
- Load
gold_standard.xlsx - Extract video IDs from hyperlinks
- Clean and aggregate human scores
- Generate validation reports in
validation_results/
python run_models_on_gold_standard.pyThis will:
- Load videos from
validation_results/human_scores_cleaned.csv - Fetch transcripts (from
transcripts/Transcripts.docxor via yt-dlp) - Run all 6 models on each video
- Save results to
model_scores_gold_standard/
Note: This may take a while due to API rate limiting (1.5s delay between calls).
python compare_models_to_human.pyThis will:
- Calculate correlations (Pearson, Spearman)
- Compute error metrics (MAE, RMSE)
- Generate scatter plots and heatmaps
- Save comparison results to
model_comparison_results/
python generate_team_report.pyThis will:
- Compile all results into a comprehensive report
- Create publication-ready tables
- Generate HTML and Markdown outputs
- Save to
team_report/
.
βββ gold_standard.xlsx # Human-coded scores (input)
βββ transcripts/
β βββ Transcripts.docx # Video transcripts (optional)
βββ validation_results/ # Human score extraction outputs
β βββ human_scores_cleaned.csv
β βββ human_metrics_summary.csv
β βββ ...
βββ model_scores_gold_standard/ # Model scoring outputs
β βββ model_scores_YYYYMMDD.csv
β βββ model_scores_detailed_YYYYMMDD.json
βββ model_comparison_results/ # Comparison analysis
β βββ model_vs_human_metrics_YYYYMMDD.csv
β βββ comparison_summary_YYYYMMDD.json
β βββ plots/
βββ team_report/ # Final reports
βββ team_report_YYYYMMDD.md
βββ team_report_YYYYMMDD.html
human_scores_cleaned.csv- Aggregated human scores (gold standard, 52 videos)human_metrics_summary.csv- Summary statistics per dimensioninter_rater_reliability.csv- Agreement between evaluators (excellent: r=0.727-0.947)human_dimensions_correlation.csv- Correlation matrix between dimensionsmissing_data_report.csv- Missing data analysis (9.6%-38.5% per dimension)
model_scores_gold_standard/run_1/model_scores_YYYYMMDD.csv- All model scores (49 videos, all 5 dimensions)model_scores_gold_standard/run_1/model_scores_detailed_YYYYMMDD.json- Detailed results with rationales
model_comparison_results/model_vs_human_metrics_YYYYMMDD.csv- Statistical metrics (22 comparisons)model_comparison_results/comparison_summary_YYYYMMDD.json- Best methods per dimensionmodel_comparison_results/plots/- Scatter plots for all method-dimension combinationsmodel_comparison_results/plots/heatmaps/- Correlation heatmaps per dimension
team_report/team_report_20251117_204008.md- β Complete report (use this one)team_report/summary_stats_20251117_204008.json- Summary statistics
COMPREHENSIVE_RESULTS_REPORT.md- β Publication-ready comprehensive reportPROJECT_SUMMARY.md- Project overview and statusIMPLEMENTATION_NOTES.md- Lessons learned and improvements
| Script | Purpose |
|---|---|
validate_against_human.py |
Extract and clean human scores from Excel |
run_models_on_gold_standard.py |
Run all 6 models on gold standard videos |
compare_models_to_human.py |
Statistical comparison and visualization |
generate_team_report.py |
Generate publication-ready reports |
llm_analyzer.py |
LLM API integration (OpenAI + Gemini) |
parse_transcripts_docx.py |
Extract transcripts from Word document |
- Pearson Correlation (r): Linear relationship strength
- Spearman Correlation (Ο): Monotonic relationship strength
- Mean Absolute Error (MAE): Average prediction error
- Root Mean Squared Error (RMSE): Penalizes larger errors
Best Overall Model: Google Gemini 2.5 Flash (No Context)
- Average Correlation: r = 0.682 across all 5 dimensions
- Strong Correlations (r > 0.7): 4 out of 5 dimensions
- Best Dimension: Prevention/Promotion (r = 0.773)
Human Inter-Rater Reliability: Excellent agreement (r = 0.727-0.947)
Total Analysis:
- 52 videos in gold standard
- 49 videos with complete model scores
- 22 model comparisons (4 LLM Γ 5 dimensions + 2 RoBERTa Γ 1 dimension)
For detailed results, see:
COMPREHENSIVE_RESULTS_REPORT.md- Complete detailed analysisRESULTS.md- Quick results summary
- Model Validation: Compare AI models against human coders
- Method Selection: Choose optimal model based on accuracy/cost tradeoffs
- Inter-Rater Reliability: Analyze agreement between human evaluators
- Publication: Generate tables and figures for papers
- All scores are normalized to 1-5 scale for comparison
- Rate limiting (1.5s delay) prevents API throttling
- Transcripts are fetched from
transcripts/history/β.docxβ skip yt-dlp - Missing data handled with pairwise deletion (N varies by dimension: 29-47)
- MLflow tracking enabled for experiment reproducibility
- Run-specific folders (
run_1/,run_2/) for easy comparison
COMPREHENSIVE_RESULTS_REPORT.md- β Complete detailed results and analysisRESULTS.md- β Quick results summary and key findings
DATASET_ANALYSIS_COMPARISON.md- β Understand the difference between transcript corpus analysis vs. gold standard validation
IMPLEMENTATION_NOTES.md- Technical details and lessons learnedMLFLOW_TRACKING_GUIDE.md- MLflow experiment tracking guideANALYSIS_MECHANISMS_EXPLAINED.md- How each model works
MODEL_COMPARISON_SUMMARY.md- Historical summary (previous analysis)
- See individual script docstrings for detailed function documentation
- Check
validation_results/data_quality_notes.jsonfor data quality issues - Review
model_comparison_results/for detailed analysis outputs
For questions or issues, please refer to the research team.
Last Updated: November 2024