This small-scale experimental research project provides tools for extracting subjective language from academic peer reviews and compares model performance on binary sentence classification (objective vs. subjective) using OpenReview-sourced data.
Opinion_Analysis/
├── data/
│ ├── eval_dataset.csv
│ └── test_dataset.csv
├── results/
│ ├── mdebertav3_results.csv
│ ├── qwen3_1_7b_results.csv
│ ├── qwen3_8b_results.csv
│ ├── qwen3_14b_results.csv
│ ├── disagreements.csv
│ └── unanimous_wrong.csv
├── templates/
│ ├── index.html
│ └── reviews.html
├── app.py
├── main.py
├── build_eval_dataset.py
├── mdeberta_subjectivity_eval.ipynb
└── qwen_subjectivity_eval.ipynb
The evaluation dataset collected from ICLR 2026 peer reviews via the OpenReview API. It contains sentences classified as either subjective or objective. Subjective sentences are those that contain at least one of a predefined set of linguistic markers (e.g. hedges, reviewer beliefs, evaluative language). Objective sentences are those with no match against these keyword patterns. Columns include: field, sentence, matched_markers, and label.
A curated subset of 100 sentences drawn from eval_dataset.csv, used as the held-out test set for all model evaluations in this project.
Evaluates the GroNLP/mdebertav3-subjectivity-english model — a fine-tuned mDeBERTa v3 classifier for subjectivity detection in English.
Evaluates three sizes of the Qwen3 instruction-tuned LLM family: 1.7B, 8B, and 14B. Each model is prompted to classify each sentence as subjective or objective and to provide a reasoning chain and natural-language explanation alongside its prediction.
The results/ folder contains one CSV per model evaluated on test_dataset.csv.
Output from the mDeBERTa classifier. Columns: sentence, ground_truth, label (predicted), confidence, correct.
Output from each Qwen3 model. Columns: sentence, ground_truth, label (predicted), reasoning, explanation, confidence.
Sentences where the three Qwen models did not agree with one another on the predicted label.
Sentences where all three Qwen models agreed on a prediction but were collectively wrong relative to the gold label.
app.py is a Flask web application providing an interactive interface for subjectivity analysis.
Features:
-
Text analysis — paste any review text and receive sentence-level subjectivity labels, with matched keyword markers highlighted.
-
Live OpenReview streaming — enter a conference venue and stream peer reviews directly from the OpenReview API, with subjective sentences extracted and displayed.
Run locally with:
python app.pyThe app is then available at http://localhost:5000.