This project builds a machine learning pipeline to classify book reviews as positive or negative, simulating a sentiment detection engine for book recommendation platforms.
Originally developed in 2023 and refined for public release in June 2025, this project leverages natural language processing (NLP) and logistic regression to evaluate 1,973 real book reviews. The final model achieved AUC scores up to 0.89.
How can we automatically detect reader sentiment in text reviews to improve book recommendation systems and content moderation tools?
- Applied TF-IDF vectorization with n-gram feature engineering to extract signals from review text
- Trained a logistic regression model using Scikit-learn
- Evaluated model performance using ROC-AUC, and runtime analysis
- Tuned document frequency thresholds (
min_df) to optimize the tradeoff between feature space size and classification performance
- AUC Score: Up to 0.92 on test data
- Data: 1,973 labeled book reviews (binary: positive/negative)
- Vectorizer: TF-IDF with n-gram range (1,2)
- Classifier: Logistic Regression with
max_iter=200
- Python, Pandas, NumPy
- Scikit-learn (TF-IDF, LogisticRegression, GridSearchCV, roc_auc_score)
- Matplotlib, Seaborn
Model performance was visualized using ROC curves. 
ImplementMLProjectPlan.ipynbβ Jupyter notebookbookReviewsData.csvβ Input dataset (text + sentiment labels)README.mdβ Project overview
- Deploy as a Streamlit app to allow live sentiment prediction
- Compare performance with transformer models like DistilBERT
- Use SHAP or LIME for interpretability and trust
Noah Dufresne
π Brooklyn, NY | π Fordham University
LinkedIn
This project was originally created in 2023 and publicly released in June 2025.