Project Gutenberg NLP

Sentiment analysis and topic modeling across 57,000 books from the Project Gutenberg library.

This repository is source-available, not open source. Viewing for educational and reference purposes is permitted. All other use is prohibited. See LICENSE for full terms. Violations will be pursued.

Overview

Large-scale NLP pipeline that processes the entire Project Gutenberg digital library to extract narrative structure, sentiment dynamics, and thematic fingerprints from classic literature. The system models how stories work at a structural level, tracking emotional arcs and topic shifts from exposition through resolution.

Technical Architecture

Data Pipeline

57,000 books ingested via R-programming Gutenberg module
Text cleaning and concatenation pipeline for consistent corpus preparation
Paragraph-level segmentation for granular narrative analysis

Topic Modeling

NMF (Non-negative Matrix Factorization) for subgenre decomposition
Feature extraction via CountVectorizer and TF-IDF
Per-book topic evolution tracking across narrative segments
Topic change distance metric for narrative "complexity" scoring

Sentiment Analysis

TextBlob polarity analysis across narrative arc
Binned sentiment trajectories (Exposition -> Rising -> Climax -> Falling -> Resolution)
Cross-book sentiment pattern comparison

Clustering & Visualization

K-Means clustering for genre grouping
t-SNE dimensionality reduction for visual exploration
Interactive Dash/Plotly web application for result exploration

Recommendation Engine

Recommendations derived from topic similarity scores (NMF vectors)
Genre-aware clustering for "books like this" suggestions
LLM-powered recommendation generation (see recommend.py)

Quick Start

# LLM-powered recommendations (requires vLLM, Ollama, or OpenAI API key)
python recommend.py "The Yellow Wallpaper"
python recommend.py "Moby Dick" --n 5

Technology Stack

R TextBlob NLTK CountVectorizer TF-IDF NMF K-Means t-SNE Flask Dash Plotly Python

Related Work

This research directly informs the literary analysis and recommendation engine in Readify, an AI-powered interactive reading platform for schools and institutions.

Legal Notice

This software is provided under a Source Available License. It is not open source. You may view this code for educational and reference purposes only. Commercial use, redistribution, modification, derivative works, and incorporation into other products or services are strictly prohibited without prior written authorization.

Unauthorized use will result in legal action, including DMCA takedowns, injunctive relief, and claims for damages. See LICENSE for complete terms.

For licensing inquiries or institutional partnerships: clarence@ireadifybooks.com

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
01_gutenberg_import_using_R_ces.ipynb		01_gutenberg_import_using_R_ces.ipynb
02_gutenberg_cleaning_concat_ces.ipynb		02_gutenberg_cleaning_concat_ces.ipynb
03_gutenberg_topic_modeling_visualization_ces.ipynb		03_gutenberg_topic_modeling_visualization_ces.ipynb
04_NMF_Individual_books_ces.ipynb		04_NMF_Individual_books_ces.ipynb
LICENSE		LICENSE
README.md		README.md
application_ces.py		application_ces.py
globbin.ipynb		globbin.ipynb
lk_nlp.py		lk_nlp.py
project_gutenberg_ces.pptx		project_gutenberg_ces.pptx
reccomender_ces.ipynb		reccomender_ces.ipynb
recommend.py		recommend.py
sentiment_analysis_ces.ipynb		sentiment_analysis_ces.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Gutenberg NLP

Overview

Technical Architecture

Data Pipeline

Topic Modeling

Sentiment Analysis

Clustering & Visualization

Recommendation Engine

Quick Start

Technology Stack

Related Work

Legal Notice

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project Gutenberg NLP

Overview

Technical Architecture

Data Pipeline

Topic Modeling

Sentiment Analysis

Clustering & Visualization

Recommendation Engine

Quick Start

Technology Stack

Related Work

Legal Notice

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages