A machine learning-based book recommendation system that provides personalized audiobook suggestions using content-based filtering, clustering, and genre-based approaches.
🔗 Live Demo on Streamlit Cloud
This project analyzes 5,484 audiobooks from Audible and implements multiple recommendation algorithms to help users discover books based on their preferences.
Key Features:
- Content-based recommendations using TF-IDF and cosine similarity
- Cluster-based recommendations using K-Means (15 clusters)
- Genre-based filtering with customizable ratings
- Interactive web interface built with Streamlit
- Source: Audible audiobook catalog
- Total Books: 5,484 (after cleaning)
- Features: Book name, author, rating, reviews, price, description, genre
- Original Files:
- Audible_Catlog.csv (6,368 records)
- Audible_Catlog_Advanced_Features.csv (4,464 records)
- Python 3.11.9
- pandas, numpy - Data processing
- scikit-learn - Machine learning
- NLTK - Text processing
- Streamlit - Web interface
- matplotlib, seaborn - Visualizations
- Clone the repository:
git clone https://github.com/priyanka7411/Audible_Book_Recommendation.git
cd Audible_Book_Recommendation- Create and activate virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Download NLTK data:
python3 -c "import nltk; nltk.download('stopwords'); nltk.download('punkt'); nltk.download('wordnet')"Run the Streamlit app:
streamlit run streamlit_app.pyThe application will open in your browser at http://localhost:8501
├── data/ # Dataset files
├── notebooks/ # Jupyter notebooks (EDA, cleaning, modeling)
├── models/ # Saved ML models (.pkl files)
├── outputs/ # Visualizations and charts
├── streamlit_app.py # Web application
├── requirements.txt # Python dependencies
└── README.md
- Removed 1,052 books with invalid ratings
- Handled missing values in reviews, prices, and descriptions
- Removed 1,265 duplicate records
- Extracted genre information from text
- Analyzed rating distributions, price trends, and genre popularity
- Created 7 visualizations including word clouds and correlation plots
- Identified key patterns: average rating 4.46/5, weak price-rating correlation
Content-Based Filtering:
- TF-IDF vectorization (3,000 features)
- Cosine similarity between book features
- Combines: title, author, genre, description
Cluster-Based:
- K-Means clustering (15 clusters)
- SVD dimensionality reduction (3000→50 features)
- Groups books by thematic similarity
Genre-Based:
- Filter books by genre
- Sort by rating and popularity
- Adjustable minimum rating threshold
- Successfully processed and cleaned 5,484 books
- Built 3 different recommendation approaches
- Created interactive web application
- Generated 7 analytical visualizations
- Saved 6 trained models for deployment
- 60.7% of books rated 4.5 or higher
- Most popular genre: Audible Audiobooks & Originals (530 books)
- Price has minimal correlation with rating (0.06)
- Average book price: ₹952
- Add collaborative filtering based on user ratings
- Deploy to cloud platform (AWS/Streamlit Cloud)
- Implement user accounts and reading history
- Add more advanced NLP techniques (BERT, transformers)
- Create REST API for integration
The cosine_similarity.pkl file (229 MB) is excluded from GitHub due to size limits. It will be automatically regenerated from the TF-IDF matrix when you run the Streamlit app for the first time. This may take 10-20 seconds on initial load, then it will be cached.
Note: The cosine similarity matrix (229 MB) is computed automatically from the TF-IDF matrix when you first run the app. This may take 10-20 seconds initially, then it will be cached.
Priyanka Malavade
- GitHub: @priyanka7411
- LinkedIn: Priyanka Malavade
- Dataset: Audible audiobook catalog
- Project guidance: Guvi Geek Networks
- Libraries: scikit-learn, pandas, Streamlit, NLTK
This project is for educational purposes.