Audible Book Recommendation System

A machine learning-based book recommendation system that provides personalized audiobook suggestions using content-based filtering, clustering, and genre-based approaches.

🔗 Live Demo on Streamlit Cloud

Project Overview

This project analyzes 5,484 audiobooks from Audible and implements multiple recommendation algorithms to help users discover books based on their preferences.

Key Features:

Content-based recommendations using TF-IDF and cosine similarity
Cluster-based recommendations using K-Means (15 clusters)
Genre-based filtering with customizable ratings
Interactive web interface built with Streamlit

Dataset

Source: Audible audiobook catalog
Total Books: 5,484 (after cleaning)
Features: Book name, author, rating, reviews, price, description, genre
Original Files:
- Audible_Catlog.csv (6,368 records)
- Audible_Catlog_Advanced_Features.csv (4,464 records)

Technologies Used

Python 3.11.9
pandas, numpy - Data processing
scikit-learn - Machine learning
NLTK - Text processing
Streamlit - Web interface
matplotlib, seaborn - Visualizations

Installation

Clone the repository:

git clone https://github.com/priyanka7411/Audible_Book_Recommendation.git
cd Audible_Book_Recommendation

Create and activate virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Download NLTK data:

python3 -c "import nltk; nltk.download('stopwords'); nltk.download('punkt'); nltk.download('wordnet')"

Usage

Run the Streamlit app:

streamlit run streamlit_app.py

The application will open in your browser at http://localhost:8501

Project Structure

├── data/                   # Dataset files
├── notebooks/              # Jupyter notebooks (EDA, cleaning, modeling)
├── models/                 # Saved ML models (.pkl files)
├── outputs/                # Visualizations and charts
├── streamlit_app.py        # Web application
├── requirements.txt        # Python dependencies
└── README.md

Methodology

1. Data Cleaning

Removed 1,052 books with invalid ratings
Handled missing values in reviews, prices, and descriptions
Removed 1,265 duplicate records
Extracted genre information from text

2. Exploratory Data Analysis

Analyzed rating distributions, price trends, and genre popularity
Created 7 visualizations including word clouds and correlation plots
Identified key patterns: average rating 4.46/5, weak price-rating correlation

3. Recommendation Models

Content-Based Filtering:

TF-IDF vectorization (3,000 features)
Cosine similarity between book features
Combines: title, author, genre, description

Cluster-Based:

K-Means clustering (15 clusters)
SVD dimensionality reduction (3000→50 features)
Groups books by thematic similarity

Genre-Based:

Filter books by genre
Sort by rating and popularity
Adjustable minimum rating threshold

Results

Successfully processed and cleaned 5,484 books
Built 3 different recommendation approaches
Created interactive web application
Generated 7 analytical visualizations
Saved 6 trained models for deployment

Key Insights

60.7% of books rated 4.5 or higher
Most popular genre: Audible Audiobooks & Originals (530 books)
Price has minimal correlation with rating (0.06)
Average book price: ₹952

Future Improvements

Add collaborative filtering based on user ratings
Deploy to cloud platform (AWS/Streamlit Cloud)
Implement user accounts and reading history
Add more advanced NLP techniques (BERT, transformers)
Create REST API for integration

Important Note

The cosine_similarity.pkl file (229 MB) is excluded from GitHub due to size limits. It will be automatically regenerated from the TF-IDF matrix when you run the Streamlit app for the first time. This may take 10-20 seconds on initial load, then it will be cached.

Note: The cosine similarity matrix (229 MB) is computed automatically from the TF-IDF matrix when you first run the app. This may take 10-20 seconds initially, then it will be cached.

Author

Priyanka Malavade

GitHub: @priyanka7411
LinkedIn: Priyanka Malavade

Acknowledgments

Dataset: Audible audiobook catalog
Project guidance: Guvi Geek Networks
Libraries: scikit-learn, pandas, Streamlit, NLTK

License

This project is for educational purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audible Book Recommendation System

Project Overview

Dataset

Technologies Used

Installation

Usage

Project Structure

Methodology

1. Data Cleaning

2. Exploratory Data Analysis

3. Recommendation Models

Results

Key Insights

Future Improvements

Important Note

Author

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
models		models
notebooks		notebooks
outputs		outputs
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

Folders and files

Latest commit

History

Repository files navigation

Audible Book Recommendation System

Project Overview

Dataset

Technologies Used

Installation

Usage

Project Structure

Methodology

1. Data Cleaning

2. Exploratory Data Analysis

3. Recommendation Models

Results

Key Insights

Future Improvements

Important Note

Author

Acknowledgments

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages