Skip to content

priyanka7411/Audible_Book_Recommendation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Audible Book Recommendation System

A machine learning-based book recommendation system that provides personalized audiobook suggestions using content-based filtering, clustering, and genre-based approaches.

🔗 Live Demo on Streamlit Cloud

Project Overview

This project analyzes 5,484 audiobooks from Audible and implements multiple recommendation algorithms to help users discover books based on their preferences.

Key Features:

  • Content-based recommendations using TF-IDF and cosine similarity
  • Cluster-based recommendations using K-Means (15 clusters)
  • Genre-based filtering with customizable ratings
  • Interactive web interface built with Streamlit

Dataset

  • Source: Audible audiobook catalog
  • Total Books: 5,484 (after cleaning)
  • Features: Book name, author, rating, reviews, price, description, genre
  • Original Files:
    • Audible_Catlog.csv (6,368 records)
    • Audible_Catlog_Advanced_Features.csv (4,464 records)

Technologies Used

  • Python 3.11.9
  • pandas, numpy - Data processing
  • scikit-learn - Machine learning
  • NLTK - Text processing
  • Streamlit - Web interface
  • matplotlib, seaborn - Visualizations

Installation

  1. Clone the repository:
git clone https://github.com/priyanka7411/Audible_Book_Recommendation.git
cd Audible_Book_Recommendation
  1. Create and activate virtual environment:
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Download NLTK data:
python3 -c "import nltk; nltk.download('stopwords'); nltk.download('punkt'); nltk.download('wordnet')"

Usage

Run the Streamlit app:

streamlit run streamlit_app.py

The application will open in your browser at http://localhost:8501

Project Structure

├── data/                   # Dataset files
├── notebooks/              # Jupyter notebooks (EDA, cleaning, modeling)
├── models/                 # Saved ML models (.pkl files)
├── outputs/                # Visualizations and charts
├── streamlit_app.py        # Web application
├── requirements.txt        # Python dependencies
└── README.md

Methodology

1. Data Cleaning

  • Removed 1,052 books with invalid ratings
  • Handled missing values in reviews, prices, and descriptions
  • Removed 1,265 duplicate records
  • Extracted genre information from text

2. Exploratory Data Analysis

  • Analyzed rating distributions, price trends, and genre popularity
  • Created 7 visualizations including word clouds and correlation plots
  • Identified key patterns: average rating 4.46/5, weak price-rating correlation

3. Recommendation Models

Content-Based Filtering:

  • TF-IDF vectorization (3,000 features)
  • Cosine similarity between book features
  • Combines: title, author, genre, description

Cluster-Based:

  • K-Means clustering (15 clusters)
  • SVD dimensionality reduction (3000→50 features)
  • Groups books by thematic similarity

Genre-Based:

  • Filter books by genre
  • Sort by rating and popularity
  • Adjustable minimum rating threshold

Results

  • Successfully processed and cleaned 5,484 books
  • Built 3 different recommendation approaches
  • Created interactive web application
  • Generated 7 analytical visualizations
  • Saved 6 trained models for deployment

Key Insights

  • 60.7% of books rated 4.5 or higher
  • Most popular genre: Audible Audiobooks & Originals (530 books)
  • Price has minimal correlation with rating (0.06)
  • Average book price: ₹952

Future Improvements

  • Add collaborative filtering based on user ratings
  • Deploy to cloud platform (AWS/Streamlit Cloud)
  • Implement user accounts and reading history
  • Add more advanced NLP techniques (BERT, transformers)
  • Create REST API for integration

Important Note

The cosine_similarity.pkl file (229 MB) is excluded from GitHub due to size limits. It will be automatically regenerated from the TF-IDF matrix when you run the Streamlit app for the first time. This may take 10-20 seconds on initial load, then it will be cached.

Note: The cosine similarity matrix (229 MB) is computed automatically from the TF-IDF matrix when you first run the app. This may take 10-20 seconds initially, then it will be cached.

Author

Priyanka Malavade

Acknowledgments

  • Dataset: Audible audiobook catalog
  • Project guidance: Guvi Geek Networks
  • Libraries: scikit-learn, pandas, Streamlit, NLTK

License

This project is for educational purposes.

About

Audiobook recommendation engine using NLP and K-Means clustering with interactive Streamlit interface | Python, Scikit-learn, Content-Based Filtering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors