|
| 1 | +# Ecommerce product categorization |
| 2 | + |
| 3 | +## Goal |
| 4 | +This project implements an automated product categorization system for e-commerce platforms using Natural Language Processing (NLP) and Machine Learning (ML) techniques. The system analyzes product descriptions, titles, and metadata to automatically assign products to the most relevant categories. The solution handles large datasets, and continuously improves product discoverability and categorization consistency. |
| 5 | + |
| 6 | +## Introduction |
| 7 | +Product categorization is the task of classifying products as belonging to one or more categories from a given taxonomy.It helps customers navigate an ecommerce store with ease. It deals with organizing our ecommerce products into categories and tags that give us a system to get customers to the exact product they are looking for quicker. This includes creating categories, tags, attributes and more to create a hierarchy for similar products. |
| 8 | + |
| 9 | +## Dataset |
| 10 | +The dataset used in this project is sourced from Kaggle(https://www.kaggle.com/datasets/sumedhdataaspirant/e-commerce-text-dataset) . It consists of >50000 records for 4 categories - "Electronics", "Household", "Books" and "Clothing & Accessories", which cover almost 80% of any E-commerce website. |
| 11 | + |
| 12 | + |
| 13 | + |
| 14 | + |
| 15 | +## Methodology |
| 16 | +Basic NLP steps for categorizing the E-commerce dataset include:- |
| 17 | + |
| 18 | +**1. Importing Libraries** |
| 19 | + |
| 20 | + - Libraries such as NumPy, Pandas, Matplotlib are imported for data manipulation and visualization , NLTK for nlp processing, sklearn for model building and performance metrics. |
| 21 | + |
| 22 | +**2. Data preprocessing** |
| 23 | + |
| 24 | + - **Tokenization:** Tokenization is the process of splitting text into smaller units, typically words or phrases.Tokenizes product titles and descriptions. |
| 25 | + - **Stopword Removal:** Removes common stopwords that do not provide categorization value. |
| 26 | + - **Stemming:** Involves reducing words to their root form. It removes suffixes like "-ing", "-ed", and "-ly", simplifying words to their base form. |
| 27 | + - **Lemmatization:** Similar to stemming but more sophisticated. Instead of just chopping off word endings, it transforms words into their dictionary base form (or lemma) based on their context. |
| 28 | + - **Vectorization:** Once text is preprocessed (tokenized, lowercased, and lemmatized), it’s transformed into numerical vectors that can be fed into a machine learning model. Techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or Word2Vec are used to convert textual data into a format that a model can understand. |
| 29 | + - **Removing Special Characters :** Before or during the vectorization process, unnecessary characters like punctuation marks, symbols, and numbers (unless relevant to the product, like in technical specifications) are removed from the text. |
| 30 | + |
| 31 | +**3. Model Overview** |
| 32 | + |
| 33 | +**a. Multinomial Naive Bayes (MultinomialNB)** |
| 34 | + |
| 35 | +Multinomial Naive Bayes is a popular algorithm for text classification tasks. It’s based on Bayes' Theorem. |
| 36 | +- How it works: MultinomialNB assumes that features (words) are conditionally independent given the class and calculates the probability of a product belonging to a specific category. |
| 37 | + |
| 38 | +**b. Support Vector Machine (SVM)** |
| 39 | + |
| 40 | +Support Vector Machine (SVM) is a supervised learning algorithm used for classification tasks. It aims to find the best hyperplane that separates different classes in the feature space. |
| 41 | + - How it works: SVM tries to maximize the margin between different classes by finding the hyperplane that best separates the data points. In the case of text, the features are usually word embeddings or TF-IDF vectors. |
| 42 | + |
| 43 | +**c. Random Forest Classifier** |
| 44 | + |
| 45 | +Random Forest is an ensemble learning method that builds multiple decision trees during training and outputs the mode of the classes for classification. It is a bagging method that helps improve accuracy and reduce overfitting. |
| 46 | + |
| 47 | + - How it works: Random Forest creates multiple decision trees based on different subsets of the training data and features. The final prediction is made by averaging the results from all the trees (voting for classification). |
| 48 | + |
| 49 | +**d. Logistic Regression** |
| 50 | + |
| 51 | +Logistic Regression is a linear classification algorithm that models the probability of a product belonging to a particular category. |
| 52 | +- How it works: Logistic Regression calculates the probability of a class based on the input features using a logistic (sigmoid) function. It finds the best-fitting hyperplane between categories. |
| 53 | + |
| 54 | +**4. Model training** |
| 55 | +Before training, the dataset is split into two parts: |
| 56 | + - Training Set: Used to train the model (typically 70-80% of the data). |
| 57 | + - Test Set: Used to evaluate the model’s performance on unseen data (typically 20-30%). |
| 58 | + |
| 59 | +**5. Model Evaluation** |
| 60 | +Once trained, the model is evaluated on the test set to ensure it generalizes well. Key evaluation metrics used include: |
| 61 | + |
| 62 | + - Accuracy: Percentage of correct predictions. |
| 63 | + - Precision: Fraction of relevant products correctly classified. |
| 64 | + - Recall: Fraction of relevant products retrieved. |
| 65 | + - F1 Score: Harmonic mean of precision and recall, useful for imbalanced datasets. |
| 66 | + - Confusion Matrix: Provides insight into the number of true positives, true negatives, false positives, and false negatives. |
| 67 | + |
| 68 | +## Results |
| 69 | +Accuracy of various models on test data is compared below. Out of all models SVM performs the best, closely followed by Logistic Regression and Random forest. : |
| 70 | + |
| 71 | +MultinomialNB - 92% |
| 72 | + |
| 73 | +SVM - 96% |
| 74 | + |
| 75 | +RandomForestClassifier - 93.058% |
| 76 | + |
| 77 | +LogisticRegression - 95% |
0 commit comments