Skip to content

Commit c29a63d

Browse files
authored
Update README.md
1 parent f20561a commit c29a63d

File tree

1 file changed

+11
-11
lines changed

1 file changed

+11
-11
lines changed

NLP projects/README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -15,15 +15,15 @@ Basic NLP steps for categorizing the E-commerce dataset include:-
1515
**1. Importing Libraries**
1616

1717
- Libraries such as NumPy, Pandas, Matplotlib are imported for data manipulation and visualization , NLTK for nlp processing, sklearn for model building and performance metrics.
18-
-
18+
1919
**2. Data preprocessing**
2020

21-
- Tokenization: Tokenization is the process of splitting text into smaller units, typically words or phrases.Tokenizes product titles and descriptions.
22-
- Stopword Removal: Removes common stopwords that do not provide categorization value.
23-
- Stemming: Involves reducing words to their root form. It removes suffixes like "-ing", "-ed", and "-ly", simplifying words to their base form.
24-
- Lemmatization: Similar to stemming but more sophisticated. Instead of just chopping off word endings, it transforms words into their dictionary base form (or lemma) based on their context.
25-
- Vectorization: Once text is preprocessed (tokenized, lowercased, and lemmatized), it’s transformed into numerical vectors that can be fed into a machine learning model. Techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or Word2Vec are used to convert textual data into a format that a model can understand.
26-
- Removing Special Characters : Before or during the vectorization process, unnecessary characters like punctuation marks, symbols, and numbers (unless relevant to the product, like in technical specifications) are removed from the text.
21+
- **Tokenization:** Tokenization is the process of splitting text into smaller units, typically words or phrases.Tokenizes product titles and descriptions.
22+
- **Stopword Removal:** Removes common stopwords that do not provide categorization value.
23+
- **Stemming:** Involves reducing words to their root form. It removes suffixes like "-ing", "-ed", and "-ly", simplifying words to their base form.
24+
- **Lemmatization:** Similar to stemming but more sophisticated. Instead of just chopping off word endings, it transforms words into their dictionary base form (or lemma) based on their context.
25+
- **Vectorization:** Once text is preprocessed (tokenized, lowercased, and lemmatized), it’s transformed into numerical vectors that can be fed into a machine learning model. Techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or Word2Vec are used to convert textual data into a format that a model can understand.
26+
- **Removing Special Characters :** Before or during the vectorization process, unnecessary characters like punctuation marks, symbols, and numbers (unless relevant to the product, like in technical specifications) are removed from the text.
2727

2828
**3. Model Overview**
2929

@@ -32,10 +32,10 @@ Basic NLP steps for categorizing the E-commerce dataset include:-
3232
Multinomial Naive Bayes is a popular algorithm for text classification tasks. It’s based on Bayes' Theorem.
3333
- How it works: MultinomialNB assumes that features (words) are conditionally independent given the class and calculates the probability of a product belonging to a specific category.
3434

35-
**b. Support Vector Classifier (SVC)**
35+
**b. Support Vector Machine (SVM)**
3636

37-
Support Vector Classifier (SVC) is a supervised learning algorithm used for classification tasks. It aims to find the best hyperplane that separates different classes in the feature space.
38-
- How it works: SVC tries to maximize the margin between different classes by finding the hyperplane that best separates the data points. In the case of text, the features are usually word embeddings or TF-IDF vectors.
37+
Support Vector Machine (SVM) is a supervised learning algorithm used for classification tasks. It aims to find the best hyperplane that separates different classes in the feature space.
38+
- How it works: SVM tries to maximize the margin between different classes by finding the hyperplane that best separates the data points. In the case of text, the features are usually word embeddings or TF-IDF vectors.
3939

4040
**c. Random Forest Classifier**
4141

@@ -66,7 +66,7 @@ Once trained, the model is evaluated on the test set to ensure it generalizes we
6666

6767
MultinomialNB - 92%
6868

69-
SVC - 96%
69+
SVM - 96%
7070

7171
RandomForestClassifier - 93.058%
7272

0 commit comments

Comments
 (0)