You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: NLP projects/README.md
+11-11Lines changed: 11 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,15 +15,15 @@ Basic NLP steps for categorizing the E-commerce dataset include:-
15
15
**1. Importing Libraries**
16
16
17
17
- Libraries such as NumPy, Pandas, Matplotlib are imported for data manipulation and visualization , NLTK for nlp processing, sklearn for model building and performance metrics.
18
-
-
18
+
19
19
**2. Data preprocessing**
20
20
21
-
- Tokenization: Tokenization is the process of splitting text into smaller units, typically words or phrases.Tokenizes product titles and descriptions.
22
-
- Stopword Removal: Removes common stopwords that do not provide categorization value.
23
-
- Stemming: Involves reducing words to their root form. It removes suffixes like "-ing", "-ed", and "-ly", simplifying words to their base form.
24
-
- Lemmatization: Similar to stemming but more sophisticated. Instead of just chopping off word endings, it transforms words into their dictionary base form (or lemma) based on their context.
25
-
- Vectorization: Once text is preprocessed (tokenized, lowercased, and lemmatized), it’s transformed into numerical vectors that can be fed into a machine learning model. Techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or Word2Vec are used to convert textual data into a format that a model can understand.
26
-
- Removing Special Characters : Before or during the vectorization process, unnecessary characters like punctuation marks, symbols, and numbers (unless relevant to the product, like in technical specifications) are removed from the text.
21
+
-**Tokenization:** Tokenization is the process of splitting text into smaller units, typically words or phrases.Tokenizes product titles and descriptions.
22
+
-**Stopword Removal:** Removes common stopwords that do not provide categorization value.
23
+
-**Stemming:** Involves reducing words to their root form. It removes suffixes like "-ing", "-ed", and "-ly", simplifying words to their base form.
24
+
-**Lemmatization:** Similar to stemming but more sophisticated. Instead of just chopping off word endings, it transforms words into their dictionary base form (or lemma) based on their context.
25
+
-**Vectorization:** Once text is preprocessed (tokenized, lowercased, and lemmatized), it’s transformed into numerical vectors that can be fed into a machine learning model. Techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or Word2Vec are used to convert textual data into a format that a model can understand.
26
+
-**Removing Special Characters :** Before or during the vectorization process, unnecessary characters like punctuation marks, symbols, and numbers (unless relevant to the product, like in technical specifications) are removed from the text.
27
27
28
28
**3. Model Overview**
29
29
@@ -32,10 +32,10 @@ Basic NLP steps for categorizing the E-commerce dataset include:-
32
32
Multinomial Naive Bayes is a popular algorithm for text classification tasks. It’s based on Bayes' Theorem.
33
33
- How it works: MultinomialNB assumes that features (words) are conditionally independent given the class and calculates the probability of a product belonging to a specific category.
34
34
35
-
**b. Support Vector Classifier (SVC)**
35
+
**b. Support Vector Machine (SVM)**
36
36
37
-
Support Vector Classifier (SVC) is a supervised learning algorithm used for classification tasks. It aims to find the best hyperplane that separates different classes in the feature space.
38
-
- How it works: SVC tries to maximize the margin between different classes by finding the hyperplane that best separates the data points. In the case of text, the features are usually word embeddings or TF-IDF vectors.
37
+
Support Vector Machine (SVM) is a supervised learning algorithm used for classification tasks. It aims to find the best hyperplane that separates different classes in the feature space.
38
+
- How it works: SVM tries to maximize the margin between different classes by finding the hyperplane that best separates the data points. In the case of text, the features are usually word embeddings or TF-IDF vectors.
39
39
40
40
**c. Random Forest Classifier**
41
41
@@ -66,7 +66,7 @@ Once trained, the model is evaluated on the test set to ensure it generalizes we
0 commit comments