Skip to content

Starter code to solve real-world text data problems related to job advertisements. Includes: Word2Vec, phrase embeddings, Text Classification with Logistic Regression, simple text preprocessing, pre-trained embeddings and more.

Notifications You must be signed in to change notification settings

YULINHEEE/NLP-text-preprocessing-and-classification

Repository files navigation

NLP Job Ads Classification Project

The data used in this project is provided as part of the coursework for advancing programming. The data has been anonymized and stripped of sensitive information, and is intended for practice use only.

Starter code to solve real-world text data problems related to job advertisements. This project includes:

-- Basic Text Pre-processing: Tokenization, removing stopwords, and filtering less frequent words.

-- Feature Representations: Using Count Vector and Word Embeddings (Word2Vec) with both weighted (TF-IDF) and unweighted vectors.

-- Text Classification: Implementing machine learning models, including Logistic Regression, to classify job advertisements into categories.

The project aims to build an automated job ads classification system to help predict the categories of new job advertisements, reducing human error and improving user experience on job hunting websites.

About the Text data

There are 4 folders which stands for 4 catogories of job advertisment: 'Accounting_Finance', 'Engineering', 'Healthcare_Nursing', 'Sales'. They have 776 job advertisements totally.

Format of txt files: In each txt files, there are 'Title','Webindex','Company','Description' informations. Some txt files may not have 'Company' information.

1. Introduction of preprocessing.ipynb

In this notebook, I focus on pre-processing the description information only. The steps include:

-- Tokenization

-- Removing words with length less than 2

-- Removing stop words and most/less frequent words

-- Constructing the vocabulary

Finally, all job advertisement text and information are saved in txt files.

2. Introduction of Text_Classification.ipynb

In this notebook, I generated three different types of feature representations:

-- Count vector representation

-- TF-IDF weighted vector representation

-- Unweighted vector representation

These were generated for the corresponding descriptions using the Bag-of-Words model and the Word2Vec pretrained model, respectively. The count vectors are saved in a text file in the format: word_integer_index:word_freq, separated by commas.

After obtaining the required feature representations, I build machine learning models to classify the category of job advertisements.

I use a Logistic Regression model from sklearn to compare the performance of the model on the three different types of feature vector representations.

About

Starter code to solve real-world text data problems related to job advertisements. Includes: Word2Vec, phrase embeddings, Text Classification with Logistic Regression, simple text preprocessing, pre-trained embeddings and more.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published