GitHub - YULINHEEE/NLP-text-preprocessing-and-classification: Starter code to solve real-world text data problems related to job advertisements. Includes: Word2Vec, phrase embeddings, Text Classification with Logistic Regression, simple text preprocessing, pre-trained embeddings and more.

NLP Job Ads Classification Project

The data used in this project is provided as part of the coursework for advancing programming. The data has been anonymized and stripped of sensitive information, and is intended for practice use only.

Starter code to solve real-world text data problems related to job advertisements. This project includes:

-- Basic Text Pre-processing: Tokenization, removing stopwords, and filtering less frequent words.

-- Feature Representations: Using Count Vector and Word Embeddings (Word2Vec) with both weighted (TF-IDF) and unweighted vectors.

-- Text Classification: Implementing machine learning models, including Logistic Regression, to classify job advertisements into categories.

The project aims to build an automated job ads classification system to help predict the categories of new job advertisements, reducing human error and improving user experience on job hunting websites.

About the Text data

There are 4 folders which stands for 4 catogories of job advertisment: 'Accounting_Finance', 'Engineering', 'Healthcare_Nursing', 'Sales'. They have 776 job advertisements totally.

Format of txt files: In each txt files, there are 'Title','Webindex','Company','Description' informations. Some txt files may not have 'Company' information.

1. Introduction of preprocessing.ipynb

In this notebook, I focus on pre-processing the description information only. The steps include:

-- Tokenization

-- Removing words with length less than 2

-- Removing stop words and most/less frequent words

-- Constructing the vocabulary

Finally, all job advertisement text and information are saved in txt files.

2. Introduction of Text_Classification.ipynb

In this notebook, I generated three different types of feature representations:

-- Count vector representation

-- TF-IDF weighted vector representation

-- Unweighted vector representation

These were generated for the corresponding descriptions using the Bag-of-Words model and the Word2Vec pretrained model, respectively. The count vectors are saved in a text file in the format: word_integer_index:word_freq, separated by commas.

After obtaining the required feature representations, I build machine learning models to classify the category of job advertisements.

I use a Logistic Regression model from sklearn to compare the performance of the model on the three different types of feature vector representations.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
count_vectors.txt		count_vectors.txt
data.zip		data.zip
job_descriptions.txt		job_descriptions.txt
preprocessing.ipynb		preprocessing.ipynb
stopwords_en.txt		stopwords_en.txt
text_classification.ipynb		text_classification.ipynb
vocab.txt		vocab.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP Job Ads Classification Project

About the Text data

1. Introduction of preprocessing.ipynb

2. Introduction of Text_Classification.ipynb

About

Uh oh!

Releases

Packages

Uh oh!

Languages

YULINHEEE/NLP-text-preprocessing-and-classification

Folders and files

Latest commit

History

Repository files navigation

NLP Job Ads Classification Project

About the Text data

1. Introduction of preprocessing.ipynb

2. Introduction of Text_Classification.ipynb

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages