The data used in this project is provided as part of the coursework for advancing programming. The data has been anonymized and stripped of sensitive information, and is intended for practice use only.
Starter code to solve real-world text data problems related to job advertisements. This project includes:
-- Basic Text Pre-processing: Tokenization, removing stopwords, and filtering less frequent words.
-- Feature Representations: Using Count Vector and Word Embeddings (Word2Vec) with both weighted (TF-IDF) and unweighted vectors.
-- Text Classification: Implementing machine learning models, including Logistic Regression, to classify job advertisements into categories.
The project aims to build an automated job ads classification system to help predict the categories of new job advertisements, reducing human error and improving user experience on job hunting websites.
There are 4 folders which stands for 4 catogories of job advertisment: 'Accounting_Finance', 'Engineering', 'Healthcare_Nursing', 'Sales'. They have 776 job advertisements totally.
Format of txt files: In each txt files, there are 'Title','Webindex','Company','Description' informations. Some txt files may not have 'Company' information.
In this notebook, I focus on pre-processing the description information only. The steps include:
-- Tokenization
-- Removing words with length less than 2
-- Removing stop words and most/less frequent words
-- Constructing the vocabulary
Finally, all job advertisement text and information are saved in txt files.
In this notebook, I generated three different types of feature representations:
-- Count vector representation
-- TF-IDF weighted vector representation
-- Unweighted vector representation
These were generated for the corresponding descriptions using the Bag-of-Words model and the Word2Vec pretrained model, respectively. The count vectors are saved in a text file in the format: word_integer_index:word_freq, separated by commas.
After obtaining the required feature representations, I build machine learning models to classify the category of job advertisements.
I use a Logistic Regression model from sklearn to compare the performance of the model on the three different types of feature vector representations.