This project involves detecting fraudulent job postings using machine learning models. The dataset used is fake_job_postings.csv, which contains various attributes related to job postings, such as job title, location, company profile, description, requirements, benefits, and labels indicating whether a job posting is fraudulent or not. The goal is to analyze the data and build models to classify job postings as real or fake.
-
Importing Data
- Necessary libraries such as
pandas,numpy,seaborn,matplotlib, andsklearnare imported. - The dataset is loaded using
pandas.read_csv(). - Missing values are analyzed using
missingno.
- Necessary libraries such as
-
Data Visualization
- Distribution of fraudulent vs. non-fraudulent job postings.
- Distribution of job postings by employment type, required education, and experience.
- Correlation heatmaps of numerical features.
- Word cloud visualization of job descriptions.
-
Data Preprocessing
- Categorical and numerical feature separation.
- Handling missing values using
SimpleImputer. - One-hot encoding categorical features.
- Splitting data into training and test sets.
-
Model Training and Evaluation
- Random Forest Classifier:
- Trained with different numbers of trees.
- Achieved an accuracy of 99.48%.
- Top keywords influencing fraudulent job postings were extracted.
- Naive Bayes Classifier:
- Achieved an accuracy of 98.43%.
- Precision: 0.93, Recall: 0.50, F1 Score: 0.65.
- Decision Tree Classifier:
- Achieved an accuracy of 99.42%.
- Precision: 0.99, Recall: 0.93, F1 Score: 0.96.
- Random Forest Classifier:
-
Key Findings
- Fraudulent job postings often contain certain keywords in company profiles and descriptions.
- Specific industries and job functions are more prone to fraudulent postings.
- STEM vs. Non-STEM fraudulent job posting analysis showed varying trends.
- Install dependencies:
pip install pandas numpy seaborn matplotlib scikit-learn nltk wordcloud missingno
- Run the script:
python fake_job_detection.py
- Modify dataset file path if needed:
df = pd.read_csv('fake_job_postings.csv')
- Incorporate NLP techniques like TF-IDF and word embeddings.
- Experiment with deep learning models for classification.
- Deploy the model as a web service for real-time job posting verification.
Vedant - GitHub