🔍 Fake Job Post Prediction
Industry-ready ML system for detecting fraudulent job postings using classical ML and Transformer models.
Built with Python, scikit-learn, XGBoost, LightGBM, BERT (Transformers), and FastAPI.
📊 Model Performance Results
All 6 models were trained and evaluated on the HuggingFace Fake Job Posting dataset (17,880 records).
Evaluation was performed on a held-out 15% stratified test set .
Model
Accuracy
Precision
Recall
F1 Score
ROC-AUC
Baseline (DummyClassifier)
95.2%
—
—
—
0.50
Logistic Regression
96.5%
0.59
0.89
0.71
0.99
Linear SVM
98.2%
0.80
0.83
0.82
0.98
Random Forest
97.7%
0.99
0.53
0.69
0.98
XGBoost
97.8%
0.78
0.77
0.77
0.98
LightGBM
98.1%
0.86
0.74
0.79
0.98
Best overall (F1): Linear SVM — 0.82 F1 with 98.2% accuracy
Best recall (catch fraud): Logistic Regression — 0.89 recall (misses fewest fake posts)
Best precision (fewest false alarms): Random Forest — 0.99 precision
Priority metric: F1 Score and Recall — minimizing missed fraud is critical
Fake-Job-Post-Prediction/
│
├── data/
│ ├── raw/
│ │ └── huggingface_dataset/ # Cached raw dataset from HF
│ ├── processed/
│ │ ├── train.csv # 70% stratified train split
│ │ ├── val.csv # 15% validation split
│ │ └── test.csv # 15% test split
│ └── external/ # Optional augmentation data
│
├── notebooks/
│ ├── 01_eda.ipynb # Exploratory Data Analysis
│ ├── 02_preprocessing.ipynb
│ ├── 03_feature_engineering.ipynb
│ └── 04_baseline_models.ipynb
│
├── src/
│ ├── __init__.py
│ ├── config.py # Centralized hyperparameters & paths
│ │
│ ├── data/
│ │ ├── dataset.py # HuggingFace dataset loader + local cache
│ │ ├── preprocess.py # HTML/emoji/URL removal, stopwords, fraud indicators
│ │ ├── split.py # Stratified train/val/test splitting
│ │ └── augment.py # SMOTE oversampling for class imbalance
│ │
│ ├── features/
│ │ ├── featurize.py # TF-IDF + metadata ColumnTransformer
│ │ └── utils.py # Feature utility functions
│ │
│ ├── models/
│ │ ├── baseline.py # DummyClassifier (majority class)
│ │ ├── ml_models.py # Model registry: LR, SVM, RF, XGBoost, LightGBM
│ │ └── transformer.py # BERT fine-tuning wrapper (train/predict/save/load)
│ │
│ ├── training/
│ │ ├── train.py # Main training script (--all, --smote, --full-features)
│ │ ├── evaluate.py # Evaluation metrics (Accuracy, F1, ROC-AUC, PR-AUC)
│ │ └── callbacks.py # Early stopping callback
│ │
│ ├── inference/
│ │ ├── predict.py # Single + batch prediction with saved models
│ │ └── explain.py # SHAP & LIME explainability
│ │
│ ├── api/
│ │ ├── app.py # FastAPI app (4 endpoints)
│ │ └── schemas.py # Pydantic request/response schemas
│ │
│ ├── utils/
│ │ ├── helpers.py # Text combination, pattern matching
│ │ ├── metrics.py # Comprehensive metric computation
│ │ └── logger.py # Centralized logging
│ │
│ └── visualization/
│ └── plots.py # Confusion matrix, ROC, PR curves, model comparison
│
├── models/ # Saved model artifacts (.joblib)
│ ├── baseline.joblib
│ ├── logistic_regression.joblib
│ ├── svm.joblib
│ ├── random_forest.joblib
│ ├── xgboost.joblib
│ ├── lightgbm.joblib
│ └── comparison.csv # Model comparison results
│
├── requirements.txt
├── Dockerfile
├── .gitignore
├── README.md
└── LICENSE
1. Clone & Setup Virtual Environment
git clone https://github.com/ByteNinjaSmit/Fake-Job-Post-Prediction.git
cd Fake-Job-Post-Prediction
# Create virtual environment
python -m venv venv
# Activate (Windows)
.\v env\S cripts\a ctivate
# Activate (Linux/Mac)
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Train a single model
.\v env\S cripts\p ython src/training/train.py --model logistic_regression
# Train ALL models and generate comparison table
.\v env\S cripts\p ython src/training/train.py --all
# Train with SMOTE oversampling (handles class imbalance)
.\v env\S cripts\p ython src/training/train.py --model xgboost --smote
# Train with full features (TF-IDF + metadata + engineered features)
.\v env\S cripts\p ython src/training/train.py --model xgboost --full-features
Available models: baseline, logistic_regression, svm, random_forest, xgboost, lightgbm
from src .inference .predict import Predictor
predictor = Predictor ("logistic_regression" )
result = predictor .predict_single (
"Earn $5000/week from home! No experience needed. Contact us on WhatsApp."
)
print (result )
# {'prediction': 'Fraudulent', 'label': 1, 'probability_fraudulent': 0.92, ...}
uvicorn src.api.app:app --reload
Then visit: http://localhost:8000/docs for interactive Swagger documentation.
Route
Method
Description
/health
GET
Health check — model status
/predict
POST
Classify a single job posting
/batch
POST
Classify multiple job postings
/explain
POST
Classify + LIME feature explanation
curl -X POST http://localhost:8000/predict \
-H " Content-Type: application/json" \
-d ' {
"title": "Marketing Intern",
"description": "Earn money fast from home!",
"company_profile": "",
"requirements": "No experience needed"
}'
{
"prediction" : " Fraudulent" ,
"confidence" : 0.92 ,
"fraudulent_score" : 0.92
}
Tier 1 — Classical ML (TF-IDF Features)
Model
Library
Strategy
Logistic Regression
scikit-learn
class_weight='balanced', max_iter=1000
Linear SVM
scikit-learn
class_weight='balanced'
Tier 2 — Ensemble Models (TF-IDF + Metadata)
Model
Library
Strategy
Random Forest
scikit-learn
200 estimators, class_weight='balanced'
XGBoost
XGBoost
200 estimators, scale_pos_weight=10
LightGBM
LightGBM
200 estimators, class_weight='balanced'
Model
Library
Strategy
BERT
Hugging Face Transformers
bert-base-uncased, lr=2e-5, 4 epochs, AdamW
The dataset is highly imbalanced (~5% fraud). We address this through:
Class weights — balanced weighting in all classical models
Scale pos weight — XGBoost positive class weighting
SMOTE — Synthetic minority oversampling (optional via --smote flag)
TF-IDF vectors — up to 5,000 features, bigrams, sublinear TF
Combined text from: title + company_profile + description + requirements + benefits
Engineered Fraud Indicators
Feature
Rationale
email_count
Fake posts often include personal emails
url_count
External link redirection
exclamation_count
Emotional manipulation ("Earn $$$!!!")
upper_ratio
ALL CAPS usage
word_count
Unusually short or long descriptions
company_profile_len
Fake companies have short/empty profiles
Metadata Features (One-Hot Encoded)
employment_type, required_experience, required_education, industry, function
telecommuting, has_company_logo, has_questions
Metric
Description
Priority
F1 Score
Harmonic mean of precision & recall
⭐ Primary
Recall
Fraction of actual fraud detected
⭐ Primary
Precision
Fraction of predicted fraud that is real
Secondary
ROC-AUC
Overall discrimination ability
Secondary
PR-AUC
Precision-Recall area under curve
Secondary
Accuracy
Overall correctness
Baseline
Priority: F1 and Recall — In fraud detection, missing a fake job post (false negative) is worse than a false alarm.
HuggingFace Dataset (17,880 records)
↓
Text Cleaning (HTML, emoji, URL, stopword removal)
↓
Fraud Indicator Feature Engineering
↓
Stratified Split (70% train / 15% val / 15% test)
↓
TF-IDF Vectorization + Metadata Encoding
↓
Model Training & Evaluation
↓
Model Comparison Table (models/comparison.csv)
🐳 Docker & Docker Compose
Quick Start with Docker Compose
# Start the API server
docker compose up api
# Access API
curl http://localhost:8000/health
Train Models Inside Docker
# Run all model training inside a container
docker compose --profile train up trainer
Models and data are mounted as volumes — trained models persist on your host machine.
Enable Monitoring (Prometheus + Grafana)
# Start API + Prometheus + Grafana
docker compose --profile monitoring up
Standalone Docker (without Compose)
# Build image
docker build -t fake-job-api .
# Run container
docker run -p 8000:8000 -v ./models:/app/models fake-job-api
# Access API
curl http://localhost:8000/health
Service
Port
Profile
Description
api
8000
default
FastAPI prediction server
trainer
—
train
One-off model training
prometheus
9090
monitoring
Metrics collection
grafana
3000
monitoring
Dashboards
LIME (Local Interpretable Model-agnostic Explanations)
Explains individual predictions by highlighting contributing words
Integrated into the /explain API endpoint
SHAP (SHapley Additive exPlanations)
Global feature importance for ML models
Available via src/inference/explain.py
Category
Libraries
Data
pandas, numpy, datasets (HuggingFace)
ML
scikit-learn, XGBoost, LightGBM, imbalanced-learn
Deep Learning
PyTorch, Transformers (HuggingFace)
NLP
NLTK, BeautifulSoup4
API
FastAPI, Uvicorn, Pydantic
Explainability
SHAP, LIME
Visualization
Matplotlib, Seaborn
Testing
pytest, httpx
Source: victor/real-or-fake-fake-jobposting-prediction
Field
Type
Description
title
text
Job title
company_profile
text
Company description
description
text
Job description
requirements
text
Job requirements
benefits
text
Job benefits
telecommuting
binary
Remote work flag
has_company_logo
binary
Logo presence
has_questions
binary
Screening questions
employment_type
categorical
Full-time, Part-time, etc.
required_experience
categorical
Entry, Mid, Senior, etc.
required_education
categorical
Bachelor's, Master's, etc.
industry
categorical
Industry sector
fraudulent
binary
Target — 0 (Real) / 1 (Fake)
✅ Clean, documented codebase (20+ source files)
✅ Reproducible training scripts with CLI arguments
✅ 6 trained models with comparison table
✅ Production-ready FastAPI with 4 endpoints
✅ SHAP & LIME explainability
✅ Dockerized deployment
✅ Comprehensive README with results
This project is licensed under the MIT License — see LICENSE for details.