Skip to content

ByteNinjaSmit/Fake-Job-Post-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” Fake Job Post Prediction

Industry-ready ML system for detecting fraudulent job postings using classical ML and Transformer models.
Built with Python, scikit-learn, XGBoost, LightGBM, BERT (Transformers), and FastAPI.

Python 3.10+ License: MIT


πŸ“Š Model Performance Results

All 6 models were trained and evaluated on the HuggingFace Fake Job Posting dataset (17,880 records).
Evaluation was performed on a held-out 15% stratified test set.

Model Accuracy Precision Recall F1 Score ROC-AUC
Baseline (DummyClassifier) 95.2% β€” β€” β€” 0.50
Logistic Regression 96.5% 0.59 0.89 0.71 0.99
Linear SVM 98.2% 0.80 0.83 0.82 0.98
Random Forest 97.7% 0.99 0.53 0.69 0.98
XGBoost 97.8% 0.78 0.77 0.77 0.98
LightGBM 98.1% 0.86 0.74 0.79 0.98

Key Takeaways

  • Best overall (F1): Linear SVM β€” 0.82 F1 with 98.2% accuracy
  • Best recall (catch fraud): Logistic Regression β€” 0.89 recall (misses fewest fake posts)
  • Best precision (fewest false alarms): Random Forest β€” 0.99 precision
  • Priority metric: F1 Score and Recall β€” minimizing missed fraud is critical

πŸ“ Project Structure

Fake-Job-Post-Prediction/
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/
β”‚   β”‚   └── huggingface_dataset/           # Cached raw dataset from HF
β”‚   β”œβ”€β”€ processed/
β”‚   β”‚   β”œβ”€β”€ train.csv                      # 70% stratified train split
β”‚   β”‚   β”œβ”€β”€ val.csv                        # 15% validation split
β”‚   β”‚   └── test.csv                       # 15% test split
β”‚   └── external/                          # Optional augmentation data
β”‚
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_eda.ipynb                       # Exploratory Data Analysis
β”‚   β”œβ”€β”€ 02_preprocessing.ipynb
β”‚   β”œβ”€β”€ 03_feature_engineering.ipynb
β”‚   └── 04_baseline_models.ipynb
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py                          # Centralized hyperparameters & paths
β”‚   β”‚
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ dataset.py                     # HuggingFace dataset loader + local cache
β”‚   β”‚   β”œβ”€β”€ preprocess.py                  # HTML/emoji/URL removal, stopwords, fraud indicators
β”‚   β”‚   β”œβ”€β”€ split.py                       # Stratified train/val/test splitting
β”‚   β”‚   └── augment.py                     # SMOTE oversampling for class imbalance
β”‚   β”‚
β”‚   β”œβ”€β”€ features/
β”‚   β”‚   β”œβ”€β”€ featurize.py                   # TF-IDF + metadata ColumnTransformer
β”‚   β”‚   └── utils.py                       # Feature utility functions
β”‚   β”‚
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ baseline.py                    # DummyClassifier (majority class)
β”‚   β”‚   β”œβ”€β”€ ml_models.py                   # Model registry: LR, SVM, RF, XGBoost, LightGBM
β”‚   β”‚   └── transformer.py                 # BERT fine-tuning wrapper (train/predict/save/load)
β”‚   β”‚
β”‚   β”œβ”€β”€ training/
β”‚   β”‚   β”œβ”€β”€ train.py                       # Main training script (--all, --smote, --full-features)
β”‚   β”‚   β”œβ”€β”€ evaluate.py                    # Evaluation metrics (Accuracy, F1, ROC-AUC, PR-AUC)
β”‚   β”‚   └── callbacks.py                   # Early stopping callback
β”‚   β”‚
β”‚   β”œβ”€β”€ inference/
β”‚   β”‚   β”œβ”€β”€ predict.py                     # Single + batch prediction with saved models
β”‚   β”‚   └── explain.py                     # SHAP & LIME explainability
β”‚   β”‚
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”œβ”€β”€ app.py                         # FastAPI app (4 endpoints)
β”‚   β”‚   └── schemas.py                     # Pydantic request/response schemas
β”‚   β”‚
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   β”œβ”€β”€ helpers.py                     # Text combination, pattern matching
β”‚   β”‚   β”œβ”€β”€ metrics.py                     # Comprehensive metric computation
β”‚   β”‚   └── logger.py                      # Centralized logging
β”‚   β”‚
β”‚   └── visualization/
β”‚       └── plots.py                       # Confusion matrix, ROC, PR curves, model comparison
β”‚
β”œβ”€β”€ models/                                # Saved model artifacts (.joblib)
β”‚   β”œβ”€β”€ baseline.joblib
β”‚   β”œβ”€β”€ logistic_regression.joblib
β”‚   β”œβ”€β”€ svm.joblib
β”‚   β”œβ”€β”€ random_forest.joblib
β”‚   β”œβ”€β”€ xgboost.joblib
β”‚   β”œβ”€β”€ lightgbm.joblib
β”‚   └── comparison.csv                     # Model comparison results
β”‚
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ .gitignore
β”œβ”€β”€ README.md
└── LICENSE

πŸš€ Quick Start

1. Clone & Setup Virtual Environment

git clone https://github.com/ByteNinjaSmit/Fake-Job-Post-Prediction.git
cd Fake-Job-Post-Prediction

# Create virtual environment
python -m venv venv

# Activate (Windows)
.\venv\Scripts\activate

# Activate (Linux/Mac)
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

2. Train Models

# Train a single model
.\venv\Scripts\python src/training/train.py --model logistic_regression

# Train ALL models and generate comparison table
.\venv\Scripts\python src/training/train.py --all

# Train with SMOTE oversampling (handles class imbalance)
.\venv\Scripts\python src/training/train.py --model xgboost --smote

# Train with full features (TF-IDF + metadata + engineered features)
.\venv\Scripts\python src/training/train.py --model xgboost --full-features

Available models: baseline, logistic_regression, svm, random_forest, xgboost, lightgbm

3. Run Inference

from src.inference.predict import Predictor

predictor = Predictor("logistic_regression")

result = predictor.predict_single(
    "Earn $5000/week from home! No experience needed. Contact us on WhatsApp."
)
print(result)
# {'prediction': 'Fraudulent', 'label': 1, 'probability_fraudulent': 0.92, ...}

4. Start the API

uvicorn src.api.app:app --reload

Then visit: http://localhost:8000/docs for interactive Swagger documentation.


🌐 API Endpoints

Route Method Description
/health GET Health check β€” model status
/predict POST Classify a single job posting
/batch POST Classify multiple job postings
/explain POST Classify + LIME feature explanation

Example Request

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Marketing Intern",
    "description": "Earn money fast from home!",
    "company_profile": "",
    "requirements": "No experience needed"
  }'

Example Response

{
  "prediction": "Fraudulent",
  "confidence": 0.92,
  "fraudulent_score": 0.92
}

🧠 Models & Methodology

Tier 1 β€” Classical ML (TF-IDF Features)

Model Library Strategy
Logistic Regression scikit-learn class_weight='balanced', max_iter=1000
Linear SVM scikit-learn class_weight='balanced'

Tier 2 β€” Ensemble Models (TF-IDF + Metadata)

Model Library Strategy
Random Forest scikit-learn 200 estimators, class_weight='balanced'
XGBoost XGBoost 200 estimators, scale_pos_weight=10
LightGBM LightGBM 200 estimators, class_weight='balanced'

Tier 3 β€” Deep Learning

Model Library Strategy
BERT Hugging Face Transformers bert-base-uncased, lr=2e-5, 4 epochs, AdamW

Class Imbalance Handling

The dataset is highly imbalanced (~5% fraud). We address this through:

  • Class weights β€” balanced weighting in all classical models
  • Scale pos weight β€” XGBoost positive class weighting
  • SMOTE β€” Synthetic minority oversampling (optional via --smote flag)

πŸ”§ Feature Engineering

Text Features

  • TF-IDF vectors β€” up to 5,000 features, bigrams, sublinear TF
  • Combined text from: title + company_profile + description + requirements + benefits

Engineered Fraud Indicators

Feature Rationale
email_count Fake posts often include personal emails
url_count External link redirection
exclamation_count Emotional manipulation ("Earn $$$!!!")
upper_ratio ALL CAPS usage
word_count Unusually short or long descriptions
company_profile_len Fake companies have short/empty profiles

Metadata Features (One-Hot Encoded)

  • employment_type, required_experience, required_education, industry, function

Boolean Features

  • telecommuting, has_company_logo, has_questions

πŸ“ˆ Evaluation Metrics

Metric Description Priority
F1 Score Harmonic mean of precision & recall ⭐ Primary
Recall Fraction of actual fraud detected ⭐ Primary
Precision Fraction of predicted fraud that is real Secondary
ROC-AUC Overall discrimination ability Secondary
PR-AUC Precision-Recall area under curve Secondary
Accuracy Overall correctness Baseline

Priority: F1 and Recall β€” In fraud detection, missing a fake job post (false negative) is worse than a false alarm.


πŸ§ͺ Data Pipeline

HuggingFace Dataset (17,880 records)
        ↓
   Text Cleaning (HTML, emoji, URL, stopword removal)
        ↓
   Fraud Indicator Feature Engineering
        ↓
   Stratified Split (70% train / 15% val / 15% test)
        ↓
   TF-IDF Vectorization + Metadata Encoding
        ↓
   Model Training & Evaluation
        ↓
   Model Comparison Table (models/comparison.csv)

🐳 Docker & Docker Compose

Quick Start with Docker Compose

# Start the API server
docker compose up api

# Access API
curl http://localhost:8000/health

Train Models Inside Docker

# Run all model training inside a container
docker compose --profile train up trainer

Models and data are mounted as volumes β€” trained models persist on your host machine.

Enable Monitoring (Prometheus + Grafana)

# Start API + Prometheus + Grafana
docker compose --profile monitoring up

Standalone Docker (without Compose)

# Build image
docker build -t fake-job-api .

# Run container
docker run -p 8000:8000 -v ./models:/app/models fake-job-api

# Access API
curl http://localhost:8000/health

Docker Compose Services

Service Port Profile Description
api 8000 default FastAPI prediction server
trainer β€” train One-off model training
prometheus 9090 monitoring Metrics collection
grafana 3000 monitoring Dashboards

πŸ”¬ Explainability

LIME (Local Interpretable Model-agnostic Explanations)

  • Explains individual predictions by highlighting contributing words
  • Integrated into the /explain API endpoint

SHAP (SHapley Additive exPlanations)

  • Global feature importance for ML models
  • Available via src/inference/explain.py

πŸ“š Tech Stack

Category Libraries
Data pandas, numpy, datasets (HuggingFace)
ML scikit-learn, XGBoost, LightGBM, imbalanced-learn
Deep Learning PyTorch, Transformers (HuggingFace)
NLP NLTK, BeautifulSoup4
API FastAPI, Uvicorn, Pydantic
Explainability SHAP, LIME
Visualization Matplotlib, Seaborn
Testing pytest, httpx

πŸ“ Dataset

Source: victor/real-or-fake-fake-jobposting-prediction

Field Type Description
title text Job title
company_profile text Company description
description text Job description
requirements text Job requirements
benefits text Job benefits
telecommuting binary Remote work flag
has_company_logo binary Logo presence
has_questions binary Screening questions
employment_type categorical Full-time, Part-time, etc.
required_experience categorical Entry, Mid, Senior, etc.
required_education categorical Bachelor's, Master's, etc.
industry categorical Industry sector
fraudulent binary Target β€” 0 (Real) / 1 (Fake)

πŸ—‚ Deliverables

  • βœ… Clean, documented codebase (20+ source files)
  • βœ… Reproducible training scripts with CLI arguments
  • βœ… 6 trained models with comparison table
  • βœ… Production-ready FastAPI with 4 endpoints
  • βœ… SHAP & LIME explainability
  • βœ… Dockerized deployment
  • βœ… Comprehensive README with results

πŸ“„ License

This project is licensed under the MIT License β€” see LICENSE for details.

About

Industry-ready ML system for detecting fraudulent job postings using classical ML and Transformer models. Built with Python, scikit-learn, XGBoost, LightGBM, BERT (Transformers), and FastAPI.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors