A machine learning-driven job matching engine that connects applicants with the most suitable job opportunities using natural language processing, text embeddings, and structured data features.
- 🔄 Data Ingestion: Load applicant profiles, job descriptions, and labeled pairs
- 🧠 Feature Engineering: Combine text-based embeddings and structured metadata (location, experience, skills)
- 🎯 Model Training: Train predictive models using logistic regression, XGBoost, and LightGBM
- 📈 Prediction: Rank jobs for applicants or find best-fit candidates for positions
- 🌐 API Integration: Serve predictions via FastAPI (optional)
git clone https://github.com/theflyfoxX/flyfox-job-matching.git
cd flyfox-job-matching# Create virtual environment
python -m venv wrangler-env
# Activate on Windows
./wrangler-env/Scripts/activate
# Activate on macOS/Linux
source wrangler-env/bin/activatepip install -r requirements.txtflyfox/
├── config.yaml # Central configuration
├── predict.py # Main prediction script
├── test.py # Test runner
├── requirements.txt # Python dependencies
├── pyproject.toml # Project metadata
│
├── data/
│ ├── raw/ # Raw CSV files
│ │ ├── Combined_Jobs_Final.csv
│ │ ├── Experience.csv
│ │ ├── Positions_Of_Interest.csv
│ │ └── labeled_applicant_job_pairs.csv
│ ├── interim/ # Processed intermediate data
│ └── features/ # Final feature matrices
│
├── embeddings/
│ ├── jobs/ # Job embeddings (.npy)
│ └── applicants/ # Applicant embeddings (.npy)
│
├── features/
│ └── build_features.py # Feature engineering scripts
│
├── src/
│ ├── features/ # Feature builders
│ ├── io/ # File I/O utilities
│ ├── models/ # Model training & evaluation
│ ├── prep/ # Data preparation helpers
│ ├── preprocessing/ # Text/vector preprocessing
│ ├── utils/ # Shared utilities
│ └── api/ # FastAPI application
│
└── docker/ # Docker configurations
Run the main prediction script:
python predict.pyExecute the test suite:
python test.pyPlace the following files in data/raw/:
Combined_Jobs_Final.csv- Job postings with descriptions and metadataExperience.csv- Applicant work experience recordsPositions_Of_Interest.csv- Applicant job preferenceslabeled_applicant_job_pairs.csv- Training data with applicant-job matches
Pre-generated embeddings must be stored as .npy dictionary files:
embeddings/jobs/embeddings_dict.npy- Job description embeddingsembeddings/applicants/embeddings_dict.npy- Applicant profile embeddings
- Data Processing:
pandas,numpy,pyarrow,fastparquet - Machine Learning:
scikit-learn,lightgbm,xgboost - NLP & Embeddings:
sentence-transformers,transformers,torch,gensim - API:
fastapi,uvicorn - Database:
psycopg2-binary(PostgreSQL support)
See requirements.txt for complete list with versions.
The project includes comprehensive testing:
# Run all tests
python test.py
# Run specific test modules
pytest tests/test_features.py
pytest tests/test_models.pyEdit config.yaml to customize:
- Model parameters
- Feature engineering settings
- API configuration
- File paths and data sources
- Embeddings must be generated before running predictions
- Ensure all required data files are present in
data/raw/ - The virtual environment (
wrangler-env/) is excluded from version control - GPU acceleration recommended for embedding generation and model training
Ali Rassas
- 🔗 GitHub: @theflyfoxX