GitHub - olufemig/mlops-project

Salary Prediction Machine Learning Project

Problem statement

This project builds a machine learning pipeline to predict salaries based on variables such as number of certifications, residence, age, education, and job title. Each run uses auto-generated synthetic data. The pipeline handles data preprocessing, trains a linear regression model using scikit-learn, and evaluates model performance through key metrics. Development was done locally using VS Code, with all data and model artifacts logged and versioned via GitHub.

Cloud assets used ( all local stack)

Pipelines: ZenML pipelines for data preprocessing, model training, and evaluation.
Monitoring: Evidently AI dashboards for data drift and model performance monitoring.
Experiment Tracking: MLflow for logging experiments, parameters, metrics, and models.
Orchestration: Apache Airflow DAGs for scheduling and managing pipeline execution ( not completed)
CI/CD: GitHub Actions workflows for automated testing, building, and deployment.
API: FastAPI application serving the model predictions.
Deployment: Model hosted on Hugging Face for inference. ( not completed)

Experiment tracking and model register

I used MLflow as both an experiment tracker and a model registry to streamline my machine learning workflow. During experimentation, I log key parameters, metrics, artifacts, and model versions using MLflow Tracking, which helps me compare different runs and identify the most effective models. Once a model meets the desired performance criteria, I register it in the MLflow Model Registry, where I can manage model stages such as staging, production, and archived. This setup provides a structured, auditable process for tracking model lifecycle events, promoting reproducibility and smooth collaboration between development and deployment teams.

Workflow orchestration

I used ZenML for workflow orchestration by designing modular, reproducible machine learning pipelines that integrate data ingestion, preprocessing, model training, evaluation, and deployment. ZenML allows me to structure these workflows as version-controlled pipelines, leveraging step caching and artifact tracking for efficiency and traceability. Its integration with orchestrators like Kubeflow or Airflow helps scale and automate pipeline runs across environments, ensuring consistent execution from local development to production deployment. I logged the trin and test data as artifacts in ZenML where they could be viewed in the dashboard while the model, evidently report and model metrics were logged to MLFLOW.

Data: Fictitious dataset containing features relevant to IT professionals' salaries (e.g., years of experience, job role, education).

├── data/ # Synthetic test datasets
├── pipelines/ # ZenML pipeline definitions
├── airflow/ # Apache Airflow DAGs and configs
├── tests/ # pytest scripts
├── steps/ # ZenML pipeline steps
├── notebooks/ # EDA notebook
├── mlruns/ # MLFlow data
├── logged_datasets/ # GitHub Actions workflows
├── README.md # Project documentation
├── run_pipeline.py # key entry file
└── requirements.txt # Python dependencies

Setup and Installation

Clone the repository:
git clone 
cd

Install dependencies:
pip install -r requirements.txt

Set up environment variables:Create a .env file and configure necessary keys (e.g., Hugging Face API token, MLflow tracking server).

Initialize ZenML:
zenml init

Start Airflow:
airflow webserver -p 8080
airflow scheduler

Run GitHub Actions :GitHub Actions workflows are configured in .github/workflows/ for CI/CD. i have included code for installing python dependencies, unit testing, linting and uploading the model to hugging spaces for inference

Running the Project

Execute the ZenML pipeline:
python run_pipeline.py

Track experiments with MLflow:Access the MLflow UI to view experiments:
mlflow ui

Deploy the model:Start the FastAPI server:
uvicorn app.main:app --host 0.0.0.0 --port 8000

The model is deployed to Hugging Face for inference.

Usage

Prediction API: Send a POST request to the FastAPI endpoint (e.g., /predict) with input features in JSON format to get salary predictions.
Monitoring: Check Evidently AI dashboards for data drift and model performance metrics.
Airflow: Monitor pipeline execution via the Airflow web interface.

Contributing

Fork the repository.
Create a feature branch (git checkout -b feature-name).
Commit your changes (git commit -m 'Add feature').
Push to the branch (git push origin feature-name).
Create a pull request.

License
This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
.zen		.zen
app		app
data		data
images		images
mlartifacts/0		mlartifacts/0
mlruns/0		mlruns/0
notebooks		notebooks
pipelines		pipelines
steps		steps
tests		tests
.gitignore		.gitignore
dummy.py		dummy.py
evidently_drift_report.html		evidently_drift_report.html
readme.md		readme.md
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Salary Prediction Machine Learning Project

Cloud assets used ( all local stack)

Experiment tracking and model register

Workflow orchestration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Salary Prediction Machine Learning Project

Cloud assets used ( all local stack)

Experiment tracking and model register

Workflow orchestration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages