This repository provides a comprehensive framework for managing the entire Machine Learning lifecycle, specifically tailored for Time Series data. By adhering to MLOps best practices, this project streamlines workflows from data ingestion and preprocessing to hyperparameter optimization, model training, and deployment.
While the primary focus is on Time Series data, the scripts and methodologies in this repository are generalizable and can be adapted for any Machine Learning project involving tabular data.
- Complete ML Lifecycle: Covers all phases from data preparation to model deployment.
- Modular Codebase: Easily transferable to other ML projects.
- MLOps Best Practices: Ensures reproducibility, scalability, and maintainability.
The project leverages the following technologies:
- Python 3.10+: The core programming language used throughout the project.
- Pandas: For data manipulation and preprocessing.
- AWS: Cloud provider for scalable infrastructure and deployment.
- Scikit-Learn API: Provides a consistent interface for Machine Learning estimators.
- Optuna: For advanced Hyperparameter Optimization (HPO).
- MLflow: For tracking experiments, models, and for model deployment.
-
Data Ingestion
- Scripts to automate the collection and validation of raw data.
-
Data Preprocessing
- Tools to clean, transform, and prepare data for Time Series analysis.
-
Feature Engineering
- Create and select relevant features for model training.
-
Hyperparameter Optimization
- Integration with Optuna for efficient and scalable parameter tuning.
-
Model Training
- Leverages Scikit-Learn's API to train models.
-
Model Tracking and Experimentation
- Utilizes MLflow to log metrics, parameters, and artifacts for reproducibility.
-
Model Deployment
- Flask-based deployment for serving models as RESTful APIs.
-
Clone the Repository
git clone <repository-url> cd <repository-folder>
-
Install Dependencies
Ensure you have Python 3.10 installed, then run:
pip install -r requirements.txt
-
Configure the Kaggle API
The
kagglepackage will be needed to download the data. It should have been installed with the requirements. Before accessing the Kaggle API, you need to authenticate using an API token. To do this:- Go to the 'Account' tab on your Kaggle profile.
- Click 'Create New Token'. This will download a file named
kaggle.jsoncontaining your API credentials. - Move this file to the appropriate location:
- Linux/OSX:
~/.kaggle/kaggle.json - Windows:
C:\Users\<Windows-username>\.kaggle\kaggle.json
- Linux/OSX:
Make sure the permissions are set correctly to keep the file secure.
-
Configure AWS
Set up your AWS credentials to use cloud services for storage or deployment.
-
Run the Pipeline
TODO