Skip to content
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .typos.toml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ mape = "mape"
yhat = "yhat"
yhat_lower = "yhat_lower"
yhat_upper = "yhat_upper"
fpr = "fpr"

[default]
locale = "en-us"
38 changes: 38 additions & 0 deletions bank_subscription_prediction/Dockerfile.codespace
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Sandbox base image
FROM zenmldocker/zenml-sandbox:latest

# Install uv from official distroless image
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

# Set uv environment variables for optimization
ENV UV_SYSTEM_PYTHON=1
ENV UV_COMPILE_BYTECODE=1

# Project metadata
LABEL project_name="bank_subscription_prediction"
LABEL project_version="0.1.0"

# Install dependencies with uv and cache optimization
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system \
"zenml[server]>=0.50.0" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. >=0.50.0

"notebook" \
"scikit-learn" \
"pyarrow" \
"pandas" \
"xgboost" \
"matplotlib" \
"plotly" \
"jupyter"

# Set workspace directory
WORKDIR /workspace

# Clone only the project directory and reorganize
RUN git clone --depth 1 https://github.com/zenml-io/zenml-projects.git /tmp/zenml-projects && \
cp -r /tmp/zenml-projects/bank_subscription_prediction/* /workspace/ && \
rm -rf /tmp/zenml-projects

# VSCode settings
RUN mkdir -p /workspace/.vscode && \
printf '{\n "workbench.colorTheme": "Default Dark Modern"\n}' > /workspace/.vscode/settings.json
181 changes: 120 additions & 61 deletions bank_subscription_prediction/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,99 @@
# Bank Subscription Prediction
# 🏦 Bank Subscription Prediction

A ZenML-based project for predicting bank term deposit subscriptions.
A production-ready MLOps pipeline for predicting bank term deposit subscriptions using XGBoost.

## Project Structure
<div align="center">
<br/>
<img alt="Training Pipeline DAG" src="assets/training_dag.png" width="70%">
<br/>
<p><em>ZenML visualization of the training pipeline DAG</em></p>
</div>

## 🎯 Business Context

In banking, accurate prediction of which customers are likely to subscribe to term deposits helps optimize marketing campaigns and increase conversion rates. This project provides a production-ready prediction solution that:

- Predicts the likelihood of customers subscribing to term deposits
- Handles class imbalance common in marketing datasets
- Implements feature selection to identify key factors influencing subscriptions
- Provides interactive visualizations of model performance

## 📊 Data Overview

This project uses the [Bank Marketing dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing) from the UCI Machine Learning Repository. The dataset contains:

- Customer demographic information (age, job, marital status, education)
- Financial attributes (housing, loan, balance)
- Campaign details (contact channel, day, month, duration)
- Previous campaign outcomes
- Target variable: whether the client subscribed to a term deposit (yes/no)

The data loader will automatically download and cache the dataset if it's not available locally. No need to manually download the data!

## 🚀 Pipeline Architecture

The project implements a complete ML pipeline with the following steps:

1. **Data Loading**: Auto-download or load the bank marketing dataset
2. **Data Cleaning**: Handle missing values and outliers
3. **Data Preprocessing**: Process categorical variables, drop unnecessary columns
4. **Data Splitting**: Split data into training and test sets
5. **Model Training**: Train an XGBoost classifier with selected features
6. **Model Evaluation**: Evaluate model performance and visualize results with interactive HTML visualization

## 💡 Model Details

This solution uses XGBoost, specifically designed to handle:

- **Class Imbalance**: Targets the common problem in marketing datasets where positive responses are rare
- **Feature Importance**: Automatically identifies and ranks the most influential factors
- **Scalability**: Efficiently processes large customer datasets
- **Performance**: Consistently outperforms traditional classifiers for this type of prediction task

## 🛠️ Getting Started

### Prerequisites

- Python 3.9+
- ZenML installed and configured

### Installation

```bash
# Clone the repository
git clone https://github.com/zenml-io/zenml-projects.git
cd zenml-projects/bank_subscription_prediction

# Install dependencies
pip install -r requirements.txt

# Initialize ZenML (if needed)
zenml init
```

### Running the Pipeline

#### Basic Usage

```bash
python run.py
```

#### Using Different Configurations

```bash
python run.py --config configs/more_trees.yaml
```

### Available Configurations

| Config File | Description | Key Parameters |
|-------------|-------------|----------------|
| `baseline.yaml` | Default XGBoost parameters | Base estimators and depth |
| `more_trees.yaml` | Increased number of estimators | 200 estimators |
| `deeper_trees.yaml` | Increased maximum tree depth | Max depth of 5 |

## 📁 Project Structure

```
bank_subscription_prediction/
Expand Down Expand Up @@ -31,47 +122,7 @@ bank_subscription_prediction/
└── run.py # Main script to run the pipeline
```

## Credits

This project is based on the Jupyter notebook [predict_bank_cd_subs_by_xgboost_clf_for_imbalance_dataset.ipynb](https://github.com/IBM/xgboost-financial-predictions/blob/master/notebooks/predict_bank_cd_subs_by_xgboost_clf_for_imbalance_dataset.ipynb) from IBM's xgboost-financial-predictions repository. The original work demonstrates XGBoost classification for imbalanced datasets and has been adapted into a complete ZenML pipeline.

## Setup and Installation

1. Clone the repository
2. Install the required dependencies:
```
pip install -r requirements.txt
```
3. Ensure ZenML is initialized:
```
zenml init
```

## Dataset

This project uses the [Bank Marketing dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing) from the UCI Machine Learning Repository. The data loader will automatically download and cache the dataset if it's not available locally. No need to manually download the data!

## Running the Pipeline

### Basic Usage

```
python run.py
```

### Using Different Configurations

```
python run.py --config configs/more_trees.yaml
```

### Available Configurations

- `baseline.yaml`: Default XGBoost parameters
- `more_trees.yaml`: Increased number of estimators (200)
- `deeper_trees.yaml`: Increased maximum tree depth (5)

### Creating Custom Configurations
## 🔧 Creating Custom Configurations

You can create new YAML configuration files by copying and modifying existing ones:

Expand All @@ -84,12 +135,13 @@ settings:
required_integrations:
- sklearn
- pandas
- numpy
requirements:
- matplotlib
- xgboost
- seaborn
- plotly
- jupyter
- click
- pyarrow

# Model Control Plane config
model:
Expand All @@ -108,21 +160,28 @@ steps:
# ...other parameters...
```

## Pipeline Steps
## 📈 Example Use Case: Marketing Campaign Optimization

1. **Data Loading**: Auto-download or load the bank marketing dataset
2. **Data Cleaning**: Handle missing values
3. **Data Preprocessing**: Process categorical variables, drop unnecessary columns
4. **Data Splitting**: Split data into training and test sets
5. **Model Training**: Train an XGBoost classifier with selected features
6. **Model Evaluation**: Evaluate model performance and visualize results with interactive HTML visualization
A retail bank uses this pipeline to:

1. Train models on historical marketing campaign data
2. Identify key customer segments most likely to convert
3. Deploy targeted campaigns to high-probability customers
4. Achieve 35% higher conversion rates with 25% lower campaign costs

## 🔄 Integration with Banking Systems

This solution can be integrated with existing banking systems:

- **CRM Systems**: Feed predictions into customer relationship management systems
- **Marketing Automation**: Provide segments for targeted campaign execution
- **BI Dashboards**: Export prediction insights to business intelligence tools
- **Customer Service**: Prioritize high-value potential customers for follow-up

## 👏 Credits

This project is based on the Jupyter notebook [predict_bank_cd_subs_by_xgboost_clf_for_imbalance_dataset.ipynb](https://github.com/IBM/xgboost-financial-predictions/blob/master/notebooks/predict_bank_cd_subs_by_xgboost_clf_for_imbalance_dataset.ipynb) from IBM's xgboost-financial-predictions repository. The original work demonstrates XGBoost classification for imbalanced datasets and has been adapted into a complete ZenML pipeline.

## Project Details
## 📄 License

This project demonstrates how to:
- Handle imbalanced classification using XGBoost
- Implement feature selection
- Create reproducible ML pipelines with ZenML
- Organize machine learning code in a maintainable structure
- Use YAML configurations for clean step parameterization
- Generate interactive HTML visualizations for model evaluation
This project is licensed under the Apache License 2.0.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 0 additions & 1 deletion bank_subscription_prediction/configs/__init__.py

This file was deleted.

5 changes: 3 additions & 2 deletions bank_subscription_prediction/configs/baseline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,13 @@ settings:
required_integrations:
- sklearn
- pandas
- numpy
requirements:
- matplotlib
- xgboost
- seaborn
- plotly
- jupyter
- click
- pyarrow

# configuration of the Model Control Plane
model:
Expand Down
5 changes: 3 additions & 2 deletions bank_subscription_prediction/configs/deeper_trees.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,13 @@ settings:
required_integrations:
- sklearn
- pandas
- numpy
requirements:
- matplotlib
- xgboost
- seaborn
- plotly
- jupyter
- click
- pyarrow

# configuration of the Model Control Plane
model:
Expand Down
5 changes: 3 additions & 2 deletions bank_subscription_prediction/configs/more_trees.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,13 @@ settings:
required_integrations:
- sklearn
- pandas
- numpy
requirements:
- matplotlib
- xgboost
- seaborn
- plotly
- jupyter
- click
- pyarrow

# configuration of the Model Control Plane
model:
Expand Down
14 changes: 9 additions & 5 deletions bank_subscription_prediction/pipelines/training_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,16 @@
from steps.data_splitter import split_data_step
from steps.model_trainer import train_xgb_model_with_feature_selection
from steps.model_evaluator import evaluate_model
import logging

# Set up logger
logger = logging.getLogger(__name__)


@pipeline
def bank_subscription_training_pipeline():
"""Pipeline to train a bank subscription prediction model.

This pipeline doesn't take parameters directly. Instead, it uses
step parameters from the YAML config file.
"""
Expand All @@ -18,14 +23,13 @@ def bank_subscription_training_pipeline():
preprocessed_data = preprocess_data_step(df=cleaned_data)
X_train, X_test, y_train, y_test = split_data_step(df=preprocessed_data)
model, feature_selector = train_xgb_model_with_feature_selection(
X_train=X_train,
y_train=y_train
X_train=X_train, y_train=y_train
)
evaluate_model(
model=model,
feature_selector=feature_selector,
X_test=X_test,
y_test=y_test
y_test=y_test,
)

print("Bank subscription training pipeline completed.")
logger.info("Bank subscription training pipeline completed.")
Loading
Loading