Skip to content

Commit 58b08f2

Browse files
authored
Merge pull request #206 from zenml-io/project/predict_financial_timeseries
New project: Bank Subscription Prediction
2 parents fb4ea5a + 0c6efc9 commit 58b08f2

28 files changed

+45877
-23
lines changed

.typos.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ mape = "mape"
5656
yhat = "yhat"
5757
yhat_lower = "yhat_lower"
5858
yhat_upper = "yhat_upper"
59+
fpr = "fpr"
5960

6061
[default]
6162
locale = "en-us"
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Sandbox base image
2+
FROM zenmldocker/zenml-sandbox:latest
3+
4+
# Install uv from official distroless image
5+
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
6+
7+
# Set uv environment variables for optimization
8+
ENV UV_SYSTEM_PYTHON=1
9+
ENV UV_COMPILE_BYTECODE=1
10+
11+
# Project metadata
12+
LABEL project_name="bank_subscription_prediction"
13+
LABEL project_version="0.1.0"
14+
15+
# Install dependencies with uv and cache optimization
16+
RUN --mount=type=cache,target=/root/.cache/uv \
17+
uv pip install --system \
18+
"zenml[server]>=0.80.0" \
19+
"notebook" \
20+
"scikit-learn" \
21+
"pyarrow" \
22+
"pandas" \
23+
"xgboost" \
24+
"matplotlib" \
25+
"plotly" \
26+
"jupyter"
27+
28+
# Set workspace directory
29+
WORKDIR /workspace
30+
31+
# Clone only the project directory and reorganize
32+
RUN git clone --depth 1 https://github.com/zenml-io/zenml-projects.git /tmp/zenml-projects && \
33+
cp -r /tmp/zenml-projects/bank_subscription_prediction/* /workspace/ && \
34+
rm -rf /tmp/zenml-projects
35+
36+
# VSCode settings
37+
RUN mkdir -p /workspace/.vscode && \
38+
printf '{\n "workbench.colorTheme": "Default Dark Modern"\n}' > /workspace/.vscode/settings.json
Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
# 🏦 Bank Subscription Prediction
2+
3+
A production-ready MLOps pipeline for predicting bank term deposit subscriptions using XGBoost.
4+
5+
<div align="center">
6+
<br/>
7+
<img alt="Training Pipeline DAG" src="assets/training_dag.png" width="70%">
8+
<br/>
9+
<p><em>ZenML visualization of the training pipeline DAG</em></p>
10+
</div>
11+
12+
## 🎯 Business Context
13+
14+
In banking, accurate prediction of which customers are likely to subscribe to term deposits helps optimize marketing campaigns and increase conversion rates. This project provides a production-ready prediction solution that:
15+
16+
- Predicts the likelihood of customers subscribing to term deposits
17+
- Handles class imbalance common in marketing datasets
18+
- Implements feature selection to identify key factors influencing subscriptions
19+
- Provides interactive visualizations of model performance
20+
21+
## 📊 Data Overview
22+
23+
This project uses the [Bank Marketing dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing) from the UCI Machine Learning Repository. The dataset contains:
24+
25+
- Customer demographic information (age, job, marital status, education)
26+
- Financial attributes (housing, loan, balance)
27+
- Campaign details (contact channel, day, month, duration)
28+
- Previous campaign outcomes
29+
- Target variable: whether the client subscribed to a term deposit (yes/no)
30+
31+
The data loader will automatically download and cache the dataset if it's not available locally. No need to manually download the data!
32+
33+
## 🚀 Pipeline Architecture
34+
35+
The project implements a complete ML pipeline with the following steps:
36+
37+
1. **Data Loading**: Auto-download or load the bank marketing dataset
38+
2. **Data Cleaning**: Handle missing values and outliers
39+
3. **Data Preprocessing**: Process categorical variables, drop unnecessary columns
40+
4. **Data Splitting**: Split data into training and test sets
41+
5. **Model Training**: Train an XGBoost classifier with selected features
42+
6. **Model Evaluation**: Evaluate model performance and visualize results with interactive HTML visualization
43+
44+
<div align="center">
45+
<br/>
46+
<img alt="Evaluation visualization" src="assets/eval_vis.png" width="70%">
47+
<br/>
48+
<p><em>ZenML visualization of the evals</em></p>
49+
</div>
50+
51+
## 💡 Model Details
52+
53+
This solution uses XGBoost, specifically designed to handle:
54+
55+
- **Class Imbalance**: Targets the common problem in marketing datasets where positive responses are rare
56+
- **Feature Importance**: Automatically identifies and ranks the most influential factors
57+
- **Scalability**: Efficiently processes large customer datasets
58+
- **Performance**: Consistently outperforms traditional classifiers for this type of prediction task
59+
60+
## 🛠️ Getting Started
61+
62+
### Prerequisites
63+
64+
- Python 3.9+
65+
- ZenML installed and configured
66+
67+
### Installation
68+
69+
```bash
70+
# Clone the repository
71+
git clone https://github.com/zenml-io/zenml-projects.git
72+
cd zenml-projects/bank_subscription_prediction
73+
74+
# Install dependencies
75+
pip install -r requirements.txt
76+
77+
# Initialize ZenML (if needed)
78+
zenml init
79+
```
80+
81+
### Running the Pipeline
82+
83+
#### Basic Usage
84+
85+
```bash
86+
python run.py
87+
```
88+
89+
#### Using Different Configurations
90+
91+
```bash
92+
python run.py --config configs/more_trees.yaml
93+
```
94+
95+
### Available Configurations
96+
97+
| Config File | Description | Key Parameters |
98+
|-------------|-------------|----------------|
99+
| `baseline.yaml` | Default XGBoost parameters | Base estimators and depth |
100+
| `more_trees.yaml` | Increased number of estimators | 200 estimators |
101+
| `deeper_trees.yaml` | Increased maximum tree depth | Max depth of 5 |
102+
103+
## 📁 Project Structure
104+
105+
```
106+
bank_subscription_prediction/
107+
├── configs/ # YAML Configuration files
108+
│ ├── __init__.py
109+
│ ├── baseline.yaml # Baseline experiment config
110+
│ ├── more_trees.yaml # Config with more trees
111+
│ └── deeper_trees.yaml# Config with deeper trees
112+
├── pipelines/ # ZenML pipeline definitions
113+
│ ├── __init__.py
114+
│ └── training_pipeline.py
115+
├── steps/ # ZenML pipeline steps
116+
│ ├── __init__.py
117+
│ ├── data_loader.py
118+
│ ├── data_cleaner.py
119+
│ ├── data_preprocessor.py
120+
│ ├── data_splitter.py
121+
│ ├── model_trainer.py
122+
│ └── model_evaluator.py
123+
├── utils/ # Utility functions and helpers
124+
│ ├── __init__.py
125+
│ └── model_utils.py
126+
├── __init__.py
127+
├── requirements.txt # Project dependencies
128+
├── README.md # Project documentation
129+
└── run.py # Main script to run the pipeline
130+
```
131+
132+
## 🔧 Creating Custom Configurations
133+
134+
You can create new YAML configuration files by copying and modifying existing ones:
135+
136+
```yaml
137+
# my_custom_config.yaml
138+
# Start with copying an existing config and modify the values
139+
# environment configuration
140+
settings:
141+
docker:
142+
required_integrations:
143+
- sklearn
144+
- pandas
145+
- numpy
146+
requirements:
147+
- matplotlib
148+
- xgboost
149+
- plotly
150+
- click
151+
- pyarrow
152+
153+
# Model Control Plane config
154+
model:
155+
name: bank_subscription_classifier
156+
version: 0.1.0
157+
license: MIT
158+
description: A bank term deposit subscription classifier
159+
tags: ["bank_marketing", "classifier", "xgboost"]
160+
161+
# Custom step parameters
162+
steps:
163+
# ...other step params...
164+
train_xgb_model_with_feature_selection:
165+
n_estimators: 300
166+
max_depth: 4
167+
# ...other parameters...
168+
```
169+
170+
## 📈 Example Use Case: Marketing Campaign Optimization
171+
172+
A retail bank uses this pipeline to:
173+
174+
1. Train models on historical marketing campaign data
175+
2. Identify key customer segments most likely to convert
176+
3. Deploy targeted campaigns to high-probability customers
177+
4. Achieve 35% higher conversion rates with 25% lower campaign costs
178+
179+
## 🔄 Integration with Banking Systems
180+
181+
This solution can be integrated with existing banking systems:
182+
183+
- **CRM Systems**: Feed predictions into customer relationship management systems
184+
- **Marketing Automation**: Provide segments for targeted campaign execution
185+
- **BI Dashboards**: Export prediction insights to business intelligence tools
186+
- **Customer Service**: Prioritize high-value potential customers for follow-up
187+
188+
## 👏 Credits
189+
190+
This project is based on the Jupyter notebook [predict_bank_cd_subs_by_xgboost_clf_for_imbalance_dataset.ipynb](https://github.com/IBM/xgboost-financial-predictions/blob/master/notebooks/predict_bank_cd_subs_by_xgboost_clf_for_imbalance_dataset.ipynb) from IBM's xgboost-financial-predictions repository. The original work demonstrates XGBoost classification for imbalanced datasets and has been adapted into a complete ZenML pipeline.
191+
192+
## 📄 License
193+
194+
This project is licensed under the Apache License 2.0.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""Bank Subscription Prediction Project using ZenML."""
329 KB
Loading
256 KB
Loading
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Baseline experiment configuration
2+
3+
# environment configuration
4+
settings:
5+
docker:
6+
required_integrations:
7+
- sklearn
8+
- pandas
9+
- numpy
10+
requirements:
11+
- matplotlib
12+
- xgboost
13+
- plotly
14+
- click
15+
- pyarrow
16+
17+
# configuration of the Model Control Plane
18+
model:
19+
name: bank_subscription_classifier
20+
version: 0.1.0
21+
license: MIT
22+
description: A bank term deposit subscription classifier
23+
tags: ["bank_marketing", "classifier", "xgboost"]
24+
25+
# Step-specific parameters
26+
steps:
27+
# Data loading parameters
28+
load_data:
29+
csv_file_path: "bank.csv"
30+
31+
# Data splitting parameters
32+
split_data_step:
33+
test_size: 0.2
34+
random_state: 42
35+
stratify_col: "y"
36+
37+
# Model training parameters
38+
train_xgb_model_with_feature_selection:
39+
learning_rate: 0.1
40+
n_estimators: 100
41+
max_depth: 3
42+
min_child_weight: 1
43+
gamma: 0
44+
subsample: 0.8
45+
colsample_bytree: 0.8
46+
objective: "binary:logistic"
47+
scale_pos_weight: 1 # Will be calculated dynamically if not overridden
48+
random_state: 42
49+
feature_selection_threshold: "median"
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Deeper trees experiment configuration
2+
3+
# environment configuration
4+
settings:
5+
docker:
6+
required_integrations:
7+
- sklearn
8+
- pandas
9+
- numpy
10+
requirements:
11+
- matplotlib
12+
- xgboost
13+
- plotly
14+
- click
15+
- pyarrow
16+
17+
# configuration of the Model Control Plane
18+
model:
19+
name: bank_subscription_classifier
20+
version: 0.1.0
21+
license: MIT
22+
description: A bank term deposit subscription classifier
23+
tags: ["bank_marketing", "classifier", "xgboost"]
24+
25+
# Step-specific parameters
26+
steps:
27+
# Data loading parameters
28+
load_data:
29+
csv_file_path: "bank.csv"
30+
31+
# Data splitting parameters
32+
split_data_step:
33+
test_size: 0.2
34+
random_state: 42
35+
stratify_col: "y"
36+
37+
# Model training parameters with deeper trees
38+
train_xgb_model_with_feature_selection:
39+
learning_rate: 0.1
40+
n_estimators: 100
41+
max_depth: 5 # Deeper trees than baseline
42+
min_child_weight: 1
43+
gamma: 0
44+
subsample: 0.8
45+
colsample_bytree: 0.8
46+
objective: "binary:logistic"
47+
scale_pos_weight: 1
48+
random_state: 42
49+
feature_selection_threshold: "median"
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# More trees experiment configuration
2+
3+
# environment configuration
4+
settings:
5+
docker:
6+
required_integrations:
7+
- sklearn
8+
- pandas
9+
- numpy
10+
requirements:
11+
- matplotlib
12+
- xgboost
13+
- plotly
14+
- click
15+
- pyarrow
16+
17+
# configuration of the Model Control Plane
18+
model:
19+
name: bank_subscription_classifier
20+
version: 0.1.0
21+
license: MIT
22+
description: A bank term deposit subscription classifier
23+
tags: ["bank_marketing", "classifier", "xgboost"]
24+
25+
# Step-specific parameters
26+
steps:
27+
# Data loading parameters
28+
load_data:
29+
csv_file_path: "bank.csv"
30+
31+
# Data splitting parameters
32+
split_data_step:
33+
test_size: 0.2
34+
random_state: 42
35+
stratify_col: "y"
36+
37+
# Model training parameters with more trees
38+
train_xgb_model_with_feature_selection:
39+
learning_rate: 0.1
40+
n_estimators: 200 # More trees than baseline
41+
max_depth: 3
42+
min_child_weight: 1
43+
gamma: 0
44+
subsample: 0.8
45+
colsample_bytree: 0.8
46+
objective: "binary:logistic"
47+
scale_pos_weight: 1
48+
random_state: 42
49+
feature_selection_threshold: "median"

0 commit comments

Comments
 (0)