zenml-io · htahir1 · May 16, 2025 · May 15, 2025 · May 16, 2025 · May 16, 2025
diff --git a/bank_subscription_prediction/README.md b/bank_subscription_prediction/README.md
@@ -0,0 +1,128 @@
+# Bank Subscription Prediction
+
+A ZenML-based project for predicting bank term deposit subscriptions.
+
+## Project Structure
+
+```
+bank_subscription_prediction/
+├── configs/             # YAML Configuration files
+│   ├── __init__.py
+│   ├── baseline.yaml    # Baseline experiment config
+│   ├── more_trees.yaml  # Config with more trees
+│   └── deeper_trees.yaml# Config with deeper trees
+├── pipelines/           # ZenML pipeline definitions
+│   ├── __init__.py
+│   └── training_pipeline.py
+├── steps/               # ZenML pipeline steps
+│   ├── __init__.py
+│   ├── data_loader.py
+│   ├── data_cleaner.py
+│   ├── data_preprocessor.py
+│   ├── data_splitter.py
+│   ├── model_trainer.py
+│   └── model_evaluator.py
+├── utils/               # Utility functions and helpers
+│   ├── __init__.py
+│   └── model_utils.py
+├── __init__.py
+├── requirements.txt     # Project dependencies
+├── README.md            # Project documentation
+└── run.py               # Main script to run the pipeline
+```
+
+## Credits
+
+This project is based on the Jupyter notebook [predict_bank_cd_subs_by_xgboost_clf_for_imbalance_dataset.ipynb](https://github.com/IBM/xgboost-financial-predictions/blob/master/notebooks/predict_bank_cd_subs_by_xgboost_clf_for_imbalance_dataset.ipynb) from IBM's xgboost-financial-predictions repository. The original work demonstrates XGBoost classification for imbalanced datasets and has been adapted into a complete ZenML pipeline.
+
+## Setup and Installation
+
+1. Clone the repository
+2. Install the required dependencies:
+   ```
+   pip install -r requirements.txt
+   ```
+3. Ensure ZenML is initialized:
+   ```
+   zenml init
+   ```
+
+## Dataset
+
+This project uses the [Bank Marketing dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing) from the UCI Machine Learning Repository. The data loader will automatically download and cache the dataset if it's not available locally. No need to manually download the data!
+
+## Running the Pipeline
+
+### Basic Usage
+
+```
+python run.py
+```
+
+### Using Different Configurations
+
+```
+python run.py --config configs/more_trees.yaml
+```
+
+### Available Configurations
+
+- `baseline.yaml`: Default XGBoost parameters
+- `more_trees.yaml`: Increased number of estimators (200)
+- `deeper_trees.yaml`: Increased maximum tree depth (5)
+
+### Creating Custom Configurations
+
+You can create new YAML configuration files by copying and modifying existing ones:
+
+```yaml
+# my_custom_config.yaml
+# Start with copying an existing config and modify the values
+# environment configuration
+settings:
+  docker:
+    required_integrations:
+      - sklearn
+      - pandas
+    requirements:
+      - matplotlib
+      - xgboost
+      - seaborn
+      - plotly
+      - jupyter
+
+# Model Control Plane config
+model:
+  name: bank_subscription_classifier
+  version: 0.1.0
+  license: MIT
+  description: A bank term deposit subscription classifier
+  tags: ["bank_marketing", "classifier", "xgboost"]
+
+# Custom step parameters
+steps:
+  # ...other step params...
+  train_xgb_model_with_feature_selection:
+    n_estimators: 300
+    max_depth: 4
+    # ...other parameters...
+```
+
+## Pipeline Steps
+
+1. **Data Loading**: Auto-download or load the bank marketing dataset
+2. **Data Cleaning**: Handle missing values
+3. **Data Preprocessing**: Process categorical variables, drop unnecessary columns
+4. **Data Splitting**: Split data into training and test sets
+5. **Model Training**: Train an XGBoost classifier with selected features
+6. **Model Evaluation**: Evaluate model performance and visualize results with interactive HTML visualization
+
+## Project Details
+
+This project demonstrates how to:
+- Handle imbalanced classification using XGBoost
+- Implement feature selection 
+- Create reproducible ML pipelines with ZenML
+- Organize machine learning code in a maintainable structure
+- Use YAML configurations for clean step parameterization
+- Generate interactive HTML visualizations for model evaluation 
diff --git a/bank_subscription_prediction/__init__.py b/bank_subscription_prediction/__init__.py
@@ -0,0 +1 @@
+"""Bank Subscription Prediction Project using ZenML.""" 
diff --git a/bank_subscription_prediction/configs/__init__.py b/bank_subscription_prediction/configs/__init__.py
@@ -0,0 +1 @@
+"""Configuration settings for the Bank Subscription Prediction project.""" 
diff --git a/bank_subscription_prediction/configs/baseline.yaml b/bank_subscription_prediction/configs/baseline.yaml
@@ -0,0 +1,48 @@
+# Baseline experiment configuration
+
+# environment configuration
+settings:
+  docker:
+    required_integrations:
+      - sklearn
+      - pandas
+    requirements:
+      - matplotlib
+      - xgboost
+      - seaborn
+      - plotly
+      - jupyter
+
+# configuration of the Model Control Plane
+model:
+  name: bank_subscription_classifier
+  version: 0.1.0
+  license: MIT
+  description: A bank term deposit subscription classifier
+  tags: ["bank_marketing", "classifier", "xgboost"]
+
+# Step-specific parameters
+steps:
+  # Data loading parameters
+  load_data:
+    csv_file_path: "bank.csv"
+
+  # Data splitting parameters
+  split_data_step:
+    test_size: 0.2
+    random_state: 42
+    stratify_col: "y"
+
+  # Model training parameters
+  train_xgb_model_with_feature_selection:
+    learning_rate: 0.1
+    n_estimators: 100
+    max_depth: 3
+    min_child_weight: 1
+    gamma: 0
+    subsample: 0.8
+    colsample_bytree: 0.8
+    objective: "binary:logistic"
+    scale_pos_weight: 1  # Will be calculated dynamically if not overridden
+    random_state: 42
+    feature_selection_threshold: "median" 
diff --git a/bank_subscription_prediction/configs/deeper_trees.yaml b/bank_subscription_prediction/configs/deeper_trees.yaml
@@ -0,0 +1,48 @@
+# Deeper trees experiment configuration
+
+# environment configuration
+settings:
+  docker:
+    required_integrations:
+      - sklearn
+      - pandas
+    requirements:
+      - matplotlib
+      - xgboost
+      - seaborn
+      - plotly
+      - jupyter
+
+# configuration of the Model Control Plane
+model:
+  name: bank_subscription_classifier
+  version: 0.1.0
+  license: MIT
+  description: A bank term deposit subscription classifier
+  tags: ["bank_marketing", "classifier", "xgboost"]
+
+# Step-specific parameters
+steps:
+  # Data loading parameters
+  load_data:
+    csv_file_path: "bank.csv"
+
+  # Data splitting parameters
+  split_data_step:
+    test_size: 0.2
+    random_state: 42
+    stratify_col: "y"
+
+  # Model training parameters with deeper trees
+  train_xgb_model_with_feature_selection:
+    learning_rate: 0.1
+    n_estimators: 100
+    max_depth: 5  # Deeper trees than baseline
+    min_child_weight: 1
+    gamma: 0
+    subsample: 0.8
+    colsample_bytree: 0.8
+    objective: "binary:logistic"
+    scale_pos_weight: 1
+    random_state: 42
+    feature_selection_threshold: "median" 
diff --git a/bank_subscription_prediction/configs/more_trees.yaml b/bank_subscription_prediction/configs/more_trees.yaml
@@ -0,0 +1,48 @@
+# More trees experiment configuration
+
+# environment configuration
+settings:
+  docker:
+    required_integrations:
+      - sklearn
+      - pandas
+    requirements:
+      - matplotlib
+      - xgboost
+      - seaborn
+      - plotly
+      - jupyter
+
+# configuration of the Model Control Plane
+model:
+  name: bank_subscription_classifier
+  version: 0.1.0
+  license: MIT
+  description: A bank term deposit subscription classifier
+  tags: ["bank_marketing", "classifier", "xgboost"]
+
+# Step-specific parameters
+steps:
+  # Data loading parameters
+  load_data:
+    csv_file_path: "bank.csv"
+
+  # Data splitting parameters
+  split_data_step:
+    test_size: 0.2
+    random_state: 42
+    stratify_col: "y"
+
+  # Model training parameters with more trees
+  train_xgb_model_with_feature_selection:
+    learning_rate: 0.1
+    n_estimators: 200  # More trees than baseline
+    max_depth: 3
+    min_child_weight: 1
+    gamma: 0
+    subsample: 0.8
+    colsample_bytree: 0.8
+    objective: "binary:logistic"
+    scale_pos_weight: 1
+    random_state: 42
+    feature_selection_threshold: "median"
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		"""Bank Subscription Prediction Project using ZenML."""
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		"""Configuration settings for the Bank Subscription Prediction project."""