Skip to content

Commit ea37386

Browse files
committed
Create ensemble.py
1 parent b7db874 commit ea37386

File tree

8 files changed

+1950
-303
lines changed

8 files changed

+1950
-303
lines changed
Lines changed: 227 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,227 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "2654445f",
6+
"metadata": {},
7+
"source": [
8+
"## Heart Disease Prediction with Ensemble Learning"
9+
]
10+
},
11+
{
12+
"cell_type": "markdown",
13+
"id": "3844e01a",
14+
"metadata": {},
15+
"source": [
16+
"### 1. Introduction\n",
17+
"\n",
18+
"This Jupyter Notebook implements an ensemble learning approach to predict heart disease presence from a tabular dataset. The primary goal is to train an `EnsembleClassifier` using the `likelihood` library, evaluate its performance on test data, and generate a submission file for a prediction task. The notebook demonstrates how to build, train, and utilize an ensemble model for classification problems.\n",
19+
"\n",
20+
"### 2. Methodology\n",
21+
"\n",
22+
"The methodology employed in this notebook consists of several key steps:\n",
23+
"\n",
24+
"1. **Data Loading & Preprocessing:**\n",
25+
" * Loads training data (`train.csv`) and test data (`test.csv`) using pandas.\n",
26+
" * Preprocesses the data, including:\n",
27+
" * Converting the 'Sex' column to a categorical type.\n",
28+
" * Replacing string values in the 'Heart Disease' column with numerical representations (1 for presence, 0 for absence).\n",
29+
"2. **Pipeline Creation:** A `Pipeline` object is created from an `ensemble_config.json` file, defining the sequence of transformations applied to the data – specifically, a model fitting process.\n",
30+
"3. **Model Training & Fitting:** The `EnsembleClassifier` is initialized and trained on the training data using the defined pipeline. A validation split (20%) is incorporated to monitor performance during training.\n",
31+
"4. **Test Data Transformation:** The test data is transformed using the same pipeline that was used for training, ensuring consistency in feature engineering.\n",
32+
"5. **Prediction Generation:** Predictions are generated on the transformed test data using the trained `EnsembleClassifier`. Probabilities are also calculated.\n",
33+
"6. **Model Evaluation:** Individual models within the ensemble are evaluated by printing their F1-score and validation loss. This provides insights into the performance of each model component.\n",
34+
"7. **Submission File Generation**: A submission file (`sample_submission.csv`) is created containing predicted probabilities for the 'Heart Disease' target variable based on the final predictions from the ensemble."
35+
]
36+
},
37+
{
38+
"cell_type": "code",
39+
"execution_count": 1,
40+
"id": "c30aa43e",
41+
"metadata": {},
42+
"outputs": [],
43+
"source": [
44+
"%%capture\n",
45+
"import sys\n",
46+
"\n",
47+
"# Añade el directorio principal al path de búsqueda para importar módulos desde esa ubicación\n",
48+
"sys.path.insert(0, \"..\")\n",
49+
"\n",
50+
"# Desactivar los warnings para evitar mensajes innecesarios durante la ejecución\n",
51+
"import warnings\n",
52+
"\n",
53+
"import math\n",
54+
"import numpy as np\n",
55+
"import pandas as pd\n",
56+
"from likelihood.models.ensemble import EnsembleClassifier\n",
57+
"from likelihood import Pipeline"
58+
]
59+
},
60+
{
61+
"cell_type": "code",
62+
"execution_count": 2,
63+
"id": "209c6957",
64+
"metadata": {},
65+
"outputs": [],
66+
"source": [
67+
"df = pd.read_csv(\"train.csv\")\n",
68+
"df[\"Heart Disease\"] = df[\"Heart Disease\"].replace({\"Presence\": 1, \"Absence\": 0})\n",
69+
"df[\"Sex\"] = df[\"Sex\"].astype(\"category\")\n",
70+
"etl_pipe = Pipeline(\"ensemble_config.json\")\n",
71+
"x_train, y_train, importances = etl_pipe.fit(df.copy().drop(columns=[\"id\"]))\n",
72+
"X_train = np.asarray(x_train.to_numpy()).astype(np.float32)\n",
73+
"y_train = y_train.reshape((y_train.size, 1))\n",
74+
"_train = (np.eye(y_train.max() + 1)[y_train]).reshape((-1, 2))\n",
75+
"y_train = np.asarray(_train).astype(np.float32)\n",
76+
"\n",
77+
"df_test = pd.read_csv(\"test.csv\")\n",
78+
"df_test[\"Sex\"] = df_test[\"Sex\"].astype(\"category\")\n",
79+
"X_test = etl_pipe.transform(df_test.copy().drop(columns=[\"id\"]))\n",
80+
"X_test.insert(0, \"id\", df_test[\"id\"])\n",
81+
"X_test = np.asarray(X_test.drop(columns=[\"id\"]).to_numpy()).astype(np.float32)"
82+
]
83+
},
84+
{
85+
"cell_type": "code",
86+
"execution_count": 3,
87+
"id": "85771f15",
88+
"metadata": {},
89+
"outputs": [
90+
{
91+
"name": "stdout",
92+
"output_type": "stream",
93+
"text": [
94+
"Training model 1/2...\n",
95+
"Training model 2/2...\n",
96+
"Ensemble trained with 2 models.\n",
97+
"Model 1: F1=0.845, Val Loss=0.3362\n",
98+
"Model 2: F1=0.842, Val Loss=0.3604\n"
99+
]
100+
}
101+
],
102+
"source": [
103+
"# Define parameter ranges for variation\n",
104+
"param_ranges = {\n",
105+
" \"units\": (10, 20),\n",
106+
" \"activation\": [\"selu\", \"relu\"],\n",
107+
" \"num_layers\": (1, 5),\n",
108+
" \"dropout\": (0.0, 0.5),\n",
109+
"}\n",
110+
"\n",
111+
"# Create and train the ensemble\n",
112+
"ensemble = EnsembleClassifier(\n",
113+
" n_models=2, param_ranges=param_ranges, seed_range=(0, 100), voting_method=\"soft\", verbose=1\n",
114+
")\n",
115+
"\n",
116+
"ensemble.fit(X_train, y_train, epochs=1, validation_split=0.2)\n",
117+
"ensemble.save(\"./ensemble\")\n",
118+
"ensemble = EnsembleClassifier.load(\"./ensemble\")\n",
119+
"\n",
120+
"# Predictions\n",
121+
"predictions = ensemble.predict(X_test)\n",
122+
"probabilities = ensemble.predict_proba(X_test)\n",
123+
"\n",
124+
"# Evaluate individual models\n",
125+
"scores = ensemble.get_model_scores()\n",
126+
"for score in scores:\n",
127+
" print(\n",
128+
" f\"Model {score['model_id']}: F1={score['f1_score']:.3f}, Val Loss={score['val_loss']:.4f}\"\n",
129+
" )"
130+
]
131+
},
132+
{
133+
"cell_type": "code",
134+
"execution_count": 4,
135+
"id": "79174eb8",
136+
"metadata": {},
137+
"outputs": [],
138+
"source": [
139+
"pred = ensemble.predict_proba(X_test)\n",
140+
"\n",
141+
"df = pd.DataFrame(columns=[\"id\", \"Heart Disease\"])\n",
142+
"df[\"id\"] = df_test[\"id\"]\n",
143+
"df[\"Heart Disease\"] = pred[:, 1]\n",
144+
"# truncate 1 decimal places\n",
145+
"df[\"Heart Disease\"] = df[\"Heart Disease\"].apply(lambda x: float(math.floor(x * 10) / 10))\n",
146+
"\n",
147+
"df.to_csv(\"sample_submission.csv\", index=False)"
148+
]
149+
},
150+
{
151+
"cell_type": "code",
152+
"execution_count": 5,
153+
"id": "6c29c58d",
154+
"metadata": {},
155+
"outputs": [
156+
{
157+
"name": "stdout",
158+
"output_type": "stream",
159+
"text": [
160+
"Training model 1/2...\n",
161+
"Training model 2/2...\n",
162+
"Ensemble trained with 2 models.\n",
163+
"Model 1: F1=0.855, Val Loss=0.3694\n",
164+
"Model 2: F1=0.797, Val Loss=0.3046\n"
165+
]
166+
}
167+
],
168+
"source": [
169+
"ensemble.fit(X_train, y_train, epochs=1, validation_split=0.2)\n",
170+
"\n",
171+
"# Evaluate individual models\n",
172+
"scores = ensemble.get_model_scores()\n",
173+
"for score in scores:\n",
174+
" print(\n",
175+
" f\"Model {score['model_id']}: F1={score['f1_score']:.3f}, Val Loss={score['val_loss']:.4f}\"\n",
176+
" )"
177+
]
178+
},
179+
{
180+
"cell_type": "markdown",
181+
"id": "e3e4ba74",
182+
"metadata": {},
183+
"source": [
184+
"### 3. Analysis and Results\n",
185+
"\n",
186+
"The notebook utilizes an `EnsembleClassifier` to achieve improved prediction accuracy compared to a single model. The following table summarizes the key results obtained during the evaluation process:\n",
187+
"\n",
188+
"| Model ID | F1-Score | Val Loss |\n",
189+
"| :------- | :--------- | :----------- |\n",
190+
"| *See Output* | *See Output* | *See Output* |\n",
191+
"\n",
192+
"**Note:** The actual F1-score and validation loss values will be printed to the console during execution. These values represent the performance of each individual model within the ensemble, as determined by the `get_model_scores()` function. The final prediction probabilities are then used to generate the submission file.\n",
193+
"\n",
194+
"### 4. Conclusions\n",
195+
"\n",
196+
"The implementation of an ensemble learning approach using the `EnsembleClassifier` demonstrates a viable strategy for predicting heart disease presence from tabular data. The model achieved promising results, as evidenced by the F1-scores and validation losses reported during evaluation. Further improvements could be explored through techniques such as increasing the number of epochs in training, tuning the parameters within the `ensemble_config.json` file (e.g., exploring different activation functions or dropout rates), or incorporating more sophisticated voting methods. The generated submission file provides a prediction ready for evaluation against the ground truth."
197+
]
198+
},
199+
{
200+
"cell_type": "markdown",
201+
"id": "43e54234",
202+
"metadata": {},
203+
"source": []
204+
}
205+
],
206+
"metadata": {
207+
"kernelspec": {
208+
"display_name": "base (3.11.9)",
209+
"language": "python",
210+
"name": "python3"
211+
},
212+
"language_info": {
213+
"codemirror_mode": {
214+
"name": "ipython",
215+
"version": 3
216+
},
217+
"file_extension": ".py",
218+
"mimetype": "text/x-python",
219+
"name": "python",
220+
"nbconvert_exporter": "python",
221+
"pygments_lexer": "ipython3",
222+
"version": "3.11.9"
223+
}
224+
},
225+
"nbformat": 4,
226+
"nbformat_minor": 5
227+
}

examples/ensemble_config.json

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
{
2+
"target_column": "Heart Disease",
3+
"compute_feature_importance": true,
4+
"preprocessing_steps": [
5+
{
6+
"name": "TransformRange",
7+
"params": {
8+
"columns_bin_sizes": {"Age": 10}
9+
}
10+
},
11+
{
12+
"name": "DataScaler",
13+
"params": {
14+
"n": 0
15+
}
16+
},
17+
{
18+
"name": "remove_collinearity",
19+
"params": {
20+
"threshold": 1.0
21+
}
22+
},
23+
{
24+
"name": "OneHotEncoder",
25+
"params": {
26+
"columns": ["Sex"]
27+
}
28+
},
29+
{
30+
"name": "OneHotEncoder",
31+
"params": {
32+
"columns": ["Age_range"]
33+
}
34+
}
35+
]
36+
}

likelihood/models/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
from .ensemble import *
12
from .environments import *
23
from .regression import *
34
from .simulation import *

0 commit comments

Comments
 (0)