jzsmoreno
diff --git a/‎examples/Ensemble with AutoClassifier.ipynb‎
Lines changed: 227 additions & 0 deletions b/‎examples/Ensemble with AutoClassifier.ipynb‎
Lines changed: 227 additions & 0 deletions
diff --git a/‎examples/ensemble_config.json‎
Lines changed: 36 additions & 0 deletions b/‎examples/ensemble_config.json‎
Lines changed: 36 additions & 0 deletions
diff --git a/‎likelihood/models/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎likelihood/models/__init__.py‎
Lines changed: 1 addition & 0 deletions
@@ -0,0 +1,227 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "2654445f",
+   "metadata": {},
+   "source": [
+    "## Heart Disease Prediction with Ensemble Learning"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3844e01a",
+   "metadata": {},
+   "source": [
+    "### 1. Introduction\n",
+    "\n",
+    "This Jupyter Notebook implements an ensemble learning approach to predict heart disease presence from a tabular dataset. The primary goal is to train an `EnsembleClassifier` using the `likelihood` library, evaluate its performance on test data, and generate a submission file for a prediction task. The notebook demonstrates how to build, train, and utilize an ensemble model for classification problems.\n",
+    "\n",
+    "### 2. Methodology\n",
+    "\n",
+    "The methodology employed in this notebook consists of several key steps:\n",
+    "\n",
+    "1.  **Data Loading & Preprocessing:**\n",
+    "    *   Loads training data (`train.csv`) and test data (`test.csv`) using pandas.\n",
+    "    *   Preprocesses the data, including:\n",
+    "        *   Converting the 'Sex' column to a categorical type.\n",
+    "        *   Replacing string values in the 'Heart Disease' column with numerical representations (1 for presence, 0 for absence).\n",
+    "2.  **Pipeline Creation:** A `Pipeline` object is created from an `ensemble_config.json` file, defining the sequence of transformations applied to the data – specifically, a model fitting process.\n",
+    "3.  **Model Training & Fitting:** The `EnsembleClassifier` is initialized and trained on the training data using the defined pipeline. A validation split (20%) is incorporated to monitor performance during training.\n",
+    "4.  **Test Data Transformation:** The test data is transformed using the same pipeline that was used for training, ensuring consistency in feature engineering.\n",
+    "5.  **Prediction Generation:** Predictions are generated on the transformed test data using the trained `EnsembleClassifier`. Probabilities are also calculated.\n",
+    "6.  **Model Evaluation:** Individual models within the ensemble are evaluated by printing their F1-score and validation loss. This provides insights into the performance of each model component.\n",
+    "7. **Submission File Generation**: A submission file (`sample_submission.csv`) is created containing predicted probabilities for the 'Heart Disease' target variable based on the final predictions from the ensemble."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "c30aa43e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "import sys\n",
+    "\n",
+    "# Añade el directorio principal al path de búsqueda para importar módulos desde esa ubicación\n",
+    "sys.path.insert(0, \"..\")\n",
+    "\n",
+    "# Desactivar los warnings para evitar mensajes innecesarios durante la ejecución\n",
+    "import warnings\n",
+    "\n",
+    "import math\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "from likelihood.models.ensemble import EnsembleClassifier\n",
+    "from likelihood import Pipeline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "209c6957",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = pd.read_csv(\"train.csv\")\n",
+    "df[\"Heart Disease\"] = df[\"Heart Disease\"].replace({\"Presence\": 1, \"Absence\": 0})\n",
+    "df[\"Sex\"] = df[\"Sex\"].astype(\"category\")\n",
+    "etl_pipe = Pipeline(\"ensemble_config.json\")\n",
+    "x_train, y_train, importances = etl_pipe.fit(df.copy().drop(columns=[\"id\"]))\n",
+    "X_train = np.asarray(x_train.to_numpy()).astype(np.float32)\n",
+    "y_train = y_train.reshape((y_train.size, 1))\n",
+    "_train = (np.eye(y_train.max() + 1)[y_train]).reshape((-1, 2))\n",
+    "y_train = np.asarray(_train).astype(np.float32)\n",
+    "\n",
+    "df_test = pd.read_csv(\"test.csv\")\n",
+    "df_test[\"Sex\"] = df_test[\"Sex\"].astype(\"category\")\n",
+    "X_test = etl_pipe.transform(df_test.copy().drop(columns=[\"id\"]))\n",
+    "X_test.insert(0, \"id\", df_test[\"id\"])\n",
+    "X_test = np.asarray(X_test.drop(columns=[\"id\"]).to_numpy()).astype(np.float32)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "85771f15",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Training model 1/2...\n",
+      "Training model 2/2...\n",
+      "Ensemble trained with 2 models.\n",
+      "Model 1: F1=0.845, Val Loss=0.3362\n",
+      "Model 2: F1=0.842, Val Loss=0.3604\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Define parameter ranges for variation\n",
+    "param_ranges = {\n",
+    "    \"units\": (10, 20),\n",
+    "    \"activation\": [\"selu\", \"relu\"],\n",
+    "    \"num_layers\": (1, 5),\n",
+    "    \"dropout\": (0.0, 0.5),\n",
+    "}\n",
+    "\n",
+    "# Create and train the ensemble\n",
+    "ensemble = EnsembleClassifier(\n",
+    "    n_models=2, param_ranges=param_ranges, seed_range=(0, 100), voting_method=\"soft\", verbose=1\n",
+    ")\n",
+    "\n",
+    "ensemble.fit(X_train, y_train, epochs=1, validation_split=0.2)\n",
+    "ensemble.save(\"./ensemble\")\n",
+    "ensemble = EnsembleClassifier.load(\"./ensemble\")\n",
+    "\n",
+    "# Predictions\n",
+    "predictions = ensemble.predict(X_test)\n",
+    "probabilities = ensemble.predict_proba(X_test)\n",
+    "\n",
+    "# Evaluate individual models\n",
+    "scores = ensemble.get_model_scores()\n",
+    "for score in scores:\n",
+    "    print(\n",
+    "        f\"Model {score['model_id']}: F1={score['f1_score']:.3f}, Val Loss={score['val_loss']:.4f}\"\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "79174eb8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pred = ensemble.predict_proba(X_test)\n",
+    "\n",
+    "df = pd.DataFrame(columns=[\"id\", \"Heart Disease\"])\n",
+    "df[\"id\"] = df_test[\"id\"]\n",
+    "df[\"Heart Disease\"] = pred[:, 1]\n",
+    "# truncate 1 decimal places\n",
+    "df[\"Heart Disease\"] = df[\"Heart Disease\"].apply(lambda x: float(math.floor(x * 10) / 10))\n",
+    "\n",
+    "df.to_csv(\"sample_submission.csv\", index=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "6c29c58d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Training model 1/2...\n",
+      "Training model 2/2...\n",
+      "Ensemble trained with 2 models.\n",
+      "Model 1: F1=0.855, Val Loss=0.3694\n",
+      "Model 2: F1=0.797, Val Loss=0.3046\n"
+     ]
+    }
+   ],
+   "source": [
+    "ensemble.fit(X_train, y_train, epochs=1, validation_split=0.2)\n",
+    "\n",
+    "# Evaluate individual models\n",
+    "scores = ensemble.get_model_scores()\n",
+    "for score in scores:\n",
+    "    print(\n",
+    "        f\"Model {score['model_id']}: F1={score['f1_score']:.3f}, Val Loss={score['val_loss']:.4f}\"\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e3e4ba74",
+   "metadata": {},
+   "source": [
+    "### 3. Analysis and Results\n",
+    "\n",
+    "The notebook utilizes an `EnsembleClassifier` to achieve improved prediction accuracy compared to a single model. The following table summarizes the key results obtained during the evaluation process:\n",
+    "\n",
+    "| Model ID | F1-Score   | Val Loss     |\n",
+    "| :------- | :--------- | :----------- |\n",
+    "|  *See Output* | *See Output* | *See Output* |\n",
+    "\n",
+    "**Note:** The actual F1-score and validation loss values will be printed to the console during execution. These values represent the performance of each individual model within the ensemble, as determined by the `get_model_scores()` function. The final prediction probabilities are then used to generate the submission file.\n",
+    "\n",
+    "### 4. Conclusions\n",
+    "\n",
+    "The implementation of an ensemble learning approach using the `EnsembleClassifier` demonstrates a viable strategy for predicting heart disease presence from tabular data. The model achieved promising results, as evidenced by the F1-scores and validation losses reported during evaluation. Further improvements could be explored through techniques such as increasing the number of epochs in training, tuning the parameters within the `ensemble_config.json` file (e.g., exploring different activation functions or dropout rates), or incorporating more sophisticated voting methods. The generated submission file provides a prediction ready for evaluation against the ground truth."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "43e54234",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "base (3.11.9)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,36 @@
+{
+  "target_column": "Heart Disease",
+  "compute_feature_importance": true,  
+  "preprocessing_steps": [
+    {
+      "name": "TransformRange",
+      "params": {
+        "columns_bin_sizes": {"Age": 10}  
+      }
+    },
+    {
+      "name": "DataScaler",
+      "params": {
+        "n": 0  
+      }
+    },
+    {
+      "name": "remove_collinearity",
+      "params": {
+        "threshold": 1.0  
+      }
+    },
+    {
+      "name": "OneHotEncoder",
+      "params": {
+        "columns": ["Sex"]  
+      }
+    },
+    {
+      "name": "OneHotEncoder",
+      "params": {
+        "columns": ["Age_range"]  
+      }
+    }
+  ]
+}
@@ -1,3 +1,4 @@
+from .ensemble import *
 from .environments import *
 from .regression import *
 from .simulation import *