zenml-io
diff --git a/‎native-experiment-tracking/.dockerignore‎
Lines changed: 2 additions & 0 deletions b/‎native-experiment-tracking/.dockerignore‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎native-experiment-tracking/README.md‎
Lines changed: 130 additions & 0 deletions b/‎native-experiment-tracking/README.md‎
Lines changed: 130 additions & 0 deletions
diff --git a/‎native-experiment-tracking/analyze.py‎
Lines changed: 184 additions & 0 deletions b/‎native-experiment-tracking/analyze.py‎
Lines changed: 184 additions & 0 deletions
diff --git a/‎native-experiment-tracking/assets/2d_plot.png‎
54.1 KB b/‎native-experiment-tracking/assets/2d_plot.png‎
54.1 KB
diff --git a/‎native-experiment-tracking/assets/3d_plot.png‎
106 KB b/‎native-experiment-tracking/assets/3d_plot.png‎
106 KB
diff --git a/‎native-experiment-tracking/assets/cm_visualization.png‎
90.8 KB b/‎native-experiment-tracking/assets/cm_visualization.png‎
90.8 KB
diff --git a/‎native-experiment-tracking/assets/model_versions.png‎
126 KB b/‎native-experiment-tracking/assets/model_versions.png‎
126 KB
diff --git a/‎native-experiment-tracking/assets/pipeline_dag_caching.png‎
33.9 KB b/‎native-experiment-tracking/assets/pipeline_dag_caching.png‎
33.9 KB
diff --git a/‎native-experiment-tracking/configs/feature_engineering.yaml‎
Lines changed: 11 additions & 0 deletions b/‎native-experiment-tracking/configs/feature_engineering.yaml‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎native-experiment-tracking/configs/training.yaml‎
Lines changed: 11 additions & 0 deletions b/‎native-experiment-tracking/configs/training.yaml‎
Lines changed: 11 additions & 0 deletions
@@ -0,0 +1,2 @@
+.venv*
+.requirements*
@@ -0,0 +1,130 @@
+# :: Track experiments in ZenML natively
+
+Although ZenML plugs into many [experiment trackers](https://www.zenml.io/vs/zenml-vs-experiment-trackers), a lot of 
+the functionality of experiment trackers is already covered by ZenML's native metadata and artifact tracking.
+This project aims to show these capabilities.
+
+## 🎯 Project Overview
+We're tackling a simple classification task using the breast cancer dataset. Our goal is to showcase how ZenML can effortlessly track experiments, hyperparameters, and results throughout the machine learning workflow.
+
+### 🔍 What We're Doing
+
+In this project, we begin by preparing the breast cancer dataset for our model through data preprocessing. For our machine learning task, we've chosen to use an SGDClassifier. Rather than relying on sklearn's GridSearchCV, we implement our own hyperparameter tuning process to showcase ZenML's robust tracking capabilities. Finally, we conduct a thorough analysis of the results, visualizing how various hyperparameters influence the model's accuracy. This approach allows us to demonstrate the power of ZenML in tracking and managing the machine learning workflow.
+
+We are by no means claiming that our solution outperforms GridSearchCV, spoiler alert, this demo won't, rather, this project demonstrates how you would do hyperparameter tuning and experiment tracking  with ZenML on large deep learning problems. 
+
+### 🛠 The Pipeline
+
+Our ZenML pipeline consists of the following steps:
+
+The feature_engineering pipeline:
+* Data Loading: Load the breast cancer dataset.
+* Data Splitting: Split the data into training and testing sets.
+* Data Pre Processing: Pre process our dataset
+
+The model training pipeline:
+* Model Training: Train multiple SGDClassifiers with different hyperparameters.
+* Model Evaluation: Evaluate each model's performance.
+
+By running this pipeline iteratively 
+
+## :running: Run locally
+
+```bash
+# Pip install all requirements
+pip install -r requirements.txt
+
+# Install required zenml integrations
+zenml integration install sklearn pandas -y
+
+# Initialize ZenML
+zenml init
+
+# Connect to your ZenML server
+zenml connect --url ...
+
+python run.py --parallel
+```
+
+This will run a grid search across the following parameter space:
+
+```python
+alpha_values = [0.0001, 0.001, 0.01]
+penalties = ["l2", "l1", "elasticnet"]
+losses = ["hinge", "squared_hinge", "modified_huber"]
+```
+
+If you choose to include the `--parallel` flag, this should all run in parallel. 
+As ZenML smartly caches across pipelines, and because the feature pipeline has run 
+ahead of the parallel training runs, all training pipelines should start on the
+`model_trainer` step.
+![Pipeline DAG with cached steps](./assets/pipeline_dag_caching.png)
+
+After running, you now should have 27 runs of the model training with 27
+produced model_versions. In case you are running with [ZenML Pro](https://docs.zenml.io/getting-started/zenml-pro)
+you'll now be able to inspect these models in the dashboard:
+![Model Versions Page](./assets/model_versions.png)
+
+Additionally, in case you ran with a remote [Data backend](https://docs.zenml.io/stack-components/artifact-stores),
+you'll be able to inspect the confusion matrix for any specific training directly in the
+frontend.
+![Confusion Matrix Visualization](./assets/cm_visualization.png)
+
+In case you want to create your own visualization, check out the implementation
+at `native-experiment-tracking/steps/model_trainer.py:generate_cm`. Basically, just create a 
+matplotlib plot, convert it into a `PIL.Image` and return it from your
+step. Don't forget to annotate your [step output accordingly](https://docs.zenml.io/how-to/build-pipelines/step-output-typing-and-annotation.
+
+```python
+from typing import Tuple
+from typing_extensions import Annotated
+from PIL import Image
+from zenml import ArtifactConfig, step
+
+@step
+def func(...) -> Tuple[
+    Annotated[
+        ...
+    ],
+    Annotated[
+        Image.Image, "confusion_matrix"
+    ]
+]:
+```
+
+## 📈 Explore your experiments
+
+Once all pipelines ran, it is time to analyze our experiment.
+For this we have written an analyze.py script.
+```commandline
+python analyze.py
+```
+This will generate 2 plots for you:
+
+**3D Plot**
+![3D Plot](./assets/3d_plot.png)
+
+**2D Plot**
+![2D Plot](./assets/2d_plot.png)
+
+Feel free to use this file as a starting point to write your very own
+analysis. 
+
+## The moral of the story
+
+So what's the point? We at ZenML believe that any good experiment should be set up in a
+repeatable, scalable way while storing all the relevant metadata in order to analyze the experiment 
+after the fact. This project shows how you could do this with ZenML. 
+
+Once you have accomplished this on a toy dataset with a tiny SGDClassifier, you can start 
+scaling up in all dimensions: data, parameters, model, etc... And all of this while staying infrastructure 
+agnostic. So when your experiment outgrows your local machine, you can simply move 
+to the stack of your choice ...
+
+## 🤝 Contributing
+
+Contributions to improve the pipeline are welcome! Please feel free to submit a Pull Request.
+
+## 📄 License
+
+This project is licensed under the Apache License 2.0. See the LICENSE file for details.
@@ -0,0 +1,184 @@
+# Apache Software License 2.0
+#
+# Copyright (c) ZenML GmbH 2024. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+import seaborn as sns
+from zenml.client import Client
+
+
+def main():
+    client = Client()
+
+    model_versions = client.list_model_versions(
+        model_name_or_id="breast_cancer_classifier", size=27, hydrate=True
+    )
+
+    alpha_values = []
+    losses = []
+    penalties = []
+    test_accuracies = []
+    train_accuracies = []
+
+    for model_version in model_versions:
+        mv_metadata = model_version.run_metadata
+
+        alpha_values.append(mv_metadata.get("alpha_value", None).value)
+        losses.append(mv_metadata.get("loss", None).value)
+        penalties.append(mv_metadata.get("penalty", None).value)
+        test_accuracies.append(mv_metadata.get("test_accuracy", None).value)
+        train_accuracies.append(mv_metadata.get("train_accuracy", None).value)
+
+    generate_3d_plot(alpha_values, losses, penalties, test_accuracies)
+    generate_2d_plots(alpha_values, losses, penalties, test_accuracies)
+
+
+def generate_2d_plots(alpha_values, losses, penalties, test_accuracies):
+    # Convert the data into a DataFrame
+    df = pd.DataFrame(
+        {
+            "Alpha": alpha_values,
+            "Loss": losses,
+            "Penalty": penalties,
+            "Accuracy": test_accuracies,
+        }
+    )
+
+    # Get unique values
+    unique_penalties = df["Penalty"].unique()
+
+    # Create a figure with subplots for each penalty
+    fig, axes = plt.subplots(
+        1, len(unique_penalties), figsize=(20, 6), sharey=True
+    )
+    fig.suptitle("Accuracy Heatmap for Different Penalties", fontsize=16)
+
+    for i, penalty in enumerate(unique_penalties):
+        # Filter data for the current penalty
+        df_penalty = df[df["Penalty"] == penalty]
+
+        # Create a pivot table
+        pivot = df_penalty.pivot(
+            index="Loss", columns="Alpha", values="Accuracy"
+        )
+
+        # Create heatmap
+        sns.heatmap(
+            pivot,
+            ax=axes[i],
+            cmap="viridis",
+            annot=True,
+            fmt=".3f",
+            cbar=False,
+        )
+
+        axes[i].set_title(f"Penalty: {penalty}")
+        axes[i].set_xlabel("Alpha")
+
+        if i == 0:
+            axes[i].set_ylabel("Loss")
+
+    # Add a colorbar to the right of the subplots
+    cbar_ax = fig.add_axes([0.92, 0.15, 0.02, 0.7])
+    fig.colorbar(axes[0].collections[0], cax=cbar_ax, label="Accuracy")
+
+    plt.tight_layout(rect=[0, 0, 0.9, 1])
+    plt.show()
+
+
+def generate_3d_plot(alpha_values, losses, penalties, test_accuracies):
+    # Convert losses and penalties to numerical indices
+    unique_losses = list(set(losses))
+    unique_penalties = list(set(penalties))
+
+    loss_indices = [unique_losses.index(loss) for loss in losses]
+    penalty_indices = [
+        unique_penalties.index(penalty) for penalty in penalties
+    ]
+
+    # Create a figure and a 3D axis
+    fig = plt.figure(figsize=(12, 8))
+    ax = fig.add_subplot(111, projection="3d")
+
+    # Create a scatter plot
+    scatter = ax.scatter(
+        alpha_values,
+        loss_indices,
+        penalty_indices,
+        c=test_accuracies,
+        cmap="viridis",
+    )
+    # Find the point with the highest accuracy
+    max_accuracy_index = np.argmax(test_accuracies)
+    max_accuracy = test_accuracies[max_accuracy_index]
+    max_alpha = alpha_values[max_accuracy_index]
+    max_loss = losses[max_accuracy_index]
+    max_penalty = penalties[max_accuracy_index]
+
+    # Highlight the point with the highest accuracy
+    ax.scatter(
+        [max_alpha],
+        [loss_indices[max_accuracy_index]],
+        [penalty_indices[max_accuracy_index]],
+        c="red",
+        s=100,
+        edgecolors="black",
+        linewidths=2,
+        zorder=10,
+    )
+
+    # Set labels for each axis
+    ax.set_xlabel("Alpha")
+    ax.set_ylabel("Loss")
+    ax.set_zlabel("Penalty")
+
+    # Set custom ticks for loss and penalty axes
+    ax.set_yticks(range(len(unique_losses)))
+    ax.set_yticklabels(unique_losses)
+    ax.set_zticks(range(len(unique_penalties)))
+    ax.set_zticklabels(unique_penalties)
+
+    # Add a color bar
+    cbar = plt.colorbar(scatter)
+    cbar.set_label("Accuracy")
+
+    # Set a title
+    plt.title("Accuracy vs. Alpha, Loss, and Penalty")
+
+    # Adjust the viewing angle
+    ax.view_init(elev=20, azim=45)
+
+    # Add legend with highest accuracy point description
+    legend_text = f"Highest Accuracy:\nAccuracy: {max_accuracy:.4f}\nAlpha: {max_alpha}\nLoss: {max_loss}\nPenalty: {max_penalty}"
+    ax.text2D(
+        0.05,
+        0.95,
+        legend_text,
+        transform=ax.transAxes,
+        fontsize=10,
+        verticalalignment="top",
+        bbox=dict(boxstyle="round", facecolor="white", alpha=0.8),
+    )
+
+    # Show the plot
+    plt.tight_layout()
+    plt.show()
+    return
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,11 @@
+# environment configuration
+settings:
+  docker:
+    required_integrations:
+      - sklearn
+      - pandas
+    requirements:
+      - pyarrow
+
+# pipeline configuration
+test_size: 0.35
@@ -0,0 +1,11 @@
+# environment configuration
+settings:
+  docker:
+    required_integrations:
+      - sklearn
+      - pandas
+    requirements:
+      - pyarrow
+      - matplotlib
+      - pillow
+      - numpy