Snowflake-Labs
diff --git a/‎site/sfguides/src/healthcare-ml-breast-cancer-classification/assets/configure-service.png‎
188 KB b/‎site/sfguides/src/healthcare-ml-breast-cancer-classification/assets/configure-service.png‎
188 KB
diff --git a/‎site/sfguides/src/healthcare-ml-breast-cancer-classification/assets/create-service.png‎
141 KB b/‎site/sfguides/src/healthcare-ml-breast-cancer-classification/assets/create-service.png‎
141 KB
diff --git a/‎site/sfguides/src/healthcare-ml-breast-cancer-classification/assets/upload_notebook_file.png‎
62.7 KB b/‎site/sfguides/src/healthcare-ml-breast-cancer-classification/assets/upload_notebook_file.png‎
62.7 KB
diff --git a/‎site/sfguides/src/healthcare-ml-breast-cancer-classification/healthcare-ml-breast-cancer-classification.md‎
Lines changed: 498 additions & 0 deletions b/‎site/sfguides/src/healthcare-ml-breast-cancer-classification/healthcare-ml-breast-cancer-classification.md‎
Lines changed: 498 additions & 0 deletions
diff --git a/‎site/sfguides/src/healthcare-ml-breast-cancer-classification/notebooks/0_start_here.ipynb‎
Lines changed: 755 additions & 0 deletions b/‎site/sfguides/src/healthcare-ml-breast-cancer-classification/notebooks/0_start_here.ipynb‎
Lines changed: 755 additions & 0 deletions
diff --git a/‎site/sfguides/src/healthcare-ml-breast-cancer-classification/notebooks/1_snowflake_deployment.ipynb‎
Lines changed: 321 additions & 0 deletions b/‎site/sfguides/src/healthcare-ml-breast-cancer-classification/notebooks/1_snowflake_deployment.ipynb‎
Lines changed: 321 additions & 0 deletions
@@ -0,0 +1,321 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "intro",
+   "metadata": {},
+   "source": [
+    "# Part 2: Snowflake Model Registry Deployment\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "This notebook demonstrates **deploying an XGBoost model to Snowflake Model Registry** for production inference. You'll save training data to Snowflake tables and register your model for scalable, governed ML operations.\n",
+    "\n",
+    "### Prerequisites\n",
+    "\n",
+    "⚠️ **IMPORTANT**: Run `setup.sql` as ACCOUNTADMIN before starting this notebook.\n",
+    "\n",
+    "The setup script creates:\n",
+    "- Role: `HEALTHCARE_ML_ROLE`\n",
+    "- Database: `HEALTHCARE_ML`\n",
+    "- Schema: `HEALTHCARE_ML.DIAGNOSTICS`\n",
+    "- Warehouse: `HEALTHCARE_ML_WH`\n",
+    "- Compute Pool: `HEALTHCARE_ML_CPU_POOL`\n",
+    "\n",
+    "### What You'll Learn\n",
+    "\n",
+    "1. **Persist data** to Snowflake tables\n",
+    "2. **Register models** in Snowflake Model Registry\n",
+    "3. **Run inference** using registered models\n",
+    "4. **Track metadata** (metrics, versions, comments)\n",
+    "\n",
+    "> **Note**: This notebook requires Container Runtime and must be run from **Snowsight**."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "load_intro",
+   "metadata": {},
+   "source": [
+    "## Step 1: Load Artifacts from Part 1\n",
+    "\n",
+    "Load the trained model and data from `/tmp` that were saved in Part 1."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "load_artifacts",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pickle\n",
+    "import pandas as pd\n",
+    "from snowflake.snowpark.context import get_active_session\n",
+    "\n",
+    "# Load artifacts from Part 1\n",
+    "with open('/tmp/breast_cancer_artifacts.pkl', 'rb') as f:\n",
+    "    artifacts = pickle.load(f)\n",
+    "\n",
+    "best_model = artifacts['best_model']\n",
+    "X_train = artifacts['X_train']\n",
+    "X_test = artifacts['X_test']\n",
+    "y_train = artifacts['y_train']\n",
+    "y_test = artifacts['y_test']\n",
+    "test_accuracy = artifacts['test_accuracy']\n",
+    "test_f1 = artifacts['test_f1']\n",
+    "roc_auc = artifacts['roc_auc']\n",
+    "pr_auc = artifacts['pr_auc']\n",
+    "cv_results = artifacts['cv_results']\n",
+    "feature_names = artifacts['feature_names']\n",
+    "\n",
+    "print(\"=\" * 60)\n",
+    "print(\"✅ ARTIFACTS LOADED FROM /tmp\")\n",
+    "print(\"=\" * 60)\n",
+    "print(f\"Model: XGBoost ({best_model.n_estimators} estimators)\")\n",
+    "print(f\"Training data: {X_train.shape[0]} samples × {X_train.shape[1]} features\")\n",
+    "print(f\"Test data: {X_test.shape[0]} samples\")\n",
+    "print(f\"Test Accuracy: {test_accuracy:.4f}\")\n",
+    "print(f\"ROC AUC: {roc_auc:.4f}\")\n",
+    "\n",
+    "# Connect to Snowflake\n",
+    "session = get_active_session()\n",
+    "session.sql(\"\"\"\n",
+    "    ALTER SESSION SET query_tag = '{\"origin\":\"sf_sit-is\",\"name\":\"healthcare_ml_classification\",\"version\":{\"major\":1,\"minor\":0},\"attributes\":{\"is_quickstart\":1,\"source\":\"notebook\"}}'\n",
+    "\"\"\").collect()\n",
+    "print(f\"\\n✅ Connected to Snowflake: {session.get_current_account()}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a18fec30",
+   "metadata": {},
+   "source": [
+    "## Step 1: Environment Setup\n",
+    "\n",
+    "### Import Libraries\n",
+    "\n",
+    "We'll use a combination of data science and Snowflake-specific libraries:\n",
+    "\n",
+    "| Library | Purpose |\n",
+    "|---------|---------|\n",
+    "| `snowflake.snowpark` | Snowflake session management |\n",
+    "| `pandas`, `numpy` | Data manipulation and numerical operations |\n",
+    "| `matplotlib`, `seaborn` | Statistical visualizations |\n",
+    "| `sklearn` | ML utilities, metrics, and baseline models |\n",
+    "| `xgboost` | Gradient boosting implementation |\n",
+    "\n",
+    "> **Note**: All libraries are pre-installed in Container Runtime - no `!pip install` or EAIs needed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9ad41959",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from snowflake.ml.registry import Registry\n",
+    "from snowflake.ml.model import task\n",
+    "\n",
+    "DATABASE = \"HEALTHCARE_ML\"\n",
+    "SCHEMA = \"DIAGNOSTICS\"\n",
+    "\n",
+    "session.use_database(DATABASE)\n",
+    "session.use_schema(SCHEMA)\n",
+    "\n",
+    "registry = Registry(session=session)\n",
+    "\n",
+    "MODEL_NAME = \"BREAST_CANCER_CLASSIFIER\"\n",
+    "\n",
+    "print(\"Logging model to Snowflake Model Registry...\")\n",
+    "mv = registry.log_model(\n",
+    "    best_model,\n",
+    "    model_name=MODEL_NAME,\n",
+    "    sample_input_data=X_train.head(),\n",
+    "    target_platforms=[\"WAREHOUSE\"],\n",
+    "    task=task.Task.TABULAR_BINARY_CLASSIFICATION,\n",
+    "    options={'relax_version': False},\n",
+    "    metrics={\n",
+    "        \"test_accuracy\": float(test_accuracy),\n",
+    "        \"test_f1_score\": float(test_f1),\n",
+    "        \"roc_auc\": float(roc_auc),\n",
+    "        \"cv_accuracy_mean\": float(cv_results['XGBoost'].mean()),\n",
+    "        \"cv_accuracy_std\": float(cv_results['XGBoost'].std()),\n",
+    "        \"n_estimators\": 100,\n",
+    "        \"max_depth\": 6,\n",
+    "        \"learning_rate\": 0.1\n",
+    "    },\n",
+    "    comment=\"XGBoost classifier for breast cancer diagnosis. Trained on Wisconsin Diagnostic dataset (569 samples, 30 features). Cross-validated.\"\n",
+    ")\n",
+    "\n",
+    "print(\"=\" * 60)\n",
+    "print(\"MODEL REGISTRY - SUCCESS\")\n",
+    "print(\"=\" * 60)\n",
+    "print(f\"Model Name:    {MODEL_NAME}\")\n",
+    "print(f\"Version:       {mv.version_name}\")\n",
+    "print(f\"Test Accuracy: {test_accuracy:.4f}\")\n",
+    "print(f\"ROC AUC:       {roc_auc:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3e428a2e",
+   "metadata": {},
+   "source": [
+    "## Step 3: Model Inference\n",
+    "\n",
+    "### Running Predictions with the Registered Model\n",
+    "\n",
+    "Once deployed to the Model Registry, inference can be performed via:\n",
+    "\n",
+    "| Method | Use Case | Scalability |\n",
+    "|--------|----------|-------------|\n",
+    "| `mv.run()` (Python) | Notebooks, scripts | Batch processing |\n",
+    "| `MODEL!PREDICT()` (SQL) | Dashboards, ETL pipelines | Warehouse-scale |\n",
+    "\n",
+    "The model executes **within Snowflake** - no data leaves the platform, maintaining security and governance."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "904c5d8e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(f\"Running inference using model: {mv.model_name} (version: {mv.version_name})\")\n",
+    "predictions = mv.run(X_test, function_name=\"predict\")\n",
+    "print(f\"Prediction columns: {predictions.columns.tolist()}\")\n",
+    "pred_col = predictions.columns[-1]\n",
+    "predictions[[pred_col]].rename(columns={pred_col: \"PREDICTION\"}).head(10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe14c8d1",
+   "metadata": {},
+   "source": [
+    "## Step 4: Explore Registered Model\n",
+    "\n",
+    "The Model Registry stores model artifacts along with metadata. Let's inspect:\n",
+    "- **Available methods**: predict, predict_proba\n",
+    "- **Logged metrics**: accuracy, AUC, hyperparameters\n",
+    "\n",
+    "> **Tip**: View your model in Snowsight under **AI & ML > Models** for a visual interface."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "28216849",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Available methods:\")\n",
+    "for func in mv.show_functions():\n",
+    "    print(f\"  - {func['name']}\")\n",
+    "\n",
+    "print(f\"\\nModel metrics:\")\n",
+    "mv.show_metrics()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1483b6cb",
+   "metadata": {},
+   "source": [
+    "## Step 5: (Optional) Persist Data to Snowflake\n",
+    "\n",
+    "**Data Persistence Options:**\n",
+    "\n",
+    "| Method | Use Case | Durability |\n",
+    "|--------|----------|------------|\n",
+    "| Snowflake Table | Structured data, SQL queries | Permanent |\n",
+    "| Snowflake Stage | Files, artifacts | Permanent |\n",
+    "| Notebook CWD | Temporary files | Session only ⚠️ |\n",
+    "\n",
+    "> **Warning**: The notebook working directory (`/home/udf/`) does not persist between sessions. Always save important data to tables or stages."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7cf4e1ff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# OPTIONAL: Save training data to Snowflake\n",
+    "# Uncomment and update the database/schema names to match your environment\n",
+    "\n",
+    "# train_df = X_train.copy()\n",
+    "# train_df[\"DIAGNOSIS\"] = y_train.values\n",
+    "# \n",
+    "# snowpark_df = session.create_dataframe(train_df)\n",
+    "# snowpark_df.write.mode(\"overwrite\").save_as_table(\"HEALTHCARE_ML.DIAGNOSTICS.BREAST_CANCER_TRAINING_DATA\")\n",
+    "# \n",
+    "# print(\"Training data saved to Snowflake table\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7550b87a",
+   "metadata": {},
+   "source": [
+    "## Summary and Key Takeaways\n",
+    "\n",
+    "### What We Accomplished\n",
+    "\n",
+    "| Step | Technique | Outcome |\n",
+    "|------|-----------|---------|\n",
+    "| Data Exploration | Statistical analysis + visualizations | Understood feature distributions and class balance |\n",
+    "| Feature Engineering | StandardScaler | Normalized features for fair model comparison |\n",
+    "| Model Selection | 5-Fold Stratified CV | Compared 3 algorithms, selected XGBoost |\n",
+    "| Evaluation | Multiple metrics + visualizations | Validated model with ~97% accuracy, 0.99 AUC |\n",
+    "| Deployment | Snowflake Model Registry | Production-ready model with versioning |\n",
+    "\n",
+    "### Performance Summary\n",
+    "\n",
+    "| Metric | Value | Interpretation |\n",
+    "|--------|-------|----------------|\n",
+    "| Test Accuracy | ~97% | Correct predictions overall |\n",
+    "| ROC AUC | ~0.99 | Excellent discrimination |\n",
+    "| Malignant Recall | ~95%+ | Catches most cancers |\n",
+    "| Benign Precision | ~98%+ | Few false alarms |\n",
+    "\n",
+    "### Production Usage\n",
+    "\n",
+    "```sql\n",
+    "-- SQL Inference\n",
+    "SELECT BREAST_CANCER_CLASSIFIER!PREDICT(*) FROM your_patient_data;\n",
+    "\n",
+    "-- Python Inference\n",
+    "model_version = registry.get_model(\"BREAST_CANCER_CLASSIFIER\").version(\"V1\")\n",
+    "predictions = model_version.run(new_data, function_name=\"predict\")\n",
+    "```\n",
+    "\n",
+    "### Next Steps\n",
+    "\n",
+    "1. **Hyperparameter Tuning**: Use GridSearchCV or Optuna for optimization\n",
+    "2. **Feature Selection**: Reduce to top 10-15 features for efficiency\n",
+    "3. **Model Monitoring**: Track prediction drift in production\n",
+    "4. **A/B Testing**: Compare model versions on live data\n",
+    "\n",
+    "> **Resources**: [Snowflake ML Documentation](https://docs.snowflake.com/en/developer-guide/snowflake-ml/overview) | [XGBoost Documentation](https://xgboost.readthedocs.io/)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.12.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}