|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "id": "630e3e17", |
| 6 | + "metadata": {}, |
| 7 | + "source": [ |
| 8 | + "# 🔐 NeMo Safe Synthesizer: Advanced Privacy (Differential Privacy)\n", |
| 9 | + "\n", |
| 10 | + "> ⚠️ **Warning**: NeMo Safe Synthesizer is in Early Access and not recommended for production use.\n", |
| 11 | + "\n", |
| 12 | + "<br>\n", |
| 13 | + "\n", |
| 14 | + "In this notebook, we create synthetic tabular data using the NeMo Microservices Python SDK with differential privacy enabled. The notebook should take about 1.5 hours to run.\n", |
| 15 | + "\n", |
| 16 | + "After completing this notebook, you'll be able to:\n", |
| 17 | + "- **Use the NeMo Microservices SDK** to interact with Safe Synthesizer\n", |
| 18 | + "- **Enable differential privacy** to provide additional privacy protection\n", |
| 19 | + "- **Access an evaluation report** on the quality and privacy of the synthetic data" |
| 20 | + ] |
| 21 | + }, |
| 22 | + { |
| 23 | + "cell_type": "code", |
| 24 | + "execution_count": null, |
| 25 | + "id": "a538526a", |
| 26 | + "metadata": {}, |
| 27 | + "outputs": [], |
| 28 | + "source": [] |
| 29 | + }, |
| 30 | + { |
| 31 | + "cell_type": "markdown", |
| 32 | + "id": "8be84f5d", |
| 33 | + "metadata": {}, |
| 34 | + "source": [ |
| 35 | + "#### 💾 Install dependencies\n", |
| 36 | + "\n", |
| 37 | + "Ensure you have a NeMo Microservices Platform deployment available. If you're using a managed or remote deployment, have the correct base URLs and tokens ready." |
| 38 | + ] |
| 39 | + }, |
| 40 | + { |
| 41 | + "cell_type": "code", |
| 42 | + "execution_count": null, |
| 43 | + "id": "9f5d6f5a", |
| 44 | + "metadata": {}, |
| 45 | + "outputs": [], |
| 46 | + "source": [ |
| 47 | + "import pandas as pd\n", |
| 48 | + "from nemo_microservices import NeMoMicroservices\n", |
| 49 | + "from nemo_microservices.beta.safe_synthesizer.builder import SafeSynthesizerBuilder\n", |
| 50 | + "\n", |
| 51 | + "import logging\n", |
| 52 | + "\n", |
| 53 | + "logging.basicConfig(level=logging.WARNING)\n", |
| 54 | + "logging.getLogger(\"httpx\").setLevel(logging.WARNING)" |
| 55 | + ] |
| 56 | + }, |
| 57 | + { |
| 58 | + "cell_type": "markdown", |
| 59 | + "id": "7395f0c8", |
| 60 | + "metadata": {}, |
| 61 | + "source": [ |
| 62 | + "### ⚙️ Initialize the NeMo Safe Synthesizer Client\n", |
| 63 | + "\n", |
| 64 | + "- The Python SDK provides a wrapper around the NeMo Microservices Platform APIs.\n", |
| 65 | + "- `http://localhost:8080` is the default URL for `base_url` in quickstart.\n", |
| 66 | + "- If using a managed or remote deployment, ensure you use the correct base URLs and tokens." |
| 67 | + ] |
| 68 | + }, |
| 69 | + { |
| 70 | + "cell_type": "code", |
| 71 | + "execution_count": null, |
| 72 | + "id": "8c15ab93", |
| 73 | + "metadata": {}, |
| 74 | + "outputs": [], |
| 75 | + "source": [ |
| 76 | + "client = NeMoMicroservices(\n", |
| 77 | + " base_url=\"http://localhost:8080\",\n", |
| 78 | + ")" |
| 79 | + ] |
| 80 | + }, |
| 81 | + { |
| 82 | + "cell_type": "markdown", |
| 83 | + "id": "8f1cfb12", |
| 84 | + "metadata": {}, |
| 85 | + "source": [ |
| 86 | + "NeMo DataStore is launched as one of the services. We'll use it to manage storage, so set the following:" |
| 87 | + ] |
| 88 | + }, |
| 89 | + { |
| 90 | + "cell_type": "code", |
| 91 | + "execution_count": null, |
| 92 | + "id": "426186a3", |
| 93 | + "metadata": {}, |
| 94 | + "outputs": [], |
| 95 | + "source": [ |
| 96 | + "datastore_config = {\n", |
| 97 | + " \"endpoint\": \"http://localhost:3000/v1/hf\",\n", |
| 98 | + " \"token\": \"\",\n", |
| 99 | + "}" |
| 100 | + ] |
| 101 | + }, |
| 102 | + { |
| 103 | + "cell_type": "markdown", |
| 104 | + "id": "2d66c819", |
| 105 | + "metadata": {}, |
| 106 | + "source": [ |
| 107 | + "## 📥 Load input data\n", |
| 108 | + "\n", |
| 109 | + "Safe synthesizer learns the patterns and correlations of an input data set in order to produce synthetic data with similar properties. Use the sample dataset provided or change the following cell to try with your own data.\n", |
| 110 | + "\n", |
| 111 | + "The sample dataset is of a set of customer default payments. It includes columns of Personally Identifiable Information (PII) such as sex, education level, marriage status, and age. In addition, it contains several billing and payments accounts and a binary indicator of whether the next month's payment would default." |
| 112 | + ] |
| 113 | + }, |
| 114 | + { |
| 115 | + "cell_type": "code", |
| 116 | + "execution_count": null, |
| 117 | + "id": "9c989a42", |
| 118 | + "metadata": {}, |
| 119 | + "outputs": [], |
| 120 | + "source": [ |
| 121 | + "%pip install ucimlrepo || uv pip install ucimlrepo" |
| 122 | + ] |
| 123 | + }, |
| 124 | + { |
| 125 | + "cell_type": "code", |
| 126 | + "execution_count": null, |
| 127 | + "id": "7204f213", |
| 128 | + "metadata": {}, |
| 129 | + "outputs": [], |
| 130 | + "source": [ |
| 131 | + "from ucimlrepo import fetch_ucirepo \n", |
| 132 | + " \n", |
| 133 | + "# fetch dataset \n", |
| 134 | + "default_of_credit_card_clients = fetch_ucirepo(id=350) \n", |
| 135 | + "df = default_of_credit_card_clients.data.original\n", |
| 136 | + " \n", |
| 137 | + "\n", |
| 138 | + "# Display the first few rows of the combined DataFrame\n", |
| 139 | + "print(df.head()) " |
| 140 | + ] |
| 141 | + }, |
| 142 | + { |
| 143 | + "cell_type": "code", |
| 144 | + "execution_count": null, |
| 145 | + "id": "d8ca3a11", |
| 146 | + "metadata": {}, |
| 147 | + "outputs": [], |
| 148 | + "source": [ |
| 149 | + "df" |
| 150 | + ] |
| 151 | + }, |
| 152 | + { |
| 153 | + "cell_type": "markdown", |
| 154 | + "id": "87d72c68", |
| 155 | + "metadata": {}, |
| 156 | + "source": [ |
| 157 | + "## 🏗️ Create a Safe Synthesizer job\n", |
| 158 | + "\n", |
| 159 | + "The `SafeSynthesizerBuilder` provides a fluent interface to configure and submit jobs.\n", |
| 160 | + "\n", |
| 161 | + "This job will:\n", |
| 162 | + "- Initialize the builder with the NeMo Microservices client.\n", |
| 163 | + "- Use the loaded DataFrame as the input data source.\n", |
| 164 | + "- Configure the job to use the specified datastore for model storage.\n", |
| 165 | + "- Enable automatic replacement of personally identifiable information (PII).\n", |
| 166 | + "- Enable differential privacy (DP) with a configurable epsilon.\n", |
| 167 | + "- Use structured generation to enforce the schema during data generation.\n", |
| 168 | + "- Submit the job to the microservices platform." |
| 169 | + ] |
| 170 | + }, |
| 171 | + { |
| 172 | + "cell_type": "code", |
| 173 | + "execution_count": null, |
| 174 | + "id": "85d9de56", |
| 175 | + "metadata": {}, |
| 176 | + "outputs": [], |
| 177 | + "source": [ |
| 178 | + "job = (\n", |
| 179 | + " SafeSynthesizerBuilder(client)\n", |
| 180 | + " .from_data_source(df)\n", |
| 181 | + " .with_datastore(datastore_config)\n", |
| 182 | + " .with_replace_pii()\n", |
| 183 | + " .with_differential_privacy(dp_enabled=True, epsilon=8.0)\n", |
| 184 | + " .with_generate(use_structured_generation=True)\n", |
| 185 | + " .create_job()\n", |
| 186 | + ")\n", |
| 187 | + "\n", |
| 188 | + "print(f\"job_id = {job.job_id}\")\n", |
| 189 | + "job.wait_for_completion()\n", |
| 190 | + "\n", |
| 191 | + "print(f\"Job finished with status {job.fetch_status()}\")" |
| 192 | + ] |
| 193 | + }, |
| 194 | + { |
| 195 | + "cell_type": "code", |
| 196 | + "execution_count": null, |
| 197 | + "id": "fa2eacb2", |
| 198 | + "metadata": {}, |
| 199 | + "outputs": [], |
| 200 | + "source": [ |
| 201 | + "# If your notebook shuts down, it's okay, your job is still running on the microservices platform.\n", |
| 202 | + "# You can get the same job object and interact with it again by uncommenting the following code\n", |
| 203 | + "# snippet, and modifying it with the job id from the previous cell output.\n", |
| 204 | + "\n", |
| 205 | + "# from nemo_microservices.beta.safe_synthesizer.sdk.job import SafeSynthesizerJob\n", |
| 206 | + "# job = SafeSynthesizerJob(job_id=\"<job id>\", client=client)" |
| 207 | + ] |
| 208 | + }, |
| 209 | + { |
| 210 | + "cell_type": "markdown", |
| 211 | + "id": "285d4a9d", |
| 212 | + "metadata": {}, |
| 213 | + "source": [ |
| 214 | + "## 👀 View synthetic data\n", |
| 215 | + "\n", |
| 216 | + "After the job completes, fetch the generated synthetic dataset." |
| 217 | + ] |
| 218 | + }, |
| 219 | + { |
| 220 | + "cell_type": "code", |
| 221 | + "execution_count": null, |
| 222 | + "id": "7f25574a", |
| 223 | + "metadata": {}, |
| 224 | + "outputs": [], |
| 225 | + "source": [ |
| 226 | + "# Fetch the synthetic data created by the job\n", |
| 227 | + "synthetic_df = job.fetch_data()\n", |
| 228 | + "synthetic_df\n" |
| 229 | + ] |
| 230 | + }, |
| 231 | + { |
| 232 | + "cell_type": "markdown", |
| 233 | + "id": "472b4f38", |
| 234 | + "metadata": {}, |
| 235 | + "source": [ |
| 236 | + "## 📊 View evaluation report\n", |
| 237 | + "\n", |
| 238 | + "An evaluation comparing the synthetic data to the input data is performed automatically.\n", |
| 239 | + "\n", |
| 240 | + "- Programmatically access key scores (quality and privacy).\n", |
| 241 | + "- Download the full HTML report with charts and detailed metrics.\n", |
| 242 | + "- Display the report inline below." |
| 243 | + ] |
| 244 | + }, |
| 245 | + { |
| 246 | + "cell_type": "code", |
| 247 | + "execution_count": null, |
| 248 | + "id": "7b691127", |
| 249 | + "metadata": {}, |
| 250 | + "outputs": [], |
| 251 | + "source": [ |
| 252 | + "# Print selected information from the job summary\n", |
| 253 | + "summary = job.fetch_summary()\n", |
| 254 | + "print(\n", |
| 255 | + " f\"Synthetic data quality score (0-10, higher is better): {summary.synthetic_data_quality_score}\"\n", |
| 256 | + ")\n", |
| 257 | + "print(f\"Data privacy score (0-10, higher is better): {summary.data_privacy_score}\")\n" |
| 258 | + ] |
| 259 | + }, |
| 260 | + { |
| 261 | + "cell_type": "code", |
| 262 | + "execution_count": null, |
| 263 | + "id": "d5b1030a", |
| 264 | + "metadata": {}, |
| 265 | + "outputs": [], |
| 266 | + "source": [ |
| 267 | + "# Download the full evaluation report to your local machine\n", |
| 268 | + "job.save_report(\"evaluation_report.html\")" |
| 269 | + ] |
| 270 | + }, |
| 271 | + { |
| 272 | + "cell_type": "code", |
| 273 | + "execution_count": null, |
| 274 | + "id": "45f7e22b", |
| 275 | + "metadata": {}, |
| 276 | + "outputs": [], |
| 277 | + "source": [ |
| 278 | + "# Fetch and display the full evaluation report inline\n", |
| 279 | + "job.display_report_in_notebook()" |
| 280 | + ] |
| 281 | + } |
| 282 | + ], |
| 283 | + "metadata": { |
| 284 | + "kernelspec": { |
| 285 | + "display_name": "kendrickb-notebooks", |
| 286 | + "language": "python", |
| 287 | + "name": "python3" |
| 288 | + }, |
| 289 | + "language_info": { |
| 290 | + "codemirror_mode": { |
| 291 | + "name": "ipython", |
| 292 | + "version": 3 |
| 293 | + }, |
| 294 | + "file_extension": ".py", |
| 295 | + "mimetype": "text/x-python", |
| 296 | + "name": "python", |
| 297 | + "nbconvert_exporter": "python", |
| 298 | + "pygments_lexer": "ipython3", |
| 299 | + "version": "3.11.13" |
| 300 | + } |
| 301 | + }, |
| 302 | + "nbformat": 4, |
| 303 | + "nbformat_minor": 5 |
| 304 | +} |
0 commit comments