Merge pull request #47 from biosustain/add-preprocessing-notebook

sambra95 · web-flow · commit 8ad358eb6f1c · 2026-02-26T13:31:39.000+01:00
move preprocessing function demonstrations to a new notebook
diff --git a/docs/index.md b/docs/index.md
@@ -12,6 +12,7 @@
 :hidden: true
 
 tutorial/analysis
+tutorial/preprocessing
 tutorial/plotting
 ```
 
diff --git a/docs/tutorial/analysis.ipynb b/docs/tutorial/analysis.ipynb
@@ -3,23 +3,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "# Fit growth models and extract growth statistics\n",
-    "\n",
-    "This tutorial demonstrates how to fit growth models and extract growth statistics\n",
-    "using the growthcurves package.\n",
-    "\n",
-    "The analysis workflow includes:\n",
-    "1. Generating or loading growth data\n",
-    "2. Fitting **mechanistic** models (ODE-based, parametric)\n",
-    "3. Fitting **phenomenological** models (parametric and non-parametric)\n",
-    "4. Extracting growth statistics from all fits\n",
-    "5. Saving results for visualization\n",
-    "\n",
-    "For visualization of the results, see the companion notebook:\n",
-    "[`plotting.ipynb`](plotting.ipynb) (Visualize fitted growth curves, derivatives,\n",
-    " and growth statistics)"
-   ]
+   "source": "# Fit growth models and extract growth statistics\n\nThis tutorial demonstrates how to fit growth models and extract growth statistics\nusing the growthcurves package.\n\nThe analysis workflow includes:\n1. Generating or loading growth data\n2. Fitting **mechanistic** models (ODE-based, parametric)\n3. Fitting **phenomenological** models (parametric and non-parametric)\n4. Extracting growth statistics from all fits\n5. Saving results for visualization\n\n\nFor preprocessing examples (blank subtraction, outlier detection, path length correction), see the companion notebook:\n[`preprocessing.ipynb`](preprocessing.ipynb).\n\nFor visualization of the results, see the companion notebook:\n[`plotting.ipynb`](plotting.ipynb) (Visualize fitted growth curves, derivatives,\n and growth statistics)"
   },
   {
    "cell_type": "code",
@@ -35,99 +19,6 @@
     "import growthcurves as gc"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Data Preprocessing Functions\n",
-    "\n",
-    "The growthcurves package provides preprocessing utilities for common data corrections:\n",
-    "\n",
-    "- **`path_correct(N, path_length_cm)`**: Normalize OD measurements to 1 cm path length\n",
-    "- **`blank_subtraction(N, blank)`**: Subtract blank/background measurements from data"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {
-    "tags": [
-     "hide-input"
-    ]
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Path Length Correction Example:\n",
-      "  Raw OD (0.5 cm path): [0.25 0.3  0.35 0.4 ]\n",
-      "  Corrected OD (1 cm path): [0.5 0.6 0.7 0.8]\n",
-      "\n",
-      "Blank Subtraction Example:\n",
-      "  Sample OD: [0.5 0.6 0.7 0.8]\n",
-      "  Blank OD:  [0.05  0.052 0.048 0.051]\n",
-      "  Corrected: [0.45  0.548 0.652 0.749]\n",
-      "\n",
-      "Combined Preprocessing Pipeline:\n",
-      "  Raw measurements (0.5 cm):     [0.125 0.15  0.175 0.2  ]\n",
-      "  After path correction (1 cm):  [0.25 0.3  0.35 0.4 ]\n",
-      "  Blank (corrected to 1 cm):[0.05 0.05 0.05 0.05]\n",
-      "  Final corrected OD:            [0.2  0.25 0.3  0.35]\n",
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "# Example 1: Path length correction\n",
-    "# Measurements taken with a 0.5 cm path length, normalized to 1 cm\n",
-    "raw_od_measurements = np.array([0.25, 0.30, 0.35, 0.40])\n",
-    "path_length = 0.5  # cm\n",
-    "\n",
-    "od_corrected = gc.path_correct(raw_od_measurements, path_length)\n",
-    "\n",
-    "print(\"Path Length Correction Example:\")\n",
-    "print(f\"  Raw OD (0.5 cm path): {raw_od_measurements}\")\n",
-    "print(f\"  Corrected OD (1 cm path): {od_corrected}\")\n",
-    "print()\n",
-    "\n",
-    "# Example 2: Blank subtraction\n",
-    "# Typical workflow: subtract blank measurements from sample data\n",
-    "sample_data = np.array([0.500, 0.600, 0.700, 0.800])\n",
-    "blank_data = np.array([0.050, 0.052, 0.048, 0.051])\n",
-    "\n",
-    "corrected_data = gc.blank_subtraction(sample_data, blank_data)\n",
-    "\n",
-    "print(\"Blank Subtraction Example:\")\n",
-    "print(f\"  Sample OD: {sample_data}\")\n",
-    "print(f\"  Blank OD:  {blank_data}\")\n",
-    "print(f\"  Corrected: {corrected_data}\")\n",
-    "print()\n",
-    "\n",
-    "# Example 3: Combined preprocessing workflow\n",
-    "# Simulate a typical preprocessing pipeline\n",
-    "raw_measurements = np.array([0.125, 0.150, 0.175, 0.200])\n",
-    "blank_measurements = np.array([0.025, 0.025, 0.025, 0.025])\n",
-    "path_length_cm = 0.5\n",
-    "\n",
-    "# Step 1: Path correction\n",
-    "od_1cm = gc.path_correct(raw_measurements, path_length_cm)\n",
-    "\n",
-    "# Step 2: Blank subtraction\n",
-    "od_corrected = gc.blank_subtraction(\n",
-    "    od_1cm, gc.path_correct(blank_measurements, path_length_cm)\n",
-    ")\n",
-    "\n",
-    "print(\"Combined Preprocessing Pipeline:\")\n",
-    "print(f\"  Raw measurements (0.5 cm):     {raw_measurements}\")\n",
-    "print(f\"  After path correction (1 cm):  {od_1cm}\")\n",
-    "print(\n",
-    "    f\"  Blank (corrected to 1 cm):{gc.path_correct(blank_measurements, path_length_cm)}\"\n",
-    ")\n",
-    "print(f\"  Final corrected OD:            {od_corrected}\")\n",
-    "print()"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -137,7 +28,7 @@
     "This cell generates synthetic growth data from a clean logistic function.\n",
     "- time is modeled in hours, with measurements every 12 minutes (0.2 hours) for\n",
     "  a total of 440 points (88 hours).\n",
-    "- We assume a lag of 30 hours, an intrinsic growth rate of 0.15 hour⁻¹,\n",
+    "- We assume a lag of 30 hours, an intrinsic growth rate of 0.15 hour\u207b\u00b9,\n",
     "  and a carrying capacity of 0.45 OD."
    ]
   },
@@ -297,11 +188,11 @@
     "| Output key | Meaning | How it is calculated |\n",
     "|---|---|---|\n",
     "| `max_od` | Maximum observed/fitted OD | Maximum OD over the valid data range |\n",
-    "| `mu_max` | Maximum specific growth rate (μ_max) | Maximum of `d(ln N)/dt` from the fitted model (or local fit for non-parametric) |\n",
-    "| `intrinsic_growth_rate` | Intrinsic model rate parameter | For mechanistic models: fitted intrinsic `μ`; for phenomenological/non-parametric: `None` |\n",
+    "| `mu_max` | Maximum specific growth rate (\u03bc_max) | Maximum of `d(ln N)/dt` from the fitted model (or local fit for non-parametric) |\n",
+    "| `intrinsic_growth_rate` | Intrinsic model rate parameter | For mechanistic models: fitted intrinsic `\u03bc`; for phenomenological/non-parametric: `None` |\n",
     "| `doubling_time` | Doubling time in hours | `ln(2) / mu_max` |\n",
     "| `time_at_umax` | Time at maximum specific growth | Time where `mu_max` reaches its maximum |\n",
-    "| `od_at_umax` | OD at time of μ_max | Model-predicted OD at `time_at_umax` |\n",
+    "| `od_at_umax` | OD at time of \u03bc_max | Model-predicted OD at `time_at_umax` |\n",
     "| `exp_phase_start`, `exp_phase_end` | Exponential phase boundaries | From threshold or tangent phase-boundary method in `extract_stats()` |\n",
     "| `model_rmse` | Fit error | RMSE between observed OD and model-predicted OD over the model fit window |\n",
     "\n",
@@ -319,15 +210,15 @@
     "The `extract_stats_from_fit()` function calculates these key metrics:\n",
     "\n",
     "- `max_od`: Maximum OD value within the fitted window\n",
-    "- `mu_max`: **Observed** maximum specific growth rate μ_max (hour⁻¹) - calculated\n",
+    "- `mu_max`: **Observed** maximum specific growth rate \u03bc_max (hour\u207b\u00b9) - calculated\n",
     "  from the fitted curve\n",
     "- `intrinsic_growth_rate`: **Model parameter** for intrinsic growth rate\n",
     "  (parametric models only, `None` for non-parametric)\n",
     "- `doubling_time`: Time to double the population at peak growth (hours)\n",
     "- `exp_phase_start`: When exponential phase begins (hours)\n",
     "- `exp_phase_end`: When exponential phase ends (hours)\n",
-    "- `time_at_umax`: Time when μ reaches its maximum (hours)\n",
-    "- `od_at_umax`: OD value at time of maximum μ\n",
+    "- `time_at_umax`: Time when \u03bc reaches its maximum (hours)\n",
+    "- `od_at_umax`: OD value at time of maximum \u03bc\n",
     "- `fit_t_min`: Start of fitting window (hours)\n",
     "- `fit_t_max`: End of fitting window (hours)\n",
     "- `fit_method`: Identifier for the method used\n",
@@ -339,29 +230,29 @@
     "\n",
     "### MECHANISTIC MODELS\n",
     "\n",
-    "| Name | Model | Equation | Exp Start | Exp End | Intrinsic μ | μ max | Carrying Capacity | Fit |\n",
+    "| Name | Model | Equation | Exp Start | Exp End | Intrinsic \u03bc | \u03bc max | Carrying Capacity | Fit |\n",
     "|------|-------|----------|-----------|---------|-------------|-------|-------------------|-----|\n",
-    "| Logistic | parametric | `dN/dt = μ * (1 - N(t) / K) * N(t)` | threshold/<br>tangent | threshold/<br>tangent | μ | max dln(N)/dt | K | entire curve |\n",
-    "| Gompertz | parametric | `dN/dt = μ * math.log(K / N(t)) * N(t)` | threshold/<br>tangent | threshold/<br>tangent | μ | max dln(N)/dt | K | entire curve |\n",
-    "| Richards | parametric | `dN/dt = μ * (1 - (N(t) / K)**beta) * N(t)` | threshold/<br>tangent | threshold/<br>tangent | μ | max dln(N)/dt | A | entire curve |\n",
-    "| Baranyi | parametric | `dN/dt= μ * math.exp(μ * t) / (math.exp(h0) - 1 + math.exp(μ * t)) * (1 - N(t) / K) * N(t)` | threshold/<br>tangent | threshold/<br>tangent | μ | max dln(N)/dt | K | entire curve |\n",
+    "| Logistic | parametric | `dN/dt = \u03bc * (1 - N(t) / K) * N(t)` | threshold/<br>tangent | threshold/<br>tangent | \u03bc | max dln(N)/dt | K | entire curve |\n",
+    "| Gompertz | parametric | `dN/dt = \u03bc * math.log(K / N(t)) * N(t)` | threshold/<br>tangent | threshold/<br>tangent | \u03bc | max dln(N)/dt | K | entire curve |\n",
+    "| Richards | parametric | `dN/dt = \u03bc * (1 - (N(t) / K)**beta) * N(t)` | threshold/<br>tangent | threshold/<br>tangent | \u03bc | max dln(N)/dt | A | entire curve |\n",
+    "| Baranyi | parametric | `dN/dt= \u03bc * math.exp(\u03bc * t) / (math.exp(h0) - 1 + math.exp(\u03bc * t)) * (1 - N(t) / K) * N(t)` | threshold/<br>tangent | threshold/<br>tangent | \u03bc | max dln(N)/dt | K | entire curve |\n",
     "\n",
     "### PHENOMENOLOGICAL MODELS\n",
     "\n",
-    "| Name | Model | Equation | Exp Start | Exp End | Intrinsic μ | μ max | Max OD | Fit |\n",
+    "| Name | Model | Equation | Exp Start | Exp End | Intrinsic \u03bc | \u03bc max | Max OD | Fit |\n",
     "|------|-------|----------|-----------|---------|-------------|-------|--------|-----|\n",
     "| Linear | non-parametric | `ln(N(t)) = N0 + b * t` | threshold/<br>tangent | threshold/<br>tangent | n.a. | b | max OD raw | only window |\n",
     "| Spline | non-parametric | `ln(N(t)) = spline(t)` | threshold/<br>tangent | threshold/<br>tangent | n.a. | max of derivative of spline | max OD raw | only log phase |\n",
-    "| Logistic (phenom) | parametric | `ln(N(t)/N0) = A / (1 + exp(4 * μ_max * (λ - t) / A + 2))` | λ | threshold/<br>tangent | n.a. | μ_max | K | entire curve |\n",
-    "| Gompertz (phenom) | parametric | `ln(N(t)/N0) = A * exp(-exp(μ_max * exp(1) * (λ - t) / A + 1))` | λ | threshold/<br>tangent | n.a. | μ_max | K | entire curve |\n",
-    "| Gompertz (modified) | parametric | `ln(N(t)/N0) = A * exp(-exp(μ_max * exp(1) * (λ - t) / A + 1)) + A * exp(α * (t - t_shift))` | λ | threshold/<br>tangent | n.a. | μ_max | K | entire curve |\n",
-    "| Richards (phenom) | parametric | `ln(N(t)/N0) = A * (1 + ν * exp(1 + ν + μ_max * (1 + ν)**(1/ν) * (λ - t) / A))**(-1/ν)` | λ | threshold/<br>tangent | n.a. | μ_max | K | entire curve |\n",
+    "| Logistic (phenom) | parametric | `ln(N(t)/N0) = A / (1 + exp(4 * \u03bc_max * (\u03bb - t) / A + 2))` | \u03bb | threshold/<br>tangent | n.a. | \u03bc_max | K | entire curve |\n",
+    "| Gompertz (phenom) | parametric | `ln(N(t)/N0) = A * exp(-exp(\u03bc_max * exp(1) * (\u03bb - t) / A + 1))` | \u03bb | threshold/<br>tangent | n.a. | \u03bc_max | K | entire curve |\n",
+    "| Gompertz (modified) | parametric | `ln(N(t)/N0) = A * exp(-exp(\u03bc_max * exp(1) * (\u03bb - t) / A + 1)) + A * exp(\u03b1 * (t - t_shift))` | \u03bb | threshold/<br>tangent | n.a. | \u03bc_max | K | entire curve |\n",
+    "| Richards (phenom) | parametric | `ln(N(t)/N0) = A * (1 + \u03bd * exp(1 + \u03bd + \u03bc_max * (1 + \u03bd)**(1/\u03bd) * (\u03bb - t) / A))**(-1/\u03bd)` | \u03bb | threshold/<br>tangent | n.a. | \u03bc_max | K | entire curve |\n",
     "\n",
     "### Understanding Growth Rates: Intrinsic vs. Observed\n",
     "\n",
     "**Important distinction:**\n",
     "\n",
-    "- **`mu_max`** (μ_max): The **observed** maximum specific growth rate calculated\n",
+    "- **`mu_max`** (\u03bc_max): The **observed** maximum specific growth rate calculated\n",
     "  from the fitted curve as max(d(ln N)/dt). This is what you measure from the data.\n",
     "\n",
     "- **`intrinsic_growth_rate`**: The **model parameter** representing intrinsic growth\n",
@@ -1014,9 +905,9 @@
     "Two methods are available for determining exponential phase boundaries:\n",
     "\n",
     "### 1. **Threshold Method**\n",
-    "- Tracks the instantaneous specific growth rate μ(t)\n",
-    "- `exp_phase_start`: First time when μ exceeds a fraction of μ_max (default: 15%)\n",
-    "- `exp_phase_end`: First time after peak when μ drops below the threshold\n",
+    "- Tracks the instantaneous specific growth rate \u03bc(t)\n",
+    "- `exp_phase_start`: First time when \u03bc exceeds a fraction of \u03bc_max (default: 15%)\n",
+    "- `exp_phase_end`: First time after peak when \u03bc drops below the threshold\n",
     "\n",
     "### 2. **Tangent Method**\n",
     "- Constructs a tangent line in log space at the point of maximum growth rate\n",
diff --git a/docs/tutorial/preprocessing.ipynb b/docs/tutorial/preprocessing.ipynb
@@ -0,0 +1,97 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "# Preprocess growth data\n\nThis tutorial demonstrates the preprocessing functions in `growthcurves.preprocessing`:\n\n- **`path_correct(N, path_length_cm)`**\n- **`blank_subtraction(N, blank)`**\n- **`out_of_iqr_window(values, factor, position)`**\n- **`out_of_iqr(N, window_size, factor)`**\n\nUse this workflow before model fitting when measurements require optical corrections or outlier screening."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import numpy as np\n\nimport growthcurves as gc\nfrom growthcurves import preprocessing as prep"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Path length correction"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# Measurements taken at 0.5 cm path length\nraw_od = np.array([0.25, 0.30, 0.35, 0.40])\nod_1cm = gc.path_correct(raw_od, path_length_cm=0.5)\n\nprint(f'Raw OD (0.5 cm): {raw_od}')\nprint(f'Corrected OD (1.0 cm): {od_1cm}')"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Blank subtraction"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "sample_od = np.array([0.50, 0.60, 0.70, 0.80])\nblank_od = np.array([0.05, 0.052, 0.048, 0.051])\ncorrected_od = gc.blank_subtraction(sample_od, blank_od)\n\nprint(f'Sample OD:   {sample_od}')\nprint(f'Blank OD:    {blank_od}')\nprint(f'Corrected OD:{corrected_od}')"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Outlier detection in a single window"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "window = np.array([0.10, 0.12, 0.65, 0.11, 0.13])\ncenter_is_outlier = prep.out_of_iqr_window(window, factor=1.5, position='center')\nfirst_is_outlier = prep.out_of_iqr_window(window, factor=1.5, position='first')\nlast_is_outlier = prep.out_of_iqr_window(window, factor=1.5, position='last')\n\nprint(f'Window: {window}')\nprint(f'Center value outlier? {center_is_outlier}')\nprint(f'First value outlier?  {first_is_outlier}')\nprint(f'Last value outlier?   {last_is_outlier}')"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Outlier detection across a full time series"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "od_series = np.array([0.08, 0.11, 0.14, 0.19, 0.23, 0.95, 0.31, 0.36, 0.41])\nmask = prep.out_of_iqr(od_series, window_size=5, factor=1.5)\n\nprint(f'OD series: {od_series}')\nprint(f'Outlier mask: {mask}')\nprint(f'Outlier indices: {np.where(mask)[0]}')\nprint(f'Outlier values: {od_series[mask]}')"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Combined preprocessing pipeline"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "raw = np.array([0.10, 0.12, 0.14, 0.16, 0.48, 0.20, 0.22])\nblank = np.full_like(raw, 0.02)\npath_length_cm = 0.5\n\nraw_1cm = gc.path_correct(raw, path_length_cm=path_length_cm)\nblank_1cm = gc.path_correct(blank, path_length_cm=path_length_cm)\nbaseline_corrected = gc.blank_subtraction(raw_1cm, blank_1cm)\noutlier_mask = prep.out_of_iqr(baseline_corrected, window_size=5, factor=1.5)\n\nprint(f'Raw OD ({path_length_cm} cm): {raw}')\nprint(f'Path-corrected OD (1 cm): {raw_1cm}')\nprint(f'Blank-subtracted OD: {baseline_corrected}')\nprint(f'Outlier mask: {outlier_mask}')"
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "growthcurves_env",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}