Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
164 changes: 152 additions & 12 deletions docs/tutorial/preprocessing.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3,74 +3,214 @@
{
"cell_type": "markdown",
"metadata": {},
"source": "# Preprocess growth data\n\nThis tutorial demonstrates the preprocessing functions in `growthcurves.preprocessing`:\n\n- **`path_correct(N, path_length_cm)`**\n- **`blank_subtraction(N, blank)`**\n- **`out_of_iqr_window(values, factor, position)`**\n- **`out_of_iqr(N, window_size, factor)`**\n\nUse this workflow before model fitting when measurements require optical corrections or outlier screening."
"source": [
"# Preprocess growth data\n",
"\n",
"This tutorial demonstrates the preprocessing functions in\n",
"`growthcurves.preprocessing`:\n",
"\n",
"- **`path_correct(N, path_length_cm)`**\n",
"- **`blank_subtraction(N, blank)`**\n",
"- **`out_of_iqr_window(values, factor, position)`** — single-window helper\n",
"- **`detect_outliers(N, method, **kwargs)`** — main outlier detection entry point\n",
" - `method=\"iqr\"` — sliding-window IQR (kwargs: `window_size`, `factor`)\n",
" - `method=\"ecod\"` — ECOD anomaly detection (kwargs: `factor`)\n",
"\n",
"Use this workflow before model fitting when measurements require optical corrections\n",
"or outlier screening."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "import numpy as np\n\nimport growthcurves as gc\nfrom growthcurves import preprocessing as prep"
"source": [
"import numpy as np\n",
"\n",
"import growthcurves as gc\n",
"from growthcurves import preprocessing as prep"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Path length correction"
"source": [
"## Path length correction"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Measurements taken at 0.5 cm path length\nraw_od = np.array([0.25, 0.30, 0.35, 0.40])\nod_1cm = gc.path_correct(raw_od, path_length_cm=0.5)\n\nprint(f'Raw OD (0.5 cm): {raw_od}')\nprint(f'Corrected OD (1.0 cm): {od_1cm}')"
"source": [
"# Measurements taken at 0.5 cm path length\n",
"raw_od = np.array([0.25, 0.30, 0.35, 0.40])\n",
"od_1cm = gc.path_correct(raw_od, path_length_cm=0.5)\n",
"\n",
"print(f\"Raw OD (0.5 cm): {raw_od}\")\n",
"print(f\"Corrected OD (1.0 cm): {od_1cm}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Blank subtraction"
"source": [
"## Blank subtraction"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "sample_od = np.array([0.50, 0.60, 0.70, 0.80])\nblank_od = np.array([0.05, 0.052, 0.048, 0.051])\ncorrected_od = gc.blank_subtraction(sample_od, blank_od)\n\nprint(f'Sample OD: {sample_od}')\nprint(f'Blank OD: {blank_od}')\nprint(f'Corrected OD:{corrected_od}')"
"source": [
"sample_od = np.array([0.50, 0.60, 0.70, 0.80])\n",
"blank_od = np.array([0.05, 0.052, 0.048, 0.051])\n",
"corrected_od = gc.blank_subtraction(sample_od, blank_od)\n",
"\n",
"print(f\"Sample OD: {sample_od}\")\n",
"print(f\"Blank OD: {blank_od}\")\n",
"print(f\"Corrected OD:{corrected_od}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Outlier detection in a single window"
"source": [
"## Outlier detection in a single window"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "window = np.array([0.10, 0.12, 0.65, 0.11, 0.13])\ncenter_is_outlier = prep.out_of_iqr_window(window, factor=1.5, position='center')\nfirst_is_outlier = prep.out_of_iqr_window(window, factor=1.5, position='first')\nlast_is_outlier = prep.out_of_iqr_window(window, factor=1.5, position='last')\n\nprint(f'Window: {window}')\nprint(f'Center value outlier? {center_is_outlier}')\nprint(f'First value outlier? {first_is_outlier}')\nprint(f'Last value outlier? {last_is_outlier}')"
"source": [
"window = np.array([0.10, 0.12, 0.65, 0.11, 0.13])\n",
"center_is_outlier = prep.out_of_iqr_window(window, factor=1.5, position=\"center\")\n",
"first_is_outlier = prep.out_of_iqr_window(window, factor=1.5, position=\"first\")\n",
"last_is_outlier = prep.out_of_iqr_window(window, factor=1.5, position=\"last\")\n",
"\n",
"print(f\"Window: {window}\")\n",
"print(f\"Center value outlier? {center_is_outlier}\")\n",
"print(f\"First value outlier? {first_is_outlier}\")\n",
"print(f\"Last value outlier? {last_is_outlier}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Outlier detection across a full time series"
"source": [
"## Outlier detection across a full time series with `detect_outliers`\n",
"\n",
"`detect_outliers(N, method=..., **kwargs)` is the main entry point. Pass\n",
"`method=\"iqr\"` for the sliding-window IQR approach:\n",
"\n",
"- For values in the centre of a window the IQR status is calculated for that window.\n",
"- For the first and last values (which cannot be centred in a window) the IQR status\n",
" is calculated using the first and last positions of their respective windows.\n",
" This is especially useful for catching outliers at the start of a series.\n",
"\n",
"Example with a centre outlier:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "od_series = np.array([0.08, 0.11, 0.14, 0.19, 0.23, 0.95, 0.31, 0.36, 0.41])\nmask = prep.out_of_iqr(od_series, window_size=5, factor=1.5)\n\nprint(f'OD series: {od_series}')\nprint(f'Outlier mask: {mask}')\nprint(f'Outlier indices: {np.where(mask)[0]}')\nprint(f'Outlier values: {od_series[mask]}')"
"source": [
"od_series = np.array([0.08, 0.11, 0.14, 0.19, 0.23, 0.25, 0.95, 0.31, 0.36, 0.41])\n",
"mask = prep.detect_outliers(od_series, method=\"iqr\", window_size=5, factor=1.5)\n",
"\n",
"print(f\"OD series: {od_series}\")\n",
"print(f\"Outlier mask: {mask}\")\n",
"print(f\"Outlier indices: {np.where(mask)[0]}\")\n",
"print(f\"Outlier values: {od_series[mask]}\")"
]
},
{
"cell_type": "markdown",
"id": "93a3bc8a",
"metadata": {},
"source": "## Combined preprocessing pipeline"
"source": [
"Example with a center outlier, and an outlier at the beginning of the series:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "42b10ae5",
"metadata": {},
"outputs": [],
"source": "raw = np.array([0.10, 0.12, 0.14, 0.16, 0.48, 0.20, 0.22])\nblank = np.full_like(raw, 0.02)\npath_length_cm = 0.5\n\nraw_1cm = gc.path_correct(raw, path_length_cm=path_length_cm)\nblank_1cm = gc.path_correct(blank, path_length_cm=path_length_cm)\nbaseline_corrected = gc.blank_subtraction(raw_1cm, blank_1cm)\noutlier_mask = prep.out_of_iqr(baseline_corrected, window_size=5, factor=1.5)\n\nprint(f'Raw OD ({path_length_cm} cm): {raw}')\nprint(f'Path-corrected OD (1 cm): {raw_1cm}')\nprint(f'Blank-subtracted OD: {baseline_corrected}')\nprint(f'Outlier mask: {outlier_mask}')"
"source": [
"od_series = np.array([0.08, 0.99, 0.14, 0.19, 0.23, 0.25, 0.95, 0.31, 0.36, 0.41])\n",
"mask = prep.detect_outliers(od_series, method=\"iqr\", window_size=5, factor=1.5)\n",
"\n",
"print(f\"OD series: {od_series}\")\n",
"print(f\"Outlier mask: {mask}\")\n",
"print(f\"Outlier indices: {np.where(mask)[0]}\")\n",
"print(f\"Outlier values: {od_series[mask]}\")"
]
},
{
"cell_type": "markdown",
"id": "e4a168ed",
"metadata": {},
"source": [
"If several outliers are present at the start of a time series, IQR values need to be\n",
"calculated with a sufficiently large window, and maybe iteratively, to detect all\n",
"outliers (here the first value is not detected as an outlier as the second value\n",
"is included in the window and increases the IQR range)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4e416f91",
"metadata": {},
"outputs": [],
"source": [
"od_series = np.array([0.99, 0.99, 0.14, 0.19, 0.23, 0.25, 0.95, 0.31, 0.36, 0.41])\n",
"mask = prep.detect_outliers(od_series, method=\"iqr\", window_size=5, factor=1.5)\n",
"\n",
"print(f\"OD series: {od_series}\")\n",
"print(f\"Outlier mask: {mask}\")\n",
"print(f\"Outlier indices: {np.where(mask)[0]}\")\n",
"print(f\"Outlier values: {od_series[mask]}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Combined preprocessing pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"raw = np.array([0.10, 0.12, 0.14, 0.16, 0.48, 0.20, 0.22])\n",
"blank = np.full_like(raw, 0.02)\n",
"path_length_cm = 0.5\n",
"\n",
"raw_1cm = gc.path_correct(raw, path_length_cm=path_length_cm)\n",
"blank_1cm = gc.path_correct(blank, path_length_cm=path_length_cm)\n",
"baseline_corrected = gc.blank_subtraction(raw_1cm, blank_1cm)\n",
"outlier_mask = prep.detect_outliers(\n",
" baseline_corrected, method=\"iqr\", window_size=5, factor=1.5\n",
")\n",
"\n",
"print(f\"Raw OD ({path_length_cm} cm): {raw}\")\n",
"print(f\"Path-corrected OD (1 cm): {raw_1cm}\")\n",
"print(f\"Blank-subtracted OD: {baseline_corrected}\")\n",
"print(f\"Outlier mask: {outlier_mask}\")"
]
}
],
"metadata": {
Expand Down
Loading