diff --git a/CHANGELOG.md b/CHANGELOG.md index 24c934fcf..358c3d7b4 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -58,6 +58,8 @@ - **Clearer validation errors in adjustment helpers** - `trim_weights()` now accepts list/tuple inputs and reports invalid types explicitly. - `apply_transformations()` raises clearer errors for invalid inputs and empty transformations. +- **Cleaned up Balance CLI tutorial flow** + - Removed the unused synthetic dataset section, aligned diagnostics exploration by metric/variable pairs, and shortened the CLI help snippet. ## Code Quality & Refactoring diff --git a/tutorials/balance_cli_tutorial.ipynb b/tutorials/balance_cli_tutorial.ipynb index ed2eac11f..40a4c5c0b 100644 --- a/tutorials/balance_cli_tutorial.ipynb +++ b/tutorials/balance_cli_tutorial.ipynb @@ -1,709 +1,673 @@ { "cells": [ - { - "cell_type": "markdown", - "id": "3fcf91b9", - "metadata": {}, - "source": [ - "# CLI tutorial\n", - "\n", - "This tutorial walks through using the `balance` command-line interface (CLI) to adjust a sample dataset to a target. We will build a small synthetic dataset, run the CLI, and inspect the outputs.\n", - "\n", - "The real power of a CLI lies in how seamlessly it integrates into the broader ecosystem of automation and data workflows. A CLI command can be invoked directly from shell scripts, scheduled via cron jobs, embedded in CI/CD pipelines, or orchestrated through tools like Airflow - all with minimal overhead. This composability means you can chain balance operations with other command-line tools using pipes, process batches of files in a loop, or trigger analyses based on events, all while maintaining a clear audit trail since the command itself documents exactly what was run. The non-zero exit codes that CLIs return on failure integrate naturally with automated systems that need to halt pipelines or send alerts when something goes wrong. In short, a CLI transforms balance from something you use interactively into a building block for production-grade, reproducible workflows." - ] - }, - { - "cell_type": "markdown", - "id": "3ab860ae", - "metadata": {}, - "source": [ - "## Prerequisites\n", - "\n", - "Make sure `balance` is installed and the `balance` CLI is on your PATH. You can also run the CLI via `python -m balance.cli` from a checkout of the repository.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e911bd3d", - "metadata": { - "execution": { - "iopub.execute_input": "2026-01-18T17:42:04.803717Z", - "iopub.status.busy": "2026-01-18T17:42:04.803432Z", - "iopub.status.idle": "2026-01-18T17:42:08.325883Z", - "shell.execute_reply": "2026-01-18T17:42:08.324430Z" - } - }, - "outputs": [], - "source": [ - "import os\n", - "import subprocess\n", - "import tempfile\n", - "\n", - "import numpy as np\n", - "import pandas as pd\n", - "\n", - "from balance import load_data\n", - "from IPython.display import display\n" - ] - }, - { - "cell_type": "markdown", - "id": "744fc4f3", - "metadata": {}, - "source": [ - "## Create a sample + target dataset\n", - "\n", - "We'll create a CSV with two groups: respondents (sample) and non-respondents (target). The CLI expects a binary sample indicator column (`is_respondent` by default), an `id`, a `weight`, and covariates.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "369a8fdb", - "metadata": { - "execution": { - "iopub.execute_input": "2026-01-18T17:12:06.965029Z", - "iopub.status.busy": "2026-01-18T17:12:06.964395Z", - "iopub.status.idle": "2026-01-18T17:12:06.985554Z", - "shell.execute_reply": "2026-01-18T17:12:06.984593Z" - } - }, - "outputs": [], - "source": [ - "rng = np.random.default_rng(2021)\n", - "n_sample = 1000\n", - "n_target = 2000\n", - "\n", - "sample_df = pd.DataFrame(\n", - " {\n", - " \"age\": rng.uniform(18, 80, n_sample),\n", - " \"gender\": rng.choice([1, 2, 3, 4], n_sample),\n", - " \"id\": range(n_sample),\n", - " \"weight\": 1.0,\n", - " \"is_respondent\": 1,\n", - " }\n", - ")\n", - "target_df = pd.DataFrame(\n", - " {\n", - " \"age\": rng.uniform(18, 80, n_target),\n", - " \"gender\": rng.choice([1, 2, 3, 4], n_target),\n", - " \"id\": range(n_sample, n_sample + n_target),\n", - " \"weight\": 1.0,\n", - " \"is_respondent\": 0,\n", - " }\n", - ")\n", - "\n", - "input_df = pd.concat([sample_df, target_df], ignore_index=True)\n", - "input_df.head()\n" - ] - }, - { - "cell_type": "markdown", - "id": "b9a75d8a", - "metadata": {}, - "source": [ - "## Alternative: use the bundled demo data\n", - "\n", - "Balance ships with a small demo dataset via `load_data()`. You can build the CLI input by\n", - "adding a sample indicator and weight columns, then concatenate sample and target frames.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c062fe77", - "metadata": { - "execution": { - "iopub.execute_input": "2026-01-18T17:12:06.988076Z", - "iopub.status.busy": "2026-01-18T17:12:06.987796Z", - "iopub.status.idle": "2026-01-18T17:12:07.027332Z", - "shell.execute_reply": "2026-01-18T17:12:07.026311Z" - } - }, - "outputs": [], - "source": [ - "target_df, sample_df = load_data()\n", - "\n", - "sample_df = sample_df.copy()\n", - "target_df = target_df.copy()\n", - "sample_df[\"is_respondent\"] = 1\n", - "target_df[\"is_respondent\"] = 0\n", - "sample_df[\"weight\"] = 1.0\n", - "target_df[\"weight\"] = 1.0\n", - "\n", - "load_data_input_df = pd.concat([sample_df, target_df], ignore_index=True)\n", - "load_data_input_df.head()\n" - ] - }, - { - "cell_type": "markdown", - "id": "c9e61940", - "metadata": {}, - "source": [ - "## Run the CLI\n", - "\n", - "We'll write the input dataset to disk, then call the CLI to compute weights and diagnostics.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b8cd2ef2", - "metadata": { - "execution": { - "iopub.execute_input": "2026-01-18T17:12:07.029932Z", - "iopub.status.busy": "2026-01-18T17:12:07.029701Z", - "iopub.status.idle": "2026-01-18T17:12:12.807649Z", - "shell.execute_reply": "2026-01-18T17:12:12.806571Z" - } - }, - "outputs": [], - "source": [ - "with tempfile.TemporaryDirectory() as tmpdir:\n", - " input_path = os.path.join(tmpdir, \"input.csv\")\n", - " output_path = os.path.join(tmpdir, \"weights_out.csv\")\n", - " diagnostics_path = os.path.join(tmpdir, \"diagnostics_out.csv\")\n", - "\n", - " input_df.to_csv(input_path, index=False)\n", - "\n", - " cmd = [\n", - " \"python\",\n", - " \"-m\",\n", - " \"balance.cli\",\n", - " \"--input_file\",\n", - " input_path,\n", - " \"--output_file\",\n", - " output_path,\n", - " \"--diagnostics_output_file\",\n", - " diagnostics_path,\n", - " \"--covariate_columns\",\n", - " \"age,gender\",\n", - " \"--method\",\n", - " \"ipw\",\n", - " ]\n", - "\n", - " print(\"CLI command:\", \" \".join(cmd))\n", - " subprocess.check_call(cmd)\n", - "\n", - " adjusted_df = pd.read_csv(output_path)\n", - " diagnostics_df = pd.read_csv(diagnostics_path)\n", - "\n", - "adjusted_df.head()\n" - ] - }, - { - "cell_type": "markdown", - "id": "299f5bc4", - "metadata": {}, - "source": [ - "## Run the CLI on the bundled demo data\n", - "\n", - "Here is the same CLI flow using the data returned by `load_data()`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bf6a4804", - "metadata": { - "execution": { - "iopub.execute_input": "2026-01-18T17:42:15.586144Z", - "iopub.status.busy": "2026-01-18T17:42:15.585822Z", - "iopub.status.idle": "2026-01-18T17:42:37.497372Z", - "shell.execute_reply": "2026-01-18T17:42:37.495530Z" - } - }, - "outputs": [], - "source": [ - "with tempfile.TemporaryDirectory() as tmpdir:\n", - " input_path = os.path.join(tmpdir, \"input_load_data.csv\")\n", - " output_path = os.path.join(tmpdir, \"weights_load_data.csv\")\n", - " diagnostics_path = os.path.join(tmpdir, \"diagnostics_load_data.csv\")\n", - "\n", - " load_data_input_df.to_csv(input_path, index=False)\n", - "\n", - " cmd = [\n", - " \"python\",\n", - " \"-m\",\n", - " \"balance.cli\",\n", - " \"--input_file\",\n", - " input_path,\n", - " \"--output_file\",\n", - " output_path,\n", - " \"--diagnostics_output_file\",\n", - " diagnostics_path,\n", - " \"--covariate_columns\",\n", - " \"gender,age_group,income\",\n", - " \"--outcome_columns\",\n", - " \"happiness\",\n", - " \"--method\",\n", - " \"ipw\",\n", - " ]\n", - "\n", - " print(\"CLI command:\", \" \".join(cmd))\n", - " subprocess.check_call(cmd)\n", - "\n", - " load_data_adjusted_df = pd.read_csv(output_path)\n", - " load_data_diagnostics_df = pd.read_csv(diagnostics_path)\n", - "\n", - "display(load_data_adjusted_df.head())\n", - "display(load_data_diagnostics_df.head())\n" - ] - }, - { - "cell_type": "markdown", - "id": "5398a983", - "metadata": {}, - "source": [ - "## Inspect diagnostics\n", - "\n", - "The diagnostics output is a flat table that includes adjustment metadata and balance\n", - "metrics. The `metric` column identifies the type of diagnostic, while `var` indicates the\n", - "variable (or `NaN` for overall summaries). The cells below use the diagnostics from the\n", - "`load_data()` run (`load_data_diagnostics_df`).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d20c0ce3", - "metadata": { - "execution": { - "iopub.execute_input": "2026-01-18T17:42:37.500850Z", - "iopub.status.busy": "2026-01-18T17:42:37.500461Z", - "iopub.status.idle": "2026-01-18T17:42:37.511102Z", - "shell.execute_reply": "2026-01-18T17:42:37.509100Z" - } - }, - "outputs": [], - "source": [ - "sorted(load_data_diagnostics_df[\"metric\"].unique())\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7c44bb6b", - "metadata": { - "execution": { - "iopub.execute_input": "2026-01-18T17:42:37.515721Z", - "iopub.status.busy": "2026-01-18T17:42:37.515347Z", - "iopub.status.idle": "2026-01-18T17:42:37.525848Z", - "shell.execute_reply": "2026-01-18T17:42:37.524458Z" - } - }, - "outputs": [], - "source": [ - "sorted(load_data_diagnostics_df[\"var\"].dropna().unique())\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "390ae698", - "metadata": { - "execution": { - "iopub.execute_input": "2026-01-18T17:12:28.807173Z", - "iopub.status.busy": "2026-01-18T17:12:28.806973Z", - "iopub.status.idle": "2026-01-18T17:12:28.812989Z", - "shell.execute_reply": "2026-01-18T17:12:28.812303Z" - } - }, - "outputs": [], - "source": [ - "load_data_diagnostics_df.query(\"metric == 'adjustment_method'\")\n" - ] - }, - { - "cell_type": "markdown", - "id": "b562fb6d", - "metadata": {}, - "source": [ - "## CLI Help and Arguments\n", - "\n", - "You can view all available CLI arguments using `--help`:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Print all CLI arguments\n", - "subprocess.run(\"python -m balance.cli --help\", shell=True)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Key CLI Arguments Summary\n", - "\n", - "Here are the most commonly used arguments:\n", - "\n", - "| Argument | Default | Description |\n", - "|----------|---------|-------------|\n", - "| `--method` | `ipw` | Adjustment method: `ipw`, `cbps`, or `rake` |\n", - "| `--max_de` | `1.5` | Maximum design effect. Set to `None` to use `lambda_1se` instead |\n", - "| `--lambda_min` | `1e-05` | Lower bound for L1 penalty (IPW only) |\n", - "| `--lambda_max` | `10` | Upper bound for L1 penalty (IPW only) |\n", - "| `--num_lambdas` | `250` | Number of lambda values to search (IPW only) |\n", - "| `--weight_trimming_mean_ratio` | `20.0` | Trim weights above `mean(weights) * ratio` |\n", - "| `--transformations` | `default` | Covariate transformations. Use `None` to disable |\n", - "| `--formula` | `None` | Custom model formula (e.g., `\"age + gender\"`) |\n", - "| `--one_hot_encoding` | `True` | One-hot encode categorical features |\n", - "| `--batch_columns` | `None` | Columns to group by for batch processing |\n", - "| `--keep_columns` | `None` | Subset of columns to include in output |\n", - "| `--outcome_columns` | `None` | Columns treated as outcomes (not covariates) |\n", - "| `--ipw_logistic_regression_kwargs` | `None` | JSON string of kwargs for sklearn LogisticRegression |\n", - "| `--succeed_on_weighting_failure` | `False` | Return null weights instead of failing on errors |\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Example: Tuning IPW parameters\n", - "\n", - "Below we run the CLI with custom regularization settings and a custom logistic regression solver:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "with tempfile.TemporaryDirectory() as tmpdir:\n", - " input_path = os.path.join(tmpdir, \"input.csv\")\n", - " output_path = os.path.join(tmpdir, \"weights_tuned.csv\")\n", - " diagnostics_path = os.path.join(tmpdir, \"diagnostics_tuned.csv\")\n", - "\n", - " load_data_input_df.to_csv(input_path, index=False)\n", - "\n", - " cmd = [\n", - " \"python\",\n", - " \"-m\",\n", - " \"balance.cli\",\n", - " \"--input_file\", input_path,\n", - " \"--output_file\", output_path,\n", - " \"--diagnostics_output_file\", diagnostics_path,\n", - " \"--covariate_columns\", \"gender,age_group,income\",\n", - " \"--method\", \"ipw\",\n", - " # Tuning parameters\n", - " \"--max_de\", \"2.0\",\n", - " \"--lambda_min\", \"1e-06\",\n", - " \"--lambda_max\", \"100\",\n", - " \"--num_lambdas\", \"500\",\n", - " \"--weight_trimming_mean_ratio\", \"10.0\",\n", - " # Custom logistic regression settings\n", - " \"--ipw_logistic_regression_kwargs\", '{\"solver\": \"liblinear\", \"max_iter\": 500}',\n", - " ]\n", - "\n", - " print(\"CLI command:\")\n", - " print(\" \".join(cmd))\n", - " subprocess.check_call(cmd)\n", - "\n", - " tuned_adjusted_df = pd.read_csv(output_path)\n", - "\n", - "tuned_adjusted_df.head()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Example: Using a Custom Formula\n", - "\n", - "The `--formula` argument allows you to specify a custom model formula, including interaction\n", - "terms. When using `--formula`, you should typically also set `--transformations=None` to\n", - "prevent automatic transformations from interfering with your custom formula.\n", - "\n", - "The formula uses patsy/R-style syntax:\n", - "- `age + gender`: additive terms (no interaction)\n", - "- `age * gender`: equivalent to `age + gender + age:gender` (main effects + interaction)\n", - "- `age:gender`: only the interaction term" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "with tempfile.TemporaryDirectory() as tmpdir:\n", - " input_path = os.path.join(tmpdir, \"input.csv\")\n", - " output_path = os.path.join(tmpdir, \"weights_formula.csv\")\n", - " diagnostics_path = os.path.join(tmpdir, \"diagnostics_formula.csv\")\n", - "\n", - " # Use the synthetic data with numeric covariates for formula example\n", - " input_df.to_csv(input_path, index=False)\n", - "\n", - " cmd = [\n", - " \"python\",\n", - " \"-m\",\n", - " \"balance.cli\",\n", - " \"--input_file\", input_path,\n", - " \"--output_file\", output_path,\n", - " \"--diagnostics_output_file\", diagnostics_path,\n", - " \"--covariate_columns\", \"age,gender\",\n", - " \"--method\", \"ipw\",\n", - " # Disable transformations to use raw covariates in formula\n", - " \"--transformations\", \"None\",\n", - " # Use a formula with interaction term\n", - " \"--formula\", \"age*gender\",\n", - " ]\n", - "\n", - " print(\"CLI command with custom formula:\")\n", - " print(\" \".join(cmd))\n", - " subprocess.check_call(cmd)\n", - "\n", - " formula_diagnostics_df = pd.read_csv(diagnostics_path)\n", - "\n", - "# Check model coefficients to verify formula was applied\n", - "print(\"\\nModel coefficients (showing interaction term):\")\n", - "print(formula_diagnostics_df.query(\"metric == 'model_coef'\")[[\"var\", \"val\"]])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Batch Processing Example\n", - "\n", - "The `--batch_columns` argument allows you to run separate adjustments for each unique\n", - "combination of values in the specified columns. This is useful when you want to compute\n", - "weights independently for different subgroups (e.g., by gender or region)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Create a dataset with a batch column for gender\n", - "batch_input_df = load_data_input_df.copy()\n", - "\n", - "# The 'gender' column has values like 'Female', 'Male', and possibly NA\n", - "# Filter to only rows with non-null gender for this example\n", - "batch_input_df = batch_input_df[batch_input_df[\"gender\"].notna()].copy()\n", - "print(f\"Rows after filtering: {len(batch_input_df)}\")\n", - "print(f\"Gender distribution:\\n{batch_input_df['gender'].value_counts()}\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "with tempfile.TemporaryDirectory() as tmpdir:\n", - " input_path = os.path.join(tmpdir, \"input_batch.csv\")\n", - " output_path = os.path.join(tmpdir, \"weights_batch.csv\")\n", - " diagnostics_path = os.path.join(tmpdir, \"diagnostics_batch.csv\")\n", - "\n", - " batch_input_df.to_csv(input_path, index=False)\n", - "\n", - " cmd = [\n", - " \"python\",\n", - " \"-m\",\n", - " \"balance.cli\",\n", - " \"--input_file\", input_path,\n", - " \"--output_file\", output_path,\n", - " \"--diagnostics_output_file\", diagnostics_path,\n", - " \"--covariate_columns\", \"age_group,income\", # Note: gender is now used as batch column\n", - " \"--outcome_columns\", \"happiness\",\n", - " \"--batch_columns\", \"gender\", # Process each gender separately\n", - " \"--method\", \"ipw\",\n", - " ]\n", - "\n", - " print(\"CLI command with batch processing:\")\n", - " print(\" \".join(cmd))\n", - " subprocess.check_call(cmd)\n", - "\n", - " batch_adjusted_df = pd.read_csv(output_path)\n", - " batch_diagnostics_df = pd.read_csv(diagnostics_path)\n", - "\n", - "print(f\"\\nOutput rows: {len(batch_adjusted_df)}\")\n", - "batch_adjusted_df.head()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Inspect weights by gender - each group was adjusted independently\n", - "print(\"Weight statistics by gender (sample only):\")\n", - "sample_only = batch_adjusted_df[batch_adjusted_df[\"is_respondent\"] == 1]\n", - "print(sample_only.groupby(\"gender\")[\"weight\"].describe().round(3))\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Alternative Weighting Methods\n", - "\n", - "The CLI supports three adjustment methods:\n", - "- **IPW (Inverse Probability Weighting)**: The default method, uses logistic regression to estimate propensity scores\n", - "- **CBPS (Covariate Balancing Propensity Score)**: Balances covariates while estimating propensity scores\n", - "- **Rake (Raking/Iterative Proportional Fitting)**: Adjusts weights iteratively to match marginal distributions" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Example: CBPS Method\n", - "\n", - "CBPS simultaneously optimizes covariate balance and propensity score estimation:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "with tempfile.TemporaryDirectory() as tmpdir:\n", - " input_path = os.path.join(tmpdir, \"input.csv\")\n", - " output_path = os.path.join(tmpdir, \"weights_cbps.csv\")\n", - " diagnostics_path = os.path.join(tmpdir, \"diagnostics_cbps.csv\")\n", - "\n", - " input_df.to_csv(input_path, index=False)\n", - "\n", - " cmd = [\n", - " \"python\",\n", - " \"-m\",\n", - " \"balance.cli\",\n", - " \"--input_file\", input_path,\n", - " \"--output_file\", output_path,\n", - " \"--diagnostics_output_file\", diagnostics_path,\n", - " \"--covariate_columns\", \"age,gender\",\n", - " \"--method\", \"cbps\",\n", - " ]\n", - "\n", - " print(\"CLI command with CBPS method:\")\n", - " print(\" \".join(cmd))\n", - " subprocess.check_call(cmd)\n", - "\n", - " cbps_diagnostics_df = pd.read_csv(diagnostics_path)\n", - "\n", - "# Verify the method used\n", - "print(\"\\nAdjustment method used:\")\n", - "print(cbps_diagnostics_df.query(\"metric == 'adjustment_method'\")[[\"var\", \"val\"]])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Example: Rake Method\n", - "\n", - "Raking iteratively adjusts weights to match target marginal distributions:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "with tempfile.TemporaryDirectory() as tmpdir:\n", - " input_path = os.path.join(tmpdir, \"input.csv\")\n", - " output_path = os.path.join(tmpdir, \"weights_rake.csv\")\n", - " diagnostics_path = os.path.join(tmpdir, \"diagnostics_rake.csv\")\n", - "\n", - " input_df.to_csv(input_path, index=False)\n", - "\n", - " cmd = [\n", - " \"python\",\n", - " \"-m\",\n", - " \"balance.cli\",\n", - " \"--input_file\", input_path,\n", - " \"--output_file\", output_path,\n", - " \"--diagnostics_output_file\", diagnostics_path,\n", - " \"--covariate_columns\", \"age,gender\",\n", - " \"--method\", \"rake\",\n", - " ]\n", - "\n", - " print(\"CLI command with rake method:\")\n", - " print(\" \".join(cmd))\n", - " subprocess.check_call(cmd)\n", - "\n", - " rake_diagnostics_df = pd.read_csv(diagnostics_path)\n", - "\n", - "# Verify the method used\n", - "print(\"\\nAdjustment method used:\")\n", - "print(rake_diagnostics_df.query(\"metric == 'adjustment_method'\")[[\"var\", \"val\"]])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Next steps\n", - "\n", - "- Try `--method cbps` or `--method rake` for alternative weighting approaches.\n", - "- Use `--outcome_columns` to control which columns are treated as outcomes.\n", - "- Supply `--ipw_logistic_regression_kwargs` to tune the IPW model.\n", - "- Use `--succeed_on_weighting_failure` for pipelines where you want null weights instead of errors.\n", - "- Explore `--covariate_columns_for_diagnostics` and `--rows_to_keep_for_diagnostics` to customize diagnostic output.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Session info\n", - "\n", - "For reproducibility, here is the session information:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import session_info\n", - "session_info.show(html=False, dependencies=True)\n" - ] + { + "cell_type": "markdown", + "id": "3fcf91b9", + "metadata": {}, + "source": [ + "# CLI tutorial\n", + "\n", + "This tutorial walks through using the `balance` command-line interface (CLI) to adjust a sample dataset to a target. We will build a small demo dataset, run the CLI, and inspect the outputs.\n", + "\n", + "The real power of a CLI lies in how seamlessly it integrates into the broader ecosystem of automation and data workflows. A CLI command can be invoked directly from shell scripts, scheduled via cron jobs, embedded in CI/CD pipelines, or orchestrated through tools like Airflow - all with minimal overhead. This composability means you can chain balance operations with other command-line tools using pipes, process batches of files in a loop, or trigger analyses based on events, all while maintaining a clear audit trail since the command itself documents exactly what was run. The non-zero exit codes that CLIs return on failure integrate naturally with automated systems that need to halt pipelines or send alerts when something goes wrong. In short, a CLI transforms balance from something you use interactively into a building block for production-grade, reproducible workflows.\n" + ] + }, + { + "cell_type": "markdown", + "id": "3ab860ae", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "Make sure `balance` is installed and the `balance` CLI is on your PATH. You can also run the CLI via `python -m balance.cli` from a checkout of the repository.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e911bd3d", + "metadata": { + "execution": { + "iopub.execute_input": "2026-01-19T05:24:45.475332Z", + "iopub.status.busy": "2026-01-19T05:24:45.475086Z", + "iopub.status.idle": "2026-01-19T05:24:49.753010Z", + "shell.execute_reply": "2026-01-19T05:24:49.751511Z" + } + }, + "outputs": [], + "source": [ + "import os\n", + "import subprocess\n", + "import tempfile\n", + "\n", + "import pandas as pd\n", + "\n", + "from balance import load_data\n" + ] + }, + { + "cell_type": "markdown", + "id": "b9a75d8a", + "metadata": {}, + "source": [ + "## Use the bundled demo data\n", + "\n", + "Balance ships with a small demo dataset via `load_data()`. You can build the CLI input by\n", + "adding a sample indicator and weight columns, then concatenate sample and target frames.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c062fe77", + "metadata": { + "execution": { + "iopub.execute_input": "2026-01-19T05:24:49.757535Z", + "iopub.status.busy": "2026-01-19T05:24:49.756959Z", + "iopub.status.idle": "2026-01-19T05:24:49.824122Z", + "shell.execute_reply": "2026-01-19T05:24:49.823490Z" + } + }, + "outputs": [], + "source": [ + "target_df, sample_df = load_data()\n", + "\n", + "sample_df = sample_df.copy()\n", + "target_df = target_df.copy()\n", + "sample_df[\"is_respondent\"] = 1\n", + "target_df[\"is_respondent\"] = 0\n", + "sample_df[\"weight\"] = 1.0\n", + "target_df[\"weight\"] = 1.0\n", + "\n", + "load_data_input_df = pd.concat([sample_df, target_df], ignore_index=True)\n", + "load_data_input_df.head()\n" + ] + }, + { + "cell_type": "markdown", + "id": "c9e61940", + "metadata": {}, + "source": [ + "## Run the CLI\n", + "\n", + "We'll write the input dataset to disk, then call the CLI to compute weights and diagnostics.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b8cd2ef2", + "metadata": { + "execution": { + "iopub.execute_input": "2026-01-19T05:24:49.827666Z", + "iopub.status.busy": "2026-01-19T05:24:49.827216Z", + "iopub.status.idle": "2026-01-19T05:25:13.925881Z", + "shell.execute_reply": "2026-01-19T05:25:13.924800Z" + } + }, + "outputs": [], + "source": [ + "with tempfile.TemporaryDirectory() as tmpdir:\n", + " input_path = os.path.join(tmpdir, \"input.csv\")\n", + " output_path = os.path.join(tmpdir, \"weights_out.csv\")\n", + " diagnostics_path = os.path.join(tmpdir, \"diagnostics_out.csv\")\n", + "\n", + " load_data_input_df.to_csv(input_path, index=False)\n", + "\n", + " cmd = [\n", + " \"python\",\n", + " \"-m\",\n", + " \"balance.cli\",\n", + " \"--input_file\",\n", + " input_path,\n", + " \"--output_file\",\n", + " output_path,\n", + " \"--diagnostics_output_file\",\n", + " diagnostics_path,\n", + " \"--covariate_columns\",\n", + " \"gender,age_group,income\",\n", + " \"--method\",\n", + " \"ipw\",\n", + " ]\n", + "\n", + " print(\"CLI command:\", \" \".join(cmd))\n", + " subprocess.check_call(cmd)\n", + "\n", + " load_data_adjusted_df = pd.read_csv(output_path)\n", + " load_data_diagnostics_df = pd.read_csv(diagnostics_path)\n", + "\n", + "load_data_adjusted_df.head()\n" + ] + }, + { + "cell_type": "markdown", + "id": "5398a983", + "metadata": {}, + "source": [ + "## Inspect diagnostics\n", + "\n", + "The diagnostics output is a flat table that includes adjustment metadata and balance\n", + "metrics. The `metric` column identifies the type of diagnostic, while `var` indicates the\n", + "variable (or `NaN` for overall summaries). It is most useful to inspect `var` in the\n", + "context of the metric it belongs to. The cells below use the diagnostics from the\n", + "previous CLI run (`load_data_diagnostics_df`).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d20c0ce3", + "metadata": { + "execution": { + "iopub.execute_input": "2026-01-19T05:25:13.929237Z", + "iopub.status.busy": "2026-01-19T05:25:13.928928Z", + "iopub.status.idle": "2026-01-19T05:25:13.947012Z", + "shell.execute_reply": "2026-01-19T05:25:13.945233Z" } - ], - "metadata": { - "captumWidgetMessage": {}, - "dataExplorerConfig": {}, - "fileHeader": "", - "fileUid": "a25fa00c-a941-4613-983d-cbcdd38ce995", - "isAdHoc": false, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.12" - }, - "last_kernel_id": "c3b1455f-e38f-456b-93d0-30bb986a13c1", - "last_msg_id": "9c18e41f-6fecac803e00a64bdc0e960e_175", - "last_server_session_id": "e9565ab9-c26f-47a6-84c2-db16db62b238", - "outputWidgetContext": {} + }, + "outputs": [], + "source": [ + "(\n", + " load_data_diagnostics_df.groupby(\"metric\")[\"var\"]\n", + " .apply(lambda col: sorted(col.dropna().unique()))\n", + " .sort_index()\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "390ae698", + "metadata": { + "execution": { + "iopub.execute_input": "2026-01-19T05:25:13.949608Z", + "iopub.status.busy": "2026-01-19T05:25:13.949303Z", + "iopub.status.idle": "2026-01-19T05:25:13.966000Z", + "shell.execute_reply": "2026-01-19T05:25:13.963520Z" + } + }, + "outputs": [], + "source": [ + "load_data_diagnostics_df.query(\"metric == 'adjustment_method'\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "b562fb6d", + "metadata": {}, + "source": [ + "## CLI Help and Arguments\n", + "\n", + "You can view all available CLI arguments using `--help`. Because the full output is long,\n", + "the snippet below prints the first section only.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ec65a74b", + "metadata": { + "execution": { + "iopub.execute_input": "2026-01-19T05:25:13.968706Z", + "iopub.status.busy": "2026-01-19T05:25:13.968391Z", + "iopub.status.idle": "2026-01-19T05:25:19.040294Z", + "shell.execute_reply": "2026-01-19T05:25:19.037852Z" + } + }, + "outputs": [], + "source": [ + "# Print a shorter CLI help snippet\n", + "help_output = subprocess.run(\n", + " [\"python\", \"-m\", \"balance.cli\", \"--help\"],\n", + " check=False,\n", + " capture_output=True,\n", + " text=True,\n", + ").stdout\n", + "print(\"\\n\".join(help_output.splitlines()[:40]))\n" + ] + }, + { + "cell_type": "markdown", + "id": "a38887e9", + "metadata": {}, + "source": [ + "### Key CLI Arguments Summary\n", + "\n", + "Here are the most commonly used arguments:\n", + "\n", + "| Argument | Default | Description |\n", + "|----------|---------|-------------|\n", + "| `--method` | `ipw` | Adjustment method: `ipw`, `cbps`, or `rake` |\n", + "| `--max_de` | `1.5` | Maximum design effect. Set to `None` to use `lambda_1se` instead |\n", + "| `--lambda_min` | `1e-05` | Lower bound for L1 penalty (IPW only) |\n", + "| `--lambda_max` | `10` | Upper bound for L1 penalty (IPW only) |\n", + "| `--num_lambdas` | `250` | Number of lambda values to search (IPW only) |\n", + "| `--weight_trimming_mean_ratio` | `20.0` | Trim weights above `mean(weights) * ratio` |\n", + "| `--transformations` | `default` | Covariate transformations. Use `None` to disable |\n", + "| `--formula` | `None` | Custom model formula (e.g., `\"gender + income\"`) |\n", + "| `--one_hot_encoding` | `True` | One-hot encode categorical features |\n", + "| `--batch_columns` | `None` | Columns to group by for batch processing |\n", + "| `--keep_columns` | `None` | Subset of columns to include in output |\n", + "| `--outcome_columns` | `None` | Columns treated as outcomes (not covariates) |\n", + "| `--ipw_logistic_regression_kwargs` | `None` | JSON string of kwargs for sklearn LogisticRegression |\n", + "| `--succeed_on_weighting_failure` | `False` | Return null weights instead of failing on errors |\n" + ] + }, + { + "cell_type": "markdown", + "id": "2611ee70", + "metadata": {}, + "source": [ + "### Example: Tuning IPW parameters\n", + "\n", + "Below we run the CLI with custom regularization settings and a custom logistic regression solver:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6c02a594", + "metadata": { + "execution": { + "iopub.execute_input": "2026-01-19T05:25:19.043890Z", + "iopub.status.busy": "2026-01-19T05:25:19.043509Z", + "iopub.status.idle": "2026-01-19T05:25:35.671005Z", + "shell.execute_reply": "2026-01-19T05:25:35.669802Z" + } + }, + "outputs": [], + "source": [ + "with tempfile.TemporaryDirectory() as tmpdir:\n", + " input_path = os.path.join(tmpdir, \"input.csv\")\n", + " output_path = os.path.join(tmpdir, \"weights_tuned.csv\")\n", + " diagnostics_path = os.path.join(tmpdir, \"diagnostics_tuned.csv\")\n", + "\n", + " load_data_input_df.to_csv(input_path, index=False)\n", + "\n", + " cmd = [\n", + " \"python\",\n", + " \"-m\",\n", + " \"balance.cli\",\n", + " \"--input_file\", input_path,\n", + " \"--output_file\", output_path,\n", + " \"--diagnostics_output_file\", diagnostics_path,\n", + " \"--covariate_columns\", \"gender,age_group,income\",\n", + " \"--method\", \"ipw\",\n", + " # Tuning parameters\n", + " \"--max_de\", \"2.0\",\n", + " \"--lambda_min\", \"1e-06\",\n", + " \"--lambda_max\", \"100\",\n", + " \"--num_lambdas\", \"500\",\n", + " \"--weight_trimming_mean_ratio\", \"10.0\",\n", + " # Custom logistic regression settings\n", + " \"--ipw_logistic_regression_kwargs\", '{\"solver\": \"liblinear\", \"max_iter\": 500}',\n", + " ]\n", + "\n", + " print(\"CLI command:\")\n", + " print(\" \".join(cmd))\n", + " subprocess.check_call(cmd)\n", + "\n", + " tuned_adjusted_df = pd.read_csv(output_path)\n", + "\n", + "tuned_adjusted_df.head()\n" + ] + }, + { + "cell_type": "markdown", + "id": "6321dab0", + "metadata": {}, + "source": [ + "### Example: Using a Custom Formula\n", + "\n", + "The `--formula` argument allows you to specify a custom model formula, including interaction\n", + "terms. When using `--formula`, you should typically also set `--transformations=None` to\n", + "prevent automatic transformations from interfering with your custom formula.\n", + "\n", + "The formula uses patsy/R-style syntax:\n", + "- `gender + income`: additive terms (no interaction)\n", + "- `gender * income`: equivalent to `gender + income + gender:income` (main effects + interaction)\n", + "- `gender:income`: only the interaction term\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5f0970bf", + "metadata": { + "execution": { + "iopub.execute_input": "2026-01-19T05:25:35.674431Z", + "iopub.status.busy": "2026-01-19T05:25:35.674140Z", + "iopub.status.idle": "2026-01-19T05:25:57.984301Z", + "shell.execute_reply": "2026-01-19T05:25:57.981965Z" + } + }, + "outputs": [], + "source": [ + "with tempfile.TemporaryDirectory() as tmpdir:\n", + " input_path = os.path.join(tmpdir, \"input.csv\")\n", + " output_path = os.path.join(tmpdir, \"weights_formula.csv\")\n", + " diagnostics_path = os.path.join(tmpdir, \"diagnostics_formula.csv\")\n", + "\n", + " # Use the demo data for the formula example\n", + " load_data_input_df.to_csv(input_path, index=False)\n", + "\n", + " cmd = [\n", + " \"python\",\n", + " \"-m\",\n", + " \"balance.cli\",\n", + " \"--input_file\", input_path,\n", + " \"--output_file\", output_path,\n", + " \"--diagnostics_output_file\", diagnostics_path,\n", + " \"--covariate_columns\", \"gender,age_group,income\",\n", + " \"--method\", \"ipw\",\n", + " # Disable transformations to use raw covariates in formula\n", + " \"--transformations\", \"None\",\n", + " # Use a formula with interaction term\n", + " \"--formula\", \"gender*income\",\n", + " ]\n", + "\n", + " print(\"CLI command with custom formula:\")\n", + " print(\" \".join(cmd))\n", + " subprocess.check_call(cmd)\n", + "\n", + " formula_diagnostics_df = pd.read_csv(diagnostics_path)\n", + "\n", + "# Check model coefficients to verify formula was applied\n", + "print(\"\\nModel coefficients (showing interaction term):\")\n", + "print(formula_diagnostics_df.query(\"metric == 'model_coef'\")[[\"var\", \"val\"]])" + ] + }, + { + "cell_type": "markdown", + "id": "1154e51f", + "metadata": {}, + "source": [ + "## Batch Processing Example\n", + "\n", + "The `--batch_columns` argument allows you to run separate adjustments for each unique\n", + "combination of values in the specified columns. This is useful when you want to compute\n", + "weights independently for different subgroups (e.g., by gender or region)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9fb85b0f", + "metadata": { + "execution": { + "iopub.execute_input": "2026-01-19T05:25:57.987212Z", + "iopub.status.busy": "2026-01-19T05:25:57.986950Z", + "iopub.status.idle": "2026-01-19T05:25:58.046867Z", + "shell.execute_reply": "2026-01-19T05:25:58.043535Z" + } + }, + "outputs": [], + "source": [ + "# Create a dataset with a batch column for gender\n", + "batch_input_df = load_data_input_df.copy()\n", + "\n", + "# The 'gender' column has values like 'Female', 'Male', and possibly NA\n", + "# Filter to only rows with non-null gender for this example\n", + "batch_input_df = batch_input_df[batch_input_df[\"gender\"].notna()].copy()\n", + "print(f\"Rows after filtering: {len(batch_input_df)}\")\n", + "print(f\"Gender distribution:\\n{batch_input_df['gender'].value_counts()}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6fccd981", + "metadata": { + "execution": { + "iopub.execute_input": "2026-01-19T05:25:58.050383Z", + "iopub.status.busy": "2026-01-19T05:25:58.050024Z", + "iopub.status.idle": "2026-01-19T05:26:21.723905Z", + "shell.execute_reply": "2026-01-19T05:26:21.721310Z" + } + }, + "outputs": [], + "source": [ + "with tempfile.TemporaryDirectory() as tmpdir:\n", + " input_path = os.path.join(tmpdir, \"input_batch.csv\")\n", + " output_path = os.path.join(tmpdir, \"weights_batch.csv\")\n", + " diagnostics_path = os.path.join(tmpdir, \"diagnostics_batch.csv\")\n", + "\n", + " batch_input_df.to_csv(input_path, index=False)\n", + "\n", + " cmd = [\n", + " \"python\",\n", + " \"-m\",\n", + " \"balance.cli\",\n", + " \"--input_file\", input_path,\n", + " \"--output_file\", output_path,\n", + " \"--diagnostics_output_file\", diagnostics_path,\n", + " \"--covariate_columns\", \"age_group,income\", # Note: gender is now used as batch column\n", + " \"--outcome_columns\", \"happiness\",\n", + " \"--batch_columns\", \"gender\", # Process each gender separately\n", + " \"--method\", \"ipw\",\n", + " ]\n", + "\n", + " print(\"CLI command with batch processing:\")\n", + " print(\" \".join(cmd))\n", + " subprocess.check_call(cmd)\n", + "\n", + " batch_adjusted_df = pd.read_csv(output_path)\n", + " batch_diagnostics_df = pd.read_csv(diagnostics_path)\n", + "\n", + "print(f\"\\nOutput rows: {len(batch_adjusted_df)}\")\n", + "batch_adjusted_df.head()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f73b66bc", + "metadata": { + "execution": { + "iopub.execute_input": "2026-01-19T05:26:21.726853Z", + "iopub.status.busy": "2026-01-19T05:26:21.726550Z", + "iopub.status.idle": "2026-01-19T05:26:21.754957Z", + "shell.execute_reply": "2026-01-19T05:26:21.752658Z" + } + }, + "outputs": [], + "source": [ + "# Inspect weights by gender - each group was adjusted independently\n", + "print(\"Weight statistics by gender (sample only):\")\n", + "sample_only = batch_adjusted_df[batch_adjusted_df[\"is_respondent\"] == 1]\n", + "print(sample_only.groupby(\"gender\")[\"weight\"].describe().round(3))\n" + ] + }, + { + "cell_type": "markdown", + "id": "09002e25", + "metadata": {}, + "source": [ + "## Alternative Weighting Methods\n", + "\n", + "The CLI supports three adjustment methods:\n", + "- **IPW (Inverse Probability Weighting)**: The default method, uses logistic regression to estimate propensity scores\n", + "- **CBPS (Covariate Balancing Propensity Score)**: Balances covariates while estimating propensity scores\n", + "- **Rake (Raking/Iterative Proportional Fitting)**: Adjusts weights iteratively to match marginal distributions" + ] + }, + { + "cell_type": "markdown", + "id": "c4b8b807", + "metadata": {}, + "source": [ + "### Example: CBPS Method\n", + "\n", + "CBPS simultaneously optimizes covariate balance and propensity score estimation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d06ac9b1", + "metadata": { + "execution": { + "iopub.execute_input": "2026-01-19T05:26:21.758102Z", + "iopub.status.busy": "2026-01-19T05:26:21.757685Z", + "iopub.status.idle": "2026-01-19T05:26:37.588944Z", + "shell.execute_reply": "2026-01-19T05:26:37.587223Z" + } + }, + "outputs": [], + "source": [ + "with tempfile.TemporaryDirectory() as tmpdir:\n", + " input_path = os.path.join(tmpdir, \"input.csv\")\n", + " output_path = os.path.join(tmpdir, \"weights_cbps.csv\")\n", + " diagnostics_path = os.path.join(tmpdir, \"diagnostics_cbps.csv\")\n", + "\n", + " load_data_input_df.to_csv(input_path, index=False)\n", + "\n", + " cmd = [\n", + " \"python\",\n", + " \"-m\",\n", + " \"balance.cli\",\n", + " \"--input_file\", input_path,\n", + " \"--output_file\", output_path,\n", + " \"--diagnostics_output_file\", diagnostics_path,\n", + " \"--covariate_columns\", \"gender,age_group,income\",\n", + " \"--method\", \"cbps\",\n", + " ]\n", + "\n", + " print(\"CLI command with CBPS method:\")\n", + " print(\" \".join(cmd))\n", + " subprocess.check_call(cmd)\n", + "\n", + " cbps_diagnostics_df = pd.read_csv(diagnostics_path)\n", + "\n", + "# Verify the method used\n", + "print(\"\\nAdjustment method used:\")\n", + "print(cbps_diagnostics_df.query(\"metric == 'adjustment_method'\")[[\"var\", \"val\"]])" + ] + }, + { + "cell_type": "markdown", + "id": "8ca0f5ec", + "metadata": {}, + "source": [ + "### Example: Rake Method\n", + "\n", + "Raking iteratively adjusts weights to match target marginal distributions:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "166ba565", + "metadata": { + "execution": { + "iopub.execute_input": "2026-01-19T05:26:37.595334Z", + "iopub.status.busy": "2026-01-19T05:26:37.594882Z", + "iopub.status.idle": "2026-01-19T05:26:44.603149Z", + "shell.execute_reply": "2026-01-19T05:26:44.601464Z" + } + }, + "outputs": [], + "source": [ + "with tempfile.TemporaryDirectory() as tmpdir:\n", + " input_path = os.path.join(tmpdir, \"input.csv\")\n", + " output_path = os.path.join(tmpdir, \"weights_rake.csv\")\n", + " diagnostics_path = os.path.join(tmpdir, \"diagnostics_rake.csv\")\n", + "\n", + " load_data_input_df.to_csv(input_path, index=False)\n", + "\n", + " cmd = [\n", + " \"python\",\n", + " \"-m\",\n", + " \"balance.cli\",\n", + " \"--input_file\", input_path,\n", + " \"--output_file\", output_path,\n", + " \"--diagnostics_output_file\", diagnostics_path,\n", + " \"--covariate_columns\", \"gender,age_group,income\",\n", + " \"--method\", \"rake\",\n", + " ]\n", + "\n", + " print(\"CLI command with rake method:\")\n", + " print(\" \".join(cmd))\n", + " subprocess.check_call(cmd)\n", + "\n", + " rake_diagnostics_df = pd.read_csv(diagnostics_path)\n", + "\n", + "# Verify the method used\n", + "print(\"\\nAdjustment method used:\")\n", + "print(rake_diagnostics_df.query(\"metric == 'adjustment_method'\")[[\"var\", \"val\"]])" + ] + }, + { + "cell_type": "markdown", + "id": "3952ec2a", + "metadata": {}, + "source": [ + "## Next steps\n", + "\n", + "- Try `--method cbps` or `--method rake` for alternative weighting approaches.\n", + "- Use `--outcome_columns` to control which columns are treated as outcomes.\n", + "- Supply `--ipw_logistic_regression_kwargs` to tune the IPW model.\n", + "- Use `--succeed_on_weighting_failure` for pipelines where you want null weights instead of errors.\n", + "- Explore `--covariate_columns_for_diagnostics` and `--rows_to_keep_for_diagnostics` to customize diagnostic output.\n" + ] + }, + { + "cell_type": "markdown", + "id": "12dfbd73", + "metadata": {}, + "source": [ + "## Session info\n", + "\n", + "For reproducibility, here is the session information:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dd0fa8eb", + "metadata": { + "execution": { + "iopub.execute_input": "2026-01-19T05:26:44.606186Z", + "iopub.status.busy": "2026-01-19T05:26:44.605863Z", + "iopub.status.idle": "2026-01-19T05:26:58.348131Z", + "shell.execute_reply": "2026-01-19T05:26:58.346914Z" + } + }, + "outputs": [], + "source": [ + "import session_info\n", + "session_info.show(html=False, dependencies=True)\n" + ] + } + ], + "metadata": { + "captumWidgetMessage": {}, + "dataExplorerConfig": {}, + "fileHeader": "", + "fileUid": "a25fa00c-a941-4613-983d-cbcdd38ce995", + "isAdHoc": false, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.12" }, - "nbformat": 4, - "nbformat_minor": 2 + "last_kernel_id": "c3b1455f-e38f-456b-93d0-30bb986a13c1", + "last_msg_id": "9c18e41f-6fecac803e00a64bdc0e960e_175", + "last_server_session_id": "e9565ab9-c26f-47a6-84c2-db16db62b238", + "outputWidgetContext": {} + }, + "nbformat": 4, + "nbformat_minor": 5 }