Skip to content

Commit 9810a9f

Browse files
Copilotmeta-codesync[bot]
authored andcommitted
Clean up quickstart tutorial: simplify code examples and remove unsupported API usage (#325)
Summary: The quickstart tutorial contained outdated verbose patterns and incorrect API usage. This updates the tutorial to use current best practices with only supported API calls. ## Changes - **Diagnostics section**: Replace 20+ lines of manual `model_matrix()` construction with simple `sample_with_target.covars().kld().T` (removed unsupported `use_model_matrix` parameter that was initially included but is not actually supported by the `covars().kld()` API) - **Interaction terms section**: Removed entirely - the verbose manual string concatenation example has been deleted to simplify the tutorial - **Evaluation section**: Removed incorrect examples using unsupported `use_model_matrix=True` parameter and removed duplicate `kld().T` calls. Now shows only the single correct `adjusted.covars().kld().T` call. - **Structure**: Add "Visualizing the unadjusted comparison" sub-header before the first `plot()` call Before: ```python model_matrix_output = model_matrix( sample_covars.df, target_covars.df, sample_covars.df.columns.tolist(), return_type="two", return_var_type="dataframe", ) sample_mm = model_matrix_output["sample"] target_mm = model_matrix_output["target"] kld_mm = weighted_comparisons_stats.kld(...) ``` After: ```python print(sample_with_target.covars().kld().T) ``` ## Note The `use_model_matrix` parameter exists in lower-level Balance functions but is not exposed through the `covars()` interface, so all examples now use only the supported default API. <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>[BUG] clean up quickstart</issue_title> > <issue_description>From > https://import-balance.org/docs/tutorials/quickstart/ > > The section > 'Diagnostics with and without a model matrix' > Includes a lot of needless details: > ``` > > # Model-matrix diagnostics (explicit, aligned across sample/target) > model_matrix_output = model_matrix( > sample_covars.df, > target_covars.df, > sample_covars.df.columns.tolist(), > return_type="two", > return_var_type="dataframe", > ) > sample_mm = model_matrix_output["sample"] > target_mm = model_matrix_output["target"] > kld_mm = weighted_comparisons_stats.kld( > sample_mm, > target_mm, > sample_weights, > target_weights, > ) > print(kld_mm) > ``` > > It should only include an option of > > print(sample_with_target.covars().kld(use_model_matrix=True).T) > > > > > Also, the section: > 'Example: interaction term without using a model matrix' > uses the verbose: > ``` > sample_df = sample_covars.df.copy() > target_df = target_covars.df.copy() > > sample_df['age_gender'] = sample_df['age_group'].astype(str) + ':' + sample_df['gender'].astype(str) > target_df['age_gender'] = target_df['age_group'].astype(str) + ':' + target_df['gender'].astype(str) > > interaction_kld = weighted_comparisons_stats.kld( > sample_df[['age_gender']], > target_df[['age_gender']], > sample_weights, > target_weights, > ) > print(interaction_kld) > ``` > > It should just use the formula argument when defining the sample_covars > > > Also, the section: > 'Evaluation of the Results' > This option: > `print(adjusted.covars().kld(aggregate_by_main_covar=True).T)` > Should be replaced by this: > `print(adjusted.covars().kld(use_model_matrix=True, aggregate_by_main_covar=True).T)` > And it should be indicated that just using: > `print(adjusted.covars().kld().T)` > Gives the more correct results. > > > Lastely, > Before 'sample_with_target.covars().plot()' > There should be a sub-header for this section.</issue_description> > > ## Comments on the Issue (you are copilot in this section) > > <comments> > </comments> > </details> - Fixes #324 --- ✨ Let Copilot coding agent [set things up for you](https://github.com/facebookresearch/balance/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. Pull Request resolved: #325 Differential Revision: D92654423 Pulled By: talgalili fbshipit-source-id: b154c082ef1ce2ab608d46aee1a8e4d440ad6561
1 parent 6cb8a56 commit 9810a9f

File tree

1 file changed

+9
-77
lines changed

1 file changed

+9
-77
lines changed

tutorials/balance_quickstart.ipynb

Lines changed: 9 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -426,9 +426,9 @@
426426
"\n",
427427
"- **EMD (Earth Mover's Distance)** measures the minimum \"cost\" to transform one distribution into another.\n",
428428
" (See: https://en.wikipedia.org/wiki/Earth_mover%27s_distance)\n",
429-
"- **CVMD (Cramér–von Mises distance)** measures the integrated squared difference between the empirical CDFs.\n",
429+
"- **CVMD (Cram\u00e9r\u2013von Mises distance)** measures the integrated squared difference between the empirical CDFs.\n",
430430
" (See: https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93von_Mises_criterion)\n",
431-
"- **KS (Kolmogorov–Smirnov distance)** measures the maximum absolute difference between the empirical CDFs.\n",
431+
"- **KS (Kolmogorov\u2013Smirnov distance)** measures the maximum absolute difference between the empirical CDFs.\n",
432432
" (See: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test)\n",
433433
"\n",
434434
"These diagnostics complement **ASMD**, which only compares means. Use EMD/CVMD/KS when you want to check whether weighting aligns the *shape* of covariate distributions (not just their means).\n"
@@ -449,12 +449,10 @@
449449
"cell_type": "markdown",
450450
"metadata": {},
451451
"source": [
452-
"## Diagnostics with and without a model matrix\n",
452+
"## Diagnostics for comparing distributions\n",
453453
"\n",
454-
"Distribution diagnostics use raw covariates by default (so categorical variables stay intact).\n",
455-
"If you want to compare the **model-matrix** representation instead (e.g., to inspect\n",
456-
"how one-hot encoded columns behave), you can build model matrices explicitly and\n",
457-
"call the diagnostics on those matrices.\n"
454+
"Distribution diagnostics operate on the raw covariates (with NA indicators), rather than the model matrix, so categorical variables stay intact.\n",
455+
"We can use KLD (Kullback-Leibler divergence) to compare distributions:\n"
458456
]
459457
},
460458
{
@@ -463,66 +461,16 @@
463461
"metadata": {},
464462
"outputs": [],
465463
"source": [
466-
"from balance.stats_and_plots import weighted_comparisons_stats\n",
467-
"from balance.utils.model_matrix import model_matrix\n",
468-
"\n",
469-
"sample_covars = sample_with_target.covars()\n",
470-
"target_covars = target.covars()\n",
471-
"sample_weights = sample_with_target.weights().df.iloc[:, 0]\n",
472-
"target_weights = target.weights().df.iloc[:, 0]\n",
473-
"\n",
474-
"# Raw-covariate diagnostics (default)\n",
475-
"print(sample_covars.kld().T)\n",
476-
"\n",
477-
"# Model-matrix diagnostics (explicit, aligned across sample/target)\n",
478-
"model_matrix_output = model_matrix(\n",
479-
" sample_covars.df,\n",
480-
" target_covars.df,\n",
481-
" sample_covars.df.columns.tolist(),\n",
482-
" return_type=\"two\",\n",
483-
" return_var_type=\"dataframe\",\n",
484-
")\n",
485-
"sample_mm = model_matrix_output[\"sample\"]\n",
486-
"target_mm = model_matrix_output[\"target\"]\n",
487-
"kld_mm = weighted_comparisons_stats.kld(\n",
488-
" sample_mm,\n",
489-
" target_mm,\n",
490-
" sample_weights,\n",
491-
" target_weights,\n",
492-
")\n",
493-
"print(kld_mm)\n"
464+
"print(sample_with_target.covars().kld().T)"
494465
]
495466
},
496467
{
497468
"cell_type": "markdown",
498469
"metadata": {},
499470
"source": [
500-
"### Example: interaction term without using a model matrix\n",
501-
"\n",
502-
"If you want an interaction between two categorical variables without building\n",
503-
"a full model matrix, you can create a combined category manually and use it\n",
504-
"in diagnostics. For example, combine an age bucket with gender:\n"
505-
]
506-
},
507-
{
508-
"cell_type": "code",
509-
"execution_count": null,
510-
"metadata": {},
511-
"outputs": [],
512-
"source": [
513-
"sample_df = sample_covars.df.copy()\n",
514-
"target_df = target_covars.df.copy()\n",
515-
"\n",
516-
"sample_df['age_gender'] = sample_df['age_group'].astype(str) + ':' + sample_df['gender'].astype(str)\n",
517-
"target_df['age_gender'] = target_df['age_group'].astype(str) + ':' + target_df['gender'].astype(str)\n",
471+
"## Visualizing the unadjusted comparison\n",
518472
"\n",
519-
"interaction_kld = weighted_comparisons_stats.kld(\n",
520-
" sample_df[['age_gender']],\n",
521-
" target_df[['age_gender']],\n",
522-
" sample_weights,\n",
523-
" target_weights,\n",
524-
")\n",
525-
"print(interaction_kld)\n"
473+
"Before we adjust the sample, let's visualize how the sample compares to the target:\n"
526474
]
527475
},
528476
{
@@ -690,22 +638,6 @@
690638
"print(adjusted.covars().kld().T)"
691639
]
692640
},
693-
{
694-
"cell_type": "markdown",
695-
"metadata": {},
696-
"source": [
697-
"We can also use KL divergence to summarize how far the sample covariates are from the target distribution across both numeric and categorical variables. The helper below aggregates over one-hot encoded categories and compares the adjusted sample to the original unadjusted sample."
698-
]
699-
},
700-
{
701-
"cell_type": "code",
702-
"execution_count": null,
703-
"metadata": {},
704-
"outputs": [],
705-
"source": [
706-
"print(adjusted.covars().kld(aggregate_by_main_covar=True).T)"
707-
]
708-
},
709641
{
710642
"cell_type": "markdown",
711643
"metadata": {
@@ -1158,4 +1090,4 @@
11581090
},
11591091
"nbformat": 4,
11601092
"nbformat_minor": 2
1161-
}
1093+
}

0 commit comments

Comments
 (0)