Update evaluation_of_results.md with current balance package outputs (#313)

Copilot · meta-codesync[bot] · commit ca6de35a0463 · 2026-02-05T03:33:29.000-08:00
Summary: ## Plan to update evaluation_of_results.md based on quickstart tutorial - [x] Update the print(adjusted) output to match current format (includes adjustment details section) - [x] Update the summary() output to match current format (structured sections, includes KLD metrics, outcome weighted means, and confidence intervals) - [x] Update the covars().mean().T output to match current column values - [x] Update the covars().asmd().T output to match current values - [x] Update the outcomes().summary() output to match current format (includes confidence intervals, weights impact, and more detailed response rates) - [x] Update the design_effect() output value - [x] Review all changes to ensure minimal modifications - [x] Address PR review feedback: - Fixed typo: `adjust` → `adjusted` in outcomes example - Clarified that `.mean()` shows covariate means, not ASMD - Replaced tab characters with spaces for consistent formatting - Added note about when target outcome columns appear in output - Fixed grammar: "didn't get improved" → "didn't improve" <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>[FEATURE] Update website/docs/docs/general_framework/evaluation_of_results.md</issue_title> > <issue_description>Update text in: > https://github.com/facebookresearch/balance/blob/main/website/docs/docs/general_framework/evaluation_of_results.md > Based on updated output from here: > https://import-balance.org/docs/tutorials/quickstart/</issue_description> > > ## Comments on the Issue (you are copilot in this section) > > <comments> > </comments> > </details> - Fixes #312 --- ✨ Let Copilot coding agent [set things up for you](https://github.com/facebookresearch/balance/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. Pull Request resolved: #313 Reviewed By: omriharosh Differential Revision: D92278385 Pulled By: talgalili fbshipit-source-id: fd396ce4b6291ba4e830f487226a6e60109a3024
diff --git a/website/docs/docs/general_framework/evaluation_of_results.md b/website/docs/docs/general_framework/evaluation_of_results.md
@@ -24,19 +24,28 @@ print(adjusted)
 Output:
 
 ```
-Adjusted balance Sample object with target set using ipw
-1000 observations x 3 variables: gender,age_group,income
-id_column: id, weight_column: weight,
-outcome_columns: happiness
 
-    target:
+        Adjusted balance Sample object with target set using ipw
+        1000 observations x 3 variables: gender,age_group,income
+        id_column: id, weight_column: weight,
+        outcome_columns: happiness
 
-    balance Sample object
-    10000 observations x 3 variables: gender,age_group,income
-    id_column: id, weight_column: weight,
-    outcome_columns: None
+        adjustment details:
+            method: ipw
+            weight trimming mean ratio: 20
+            design effect (Deff): 1.880
+            effective sample size proportion (ESSP): 0.532
+            effective sample size (ESS): 531.9
+
+            target:
+
+            balance Sample object
+            10000 observations x 3 variables: gender,age_group,income
+            id_column: id, weight_column: weight,
+            outcome_columns: happiness
+
+            3 common variables: gender,age_group,income
 
-    3 common variables: income,age_group,gender
 ```
 
 
@@ -47,17 +56,34 @@ print(adjusted.summary())
 ```
 
 This will return several results:
-- Covariate mean ASMD improvement: ASMD is "Absolute Standardized Mean Difference". For continuous variables, this measure is the same as taking the absolute value of [Cohen's d statistic](https://en.wikipedia.org/wiki/Effect_size#Cohen's_d) (also related to [SSMD](https://en.wikipedia.org/wiki/Strictly_standardized_mean_difference)), when using the (weighted) standard deviation of the target population. For categorical variables it uses [one-hot encoding](https://en.wikipedia.org/wiki/One-hot).
-- [Design effect](https://en.wikipedia.org/wiki/Design_effect)
-- Covariate mean Adjusted Standardized Mean Deviation (ASMD) versus Unadjusted covariate mean ASMD
-- Model proportion deviance explained (if inverse propensity weighting method was used)
+- Adjustment details: method used and weight trimming parameters
+- Covariate diagnostics: ASMD is "Absolute Standardized Mean Difference". For continuous variables, this measure is the same as taking the absolute value of [Cohen's d statistic](https://en.wikipedia.org/wiki/Effect_size#Cohen's_d) (also related to [SSMD](https://en.wikipedia.org/wiki/Strictly_standardized_mean_difference)), when using the (weighted) standard deviation of the target population. For categorical variables it uses [one-hot encoding](https://en.wikipedia.org/wiki/One-hot). Also includes KLD (Kullback-Leibler divergence) metrics.
+- Weight diagnostics: [Design effect](https://en.wikipedia.org/wiki/Design_effect), effective sample size proportion (ESSP), and effective sample size (ESS)
+- Outcome weighted means: means for each outcome variable across self (adjusted), target, and unadjusted samples
+- Model performance: Model proportion deviance explained (if inverse propensity weighting method was used)
 
 Output:
 
 ```
-Covar ASMD reduction: 62.3%, design effect: 2.249
-Covar ASMD (7 variables): 0.335 -> 0.126
-Model performance: Model proportion deviance explained: 0.174
+Adjustment details:
+    method: ipw
+    weight trimming mean ratio: 20
+Covariate diagnostics:
+    Covar ASMD reduction: 63.4%
+    Covar ASMD (7 variables): 0.327 -> 0.120
+    Covar mean KLD reduction: 95.3%
+    Covar mean KLD (3 variables): 0.071 -> 0.003
+Weight diagnostics:
+    design effect (Deff): 1.880
+    effective sample size proportion (ESSP): 0.532
+    effective sample size (ESS): 531.9
+Outcome weighted means:
+            happiness
+source
+self           53.295
+target         56.278
+unadjusted     48.559
+Model performance: Model proportion deviance explained: 0.173
 ```
 
 Note that although we had 3 variables in our original data (age_group, gender, income), the asmd counts each level of the categorical variables as separate variable, and thus it considered 7 variables for the covar ASMD improvement.
@@ -74,18 +100,18 @@ adjusted.covars().mean().T
 To get:
 
 ```
-source                     self     target  unadjusted
-_is_na_gender[T.True]  0.103449   0.089800     0.08800
-age_group[T.25-34]     0.279072   0.297400     0.30900
-age_group[T.35-44]     0.290137   0.299200     0.17200
-age_group[T.45+]       0.150714   0.206300     0.04600
-gender[Female]         0.410664   0.455100     0.26800
-gender[Male]           0.485887   0.455100     0.64400
-gender[_NA]            0.103449   0.089800     0.08800
-income                 9.519935  12.737608     5.99102
+source                      self     target  unadjusted
+_is_na_gender[T.True]   0.086776   0.089800    0.088000
+age_group[T.25-34]      0.307355   0.297400    0.300000
+age_group[T.35-44]      0.273609   0.299200    0.156000
+age_group[T.45+]        0.137581   0.206300    0.053000
+gender[Female]          0.406337   0.455100    0.268000
+gender[Male]            0.506887   0.455100    0.644000
+gender[_NA]             0.086776   0.089800    0.088000
+income                 10.060068  12.737608    6.297302
 ```
 
-The `self` is the adjusted ASMD, while `unadjusted` is the unadjusted ASMD.
+Here, `self` is the adjusted (weighted) covariate mean, `target` is the target mean, and `unadjusted` is the unadjusted sample mean.
 
 
 And `.asmd()` to get ASMD:
@@ -98,18 +124,18 @@ To get:
 
 ```
 source                  self  unadjusted  unadjusted - self
-age_group[T.25-34]  0.040094    0.025375          -0.014719
-age_group[T.35-44]  0.019792    0.277771           0.257980
-age_group[T.45+]    0.137361    0.396127           0.258765
-gender[Female]      0.089228    0.375699           0.286472
-gender[Male]        0.061820    0.379314           0.317494
-gender[_NA]         0.047739    0.006296          -0.041444
-income              0.246918    0.517721           0.270802
-mean(asmd)          0.126310    0.334860           0.208551
+age_group[T.25-34]  0.021777    0.005688          -0.016090
+age_group[T.35-44]  0.055884    0.312711           0.256827
+age_group[T.45+]    0.169816    0.378828           0.209013
+gender[Female]      0.097916    0.375699           0.277783
+gender[Male]        0.103989    0.379314           0.275324
+gender[_NA]         0.010578    0.006296          -0.004282
+income              0.205469    0.494217           0.288748
+mean(asmd)          0.119597    0.326799           0.207202
 ```
 
 We can see that on average the ASMD improved from 0.33 to 0.12 thanks to the weights. We got improvements in income, gender, and age_group.
-Although we can see that `age_group[T.25-34]` didn't get improved.
+Although we can see that `age_group[T.25-34]` and `gender[_NA]` didn't improve.
 
 
 ## Understanding the model
@@ -166,7 +192,7 @@ Or calculate the design effect using:
 
 ```python
 adjusted.weights().design_effect()
-# 2.24937
+# 1.88
 ```
 
 ## Analyzing the outcome
@@ -179,21 +205,37 @@ print(adjusted.outcomes().summary())
 
 To get:
 ```
-
 1 outcomes: ['happiness']
-Mean outcomes:
-            happiness
-source
-self        54.221388
-unadjusted  48.392784
+Mean outcomes (with 95% confidence intervals):
+source       self  target  unadjusted           self_ci         target_ci     unadjusted_ci
+happiness  53.295  56.278      48.559  (52.096, 54.495)  (55.961, 56.595)  (47.669, 49.449)
+```
+
+Note: The `target` column and target-based response rates appear only when the target `Sample` has outcome data. If your target has no outcomes, you will only see `self` and `unadjusted` columns.
+
+```
+
+Weights impact on outcomes (t_test):
+           mean_yw0  mean_yw1  mean_diff  diff_ci_lower  diff_ci_upper  t_stat  p_value       n
+outcome
+happiness    48.559    53.295      4.736          1.312          8.161   2.714    0.007  1000.0
 
 Response rates (relative to number of respondents in sample):
    happiness
 n     1000.0
 %      100.0
+Response rates (relative to notnull rows in the target):
+    happiness
+n     1000.0
+%       10.0
+Response rates (in the target):
+    happiness
+n    10000.0
+%      100.0
+
 ```
 
-For example, we see that the estimated mean happiness according to our sample is 48 without any adjustment and 54 with adjustment.  The following shows the distribution of happiness before and after applying the weights:
+For example, we see that the estimated mean happiness according to our sample is 48.6 without any adjustment and 53.3 with adjustment (compared to the target mean of 56.3). The following shows the distribution of happiness before and after applying the weights:
 
 ```python
 adjusted.outcomes().plot()