Skip to content

[FEATURE] Change Sample and BalanceDF roles #182

@talgalili

Description

@talgalili

The current design saves data in Sample, e.g.:
https://github.com/facebookresearch/balance/blob/main/balance/sample_class.py#L188
And then set_target links to another sample:
https://github.com/facebookresearch/balance/blob/main/balance/sample_class.py#L554
Then balanceDF extracts information from Sample and does stuff on it (plot, summary, etc.)

This is an odd design.

A better design should have something like
Balancedf (what Sample is currently) - including X, W (many weights), Y (outcomes), Y_hat (estimated Y). Each are a df sharing the same index, but we want to know which is which for the different diagnostics and model fitting.
** For each X, we want to also keep which transformation should be applied to it (bucketing, NA filling, formula, etc).
The transformation output might be memoized.
We WANT the original X to also be kept, so that we could see the impact of weights on the original scale, not just the transformed one.
For each Y_hat we want to know how we estimated it.
For each W we want to know how we estimated it.

Use Balancedf for sample and target population. Combine them together with ‘Sample’ object.
PROBLEM: the current implementation stores sample and target as ‘Sample’, and you have ‘Sample’ with another one as ‘Target’ (which is a bad design). Instead, we should make sure that when we combine sample and target, both will be saved as balancedf
We want Sample to apply the same transformations for both sample and target

Going with something like this will give us a more consistent architecture which will enable the package to develop to more directions and use-cases.

This is closely related to:
#51

All the names and structure is a bit mixed.

Alternative object structure

A potential alternative names and hierarchy (all tentative):

  1. SampleDF (maybe SampleDFs, or SampleFrame) = this is a class with 5 DataFrame(s), for X, W (including the baseline/unadjusted/design-weights, and the adjusted ones - at least one adjust option), Y, Y_hat, and Misc (columns that are kept, but not really needed for the core analysis). The name 'sample' is good, since we have responders and target, but it's fine that they are a subset of another population. This could also be a single DataFrame, just with the attributes of which column is what.
  2. BalancedDF (or BalancedDFs, or BalancedFrame, or BalancedSample) = this is a class that includes two SampleDF, one for responders (will have Y, can also be called something like BiasedDF or something like this) and one for the target. This one implements core capabilities like getting a single df of everything, responders and target with all columns together - original index and overall index. Also things like download, and plot, and summary, etc. And from it, we can get a subset of classes. BalancedDFCovars, BalancedDFWeights, BalancedDFOutcomes - each with their own set of methods (plot, df, etc.)

Then it would be something like:

from balance import SampleFrame, BalancedFrame

sf_responders = SampleFrame.from_frame(df1, covars_columns, weights_columns, outcome_columns)
sf_target = SampleFrame.from_frame(df2, covars_columns)
bf = BalancedFrame.set_sample(sf_responders).set_target(sf_target)

# Then some form of 
bf.adjust()
bf.summary() #etc.
bf.covars().plot() #etc.

The challenge with the naming is that 'sample' has to do with taking a sample from 'something', and it can be a sample from the target population. So set_sample could have just as well been used for the target. So it's not ideal. But set_responders is too long. set_target could also be set_population or set_target_population but these are too long.

NOTE: Doing this transition before balance 1.0.0 is critical (since this will be the 'official' design).

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions