Skip to content

Route unmentioned columns to ignore_columns#333

Closed
talgalili wants to merge 1 commit intofacebookresearch:mainfrom
talgalili:export-D92749122
Closed

Route unmentioned columns to ignore_columns#333
talgalili wants to merge 1 commit intofacebookresearch:mainfrom
talgalili:export-D92749122

Conversation

@talgalili
Copy link
Contributor

Summary:
Previously, when --outcome_columns was not explicitly set, all columns that were not the id, weight, or a covariate were automatically classified as outcome_columns. This was surprising — e.g. columns listed in --keep_columns would silently become outcome columns.

Now, unmentioned columns are placed into ignore_columns instead. Columns that are explicitly mentioned (id, weight, covariates, and any columns specified via --outcome_columns, --keep_columns, --batch_columns, or --keep_row_column) are not ignored. Pipelines that relied on the automatic outcome classification must now pass those column names explicitly via --outcome_columns.

Also improved docstrings for has_keep_columns(), keep_columns(), and the --keep_columns argument to clarify their behavior.

Differential Revision: D92749122

Copilot AI review requested due to automatic review settings February 9, 2026 21:24
@meta-cla meta-cla bot added the cla signed label Feb 9, 2026
@meta-codesync
Copy link

meta-codesync bot commented Feb 9, 2026

@talgalili has exported this pull request. If you are a Meta employee, you can view the originating Diff in D92749122.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the balance CLI’s default column classification so that when --outcome_columns is not provided, previously-unclassified columns are routed into ignore_columns (instead of being auto-treated as outcomes), avoiding surprising behavior (e.g., --keep_columns silently becoming outcomes). It also updates related CLI docs and records the breaking change in the changelog.

Changes:

  • Update BalanceCLI.process_batch() to stop inferring outcome columns by default and instead populate ignore_columns with unmentioned columns.
  • Update/clarify keep_columns docstrings and CLI --keep_columns help text.
  • Adjust CLI tests to assert the new ignore_columns behavior; add changelog entry for the breaking change.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
balance/cli.py Changes default column routing in process_batch() and updates keep_columns docs / CLI arg help.
tests/test_cli.py Updates tests to validate that unmentioned columns now go to ignore_columns and outcomes default to None.
balance/__init__.py Updates the package __version__ string.
CHANGELOG.md Adds a 0.17.0 (TBD) breaking-change entry describing the new CLI behavior.

Comment on lines 1259 to +1265
"--keep_columns",
type=str,
required=False,
help="Set of columns we include in the output csv file",
help=(
"Comma-separated columns to include in the output csv file. "
"The output will be subsetted to only these columns."
),
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI help text for --outcome_columns earlier in this file still describes the old default inference (“if not supplied… treated as outcomes”), but process_batch() no longer infers outcomes by default. Please update that help text to match the new behavior so users aren’t misled.

Copilot uses AI. Check for mistakes.
Comment on lines 150 to 168
if outcome_columns is None:
raise AssertionError("Expected outcome_columns to be provided.")
cls.calls.append(outcome_columns)
cls.ignore_calls.append(ignore_columns)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RecordingSample.calls is annotated as list[tuple[str, ...]], but from_frame() now appends outcome_columns which can be None. Under # pyre-strict this will be a type error; update the annotation (and/or the from_frame signature) so the list element type includes None.

Copilot uses AI. Check for mistakes.
balance/cli.py Outdated

Keep columns control which columns appear in the final output CSV.
After adjustment, the output DataFrame is subsetted to contain
**only** these columns (see :meth:`filter_output`).
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring references :meth:filter_output``, but BalanceCLI doesn’t define a `filter_output` method (the behavior appears to live in `adapt_output`). Update the reference so the docs don’t point to a nonexistent method.

Suggested change
**only** these columns (see :meth:`filter_output`).
**only** these columns (see :meth:`adapt_output`).

Copilot uses AI. Check for mistakes.
talgalili added a commit to talgalili/balance that referenced this pull request Feb 9, 2026
Summary:

Previously, when --outcome_columns was not explicitly set, all columns that were not the id, weight, or a covariate were automatically classified as outcome_columns. This was surprising — e.g. columns listed in --keep_columns would silently become outcome columns.

Now, unmentioned columns are placed into ignore_columns instead. Columns that are explicitly mentioned (id, weight, covariates, and any columns specified via --outcome_columns, --keep_columns, --batch_columns, or --keep_row_column) are not ignored. Pipelines that relied on the automatic outcome classification must now pass those column names explicitly via --outcome_columns.

Also improved docstrings for has_keep_columns(), keep_columns(), and the --keep_columns argument to clarify their behavior.

Differential Revision: D92749122
@talgalili talgalili requested a review from Copilot February 9, 2026 21:47
talgalili added a commit to talgalili/balance that referenced this pull request Feb 9, 2026
Summary:

Previously, when --outcome_columns was not explicitly set, all columns that were not the id, weight, or a covariate were automatically classified as outcome_columns. This was surprising — e.g. columns listed in --keep_columns would silently become outcome columns.

Now, unmentioned columns are placed into ignore_columns instead. Columns that are explicitly mentioned (id, weight, covariates, and any columns specified via --outcome_columns, --keep_columns, --batch_columns, or --keep_row_column) are not ignored. Pipelines that relied on the automatic outcome classification must now pass those column names explicitly via --outcome_columns.

Also improved docstrings for has_keep_columns(), keep_columns(), and the --keep_columns argument to clarify their behavior.

Differential Revision: D92749122
talgalili added a commit to talgalili/balance that referenced this pull request Feb 9, 2026
Summary:

Previously, when --outcome_columns was not explicitly set, all columns that were not the id, weight, or a covariate were automatically classified as outcome_columns. This was surprising — e.g. columns listed in --keep_columns would silently become outcome columns.

Now, unmentioned columns are placed into ignore_columns instead. Columns that are explicitly mentioned (id, weight, covariates, and any columns specified via --outcome_columns, --keep_columns, --batch_columns, or --keep_row_column) are not ignored. Pipelines that relied on the automatic outcome classification must now pass those column names explicitly via --outcome_columns.

Also improved docstrings for has_keep_columns(), keep_columns(), and the --keep_columns argument to clarify their behavior.

Differential Revision: D92749122
talgalili added a commit to talgalili/balance that referenced this pull request Feb 9, 2026
Summary:
Pull Request resolved: facebookresearch#333

Previously, when --outcome_columns was not explicitly set, all columns that were not the id, weight, or a covariate were automatically classified as outcome_columns. This was surprising — e.g. columns listed in --keep_columns would silently become outcome columns.

Now, unmentioned columns are placed into ignore_columns instead. Columns that are explicitly mentioned (id, weight, covariates, and any columns specified via --outcome_columns, --keep_columns, --batch_columns, or --keep_row_column) are not ignored. Pipelines that relied on the automatic outcome classification must now pass those column names explicitly via --outcome_columns.

Also improved docstrings for has_keep_columns(), keep_columns(), and the --keep_columns argument to clarify their behavior.

Differential Revision: D92749122
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment on lines +1335 to +1338
with (
tempfile.TemporaryDirectory() as temp_dir,
tempfile.NamedTemporaryFile("w", suffix=".csv", delete=False) as in_file,
):
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test creates a NamedTemporaryFile(..., delete=False) but never removes it after the test finishes. Please either set delete=True (closing the handle before it’s read) or explicitly os.unlink(in_file.name) in a finally block to avoid leaving temp files behind.

Copilot uses AI. Check for mistakes.
Summary:
Pull Request resolved: facebookresearch#333

Previously, when --outcome_columns was not explicitly set, all columns that were not the id, weight, or a covariate were automatically classified as outcome_columns. This was surprising — e.g. columns listed in --keep_columns would silently become outcome columns.

Now, unmentioned columns are placed into ignore_columns instead. Columns that are explicitly mentioned (id, weight, covariates, and any columns specified via --outcome_columns, --keep_columns, --batch_columns, or --keep_row_column) are not ignored. Pipelines that relied on the automatic outcome classification must now pass those column names explicitly via --outcome_columns.

Also improved docstrings for has_keep_columns(), keep_columns(), and the --keep_columns argument to clarify their behavior.

Differential Revision: D92749122
@meta-codesync
Copy link

meta-codesync bot commented Feb 9, 2026

This pull request has been merged in a83f409.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants