Route unmentioned columns to ignore_columns#333
Route unmentioned columns to ignore_columns#333talgalili wants to merge 1 commit intofacebookresearch:mainfrom
Conversation
|
@talgalili has exported this pull request. If you are a Meta employee, you can view the originating Diff in D92749122. |
There was a problem hiding this comment.
Pull request overview
This PR changes the balance CLI’s default column classification so that when --outcome_columns is not provided, previously-unclassified columns are routed into ignore_columns (instead of being auto-treated as outcomes), avoiding surprising behavior (e.g., --keep_columns silently becoming outcomes). It also updates related CLI docs and records the breaking change in the changelog.
Changes:
- Update
BalanceCLI.process_batch()to stop inferring outcome columns by default and instead populateignore_columnswith unmentioned columns. - Update/clarify
keep_columnsdocstrings and CLI--keep_columnshelp text. - Adjust CLI tests to assert the new
ignore_columnsbehavior; add changelog entry for the breaking change.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
balance/cli.py |
Changes default column routing in process_batch() and updates keep_columns docs / CLI arg help. |
tests/test_cli.py |
Updates tests to validate that unmentioned columns now go to ignore_columns and outcomes default to None. |
balance/__init__.py |
Updates the package __version__ string. |
CHANGELOG.md |
Adds a 0.17.0 (TBD) breaking-change entry describing the new CLI behavior. |
| "--keep_columns", | ||
| type=str, | ||
| required=False, | ||
| help="Set of columns we include in the output csv file", | ||
| help=( | ||
| "Comma-separated columns to include in the output csv file. " | ||
| "The output will be subsetted to only these columns." | ||
| ), |
There was a problem hiding this comment.
The CLI help text for --outcome_columns earlier in this file still describes the old default inference (“if not supplied… treated as outcomes”), but process_batch() no longer infers outcomes by default. Please update that help text to match the new behavior so users aren’t misled.
tests/test_cli.py
Outdated
| if outcome_columns is None: | ||
| raise AssertionError("Expected outcome_columns to be provided.") | ||
| cls.calls.append(outcome_columns) | ||
| cls.ignore_calls.append(ignore_columns) |
There was a problem hiding this comment.
RecordingSample.calls is annotated as list[tuple[str, ...]], but from_frame() now appends outcome_columns which can be None. Under # pyre-strict this will be a type error; update the annotation (and/or the from_frame signature) so the list element type includes None.
balance/cli.py
Outdated
|
|
||
| Keep columns control which columns appear in the final output CSV. | ||
| After adjustment, the output DataFrame is subsetted to contain | ||
| **only** these columns (see :meth:`filter_output`). |
There was a problem hiding this comment.
The docstring references :meth:filter_output``, but BalanceCLI doesn’t define a `filter_output` method (the behavior appears to live in `adapt_output`). Update the reference so the docs don’t point to a nonexistent method.
| **only** these columns (see :meth:`filter_output`). | |
| **only** these columns (see :meth:`adapt_output`). |
Summary: Previously, when --outcome_columns was not explicitly set, all columns that were not the id, weight, or a covariate were automatically classified as outcome_columns. This was surprising — e.g. columns listed in --keep_columns would silently become outcome columns. Now, unmentioned columns are placed into ignore_columns instead. Columns that are explicitly mentioned (id, weight, covariates, and any columns specified via --outcome_columns, --keep_columns, --batch_columns, or --keep_row_column) are not ignored. Pipelines that relied on the automatic outcome classification must now pass those column names explicitly via --outcome_columns. Also improved docstrings for has_keep_columns(), keep_columns(), and the --keep_columns argument to clarify their behavior. Differential Revision: D92749122
48114d5 to
ad5faae
Compare
Summary: Previously, when --outcome_columns was not explicitly set, all columns that were not the id, weight, or a covariate were automatically classified as outcome_columns. This was surprising — e.g. columns listed in --keep_columns would silently become outcome columns. Now, unmentioned columns are placed into ignore_columns instead. Columns that are explicitly mentioned (id, weight, covariates, and any columns specified via --outcome_columns, --keep_columns, --batch_columns, or --keep_row_column) are not ignored. Pipelines that relied on the automatic outcome classification must now pass those column names explicitly via --outcome_columns. Also improved docstrings for has_keep_columns(), keep_columns(), and the --keep_columns argument to clarify their behavior. Differential Revision: D92749122
ad5faae to
18148fc
Compare
Summary: Previously, when --outcome_columns was not explicitly set, all columns that were not the id, weight, or a covariate were automatically classified as outcome_columns. This was surprising — e.g. columns listed in --keep_columns would silently become outcome columns. Now, unmentioned columns are placed into ignore_columns instead. Columns that are explicitly mentioned (id, weight, covariates, and any columns specified via --outcome_columns, --keep_columns, --batch_columns, or --keep_row_column) are not ignored. Pipelines that relied on the automatic outcome classification must now pass those column names explicitly via --outcome_columns. Also improved docstrings for has_keep_columns(), keep_columns(), and the --keep_columns argument to clarify their behavior. Differential Revision: D92749122
18148fc to
412bcdb
Compare
Summary: Pull Request resolved: facebookresearch#333 Previously, when --outcome_columns was not explicitly set, all columns that were not the id, weight, or a covariate were automatically classified as outcome_columns. This was surprising — e.g. columns listed in --keep_columns would silently become outcome columns. Now, unmentioned columns are placed into ignore_columns instead. Columns that are explicitly mentioned (id, weight, covariates, and any columns specified via --outcome_columns, --keep_columns, --batch_columns, or --keep_row_column) are not ignored. Pipelines that relied on the automatic outcome classification must now pass those column names explicitly via --outcome_columns. Also improved docstrings for has_keep_columns(), keep_columns(), and the --keep_columns argument to clarify their behavior. Differential Revision: D92749122
412bcdb to
b2a39f8
Compare
| with ( | ||
| tempfile.TemporaryDirectory() as temp_dir, | ||
| tempfile.NamedTemporaryFile("w", suffix=".csv", delete=False) as in_file, | ||
| ): |
There was a problem hiding this comment.
This test creates a NamedTemporaryFile(..., delete=False) but never removes it after the test finishes. Please either set delete=True (closing the handle before it’s read) or explicitly os.unlink(in_file.name) in a finally block to avoid leaving temp files behind.
Summary: Pull Request resolved: facebookresearch#333 Previously, when --outcome_columns was not explicitly set, all columns that were not the id, weight, or a covariate were automatically classified as outcome_columns. This was surprising — e.g. columns listed in --keep_columns would silently become outcome columns. Now, unmentioned columns are placed into ignore_columns instead. Columns that are explicitly mentioned (id, weight, covariates, and any columns specified via --outcome_columns, --keep_columns, --batch_columns, or --keep_row_column) are not ignored. Pipelines that relied on the automatic outcome classification must now pass those column names explicitly via --outcome_columns. Also improved docstrings for has_keep_columns(), keep_columns(), and the --keep_columns argument to clarify their behavior. Differential Revision: D92749122
b2a39f8 to
23ad03b
Compare
|
This pull request has been merged in a83f409. |
Summary:
Previously, when --outcome_columns was not explicitly set, all columns that were not the id, weight, or a covariate were automatically classified as outcome_columns. This was surprising — e.g. columns listed in --keep_columns would silently become outcome columns.
Now, unmentioned columns are placed into ignore_columns instead. Columns that are explicitly mentioned (id, weight, covariates, and any columns specified via --outcome_columns, --keep_columns, --batch_columns, or --keep_row_column) are not ignored. Pipelines that relied on the automatic outcome classification must now pass those column names explicitly via --outcome_columns.
Also improved docstrings for has_keep_columns(), keep_columns(), and the --keep_columns argument to clarify their behavior.
Differential Revision: D92749122