Skip to content

Conversation

skalwaghe-56
Copy link
Contributor

@skalwaghe-56 skalwaghe-56 commented Sep 8, 2025


This PR fixes a regression in the CSV parsers when using on_bad_lines as a callable.

Thanks!

@skalwaghe-56 skalwaghe-56 force-pushed the fix-issue-61837 branch 3 times, most recently from f579800 to d77afef Compare September 10, 2025 10:43
@skalwaghe-56
Copy link
Contributor Author

@jbrockmendel @rhshadrach If you could please guide me further.

@simonjayhawkins simonjayhawkins added Bug IO CSV read_csv, to_csv labels Sep 10, 2025
@skalwaghe-56 skalwaghe-56 force-pushed the fix-issue-61837 branch 4 times, most recently from 7009e84 to 0729267 Compare September 12, 2025 12:14
@skalwaghe-56
Copy link
Contributor Author

@rhshadrach @jorisvandenbossche When I ran the test locally for the changes 1 test xpassed. Related to #10153 I think.
Its this test

@pytest.mark.parametrize("dtype", [{"b": "category"}, {1: "category"}])
def test_categorical_dtype_single(all_parsers, dtype, request):
    # see gh-10153
    parser = all_parsers
    data = """a,b,c
1,a,3.4
1,a,3.4
2,b,4.5"""
    expected = DataFrame(
        {"a": [1, 1, 2], "b": Categorical(["a", "a", "b"]), "c": [3.4, 3.4, 4.5]}
    )
    if parser.engine == "pyarrow":
        mark = pytest.mark.xfail(
            strict=False,
            reason="Flaky test sometimes gives object dtype instead of Categorical",
        )
        request.applymarker(mark)

    actual = parser.read_csv(StringIO(data), dtype=dtype)
    tm.assert_frame_equal(actual, expected)

I would like you guys to check this out and check the PR too!

Thanks!

@skalwaghe-56 skalwaghe-56 force-pushed the fix-issue-61837 branch 4 times, most recently from 02e9bd2 to 7f303f7 Compare September 16, 2025 16:40
Copy link
Contributor Author

@skalwaghe-56 skalwaghe-56 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed the tests too now. The CI should be successful now.

@skalwaghe-56 skalwaghe-56 force-pushed the fix-issue-61837 branch 2 times, most recently from e1f405e to 2fa7f70 Compare September 20, 2025 09:00
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

@rhshadrach
Copy link
Member

Looks good - I suspect merging main will fix the CI.

@skalwaghe-56
Copy link
Contributor Author

Looks good - I suspect merging main will fix the CI.

Yep you're right!

@skalwaghe-56
Copy link
Contributor Author

I guess the errors are still present, will rebase and push once these are all fixed :)

- Always emit ParserWarning and drop extra fields when an on_bad_lines
  callable returns more elements than expected, regardless of index_col,
  in PythonParser._rows_to_cols. [GH#61837]

- Ensure non-bad rows are appended in the outer else branch so good lines
  are preserved.

- Add regression test
  pandas/tests/io/parser/test_python_parser_only.py::test_on_bad_lines_callable_warns_and_truncates_with_index_col
  covering index_col in [None, 0].

Closes pandas-dev#61837.
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach added this to the 3.0 milestone Sep 29, 2025
@rhshadrach rhshadrach merged commit 986b4e5 into pandas-dev:main Sep 29, 2025
42 checks passed
@rhshadrach
Copy link
Member

Thanks @skalwaghe-56!

@skalwaghe-56 skalwaghe-56 deleted the fix-issue-61837 branch September 30, 2025 10:56
@skalwaghe-56
Copy link
Contributor Author

Thanks!

jzwick pushed a commit to jzwick/pandas that referenced this pull request Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read_csv() on_bad_lines callable does not raise ParserWarning when index_col is set
3 participants