Skip to content

Conversation

snitish
Copy link
Member

@snitish snitish commented Dec 2, 2024

From the original thread:

s = '<table><tr><th rowspan="2">A</th><th>B</th></tr><tr><td>1</td></tr><tr><td>C</td><td>2</td></tr></table>'
buf = io.StringIO(s)
print(pd.read_html(buf)[0])
#    A                  B
#    A Unnamed: 1_level_1
# 0  1                NaN

# Expected:
#    A  B
# 0  A  1
# 1  C  2

The bug is due to rowspan > 1 in the header row which leads to overflow into the body rows. Current logic does not handle this case. I fix it by overflowing the partial rows from the header into the body (and similarly from body to footer if any).

@snitish snitish mentioned this pull request Dec 2, 2024
3 tasks
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Could you add a whatsnew note in v3.0.0.rst under the I/O section?

@mroeschke mroeschke added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Dec 2, 2024
@snitish snitish requested a review from mroeschke December 2, 2024 18:47
@rhshadrach rhshadrach added the Bug label Dec 2, 2024
@mroeschke mroeschke added this to the 3.0 milestone Dec 3, 2024
@mroeschke mroeschke merged commit d9dfaa9 into pandas-dev:main Dec 3, 2024
51 of 55 checks passed
@mroeschke
Copy link
Member

Thanks @snitish

@snitish snitish deleted the 60210 branch February 6, 2025 19:46
KevsterAmp pushed a commit to KevsterAmp/pandas that referenced this pull request Mar 12, 2025
…#60464)

* BUG: Fix pd.read_html handling of rowspan in table header

* BUG: Fix docstring error in _expand_colspan_rowspan

* BUG: Update return type for _expand_colspan_rowspan

* BUG: Address review and add not to whatsnew
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: rowspan in read_html failed

3 participants