-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Is your feature request related to a problem?
I wish DataFrame.read_excel could skip automatically the rows that are hidden in excel. I want to read data from an excel sheet but I want to select within excel the data that will go to the dataFrame. I am using excel filter to select data of interest so some lines are hidden. I can use this workaround but it implies reading twice the excel sheet which in my case is quite big.
Describe the solution you'd like
DataFrame.read_excel
should get a new parameter skip_hidden
that is default Fasle but when set to True will DataFrame.read_excel
will skip the hidden rows
API breaking implications
Additionnal parameter to DataFrame.read_excel
with default value so I guess no API breaking risk.
Describe alternatives you've considered
Going for a workaround with openpyxl
which implies opening twice the excel sheet which is expensive. stackoverflow example
Additional context
looking at pandas/io/excel/_openpyxl.py
function get_sheet_data
, considering new parameter to read_excel
named skip_hidden
.
modification looks quite simple:
def get_sheet_data(self, sheet, convert_float: bool, skip_hidden: bool) -> list[list[Scalar]]:
if self.book.read_only:
sheet.reset_dimensions()
rowDim=sheet.row_dimensions # added
data: list[list[Scalar]] = []
last_row_with_data = -1
for row_number, row in enumerate(sheet.rows):
converted_row = [self._convert_cell(cell, convert_float) for cell in row]
while converted_row and converted_row[-1] == "":
# trim trailing empty elements
converted_row.pop()
if converted_row:
last_row_with_data = row_number
if not (skip_hidden and rowDim[row_number].hidden) : # added
data.append(converted_row)
# Trim trailing empty rows
data = data[: last_row_with_data + 1]
if len(data) > 0:
# extend rows to max width
max_width = max(len(data_row) for data_row in data)
if min(len(data_row) for data_row in data) < max_width:
empty_cell: list[Scalar] = [""]
data = [
data_row + (max_width - len(data_row)) * empty_cell
for data_row in data
]
return data