-
Notifications
You must be signed in to change notification settings - Fork 32
feat(file-based): Add Calamine-first with Openpyxl fallback for Excel parser #850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 16 commits
5be5bee
65e8adc
a7664eb
34fe892
adfe576
88084ad
546bd46
67fa697
e431f9d
a82a2fa
6a38d55
fef9ac2
fd1939e
d2f691a
63d24a6
44f7df1
49f3e19
fffe027
9d6428c
0831b04
463be27
95fc5e3
38a1a1c
3277b70
7d73cc6
d2b0255
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -3,9 +3,10 @@ | |||||||||||||
| # | ||||||||||||||
|
|
||||||||||||||
| import logging | ||||||||||||||
| import warnings | ||||||||||||||
| from io import IOBase | ||||||||||||||
| from pathlib import Path | ||||||||||||||
| from typing import Any, Dict, Iterable, Mapping, Optional, Tuple, Union | ||||||||||||||
| from typing import Any, Dict, Iterable, Mapping, Optional, Tuple, Type, Union, cast | ||||||||||||||
|
|
||||||||||||||
| import orjson | ||||||||||||||
| import pandas as pd | ||||||||||||||
|
|
@@ -17,6 +18,7 @@ | |||||||||||||
| ) | ||||||||||||||
| from airbyte_cdk.sources.file_based.exceptions import ( | ||||||||||||||
| ConfigValidationError, | ||||||||||||||
| ExcelCalamineParsingError, | ||||||||||||||
| FileBasedSourceError, | ||||||||||||||
| RecordParseError, | ||||||||||||||
| ) | ||||||||||||||
|
|
@@ -64,7 +66,7 @@ async def infer_schema( | |||||||||||||
| fields: Dict[str, str] = {} | ||||||||||||||
|
|
||||||||||||||
| with stream_reader.open_file(file, self.file_read_mode, self.ENCODING, logger) as fp: | ||||||||||||||
| df = self.open_and_parse_file(fp) | ||||||||||||||
| df = self.open_and_parse_file(fp, logger, file) | ||||||||||||||
| for column, df_type in df.dtypes.items(): | ||||||||||||||
| # Choose the broadest data type if the column's data type differs in dataframes | ||||||||||||||
| prev_frame_column_type = fields.get(column) # type: ignore [call-overload] | ||||||||||||||
|
|
@@ -92,7 +94,7 @@ def parse_records( | |||||||||||||
| discovered_schema: Optional[Mapping[str, SchemaType]] = None, | ||||||||||||||
| ) -> Iterable[Dict[str, Any]]: | ||||||||||||||
| """ | ||||||||||||||
| Parses records from an Excel file based on the provided configuration. | ||||||||||||||
| Parses records from an Excel file with fallback error handling. | ||||||||||||||
|
|
||||||||||||||
| Args: | ||||||||||||||
| config (FileBasedStreamConfig): Configuration for the file-based stream. | ||||||||||||||
|
|
@@ -111,7 +113,7 @@ def parse_records( | |||||||||||||
| try: | ||||||||||||||
| # Open and parse the file using the stream reader | ||||||||||||||
| with stream_reader.open_file(file, self.file_read_mode, self.ENCODING, logger) as fp: | ||||||||||||||
| df = self.open_and_parse_file(fp) | ||||||||||||||
| df = self.open_and_parse_file(fp, logger, file) | ||||||||||||||
| # Yield records as dictionaries | ||||||||||||||
| # DataFrame.to_dict() method returns datetime values in pandas.Timestamp values, which are not serializable by orjson | ||||||||||||||
| # DataFrame.to_json() returns string with datetime values serialized to iso8601 with microseconds to align with pydantic behavior | ||||||||||||||
|
|
@@ -180,15 +182,95 @@ def validate_format(excel_format: BaseModel, logger: logging.Logger) -> None: | |||||||||||||
| logger.info(f"Expected ExcelFormat, got {excel_format}") | ||||||||||||||
| raise ConfigValidationError(FileBasedSourceError.CONFIG_VALIDATION_ERROR) | ||||||||||||||
|
|
||||||||||||||
| @staticmethod | ||||||||||||||
| def open_and_parse_file(fp: Union[IOBase, str, Path]) -> pd.DataFrame: | ||||||||||||||
| def _open_and_parse_file_with_calamine( | ||||||||||||||
| self, | ||||||||||||||
| fp: Union[IOBase, str, Path], | ||||||||||||||
| logger: logging.Logger, | ||||||||||||||
| file_info: RemoteFile, | ||||||||||||||
| ) -> pd.DataFrame: | ||||||||||||||
| """Opens and parses Excel file using Calamine engine. | ||||||||||||||
|
|
||||||||||||||
| Args: | ||||||||||||||
| fp: File pointer to the Excel file. | ||||||||||||||
| logger: Logger for logging information and errors. | ||||||||||||||
| file_info: Remote file information for logging context. | ||||||||||||||
|
|
||||||||||||||
|
||||||||||||||
| Args: | |
| fp (Union[IOBase, str, Path]): The file pointer, file path, or file-like object to parse. | |
| logger (Optional[logging.Logger]): Optional logger for warning and error messages. | |
| file_info (Optional[Union[str, RemoteFile]]): Optional file information (RemoteFile or string) used for logging context. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we have if hasattr(fp, "seek"): do we really need to catch AttributeError here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right! With the hasattr(fp, "seek") guard in place, catching AttributeError is redundant.
After the hasattr check passes, the only standard failure for seeking on non-seekable streams is io.UnsupportedOperation, which is a subclass of OSError. The AttributeError would only occur during attribute lookup, which hasattr already handles.
I'll update the code to:
- Narrow the except clause to only
OSError - Update the parametrized test to only test the
OSErrorcase (removing theAttributeErrorcase)
This makes the code cleaner and more precise. Let me push that change now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! I've narrowed the except clause to only OSError and updated the test accordingly.
Changes Made:
- excel_parser.py: Changed
except (AttributeError, OSError)toexcept OSErroron line 240 - test_excel_parser.py: Removed the
AttributeErrorcase from the parametrized test, keeping only theOSErrorcase
Rationale:
With the hasattr(fp, "seek") guard in place, AttributeError can only occur during attribute lookup, which hasattr already handles. After the guard passes, the only standard failure for seeking on non-seekable streams is io.UnsupportedOperation, which is a subclass of OSError.
Local Verification:
- ✅ Test passes with only OSError case
- ✅ All Excel parser tests pass (5 passed, 1 skipped)
- ✅ Ruff format and lint pass
Commit: 49f3e19 - "refactor: Narrow exception handling to OSError only in seek logic"
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use file as naming instead of file_info because we know this is an instance of RemoteFile?
here and in _open_and_parse_file_with_calamine and in _open_and_parse_file_with_openpyxl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! I've renamed file_info to file across all three methods.
Changes Made:
_open_and_parse_file_with_calamine: Renamed parameter and all references_open_and_parse_file_with_openpyxl: Renamed parameter and all referencesopen_and_parse_file: Renamed parameter and all references
Rationale:
Since the type is RemoteFile, the shorter name file is more appropriate and makes the code more concise while maintaining clarity.
Local Verification:
- ✅ All Excel parser tests pass (5 passed, 1 skipped)
- ✅ Ruff format and lint pass
Commit: fffe027 - "refactor: Rename file_info parameter to file in Excel parsing methods"
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Uh oh!
There was an error while loading. Please reload this page.