-
Notifications
You must be signed in to change notification settings - Fork 32
feat(file-based): Add Calamine-first with Openpyxl fallback for Excel parser #850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 8 commits
5be5bee
65e8adc
a7664eb
34fe892
adfe576
88084ad
546bd46
67fa697
e431f9d
a82a2fa
6a38d55
fef9ac2
fd1939e
d2f691a
63d24a6
44f7df1
49f3e19
fffe027
9d6428c
0831b04
463be27
95fc5e3
38a1a1c
3277b70
7d73cc6
d2b0255
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -3,6 +3,7 @@ | |||||||||||||||||||||
| # | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| import logging | ||||||||||||||||||||||
| import warnings | ||||||||||||||||||||||
| from io import IOBase | ||||||||||||||||||||||
| from pathlib import Path | ||||||||||||||||||||||
| from typing import Any, Dict, Iterable, Mapping, Optional, Tuple, Union | ||||||||||||||||||||||
|
|
@@ -64,7 +65,7 @@ async def infer_schema( | |||||||||||||||||||||
| fields: Dict[str, str] = {} | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| with stream_reader.open_file(file, self.file_read_mode, self.ENCODING, logger) as fp: | ||||||||||||||||||||||
| df = self.open_and_parse_file(fp) | ||||||||||||||||||||||
| df = self.open_and_parse_file(fp, logger, file) | ||||||||||||||||||||||
| for column, df_type in df.dtypes.items(): | ||||||||||||||||||||||
| # Choose the broadest data type if the column's data type differs in dataframes | ||||||||||||||||||||||
| prev_frame_column_type = fields.get(column) # type: ignore [call-overload] | ||||||||||||||||||||||
|
|
@@ -92,7 +93,7 @@ def parse_records( | |||||||||||||||||||||
| discovered_schema: Optional[Mapping[str, SchemaType]] = None, | ||||||||||||||||||||||
| ) -> Iterable[Dict[str, Any]]: | ||||||||||||||||||||||
| """ | ||||||||||||||||||||||
| Parses records from an Excel file based on the provided configuration. | ||||||||||||||||||||||
| Parses records from an Excel file with fallback error handling. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Args: | ||||||||||||||||||||||
| config (FileBasedStreamConfig): Configuration for the file-based stream. | ||||||||||||||||||||||
|
|
@@ -111,7 +112,7 @@ def parse_records( | |||||||||||||||||||||
| try: | ||||||||||||||||||||||
| # Open and parse the file using the stream reader | ||||||||||||||||||||||
| with stream_reader.open_file(file, self.file_read_mode, self.ENCODING, logger) as fp: | ||||||||||||||||||||||
| df = self.open_and_parse_file(fp) | ||||||||||||||||||||||
| df = self.open_and_parse_file(fp, logger, file) | ||||||||||||||||||||||
| # Yield records as dictionaries | ||||||||||||||||||||||
| # DataFrame.to_dict() method returns datetime values in pandas.Timestamp values, which are not serializable by orjson | ||||||||||||||||||||||
| # DataFrame.to_json() returns string with datetime values serialized to iso8601 with microseconds to align with pydantic behavior | ||||||||||||||||||||||
|
|
@@ -181,14 +182,84 @@ def validate_format(excel_format: BaseModel, logger: logging.Logger) -> None: | |||||||||||||||||||||
| raise ConfigValidationError(FileBasedSourceError.CONFIG_VALIDATION_ERROR) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| @staticmethod | ||||||||||||||||||||||
| def open_and_parse_file(fp: Union[IOBase, str, Path]) -> pd.DataFrame: | ||||||||||||||||||||||
| def open_and_parse_file( | ||||||||||||||||||||||
| fp: Union[IOBase, str, Path], | ||||||||||||||||||||||
| logger: Optional[logging.Logger] = None, | ||||||||||||||||||||||
| file_info: Optional[Union[str, RemoteFile]] = None, | ||||||||||||||||||||||
| ) -> pd.DataFrame: | ||||||||||||||||||||||
| """ | ||||||||||||||||||||||
| Opens and parses the Excel file. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Args: | ||||||||||||||||||||||
| fp: File pointer to the Excel file. | ||||||||||||||||||||||
| Opens and parses the Excel file with Calamine-first and Openpyxl fallback. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
|
||||||||||||||||||||||
| Args: | |
| fp (Union[IOBase, str, Path]): The file pointer, file path, or file-like object to parse. | |
| logger (Optional[logging.Logger]): Optional logger for warning and error messages. | |
| file_info (Optional[Union[str, RemoteFile]]): Optional file information (RemoteFile or string) used for logging context. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to separate the logic of this method:
_open_and_parse_file_with_calamine: add a new method to class to read file withcalamineengine, catchPanicException, log an error and raiseExcelCalamineParsingError(BaseFileBasedSourceError)_open_and_parse_file_with_openpyxl: add a new method to class to read withopenpyxlengineopen_and_parse_file: try call_open_and_parse_file_with_calamineexceptExcelCalamineParsingErrorlog an error and call_open_and_parse_file_with_openpyxl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion for better separation of concerns! I'll refactor the code to extract the parsing logic into separate methods.
Proposed approach:
- Create custom exception:
ExcelCalamineParsingError(BaseFileBasedSourceError)to signal Calamine failures - Extract Calamine logic:
_open_and_parse_file_with_calamine(fp, logger, file_info)- catches all exceptions (including PyO3 panics), logs warning, raisesExcelCalamineParsingError - Extract Openpyxl logic:
_open_and_parse_file_with_openpyxl(fp, logger, file_info)- handles openpyxl parsing with warning capture - Simplify orchestration:
open_and_parse_file()tries Calamine, catchesExcelCalamineParsingError, falls back to Openpyxl
Questions before I proceed:
- Should
_open_and_parse_file_with_calaminehandle the file pointer seek/reset logic, or should that remain in the orchestration method? - For the custom exception, should I store the original exception as a cause (e.g.,
raise ExcelCalamineParsingError(...) from original_exc)? - Should the two-tier exception handling (Exception + BaseException) remain in the Calamine method to catch PyO3 panics?
I'll proceed with the refactoring once you confirm the approach looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add ExcelCalamineParsingError(BaseFileBasedSourceError) into airbyte-python-cdk/airbyte_cdk/sources/file_based/exceptions.py.
Extract Calamine logic: _open_and_parse_file_with_calamine(fp, logger, file_info) - should catch only PanicException which happens when date is out of range and raise ExcelCalamineParsingError.
- It should only call
pd.ExcelFile(fp, engine="calamine").parse()as in current implementation in the main branch and raise an error as described above - yes, raise ExcelCalamineParsingError(...) from original_exc
- no
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Refactoring complete! I've implemented the three-method separation as requested:
-
_open_and_parse_file_with_calamine()- Callspd.ExcelFile(fp, engine="calamine").parse()like the original implementation, catches all exceptions (including PyO3PanicExceptionwhich inherits fromBaseException), logs warning, and raisesExcelCalamineParsingErrorwith exception chaining (from exc) -
_open_and_parse_file_with_openpyxl()- Handles Openpyxl parsing with warning capture and logging -
open_and_parse_file()- Orchestrates the fallback: tries Calamine first, catchesExcelCalamineParsingError, handles file pointer seeking, then calls Openpyxl
I've also:
- Added
ExcelCalamineParsingError(BaseFileBasedSourceError)toexceptions.py - Removed
Optionalfromloggerandfile_infoparameters (Comment 14) - Changed
file_infotype fromUnion[str, RemoteFile]to justRemoteFile(Comment 14) - Simplified file_url logic to use only
file_info.file_uri_for_logging(Comment 17) - Updated tests to match new signatures
All tests pass locally (MyPy, Ruff, pytest). Ready for review!
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the change in remote_file.py and with providing the remote file only when calling this method I think it's safe to remove the check for file_url value below and use only file_info.file_uri_for_logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Simplified! I've removed the file_url variable and getattr(file_info, "url", None) logic. Now the code uses only file_info.file_uri_for_logging for all logging messages.
The refactored code no longer needs the _format_message_with_link() helper method since we're not using separate file URLs anymore - just the URI from file_info.file_uri_for_logging.
Outdated
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using getattr(file_info, "url", None) assumes RemoteFile may have a url attribute, but RemoteFile only has uri, last_modified, and mime_type attributes according to the class definition. If you intended to use uri for the URL, use file_info.uri or getattr(file_info, "uri", None) instead. If a separate url attribute is expected from a subclass, please document this.
| file_url = getattr(file_info, "url", None) | |
| file_url = file_info.uri |
github-code-quality[bot] marked this conversation as resolved.
Fixed
Show fixed
Hide fixed
github-code-quality[bot] marked this conversation as resolved.
Fixed
Show fixed
Hide fixed
Outdated
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two exception handlers (lines 207-215 and 216-227) are redundant. Since BaseException catches everything including Exception, the first except Exception block will handle standard exceptions, and the second except BaseException will only catch non-Exception BaseExceptions (like PyO3 panics). Both handlers log identical warning messages and set calamine_exc. Consider combining them or documenting why they're separate, or just use a single except BaseException block with the KeyboardInterrupt/SystemExit check.
github-code-quality[bot] marked this conversation as resolved.
Fixed
Show fixed
Hide fixed
Outdated
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the file pointer is not seekable and Calamine has partially consumed the stream before failing, the fallback to openpyxl will receive a partially consumed stream and likely fail. Consider logging a warning when seek fails, or document this limitation. The silent pass here could mask a critical failure mode where both parsers fail due to stream state rather than file content issues.
| pass | |
| warning_msg = ( | |
| f"File-like object for {file_label} is not seekable. " | |
| "Calamine may have partially consumed the stream, so the fallback to openpyxl may fail due to stream state, not file content. " | |
| "Consider providing a seekable file-like object for best results." | |
| ) | |
| if logger: | |
| logger.warning(ExcelParser._format_message_with_link(warning_msg, file_url)) | |
| else: | |
| warnings.warn(warning_msg) |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have to
file_infonot a type ofRemoteFile?Also I think we can remove
Optionalfrom typing forloggerandfile_info, because in both times where we call it we've already added these arguments.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I checked the call sites and you're absolutely right:
file_infois alwaysRemoteFile: Both call sites (lines 68 and 115 inexcel_parser.py) passfilewhich is aRemoteFileobject. TheUnion[str, RemoteFile]type was overly broad.Both parameters are always provided: All call sites provide both
loggerandfile_info, soOptionalcan be removed.I'll update the signature to:
This will make the API clearer and remove unnecessary None checks. I'll address this along with your other refactoring suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Addressed! I've investigated all call sites and confirmed:
file_infois alwaysRemoteFile- Both call sites (lines 68 and 115 inexcel_parser.py) pass aRemoteFileobject, never a stringloggerandfile_infoare always provided - All call sites provide both parametersI've updated the signature to:
Removed
Optionalfrom both parameters and changedfile_infofromUnion[str, RemoteFile]to justRemoteFile.