SNOW-2082332: Support mode for dealing corrupt XML records #3337
SNOW-2082332: Support mode for dealing corrupt XML records #3337sfc-gh-jdu merged 5 commits intomainfrom
Conversation
| try: | ||
| record_str = record_bytes.decode("utf-8") | ||
| record_str = re.sub(r"&(\w+);", replace_entity, record_str) | ||
| except UnicodeDecodeError as e: |
There was a problem hiding this comment.
We actually don't need to handle UnicodeDecodeError because we can simply replace the char that isn't supported by charset. We will have another PR to support different charset other than utf-8.
| except EOFError as e: | ||
| if mode == "PERMISSIVE": | ||
| # read util the end of file or util variant column size limit | ||
| record_bytes = f.read(VARIANT_COLUMN_SIZE_LIMIT) | ||
| record_str = record_bytes.decode("utf-8", errors="replace") | ||
| record_str = re.sub(r"&(\w+);", replace_entity, record_str) | ||
| yield {COLUMN_NAME_OF_CORRUPT_RECORD: record_str} | ||
| elif mode == "FAILFAST": | ||
| raise EOFError( | ||
| f"Malformed XML record at bytes {record_start}-EOF: {e}" | ||
| ) from e |
There was a problem hiding this comment.
It appears the DROPMALFORMED mode handling is missing in this error handling block. When mode == "DROPMALFORMED", the code should simply break or continue without yielding anything to properly skip the malformed record. This would align with the behavior described in the documentation where "DROPMALFORMED: Ignores the whole record that cannot be parsed correctly."
| except EOFError as e: | |
| if mode == "PERMISSIVE": | |
| # read util the end of file or util variant column size limit | |
| record_bytes = f.read(VARIANT_COLUMN_SIZE_LIMIT) | |
| record_str = record_bytes.decode("utf-8", errors="replace") | |
| record_str = re.sub(r"&(\w+);", replace_entity, record_str) | |
| yield {COLUMN_NAME_OF_CORRUPT_RECORD: record_str} | |
| elif mode == "FAILFAST": | |
| raise EOFError( | |
| f"Malformed XML record at bytes {record_start}-EOF: {e}" | |
| ) from e | |
| except EOFError as e: | |
| if mode == "PERMISSIVE": | |
| # read util the end of file or util variant column size limit | |
| record_bytes = f.read(VARIANT_COLUMN_SIZE_LIMIT) | |
| record_str = record_bytes.decode("utf-8", errors="replace") | |
| record_str = re.sub(r"&(\w+);", replace_entity, record_str) | |
| yield {COLUMN_NAME_OF_CORRUPT_RECORD: record_str} | |
| elif mode == "DROPMALFORMED": | |
| # Skip the malformed record | |
| continue | |
| elif mode == "FAILFAST": | |
| raise EOFError( | |
| f"Malformed XML record at bytes {record_start}-EOF: {e}" | |
| ) from e |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
There was a problem hiding this comment.
not needed as we just ignore the malformed record here on DROPMALFORMED mode
🎉 Snyk checks have passed. No issues have been found so far.✅ security/snyk check is complete. No issues have been found. (View Details) ✅ license/snyk check is complete. No issues have been found. (View Details) |
9a5b1c1 to
95f2792
Compare
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-2082332
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
The introduction of different modes help with improving the error message. This PR also adds a more detailed docstring for
xml()method.