Skip to content

SNOW-2082332: Support mode for dealing corrupt XML records #3337

Merged
sfc-gh-jdu merged 5 commits intomainfrom
jdu-SNOW-2082332-xml-error-msg
May 8, 2025
Merged

SNOW-2082332: Support mode for dealing corrupt XML records #3337
sfc-gh-jdu merged 5 commits intomainfrom
jdu-SNOW-2082332-xml-error-msg

Conversation

@sfc-gh-jdu
Copy link
Copy Markdown
Collaborator

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-2082332

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
  3. Please describe how your code solves the related issue.

    The introduction of different modes help with improving the error message. This PR also adds a more detailed docstring for xml() method.

@sfc-gh-jdu sfc-gh-jdu requested review from a team as code owners May 6, 2025 23:43
try:
record_str = record_bytes.decode("utf-8")
record_str = re.sub(r"&(\w+);", replace_entity, record_str)
except UnicodeDecodeError as e:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't need to handle UnicodeDecodeError because we can simply replace the char that isn't supported by charset. We will have another PR to support different charset other than utf-8.

Comment on lines +360 to +370
except EOFError as e:
if mode == "PERMISSIVE":
# read util the end of file or util variant column size limit
record_bytes = f.read(VARIANT_COLUMN_SIZE_LIMIT)
record_str = record_bytes.decode("utf-8", errors="replace")
record_str = re.sub(r"&(\w+);", replace_entity, record_str)
yield {COLUMN_NAME_OF_CORRUPT_RECORD: record_str}
elif mode == "FAILFAST":
raise EOFError(
f"Malformed XML record at bytes {record_start}-EOF: {e}"
) from e
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears the DROPMALFORMED mode handling is missing in this error handling block. When mode == "DROPMALFORMED", the code should simply break or continue without yielding anything to properly skip the malformed record. This would align with the behavior described in the documentation where "DROPMALFORMED: Ignores the whole record that cannot be parsed correctly."

Suggested change
except EOFError as e:
if mode == "PERMISSIVE":
# read util the end of file or util variant column size limit
record_bytes = f.read(VARIANT_COLUMN_SIZE_LIMIT)
record_str = record_bytes.decode("utf-8", errors="replace")
record_str = re.sub(r"&(\w+);", replace_entity, record_str)
yield {COLUMN_NAME_OF_CORRUPT_RECORD: record_str}
elif mode == "FAILFAST":
raise EOFError(
f"Malformed XML record at bytes {record_start}-EOF: {e}"
) from e
except EOFError as e:
if mode == "PERMISSIVE":
# read util the end of file or util variant column size limit
record_bytes = f.read(VARIANT_COLUMN_SIZE_LIMIT)
record_str = record_bytes.decode("utf-8", errors="replace")
record_str = re.sub(r"&(\w+);", replace_entity, record_str)
yield {COLUMN_NAME_OF_CORRUPT_RECORD: record_str}
elif mode == "DROPMALFORMED":
# Skip the malformed record
continue
elif mode == "FAILFAST":
raise EOFError(
f"Malformed XML record at bytes {record_start}-EOF: {e}"
) from e

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed as we just ignore the malformed record here on DROPMALFORMED mode

@sfc-gh-snowflakedb-snyk-sa
Copy link
Copy Markdown

sfc-gh-snowflakedb-snyk-sa commented May 7, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

@sfc-gh-jdu sfc-gh-jdu force-pushed the jdu-SNOW-2082332-xml-error-msg branch from 9a5b1c1 to 95f2792 Compare May 7, 2025 18:30
@sfc-gh-jdu sfc-gh-jdu merged commit 7413be9 into main May 8, 2025
37 of 40 checks passed
@sfc-gh-jdu sfc-gh-jdu deleted the jdu-SNOW-2082332-xml-error-msg branch May 8, 2025 01:58
@github-actions github-actions bot locked and limited conversation to collaborators May 8, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants