Skip to content

SNOW-2752334: Fix overlap handling when parsing XML file#4008

Merged
sfc-gh-aalam merged 7 commits intomainfrom
aalam-SNOW-2752334-xml-fix
Nov 22, 2025
Merged

SNOW-2752334: Fix overlap handling when parsing XML file#4008
sfc-gh-aalam merged 7 commits intomainfrom
aalam-SNOW-2752334-xml-fix

Conversation

@sfc-gh-aalam
Copy link
Contributor

@sfc-gh-aalam sfc-gh-aalam commented Nov 21, 2025

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-2752334

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
  3. Please describe how your code solves the related issue.

    When parsing the XML file chunk by chunk to find the start pos of start of row tag, the code was loading the overlap into memory but also doing a file seek to undo the overlap causing invalid file parsing. This change removes the file seek line.

@sfc-gh-aalam sfc-gh-aalam marked this pull request as ready for review November 21, 2025 21:37
@sfc-gh-aalam sfc-gh-aalam requested review from a team as code owners November 21, 2025 21:37
Comment on lines -208 to -211
# If the chunk is smaller than expected, we are near the end.
if len(chunk) < current_chunk_size:
if chunk.find(tag_start_1) == -1 and chunk.find(tag_start_2) == -1:
raise EOFError("Reached end of file before finding opening tag")
Copy link
Contributor Author

@sfc-gh-aalam sfc-gh-aalam Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is looking for start tag in the last chunk and raising EOF errors if tag is not found in chunk. This is not correct since code below start line 210, we look for tag in data = overlap + chunk. If in the last chunk, the start row tag was part of overlap, we will miss it and raise a false error.

Comment on lines -236 to -238
# Otherwise, rewind by the length of the overlap so that a tag spanning the boundary isn't missed.
file_obj.seek(-len(overlap), 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 230 is already reading overlapped data into overlap. we don't need to seek back. The absolute_pos calculation on line 222 also accounts for this.

.select("medical_license", "radio_license", "count")
)
.sort(col("count"))
.sort(col("count"), col("radio_license"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are flaky due to sort order. Making them unflaky.

@sfc-gh-aalam sfc-gh-aalam merged commit 48e715d into main Nov 22, 2025
27 of 29 checks passed
@sfc-gh-aalam sfc-gh-aalam deleted the aalam-SNOW-2752334-xml-fix branch November 22, 2025 01:19
@github-actions github-actions bot locked and limited conversation to collaborators Nov 22, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants