SNOW-2752334: Fix overlap handling when parsing XML file#4008
Merged
sfc-gh-aalam merged 7 commits intomainfrom Nov 22, 2025
Merged
SNOW-2752334: Fix overlap handling when parsing XML file#4008sfc-gh-aalam merged 7 commits intomainfrom
sfc-gh-aalam merged 7 commits intomainfrom
Conversation
sfc-gh-aalam
commented
Nov 21, 2025
Comment on lines
-208
to
-211
| # If the chunk is smaller than expected, we are near the end. | ||
| if len(chunk) < current_chunk_size: | ||
| if chunk.find(tag_start_1) == -1 and chunk.find(tag_start_2) == -1: | ||
| raise EOFError("Reached end of file before finding opening tag") |
Contributor
Author
There was a problem hiding this comment.
this code is looking for start tag in the last chunk and raising EOF errors if tag is not found in chunk. This is not correct since code below start line 210, we look for tag in data = overlap + chunk. If in the last chunk, the start row tag was part of overlap, we will miss it and raise a false error.
sfc-gh-aalam
commented
Nov 21, 2025
Comment on lines
-236
to
-238
| # Otherwise, rewind by the length of the overlap so that a tag spanning the boundary isn't missed. | ||
| file_obj.seek(-len(overlap), 1) | ||
|
|
Contributor
Author
There was a problem hiding this comment.
line 230 is already reading overlapped data into overlap. we don't need to seek back. The absolute_pos calculation on line 222 also accounts for this.
sfc-gh-jdu
approved these changes
Nov 21, 2025
sfc-gh-aalam
commented
Nov 21, 2025
| .select("medical_license", "radio_license", "count") | ||
| ) | ||
| .sort(col("count")) | ||
| .sort(col("count"), col("radio_license")) |
Contributor
Author
There was a problem hiding this comment.
These are flaky due to sort order. Making them unflaky.
sfc-gh-mayliu
approved these changes
Nov 21, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-2752334
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
When parsing the XML file chunk by chunk to find the start pos of start of row tag, the code was loading the overlap into memory but also doing a file seek to undo the overlap causing invalid file parsing. This change removes the file seek line.