[DEV-12528] Remove Hadoop copy merge step from Spark downloads #4531

aguest-kc · 2025-10-31T18:35:30Z

Description:

Remove the Hadoop copy merge step from Spark downloads and use the correct number of partitions instead of 1.

Technical Details:

Updated the number of partitions to be total records / max records per file instead of 1. Skipped the Hadoop copy merge step since the previous change results in the correct number of records in each row. Added a method for renaming the part files to the expected format.

Requirements for PR Merge:

Unit & integration tests updated
Data validation completed (examples listed below)
1. Does this work well with the current frontend? Or is the frontend aware of a needed change?
2. Is performance impacted in the changes (e.g., API, pipeline, downloads, etc.)?
3. Is the expected data returned with the expected format?
Jira Ticket(s)
1. DEV-12528

Explain N/A in above checklist:

API documentation updated (examples listed below)
No API contracts need to be updated for this change.
Appropriate Operations ticket(s) created
No operation tickets are needed for this change.

sethstoudenmier

Approving, but left some comments are possible cleanup / improvements.

usaspending_api/common/etl/spark.py

usaspending_api/common/helpers/s3_helpers.py

aguest-kc added 4 commits October 29, 2025 13:26

[DEV-12528] Remove Hadoop copy merge usage from Spark downloads

6562bc6

[DEV-12528] Update Spark tests

b906e13

[DEV-12528] Update formatting of EXCEL_ROW_LIMIT

5816753

Merge branch 'qat' into ftr/dev-12528-spark-download-zipping

7505aa9

github-actions bot assigned aguest-kc Oct 31, 2025

aguest-kc and others added 4 commits November 3, 2025 08:46

[DEV-12528] Remove num_partitions from test

029ed59

[DEV-12528] Make rename_part_files a helper function

febb250

[DEV-12528] Update tests to remove Hadoop copy merge

50ba76b

Merge branch 'qat' into ftr/dev-12528-spark-download-zipping

136313a

sethstoudenmier previously approved these changes Nov 4, 2025

View reviewed changes

usaspending_api/common/etl/spark.py Outdated Show resolved Hide resolved

usaspending_api/common/helpers/s3_helpers.py Show resolved Hide resolved

github-actions bot assigned sethstoudenmier Nov 4, 2025

aguest-kc added 2 commits November 5, 2025 09:45

Merge branch 'qat' into ftr/dev-12528-spark-download-zipping

c508221

[DEV-12528] Remove unused hadoop_copy_merge function

c0c95b8

aguest-kc dismissed sethstoudenmier’s stale review via c0c95b8 November 5, 2025 15:48

Merge branch 'qat' into ftr/dev-12528-spark-download-zipping

b5e23fb

sethstoudenmier approved these changes Nov 5, 2025

View reviewed changes

aguest-kc merged commit 495bd5e into qat Nov 5, 2025
36 of 37 checks passed

aguest-kc deleted the ftr/dev-12528-spark-download-zipping branch December 9, 2025 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DEV-12528] Remove Hadoop copy merge step from Spark downloads #4531

[DEV-12528] Remove Hadoop copy merge step from Spark downloads #4531

aguest-kc commented Oct 31, 2025 •

edited by atlassian bot

Loading

Uh oh!

sethstoudenmier left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[DEV-12528] Remove Hadoop copy merge step from Spark downloads #4531

[DEV-12528] Remove Hadoop copy merge step from Spark downloads #4531

Conversation

aguest-kc commented Oct 31, 2025 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Technical Details:

Requirements for PR Merge:

Explain N/A in above checklist:

Uh oh!

sethstoudenmier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aguest-kc commented Oct 31, 2025 •

edited by atlassian bot

Loading