Feature/alto mortgage performance process #8980

higs4281 · 2025-12-30T23:28:55Z

This adjusts the mortgage-data update routines to match the needs of Jenkins automations in Alto.

One goal was to make this repo the source of as much of the code as possible for processing, moving files between environments, and updating s3 assets and databases.

Since we lost dev environments, the process now is to run an update in Alto staging, get approval there, then deploy the data to prod.

Additions

A thrudate script that validates the date values derived from a source file and creates a new through_date
An SQL file that truncates the 10 mortgage-data tables before loading new data.
An ALTO_ENV var that should be available during Jenkins processing, but will default to prodpub.
~~A MORTGAGE_PERFORMANCE_SOURCE env var to deliver a GHE repo URL for fetching new data~~ ditched
MSA data had become woefully out of date, so I updated the crosswalk and added a step to load MSAs before every processing run, as @wpears had done for states and counties.

Testing

Best run locally in a container:

export MORTGAGE_PERFORMANCE_SOURCE=https://raw.GHE_REPO/refs/heads/main/data_files/nmdb
`./cfgov/manage.py runscript process_mortgage_data --script-args

Notes

This patch updates the responses testing package to 0.25.8, since we're now on Python 3.13 and 3.13 support was officially added in 0.25.7
~~Still working on test coverage~~ [data_research coverage is now 100%]
Don't know if Postgres17 or my M3 Pro chip are stepping up, but local processing time drops from about 15 min to 5, when run from a container.

Alto poses some challenges to the mortgage-performance routine: - Local operations can no longer post files to public s3 buckets - Staging operations can't post directly to prod buckets - Processing could be run in production, where exports would work. But that exposes the mortgage pages to at least 15 minutes of interruptions as tables are wiped and rebuilt and exports are generated. To avoid that, we run processing in staging, where the results can be verified by stakeholders before promoting the changes to prod. The most efficient way to update the prod db is to load a table dump, which takes a few seconds and saves having to re-process source data. This patch offers options for test-processing new data locally and dumping the public CSV files locally so they can be checked or manually pushed to s3 if needed. It also adds configuration so that the process will run in Alto staging, where table dumps and public CSVs can be generated without affecting production. Once a new data load is approved, we can push the table dump and public CSVs to prod s3 for the final step. Additions Two env vars are added to facilitate downloading and the reading of the source CSV, which lives in an internal GitHub repo. The new vars are: - MORTGAGE_PERFORMANCE_SOURCE, which allows a management command to fetch source data from the internal repo without exposing its URL. This is now a required var for mortgage processing and has been added to the common secret vars helm produces to deploy cfgov. - LOCAL_MORTGAGE_FILEPATH, an optional var that will trigger a local download of public CSVs to a path, instead of to s3. If the var is missing, an s3 push will be attempted. If an s3 push fails, it is skipped and noted in the logs, rather than disrupting the run. - A function that derives the through-date from the source file's name. These changes remove the need to figure out what the through_date should be, and avoids the churn of moving the source file to s3 and then accessing it again for processing. Internal GHE documentation at mortgage-performance/wiki/Pipeline-steps has been updated.

…s://github.com/getsentry/responses/releases/tag/0.25.7

…gnature

…thub.com/cfpb/consumerfinance.gov into feature/alto-mortgage-performance-script

…e-data sql.gz files

…l not posted) showed the folly of getting too fancy

higs4281 · 2026-01-02T21:38:47Z

If processing proves this fast in Alto environments, we could switch to processing the files in prod as a final step, instead of the table dump-and-load across environments. Or that could be our routine method, with table dumps left in place as backups.

chosak · 2026-01-02T22:23:32Z

Don't know if Postgres17 is playing a role, but local processing time drops from about 15 min to 5.

@higs4281 by "local" do you mean local PG17 running on your laptop, or running as a container? When I run with the DB in the container, the aggregation step takes significant time, more than 15 minutes on its own. (I'm also on an Intel-based Mac, if that impacts things.)

higs4281 · 2026-01-05T13:08:39Z

@higs4281 by "local" do you mean local PG17 running on your laptop, or running as a container? When I run with the DB in the container, the aggregation step takes significant time, more than 15 minutes on its own. (I'm also on an Intel-based Mac, if that impacts things.)

I ran in container. Maybe my laptop's M3 Pro chip is the difference.
We should time it in staging.

higs4281 · 2026-01-05T14:40:06Z

I ran processing in a local py3.13.6 env with a local PG18.1 db, and processing took 13:46.

higs4281 · 2026-01-09T04:48:23Z

@chosak I look forward to your ideas on live aggregations. But for now, this PR establishes an improved and (I hope) functional processing routine that updates MSA handling.

higs4281 · 2026-01-09T14:40:00Z

FYI, had to add an apt update to stop CI tests from randomly failing during package installs.

ref: https://github.com/orgs/community/discussions/158504#discussioncomment-13054281

…thub.com/cfpb/consumerfinance.gov into feature/alto-mortgage-performance-script

chosak

@higs4281 this looks great, thanks for the thoroughness and for your patience with my review. My suggestions are in #8986.

* Simplify MPT through date logic We don't need or use the "through_date" mortgage data constant outside of pipeline processing; we determine it from the input filename, and persist it in the output data. So there's no need to persist it as its own data constant. This commit also simplifies the through date computation a bit by relying on regex to enforce input format. * Optimize datetime parsing * Simplify process_mortgage_data usage Pass single argument which is URL to source file, instead of using an environment variable to set the repo and a separate argument for the filename. Also deprecate the ability to pass export-csvs-only; we can just call the export_public_csvs script directly instead of doing this. * Optimize CountyMortgageData object creation Create objects in batches, and cache county primary keys to avoid doing a database lookup per row. * MPT loading optimizations Optimize loop that creates CountyMortgageData objects, and simplify the logic used to pull down the source file. * Remove unnecessary MORTGAGE_PERFORMANCE_SOURCE * Fix coverage comparison on non-main PRs

chosak

🚀 🚀 🚀 🚀 🚀

higs4281 added 13 commits December 21, 2025 21:24

linting

009f183

add option to output csvs only

bd421aa

upgrade responses to 0.25.8 -- py313 support was added in 0.25.7 http…

17d5745

…s://github.com/getsentry/responses/releases/tag/0.25.7

move starting-date handling into process_mortgage_data to simplify si…

292661a

…gnature

add comment to explain var changes

dff8696

fix tests

c5fcdfe

fix tests

b271514

Merge branch 'feature/alto-mortgage-performance-script' of https://gi…

f9b1818

…thub.com/cfpb/consumerfinance.gov into feature/alto-mortgage-performance-script

add SQL script to truncate mortgage-performance data tables

ff3ac4f

consult ALTO_ENV, not ACCOUNT_NAME_SHORT

af5a653

adjust thrudate tests for the millennium

ba9e2e8

lint and remove repeated date declaration

83047df

higs4281 requested review from chosak and willbarton December 30, 2025 23:28

higs4281 added 8 commits December 30, 2025 19:22

remove unstable epoch tests

386488b

lint

89bd9bf

add tests and bump coveraget to 100%

fba19d2

add env_SAMPLE entry for the mortgage-performance source

c618e15

remove dead code

473051c

remove redundant comments

7761f12

remove dead code

136c47f

restore test.sql.gz – inadvertently deleted while cleaning up mortgag…

ea7397f

…e-data sql.gz files

higs4281 changed the title ~~Feature/alto mortgage performance script~~ Feature/alto mortgage performance process Jan 2, 2026

higs4281 added 2 commits January 2, 2026 11:13

simplify thrudate routine; this year's anomaly (september update stil…

efac68b

…l not posted) showed the folly of getting too fancy

adjust test names for clarity

422d594

fix intcomma parsing in log line

36f6633

higs4281 added 4 commits January 8, 2026 10:53

add msa building and linting

28e231a

fix and add tests to restore 100% coverage after adding MSA builders

ffc1ca6

Merge branch 'main' into feature/alto-mortgage-performance-script

344d8f4

fix test

5ce56d0

higs4281 added 2 commits January 9, 2026 01:09

remove dead code

7666fc1

update apt before installing packages

1f0ac34

higs4281 requested a review from wpears January 9, 2026 14:57

higs4281 added 6 commits January 9, 2026 14:07

Reduce verbosity and clarify logging

c8aaac7

stop logging to stdout

ae7e0d5

Merge branch 'main' into feature/alto-mortgage-performance-script

ae2f491

remove option for csvs-only and its test

f74056c

remove option for csvs-only and its test

9c674db

Merge branch 'feature/alto-mortgage-performance-script' of https://gi…

2232f62

…thub.com/cfpb/consumerfinance.gov into feature/alto-mortgage-performance-script

chosak mentioned this pull request Jan 15, 2026

Mortgage performance data loading optimizations #8986

Merged

chosak reviewed Jan 15, 2026

View reviewed changes

chosak and others added 2 commits January 15, 2026 11:25

drop unused DATAFILE constant

c389286

higs4281 requested a review from chosak January 20, 2026 14:08

chosak approved these changes Jan 20, 2026

View reviewed changes

higs4281 added this pull request to the merge queue Jan 20, 2026

Merged via the queue into main with commit 617d1b4 Jan 20, 2026
17 of 18 checks passed

higs4281 deleted the feature/alto-mortgage-performance-script branch January 20, 2026 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/alto mortgage performance process #8980

Feature/alto mortgage performance process #8980

Uh oh!

higs4281 commented Dec 30, 2025 •

edited

Loading

Uh oh!

higs4281 commented Jan 2, 2026

Uh oh!

chosak commented Jan 2, 2026

Uh oh!

higs4281 commented Jan 5, 2026 •

edited

Loading

Uh oh!

higs4281 commented Jan 5, 2026

Uh oh!

higs4281 commented Jan 9, 2026

Uh oh!

higs4281 commented Jan 9, 2026 •

edited

Loading

Uh oh!

chosak left a comment

Uh oh!

chosak left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feature/alto mortgage performance process #8980

Feature/alto mortgage performance process #8980

Uh oh!

Conversation

higs4281 commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Additions

Testing

Notes

Uh oh!

higs4281 commented Jan 2, 2026

Uh oh!

chosak commented Jan 2, 2026

Uh oh!

higs4281 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

higs4281 commented Jan 5, 2026

Uh oh!

higs4281 commented Jan 9, 2026

Uh oh!

higs4281 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chosak left a comment

Choose a reason for hiding this comment

Uh oh!

chosak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

higs4281 commented Dec 30, 2025 •

edited

Loading

higs4281 commented Jan 5, 2026 •

edited

Loading

higs4281 commented Jan 9, 2026 •

edited

Loading