Skip to content

Conversation

@higs4281
Copy link
Member

@higs4281 higs4281 commented Dec 30, 2025

This adjusts the mortgage-data update routines to match the needs of Jenkins automations in Alto.

One goal was to make this repo the source of as much of the code as possible for processing, moving files between environments, and updating s3 assets and databases.

Since we lost dev environments, the process now is to run an update in Alto staging, get approval there, then deploy the data to prod.

Additions

  • A thrudate script that validates the date values derived from a source file and creates a new through_date
  • An SQL file that truncates the 10 mortgage-data tables before loading new data.
  • An ALTO_ENV var that should be available during Jenkins processing, but will default to prodpub.
  • A MORTGAGE_PERFORMANCE_SOURCE env var to deliver a GHE repo URL for fetching new data ditched
  • MSA data had become woefully out of date, so I updated the crosswalk and added a step to load MSAs before every processing run, as @wpears had done for states and counties.

Testing

Best run locally in a container:

  • export MORTGAGE_PERFORMANCE_SOURCE=https://raw.GHE_REPO/refs/heads/main/data_files/nmdb
  • `./cfgov/manage.py runscript process_mortgage_data --script-args

Notes

  • This patch updates the responses testing package to 0.25.8, since we're now on Python 3.13 and 3.13 support was officially added in 0.25.7
  • Still working on test coverage [data_research coverage is now 100%]
  • Don't know if Postgres17 or my M3 Pro chip are stepping up, but local processing time drops from about 15 min to 5, when run from a container.

Alto poses some challenges to the mortgage-performance routine:
- Local operations can no longer post files to public s3 buckets
- Staging operations can't post directly to prod buckets
- Processing could be run in production, where exports would work. But
that exposes the mortgage pages to at least 15 minutes of interruptions
as tables are wiped and rebuilt and exports are generated.

To avoid that, we run processing in staging, where the results
can be verified by stakeholders before promoting the changes to prod.
The most efficient way to update the prod db is to load a table dump,
which takes a few seconds and saves having to re-process source data.

This patch offers options for test-processing new data locally and
dumping the public CSV files locally so they can be checked
or manually pushed to s3 if needed.

It also adds configuration so that the process will run in Alto staging,
where table dumps and public CSVs can be generated without affecting
production. Once a new data load is approved, we can push the table
dump and public CSVs to prod s3 for the final step.

Additions
Two env vars are added to facilitate downloading and the reading of the
source CSV, which lives in an internal GitHub repo. The new vars are:
- MORTGAGE_PERFORMANCE_SOURCE, which allows a management command to
fetch source data from the internal repo without exposing its URL.
This is now a required var for mortgage processing and has been added
to the common secret vars helm produces to deploy cfgov.
- LOCAL_MORTGAGE_FILEPATH, an optional var that will trigger a
local download of public CSVs to a path, instead of to s3.
If the var is missing, an s3 push will be attempted. If an s3 push fails,
it is skipped and noted in the logs, rather than disrupting the run.
- A function that derives the through-date from the source file's name.

These changes remove the need to figure out what the through_date should be,
and avoids the churn of moving the source file to s3 and then accessing
it again for processing.

Internal GHE documentation at mortgage-performance/wiki/Pipeline-steps
has been updated.
@higs4281 higs4281 changed the title Feature/alto mortgage performance script Feature/alto mortgage performance process Jan 2, 2026
@higs4281
Copy link
Member Author

higs4281 commented Jan 2, 2026

If processing proves this fast in Alto environments, we could switch to processing the files in prod as a final step, instead of the table dump-and-load across environments. Or that could be our routine method, with table dumps left in place as backups.

@chosak
Copy link
Member

chosak commented Jan 2, 2026

Don't know if Postgres17 is playing a role, but local processing time drops from about 15 min to 5.

@higs4281 by "local" do you mean local PG17 running on your laptop, or running as a container? When I run with the DB in the container, the aggregation step takes significant time, more than 15 minutes on its own. (I'm also on an Intel-based Mac, if that impacts things.)

@higs4281
Copy link
Member Author

higs4281 commented Jan 5, 2026

@higs4281 by "local" do you mean local PG17 running on your laptop, or running as a container? When I run with the DB in the container, the aggregation step takes significant time, more than 15 minutes on its own. (I'm also on an Intel-based Mac, if that impacts things.)

I ran in container. Maybe my laptop's M3 Pro chip is the difference.
We should time it in staging.

@higs4281
Copy link
Member Author

higs4281 commented Jan 5, 2026

I ran processing in a local py3.13.6 env with a local PG18.1 db, and processing took 13:46.

@higs4281
Copy link
Member Author

higs4281 commented Jan 9, 2026

@chosak I look forward to your ideas on live aggregations. But for now, this PR establishes an improved and (I hope) functional processing routine that updates MSA handling.

@higs4281
Copy link
Member Author

higs4281 commented Jan 9, 2026

FYI, had to add an apt update to stop CI tests from randomly failing during package installs.

ref: https://github.com/orgs/community/discussions/158504#discussioncomment-13054281

@higs4281 higs4281 requested a review from wpears January 9, 2026 14:57
Copy link
Member

@chosak chosak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@higs4281 this looks great, thanks for the thoroughness and for your patience with my review. My suggestions are in #8986.

chosak and others added 2 commits January 15, 2026 11:25
* Simplify MPT through date logic

We don't need or use the "through_date" mortgage data constant outside
of pipeline processing; we determine it from the input filename, and
persist it in the output data. So there's no need to persist it as its
own data constant.

This commit also simplifies the through date computation a bit by
relying on regex to enforce input format.

* Optimize datetime parsing

* Simplify process_mortgage_data usage

Pass single argument which is URL to source file, instead of using an
environment variable to set the repo and a separate argument for the
filename.

Also deprecate the ability to pass export-csvs-only; we can just call
the export_public_csvs script directly instead of doing this.

* Optimize CountyMortgageData object creation

Create objects in batches, and cache county primary keys to avoid doing
a database lookup per row.

* MPT loading optimizations

Optimize loop that creates CountyMortgageData objects, and simplify the
logic used to pull down the source file.

* Remove unnecessary MORTGAGE_PERFORMANCE_SOURCE

* Fix coverage comparison on non-main PRs
@higs4281 higs4281 requested a review from chosak January 20, 2026 14:08
Copy link
Member

@chosak chosak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 🚀 🚀 🚀 🚀

@higs4281 higs4281 added this pull request to the merge queue Jan 20, 2026
Merged via the queue into main with commit 617d1b4 Jan 20, 2026
17 of 18 checks passed
@higs4281 higs4281 deleted the feature/alto-mortgage-performance-script branch January 20, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants