-
Notifications
You must be signed in to change notification settings - Fork 128
Feature/alto mortgage performance process #8980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Alto poses some challenges to the mortgage-performance routine: - Local operations can no longer post files to public s3 buckets - Staging operations can't post directly to prod buckets - Processing could be run in production, where exports would work. But that exposes the mortgage pages to at least 15 minutes of interruptions as tables are wiped and rebuilt and exports are generated. To avoid that, we run processing in staging, where the results can be verified by stakeholders before promoting the changes to prod. The most efficient way to update the prod db is to load a table dump, which takes a few seconds and saves having to re-process source data. This patch offers options for test-processing new data locally and dumping the public CSV files locally so they can be checked or manually pushed to s3 if needed. It also adds configuration so that the process will run in Alto staging, where table dumps and public CSVs can be generated without affecting production. Once a new data load is approved, we can push the table dump and public CSVs to prod s3 for the final step. Additions Two env vars are added to facilitate downloading and the reading of the source CSV, which lives in an internal GitHub repo. The new vars are: - MORTGAGE_PERFORMANCE_SOURCE, which allows a management command to fetch source data from the internal repo without exposing its URL. This is now a required var for mortgage processing and has been added to the common secret vars helm produces to deploy cfgov. - LOCAL_MORTGAGE_FILEPATH, an optional var that will trigger a local download of public CSVs to a path, instead of to s3. If the var is missing, an s3 push will be attempted. If an s3 push fails, it is skipped and noted in the logs, rather than disrupting the run. - A function that derives the through-date from the source file's name. These changes remove the need to figure out what the through_date should be, and avoids the churn of moving the source file to s3 and then accessing it again for processing. Internal GHE documentation at mortgage-performance/wiki/Pipeline-steps has been updated.
…thub.com/cfpb/consumerfinance.gov into feature/alto-mortgage-performance-script
…e-data sql.gz files
…l not posted) showed the folly of getting too fancy
|
If processing proves this fast in Alto environments, we could switch to processing the files in prod as a final step, instead of the table dump-and-load across environments. Or that could be our routine method, with table dumps left in place as backups. |
@higs4281 by "local" do you mean local PG17 running on your laptop, or running as a container? When I run with the DB in the container, the aggregation step takes significant time, more than 15 minutes on its own. (I'm also on an Intel-based Mac, if that impacts things.) |
I ran in container. Maybe my laptop's M3 Pro chip is the difference. |
|
I ran processing in a local py3.13.6 env with a local PG18.1 db, and processing took 13:46. |
|
@chosak I look forward to your ideas on live aggregations. But for now, this PR establishes an improved and (I hope) functional processing routine that updates MSA handling. |
|
FYI, had to add an apt update to stop CI tests from randomly failing during package installs. ref: https://github.com/orgs/community/discussions/158504#discussioncomment-13054281 |
…thub.com/cfpb/consumerfinance.gov into feature/alto-mortgage-performance-script
chosak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Simplify MPT through date logic We don't need or use the "through_date" mortgage data constant outside of pipeline processing; we determine it from the input filename, and persist it in the output data. So there's no need to persist it as its own data constant. This commit also simplifies the through date computation a bit by relying on regex to enforce input format. * Optimize datetime parsing * Simplify process_mortgage_data usage Pass single argument which is URL to source file, instead of using an environment variable to set the repo and a separate argument for the filename. Also deprecate the ability to pass export-csvs-only; we can just call the export_public_csvs script directly instead of doing this. * Optimize CountyMortgageData object creation Create objects in batches, and cache county primary keys to avoid doing a database lookup per row. * MPT loading optimizations Optimize loop that creates CountyMortgageData objects, and simplify the logic used to pull down the source file. * Remove unnecessary MORTGAGE_PERFORMANCE_SOURCE * Fix coverage comparison on non-main PRs
chosak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 🚀 🚀 🚀 🚀
This adjusts the mortgage-data update routines to match the needs of Jenkins automations in Alto.
One goal was to make this repo the source of as much of the code as possible for processing, moving files between environments, and updating s3 assets and databases.
Since we lost dev environments, the process now is to run an update in Alto staging, get approval there, then deploy the data to prod.
Additions
A MORTGAGE_PERFORMANCE_SOURCE env var to deliver a GHE repo URL for fetching new dataditchedTesting
Best run locally in a container:
Notes
Still working on test coverage[data_research coverage is now 100%]