-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Thank you for taking the time to let us know about the issue you found. The basic rule for bug reporting is that something isn't working the way one would expect it to work. Please provide us with the information requested below and we will look at it as soon as we are able.
Bug Report
Description
The remove_all_old_versions Celery task in tdrs-backend/tdpservice/scheduling/datafile_retention_tasks.py has severe performance issues in production. The time between the initial log message ("Beginning deletion of old datafile versions.") and the very first actual deletion of records is approaching 6 hours. This is because the function uses itertools.product to generate the full Cartesian product of all years, quarters, program types, sections, and STTs — then queries the database individually for each combination.
With current values this produces approximately 50,000+ individual DB queries (8 years x 4 quarters x 4 program types x 7 sections x ~56 STTs), the vast majority of which return zero results and do no useful work. Each query still incurs network round-trip and DB overhead, which compounds to hours of wasted time before any deletions occur.
Action Taken
Observed production logs for the scheduled remove_all_old_versions task. The "Beginning deletion of old datafile versions." log entry appears, followed by ~6 hours of empty query iterations before the first delete_records call is made.
What I expected to see
The task should begin deleting old file versions within seconds or minutes of starting — not hours.
What I did see
Approximately 6 hours of elapsed time between the start log message and the first actual record deletion, caused by tens of thousands of unnecessary database queries for non-existent year/quarter/program_type/section/stt combinations.
Other Helpful Information
- Affected file:
tdrs-backend/tdpservice/scheduling/datafile_retention_tasks.py—remove_all_old_versions() - Root cause: The Cartesian product loop (line 52) generates ~50,000+ combinations. For each one, a
DataFile.objects.filter(...)query is issued (line 56-62). Most combinations have no matching files, but the DB round-trip cost accumulates to hours. - Environment: Production
- Is the issue repeatable?: Yes — every scheduled execution
- Potential fix direction: Instead of iterating every possible combination, query for distinct
(year, quarter, program_type, section, stt)tuples that actually exist in the database, then only process those. This would reduce the query count from ~50,000 to only the number of unique file groupings that exist (likely a few hundred at most).
Testing
Usability Testing Criteria
PM/UX/OFA Only: Create a list of expected user behaviors that should be confirmed when UX and/or PM is testing this ticket
- The
remove_all_old_versionstask begins deleting records within minutes of starting (not hours) - All old file versions are still correctly identified and deleted
- No regressions in the cleanup behavior — the newest version of each file grouping is still preserved