Bug: remove_all_old_versions takes ~6 hours before first deletion in production

Thank you for taking the time to let us know about the issue you found. The basic rule for bug reporting is that something isn't working the way one would expect it to work. Please provide us with the information requested below and we will look at it as soon as we are able.

## Bug Report

### Description

The `remove_all_old_versions` Celery task in `tdrs-backend/tdpservice/scheduling/datafile_retention_tasks.py` has severe performance issues in production. The time between the initial log message ("Beginning deletion of old datafile versions.") and the very first actual deletion of records is approaching **6 hours**. This is because the function uses `itertools.product` to generate the full Cartesian product of all years, quarters, program types, sections, and STTs — then queries the database individually for each combination.

With current values this produces approximately **50,000+ individual DB queries** (8 years x 4 quarters x 4 program types x 7 sections x ~56 STTs), the vast majority of which return zero results and do no useful work. Each query still incurs network round-trip and DB overhead, which compounds to hours of wasted time before any deletions occur.

### Action Taken

Observed production logs for the scheduled `remove_all_old_versions` task. The "Beginning deletion of old datafile versions." log entry appears, followed by ~6 hours of empty query iterations before the first `delete_records` call is made.

### What I expected to see

The task should begin deleting old file versions within seconds or minutes of starting — not hours.

### What I did see

Approximately 6 hours of elapsed time between the start log message and the first actual record deletion, caused by tens of thousands of unnecessary database queries for non-existent year/quarter/program_type/section/stt combinations.

### Other Helpful Information

+ **Affected file:** `tdrs-backend/tdpservice/scheduling/datafile_retention_tasks.py` — `remove_all_old_versions()`
+ **Root cause:** The Cartesian product loop (line 52) generates ~50,000+ combinations. For each one, a `DataFile.objects.filter(...)` query is issued (line 56-62). Most combinations have no matching files, but the DB round-trip cost accumulates to hours.
+ **Environment:** Production
+ **Is the issue repeatable?:** Yes — every scheduled execution
+ **Potential fix direction:** Instead of iterating every possible combination, query for distinct `(year, quarter, program_type, section, stt)` tuples that actually exist in the database, then only process those. This would reduce the query count from ~50,000 to only the number of unique file groupings that exist (likely a few hundred at most).

## Testing

### Usability Testing Criteria
_PM/UX/OFA Only: Create a list of expected user behaviors that should be confirmed when UX and/or PM is testing this ticket_
- [ ] The `remove_all_old_versions` task begins deleting records within minutes of starting (not hours)
- [ ] All old file versions are still correctly identified and deleted
- [ ] No regressions in the cleanup behavior — the newest version of each file grouping is still preserved


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: remove_all_old_versions takes ~6 hours before first deletion in production #5686

Bug Report

Description

Action Taken

What I expected to see

What I did see

Other Helpful Information

Testing

Usability Testing Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: remove_all_old_versions takes ~6 hours before first deletion in production #5686

Description

Bug Report

Description

Action Taken

What I expected to see

What I did see

Other Helpful Information

Testing

Usability Testing Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions