Skip to content

Bug: remove_all_old_versions takes ~6 hours before first deletion in production #5686

@elipe17

Description

@elipe17

Thank you for taking the time to let us know about the issue you found. The basic rule for bug reporting is that something isn't working the way one would expect it to work. Please provide us with the information requested below and we will look at it as soon as we are able.

Bug Report

Description

The remove_all_old_versions Celery task in tdrs-backend/tdpservice/scheduling/datafile_retention_tasks.py has severe performance issues in production. The time between the initial log message ("Beginning deletion of old datafile versions.") and the very first actual deletion of records is approaching 6 hours. This is because the function uses itertools.product to generate the full Cartesian product of all years, quarters, program types, sections, and STTs — then queries the database individually for each combination.

With current values this produces approximately 50,000+ individual DB queries (8 years x 4 quarters x 4 program types x 7 sections x ~56 STTs), the vast majority of which return zero results and do no useful work. Each query still incurs network round-trip and DB overhead, which compounds to hours of wasted time before any deletions occur.

Action Taken

Observed production logs for the scheduled remove_all_old_versions task. The "Beginning deletion of old datafile versions." log entry appears, followed by ~6 hours of empty query iterations before the first delete_records call is made.

What I expected to see

The task should begin deleting old file versions within seconds or minutes of starting — not hours.

What I did see

Approximately 6 hours of elapsed time between the start log message and the first actual record deletion, caused by tens of thousands of unnecessary database queries for non-existent year/quarter/program_type/section/stt combinations.

Other Helpful Information

  • Affected file: tdrs-backend/tdpservice/scheduling/datafile_retention_tasks.pyremove_all_old_versions()
  • Root cause: The Cartesian product loop (line 52) generates ~50,000+ combinations. For each one, a DataFile.objects.filter(...) query is issued (line 56-62). Most combinations have no matching files, but the DB round-trip cost accumulates to hours.
  • Environment: Production
  • Is the issue repeatable?: Yes — every scheduled execution
  • Potential fix direction: Instead of iterating every possible combination, query for distinct (year, quarter, program_type, section, stt) tuples that actually exist in the database, then only process those. This would reduce the query count from ~50,000 to only the number of unique file groupings that exist (likely a few hundred at most).

Testing

Usability Testing Criteria

PM/UX/OFA Only: Create a list of expected user behaviors that should be confirmed when UX and/or PM is testing this ticket

  • The remove_all_old_versions task begins deleting records within minutes of starting (not hours)
  • All old file versions are still correctly identified and deleted
  • No regressions in the cleanup behavior — the newest version of each file grouping is still preserved

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions