Skip to content

add ability to set expiration time period for messages in the crawler queue #627

@elrayle

Description

@elrayle

Description

Update CD crawler to allow timeToLive for queue messages to be configurable. It is currently using the default of 7 days, which is too short and results in messages getting lost on this arbitrary timeline. This impacts our internal harvester and the missing license backfill process. If this can't be fixed, the DAG will have to reduce the number of packages it sends to the harvester. This will likely slow down processing. It is currently averaging only 125k per day, but will process closer to 500k on some days. The primary driver of this is the number of files being scanned by scancode. This will require some thought into how best to keep the process running without missing packages because they get dropped off after expiring.

Rationale

The backfill DAG puts more messages on the queue than the throughput of the GH CD harvester. If these messages just drop off the queue unprocessed, then it will appear that they are indeed missing their license, which may be incorrect.

Definition of Done

  • There is a new config to set the expiration to use for a message and the configured expiration is seen with messages in the queue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions