Skip to content

Retries for failed alleles #57

@chrisammon3000

Description

@chrisammon3000

Description

Some alleles (at least one) are timing out during the build process when sending requests to the Feature Service API. The simplest approach is to implement retry logic in the main build script for when this occurs, however there are some potential problems with this approach.

Example

For context, a typical build for ~30,000 alleles takes about 20 minutes (~20-25 alleles per second). Adding retry logic with exponential backoff for 0, 2, 4, and finally 8 seconds means that one failed allele will slow the process down by a maximum of 14 seconds. For multiple failed alleles:

  • 10 failed alleles would slow it down by 2.33 minutes
  • 20 failed alleles would slow it down by 4.67 minutes
  • 30 failed alleles would slow it down by 7 minutes

Considerations

  1. Using exponential backoff behavior for all 30,000+ alleles would potentially increase the build time and cost by a large amount if a large number of alleles fail (see above example).
  2. Failed alleles are recorded in logs, but other than the increasing time/cost, there is no reason why they can't immediately be retried again.
  3. The system should be able to potentially handle a moderate number of failures without losing any data or wasting time/cost on resources.
  4. As of right now, the build script does not have logic to monitor the number of failures and abort the build if a threshold is reached.

Assumptions

  • Feature Service timeouts are the only cause for allele build errors.
  • Retrying has high enough probability of being successful to make it worthwhile.
  • New IMGT/HLA data will be consistent and not cause a high number of errors.

Proposed Solution

Decouple the retry logic from the main script by implementing a queue and consumer(s) for re-processing failed alleles. This would allow any number of alleles to be retried without driving up cost on the high-powered build server during the main build. On AWS this would require at minimum employing an SQS queue with a Lambda function.

Advantages

  • More control over cost: avoiding runaway high costs in the event of many failed alleles.
  • More control over the maximum number of retries before declaring failure.
  • Ability to process retries in serial or in parallel by adjusting Lambda function concurrency.
  • Ability to create an alert if a threshold is reached, so that problems with new data can be quickly identified and resolved. An alert can trigger logic to abort the build so that resources aren't wasted and data isn't lost.
  • A consumer Lambda can be used independently of the retry mechanism to build any allele as long it is given the correct parameters. For example, behind a REST API.

Disadvantages

  • Data from successful retries needs to be merged into the data set produced by the main build, or loaded independently.
  • Potential for lost data if retries are not coordinated correctly.
  • The retries need to use the same configuration as the build or the data will be corrupted (alignments, kir etc.).

Priorities

  • Complete the main build without losing any data or wasting resources (ie., time on the build server)

Recommended Next Steps

  • Add SQS queue for alleles that fail during main build
  • Add logic to the main build script to send failed alleles to the queue
  • Add a failed count threshold to the main build script and abort if the threshold is reached

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions