Improve error handling for sdlf-stage-lambda/sdlf-stage-glue #536

cnfait · 2025-05-20T13:29:05Z

Issue #, if available:
would close #514 by providing a clearer indication of failure/partial failure

Description of changes:
Improve error handling for sdlf-stage-lambda and sdlf-stage-glue:

send processing failures to a SQS dead-letter queue (DLQ)
visually show processing failures with the post lambda
& avoid confusion about where errors happened (due to the parallel state wrapper)

To that end:

add sqs dead-letter queue url to sdlf-pipeline stack outputs
align dlq and queue content dedup configuration
send failures to sqs dlq in the distributed map runs
- collecting failures and sending them to the dlq in the post-processing lambda is not scalable due to sfn output size limits (and other reasons)
add a message attribute containing the state machine execution id for easier debugging
align peh_id and sfn state machine execution name
- both are uuid4 anyway and this helps debugging and updating the peh dynamodb table
remove the error lambda, subsumed entirely by the state machine itself and the post lambda
remove the parallel state wrapper

The lambda and glue stages are the only one updated in this PR - the others work a bit differently, I plan to align all of them in a future PR. Also this PR is not meant to be a perfect solution, but I'm keeping the changes and the workload manageable...

A couple example screenshots to show what it may look like/the testing done (I've also rerun the workshop to be sure):

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

send processing failures to a SQS dead-letter queue (DLQ) visually show processing failures with the post lambda & avoid confusion about where errors happened (due to the parallel state wrapper) to that end: add sqs dead-letter queue url to sdlf-pipeline stack outputs align dlq and queue content dedup configuration send failures to sqs dlq in the distributed map runs collecting failures and sending them to the dlq in the post-processing lambda is not scalable due to sfn output size limits add a message attribute containing the state machine execution id for easier debugging align peh_id and sfn state machine execution name both are uuid4 anyway and this helps debugging and updating the peh dynamodb table remove the error lambda, subsumed entirely by the state machine itself and the post lambda remove the parallel state wrapper

cnfait added 2 commits May 20, 2025 15:08

[sdlf-stage-lambda] add jitter to processing lambda

9fabf0d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve error handling for sdlf-stage-lambda/sdlf-stage-glue #536

Improve error handling for sdlf-stage-lambda/sdlf-stage-glue #536

Uh oh!

cnfait commented May 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Improve error handling for sdlf-stage-lambda/sdlf-stage-glue #536

Are you sure you want to change the base?

Improve error handling for sdlf-stage-lambda/sdlf-stage-glue #536

Uh oh!

Conversation

cnfait commented May 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant