Skip to content

CCM-10048: e2e test reliability#536

Merged
alexnuttall merged 25 commits intomainfrom
feature/CCM-10048-flakiness
Jun 30, 2025
Merged

CCM-10048: e2e test reliability#536
alexnuttall merged 25 commits intomainfrom
feature/CCM-10048-flakiness

Conversation

@alexnuttall
Copy link
Contributor

@alexnuttall alexnuttall commented Jun 27, 2025

Description

Most of the intermittent failures in the e2e test suite were related to missing events (copies, deletions, status updates) in the file validation pipeline. The failures affected freshly created environments such as branch sandboxes.

However, the relevant event rules were being triggered reliably by Guardduty tagging events. This is why previous work to publish synthetic guard duty events during testing didn't fix overall reliability.

The actual problem was shown up in 'FailedInvocations' metrics on the event rules. These mean that the rule was unable to invoke its target(s). Temporarily adding DLQs to the targets showed that the failed invocations were all permissions problems. (example error message: The security token included in the request is invalid. (Service: AWSLambdaInternal; Status Code: 403; Error Code: UnrecognizedClientException; Request ID: c1c2bd69-2405-4c88-bbca-b1a14b1ad781; Proxy: null)).

It seems like the problem is eventually consistent creation of IAM resources (i.e. the roles which the eventbridge rules used to invoke lambda or SQS). Switching the permissions from being role-based to resource-based (lamdba policies, queue policies) seems to have resolved the timing issue.

Elsewhere, fixed a problem where the sftp-poll lambda was not re-invoked while polling for incoming proof events.

template-mgmt-sftp-send-proof.e2e.spec.ts had a very intermittent problem where an SFTP fetch failed. I didn't get to the bottom of this, but have deleted the test file since it's mostly unnecessarily duplicating unit tests (on manifest and test batch creation). Other e2e tests (template-mgmt-letter-full.e2e.spec.ts) show that we're sending files to the right location. Created https://github.com/NHSDigital/comms-mgr/pull/765 to add some validation to the mock itself to improve coverage provided by other tests

Context

Type of changes

  • Refactoring (non-breaking change)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would change existing functionality)
  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I am familiar with the contributing guidelines
  • I have followed the code style of the project
  • I have added tests to cover my changes
  • I have updated the documentation accordingly
  • This PR is a result of pair or mob programming

Sensitive Information Declaration

To ensure the utmost confidentiality and protect your and others privacy, we kindly ask you to NOT including PII (Personal Identifiable Information) / PID (Personal Identifiable Data) or any other sensitive data in this PR (Pull Request) and the codebase changes. We will remove any PR that do contain any sensitive information. We really appreciate your cooperation in this matter.

  • I confirm that neither PII/PID nor sensitive data are included in this PR and the codebase changes.

@alexnuttall alexnuttall requested review from a team as code owners June 27, 2025 14:57
@alexnuttall alexnuttall marked this pull request as draft June 27, 2025 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants