Merged
Conversation
chris-elliott-nhsd
previously approved these changes
Jun 30, 2025
ClareJonesBJSS
approved these changes
Jun 30, 2025
sidnhs
approved these changes
Jun 30, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Most of the intermittent failures in the e2e test suite were related to missing events (copies, deletions, status updates) in the file validation pipeline. The failures affected freshly created environments such as branch sandboxes.
However, the relevant event rules were being triggered reliably by Guardduty tagging events. This is why previous work to publish synthetic guard duty events during testing didn't fix overall reliability.
The actual problem was shown up in 'FailedInvocations' metrics on the event rules. These mean that the rule was unable to invoke its target(s). Temporarily adding DLQs to the targets showed that the failed invocations were all permissions problems. (example error message:
The security token included in the request is invalid. (Service: AWSLambdaInternal; Status Code: 403; Error Code: UnrecognizedClientException; Request ID: c1c2bd69-2405-4c88-bbca-b1a14b1ad781; Proxy: null)).It seems like the problem is eventually consistent creation of IAM resources (i.e. the roles which the eventbridge rules used to invoke lambda or SQS). Switching the permissions from being role-based to resource-based (lamdba policies, queue policies) seems to have resolved the timing issue.
Elsewhere, fixed a problem where the sftp-poll lambda was not re-invoked while polling for incoming proof events.
template-mgmt-sftp-send-proof.e2e.spec.tshad a very intermittent problem where an SFTP fetch failed. I didn't get to the bottom of this, but have deleted the test file since it's mostly unnecessarily duplicating unit tests (on manifest and test batch creation). Other e2e tests (template-mgmt-letter-full.e2e.spec.ts) show that we're sending files to the right location. Created https://github.com/NHSDigital/comms-mgr/pull/765 to add some validation to the mock itself to improve coverage provided by other testsContext
Type of changes
Checklist
Sensitive Information Declaration
To ensure the utmost confidentiality and protect your and others privacy, we kindly ask you to NOT including PII (Personal Identifiable Information) / PID (Personal Identifiable Data) or any other sensitive data in this PR (Pull Request) and the codebase changes. We will remove any PR that do contain any sensitive information. We really appreciate your cooperation in this matter.