@@ -105,83 +105,6 @@ Alarm thresholds are configured in [`infrastructure/stacks/dos_search/variables.
105105| --------| ----------| -----------| ------------| --------| --------|
106106| Errors | CRITICAL | > 0 | 1/1 period | 60s | ✅ Active |
107107
108- ## Alarm Details
109-
110- ### Errors (WARNING - DISABLED)
111-
112- ** Status** : Placeholder alarm - actions disabled until baseline established.
113-
114- ** Placeholder threshold** : 5
115- ** Time to trigger** : N/A (disabled)
116- ** Variable** : ` search_lambda_errors_warning_threshold `
117- ** Enable** : Set ` enable_warning_alarms = true ` in variables.tf
118-
119- ### Errors (CRITICAL - ACTIVE)
120-
121- Triggers by invoking Lambda with invalid payloads that cause errors.
122-
123- ** Requirements** : None
124- ** Time to trigger** : ~ 3 minutes (2 out of 3 periods must breach)
125- ** Variable** : ` search_lambda_errors_critical_threshold `
126-
127- ### Duration p95 (WARNING - DISABLED)
128-
129- ** Status** : Placeholder alarm - actions disabled until baseline established.
130-
131- ** Placeholder threshold** : 600ms
132- ** Time to trigger** : N/A (disabled)
133- ** Variable** : ` search_lambda_duration_p95_warning_ms `
134- ** Enable** : Set ` enable_warning_alarms = true ` in variables.tf
135-
136- ### Duration p99 (CRITICAL - ACTIVE)
137-
138- Triggers when Lambda p99 execution time exceeds 800ms.
139-
140- ** Requirements** : Actual execution time must exceed 800ms
141- ** Time to trigger** : ~ 3 minutes (2 out of 3 periods must breach)
142- ** Variable** : ` search_lambda_duration_p99_critical_ms `
143-
144- ### Concurrent Executions (WARNING - DISABLED)
145-
146- ** Status** : Placeholder alarm - actions disabled until baseline established.
147-
148- ** Placeholder threshold** : 80
149- ** Time to trigger** : N/A (disabled)
150- ** Variable** : ` search_lambda_concurrent_executions_warning `
151- ** Enable** : Set ` enable_warning_alarms = true ` in variables.tf
152-
153- ### Concurrent Executions (CRITICAL - ACTIVE)
154-
155- Triggers when too many Lambda instances run simultaneously.
156-
157- ** Requirements** : Sufficient concurrent invocations (>= 100)
158- ** Time to trigger** : ~ 3 minutes (2 out of 3 periods must breach)
159- ** Variable** : ` search_lambda_concurrent_executions_critical ` (100)
160-
161- ### Throttles (CRITICAL - ACTIVE)
162-
163- Triggers when Lambda is throttled due to concurrency limits.
164-
165- ** Requirements** : Reserved concurrency must be set on the Lambda function
166- ** Time to trigger** : Immediate when throttling occurs (1 minute evaluation)
167- ** Variable** : ` search_lambda_throttles_critical_threshold ` (0)
168-
169- ### Invocations Spike (CRITICAL - ACTIVE)
170-
171- Triggers when Lambda invocations exceed 2x baseline (600/hour).
172-
173- ** Requirements** : > 600 invocations in 1 hour
174- ** Time to trigger** : ~ 3 minutes (2 out of 3 periods must breach)
175- ** Variables** : ` search_lambda_invocations_baseline_per_hour ` (300), ` invocations_critical_spike_multiplier ` (2)
176-
177- ### Health Check Errors (CRITICAL - ACTIVE)
178-
179- Triggers when health check Lambda has any errors.
180-
181- ** Requirements** : None
182- ** Time to trigger** : ~ 1 minute after error
183- ** Variable** : ` health_check_errors_critical_threshold ` (0)
184-
185108## Direct Script Usage
186109
187110You can also use the script directly for more control:
@@ -227,16 +150,179 @@ Or check Slack notifications in the configured alerts channel (#ftrs-dos-search-
227150- Wait 1-2 minutes for CloudWatch to evaluate metrics
228151- Check alarm thresholds in Terraform configuration
229152- Verify alarm evaluation periods and thresholds are appropriate for testing
230- - For duration alarms, ensure Lambda execution time actually exceeds thresholds
153+ - Use the troubleshooting commands below for specific alarm types
154+
155+ ### Troubleshooting Errors Alarms
156+
157+ Check if errors are being recorded:
158+
159+ ``` bash
160+ # Set Lambda name (adjust for workspace if needed)
161+ LAMBDA_NAME=" ftrs-dos-${ENVIRONMENT} -dos-search-ods-code-lambda"
162+ [ -n " ${WORKSPACE} " ] && [ " ${WORKSPACE} " != " default" ] && LAMBDA_NAME=" ${LAMBDA_NAME} -${WORKSPACE} "
163+
164+ # View recent errors in CloudWatch Logs
165+ aws logs tail /aws/lambda/${LAMBDA_NAME} \
166+ --follow \
167+ --filter-pattern " ERROR" \
168+ --profile ${AWS_PROFILE}
169+
170+ # Check error metrics
171+ aws cloudwatch get-metric-statistics \
172+ --namespace AWS/Lambda \
173+ --metric-name Errors \
174+ --dimensions Name=FunctionName,Value=${LAMBDA_NAME} \
175+ --start-time $( date -u -v-10M +%Y-%m-%dT%H:%M:%S) \
176+ --end-time $( date -u +%Y-%m-%dT%H:%M:%S) \
177+ --period 60 \
178+ --statistics Sum \
179+ --profile ${AWS_PROFILE}
180+ ```
181+
182+ ### Troubleshooting Duration Alarms
183+
184+ Verify actual execution time:
185+
186+ ``` bash
187+ # Set Lambda name (adjust for workspace if needed)
188+ LAMBDA_NAME=" ftrs-dos-${ENVIRONMENT} -dos-search-ods-code-lambda"
189+ [ -n " ${WORKSPACE} " ] && [ " ${WORKSPACE} " != " default" ] && LAMBDA_NAME=" ${LAMBDA_NAME} -${WORKSPACE} "
190+
191+ # Check execution time for a single invocation
192+ aws lambda invoke \
193+ --function-name ${LAMBDA_NAME} \
194+ --payload ' {"odsCode": "TEST123"}' \
195+ --cli-binary-format raw-in-base64-out \
196+ --log-type Tail \
197+ --profile ${AWS_PROFILE} \
198+ response.json | jq -r ' .LogResult' | base64 -d | grep " Duration"
199+
200+ # Check p99 duration metrics
201+ aws cloudwatch get-metric-statistics \
202+ --namespace AWS/Lambda \
203+ --metric-name Duration \
204+ --dimensions Name=FunctionName,Value=${LAMBDA_NAME} \
205+ --start-time $( date -u -v-10M +%Y-%m-%dT%H:%M:%S) \
206+ --end-time $( date -u +%Y-%m-%dT%H:%M:%S) \
207+ --period 60 \
208+ --statistics Average,Maximum \
209+ --profile ${AWS_PROFILE}
210+ ```
211+
212+ If execution time is below the threshold (600ms for p95, 800ms for p99), the alarm won't trigger.
231213
232- ** Lambda Logging** :
214+ ### Troubleshooting Concurrent Executions Alarms
215+
216+ Check current concurrency levels:
217+
218+ ``` bash
219+ # Set Lambda name (adjust for workspace if needed)
220+ LAMBDA_NAME=" ftrs-dos-${ENVIRONMENT} -dos-search-ods-code-lambda"
221+ [ -n " ${WORKSPACE} " ] && [ " ${WORKSPACE} " != " default" ] && LAMBDA_NAME=" ${LAMBDA_NAME} -${WORKSPACE} "
222+
223+ # Check concurrent executions metric
224+ aws cloudwatch get-metric-statistics \
225+ --namespace AWS/Lambda \
226+ --metric-name ConcurrentExecutions \
227+ --dimensions Name=FunctionName,Value=${LAMBDA_NAME} \
228+ --start-time $( date -u -v-10M +%Y-%m-%dT%H:%M:%S) \
229+ --end-time $( date -u +%Y-%m-%dT%H:%M:%S) \
230+ --period 60 \
231+ --statistics Maximum \
232+ --profile ${AWS_PROFILE}
233+
234+ # Check account-level concurrency
235+ aws lambda get-account-settings --profile ${AWS_PROFILE}
236+ ```
237+
238+ ** Lowering the threshold for easier testing:**
239+
240+ The default threshold of 100 concurrent executions can be difficult to trigger. To make testing easier, temporarily lower it:
241+
242+ 1 . Edit ` infrastructure/stacks/dos_search/variables.tf ` line 178-182:
243+ ``` terraform
244+ variable "search_lambda_concurrent_executions_critical" {
245+ description = "Search Lambda concurrency critical threshold (ConcurrentExecutions)"
246+ type = number
247+ default = 10 # Lowered from 100 for testing
248+ }
249+ ```
233250
234- - Check CloudWatch Logs for the Lambda function to see invocation details and errors using:
251+ 2 . Apply the change: Running the workspace via pipeline.
235252
236- ``` shell
237- aws lambda get-function --function-name ftrs-dos-dev-dos-search-slack-notification-ftrs-765 --profile dos-search-dev
253+ 3 . Update the test to match: ` ./scripts/trigger-alarms.sh concurrent search 15 `
254+
255+ 4 . After testing, revert the threshold back to 100
256+
257+ ### Troubleshooting Throttles Alarms
258+
259+ Check if throttling is occurring:
260+
261+ ``` bash
262+ # Set Lambda name (adjust for workspace if needed)
263+ LAMBDA_NAME=" ftrs-dos-${ENVIRONMENT} -dos-search-ods-code-lambda"
264+ [ -n " ${WORKSPACE} " ] && [ " ${WORKSPACE} " != " default" ] && LAMBDA_NAME=" ${LAMBDA_NAME} -${WORKSPACE} "
265+
266+ # Check if reserved concurrency is set
267+ aws lambda get-function-concurrency \
268+ --function-name ${LAMBDA_NAME} \
269+ --profile ${AWS_PROFILE}
270+
271+ # Check throttle metrics
272+ aws cloudwatch get-metric-statistics \
273+ --namespace AWS/Lambda \
274+ --metric-name Throttles \
275+ --dimensions Name=FunctionName,Value=${LAMBDA_NAME} \
276+ --start-time $( date -u -v-10M +%Y-%m-%dT%H:%M:%S) \
277+ --end-time $( date -u +%Y-%m-%dT%H:%M:%S) \
278+ --period 60 \
279+ --statistics Sum \
280+ --profile ${AWS_PROFILE}
281+ ```
282+
283+ ** Setting up reserved concurrency for testing:**
284+
285+ Throttles alarm requires reserved concurrency to be set. To enable testing:
286+
287+ ``` bash
288+ # Set Lambda name (adjust for workspace if needed)
289+ LAMBDA_NAME=" ftrs-dos-${ENVIRONMENT} -dos-search-ods-code-lambda"
290+ [ -n " ${WORKSPACE} " ] && [ " ${WORKSPACE} " != " default" ] && LAMBDA_NAME=" ${LAMBDA_NAME} -${WORKSPACE} "
291+
292+ # Set reserved concurrency to 5 (limits Lambda to 5 concurrent executions)
293+ aws lambda put-function-concurrency \
294+ --function-name ${LAMBDA_NAME} \
295+ --reserved-concurrent-executions 5 \
296+ --profile ${AWS_PROFILE}
297+
298+ # Run the test (20 concurrent invocations will cause throttling)
299+ make test-lambda-alarm-throttles-critical
300+
301+ # Remove reserved concurrency after testing
302+ aws lambda delete-function-concurrency \
303+ --function-name ${LAMBDA_NAME} \
304+ --profile ${AWS_PROFILE}
305+ ```
306+
307+ ### Troubleshooting Invocations Spike Alarms
308+
309+ Check invocation rate:
310+
311+ ``` bash
312+ # Set Lambda name (adjust for workspace if needed)
313+ LAMBDA_NAME=" ftrs-dos-${ENVIRONMENT} -dos-search-ods-code-lambda"
314+ [ -n " ${WORKSPACE} " ] && [ " ${WORKSPACE} " != " default" ] && LAMBDA_NAME=" ${LAMBDA_NAME} -${WORKSPACE} "
315+
316+ # Check invocations over the last hour
317+ aws cloudwatch get-metric-statistics \
318+ --namespace AWS/Lambda \
319+ --metric-name Invocations \
320+ --dimensions Name=FunctionName,Value=${LAMBDA_NAME} \
321+ --start-time $( date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
322+ --end-time $( date -u +%Y-%m-%dT%H:%M:%S) \
323+ --period 3600 \
324+ --statistics Sum \
325+ --profile ${AWS_PROFILE}
238326```
239327
240- ` ftrs-765 ` is your workspace and will need to adjusted according to yours.
241- ` --profile <your-profile name> ` is your AWS CLI name that you used when you configured AWS CLI using ` aws configure sso ` .
242- Test comment
328+ Alarm triggers when invocations exceed 600/hour (2x baseline of 300/hour).
0 commit comments