Skip to content

Commit 0f2d8f4

Browse files
committed
feat(dos-search): FTRS-765 updating readme
1 parent 74c4ce3 commit 0f2d8f4

File tree

1 file changed

+171
-85
lines changed

1 file changed

+171
-85
lines changed

tests/alarms/README.md

Lines changed: 171 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -105,83 +105,6 @@ Alarm thresholds are configured in [`infrastructure/stacks/dos_search/variables.
105105
|--------|----------|-----------|------------|--------|--------|
106106
| Errors | CRITICAL | > 0 | 1/1 period | 60s | ✅ Active |
107107

108-
## Alarm Details
109-
110-
### Errors (WARNING - DISABLED)
111-
112-
**Status**: Placeholder alarm - actions disabled until baseline established.
113-
114-
**Placeholder threshold**: 5
115-
**Time to trigger**: N/A (disabled)
116-
**Variable**: `search_lambda_errors_warning_threshold`
117-
**Enable**: Set `enable_warning_alarms = true` in variables.tf
118-
119-
### Errors (CRITICAL - ACTIVE)
120-
121-
Triggers by invoking Lambda with invalid payloads that cause errors.
122-
123-
**Requirements**: None
124-
**Time to trigger**: ~3 minutes (2 out of 3 periods must breach)
125-
**Variable**: `search_lambda_errors_critical_threshold`
126-
127-
### Duration p95 (WARNING - DISABLED)
128-
129-
**Status**: Placeholder alarm - actions disabled until baseline established.
130-
131-
**Placeholder threshold**: 600ms
132-
**Time to trigger**: N/A (disabled)
133-
**Variable**: `search_lambda_duration_p95_warning_ms`
134-
**Enable**: Set `enable_warning_alarms = true` in variables.tf
135-
136-
### Duration p99 (CRITICAL - ACTIVE)
137-
138-
Triggers when Lambda p99 execution time exceeds 800ms.
139-
140-
**Requirements**: Actual execution time must exceed 800ms
141-
**Time to trigger**: ~3 minutes (2 out of 3 periods must breach)
142-
**Variable**: `search_lambda_duration_p99_critical_ms`
143-
144-
### Concurrent Executions (WARNING - DISABLED)
145-
146-
**Status**: Placeholder alarm - actions disabled until baseline established.
147-
148-
**Placeholder threshold**: 80
149-
**Time to trigger**: N/A (disabled)
150-
**Variable**: `search_lambda_concurrent_executions_warning`
151-
**Enable**: Set `enable_warning_alarms = true` in variables.tf
152-
153-
### Concurrent Executions (CRITICAL - ACTIVE)
154-
155-
Triggers when too many Lambda instances run simultaneously.
156-
157-
**Requirements**: Sufficient concurrent invocations (>= 100)
158-
**Time to trigger**: ~3 minutes (2 out of 3 periods must breach)
159-
**Variable**: `search_lambda_concurrent_executions_critical` (100)
160-
161-
### Throttles (CRITICAL - ACTIVE)
162-
163-
Triggers when Lambda is throttled due to concurrency limits.
164-
165-
**Requirements**: Reserved concurrency must be set on the Lambda function
166-
**Time to trigger**: Immediate when throttling occurs (1 minute evaluation)
167-
**Variable**: `search_lambda_throttles_critical_threshold` (0)
168-
169-
### Invocations Spike (CRITICAL - ACTIVE)
170-
171-
Triggers when Lambda invocations exceed 2x baseline (600/hour).
172-
173-
**Requirements**: > 600 invocations in 1 hour
174-
**Time to trigger**: ~3 minutes (2 out of 3 periods must breach)
175-
**Variables**: `search_lambda_invocations_baseline_per_hour` (300), `invocations_critical_spike_multiplier` (2)
176-
177-
### Health Check Errors (CRITICAL - ACTIVE)
178-
179-
Triggers when health check Lambda has any errors.
180-
181-
**Requirements**: None
182-
**Time to trigger**: ~1 minute after error
183-
**Variable**: `health_check_errors_critical_threshold` (0)
184-
185108
## Direct Script Usage
186109

187110
You can also use the script directly for more control:
@@ -227,16 +150,179 @@ Or check Slack notifications in the configured alerts channel (#ftrs-dos-search-
227150
- Wait 1-2 minutes for CloudWatch to evaluate metrics
228151
- Check alarm thresholds in Terraform configuration
229152
- Verify alarm evaluation periods and thresholds are appropriate for testing
230-
- For duration alarms, ensure Lambda execution time actually exceeds thresholds
153+
- Use the troubleshooting commands below for specific alarm types
154+
155+
### Troubleshooting Errors Alarms
156+
157+
Check if errors are being recorded:
158+
159+
```bash
160+
# Set Lambda name (adjust for workspace if needed)
161+
LAMBDA_NAME="ftrs-dos-${ENVIRONMENT}-dos-search-ods-code-lambda"
162+
[ -n "${WORKSPACE}" ] && [ "${WORKSPACE}" != "default" ] && LAMBDA_NAME="${LAMBDA_NAME}-${WORKSPACE}"
163+
164+
# View recent errors in CloudWatch Logs
165+
aws logs tail /aws/lambda/${LAMBDA_NAME} \
166+
--follow \
167+
--filter-pattern "ERROR" \
168+
--profile ${AWS_PROFILE}
169+
170+
# Check error metrics
171+
aws cloudwatch get-metric-statistics \
172+
--namespace AWS/Lambda \
173+
--metric-name Errors \
174+
--dimensions Name=FunctionName,Value=${LAMBDA_NAME} \
175+
--start-time $(date -u -v-10M +%Y-%m-%dT%H:%M:%S) \
176+
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
177+
--period 60 \
178+
--statistics Sum \
179+
--profile ${AWS_PROFILE}
180+
```
181+
182+
### Troubleshooting Duration Alarms
183+
184+
Verify actual execution time:
185+
186+
```bash
187+
# Set Lambda name (adjust for workspace if needed)
188+
LAMBDA_NAME="ftrs-dos-${ENVIRONMENT}-dos-search-ods-code-lambda"
189+
[ -n "${WORKSPACE}" ] && [ "${WORKSPACE}" != "default" ] && LAMBDA_NAME="${LAMBDA_NAME}-${WORKSPACE}"
190+
191+
# Check execution time for a single invocation
192+
aws lambda invoke \
193+
--function-name ${LAMBDA_NAME} \
194+
--payload '{"odsCode": "TEST123"}' \
195+
--cli-binary-format raw-in-base64-out \
196+
--log-type Tail \
197+
--profile ${AWS_PROFILE} \
198+
response.json | jq -r '.LogResult' | base64 -d | grep "Duration"
199+
200+
# Check p99 duration metrics
201+
aws cloudwatch get-metric-statistics \
202+
--namespace AWS/Lambda \
203+
--metric-name Duration \
204+
--dimensions Name=FunctionName,Value=${LAMBDA_NAME} \
205+
--start-time $(date -u -v-10M +%Y-%m-%dT%H:%M:%S) \
206+
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
207+
--period 60 \
208+
--statistics Average,Maximum \
209+
--profile ${AWS_PROFILE}
210+
```
211+
212+
If execution time is below the threshold (600ms for p95, 800ms for p99), the alarm won't trigger.
231213

232-
**Lambda Logging**:
214+
### Troubleshooting Concurrent Executions Alarms
215+
216+
Check current concurrency levels:
217+
218+
```bash
219+
# Set Lambda name (adjust for workspace if needed)
220+
LAMBDA_NAME="ftrs-dos-${ENVIRONMENT}-dos-search-ods-code-lambda"
221+
[ -n "${WORKSPACE}" ] && [ "${WORKSPACE}" != "default" ] && LAMBDA_NAME="${LAMBDA_NAME}-${WORKSPACE}"
222+
223+
# Check concurrent executions metric
224+
aws cloudwatch get-metric-statistics \
225+
--namespace AWS/Lambda \
226+
--metric-name ConcurrentExecutions \
227+
--dimensions Name=FunctionName,Value=${LAMBDA_NAME} \
228+
--start-time $(date -u -v-10M +%Y-%m-%dT%H:%M:%S) \
229+
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
230+
--period 60 \
231+
--statistics Maximum \
232+
--profile ${AWS_PROFILE}
233+
234+
# Check account-level concurrency
235+
aws lambda get-account-settings --profile ${AWS_PROFILE}
236+
```
237+
238+
**Lowering the threshold for easier testing:**
239+
240+
The default threshold of 100 concurrent executions can be difficult to trigger. To make testing easier, temporarily lower it:
241+
242+
1. Edit `infrastructure/stacks/dos_search/variables.tf` line 178-182:
243+
```terraform
244+
variable "search_lambda_concurrent_executions_critical" {
245+
description = "Search Lambda concurrency critical threshold (ConcurrentExecutions)"
246+
type = number
247+
default = 10 # Lowered from 100 for testing
248+
}
249+
```
233250

234-
- Check CloudWatch Logs for the Lambda function to see invocation details and errors using:
251+
2. Apply the change: Running the workspace via pipeline.
235252

236-
```shell
237-
aws lambda get-function --function-name ftrs-dos-dev-dos-search-slack-notification-ftrs-765 --profile dos-search-dev
253+
3. Update the test to match: `./scripts/trigger-alarms.sh concurrent search 15`
254+
255+
4. After testing, revert the threshold back to 100
256+
257+
### Troubleshooting Throttles Alarms
258+
259+
Check if throttling is occurring:
260+
261+
```bash
262+
# Set Lambda name (adjust for workspace if needed)
263+
LAMBDA_NAME="ftrs-dos-${ENVIRONMENT}-dos-search-ods-code-lambda"
264+
[ -n "${WORKSPACE}" ] && [ "${WORKSPACE}" != "default" ] && LAMBDA_NAME="${LAMBDA_NAME}-${WORKSPACE}"
265+
266+
# Check if reserved concurrency is set
267+
aws lambda get-function-concurrency \
268+
--function-name ${LAMBDA_NAME} \
269+
--profile ${AWS_PROFILE}
270+
271+
# Check throttle metrics
272+
aws cloudwatch get-metric-statistics \
273+
--namespace AWS/Lambda \
274+
--metric-name Throttles \
275+
--dimensions Name=FunctionName,Value=${LAMBDA_NAME} \
276+
--start-time $(date -u -v-10M +%Y-%m-%dT%H:%M:%S) \
277+
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
278+
--period 60 \
279+
--statistics Sum \
280+
--profile ${AWS_PROFILE}
281+
```
282+
283+
**Setting up reserved concurrency for testing:**
284+
285+
Throttles alarm requires reserved concurrency to be set. To enable testing:
286+
287+
```bash
288+
# Set Lambda name (adjust for workspace if needed)
289+
LAMBDA_NAME="ftrs-dos-${ENVIRONMENT}-dos-search-ods-code-lambda"
290+
[ -n "${WORKSPACE}" ] && [ "${WORKSPACE}" != "default" ] && LAMBDA_NAME="${LAMBDA_NAME}-${WORKSPACE}"
291+
292+
# Set reserved concurrency to 5 (limits Lambda to 5 concurrent executions)
293+
aws lambda put-function-concurrency \
294+
--function-name ${LAMBDA_NAME} \
295+
--reserved-concurrent-executions 5 \
296+
--profile ${AWS_PROFILE}
297+
298+
# Run the test (20 concurrent invocations will cause throttling)
299+
make test-lambda-alarm-throttles-critical
300+
301+
# Remove reserved concurrency after testing
302+
aws lambda delete-function-concurrency \
303+
--function-name ${LAMBDA_NAME} \
304+
--profile ${AWS_PROFILE}
305+
```
306+
307+
### Troubleshooting Invocations Spike Alarms
308+
309+
Check invocation rate:
310+
311+
```bash
312+
# Set Lambda name (adjust for workspace if needed)
313+
LAMBDA_NAME="ftrs-dos-${ENVIRONMENT}-dos-search-ods-code-lambda"
314+
[ -n "${WORKSPACE}" ] && [ "${WORKSPACE}" != "default" ] && LAMBDA_NAME="${LAMBDA_NAME}-${WORKSPACE}"
315+
316+
# Check invocations over the last hour
317+
aws cloudwatch get-metric-statistics \
318+
--namespace AWS/Lambda \
319+
--metric-name Invocations \
320+
--dimensions Name=FunctionName,Value=${LAMBDA_NAME} \
321+
--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
322+
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
323+
--period 3600 \
324+
--statistics Sum \
325+
--profile ${AWS_PROFILE}
238326
```
239327

240-
`ftrs-765` is your workspace and will need to adjusted according to yours.
241-
`--profile <your-profile name>` is your AWS CLI name that you used when you configured AWS CLI using `aws configure sso`.
242-
Test comment
328+
Alarm triggers when invocations exceed 600/hour (2x baseline of 300/hour).

0 commit comments

Comments
 (0)