This project implements a comprehensive alert normalization pipeline that processes CloudWatch and Grafana alerts, normalizes them into a canonical format, and creates GitHub issues for incident management.
infra/- Terraform infrastructure definitions (SNS, SQS, DynamoDB, Lambda, IAM)lambdas/- TypeScript Lambda functionslambdas/collector/- Main alert processing Lambda with transformerslambdas/external-alerts-webhook/- Webhook Lambda for Grafana integration
ReferenceData/- Reference documentation and schemasbootstrap/- Infrastructure bootstrapping utilitiesscratch/- Development workspace
- Terraform >= 1.6
- AWS CLI configured (SSO or profile)
- Node.js 18+ and Yarn
- Optional: LocalStack +
tflocal/awslocalfor local testing
The following commands are pre-approved for this project:
make build- Build all Lambda functions (collector + webhook)make clean- Clean all build artifactscd lambdas/collector && yarn install && yarn build- Build collector Lambdacd lambdas/collector && yarn test- Run collector testscd lambdas/collector && yarn test:watch- Run tests in watch modecd lambdas/collector && yarn test:coverage- Run tests with coveragecd lambdas/collector && yarn lint- TypeScript type checking
make aws-init-dev- Initialize Terraform backend for devmake aws-init-prod- Initialize Terraform backend for prodmake aws-apply-dev- Deploy to dev environmentmake aws-apply-prod- Deploy to prod environmentmake aws-destroy-dev- Destroy dev environmentmake aws-destroy-prod- Destroy prod environment
make aws-logs-dev/make logs-dev- Tail dev Lambda logsmake aws-logs-prod/make logs-prod- Tail prod Lambda logsmake aws-publish-dev- Send test message to dev SNSmake aws-publish-prod- Send test message to prod SNS
make ls-apply- Deploy to LocalStackmake ls-destroy- Destroy LocalStack deploymentmake ls-publish- Send test message to LocalStack SNSmake ls-logs- Tail LocalStack Lambda logs
- Use
/read-alertingto analyze the current project state - Make changes to Lambda source code in
lambdas/*/src/ - Run
make buildto compile TypeScript - Test locally with
cd lambdas/collector && yarn test - Deploy to dev with
make aws-apply-devfor integration testing - Monitor with
make logs-devto verify functionality
Alert Flow: Grafana/CloudWatch → SNS → SQS → Lambda → DynamoDB + GitHub Issues
- SNS Topic: Receives alerts from multiple sources
- SQS Queue: Buffers alerts with DLQ for failed processing
- Collector Lambda: Normalizes alerts, manages GitHub issues, tracks state
- Webhook Lambda: Secure endpoint for Grafana alerts
- DynamoDB: Stores alert state with fingerprint-based deduplication
- GitHub App Integration: Creates, updates, and closes issues automatically
- Ingestion: Alerts arrive via SNS (Grafana webhook or CloudWatch)
- Queuing: SNS fans out to SQS with dead letter queue for failures
- Processing: Collector Lambda processes batches with partial failure support
- Transformation: Provider-specific transformers normalize to canonical schema
- Fingerprinting: SHA-256 hash of stable alert identifiers for deduplication
- State Management: DynamoDB tracks alert lifecycle with TTL
- GitHub Integration: Automated issue creation/updates/closure via GitHub App
- Main Handler (
index.ts): SQS batch processing with error handling - Processor (
processor.ts): Alert normalization pipeline orchestration - Transformers (
transformers/): Provider-specific alert parsinggrafana.ts- Grafana webhook payload transformationcloudwatch.ts- CloudWatch SNS message transformationbase.ts- Common validation and utility functions
- Fingerprinting (
fingerprint.ts): Deterministic alert identification - Database (
database.ts): DynamoDB alert state management - GitHub Client (
github/githubClient.ts): Issue lifecycle management - Utilities (
utils/): Rate limiter, circuit breaker, common functions
- Handler (
index.ts): Secure webhook endpoint for multiple alert sources - Authentication: Flexible header/token validation with timing-safe comparison
- SNS Publishing: Forwards validated payloads to alert processing
The webhook endpoint supports flexible authentication using header/token pairs stored in AWS Secrets Manager:
- Secret Format: Secrets are stored as JSON with header names as keys and tokens as values
{ "x-grafana-token": "your-grafana-secret-token", "x-custom-webhook": "your-custom-webhook-token" } - Request Validation: Incoming requests are authenticated if any configured header is present with the correct token value
- Security: Uses timing-safe comparison to prevent timing attacks
- Caching: Secrets are cached for 5 minutes to reduce Secrets Manager API calls
- Multiple Sources: Support for different webhook sources (Grafana, PagerDuty, etc.) with different header token-based methods
interface AlertEvent {
schema_version: number; // Version for future migrations
source: "grafana" | "cloudwatch";
state: "FIRING" | "RESOLVED";
title: string; // Normalized alert title
description?: string; // Optional alert description
reason?: string; // Provider-specific reason
priority: "P0" | "P1" | "P2" | "P3"; // Standardized priority
occurred_at: string; // ISO8601 timestamp
team: string; // Owning team identifier
resource: AlertResource; // Resource information
identity: AlertIdentity; // Provider identity for fingerprinting
links: AlertLinks; // Navigation and runbook links
raw_provider: any; // Original payload for debugging
}- CREATE: New alert, create GitHub issue
- COMMENT: Recurring alert, add comment to existing issue
- CLOSE: Resolved alert, close GitHub issue
- SKIP_STALE: Out-of-order or duplicate alert
- SKIP_MANUAL_CLOSE: Alert manually closed, skip automation
- Primary Key:
fingerprint(SHA-256 hash) - Attributes:
status: "OPEN" | "CLOSED"team,priority,title: Core alert metadataissue_number: GitHub issue numberlast_provider_state_at: Timestamp for out-of-order detectionmanually_closed: Boolean flag for manual interventionttl_expires_at: 3-year TTL for automatic cleanup
- Purpose: Prevent GitHub API cascading failures
- Configuration: Failure threshold, timeout, recovery period
- Fallback: Log for manual processing when circuit open
- GitHub API: 10 requests/second default with backoff
- Implementation: Token bucket algorithm with jitter
- Scope: Global rate limiting across all GitHub operations
- SQS Integration: Report individual message failures
- Benefits: Failed messages retry without affecting successful ones
- DLQ: Poison messages route to dead letter queue
- SNS Topic (
aws_sns_topic.alerts): Multi-source alert ingestion - SQS Queue (
aws_sqs_queue.alerts): Alert buffering with visibility timeout - DLQ (
aws_sqs_queue.dlq): Failed message handling - Lambda Functions: Collector and webhook with proper IAM roles
- DynamoDB Table: Alert state with TTL and on-demand billing
- CloudWatch: Logs, metrics, and alarms
- IAM Roles: Least-privilege access with explicit resource ARNs
- Secrets Manager: GitHub App credentials with rotation support
- VPC: Optional network isolation (not required for current setup)
- Encryption: At-rest and in-transit for all data
- Installation Tokens: Short-lived, scoped access tokens
- JWT Generation: RS256 algorithm with private key
- Token Caching: 5-minute cache with expiration handling
- Permissions: Issues (read/write), metadata (read)
- Creation: Rich issue body with alert details and debug info
- Labels: Auto-created labels (Pri: P1, Team: dev-infra, etc.)
- Comments: Recurring alert updates with timestamps
- Closure: Automatic closure on resolution (unless manually closed)
- Transformers: Payload parsing and validation
- Fingerprinting: Consistency and collision resistance
- State Management: DynamoDB operations and edge cases
- GitHub Client: API interactions with mocked responses
- LocalStack: Full AWS service simulation
- End-to-End: SNS → SQS → Lambda → DynamoDB flow
- GitHub API: Mocked with realistic rate limits and errors
- Realistic Payloads:
test-data/with actual Grafana/CloudWatch formats - Edge Cases: Missing fields, malformed data, network failures
- Fixtures: Reusable test objects for consistent testing
- Alert Processing: Success/failure rates by source and team
- GitHub API: Success rates, rate limit hits, circuit breaker state
- DLQ Depth: Failed message accumulation
- Processing Latency: P50, P95, P99 latencies
- JSON Format: Consistent log structure for parsing
- Correlation IDs: Message ID tracking through pipeline
- Debug Context: Comprehensive error context with alert details
- Security: No secrets or credentials in logs
- DLQ Alarms: High message count indicates processing issues
- Error Rate Alarms: High failure rates in transformation/GitHub
- Latency Alarms: Processing time exceeding thresholds
STATUS_TABLE_NAME: DynamoDB table nameGITHUB_REPO: Target repository (org/repo format)GITHUB_APP_SECRET_ID: Secrets Manager secret nameENABLE_GITHUB_ISSUES: Feature flag for GitHub integration
aws_region: Deployment regionname_prefix: Resource naming prefixgithub_repo: Target repositoryenable_github_issues: Boolean feature flagwebhook_grafana_token: Shared secret for webhook auth
- Create transformer in
lambdas/collector/src/transformers/ - Extend source detection in
detectAlertSource() - Add fingerprint logic in
fingerprint.ts - Create test fixtures and unit tests
- Update infrastructure for new SNS sources
- Check CloudWatch logs for structured JSON events
- Look for
NORMALIZED_ALERTlog entries with full context - Verify DynamoDB state in
alerts_statetable - Check GitHub issue creation/updates
- Monitor DLQ for failed messages
- Batch Size: SQS batch size vs processing time tradeoff
- Memory: Lambda memory allocation based on payload size
- Concurrency: Reserved concurrency to prevent resource exhaustion
- Caching: Secret and token caching with appropriate TTLs
All changes should maintain security best practices including:
- Size Limits: Prevent DoS with reasonable payload limits (4KB descriptions)
- Field Validation: Strict type checking with allowlists
- HTML Encoding: Sanitize user-provided content for GitHub
- Schema Validation: Runtime validation against TypeScript interfaces
- Flexible Webhook Authentication: Support multiple webhook sources with header/token pairs stored in AWS Secrets Manager
- Timing-Safe Comparisons: Prevent timing attack vulnerabilities across all authentication methods
- GitHub App Tokens: Use installation tokens, not personal access
- Secret Rotation: Support for GitHub App key rotation
- Least Privilege: Minimal IAM permissions with explicit resources
- Circuit Breakers: Prevent cascading failures
- Rate Limiting: Respect external API limits
- Graceful Degradation: Continue processing when GitHub unavailable
- Comprehensive Logging: Full context for debugging without secrets
- Encryption: At-rest (DynamoDB, S3) and in-transit (HTTPS, TLS)
- Network Isolation: VPC endpoints when required
- Secret Management: AWS Secrets Manager with rotation
- IAM Policies: Resource-specific permissions with conditions