Skip to content

Conversation

@nogueiraanderson
Copy link
Contributor

@nogueiraanderson nogueiraanderson commented Nov 28, 2025

Jenkins pipelines for PMM HA testing on EKS.

Jobs

pmm3-ha-eks - Create EKS cluster with PMM HA

  • Deploys PMM HA stack using Helm charts from theTibi/percona-helm-charts (PMM-14420 branch)
  • Configures ALB Ingress with ACM certificate for HTTPS
  • Creates Route53 record: https://pmm-ha-test-{BUILD_NUMBER}.cd.percona.com
  • Outputs PMM credentials in build description and artifacts
  • Max 5 concurrent clusters, auto-cleanup on failure

pmm3-ha-eks-cleanup - Delete EKS clusters

  • Actions: LIST_ONLY, DELETE_CLUSTER, DELETE_ALL
  • Cron (twice daily): auto-deletes clusters older than 24h
  • Cleans up Route53 records and ALB before cluster deletion

Access

Users in pmm-eks-admins IAM group get kubectl access via EKS Access Entries:

aws eks update-kubeconfig --name pmm-ha-test-XX --region us-east-2

Testing

Validated with temporary jobs (to be deleted after merge):

Jira: PMM-14346

@nogueiraanderson nogueiraanderson force-pushed the fix/pmm-ha-eks-access-entries branch 2 times, most recently from 0f410b8 to 4d0bdaa Compare November 28, 2025 11:14
@nogueiraanderson nogueiraanderson changed the title fix(pmm): use EKS Access Entries API for cluster access PMM-14346: PMM HA EKS testing pipeline with ALB, Route53, and Access Entries Nov 28, 2025
@nogueiraanderson nogueiraanderson force-pushed the fix/pmm-ha-eks-access-entries branch 3 times, most recently from aa147f6 to 0ff5e20 Compare November 29, 2025 23:28
…Entries

- Add AWS Load Balancer Controller with IRSA for ALB ingress
- Add ALB Ingress with ACM certificate (*.cd.percona.com wildcard)
- Add Route53 alias records for friendly URLs (pmm-ha-test-N.cd.percona.com)
- Replace ConfigMap-based auth with EKS Access Entries API
- Add pmm-eks-admins IAM group for kubectl access
- Add SSO AdministratorAccess role support
- Add cleanup job with Route53/ALB cleanup before cluster deletion
- Extract shared library vars/pmmHaEks.groovy for reusable functions

Jira: PMM-14346
@nogueiraanderson nogueiraanderson force-pushed the fix/pmm-ha-eks-access-entries branch from 0ff5e20 to b38da5e Compare November 29, 2025 23:31
- Remove hardcoded account ID (119175775298), use aws sts get-caller-identity
- Remove hardcoded SSO role suffix, discover via aws iam list-roles
- Skip SSO role gracefully if not found in account
- Revert library branch to feature branch for testing
The jmespath query returns trailing None values when using --output text,
causing the access entry creation to fail with an embedded newline.
Discover availability zones from AWS at runtime instead of hardcoding.
Improves spot instance resilience - if one AZ has interruptions,
pods can reschedule to nodes in other AZs.
Remove comments that merely restate what the code does:
- 'Get AWS account ID dynamically'
- 'Install PMM HA'
- 'Wait for components'
- 'Wait for ALB'
- 'Create Route53 record'
- 'Delete the EKS cluster'
Cleanup requires kubectl and eksctl which may not be available on cli agents.
CLI agents have kubectl, eksctl, helm, and AWS CLI - same as cleanup.
Removes stale kubeconfig entries from previous builds that could
persist in the Jenkins workspace, ensuring the artifact contains
only the current cluster configuration.
ClickHouse merge operations failing with MEMORY_LIMIT_EXCEEDED on
*.large instances (8GB RAM). Upgrade to *.xlarge (16GB RAM) to provide
sufficient memory headroom for the full PMM HA stack.
Date.parse() is not allowed in Jenkins sandbox, causing DELETE_OLD
cron jobs to fail with RejectedAccessException. Use shell date -d
to convert ISO 8601 timestamps to epoch milliseconds instead.

Fixes cron builds #21, #22, #23 failing with:
"No such static method found: staticMethod java.util.Date parse"
Default 4Gi memory limit causes MEMORY_LIMIT_EXCEEDED errors during
merge operations. Increase to 10Gi with 4Gi requests to allow proper
merge execution on xlarge nodes.
Move cluster management functions from cleanup pipeline to pmmHaEks.groovy:
- listClusters(): returns clusters sorted newest first (CPS-safe)
- deleteAllClusters(): parallel deletion with SKIP_NEWEST and age filter
- cleanupOrphans(): removes orphaned VPCs and failed CF stacks

Simplify pmm3-ha-eks-cleanup.groovy to high-level orchestration only.
Add SKIP_NEWEST parameter and CLEANUP_ORPHANS action.
Replace inline shell cluster discovery with pmmHaEks.listClusters():
- pmm3-ha-eks.groovy: Check Existing Clusters stage
- pmm3-ha-eks-cleanup.groovy: List Clusters stage

Reduces code duplication and ensures consistent behavior.
Replace readJSON (unavailable DSL method) with shell-based jq parsing.
Sorting by createdAt is done in shell using sort -r for CPS safety.
eksctl cannot delete stacks with TerminationProtection enabled.
Add step to disable protection on all cluster-related CF stacks
before calling eksctl delete.
Add configurable cluster retention (1-7 days) and optional custom PMM
admin password per PMM-14613.

Changes:
- RETENTION_DAYS: cluster survives cron cleanup for N days (default: 1)
- PMM_ADMIN_PASSWORD: user-provided or auto-generated 16-char password
- delete-after tag stored in cluster metadata for cleanup job
- deleteAllClusters() checks tag before deleting, falls back to 24h for legacy

Jira: PMM-14613
Add PostgreSQL, ClickHouse, and VictoriaMetrics credentials to
pmm-credentials/access-info.txt artifact for convenience.
Add getCredentials() and writeAccessInfo() to pmmHaEks.groovy.
Simplifies main pipeline from ~450 to 387 lines.
…to library

- validateHelmChart(): validates branch exists and contains pmm-ha charts
- resolveR53ZoneId(): resolves Route53 zone ID from zone name (DRY)
- Main pipeline reduced from 387 to 352 lines
- Add named constants: MAX_CLUSTERS, DEFAULT_RETENTION_HOURS, etc.
- Add validateRetentionDays() for Groovy-level input validation
- Move retention validation from shell to Groovy
Reorganize pmmHaEks.groovy into 6 clearly labeled sections:
1. Constants
2. Validation Helpers
3. Credential Management
4. EKS Cluster Setup
5. PMM Installation
6. Cluster Lifecycle (list, delete, cleanup)

Update file header to list sections instead of individual functions.
Add visual section dividers for better navigation.
No functional changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants