Skip to content

Lifecycle clean routes#11

Draft
daxelrod-rh wants to merge 3 commits intolifecycle-add-routesfrom
lifecycle-clean-routes
Draft

Lifecycle clean routes#11
daxelrod-rh wants to merge 3 commits intolifecycle-add-routesfrom
lifecycle-clean-routes

Conversation

@daxelrod-rh
Copy link
Copy Markdown
Collaborator

@daxelrod-rh daxelrod-rh commented Apr 2, 2026

Please merge #8 before this

done

echo ""
echo "Cleanup complete: deleted=$DELETED_COUNT skipped=$SKIPPED_COUNT"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker: No reason for the script to be duplicated in the repo. Either have just the YAML or just this and template it into the YAML.

- name: HOME
value: /tmp
command:
- /bin/bash
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker: Would this be cleaner in Python? aws-cli containers should have Python in them.

# 2. Get all daemonset-managed peers (exclude already deleted)
echo "Querying Route Server for managed BGP peers..."
ALL_PEERS=$(aws ec2 describe-route-server-peers \
--query "RouteServerPeers[?Tags[?Key=='managed-by' && Value=='daemonset'] && State!='deleted'].{Id:RouteServerPeerId,Ip:PeerAddress,State:State,Tags:Tags}" \
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker: Without a specific cluster tag, this has the limitation that a particular Route Server can only peer with one ROSA cluster.

AGE_MINUTES=$(( (NOW_EPOCH - CREATED_EPOCH) / 60 ))
else
# No created-at tag (legacy peer) - treat as very old to allow deletion
AGE_MINUTES=999999
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker: might be safer to not clean up peers without created-at tag.

@daxelrod-rh daxelrod-rh force-pushed the lifecycle-add-routes branch 2 times, most recently from 556c5cd to 242e7a2 Compare April 2, 2026 19:25
Implements a Kubernetes CronJob that periodically identifies and deletes
stale Route Server BGP peers from deleted or replaced ROSA worker nodes.

Changes:
- Add cleanup-cronjob.yaml: CronJob running every 15 minutes to clean up
  stale peers by comparing EC2 instance IPs with Route Server peer IPs
- Add cleanup-script.sh: Standalone version for manual testing
- Update eni-srcdst-iam.tf: Add ec2:DeleteRouteServerPeer and
  ec2:DescribeInstances permissions
- Update deploy.sh: Deploy CronJob with environment variable substitution
- Update README.md: Document cleanup functionality, troubleshooting, and
  manual testing procedures

Safety mechanisms:
- Only processes peers tagged managed-by=daemonset
- Queries EC2 for active router node IPs (source of truth)
- Dry-run mode available for testing (DRY_RUN=true)
- Concurrency control prevents overlapping runs
- Skips peers in 'deleted' state

Note: AWS Route Server Peer objects lack CreationTime field, so the
planned 30-minute age threshold cannot be reliably implemented. For now,
age calculation defaults to 0 minutes. Future enhancement: add created-at
timestamp tag when creating peers.

Tested: Successfully deleted 6 stale peers from old non-router nodes while
preserving 6 active peers on current router nodes (10.0.1.49, 10.0.2.33,
10.0.3.56).

Assisted-by: Claude Code (Claude Sonnet 4.5)
Implements proper age-based safety threshold for BGP peer cleanup by
adding a created-at timestamp tag when peers are created.

Changes:
- Update daemonset.yaml: Add created-at tag with ISO8601 timestamp when
  creating BGP peers
- Update cleanup-script.sh: Read created-at from tag instead of
  non-existent CreationTime field; treat legacy peers without tag as
  very old (eligible for cleanup if IP is stale)
- Update cleanup-cronjob.yaml: Same age calculation logic as script

How it works:
- DaemonSet adds created-at tag: "2026-04-02T13:06:03Z"
- Cleanup reads tag and calculates age in minutes
- Only deletes stale peers older than MIN_AGE_MINUTES (default: 30)
- Legacy peers without tag get age=999999 (always eligible if stale)

Tested:
- Created 3 new peers with created-at tags
- Cleanup script correctly reads timestamps and calculates ages
- CronJob runs successfully with 30-minute threshold
- Legacy peers (without tag) handled gracefully

This addresses the limitation where AWS Route Server Peer objects
lack a CreationTime field, enabling the 30-minute safety threshold
to prevent race conditions during peer creation.

Assisted-by: Claude Code (Claude Sonnet 4.5)
Use the OpenShift mechanism for assuming pod identity, rather than the
incorrect EKS one.

This was again somehow uncommitted before.
@daxelrod-rh daxelrod-rh force-pushed the lifecycle-clean-routes branch from a09678c to 49e5537 Compare April 2, 2026 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant