Skip to content

Commit 9ae42cd

Browse files
afgambinmvlassis
andauthored
[KF-7803] Adding Github action for deploying on EKS (#55)
* Github action to deploy on EKS * Addressing Mano's review - bootstrapping * Updating AWS credentials config step * Adding PR testing * Updating the .yaml file * Adding a second testing .yaml file * Adding dependencies files * Updates to the workflow * Update eksctl runner * Updating the action * Adding pre-creation cluster steps * Updated CloudFormation run * Testing with a new cluster * Updating dependency versions * Debugging * Updating tox dependencies * pytest missing * Clean up namespace * Teak to the namespace clean up * Remove model creation from CLI * Passing AWS credentials to tox env * Juju version * Pinning Juju version to 3.6/stable * Pinning Juju version * Adding deleting AWS volumes workflow * Fixing dependencies * Fixing typo with AWS volumes section * Setting regions as output for AWS delete volumes * Fixing duplicated code * Removing testing sections * Bug fixing juju controller step * Fixing region pass to reusable workflow * Updating AWS credentials config * Testing a version 2 of the action * Testing without pinning Python version * Removing testing yaml file * Apply suggestions from code review Co-authored-by: Manos Vlassis <57320708+mvlassis@users.noreply.github.com> Signed-off-by: Angel Fernandez <103958447+afgambin@users.noreply.github.com> * Testing without pinning Python version * Updating action: no Python version pinning needed * Removing labels from cluster.yaml file * K8s version updated in cluster config file * Removing triggering action with PR --------- Signed-off-by: Angel Fernandez <103958447+afgambin@users.noreply.github.com> Co-authored-by: Manos Vlassis <57320708+mvlassis@users.noreply.github.com>
1 parent 68b2a8f commit 9ae42cd

File tree

6 files changed

+393
-0
lines changed

6 files changed

+393
-0
lines changed

.github/cluster.yaml

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
apiVersion: eksctl.io/v1alpha5
2+
availabilityZones:
3+
- eu-central-1a
4+
- eu-central-1b
5+
cloudWatch:
6+
clusterLogging: {}
7+
iam:
8+
vpcResourceControllerPolicy: true
9+
withOIDC: false
10+
addons:
11+
- name: aws-ebs-csi-driver
12+
serviceAccountRoleARN: "arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy"
13+
kind: ClusterConfig
14+
kubernetesNetworkConfig:
15+
ipFamily: IPv4
16+
managedNodeGroups:
17+
- amiFamily: Ubuntu2204
18+
iam:
19+
withAddonPolicies:
20+
ebs: true
21+
instanceType: t2.2xlarge
22+
maxSize: 2
23+
minSize: 2
24+
name: ng-d06bd84e
25+
releaseVersion: ""
26+
ssh:
27+
allow: true
28+
tags:
29+
alpha.eksctl.io/nodegroup-name: ng-d06bd84e
30+
alpha.eksctl.io/nodegroup-type: managed
31+
volumeSize: 100
32+
metadata:
33+
name: kubeflow-test
34+
region: eu-central-1
35+
version: "1.32"
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
name: Delete unattached (available) EBS volumes
2+
3+
on:
4+
workflow_dispatch:
5+
inputs:
6+
region:
7+
description: "AWS region to clean. Leave empty to clean ALL regions."
8+
required: false
9+
default: ""
10+
11+
workflow_call:
12+
inputs:
13+
region:
14+
description: "AWS region to clean. Leave empty to clean ALL regions."
15+
required: false
16+
default: ""
17+
type: string
18+
secrets:
19+
AWS_ACCESS_KEY_ID:
20+
required: true
21+
AWS_SECRET_ACCESS_KEY:
22+
required: true
23+
24+
jobs:
25+
delete-volumes:
26+
runs-on: ubuntu-24.04
27+
28+
steps:
29+
- name: Checkout repository
30+
uses: actions/checkout@v4
31+
32+
- name: Configure AWS credentials
33+
# Use your repo/org secrets: AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
34+
uses: aws-actions/configure-aws-credentials@v2
35+
with:
36+
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
37+
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
38+
# Always needs *some* region; if input empty we'll still iterate all inside the script
39+
aws-region: ${{ inputs.region || 'eu-central-1' }}
40+
41+
- name: Set up Python
42+
uses: actions/setup-python@v5
43+
with:
44+
python-version: '3.x'
45+
46+
- name: Install requirements
47+
run: |
48+
python -m pip install --upgrade pip
49+
pip install boto3 tenacity
50+
51+
- name: Run delete volumes script
52+
run: |
53+
if [ -n "${{ inputs.region }}" ]; then
54+
python scripts/delete_volumes.py "${{ inputs.region }}"
55+
else
56+
python scripts/delete_volumes.py
57+
fi
58+
Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
name: Create EKS cluster, deploy kubeflow-mlflow Terraform solution and run UATs
2+
3+
on:
4+
workflow_dispatch:
5+
inputs:
6+
k8s_version:
7+
description: 'Kubernetes version to use for the EKS cluster (e.g. 1.27)'
8+
required: false
9+
uats_branch:
10+
description: 'Branch to run the UATs from, e.g., main or track/1.10'
11+
required: false
12+
schedule:
13+
- cron: "17 02 * * 1"
14+
15+
env:
16+
CLUSTER_NAME: kubeflow-eks-test
17+
18+
jobs:
19+
deploy-solution-to-eks:
20+
name: Deploy CKF + MLFlow solution to EKS
21+
runs-on: ubuntu-24.04
22+
outputs:
23+
aws_region: ${{ steps.extract_region.outputs.region }}
24+
25+
steps:
26+
- name: Checkout repository
27+
uses: actions/checkout@v4
28+
29+
- name: Set envvars from dependencies.yaml
30+
run: |
31+
yq eval 'to_entries | .[] | "\(.key)=\(.value)"' ".github/dependencies.yaml" | while IFS= read -r line; do
32+
echo "$line" >> "$GITHUB_ENV"
33+
done
34+
35+
- name: Update ENV variables from inputs if available
36+
run: |
37+
K8S_VERSION=${{ inputs.k8s_version || env.K8S_VERSION }}
38+
echo "K8S_VERSION=${K8S_VERSION}" >> $GITHUB_ENV
39+
UATS_BRANCH=${{ inputs.uats_branch || env.UATS_BRANCH }}
40+
echo "UATS_BRANCH=${UATS_BRANCH}" >> $GITHUB_ENV
41+
42+
- name: Extract AWS region from cluster.yaml
43+
id: extract_region
44+
run: |
45+
REGION=$(yq e '.metadata.region' .github/cluster.yaml)
46+
echo "AWS_REGION=$REGION" >> $GITHUB_ENV
47+
echo "region=$REGION" >> $GITHUB_OUTPUT
48+
49+
- name: Install CLI tools & dependencies
50+
run: |
51+
pip install tox
52+
sudo snap install juju --channel=${{ env.JUJU_VERSION }}/stable
53+
sudo snap install charmcraft --channel latest/stable --classic
54+
sudo snap install terraform --channel=latest/stable --classic
55+
juju version
56+
terraform --version
57+
charmcraft version
58+
59+
- name: Configure AWS credentials
60+
env:
61+
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
62+
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
63+
run: |
64+
mkdir -p ~/.aws
65+
aws configure set aws_access_key_id "${{ secrets.AWS_ACCESS_KEY_ID }}"
66+
aws configure set aws_secret_access_key "${{ secrets.AWS_SECRET_ACCESS_KEY }}"
67+
aws configure set default.region "${{ env.AWS_REGION }}"
68+
echo "AWS_SDK_LOAD_CONFIG=1" >> "$GITHUB_ENV"
69+
70+
- name: Install kubectl
71+
run: |
72+
sudo snap install kubectl --classic --channel=${{ env.K8S_VERSION }}/stable
73+
mkdir ~/.kube
74+
kubectl version --client
75+
76+
- name: Install eksctl
77+
run: |
78+
PLATFORM=$(uname -s)_amd64
79+
curl -sL "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_${PLATFORM}.tar.gz" | tar xz -C /tmp
80+
sudo mv /tmp/eksctl /usr/local/bin
81+
eksctl version
82+
83+
# Once working, do we want to keep these two pre-deletion steps?
84+
- name: Pre-delete EKS cluster (if exists)
85+
run: |
86+
echo "Attempting to delete EKS cluster '${{ env.CLUSTER_NAME }}' (if it exists)..."
87+
eksctl delete cluster --region ${{ env.AWS_REGION }} --name ${{ env.CLUSTER_NAME }} || echo "Cluster not found or already deleted."
88+
89+
echo "Confirming deletion..."
90+
aws eks describe-cluster --region ${{ env.AWS_REGION }} --name ${{ env.CLUSTER_NAME }} || echo "Cluster no longer exists."
91+
92+
- name: Pre-delete CloudFormation stack (if exists)
93+
run: |
94+
STACK_NAME="eksctl-${{ env.CLUSTER_NAME }}-cluster"
95+
echo "Deleting CloudFormation stack '$STACK_NAME' (if it exists)..."
96+
aws cloudformation delete-stack --region ${{ env.AWS_REGION }} --stack-name "$STACK_NAME" || echo "Stack not found."
97+
98+
echo "Waiting (max 10 minutes) for stack deletion to complete..."
99+
timeout 600s aws cloudformation wait stack-delete-complete --region ${{ env.AWS_REGION }} --stack-name "$STACK_NAME" \
100+
&& echo "Stack deleted." \
101+
|| echo "Stack deletion timed out or failed (continuing)."
102+
103+
echo "Verifying stack is gone..."
104+
aws cloudformation describe-stacks --region ${{ env.AWS_REGION }} --stack-name "$STACK_NAME" 2>/dev/null \
105+
|| echo "Stack no longer exists."
106+
107+
- name: Create EKS cluster
108+
run: |
109+
yq e ".metadata.name |= \"${{ env.CLUSTER_NAME }}\"" -i .github/cluster.yaml
110+
yq e ".metadata.version |= \"${{ env.K8S_VERSION }}\"" -i .github/cluster.yaml
111+
112+
ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa <<<y >/dev/null 2>&1
113+
eksctl create cluster -f .github/cluster.yaml
114+
kubectl get nodes
115+
116+
- name: Configure EKS nodes
117+
run: |
118+
echo "Configuring sysctl on EKS workers"
119+
source ./scripts/gh-actions/set_eks_sysctl_config.sh
120+
121+
- name: Setup Juju controller
122+
run: |
123+
/snap/juju/current/bin/juju add-k8s eks --client
124+
juju bootstrap eks eks-controller
125+
126+
- name: Deploy and assert kubeflow-mlflow solution
127+
env:
128+
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
129+
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
130+
AWS_REGION: ${{ env.AWS_REGION }}
131+
run: |
132+
tox -c ./modules/kubeflow-mlflow -vve test_deployment -- -vv -s
133+
134+
- name: Run UATs
135+
run: |
136+
git clone https://github.com/canonical/charmed-kubeflow-uats.git ~/charmed-kubeflow-uats
137+
cd ~/charmed-kubeflow-uats
138+
git checkout ${{ env.UATS_BRANCH }}
139+
tox -e uats-remote -- --filter "not feast"
140+
141+
# On failure, capture debugging resources
142+
- name: Select model (for debug)
143+
if: failure() || cancelled()
144+
run: juju switch eks-controller:kubeflow
145+
146+
- name: Save debug artifacts
147+
if: failure() || cancelled()
148+
uses: canonical/kubeflow-ci/actions/dump-charm-debug-artifacts@main
149+
150+
- name: Get juju status
151+
if: failure() || cancelled()
152+
run: juju status
153+
154+
- name: Get juju debug logs
155+
if: failure() || cancelled()
156+
run: juju debug-log --replay --no-tail
157+
158+
- name: Get all Kubernetes resources
159+
if: failure() || cancelled()
160+
run: kubectl get all -A
161+
162+
- name: Describe all pods
163+
if: failure() || cancelled()
164+
run: kubectl describe pods --all-namespaces
165+
166+
- name: Logs from Pending pods
167+
if: failure() || cancelled()
168+
run: |
169+
kubectl -n kubeflow get pods | tail -n +2 | grep Pending | awk '{print $1}' | xargs -r -n1 kubectl -n kubeflow logs --all-containers=true --tail 100
170+
171+
- name: Logs from Failed pods
172+
if: failure() || cancelled()
173+
run: |
174+
kubectl -n kubeflow get pods | tail -n +2 | grep Failed | awk '{print $1}' | xargs -r -n1 kubectl -n kubeflow logs --all-containers=true --tail 100
175+
176+
- name: Logs from CrashLoopBackOff pods
177+
if: failure() || cancelled()
178+
run: |
179+
kubectl -n kubeflow get pods | tail -n +2 | grep CrashLoopBackOff | awk '{print $1}' | xargs -r -n1 kubectl -n kubeflow logs --all-containers=true --tail 100
180+
181+
# Clean up resources
182+
- name: Delete EKS cluster
183+
if: always()
184+
run: eksctl delete cluster --region ${{ env.AWS_REGION }} --name ${{ env.CLUSTER_NAME }}
185+
186+
- name: Delete CloudFormation stack
187+
if: always()
188+
run: aws cloudformation delete-stack --region ${{ env.AWS_REGION }} --stack-name eksctl-${{ env.CLUSTER_NAME }}-cluster
189+
190+
delete-unattached-volumes:
191+
name: Clean unattached EBS volumes
192+
if: always()
193+
needs: [deploy-solution-to-eks]
194+
uses: ./.github/workflows/delete-aws-volumes.yaml
195+
with:
196+
region: ${{ needs.deploy-solution-to-eks.outputs.aws_region }}
197+
secrets: inherit

modules/kubeflow-mlflow/tox.ini

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,5 +33,6 @@ deps =
3333
tenacity
3434
ops>=2.3.0
3535
juju<4.0.0
36+
pytest
3637
pytest-dependency
3738
description = Test bundle deployment

scripts/delete_volumes.py

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Delete unattached EBS volumes (state=available) in all AWS regions
2+
# source: https://towardsthecloud.com/amazon-ec2-delete-unattached-ebs-volumes
3+
import boto3
4+
from tenacity import retry, stop_after_attempt, wait_fixed
5+
import sys
6+
7+
@retry(stop=stop_after_attempt(3), wait=wait_fixed(2), reraise=True)
8+
def delete_volumes_in_region(region_name: str, count: int)-> int:
9+
try:
10+
ec2conn = boto3.resource("ec2", region_name = region_name)
11+
unattached_volumes = [
12+
volume for volume in ec2conn.volumes.all() if (volume.state == "available")
13+
]
14+
for volume in unattached_volumes:
15+
volume.delete()
16+
print(f"Deleted unattached volume {volume.id} in region {region_name}.")
17+
count = count + 1
18+
return count
19+
except Exception as e:
20+
print(f"Error: {e}")
21+
raise e
22+
23+
def validate_region(region_name: str)-> bool:
24+
ec2 = boto3.client("ec2")
25+
regions = ec2.describe_regions()["Regions"]
26+
regions_names = list(map(lambda region: region["RegionName"],regions))
27+
return region_name in regions_names
28+
29+
def delete_volumes() -> None:
30+
count = 0
31+
if len(sys.argv)>1:
32+
region_name = sys.argv[1]
33+
if validate_region(region_name):
34+
count = delete_volumes_in_region(region_name, count)
35+
else:
36+
print("Region from input isn't being used in this AWS account.")
37+
raise Exception
38+
else:
39+
ec2 = boto3.client("ec2")
40+
for region in ec2.describe_regions()["Regions"]:
41+
region_name = region["RegionName"]
42+
count = delete_volumes_in_region(region_name, count)
43+
44+
if count > 0:
45+
print(f"Deleted {count} unattached volumes.")
46+
else:
47+
print("No unattached volumes found for deletion.")
48+
49+
delete_volumes()

0 commit comments

Comments
 (0)