-
Notifications
You must be signed in to change notification settings - Fork 4
[KF-7803] Adding Github action for deploying on EKS #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
45 commits
Select commit
Hold shift + click to select a range
dfbbc4c
Github action to deploy on EKS
afgambin a9298e5
Addressing Mano's review - bootstrapping
afgambin d37aea1
Updating AWS credentials config step
afgambin 11492a4
Adding PR testing
afgambin 4ff9065
Updating the .yaml file
afgambin c2edbea
Adding a second testing .yaml file
afgambin 76a8cd5
Adding dependencies files
afgambin 434c5d4
Updates to the workflow
afgambin 2fc8824
Update eksctl runner
afgambin cbde8ce
Updating the action
afgambin 2672f0f
Adding pre-creation cluster steps
afgambin 1b4f8e5
Updated CloudFormation run
afgambin 3e12482
Merge remote-tracking branch 'origin/track/1.10' into kf-7803-gh-acti…
afgambin 658c59e
Testing with a new cluster
afgambin dcf3128
Updating dependency versions
afgambin 11306fb
Debugging
afgambin add2842
Updating tox dependencies
afgambin 226f35c
pytest missing
afgambin 0c880e1
Clean up namespace
afgambin d98ccd8
Teak to the namespace clean up
afgambin c0bc628
Remove model creation from CLI
afgambin c4a6014
Passing AWS credentials to tox env
afgambin c2d4df4
Juju version
afgambin 46964cb
Pinning Juju version to 3.6/stable
afgambin 128b0d9
Pinning Juju version
afgambin 6b80f14
Adding deleting AWS volumes workflow
afgambin c0bc3cf
Merge remote-tracking branch 'origin/track/1.10' into kf-7803-gh-acti…
afgambin a794de2
Fixing dependencies
afgambin b8b5586
Fixing typo with AWS volumes section
afgambin b7d445a
Setting regions as output for AWS delete volumes
afgambin 1346a0f
Fixing duplicated code
afgambin 00949df
Removing testing sections
afgambin a1a3400
Bug fixing juju controller step
afgambin 87f6189
Fixing region pass to reusable workflow
afgambin 11cb7dd
Updating AWS credentials config
afgambin 9997908
Testing a version 2 of the action
afgambin ada5477
Testing without pinning Python version
afgambin 787a1eb
Removing testing yaml file
afgambin a2cdbbf
Apply suggestions from code review
afgambin 281edc2
Testing without pinning Python version
afgambin 43bf954
Updating action: no Python version pinning needed
afgambin 8ed1137
Removing labels from cluster.yaml file
afgambin 6a9c3f0
K8s version updated in cluster config file
afgambin 98292d2
Merge remote-tracking branch 'origin/track/1.10' into kf-7803-gh-acti…
afgambin 68ecd8b
Removing triggering action with PR
afgambin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| apiVersion: eksctl.io/v1alpha5 | ||
| availabilityZones: | ||
| - eu-central-1a | ||
| - eu-central-1b | ||
| cloudWatch: | ||
| clusterLogging: {} | ||
| iam: | ||
| vpcResourceControllerPolicy: true | ||
| withOIDC: false | ||
| addons: | ||
| - name: aws-ebs-csi-driver | ||
| serviceAccountRoleARN: "arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy" | ||
| kind: ClusterConfig | ||
| kubernetesNetworkConfig: | ||
| ipFamily: IPv4 | ||
| managedNodeGroups: | ||
| - amiFamily: Ubuntu2204 | ||
| iam: | ||
| withAddonPolicies: | ||
| ebs: true | ||
| instanceType: t2.2xlarge | ||
| maxSize: 2 | ||
| minSize: 2 | ||
| name: ng-d06bd84e | ||
| releaseVersion: "" | ||
| ssh: | ||
| allow: true | ||
| tags: | ||
| alpha.eksctl.io/nodegroup-name: ng-d06bd84e | ||
| alpha.eksctl.io/nodegroup-type: managed | ||
| volumeSize: 100 | ||
| metadata: | ||
| name: kubeflow-test | ||
| region: eu-central-1 | ||
| version: "1.32" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,58 @@ | ||
| name: Delete unattached (available) EBS volumes | ||
|
|
||
| on: | ||
| workflow_dispatch: | ||
| inputs: | ||
| region: | ||
| description: "AWS region to clean. Leave empty to clean ALL regions." | ||
| required: false | ||
| default: "" | ||
|
|
||
| workflow_call: | ||
| inputs: | ||
| region: | ||
| description: "AWS region to clean. Leave empty to clean ALL regions." | ||
| required: false | ||
| default: "" | ||
| type: string | ||
| secrets: | ||
| AWS_ACCESS_KEY_ID: | ||
| required: true | ||
| AWS_SECRET_ACCESS_KEY: | ||
| required: true | ||
|
|
||
| jobs: | ||
| delete-volumes: | ||
| runs-on: ubuntu-24.04 | ||
|
|
||
| steps: | ||
| - name: Checkout repository | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Configure AWS credentials | ||
| # Use your repo/org secrets: AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY | ||
| uses: aws-actions/configure-aws-credentials@v2 | ||
| with: | ||
| aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} | ||
| aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | ||
| # Always needs *some* region; if input empty we'll still iterate all inside the script | ||
| aws-region: ${{ inputs.region || 'eu-central-1' }} | ||
|
|
||
| - name: Set up Python | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.x' | ||
|
|
||
| - name: Install requirements | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip install boto3 tenacity | ||
|
|
||
| - name: Run delete volumes script | ||
| run: | | ||
| if [ -n "${{ inputs.region }}" ]; then | ||
| python scripts/delete_volumes.py "${{ inputs.region }}" | ||
| else | ||
| python scripts/delete_volumes.py | ||
| fi | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,197 @@ | ||
| name: Create EKS cluster, deploy kubeflow-mlflow Terraform solution and run UATs | ||
|
|
||
| on: | ||
| workflow_dispatch: | ||
| inputs: | ||
| k8s_version: | ||
| description: 'Kubernetes version to use for the EKS cluster (e.g. 1.27)' | ||
| required: false | ||
| uats_branch: | ||
| description: 'Branch to run the UATs from, e.g., main or track/1.10' | ||
| required: false | ||
| schedule: | ||
| - cron: "17 02 * * 1" | ||
|
|
||
| env: | ||
| CLUSTER_NAME: kubeflow-eks-test | ||
|
|
||
| jobs: | ||
| deploy-solution-to-eks: | ||
| name: Deploy CKF + MLFlow solution to EKS | ||
| runs-on: ubuntu-24.04 | ||
| outputs: | ||
| aws_region: ${{ steps.extract_region.outputs.region }} | ||
|
|
||
| steps: | ||
| - name: Checkout repository | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Set envvars from dependencies.yaml | ||
| run: | | ||
| yq eval 'to_entries | .[] | "\(.key)=\(.value)"' ".github/dependencies.yaml" | while IFS= read -r line; do | ||
| echo "$line" >> "$GITHUB_ENV" | ||
| done | ||
|
|
||
| - name: Update ENV variables from inputs if available | ||
| run: | | ||
| K8S_VERSION=${{ inputs.k8s_version || env.K8S_VERSION }} | ||
deusebio marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| echo "K8S_VERSION=${K8S_VERSION}" >> $GITHUB_ENV | ||
| UATS_BRANCH=${{ inputs.uats_branch || env.UATS_BRANCH }} | ||
| echo "UATS_BRANCH=${UATS_BRANCH}" >> $GITHUB_ENV | ||
|
|
||
| - name: Extract AWS region from cluster.yaml | ||
| id: extract_region | ||
| run: | | ||
| REGION=$(yq e '.metadata.region' .github/cluster.yaml) | ||
| echo "AWS_REGION=$REGION" >> $GITHUB_ENV | ||
afgambin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| echo "region=$REGION" >> $GITHUB_OUTPUT | ||
|
|
||
| - name: Install CLI tools & dependencies | ||
| run: | | ||
| pip install tox | ||
| sudo snap install juju --channel=${{ env.JUJU_VERSION }}/stable | ||
| sudo snap install charmcraft --channel latest/stable --classic | ||
| sudo snap install terraform --channel=latest/stable --classic | ||
| juju version | ||
| terraform --version | ||
| charmcraft version | ||
|
|
||
| - name: Configure AWS credentials | ||
| env: | ||
| AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} | ||
| AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | ||
| run: | | ||
| mkdir -p ~/.aws | ||
| aws configure set aws_access_key_id "${{ secrets.AWS_ACCESS_KEY_ID }}" | ||
| aws configure set aws_secret_access_key "${{ secrets.AWS_SECRET_ACCESS_KEY }}" | ||
| aws configure set default.region "${{ env.AWS_REGION }}" | ||
| echo "AWS_SDK_LOAD_CONFIG=1" >> "$GITHUB_ENV" | ||
|
|
||
| - name: Install kubectl | ||
| run: | | ||
| sudo snap install kubectl --classic --channel=${{ env.K8S_VERSION }}/stable | ||
deusebio marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| mkdir ~/.kube | ||
| kubectl version --client | ||
|
|
||
| - name: Install eksctl | ||
| run: | | ||
| PLATFORM=$(uname -s)_amd64 | ||
| curl -sL "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_${PLATFORM}.tar.gz" | tar xz -C /tmp | ||
| sudo mv /tmp/eksctl /usr/local/bin | ||
| eksctl version | ||
|
|
||
| # Once working, do we want to keep these two pre-deletion steps? | ||
afgambin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - name: Pre-delete EKS cluster (if exists) | ||
| run: | | ||
| echo "Attempting to delete EKS cluster '${{ env.CLUSTER_NAME }}' (if it exists)..." | ||
| eksctl delete cluster --region ${{ env.AWS_REGION }} --name ${{ env.CLUSTER_NAME }} || echo "Cluster not found or already deleted." | ||
|
|
||
| echo "Confirming deletion..." | ||
| aws eks describe-cluster --region ${{ env.AWS_REGION }} --name ${{ env.CLUSTER_NAME }} || echo "Cluster no longer exists." | ||
|
|
||
| - name: Pre-delete CloudFormation stack (if exists) | ||
| run: | | ||
| STACK_NAME="eksctl-${{ env.CLUSTER_NAME }}-cluster" | ||
| echo "Deleting CloudFormation stack '$STACK_NAME' (if it exists)..." | ||
| aws cloudformation delete-stack --region ${{ env.AWS_REGION }} --stack-name "$STACK_NAME" || echo "Stack not found." | ||
|
|
||
| echo "Waiting (max 10 minutes) for stack deletion to complete..." | ||
| timeout 600s aws cloudformation wait stack-delete-complete --region ${{ env.AWS_REGION }} --stack-name "$STACK_NAME" \ | ||
| && echo "Stack deleted." \ | ||
| || echo "Stack deletion timed out or failed (continuing)." | ||
|
|
||
| echo "Verifying stack is gone..." | ||
| aws cloudformation describe-stacks --region ${{ env.AWS_REGION }} --stack-name "$STACK_NAME" 2>/dev/null \ | ||
| || echo "Stack no longer exists." | ||
|
|
||
| - name: Create EKS cluster | ||
| run: | | ||
| yq e ".metadata.name |= \"${{ env.CLUSTER_NAME }}\"" -i .github/cluster.yaml | ||
| yq e ".metadata.version |= \"${{ env.K8S_VERSION }}\"" -i .github/cluster.yaml | ||
|
|
||
| ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa <<<y >/dev/null 2>&1 | ||
| eksctl create cluster -f .github/cluster.yaml | ||
| kubectl get nodes | ||
|
|
||
| - name: Configure EKS nodes | ||
| run: | | ||
| echo "Configuring sysctl on EKS workers" | ||
| source ./scripts/gh-actions/set_eks_sysctl_config.sh | ||
|
|
||
| - name: Setup Juju controller | ||
| run: | | ||
| /snap/juju/current/bin/juju add-k8s eks --client | ||
| juju bootstrap eks eks-controller | ||
|
|
||
| - name: Deploy and assert kubeflow-mlflow solution | ||
afgambin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| env: | ||
| AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} | ||
| AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | ||
| AWS_REGION: ${{ env.AWS_REGION }} | ||
| run: | | ||
| tox -c ./modules/kubeflow-mlflow -vve test_deployment -- -vv -s | ||
|
|
||
| - name: Run UATs | ||
afgambin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| run: | | ||
| git clone https://github.com/canonical/charmed-kubeflow-uats.git ~/charmed-kubeflow-uats | ||
| cd ~/charmed-kubeflow-uats | ||
| git checkout ${{ env.UATS_BRANCH }} | ||
| tox -e uats-remote -- --filter "not feast" | ||
|
|
||
| # On failure, capture debugging resources | ||
| - name: Select model (for debug) | ||
| if: failure() || cancelled() | ||
| run: juju switch eks-controller:kubeflow | ||
|
|
||
| - name: Save debug artifacts | ||
| if: failure() || cancelled() | ||
| uses: canonical/kubeflow-ci/actions/dump-charm-debug-artifacts@main | ||
|
|
||
| - name: Get juju status | ||
| if: failure() || cancelled() | ||
| run: juju status | ||
|
|
||
| - name: Get juju debug logs | ||
| if: failure() || cancelled() | ||
| run: juju debug-log --replay --no-tail | ||
|
|
||
| - name: Get all Kubernetes resources | ||
| if: failure() || cancelled() | ||
| run: kubectl get all -A | ||
|
|
||
| - name: Describe all pods | ||
| if: failure() || cancelled() | ||
| run: kubectl describe pods --all-namespaces | ||
|
|
||
| - name: Logs from Pending pods | ||
| if: failure() || cancelled() | ||
| run: | | ||
| kubectl -n kubeflow get pods | tail -n +2 | grep Pending | awk '{print $1}' | xargs -r -n1 kubectl -n kubeflow logs --all-containers=true --tail 100 | ||
|
|
||
| - name: Logs from Failed pods | ||
| if: failure() || cancelled() | ||
| run: | | ||
| kubectl -n kubeflow get pods | tail -n +2 | grep Failed | awk '{print $1}' | xargs -r -n1 kubectl -n kubeflow logs --all-containers=true --tail 100 | ||
|
|
||
| - name: Logs from CrashLoopBackOff pods | ||
| if: failure() || cancelled() | ||
| run: | | ||
| kubectl -n kubeflow get pods | tail -n +2 | grep CrashLoopBackOff | awk '{print $1}' | xargs -r -n1 kubectl -n kubeflow logs --all-containers=true --tail 100 | ||
|
|
||
| # Clean up resources | ||
| - name: Delete EKS cluster | ||
| if: always() | ||
| run: eksctl delete cluster --region ${{ env.AWS_REGION }} --name ${{ env.CLUSTER_NAME }} | ||
|
|
||
| - name: Delete CloudFormation stack | ||
| if: always() | ||
| run: aws cloudformation delete-stack --region ${{ env.AWS_REGION }} --stack-name eksctl-${{ env.CLUSTER_NAME }}-cluster | ||
|
|
||
| delete-unattached-volumes: | ||
| name: Clean unattached EBS volumes | ||
| if: always() | ||
| needs: [deploy-solution-to-eks] | ||
| uses: ./.github/workflows/delete-aws-volumes.yaml | ||
| with: | ||
| region: ${{ needs.deploy-solution-to-eks.outputs.aws_region }} | ||
| secrets: inherit | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -33,5 +33,6 @@ deps = | |
| tenacity | ||
| ops>=2.3.0 | ||
| juju<4.0.0 | ||
| pytest | ||
| pytest-dependency | ||
| description = Test bundle deployment | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| # Delete unattached EBS volumes (state=available) in all AWS regions | ||
| # source: https://towardsthecloud.com/amazon-ec2-delete-unattached-ebs-volumes | ||
| import boto3 | ||
| from tenacity import retry, stop_after_attempt, wait_fixed | ||
| import sys | ||
|
|
||
| @retry(stop=stop_after_attempt(3), wait=wait_fixed(2), reraise=True) | ||
| def delete_volumes_in_region(region_name: str, count: int)-> int: | ||
| try: | ||
| ec2conn = boto3.resource("ec2", region_name = region_name) | ||
| unattached_volumes = [ | ||
| volume for volume in ec2conn.volumes.all() if (volume.state == "available") | ||
| ] | ||
| for volume in unattached_volumes: | ||
| volume.delete() | ||
| print(f"Deleted unattached volume {volume.id} in region {region_name}.") | ||
| count = count + 1 | ||
| return count | ||
| except Exception as e: | ||
| print(f"Error: {e}") | ||
| raise e | ||
|
|
||
| def validate_region(region_name: str)-> bool: | ||
| ec2 = boto3.client("ec2") | ||
| regions = ec2.describe_regions()["Regions"] | ||
| regions_names = list(map(lambda region: region["RegionName"],regions)) | ||
| return region_name in regions_names | ||
|
|
||
| def delete_volumes() -> None: | ||
| count = 0 | ||
| if len(sys.argv)>1: | ||
| region_name = sys.argv[1] | ||
| if validate_region(region_name): | ||
| count = delete_volumes_in_region(region_name, count) | ||
| else: | ||
| print("Region from input isn't being used in this AWS account.") | ||
| raise Exception | ||
| else: | ||
| ec2 = boto3.client("ec2") | ||
| for region in ec2.describe_regions()["Regions"]: | ||
| region_name = region["RegionName"] | ||
| count = delete_volumes_in_region(region_name, count) | ||
|
|
||
| if count > 0: | ||
| print(f"Deleted {count} unattached volumes.") | ||
| else: | ||
| print("No unattached volumes found for deletion.") | ||
|
|
||
| delete_volumes() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.