Caio Trevisan - Cloud Engineer at Contino
-
Multi-AZ Kubernetes control plane deployment managed by AWS
-
auto-healing
-
ondemand patching and upgrades
-
~$150USD/monthly
-
need workers -- pay separately
- On-Demand: pay by the hour or second
- Reserved Instances: up to 75% discount, one to three-year commitment
- Spot Instances: bid for spare EC2 capacity, up to 90% discount
- predictable pricing
- up to 90% of savings
- termination notice
- ~2 minutes -- metadada warning
Good use for:
- Flexible start/end times
- Applications that handle well failure
- Large computing needs for jobs like data processing
AWS EC2 Spot Instances Documentation
Pricing of a m5.large
instance from Jul/19
to Oct19
in region SydneyAU ap-southeast-2
.
Instance Type | Price |
---|---|
On-demand | 0.12 |
Spot | 0.0362 |
- collection/group/fleet of spot instances
- request is fulfilled either by reaching target capacity or exceeding the maximum price
- one-time / maintain
- launch specifications: instance types / az (up to 50)
- target capacity
- on-demand portion
- defined price vs on-demand price
- two set of worker nodes
- spot: scale up above 70% of load
- ondemand: scale up above 90% of load
PS.: think ahead and overprovision in case of any expected event
- queue requests on SQS or any other queue service
- scale workers based on quantity of jobs queued
- Taint spot workers with
PreferNoSchedule
so jobs will run first on ondemand workers and only if not resources available will use Spot anyway
- as non-critical services that can retry in case of failing make a good use case for savings
- Western Digital has run a simulation of close to 2.5million tasks in just 8 hours with more than 50k instances (1 million vCPU!!) costing around $140kUSD. Estimated in half of the costs of running even on in-house infrastructure.
- S3 were used to save data results and checkpoints when the instance were schedule to terminate
- scale up/down NODES when pods are not able to be scheduled
- keeps checking for pendind pods
- send a api call to the ASG when scale is needed
- userdata/scripts insert the new node to the cluster
- kubernetes allocate pods to newly added nodes
- CA is not based on actual load but instead in
requests/limits
- how much memory/cpu you allocate to a pod
- Update your ASG name so the service can trigger the scale up/down for you
sed -i '' "s/<INSERT-YOUR-SPOT-INSTANCES-ASG-NAME-HERE>/test/g" "k8s-tools/cluster-autoscaler/cluster_autoscaler.yml"
- Run the CA deployment
kubectl apply -f k8s-tools/cluster-autoscaler
- Watch logs
kubectl logs -f deployment/cluster-autoscaler -n kube-system
- run as daemonsets
- keep polling instance metadata for termination notice
- drain the node -- taint as NoSchedule
- node can be gracefully removed
- Run the Spot Interrup Handler Daemonset
kubectl apply -f k8s-toolks/spot-interrupt-handler
- Auto scale at pod level based on cpu utilisation
- query utilisation every 15 seconds
- Install metrics server for pod load monitoring
helm install stable/metrics-server --name metrics-server --version 2.0.2 --namespace metrics
- Create a test deployment and expose it
kubectl run php-apache --image=k8s.gcr.io/hpa-example --requests=cpu=200m --limits=cpu=500m
- Create deployment autoscaler
kubectl autoscale deployment php-apache --cpu-percent=30 --min=1 --max=10
- Expose the service
kubectl expose deploy php-apache --target-port=80 --port=80 --type=LoadBalancer
- Increase load
kubectl run -i --tty load-generator --image=busybox /bin/sh
Hit enter for command prompt
while true; do wget -q -O- http://php-apache.default.svc.cluster.local; done
- Monitor HPA and deployment
kubectl get hpa -w
kubectl get deployment php-apache -w
- affinity attracts pods to a set of nodes
- create rules based on labels to:
- hard: need to match label
- soft: preference to match but not required
- create rules based on labels to:
- taints allow nodes to repel pods
- tolerations are applied to pods to allow (not require) schedule with matching taints
- Node tainted with NoSchedule
kubectl taint nodes node1 key=value:NoSchedule
- A pod needs a toleration to be able to run on that node
tolerations:
- key: "key"
operator: "Exists"
effect: "NoSchedule"
- Hard and soft affinities:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: jobType
operator: In
values:
- batch
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: instanceType
operator: In
values:
- Ec2Spot
- Visual graphics of cluster working
- good for learning and dashboards
Kubernetes Operational View Documentation
- Instal via helm
helm repo update
helm install stable/kube-ops-view --name kube-ops-view --set service.type=LoadBalancer --set rbac.create=True
- Get service url
kubectl get svc kube-ops-view | tail -n 1 | awk '{ print "Kube-ops-view URL = http://"$4 }'
- cluster created with
eksctl
using all default settings
eksctl create cluster caio-eks-test
- deploy CA, Spot Interrupt Handler, metrics server and Kubernetes Operational Viewer to your cluster
- run
kubectl apply -f k8s-tools/monte-carlo.yaml
so it can fill existing nodes with workloads
- spot instance workers via Spot Fleet Request using terraform
-
go to
tf-spot-workers
folder and update variables according to your recently create EKS cluster before applying -
wait until the new instance joins the cluster
-
run a nginx pod and expose it
kubectl run nginx --image=nginx
kubectl expose deployment nginx --port=80 --target-port=80 --type=LoadBalancer
-
make sure the
nginx
replica is running on the spot instance node -
run
test_url.sh <service-url>
for constant polling the url -
go to AWS console Spot Fleet Requests and modify the fleet target capacity to 0
- this will trigger the termination notice
- observe:
- the pod needs to be reallocated to a healthy node before the node is removed
- service will have little to none interruption on the polling
-
run HPA load test steps to create a
php-apache
pod and generate some load -
behaviour to expect/monitor through the dashboard:
- HPA scale deployment replicas based on CPU load
- once no nodes available for schedule pods, CA should scale up the cluster
- use lambda with cloudwatch events or builtin application function for:
- re-assigning elastic IP
- load balancer handling
- update DNS entries
- any environment changes
- ebs volumes cannot span multiple aws availability zone
- use either Affinity rules and/or taint/tolerations to force use of nodes
- efs for multiaz agnostic
- currently does not support multi az
- one ASG per az and enable
--balance-similar-node-groups
feature
- one ASG per az and enable
- require to exists
/etc/ssl/certs/ca-bundle.crt
in your cluster.- tools like
kops
need customization
- tools like
- by default CA won't move pods on
kube-system
namespace- you can change this behaviour
- you can overprosion with
pause pods
- keep pods with
requests/limits
close to real needs - avoid local storage
Contino Ultimate Guide to Passing the CKA Exam
- best content around -- Linux Academy CKA course
- keep track of questions and weight as you go on the notepad
- skip if it's too hard and worth less than 5%
- you only need 74% and 66% to pass
- bookmarks for Kubernetes documentation
- no need for auto-completion as the terminal comes pre-configured
- you can split view your browser with k8s documentation and the exam (only these two tabs open)
- book exam in the morning so you are 100% for a 3 hours exam
- basic set of aliases on
.bash_profile
first thing once the test start
alias k='kubectl'
alias kgp='k get po'
alias kgd='k get deploy'
alias kgs='k get svc'
alias kcc='k config current-context'
alias kuc='k config use-context'
alias ka='k apply'
alias kc='k create'
alias kd='k delete'
alias kg='k get'
-
Spot Instances termination notices: https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-notices/
-
Running EKS workloads on Spot Instances: https://aws.amazon.com/blogs/compute/run-your-kubernetes-workloads-on-amazon-ec2-spot-instances-with-amazon-eks/
-
AWS Spot Instances Pricing Advisor: https://aws.amazon.com/ec2/spot/instance-advisor/
-
Using Spot Instances for cost optimizations: https://d1.awsstatic.com/whitepapers/cost-optimization-leveraging-ec2-spot-instances.pdf
-
Purchase options types on ASG: https://docs.aws.amazon.com/en_pv/autoscaling/ec2/userguide/asg-purchase-options.html#asg-allocation-strategies
-
Using EKSCTL with existing iam and vpc: https://eksctl.io/examples/reusing-iam-and-vpc/
-
Spot Instances termination notice handler: https://github.com/kube-aws/kube-spot-termination-notice-handler
-
Overprovisioning with Cluster Autoscaler (CA): https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler
-
Gotchas when using Cluster Autoscaler (CA): https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws#common-notes-and-gotchas
-
Re:Invent 2018 Spot Instances with EKS: https://www.slideshare.net/AmazonWebServices/amazon-ec2-spot-with-amazon-eks-con406r1-aws-reinvent-2018
-
AWS Getting Started with EKS: https://aws.amazon.com/getting-started/projects/deploy-kubernetes-app-amazon-eks/
-
AWS Quickstart EKS: https://s3.amazonaws.com/aws-quickstart/quickstart-amazon-eks/doc/amazon-eks-architecture.pdf
-
Kubernetes Docs: https://kubernetes.io/docs/
-
kubectl drain: https://kubernetes.io/images/docs/kubectl_drain.svg