Skip to content

Commit b3bc685

Browse files
scottileefacebook-github-bot
authored andcommitted
Improve torchx/resources README (#505)
Summary: Pull Request resolved: #505 #470 * Added steps to generate the "torchx-dev-eks.yml" * Updated "torchx-dev-eks-template.yml" * Used https://www.kubeflow.org/docs/components/pipelines/installation/standalone-deployment/#deploying-kubeflow-pipelines to update the KFP creation steps * Fixed typos and removed stale link Pull Request resolved: #494 Reviewed By: kurman Differential Revision: D36416809 Pulled By: scottilee fbshipit-source-id: d7c33d4c672cda1d6e2f23a18564139af15e2e05
1 parent 93548fa commit b3bc685

File tree

2 files changed

+24
-23
lines changed

2 files changed

+24
-23
lines changed

resources/README.md

Lines changed: 22 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,28 @@
1-
The readme describes how to create and delete eks cluster and kfp services.
1+
The readme describes how to create and delete an EKS cluster and KFP services.
22

33
#### Creating EKS cluster
44

5+
export CLUSTER_NAME="torchx-dev"
6+
export EKS_VERSION="1.21"
7+
envsubst < torchx-dev-eks-template.yml > torchx-dev-eks.yml
58
eksctl create cluster -f torchx-dev-eks.yml
69

10+
See https://docs.aws.amazon.com/eks/latest/userguide/platform-versions.html for the latest EKS version
11+
712
#### Creating KFP
813

9-
kfctl apply -V -f torchx-dev-kfp.yml
14+
Source doc: https://www.kubeflow.org/docs/components/pipelines/installation/standalone-deployment/#deploying-kubeflow-pipelines
15+
16+
export PIPELINE_VERSION=1.8.1
17+
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
18+
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
19+
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=$PIPELINE_VERSION"
20+
21+
See https://github.com/kubeflow/pipelines/releases for the latest KFP version
1022

11-
#### Applying kfp role binding
23+
#### Applying KFP role binding
1224

25+
kubectl create namespace torchx-dev
1326
kubectl apply -f kfp_volcano_role_binding.yaml
1427

1528
#### Creating torchserve
@@ -22,16 +35,6 @@ The readme describes how to create and delete eks cluster and kfp services.
2235

2336
Install `vcctl`
2437

25-
26-
#### Installing kfp from source code
27-
28-
Source doc: https://www.kubeflow.org/docs/components/pipelines/installation/standalone-deployment/
29-
30-
kubectl apply -k manifests/kustomize/cluster-scoped-resources
31-
32-
kubectl apply -k manifests/kustomize/env/dev
33-
34-
3538
#### Starting etcd service
3639

3740
kubectl apply -f etcd.yaml
@@ -44,21 +47,20 @@ The readme describes how to create and delete eks cluster and kfp services.
4447

4548
eksctl delete -f torch-dev-eks.yml
4649

47-
This command most likely will fail. EKS user cloudformation to create many resources, that
48-
are hard to remove. If the command fails there needs to be done manual cleanup:
50+
This command most likely will fail. EKS uses CloudFormation to create many resources that
51+
are hard to remove. If the command fails there needs to be manual cleanup:
4952
* Clean up the associated VPC. Go to AWS Console -> VPC -> Press `Delete`. This will
5053
point you the ENI and NAT that needs to be deleted manually.
51-
* Clean up the cloudformation temalte. Go to AWS Console -> CNF -> delete corresponding templates.
54+
* Clean up the CloudFormation template. Go to AWS Console -> CNF -> delete corresponding templates.
5255

5356
### Gotchas:
5457

55-
* The directory where `torchx-dev-kfp.yml` is located should be the same name
56-
as eks cluster
58+
* The directory where `torchx-dev-kfp.yml` is located should be the same name as eks cluster
5759

58-
* The node groups in eks cluster HAVE to be spread more than a single AZ, otherwise there
60+
* The node groups in the EKS cluster HAVE to be spread to more than a single AZ, otherwise there
5961
will be problems with `istio-ingress`
6062

61-
* KFP troubleshooting: https://www.kubeflow.org/docs/distributions/aws/troubleshooting-aws/
63+
* KFP troubleshooting: https://www.kubeflow.org/docs/components/pipelines/troubleshooting/
6264

6365
* Enable Kubernetes nodes to access AWS account resources: https://stackoverflow.com/a/64617080/1446208
6466

resources/torchx-dev-eks-template.yml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
# Running script:
55
# export CLUSTER_NAME="torchx-dev"
6-
# export KFP_VERSION="1.18"
6+
# export EKS_VERSION="1.21" # See https://docs.aws.amazon.com/eks/latest/userguide/platform-versions.html for latest EKS versions
77
# envsubst < torchx-dev-eks-template.yml > torchx-dev-eks.yml
88
# eksctl create cluster -f torchx-dev-eks.yml
99

@@ -20,8 +20,7 @@ kind: ClusterConfig
2020
metadata:
2121
name: ${CLUSTER_NAME}
2222
region: us-west-2
23-
# https://www.kubeflow.org/docs/distributions/aws/deploy/install-kubeflow/
24-
version: '${KFP_VERSION}'
23+
version: '${EKS_VERSION}'
2524
tags:
2625
environment: dev
2726

0 commit comments

Comments
 (0)