Skip to content

Commit e69f7c5

Browse files
committed
Initial commit
1 parent 6369572 commit e69f7c5

File tree

4 files changed

+156
-0
lines changed

4 files changed

+156
-0
lines changed
96.9 KB
Loading
108 KB
Loading
39.7 KB
Loading

nvidia-nemo-oke/readme.md

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# Deploy NVIDIA NeMo microservices on Oracle Kubernetes Engine (OKE)
2+
3+
**Summary:** The following tutorial will take you through the requisite steps for deploying and configuring [NVIDIA NeMo Microservices](https://www.nvidia.com/en-us/ai-data-science/products/nemo/) on OCI. The deployment will use OKE (managed Kubernetes) and will utilize Oracle Database 23ai for both structured data and vector data store.
4+
5+
<u>Requirements</u>
6+
7+
* An [NVIDIA NGC account](https://org.ngc.nvidia.com/setup/personal-keys) where you can provision an API key.
8+
* An Oracle Cloud Infrastructure (OCI) paid account with access to GPU shapes. NVIDIA A10 will be sufficient.
9+
* General understanding of Python and Jupyter Notebooks
10+
11+
## Task 1: Collect and configure prerequisites
12+
13+
1. Generate an NGC API Key via the NVIDIA portal.
14+
15+
![Generate API Key](images/generate-ngc-api-key.png)
16+
17+
2. Log into your [Oracle Cloud](https://cloud.oracle.com) account.
18+
19+
3. Using the menu in the top left corner, navigate to **`Developer Services`** -> **`Kubernetes Clusters (OKE)`**
20+
21+
4. Click **`[Create cluster]`** and choose the **Quick create** option. Click **`[Submit]`**
22+
23+
![Quick create a new cluster](images/create-oke-cluster.png)
24+
25+
5. Provide the following confniguration details for your cluster:
26+
27+
* Name
28+
* Kubernetes Endpoint: Public endpoint
29+
* Node type: Managed
30+
* Kubernetes worker nodes: Private workers
31+
* Shape: VM. Standard.E3.Flex (or E4 | E5, depending on your available capacity)
32+
* Select the number of OCPUs: 2 or more
33+
* Node count: 1
34+
35+
>Note: After the cluster is online, we'll provision a second node pool with GPU shapes. The *E#* flex shapes will be used for cluster operations and the Oracle Database 23ai deployment.
36+
37+
6. Click **`[Next]`**, validate the settings, then click **`[Create cluster]`**.
38+
39+
>Note: The cluster creation process will take around 15 minutes.
40+
41+
7. Once the cluster is **Active** click the cluster name to view details. Use the navigation menu in the left pane to locate, then click **Node pools**
42+
43+
8. You should see **pool1** that was automatically provisioned with the cluster. Click **`[Add node pool]`**.
44+
45+
9. Provide the following configuration parameters:
46+
47+
* Name
48+
* Node Placement Configuration:
49+
* Availability domain: select at least 1
50+
* Worker node subnet: select the *node* subnet
51+
* Node shape: An NVIDIA GPU shape. VM.GPU.A10.1 will work.
52+
* Node count: 3
53+
* Click **Specify a custom boot volume size and change the value to 250.
54+
* Click the very last **Show advanced options**, found just above the **`[Add]`** button. Under **Initialization script** choose **Paste Cloud-Init Script and enter the following:
55+
56+
```bash
57+
<copy>
58+
#!/bin/bash
59+
curl --fail -H "Authorization: Bearer Oracle" -L0 http://169.254.169.254/opc/v2/instance/metadata/oke_init_script | base64 --decode >/var/run/oke-init.sh
60+
bash /var/run/oke-init.sh
61+
bash /usr/libexec/oci-growfs -y
62+
systemctl restart kubelet.service
63+
</copy>
64+
```
65+
66+
>Note: This deployment requires 3 GPUs to function properly. You can either deploy 3 separate single-GPU nodes, or a single node with 4+ GPUs.
67+
68+
10. Click **`[Add]`** to create the new node pool.
69+
70+
11. While that is creating, return to the **Cluster details** page and click the **`[Access Cluster]`** at the top of the page.
71+
72+
12. In the dialog that opens, click the button to **`[Launch Cloud Shell]`**, then copy the command found in step 2. When Cloud Shell becomes available, paste and run the command.
73+
74+
![Access Your Cluster](images/access-cluster.png)
75+
76+
13. The command you just executed will create your Kube config file. To test it, run the following:
77+
78+
```bash
79+
<copy>
80+
kubectl cluster-info
81+
kubectl get nodes -o wide
82+
</copy>
83+
```
84+
85+
>Note: The GPU nodes may still be provisioning and might not show up just yet. The node name is its private IP address.
86+
87+
14. Finally, on the Cluster details page, locate the **Add-ons** link and click it. Click **`[Manage add-ons]`** and enable the following:
88+
89+
* Certificate Manager
90+
* Databaes Operator
91+
* Metrics Server
92+
93+
>Note: Enable them on at a time by clicking the box, checking the **Enable** option, and saving the changes.
94+
95+
96+
## Task 2: Install JupyterHub
97+
98+
1. Return to Cloud Shell. Create a new file called **jh-values.yaml** and paste the following:
99+
100+
```
101+
<copy>
102+
# default configuration
103+
singleuser:
104+
cloudMetadata:
105+
blockWithIptables: false
106+
# optional – if you want to spawn GPU-based user notebooks, remove the comment character from the following lines.
107+
#profileList:
108+
# - display_name: "GPU Server"
109+
# description: "Spawns a notebook server with access to a GPU"
110+
# kubespawner_override:
111+
# extra_resource_limits:
112+
# nvidia.com/gpu: "1"
113+
</copy>
114+
```
115+
116+
>Note: In this tutorial we use Jupyter notebooks to interact with the GPU-driven NVIDIA microservices. You will not need to enable GPU-based user notebooks to complete the tasks herein.
117+
118+
2. Add the Helm repo.
119+
120+
```bash
121+
<copy>
122+
helm repo add jupyterhub https://hub.jupyter.org/helm-chart/ && helm repo update
123+
</copy>
124+
```
125+
126+
3. Perform the install using Helm, and reference the values file created in step 1.
127+
128+
```bash
129+
<copy>
130+
helm upgrade --cleanup-on-fail –install jupyter-hub jupyterhub/jupyterhub --namespace k8s-jupyter --create-namespace --values jh-values.yaml
131+
</copy>
132+
```
133+
134+
4. Once the deployment is complete, the Kubernetes service that gets created will provision an OCI Load Balancer for public access. Locate the public IP address of the load balancer and store it for later.
135+
136+
```bash
137+
<copy>
138+
kubectl get svc -n k8s-jupyter
139+
</copy>
140+
```
141+
142+
Output:
143+
```bash
144+
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
145+
k8s-jupyter proxy-public LoadBalancer 10.96.177.9 129.213.1.77 80:30141/TCP
146+
```
147+
148+
5. When you access the JupyterHub UI for the first time, you will be prompted for a username and password. Specify values of your choosing but make sure you safe them for future use. After logging in, you'll need to click the button to start the server. The startup process will take 5-7 minutes.
149+
150+
## Task 3: Deploy the Oracle Database 23ai pod
151+
152+
1.
153+
154+
155+
156+

0 commit comments

Comments
 (0)