|
| 1 | +# Deploying embedding models with Text Embedding Inference on OCI MD |
| 2 | + |
| 3 | +Hugging Face's [text-embeddings-inference](https://github.com/huggingface/text-embeddings-inference) (TEI) allows us to serve embedding models, and supports OpenAI spec for inference. |
| 4 | +TEI supports Nomic, BERT, CamemBERT, XLM-RoBERTa models with absolute positions, JinaBERT model with Alibi positions and |
| 5 | +Mistral, Alibaba GTE and Qwen2 models with Rope positions. TEI also supports sequence classification and re-ranking models, |
| 6 | +but the scope of this example is limited to deploying embedding models. |
| 7 | + |
| 8 | + |
| 9 | +TEI can serve more than 10k embedding models available on Hugging Face Hub, |
| 10 | +with support for most popular models. Currently, users can deploy any embedding model on OCI Data Science platform |
| 11 | +supported by TEI. While TEI offers `/embed` endpoint as default method to get embeddings, the following example |
| 12 | +will use the OpenAI compatible route, i.e. `/v1/embeddings`. For more details, check the list of endpoints available |
| 13 | +[here](https://huggingface.github.io/text-embeddings-inference/#/). |
| 14 | + |
| 15 | + |
| 16 | +## Overview |
| 17 | +This guide demonstrates how to deploy and perform inference of embedding models with Oracle Data Science Service |
| 18 | +through a Bring Your Own Container (BYOC) approach. In this example, we will use a model downloaded from |
| 19 | +Hugging Face—specifically, `BAAI/bge-base-en-v1.5`, and the container is powered by Text Embedding Inference (TEI). |
| 20 | + |
| 21 | + |
| 22 | + |
| 23 | +## Pre-Requisites |
| 24 | +To be able to run the example on this page, ensure you have access to Oracle Data Science notebook in your tenancy. |
| 25 | + |
| 26 | +### Required IAM Policies |
| 27 | + |
| 28 | +Add these [policies](https://github.com/oracle-samples/oci-data-science-ai-samples/tree/main/model-deployment/containers/llama2#required-iam-policies) |
| 29 | +to grant access to OCI services. |
| 30 | + |
| 31 | +### Install Desktop Container Management |
| 32 | + |
| 33 | +This example requires a desktop tool to build, run, launch and push the containers. We support: |
| 34 | + |
| 35 | +* [Docker Desktop](https://docs.docker.com/get-docker) |
| 36 | +* [Rancher Desktop](https://rancherdesktop.io/) |
| 37 | + |
| 38 | + |
| 39 | +## Prepare Inference Container |
| 40 | +TEI ships with multiple docker images that we can use to deploy an embedding model in OCI Data Science platform. |
| 41 | +For more details on images, visit the official Github repository section |
| 42 | +[here](https://github.com/huggingface/text-embeddings-inference/tree/main?tab=readme-ov-file#docker-images). |
| 43 | +Here, we show an example with one of the images. Run the following in the desktop when docker is installed. |
| 44 | + |
| 45 | +``` |
| 46 | +docker pull ghcr.io/huggingface/text-embeddings-inference:1.5.0 |
| 47 | +``` |
| 48 | + |
| 49 | +Currently, OCI Data Science Model Deployment only supports container images residing in the OCI Registry. |
| 50 | +Before we can push the pulled TEI container, make sure you have created a repository in your tenancy. |
| 51 | + |
| 52 | +* Go to your tenancy Container Registry |
| 53 | +* Click on the Create repository button |
| 54 | +* Select Private under Access types |
| 55 | +* Set a name for Repository name. We are using "text-embeddings-inference" in the example. |
| 56 | +* Click on Create button |
| 57 | + |
| 58 | +You may need to docker login to the Oracle Cloud Container Registry (OCIR) first, if you haven't done so before in |
| 59 | +order to push the image. To login, you have to use your API Auth Token that can be created under your |
| 60 | +Oracle Cloud Account->Auth Token. You need to login only once. Replace with the OCI region you are using. |
| 61 | + |
| 62 | +``` |
| 63 | +docker login -u '<tenant-namespace>/<username>' <region>.ocir.io |
| 64 | +``` |
| 65 | + |
| 66 | +If your tenancy is federated with Oracle Identity Cloud Service, use the format /oracleidentitycloudservice/. |
| 67 | +You can then push the container image to the OCI Registry. |
| 68 | + |
| 69 | +``` |
| 70 | +docker tag ghcr.io/huggingface/text-embeddings-inference:1.5.0 -t <region>.ocir.io/<tenancy>/text-embeddings-inference:1.5.0 |
| 71 | +docker push <region>.ocir.io/<tenancy>/text-embeddings-inference:1.5.0 |
| 72 | +``` |
| 73 | + |
| 74 | +## Setup |
| 75 | + |
| 76 | +Install dependencies in the notebook session. This is needed to prepare the artifacts, create a model |
| 77 | +and deploy it in OCI Data Science. |
| 78 | + |
| 79 | +Run this in the terminal in a notebook session: |
| 80 | +``` |
| 81 | +# install and/or update required python packages |
| 82 | + pip install oracle-ads oci huggingface_hub -U |
| 83 | +``` |
| 84 | + |
| 85 | +## Prepare the model artifacts |
| 86 | + |
| 87 | +To prepare model artifacts for deployment: |
| 88 | + |
| 89 | +* Download the model files from Hugging Face to local directory using a valid Hugging Face token (only needed for gated models). If you don't have a Hugging Face Token, refer [this](https://huggingface.co/docs/hub/en/security-tokens) to generate one. |
| 90 | +* Upload the model folder to a [versioned bucket](https://docs.oracle.com/en-us/iaas/Content/Object/Tasks/usingversioning.htm) in Oracle Object Storage. If you don’t have an Object Storage bucket, create one using the OCI SDK or the Console. Make a note of the namespace, compartment, and bucketname. Configure the policies to allow the Data Science service to read and write the model artifact to the Object Storage bucket in your tenancy. An administrator must configure the policies in IAM in the Console. |
| 91 | +* Create model catalog entry for the model using the Object storage path. |
| 92 | + |
| 93 | +## Downloading model from Hugging Face Hub |
| 94 | + |
| 95 | +You can refer to [HuggingFace Hub documentation](https://huggingface.co/docs/hub/en/index) for details. Here, we'll deploy one of the most popular embedding models from the hub that is supported by TEI. |
| 96 | + |
| 97 | +Run this in the terminal in a notebook session: |
| 98 | +``` |
| 99 | +# Login to huggingface |
| 100 | +huggingface-cli login --token "<your-huggingface-token>" |
| 101 | + |
| 102 | +# download the model to local folder. Here, BAAI/bge-base-en-v1.5 can be replaced by other models available on Hugging Face Hub. |
| 103 | +huggingface-cli download BAAI/bge-base-en-v1.5 --local-dir BAAI/bge-base-en-v1.5 |
| 104 | +``` |
| 105 | + |
| 106 | +## Upload Model to OCI Object Storage |
| 107 | + |
| 108 | +Once models are downloaded, use the terminal to upload the artifacts to object storage. Make sure that the bucket is versioned as mentioned above. |
| 109 | + |
| 110 | +Run this in the terminal in a notebook session: |
| 111 | +``` |
| 112 | +oci os object bulk-upload -bn <bucket> -ns <namespace> --auth resource_principal --prefix BAAI/bge-base-en-v1.5/ --src-dir BAAI/bge-base-en-v1.5/ --no-overwrite |
| 113 | +``` |
| 114 | + |
| 115 | +## Create Model by reference using ADS |
| 116 | + |
| 117 | +Create a notebook using the default Python kernel with the Python library specified in the setup section. |
| 118 | + |
| 119 | +We first set up the variables needed for creating and deploying the model. |
| 120 | +``` |
| 121 | +import ads |
| 122 | +ads.set_auth("resource_principal") |
| 123 | + |
| 124 | +# Extract region information from the Notebook environment variables and signer. |
| 125 | +region = ads.common.utils.extract_region() |
| 126 | + |
| 127 | +hf_model_name = "BAAI/bge-base-en-v1.5" |
| 128 | +artifact_path = f"oci://<bucket>@<namespace>/{hf_model_name}" |
| 129 | +project_id="<project_ocid>" |
| 130 | +compartment_id = "<compartment_ocid>" |
| 131 | + |
| 132 | +log_group_id = "ocid1.loggroup.oc1.xxx.xxxxx" |
| 133 | +log_id = "cid1.log.oc1.xxx.xxxxx" |
| 134 | + |
| 135 | +instance_shape = "VM.GPU.A10.1" |
| 136 | +container_image = "<region>.ocir.io/<tenancy>/text-embeddings-inference:1.5.0" |
| 137 | +``` |
| 138 | + |
| 139 | +Next, create a model catalog entry with the artifact in the object storage bucket where it was uploaded. |
| 140 | +``` |
| 141 | +from ads.model.datascience_model import DataScienceModel |
| 142 | + |
| 143 | +model = (DataScienceModel() |
| 144 | + .with_compartment_id(compartment_id) |
| 145 | + .with_project_id(project_id) |
| 146 | + .with_display_name(hf_model_name) |
| 147 | + .with_artifact(artifact_path) |
| 148 | + ).create(model_by_reference=True) |
| 149 | +``` |
| 150 | + |
| 151 | +## Deploy embedding model |
| 152 | + |
| 153 | +In order to deploy the model we just created, we set up the infrastructure and container runtime first. |
| 154 | + |
| 155 | +### Import Model Deployment Modules |
| 156 | + |
| 157 | +``` |
| 158 | +from ads.model.deployment import ( |
| 159 | + ModelDeployment, |
| 160 | + ModelDeploymentContainerRuntime, |
| 161 | + ModelDeploymentInfrastructure, |
| 162 | + ModelDeploymentMode, |
| 163 | +) |
| 164 | +``` |
| 165 | + |
| 166 | +### Setup Model Deployment Infrastructure |
| 167 | + |
| 168 | +``` |
| 169 | +infrastructure = ( |
| 170 | + ModelDeploymentInfrastructure() |
| 171 | + .with_project_id(project_id) |
| 172 | + .with_compartment_id(compartment_id) |
| 173 | + .with_shape_name(instance_shape) |
| 174 | + .with_bandwidth_mbps(10) |
| 175 | + .with_replica(1) |
| 176 | + .with_web_concurrency(10) |
| 177 | + .with_access_log( |
| 178 | + log_group_id=log_group_id, |
| 179 | + log_id=log_id, |
| 180 | + ) |
| 181 | + .with_predict_log( |
| 182 | + log_group_id=log_group_id, |
| 183 | + log_id=log_id, |
| 184 | + ) |
| 185 | +) |
| 186 | +``` |
| 187 | + |
| 188 | +### Configure Model Deployment Runtime |
| 189 | + |
| 190 | +We set the `MODEL_DEPLOY_PREDICT_ENDPOINT` endpoint environment variable with `/v1/embeddings` so that we can |
| 191 | +access the corresponding endpoint from the TEI container. One additional configuration we need to add is `cmd_var`, which |
| 192 | +specifies the location of artifacts that will be downloaded within model deployment. For models created by reference, the |
| 193 | +default artifact location is `/opt/ds/model/deployed_model/` and we need to append the object storage bucket prefix to this path. |
| 194 | + |
| 195 | +``` |
| 196 | +env_var = { |
| 197 | + 'MODEL_DEPLOY_PREDICT_ENDPOINT': '/v1/embeddings' |
| 198 | +} |
| 199 | +# note that the model path inside the container will have the format /opt/ds/model/deployed_model/{artifact_path_prefix} |
| 200 | +cmd_var = ["--model-id", "/opt/ds/model/deployed_model/BAAI/bge-base-en-v1.5/", "--port", "8080", "--hostname", "0.0.0.0"] |
| 201 | + |
| 202 | +container_runtime = ( |
| 203 | + ModelDeploymentContainerRuntime() |
| 204 | + .with_image(container_image) |
| 205 | + .with_server_port(8080) |
| 206 | + .with_health_check_port(8080) |
| 207 | + .with_env(env_var) |
| 208 | + .with_cmd(cmd_var) |
| 209 | + .with_deployment_mode(ModelDeploymentMode.HTTPS) |
| 210 | + .with_model_uri(model.id) |
| 211 | + .with_region(region) |
| 212 | +) |
| 213 | +``` |
| 214 | + |
| 215 | +### Deploy Model Using Container Runtime |
| 216 | + |
| 217 | +Once the infrastructure and runtime is configured, we can deploy the model. |
| 218 | +``` |
| 219 | +deployment = ( |
| 220 | + ModelDeployment() |
| 221 | + .with_display_name(f"{hf_model_name} with TEI docker container") |
| 222 | + .with_description(f"Deployment of {hf_model_name} MD with text-embeddings-inference:1.5.0 container") |
| 223 | + .with_infrastructure(infrastructure) |
| 224 | + .with_runtime(container_runtime) |
| 225 | +).deploy(wait_for_completion=False) |
| 226 | +``` |
| 227 | + |
| 228 | +## Inference |
| 229 | + |
| 230 | +Once the model deployment has reached the Active state, we can invoke the model deployment endpoint to interact with the LLM. |
| 231 | +More details on different ways for accessing MD endpoints is documented [here](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/ai-quick-actions/model-deployment-tips.md#inferencing-model). |
| 232 | + |
| 233 | +``` |
| 234 | +import requests |
| 235 | + |
| 236 | +sentences = ["The car sped down the highway at an incredible speed.", "A vehicle raced along the freeway, moving very fast.", "The child was playing in the park with a ball."] |
| 237 | +endpoint = f"https://modeldeployment.{region}.oci.customer-oci.com/{deployment.model_deployment_id}/predict" |
| 238 | + |
| 239 | +response = requests.post( |
| 240 | + endpoint, |
| 241 | + json={ |
| 242 | + "input":sentences |
| 243 | + }, |
| 244 | + auth=ads.common.auth.default_signer()["signer"], |
| 245 | + headers={}, |
| 246 | +).json() |
| 247 | +``` |
| 248 | + |
| 249 | +The raw output (response) has an array of three lists with embedding for the above three sentences. |
| 250 | + |
| 251 | +``` |
| 252 | +{'object': 'list', |
| 253 | + 'data': [{'object': 'embedding', |
| 254 | + 'embedding': [-0.00735207, |
| 255 | + -0.045759525, |
| 256 | + 0.061242294, |
| 257 | + 0.013910536, |
| 258 | + 0.048454784, |
| 259 | + -0.0059445454, |
| 260 | + 0.007921069, |
| 261 | + 0.029093834, |
| 262 | + 0.04836494, |
| 263 | + ... |
| 264 | + ... |
| 265 | + ... |
| 266 | + -0.005862751, |
| 267 | + 0.055649005], |
| 268 | + 'index': 2}], |
| 269 | + 'model': '/opt/ds/model/deployed_model/BAAI/bge-base-en-v1.5/', |
| 270 | + 'usage': {'prompt_tokens': 39, 'total_tokens': 39}} |
| 271 | +``` |
| 272 | + |
| 273 | +## Testing Embeddings generated by the model |
| 274 | + |
| 275 | +Here, we have 3 sentences - two of which have similar meaning, and the third one is distinct. We'll run a simple test to |
| 276 | +find how similar or dissimilar these sentences are, using cosine similarity as a comparison metric. |
| 277 | + |
| 278 | +``` |
| 279 | +from sklearn.metrics.pairwise import cosine_similarity |
| 280 | +import matplotlib.pyplot as plt |
| 281 | +import seaborn as sns |
| 282 | + |
| 283 | +embeddings = [sentence['embedding'] for sentence in response['data']] |
| 284 | +similarity_matrix = cosine_similarity(embeddings) |
| 285 | + |
| 286 | +labels = [f"Sentence {i+1}" for i in range(len(sentences))] |
| 287 | + |
| 288 | +# visualize the similarity matrix using a heatmap |
| 289 | +plt.figure(figsize=(6, 4)) |
| 290 | +sns.heatmap(similarity_matrix, annot=True, cmap='coolwarm', xticklabels=labels, yticklabels=labels) |
| 291 | + |
| 292 | +# add title and labels for better clarity |
| 293 | +plt.title('Cosine Similarity Heatmap of Sentences') |
| 294 | +plt.xticks(rotation=45, ha='right') |
| 295 | +plt.yticks(rotation=0) |
| 296 | +plt.tight_layout() |
| 297 | +plt.show() |
| 298 | +``` |
| 299 | + |
| 300 | + |
| 301 | + |
| 302 | + |
| 303 | +The above heatmap shows that the embedding model captures the semantic similarity between the first two sentences while distinguishing the third as different. |
0 commit comments