Skip to content

Commit 92d7eaf

Browse files
Merge pull request #257814 from mrbullwinkle/mrb_11_06_2023_embeddings_tutorial
[Azure OpenAI] embeddings
2 parents 7cdeb83 + ea8cd1b commit 92d7eaf

File tree

1 file changed

+82
-2
lines changed

1 file changed

+82
-2
lines changed

articles/ai-services/openai/tutorials/embeddings.md

Lines changed: 82 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ services: cognitive-services
66
manager: nitinme
77
ms.service: azure-ai-openai
88
ms.topic: tutorial
9-
ms.date: 09/12/2023
9+
ms.date: 11/06/2023
1010
author: mrbullwinkle #noabenefraim
1111
ms.author: mbullwin
1212
recommendations: false
@@ -46,10 +46,20 @@ In this tutorial, you learn how to:
4646

4747
If you haven't already, you need to install the following libraries:
4848

49+
# [OpenAI Python 0.28.1](#tab/python)
50+
4951
```cmd
5052
pip install "openai==0.28.1" num2words matplotlib plotly scipy scikit-learn pandas tiktoken
5153
```
5254

55+
# [OpenAI Python 1.x](#tab/python-new)
56+
57+
```console
58+
pip install openai num2words matplotlib plotly scipy scikit-learn pandas tiktoken
59+
```
60+
61+
---
62+
5363
<!--Alternatively, you can use our [requirements.txt file](https://github.com/Azure-Samples/Azure-OpenAI-Docs-Samples/blob/main/Samples/Tutorials/Embeddings/requirements.txt).-->
5464

5565
### Download the BillSum dataset
@@ -105,7 +115,9 @@ Run the following code in your preferred Python IDE:
105115

106116
<!--If you wish to view the Jupyter notebook that corresponds to this tutorial you can download the tutorial from our [samples repo](https://github.com/Azure-Samples/Azure-OpenAI-Docs-Samples/blob/main/Samples/Tutorials/Embeddings/embedding_billsum.ipynb).-->
107117

108-
## Import libraries and list models
118+
## Import libraries
119+
120+
# [OpenAI Python 0.28.1](#tab/python)
109121

110122
```python
111123
import openai
@@ -193,6 +205,23 @@ print(r.text)
193205

194206
The output of this command will vary based on the number and type of models you've deployed. In this case, we need to confirm that we have an entry for **text-embedding-ada-002**. If you find that you're missing this model, you'll need to [deploy the model](../how-to/create-resource.md#deploy-a-model) to your resource before proceeding.
195207

208+
# [OpenAI Python 1.x](#tab/python-new)
209+
210+
```python
211+
import os
212+
import re
213+
import requests
214+
import sys
215+
from num2words import num2words
216+
import os
217+
import pandas as pd
218+
import numpy as np
219+
import tiktoken
220+
from openai import AzureOpenAI
221+
```
222+
223+
---
224+
196225
Now we need to read our csv file and create a pandas DataFrame. After the initial DataFrame is created, we can view the contents of the table by running `df`.
197226

198227
```python
@@ -334,10 +363,29 @@ len(decode)
334363

335364
Now that we understand more about how tokenization works we can move on to embedding. It is important to note, that we haven't actually tokenized the documents yet. The `n_tokens` column is simply a way of making sure none of the data we pass to the model for tokenization and embedding exceeds the input token limit of 8,192. When we pass the documents to the embeddings model, it will break the documents into tokens similar (though not necessarily identical) to the examples above and then convert the tokens to a series of floating point numbers that will be accessible via vector search. These embeddings can be stored locally or in an [Azure Database to support Vector Search](../../../cosmos-db/mongodb/vcore/vector-search.md). As a result, each bill will have its own corresponding embedding vector in the new `ada_v2` column on the right side of the DataFrame.
336365

366+
# [OpenAI Python 0.28.1](#tab/python)
367+
337368
```python
338369
df_bills['ada_v2'] = df_bills["text"].apply(lambda x : get_embedding(x, engine = 'text-embedding-ada-002')) # engine should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model
339370
```
340371

372+
# [OpenAI Python 1.x](#tab/python-new)
373+
374+
```python
375+
client = AzureOpenAI(
376+
api_key = os.getenv("AZURE_OPENAI_API_KEY"),
377+
api_version = "2023-05-15",
378+
azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
379+
)
380+
381+
def generate_embeddings(text, model="text-embedding-ada-002"): # model = "deployment_name"
382+
return client.embeddings.create(input = [text], model=model).data[0].embedding
383+
384+
df_bills['ada_v2'] = df_bills["text"].apply(lambda x : generate_embeddings (x, model = 'text-embedding-ada-002')) # model should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model
385+
```
386+
387+
---
388+
341389
```python
342390
df_bills
343391
```
@@ -348,6 +396,8 @@ df_bills
348396

349397
As we run the search code block below, we'll embed the search query *"Can I get information on cable company tax revenue?"* with the same **text-embedding-ada-002 (Version 2)** model. Next we'll find the closest bill embedding to the newly embedded text from our query ranked by [cosine similarity](../concepts/understand-embeddings.md).
350398

399+
# [OpenAI Python 0.28.1](#tab/python)
400+
351401
```python
352402
# search through the reviews for a specific product
353403
def search_docs(df, user_query, top_n=3, to_print=True):
@@ -369,6 +419,36 @@ def search_docs(df, user_query, top_n=3, to_print=True):
369419
res = search_docs(df_bills, "Can I get information on cable company tax revenue?", top_n=4)
370420
```
371421

422+
# [OpenAI Python 1.x](#tab/python-new)
423+
424+
```python
425+
def cosine_similarity(a, b):
426+
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
427+
428+
def get_embedding(text, model="text-embedding-ada-002"): # model = "deployment_name"
429+
return client.embeddings.create(input = [text], model=model).data[0].embedding
430+
431+
def search_docs(df, user_query, top_n=4, to_print=True):
432+
embedding = get_embedding(
433+
user_query,
434+
model="text-embedding-ada-002" # model should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model
435+
)
436+
df["similarities"] = df.ada_v2.apply(lambda x: cosine_similarity(x, embedding))
437+
438+
res = (
439+
df.sort_values("similarities", ascending=False)
440+
.head(top_n)
441+
)
442+
if to_print:
443+
display(res)
444+
return res
445+
446+
447+
res = search_docs(df_bills, "Can I get information on cable company tax revenue?", top_n=4)
448+
```
449+
450+
---
451+
372452
**Output**:
373453

374454
:::image type="content" source="../media/tutorials/query-result.png" alt-text="Screenshot of the formatted results of res once the search query has been run." lightbox="../media/tutorials/query-result.png":::

0 commit comments

Comments
 (0)