You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<!--Alternatively, you can use our [requirements.txt file](https://github.com/Azure-Samples/Azure-OpenAI-Docs-Samples/blob/main/Samples/Tutorials/Embeddings/requirements.txt).-->
54
64
55
65
### Download the BillSum dataset
@@ -105,7 +115,9 @@ Run the following code in your preferred Python IDE:
105
115
106
116
<!--If you wish to view the Jupyter notebook that corresponds to this tutorial you can download the tutorial from our [samples repo](https://github.com/Azure-Samples/Azure-OpenAI-Docs-Samples/blob/main/Samples/Tutorials/Embeddings/embedding_billsum.ipynb).-->
107
117
108
-
## Import libraries and list models
118
+
## Import libraries
119
+
120
+
# [OpenAI Python 0.28.1](#tab/python)
109
121
110
122
```python
111
123
import openai
@@ -193,6 +205,23 @@ print(r.text)
193
205
194
206
The output of this command will vary based on the number and type of models you've deployed. In this case, we need to confirm that we have an entry for **text-embedding-ada-002**. If you find that you're missing this model, you'll need to [deploy the model](../how-to/create-resource.md#deploy-a-model) to your resource before proceeding.
195
207
208
+
# [OpenAI Python 1.x](#tab/python-new)
209
+
210
+
```python
211
+
import os
212
+
import re
213
+
import requests
214
+
import sys
215
+
from num2words import num2words
216
+
import os
217
+
import pandas as pd
218
+
import numpy as np
219
+
import tiktoken
220
+
from openai import AzureOpenAI
221
+
```
222
+
223
+
---
224
+
196
225
Now we need to read our csv file and create a pandas DataFrame. After the initial DataFrame is created, we can view the contents of the table by running `df`.
197
226
198
227
```python
@@ -334,10 +363,29 @@ len(decode)
334
363
335
364
Now that we understand more about how tokenization works we can move on to embedding. It is important to note, that we haven't actually tokenized the documents yet. The `n_tokens` column is simply a way of making sure none of the data we pass to the model for tokenization and embedding exceeds the input token limit of 8,192. When we pass the documents to the embeddings model, it will break the documents into tokens similar (though not necessarily identical) to the examples above and then convert the tokens to a series of floating point numbers that will be accessible via vector search. These embeddings can be stored locally or in an [Azure Database to support Vector Search](../../../cosmos-db/mongodb/vcore/vector-search.md). As a result, each bill will have its own corresponding embedding vector in the new `ada_v2` column on the right side of the DataFrame.
336
365
366
+
# [OpenAI Python 0.28.1](#tab/python)
367
+
337
368
```python
338
369
df_bills['ada_v2'] = df_bills["text"].apply(lambdax : get_embedding(x, engine='text-embedding-ada-002')) # engine should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model
df_bills['ada_v2'] = df_bills["text"].apply(lambdax : generate_embeddings (x, model='text-embedding-ada-002')) # model should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model
385
+
```
386
+
387
+
---
388
+
341
389
```python
342
390
df_bills
343
391
```
@@ -348,6 +396,8 @@ df_bills
348
396
349
397
As we run the search code block below, we'll embed the search query *"Can I get information on cable company tax revenue?"* with the same **text-embedding-ada-002 (Version 2)** model. Next we'll find the closest bill embedding to the newly embedded text from our query ranked by [cosine similarity](../concepts/understand-embeddings.md).
350
398
399
+
# [OpenAI Python 0.28.1](#tab/python)
400
+
351
401
```python
352
402
# search through the reviews for a specific product
res = search_docs(df_bills, "Can I get information on cable company tax revenue?", top_n=4)
448
+
```
449
+
450
+
---
451
+
372
452
**Output**:
373
453
374
454
:::image type="content" source="../media/tutorials/query-result.png" alt-text="Screenshot of the formatted results of res once the search query has been run." lightbox="../media/tutorials/query-result.png":::
0 commit comments