You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Alternatively, you can use our requirements.txt file. `TODO:(mbullwin): Create publicly accessible sample repo with requirements.txt file for this tutorial`
50
+
Alternatively, you can use our [requirements.txt file](https://github.com/Azure-Samples/Azure-OpenAI-Docs-Samples/blob/main/Samples/Tutorials/Embeddings/requirements.txt).
51
51
52
52
### Download the BillSum dataset
53
53
54
54
BillSum is a dataset of United States Congressional and California state bills. For illustration purposes, we'll look only at the US bills. The corpus consists of bills from the 103rd-115th (1993-2018) sessions of Congress. The data was split into 18,949 train bills and 3,269 test bills. The BillSum corpus focuses on mid-length legislation from 5,000 to 20,000 characters in length. More information on the project and the original academic paper where this dataset is derived from can be found on the [BillSum project's GitHub repository](https://github.com/FiscalNote/BillSum)
55
55
56
-
This tutorial uses the `bill_sum_data.csv` file that can be downloaded from our [GitHub sample data](TODO-mbullwin-add-link-to-sample-file).
56
+
This tutorial uses the `bill_sum_data.csv` file that can be downloaded from our [GitHub sample data](https://github.com/Azure-Samples/Azure-OpenAI-Docs-Samples/blob/main/Samples/Tutorials/Embeddings/data/bill_sum_data.csv).
57
57
58
58
You can also download the sample data by running the following on your local machine:
The output of this command will vary based on the number and type of models you've deployed. In this case, we need to confirm that we have entries for both **text-search-curie-doc-001** and **text-search-curie-query-001**. If you find that you're missing one of these models, you'll need to [deploy the models](../how-to/create-resource.md#deploy-a-model) to your resource before proceeding.
201
201
202
202
> [!IMPORTANT]
203
-
> You will likely receive warnings even when successfully running the code above and retrieving the expected output. The warning messages can be ignored.
204
-
205
-
**TODO(mbullwin): Confirm with Noa if the below warning is expected behavior***`TqdmWarning: IProgress not found. Please update jupyter and ipywidgets.None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.`*
203
+
> You may receive warnings even when successfully running the code above and retrieving the expected output. This warning messages can be ignored:`TqdmWarning: IProgress not found. Please update jupyter and ipywidgets.None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.`
206
204
207
205
```python
208
206
df = pd.read_csv("INSERT LOCAL PATH TO BILL_SUM_DATA.CSV")
@@ -315,10 +313,7 @@ df_bills['text'] = df_bills["text"].apply(lambda x : normalize_text(x))
315
313
```
316
314
317
315
> [!Note]
318
-
> If you receive a warning stating *"A value is trying to be set on a copy of a slice from a DataFrame.
319
-
Try using .loc[row_indexer,col_indexer] = value instead"* you can safely ignore this message.
320
-
321
-
**TODO(mbullwin): Confirm with Noa if the above warning is expected behavior**
316
+
> If you receive a warning stating *`A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value instead` you can safely ignore this message.
322
317
323
318
Let's once again print `df_bills` so we can visualize the cleanup we just completed:
324
319
@@ -411,7 +406,8 @@ len(df_bills)
411
406
12
412
407
```
413
408
414
-
**TODO(mbullwin): Confirm with Noa if the following warning is expected behavior and customers should be ignoring it or if the code requires further modification.***Token indices sequence length is longer than the specified maximum sequence length for this model (1480 > 1024). Running this sequence through the model will result in indexing errors. A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value instead.*
409
+
> [!Note]
410
+
> You can ignore the message:`Token indices sequence length is longer than the specified maximum sequence length for this model (1480 > 1024). Running this sequence through the model will result in indexing errors. A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value instead.`
415
411
416
412
We'll once again print **df_bills**. Note that as expected, now only 12 results are returned though they retain their original index in the first column.
0 commit comments