Skip to content

Commit 9c8cad2

Browse files
WuhanMonkeysubramen
authored andcommitted
Update README.md
1 parent 3d914de commit 9c8cad2

File tree

1 file changed

+7
-2
lines changed
  • recipes/benchmarks/inference_throughput/cloud-api

1 file changed

+7
-2
lines changed

recipes/benchmarks/inference_throughput/cloud-api/README.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,18 @@ To get started, there are certain steps we need to take to deploy the models:
1313
* Take a quick look on what is the [Azure AI Studio](https://learn.microsoft.com/en-us/azure/ai-studio/what-is-ai-studio?tabs=home) and navigate to the website from the link in the article
1414
* Follow the demos in the article to create a project and [resource](https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/manage-resource-groups-portal) group, or you can also follow the guide [here](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-llama?tabs=azure-studio)
1515
* Select Llama models from Model catalog
16-
* Deploy with "Pay-as-you-go"
16+
* Click the "Deploy" button
17+
* Select Serverless API with Azure AI Content Safety. Note that currently this API service is offered for Llama 2 pretrained model, chat model and Llama 3 instruct model
18+
* Select the project you created in previous step
19+
* Choose a deployment name then Go to deployment
1720

1821
Once deployed successfully, you should be assigned for an API endpoint and a security key for inference.
1922
For more information, you should consult Azure's official documentation [here](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-llama?tabs=azure-studio) for model deployment and inference.
2023

2124
Now, replace the endpoint url and API key in ```azure/parameters.json```. For parameter `MODEL_ENDPOINTS`, with chat models the suffix should be `v1/chat/completions` and with pretrained models the suffix should be `v1/completions`.
22-
Note that the API endpoint might implemented a rate limit for token generation in certain amount of time. If you encountered the error, you can try reduce `MAX_NEW_TOKEN` or start with smaller `CONCURRENT_LEVELs`.
25+
Note that the API endpoint might implemented a rate limit for token generation in certain amount of time. If you encountered the error, you can try reduce `MAX_NEW_TOKEN` or start with smaller `CONCURRENT_LEVELS`.
26+
27+
For `MODEL_PATH`, copy the model path from Huggingface under meta-llama organization. For Llama 2, make sure you copy the path of the model with hf format. This model path is used to retrieve corresponding tokenizer for your model of choice. Llama 3 used a different tokenizer compare to Llama 2.
2328

2429
Once everything configured, to run chat model benchmark:
2530
```python chat_azure_api_benchmark.py```

0 commit comments

Comments
 (0)