|
| 1 | +--- |
| 2 | +title: Inference for PROs |
| 3 | +thumbnail: /blog/assets/inference_pro/thumbnail.png |
| 4 | +authors: |
| 5 | + - user: osanseviero |
| 6 | + - user: pcuenq |
| 7 | + - user: victor |
| 8 | +--- |
| 9 | + |
| 10 | +# Inference for PROs |
| 11 | + |
| 12 | +<!-- {blog_metadata} --> |
| 13 | +<!-- {authors} --> |
| 14 | + |
| 15 | + |
| 16 | + |
| 17 | +Today, we're introducing Inference for PRO users - a community offering that gives you access to APIs of curated endpoints for some of the most exciting models available, as well as improved rate limits for the usage of free Inference API. Use the following page to [subscribe to PRO](https://huggingface.co/subscribe/pro). |
| 18 | + |
| 19 | +Hugging Face PRO users now have access to exclusive API endpoints for a curated list of powerful models that benefit from ultra-fast inference powered by [text-generation-inference](https://github.com/huggingface/text-generation-inference). This is a benefit on top of the free inference API, which is available to all Hugging Face users to facilitate testing and prototyping on 200,000+ models. PRO users enjoy higher rate limits on these models, as well as exclusive access to some of the best models available today. |
| 20 | + |
| 21 | +## Contents |
| 22 | + |
| 23 | +- [Supported Models](#supported-models) |
| 24 | +- [Getting started with Inference for PROs](#getting-started-with-inference-for-pros) |
| 25 | +- [Applications](#applications) |
| 26 | + - [Chat with Llama 2 and Code Llama](#chat-with-llama-2-and-code-llama) |
| 27 | + - [Code infilling with Code Llama](#code-infilling-with-code-llama) |
| 28 | + - [Stable Diffusion XL](#stable-diffusion-xl) |
| 29 | +- [Generation Parameters](#generation-parameters) |
| 30 | + - [Controlling Text Generation](#controlling-text-generation) |
| 31 | + - [Controlling Image Generation](#controlling-image-generation) |
| 32 | + - [Caching](#caching) |
| 33 | + - [Streaming](#streaming) |
| 34 | +- [Subscribe to PRO](#subscribe-to-pro) |
| 35 | +- [FAQ](#faq) |
| 36 | + |
| 37 | +## Supported Models |
| 38 | + |
| 39 | +In addition to thousands of public models available in the Hub, PRO users get free access to the following state-of-the-art models: |
| 40 | + |
| 41 | +| Model | Size | Context Length | Use | |
| 42 | +| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------------- | ------------------------------------- | |
| 43 | +| Llama 2 Chat | [7B](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [13B](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf), and [70B](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | 4k tokens | One of the best conversational models | |
| 44 | +| Code Llama Base | [7B](https://huggingface.co/codellama/CodeLlama-7b-hf) and [13B](https://huggingface.co/codellama/CodeLlama-13b-hf) | 4k tokens | Autocomplete and infill code | |
| 45 | +| Code Llama Instruct | [34B](https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf) | 16k tokens | Conversational code assistant | |
| 46 | +| Stable Diffusion XL | [3B UNet](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) | - | Generate images | |
| 47 | + |
| 48 | +Inference for PROs makes it easy to experiment and prototype with new models without having to deploy them on your own infrastructure. It gives PRO users access to ready-to-use HTTP endpoints for all the models listed above. It’s not meant to be used for heavy production applications - for that, we recommend using [Inference Endpoints](https://ui.endpoints.huggingface.co/catalog). Inference for PROs also allows using applications that depend upon an LLM endpoint, such as using a [VS Code extension](https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode) for code completion, or have your own version of [Hugging Chat](http://hf.co/chat). |
| 49 | + |
| 50 | +## Getting started with Inference For PROs |
| 51 | + |
| 52 | +Using Inference for PROs is as simple as sending a POST request to the API endpoint for the model you want to run. You'll also need to get a PRO account authentication token from [your token settings page](https://huggingface.co/settings/tokens) and use it in the request. For example, to generate text using [Llama 2 70B Chat](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) in a terminal session, you'd do something like: |
| 53 | + |
| 54 | +```bash |
| 55 | +curl https://api-inference.huggingface.co/models/meta-llama/Llama-2-70b-chat-hf \ |
| 56 | + -X POST \ |
| 57 | + -d '{"inputs": "In a surprising turn of events, "}' \ |
| 58 | + -H "Content-Type: application/json" \ |
| 59 | + -H "Authorization: Bearer <YOUR_TOKEN>" |
| 60 | +``` |
| 61 | + |
| 62 | +Which would print something like this: |
| 63 | + |
| 64 | +```json |
| 65 | +[ |
| 66 | + { |
| 67 | + "generated_text": "In a surprising turn of events, 20th Century Fox has released a new trailer for Ridley Scott's Alien" |
| 68 | + } |
| 69 | +] |
| 70 | +``` |
| 71 | + |
| 72 | +You can also use many of the familiar transformers generation parameters, like `temperature` or `max_new_tokens`: |
| 73 | + |
| 74 | +```bash |
| 75 | +curl https://api-inference.huggingface.co/models/meta-llama/Llama-2-70b-chat-hf \ |
| 76 | + -X POST \ |
| 77 | + -d '{"inputs": "In a surprising turn of events, ", "parameters": {"temperature": 0.7, "max_new_tokens": 100}}' \ |
| 78 | + -H "Content-Type: application/json" \ |
| 79 | + -H "Authorization: Bearer <YOUR_TOKEN>" |
| 80 | +``` |
| 81 | + |
| 82 | +```json |
| 83 | +[ |
| 84 | + { |
| 85 | + "generated_text": "In a surprising turn of events, 2K has announced that it will be releasing a new free-to-play game called NBA 2K23 Arcade Edition. This game will be available on Apple iOS devices and will allow players to compete against each other in quick, 3-on-3 basketball matches.\n\nThe game promises to deliver fast-paced, action-packed gameplay, with players able to choose from a variety of NBA teams and players, including some of the biggest" |
| 86 | + } |
| 87 | +] |
| 88 | +``` |
| 89 | + |
| 90 | +For more details on the generation parameters, please take a look at [_Controlling Text Generation_](#controlling-text-generation) below. |
| 91 | + |
| 92 | +To send your requests in Python, you can take advantage of `InferenceClient`, a convenient utility available in the `huggingface_hub` Python library: |
| 93 | + |
| 94 | +```bash |
| 95 | +pip install huggingface_hub |
| 96 | +``` |
| 97 | + |
| 98 | +`InferenceClient` is a helpful wrapper that allows you to make calls to the Inference API and Inference Endpoints easily: |
| 99 | + |
| 100 | +```python |
| 101 | +from huggingface_hub import InferenceClient |
| 102 | + |
| 103 | +client = InferenceClient(model="meta-llama/Llama-2-70b-chat-hf", token=YOUR_TOKEN) |
| 104 | + |
| 105 | +output = client.text_generation("Can you please let us know more details about your ") |
| 106 | +print(output) |
| 107 | +``` |
| 108 | + |
| 109 | +If you don't want to pass the token explicitly every time you instantiate the client, you can use `notebook_login()` (in Jupyter notebooks), `huggingface-cli login` (in the terminal), or `login(token=YOUR_TOKEN)` (everywhere else) to log in a single time. The token will then be automatically used from here. |
| 110 | + |
| 111 | +In addition to Python, you can also use JavaScript to integrate inference calls inside your JS or node apps. Take a look at [huggingface.js](https://huggingface.co/docs/huggingface.js/index) to get started! |
| 112 | + |
| 113 | +## Applications |
| 114 | + |
| 115 | +### Chat with Llama 2 and Code Llama |
| 116 | + |
| 117 | +Models prepared to follow chat conversations are trained with very particular and specific chat templates that depend on the model used. You need to be careful about the format the model expects and replicate it in your queries. |
| 118 | + |
| 119 | +The following example was taken from [our Llama 2 blog post](https://huggingface.co/blog/llama2#how-to-prompt-llama-2), that describes in full detail how to query the model for conversation: |
| 120 | + |
| 121 | +```Python |
| 122 | +prompt = """<s>[INST] <<SYS>> |
| 123 | +You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. |
| 124 | +
|
| 125 | +If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. |
| 126 | +<</SYS>> |
| 127 | +
|
| 128 | +There's a llama in my garden 😱 What should I do? [/INST] |
| 129 | +""" |
| 130 | + |
| 131 | +response = client.text_generation(prompt, max_new_tokens=200) |
| 132 | +print(response) |
| 133 | +``` |
| 134 | + |
| 135 | +This example shows the structure of the first message in a multi-turn conversation. Note how the `<<SYS>>` delimiter is used to provide the _system prompt_, which tells the model how we expect it to behave. Then our query is inserted between `[INST]` delimiters. |
| 136 | + |
| 137 | +If we wish to continue the conversation, we have to append the model response to the sequence, and issue a new followup instruction afterwards. This is the general structure of the prompt template we need to use for Llama 2: |
| 138 | + |
| 139 | +``` |
| 140 | +<s>[INST] <<SYS>> |
| 141 | +{{ system_prompt }} |
| 142 | +<</SYS>> |
| 143 | +
|
| 144 | +{{ user_msg_1 }} [/INST] {{ model_answer_1 }} </s><s>[INST] {{ user_msg_2 }} [/INST] |
| 145 | +``` |
| 146 | + |
| 147 | +This same format can be used with Code Llama Instruct to engage in technical conversations with a code-savvy assistant! |
| 148 | + |
| 149 | +Please, refer to [our Llama 2 blog post](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) for more details. |
| 150 | + |
| 151 | +### Code infilling with Code Llama |
| 152 | + |
| 153 | +Code models like Code Llama can be used for code completion using the same generation strategy we used in the previous examples: you provide a starting string that may contain code or comments, and the model will try to continue the sequence with plausible content. Code models can also be used for _infilling_, a more specialized task where you provide prefix and suffix sequences, and the model will predict what should go in between. This is great for applications such as IDE extensions. Let's see an example using Code Llama: |
| 154 | + |
| 155 | +```Python |
| 156 | +client = InferenceClient(model="codellama/CodeLlama-13b-hf", token=YOUR_TOKEN) |
| 157 | + |
| 158 | +prompt_prefix = 'def remove_non_ascii(s: str) -> str:\n """ ' |
| 159 | +prompt_suffix = "\n return result" |
| 160 | + |
| 161 | +prompt = f"<PRE> {prompt_prefix} <SUF>{prompt_suffix} <MID>" |
| 162 | + |
| 163 | +infilled = client.text_generation(prompt, max_new_tokens=150) |
| 164 | +infilled = infilled.rstrip(" <EOT>") |
| 165 | +print(f"{prompt_prefix}{infilled}{prompt_suffix}") |
| 166 | +``` |
| 167 | + |
| 168 | +``` |
| 169 | +def remove_non_ascii(s: str) -> str: |
| 170 | + """ Remove non-ASCII characters from a string. |
| 171 | +
|
| 172 | + Args: |
| 173 | + s (str): The string to remove non-ASCII characters from. |
| 174 | +
|
| 175 | + Returns: |
| 176 | + str: The string with non-ASCII characters removed. |
| 177 | + """ |
| 178 | + result = "" |
| 179 | + for c in s: |
| 180 | + if ord(c) < 128: |
| 181 | + result += c |
| 182 | + return result |
| 183 | +``` |
| 184 | + |
| 185 | +As you can see, the format used for infilling follows this pattern: |
| 186 | + |
| 187 | +``` |
| 188 | +prompt = f"<PRE> {prompt_prefix} <SUF>{prompt_suffix} <MID>" |
| 189 | +``` |
| 190 | + |
| 191 | +For more details on how this task works, please take a look at https://huggingface.co/blog/codellama#code-completion. |
| 192 | + |
| 193 | +### Stable Diffusion XL |
| 194 | + |
| 195 | +SDXL is also available for PRO users. The response returned by the endpoint consists of a byte stream representing the generated image. If you use `InferenceClient`, it will automatically decode to a `PIL` image for you: |
| 196 | + |
| 197 | +```Python |
| 198 | +sdxl = InferenceClient(model="stabilityai/stable-diffusion-xl-base-1.0", token=YOUR_TOKEN) |
| 199 | +image = sdxl.text_to_image( |
| 200 | + "Dark gothic city in a misty night, lit by street lamps. A man in a cape is walking away from us", |
| 201 | + guidance_scale=9, |
| 202 | +) |
| 203 | +``` |
| 204 | + |
| 205 | + |
| 206 | + |
| 207 | +For more details on how to control generation, please take a look at [this section](#controlling-image-generation). |
| 208 | + |
| 209 | +## Generation Parameters |
| 210 | + |
| 211 | +### Controlling Text Generation |
| 212 | + |
| 213 | +Text generation is a rich topic, and there exist several generation strategies for different purposes. We recommend [this excellent overview](https://huggingface.co/blog/how-to-generate) on the subject. Many generation algorithms are supported by the text generation endpoints, and they can be configured using the following parameters: |
| 214 | + |
| 215 | +- `do_sample`: If set to `False` (the default), the generation method will be _greedy search_, which selects the most probable continuation sequence after the prompt you provide. Greedy search is deterministic, so the same results will always be returned from the same input. When `do_sample` is `True`, tokens will be sampled from a probability distribution and will therefore vary across invocations. |
| 216 | +- `temperature`: Controls the amount of variation we desire from the generation. A temperature of `0` is equivalent to greedy search. If we set a value for `temperature`, then `do_sample` will automatically be enabled. The same thing happens for `top_k` and `top_p`. When doing code-related tasks, we want less variability and hence recommend a low `temperature`. For other tasks, such as open-ended text generation, we recommend a higher one. |
| 217 | +- `top_k`. Enables "Top-K" sampling: the model will choose from the `K` most probable tokens that may occur after the input sequence. Typical values are between 10 to 50. |
| 218 | +- `top_p`. Enables "nucleus sampling": the model will choose from as many tokens as necessary to cover a particular probability mass. If `top_p` is 0.9, the 90% most probable tokens will be considered for sampling, and the trailing 10% will be ignored. |
| 219 | +- `repetition_penalty`: Tries to avoid repeated words in the generated sequence. |
| 220 | +- `seed`: Random seed that you can use in combination with sampling, for reproducibility purposes. |
| 221 | + |
| 222 | +In addition to the sampling parameters above, you can also control general aspects of the generation with the following: |
| 223 | + |
| 224 | +- `max_new_tokens`: maximum number of new tokens to generate. The default is `20`, feel free to increase if you want longer sequences. |
| 225 | +- `return_full_text`: whether to include the input sequence in the output returned by the endpoint. The default used by `InferenceClient` is `False`, but the endpoint itself uses `True` by default. |
| 226 | +- `stop_sequences`: a list of sequences that will cause generation to stop when encountered in the output. |
| 227 | + |
| 228 | +### Controlling Image Generation |
| 229 | + |
| 230 | +If you want finer-grained control over images generated with the SDXL endpoint, you can use the following parameters: |
| 231 | + |
| 232 | +- `negative_prompt`: A text describing content that you want the model to steer _away_ from. |
| 233 | +- `guidance_scale`: How closely you want the model to match the prompt. Lower numbers are less accurate, very high numbers might decrease image quality or generate artifacts. |
| 234 | +- `width` and `height`: The desired image dimensions. SDXL works best for sizes between 768 and 1024. |
| 235 | +- `num_inference_steps`: The number of denoising steps to run. Larger numbers may produce better quality but will be slower. Typical values are between 20 and 50 steps. |
| 236 | + |
| 237 | +For additional details on text-to-image generation, we recommend you check the [diffusers library documentation](https://huggingface.co/docs/diffusers/using-diffusers/sdxl). |
| 238 | + |
| 239 | +### Caching |
| 240 | + |
| 241 | +If you run the same generation multiple times, you’ll see that the result returned by the API is the same (even if you are using sampling instead of greedy decoding). This is because recent results are cached. To force a different response each time, we can use an HTTP header to tell the server to run a new generation each time: `x-use-cache: 0`. |
| 242 | + |
| 243 | +If you are using `InferenceClient`, you can simply append it to the `headers` client property: |
| 244 | + |
| 245 | +```Python |
| 246 | +client = InferenceClient(model="meta-llama/Llama-2-70b-chat-hf", token=YOUR_TOKEN) |
| 247 | +client.headers["x-use-cache"] = "0" |
| 248 | + |
| 249 | +output = client.text_generation("In a surprising turn of events, ", do_sample=True) |
| 250 | +print(output) |
| 251 | +``` |
| 252 | + |
| 253 | +### Streaming |
| 254 | + |
| 255 | +Token streaming is the mode in which the server returns the tokens one by one as the model generates them. This enables showing progressive generations to the user rather than waiting for the whole generation. Streaming is an essential aspect of the end-user experience as it reduces latency, one of the most critical aspects of a smooth experience. |
| 256 | + |
| 257 | +<div class="flex justify-center"> |
| 258 | + <img |
| 259 | + class="block dark:hidden" |
| 260 | + src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/streaming-generation-visual_360.gif" |
| 261 | + /> |
| 262 | + <img |
| 263 | + class="hidden dark:block" |
| 264 | + src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/streaming-generation-visual-dark_360.gif" |
| 265 | + /> |
| 266 | +</div> |
| 267 | + |
| 268 | +To stream tokens with `InferenceClient`, simply pass `stream=True` and iterate over the response. |
| 269 | + |
| 270 | +```python |
| 271 | +for token in client.text_generation("How do you make cheese?", max_new_tokens=12, stream=True): |
| 272 | + print(token) |
| 273 | + |
| 274 | +# To |
| 275 | +# make |
| 276 | +# cheese |
| 277 | +#, |
| 278 | +# you |
| 279 | +# need |
| 280 | +# to |
| 281 | +# start |
| 282 | +# with |
| 283 | +# milk |
| 284 | +``` |
| 285 | + |
| 286 | +To use the generate_stream endpoint with curl, you can add the `-N`/`--no-buffer` flag, which disables curl default buffering and shows data as it arrives from the server. |
| 287 | + |
| 288 | +``` |
| 289 | +curl -N https://api-inference.huggingface.co/models/meta-llama/Llama-2-70b-chat-hf \ |
| 290 | + -X POST \ |
| 291 | + -d '{"inputs": "In a surprising turn of events, ", "parameters": {"temperature": 0.7, "max_new_tokens": 100}}' \ |
| 292 | + -H "Content-Type: application/json" \ |
| 293 | + -H "Authorization: Bearer <YOUR_TOKEN>" |
| 294 | +``` |
| 295 | + |
| 296 | +## Subscribe to PRO |
| 297 | + |
| 298 | +You can sign up today for a PRO subscription [here](https://huggingface.co/subscribe/pro). Benefit from higher rate limits, custom accelerated endpoints for the latest models, and early access to features. If you've built some exciting projects with the Inference API or are looking for a model not available in Inference for PROs, please [use this discussion](https://huggingface.co/spaces/huggingface/HuggingDiscussions/discussions/13). [Enterprise users](https://huggingface.co/enterprise) also benefit from PRO Inference API on top of other features, such as SSO. |
| 299 | + |
| 300 | +## FAQ |
| 301 | + |
| 302 | +**Does this affect the free Inference API?** |
| 303 | + |
| 304 | +No. We still expose thousands of models through free APIs that allow people to prototype and explore model capabilities quickly. |
| 305 | + |
| 306 | +**Does this affect Enterprise users?** |
| 307 | + |
| 308 | +Users with an Enterprise subscription also benefit from accelerated inference API for curated models. |
| 309 | + |
| 310 | +**Can I use my own models with PRO Inference API?** |
| 311 | + |
| 312 | +The free Inference API already supports a wide range of small and medium models from a variety of libraries (such as diffusers, transformers, and sentence transformers). If you have a custom model or custom inference logic, we recommend using [Inference Endpoints](https://ui.endpoints.huggingface.co/catalog). |
0 commit comments