diff --git a/packages/inference/README.md b/packages/inference/README.md index 55cff9429c..ad4fcb879d 100644 --- a/packages/inference/README.md +++ b/packages/inference/README.md @@ -1,11 +1,11 @@ # 🤗 Hugging Face Inference -A Typescript powered wrapper for Inference Providers (serverless) and Inference Endpoints (dedicated). -It works with [Inference Providers (serverless)](https://huggingface.co/docs/api-inference/index) – including all supported third-party Inference Providers – and [Inference Endpoints (dedicated)](https://huggingface.co/docs/inference-endpoints/index), and even with . +A Typescript powered wrapper that provides a unified interface to run inference across multiple services for models hosted on the Hugging Face Hub: -Check out the [full documentation](https://huggingface.co/docs/huggingface.js/inference/README). +1. [Inference Providers](https://huggingface.co/docs/inference-providers/index): a streamlined, unified access to hundreds of machine learning models, powered by our serverless inference partners. This new approach builds on our previous Serverless Inference API, offering more models, improved performance, and greater reliability thanks to world-class providers. Refer to the [documentation](https://huggingface.co/docs/inference-providers/index#partners) for a list of supported providers. +2. [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index): a product to easily deploy models to production. Inference is run by Hugging Face in a dedicated, fully managed infrastructure on a cloud provider of your choice. +3. Local endpoints: you can also run inference with local inference servers like [llama.cpp](https://github.com/ggerganov/llama.cpp), [Ollama](https://ollama.com/), [vLLM](https://github.com/vllm-project/vllm), [LiteLLM](https://docs.litellm.ai/docs/simple_proxy), or [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) by connecting the client to these local endpoints. -You can also try out a live [interactive notebook](https://observablehq.com/@huggingface/hello-huggingface-js-inference), see some demos on [hf.co/huggingfacejs](https://huggingface.co/huggingfacejs), or watch a [Scrimba tutorial that explains how Inference Endpoints works](https://scrimba.com/scrim/cod8248f5adfd6e129582c523). ## Getting Started @@ -42,7 +42,7 @@ const hf = new InferenceClient('your access token'); Your access token should be kept private. If you need to protect it in front-end applications, we suggest setting up a proxy server that stores the access token. -### All supported inference providers +## Using Inference Providers You can send inference requests to third-party providers with the inference client. @@ -50,6 +50,7 @@ Currently, we support the following providers: - [Fal.ai](https://fal.ai) - [Featherless AI](https://featherless.ai) - [Fireworks AI](https://fireworks.ai) +- [HF Inference](https://huggingface.co/docs/inference-providers/providers/hf-inference) - [Hyperbolic](https://hyperbolic.xyz) - [Nebius](https://studio.nebius.ai) - [Novita](https://novita.ai/?utm_source=github_huggingface&utm_medium=github_readme&utm_campaign=link) @@ -63,7 +64,8 @@ Currently, we support the following providers: - [Cerebras](https://cerebras.ai/) - [Groq](https://groq.com) -To send requests to a third-party provider, you have to pass the `provider` parameter to the inference function. Make sure your request is authenticated with an access token. +To send requests to a third-party provider, you have to pass the `provider` parameter to the inference function. The default value of the `provider` parameter is "auto", which will select the first of the providers available for the model, sorted by your preferred order in https://hf.co/settings/inference-providers. + ```ts const accessToken = "hf_..."; // Either a HF access token, or an API key from the third-party provider (Replicate in this example) @@ -75,6 +77,7 @@ await client.textToImage({ }) ``` +You also have to make sure your request is authenticated with an access token. When authenticated with a Hugging Face access token, the request is routed through https://huggingface.co. When authenticated with a third-party provider key, the request is made directly against that provider's inference API. @@ -82,6 +85,7 @@ Only a subset of models are supported when requesting third-party providers. You - [Fal.ai supported models](https://huggingface.co/api/partners/fal-ai/models) - [Featherless AI supported models](https://huggingface.co/api/partners/featherless-ai/models) - [Fireworks AI supported models](https://huggingface.co/api/partners/fireworks-ai/models) +- [HF Inference supported models](https://huggingface.co/api/partners/hf-inference/models) - [Hyperbolic supported models](https://huggingface.co/api/partners/hyperbolic/models) - [Nebius supported models](https://huggingface.co/api/partners/nebius/models) - [Nscale supported models](https://huggingface.co/api/partners/nscale/models) @@ -92,7 +96,6 @@ Only a subset of models are supported when requesting third-party providers. You - [Cohere supported models](https://huggingface.co/api/partners/cohere/models) - [Cerebras supported models](https://huggingface.co/api/partners/cerebras/models) - [Groq supported models](https://console.groq.com/docs/models) -- [HF Inference API (serverless)](https://huggingface.co/models?inference=warm&sort=trending) ❗**Important note:** To be compatible, the third-party API must adhere to the "standard" shape API we expect on HF model pages for each pipeline task type. This is not an issue for LLMs as everyone converged on the OpenAI API anyways, but can be more tricky for other tasks like "text-to-image" or "automatic-speech-recognition" where there exists no standard API. Let us know if any help is needed or if we can make things easier for you! @@ -116,22 +119,22 @@ await textGeneration({ This will enable tree-shaking by your bundler. -## Natural Language Processing +### Natural Language Processing -### Text Generation +#### Text Generation Generates text from an input prompt. -[Demo](https://huggingface.co/spaces/huggingfacejs/streaming-text-generation) - ```typescript await hf.textGeneration({ - model: 'gpt2', + model: 'mistralai/Mixtral-8x7B-v0.1', + provider: "together", inputs: 'The answer to the universe is' }) for await (const output of hf.textGenerationStream({ - model: "google/flan-t5-xxl", + model: "mistralai/Mixtral-8x7B-v0.1", + provider: "together", inputs: 'repeat "one two three four"', parameters: { max_new_tokens: 250 } })) { @@ -139,16 +142,15 @@ for await (const output of hf.textGenerationStream({ } ``` -### Text Generation (Chat Completion API Compatible) - -Using the `chatCompletion` method, you can generate text with models compatible with the OpenAI Chat Completion API. All models served by [TGI](https://huggingface.co/docs/text-generation-inference/) on Hugging Face support Messages API. +#### Chat Completion -[Demo](https://huggingface.co/spaces/huggingfacejs/streaming-chat-completion) +Generate a model response from a list of messages comprising a conversation. ```typescript // Non-streaming API const out = await hf.chatCompletion({ - model: "meta-llama/Llama-3.1-8B-Instruct", + model: "Qwen/Qwen3-32B", + provider: "cerebras", messages: [{ role: "user", content: "Hello, nice to meet you!" }], max_tokens: 512, temperature: 0.1, @@ -157,7 +159,8 @@ const out = await hf.chatCompletion({ // Streaming API let out = ""; for await (const chunk of hf.chatCompletionStream({ - model: "meta-llama/Llama-3.1-8B-Instruct", + model: "Qwen/Qwen3-32B", + provider: "cerebras", messages: [ { role: "user", content: "Can you help me solve an equation?" }, ], @@ -169,33 +172,18 @@ for await (const chunk of hf.chatCompletionStream({ } } ``` +#### Feature Extraction -It's also possible to call Mistral or OpenAI endpoints directly: +This task reads some text and outputs raw float values, that are usually consumed as part of a semantic database/semantic search. ```typescript -const openai = new InferenceClient(OPENAI_TOKEN).endpoint("https://api.openai.com"); - -let out = ""; -for await (const chunk of openai.chatCompletionStream({ - model: "gpt-3.5-turbo", - messages: [ - { role: "user", content: "Complete the equation 1+1= ,just the answer" }, - ], - max_tokens: 500, - temperature: 0.1, - seed: 0, -})) { - if (chunk.choices && chunk.choices.length > 0) { - out += chunk.choices[0].delta.content; - } -} - -// For mistral AI: -// endpointUrl: "https://api.mistral.ai" -// model: "mistral-tiny" +await hf.featureExtraction({ + model: "sentence-transformers/distilbert-base-nli-mean-tokens", + inputs: "That is a happy person", +}); ``` -### Fill Mask +#### Fill Mask Tries to fill in a hole with a missing word (token to be precise). @@ -206,7 +194,7 @@ await hf.fillMask({ }) ``` -### Summarization +#### Summarization Summarizes longer text into shorter text. Be careful, some models have a maximum length of input. @@ -221,7 +209,7 @@ await hf.summarization({ }) ``` -### Question Answering +#### Question Answering Answers questions based on the context you provide. @@ -235,7 +223,7 @@ await hf.questionAnswering({ }) ``` -### Table Question Answering +#### Table Question Answering ```typescript await hf.tableQuestionAnswering({ @@ -252,7 +240,7 @@ await hf.tableQuestionAnswering({ }) ``` -### Text Classification +#### Text Classification Often used for sentiment analysis, this method will assign labels to the given text along with a probability score of that label. @@ -263,7 +251,7 @@ await hf.textClassification({ }) ``` -### Token Classification +#### Token Classification Used for sentence parsing, either grammatical, or Named Entity Recognition (NER) to understand keywords contained within text. @@ -274,7 +262,7 @@ await hf.tokenClassification({ }) ``` -### Translation +#### Translation Converts text from one language to another. @@ -294,7 +282,7 @@ await hf.translation({ }) ``` -### Zero-Shot Classification +#### Zero-Shot Classification Checks how well an input text fits into a set of labels you provide. @@ -308,22 +296,7 @@ await hf.zeroShotClassification({ }) ``` -### Conversational - -This task corresponds to any chatbot-like structure. Models tend to have shorter max_length, so please check with caution when using a given model if you need long-range dependency or not. - -```typescript -await hf.conversational({ - model: 'microsoft/DialoGPT-large', - inputs: { - past_user_inputs: ['Which movie is the best ?'], - generated_responses: ['It is Die Hard for sure.'], - text: 'Can you explain why ?' - } -}) -``` - -### Sentence Similarity +#### Sentence Similarity Calculate the semantic similarity between one text and a list of other sentences. @@ -341,9 +314,9 @@ await hf.sentenceSimilarity({ }) ``` -## Audio +### Audio -### Automatic Speech Recognition +#### Automatic Speech Recognition Transcribes speech from an audio file. @@ -356,7 +329,7 @@ await hf.automaticSpeechRecognition({ }) ``` -### Audio Classification +#### Audio Classification Assigns labels to the given audio along with a probability score of that label. @@ -369,7 +342,7 @@ await hf.audioClassification({ }) ``` -### Text To Speech +#### Text To Speech Generates natural-sounding speech from text input. @@ -382,7 +355,7 @@ await hf.textToSpeech({ }) ``` -### Audio To Audio +#### Audio To Audio Outputs one or multiple generated audios from an input audio, commonly used for speech enhancement and source separation. @@ -393,9 +366,9 @@ await hf.audioToAudio({ }) ``` -## Computer Vision +### Computer Vision -### Image Classification +#### Image Classification Assigns labels to a given image along with a probability score of that label. @@ -408,7 +381,7 @@ await hf.imageClassification({ }) ``` -### Object Detection +#### Object Detection Detects objects within an image and returns labels with corresponding bounding boxes and probability scores. @@ -421,7 +394,7 @@ await hf.objectDetection({ }) ``` -### Image Segmentation +#### Image Segmentation Detects segments within an image and returns labels with corresponding bounding boxes and probability scores. @@ -432,7 +405,7 @@ await hf.imageSegmentation({ }) ``` -### Image To Text +#### Image To Text Outputs text from a given image, commonly used for captioning or optical character recognition. @@ -443,7 +416,7 @@ await hf.imageToText({ }) ``` -### Text To Image +#### Text To Image Creates an image from a text prompt. @@ -456,7 +429,7 @@ await hf.textToImage({ }) ``` -### Image To Image +#### Image To Image Image-to-image is the task of transforming a source image to match the characteristics of a target image or a target image domain. @@ -472,7 +445,7 @@ await hf.imageToImage({ }); ``` -### Zero Shot Image Classification +#### Zero Shot Image Classification Checks how well an input image fits into a set of labels you provide. @@ -488,20 +461,10 @@ await hf.zeroShotImageClassification({ }) ``` -## Multimodal - -### Feature Extraction - -This task reads some text and outputs raw float values, that are usually consumed as part of a semantic database/semantic search. +### Multimodal -```typescript -await hf.featureExtraction({ - model: "sentence-transformers/distilbert-base-nli-mean-tokens", - inputs: "That is a happy person", -}); -``` -### Visual Question Answering +#### Visual Question Answering Visual Question Answering is the task of answering open-ended questions based on an image. They output natural language responses to natural language questions. @@ -517,7 +480,7 @@ await hf.visualQuestionAnswering({ }) ``` -### Document Question Answering +#### Document Question Answering Document question answering models take a (document, question) pair as input and return an answer in natural language. @@ -533,9 +496,9 @@ await hf.documentQuestionAnswering({ }) ``` -## Tabular +### Tabular -### Tabular Regression +#### Tabular Regression Tabular regression is the task of predicting a numerical value given a set of attributes. @@ -555,7 +518,7 @@ await hf.tabularRegression({ }) ``` -### Tabular Classification +#### Tabular Classification Tabular classification is the task of classifying a target category (a group) based on set of attributes. @@ -600,48 +563,80 @@ for await (const chunk of stream) { } ``` -## Custom Inference Endpoints +## Using Inference Endpoints -Learn more about using your own inference endpoints [here](https://hf.co/docs/inference-endpoints/) +The examples we saw above use inference providers. While these prove to be very useful for prototyping +and testing things quickly. Once you're ready to deploy your model to production, you'll need to use a dedicated infrastructure. That's where [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index) comes into play. It allows you to deploy any model and expose it as a private API. Once deployed, you'll get a URL that you can connect to: ```typescript -const gpt2 = hf.endpoint('https://xyz.eu-west-1.aws.endpoints.huggingface.cloud/gpt2'); -const { generated_text } = await gpt2.textGeneration({inputs: 'The answer to the universe is'}); +import { InferenceClient } from '@huggingface/inference'; -// Chat Completion Example -const ep = hf.endpoint( - "https://router.huggingface.co/hf-inference/models/meta-llama/Llama-3.1-8B-Instruct" -); -const stream = ep.chatCompletionStream({ - model: "tgi", - messages: [{ role: "user", content: "Complete the equation 1+1= ,just the answer" }], - max_tokens: 500, - temperature: 0.1, - seed: 0, +const hf = new InferenceClient("hf_xxxxxxxxxxxxxx", { + endpointUrl: "https://j3z5luu0ooo76jnl.us-east-1.aws.endpoints.huggingface.cloud/v1/", }); -let out = ""; -for await (const chunk of stream) { - if (chunk.choices && chunk.choices.length > 0) { - out += chunk.choices[0].delta.content; - console.log(out); - } -} + +const response = await hf.chatCompletion({ + messages: [ + { + role: "user", + content: "What is the capital of France?", + }, + ], +}); + +console.log(response.choices[0].message.content); ``` -By default, all calls to the inference endpoint will wait until the model is -loaded. When [scaling to -0](https://huggingface.co/docs/inference-endpoints/en/autoscaling#scaling-to-0) -is enabled on the endpoint, this can result in non-trivial waiting time. If -you'd rather disable this behavior and handle the endpoint's returned 500 HTTP -errors yourself, you can do so like so: +By default, all calls to the inference endpoint will wait until the model is loaded. When [scaling to 0](https://huggingface.co/docs/inference-endpoints/en/autoscaling#scaling-to-0) +is enabled on the endpoint, this can result in non-trivial waiting time. If you'd rather disable this behavior and handle the endpoint's returned 500 HTTP errors yourself, you can do so like so: ```typescript -const gpt2 = hf.endpoint('https://xyz.eu-west-1.aws.endpoints.huggingface.cloud/gpt2'); -const { generated_text } = await gpt2.textGeneration( - {inputs: 'The answer to the universe is'}, - {retry_on_error: false}, +const hf = new InferenceClient("hf_xxxxxxxxxxxxxx", { + endpointUrl: "https://j3z5luu0ooo76jnl.us-east-1.aws.endpoints.huggingface.cloud/v1/", +}); + +const response = await hf.chatCompletion( + { + messages: [ + { + role: "user", + content: "What is the capital of France?", + }, + ], + }, + { + retry_on_error: false, + } ); ``` +## Using local endpoints + +You can use `InferenceClient` to run chat completion with local inference servers (llama.cpp, vllm, litellm server, TGI, mlx, etc.) running on your own machine. The API should be OpenAI API-compatible. + +```typescript +import { InferenceClient } from '@huggingface/inference'; + +const hf = new InferenceClient(undefined, { + endpointUrl: "http://localhost:8080", +}); + +const response = await hf.chatCompletion({ + messages: [ + { + role: "user", + content: "What is the capital of France?", + }, + ], +}); + +console.log(response.choices[0].message.content); +``` + + + +Similarily to the OpenAI JS client, `InferenceClient` can be used to run Chat Completion inference with any OpenAI REST API-compatible endpoint. + + ## Running tests