You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/quick_tour.md
+83-19Lines changed: 83 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
1
+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2
2
3
3
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
4
the License. You may obtain a copy of the License at
@@ -16,18 +16,18 @@ rendered properly in your Markdown viewer.
16
16
17
17
# Quick Tour
18
18
19
-
## Text Embeddings
19
+
## Set up
20
20
21
21
The easiest way to get started with TEI is to use one of the official Docker containers
22
22
(see [Supported models and hardware](supported_models) to choose the right container).
23
23
24
-
After making sure that your hardware is supported, install the
25
-
[NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) if you
26
-
plan on utilizing GPUs. NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher.
24
+
Hence one needs to install Docker following their [installation instructions](https://docs.docker.com/get-docker/).
27
25
28
-
Next, install Docker following their [installation instructions](https://docs.docker.com/get-docker/).
26
+
TEI supports inference both on GPU and CPU. If you plan on using a GPU, make sure to check that your hardware is supported. Next, install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher.
29
27
30
-
Finally, deploy your model. Let's say you want to use `BAAI/bge-large-en-v1.5`. Here's how you can do this:
28
+
## Deploy
29
+
30
+
Next it's time to deploy your model. Let's say you want to use [`BAAI/bge-large-en-v1.5`](https://huggingface.co/BAAI/bge-large-en-v1.5). Here's how you can do this:
31
31
32
32
```shell
33
33
model=BAAI/bge-large-en-v1.5
@@ -42,7 +42,13 @@ We also recommend sharing a volume with the Docker container (`volume=$PWD/data`
42
42
43
43
</Tip>
44
44
45
-
Once you have deployed a model, you can use the `embed` endpoint by sending requests:
45
+
## Inference
46
+
47
+
Inference can be performed in 3 ways: using cURL, or via the `InferenceClient` or `OpenAI` Python SDKs.
48
+
49
+
#### cURL
50
+
51
+
To send a POST request to the TEI endpoint using cURL, you can run the following command:
46
52
47
53
```bash
48
54
curl 127.0.0.1:8080/embed \
@@ -51,16 +57,53 @@ curl 127.0.0.1:8080/embed \
51
57
-H 'Content-Type: application/json'
52
58
```
53
59
54
-
## Re-rankers
60
+
#### Python
61
+
62
+
To run inference using Python, you can either use the [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/en/index) Python SDK (recommended) or the `openai` Python SDK.
63
+
64
+
##### huggingface_hub
65
+
66
+
You can install it via pip as `pip install --upgrade --quiet huggingface_hub`, and then run:
67
+
68
+
```python
69
+
from huggingface_hub import InferenceClient
70
+
71
+
client = InferenceClient()
72
+
73
+
embedding = client.feature_extraction("What is deep learning?",
74
+
model="http://localhost:8080/embed")
75
+
print(len(embedding[0]))
76
+
```
77
+
78
+
#### OpenAI
79
+
80
+
You can install it via pip as `pip install --upgrade openai`, and then run:
Re-rankers models are Sequence Classification cross-encoders models with a single class that scores the similarity
57
-
between a query and a text.
88
+
response = client.embeddings.create(
89
+
model="tei",
90
+
input="What is deep learning?"
91
+
)
58
92
59
-
See [this blogpost](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83) by
93
+
print(response)
94
+
```
95
+
96
+
## Re-rankers and sequence classification
97
+
98
+
TEI also supports re-ranker and classic sequence classification models.
99
+
100
+
### Re-rankers
101
+
102
+
Rerankers, also called cross-encoders, are sequence classification models with a single class that score the similarity between a query and a text. See [this blogpost](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83) by
60
103
the LlamaIndex team to understand how you can use re-rankers models in your RAG pipeline to improve
61
104
downstream performance.
62
105
63
-
Let's say you want to use `BAAI/bge-reranker-large`:
106
+
Let's say you want to use [`BAAI/bge-reranker-large`](https://huggingface.co/BAAI/bge-reranker-large). First, you can deploy it like so:
64
107
65
108
```shell
66
109
model=BAAI/bge-reranker-large
@@ -69,8 +112,7 @@ volume=$PWD/data
69
112
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
70
113
```
71
114
72
-
Once you have deployed a model, you can use the `rerank` endpoint to rank the similarity between a query and a list
73
-
of texts:
115
+
Once you have deployed a model, you can use the `rerank` endpoint to rank the similarity between a query and a list of texts. With `cURL` this can be done like so:
74
116
75
117
```bash
76
118
curl 127.0.0.1:8080/rerank \
@@ -79,9 +121,20 @@ curl 127.0.0.1:8080/rerank \
79
121
-H 'Content-Type: application/json'
80
122
```
81
123
82
-
## Sequence Classification
124
+
Alternatively, one can perform inference using the `huggingface_hub` Python SDK. You can install it via pip as `pip install --upgrade --quiet huggingface_hub`, and then run:
125
+
126
+
```python
127
+
from huggingface_hub import InferenceClient
128
+
129
+
client = InferenceClient()
130
+
embedding = client.feature_extraction("What is deep learning?",
131
+
model="http://localhost:8080/rerank")
132
+
print(len(embedding[0]))
133
+
```
134
+
135
+
### Sequence classification models
83
136
84
-
You can also use classic Sequence Classification models like `SamLowe/roberta-base-go_emotions`:
137
+
You can also use classic Sequence Classification models like [`SamLowe/roberta-base-go_emotions`](https://huggingface.co/SamLowe/roberta-base-go_emotions):
85
138
86
139
```shell
87
140
model=SamLowe/roberta-base-go_emotions
@@ -99,9 +152,20 @@ curl 127.0.0.1:8080/predict \
99
152
-H 'Content-Type: application/json'
100
153
```
101
154
155
+
Alternatively, one can perform inference using the `huggingface_hub` Python SDK. You can install it via pip as `pip install --upgrade --quiet huggingface_hub`, and then run:
156
+
157
+
```python
158
+
from huggingface_hub import InferenceClient
159
+
160
+
client = InferenceClient()
161
+
embedding = client.feature_extraction("What is deep learning?",
162
+
model="http://localhost:8080/predict")
163
+
print(len(embedding[0]))
164
+
```
165
+
102
166
## Batching
103
167
104
-
You can send multiple inputs in a batch. For example, for embeddings
168
+
You can send multiple inputs in a batch. For example, for embeddings:
105
169
106
170
```bash
107
171
curl 127.0.0.1:8080/embed \
@@ -140,4 +204,4 @@ volume=$PWD
140
204
141
205
# Mount the models directory inside the container with a volume and set the model ID
142
206
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id /data/gte-base-en-v1.5
0 commit comments