@@ -24,59 +24,49 @@ This guide explains how to compile, load, and use [Sentence Transformers (SBERT)
2424
2525### Convert Sentence Transformers model to AWS Inferentia2
2626
27- First, you need to convert your Sentence Transformers model to a format compatible with AWS Inferentia2. You can compile Sentence Transformers models with Optimum Neuron using the `optimum-cli` or `NeuronModelForSentenceTransformers ` class. Below you will find an example for both approaches. We have to make sure `sentence-transformers` is installed. That's only needed for exporting the model.
27+ First, you need to convert your Sentence Transformers model to a format compatible with AWS Inferentia2. You can compile Sentence Transformers models with Optimum Neuron using the `optimum-cli` or `NeuronSentenceTransformers ` class. Below you will find an example for both approaches. We have to make sure `sentence-transformers` is installed. That's only needed for exporting the model.
2828
2929```bash
3030pip install sentence-transformers
3131```
3232
33- Here we will use the `NeuronModelForSentenceTransformers `, which can be used to convert any Sentence Transformers model to a format compatible with AWS Inferentia2 or load already converted models. When exporting models with the `NeuronModelForSentenceTransformers ` you need to set `export=True` and define the input shape and batch size. The input shape is defined by the `sequence_length` and the batch size by `batch_size`.
33+ Here we will use the `NeuronSentenceTransformers `, which can be used to convert any Sentence Transformers model to a format compatible with AWS Inferentia2 or load already converted models. When exporting models with the `NeuronSentenceTransformers ` you need to set `export=True` and define the input shape and batch size. The input shape is defined by the `sequence_length` and the batch size by `batch_size`.
3434
3535```python
36- from optimum.neuron import NeuronModelForSentenceTransformers
36+ from optimum.neuron import NeuronSentenceTransformers
3737
3838# Sentence Transformers model from HuggingFace
3939model_id = " BAAI/bge-small-en-v1.5"
4040input_shapes = { " batch_size" : 1 , " sequence_length" : 384 } # mandatory shapes
4141
4242# Load Transformers model and export it to AWS Inferentia2
43- model = NeuronModelForSentenceTransformers .from_pretrained(model_id, export =True, **input_shapes)
43+ model = NeuronSentenceTransformers .from_pretrained(model_id, export =True, **input_shapes)
4444
4545# Save model to disk
4646model.save_pretrained("bge_emb_inf2/")
4747```
4848
49- Here we will use the `optimum-cli` to convert the model. Similar to the `NeuronModelForSentenceTransformers ` we need to define our input shape and batch size. The input shape is defined by the `sequence_length` and the batch size by `batch_size`. The `optimum-cli` will automatically convert the model to a format compatible with AWS Inferentia2 and save it to the specified output directory.
49+ Here we will use the `optimum-cli` to convert the model. Similar to the `NeuronSentenceTransformers ` we need to define our input shape and batch size. The input shape is defined by the `sequence_length` and the batch size by `batch_size`. The `optimum-cli` will automatically convert the model to a format compatible with AWS Inferentia2 and save it to the specified output directory.
5050
5151```bash
5252optimum-cli export neuron -m BAAI/bge-small-en-v1.5 --sequence_length 384 --batch_size 1 --task feature-extraction bge_emb_inf2/
5353```
5454
5555### Load compiled Sentence Transformers model and run inference
5656
57- Once we have a compiled Sentence Transformers model, which we either exported ourselves or is available on the Hugging Face Hub, we can load it and run inference. For loading the model we can use the `NeuronModelForSentenceTransformers ` class, which is an abstraction layer for the `SentenceTransformer` class. The `NeuronModelForSentenceTransformers ` class will automatically pad the input to the specified `sequence_length` and run inference on AWS Inferentia2.
57+ Once we have a compiled Sentence Transformers model, which we either exported ourselves or is available on the Hugging Face Hub, we can load it and run inference. For loading the model we can use the `NeuronSentenceTransformers ` class, which is an abstraction layer for the `SentenceTransformer` class. The `NeuronSentenceTransformers ` class will automatically pad the input to the specified `sequence_length` and run inference on AWS Inferentia2.
5858
5959```python
60- from optimum.neuron import NeuronModelForSentenceTransformers
61- from transformers import AutoTokenizer
60+ from optimum.neuron import NeuronSentenceTransformers
6261
6362model_id_or_path = " bge_emb_inf2/"
64- tokenizer_id = " BAAI/bge-small-en-v1.5"
6563
6664# Load model and tokenizer
67- model = NeuronModelForSentenceTransformers.from_pretrained(model_id_or_path)
68- tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
65+ model = NeuronSentenceTransformers.from_pretrained(model_id_or_path)
6966
7067# Run inference
71- prompt = " I like to eat apples"
72- encoded_input = tokenizer(prompt, return_tensors = ' pt' )
73- outputs = model(**encoded_input)
74-
75- token_embeddings = outputs.token_embeddings
76- sentence_embedding = outputs.sentence_embedding
77-
78- print(f"token embeddings: { token_embeddings .shape } " ) # torch.Size([1, 7, 384])
79- print(f" sentence_embedding: { sentence_embedding .shape } " ) # torch.Size([1, 384])
68+ token_embeddings = model.encode(output_value="token_embeddings")
69+ sentence_embedding = model.encode(output_value="sentence_embedding")
8070```
8171
8272### Production Usage
@@ -89,18 +79,18 @@ For deploying these models in a production environment, refer to the [Amazon Sag
8979
9080### Compile CLIP for AWS Inferentia2
9181
92- You can compile CLIP models with Optimum Neuron either by using the `optimum-cli` or `NeuronModelForSentenceTransformers ` class. Adopt one approach that you prefer:
82+ You can compile CLIP models with Optimum Neuron either by using the `optimum-cli` or `NeuronSentenceTransformers ` class. Adopt one approach that you prefer:
9383
9484* With the Optimum CLI
9585
9686```bash
9787optimum-cli export neuron -m sentence-transformers/clip-ViT-B-32 --sequence_length 64 --text_batch_size 3 --image_batch_size 1 --num_channels 3 --height 224 --width 224 --task feature-extraction --subfolder 0_CLIPModel clip_emb/
9888```
9989
100- * With the `NeuronModelForSentenceTransformers ` class
90+ * With the `NeuronSentenceTransformers ` class
10191
10292```python
103- from optimum.neuron import NeuronModelForSentenceTransformers
93+ from optimum.neuron import NeuronSentenceTransformers
10494
10595model_id = " sentence-transformers/clip-ViT-B-32"
10696
@@ -114,7 +104,7 @@ input_shapes = {
114104 " sequence_length" : 64 ,
115105}
116106
117- emb_model = NeuronModelForSentenceTransformers .from_pretrained(
107+ emb_model = NeuronSentenceTransformers .from_pretrained(
118108 model_id, subfolder = " 0_CLIPModel" , export =True, library_name = " sentence_transformers" , dynamic_batch_size =False, **input_shapes
119109)
120110
@@ -130,10 +120,10 @@ from PIL import Image
130120from sentence_transformers import util
131121from transformers import CLIPProcessor
132122
133- from optimum.neuron import NeuronModelForSentenceTransformers
123+ from optimum.neuron import NeuronSentenceTransformers
134124
135125save_directory = " clip_emb"
136- emb_model = NeuronModelForSentenceTransformers .from_pretrained(save_directory)
126+ emb_model = NeuronSentenceTransformers .from_pretrained(save_directory)
137127
138128processor = CLIPProcessor.from_pretrained(save_directory)
139129inputs = processor(
@@ -154,7 +144,7 @@ print(cos_scores)
154144
155145**Caveat**
156146
157- Since compiled models with dynamic batching enabled only accept input tensors with the same batch size, we cannot set `dynamic_batch_size=True` if the input texts and images have different batch sizes. And as `NeuronModelForSentenceTransformers ` class pads the inputs to the batch sizes (`text_batch_size` and `image_batch_size`) used during the compilation, you could use relatively larger batch sizes during the compilation for flexibility with the trade-off of compute.
147+ Since compiled models with dynamic batching enabled only accept input tensors with the same batch size, we cannot set `dynamic_batch_size=True` if the input texts and images have different batch sizes. And as `NeuronSentenceTransformers ` class pads the inputs to the batch sizes (`text_batch_size` and `image_batch_size`) used during the compilation, you could use relatively larger batch sizes during the compilation for flexibility with the trade-off of compute.
158148
159149eg. if you want to encode 3 or 4 or 5 texts and 1 image, you could set `text_batch_size = 5 = max(3, 4, 5)` and `image_batch_size = 1` during the compilation.
160150
0 commit comments