Skip to content

Commit c6ec84e

Browse files
Add audio endpoints to benchmarking script (#3804) (#3828)
### 🛠 Summary CVS-174708
1 parent b88523c commit c6ec84e

File tree

6 files changed

+205
-21
lines changed

6 files changed

+205
-21
lines changed

demos/audio/README.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,25 @@ print("Generation finished")
102102

103103
Play speech.wav file to check generated speech.
104104

105+
## Benchmarking speech generation
106+
An asynchronous benchmarking client can be used to access the model server performance with various load conditions. Below are execution examples captured on Intel(R) Core(TM) Ultra 7 258V.
107+
108+
```console
109+
git clone https://github.com/openvinotoolkit/model_server
110+
cd model_server/demos/benchmark/v3/
111+
pip install -r requirements.txt
112+
python benchmark.py --api_url http://localhost:8122/v3/audio/speech --model microsoft/speecht5_tts --batch_size 1 --limit 100 --request_rate inf --backend text2speech --dataset edinburghcstr/ami --hf-subset 'ihm' --tokenizer openai/whisper-large-v3-turbo --trust-remote-code True
113+
Number of documents: 100
114+
100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [01:58<00:00, 1.19s/it]
115+
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
116+
Tokens: 1802
117+
Success rate: 100.0%. (100/100)
118+
Throughput - Tokens per second: 15.2
119+
Mean latency: 63653.98 ms
120+
Median latency: 66736.83 ms
121+
Average document length: 18.02 tokens
122+
```
123+
105124
## Transcription
106125
### Model preparation
107126
Many variances of Whisper models can be deployed in a single command by using pre-configured models from [OpenVINO HuggingFace organization](https://huggingface.co/collections/OpenVINO/speech-to-text) and used both for translations and transcriptions endpoints.
@@ -208,6 +227,26 @@ print(transcript.text)
208227
The quick brown fox jumped over the lazy dog.
209228
```
210229
:::
230+
231+
## Benchmarking transcription
232+
An asynchronous benchmarking client can be used to access the model server performance with various load conditions. Below are execution examples captured on Intel(R) Core(TM) Ultra 7 258V.
233+
234+
```console
235+
git clone https://github.com/openvinotoolkit/model_server
236+
cd model_server/demos/benchmark/v3/
237+
pip install -r requirements.txt
238+
python benchmark.py --api_url http://localhost:8000/v3/audio/transcriptions --model openai/whisper-large-v3-turbo --batch_size 1 --limit 1000 --request_rate inf --dataset edinburghcstr/ami --hf-subset ihm --backend speech2text --trust-remote-code True
239+
Number of documents: 1000
240+
100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [04:44<00:00, 3.51it/s]
241+
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
242+
Tokens: 10948
243+
Success rate: 100.0%. (1000/1000)
244+
Throughput - Tokens per second: 38.5
245+
Mean latency: 26670.64 ms
246+
Median latency: 20772.09 ms
247+
Average document length: 10.948 tokens
248+
```
249+
211250
## Translation
212251
To test translations endpoint we first need to prepare audio file with speech in language other than English, e.g. Spanish. To generate such sample we will use finetuned version of microsoft/speecht5_tts model.
213252

demos/benchmark/embeddings/requirements.txt

Lines changed: 0 additions & 5 deletions
This file was deleted.

demos/benchmark/embeddings/benchmark_embeddings.py renamed to demos/benchmark/v3/benchmark.py

Lines changed: 150 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -31,14 +31,19 @@
3131
from transformers import AutoTokenizer
3232
import argparse
3333
import aiohttp
34+
import io
35+
import soundfile
3436

3537
AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60)
3638

3739
default_url_description = "Default value depends on the backend: \
3840
ovms-embeddings: http://localhost:8000/v3/embeddings ;\
3941
ovms_rerank: http://localhost:8000/v3/rerank ;\
4042
tei_embed: http://localhost:8080/embed ;\
41-
infinity-embeddings: http://localhost:7997/embeddings"
43+
infinity-embeddings: http://localhost:7997/embeddings ;\
44+
text2speech: http://localhost:8000/v3/audio/speech ;\
45+
speech2text: http://localhost:8000/v3/audio/transcriptions ;\
46+
translations: http://localhost:8000/v3/audio/translations"
4247

4348
parser = argparse.ArgumentParser(description='Run benchmark for embeddings endpoints', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
4449
parser.add_argument('--dataset', required=False, default='Cohere/wikipedia-22-12-simple-embeddings', help='Dataset for load generation from HF or a keyword "synthetic"', dest='dataset')
@@ -47,8 +52,12 @@
4752
parser.add_argument('--model', required=False, default='Alibaba-NLP/gte-large-en-v1.5', help='HF model name', dest='model')
4853
parser.add_argument('--request_rate', required=False, default='inf', help='Average amount of requests per seconds in random distribution', dest='request_rate')
4954
parser.add_argument('--batch_size', required=False, type=int, default=16, help='Number of strings in every requests', dest='batch_size')
50-
parser.add_argument('--backend', required=False, default='ovms-embeddings', choices=['ovms-embeddings','tei-embed','infinity-embeddings','ovms_rerank'], help='Backend serving API type', dest='backend')
55+
parser.add_argument('--backend', required=False, default='ovms-embeddings', choices=['ovms-embeddings','tei-embed','infinity-embeddings','ovms_rerank','text2speech','speech2text', 'translations'], help='Backend serving API type', dest='backend')
5156
parser.add_argument('--limit', required=False, type=int, default=1000, help='Number of documents to use in testing', dest='limit')
57+
parser.add_argument('--split', required=False, default='train', help='Dataset split', dest='split')
58+
parser.add_argument('--hf-subset', required=False, help='Hf dataset subset', dest='subset')
59+
parser.add_argument('--trust-remote-code', required=False, type=bool, default=False, help='Trust remote code from huggingface', dest='trust_remote_code')
60+
parser.add_argument('--tokenizer', required=False, help='HF tokenizer, provide if different than model', dest='tokenizer')
5261

5362
args = vars(parser.parse_args())
5463

@@ -61,15 +70,22 @@
6170
for i in range(args["limit"]):
6271
docs = docs.add_item({"text":dummy_text})
6372
else:
64-
filter = f"train[:{args['limit']}]"
65-
docs = load_dataset(args["dataset"], split=filter)
73+
filter = f"{args['split']}[:{args['limit']}]"
74+
if args["subset"] == None:
75+
docs = load_dataset(args["dataset"], trust_remote_code=args['trust_remote_code'], split=filter)
76+
else:
77+
docs = load_dataset(args["dataset"], args["subset"], trust_remote_code=args['trust_remote_code'], split=filter)
6678

6779
print("Number of documents:",len(docs))
6880

6981
batch_size = args['batch_size']
7082

7183
def count_tokens(docs, model):
72-
tokenizer = AutoTokenizer.from_pretrained(model)
84+
if args["tokenizer"] == None:
85+
hf_tokenizer = model
86+
else:
87+
hf_tokenizer = args["tokenizer"]
88+
tokenizer = AutoTokenizer.from_pretrained(hf_tokenizer)
7389
documents = docs.iter(batch_size=1)
7490
num_tokens = 0
7591
for request in documents:
@@ -89,12 +105,104 @@ class RequestFuncOutput:
89105
latency: float = 0.0
90106
tokens_len: int = 0
91107
error: str = ""
108+
text: str = ""
92109

93110
application_json_headers = {
94111
"Content-Type": "application/json",
95112
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}",
96113
}
97114

115+
application_multipart_headers = {
116+
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}",
117+
}
118+
119+
120+
async def async_request_text2speech(
121+
request_func_input: RequestFuncInput,
122+
pbar: Optional[tqdm] = None,
123+
) -> RequestFuncOutput:
124+
api_url = request_func_input.api_url
125+
126+
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT, read_bufsize=100000) as session:
127+
payload = {
128+
"model": request_func_input.model,
129+
"input": request_func_input.documents[0],
130+
}
131+
headers = application_json_headers
132+
133+
output = RequestFuncOutput()
134+
st = time.perf_counter()
135+
try:
136+
async with session.post(url=api_url, json=payload,
137+
headers=headers) as response:
138+
if response.status == 200:
139+
async for chunk_bytes in response.content:
140+
if not chunk_bytes:
141+
continue
142+
# uncomment for response debugging
143+
# chunk_bytes = chunk_bytes.decode("utf-8")
144+
# data = json.loads(chunk_bytes)
145+
# TBD: saving response to file
146+
timestamp = time.perf_counter()
147+
output.success = True
148+
output.latency = timestamp - st
149+
else:
150+
output.error = response.reason or ""
151+
output.success = False
152+
print("ERROR", response.reason)
153+
154+
except Exception:
155+
output.success = False
156+
exc_info = sys.exc_info()
157+
output.error = "".join(traceback.format_exception(*exc_info))
158+
159+
if pbar:
160+
pbar.update(1)
161+
return output
162+
163+
async def async_request_speech2text(
164+
request_func_input: RequestFuncInput,
165+
pbar: Optional[tqdm] = None,
166+
) -> RequestFuncOutput:
167+
api_url = request_func_input.api_url
168+
169+
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT, read_bufsize=100000) as session:
170+
headers = application_multipart_headers
171+
172+
y, sr = request_func_input.documents[0]["array"], request_func_input.documents[0]["sampling_rate"]
173+
buffer = io.BytesIO()
174+
soundfile.write(buffer, y, sr, format="WAV")
175+
buffer.seek(0)
176+
177+
form = aiohttp.FormData()
178+
form.add_field('file', buffer, content_type='audio/wav')
179+
form.add_field('model', request_func_input.model)
180+
output = RequestFuncOutput()
181+
st = time.perf_counter()
182+
try:
183+
async with session.post(url=api_url, data=form,
184+
headers=headers) as response:
185+
if response.status == 200:
186+
async for chunk_bytes in response.content:
187+
if not chunk_bytes:
188+
continue
189+
timestamp = time.perf_counter()
190+
output.success = True
191+
output.latency = timestamp - st
192+
output.text = chunk_bytes.decode("utf-8")
193+
else:
194+
output.error = response.reason or ""
195+
output.success = False
196+
197+
except Exception:
198+
output.success = False
199+
exc_info = sys.exc_info()
200+
output.error = "".join(traceback.format_exception(*exc_info))
201+
202+
if pbar:
203+
pbar.update(1)
204+
return output
205+
98206
async def async_request_embeddings(
99207
request_func_input: RequestFuncInput,
100208
pbar: Optional[tqdm] = None,
@@ -231,7 +339,10 @@ async def get_request(
231339
) -> AsyncGenerator[List[str], None]:
232340
documents = documents_all.iter(batch_size=batch_size)
233341
for request in documents:
234-
yield request["text"]
342+
if args["backend"] == "speech2text" or args["backend"] == "translations":
343+
yield request["audio"]
344+
else:
345+
yield request["text"]
235346
if request_rate == float("inf"):
236347
# If the request rate is infinity, then we don't need to wait.
237348
continue
@@ -260,11 +371,22 @@ async def limited_request_func(request_func_input, pbar):
260371
outputs: List[RequestFuncOutput] = await asyncio.gather(*tasks)
261372
benchmark_duration = time.perf_counter() - benchmark_start_time
262373
pbar.close()
374+
if args["backend"] == "speech2text" or args["backend"] == "translations":
375+
if args["tokenizer"] == None:
376+
hf_tokenizer = model
377+
else:
378+
hf_tokenizer = args["tokenizer"]
379+
tokenizer = AutoTokenizer.from_pretrained(hf_tokenizer)
380+
for output in outputs:
381+
data = json.loads(output.text)
382+
output.tokens_len = len(tokenizer(data['text'],add_special_tokens=False, truncation=True)["input_ids"])
383+
263384
result = {
264385
"duration": benchmark_duration,
265386
"errors": [output.error for output in outputs],
266387
"latencies": [output.latency for output in outputs],
267388
"successes": [output.success for output in outputs],
389+
"token_count": [output.tokens_len for output in outputs],
268390
}
269391
return result
270392

@@ -280,6 +402,24 @@ async def limited_request_func(request_func_input, pbar):
280402
elif args["backend"] == "infinity-embeddings":
281403
backend_function = async_request_embeddings
282404
default_api_url = "http://localhost:7997/embeddings"
405+
elif args["backend"] == "text2speech":
406+
if(batch_size != 1):
407+
print("ERROR: Only batch_size=1 supported in audio/speech endpoint")
408+
exit()
409+
backend_function = async_request_text2speech
410+
default_api_url = "http://localhost:8000/v3/audio/speech"
411+
elif args["backend"] == "speech2text":
412+
if(batch_size != 1):
413+
print("ERROR: Only batch_size=1 supported in audio/transcriptions endpoint")
414+
exit()
415+
backend_function = async_request_speech2text
416+
default_api_url = "http://localhost:8000/v3/audio/transcriptions"
417+
elif args["backend"] == "translations":
418+
if(batch_size != 1):
419+
print("ERROR: Only batch_size=1 supported in audio/translations endpoint")
420+
exit()
421+
backend_function = async_request_speech2text
422+
default_api_url = "http://localhost:8000/v3/audio/translations"
283423
else:
284424
print("invalid backend")
285425
exit()
@@ -288,8 +428,10 @@ async def limited_request_func(request_func_input, pbar):
288428
args["api_url"] = default_api_url
289429

290430
benchmark_results = asyncio.run(benchmark(docs=docs, model=args["model"], api_url=args["api_url"], request_rate=float(args["request_rate"]), backend_function=backend_function))
291-
292-
num_tokens = count_tokens(docs=docs,model=args["model"])
431+
if args["backend"] == "speech2text" or args["backend"] == "translations":
432+
num_tokens = sum(benchmark_results['token_count'])
433+
else:
434+
num_tokens = count_tokens(docs=docs,model=args["model"])
293435
#print(benchmark_results)
294436
print("Tokens:",num_tokens)
295437
print(f"Success rate: {sum(benchmark_results['successes'])/len(benchmark_results['successes'])*100}%. ({sum(benchmark_results['successes'])}/{len(benchmark_results['successes'])})")
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
datasets==3.6.0
2+
dataclasses==0.6
3+
transformers==4.57.3
4+
numpy==2.3.5
5+
tqdm==4.67.1
6+
sentencepiece==0.2.1
7+
soundfile==0.13.1
8+
librosa==0.11.0

demos/embeddings/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -371,9 +371,9 @@ An asynchronous benchmarking client can be used to access the model server perfo
371371
```console
372372
git clone https://github.com/openvinotoolkit/model_server
373373
pushd .
374-
cd model_server/demos/benchmark/embeddings/
374+
cd model_server/demos/benchmark/v3/
375375
pip install -r requirements.txt
376-
python benchmark_embeddings.py --api_url http://localhost:8000/v3/embeddings --dataset synthetic --synthetic_length 5 --request_rate 10 --batch_size 1 --model BAAI/bge-large-en-v1.5
376+
python benchmark.py --api_url http://localhost:8000/v3/embeddings --dataset synthetic --synthetic_length 5 --request_rate 10 --batch_size 1 --model BAAI/bge-large-en-v1.5
377377
Number of documents: 1000
378378
100%|████████████████████████████████████████████████████████████████| 1000/1000 [01:44<00:00, 9.56it/s]
379379
Tokens: 5000
@@ -384,7 +384,7 @@ Median latency: 13.97 ms
384384
Average document length: 5.0 tokens
385385

386386

387-
python benchmark_embeddings.py --api_url http://localhost:8000/v3/embeddings --request_rate inf --batch_size 32 --dataset synthetic --synthetic_length 510 --model BAAI/bge-large-en-v1.5
387+
python benchmark.py --api_url http://localhost:8000/v3/embeddings --request_rate inf --batch_size 32 --dataset synthetic --synthetic_length 510 --model BAAI/bge-large-en-v1.5
388388
Number of documents: 1000
389389
100%|████████████████████████████████████████████████████████████████| 32/32 [00:17<00:00, 1.82it/s]
390390
Tokens: 510000
@@ -395,7 +395,7 @@ Median latency: 9905.79 ms
395395
Average document length: 510.0 tokens
396396

397397

398-
python benchmark_embeddings.py --api_url http://localhost:8000/v3/embeddings --request_rate inf --batch_size 1 --dataset Cohere/wikipedia-22-12-simple-embeddings --model BAAI/bge-large-en-v1.5
398+
python benchmark.py --api_url http://localhost:8000/v3/embeddings --request_rate inf --batch_size 1 --dataset Cohere/wikipedia-22-12-simple-embeddings --model BAAI/bge-large-en-v1.5
399399
Number of documents: 1000
400400
100%|████████████████████████████████████████████████████████████████| 1000/1000 [00:15<00:00, 64.02it/s]
401401
Tokens: 83208

demos/rerank/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -212,9 +212,9 @@ OVMS reranking: [0.9968273 0.0913821]
212212

213213
An asynchronous benchmarking client can be used to access the model server performance with various load conditions. Below are execution examples captured on dual Intel(R) Xeon(R) CPU Max 9480.
214214
```bash
215-
cd model_server/demos/benchmark/embeddings/
215+
cd model_server/demos/benchmark/v3/
216216
pip install -r requirements.txt
217-
python benchmark_embeddings.py --api_url http://127.0.0.1:8000/v3/rerank --backend ovms_rerank --dataset synthetic --synthetic_length 500 --request_rate inf --batch_size 20 --model BAAI/bge-reranker-large
217+
python benchmark.py --api_url http://127.0.0.1:8000/v3/rerank --backend ovms_rerank --dataset synthetic --synthetic_length 500 --request_rate inf --batch_size 20 --model BAAI/bge-reranker-large
218218
Number of documents: 1000
219219
100%|██████████████████████████████████████| 50/50 [00:19<00:00, 2.53it/s]
220220
Tokens: 501000
@@ -224,7 +224,7 @@ Mean latency: 10268 ms
224224
Median latency: 10249 ms
225225
Average document length: 501.0 tokens
226226

227-
python benchmark_embeddings.py --api_url http://127.0.0.1:8000/v3/rerank --backend ovms_rerank --dataset synthetic --synthetic_length 500 --request_rate inf --batch_size 20 --model BAAI/bge-reranker-large
227+
python benchmark.py --api_url http://127.0.0.1:8000/v3/rerank --backend ovms_rerank --dataset synthetic --synthetic_length 500 --request_rate inf --batch_size 20 --model BAAI/bge-reranker-large
228228
Number of documents: 1000
229229
100%|██████████████████████████████████████| 50/50 [00:19<00:00, 2.53it/s]
230230
Tokens: 501000
@@ -234,7 +234,7 @@ Mean latency: 10268 ms
234234
Median latency: 10249 ms
235235
Average document length: 501.0 tokens
236236

237-
python benchmark_embeddings.py --api_url http://127.0.0.1:8000/v3/rerank --backend ovms_rerank --dataset Cohere/wikipedia-22-12-simple-embeddings --request_rate inf --batch_size 20 --model BAAI/bge-reranker-large
237+
python benchmark.py --api_url http://127.0.0.1:8000/v3/rerank --backend ovms_rerank --dataset Cohere/wikipedia-22-12-simple-embeddings --request_rate inf --batch_size 20 --model BAAI/bge-reranker-large
238238
Number of documents: 1000
239239
100%|██████████████████████████████████████| 50/50 [00:09<00:00, 5.55it/s]
240240
Tokens: 92248

0 commit comments

Comments
 (0)