Skip to content

Commit 448f3d3

Browse files
authored
tests : add script to benchmark whisper.cpp on LibriSpeech corpus (#2999)
* tests : add script to benchmark whisper.cpp on LibriSpeech corpus LibriSpeech is a widely-used benchmark dataset for training and testing speech recognition models. This adds a set of scripts to measure the recognition accuracy of whisper.cpp models, following the common benchmark standards. Signed-off-by: Fujimoto Seiji <[email protected]> * Document how to prepare `whisper-cli` and model files Feedback from Daniel Bevenius. This adds a short code example how to prepare the `whisper-cli` command, to make the initial setup step a little bit clearer. Signed-off-by: Fujimoto Seiji <[email protected]> * tests : Simplify how to set up Python environment Based on a feedback from Georgi Gerganov. Instead of setting up a virtual environment in Makefile, let users set up the Python environment. This is better since users may have their own preferred workflow/toolkit. Signed-off-by: Fujimoto Seiji <[email protected]> --------- Signed-off-by: Fujimoto Seiji <[email protected]>
1 parent e6234cd commit 448f3d3

File tree

11 files changed

+2571
-0
lines changed

11 files changed

+2571
-0
lines changed

tests/librispeech/.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
__pycache__
2+
*.tar.gz
3+
*.txt
4+
eval.conf
5+
venv
6+
LibriSpeech

tests/librispeech/Makefile

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
TAR_URL = https://www.openslr.org/resources/12/test-clean.tar.gz
2+
3+
all: eval
4+
5+
eval:
6+
$(MAKE) -f eval.mk
7+
8+
clean:
9+
$(MAKE) -f eval.mk clean
10+
11+
get-audio:
12+
wget -c $(TAR_URL)
13+
tar -xf test-clean.tar.gz
14+
15+
.PHONY: all eval clean setup-venv clean-venv get-audio

tests/librispeech/README.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# whisper.cpp/tests/librispeech
2+
3+
[LibriSpeech](https://www.openslr.org/12) is a standard dataset for
4+
training and evaluating automatic speech recognition systems.
5+
6+
This directory contains a set of tools to evaluate the recognition
7+
performance of whisper.cpp on LibriSpeech corpus.
8+
9+
## Quick Start
10+
11+
1. (Pre-requirement) Compile `whisper-cli` and prepare the Whisper
12+
model in `ggml` format.
13+
14+
```
15+
$ # Execute the commands below in the project root dir.
16+
$ cmake -B build
17+
$ cmake --build build --config Release
18+
$ ./models/download-ggml-model.sh tiny
19+
```
20+
21+
Consult [whisper.cpp/README.md](../../README.md) for more details.
22+
23+
2. Download the audio files from LibriSpeech project.
24+
25+
```
26+
$ make get-audio
27+
```
28+
29+
3. Set up the environment to compute WER score.
30+
31+
```
32+
$ pip install -r requirements.txt
33+
```
34+
35+
For example, if you use `virtualenv`, you can set up it as follows:
36+
37+
```
38+
$ python3 -m venv venv
39+
$ . venv/bin/activate
40+
$ pip install -r requirements.txt
41+
```
42+
43+
4. Run the benchmark test.
44+
45+
```
46+
$ make
47+
```
48+
49+
## How-to guides
50+
51+
### How to change the inferece parameters
52+
53+
Create `eval.conf` and override variables.
54+
55+
```
56+
WHISPER_MODEL = large-v3-turbo
57+
WHISPER_FLAGS = --no-prints --threads 8 --language en --output-txt
58+
```
59+
60+
Check out `eval.mk` for more details.

tests/librispeech/eval.mk

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
PYTHON = python
2+
3+
WHISPER_PREFIX = ../../
4+
WHISPER_MODEL = tiny
5+
6+
WHISPER_CLI = $(WHISPER_PREFIX)build/bin/whisper-cli
7+
WHISPER_FLAGS = --no-prints --language en --output-txt
8+
9+
# You can create eval.conf to override the WHISPER_* variables
10+
# defined above.
11+
-include eval.conf
12+
13+
# This follows the file structure of the LibriSpeech project.
14+
AUDIO_SRCS = $(sort $(wildcard LibriSpeech/*/*/*/*.flac))
15+
TRANS_TXTS = $(addsuffix .txt, $(AUDIO_SRCS))
16+
17+
# We output the evaluation result to this file.
18+
DONE = $(WHISPER_MODEL).txt
19+
20+
all: $(DONE)
21+
22+
$(DONE): $(TRANS_TXTS)
23+
$(PYTHON) eval.py > $@.tmp
24+
mv $@.tmp $@
25+
26+
# Note: This task writes to a temporary file first to
27+
# create the target file atomically.
28+
%.flac.txt: %.flac
29+
$(WHISPER_CLI) $(WHISPER_FLAGS) --model $(WHISPER_PREFIX)models/ggml-$(WHISPER_MODEL).bin --file $^ --output-file $^.tmp
30+
mv $^.tmp.txt $^.txt
31+
32+
archive:
33+
tar -czf $(WHISPER_MODEL).tar.gz --exclude="*.flac" LibriSpeech $(DONE)
34+
35+
clean:
36+
@rm -f $(TRANS_TXTS)
37+
@rm -f $(DONE)
38+
39+
.PHONY: all clean

tests/librispeech/eval.py

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
import os
2+
import glob
3+
import jiwer
4+
from normalizers import EnglishTextNormalizer
5+
6+
def get_reference():
7+
ref = {}
8+
for path in glob.glob('LibriSpeech/*/*/*/*.trans.txt'):
9+
with open(path) as fp:
10+
for line in fp:
11+
code, text = line.strip().split(" ", maxsplit=1)
12+
ref [code] = text
13+
return ref
14+
15+
def get_hypothesis():
16+
hyp = {}
17+
for path in glob.glob('LibriSpeech/*/*/*/*.flac.txt'):
18+
with open(path) as fp:
19+
text = fp.read().strip()
20+
code = os.path.basename(path).replace('.flac.txt', '')
21+
hyp[code] = text
22+
return hyp
23+
24+
def get_codes():
25+
codes = []
26+
for path in glob.glob('LibriSpeech/*/*/*/*.flac'):
27+
codes.append(os.path.basename(path).replace('.flac', ''))
28+
return sorted(codes)
29+
30+
def main():
31+
normalizer = EnglishTextNormalizer()
32+
33+
ref_orig = get_reference()
34+
hyp_orig = get_hypothesis()
35+
36+
ref_clean = []
37+
hyp_clean = []
38+
39+
for code in get_codes():
40+
ref_clean.append(normalizer(ref_orig[code]))
41+
hyp_clean.append(normalizer(hyp_orig[code]))
42+
43+
wer = jiwer.wer(ref_clean, hyp_clean)
44+
print(f"WER: {wer * 100:.2f}%")
45+
46+
if __name__ == '__main__':
47+
main()

tests/librispeech/normalizers/LICENSE

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
Code in this directory is adapted from OpenAI Whisper project
2+
(https://github.com/openai/whisper) and carries the following
3+
copyright and license.
4+
5+
MIT License
6+
7+
Copyright (c) 2022 OpenAI
8+
9+
Permission is hereby granted, free of charge, to any person obtaining a copy
10+
of this software and associated documentation files (the "Software"), to deal
11+
in the Software without restriction, including without limitation the rights
12+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
13+
copies of the Software, and to permit persons to whom the Software is
14+
furnished to do so, subject to the following conditions:
15+
16+
The above copyright notice and this permission notice shall be included in all
17+
copies or substantial portions of the Software.
18+
19+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
20+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
21+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
22+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
23+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
24+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
25+
SOFTWARE.
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
from .basic import BasicTextNormalizer as BasicTextNormalizer
2+
from .english import EnglishTextNormalizer as EnglishTextNormalizer
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
import re
2+
import unicodedata
3+
4+
import regex
5+
6+
# non-ASCII letters that are not separated by "NFKD" normalization
7+
ADDITIONAL_DIACRITICS = {
8+
"œ": "oe",
9+
"Œ": "OE",
10+
"ø": "o",
11+
"Ø": "O",
12+
"æ": "ae",
13+
"Æ": "AE",
14+
"ß": "ss",
15+
"ẞ": "SS",
16+
"đ": "d",
17+
"Đ": "D",
18+
"ð": "d",
19+
"Ð": "D",
20+
"þ": "th",
21+
"Þ": "th",
22+
"ł": "l",
23+
"Ł": "L",
24+
}
25+
26+
27+
def remove_symbols_and_diacritics(s: str, keep=""):
28+
"""
29+
Replace any other markers, symbols, and punctuations with a space,
30+
and drop any diacritics (category 'Mn' and some manual mappings)
31+
"""
32+
return "".join(
33+
(
34+
c
35+
if c in keep
36+
else (
37+
ADDITIONAL_DIACRITICS[c]
38+
if c in ADDITIONAL_DIACRITICS
39+
else (
40+
""
41+
if unicodedata.category(c) == "Mn"
42+
else " " if unicodedata.category(c)[0] in "MSP" else c
43+
)
44+
)
45+
)
46+
for c in unicodedata.normalize("NFKD", s)
47+
)
48+
49+
50+
def remove_symbols(s: str):
51+
"""
52+
Replace any other markers, symbols, punctuations with a space, keeping diacritics
53+
"""
54+
return "".join(
55+
" " if unicodedata.category(c)[0] in "MSP" else c
56+
for c in unicodedata.normalize("NFKC", s)
57+
)
58+
59+
60+
class BasicTextNormalizer:
61+
def __init__(self, remove_diacritics: bool = False, split_letters: bool = False):
62+
self.clean = (
63+
remove_symbols_and_diacritics if remove_diacritics else remove_symbols
64+
)
65+
self.split_letters = split_letters
66+
67+
def __call__(self, s: str):
68+
s = s.lower()
69+
s = re.sub(r"[<\[][^>\]]*[>\]]", "", s) # remove words between brackets
70+
s = re.sub(r"\(([^)]+?)\)", "", s) # remove words between parenthesis
71+
s = self.clean(s).lower()
72+
73+
if self.split_letters:
74+
s = " ".join(regex.findall(r"\X", s, regex.U))
75+
76+
s = re.sub(
77+
r"\s+", " ", s
78+
) # replace any successive whitespace characters with a space
79+
80+
return s

0 commit comments

Comments
 (0)