Skip to content

Commit 9289df2

Browse files
[readme] update for rtfx (#36)
* update readme * fix * fix fix * fix nemo note * same hps
1 parent a189bb9 commit 9289df2

File tree

1 file changed

+199
-17
lines changed

1 file changed

+199
-17
lines changed

README.md

Lines changed: 199 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Open ASR Leaderboard
22

3-
This repository contains the code for the Open ASR Leaderboard. The leaderboard is a Gradio Space that allows users to compare the accuracy of ASR models on a variety of datasets. The leaderboard is hosted at [open-asr-leaderboard/leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard).
3+
This repository contains the code for the Open ASR Leaderboard. The leaderboard is a Gradio Space that allows users to compare the accuracy of ASR models on a variety of datasets. The leaderboard is hosted at [hf-audio/open_asr_leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard).
44

55
# Requirements
66

@@ -9,38 +9,220 @@ Each library has its own set of requirements. We recommend using a clean conda e
99
1) Clone this repository.
1010
2) Install PyTorch by following the instructions here: https://pytorch.org/get-started/locally/
1111
3) Install the common requirements for all library by running `pip install -r requirements/requirements.txt`.
12-
4) If you wish to run NeMo, note that the benchmark currently needs CUDA 12.6 (`nvidia-smi` should output "CUDA Version: 12.6" or higher), to fix a problem in previous drivers for RNN-T inference with cooperative kernels inside of conditional nodes (see here: https://github.com/NVIDIA/NeMo/pull/9869)
13-
5) Install the requirements for each library you wish to evalaute by running `pip install -r requirements/requirements_<library_name>.txt`.
14-
6) Connect your Hugging Face account by running `huggingface-cli login`.
12+
4) Install the requirements for each library you wish to evaluate by running `pip install -r requirements/requirements_<library_name>.txt`.
13+
5) Connect your Hugging Face account by running `huggingface-cli login`.
14+
15+
**Note:** If you wish to run NeMo, the benchmark currently needs CUDA 12.6 to fix a problem in previous drivers for RNN-T inference with cooperative kernels inside conditional nodes (see here: https://github.com/NVIDIA/NeMo/pull/9869). Running `nvidia-smi` should output "CUDA Version: 12.6" or higher.
1516

1617
# Evaluate a model
1718

18-
Each library has a script `run_eval.py` that acts as the entry point for evaluating a model. The script is run by the corresponding bash script for each model that is being evalauted. The script then outputs a JSONL file containing the predictions of the model on each dataset, and summarizes the Word Error Rate of the model on each dataset after completion.
19+
Each library has a script `run_eval.py` that acts as the entry point for evaluating a model. The script is run by the corresponding bash script for each model that is being evaluated. The script then outputs a JSONL file containing the predictions of the model on each dataset, and summarizes the Word Error Rate (WER) and Inverse Real-Time Factor (RTFx) of the model on each dataset after completion.
20+
21+
To reproduce existing results:
1922

2023
1) Change directory into the library you wish to evaluate. For example, `cd transformers`.
2124
2) Run the bash script for the model you wish to evaluate. For example, `bash run_wav2vec2.sh`.
22-
3) **Note**: All evaluations are done on single GPU. If you wish to run two scripts in parallel, please use `CUDA_VISIBLE_DEVICES=<0,1,...N-1>` prior to running the bash script, where `N` is the number of GPUs on your machine.
25+
26+
**Note**: All evaluations were run using an NVIDIA A100-SXM4-80GB GPU, with NVIDIA driver 560.28.03, CUDA 12.6, and PyTorch 2.4.0. You should ensure you use the same configuration when submitting results. If you are unable to create an equivalent machine, please request one of the maintainers to run your scripts for evaluation!
2327

2428
# Add a new library
2529

26-
To add a new library for evalution in this benchmark, please follow the steps below:
30+
To add a new library for evaluation in this benchmark, please follow the steps below:
2731

28-
1) Fork this repository and create a new branch.
32+
1) Fork this repository and create a new branch
2933
2) Create a new directory for your library. For example, `mkdir transformers`.
30-
3) Copy the `run_eval.py` script from an existing library into your new directory. For example, `cp transformers/run_eval.py <your_library>/run_eval.py`.
31-
- Modify the script as needed, but please try to keep the structure of the script the same as others.
32-
- In particular, the data loading, evaluation and manifest writing must be done in the same way as other libraries.
33-
4) Create one bash file per model type following the convesion `run_<model_type>.sh`.
34-
- The bash script should follow the same steps as other libraries.
34+
3) Copy the template `run_eval.py` script below into your new directory. The script should be updated for the new library by making two modifications. Otherwise, please try to keep the structure of the script the same as in the template. In particular, the data loading, evaluation and manifest writing must be done in the same way as other libraries for consistency.
35+
1) Update the model loading logic in the `main` function
36+
2) Update the inference logic in the `benchmark` function
37+
38+
<details>
39+
40+
<summary> Template script for Transformers: </summary>
41+
42+
```python
43+
import argparse
44+
import os
45+
import torch
46+
from transformers import WhisperForConditionalGeneration, WhisperProcessor
47+
import evaluate
48+
from normalizer import data_utils
49+
import time
50+
from tqdm import tqdm
51+
52+
wer_metric = evaluate.load("wer")
53+
54+
def main(args):
55+
# Load model (FILL ME!)
56+
model = WhisperForConditionalGeneration.from_pretrained(args.model_id, torch_dtype=torch.bfloat16).to(args.device)
57+
processor = WhisperProcessor.from_pretrained(args.model_id)
58+
59+
def benchmark(batch):
60+
# Load audio inputs
61+
audios = [audio["array"] for audio in batch["audio"]]
62+
batch["audio_length_s"] = [len(audio) / batch["audio"][0]["sampling_rate"] for audio in audios]
63+
minibatch_size = len(audios)
64+
65+
# Start timing
66+
start_time = time.time()
67+
68+
# INFERENCE (FILL ME! Replacing 1-3 with steps from your library)
69+
# 1. Pre-processing
70+
inputs = processor(audios, sampling_rate=16_000, return_tensors="pt").to(args.device)
71+
inputs["input_features"] = inputs["input_features"].to(torch.bfloat16)
72+
# 2. Generation
73+
pred_ids = model.generate(**inputs)
74+
# 3. Post-processing
75+
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True)
76+
77+
# End timing
78+
runtime = time.time() - start_time
79+
80+
# normalize by minibatch size since we want the per-sample time
81+
batch["transcription_time_s"] = minibatch_size * [runtime / minibatch_size]
82+
83+
# normalize transcriptions with English normalizer
84+
batch["predictions"] = [data_utils.normalizer(pred) for pred in pred_text]
85+
batch["references"] = batch["norm_text"]
86+
return batch
87+
88+
if args.warmup_steps is not None:
89+
warmup_dataset = data_utils.load_data(args)
90+
warmup_dataset = data_utils.prepare_data(warmup_dataset)
91+
92+
num_warmup_samples = args.warmup_steps * args.batch_size
93+
if args.streaming:
94+
warmup_dataset = warmup_dataset.take(num_warmup_samples)
95+
else:
96+
warmup_dataset = warmup_dataset.select(range(min(num_warmup_samples, len(warmup_dataset))))
97+
warmup_dataset = iter(warmup_dataset.map(benchmark, batch_size=args.batch_size, batched=True))
98+
99+
for _ in tqdm(warmup_dataset, desc="Warming up..."):
100+
continue
101+
102+
dataset = data_utils.load_data(args)
103+
dataset = data_utils.prepare_data(dataset)
104+
105+
if args.max_eval_samples is not None and args.max_eval_samples > 0:
106+
print(f"Subsampling dataset to first {args.max_eval_samples} samples!")
107+
if args.streaming:
108+
dataset = dataset.take(args.max_eval_samples)
109+
else:
110+
dataset = dataset.select(range(min(args.max_eval_samples, len(dataset))))
111+
112+
dataset = dataset.map(
113+
benchmark, batch_size=args.batch_size, batched=True, remove_columns=["audio"],
114+
)
115+
116+
all_results = {
117+
"audio_length_s": [],
118+
"transcription_time_s": [],
119+
"predictions": [],
120+
"references": [],
121+
}
122+
result_iter = iter(dataset)
123+
for result in tqdm(result_iter, desc="Samples..."):
124+
for key in all_results:
125+
all_results[key].append(result[key])
126+
127+
# Write manifest results (WER and RTFX)
128+
manifest_path = data_utils.write_manifest(
129+
all_results["references"],
130+
all_results["predictions"],
131+
args.model_id,
132+
args.dataset_path,
133+
args.dataset,
134+
args.split,
135+
audio_length=all_results["audio_length_s"],
136+
transcription_time=all_results["transcription_time_s"],
137+
)
138+
print("Results saved at path:", os.path.abspath(manifest_path))
139+
140+
wer = wer_metric.compute(
141+
references=all_results["references"], predictions=all_results["predictions"]
142+
)
143+
wer = round(100 * wer, 2)
144+
rtfx = round(sum(all_results["audio_length_s"]) / sum(all_results["transcription_time_s"]), 2)
145+
print("WER:", wer, "%", "RTFx:", rtfx)
146+
147+
148+
if __name__ == "__main__":
149+
parser = argparse.ArgumentParser()
150+
151+
parser.add_argument(
152+
"--model_id",
153+
type=str,
154+
required=True,
155+
help="Model identifier. Should be loadable with 🤗 Transformers",
156+
)
157+
parser.add_argument(
158+
"--dataset_path",
159+
type=str,
160+
default="esb/datasets",
161+
help="Dataset path. By default, it is `esb/datasets`",
162+
)
163+
parser.add_argument(
164+
"--dataset",
165+
type=str,
166+
required=True,
167+
help="Dataset name. *E.g.* `'librispeech_asr` for the LibriSpeech ASR dataset, or `'common_voice'` for Common Voice. The full list of dataset names "
168+
"can be found at `https://huggingface.co/datasets/esb/datasets`",
169+
)
170+
parser.add_argument(
171+
"--split",
172+
type=str,
173+
default="test",
174+
help="Split of the dataset. *E.g.* `'validation`' for the dev split, or `'test'` for the test split.",
175+
)
176+
parser.add_argument(
177+
"--device",
178+
type=int,
179+
default=-1,
180+
help="The device to run the pipeline on. -1 for CPU (default), 0 for the first GPU and so on.",
181+
)
182+
parser.add_argument(
183+
"--batch_size",
184+
type=int,
185+
default=1,
186+
help="Number of samples to go through each streamed batch.",
187+
)
188+
parser.add_argument(
189+
"--max_eval_samples",
190+
type=int,
191+
default=None,
192+
help="Number of samples to be evaluated. Put a lower number e.g. 64 for testing this script.",
193+
)
194+
parser.add_argument(
195+
"--no-streaming",
196+
dest="streaming",
197+
action="store_false",
198+
help="Choose whether you'd like to download the entire dataset or stream it during the evaluation.",
199+
)
200+
parser.add_argument(
201+
"--warmup_steps",
202+
type=int,
203+
default=10,
204+
help="Number of warm-up steps to run before launching the timed runs.",
205+
)
206+
args = parser.parse_args()
207+
parser.set_defaults(streaming=False)
208+
209+
main(args)
210+
211+
```
212+
213+
</details>
214+
215+
4) Create one bash file per model type following the conversion `run_<model_type>.sh`.
216+
- The bash script should follow the same steps as other libraries. You can copy the example for [run_whisper.sh](./transformers/run_whisper.sh) and update it to your library
35217
- Different model sizes of the same type should share the script. For example `Wav2Vec` and `Wav2Vec2` would be two separate scripts, but different size of `Wav2Vec2` would be part of the same script.
36-
5) (Optional) You could also add a `calc_rtf.py` script for your library to evaluate the Real Time Factor of the model.
37-
6) Submit a PR for your changes.
218+
- **Important:** for a given model, you can tune decoding hyper-parameters to maximize benchmark performance (e.g. batch size, beam size, etc.). However, you must use the **same decoding hyper-parameters** for each dataset in the benchmark. For more details, refer to the [ESB paper](https://arxiv.org/abs/2210.13352).
219+
5) Submit a PR for your changes.
38220

39221
# Add a new model
40222

41-
To add a new model for evalution in this benchmark, you can follow most of the steps noted above.
223+
To add a model from a new library for evaluation in this benchmark, you can follow the steps noted above.
42224

43-
Since the library already exists in the benchmark, we can simplify the steps to:
225+
To add a model from an existing library, we can simplify the steps to:
44226

45227
1) If the model is already supported, but of a different size, simply add the new model size to the list of models run by the corresponding bash script.
46228
2) If the model is entirely new, create a new bash script based on others of that library and add the new model and its sizes to that script.

0 commit comments

Comments
 (0)