The ASR module is designed to work with Conformer family models.
We provide a Jupyter notebook for ONNX export with needed signatures here: examples/conformer_export_onnx.ipynb.
General ASR model signature: (1, T_frames, V) where:
T_framesis the number of temporal frames after Conformer downsamplingVis the token vocabulary size (V=129for NeMo Conformer-large).
We reference an open-source version of Conformer fine-tuned on TORGO dataset of Dysarthric speech: https://huggingface.co/miosipov/conformer-ctc-DASR-en-TORGO.
Phonemic ASR model signature: (1, T_frames, P) where:
T_framesis the number of temporal frames after Conformer downsamplingPis the phoneme vocabulary size (P=50in our experiments).
We reference an open-source version of Conformer fine-tuned on TEDLIUM dataset for Phonemic ASR: https://huggingface.co/miosipov/conformer-ctc-phonemic-en-tedlium.
The dysfluency detection module expects an ONNX model with the signature (1, C, T), where:
Tis the (variable) number of time frames after log-Mel spectrogram calculationC=4is the number of classes. Classes should be in the following order:fluent, clonic, tonic, silence.
The fluency module uses both detector-predicted silence and its on-the-fly estimation via energy.
We do not provide pre-trained weights for this module.
Our framework supports phrase completion triggered by dysfluency detection. In our experiments, we used DistilGPT2: https://huggingface.co/distilbert/distilgpt2.
We provide a Jupyter notebook with instructions on the conversion of this model into ONNX format, adding auxiliary tokens for dysfluency: <CLONIC>, <TONIC>.
We do not provide pre-trained weights for this module.