Skip to content

Conversation

@galv
Copy link
Collaborator

@galv galv commented Jun 12, 2024

Setting length_penalty to a negative score is helpful for CTC models, since they are often biased towards taking shorter length paths through the WFST graph. (Since shorter paths have smaller costs, in general.)

However, a side effect of using length penalty this way is that stuff like "no one cares" would come out as "no one caress" instead because "caress" has a longer WFST path than "cares".

Applying the penalty only when olabel != 0 (epsilon) can help work around this issue, while still preserving some of the benefits from length_penalty.

Note that this word_length_penalty is applied in both the emitting and non-emitting ExpandArcs, while length_penalty is applied only in the emitting ExpandArcs. I believe this is the proper way to do things.

Here are some experiments from running the test test_sub_ins_del:

model is stt_en_conformer_ctc_small
dataset is test-clean

For vanilla "ctc" topology, the best length_penalty was -5.0. The WER was: wer=0.04530584297017651, ins=369, sub=1650, del=363

For vanilla "ctc" topology, the best word_length_penalty was -10.0. The WER was: wer=0.045058581862446746, ins=375, sub=1608, del=386

For compact "ctc" topology, the best length_penalty was -9.5. The WER was: wer=0.045058581862446746, ins=375, sub=1608, del=386

For compact "ctc" topology, the best word_length_penalty was -10.0. The WER was: wer=0.04309951308581862, ins=302, sub=1572, del=392

The best result comes from using compact CTC topology with word_length_penalty=-10.0

It makes sense that a more negative length penalty is required to minimize WER for the compact CTC topology; it has fewer self-loops.

Insertion, Substitution, and Deletion statistics were obtained by applying this diff:

modified   src/riva/asrlib/decoder/test_graph_construction.py
@@ -963,6 +963,8 @@ class TestGraphConstruction:
         references = [s.lower() for s in references]
         # Might want to try a different WER implementation, for sanity.
         my_wer = wer(references, predictions)
+        wer_ratio, ins, sub, deletions = my_wer
+        print(f"GALVEZ:wer={wer_ratio}, ins={ins}, sub={sub}, del={deletions}")
         other_wer = word_error_rate(references, predictions)
         print("beam search WER:", my_wer)
         print("other beam search WER:", other_wer)

Setting length_penalty to a negative score is helpful for CTC models,
since they are often biased towards taking shorter length paths
through the WFST graph. (Since shorter paths have smaller costs, in
general.)

However, a side effect of using length penalty this way is that stuff
like "no one cares" would come out as "no one caress" instead because
"caress" has a longer WFST path than "cares".

Applying the penalty only when olabel != 0 (epsilon) can help work
around this issue, while still preserving some of the benefits from
length_penalty.

Note that this word_length_penalty is applied in both the emitting and
non-emitting ExpandArcs, while length_penalty is applied only in the
emitting ExpandArcs. I believe this is the proper way to do things.

Here are some experiments from running the test test_sub_ins_del:

model is stt_en_conformer_ctc_small
dataset is test-clean

For vanilla "ctc" topology, the best length_penalty was -5.0. The WER was:
wer=0.04530584297017651, ins=369, sub=1650, del=363

For vanilla "ctc" topology, the best word_length_penalty was -10.0. The WER was:
wer=0.045058581862446746, ins=375, sub=1608, del=386

For compact "ctc" topology, the best length_penalty was -9.5. The WER was:
wer=0.045058581862446746, ins=375, sub=1608, del=386

For compact "ctc" topology, the best word_length_penalty was -10.0. The WER was:
wer=0.04309951308581862, ins=302, sub=1572, del=392

The best result comes from using compact CTC topology with word_length_penalty=-10.0

It makes sense that a more negative length penalty is required to
minimize WER for the compact CTC topology; it has fewer self-loops.

Insertion, Substitution, and Deletion statistics were obtained by
applying this diff:

modified   src/riva/asrlib/decoder/test_graph_construction.py
@@ -963,6 +963,8 @@ class TestGraphConstruction:
         references = [s.lower() for s in references]
         # Might want to try a different WER implementation, for sanity.
         my_wer = wer(references, predictions)
+        wer_ratio, ins, sub, deletions = my_wer
+        print(f"GALVEZ:wer={wer_ratio}, ins={ins}, sub={sub}, del={deletions}")
         other_wer = word_error_rate(references, predictions)
         print("beam search WER:", my_wer)
         print("other beam search WER:", other_wer)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants