-
Notifications
You must be signed in to change notification settings - Fork 16
Score sentence-end transitions with LabelScorer in new search #152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
src/Search/LexiconfreeTimesyncBeamSearch/LexiconfreeTimesyncBeamSearch.cc
Outdated
Show resolved
Hide resolved
| // Add optional blank after the sentence-end lemma | ||
| if (allowBlankAfterSentenceEnd_) { | ||
| StateId blankAfter = extendState(sentenceEndSink_, blankDesc_); | ||
| addExit(blankAfter, sentenceEndSink_, lexicon_.specialLemma("blank")->id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to add loop for this blank state?
addTransition(blankAfter, blankAfter);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so because if you take the exit at this state, you are transferred to the root again (tr=2), which only has this one state as successor, so basically you already have this loop as blank-state -> root -> blank-state -> root -> ...
The only aspect we might need to think about is the fact that this blank always counts as a word-end hypothesis. However I think this is fine because actually we are at a word-end. Blanks between words also count as word-end hypotheses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that is a loop already.
Now that you mention this blank should be word end hyp and I agree, it seems it's not the case now.
reachedSentenceEnd will be set to true, only when the next token is SENTENCE_END.
There is no SENTENCE_END to BLANK transition type where we'd also want to set that to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to set reachedSentenceEnd to true once we have a SENTENCE_END transition and afterwards it can't be set to false again. But you're right, I also see a problem there now. In the LabelHypothesis of the Lexiconfree Search, reachedSentenceEnd is also set to true for many other transition types. I think this is indeed a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, this was a bug. Fixed now.
| hypIndex}; | ||
|
|
||
| auto const sts = lemma->syntacticTokenSequence(); | ||
| if (sts.size() != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have seen more than one syntacticTokenSequence in a lemma. Do we need this assertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced that in PR #145 . First of all, we need to make sure that we have at least one syntacticTokenSequence, otherwise the lemma should not be scored by the LM. We are currently not supporting multiple syntacticTokenSequences in general, therefore we require that we have exactly one. Alternatively, we would have to pick one, probably the first one, anyway. I would say this is an aspect for future work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we will implement support for multi-token-sequences when we actually need it. I haven't seen this case so far and it would complicate the logic a bit since we currently delay the history update until after pruning which we can't do if multiple tokens in a row need to be scored.
|
In the graph above, |
|
@hannah220 With
|
Currently, in
LexiconfreeTimesyncBeamSearchandTreeTimesyncBeamSearchsentence-end is not handled by the LabelScorer; it is only scored by the word-level LM. This PR adds logic to properly handle sentence-end transitions in the new search. The changes consist of the following points:SENTENCE_ENDis added as a new TransitionType for the LabelScorer.LexiconfreeTimesyncBeamSearchsentence-end-indexcan be specified as a parameterinferTransitionTypefunction is adjusted accordingly to assign it the newSENTENCE_ENDtransition typeTreeTimesyncBeamSearchCtcTreeBuilderis modified to include thesentenceEndLemmain the tree if it exists and has pronunciations.finalStatesis added to thePersistentStateTree. This is used in the search to determine which states are considered valid at segment end. If sentence-end is included in the tree, only the sentence-end sink state is added as final state.sentenceEndLabelIndex_is added as a member to the search algorithm and inferred from the lexiconinferTransitionTypefunction is also adjusted to produce theSENTENCE_ENDtransition typedecodeStep. This is because when thesentenceEndLemmahas an empty pronunciation (i.e., should only be scored by the LM and not the LabelScorer), the hypotheses may need to take a normal word-end exit and then the sentence-end exit back-to-back in the same decode step.Depends on changes to the transition types from #138.
Still requires testing.
Here are some plots of the new tree structure including sentence-end:
Tree without

sentence-endlemma in lexicon:Tree with

sentence-endlemma with empty pronunciation in lexicon:Tree with

sentence-endlemma and non-empty pronunciation in lexicon: