-
Notifications
You must be signed in to change notification settings - Fork 24
Add new functionalities to BPE lexicon creation #569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,6 +1,7 @@ | ||||||||||||||||||||||||||||||||||||||||||||||||||
| __all__ = ["CreateBPELexiconJob"] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||
| import subprocess as sp | ||||||||||||||||||||||||||||||||||||||||||||||||||
| import logging | ||||||||||||||||||||||||||||||||||||||||||||||||||
| import os | ||||||||||||||||||||||||||||||||||||||||||||||||||
| import sys | ||||||||||||||||||||||||||||||||||||||||||||||||||
| from typing import List, Optional, Set, Union | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
@@ -19,6 +20,8 @@ class CreateBPELexiconJob(Job): | |||||||||||||||||||||||||||||||||||||||||||||||||
| This job is still in experimental state, and only tested with Flashlight BPE decoding | ||||||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||
| __sis_hash_exclude__ = {"skip_unk_lemmas": False, "add_all_bpe_phonemes": True, "additional_words": None} | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||
| def __init__( | ||||||||||||||||||||||||||||||||||||||||||||||||||
| self, | ||||||||||||||||||||||||||||||||||||||||||||||||||
| base_lexicon_path: tk.Path, | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
@@ -28,6 +31,9 @@ def __init__( | |||||||||||||||||||||||||||||||||||||||||||||||||
| unk_label: str = "UNK", | ||||||||||||||||||||||||||||||||||||||||||||||||||
| vocab_blacklist: Optional[Union[List[str], Set[str]]] = None, | ||||||||||||||||||||||||||||||||||||||||||||||||||
| keep_special_lemmas: bool = True, | ||||||||||||||||||||||||||||||||||||||||||||||||||
| skip_unk_lemmas: bool = False, | ||||||||||||||||||||||||||||||||||||||||||||||||||
| add_all_bpe_phonemes: bool = True, | ||||||||||||||||||||||||||||||||||||||||||||||||||
| additional_words: Optional[tk.Path] = None, | ||||||||||||||||||||||||||||||||||||||||||||||||||
| ): | ||||||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||||||
| :param base_lexicon_path: base lexicon (can be phoneme based) to take the lemmas from | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
@@ -41,6 +47,12 @@ def __init__( | |||||||||||||||||||||||||||||||||||||||||||||||||
| usually yes for RASR search and no for Flashlight search. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| The phonemes of the special lemmas will also be kept, therefore | ||||||||||||||||||||||||||||||||||||||||||||||||||
| make sure there is no overlap with the BPE vocab. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| :param skip_unk_lemmas: whether simply skip lemmas out of the BPE vocab | ||||||||||||||||||||||||||||||||||||||||||||||||||
| useful if you set vocab_blacklist | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||
| :param skip_unk_lemmas: whether simply skip lemmas out of the BPE vocab | |
| useful if you set vocab_blacklist | |
| :param skip_unk_lemmas: Whether to simply skip lemmas that are not part of the BPE vocabulary. | |
| Useful if you set vocab_blacklist. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| :param add_all_bpe_phonemes: If set to True, all BPE vocab will be added to lexicon phonemes, | |
| otherwise, only phonemes appear in lexicon lemma will be added to the lexicon. | |
| :param add_all_bpe_phonemes: If set to True, all BPE tokens will be added to lexicon as phonemes, | |
| otherwise, only tokens that appear in the base lexicon will be added to the output lexicon. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| :param additional_words: Aside from vocab specified in base_lexicon, we might want to convert some other words, | |
| e.g. untranslatable words by a g2p model in case of g2p-augmented lexicon | |
| :param additional_words: Aside from the vocabulary specified in base_lexicon, we might want to convert some other words, | |
| e.g. untranslatable words by a g2p model in case of g2p-augmented lexicon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
symbol is already replaced at line 101 symbol = symbol.replace(".", "_") , so you could just lexicon.add_phoneme(symbol)
michelwi marked this conversation as resolved.
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| additional_words_list = set() | |
| if self.additional_words is not None: | |
| with util.uopen(self.additional_words.get_path(), "rt") as f: | |
| for line in f: | |
| line = line.strip() | |
| additional_words_list.add(line) | |
| return sorted(additional_words_list) | |
| if self.additional_words is not None: | |
| with util.uopen(self.additional_words.get_path(), "rt") as f: | |
| res = {line.strip() for line in f} | |
| else: | |
| res = set() | |
| return sorted(res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| additional_words_list = set() | |
| if self.additional_words is not None: | |
| with util.uopen(self.additional_words.get_path(), "rt") as f: | |
| for line in f: | |
| line = line.strip() | |
| additional_words_list.add(line) | |
| return sorted(additional_words_list) | |
| if self.additional_words is not None: | |
| with util.uopen(self.additional_words.get_path(), "rt") as f: | |
| return sorted({line.strip() for line in f}) | |
| return [] |
even more simpler
michelwi marked this conversation as resolved.
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.") | |
| logging.info(f"Lemma {orth} is skipped due to use of the BPE token for <unknown>.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.unk_label in bpe_pron.split() means there is some string in the word that cannot be represented with the current BPE vocab (e.g. greek letters with an all latin vocab), right?
| logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.") | |
| logging.info(f"Lemma {orth} is skipped because it cannot be represented with the BPE vocab.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we normally filter out non-word(like [noise]) from the text corpora to train BPE, and when we try to convert non-word lemmata in the base phoneme lexicon, they would be unknown to the BPE vocab due to the existence of "[" and "]".
Btw, this PR is now NOT used in Apptek BPE pipeline and will only be a feature for i6 people. I will still try to get it merged, but don't worry that it would slow down our process in Apptek.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why if not? Shouldn't we add when phoneme when self.add_all_bpe_phonemes is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.add_all_bpe_phonemes is True there is a separate block below that does the adding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do it in this job, but my feeling is that adding a bunch of words with empty pronunciation is a separate task and I would rather have a
CreateEmptyPronunciationLexiconJoborCreateOrthOnlyLexiconJoband then use theMergeLexiconJobto combine those.