Skip to content

Conversation

@Stefanwuu
Copy link
Contributor

In order to use this job in more complicated scenarios, I added some functionalities to CreateBPELexiconJob. The original functionalities and hashes should stay unchanged.

Co-authored-by: Albert Zeyer <[email protected]>
lexicon/bpe.py Outdated
Comment on lines 50 to 51
:param skip_unk_lemmas: whether simply skip lemmas out of the BPE vocab
useful if you set vocab_blacklist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param skip_unk_lemmas: whether simply skip lemmas out of the BPE vocab
useful if you set vocab_blacklist
:param skip_unk_lemmas: Whether to simply skip lemmas that are not part of the BPE vocabulary.
Useful if you set vocab_blacklist.

lexicon/bpe.py Outdated
Comment on lines 52 to 53
:param add_all_bpe_phonemes: If set to True, all BPE vocab will be added to lexicon phonemes,
otherwise, only phonemes appear in lexicon lemma will be added to the lexicon.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param add_all_bpe_phonemes: If set to True, all BPE vocab will be added to lexicon phonemes,
otherwise, only phonemes appear in lexicon lemma will be added to the lexicon.
:param add_all_bpe_phonemes: If set to True, all BPE tokens will be added to lexicon as phonemes,
otherwise, only tokens that appear in the base lexicon will be added to the output lexicon.

lexicon/bpe.py Outdated
Comment on lines 54 to 55
:param additional_words: Aside from vocab specified in base_lexicon, we might want to convert some other words,
e.g. untranslatable words by a g2p model in case of g2p-augmented lexicon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param additional_words: Aside from vocab specified in base_lexicon, we might want to convert some other words,
e.g. untranslatable words by a g2p model in case of g2p-augmented lexicon
:param additional_words: Aside from the vocabulary specified in base_lexicon, we might want to convert some other words,
e.g. untranslatable words by a g2p model in case of g2p-augmented lexicon

lexicon/bpe.py Outdated
for orth in lemma.orth:
bpe_pron = " ".join([token if token in vocab else self.unk_label for token in w2b[orth].split()])
if self.skip_unk_lemmas and self.unk_label in bpe_pron.split():
logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.")
logging.info(f"Lemma {orth} is skipped due to use of the BPE token for <unknown>.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.unk_label in bpe_pron.split() means there is some string in the word that cannot be represented with the current BPE vocab (e.g. greek letters with an all latin vocab), right?

Suggested change
logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.")
logging.info(f"Lemma {orth} is skipped because it cannot be represented with the BPE vocab.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we normally filter out non-word(like [noise]) from the text corpora to train BPE, and when we try to convert non-word lemmata in the base phoneme lexicon, they would be unknown to the BPE vocab due to the existence of "[" and "]".
Btw, this PR is now NOT used in Apptek BPE pipeline and will only be a feature for i6 people. I will still try to get it merged, but don't worry that it would slow down our process in Apptek.

lexicon/bpe.py Outdated
Comment on lines 109 to 115
additional_words_list = set()
if self.additional_words is not None:
with util.uopen(self.additional_words.get_path(), "rt") as f:
for line in f:
line = line.strip()
additional_words_list.add(line)
return sorted(additional_words_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
additional_words_list = set()
if self.additional_words is not None:
with util.uopen(self.additional_words.get_path(), "rt") as f:
for line in f:
line = line.strip()
additional_words_list.add(line)
return sorted(additional_words_list)
if self.additional_words is not None:
with util.uopen(self.additional_words.get_path(), "rt") as f:
res = {line.strip() for line in f}
else:
res = set()
return sorted(res)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
additional_words_list = set()
if self.additional_words is not None:
with util.uopen(self.additional_words.get_path(), "rt") as f:
for line in f:
line = line.strip()
additional_words_list.add(line)
return sorted(additional_words_list)
if self.additional_words is not None:
with util.uopen(self.additional_words.get_path(), "rt") as f:
return sorted({line.strip() for line in f})
return []

even more simpler

lexicon/bpe.py Outdated
Comment on lines 109 to 115
additional_words_list = set()
if self.additional_words is not None:
with util.uopen(self.additional_words.get_path(), "rt") as f:
for line in f:
line = line.strip()
additional_words_list.add(line)
return sorted(additional_words_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
additional_words_list = set()
if self.additional_words is not None:
with util.uopen(self.additional_words.get_path(), "rt") as f:
for line in f:
line = line.strip()
additional_words_list.add(line)
return sorted(additional_words_list)
if self.additional_words is not None:
with util.uopen(self.additional_words.get_path(), "rt") as f:
return sorted({line.strip() for line in f})
return []

even more simpler

lexicon/bpe.py Outdated
for orth in lemma.orth:
bpe_pron = " ".join([token if token in vocab else self.unk_label for token in w2b[orth].split()])
if self.skip_unk_lemmas and self.unk_label in bpe_pron.split():
logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.unk_label in bpe_pron.split() means there is some string in the word that cannot be represented with the current BPE vocab (e.g. greek letters with an all latin vocab), right?

Suggested change
logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.")
logging.info(f"Lemma {orth} is skipped because it cannot be represented with the BPE vocab.")

keep_special_lemmas: bool = True,
skip_unk_lemmas: bool = False,
add_all_bpe_phonemes: bool = True,
additional_words: Optional[tk.Path] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do it in this job, but my feeling is that adding a bunch of words with empty pronunciation is a separate task and I would rather have a CreateEmptyPronunciationLexiconJob or CreateOrthOnlyLexiconJob and then use the MergeLexiconJob to combine those.

vocab.add(symbol)
lexicon.add_phoneme(symbol.replace(".", "_"))
if self.add_all_bpe_phonemes:
lexicon.add_phoneme(symbol.replace(".", "_"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

symbol is already replaced at line 101 symbol = symbol.replace(".", "_") , so you could just lexicon.add_phoneme(symbol)

continue
used_vocab.update(set(bpe_pron.split()))
lexicon.add_lemma(Lemma([orth], [bpe_pron.replace(".", "_")], lemma.synt, lemma.eval))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why if not? Shouldn't we add when phoneme when self.add_all_bpe_phonemes is true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if self.add_all_bpe_phonemes is True there is a separate block below that does the adding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants