Add new functionalities to BPE lexicon creation #569

Stefanwuu · 2025-01-10T10:37:18Z

In order to use this job in more complicated scenarios, I added some functionalities to CreateBPELexiconJob. The original functionalities and hashes should stay unchanged.

lexicon/bpe.py

Co-authored-by: Albert Zeyer <[email protected]>

curufinwe · 2025-04-10T15:56:50Z

lexicon/bpe.py

+        :param skip_unk_lemmas: whether simply skip lemmas out of the BPE vocab
+            useful if you set vocab_blacklist


Suggested change

:param skip_unk_lemmas: whether simply skip lemmas out of the BPE vocab

useful if you set vocab_blacklist

:param skip_unk_lemmas: Whether to simply skip lemmas that are not part of the BPE vocabulary.

Useful if you set vocab_blacklist.

curufinwe · 2025-04-10T16:04:57Z

lexicon/bpe.py

+        :param add_all_bpe_phonemes: If set to True, all BPE vocab will be added to lexicon phonemes,
+            otherwise, only phonemes appear in lexicon lemma will be added to the lexicon.


Suggested change

:param add_all_bpe_phonemes: If set to True, all BPE vocab will be added to lexicon phonemes,

otherwise, only phonemes appear in lexicon lemma will be added to the lexicon.

:param add_all_bpe_phonemes: If set to True, all BPE tokens will be added to lexicon as phonemes,

otherwise, only tokens that appear in the base lexicon will be added to the output lexicon.

curufinwe · 2025-04-10T16:05:19Z

lexicon/bpe.py

+        :param additional_words: Aside from vocab specified in base_lexicon, we might want to convert some other words,
+            e.g. untranslatable words by a g2p model in case of g2p-augmented lexicon


Suggested change

:param additional_words: Aside from vocab specified in base_lexicon, we might want to convert some other words,

e.g. untranslatable words by a g2p model in case of g2p-augmented lexicon

:param additional_words: Aside from the vocabulary specified in base_lexicon, we might want to convert some other words,

e.g. untranslatable words by a g2p model in case of g2p-augmented lexicon

curufinwe · 2025-04-10T16:07:37Z

lexicon/bpe.py

            for orth in lemma.orth:
                bpe_pron = " ".join([token if token in vocab else self.unk_label for token in w2b[orth].split()])
+                if self.skip_unk_lemmas and self.unk_label in bpe_pron.split():
+                    logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.")


Suggested change

logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.")

logging.info(f"Lemma {orth} is skipped due to use of the BPE token for <unknown>.")

self.unk_label in bpe_pron.split() means there is some string in the word that cannot be represented with the current BPE vocab (e.g. greek letters with an all latin vocab), right?

Suggested change

logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.")

logging.info(f"Lemma {orth} is skipped because it cannot be represented with the BPE vocab.")

Yeah, we normally filter out non-word(like [noise]) from the text corpora to train BPE, and when we try to convert non-word lemmata in the base phoneme lexicon, they would be unknown to the BPE vocab due to the existence of "[" and "]".
Btw, this PR is now NOT used in Apptek BPE pipeline and will only be a feature for i6 people. I will still try to get it merged, but don't worry that it would slow down our process in Apptek.

curufinwe · 2025-04-10T16:13:44Z

lexicon/bpe.py

+        additional_words_list = set()
+        if self.additional_words is not None:
+            with util.uopen(self.additional_words.get_path(), "rt") as f:
+                for line in f:
+                    line = line.strip()
+                    additional_words_list.add(line)
+        return sorted(additional_words_list)


Suggested change

additional_words_list = set()

if self.additional_words is not None:

with util.uopen(self.additional_words.get_path(), "rt") as f:

for line in f:

line = line.strip()

additional_words_list.add(line)

return sorted(additional_words_list)

if self.additional_words is not None:

with util.uopen(self.additional_words.get_path(), "rt") as f:

res = {line.strip() for line in f}

else:

res = set()

return sorted(res)

Suggested change

additional_words_list = set()

if self.additional_words is not None:

with util.uopen(self.additional_words.get_path(), "rt") as f:

for line in f:

line = line.strip()

additional_words_list.add(line)

return sorted(additional_words_list)

if self.additional_words is not None:

with util.uopen(self.additional_words.get_path(), "rt") as f:

return sorted({line.strip() for line in f})

return []

even more simpler

lexicon/bpe.py

michelwi · 2025-04-11T06:46:19Z

lexicon/bpe.py

+        additional_words_list = set()
+        if self.additional_words is not None:
+            with util.uopen(self.additional_words.get_path(), "rt") as f:
+                for line in f:
+                    line = line.strip()
+                    additional_words_list.add(line)
+        return sorted(additional_words_list)


Suggested change

additional_words_list = set()

if self.additional_words is not None:

with util.uopen(self.additional_words.get_path(), "rt") as f:

for line in f:

line = line.strip()

additional_words_list.add(line)

return sorted(additional_words_list)

if self.additional_words is not None:

with util.uopen(self.additional_words.get_path(), "rt") as f:

return sorted({line.strip() for line in f})

return []

even more simpler

lexicon/bpe.py

michelwi · 2025-04-11T06:53:27Z

lexicon/bpe.py

            for orth in lemma.orth:
                bpe_pron = " ".join([token if token in vocab else self.unk_label for token in w2b[orth].split()])
+                if self.skip_unk_lemmas and self.unk_label in bpe_pron.split():
+                    logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.")


self.unk_label in bpe_pron.split() means there is some string in the word that cannot be represented with the current BPE vocab (e.g. greek letters with an all latin vocab), right?

Suggested change

logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.")

logging.info(f"Lemma {orth} is skipped because it cannot be represented with the BPE vocab.")

michelwi · 2025-04-11T06:57:38Z

lexicon/bpe.py

        keep_special_lemmas: bool = True,
+        skip_unk_lemmas: bool = False,
+        add_all_bpe_phonemes: bool = True,
+        additional_words: Optional[tk.Path] = None,


We could do it in this job, but my feeling is that adding a bunch of words with empty pronunciation is a separate task and I would rather have a CreateEmptyPronunciationLexiconJob or CreateOrthOnlyLexiconJob and then use the MergeLexiconJob to combine those.

hannah220 · 2025-04-17T10:54:37Z

lexicon/bpe.py

                    vocab.add(symbol)
-                    lexicon.add_phoneme(symbol.replace(".", "_"))
+                    if self.add_all_bpe_phonemes:
+                        lexicon.add_phoneme(symbol.replace(".", "_"))


symbol is already replaced at line 101 symbol = symbol.replace(".", "_") , so you could just lexicon.add_phoneme(symbol)

hannah220 · 2025-04-17T11:02:20Z

lexicon/bpe.py

+                    continue
+                used_vocab.update(set(bpe_pron.split()))
                lexicon.add_lemma(Lemma([orth], [bpe_pron.replace(".", "_")], lemma.synt, lemma.eval))



why if not? Shouldn't we add when phoneme when self.add_all_bpe_phonemes is true?

if self.add_all_bpe_phonemes is True there is a separate block below that does the adding

add new functionality to BPE lexicon creation

17df903

Stefanwuu requested review from JackTemaki, SimBe195, albertz and michelwi January 10, 2025 10:37

albertz reviewed Jan 10, 2025

View reviewed changes

lexicon/bpe.py Outdated Show resolved Hide resolved

sharing more code suggested by Albert

8a0d00e

albertz reviewed Jan 22, 2025

View reviewed changes

lexicon/bpe.py Outdated Show resolved Hide resolved

Remove debugging print

b015dc5

Co-authored-by: Albert Zeyer <[email protected]>

curufinwe requested changes Apr 10, 2025

View reviewed changes

michelwi reviewed Apr 11, 2025

View reviewed changes

changes suggested by Eugen

aefce40

hannah220 reviewed Apr 17, 2025

View reviewed changes

		:param skip_unk_lemmas: whether simply skip lemmas out of the BPE vocab
		useful if you set vocab_blacklist

		:param add_all_bpe_phonemes: If set to True, all BPE vocab will be added to lexicon phonemes,
		otherwise, only phonemes appear in lexicon lemma will be added to the lexicon.

		:param additional_words: Aside from vocab specified in base_lexicon, we might want to convert some other words,
		e.g. untranslatable words by a g2p model in case of g2p-augmented lexicon

	logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.")
	logging.info(f"Lemma {orth} is skipped due to use of the BPE token for <unknown>.")

	logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.")
	logging.info(f"Lemma {orth} is skipped because it cannot be represented with the BPE vocab.")

Add new functionalities to BPE lexicon creation #569

Are you sure you want to change the base?

Add new functionalities to BPE lexicon creation #569

Uh oh!

Conversation

Stefanwuu commented Jan 10, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants