Skip to content

Conversation

@larissakl
Copy link
Contributor

Adds an AedTreeBuilder which can later be used for a TreeLabelsyncBeamSearch. The search tree only includes labels, without blanks, self-loops and skip-transitions.
The sentence-end token is retrieved from a special lemma in the lexicon and added as a label reachable from the root.
As in the CtcTreeBuilder, a word-boundary root will be added if a word-boundary token is present in the lexicon.
I moved some helper functions from CtcTreeBuilder to AbstractTreeBuilder so that I can easily reuse them.

@larissakl larissakl requested review from SimBe195 and curufinwe May 17, 2025 11:47
@larissakl
Copy link
Contributor Author

larissakl commented May 28, 2025

The helper functions used by CtcTreeBuilder and AedTreeBuilder are now in the shared base class CtcAedSharedBaseClassTreeBuilder SharedBaseClassTreeBuilder instead of AbstractTreeBuilder.

@SimBe195
Copy link
Collaborator

@larissakl Do you have any plots that show the generated tree structure for a simple example lexicon? If so, it would be nice to include one in this PR.

@larissakl
Copy link
Contributor Author

aed_tree

Sure, here is an example tree and this is the corresponding example lexicon:

<?xml version="1.0" ?>
<lexicon>
  <phoneme-inventory>
    <phoneme>
      <symbol>_</symbol>
      <variation>none</variation>
    </phoneme>
    <phoneme>
      <symbol>A</symbol>
      <variation>none</variation>
    </phoneme>
    <phoneme>
      <symbol>B</symbol>
      <variation>none</variation>
    </phoneme>
    <phoneme>
      <symbol>C</symbol>
      <variation>none</variation>
    </phoneme>
    <phoneme>
      <symbol>[SILENCE]</symbol>
      <variation>none</variation>
    </phoneme>
    <phoneme>
      <symbol>[UNKNOWN]</symbol>
      <variation>none</variation>
    </phoneme>
    <phoneme>
      <symbol>&lt;/s&gt;</symbol>
      <variation>none</variation>
    </phoneme>
    <phoneme>
      <symbol>@</symbol>
      <variation>none</variation>
    </phoneme>
  </phoneme-inventory>
  <lemma special="silence">
    <orth>[SILENCE]</orth>
    <orth/>
    <phon>[SILENCE]</phon>
    <synt/>
    <eval/>
  </lemma>
  <lemma special="unknown">
    <orth>[UNKNOWN]</orth>
    <phon>[UNKNOWN]</phon>
    <synt>
      <tok>&lt;UNK&gt;</tok>
    </synt>
  </lemma>
  <lemma special="sentence-end">
    <orth>&lt;/s&gt;</orth>
    <phon>&lt;/s&gt;</phon>
  </lemma>
  <lemma special="word-boundary">
    <orth>@</orth>
    <phon>@</phon>
  </lemma>
  <lemma>
    <orth>AA</orth>
    <phon>A A</phon>
  </lemma>
  <lemma>
    <orth>AB</orth>
    <phon>A B</phon>
  </lemma>
  <lemma>
    <orth>AAA</orth>
    <phon>A A A</phon>
  </lemma>
  <lemma>
    <orth>AAB</orth>
    <phon>A A B</phon>
  </lemma>
  <lemma>
    <orth>ABA</orth>
    <phon>A B A</phon>
  </lemma>
  <lemma>
    <orth>ACA</orth>
    <phon>A C A</phon>
  </lemma>
  <lemma>
    <orth>BA</orth>
    <phon>B A</phon>
  </lemma>
  <lemma>
    <orth>BAC</orth>
    <phon>B A C</phon>
  </lemma>
</lexicon>

m=... is the AM index (1 for A, 2 for B, 3 for C, 4 for [SILENCE], 5 for [UNKNOWN] and 6 for sentence-end, just as the order of the phonemes). The blank symbol _ is still part of the lexicon, but not relevant for this tree anymore.

@SimBe195
Copy link
Collaborator

It looks like "word-boundary" is part of the lexicon but not in the tree. From the code it looks like word-boundary should be integrated though. Is this really the right picture given the lexicon?

@larissakl
Copy link
Contributor Author

Oh yes, you're right. This was the tree without word-boundary token. With word-boundary in the lexicon, it looks like this:

tree_word-boundary

@larissakl larissakl mentioned this pull request Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants