SLM-evolving/index.html at main · X-LANCE/SLM-evolving · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <title>Why Do Speech Language Models Fail? | Appendix Companion</title>
  <link rel="stylesheet" href="styles.css" />
</head>
<body>
  <div class="page">
    <header>
    <div class="label">Appendix Companion</div>
    <h1>Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective</h1>
    <div class="meta">
      <div class="pill">Hankun Wang · Haoran Wang · Yiwei Guo · Zhihan Li · Chenpeng Du · Xie Chen · Kai Yu</div>
      <div class="pill">Shanghai Jiao Tong University · X-LANCE Lab</div>
      <div class="pill">Additional materials for ICASSP 2026 submission</div>
    </div>
  </header>

    <main>
    <section id="abstract">
      <h2>Abstract</h2>
      <p>Although text-based large language models exhibit human-level writing ability and remarkable intelligence, speech language models (SLMs) still struggle to generate semantically coherent outputs. We study three factors by evolving the modality from text to speech: (A) speech tokens provide phonetic rather than semantic information, (B) speech sequences are far longer than text, and (C) paralinguistic information adds variability. Factor A has minor impact, factor B noticeably affects syntactic and semantic modeling, and factor C is the most disruptive, especially for lexical modeling. These findings highlight the unique challenges of training end-to-end SLMs and suggest pathways toward stronger speech generation.</p>
    </section>

    <section id="free-gen">
      <h2>Free Generation Setup</h2>
      <p class="muted">Generation settings per modality are listed below. For Phone-Repeat and Speech-HuBERT, higher temperatures mitigate repetitive loops; if more than eight consecutive identical tokens appear, generation stops early. If the transcribed text (excluding the prompt) has fewer than 50 characters, we regenerate with a different random seed. Other modalities always generate up to <code>max_length</code> and drop the last word since it may be incomplete.</p>
      <table>
        <thead>
          <tr><th>Modality</th><th>Max len</th><th>Top-K</th><th>Top-P</th><th>Temp</th></tr>
        </thead>
        <tbody>
          <tr><td>Text-BPE</td><td>45</td><td>1000</td><td>0.9</td><td>1.00</td></tr>
          <tr><td>Text-Raw</td><td>135</td><td>–</td><td>0.9</td><td>1.05</td></tr>
          <tr><td>Phone-BPE</td><td>45</td><td>1000</td><td>0.9</td><td>1.00</td></tr>
          <tr><td>Phone-Raw</td><td>96</td><td>–</td><td>0.9</td><td>1.05</td></tr>
          <tr><td>Phone-Repeat</td><td>500</td><td>–</td><td>0.9</td><td>1.15</td></tr>
          <tr><td>Speech-HuBERT</td><td>500</td><td>1000</td><td>0.9</td><td>1.20</td></tr>
        </tbody>
      </table>
    </section>

    <section id="transcription">
      <h2>Transcription and Evaluation Pipelines</h2>
      <div class="grid">
        <div class="card">
          <h3>Phone → Text (T5-PTT)</h3>
          <p class="muted">We fine-tune FLAN-T5 on LibriHeavy-50k with phone and duration labels from Kaldi alignments.</p>
          <ul>
            <li>Two versions: T5-PTT-Original and T5-PTT-Deduped (for Phone-Repeat with deduplicated runs).</li>
            <li>WER on test set: 2.64% (Original), 1.97% (Deduped).</li>
            <li>Deduped inputs preserve accuracy while matching duration-collapsed phone sequences.</li>
          </ul>
        </div>
        <div class="card">
          <h3>Speech → Text</h3>
          <ul>
            <li>HuBERT tokens → CTX-vec2wav&nbsp;[1] synthesis (speaker prompt: LibriTTS “1089_134686_000001_000001”), using the contextual vocoder from UniCATS.</li>
            <li>Whisper-Large-V3 performs ASR with punctuation and case preserved.</li>
            <li>Provides normalized text for downstream automatic evaluation (perplexity via Llama-3.1-8B).</li>
          </ul>
        </div>
      </div>
    </section>

    <section id="prompts">
      <h2>Prompt Sets</h2>
      <p class="muted">Prompts are grouped by whether they appear in the training data. For out-of-training prompts, speech prompts are synthesized with Hierspeech++ and aligned to obtain phones, durations, and HuBERT tokens.</p>
      <table>
        <thead>
          <tr><th>In Training Data</th><th>Not in Training Data</th></tr>
        </thead>
        <tbody>
          <tr><td>This</td><td>Alice is a nice</td></tr>
          <tr><td>I will</td><td>How much water do you</td></tr>
          <tr><td>How do</td><td>We decide to go to the</td></tr>
          <tr><td>When I</td><td>In the morning, I like to</td></tr>
          <tr><td>She said</td><td>A little bird told me that</td></tr>
          <tr><td>These are</td><td>Mary went to the market to</td></tr>
          <tr><td>The boy is</td><td>In the morning, I like to eat</td></tr>
          <tr><td>The moon is</td><td>Bob is a tennis player, and he</td></tr>
          <tr><td>What a lovely</td><td>He looked up to the sky and saw</td></tr>
          <tr><td>He looked up to the sky and said</td><td>A little girl is playing with her</td></tr>
        </tbody>
      </table>
    </section>

    <section id="word-boundary">
      <h2>Word Boundary Ablation</h2>
      <p class="muted">Adding explicit word-boundary tokens to non-text modalities yields slight gains in syntactic and semantic tasks for Phone-Raw, Phone-Repeat, and Speech-HuBERT, while lexical scores stay similar. Phone-BPE slightly drops because sequences become longer.</p>
      <table>
        <thead>
          <tr><th>Modality</th><th>sWUGGY</th><th>sBLIMP</th><th>Topic-SC</th></tr>
        </thead>
        <tbody>
          <tr><td>Phone-Raw</td><td>85.8</td><td>74.5</td><td>66.6</td></tr>
          <tr><td class="muted">+word boundary</td><td>85.6</td><td>75.7</td><td>66.8</td></tr>
          <tr><td>Phone-BPE</td><td>85.0</td><td>75.0</td><td>70.9</td></tr>
          <tr><td class="muted">+word boundary</td><td>84.1</td><td>75.4</td><td>69.6</td></tr>
          <tr><td>Phone-Repeat</td><td>85.5</td><td>66.2</td><td>58.3</td></tr>
          <tr><td class="muted">+word boundary</td><td>85.2</td><td>66.9</td><td>59.0</td></tr>
          <tr><td>Speech-HuBERT</td><td>50.8</td><td>57.3</td><td>52.9</td></tr>
          <tr><td class="muted">+word boundary</td><td>50.3</td><td>57.7</td><td>53.6</td></tr>
        </tbody>
      </table>
    </section>

    <section id="references">
      <h2>References</h2>
      <p class="muted">[1] Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, Kai Yu. UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding. <a href="https://arxiv.org/abs/2306.07547" target="_blank" rel="noopener">arXiv:2306.07547</a>.</p>
    </section>
    </main>
  </div>

  <footer>
    Built for GitHub Pages.
  </footer>
</body>
</html>