OpenMOSS · xiami2019 · Nov 3, 2025 · Oct 31, 2025 · Oct 31, 2025 · Nov 3, 2025
diff --git a/README.md b/README.md
@@ -28,20 +28,21 @@ MOSS-TTSD supports voice cloning and long single-session speech generation, maki
 
 ## Highlights
 
-- **Highly Expressive Dialogue Speech**: Built on unified semantic-acoustic neural audio codec, a pre-trained large language model, millions of hours of TTS data, and 400k hours synthetic and real conversational speech, MOSS-TTSD generates highly expressive, human-like dialogue speech with natural conversational prosody.
+- **Highly Expressive Dialogue Speech**: Built on unified semantic-acoustic neural audio codec, a pre-trained large language model, millions of hours of TTS data and conversational speech, MOSS-TTSD generates highly expressive, human-like dialogue speech with natural conversational prosody.
 - **Two-Speaker Voice Cloning**: MOSS-TTSD supports zero-shot two speakers voice cloning and can generate conversational speech with accurate speaker swithcing based on dialogue scripts. Only 10 to 20 seconds of reference audio is needed.
 - **Chinese-English Bilingual Support**: MOSS-TTSD enables highly expressive speech generation in both Chinese and English.
-- **Long-Form Speech Generation**: Thanks to low-bitrate codec and training framework optimization, MOSS-TTSD has been trained for long speech generation (Training maximum length is 960s).
+- **Long-Form Speech Generation**: Thanks to low-bitrate codec and training framework optimization, MOSS-TTSD has been trained for long speech generation (Training maximum length is 1700s).
 - **Fully Open Source & Commercial-Ready**: MOSS-TTSD and its future updates will be fully open-source and support free commercial use.
 
 ## News 🚀
 
+ - **[2025-11-01]** MOSS-TTSD v0.7 is released! v0.7 significantly improves audio quality, voice cloning capability, and stability, adds support for 32 kHz high‑quality output, greatly extends single‑pass generation length (960s→1700s), and more reliably generates speech events following speaker tags. We recommend using the v0.7 model by default.
 - **[2025-09-09]** We supported SGLang inference engine to accelerate model inference by up to **16x**.
 - **[2025-08-25]** We released the 32khz version of XY-Tokenizer.
 - **[2025-08-12]** We add support for streaming inference in MOSS-TTSD v0.5.
 - **[2025-07-29]** We provide the SiliconFlow API interface and usage examples for MOSS-TTSD v0.5.
 - **[2025-07-16]** We open-source the fine-tuning code for MOSS-TTSD v0.5, supporting full-parameter fine-tuning, LoRA fine-tuning, and multi-node training.
-- **[2025-07-04]** MOSS-TTSD v0.5 is released! v0.5 has enhanced the accuracy of timbre switching, voice cloning capability, and model stability. We recommend using the v0.5 model by default.
+- **[2025-07-04]** MOSS-TTSD v0.5 is released! v0.5 has enhanced the accuracy of timbre switching, voice cloning capability, and model stability.
 - **[2025-06-20]** MOSS-TTSD v0 is released! Moreover, we provide a podcast generation pipeline named Podever, which can automatically convert PDF, URL, or long text files into high-quality podcasts.
 
 ## Installation
@@ -62,7 +63,7 @@ You also need to download the XY Tokenizer model weights. You can find the weigh
 
 ```bash
 mkdir -p XY_Tokenizer/weights
-huggingface-cli download fnlp/XY_Tokenizer_TTSD_V0_32k xy_tokenizer.ckpt --local-dir ./XY_Tokenizer/weights/
+huggingface-cli download fnlp/MOSS_TTSD_tokenizer MOSS_TTSD_tokenizer --local-dir ./XY_Tokenizer/weights/
 ```
 
 ## Usage
@@ -89,16 +90,9 @@ Parameters:
 
 #### JSONL Input Format
 
-The input JSONL file should contain one JSON object per line. MOSS-TTSD supports multiple input formats:
+The input JSONL file should contain one JSON object per line. MOSS-TTSD supports two input formats:
 
-**Format 1: Text-only input (No voice cloning, using the model's random timbre)**
-```json
-{
-  "text": "[S1]Speaker 1 dialogue content[S2]Speaker 2 dialogue content[S1]..."
-}
-```
-
-**Format 2: Separate speaker audio references**
+**Format 1: Separate speaker audio references**
 ```json
 {
   "base_path": "/path/to/audio/files",
@@ -110,7 +104,7 @@ The input JSONL file should contain one JSON object per line. MOSS-TTSD supports
 }
 ```
 
-**Format 3: Shared audio reference**
+**Format 2: Shared audio reference**
 ```json
 {
   "base_path": "/path/to/audio/files",
@@ -126,11 +120,11 @@ The input JSONL file should contain one JSON object per line. MOSS-TTSD supports
 - `text`: Dialogue script with speaker tags `[S1]` and `[S2]` indicating speaker turns (required)
 - `base_path`: Base directory path for relative file paths (optional)
 
-**For voice cloning (Format 2):**
+**For voice cloning (Format 1):**
 - `prompt_audio_speaker1/2`: Path to reference audio files for voice cloning (relative to `base_path`)
 - `prompt_text_speaker1/2`: Reference text corresponding to the audio prompts for better voice matching
 
-**For shared reference (Format 3):**
+**For shared reference (Format 2):**
 - `prompt_audio`: Path to shared reference audio file containing both speakers' voices (relative to `base_path`)
 - `prompt_text`: Reference text corresponding to the audio, also using `[S1]` and `[S2]` tags to distinguish speakers
 

diff --git a/README_zh.md b/README_zh.md
@@ -26,14 +26,15 @@ MOSS-TTSD（text to spoken dialogue）是一个开源的中英双语口语对话
 
 ## 亮点
 
-- **高表现力对话语音**：基于统一语义-声学神经音频Codec、预训练大语言模型、百万小时TTS数据与约40万小时的真实/合成对话语音数据，MOSS-TTSD能够生成高表现力，高自然度，具有自然对话韵律的拟人对话语音。
+- **高表现力对话语音**：基于统一语义-声学神经音频Codec、预训练大语言模型、百万小时TTS数据与对话语音数据，MOSS-TTSD能够生成高表现力，高自然度，具有自然对话韵律的拟人对话语音。
 - **双说话人零样本声音克隆**：MOSS-TTSD支持零样本双说话人克隆，按脚本精确进行角色/声线切换。只需要提供10到20秒的参考音频片段。
 - **中英双语**：MOSS-TTSD支持中英两种语言的高表现力语音生成。
-- **长音频生成**：得益于低码率Codec与训练框架优化，MOSS-TTSD在长音频生成场景进行了大量训练（训练最大长度达到960s），能够单次生成超长音频。
+- **长音频生成**：得益于低码率Codec与训练框架优化，MOSS-TTSD在长音频生成场景进行了大量训练（训练最大长度达到1700s），能够单次生成超长音频。
 - **开源可商用**：当前与后续版本将保持开源，支持免费商用。
 
 ## 最新动态 🚀
 
+- **[2025-11-01]** 我们发布了 MOSS-TTSD v0.7：显著提升了音质、声音克隆能力与稳定性，支持32khz高音质输出，并大幅拓展了单次生成长度（960s->1700s），更够比较稳定地根据说话人标签生成语音事件。
 - **[2025-09-09]** 我们支持了 SGLang 推理引擎加速模型推理，最高可加速**16倍**。
 - **[2025-08-25]** 我们发布了 32khz XY-Tokenizer。
 - **[2025-08-12]** 我们支持了 MOSS-TTSD v0.5 的流式推理。
@@ -60,7 +61,7 @@ pip install flash-attn
 
 ```bash
 mkdir -p XY_Tokenizer/weights
-huggingface-cli download fnlp/XY_Tokenizer_TTSD_V0_32k xy_tokenizer.ckpt --local-dir ./XY_Tokenizer/weights/
+huggingface-cli download fnlp/MOSS_TTSD_tokenizer MOSS_TTSD_tokenizer --local-dir ./XY_Tokenizer/weights/
 ```
 
 ## 使用方法
@@ -87,17 +88,9 @@ python inference.py --jsonl examples/examples.jsonl --output_dir outputs --seed
 
 #### JSONL 输入格式
 
-MOSS-TTSD支持多种输入格式：
+MOSS-TTSD 支持两种输入格式：
 
-**格式1：仅文本（不进行声音克隆，使用模型随机音色）**
-
-```json
-{
-  "text": "[S1]说话人1的内容[S2]说话人2的内容[S1]..."
-}
-```
-
-**格式2：分别提供两位说话人的参考音频**
+**格式1：分别提供两位说话人的参考音频**
 
 ```json
 {
@@ -110,7 +103,7 @@ MOSS-TTSD支持多种输入格式：
 }
 ```
 
-**格式3：共享参考音频（一个参考音频包含两个说话人的内容）**
+**格式2：共享参考音频（一个参考音频包含两个说话人的内容）**
 
 ```json
 {
@@ -128,12 +121,12 @@ MOSS-TTSD支持多种输入格式：
 - `text`：带 `[S1]`、`[S2]` 说话人标签的对话脚本（必填）
 - `base_path`：相对路径的基准目录（可选）
 
-**用于声音克隆（格式2）：**
+**用于声音克隆（格式1）：**
 
 - `prompt_audio_speaker1/2`：两位说话人的参考音频（可相对 `base_path`）
 - `prompt_text_speaker1/2`：对应参考音频的文本，有助于更好匹配音色
 
-**用于共享参考（格式3）：**
+**用于共享参考（格式2）：**
 
 - `prompt_audio`：包含两位说话人的共享参考音频（可相对 `base_path`）
 - `prompt_text`：对应的参考文本，亦使用 `[S1]`、`[S2]` 区分

diff --git a/XY_Tokenizer/config/MOSS_TTSD_tokenizer.yaml b/XY_Tokenizer/config/MOSS_TTSD_tokenizer.yaml
@@ -0,0 +1,115 @@
+generator_params:
+  input_sample_rate: 16000
+  output_sample_rate: 32000
+  encoder_downsample_rate: 1280
+  decoder_upsample_rate: 2560
+
+  feature_extractor_kwargs:
+    chunk_length: 30
+    feature_size: 80
+    hop_length: 160
+    n_fft: 400
+    n_samples: 480000
+    nb_max_frames: 3000
+    padding_side: right
+    padding_value: 0.0
+    return_attention_mask: false
+    sampling_rate: 16000
+
+  # Codec / model architecture (inference required)
+  semantic_encoder_kwargs:  # 100hz -> 50hz
+    num_mel_bins: 80
+    sampling_rate: 16000
+    hop_length: 160
+    stride_size: 2
+    kernel_size: 3
+    d_model: 768
+    scale_embedding: false
+    max_audio_seconds: 30
+    encoder_layers: 12
+    encoder_attention_heads: 12
+    encoder_ffn_dim: 3072
+    activation_function: "gelu"
+
+  semantic_encoder_adapter_kwargs: # 50hz
+    input_dim: 768
+    output_dim: 768
+    d_model: 768
+    max_source_positions: 1500
+    encoder_layers: 4
+    encoder_attention_heads: 12
+    encoder_ffn_dim: 3072
+
+  acoustic_encoder_kwargs:  # 100hz -> 50hz
+    num_mel_bins: 80
+    sampling_rate: 16000
+    hop_length: 160
+    stride_size: 2
+    kernel_size: 3
+    d_model: 768
+    scale_embedding: false
+    max_audio_seconds: 30
+    encoder_layers: 12
+    encoder_attention_heads: 12
+    encoder_ffn_dim: 3072
+    activation_function: "gelu"
+
+  pre_rvq_adapter_kwargs: # 50hz
+    input_dim: 1536
+    output_dim: 768
+    d_model: 768
+    max_source_positions: 1500
+    encoder_layers: 4
+    encoder_attention_heads: 12
+    encoder_ffn_dim: 3072
+
+  downsample_kwargs:  # 50hz -> 12.5hz
+    d_model: 768
+    avg_pooler: 4
+
+  quantizer_kwargs:  # 12.5hz
+    input_dim: 3072
+    rvq_dim: 512
+    output_dim: 3072
+    num_quantizers: 8
+    codebook_size: 1024
+    codebook_dim: 512
+    quantizer_dropout: 0.0
+    commitment: 1
+
+  post_rvq_adapter_kwargs: # 12.5hz
+    input_dim: 3072
+    output_dim: 3072
+    d_model: 768
+    max_source_positions: 375
+    encoder_layers: 4
+    encoder_attention_heads: 12
+    encoder_ffn_dim: 3072
+
+  upsample_kwargs:  # 12.5hz -> 50hz
+    d_model: 768
+    stride: 4
+
+  acoustic_decoder_kwargs:  # 50hz -> 100hz
+    num_mel_bins: 80
+    sampling_rate: 16000
+    hop_length: 160
+    stride_size: 2
+    kernel_size: 3
+    d_model: 768
+    scale_embedding: false
+    max_audio_seconds: 30
+    decoder_layers: 12
+    decoder_attention_heads: 12
+    decoder_ffn_dim: 3072
+    activation_function: "gelu"
+
+  vocos_kwargs:  # 100hz -> 32khz
+    input_channels: 80
+    dim: 512
+    intermediate_dim: 4096
+    num_layers: 30
+    n_fft: 1280
+    hop_size: 320
+    padding: "same"
+
diff --git a/XY_Tokenizer/xy_tokenizer/model.py b/XY_Tokenizer/xy_tokenizer/model.py
@@ -49,7 +49,8 @@ def __init__(self, generator_params):
         self.enhanced_vocos = Vocos(**generator_params['vocos_kwargs'])
 
         ## Feature extractor
-        self.feature_extractor = MelFeatureExtractor(**generator_params['feature_extractor_kwargs'])
+        fe_kwargs = generator_params.get('feature_extractor_kwargs', {})
+        self.feature_extractor = MelFeatureExtractor(**fe_kwargs)
 
     @torch.inference_mode()
     def inference_tokenize(self, x, input_lengths):