Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 10 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,20 +28,21 @@ MOSS-TTSD supports voice cloning and long single-session speech generation, maki

## Highlights

- **Highly Expressive Dialogue Speech**: Built on unified semantic-acoustic neural audio codec, a pre-trained large language model, millions of hours of TTS data, and 400k hours synthetic and real conversational speech, MOSS-TTSD generates highly expressive, human-like dialogue speech with natural conversational prosody.
- **Highly Expressive Dialogue Speech**: Built on unified semantic-acoustic neural audio codec, a pre-trained large language model, millions of hours of TTS data and conversational speech, MOSS-TTSD generates highly expressive, human-like dialogue speech with natural conversational prosody.
- **Two-Speaker Voice Cloning**: MOSS-TTSD supports zero-shot two speakers voice cloning and can generate conversational speech with accurate speaker swithcing based on dialogue scripts. Only 10 to 20 seconds of reference audio is needed.
- **Chinese-English Bilingual Support**: MOSS-TTSD enables highly expressive speech generation in both Chinese and English.
- **Long-Form Speech Generation**: Thanks to low-bitrate codec and training framework optimization, MOSS-TTSD has been trained for long speech generation (Training maximum length is 960s).
- **Long-Form Speech Generation**: Thanks to low-bitrate codec and training framework optimization, MOSS-TTSD has been trained for long speech generation (Training maximum length is 1700s).
- **Fully Open Source & Commercial-Ready**: MOSS-TTSD and its future updates will be fully open-source and support free commercial use.

## News 🚀

- **[2025-11-01]** MOSS-TTSD v0.7 is released! v0.7 significantly improves audio quality, voice cloning capability, and stability, adds support for 32 kHz high‑quality output, greatly extends single‑pass generation length (960s→1700s), and more reliably generates speech events following speaker tags. We recommend using the v0.7 model by default.
- **[2025-09-09]** We supported SGLang inference engine to accelerate model inference by up to **16x**.
- **[2025-08-25]** We released the 32khz version of XY-Tokenizer.
- **[2025-08-12]** We add support for streaming inference in MOSS-TTSD v0.5.
- **[2025-07-29]** We provide the SiliconFlow API interface and usage examples for MOSS-TTSD v0.5.
- **[2025-07-16]** We open-source the fine-tuning code for MOSS-TTSD v0.5, supporting full-parameter fine-tuning, LoRA fine-tuning, and multi-node training.
- **[2025-07-04]** MOSS-TTSD v0.5 is released! v0.5 has enhanced the accuracy of timbre switching, voice cloning capability, and model stability. We recommend using the v0.5 model by default.
- **[2025-07-04]** MOSS-TTSD v0.5 is released! v0.5 has enhanced the accuracy of timbre switching, voice cloning capability, and model stability.
- **[2025-06-20]** MOSS-TTSD v0 is released! Moreover, we provide a podcast generation pipeline named Podever, which can automatically convert PDF, URL, or long text files into high-quality podcasts.

## Installation
Expand All @@ -62,7 +63,7 @@ You also need to download the XY Tokenizer model weights. You can find the weigh

```bash
mkdir -p XY_Tokenizer/weights
huggingface-cli download fnlp/XY_Tokenizer_TTSD_V0_32k xy_tokenizer.ckpt --local-dir ./XY_Tokenizer/weights/
huggingface-cli download fnlp/MOSS_TTSD_tokenizer MOSS_TTSD_tokenizer --local-dir ./XY_Tokenizer/weights/
```

## Usage
Expand All @@ -89,16 +90,9 @@ Parameters:

#### JSONL Input Format

The input JSONL file should contain one JSON object per line. MOSS-TTSD supports multiple input formats:
The input JSONL file should contain one JSON object per line. MOSS-TTSD supports two input formats:

**Format 1: Text-only input (No voice cloning, using the model's random timbre)**
```json
{
"text": "[S1]Speaker 1 dialogue content[S2]Speaker 2 dialogue content[S1]..."
}
```

**Format 2: Separate speaker audio references**
**Format 1: Separate speaker audio references**
```json
{
"base_path": "/path/to/audio/files",
Expand All @@ -110,7 +104,7 @@ The input JSONL file should contain one JSON object per line. MOSS-TTSD supports
}
```

**Format 3: Shared audio reference**
**Format 2: Shared audio reference**
```json
{
"base_path": "/path/to/audio/files",
Expand All @@ -126,11 +120,11 @@ The input JSONL file should contain one JSON object per line. MOSS-TTSD supports
- `text`: Dialogue script with speaker tags `[S1]` and `[S2]` indicating speaker turns (required)
- `base_path`: Base directory path for relative file paths (optional)

**For voice cloning (Format 2):**
**For voice cloning (Format 1):**
- `prompt_audio_speaker1/2`: Path to reference audio files for voice cloning (relative to `base_path`)
- `prompt_text_speaker1/2`: Reference text corresponding to the audio prompts for better voice matching

**For shared reference (Format 3):**
**For shared reference (Format 2):**
- `prompt_audio`: Path to shared reference audio file containing both speakers' voices (relative to `base_path`)
- `prompt_text`: Reference text corresponding to the audio, also using `[S1]` and `[S2]` tags to distinguish speakers

Expand Down
25 changes: 9 additions & 16 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,15 @@ MOSS-TTSD(text to spoken dialogue)是一个开源的中英双语口语对话

## 亮点

- **高表现力对话语音**:基于统一语义-声学神经音频Codec、预训练大语言模型、百万小时TTS数据与约40万小时的真实/合成对话语音数据,MOSS-TTSD能够生成高表现力,高自然度,具有自然对话韵律的拟人对话语音。
- **高表现力对话语音**:基于统一语义-声学神经音频Codec、预训练大语言模型、百万小时TTS数据与对话语音数据,MOSS-TTSD能够生成高表现力,高自然度,具有自然对话韵律的拟人对话语音。
- **双说话人零样本声音克隆**:MOSS-TTSD支持零样本双说话人克隆,按脚本精确进行角色/声线切换。只需要提供10到20秒的参考音频片段。
- **中英双语**:MOSS-TTSD支持中英两种语言的高表现力语音生成。
- **长音频生成**:得益于低码率Codec与训练框架优化,MOSS-TTSD在长音频生成场景进行了大量训练(训练最大长度达到960s),能够单次生成超长音频。
- **长音频生成**:得益于低码率Codec与训练框架优化,MOSS-TTSD在长音频生成场景进行了大量训练(训练最大长度达到1700s),能够单次生成超长音频。
- **开源可商用**:当前与后续版本将保持开源,支持免费商用。

## 最新动态 🚀

- **[2025-11-01]** 我们发布了 MOSS-TTSD v0.7:显著提升了音质、声音克隆能力与稳定性,支持32khz高音质输出,并大幅拓展了单次生成长度(960s->1700s),更够比较稳定地根据说话人标签生成语音事件。
- **[2025-09-09]** 我们支持了 SGLang 推理引擎加速模型推理,最高可加速**16倍**。
- **[2025-08-25]** 我们发布了 32khz XY-Tokenizer。
- **[2025-08-12]** 我们支持了 MOSS-TTSD v0.5 的流式推理。
Expand All @@ -60,7 +61,7 @@ pip install flash-attn

```bash
mkdir -p XY_Tokenizer/weights
huggingface-cli download fnlp/XY_Tokenizer_TTSD_V0_32k xy_tokenizer.ckpt --local-dir ./XY_Tokenizer/weights/
huggingface-cli download fnlp/MOSS_TTSD_tokenizer MOSS_TTSD_tokenizer --local-dir ./XY_Tokenizer/weights/
```

## 使用方法
Expand All @@ -87,17 +88,9 @@ python inference.py --jsonl examples/examples.jsonl --output_dir outputs --seed

#### JSONL 输入格式

MOSS-TTSD支持多种输入格式
MOSS-TTSD 支持两种输入格式

**格式1:仅文本(不进行声音克隆,使用模型随机音色)**

```json
{
"text": "[S1]说话人1的内容[S2]说话人2的内容[S1]..."
}
```

**格式2:分别提供两位说话人的参考音频**
**格式1:分别提供两位说话人的参考音频**

```json
{
Expand All @@ -110,7 +103,7 @@ MOSS-TTSD支持多种输入格式:
}
```

**格式3:共享参考音频(一个参考音频包含两个说话人的内容)**
**格式2:共享参考音频(一个参考音频包含两个说话人的内容)**

```json
{
Expand All @@ -128,12 +121,12 @@ MOSS-TTSD支持多种输入格式:
- `text`:带 `[S1]`、`[S2]` 说话人标签的对话脚本(必填)
- `base_path`:相对路径的基准目录(可选)

**用于声音克隆(格式2):**
**用于声音克隆(格式1):**

- `prompt_audio_speaker1/2`:两位说话人的参考音频(可相对 `base_path`)
- `prompt_text_speaker1/2`:对应参考音频的文本,有助于更好匹配音色

**用于共享参考(格式3):**
**用于共享参考(格式2):**

- `prompt_audio`:包含两位说话人的共享参考音频(可相对 `base_path`)
- `prompt_text`:对应的参考文本,亦使用 `[S1]`、`[S2]` 区分
Expand Down
115 changes: 115 additions & 0 deletions XY_Tokenizer/config/MOSS_TTSD_tokenizer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
generator_params:
input_sample_rate: 16000
output_sample_rate: 32000
encoder_downsample_rate: 1280
decoder_upsample_rate: 2560

feature_extractor_kwargs:
chunk_length: 30
feature_size: 80
hop_length: 160
n_fft: 400
n_samples: 480000
nb_max_frames: 3000
padding_side: right
padding_value: 0.0
return_attention_mask: false
sampling_rate: 16000

# Codec / model architecture (inference required)
semantic_encoder_kwargs: # 100hz -> 50hz
num_mel_bins: 80
sampling_rate: 16000
hop_length: 160
stride_size: 2
kernel_size: 3
d_model: 768
scale_embedding: false
max_audio_seconds: 30
encoder_layers: 12
encoder_attention_heads: 12
encoder_ffn_dim: 3072
activation_function: "gelu"

semantic_encoder_adapter_kwargs: # 50hz
input_dim: 768
output_dim: 768
d_model: 768
max_source_positions: 1500
encoder_layers: 4
encoder_attention_heads: 12
encoder_ffn_dim: 3072

acoustic_encoder_kwargs: # 100hz -> 50hz
num_mel_bins: 80
sampling_rate: 16000
hop_length: 160
stride_size: 2
kernel_size: 3
d_model: 768
scale_embedding: false
max_audio_seconds: 30
encoder_layers: 12
encoder_attention_heads: 12
encoder_ffn_dim: 3072
activation_function: "gelu"

pre_rvq_adapter_kwargs: # 50hz
input_dim: 1536
output_dim: 768
d_model: 768
max_source_positions: 1500
encoder_layers: 4
encoder_attention_heads: 12
encoder_ffn_dim: 3072

downsample_kwargs: # 50hz -> 12.5hz
d_model: 768
avg_pooler: 4

quantizer_kwargs: # 12.5hz
input_dim: 3072
rvq_dim: 512
output_dim: 3072
num_quantizers: 8
codebook_size: 1024
codebook_dim: 512
quantizer_dropout: 0.0
commitment: 1

post_rvq_adapter_kwargs: # 12.5hz
input_dim: 3072
output_dim: 3072
d_model: 768
max_source_positions: 375
encoder_layers: 4
encoder_attention_heads: 12
encoder_ffn_dim: 3072

upsample_kwargs: # 12.5hz -> 50hz
d_model: 768
stride: 4

acoustic_decoder_kwargs: # 50hz -> 100hz
num_mel_bins: 80
sampling_rate: 16000
hop_length: 160
stride_size: 2
kernel_size: 3
d_model: 768
scale_embedding: false
max_audio_seconds: 30
decoder_layers: 12
decoder_attention_heads: 12
decoder_ffn_dim: 3072
activation_function: "gelu"

vocos_kwargs: # 100hz -> 32khz
input_channels: 80
dim: 512
intermediate_dim: 4096
num_layers: 30
n_fft: 1280
hop_size: 320
padding: "same"

3 changes: 2 additions & 1 deletion XY_Tokenizer/xy_tokenizer/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,8 @@ def __init__(self, generator_params):
self.enhanced_vocos = Vocos(**generator_params['vocos_kwargs'])

## Feature extractor
self.feature_extractor = MelFeatureExtractor(**generator_params['feature_extractor_kwargs'])
fe_kwargs = generator_params.get('feature_extractor_kwargs', {})
self.feature_extractor = MelFeatureExtractor(**fe_kwargs)

@torch.inference_mode()
def inference_tokenize(self, x, input_lengths):
Expand Down
Loading
Loading