Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 94 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,83 +1,134 @@
# [SongBloom]: *Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement*

We propose **SongBloom**, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. SongBloom employs an autoregressive diffusion model that combines the high fidelity of diffusion models with the scalability of language models.
Specifically, it gradually extends a musical sketch from short to long and refines the details from coarse to fine-grained. The interleaved generation paradigm effectively integrates prior semantic and acoustic context to guide the generation process.
Experimental results demonstrate that SongBloom outperforms existing methods across both subjective and objective metrics and achieves performance comparable to the state-of-the-art commercial music generation platforms.
# SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement

![img](docs/architecture.png)
<div align="center">

Demo page: [https://cypress-yang.github.io/SongBloom_demo](https://cypress-yang.github.io/SongBloom_demo)
[![Paper](https://img.shields.io/badge/arXiv-2506.07634-b31b1b.svg)](https://arxiv.org/abs/2506.07634)
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-yellow)](https://huggingface.co/CypressYang/SongBloom)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)

ArXiv: [https://arxiv.org/abs/2506.07634](https://arxiv.org/abs/2506.07634)
</div>

## Prepare Environments
We propose **SongBloom**, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. By combining a high-fidelity diffusion model with a scalable language model, SongBloom gradually extends a musical sketch from short to long and refines details from coarse to fine-grained.

This interleaved paradigm effectively integrates prior semantic and acoustic context to guide the generation process, achieving state-of-the-art results in coherent, full-length song creation.

### ▶️ [**Check out the Demos**](https://cypress-yang.github.io/SongBloom_demo/)

![SongBloom Architecture](docs/architecture.png)

## 🚀 Getting Started

Follow these three simple steps to generate your first song with SongBloom.

### Step 1: Set Up Your Environment

First, clone the repository and set up the Conda environment.

```bash
# Clone the repository
git clone https://github.com/Cypress-Yang/SongBloom.git
cd SongBloom

# Create and activate the Conda environment
conda create -n SongBloom python==3.8.12
conda activate SongBloom

# yum install libsndfile
# pip install torch==2.2.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu118 # For different CUDA version
# For Linux, you may need to install libsndfile first
# sudo apt-get install libsndfile1 or sudo yum install libsndfile

# Install all required Python packages
pip install -r requirements.txt
```
> **Note:** The `requirements.txt` file includes a specific version of PyTorch for CUDA 11.8. If you have a different CUDA version, please install the appropriate PyTorch and Torchaudio binaries from the [official site](https://pytorch.org/get-started/previous-versions/).

## Data Preparation

A .jsonl file, where each line is a json object:
### Step 2: Prepare Your Songs (`.songbloom` file)

```json
{
"idx": "The index of each sample",
"lyrics": "The lyrics to be generated",
"prompt_wav": "The path of the style prompt audio",
}
```
Instead of complex command-line arguments, you define your songs in a simple `.songbloom` file using the human-readable TOML format. Create a file like `my_songs.songbloom`:

One example can be refered to as: [example/test.jsonl](example/test.jsonl)
```toml
# File: my_songs.songbloom
# Define one or more songs to generate.

The prompt wav should be a 10-second, 48kHz audio clip.
[sunset_lullaby]
lyrics = "the sun is setting low and the stars are starting to glow"
prompt_wav = "prompts/my_awesome_prompt.wav" # 10s, 48kHz audio clip
n_samples = 2 # Optional: Number of variations to generate (default is 1)
output_name = "sunset_song_final" # Optional: Filename for the output

The details about lyric format can be found in [docs/lyric_format.md](docs/lyric_format.md).
[city_rhythm]
lyrics = "walking through the city streets with a rhythm in my feet"
prompt_wav = "inputs/city_beat.mp3"
```
* The prompt audio should ideally be a **10-second, 48kHz** audio clip.
* For details on lyric formatting, see [`docs/lyric_format.md`](docs/lyric_format.md).

## Inference
### Step 3: Generate Music!

```bash
source set_env.sh
Now, run the main script, pointing it to your configuration file. The model and necessary assets will be downloaded automatically on the first run.

python3 infer.py --input-jsonl example/test.jsonl
```bash
# Basic usage
python3 songbloom.py my_songs.songbloom

# For GPUs with low VRAM like RTX4090, you should set the dtype as bfloat16
python3 infer.py --input-jsonl example/test.jsonl --dtype bfloat16
# Specify a different output directory
python3 songbloom.py my_songs.songbloom --output-dir "path/to/my/music"

# SongBloom also supports flash-attn (optional). To enable it, please install flash-attn (v2.6.3 is used during training) manually and set os.environ['DISABLE_FLASH_ATTN'] = "0" in infer.py:8
# For GPUs with lower VRAM (e.g., RTX 4090), use bfloat16 for better performance
python3 songbloom.py my_songs.songbloom --dtype bfloat16
```
> **Flash Attention**: To enable flash-attn for a potential speed-up, install the library manually and change `DISABLE_FLASH_ATTN` from `"1"` to `"0"` at the top of `songbloom.py`.


## Models
## 📦 Models

| Name | Size | Max Length | Prompt type | 🤗 |
All models are available on the [Hugging Face Hub](https://huggingface.co/CypressYang/SongBloom).

| Name | Size | Max Length | Prompt type | Link |
| -------------------- | ---- | ---------- | ----------- | -------------------------------------------- |
| songbloom_full_150s | 2B | 2m30s | 10s wav | [link](https://huggingface.co/CypressYang/SongBloom) |
| songbloom_mulan_150s | 2B | 2m30s | 10s wav / text description | coming soon |
| ... | | | | |
| `songbloom_full_150s` | 2B | 2m 30s | 10s wav | [🤗 HF Repo](https://huggingface.co/CypressYang/SongBloom) |
| `songbloom_mulan_150s` | 2B | 2m 30s | 10s wav / text | *Coming Soon* |

## 📝 TODO List

- [ ] Support Text Description Prompts
- [ ] Release full-length model version


## 📈 Star History

<div align="center">

[![Star History Chart](https://api.star-history.com/svg?repos=Cypress-Yang/SongBloom&type=Date)](https://star-history.com/#Cypress-Yang/SongBloom&Date)

## TODO List
</div>

- [ ] Support Text Description
- [ ] Full version
## ✨ Contributors

A huge thank you to all the amazing people who have contributed to this project!

<div align="center">

<a href="https://github.com/Cypress-Yang/SongBloom/graphs/contributors">
<img src="https://contrib.rocks/image?repo=Cypress-Yang/SongBloom" />
</a>

</div>

## Citation

```
If you find SongBloom useful in your research, please cite our paper:

```bibtex
@article{yang2025songbloom,
title={SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement},
author={Yang, Chenyu and Wang, Shuai and Chen, Hangting and Tan, Wei and Yu, Jianwei and Li, Haizhou},
journal={arXiv preprint arXiv:2506.07634},
year={2025}
title={SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement},
author={Yang, Chenyu and Wang, Shuai and Chen, Hangting and Tan, Wei and Yu, Jianwei and Li, Haizhou},
journal={arXiv preprint arXiv:2506.07634},
year={2025}
}
```

## License

SongBloom (codes and weights) is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
The code and model weights for SongBloom are released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
88 changes: 0 additions & 88 deletions infer.py

This file was deleted.

3 changes: 0 additions & 3 deletions infer.sh

This file was deleted.

24 changes: 18 additions & 6 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,17 +1,29 @@
# Core dependencies for PyTorch and audio processing
torch==2.2.0
torchaudio==2.2.0

# Model and experiment management
lightning==2.2.1
huggingface-hub==0.24.6
transformers==4.44.1
omegaconf==2.2.0

# Configuration file parsing for the user-friendly interface
toml

# NLP and text processing libraries
# Chinese language support
jieba-fast==0.53
pypinyin==0.51.0
cn2an==0.5.22
# English language support
wordsegment==1.3.1
g2p-en==2.1.0
lightning==2.2.1
nltk==3.8.1
omegaconf==2.2.0
torch==2.2.0
torchaudio==2.2.0
transformers==4.44.1
einops==0.8.0
spacy==3.7.4
num2words==0.5.13

# Tensor manipulation and audio codecs
einops==0.8.0
descript-audio-codec==1.0.0
vector_quantize_pytorch==1.14.8
Loading