tencent-ailab · BlackTechX011 · Aug 15, 2025
diff --git a/README.md b/README.md
@@ -1,83 +1,134 @@
-# [SongBloom]: *Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement*
 
-We propose **SongBloom**, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. SongBloom employs an autoregressive diffusion model that combines the high fidelity of diffusion models with the scalability of language models.
-Specifically, it gradually extends a musical sketch from short to long and refines the details from coarse to fine-grained. The interleaved generation paradigm effectively integrates prior semantic and acoustic context to guide the generation process.
-Experimental results demonstrate that SongBloom outperforms existing methods across both subjective and objective metrics and achieves performance comparable to the state-of-the-art commercial music generation platforms.
+# SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement
 
-![img](docs/architecture.png)
+<div align="center">
 
-Demo page:  [https://cypress-yang.github.io/SongBloom_demo](https://cypress-yang.github.io/SongBloom_demo)
+[![Paper](https://img.shields.io/badge/arXiv-2506.07634-b31b1b.svg)](https://arxiv.org/abs/2506.07634)
+[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-yellow)](https://huggingface.co/CypressYang/SongBloom)
+[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
 
-ArXiv: [https://arxiv.org/abs/2506.07634](https://arxiv.org/abs/2506.07634)
+</div>
 
-## Prepare Environments
+We propose **SongBloom**, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. By combining a high-fidelity diffusion model with a scalable language model, SongBloom gradually extends a musical sketch from short to long and refines details from coarse to fine-grained.
+
+This interleaved paradigm effectively integrates prior semantic and acoustic context to guide the generation process, achieving state-of-the-art results in coherent, full-length song creation.
+
+### ▶️ [**Check out the Demos**](https://cypress-yang.github.io/SongBloom_demo/)
+
+![SongBloom Architecture](docs/architecture.png)
+
+## 🚀 Getting Started
+
+Follow these three simple steps to generate your first song with SongBloom.
+
+### Step 1: Set Up Your Environment
+
+First, clone the repository and set up the Conda environment.
 
 ```bash
+# Clone the repository
+git clone https://github.com/Cypress-Yang/SongBloom.git
+cd SongBloom
+
+# Create and activate the Conda environment
 conda create -n SongBloom python==3.8.12
 conda activate SongBloom
 
-# yum install libsndfile
-# pip install torch==2.2.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu118 # For different CUDA version
+# For Linux, you may need to install libsndfile first
+# sudo apt-get install libsndfile1 or sudo yum install libsndfile
+
+# Install all required Python packages
 pip install -r requirements.txt
 ```
+> **Note:** The `requirements.txt` file includes a specific version of PyTorch for CUDA 11.8. If you have a different CUDA version, please install the appropriate PyTorch and Torchaudio binaries from the [official site](https://pytorch.org/get-started/previous-versions/).
 
-## Data Preparation
 
-A  .jsonl file, where each line is a json object:
+### Step 2: Prepare Your Songs (`.songbloom` file)
 
-```json
-{
-	"idx": "The index of each sample", 
-	"lyrics": "The lyrics to be generated",
-	"prompt_wav": "The path of the style prompt audio",
-}
-```
+Instead of complex command-line arguments, you define your songs in a simple `.songbloom` file using the human-readable TOML format. Create a file like `my_songs.songbloom`:
 
-One example can be refered to as: [example/test.jsonl](example/test.jsonl)
+```toml
+# File: my_songs.songbloom
+# Define one or more songs to generate.
 
-The prompt wav should be a 10-second, 48kHz audio clip.
+[sunset_lullaby]
+lyrics = "the sun is setting low and the stars are starting to glow"
+prompt_wav = "prompts/my_awesome_prompt.wav" # 10s, 48kHz audio clip
+n_samples = 2 # Optional: Number of variations to generate (default is 1)
+output_name = "sunset_song_final" # Optional: Filename for the output
 
-The details about lyric format can be found in [docs/lyric_format.md](docs/lyric_format.md).
+[city_rhythm]
+lyrics = "walking through the city streets with a rhythm in my feet"
+prompt_wav = "inputs/city_beat.mp3"
+```
+*   The prompt audio should ideally be a **10-second, 48kHz** audio clip.
+*   For details on lyric formatting, see [`docs/lyric_format.md`](docs/lyric_format.md).
 
-## Inference
+### Step 3: Generate Music!
 
-```bash
-source set_env.sh
+Now, run the main script, pointing it to your configuration file. The model and necessary assets will be downloaded automatically on the first run.
 
-python3 infer.py --input-jsonl example/test.jsonl
+```bash
+# Basic usage
+python3 songbloom.py my_songs.songbloom
 
-# For GPUs with low VRAM like RTX4090, you should set the dtype as bfloat16
-python3 infer.py --input-jsonl example/test.jsonl --dtype bfloat16
+# Specify a different output directory
+python3 songbloom.py my_songs.songbloom --output-dir "path/to/my/music"
 
-# SongBloom also supports flash-attn (optional). To enable it, please install flash-attn (v2.6.3 is used during training) manually and set os.environ['DISABLE_FLASH_ATTN'] = "0" in infer.py:8
+# For GPUs with lower VRAM (e.g., RTX 4090), use bfloat16 for better performance
+python3 songbloom.py my_songs.songbloom --dtype bfloat16
 ```
+> **Flash Attention**: To enable flash-attn for a potential speed-up, install the library manually and change `DISABLE_FLASH_ATTN` from `"1"` to `"0"` at the top of `songbloom.py`.
+
 
-## Models
+## 📦 Models
 
-| Name                 | Size | Max Length | Prompt type | 🤗                                            |
+All models are available on the [Hugging Face Hub](https://huggingface.co/CypressYang/SongBloom).
+
+| Name                 | Size | Max Length | Prompt type | Link                                         |
 | -------------------- | ---- | ---------- | ----------- | -------------------------------------------- |
-| songbloom_full_150s  | 2B   | 2m30s      | 10s wav     | [link](https://huggingface.co/CypressYang/SongBloom) |
-| songbloom_mulan_150s | 2B   | 2m30s      | 10s wav / text description |           coming soon                           |
-| ... |      |            |             |                                              |
+| `songbloom_full_150s`  | 2B   | 2m 30s     | 10s wav     | [🤗 HF Repo](https://huggingface.co/CypressYang/SongBloom) |
+| `songbloom_mulan_150s` | 2B   | 2m 30s     | 10s wav / text | *Coming Soon*                                |
+
+## 📝 TODO List
+
+- [ ] Support Text Description Prompts
+- [ ] Release full-length model version
+
+
+## 📈 Star History
 
+<div align="center">
 
+[![Star History Chart](https://api.star-history.com/svg?repos=Cypress-Yang/SongBloom&type=Date)](https://star-history.com/#Cypress-Yang/SongBloom&Date)
 
-## TODO List
+</div>
 
-- [ ] Support Text Description
-- [ ] Full version
+## ✨ Contributors
+
+A huge thank you to all the amazing people who have contributed to this project!
+
+<div align="center">
+
+<a href="https://github.com/Cypress-Yang/SongBloom/graphs/contributors">
+  <img src="https://contrib.rocks/image?repo=Cypress-Yang/SongBloom" />
+</a>
+
+</div>
 
 ## Citation
 
-```
+If you find SongBloom useful in your research, please cite our paper:
+
+```bibtex
 @article{yang2025songbloom,
-title={SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement},
-author={Yang, Chenyu and Wang, Shuai and Chen, Hangting and Tan, Wei and Yu, Jianwei and Li, Haizhou},
-journal={arXiv preprint arXiv:2506.07634},
-year={2025}
+  title={SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement},
+  author={Yang, Chenyu and Wang, Shuai and Chen, Hangting and Tan, Wei and Yu, Jianwei and Li, Haizhou},
+  journal={arXiv preprint arXiv:2506.07634},
+  year={2025}
 }
 ```
 
 ## License
 
-SongBloom (codes and weights) is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
+The code and model weights for SongBloom are released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
diff --git a/infer.py b/infer.py
diff --git a/infer.sh b/infer.sh
diff --git a/requirements.txt b/requirements.txt
@@ -1,17 +1,29 @@
+# Core dependencies for PyTorch and audio processing
+torch==2.2.0
+torchaudio==2.2.0
+
+# Model and experiment management
+lightning==2.2.1
 huggingface-hub==0.24.6
+transformers==4.44.1
+omegaconf==2.2.0
+
+# Configuration file parsing for the user-friendly interface
+toml
+
+# NLP and text processing libraries
+# Chinese language support
 jieba-fast==0.53
 pypinyin==0.51.0
 cn2an==0.5.22
+# English language support
 wordsegment==1.3.1
 g2p-en==2.1.0
-lightning==2.2.1
 nltk==3.8.1
-omegaconf==2.2.0
-torch==2.2.0
-torchaudio==2.2.0
-transformers==4.44.1
-einops==0.8.0
 spacy==3.7.4
 num2words==0.5.13
+
+# Tensor manipulation and audio codecs
+einops==0.8.0
 descript-audio-codec==1.0.0
 vector_quantize_pytorch==1.14.8