A comprehensive CLI toolkit for preparing audio datasets for the ACE-STEP music generation model. This toolkit automates the process of audio captioning, lyric downloading, and dataset configuration.
- Automated Audio Captioning: Uses ace-step-captioner to generate descriptions for audio files
- Lyric Downloading: Integrates genius-api to automatically fetch lyrics for your tracks
- Metadata Integration: Seamlessly combines key and BPM data from Mixxx DJ software
- Smart Config Generation: Automatically detects captions, lyrics, and BPM/key metadata, generating a training-ready configuration file
- Windows Optimized: Built with Windows support and streamlined setup
- Windows OS
- uv - Fast Python package manager and installer
- Mixxx - For BPM and key detection (manual export)
- ace-step-captioner - Audio captioning
- genius-api - Lyrics API access
-
Clone the repository:
git clone https://github.com/dopf-26/ace-step-dataset-toolkit cd ace-step-dataset-toolkit -
Run the installation script:
install.bat
This will create a uv virtual environment and install all necessary Python packages.
-
Launch the toolkit:
run_toolkit.bat
-
Setup Settings: When running the individual steps you will be asked to provide a genius-api to download lyrics and setup the proper settings like cuda-selection, ace-step-captioner quantization and so on. To reset these settings delete the cli_config.json in the project folder.
Organize your audio files in the following directory structure for the toolkit to work optimally:
your_dataset/
├── metadata.csv (Exported from Mixxx)
└── audio/
├── track1.wav
├── track2.wav
└── ...
metadata.csv should be exported directly from Mixxx and contain BPM and key information for your tracks. Import your audio files into mixxx, right click and analyze to get the BPM and Key. Add them to a new playlist and then export the playlist as metadata.csv into your base folder.
After running the toolkit, the following files will be created:
your_dataset/
├── metadata.csv
├── triggerword.json (Training-ready configuration)
└── audio/
├── track1.wav
├── track1_caption.txt
├── track1_lyrics.txt
└── ...
The toolkit operates in three main steps:
- Export your Mixxx project and ensure
metadata.csvis in your dataset's base folder - Place all audio files in the
audio/subfolder
- Captioning: Audio files are automatically captioned using ace-step-captioner
- Lyrics: Track information is used to download lyrics via genius-api
- Both captions and lyrics are saved in the dataset audio subfolder
- I cant stress this enough: MANUALLY EDIT your lyrics and make sure that they fit 100% to the audio!
- The toolkit automatically detects all captions, lyrics, and metadata
- Combines BPM and key information from
metadata.csv - Generates
triggerword.json- a complete, training-ready configuration file - This file is placed in your dataset's base folder and ready for model training
The generated config.json includes:
- Audio file paths and metadata
- Track captions (from ace-step-captioner)
- Lyrics (from genius-api)
- Musical metadata (BPM, key from Mixxx)
- All information formatted for direct use with ACE-STEP training
run_toolkit.batFollow the on-screen prompts to:
- Select your dataset folder
- Generate captions and lyrics
- Create the final training configuration
- Genius API Issues: Ensure you have a valid Genius API token configured
- Mixxx Metadata: Verify
metadata.csvcontains the correct track information and is set to either english or german - Audio Files: Confirm all audio files are in the
audio/subfolder - Missing Dependencies: Run
install.batagain to ensure all packages are installed
Contributions are welcome! Please feel free to submit issues and pull requests.
This project is part of the ACE-STEP ecosystem. See the LICENSE file for details.
- ACE-STEP
- ace-step-captioner
- Genius API
- Mixxx Documentation
- uv Documentation
- Thanks to mmoalem for further improving the code and adding mixxx to the mix!