🎵 Music Audio Captioner

Generate detailed descriptions of your music files using AI. This tool uses Qwen2-Audio, Qwen2-Audio 4bit, Qwen-Omni, Ace-Step Captioner models to analyze songs and create captions describing genre, instruments, mood, tempo, and overall sound characteristics.

Features

🎧 Single File Processing: Upload and analyze individual audio files
📁 Batch Processing: Process entire folders of music files at once
✏️ Custom Prompts: Customize what information you want extracted
💾 Auto-Save: Automatically saves captions as .txt files alongside your audio files
🚀 GPU Acceleration: Supports CUDA for faster processing
📊 Progress Tracking: Real-time progress bars for batch operations

Requirements

Python: 3.9 or higher
RAM: 8GB minimum, 16GB recommended
Storage: ~20GB for model and dependencies
GPU: Optional but recommended (NVIDIA GPU with CUDA support for faster processing)

Installation

The installer automatically detects your system configuration and installs the correct version of PyTorch:

Windows

Install Python (if not already installed):
- Download from python.org
- During installation, CHECK "Add Python to PATH"
Run the installer:
- Double-click install.bat
- The installer will:
  - Detect your CUDA version (if you have an NVIDIA GPU)
  - Install the correct PyTorch build automatically
  - Install all other dependencies
- Wait for installation to complete (may take 10-30 minutes)
Run the application:
- Double-click run.bat
- Your browser will open at http://localhost:7865
- First run: Model downloads automatically (~14GB, takes 10-30 minutes)
- Future runs: Loads instantly from cache

Linux/Mac

Install Python 3.9+ (if not already installed):

# Ubuntu/Debian
sudo apt install python3 python3-venv python3-pip

# macOS (using Homebrew)
brew install python@3.11

Run the installer:
```
chmod +x install.sh
./install.sh
```
The installer will automatically detect your CUDA version and install the appropriate PyTorch build.
Run the application:
```
./run.sh
```
Open your browser at http://localhost:7865

First run: Model downloads automatically (~14GB, takes 10-30 minutes) Future runs: Loads instantly from cache

Usage

Single File Mode

Click the "Single File" tab
Upload an audio file (MP3, WAV, FLAC, M4A, etc.)
Optionally customize the prompt
Check "Save caption to .txt file" if you want to save the result
Click "Generate Caption"
The caption will appear in the result box

Batch Processing Mode

Click the "Batch Processing" tab
Enter the full path to your music folder:
- Windows example: C:\Users\YourName\Music\MyAlbum
- Linux/Mac example: /home/yourname/Music/MyAlbum
Customize the prompt (will be used for all files)
Specify file extensions (e.g., .mp3, .wav, .flac)
Click "Process Folder"
Captions will be saved as .txt files next to each audio file

Example Output

For a file named song.mp3, a caption file song.txt will be created:

This is an upbeat electronic dance track featuring synthesizers, drum machines, 
and a pulsing bassline. The genre is progressive house with elements of trance. 
The tempo is approximately 128 BPM with a driving four-on-the-floor beat. 
The mood is energetic and euphoric with atmospheric pads and melodic leads.

Customizing Prompts

You can customize what information is extracted by modifying the prompt. Here are some examples:

Detailed Analysis

Describe this song including its genre, subgenre, instruments, mood, tempo, 
and overall sound characteristics.

Brief Description

Provide a brief 2-sentence description of the song's genre and vibe.

Specific Focus

What instruments can you hear in this song? Also describe the vocal style if present.

Music Production

Analyze this track from a production perspective: describe the mix, effects used, 
and production style.

Port Configuration

This application runs on port 7865 by default (instead of Gradio's default 7860) to avoid conflicts with other Gradio applications.

To change the port, edit music_captioner.py and modify this line:

app.launch(server_port=7865, share=False)

Troubleshooting

Model Loading Issues

First run takes forever

The model (~14GB) downloads automatically on first run
Can take 10-30 minutes depending on internet speed
See MODEL_LOADING.md for detailed info

Model download fails

Check internet connection and disk space (~20GB needed)
Try pre-downloading: run download_model.bat (Windows) or ./download_model.sh (Linux/Mac)
Downloads can resume if interrupted - just run again
For China users: Set HF_ENDPOINT=https://hf-mirror.com environment variable

Where is the model stored?

Windows: C:\Users\YourName\.cache\huggingface\hub
Linux/Mac: ~/.cache/huggingface/hub
See MODEL_LOADING.md for how to change cache location

Installation Issues

"Python is not installed or not in PATH"

Reinstall Python and ensure "Add Python to PATH" is checked
Or manually add Python to your system PATH

"Failed to install dependencies"

Make sure you have a stable internet connection
Try running the installer again
On Linux, you may need to install: sudo apt install python3-dev build-essential

Runtime Issues

"Out of memory" errors

Close other applications to free up RAM
Try processing fewer files at once
Use CPU mode if GPU memory is insufficient

Model download fails

Ensure you have ~14GB free disk space
Check your internet connection
The model downloads automatically on first run from Hugging Face

"Audio file not supported"

Ensure your audio file is a common format (MP3, WAV, FLAC, M4A)
Try converting to WAV using a tool like FFmpeg

Performance Tips

GPU is much faster: If you have an NVIDIA GPU, make sure CUDA is installed
Batch processing: More efficient than processing files one by one
File formats: WAV files process slightly faster than compressed formats

Supported Audio Formats

MP3 (.mp3)
WAV (.wav)
FLAC (.flac)
M4A (.m4a)
OGG (.ogg)
And most other common audio formats

Reducing Memory Usage (Quantization)

If you have limited GPU memory (VRAM), you can enable quantization to reduce memory usage:

Go to the "ℹ️ Info" tab in the web interface
Enable "Enable Quantization"
Choose 4-bit (~4GB VRAM) or 8-bit (~7GB VRAM)
Click "Apply Settings"
Restart the application

Note: Quantization requires an NVIDIA GPU with CUDA. Quality impact is minimal - most users won't notice a difference.

Alternatively, you can edit config.py directly:

USE_QUANTIZATION = True
QUANTIZATION_TYPE = "4bit"  # or "8bit"

Technical Details

Model: Qwen2-Audio-7B-Instruct
Model Size: ~14GB
Framework: PyTorch + Transformers
Interface: Gradio
Processing: Supports both CPU and GPU (CUDA)

Updating

To update to the latest version:

Download the new release
Copy your venv folder to the new directory (optional, to save time)
Run the installer again

Uninstalling

Delete the entire music-captioner folder
The AI model cache is stored in:
- Windows: C:\Users\YourName\.cache\huggingface
- Linux/Mac: ~/.cache/huggingface
- Delete this folder to remove the downloaded model

Happy Music Captioning! 🎵

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitattributes		.gitattributes
GRADIO_COMPATIBILITY.md		GRADIO_COMPATIBILITY.md
MODEL_LOADING.md		MODEL_LOADING.md
PACKAGE_INFO.md		PACKAGE_INFO.md
QUANTIZATION_GUIDE.md		QUANTIZATION_GUIDE.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
config.py		config.py
default_prompt.txt		default_prompt.txt
download_model.bat		download_model.bat
download_model.sh		download_model.sh
install.bat		install.bat
install.sh		install.sh
install_pytorch.py		install_pytorch.py
music_captioner.py		music_captioner.py
requirements.txt		requirements.txt
run.bat		run.bat
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎵 Music Audio Captioner

Features

Requirements

Installation

Windows

Linux/Mac

Usage

Single File Mode

Batch Processing Mode

Example Output

Customizing Prompts

Detailed Analysis

Brief Description

Specific Focus

Music Production

Port Configuration

Troubleshooting

Model Loading Issues

Installation Issues

Runtime Issues

Performance Tips

Supported Audio Formats

Reducing Memory Usage (Quantization)

Technical Details

Updating

Uninstalling

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎵 Music Audio Captioner

Features

Requirements

Installation

Windows

Linux/Mac

Usage

Single File Mode

Batch Processing Mode

Example Output

Customizing Prompts

Detailed Analysis

Brief Description

Specific Focus

Music Production

Port Configuration

Troubleshooting

Model Loading Issues

Installation Issues

Runtime Issues

Performance Tips

Supported Audio Formats

Reducing Memory Usage (Quantization)

Technical Details

Updating

Uninstalling

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages