Generate detailed descriptions of your music files using AI. This tool uses Qwen2-Audio, Qwen2-Audio 4bit, Qwen-Omni, Ace-Step Captioner models to analyze songs and create captions describing genre, instruments, mood, tempo, and overall sound characteristics.
- π§ Single File Processing: Upload and analyze individual audio files
- π Batch Processing: Process entire folders of music files at once
- βοΈ Custom Prompts: Customize what information you want extracted
- πΎ Auto-Save: Automatically saves captions as .txt files alongside your audio files
- π GPU Acceleration: Supports CUDA for faster processing
- π Progress Tracking: Real-time progress bars for batch operations
- Python: 3.9 or higher
- RAM: 8GB minimum, 16GB recommended
- Storage: ~20GB for model and dependencies
- GPU: Optional but recommended (NVIDIA GPU with CUDA support for faster processing)
The installer automatically detects your system configuration and installs the correct version of PyTorch:
-
Install Python (if not already installed):
- Download from python.org
- During installation, CHECK "Add Python to PATH"
-
Run the installer:
- Double-click
install.bat - The installer will:
- Detect your CUDA version (if you have an NVIDIA GPU)
- Install the correct PyTorch build automatically
- Install all other dependencies
- Wait for installation to complete (may take 10-30 minutes)
- Double-click
-
Run the application:
- Double-click
run.bat - Your browser will open at
http://localhost:7865 - First run: Model downloads automatically (~14GB, takes 10-30 minutes)
- Future runs: Loads instantly from cache
- Double-click
-
Install Python 3.9+ (if not already installed):
# Ubuntu/Debian sudo apt install python3 python3-venv python3-pip # macOS (using Homebrew) brew install python@3.11
-
Run the installer:
chmod +x install.sh ./install.sh
The installer will automatically detect your CUDA version and install the appropriate PyTorch build.
-
Run the application:
./run.sh
Open your browser at
http://localhost:7865First run: Model downloads automatically (~14GB, takes 10-30 minutes) Future runs: Loads instantly from cache
- Click the "Single File" tab
- Upload an audio file (MP3, WAV, FLAC, M4A, etc.)
- Optionally customize the prompt
- Check "Save caption to .txt file" if you want to save the result
- Click "Generate Caption"
- The caption will appear in the result box
- Click the "Batch Processing" tab
- Enter the full path to your music folder:
- Windows example:
C:\Users\YourName\Music\MyAlbum - Linux/Mac example:
/home/yourname/Music/MyAlbum
- Windows example:
- Customize the prompt (will be used for all files)
- Specify file extensions (e.g.,
.mp3, .wav, .flac) - Click "Process Folder"
- Captions will be saved as
.txtfiles next to each audio file
For a file named song.mp3, a caption file song.txt will be created:
This is an upbeat electronic dance track featuring synthesizers, drum machines,
and a pulsing bassline. The genre is progressive house with elements of trance.
The tempo is approximately 128 BPM with a driving four-on-the-floor beat.
The mood is energetic and euphoric with atmospheric pads and melodic leads.
You can customize what information is extracted by modifying the prompt. Here are some examples:
Describe this song including its genre, subgenre, instruments, mood, tempo,
and overall sound characteristics.
Provide a brief 2-sentence description of the song's genre and vibe.
What instruments can you hear in this song? Also describe the vocal style if present.
Analyze this track from a production perspective: describe the mix, effects used,
and production style.
This application runs on port 7865 by default (instead of Gradio's default 7860) to avoid conflicts with other Gradio applications.
To change the port, edit music_captioner.py and modify this line:
app.launch(server_port=7865, share=False)First run takes forever
- The model (~14GB) downloads automatically on first run
- Can take 10-30 minutes depending on internet speed
- See
MODEL_LOADING.mdfor detailed info
Model download fails
- Check internet connection and disk space (~20GB needed)
- Try pre-downloading: run
download_model.bat(Windows) or./download_model.sh(Linux/Mac) - Downloads can resume if interrupted - just run again
- For China users: Set
HF_ENDPOINT=https://hf-mirror.comenvironment variable
Where is the model stored?
- Windows:
C:\Users\YourName\.cache\huggingface\hub - Linux/Mac:
~/.cache/huggingface/hub - See
MODEL_LOADING.mdfor how to change cache location
"Python is not installed or not in PATH"
- Reinstall Python and ensure "Add Python to PATH" is checked
- Or manually add Python to your system PATH
"Failed to install dependencies"
- Make sure you have a stable internet connection
- Try running the installer again
- On Linux, you may need to install:
sudo apt install python3-dev build-essential
"Out of memory" errors
- Close other applications to free up RAM
- Try processing fewer files at once
- Use CPU mode if GPU memory is insufficient
Model download fails
- Ensure you have ~14GB free disk space
- Check your internet connection
- The model downloads automatically on first run from Hugging Face
"Audio file not supported"
- Ensure your audio file is a common format (MP3, WAV, FLAC, M4A)
- Try converting to WAV using a tool like FFmpeg
- GPU is much faster: If you have an NVIDIA GPU, make sure CUDA is installed
- Batch processing: More efficient than processing files one by one
- File formats: WAV files process slightly faster than compressed formats
- MP3 (.mp3)
- WAV (.wav)
- FLAC (.flac)
- M4A (.m4a)
- OGG (.ogg)
- And most other common audio formats
If you have limited GPU memory (VRAM), you can enable quantization to reduce memory usage:
- Go to the "βΉοΈ Info" tab in the web interface
- Enable "Enable Quantization"
- Choose 4-bit (~4GB VRAM) or 8-bit (~7GB VRAM)
- Click "Apply Settings"
- Restart the application
Note: Quantization requires an NVIDIA GPU with CUDA. Quality impact is minimal - most users won't notice a difference.
Alternatively, you can edit config.py directly:
USE_QUANTIZATION = True
QUANTIZATION_TYPE = "4bit" # or "8bit"- Model: Qwen2-Audio-7B-Instruct
- Model Size: ~14GB
- Framework: PyTorch + Transformers
- Interface: Gradio
- Processing: Supports both CPU and GPU (CUDA)
To update to the latest version:
- Download the new release
- Copy your
venvfolder to the new directory (optional, to save time) - Run the installer again
- Delete the entire
music-captionerfolder - The AI model cache is stored in:
- Windows:
C:\Users\YourName\.cache\huggingface - Linux/Mac:
~/.cache/huggingface - Delete this folder to remove the downloaded model
- Windows:
Happy Music Captioning! π΅