MusicMon 🎵

A personal exploration into training music generation models from scratch! This project documents my journey through different approaches to AI music generation, from diffusion models to transformers.

What I Built

🎯 Project Goals

Train my own music generation model from scratch (first time!)
Experiment with different architectures and approaches
Generate Pokemon-style music using MIDI data

🔢 Approaches Tried

1. Diffusion Model (Raw Waveforms)

Started with MNIST to understand diffusion
Attempted raw waveform diffusion for audio
Result: Pure static noise output 😅
Challenge: Raw audio diffusion does not like me. I'm still not sure why this gave me pure static noise. I considered trying spectograms too, but decided a transformer would likely be better at dealing with sequential data.

2. Transformer Model

Implemented transformer decoder (with sinusoidal positional encodings & multi-headed attention) from scratch
Used REMI tokenizer (from MidiTok) to tokenize MIDI songs
Trained on a subset of Pokemon MIDI songs
Result: Some musical patterns emerging, but quality varies
Challenge: Extremely slow training with low GPU utilization (10% memory, 30% duty cycle)

These were trained on my RTX 4070 so it's a given that I wasn't able to gain extremely good quality

🎼 Audio Samples

diffusion (WARNING: LOWER VOLUME) - This was with my raw audio waveform 1D-diffusion model :,)
single sample, transformer - I only had 1 data sample for this PMDRRT_Sky_Tower.mid so it sounds awfully similar I added a lot more songs, scaled down vocab size, used only 1 instrument and go this
more samples, transformer
same model as above While it doesn't sound the best, I can still occasionally hear some patterns in the final 2 outputs

📚 Resources

DDPM: Denoising Diffusion Probabilistic Models
Attention: Attention Is All You Need
GPT-1: Improving Language Understanding by Generative Pre-Training
GPT-2: Language Models are Unsupervised Multitask Learners
MidiTok: MidiTok GitHub Repo

🎮 Data Sources

The MIDI songs were sourced from here

🚧 Future Improvements

I'm flying out to Boston to join Suno tomorrow and work on music generation so I'm not sure whether I'll be working on these here, but here are things I'd like to work on given time.

Improve training pipeline for better GPU utilization
Experiment with different model architectures
Implement better sampling strategies
Add conditioning for style control
Try hybrid approaches combining diffusion and transformers

📁 Project Structure

music-gen/
├── midi_transformer.ipynb      # Main transformer implementation
├── exploration/                # Early experiments
│   ├── music_generator.ipynb   # Diffusion model attempts
│   └── audio_utils.py          # Audio processing utilities
├── midi_to_wav.ipynb           # Convert MIDI samples to wav
├── midis/                      # Pokemon MIDI dataset
├── outputs/                    # Generated audio samples
└── midi_tokenizer.json         # Trained REMI tokenizer

This project represents my first attempt at training music generation models from scratch, but I had fun because I love music :} 🎵✨

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MusicMon 🎵

What I Built

🎯 Project Goals

🔢 Approaches Tried

1. Diffusion Model (Raw Waveforms)

2. Transformer Model

🎼 Audio Samples

📚 Resources

🎮 Data Sources

🚧 Future Improvements

📁 Project Structure

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

MusicMon 🎵

What I Built

🎯 Project Goals

🔢 Approaches Tried

1. Diffusion Model (Raw Waveforms)

2. Transformer Model

🎼 Audio Samples

📚 Resources

🎮 Data Sources

🚧 Future Improvements

📁 Project Structure