A personal exploration into training music generation models from scratch! This project documents my journey through different approaches to AI music generation, from diffusion models to transformers.
- Train my own music generation model from scratch (first time!)
- Experiment with different architectures and approaches
- Generate Pokemon-style music using MIDI data
- Started with MNIST to understand diffusion
- Attempted raw waveform diffusion for audio
- Result: Pure static noise output 😅
- Challenge: Raw audio diffusion does not like me. I'm still not sure why this gave me pure static noise. I considered trying spectograms too, but decided a transformer would likely be better at dealing with sequential data.
- Implemented transformer decoder (with sinusoidal positional encodings & multi-headed attention) from scratch
- Used REMI tokenizer (from MidiTok) to tokenize MIDI songs
- Trained on a subset of Pokemon MIDI songs
- Result: Some musical patterns emerging, but quality varies
- Challenge: Extremely slow training with low GPU utilization (10% memory, 30% duty cycle)
These were trained on my RTX 4070 so it's a given that I wasn't able to gain extremely good quality
- diffusion (WARNING: LOWER VOLUME) - This was with my raw audio waveform 1D-diffusion model :,)
- single sample, transformer - I only had 1 data sample for this
PMDRRT_Sky_Tower.midso it sounds awfully similar I added a lot more songs, scaled down vocab size, used only 1 instrument and go this - more samples, transformer
- same model as above While it doesn't sound the best, I can still occasionally hear some patterns in the final 2 outputs
- DDPM: Denoising Diffusion Probabilistic Models
- Attention: Attention Is All You Need
- GPT-1: Improving Language Understanding by Generative Pre-Training
- GPT-2: Language Models are Unsupervised Multitask Learners
- MidiTok: MidiTok GitHub Repo
The MIDI songs were sourced from here
I'm flying out to Boston to join Suno tomorrow and work on music generation so I'm not sure whether I'll be working on these here, but here are things I'd like to work on given time.
- Improve training pipeline for better GPU utilization
- Experiment with different model architectures
- Implement better sampling strategies
- Add conditioning for style control
- Try hybrid approaches combining diffusion and transformers
music-gen/
├── midi_transformer.ipynb # Main transformer implementation
├── exploration/ # Early experiments
│ ├── music_generator.ipynb # Diffusion model attempts
│ └── audio_utils.py # Audio processing utilities
├── midi_to_wav.ipynb # Convert MIDI samples to wav
├── midis/ # Pokemon MIDI dataset
├── outputs/ # Generated audio samples
└── midi_tokenizer.json # Trained REMI tokenizer
This project represents my first attempt at training music generation models from scratch, but I had fun because I love music :} 🎵✨