A pretty small decoder-only transformer model that I wrote using pytorch for an Extended Essay research project. Later came back to it because the original version was not working very well.
Based on Google's Magenta.
This is the fourth version of the model, here is what I have tried:
- An encoder-decoder model from this tutorial
- Decoder-only except with regular absolute attention.
- Added the special skewing procedure found in this paper.
- Current revisit, I corrected some issues with dropout, added learning rate scheduling and updated the hyperparameters now that I have access to lab machines. To sequence length went from 200 -> 1024, exactly like in the aforementioned paper.
Sequence length(seq_len): 1024 Embedding dimensionality(d_model): 512 Depth: 6
- Input: Midi file converted to tokens, padded to length 1024 if necessary. Truncated if too long. This is the "seed" song that the model will continue.
- Convert to embeddings: (seq_len, d_model) and scale by sqrt(d_model)
- Decoder block: run depth times
- Relative self attention using the efficient skewing procedure
- Dropout
- Normalize
- Fully connected layer
- Normalize
- Apply final fully connected layer to output probabilities for each token
- Output: Choice between top-k, top-p, top-p with a section of the seed appended to decode. See
python showcase.ipynbto try each of them!
python showcase.ipynb currently contains everything required to train the model on one song, Reverie by Claude Debussy; mostly as a proof of concept. I am re-training on the full Maestro dataset and will commit the model once it's finished.