Conformer with multi-scale local attention for symbolic music generation. See coma for a similar architecture used for composer classification.
Model Architecture (see src/transformer.py):
-
Embedding: REMI token embedding + learned positional embedding
-
Decoder: Stack of conformer-like blocks1 (1/2 * FeedForward → Multi-Scale Local Attention → Conformer Conv Module → 1/2 * FeedForward) blocks with hyper-connections and residual streams:
-
Local Attention: Multi-scale local self-attention with multiple window sizes (e.g., [32, 64]).
- Each scale uses windowed attention with optional rotary position embeddings (xpos) or dynamic position bias
- Scales aggregated via learnable weighted sum
- Query-Key RMSNorm with learnable scales for improved training stability
-
Conformer Conv Module:
-
LayerNorm → Pointwise conv (1D, expansion factor 2) → GLU activation → Depthwise conv (causal) → Swish → Channel LayerNorm → Pointwise conv → Dropout
-
Global Attention: Optional global attention layers can be inserted at specified positions (disabled by default)
-
Hyper-connections: Each component wrapped with residual stream expansion/reduction functions
-
-
Output: LayerNorm → Linear projection to vocabulary size
- KV caching
Create a conda environment with python 3.11
conda create -n coma-gen python=3.11
conda activate coma-genInstall requirements
pip install -r requirements.txtDownload the Maestro 3.0 dataset2
wget https://storage.googleapis.com/magentadata/datasets/maestro/v3.0.0/maestro-v3.0.0-midi.zip
unzip 'maestro-v3.0.0-midi.zip'
rm 'maestro-v3.0.0-midi.zip'
mv 'maestro-v3.0.0' 'data/maestro-v3.0.0'Adjust training params in config.py and begin training the transformer with
python3 train.pyTensorboard logs will be saved in the LOG_DIR directory.
This repo is largely adapted from the following.
local attention: https://github.com/lucidrains/local-attention
conformer: https://github.com/jreremy/conformer, https://github.com/lucidrains/conformer
miditok: https://github.com/Natooz/MidiTok
Footnotes
-
Gulati, A., Qin, J., Chiu, C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., & Pang, R. (2020). Conformer: Convolution-augmented Transformer for Speech Recognition. ArXiv, abs/2005.08100. ↩
-
Hawthorne, C., Stasyuk, A., Roberts, A., Simon, I., Huang, C.A., Dieleman, S., Elsen, E., Engel, J., & Eck, D. (2018). Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset. ArXiv, abs/1810.12247. ↩
