in ,

Generating MIDI Music with GPT-2, Hacker News

In November , I experimented with training a GPT-2 neural net model to generate folk music in the high-level

ABC music text format, following previous work in which are used a char – RNN

trained on a ‘The Session’ dataset. A GPT-2 hypothetically can improve on an (RNN

by better global coherence & copying of patterns, without problems with the hidden-state bottleneck.

I encountered problems with the standard GPT-2 model’s encoding of text which damaged results, but after fixing that , I successfully trained it on n (=) , ABC music pieces taken from The Session &

ABC The resulting music samples are in my opinion quite pleasant.



folk model & dataset are available for download , and I provide for listening selected music samples as well as medleys of random samples from throughout training.

We followed the

ABC folk model with an ABC-MIDI (model) : a dataset of (k) ABC (pieces) (decompiled from

MIDI pieces, which fit into GPT-2- (M) with an expanded context window when trained on


s. The


pieces are far more diverse and challenging, and (GPT-2) (underfits and struggles to produce valid samples but when sampling succeeds, it can generate) even better musical samples .

Back in

– , Bob L. Sturm (experimented with generating Irish folk music using a char – (RNN trained on a corpus of folk music written from k tunes to play , find sessions to play them in, and join in discussions about the music. You can also find events (like concerts and festivals), or explore the track listings of recordings. “Data-popup-author=” The Session community “data-popup-title=” The Session “data-url-original=” https : // “href=” ./docs / www / / 4c (babacaf) e (d7e) (b) (a) (fc4da) b.html “rel=” archived alternate nofollow “title=” (Original URL: “> The Session in a high-level musical format called “



Compact text — perfect for NNs. While


notation is written in


, it supports many complex features , and it has been adopted wide ly by folk musicians and hundreds of thousands of pieces written / transcribed in it. [^%] (Background: folk – )

Sturm et al scraped ~ (k) (ABC) (files from The Session and trained a Theano

char – RNN called “folk –

RNN ”, putting the code & data online , and providing a web interface for generation

. Prior success with char – (s.) (In addition to the various research publications , Sturm has also written many blog posts evaluating folk –


pieces) , such as how well they’re played by human musicians. [K:Bbmaj] (Transformers?)

Similar to [^%] Performance RNN , we use an event-based representation that allows us to generate expressive performances directly (ie without first generating a score). In contrast to an LSTM-based model like Performance RNN that compresses earlier events into a fixed-size hidden state, here we use a (Transformer) – based model that has direct access to all earlier events. (Our recent) (Wave2Midi2Wave) project also uses Music Transformer as its language model. “data-popup-author=” Cheng-Zhi Anna Huang, Ian Simon , Monica Dinculescu (Google Magenta) “data-popup-date=” – – “data-popup-title=” Music Transformer: Generating Music with Long-Term Structure “data -url-original=”” href=”” rel=”archived alternate nofollow” title=”(Original URL : “> Google Magenta’s Music Transformer

( Huang et al ) with (style control) & OpenAI’s Sparse Transformer – based “Generating Long Sequences with Sparse Transformers”, Child et al 2019

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to O(n ⋅ √n). We also introduce (a) a variation on architecture and initialization to train deeper networks, (b) the recomputation of attention matrices to save memory, and (c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more. ] “data-popup-author=” Christine Payne (OpenAI) “data-popup-date=” – – “data-popup-title=” MuseNet: a deep neural network that can generate 4-minute musical compositions with different instruments, and can combine styles from country to Mozart to the Beatles “href”=””> MuseNet have both demonstrated excellent results in music composition at various timescales / formats, and interesting features like mixing genres.

While messing around with Standard language generation neural network. Models, like GPT-2, are trained via likelihood training to imitate human text corpuses. Generated text suffers from persistent flaws like repetition, due to myopic generation word-by-word, and cannot improve on the training data because they are trained to predict ‘realistic’ completions of the training data.

A proposed alternative is to use reinforcement learning to train the NNs, to encourage global properties like coherence & amp; lack of repetition, and improve improve over the original corpus’s average quality. Preference learning trains a reward function on human ratings, and uses that as the ‘environment’ for a blackbox DRL algorithm like PPO.

OpenAI released a codebase implementing this dual-model preference learning approach for textual generation, based on GPT-2. Having previously used GPT-2 for poetry & amp; music generation , I experimented With GPT-2 preference learning for unconditional music and poetry generation.

I found that preference learning seemed to work better for music than poetry, and seemed to reduce the presence of repetition artifacts, but the results, at n ≅ , 19 ratings, are not dramatically better than alternative improvements like scaling up models or more thorough data-cleaning or more strin gent sample curation.

Working with it, I suspect that preference learning is unnecessarily. sample-inefficient & amp; data-inefficient, and that the blackbox reinforcement learning approach is inferior to directly using the reward model to optimize text samples, and propose two major architectural overhauls: have the reward model directly model the implied utility of every datapoint, and drop the agent model entirely in favor of backprop-powered gradient ascent which optimizes sequences to maximize the reward model’s output. “data-popup-author=” Gwern Branwen “data- popup-date=” (Dec) “data-popup-title=” GPT-2 Preference Learning for Music and Poetry Generation “href=” ./ GPT-2-preference-learning “>


preference learning in late October 05537, I became curious if folk –


could be improved by simply throwing one of the

GPT-2 models at it.


: a perfect match. Not the large ones, of course, which would overfit far too easily or hadn’t been released yet , but GPT-2 – (M )

GPT-2 is unab le to model raw


audio, or MIDI , because a meaningful musical piece is a WAV sequence of hundreds of thousands to millions of symbols long, and a MIDI

piece is tens of thousands of symbols long, which far exceed (GPT-2) ‘s small context window, and why OpenAI used Sparse Transformers for its


generation, as Sparse Transformers can scale to text with tens of thousands of characters. However, the high-level notation of


pieces means they fit just fine into the GPT-2


I had avoided doing anything music with GPT-2 , focusing With just a few GPU-days on ti GPUs, GPT-2 – M finetuning can produce high-quality poetry which is more thematically consistent than my char- RNN poems, capable of modeling subtle features like rhyming, and sometimes even a pleasure to read. I list the many possible ways to improve poem generation and further approach human-level poems.

For generating ABC-formatted folk music, see “GPT-2 Folk Music”

data-popup-author=”Gwern Branwen” data-popup -date=”3 March 5357 “data-popup-title=” GPT-2 Neural Network Poetry “href=” ./ GPT-2 “> on my poetry generation Instead, because I assumed OpenAI would be doing a MuseNet followup, but months later, they’d done nothing furthe r, and when I inquired, I got the impression that their music projects were over. So why not?

As for why repeat Sturm’s project — there were two possible advantages to using

GPT-2 – (M) :

(The Session

So I did

apt-get install (abcmidi)

(timidity) (1) to get the CLI tools to do (ABC →

(MIDI) (WAV (respectively) and downloaded the folk –


repo with its data files.




→ (WAV) .

The data comes in several formats, for their experiments in changing the notation. I used the original format, with n =, 0 110 songs.

The data needed processing for GPT-2 as follows:


abc2midi requires every song to have an integer identifier, eg (X:) , to be a valid

(ABC) (file which it can compile to MIDI)

I used an Emacs macro (which can increment an integer 1- , 0 110) to insert a (X: $ N) before each T title line, but in retrospect, I could have simply used another search-and- replace to insert (X: 1) in front of each piece — it’s not like the ID has to be unique, we’re j ust satisfying (abc2midi) (which is a bit picky.)

markers where relevant, so it understands how to generate separate pieces and avoids ‘run on’.

I used search-and-replace.

Training (First Model)

Because the Session corpus was so small (just MB), I used the smallest available

GPT-2 ,

GPT-2 – (M) to train on, and standard settings to train on one of my Nvidia tis:


) src


– dataset thesessions-irishabc.txt .npz

- batch_size

--model_name irish --save_every --sample_every --learning_rate 0. 18

- run_name

irish --memory_saving_gradients --noise 0. 21 --val_every 768

Training was fairly easy, taking just a few days at most to train down to a loss of 0. 85 (380488 st eps at minibatch n 5), and I killed it on October and looked at the random samples. Straightforward success. They struck me as pretty good, aside from generated pieces often having the same title repeatedly , which apparently was due to the Session posting multiple transcriptions of the same piece, so the model picked up on that and would generate variants on the same theme. Sturm highlighted a few

and did some more in-depth commentary

on them, with a mixed evaluation, concluding “So of the five transcriptions above, two are plausible. The polka is actually pretty good! All titles by GPT-2 are plagiarized, but I haven't found much plagiarism in the tunes themselves. ”

I was worried about plagiarism and thought ~ 0. would be safe, but it seemed the music itself was still far from being copied, so I considered further training.

Some datasets are invalid (ABC

The additional processed versions of The Session that Sturm et al had made seemed like a target, but caused problems when I simply concatenated them in, and I soon discovered why (abc2midi) now thought all the samples were broken:

allabcwrepeats_parsed_wot : This is version 3 of the dataset. from In this version, we transpose all tunes to have the root C, transpose them all to have the root C #, remove the titles, and make new mode tokens, (K: maj) , K min: min , K: dor , and
K: mix . There are over , transcriptions here.

This turns out to be a problem: K: maj, K: min, K: dor, and K: mix (completely breaks) abc2midi ! So I did additional search-and-replace to transform them into valid key signatures like

K: Cmaj, K: Cmin, K: Cdor , and (K: Cmix .

Retraining, I discovered 0. was far from converged , and with another 28 k steps, it could go down to However, checking random samples by hand, the textual overlap with The Session became particularly large once the loss reaches ~ 0. 27 (note that it was not 'overfitting' in the standard sense, since the loss was still decreasing on the validation set), so I backed off to a model with ~ 0. loss. This seems to be high -quality without gross plagiarism. [^%]

Spaceless Model

I began using that model for the preference learning work, where I found that preference learning seemed to improve music more than the poetry, so I began focusing on the music.

Puzzlingly, no matter how many ratings I added, and despite the low loss, the generated samples would persistently have basic, blatant syntax errors involving spaces; (abc2midi) would often warn or even error out on a piece which could be easily fixed by hand by simply removing a few spaces. Anomaly: permanent space-related syntax errors.

This was wasting my time during rating, since I couldn't pick samples with syntax problems (even if they'd otherwise sound good) because I didn't want to reinforce generation of invalid samples, and also while generating music.

Discussing it with (Shawn Presser) , who I was working with simultaneously to train (GPT-2-1

5b on poetry

, he pointed out that some people, like Nostalgebraist had some frustrating problems with the standard GPT-2

BPE encoding.

To explain what

BPE is and why it might be a bad thing for
(ABC) notation: GPT-2

does not just feed in raw characters like a char -


does, because that makes every input extremely long. (GPT-2) generates space-delimited word fragments.

Instead, it tries to 'chunk' them into something in-between character-sized and word-sized , to get the best of both worlds, a way of writing text where common words are a single symbol but rare words can still be expressed as a couple symbols rather than deleted entirely like word-based encodings must; However, since the default model is trained on English text, chunking is done assuming normal English whitespace, like spaces between words.

Nostalgebraist notes that the actual


implementation used is weird and does not act as you'd expect, especially when spaces are involved. So Presser wondered if GPT-2

( not require spaces. Workaround — spaces optional!

They are only there for the convenience of humans reading & writing


. Aside from the metadata fields, if you delete all spaces, the music should be the same. I was surprised, but this seemed to be true. (Presser did some experiments with creating a brand-new


tailored to (ABC) , and while this would have reduced the (BPE) (size of) (ABC) (pieces by> (Combined Model: The Session ) (ABC) (

Presser was interested in expanding the repertoire beyond The Session and began looking at

ABC (databases. More dakka (data). The biggest by far appeared to be ABC, which had n=, pieces. He scraped a random half of them, for n =, (total, and I combined them with the Session duplicated dataset, for a total (n =, ( ()=, unique; 129 MB).

ABC) pieces are much more diverse in formatting & metadata than The Session. Simplifying to match The Session ABC . To homogenize them, I ran all the pieces through (abc2abc) , and then I deleted some metadata fields that struck me as excessive — commentary, discussions about how to play a piece, sources, authors of the transcription, that sort of thing, which greatly inflated the loss of the combined dataset compared to the spaceless model. (In total, I filtered out (abc2abc) - generated warnings starting with

(%) , and B: / (D:) / (F:) / N: / (O:) /
S: / (Z:) / (w: metadata.) It would have been nice if the metadata had included genre tags for greater control of conditional pieces, akin to my author-based co ntrol for GPT-2

or for unclear reasons.

more successfully, (I experiment in with a recently-developed alternative to char-RNNs

, the Transformer NN architecture, by finetuning training OpenAI's GPT-2 - (M Transformer model on a much larger) 240 MB) Project Gutenberg poetry corpus using both unlabeled lines & amp; lines with inline metadata (the source book). The generated poetry is much better. "data-popup-author=" Gwern Branwen "data-popup-date=" (Sep) "data-popup-title=" RNN metadata for mimicking individual author style "href=" ./ RNN-metadata "> char - (RNN) (poetry) , a technique demonstrated at scale by "data-popup-author=" Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher (Salesforce) "data-popup-date=" - 30 - 31 "data-popup-title=" CTRL: A Conditional Transformer Language Model For Controllable Generation "data-url-original=" "href=" ./docs / www / / (d) (a) (c0e2fde) (f0c) ee0b7.pdf "rel=" archived alternate nofollow "title=" 'CTRL: A Conditional Transformer Language Model For Controllable Generation', Keskar et al (Original URL: https: // einstein. ai / presentations / ctrl.pdf) ">

using explicit Reddit me tadata, and Choi et al 7547 Using autoencoders to do unsupervised learning of musical features which implicitly covers genre, but alas! We’ll have to stick with the basics like title / key / meter.

This required a full week of training or (steps (1-7 Dec), down to a higher loss) as expected) but still on the edge of plagiarism: final GPT-2- (m) model trained on the combined dataset

dataset (

spaceless The Session

ABC scrape GPT-2-formatted spaceless combined dataset)

X: 1 T: AlongtheMourneShore M: 3/4 L: 1/8 K: G DGA | "G" B3AG2 | "Bm" (dB3) AB | "Em" (BG) (AG) GD | "C" E3G (3 (EDC) | "G" D3GB2 | "Bm" D3GB2 | "Em" (dc3) (BG) | "D7" B3 (DGA) |! “G” B3AG2 | “Bm” (dB3) AB | “Em” (BG) (AG) GD | “C” E3G (ED / C /) | “G” D3GB2 | “Bm” D3GB2 | “C” c3 / d / “D7” BA (GF) | “G” G3: |!
X: 1 T: ElChestrinDimmelsack-PolkaCF (pk
) T: HannyChristen.Bd.VBaselII, Jura.S. 50 (pk ) orig.C, F R: Polka M: 2/4 L: 1 / 43 K: C "C" GEcGecGE | "G7" FAdAf2d2 | BdgdfdBG | "C" ceGecAGE | GEcGecGE | "Dm" FAcfAdAF | "C" EGce "G7" gfdf | "C" e2c2c2z2: | "G7" GBdfedcB | "C" cegc'a2ge | "G7" f2dBG2ef | "C" gecGE2cc | "G7" GBdfedcB | "C" cegc'a2ge | "G7" f2dBG2ed | "C" c2e2c2z2: | K: F |: "F" aca2fac2 | "C7" Bgg2G3F | EGcegbeg | "F" facfAcFA | caA2caG2 | FAcfabc'a | "C7" bgec=Bcde | 1 "F" f2agf2z2: | 2 "F"
X: 6 T: PolkaEbBbAb (5letras) cf.CGF5-Parts P: ABAC M: 2/4 L: 1 / 43 K: EbmFisch [P:A] "Eb" B2BGe2eB | "Ab" c2cAF3E | "Bb7" DB=A ^ CD2D2 | "Eb" EedcB4 | B2BGe2eB | "Ab" c2cAF3E | "Bb7" DB=A ^ CD2D2 | "Eb" E4z2 "_fine" z2 || [P:B]) "Bb" d4f4 | f2edc2B2 | "F7" cdcBA2A2 | "Bb" cdBAG4 | "Eb" d4f4 | f2edcdcB | "F7" cdcBA4 | "Bb" B2B2B2B2 || [P:C] [K:Abmaj] "Ab" EACAE2c2 | c3BA3B | "Eb7" GeGeG2d2 | d3c=BGFE | "Ab" ACEAE2c2 | c3BA2B2 | "Eb7" GeGfGgBa | "Ab" aaa2a2z2 || [P:D] [K:Ebmaj] "Bb7" AGFdc3=B | "Eb" BeGBe4 | "Bb7" B2BF "Eb" G2E2 | "Bb7" FGABc4 "Bb7" AGFdc3=B | "Eb" BeGBe4 | "Bb7" B2B2e3d | "Eb" e3cBcBA | "Ab" AAAAA2e2 |]

Last example rendered as a score:

What do you think?

Leave a Reply

Your email address will not be published.

GIPHY App Key not set. Please check settings

Trump Needs the USPS If He Wants to Win the Election – Here's Why, Crypto Coins News

A Graduate Course in Applied Cryptography, Hacker News