In November , I experimented with training a GPT-2 neural net model to generate folk music in the high-level
ABC music text format, following previous work in which are used a char – RNN
trained on a ‘The Session’ dataset. A GPT-2 hypothetically can improve on an (RNN
by better global coherence & copying of patterns, without problems with the hidden-state bottleneck.
notation.com. The resulting music samples are in my opinion quite pleasant.
folk model & dataset are available for download , and I provide for listening selected music samples as well as medleys of random samples from throughout training.
We followed the
ABC folk model with an ABC-MIDI (model) : a dataset of (k) ABC (pieces) (decompiled fromMIDI pieces, which fit into GPT-2- (M) with an expanded context window when trained on
pieces are far more diverse and challenging, and (GPT-2) (underfits and struggles to produce valid samples but when sampling succeeds, it can generate) even better musical samples .
– , Bob L. Sturm (experimented with generating Irish folk music using a char – (RNN trained on a corpus of folk music written from k tunes to play , find sessions to play them in, and join in discussions about the music. You can also find events (like concerts and festivals), or explore the track listings of recordings. “Data-popup-author=” The Session community “data-popup-title=” The Session “data-url-original=” https : //thesession.org/ “href=” ./docs / www / thesession.org / 4c (babacaf) e (d7e) (b) (a) (fc4da) b.html “rel=” archived alternate nofollow “title=” (Original URL: https://thesession.org/) “> The Session in a high-level musical format called “
Compact text — perfect for NNs. While
notation is written in
Sturm et al scraped ~ (k) (ABC) (files from The Session and trained a Theano
char – RNN called “folk –. Prior success with char – (s.) (In addition to the various research publications , Sturm has also written many blog posts evaluating folk –
(RNNwas a long time ago, however, and DL has seen a paradigm shift in sequence modeling away from char – (RNN) (s to) (CNN) s and attention-based Transformer models — most famously, (GPT-2) (DL
Play with Music Transformer in an interactive Colab!
Music Transformer , an attention-based neural network that can generate music with improved long-term coherence. Here are three piano performances generated by the model: [^%]
Similar to [^%] Performance RNN , we use an event-based representation that allows us to generate expressive performances directly (ie without first generating a score). In contrast to an LSTM-based model like Performance RNN that compresses earlier events into a fixed-size hidden state, here we use a (Transformer) – based model that has direct access to all earlier events. (Our recent) (Wave2Midi2Wave) project also uses Music Transformer as its language model. “data-popup-author=” Cheng-Zhi Anna Huang, Ian Simon , Monica Dinculescu (Google Magenta) “data-popup-date=” – – “data-popup-title=” Music Transformer: Generating Music with Long-Term Structure “data -url-original=”https://magenta.tensorflow.org/music-transformer” href=”https://magenta.tensorflow.org/music-transformer” rel=”archived alternate nofollow” title=”(Original URL : https://magenta.tensorflow.org/music-transformer) “> Google Magenta’s Music Transformer
( Huang et al ) with (style control) & OpenAI’s Sparse Transformer – based “Generating Long Sequences with Sparse Transformers”, Child et al 2019
Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to O(n ⋅ √n). We also introduce (a) a variation on architecture and initialization to train deeper networks, (b) the recomputation of attention matrices to save memory, and (c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more. ] “data-popup-author=” Christine Payne (OpenAI) “data-popup-date=” – – “data-popup-title=” MuseNet: a deep neural network that can generate 4-minute musical compositions with different instruments, and can combine styles from country to Mozart to the Beatles “href”=”https://openai.com/blog/musenet/”> MuseNet have both demonstrated excellent results in music composition at various timescales / formats, and interesting features like mixing genres.
While messing around with Standard language generation neural network. Models, like GPT-2, are trained via likelihood training to imitate human text corpuses. Generated text suffers from persistent flaws like repetition, due to myopic generation word-by-word, and cannot improve on the training data because they are trained to predict ‘realistic’ completions of the training data.A proposed alternative is to use reinforcement learning to train the NNs, to encourage global properties like coherence & amp; lack of repetition, and improve improve over the original corpus’s average quality. Preference learning trains a reward function on human ratings, and uses that as the ‘environment’ for a blackbox DRL algorithm like PPO.
OpenAI released a codebase implementing this dual-model preference learning approach for textual generation, based on GPT-2. Having previously used GPT-2 for poetry & amp; music generation , I experimented With GPT-2 preference learning for unconditional music and poetry generation.I found that preference learning seemed to work better for music than poetry, and seemed to reduce the presence of repetition artifacts, but the results, at n ≅ , 19 ratings, are not dramatically better than alternative improvements like scaling up models or more thorough data-cleaning or more strin gent sample curation.
Working with it, I suspect that preference learning is unnecessarily. sample-inefficient & amp; data-inefficient, and that the blackbox reinforcement learning approach is inferior to directly using the reward model to optimize text samples, and propose two major architectural overhauls: have the reward model directly model the implied utility of every datapoint, and drop the agent model entirely in favor of backprop-powered gradient ascent which optimizes sequences to maximize the reward model’s output. “data-popup-author=” Gwern Branwen “data- popup-date=” (Dec) “data-popup-title=” GPT-2 Preference Learning for Music and Poetry Generation “href=” ./ GPT-2-preference-learning “>
preference learning in late October 05537, I became curious if folk –
could be improved by simply throwing one of the
GPT-2 models at it.
GPT-2: a perfect match. Not the large ones, of course, which would overfit far too easily or hadn’t been released yet , but GPT-2 – (M )
GPT-2 is unab le to model raw
audio, or MIDI , because a meaningful musical piece is a WAV sequence of hundreds of thousands to millions of symbols long, and a MIDI
piece is tens of thousands of symbols long, which far exceed (GPT-2) ‘s small context window, and why OpenAI used Sparse Transformers for its
generation, as Sparse Transformers can scale to text with tens of thousands of characters. However, the high-level notation of
pieces means they fit just fine into the GPT-2
I had avoided doing anything music with GPT-2 , focusing With just a few GPU-days on ti GPUs, GPT-2 – M finetuning can produce high-quality poetry which is more thematically consistent than my char- RNN poems, capable of modeling subtle features like rhyming, and sometimes even a pleasure to read. I list the many possible ways to improve poem generation and further approach human-level poems.
For generating ABC-formatted folk music, see “GPT-2 Folk Music”
data-popup-author=”Gwern Branwen” data-popup -date=”3 March 5357 “data-popup-title=” GPT-2 Neural Network Poetry “href=” ./ GPT-2 “> on my poetry generation Instead, because I assumed OpenAI would be doing a MuseNet followup, but months later, they’d done nothing furthe r, and when I inquired, I got the impression that their music projects were over. So why not?
As for why repeat Sturm’s project — there were two possible advantages to using
GPT-2 – (M) :
improved global coherency :
I thought the Transformer might work particularly well on
ABC format, because
RNN s suffer from persistent ‘forgetting’ issues, where it’s difficult for the
to persist its memory of past generated sequences, making it hard for an
to repeat a theme with variants, while a
Transformer has a context window of BPE
BPE [K:Ebmaj] s — much longer than alm ost every
English metadata understanding
The English pretraining could potentially help by providing semantic understanding of eg the
(ABC) (Data) (The Session
So I did
apt-get install (abcmidi)(timidity) (1) to get the CLI tools to do (ABC →
(MIDI) (WAV (respectively) and downloaded the folk –
repo with its data files.
→ (WAV) .
The data comes in several formats, for their experiments in changing the notation. I used the original format, with n =, 0 110 songs.
there was stray
HTML ( which had to be removed. [^%]
I used search-and-replace, and the report .
(ABC) (file which it can compile to MIDI)
I used an Emacs macro (which can increment an integer 1- , 0 110) to insert a (X: $ N) before each T title line, but in retrospect, I could have simply used another search-and- replace to insert (X: 1) in front of each piece — it’s not like the ID has to be unique, we’re j ust satisfying (abc2midi) (which is a bit picky.) as usual for any neural model like char –RNN
or (GPT-2) , it is important to insert markers where relevant, so it understands how to generate separate pieces and avoids ‘run on’.
I used search-and-replace.
Because the Session corpus was so small (just MB), I used the smallest available
GPT-2 – (M) to train on, and standard settings to train on one of my Nvidia tis:
– dataset thesessions-irishabc.txt .npz- batch_size
--model_name irish --save_every --sample_every --learning_rate 0. 18- run_name
irish --memory_saving_gradients --noise 0. 21 --val_every 768on them, with a mixed evaluation, concluding “So of the five transcriptions above, two are plausible. The polka is actually pretty good! All titles by GPT-2 are plagiarized, but I haven't found much plagiarism in the tunes themselves. ”
Training was fairly easy, taking just a few days at most to train down to a loss of 0. 85 (380488 st eps at minibatch n 5), and I killed it on October and looked at the random samples. Straightforward success. They struck me as pretty good, aside from generated pieces often having the same title repeatedly , which apparently was due to the Session posting multiple transcriptions of the same piece, so the model picked up on that and would generate variants on the same theme. Sturm highlighted a fewand did some more in-depth commentary
I was worried about plagiarism and thought ~ 0. would be safe, but it seemed the music itself was still far from being copied, so I considered further training.Some datasets are invalid (ABC
The additional processed versions of The Session that Sturm et al had made seemed like a target, but caused problems when I simply concatenated them in, and I soon discovered why (abc2midi) now thought all the samples were broken:
allabcwrepeats_parsed_wot : This is version 3 of the dataset. from thesession.org. In this version, we transpose all tunes to have the root C, transpose them all to have the root C #, remove the titles, and make new mode tokens, (K: maj) , K min: min , K: dor , and
K: mix . There are over , transcriptions here.
This turns out to be a problem: K: maj, K: min, K: dor, and K: mix (completely breaks) abc2midi ! So I did additional search-and-replace to transform them into valid key signatures like
K: Cmaj, K: Cmin, K: Cdor , and (K: Cmix .
Retraining, I discovered 0. was far from converged , and with another 28 k steps, it could go down to However, checking random samples by hand, the textual overlap with The Session became particularly large once the loss reaches ~ 0. 27 (note that it was not 'overfitting' in the standard sense, since the loss was still decreasing on the validation set), so I backed off to a model with ~ 0. loss. This seems to be high -quality without gross plagiarism. [^%]
I began using that model for the preference learning work, where I found that preference learning seemed to improve music more than the poetry, so I began focusing on the music.
Puzzlingly, no matter how many ratings I added, and despite the low loss, the generated samples would persistently have basic, blatant syntax errors involving spaces; (abc2midi) would often warn or even error out on a piece which could be easily fixed by hand by simply removing a few spaces. Anomaly: permanent space-related syntax errors.
This was wasting my time during rating, since I couldn't pick samples with syntax problems (even if they'd otherwise sound good) because I didn't want to reinforce generation of invalid samples, and also while generating music.
Discussing it with (Shawn Presser) , who I was working with simultaneously to train (GPT-2-15b on poetry
, he pointed out that some people, like Nostalgebraist had some frustrating problems with the standard GPT-2
To explain whatBPE is and why it might be a bad thing for(ABC) notation: GPT-2
does not just feed in raw characters like a char -RNNInstead, it tries to 'chunk' them into something in-between character-sized and word-sized , to get the best of both worlds, a way of writing text where common words are a single symbol but rare words can still be expressed as a couple symbols rather than deleted entirely like word-based encodings must; However, since the default model is trained on English text, chunking is done assuming normal English whitespace, like spaces between words.
does, because that makes every input extremely long. (GPT-2) generates space-delimited word fragments.
Nostalgebraist notes that the actual
BPEimplementation used is weird and does not act as you'd expect, especially when spaces are involved. So Presser wondered if GPT-2
( not require spaces. Workaround — spaces optional!
They are only there for the convenience of humans reading & writingABC
. Aside from the metadata fields, if you delete all spaces, the music should be the same. I was surprised, but this seemed to be true. (Presser did some experiments with creating a brand-newBPE
tailored to (ABC) , and while this would have reduced the (BPE) (size of) (ABC) (pieces by> (Combined Model: The Session ) (ABC) (notation.com)
Presser was interested in expanding the repertoire beyond The Session and began looking atABC (databases. More dakka (data). The biggest by far appeared to be ABC notation.com, which had n=, pieces. He scraped a random half of them, for n =, (total, and I combined them with the Session duplicated dataset, for a total (n =, ( ()=, unique; 129 MB).
ABC) notation.com pieces are much more diverse in formatting & metadata than The Session. Simplifying to match The Session ABC . To homogenize them, I ran all the pieces through (abc2abc) , and then I deleted some metadata fields that struck me as excessive — commentary, discussions about how to play a piece, sources, authors of the transcription, that sort of thing, which greatly inflated the loss of the combined dataset compared to the spaceless model. (In total, I filtered out (abc2abc) - generated warnings starting with(%) , and B: / (D:) / (F:) / N: / (O:) /
S: / (Z:) / (w: metadata.) It would have been nice if the metadata had included genre tags for greater control of conditional pieces, akin to my author-based co ntrol for GPT-2
more successfully, (I experiment in with a recently-developed alternative to char-RNNs, the Transformer NN architecture, by finetuning training OpenAI's GPT-2 - (M Transformer model on a much larger) 240 MB) Project Gutenberg poetry corpus using both unlabeled lines & amp; lines with inline metadata (the source book). The generated poetry is much better. "data-popup-author=" Gwern Branwen "data-popup-date=" (Sep) "data-popup-title=" RNN metadata for mimicking individual author style "href=" ./ RNN-metadata "> char - (RNN) (poetry) , a technique demonstrated at scale by "data-popup-author=" Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher (Salesforce) "data-popup-date=" - 30 - 31 "data-popup-title=" CTRL: A Conditional Transformer Language Model For Controllable Generation "data-url-original=" https://einstein.ai/presentations/ctrl.pdf "href=" ./docs / www / einstein.ai / (d) (a) (c0e2fde) (f0c) ee0b7.pdf "rel=" archived alternate nofollow "title=" 'CTRL: A Conditional Transformer Language Model For Controllable Generation', Keskar et al (Original URL: https: // einstein. ai / presentations / ctrl.pdf) ">CTRL
using explicit Reddit me tadata, and Choi et al 7547 Using autoencoders to do unsupervised learning of musical features which implicitly covers genre, but alas! We’ll have to stick with the basics like title / key / meter.
This required a full week of training or (steps (1-7 Dec), down to a higher loss) as expected) but still on the edge of plagiarism: final GPT-2- (m) model trained on the combined datasetdataset (GPT-2-formatted
spaceless The SessionABC
notation.com scrape GPT-2-formatted spaceless combined dataset)
Examples of generated (ABC) (note the lack of spaces):
X: 1 T: AlongtheMourneShore M: 3/4 L: 1/8 K: G DGA | "G" B3AG2 | "Bm" (dB3) AB | "Em" (BG) (AG) GD | "C" E3G (3 (EDC) | "G" D3GB2 | "Bm" D3GB2 | "Em" (dc3) (BG) | "D7" B3 (DGA) |! “G” B3AG2 | “Bm” (dB3) AB | “Em” (BG) (AG) GD | “C” E3G (ED / C /) | “G” D3GB2 | “Bm” D3GB2 | “C” c3 / d / “D7” BA (GF) | “G” G3: |! X: 1 T: ElChestrinDimmelsack-PolkaCF (pk ) T: HannyChristen.Bd.VBaselII, Jura.S. 50 (pk ) orig.C, F R: Polka M: 2/4 L: 1 / 43 K: C "C" GEcGecGE | "G7" FAdAf2d2 | BdgdfdBG | "C" ceGecAGE | GEcGecGE | "Dm" FAcfAdAF | "C" EGce "G7" gfdf | "C" e2c2c2z2: | "G7" GBdfedcB | "C" cegc'a2ge | "G7" f2dBG2ef | "C" gecGE2cc | "G7" GBdfedcB | "C" cegc'a2ge | "G7" f2dBG2ed | "C" c2e2c2z2: | K: F |: "F" aca2fac2 | "C7" Bgg2G3F | EGcegbeg | "F" facfAcFA | caA2caG2 | FAcfabc'a | "C7" bgec=Bcde | 1 "F" f2agf2z2: | 2 "F" X: 6 T: PolkaEbBbAb (5letras) cf.CGF5-Parts P: ABAC M: 2/4 L: 1 / 43 K: EbmFisch [P:A] "Eb" B2BGe2eB | "Ab" c2cAF3E | "Bb7" DB=A ^ CD2D2 | "Eb" EedcB4 | B2BGe2eB | "Ab" c2cAF3E | "Bb7" DB=A ^ CD2D2 | "Eb" E4z2 "_fine" z2 || [P:B]) "Bb" d4f4 | f2edc2B2 | "F7" cdcBA2A2 | "Bb" cdBAG4 | "Eb" d4f4 | f2edcdcB | "F7" cdcBA4 | "Bb" B2B2B2B2 || [P:C] [K:Abmaj] "Ab" EACAE2c2 | c3BA3B | "Eb7" GeGeG2d2 | d3c=BGFE | "Ab" ACEAE2c2 | c3BA2B2 | "Eb7" GeGfGgBa | "Ab" aaa2a2z2 || [P:D] [K:Ebmaj] "Bb7" AGFdc3=B | "Eb" BeGBe4 | "Bb7" B2BF "Eb" G2E2 | "Bb7" FGABc4 "Bb7" AGFdc3=B | "Eb" BeGBe4 | "Bb7" B2B2e3d | "Eb" e3cBcBA | "Ab" AAAAA2e2 |]
Last example rendered as a score:
(Score for “PolkaEbBbAb (5letras) cf.) (CGF) (5-Parts ”(an) ABCmusic sample generated by GPT-2 - M
trained on a combinedABC
dataset) SamplesAnABC sample is not playable on its own; it must be converted to
, and then the MIDI
can be played. If one is looking at individual samples being generated by the model, a quickCLI
way to play and then dump to anOGG Vorbis file might be: [^%] ()
(abc2midi - -o / dev / stdout |(timidity) -A 231 -
[^%] today [^%]
abc2midi - -o / dev / stdout
-A - -Ow -o "
First Model Samples
Extracting multipleABC samples, converting, and merging into a single long piece of music is somewhat more challenging, and I reused parts of my preference-learning rating script for that.Spaceless Model Samples
GPT-2 (M) (random samples) , first model trained on Session (07594 - 30 - 44): () Your browser does not support the
“Paddywhack” generated title & sample (5357 - - - 45: (Your browser does not support the (audio) element .)
“The Bank of Turf” sample (7547 - - 45: ()
“Hickey's Tune” sample (7547 - 31 - 45:
“The Loon and his Quine” sample (05537 - - 60): [K:Bbmaj]
“The Atlantic Roar” sample (05537 - - 60): [^%]
“The Lonely Fireside” sample (05537 - - 60):
I enjoyed the model's renditions of the “Yn Bollan Bane” [^%] jig when I came across it, and so I used conditional generation to generate (variations on it :“ Variants on 'Yn Bollan Bane': [K:Bbmaj] [K:Bbmaj] Combined Model Samples
random samples from combinedGPT -2 - M