Generating MIDI Music with GPT-2, Hacker News

In November , I experimented with training a GPT-2 neural net model to generate folk music in the high-level

ABC music text format, following previous work in which are used a char – RNN

trained on a ‘The Session’ dataset. A GPT-2 hypothetically can improve on an (RNN

by better global coherence & copying of patterns, without problems with the hidden-state bottleneck.

I encountered problems with the standard GPT-2 model’s encoding of text which damaged results, but after fixing that , I successfully trained it on n (=) , ABC music pieces taken from The Session &

ABC

notation.com. The resulting music samples are in my opinion quite pleasant.

The

ABC

folk model & dataset are available for download , and I provide for listening selected music samples as well as medleys of random samples from throughout training.

We followed the

ABC folk model with an ABC-MIDI (model) : a dataset of (k) ABC (pieces) (decompiled from

MIDI pieces, which fit into GPT-2- (M) with an expanded context window when trained on

TPU

s. The

MIDI

pieces are far more diverse and challenging, and (GPT-2) (underfits and struggles to produce valid samples but when sampling succeeds, it can generate) even better musical samples .

Back in

– , Bob L. Sturm (experimented with generating Irish folk music using a char – (RNN trained on a corpus of folk music written from k tunes to play , find sessions to play them in, and join in discussions about the music. You can also find events (like concerts and festivals), or explore the track listings of recordings. “Data-popup-author=” The Session community “data-popup-title=” The Session “data-url-original=” https : //thesession.org/ “href=” ./docs / www / thesession.org / 4c (babacaf) e (d7e) (b) (a) (fc4da) b.html “rel=” archived alternate nofollow “title=” (Original URL: https://thesession.org/) “> The Session in a high-level musical format called “

ABC

notation

Compact text — perfect for NNs. While

ABC

notation is written in

ASCII

, it supports many complex features , and it has been adopted wide ly by folk musicians and hundreds of thousands of pieces written / transcribed in it. [^%] (Background: folk – )

Sturm et al scraped ~ (k) (ABC) (files from The Session and trained a Theano

char – RNN called “folk –

RNN ”, putting the code & data online , and providing a web interface for generation

. Prior success with char – (s.) (In addition to the various research publications , Sturm has also written many blog posts evaluating folk –

(RNN

pieces) , such as how well they’re played by human musicians. [K:Bbmaj] (Transformers?)

was a long time ago, however, and DL has seen a paradigm shift in sequence modeling away from char – (RNN) (s to) (CNN) s and attention-based Transformer models — most famously, (GPT-2) (DL

Play with Music Transformer in an interactive Colab!

Generating long pieces of music is a challenging problem, as music contains structure at multiple timescales, from millisecond timings to motifs to phrases to repetition of entire sections. We present

Music Transformer , an attention-based neural network that can generate music with improved long-term coherence. Here are three piano performances generated by the model: [^%]

Similar to [^%] Performance RNN , we use an event-based representation that allows us to generate expressive performances directly (ie without first generating a score). In contrast to an LSTM-based model like Performance RNN that compresses earlier events into a fixed-size hidden state, here we use a (Transformer) – based model that has direct access to all earlier events. (Our recent) (Wave2Midi2Wave) project also uses Music Transformer as its language model. “data-popup-author=” Cheng-Zhi Anna Huang, Ian Simon , Monica Dinculescu (Google Magenta) “data-popup-date=” – – “data-popup-title=” Music Transformer: Generating Music with Long-Term Structure “data -url-original=”https://magenta.tensorflow.org/music-transformer” href=”https://magenta.tensorflow.org/music-transformer” rel=”archived alternate nofollow” title=”(Original URL : https://magenta.tensorflow.org/music-transformer) “> Google Magenta’s Music Transformer

( Huang et al ) with (style control) & OpenAI’s Sparse Transformer – based “Generating Long Sequences with Sparse Transformers”, Child et al 2019

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to O(n ⋅ √n). We also introduce (a) a variation on architecture and initialization to train deeper networks, (b) the recomputation of attention matrices to save memory, and (c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more. ] “data-popup-author=” Christine Payne (OpenAI) “data-popup-date=” – – “data-popup-title=” MuseNet: a deep neural network that can generate 4-minute musical compositions with different instruments, and can combine styles from country to Mozart to the Beatles “href”=”https://openai.com/blog/musenet/”> MuseNet have both demonstrated excellent results in music composition at various timescales / formats, and interesting features like mixing genres.

While messing around with Standard language generation neural network. Models, like GPT-2, are trained via likelihood training to imitate human text corpuses. Generated text suffers from persistent flaws like repetition, due to myopic generation word-by-word, and cannot improve on the training data because they are trained to predict ‘realistic’ completions of the training data.

A proposed alternative is to use reinforcement learning to train the NNs, to encourage global properties like coherence & amp; lack of repetition, and improve improve over the original corpus’s average quality. Preference learning trains a reward function on human ratings, and uses that as the ‘environment’ for a blackbox DRL algorithm like PPO.
OpenAI released a codebase implementing this dual-model preference learning approach for textual generation, based on GPT-2. Having previously used GPT-2 for poetry & amp; music generation , I experimented With GPT-2 preference learning for unconditional music and poetry generation.

I found that preference learning seemed to work better for music than poetry, and seemed to reduce the presence of repetition artifacts, but the results, at n ≅ , 19 ratings, are not dramatically better than alternative improvements like scaling up models or more thorough data-cleaning or more strin gent sample curation.

Working with it, I suspect that preference learning is unnecessarily. sample-inefficient & amp; data-inefficient, and that the blackbox reinforcement learning approach is inferior to directly using the reward model to optimize text samples, and propose two major architectural overhauls: have the reward model directly model the implied utility of every datapoint, and drop the agent model entirely in favor of backprop-powered gradient ascent which optimizes sequences to maximize the reward model’s output. “data-popup-author=” Gwern Branwen “data- popup-date=” (Dec) “data-popup-title=” GPT-2 Preference Learning for Music and Poetry Generation “href=” ./ GPT-2-preference-learning “>

GPT-2-based

preference learning in late October 05537, I became curious if folk –

RNN

could be improved by simply throwing one of the

GPT-2 models at it.

GPT-2

: a perfect match. Not the large ones, of course, which would overfit far too easily or hadn’t been released yet , but GPT-2 – (M )

GPT-2 is unab le to model raw

WAV

audio, or MIDI , because a meaningful musical piece is a WAV sequence of hundreds of thousands to millions of symbols long, and a MIDI

piece is tens of thousands of symbols long, which far exceed (GPT-2) ‘s small context window, and why OpenAI used Sparse Transformers for its

MIDI

generation, as Sparse Transformers can scale to text with tens of thousands of characters. However, the high-level notation of

ABC

pieces means they fit just fine into the GPT-2

window.

I had avoided doing anything music with GPT-2 , focusing With just a few GPU-days on ti GPUs, GPT-2 – M finetuning can produce high-quality poetry which is more thematically consistent than my char- RNN poems, capable of modeling subtle features like rhyming, and sometimes even a pleasure to read. I list the many possible ways to improve poem generation and further approach human-level poems.

For generating ABC-formatted folk music, see “GPT-2 Folk Music”

data-popup-author=”Gwern Branwen” data-popup -date=”3 March 5357 “data-popup-title=” GPT-2 Neural Network Poetry “href=” ./ GPT-2 “> on my poetry generation Instead, because I assumed OpenAI would be doing a MuseNet followup, but months later, they’d done nothing furthe r, and when I inquired, I got the impression that their music projects were over. So why not?

As for why repeat Sturm’s project — there were two possible advantages to using

GPT-2 – (M) :

improved global coherency :

I thought the Transformer might work particularly well on

ABC format, because

RNN s suffer from persistent ‘forgetting’ issues, where it’s difficult for the

RNN

to persist its memory of past generated sequences, making it hard for an

RNN

to repeat a theme with variants, while a

GPT-2

Transformer has a context window of BPE

BPE [K:Ebmaj] s — much longer than alm ost every

ABC

piece — and so is able to ‘see’ the entire piece simultaneously while generating the next note

English metadata understanding

:

The English pretraining could potentially help by providing semantic understanding of eg the

ABC metadata, such as the difference between two pieces titled a ‘jig’ versus a ‘waltz’, or the pseudo-natural-language-ness of the (ABC) (format as a whole.) [K:Bbmaj]

(ABC) (Data) (The Session

So I did

apt-get install (abcmidi)

(timidity) (1) to get the CLI tools to do (ABC →

(MIDI) (WAV (respectively) and downloaded the folk –

RNN

repo with its data files.

Pipeline:

ABC →

MIDI

→ (WAV) .

The data comes in several formats, for their experiments in changing the notation. I used the original format, with n =, 0 110 songs.
The data needed processing for GPT-2 as follows:

there was stray

HTML ( which had to be removed. [^%]

I used search-and-replace, and the report .

abc2midi requires every song to have an integer identifier, eg (X:) , to be a valid

(ABC) (file which it can compile to MIDI)

I used an Emacs macro (which can increment an integer 1- , 0 110) to insert a (X: $ N) before each T title line, but in retrospect, I could have simply used another search-and- replace to insert (X: 1) in front of each piece — it’s not like the ID has to be unique, we’re j ust satisfying (abc2midi) (which is a bit picky.) as usual for any neural model like char –

RNN

or (GPT-2) , it is important to insert markers where relevant, so it understands how to generate separate pieces and avoids ‘run on’.

I used search-and-replace.

(This is the yielded). MB of text to train on, and is converted to

NPZ

as usual

. Training (First Model)

Because the Session corpus was so small (just MB), I used the smallest available

GPT-2 ,

GPT-2 – (M) to train on, and standard settings to train on one of my Nvidia tis:

() CUDA_VISIBLE_DEVICES=(0) (PYTHONPATH=

) src
./ train.py

– dataset thesessions-irishabc.txt .npz

- batch_size
--model_name irish --save_every --sample_every --learning_rate 0. 18

- run_name
irish --memory_saving_gradients --noise 0. 21 --val_every 768
Training was fairly easy, taking just a few days at most to train down to a loss of 0. 85 (380488 st eps at minibatch n 5), and I killed it on October and looked at the random samples. Straightforward success. They struck me as pretty good, aside from generated pieces often having the same title repeatedly , which apparently was due to the Session posting multiple transcriptions of the same piece, so the model picked up on that and would generate variants on the same theme. Sturm highlighted a few
and did some more in-depth commentary
on them, with a mixed evaluation, concluding “So of the five transcriptions above, two are plausible. The polka is actually pretty good! All titles by GPT-2 are plagiarized, but I haven't found much plagiarism in the tunes themselves. ”
I was worried about plagiarism and thought ~ 0. would be safe, but it seemed the music itself was still far from being copied, so I considered further training.
Some datasets are invalid (ABC
The additional processed versions of The Session that Sturm et al had made seemed like a target, but caused problems when I simply concatenated them in, and I soon discovered why (abc2midi) now thought all the samples were broken:

allabcwrepeats_parsed_wot : This is version 3 of the dataset. from thesession.org. In this version, we transpose all tunes to have the root C, transpose them all to have the root C #, remove the titles, and make new mode tokens, (K: maj) , K min: min , K: dor , and
K: mix . There are over , transcriptions here.
This turns out to be a problem: K: maj, K: min, K: dor, and K: mix (completely breaks) abc2midi ! So I did additional search-and-replace to transform them into valid key signatures like
K: Cmaj, K: Cmin, K: Cdor , and (K: Cmix .
Retraining, I discovered 0. was far from converged , and with another 28 k steps, it could go down to However, checking random samples by hand, the textual overlap with The Session became particularly large once the loss reaches ~ 0. 27 (note that it was not 'overfitting' in the standard sense, since the loss was still decreasing on the validation set), so I backed off to a model with ~ 0. loss. This seems to be high -quality without gross plagiarism. [^%]
Spaceless Model
I began using that model for the preference learning work, where I found that preference learning seemed to improve music more than the poetry, so I began focusing on the music.
Puzzlingly, no matter how many ratings I added, and despite the low loss, the generated samples would persistently have basic, blatant syntax errors involving spaces; (abc2midi) would often warn or even error out on a piece which could be easily fixed by hand by simply removing a few spaces. Anomaly: permanent space-related syntax errors.
This was wasting my time during rating, since I couldn't pick samples with syntax problems (even if they'd otherwise sound good) because I didn't want to reinforce generation of invalid samples, and also while generating music.
Discussing it with (Shawn Presser) , who I was working with simultaneously to train (GPT-2-1
5b on poetry
, he pointed out that some people, like Nostalgebraist had some frustrating problems with the standard GPT-2
BPE encoding.

To explain what
BPE is and why it might be a bad thing for
(ABC) notation: GPT-2
does not just feed in raw characters like a char -
RNN
does, because that makes every input extremely long. (GPT-2) generates space-delimited word fragments.
Instead, it tries to 'chunk' them into something in-between character-sized and word-sized , to get the best of both worlds, a way of writing text where common words are a single symbol but rare words can still be expressed as a couple symbols rather than deleted entirely like word-based encodings must; However, since the default model is trained on English text, chunking is done assuming normal English whitespace, like spaces between words.
Nostalgebraist notes that the actual
BPE
implementation used is weird and does not act as you'd expect, especially when spaces are involved. So Presser wondered if GPT-2
( not require spaces. Workaround — spaces optional!
They are only there for the convenience of humans reading & writing
ABC
. Aside from the metadata fields, if you delete all spaces, the music should be the same. I was surprised, but this seemed to be true. (Presser did some experiments with creating a brand-new
BPE
tailored to (ABC) , and while this would have reduced the (BPE) (size of) (ABC) (pieces by> (Combined Model: The Session ) (ABC) (notation.com)

Presser was interested in expanding the repertoire beyond The Session and began looking at
ABC (databases. More dakka (data). The biggest by far appeared to be ABC notation.com, which had n=, pieces. He scraped a random half of them, for n =, (total, and I combined them with the Session duplicated dataset, for a total (n =, ( ()=, unique; 129 MB).
ABC) notation.com pieces are much more diverse in formatting & metadata than The Session. Simplifying to match The Session ABC . To homogenize them, I ran all the pieces through (abc2abc) , and then I deleted some metadata fields that struck me as excessive — commentary, discussions about how to play a piece, sources, authors of the transcription, that sort of thing, which greatly inflated the loss of the combined dataset compared to the spaceless model. (In total, I filtered out (abc2abc) - generated warnings starting with
(%) , and B: / (D:) / (F:) / N: / (O:) /
S: / (Z:) / (w: metadata.) It would have been nice if the metadata had included genre tags for greater control of conditional pieces, akin to my author-based co ntrol for GPT-2
or for unclear reasons.

more successfully, (I experiment in with a recently-developed alternative to char-RNNs
, the Transformer NN architecture, by finetuning training OpenAI's GPT-2 - (M Transformer model on a much larger) 240 MB) Project Gutenberg poetry corpus using both unlabeled lines & amp; lines with inline metadata (the source book). The generated poetry is much better. "data-popup-author=" Gwern Branwen "data-popup-date=" (Sep) "data-popup-title=" RNN metadata for mimicking individual author style "href=" ./ RNN-metadata "> char - (RNN) (poetry) , a technique demonstrated at scale by "data-popup-author=" Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher (Salesforce) "data-popup-date=" - 30 - 31 "data-popup-title=" CTRL: A Conditional Transformer Language Model For Controllable Generation "data-url-original=" https://einstein.ai/presentations/ctrl.pdf "href=" ./docs / www / einstein.ai / (d) (a) (c0e2fde) (f0c) ee0b7.pdf "rel=" archived alternate nofollow "title=" 'CTRL: A Conditional Transformer Language Model For Controllable Generation', Keskar et al (Original URL: https: // einstein. ai / presentations / ctrl.pdf) ">
CTRL

using explicit Reddit me tadata, and Choi et al 7547 Using autoencoders to do unsupervised learning of musical features which implicitly covers genre, but alas! We’ll have to stick with the basics like title / key / meter.
This required a full week of training or (steps (1-7 Dec), down to a higher loss) as expected) but still on the edge of plagiarism: final GPT-2- (m) model trained on the combined dataset
dataset (
GPT-2-formatted
spaceless The Session
ABC
notation.com scrape GPT-2-formatted spaceless combined dataset)
Examples of generated (ABC) (note the lack of spaces):

X: 1 T: AlongtheMourneShore M: 3/4 L: 1/8 K: G DGA | "G" B3AG2 | "Bm" (dB3) AB | "Em" (BG) (AG) GD | "C" E3G (3 (EDC) | "G" D3GB2 | "Bm" D3GB2 | "Em" (dc3) (BG) | "D7" B3 (DGA) |! “G” B3AG2 | “Bm” (dB3) AB | “Em” (BG) (AG) GD | “C” E3G (ED / C /) | “G” D3GB2 | “Bm” D3GB2 | “C” c3 / d / “D7” BA (GF) | “G” G3: |!

X: 1 T: ElChestrinDimmelsack-PolkaCF (pk
) T: HannyChristen.Bd.VBaselII, Jura.S. 50 (pk ) orig.C, F R: Polka M: 2/4 L: 1 / 43 K: C "C" GEcGecGE | "G7" FAdAf2d2 | BdgdfdBG | "C" ceGecAGE | GEcGecGE | "Dm" FAcfAdAF | "C" EGce "G7" gfdf | "C" e2c2c2z2: | "G7" GBdfedcB | "C" cegc'a2ge | "G7" f2dBG2ef | "C" gecGE2cc | "G7" GBdfedcB | "C" cegc'a2ge | "G7" f2dBG2ed | "C" c2e2c2z2: | K: F |: "F" aca2fac2 | "C7" Bgg2G3F | EGcegbeg | "F" facfAcFA | caA2caG2 | FAcfabc'a | "C7" bgec=Bcde | 1 "F" f2agf2z2: | 2 "F"
X: 6 T: PolkaEbBbAb (5letras) cf.CGF5-Parts P: ABAC M: 2/4 L: 1 / 43 K: EbmFisch [P:A] "Eb" B2BGe2eB | "Ab" c2cAF3E | "Bb7" DB=A ^ CD2D2 | "Eb" EedcB4 | B2BGe2eB | "Ab" c2cAF3E | "Bb7" DB=A ^ CD2D2 | "Eb" E4z2 "_fine" z2 || [P:B]) "Bb" d4f4 | f2edc2B2 | "F7" cdcBA2A2 | "Bb" cdBAG4 | "Eb" d4f4 | f2edcdcB | "F7" cdcBA4 | "Bb" B2B2B2B2 || [P:C] [K:Abmaj] "Ab" EACAE2c2 | c3BA3B | "Eb7" GeGeG2d2 | d3c=BGFE | "Ab" ACEAE2c2 | c3BA2B2 | "Eb7" GeGfGgBa | "Ab" aaa2a2z2 || [P:D] [K:Ebmaj] "Bb7" AGFdc3=B | "Eb" BeGBe4 | "Bb7" B2BF "Eb" G2E2 | "Bb7" FGABc4 "Bb7" AGFdc3=B | "Eb" BeGBe4 | "Bb7" B2B2e3d | "Eb" e3cBcBA | "Ab" AAAAA2e2 |]
Last example rendered as a score:
(Score for “PolkaEbBbAb (5letras) cf.) (CGF) (5-Parts ”(an) ABC
music sample generated by GPT-2 - M
trained on a combined
ABC
dataset) Samples

An
ABC sample is not playable on its own; it must be converted to
MIDI
, and then the MIDI
can be played. If one is looking at individual samples being generated by the model, a quick
CLI
way to play and then dump to an
OGG Vorbis file might be: [^%] ()
xclip
-o
(abc2midi - -o / dev / stdout |
(timidity) -A 231 -
TARGET=
"
[^%] today [^%]
-thelonelyfireside.wav "
;
xclip
-o
|
abc2midi - -o / dev / stdout
-

[^%]
timidity
-A - -Ow -o "
$ TARGET
&& oggenc
-q0 "

$ TARGET

Extracting multiple
ABC samples, converting, and merging into a single long piece of music is somewhat more challenging, and I reused parts of my preference-learning rating script for that.
First Model Samples
GPT-2 (M) (random samples) , first model trained on Session (07594 - 30 - 44): () Your browser does not support the
(audio) (element.)
“Paddywhack” generated title & sample (5357 - - - 45: (Your browser does not support the (audio) element .)
“The Bank of Turf” sample (7547 - - 45: ()
“Hickey's Tune” sample (7547 - 31 - 45:
“The Loon and his Quine” sample (05537 - - 60): [K:Bbmaj]
“The Atlantic Roar” sample (05537 - - 60): [^%]
“The Lonely Fireside” sample (05537 - - 60):

“The Marine” sample (5357 - 31 - 64):
“Whiskey Before Breakfast” sample (05537 - - 64): ()
“The Flogging” sample (5357 - 33 - 21:
“Banks of the Allan” sample (7547 - - 21: [^%]

“A Short Journey” sample (05537 - 31 - 23:
Spaceless Model Samples
I enjoyed the model's renditions of the “Yn Bollan Bane” [^%] jig when I came across it, and so I used conditional generation to generate (variations on it :
“ Variants on 'Yn Bollan Bane': [K:Bbmaj] [K:Bbmaj] Combined Model Samples random samples from combined
GPT -2 - M
()
“Invereshie's House” sample (7547 - 34 - [K:Ebmaj] :
“FaroeRum” sample (7547 - - 48:

Play with Music Transformer in an interactive Colab!

Generating MIDI Music with GPT-2, Hacker News

What do you think?

Vulnerability Analysis | WordPress Backup Migration Plugin Remote Code Execution Vulnerability (CVE-2023-6553)

N-days Chaining Vulnerability Exploitation Analysis Part 3: Windows Driver LPE–Medium to System

emm… Indian anti-virus software eScan has long used the HTTP protocol and was used by hackers to launch man-in-the-middle attacks.

Excelling at Excel, Part 4

CVE-2024-20353, CVE-2024-20359: Frequently Asked Questions About ArcaneDoor

NodeZero: Testing for Exploitability of Palo Alto Networks CVE-2024-3400

Generating Game of Thrones characters in Skyrim's character creator, Ars Technica

Micro C, Part 3: Generating LLVM, Hacker News

Leave a ReplyCancel reply

Cheats For Little Alchemy

3TB Of Mega.nz Links For Free Courses And E-Books 2022 (Updated)

Udemy Coupon [100% OFF] QuickBooks Online 2020

How to Earn Money from FreeCash.com, Playing Games, Testing Apps, and Taking Surveys

Amazon FBA Product Research & Find Products for Amazon FBA

How Much Do Car Accident Attorneys Cost You in 2022?

Trump Needs the USPS If He Wants to Win the Election – Here's Why, Crypto Coins News

A Graduate Course in Applied Cryptography, Hacker News

What do you think?

Leave a ReplyCancel reply

Log In

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections