Some days ago, driven by boredom, I implemented my own Markov chains in Haskell by following this greattutorial. Markov chains can be a way of implementing really fun “dumb” group chat bots, that can generate new random messages that sound realistic based on the previous history of the chat. From Wikipedia:
A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.
Here’s agreat articleintroducing the concept of Markov chains. In the case of a group chat bot, each state (or node of the graph) is one of the words that was previously sent in the messages, and each probability of transition towards another state (word) is based on the frequency of the first word (transition’s source state) being followed by the second one (the transition’s destination state).
After coding a simple Markov chain on words in Haskell, I’ve noticed that it was really slow and resource intensive even on a few thousand of messages. This was because the model was calculated by summing the frequency of word pairs and it was kept in memory inside of aData.Map
structure.
Although the Haskell implementation can get much faster if optimized, my friendFrancescoshowed me his amazing implementation of Markov chains on words madein plain sh and awk, in ~ 20 lines of code.
markov.sh
His project is split in 2 programs. The first one,mrkfeed.awk
is a really simpleawk
program that separates words on a line into pairs of words, separated line by line.
#! / usr / bin / awk -f { for (i=1; i
For example, let’s take this simple chat log (referred to aschatlog.txt
):
hello everybody hi people hello people how are you how are things going
The first step is creating a model for the Markov chain. Here’s what will be into the model file when we run./ mrkfeed.awk
, (piped intosort
for readability)
are things are you everybody going hello everybody hello people hi people how are how are people people things going you
The next step is generating a new random message with./ mrkwords.sh model.mrkdb 50 tr ' n' '' && echo
At first,mrkwords.sh
will pick a random line from the model and pick the first word of the pair as the first word of our output message. After this, it will filter the model to find what word pairs start with the first word it picked. Let’s say it picked the wordhello
as the first word of the message. It will then randomly choose the second word of the message from the second element of a pair in the model that starts with the first word it chose. In this case, since it pickedhello
as the first word, it may pick one betweeneverybody
andpeople
as the next word. It then repeats this process by passing the last word it chose as the word to choose in the next iteration. It may be even easier to understand in terms of code than in plain words.
The presence of duplicate lines in the model is what gives it the power of weighed probability, perfectly modeling the process of random extraction, making the generated models real Markov chains. Doing this with plain unix tools makesmarkov.sh
incredibly fast, because it is sacrificing disk space by storing duplicates in exchange for a huge improvement in the complexity of the computations. Although Markov chains are often considered complex, this small shell program shines in showing the real underlying simplicity.
Here’s a commented version of./ mrkwords.sh
#! / bin / sh # choose the first argument as the model file or try to open '.mrkdb' file="$ {1: - ~ / .mrkdb}" # $ n is the maximum number of remaining words (iterations) # it is the 2nd argument of this program n="$ {2: -1}" # if present, use $ key (3rd argument) to find pairs starting # with it in the model, you may use this to force a # word as the first word of the message key="$ 3" # if $ key is set print it [ ! -z "$key" ] && echo "$ key" # the max remaining number of words cannot be equal or less to 0 [ "$n" -le 0 ] && exit # if key is not set, set the chosen word to the first element # of a random pair in the model if [ -z "$key" ]; then word=$ (shuf -n 1
In this example it generates a random fortune, modeled from thegoedel
fortunes contained in the famousfortune-mod
package.
# remove the% delimiter in the fortune file and save it as a .txt sed -e '/ ^% $ / d'/share/fortunes/goedel>>goedel.txt # generate the model ./mrkfeed.awk>goedelmodel.mrkdb # get a random fortune! ./mrkwords.sh goedelmodel.mrkdb 50 | tr ' n' '' the dimensionality of computerized fortune-tellers! # again! ./mrkwords.sh goedelmodel.mrkdb 50 | tr ' n' '' the thoughts of metalanguage are still free. # again! ./mrkwords.sh goedelmodel.mrkdb 50 | tr ' n' '' # let's change the model by using all fortunes this time ./mrkwords.sh fortunesmodel.mrkdb 50 | tr ' n' '' low taste, and goin 'insane,
markov.sh
is extremely fast, even on relatively large data sets (millions of lines).
GIPHY App Key not set. Please check settings