Beating C with Dyalog APL: wc, Hacker News

Background

A recent blog post that bubbled up onr / programmingentitledBeating C With 80 Lines Of Haskell : Wcinspired me to see how effective APL would be in solving this problem.

But Why APL?

Why not? I was told by older students during my university years to avoid APL and any internship or co-op opportunities that used it. Turns out that was bad advice, at least for me – I can see how it might not be other folks ’cup of tea, but it is definitely something I like. A few years back I had the opportunity to finally play with an APL derivative K (I got a job at a small K shop). I liked K but had to move on job wise. I recently downloaded a free personal copy ofDyalogand have been playing around with it while reading throughMastering Dyalog APLby Bernard Legrand (only a few chapters in so far). I find APL friendlier to use than K and enjoy it a great deal, so this seemed like a good excuse to figure out how to do file I / O while searching through the book’s PDF and flexing google-fu.

On with the code

Just like Chris Penner’s original article, I’m comparing against the OSX version ofWCthat shipped with my machine. Just like in the original article, I admit that there are likely faster versions ofwc– I’m just comparing what I got.

Counting Words

Counting characters is trivial. Counting lines is almost as trivial, or is as trivial if we countCRLFas two lines – I chose the more trivial solution out of laziness.CRLFcould be counted as one end of line with an additional three lines (warning: guesstimate).

Just as in the original post, the meat of the problem is in accurately counting words. The first step is counting the number of words in a single string. First let’s make two definitions:

nl ← ⎕UCS¨ 10 11 12 13 ⍝ Newline characters sp ← nl, ⎕UCS¨9 32 ⍝ Space characters

The above doesn’t look like “normal” programming languages because APL doesn’t use the standard ASCII character set for its symbols / terms which means none of its names are stolen from you – if you name a variableneworthisorfunctionit won’t clash with anything. Note that it does use ASCIIsymbolslikeor=but there are no keywords using ASCII numbers / letters.

An explanation of the above:

⍝ (denotes a) //style comment, it’s calledlampand itilluminates.

NLandspare variables withnlbeing newline characters andspwhitespace characters. In APL,→is assignment. Here we assign everything to the right ofnlorspto the variablesnlandsp.

¨will create aderived functionwhich will map the functionimmediatelyto the left of¨across the value to the right. Byimmediatelyto the left I mean you read as little as possible to the left to figure out what the lefthand argument is, but the righthand argument is the result of evaluating the (entire) expression to the right, and this is how APL works in general – take as little as possible from the left but everything from the right.

⎕UCShappens to be a function in Dyalog for getting unicode characters from numeric values.

The ravel function,will catenate its lefthand and righthand arguments, sospcontains the newlines inNLalong with tab and space.

Now that we know what constitutes a newline or a space we can figure out how to getwcresults. Assume the input is stored in the variables, we can even give it a concrete value for now:s ← 'I am the model of a modern major general '. To find where the spaces are we use the membership function∊withs∊sp, which yields

0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0

telling us the exact position of any whitespace characters. We can negate the vector (turn1s into0s and vice versa) and use the partition function⊆to collect the words into an array of strings with(~ s∊sp) ⊆s(~is the negation function) and we can count the result using≢ (~ s∊sp) ⊆s. The way the partition function works is it uses the boolean vector to strip out parts of its righthand argument based on runs of1in its lefthand argument (thanks to (u / olzdfor pointing out⊆when I posted this article originally onR / Programming).

We only have one piece of input in the above calculation and so it is easy to turn it into a function. In APL we can make adirect functionby using braces (I won’t go into the differences in the two function types here but know that the book references above goes into some details). So we can store this calculation in a variable with:

In APL⍺(alpha) and⍵(omega) are the names of the left and right arguments to a direct function. We apply a function just like we do the built in primitive functions, we give it parameters:words s. One thing we can do is makewordsa bit more general and have it take (any) boolean vector and count the partition of that vector (we don’t need the characters, after all) , and this then gives us:

And so when we apply it to our string we will do it withwords s∊sp. Just like with the built in parameters,wordswill be evaluated on the entire expression to the right which iss∊sp. This yields9. One thing to keep in mind is testing edge cases and, for the empty string we get0.

WC via Partition (⊆)

Now that we can count words in a string we can count words in a file. We could just do this by reading in the entire file all at once and applying thewordsfunction and then counting the number of characters and the number of newlines. But that would be a bad idea – what if the file is 123 GB in size ?! Clearly we don’t want to read that in all at once, instead we want to read it in a few blocks at a time. We’re going to need a loop, and for that we need another type of function, aprocedural function(again see the referenced book for details).

One computation issue, as mentioned in the original Haskell based blog post, is what happens if we split up the input in the middle of a word? Well, we just use the same technique used in that blog post. No, not monoids (though that is cool), just keeping track if the last character we saw was a non-space or not. If it was and the first new character isalsoa non-space then we subtract one word from the count as we accumulate.

With that in mind, here’s the first attempt

res ← wc fn; c; w; l; fd; data; blk; nl; sp; lnsp; words words ← {≢ (~ ⍵) ⊆⍵} Blk ← 256 × 1024 nl ← ⎕UCS¨ 10 11 12 13 sp ← nl, ⎕UCS¨9 32 c → w → l → 0 fd ← fn ⎕NTIE 0 lnsp ← 0 Repeat     data → ⎕NREAD fd 80 blk     : If 0=≢data         : Leave     : EndIf     c ← c   (≢data)     l ← l    / data∊nl     data → data∊sp     w ← (w   words data) - (lnsp∧ ~ 1 ↑ data)     lnsp ← ~ (¯1) ↑ data : EndRepeat ⎕NUNTIE fd res ← l w c

We define the functionwcwith inputfn(filename) and outputres(result). The names afterfnare the names of variables we want to belocalwhich happens to be every variable we are using. If we left a variable out, it would be assumed global and be able to change global state. An explanation:

The first line, as just described, declares the functionWCwith inputfn, outputres, and a handful of local variables.

wordsis the word counting function we derived above.

blkis the block size we will use while we read from the file (fn) .

NLandspare the newline and whitespace vectors we defined originally.

C,w, andlare character, word, and line counters initially set to (0) .
fdis the file descriptor forfn(next available since we passed (0) ).

lnspis1if the last character read during the previous iteration was not a space, (0) otherwise .

Now we enter a loop with: Repeat, the body of which is everything between: Repeatand: EndRepeat.: Leave (is like) ********************** (break) in other languages.

datais assigned the next block of data read fromFD.80tells the⎕NREADfunction that we want to read characters (intuitive, right?).atadatacounts the number of items indataand if this is zero (iedatais empty) we exit the loop. Otherwise we add that count toc.

We then usedata∊nlto find the newlines indataand sum these up with /and add this sum tol.

We no longer need the contents ofdatawe only need to know where whitespace is, so we store a boolean vector describing this indata. Passing this towordscounts the number of whitespace terminated words in (data) , which we add tow [5] . But then, to account for word splitting, we use1 ↑ datato get whether the first character we readthis iterationwas a space and then negate that with~; we use∧toandthat against whether the last character in theprevious iterationwasnota space, and subtract that from the count. As in C,0is false and so (1∧1) is1and we only subtract1if the previous iteration's data ended with a nonspaceandthis iteration's data started with one.

We then store whether this iteration’s data terminated with a nonspace character.

Finally we close the file descriptor after the loop and return the line, word, and character counts inres.

Performance Measurements

Usinga user defined timing operatorwe can time theWCfunction. On a 1. 661 GB file (which is just thebig.txtfile from the original Haskell post’s repository repeated multiple times), I get (on my (inch early 2k) MacBook Air (initially a run of 32. 97 seconds and about the same on immediate subsequent runs. Using thetimeterminal utility to runwcagainst the same file I getuser timesranging from 5. (s to 5.) s, so this first attempt is definitely slow compared to OSX’s (wc) utility. If I comment out the line wherewis calculated my initial running time on that same file is 4. 32 seconds with subsequent runs hovering around 1. 85 seconds, so clearly our word calculation is the culprit here (as was to be expected).

Counting Words with Array Arithmetic

Let’s call the vectors∊sp (vector) ********************** (f) forfound:f ← s∊sp. We don’t want to just count the ones in this vector as then consecutive whitespace characters would each count as a word; what we want to count is when we switch from non-whitespace to whitespace (or vice-versa), and we can accomplish this with the difference between adjacent values. This is easy in APL, it's just(1 ↓ f) - (¯1) (f:

¯is “high minus” and is used to negate literals.

↓is the drop function which drops a number of items from the value to the right . If the number on the left is positive, it drops that many from thefrontof the value, if it is negative then it drops the absolute value of that many from the (back) of the value. So1 fisfwithout the first item and(¯1) ↓ fisfwithout the last item.

-is subtraction, so we are subtracting values inffrom the values to their left. For this particularfwe get the following:
```
1 ¯1 0 1 ¯1 0 0 1 ¯1 0 0 0 0 1 ¯1 0 1 ¯1 1 ¯1 0 0 0 0 0 1 ¯1 0 0 0 0 1 ¯1 0 0 0 0 0 0
```

This vector is one elementshorter (than) ********************** (f) – iffwere a vector of five items with values 0 1 0 0 1then we would have calculated(1 0 0 1) - (0 1 0 0)after the drops, which would be1 ¯1 0 1, a vector of four items. Back to the above result, we can find all the places where1appears simply with1=(1 f) - (¯ 1) (fwhich compares every element in the vector to the value1, yielding a vector of the same size as the shorter “Difference” vector, and this vector has a1where a difference of1was found and(0) elsewhere:

1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0

This represents the positions where a space followed a non-space (the first1corresponds to the space betweenIandam). Summing this up gives us the number of places a wordterminatedby being followed by a space, and we can get that sum with the function /. Just as¨created aderived function (so does) /- both are calledoperators. As¨was associated with the commonly named functionmapso is/associated with the commonly named functionreduce/fold, and so /is a sum function.

For / 1=(1 f) - ( ¯1) ↓ fwe get8, but'I am the model of a modern major general'has (9) *********************** (words, not8. The problem is that we were only counting words terminated by a space, not by the end of the string. To count that, we must check if the last character is a space or not and if it’s not then we can count the end of the string as terminating a word. We can get the whether the last character was a space or not with(¯1) ↑ f, which will be (1) *********************** (if) ********************** (s) ended with a space or0if not. Since we want to add exactly one word to the count when a stringdoes notend with a space we negate this with~and add that. Thus the word count expression is:

(~ (¯1)) f)    / 1=(1) f) - (¯1) ↓

Turning this into a function yields:

words ← {(~ (¯1) ↑ ⍵)    / 1=(1 ↓ ⍵) - (¯1) ↓ ⍵}

One problem is that the calculation is wrong for the empty string: (words ” ∊spgives1not0. This won’t be an issue for us because we won’t even be invoking it on empty strings.

WC via Array Arithmetic

Here’s what ourwcfunction looks like with this change (only the second line changed):

res ← wc fn; c; w; l; fd; data; blk; nl; sp; lnsp; words words ← {(~ (¯1) ↑ ⍵)    / 1=(1 ↓ ⍵) - (¯1) ↓ ⍵} Blk ← 256 × 1024 nl ← ⎕UCS¨ 10 11 12 13 sp ← nl, ⎕UCS¨9 32 c → w → l → 0 fd ← fn ⎕NTIE 0 lnsp ← 0 Repeat     data → ⎕NREAD fd 80 blk     : If 0=≢data         : Leave     : EndIf     c ← c   (≢data)     l ← l    / data∊nl     data → data∊sp     w ← (w   words data) - (lnsp∧ ~ 1 ↑ data)     lnsp ← ~ (¯1) ↑ data : EndRepeat ⎕NUNTIE fd res ← l w c

Performance Measurements

On the same 1. 661 GB file as before I initially get a run of 3. 36 seconds, but then 2. 75 s, 2. 34 s, 2. 36 s on immediate subsequent runs now that the computer is pumped and primed. Using thetimeterminal utility to runwcagainst the same file I (again) getuser timesranging from 5. 345 s to 5. 549 s, so this attempt is faster and we are done. I get similarly scaled differences for smaller files.

Splitting It All Up

We can split our procedural function up into a few direct functions and an operator which might make it easier to understand and maintain (or maybe not). We start with a function computing thewcstats on a string together with whether the string’s first and last character is not a space:

blockCount ← {     l ←   / ⍵∊⎕UCS¨ 10 11 12 13 ⍝ #lines     s ← ⍵∊⎕UCS¨9 10 11 12 13 32 ⍝ is space     w ← (~ (¯1) s)    / 1=(1 s) - (¯1) s ↓ words     ⍝ first-non-space, chars, words, lines, last-non-space     (~ 1 s) (≢⍵) w l (~ (¯1) s) }

We directly use the vector of newlines, passing***UCS¨ 10 11 12 13as the righthand argument to the membership function∊. The lefthand argument is the righthand argument passed toblockCount.

sis the boolean vector telling us where spaces ( (⎕UCS¨9) 11 12 13 32) occur in the righthand argument to (blockCount.

Fromswe calculate the words inblockCount‘s righthand argument directly, just as we did before.

Finally we return whether the first character is not a space, the number of characters , the number of words, the number of lines, and whether the last character is not a space.

We can applyblockCountto multiple strings. Just like the original Haskell version we need some way to combine the results of applyingblockCountto sequential strings from a file, which we do with the following:

blockCollapse → {     adjustment ← 0 0 (-   / (1 ↓ ⍵ [;1]) ∧ (¯1) ↓ ⍵ [;5]) 0 0     collapsed ← (⍵ [1;1]), (  ⌿⍵) [2 3 4], (⍵ [≢⍵;5])     adjustment   collapsed }

The assumption here is that the results fromblockCountare stored in a matrix where each row is a result, with the values as described as above (whether the first character is a nonspace, etc…). When we collapse a sequence ofblockCountresults we want a result of the same form asblockCount. This will be the following:

Whether the first result in the matrix represented a stringstartingwith a nonspace – that value is⍵ [1;1].

The sum of all the characters from all the results.

The sum of all the words from the results minus the number of times a word was split.

The sum of all the lines from all the results.

Whether the last result in the matrix represented a string (ending) with a nonspace – that value is⍵ [≢⍵;5].

The sums in the middle can be calculated if we reduce additiondownthe input matrix.( ⌿⍵)will calculate that sum, and( ⌿⍵) [2 3 4]will be the middle three values of that sum (we don’t need the sums of whether the strings started or ended with nonspace characters).

The value we have to subtract – the number of times a word was split – is calculated by taking the sum of whether a string started with a nonspaceandedwith whether the string before ended with one. That is the value / (1 ↓ ⍵ [;1]) ∧ (¯1) ↓ ⍵ [;5]).(1 ↓ ⍵ [;1])is the first column of our results matrix (which represents whether strings started with a nonspace) without the value from the first result;(¯1) ↓ ⍵ [;5]is the last column (representing whether strings ended with a nonspace) without the value from the last result. We and these together with∧and sum that with /to get the number of times we split a word; Prepending-negates the value and placing it in the middle of a vector of four zeroes lets us just add it to our calculated result, adjusting the word count appropriately.

Now we add a user defined operator to handle looping over reading a file. The below does this:

v ← (f rc) fd; data v ← ⍬ Repeat     data → ⎕NREAD fd 80 (1024 × 1024)     : If 0

vis the output of the operator, we need this for technical reasons but the value won’t be used. Our operator accepts a computation to use on each string on its left and a file descriptor on its right and loops until it has consumed all the data in the file.

Finally, we can bring this all together in a (possibly) simpler implementation ofwcwhich relies on the computations we’ve defined. Since we no longer need to use looping or conditionals inwc(all that work is done inrc), we can makewca direct function:

wc ← {     res ← 1 5⍴0     update ← {         val ← res, [1] blockCount ⍵         res∘ ← val         val     }     fd ← ⍵ ⎕NTIE 0     _ ← update rc fd     _ ← ⎕NUNTIE fd     (blockCollapse res) ******* }

We start by makingresa zero-filled matrix of one row by five column. Then we define a helper functionupdatewhich will updateresusingblockCount., [1]lets us use ravel along its lefthand argument’s first axis so thatval (will be) ********************** (res) with the result fromblockCount ***appended as the last row.∘ ← (lets us update) ********************** (res) from withinupdateand we returnval.

Just like the built in operators we put our operated function on the left ofrcwhich creates a derived function that accepts the file descriptorFD.RCwill iterate over the contents offdand successively apply update. We have to assign the result (or just use it somehow) as the first unused computation of a direct function is its return value, and we don’t want an early return.

Finally, we usedblockCollapseto compress the result matrix down into the desired answer and pick out the values we want with[4 3 2].

Space Utilization

The above split up implementation has one drawback – we accumulate the results into a matrix whose size grows linearly as a function of the input file. Granted, it has agreatconstant factor since we collapse each megabyte down to only five numbers, but still if we run this on a terabyte file we will likely use at least 20 MB just for storing the results. That doesn’t sound bad, but our earlier version ran in constant space and we can reclaim that here, too.

The main change comes in thercoperator which now looks like this:

res ← fd (m rc r) init; data res → init Repeat     data → ⎕NREAD fd 80 (1024 × 1024)     : If 0

We are now operating on two functions –mto map the file data andrto fold new data into an accumulated result – ie we are doing a reduce / fold across a transformation of the file’s contents. We also now accept the file descriptorfdon the left so that we can easily accept an initial accumulation value on the right.

One thing we won’t have to change is (blockCount) , it is the same as the above. We do have to changeblockCollapsethough and since we no longer need to collapse a whole matrix of results but instead need to add two results together, we can have a much simpler functionblockAdd ← {(⍺ [1], (⍺ ⍵) [2 3 4], ⍵ [5])) - (0 0 (⍺ [5] ∧⍵ [1]) 0 0)}which does just that.

Sincercis now accumulating results we don’t really need to be updating a result inwcwhich now has the following sleek form:

wc ← {     blockAdd ← {(⍺ [1], (⍺   ⍵) [2 3 4], ⍵ [5]) - - 0 0 (⍺ [5] 0 [1]) 0 0)}     fd ← ⍵ ⎕NTIE 0     res ← fd blockCount rc blockAdd 5⍴0     _ ← ⎕NUNTIE fd     res [4 3 2] }

We useblockCountto map data blocks in the file to statistics for that block then useblockAddto combine them, all handled by our operatorRC. We now have regained constant memory space (the size of which depends on how much data we read at once, here I chose one megabyte), and it’s starting to have the feel of the monoid solution in the original Haskell article.