Background
A recent blog post that bubbled up onr / programmingentitledBeating C With 80 Lines Of Haskell : Wcinspired me to see how effective APL would be in solving this problem.
But Why APL?
Why not? I was told by older students during my university years to avoid APL and any internship or co-op opportunities that used it. Turns out that was bad advice, at least for me – I can see how it might not be other folks ’cup of tea, but it is definitely something I like. A few years back I had the opportunity to finally play with an APL derivative K (I got a job at a small K shop). I liked K but had to move on job wise. I recently downloaded a free personal copy ofDyalogand have been playing around with it while reading throughMastering Dyalog APLby Bernard Legrand (only a few chapters in so far). I find APL friendlier to use than K and enjoy it a great deal, so this seemed like a good excuse to figure out how to do file I / O while searching through the book’s PDF and flexing google-fu.
On with the code
Just like Chris Penner’s original article, I’m comparing against the OSX version ofWC
that shipped with my machine. Just like in the original article, I admit that there are likely faster versions ofwc
– I’m just comparing what I got.
Counting Words
Counting characters is trivial. Counting lines is almost as trivial, or is as trivial if we countCRLF
as two lines – I chose the more trivial solution out of laziness.CRLF
could be counted as one end of line with an additional three lines (warning: guesstimate).
Just as in the original post, the meat of the problem is in accurately counting words. The first step is counting the number of words in a single string. First let’s make two definitions:
nl ← ⎕UCS¨ 10 11 12 13 ⍝ Newline characters sp ← nl, ⎕UCS¨9 32 ⍝ Space characters
The above doesn’t look like “normal” programming languages because APL doesn’t use the standard ASCII character set for its symbols / terms which means none of its names are stolen from you – if you name a variablenew
orthis
orfunction
it won’t clash with anything. Note that it does use ASCIIsymbolslikeor
=
but there are no keywords using ASCII numbers / letters.
An explanation of the above:
⍝ (denotes a) //
style comment, it’s calledlampand itilluminates.NL
andsp
are variables withnl
being newline characters andsp
whitespace characters. In APL,→
is assignment. Here we assign everything to the right ofnl
orsp
to the variablesnl
andsp
.¨
will create aderived functionwhich will map the functionimmediatelyto the left of¨
across the value to the right. Byimmediatelyto the left I mean you read as little as possible to the left to figure out what the lefthand argument is, but the righthand argument is the result of evaluating the (entire) expression to the right, and this is how APL works in general – take as little as possible from the left but everything from the right.⎕UCS
happens to be a function in Dyalog for getting unicode characters from numeric values.- The ravel function
,
will catenate its lefthand and righthand arguments, sosp
contains the newlines inNL
along with tab and space.
Now that we know what constitutes a newline or a space we can figure out how to getwc
results. Assume the input is stored in the variables
, we can even give it a concrete value for now:s ← 'I am the model of a modern major general '
. To find where the spaces are we use the membership function∊
withs∊sp
, which yields
0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
telling us the exact position of any whitespace characters. We can negate the vector (turn1
s into0
s and vice versa) and use the partition function⊆
to collect the words into an array of strings with(~ s∊sp) ⊆s
(~
is the negation function) and we can count the result using≢ (~ s∊sp) ⊆s
. The way the partition function works is it uses the boolean vector to strip out parts of its righthand argument based on runs of1
in its lefthand argument (thanks to (u / olzdfor pointing out⊆
when I posted this article originally onR / Programming).
We only have one piece of input in the above calculation and so it is easy to turn it into a function. In APL we can make adirect functionby using braces (I won’t go into the differences in the two function types here but know that the book references above goes into some details). So we can store this calculation in a variable with:
In APL⍺
(alpha) and⍵
(omega) are the names of the left and right arguments to a direct function. We apply a function just like we do the built in primitive functions, we give it parameters:words s
. One thing we can do is makewords
a bit more general and have it take (any) boolean vector and count the partition of that vector (we don’t need the characters, after all) , and this then gives us:
And so when we apply it to our string we will do it withwords s∊sp
. Just like with the built in parameters,words
will be evaluated on the entire expression to the right which iss∊sp
. This yields9
. One thing to keep in mind is testing edge cases and, for the empty string we get0
.
WC via Partition (⊆)
Now that we can count words in a string we can count words in a file. We could just do this by reading in the entire file all at once and applying thewords
function and then counting the number of characters and the number of newlines. But that would be a bad idea – what if the file is 123 GB in size ?! Clearly we don’t want to read that in all at once, instead we want to read it in a few blocks at a time. We’re going to need a loop, and for that we need another type of function, aprocedural function(again see the referenced book for details).
One computation issue, as mentioned in the original Haskell based blog post, is what happens if we split up the input in the middle of a word? Well, we just use the same technique used in that blog post. No, not monoids (though that is cool), just keeping track if the last character we saw was a non-space or not. If it was and the first new character isalsoa non-space then we subtract one word from the count as we accumulate.
With that in mind, here’s the first attempt
res ← wc fn; c; w; l; fd; data; blk; nl; sp; lnsp; words words ← {≢ (~ ⍵) ⊆⍵} Blk ← 256 × 1024 nl ← ⎕UCS¨ 10 11 12 13 sp ← nl, ⎕UCS¨9 32 c → w → l → 0 fd ← fn ⎕NTIE 0 lnsp ← 0 Repeat data → ⎕NREAD fd 80 blk : If 0=≢data : Leave : EndIf c ← c (≢data) l ← l / data∊nl data → data∊sp w ← (w words data) - (lnsp∧ ~ 1 ↑ data) lnsp ← ~ (¯1) ↑ data : EndRepeat ⎕NUNTIE fd res ← l w c
We define the functionwc
with inputfn
(filename) and outputres
(result). The names afterfn
are the names of variables we want to belocalwhich happens to be every variable we are using. If we left a variable out, it would be assumed global and be able to change global state. An explanation:
- The first line, as just described, declares the function
WC
with inputfn
, outputres
, and a handful of local variables. words
is the word counting function we derived above.blk
is the block size we will use while we read from the file (fn) .NL
andsp
are the newline and whitespace vectors we defined originally.C
,w
, andl
are character, word, and line counters initially set to (0) .fd
is the file descriptor forfn
(next available since we passed (0) ).lnsp
is1
if the last character read during the previous iteration was not a space, (0) otherwise .- Now we enter a loop with
: Repeat
, the body of which is everything between: Repeat
and: EndRepeat
.: Leave (is like) ********************** (break) in other languages.
data
is assigned the next block of data read fromFD
.80
tells the⎕NREAD
function that we want to read characters (intuitive, right?).atadata
counts the number of items indata
and if this is zero (iedata
is empty) we exit the loop. Otherwise we add that count toc
.- We then use
data∊nl
to find the newlines indata
and sum these up with/
and add this sum tol
. - We no longer need the contents of
data
we only need to know where whitespace is, so we store a boolean vector describing this indata
. Passing this towords
counts the number of whitespace terminated words in (data) , which we add tow [5] . But then, to account for word splitting, we use
1 ↑ data
to get whether the first character we readthis iterationwas a space and then negate that with~
; we use∧
toandthat against whether the last character in theprevious iterationwasnota space, and subtract that from the count. As in C,0
is false and so (1∧1) is1
and we only subtract1
if the previous iteration's data ended with a nonspaceandthis iteration's data started with one. - We then store whether this iteration’s data terminated with a nonspace character.
- Finally we close the file descriptor after the loop and return the line, word, and character counts in
res
.
Performance Measurements
Usinga user defined timing operatorwe can time theWC
function. On a 1. 661 GB file (which is just thebig.txt
file from the original Haskell post’s repository repeated multiple times), I get (on my (inch early 2k) MacBook Air (initially a run of 32. 97 seconds and about the same on immediate subsequent runs. Using thetime
terminal utility to runwc
against the same file I getuser timesranging from 5. (s to 5.) s, so this first attempt is definitely slow compared to OSX’s (wc) utility. If I comment out the line wherew
is calculated my initial running time on that same file is 4. 32 seconds with subsequent runs hovering around 1. 85 seconds, so clearly our word calculation is the culprit here (as was to be expected).
Counting Words with Array Arithmetic
Let’s call the vectors∊sp (vector) ********************** (f) forfound:
f ← s∊sp
. We don’t want to just count the ones in this vector as then consecutive whitespace characters would each count as a word; what we want to count is when we switch from non-whitespace to whitespace (or vice-versa), and we can accomplish this with the difference between adjacent values. This is easy in APL, it's just(1 ↓ f) - (¯1) (f
:
¯
is “high minus” and is used to negate literals.↓
is the drop function which drops a number of items from the value to the right . If the number on the left is positive, it drops that many from thefrontof the value, if it is negative then it drops the absolute value of that many from the (back) of the value. So1 f
isf
without the first item and(¯1) ↓ f
isf
without the last item.-
is subtraction, so we are subtracting values inf
from the values to their left. For this particularf
we get the following:1 ¯1 0 1 ¯1 0 0 1 ¯1 0 0 0 0 1 ¯1 0 1 ¯1 1 ¯1 0 0 0 0 0 1 ¯1 0 0 0 0 1 ¯1 0 0 0 0 0 0
This vector is one elementshorter (than) ********************** (f) – iff
were a vector of five items with values 0 1 0 0 1
then we would have calculated(1 0 0 1) - (0 1 0 0)
after the drops, which would be1 ¯1 0 1
, a vector of four items. Back to the above result, we can find all the places where1
appears simply with1=(1 f) - (¯ 1) (f
which compares every element in the vector to the value1
, yielding a vector of the same size as the shorter “Difference” vector, and this vector has a1
where a difference of1
was found and(0) elsewhere:
1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
This represents the positions where a space followed a non-space (the first1
corresponds to the space betweenI
andam
). Summing this up gives us the number of places a wordterminatedby being followed by a space, and we can get that sum with the function /
. Just as¨
created aderived function (so does) /- both are calledoperators. As¨
was associated with the commonly named functionmap
so is/
associated with the commonly named functionreduce
/fold
, and so /
is a sum function.
For / 1=(1 f) - ( ¯1) ↓ f
we get8
, but'I am the model of a modern major general'
has (9) *********************** (words, not8
. The problem is that we were only counting words terminated by a space, not by the end of the string. To count that, we must check if the last character is a space or not and if it’s not then we can count the end of the string as terminating a word. We can get the whether the last character was a space or not with(¯1) ↑ f
, which will be (1) *********************** (if) ********************** (s) ended with a space or0
if not. Since we want to add exactly one word to the count when a stringdoes notend with a space we negate this with~
and add that. Thus the word count expression is:
(~ (¯1)) f) / 1=(1) f) - (¯1) ↓
Turning this into a function yields:
words ← {(~ (¯1) ↑ ⍵) / 1=(1 ↓ ⍵) - (¯1) ↓ ⍵}
One problem is that the calculation is wrong for the empty string: (words ” ∊spgives1
not0
. This won’t be an issue for us because we won’t even be invoking it on empty strings.
WC via Array Arithmetic
Here’s what ourwc
function looks like with this change (only the second line changed):
res ← wc fn; c; w; l; fd; data; blk; nl; sp; lnsp; words words ← {(~ (¯1) ↑ ⍵) / 1=(1 ↓ ⍵) - (¯1) ↓ ⍵} Blk ← 256 × 1024 nl ← ⎕UCS¨ 10 11 12 13 sp ← nl, ⎕UCS¨9 32 c → w → l → 0 fd ← fn ⎕NTIE 0 lnsp ← 0 Repeat data → ⎕NREAD fd 80 blk : If 0=≢data : Leave : EndIf c ← c (≢data) l ← l / data∊nl data → data∊sp w ← (w words data) - (lnsp∧ ~ 1 ↑ data) lnsp ← ~ (¯1) ↑ data : EndRepeat ⎕NUNTIE fd res ← l w c
Performance Measurements
On the same 1. 661 GB file as before I initially get a run of 3. 36 seconds, but then 2. 75 s, 2. 34 s, 2. 36 s on immediate subsequent runs now that the computer is pumped and primed. Using thetime
terminal utility to runwc
against the same file I (again) getuser timesranging from 5. 345 s to 5. 549 s, so this attempt is faster and we are done. I get similarly scaled differences for smaller files.
Splitting It All Up
We can split our procedural function up into a few direct functions and an operator which might make it easier to understand and maintain (or maybe not). We start with a function computing thewc
stats on a string together with whether the string’s first and last character is not a space:
blockCount ← { l ← / ⍵∊⎕UCS¨ 10 11 12 13 ⍝ #lines s ← ⍵∊⎕UCS¨9 10 11 12 13 32 ⍝ is space w ← (~ (¯1) s) / 1=(1 s) - (¯1) s ↓ words ⍝ first-non-space, chars, words, lines, last-non-space (~ 1 s) (≢⍵) w l (~ (¯1) s) }
- We directly use the vector of newlines, passing
***UCS¨ 10 11 12 13
as the righthand argument to the membership function∊
. The lefthand argument is the righthand argument passed toblockCount
. s
is the boolean vector telling us where spaces ( (⎕UCS¨9) 11 12 13 32) occur in the righthand argument to (blockCount.- From
s
we calculate the words inblockCount
‘s righthand argument directly, just as we did before. - Finally we return whether the first character is not a space, the number of characters , the number of words, the number of lines, and whether the last character is not a space.
We can applyblockCount
to multiple strings. Just like the original Haskell version we need some way to combine the results of applyingblockCount
to sequential strings from a file, which we do with the following:
blockCollapse → { adjustment ← 0 0 (- / (1 ↓ ⍵ [;1]) ∧ (¯1) ↓ ⍵ [;5]) 0 0 collapsed ← (⍵ [1;1]), ( ⌿⍵) [2 3 4], (⍵ [≢⍵;5]) adjustment collapsed }
The assumption here is that the results fromblockCount
are stored in a matrix where each row is a result, with the values as described as above (whether the first character is a nonspace, etc…). When we collapse a sequence ofblockCount
results we want a result of the same form asblockCount
. This will be the following:
- Whether the first result in the matrix represented a stringstartingwith a nonspace – that value is
⍵ [1;1]
. - The sum of all the characters from all the results.
- The sum of all the words from the results minus the number of times a word was split.
- The sum of all the lines from all the results.
- Whether the last result in the matrix represented a string (ending) with a nonspace – that value is
⍵ [≢⍵;5]
.
The sums in the middle can be calculated if we reduce additiondownthe input matrix.( ⌿⍵)
will calculate that sum, and( ⌿⍵) [2 3 4]
will be the middle three values of that sum (we don’t need the sums of whether the strings started or ended with nonspace characters).
The value we have to subtract – the number of times a word was split – is calculated by taking the sum of whether a string started with a nonspaceandedwith whether the string before ended with one. That is the value / (1 ↓ ⍵ [;1]) ∧ (¯1) ↓ ⍵ [;5])
.(1 ↓ ⍵ [;1])
is the first column of our results matrix (which represents whether strings started with a nonspace) without the value from the first result;(¯1) ↓ ⍵ [;5]
is the last column (representing whether strings ended with a nonspace) without the value from the last result. We and these together with∧
and sum that with /
to get the number of times we split a word; Prepending-
negates the value and placing it in the middle of a vector of four zeroes lets us just add it to our calculated result, adjusting the word count appropriately.
Now we add a user defined operator to handle looping over reading a file. The below does this:
v ← (f rc) fd; data v ← ⍬ Repeat data → ⎕NREAD fd 80 (1024 × 1024) : If 0
v
is the output of the operator, we need this for technical reasons but the value won’t be used. Our operator accepts a computation to use on each string on its left and a file descriptor on its right and loops until it has consumed all the data in the file.
Finally, we can bring this all together in a (possibly) simpler implementation ofwc
which relies on the computations we’ve defined. Since we no longer need to use looping or conditionals inwc
(all that work is done inrc
), we can makewc
a direct function:
wc ← { res ← 1 5⍴0 update ← { val ← res, [1] blockCount ⍵ res∘ ← val val } fd ← ⍵ ⎕NTIE 0 _ ← update rc fd _ ← ⎕NUNTIE fd (blockCollapse res) ******* }
We start by makingres
a zero-filled matrix of one row by five column. Then we define a helper functionupdate
which will updateres
usingblockCount
., [1]
lets us use ravel along its lefthand argument’s first axis so thatval (will be) ********************** (res) with the result from
blockCount ***
appended as the last row.∘ ← (lets us update) ********************** (res) from within
update
and we returnval
.
Just like the built in operators we put our operated function on the left ofrc
which creates a derived function that accepts the file descriptorFD
.RC
will iterate over the contents offd
and successively apply update. We have to assign the result (or just use it somehow) as the first unused computation of a direct function is its return value, and we don’t want an early return.
Finally, we usedblockCollapse
to compress the result matrix down into the desired answer and pick out the values we want with[4 3 2]
.
Space Utilization
The above split up implementation has one drawback – we accumulate the results into a matrix whose size grows linearly as a function of the input file. Granted, it has agreatconstant factor since we collapse each megabyte down to only five numbers, but still if we run this on a terabyte file we will likely use at least 20 MB just for storing the results. That doesn’t sound bad, but our earlier version ran in constant space and we can reclaim that here, too.
The main change comes in therc
operator which now looks like this:
res ← fd (m rc r) init; data res → init Repeat data → ⎕NREAD fd 80 (1024 × 1024) : If 0
We are now operating on two functions –m
to map the file data andr
to fold new data into an accumulated result – ie we are doing a reduce / fold across a transformation of the file’s contents. We also now accept the file descriptorfd
on the left so that we can easily accept an initial accumulation value on the right.
One thing we won’t have to change is (blockCount) , it is the same as the above. We do have to changeblockCollapse
though and since we no longer need to collapse a whole matrix of results but instead need to add two results together, we can have a much simpler functionblockAdd ← {(⍺ [1], (⍺ ⍵) [2 3 4], ⍵ [5])) - (0 0 (⍺ [5] ∧⍵ [1]) 0 0)}
which does just that.
Sincerc
is now accumulating results we don’t really need to be updating a result inwc
which now has the following sleek form:
wc ← { blockAdd ← {(⍺ [1], (⍺ ⍵) [2 3 4], ⍵ [5]) - - 0 0 (⍺ [5] 0 [1]) 0 0)} fd ← ⍵ ⎕NTIE 0 res ← fd blockCount rc blockAdd 5⍴0 _ ← ⎕NUNTIE fd res [4 3 2] }
We useblockCount
to map data blocks in the file to statistics for that block then useblockAdd
to combine them, all handled by our operatorRC
. We now have regained constant memory space (the size of which depends on how much data we read at once, here I chose one megabyte), and it’s starting to have the feel of the monoid solution in the original Haskell article.
GIPHY App Key not set. Please check settings