Beating Decades of Optimized C with 80 Lines of Haskell, Hacker News

Despite the click-bait title I hope you’ll find this post generally illuminating, or at the very least a bit of fun!

You can find source code for this posthere.

The challenge is to build afasterclone of the hand-optimized C implementation of thewcutility in our favorite high-level garbage-collected runtime-based language: Haskell! Sounds simple enough right?

For reference, I’m using the Mac’s version ofwc; you can findreference source code here.

Here’s the criteria we’ll be considering as we go along:

Correctness: Should return identical character, word, and line counts aswcon the test files.
Speed (wall-clock-time): How do we compare to the execution time ofwc?
Max Resident Memory: What’s the peak of our memory usage? Is our memory usage constant, linear, or otherwise?

Those are the main things we need to worry about.

Let’s dive in.

The dumbest thing that could possibly work

As always, we should start by just trying the dumbest possible thing and see how it goes. We can build up from there. What’s the dumbest way to count characters, lines, and words in Haskell? Well, we could read the file, then run the functionslength,length. words, andlength. linesto get our counts!

stupid ::FilePath->IO(Int,Int,Int)stupid fp=docontentsreadFileFPreturn(lengths,length(wordss),length( (liness))

Amazingly enough, this actually DOES work, and gets us the same answers aswc, IF you’re willing to wait for it … I got sick of waiting for it to finish on my large test file (it was taking more than a few minutes), but on a smaller test file (90 MB) got the following results:

90 MB test file:

(time)

0. (s)	17. 07 s
max mem.	1. (MB)	2403 MB

Yikes … Needless to say there’s some room for improvement …

Being slightly less dumb

Let’s think about why this is doing so poorly; the first thing that comes to mind is that we’re iterating through the contents of the file 3 separate times! This also means GHC can’t garbage collect our list as we iterate through it since we’re still using it in other places. The fact that we’re keeping every character of the file in a linked list helps explain the 2.4 GB of memory on a file that’s only 90 MB! Ouch!

Okay, so that’s REALLY not great. Let’s see if we can get this down to a SINGLE pass over the structure. We’re accumulating 3 simple things, so maybe we can process all three parts at once? When iterating through a structure to get one final result I reach for folds!

It’s pretty easy to use a fold to count characters or lines; the character count always adds one to the total, the line count adds one when the current character is a newline; but what about the word count? We can’t add one on every space character because consecutive spaces doesn’t count as a new word! We’ll need to keep track of whether the previous character was a space, and only add one to the counter whenever we start a completely (new) word. That’s not too tough to do; we’ll usefoldl 'fromData.Listfor our implementation.

importData.ListimportData.CharsimpleFold ::FilePath->IO(Int,Int,Int)simpleFold fp=docountFilereadFilefpcountFile ::String->(Int,Int,Int)countFile s=let(cs, ws, ls, _)=(foldl 'go)  (0) ,0,0,False) sin(cs, ws, ls)wherego ::(Int,Int,Int,Bool)->Char->(Int,Int,Int,Bool)go (cs, ws, ls, wasSpace) c=letaddLine|  (c)==' n'=1|Otherwise=0addWord|wasSpace=(0)|isSpacec=1|Otherwise=0in(cs1, WSaddWord, lsaddLine,isSpacec)

Running this version we run into an even worse problem! The program takes more than a few minutes and quickly spikes up to more than 3 GB of memory! What’s gone wrong? Well, we used the strict version offoldl(indicated by the trailing tick'); BUT it’s only strict up to “Weak Head Normal Form” (WHNF), which means it’ll be strict in thestructureof the tuple accumulator, but not the actual values! That’s annoying, because it means we’re building up a HUGE thunk of additions that we never fully evaluate until we’ve finished iterating through the whole file! Sometimes laziness sneaks in and bites us like this. This is the sort of memory leak that can easily take down web-servers if you aren’t careful!

90 MB test file:

(time)

0. (s)	longer than I want to wait
max mem.	1. (MB)	>3 GB

We can fix it by telling GHC to strictly evaluate the contents of the tuple on ever iteration. An easy way to do that is with theBangPatternsextension; it lets us use!in our argument list to force evaluation on each run of the function. Here’s the new version ofgo:

{- # LANGUAGE BangPatterns # -}...go ::  (Int,Int,Int,Bool)->Char->(Int,Int,Int,Bool)go (!cs,!WS,!LS,!wasSpace (c)=letaddLine|c==' n'=1|Otherwise=0addWord|wasSpace=0|isSpace  (c)=|Otherwise=0in(cs1, WSaddWord, lsaddLine,isSpacec)

That simple change speeds things up likeCRAZY; here’s our new performance breakdown:

90 MB test file:

(time)

0. (s)	8. (s)
max mem.	1. (MB)	3.7 MB

Okay; so we’re doing WAY better on memory now, a few MBs of memory on a 90 MB file means we must finally be streaming the file contents properly! Even though laziness has already bitten us on this problem, now that we’ve localized the laziness into the right places it provides us with streaming for free! The streaming happens naturally becausereadFileactually doeslazy IO; which can be a real nuisance sometimes for things like web servers since you’re never quite sure when the IO is happening, but in our case it gives us much better memory residency.

(Better with ByteStrings)

We can probably stop worrying about memory for now, so we’re back to crunching for performance! One thing I can think to try there to switch to using a ByteString rather than a String. Using a String means we’re implicitly decoding the file as we read it, which takes time, AND we have the overhead of using a linked list for the whole thing, we can’t easily take advantage of batching or buffering our data as we read it.

This change is actually laughably easy, thebytestringpackage provides the module:Data.ByteString.Lazy.Char8, which provides operations for working with Lazy ByteStrings as though they were strings of characters, but with all the performance benefits of ByteStrings. Note that it DOESN’T actually verify that each byte is a valid Character, or do any decoding, so it’s on us to make sure we’re passing it valid data. By defaultwcassumes its input is ASCII, so I think we’re safe to do the same. If our input is ASCII then the functions in this module will behave sensibly.

Literally the only changes I need to make are to switch theData.Listimport toData.ByteString.Lazy.Char8and then switch thereadFileandfoldl 'functions to their ByteString versions:

importData.CharimportqualifiedData.ByteString.Lazy.Char8as(BS)simpleFold ::FilePath->IO(Int,Int,Int)simpleFold fp=dosimpleFoldCountFileBS.readFile fpsimpleFoldCountFile ::BS.ByteString->(Int,Int,Int)simpleFoldCountFile s=let(cs, ws, ls, _)=BS.foldl 'go (0,  (0) ,0,  (False) ) sin(cs, ws, ls)wherego ::(Int,Int,Int,Bool)->Char->(Int,Int,  (Int) ,Bool)go (!cs,!WS,!ls,!wasSpace) c=letaddLine|c==' n'=(1)|(otherwise) ***********************=(0)(addWord)  |wasSpace=(0)|isSpacec=1|Otherwise=0in(cs1, WSaddWord, lsaddLine,isSpace  (c)

This little change chops our time down by nearly half!

90 MB test file:

(time)

0. (s)	8. (s)	3. (s)
max mem.	1. (MB)	3.7 MB	5. 48 MB

So we’re clearly still making some progress. Our memory usage has increased slightly, but it still seems to be a constant overhead. We’re still orders of magnitude away fromwcunfortunately; let’s see if there’s anything else we could do.

Moving to Monoids

At this point I feel like experimenting a little. Modern PC’s tend to have multiple cores, and it seems as though newer machines scale up the number of cores moreso than their processor speed, so it would be beneficial to take advantage of that.

Splitting up a computation like this isn’t exactly trivial. In order to use multiple cores we’ll need to split up the job into pieces. In theory this is easy, just split the file into chunks and give one chunk to each core! As you think a bit deeper about it the problems start to appear; combining character counts is pretty easy, we can just sum the totals from each chunk. The same with line-counts, but word counts pose a problem! What happens if we split in the middle of a word, or in the middle of several consecutive spaces? In order to combine the word counts we’d need to keep track of the starting and end state of each chunk and be intelligent when we combine them together. That sounds like a lot of book-keeping that I don’t really want to do.

Monoids to the rescue! The associative laws of a Monoid mean that so long as we can develop a lawful monoid it WILL work properly in spite of this type of parallelism. That really just passes the buck down the line though, is it possible to write a Monoid that can handle the complexities of word-counting like this?

It sure is! It may not be immediately apparent how a monoid like this works, but there’s a class of counting problems that all fall into the same category like this, and luckily for me I’ve worked on these before. Basically we need to count the number of times a given invariant haschangedfrom the start to the end of a sequence. I’ve generalized this class of monoid before, naming themfluxmonoids. What we need to do is count the number of times we change from characters which are spaces to those which AREN’T spaces. We could probably express this using theFluxmonoid itself, but since we need to be so careful about strictness and performance I’m going to define a bespoke version of the Flux monoid for our purposes. Check this out:

dataCharType=IsSpace|NotSpace(deriving)   (Show)dataFlux=Flux!CharType{- # UNPACK # -}(Int(CharType)|UnknownDerivingShow

We need these types only for the word-counting part of our solution.

TheCharTypesays whether a given character is considered a space or not; then theFluxtype represents a chunk of text, storing fields for whether the left-most character is a space, how many words are in the full block of text, and whether the right-most character is a space. We don’t actually keep the text in the structure since we don’t need it for this problem. I’veUNPACKed theIntand made all the fields strict to ensure we won’t run into the same problems we did with lazy tuples earlier. Using a strict data type means I don’t need to use BangPatterns in my computations.

Next we need a semigroup and Monoid instance for this type!

instanceSemigroupFluxwhereUnknown(x)=(x)xUnknown=xFlux  (ln) ********************** (NotSpace)FluxNotSpacen 'r=(Flux) *********************** (l) nn '-(1) ) r(Flux) *********************** (ln _)Flux_ n 'r=Fluxl (n)  n ') rinstanceMonoidFluxwheremempty=(Unknown)

TheUnknownconstru ctor is there to represent a Monoidal identity, we could actually leave it out and useMaybeto promote our Semigroup into a Monoid, butMaybeintroduces unwanted laziness into our semigroup append! I just define it as part of the type for simplicity.

The()operation we define checks whether the join-point of our two text blocks happens in the middle of a word, if it does then we must have counted the start and end of the same word separately, so we subtract one when we add the word totals to make it all balance out.

Lastly we need a way to build aFluxobject from individual characters.

flux ::Char->Fluxflux C|isSpace  (c)=FluxIsSpace(0)IsSpace|Otherwise=FluxNotSpace(1)NotSpace

This is simple enough, we count non-space characters as ‘words’ which start and end with non-space charactes and for spaces have an empty word count surrounded on both sides with space chars.

It may not be immediately clear, but that’s all we need to count words monoidally!

>>>foldMapflux"testing one two three"FluxNotSpace4NotSpace>>>foldMapflux"testing on "foldMapflux"e two three"FluxNotSpace4NotSpace>>>foldMapflux"testing one"foldMapflux"two three"FluxNotSpace(4)NotSpace

Looks like it’s working fine!

We’ve got the word count part covered, now we need the Monoidal version of the char count and line count. This is a snap to build:

dataCounts=(Counts) *********************** ({)  charCount ::{- # UNPACK # -}!Int,wordCount ::!Flux,lineCount ::{- # UNPACK # -}!Int}Deriving(Show)instanceSemigroupCountswhere(Countsabc)(Countsa 'b' c ')=(Counts)  (aa ') (bb ') (cc ')instanceMonoidCountswheremempty=Counts(0)  mempty(0)

No problem! Similarly we’ll need a way to turn a single char into aCountsobject:

countChar ::Char->Counts(countChar c)=Counts{charCount=1, wordCount=Flux C, lineCount=if(c==' n')thenelse(0)}

Let’s try that out too:

>>>foldMapcountChar"one two  nthree"(Counts)  {charCount=13, wordCount=FluxNotSpace(3)NotSpace, lineCount=(1) }

Looks good to me! Experiment to your heart’s content to convince yourself it’s a lawful Monoid.

With a lawful Monoid we no longer need to worry about how we split our file up!

Before going any further, let’s try using our monoid with our existing code and make sure it gets the same answers.

moduleMonoidBSFoldwhereimportData.Charimport(qualified)Data.ByteString.Lazy.Char8as  (BS  ***********************)monoidBSFold ::FilePath->IOCountsmonoidBSFold paths=monoidFoldFileBS.readFile fpmonoidFoldFile ::BS.ByteString->CountsmonoidFoldFile=(BS.foldl ')  ab->acountChar b)mempty

We’ve moved some complexity into ourCountstype, which allows us to really simplify our implementation here. This is nice in general because it’s much easier to test a single data-type rather than testing EVERYWHERE that we do this fold.

As a side benefit, this change hassomehowsped things up even more!

We’re in the ballpark now!

90 MB test file:

(time)

0. (s)	3. (s)	1. 94 s
max mem.	1. (MB)	5. 48 MB	3. (MB)

We’ve knocked off a good chunk of time AND memory with this change … I ‘ll admit I have no idea WHY, but I won’t look a gift-horse in the mouth. It’s possible that by using a fully strict data structure we’ve strictified some laziness that snuck in somewhere; but I’m really not sure. If you can see what happened please let me know!

Inlining away!

Next in our quest, I think I’ll inline some definitions! Why? Because that’s just what you do when you want performance! We can use theINLINEpragma to tell GHC that our function is performance critical and it’ll inline it for us; possibly triggering further optimizations down the line.

monoidBSFold ::FilePath->IOCountsmonoidBSFold paths=monoidBSFoldFileBS.readFile fp{- # INLINE monoidBSFold # -}monoidBSFoldFile ::BS.ByteString->(Counts)monoidBSFoldFile=BS. foldl '( ab->acountChar b)mempty{- # INLINE monoidBSFoldFile # -}

I also went ahead and added INLINE’s to ourcountCharandfluxfunctions. Let’s see if it made any difference:

90 MB test file:

(time)

1. 94 s	0. (s)
max mem.	3. (MB)	4. 35 MB

Interestingly it seems to have slashed our time down by 75%! I’m really not sure if this is a fluke, or if we stumbled upon something lucky here; but I’ll take it! It’s bumped up our memory usage by a smidge; but not enough for me to worry.

Here’s how we compare to the C version now:

90 MB test file:

(time)

0. (s)	0. (s)
max mem.	1. (MB)	4. 35 MB

At this point we’re pretty close to parity withWC; but we’re looking at sub-second times, so I’m going to bump up the size of our test file and run a few times to see if we can learn anything new.

I bumped up to a 543 MB plaintext file and ran it a few times in a row to get the caches warmed up. This is clearly important because my times dropped a full 33% after a few runs. I understand my testing method isn’t exactly “scientific”, but it gives us a good estimate of how we’re doing. Anyways, on the much larger file here’s how we perform:

543 MB test file:

(time)

2. 06 s	2. 73 s
max mem.	1. (MB)	3. (MB)

From here we can see that we’re actually getting pretty close! Considering we’ve clonedwcin a high-level garbage collected language in around 80 lines of code I’d say we’re doing alright!

Using our Cores

One may not expect parallelizing to multiple cores to do a whole lot since presumably this whole operation is IO bounded, but I’m going to do it anyways because I’m stubborn and bored.

We’ve already expressed our problem as a Monoid, which means it should be pretty trivial to split up the computation! The trick here is actually in reading in our data. If we try to read in all the data and THEN split it into chunks we’ll have to load the whole file into memory at once, which is going to be REALLY bad for our memory residency, and will probably hurt our performance too! We could trystreamingit in and splitting it that way, but then we have toprocessthe first chunk before we get to the second split; and hopefully you can see the problem there. Instead I’m going to spin up a separate thread for each core we have then open a separate file handle in each of those threads. Then I’ll seek each Handle to disjoint offsets and perform our operation on each non-overlapping piece of the file that way before combining the counts together.

Here’s the whole thing, did I mention how much Ilovewriting concurrent code in Haskell?

importTypesimportControl.MonadimportData.TraversableimportData.BitsimportGHC .Conc(numCapabilities)importControl.Concurrent.AsyncimportData.FoldableimportSystem.IOimportSystem.Posix.Filesimport(qualified)Data.ByteString.Lazy.Char8as  (BL  ***********************)importData.ByteString.Internal(C2W)importGHC.IO.HandlemultiCoreCount ::FilePath->IOCountsmultiCoreCount fp=doputStrLn("Using available cores:"shownumCapabilities)sizefromIntegral.fileSizegetFileStatus fp(let)  chunkSize=fromIntegral(size`div`numCapabilities)fold(forConcurrently [0..numCapabilities-1]$ n->DO- Take all remaining bytes on the last capability due to integer division anomolies(let)  limiter=(if) *********************** (n)==numCapabilities-1thenidelseBL .takefromIntegralchunkSize)letoffset=fromIntegral(n*chunkSize)fileHandleopenBinaryFile fpReadModehSeek fileHandleAbsoluteSeekoffsetcountBytes.limiterBL.hGetContents fileHandle){- # INLINE handleSplitUTF # - }countBytes ::BL.ByteString->CountscountBytes=BL.foldl '( ab->acountChar b)  (mempty){- # INLINE countBytes # -}

There’s a lot going on here, so I’ll break it down as best I can.

We can import the number of “capabilities” available to our program (ie the number of cores we have access to) fromGHC.Conc. From there, we run a fileStat on the file we want to count to get the number of bytes in the file. From there, we use integer division to determine how many bytes should be handled by each individual core. The integer division rounds the result down, so we’ll have to be careful to pick up the bytes that were possibly left out later on. We then useforConcurrentlyfromControl.Concurrent.Asyncto run a separate thread for each of our capabilities.

Inside each thread we check whether we’re inside the thread which handls the LAST chunk of the file, if we are we should read until the EOF to pick up the leftover bytes from the earlier rounding error, otherwise we want to limit ourselves to processing onlychunkSizebytes. Then we can calculate our offset into the file by multiplying the chunk size by our thread number. We open a binary file handle and usehSeekto move our handle to the starting offset for our thread. From this point we can simply read our allocated number of bytes and fold them down using the same logic as before. After we’ve handled each of the threads, we’ll use a simplefoldto combine the counts of each chunk into a total count.

We usein a few spots to add additional strictness since we want to ensure that the folding operations happen within each thread, instead of after the threads have been joined. I might go a little overboard on strictness annotations, but it’s easier to add too many than it is to track down the places we’ve ac cidentally missed them.

Let’s take this puppy out for a spin!

After warming up the caches I ran each of them a few times on my 4 core 2013 Macbook Pro with an SSD, and averaged the results together:

543 MB test file:

(time)

2. 07 s	1. 23 s
max mem.	1. (MB)	7. 06 MB

It seems to make a pretty big difference! We’re actually going FASTER than some C code that’s been hand optimized for a few decades. These results are best taken with a hefty grain of salt; it’s really hard to tell what sort of caching is going on here. There are probably mutliple layers of disk caching happening. Maybe the multithreading only helps when reading files from a cache?

I did a bit of skimming and it seems that SOME storage devices might experience a speed-up from doing file reads in parallel, some may actually slow down. Your mileage may vary. If anyone’s an expert on SSDs I’d love to hear from you on this one. Regardless I’m still pretty happy with the results.

In case you’re wondering, the actualUsertime for our program comes in at4. 22 s(which is split across the 4 cores), meaning our parallel program is less efficient than the simple version in terms of actual processor cycles, but the ability to use multiple cores gets the “real” wall-clock time down.

Handling Unicode

There’s something we’ve been avoiding so far, we’ve assumed every file is simple ASCII! That’s really not the way the world works. A lot of documents are encoded in UTF-8 these days; which turns out to be identical to an ASCII file IFF the file only contains valid ASCII characters, however if those crazy pre-teens put some Emoji in there then it’s going to screw everything up.

The problem is two-fold; firstly we currently count BYTES not CHARACTERS, because in ASCII-land they’re semantically the same. With our current code, if we come across a UTF-8 encoded frowny face we’re going to count it as at least 2 characters when it should only count as one. Okay, so maybe we should actually be decoding these things, but that’s much easier said than done because we’re splitting the file up into chunks at arbitrary byte-counts; meaning we might end up splitting that frowny face into two different chunks, leading to an invalid decoding! What a nightmare.

This is another reason why doing a multi-threadedwcis probably a bad idea, but I’m not so easily deterred. In order to proceed I’m going to make a few assumptions:

Our input will be encoded using either ASCII or UTF-8 family of encodings. There are of course other popular encodings out there; but in my limited experience most modern text files prefer one of these encodings. In fact there are (entire sites) dedicated to makingUTF-8the one format to rule them all.
We count only ASCII spaces and newlines as spaces and newlines; sorryMONGOLIAN VOWEL SEPARATOR, but you’re cut from the team.

By making these two assumptions we can exploit a few details of the UTF-8 encoding scheme to solve our problem. Firstly, we know from the UTF-8 spec that it’s completely back-compatible with ASCII. What this means is that every ASCII byte is encoded in UTF-8 as exactly that same byte. Secondly, we know that NO other bytes in the file will conflict in encoding with a valid ASCII byte; you can see why in a chart on theUTF-8 wikipedia page. Continuation bytes start with a leading ‘1’, and no ASCII bytes start with a ‘1’.

These two facts mean we can safely leave our current ‘space’ detection logic the same! It’s impossible for us to ‘split’ a space or newline because they’re all encoded in a single byte, and we know we won’t accidentally count some byte that’s part of a different codepoint because there’s no overlap in the encoding for the ASCII bytes . We do however need to change our character-counting logic.

One last fact about UTF-8 is that every UTF-8 encoded codepoint contains exactly one byte from the set:0xxxxxxx, 110 xxxxx, 1110 xxxx, 11110 xxx. Continuation bytes ALL start with10, so if we count all bytes OTHER than those starting with10then we’ll count each code-point exactly once, even if we split a codepoint across different chunks!

All of these facts combined means we can write a per-byte monoid for counting UTF-8 OR ASCII characters all in one!

Actually, our currentCountsmonoid is fine, we’ll just need to adapt ourcountCharfunction:

import(Data.Bits)importData.ByteString.Internal(C2W)countByte ::Char->CountscountByte c=Counts{- Only count bytes at the START of a codepoint, not continuation bytescharCount=if  (bitAt) ********************** (7)&&notbitAt(6) ))then0else1, wordCount=flux c, lineCount=if(c==' n')then1else0}wherebitAt=testBit (c2w c){- # INLINE countByte # -}

And that’s it! Now we can handle UTF-8 or ASCII; we don’t even need to know which encoding we’re handling, we’ll always give the right answer.

wc, at least the version on my Macbook, has a-mflag for handling multi-byte characters when counting. A few quick experiments shows that tellingwcto handle multi-byte chars slows down the process pretty significantly (it now decodes every byte) ; let’s see how our version does in comparison. (I’ve confirmed they get the same results when running on a large UTF-8 encoded document with many non-ASCII characters)

543 MB test file:

(time)

5. s	3. 07 s
max mem.	1. (MB)	7. (MB)

Just as we suspect, we come out pretty far ahead! Our new version is a bit slower than when we just counted every byte (we’re now doing a few extra bit-checks), so it’s probably a good idea to add autfflag to our program so we can always run as fast as possible for a given input.

(Conclusions)

So; how does our high-level garbage-collected runtime-based language Haskell stack up? Pretty dang well I’d say! We ended up really quite close with our single-core lazy-bytestringwc. Switching to a multi-core approach ultimately allowed us to pull ahead! Whether ourwcclone is faster in practice without a warmed up disk-cache is something that should be considered, but in terms of raw perfo rmance we’re still winning this one.

Haskell as a language isn’t perfect, but if I can get ball-park comparable performance to a C program while writing much higher-level fully type-checked code then I’ll call that a win any day.

Hopefully you learned something 🤞! Did you know I’m currently writing a book? It’s all about Lenses and Optics! It takes you all the way from beginner to optics-wizard and it’s currently in early access! Consider supporting it, and more posts like this one by pledging on myPatreon page! It takes quite a bit of work to put these things together, if I managed to teach your something or even just entertain you for a minute or two maybe send a few bucks my way for a coffee? Cheers! 🍻 ****