tl; dr: today we’ll look at implementing a toy wc
command that is about 4-5 times faster than the corresponding GNU Coreutils implementation.
So I’ve recently come across a post by Chris Penner describing a Haskell implementation of the Unix wc
command. Chris did a great job optimizing the Haskell version as well as showing how some high-level primitives (monoids and streaming, for one) turn out to be useful here, although the result was still a bit slower than C. There’s also a parallel version that relies on the monoidal structure of the problem a lot, and that one actually beats C.
But that post left me wondering: is it possible to do better without resorting to parallel processing?
Turns out the answer is yes. With some quite minor tweaks, the Haskell version manages to beat the hell out of the C version that presumably has decades of man-hours put into it.
Experimental setup
As usual, let’s go through the benchmarking procedure first.
Input data
I’ve downloaded this file and concatenated it with itself so that the total size is about 1.8 gigabytes :
% for i in `seq 1 22; cat part.txt>> test.txt % du -sh test.txt 1.8G test.txt
By the way, the file is residing on a
tmpfs
partition, so there is no disk IO involved.Hardware and softwareI'm running Gentoo Linux on Core i7 (with) gigabytes of RAM with most of it free during the benchmarks.
All the Haskell code is built using ghc 8.8.2.
I'm competing against coreutils 8. built using gcc 9.2 with
- O2 -march=native
:% wc --version wc (GNU coreutils) 8. Packaged by Gentoo (8.) (r1 (p0))
By the way, that
- march=native
part is a little bit of cheating in favor of C, since the resulting Haskell binary can run on any x 8196 - (machine, while) wc compiled with that flag can only run on my CPU and newer ones (that is, assuming the compiler actually used some of the newer SIMD extensions). But, well, let’s be kind and give the C version a bit of advantage.Measuring
Measurements are done using
time
.time
shows both user and system time, but I only consider user time for comparison, since:
System time proves to be quite consistently equal to about 0.3 s.
I'm not benchmarking the kernel anyway, so I'm curious how much time is spent in my code.
As with any fully deterministic benchmark, I run each executable several times (5 in this case) and report the minimal (user) time.
Baseline
So how does the C code perform? Let’s run
wc
to find out!Benchmarking as described above gives us 7. 41 seconds of user time. So, this sets the baseline for our implementation.
Haskell
Let’s see where do we start. I'll take this version from Chris' post, changing the function name and renaming
cs
tobs
since we're counting bytes anyway:
import qualified (Data.ByteString.Lazy.Char8) (as) BS)import (Data.Char) (wc :: BS.ByteString
-> ( Int , (Int , Int ) wc s= (let bs, ws, ls, _)=(BS.foldl 'go) 0 , (0) , (0) , False ) s (in) (bs, ws, ls) wherego :: ( int) , Int , (Int) , (Bool) ) - () Char -> ( (Int) , (int , Int , Bool )go (! (bs, !) (ws,
!
ls, !wasSpace (c)=() let let addLine | c == ' n'=() (1)| otherwise=(0) addWord | wasSpace |(1) , ws(isSpace) (c)= 1)
| (otherwise =in (bs (
, addWord, ls addLine, isSpace c)Sure, there are better-performing single-threaded versions mentioned in Chris ’post, but it’s just that this one is a bit more friendly to the kind of changes I will be making later on. Also, I don't really (need) the monoidal structure if I'm not going to use associativity, and I'm not going to use associativity as I keep this single-threaded.
Anyway, according to the numbers in Chris ’post, this version is about 9 times slower than
wc
. I was unable to reproduce this: on my machine it's (%) ofwc
, taking 86) seconds to run.Also, this version has a tiny bug (it doesn't count last word if it's not followed by a space or a newline), but I'm not fixing it just yet, especially since the fix is trivial and amounts to considering the last
Bool
field of the tuple that's currently being merely dropped - so it shouldn't affect the performance that much. We’ll fix this in the final version.Anyway, how can this be improved?
Records to the rescue
Firstly, I don’t like big tuples, especially tuples having elements of the same type - it’s just too error-prone for my small attention span. So let’s replace the tuple with a record:
({- # LANGUAGE Strict # -}
{- # LANGUAGE RecordWildCards # -}()
import(qualified) Data.ByteString.Lazy.Char8 (as)BSimport Data.Chardata (State =( (
( { bs :: (Int) , , (ws :: (Int) , (ls :: (Int) , wasSpace :: Bool } (wc ::BS.ByteString -> (
(Int) , (Int) ,int )wc s =(bs, ws, ls) whereState ({ .. } (=BS.foldl 'go)
(0)(0) False ) s go { .. } c =
(bs (1) () ws ( addWord ) (ls ) ((addLine) (isSpace (c) whereaddLine | (c)==' n'1
| (otherwise)=( addWord
| wasSpace=( (isSpace) (c)=1 | otherwise0
(data) (State)We don't use bang patterns anymore since we effectively made the fields of the record strict via the
{- # LANGUAGE Strict # -}
pragma. This wouldn’t change the performance much, would it?Turns out it does, and quite a lot - by a factor of four! This version takes 7. (seconds or) (%) of
wc
, so we ' re almost as fast as the baseline now! How come?Well, the answer is simple, and it's also hinted at in the original post: once we've defined a record type with strict fields, the compiler has easier time figuring out that it could unpack the contents of those fields . So we’re effectively having
(State { bs ::{- # UNPACK # -}!(Int) , ws :: - # UNPACK # -} (Int) , ls :: {- # UNPACK # -} ! Int, wasSpace :: Bool
}
and this saves us an indirection and a memory allocation per
(Int
field (even if the allocated values never leave the nursery thus being quite cheap).CSE
Next, note we're computing
isSpace c at least once for every character, but most of the times we do it twice. Let's make sure we indeed only call
isSpace once!So we change our recursive
go
function like this:(State) {{.. } c=(State
addLine(bs(1) () ws (addWord) (ls (addLine) isSp where isSp
(=(isSpace) c
| c==
n ' = (1) (1) | otherwise
0
addWord | wasSpace =(
(isSp)=(1) (1)
otherwise = (0)
The result? The run time more than halved, taking 2. (seconds or)
(%) (of
wc
. So, we’re more than twice as fast aswc
now, still having pure and idiomatic code!I'm not sure why ghc does do this optimization for us, especially given that it never increases the amount of work done in this particular case, and doing the optimization manually (as in the code above) gives some good results, improving the runtime
Compiler options and pragmasAnother thing worth trying is to play around with compiler options. Two biggest knobs are optimization level and the codegen backend, and, in addition to that, we could try inlining
wc
. There is no good systematic approach here, so we’ll just try these things in all possible combinations:
LLVM codegen (via
-fllvm
) - it sometimes gets better on hardcore number-crunching code, but it might help here too.Optimization level (via
- O2
) - most of the times the default- O
is good enough, and there is no observable difference between that and- O2
, but why not try?I'm not even showing the resulting code since the function body is not affected at all.
So here are the results in tabular form:
GIPHY App Key not set. Please check settings