in ,

Destroying C with 20 lines of Haskell: wc, Hacker News

                              

tl; dr: today we’ll look at implementing a toy wc command that is about 4-5 times faster than the corresponding GNU Coreutils implementation.

So I’ve recently come across a post by Chris Penner describing a Haskell implementation of the Unix wc command. Chris did a great job optimizing the Haskell version as well as showing how some high-level primitives (monoids and streaming, for one) turn out to be useful here, although the result was still a bit slower than C. There’s also a parallel version that relies on the monoidal structure of the problem a lot, and that one actually beats C.

But that post left me wondering: is it possible to do better without resorting to parallel processing?

Turns out the answer is yes. With some quite minor tweaks, the Haskell version manages to beat the hell out of the C version that presumably has decades of man-hours put into it.

Experimental setup

As usual, let’s go through the benchmarking procedure first.

Input data

I’ve downloaded this file and concatenated it with itself so that the total size is about 1.8 gigabytes :

 % for i in `seq 1 22; cat part.txt>> test.txt % du -sh test.txt 1.8G test.txt  
 

By the way, the file is residing on a tmpfs partition, so there is no disk IO involved.

 Hardware and software  

I'm running Gentoo Linux on Core i7 (with) gigabytes of RAM with most of it free during the benchmarks.

All the Haskell code is built using ghc 8.8.2.

I'm competing against coreutils 8. built using gcc 9.2 with - O2 -march=native :

 % wc --version wc (GNU coreutils) 8.  Packaged by Gentoo (8.) (r1 (p0))  
 

By the way, that - march=native part is a little bit of cheating in favor of C, since the resulting Haskell binary can run on any x 8196 - (machine, while) wc compiled with that flag can only run on my CPU and newer ones (that is, assuming the compiler actually used some of the newer SIMD extensions). But, well, let’s be kind and give the C version a bit of advantage.

Measuring

Measurements are done using time . time shows both user and system time, but I only consider user time for comparison, since:

System time proves to be quite consistently equal to about 0.3 s.

      I'm not benchmarking the kernel anyway, so I'm curious how much time is spent in my code.

      As with any fully deterministic benchmark, I run each executable several times (5 in this case) and report the minimal (user) time.

      Baseline

      So how does the C code perform? Let’s run wc to find out!

      Benchmarking as described above gives us 7. 41 seconds of user time. So, this sets the baseline for our implementation.

      Haskell

      Let’s see where do we start. I'll take this version from Chris' post, changing the function name and renaming cs to bs since we're counting bytes anyway:

import (Data.Char) (wc :: BS.ByteString

-> ( Int , (Int , Int ) wc s= (let bs, ws, ls, _)
=(BS.foldl 'go) 0 , (0) , (0) , False ) s (in) (bs, ws, ls) where
go :: ( int) , Int , (Int) , (Bool) ) - () Char -> ( (Int) , (int , Int , Bool ) go (! (bs, !) (ws,

!
ls, !
wasSpace (c)=
  ()   let  let  addLine  |  c ==  ' n'  
=() (1)
| otherwise
=(0)      addWord  |  wasSpace 
    = (0) ()
|

(isSpace) (c)= 1)
     |  (otherwise =  
 

in (bs ( (1) , ws

, addWord, ls addLine, isSpace c)

Sure, there are better-performing single-threaded versions mentioned in Chris ’post, but it’s just that this one is a bit more friendly to the kind of changes I will be making later on. Also, I don't really (need) the monoidal structure if I'm not going to use associativity, and I'm not going to use associativity as I keep this single-threaded.

Anyway, according to the numbers in Chris ’post, this version is about 9 times slower than wc . I was unable to reproduce this: on my machine it's (%) of wc , taking 86) seconds to run.

Also, this version has a tiny bug (it doesn't count last word if it's not followed by a space or a newline), but I'm not fixing it just yet, especially since the fix is ​​trivial and amounts to considering the last Bool field of the tuple that's currently being merely dropped - so it shouldn't affect the performance that much. We’ll fix this in the final version.

Anyway, how can this be improved?

Records to the rescue

Firstly, I don’t like big tuples, especially tuples having elements of the same type - it’s just too error-prone for my small attention span. So let’s replace the tuple with a record:

({- # LANGUAGE Strict # -}

{- # LANGUAGE RecordWildCards # -}

      ()  
import
(qualified) Data.ByteString.Lazy.Char8 (as)
BS
import Data.Char
    data   (State =( (

( { bs :: (Int) , , (ws :: (Int) , (ls :: (Int) , wasSpace :: Bool } (wc ::

BS.ByteString -> (
 (Int) ,  (Int) , 
int )
wc s =(bs, ws, ls) where
     State  ({ .. }  (=BS.foldl 'go)  

(0)

(0) False ) s go { .. } c =

(bs (1) () ws ( addWord ) (ls ) (
 (addLine) (
isSpace (c) where
     addLine  |  (c)==' n'  
  1  
      |  (otherwise)  
=

( addWord
  |  wasSpace=
   (  (isSpace) (c)=
 1     |   otherwise  
  0  
  

We don't use bang patterns anymore since we effectively made the fields of the record strict via the {- # LANGUAGE Strict # -} pragma. This wouldn’t change the performance much, would it?

Turns out it does, and quite a lot - by a factor of four! This version takes 7. (seconds or) (%) of wc , so we ' re almost as fast as the baseline now! How come?

Well, the answer is simple, and it's also hinted at in the original post: once we've defined a record type with strict fields, the compiler has easier time figuring out that it could unpack the contents of those fields . So we’re effectively having

   

(data) (State)

(State { bs ::

 {- # UNPACK # -}  
 ! 
 (Int)    ,  ws ::   - # UNPACK # -}    (Int)     ,  ls ::   {- # UNPACK # -}  !   Int 

, wasSpace :: Bool

}

and this saves us an indirection and a memory allocation per (Int field (even if the allocated values ​​never leave the nursery thus being quite cheap).

CSE

Next, note we're computing

isSpace c at least once for every character, but most of the times we do it twice. Let's make sure we indeed only call

isSpace once!

So we change our recursive go function like this:

go

 (State) {{.. } c=(State 
 (bs 

(1) () ws (addWord) (ls (addLine) isSp where isSp
 (=(isSpace)  c   

addLine

| c==

n ' = (1) (1) | otherwise

  0  
   

addWord | wasSpace =(

(isSp)=(1) (1)

otherwise = (0)

  

The result? The run time more than halved, taking 2. (seconds or) (%) (of wc . So, we’re more than twice as fast as wc now, still having pure and idiomatic code!

I'm not sure why ghc does do this optimization for us, especially given that it never increases the amount of work done in this particular case, and doing the optimization manually (as in the code above) gives some good results, improving the runtime

Compiler options and pragmas

Another thing worth trying is to play around with compiler options. Two biggest knobs are optimization level and the codegen backend, and, in addition to that, we could try inlining wc . There is no good systematic approach here, so we’ll just try these things in all possible combinations:

      LLVM codegen (via -fllvm ) - it sometimes gets better on hardcore number-crunching code, but it might help here too.
      1. Optimization level (via - O2 ) - most of the times the default - O is good enough, and there is no observable difference between that and - O2 , but why not try?

          Inlining the function via the ({- # INLINE wc # -} pragma. I don’t really think it would help since this function is called only once so the corresponding overhead is negligible, and I don’t expect any further optimization opportunities to kick after inlining. But, again, let’s try and see.

      2. I'm not even showing the resulting code since the function body is not affected at all.

        So here are the results in tabular form:

        (2.)
        (✓)
        (3.)
        (✓)
        (✓) (2.)
        (✓) (2.)
        (✓)
        (2.)
        (✓) (✓) (✓) (2.)
        (✓) (✓)
        (2.)
        (✓) (✓) (2.)

        Not sure why inlining helps since this function is called only once (so the function call overhead does matter) and I don't really see any opportunity for other optimizations to kick in after inlining, but I won't argue with the numbers.

        On the other hand, - O2 really helps, while LLVM doesn't give any improvement, and, in fact, it makes things worse when used alone.

        Anyway, we could just as well stop here and push this version to prod. This is still quite an idiomatic Haskell, and C already looks pretty beaten up ( (%) ! If I were to present the results somewhere all while showing the elegance of functional programming, this is the version I’d present.

        But let’s keep going (and let’s keep - fllvm in our flags, cause why not).

        More unpacking

        The speedup we got after replacing the tuple with a custom record suggests another possible optimization.

        Note that the Bool that we have in the record doesn't get unpacked since it has more than one constructor. What if we replace it with an int , where (1) (1) denotes True and (0) denotes False ? This is a bit ugly, but, you know, we’re competing with C, so let’s do it!

            

        (data) (State (= (State) { (bs ::)
        (int
          , (ws ::)   (int  
          ,  (ls ::   
        int
          ,  wasSpace ::  Int     }   ()       wc ::   BS.ByteString  - ( ( (Int) ,  int , 
        Int Int wc s
        =(bs, ws, ls)
           where      State  { () 
         ()==BS.foldl 'go ()  0   (0) (0)   (0)  s      

        go

        (State) {{.. } c=

         (State) (bs      (1)  (ws ) (
         (addWord) (ls)    (addLine) isSp   

        (where isSp | isSpace (c)= 1

           

        |

        otherwise () 0 addLine | (c)==

        ' n'

        =(1)       
          
        otherwise =
         (0)   (addWord) | wasSpace ==  (1) (1)= 
        0 (isSp)==
         (1)=  (1)     
          otherwise =(0)

          This by itself doesn’t change much (I’m still getting the same results), but it opens some further opportunities. Namely, notice that the expression for addWord can be translated from guards (which are essentially nested (if s) to a single arithmetic expression:

             

          addWord==(

          (1) () wasSpace)

          (isSp)

            

          Proving the equivalence is left as an exercise to the reader.

          Anyway, this change further reduces the run time to 1. (s, or

          (%) of wc . (Less Unicode)

          What if we wanted to push just a little further?

          A reasonable compromise seems to be to ignore the presence of Unicode spaces outside of ASCII, avoiding the (isSpace) function (which did not do its Unicode job anyway since we are looking at the input a byte at a time). It's crucial to note that it does (not) (prevent us from correctly counting

          characters in later developments of this little wc substitute, but I'll elaborate on this in subsequent posts.

          While we're at it, we'll also import Data.ByteString.Lazy instead of Data.ByteString.Lazy.Char8 to avoid unpacking and repacking bytes into

          (Char) s.

            
              {- # LANGUAGE Strict # -}  

          {- # LANGUAGE RecordWildCards # -} (module) Data.WordCount ()

          (

          Data.ByteString.Lazy (as (BS) ()

          data (State = (State) { bs :: Int , ws :: (Int) , ls :: (int) , wasSpace :: (Int) }

                 (wc ::  BS.ByteString 

          (
          Int , (Int) (Int) ) wc s
          =(bs, ws, ls)
              (where 

          (State) {
          } (= (BS.foldl 'go) (State) () (0)
          (0)
           (0) (0) ) s       (go) (state  {
          c (=(State) (bs (1) (ws (addWord) (ls addLine (isSp where isSp | (c)==
           ()  (c) -  (

          ) (9) (4) (4)= 1)

          otherwise =(0) ( addLine | (
          )=== (1) (1)     |  (otherwise = 
              addWord  (= (1)   - 
           (wasSpace)  (isSp)    {- # INLINE wc # -}      

          This change results in quite a dramatic improvement, reducing the run time to 1. s, or (%) of wc time, even with a bit of an advantage given to wc in form of - marc h=native .

          And, I think, that’s about as best as it could get with reasonable effort, so let’s stop here.

          Fixing the off-by-one error

          The problem is that we don’t count the last word as a word if it isn’t followed by a newline or a space character. Luckily, the fix is ​​easy: just take into account the wasSpace field and set its initial value to 1 , resulting in code like

          wc s
          = (bs, ws ( )

          (1) - wasSpace, ls) where ( (

          {

          } =(BS.foldl 'go) State
          0
          (0)

          1) s ... ()

          Obviously, this does not affect performance at all.

          Bonus: reducing system time

          Remember those 0. 96 seconds spent in kernel that we agreed not to consider? How about trying to cut that down as well?

          I haven’t shown yet how do I actually use the wc function. So here’s the main :

           

          main :: (IO () main = do ( [path] contents print $ wc contents

          readFile wastes some cycles unnecessarily copying the data from the kernel to the userspace. We could eliminate that overhead if we used something like mmap . Let's take, for instance, the bytestring-mmap

          package that provides byteString s over mmap ed files with zero copying. We’ll change the main to go along the lines of

              (main ::)  
          IO () main =
           
            

          [path] ( getArgs contents print

            (wc contents)   

          and since unsafeMMapFile returns a strict ByteString , we ' ll also change our word counting module to import Data.ByteString instead of (Data.ByteString.Lazy) .

          The result? We've significantly reduced the time spent in kernel: it's now 0. - 0. (seconds instead of 0.) s we had previously. This does not have any statistically significant effect on our user time, though.

          One might say the downside is that we've lost referential transparency: if the file changes our ByteString will change as well, but I'd say that it does not matter in our case since we only observe our ByteString just once.

          (Conclusion)

          This section will be quite short.

          So we’ve managed to just smash a C program that was looked at by thousands of eyes of quite hardcore low-level Unix hackers over a few decades. We did this with a handful of lines of pure, mutation-less, idiomatic Haskell, achieving about 4 to 5 times of throughput of the C version and spending less than an hour on all the optimizations.

          The resulting code is available at

          github .

          Stay tuned for a second part, where we will investigate shaping all this into an actual wc substitute, where different statistics can be turned on or off, all while not computing what the user hasn't requested.

                      (Read More)