Destroying C with 20 lines of Haskell: wc, Hacker News

tl; dr: today we’ll look at implementing a toy wc command that is about 4-5 times faster than the corresponding GNU Coreutils implementation.

So I’ve recently come across a post by Chris Penner describing a Haskell implementation of the Unix wc command. Chris did a great job optimizing the Haskell version as well as showing how some high-level primitives (monoids and streaming, for one) turn out to be useful here, although the result was still a bit slower than C. There’s also a parallel version that relies on the monoidal structure of the problem a lot, and that one actually beats C.

But that post left me wondering: is it possible to do better without resorting to parallel processing?

Turns out the answer is yes. With some quite minor tweaks, the Haskell version manages to beat the hell out of the C version that presumably has decades of man-hours put into it.

Experimental setup

As usual, let’s go through the benchmarking procedure first.

Input data

I’ve downloaded this file and concatenated it with itself so that the total size is about 1.8 gigabytes :

 % for i in `seq 1 22; cat part.txt>> test.txt % du -sh test.txt 1.8G test.txt    By the way, the file is residing on a  tmpfs  partition, so there is no disk IO involved. 

  Hardware and software   I'm running Gentoo Linux on Core i7  (with)  gigabytes of RAM with most of it free during the benchmarks. 
  All the Haskell code is built using ghc 8.8.2. 
  I'm competing against coreutils 8.  built using gcc 9.2 with  - O2 -march=native : 
  % wc --version wc (GNU coreutils) 8.  Packaged by Gentoo (8.) (r1 (p0))    By the way, that  - march=native  part is a little bit of cheating in favor of C, since the resulting Haskell binary can run on any x 8196 -  (machine, while)  wc  compiled with that flag can only run on my CPU and newer ones (that is, assuming the compiler actually used some of the newer SIMD extensions). But, well, let’s be kind and give the C version a bit of advantage. 
 
 Measuring  
 Measurements are done using  time .  time  shows both user and system time, but I only consider user time for comparison, since: 

 System time proves to be quite consistently equal to about 0.3 s. 

 I'm not benchmarking the kernel anyway, so I'm curious how much time is spent in my code.   
 As with any fully deterministic benchmark, I run each executable several times (5 in this case) and report the minimal (user) time. 
 Baseline 
 So how does the C code perform? Let’s run  wc  to find out! 
 Benchmarking as described above gives us 7. 41 seconds of user time. So, this sets the baseline for our implementation. 
 Haskell 
 Let’s see where do we start. I'll take this version from Chris' post, changing the function name and renaming  cs  to  bs  since we're counting bytes anyway: 

      import   qualified  (Data.ByteString.Lazy.Char8) (as)   BS)    
 
   import   (Data.Char)       (wc ::   BS.ByteString    ->  ( Int ,  (Int ,  Int )    wc s=  (let  bs, ws, ls, _) =(BS.foldl 'go)  0 ,  (0) ,  (0) ,  False ) s    (in)  (bs, ws, ls)     where      go ::  ( int) ,  Int ,  (Int) ,  (Bool) )  - () Char   ->  ( (Int) ,  (int ,  Int ,  Bool )     go (!  (bs, !) (ws,  !  ls, !   wasSpace (c)=  ()   let  let  addLine  |  c ==  ' n'  =() (1)      |   otherwise  =(0)      addWord  |  wasSpace = (0) ()    
 |   (isSpace) (c)=  1)      |  (otherwise =   

     in  (bs     ( (1) , ws 
,  addWord, ls     addLine,  isSpace  c)     
 Sure, there are better-performing single-threaded versions mentioned in Chris ’post, but it’s just that this one is a bit more friendly to the kind of changes I will be making later on. Also, I don't really  (need) the monoidal structure if I'm not going to use associativity, and I'm not going to use associativity as I keep this single-threaded. 
 Anyway, according to the numbers in Chris ’post, this version is about 9 times slower than  wc . I was unable to reproduce this: on my machine it's  (%)  of  wc , taking 86)  seconds to run. 
 Also, this version has a tiny bug (it doesn't count last word if it's not followed by a space or a newline), but I'm not fixing it just yet, especially since the fix is trivial and amounts to considering the last   Bool   field of the tuple that's currently being merely dropped - so it shouldn't affect the performance that much. We’ll fix this in the final version. 
 Anyway, how can this be improved? 
  Records to the rescue  
 Firstly, I don’t like big tuples, especially tuples having elements of the same type - it’s just too error-prone for my small attention span. So let’s replace the tuple with a record: 
      ({- # LANGUAGE Strict # -} 
      {- # LANGUAGE RecordWildCards # -}  
      ()   import   (qualified)  Data.ByteString.Lazy.Char8   (as)   BS       import   Data.Char          data   (State =( (    ( {  bs ::  (Int) , , (ws ::   (Int)    , (ls ::  (Int)    , wasSpace ::   Bool     }     (wc ::  BS.ByteString   ->  ( (Int) ,  (Int) ,  int )     wc s =(bs, ws, ls)     where       State  ({ .. }  (=BS.foldl 'go)     (0)    (0)   False  ) s         go   { .. } c =  
 (bs     (1) () ws  ( addWord ) (ls ) ( (addLine) ( isSpace  (c)       where       addLine  |  (c)==' n'    1        |  (otherwise)  =     ( addWord   |  wasSpace=     (  (isSpace) (c)= 1     |   otherwise    0     We don't use bang patterns anymore since we effectively made the fields of the record strict via the   {- # LANGUAGE Strict # -}   pragma. This wouldn’t change the performance much, would it? 
  Turns out it does, and quite a lot - by a factor of four! This version takes 7.  (seconds or) (%) of  wc , so we ' re almost as fast as the baseline now! How come? 
  Well, the answer is simple, and it's also hinted at in the original post: once we've defined a record type with strict fields, the compiler has easier time figuring out that it could unpack the contents of those fields . So we’re effectively having 
     
 (data) (State)  
 (State     { bs ::  
 {- # UNPACK # -}   !  (Int)    ,  ws ::   - # UNPACK # -}    (Int)     ,  ls ::   {- # UNPACK # -}  !   Int     ,  wasSpace ::   Bool 
  } 
  
  
 and this saves us an indirection and a memory allocation per   (Int   field (even if the allocated values never leave the nursery thus being quite cheap). 

	
			
	
			

		
			
			
					
			
		





  CSE  
 Next, note we're computing  

 isSpace  c  at least once for every character, but most of the times we do it twice. Let's make sure we indeed only call  
 isSpace   once!  
 So we change our recursive  go  function like this: 
       go 

 (State) {{.. } c=(State  (bs      (1) () ws     (addWord) (ls     (addLine) isSp     where    isSp  (=(isSpace)  c   

   addLine 
  |  c==
  n ' = (1) (1)  |   otherwise  
  0     
 addWord  |  wasSpace =(

 
     

 (isSp)=(1) (1)       
 otherwise  = (0) 
   The result? The run time more than halved, taking 2.  (seconds or)  (%) (of  wc . So, we’re more than twice as fast as  wc  now, still having pure and idiomatic code! 
  I'm not sure why ghc does do this optimization for us, especially given that it never increases the amount of work done in this particular case, and doing the optimization manually (as in the code above) gives some good results, improving the runtime 
  Compiler options and pragmas   Another thing worth trying is to play around with compiler options. Two biggest knobs are optimization level and the codegen backend, and, in addition to that, we could try inlining  wc . There is no good systematic approach here, so we’ll just try these things in all possible combinations: 
  LLVM codegen (via  -fllvm ) - it sometimes gets better on hardcore number-crunching code, but it might help here too.   Optimization level (via  - O2 ) - most of the times the default  - O  is good enough, and there is no observable difference between that and  - O2 , but why not try?  Inlining the function via the   ({- # INLINE wc # -}   pragma. I don’t really think it would help since this function is called only once so the corresponding overhead is negligible, and I don’t expect any further optimization opportunities to kick after inlining. But, again, let’s try and see. 
  I'm not even showing the resulting code since the function body is not affected at all. 
  So here are the results in tabular form: 
       
  (2.) 
  (✓)  
  (3.)  
  (✓)  
  (✓)   (2.) 
  
  (✓)   (2.) 
  (✓)  
  (2.) 
  (✓)   (✓)   (✓)   (2.)  
  (✓)   (✓)  
  (2.) 
  (✓)   (✓)   (2.) 
 Not sure why inlining helps since this function is called only once (so the function call overhead does matter) and I don't really see any opportunity for other optimizations to kick in after inlining, but I won't argue with the numbers. 
 On the other hand,  - O2  really helps, while LLVM doesn't give any improvement, and, in fact, it makes things worse when used alone. 
 Anyway, we could just as well stop here and push this version to prod. This is still quite an idiomatic Haskell, and C already looks pretty beaten up ( (%) ! If I were to present the results somewhere all while showing the elegance of functional programming, this is the version I’d present. 
 But let’s keep going (and let’s keep  - fllvm  in our flags, cause why not). 
     More unpacking   The speedup we got after replacing the tuple with a custom record suggests another possible optimization. 
  Note that the   Bool   that we have in the record doesn't get unpacked since it has more than one constructor. What if we replace it with an   int  , where   (1) (1)  denotes   True   and   (0)  denotes   False  ? This is a bit ugly, but, you know, we’re competing with C, so let’s do it! 
       (data) (State  (= (State)     { (bs ::)   (int    , (ws ::)   (int    ,  (ls ::    int    ,  wasSpace ::  Int     }   ()       wc ::   BS.ByteString  - ( ( (Int) ,  int ,  Int  Int     wc s =(bs, ws, ls)    where      State  { ()  ()==BS.foldl 'go ()  0   (0) (0)   (0)  s      
 go 
 (State) {{.. } c=
 (State) (bs      (1)  (ws ) ( (addWord) (ls)    (addLine) isSp   
 (where     isSp  |   isSpace  (c)= 1  

   
 |  
 otherwise  ()   0     addLine   | (c)==
 ' n'  
=(1)          otherwise  =  (0)   (addWord) | wasSpace ==  (1) (1)=  0        (isSp)==  (1)=  (1)      otherwise =(0)      This by itself doesn’t change much (I’m still getting the same results), but it opens some further opportunities. Namely, notice that the expression for  addWord  can be translated from guards (which are essentially nested   (if   s) to a single arithmetic expression: 
     
 addWord==(
 (1) () wasSpace) 
  (isSp) 
   Proving the equivalence is left as an exercise to the reader. 
  Anyway, this change further reduces the run time to 1.  (s, or 
   (%)  of  wc .   (Less Unicode) 
 What if we wanted to push just a little further? 
  A reasonable compromise seems to be to ignore the presence of  Unicode spaces  outside of ASCII, avoiding the   (isSpace)   function (which did not do its Unicode job anyway since we are looking at the input a byte at a time). It's crucial to note that it does  (not) (prevent us from correctly counting 
 characters  in later developments of this little  wc  substitute, but I'll elaborate on this in subsequent posts.  
 While we're at it, we'll also import   Data.ByteString.Lazy   instead of   Data.ByteString.Lazy.Char8   to avoid unpacking and repacking bytes into  

 (Char)  s.   
      {- # LANGUAGE Strict # -}  
   {- # LANGUAGE RecordWildCards # -}        (module)   Data.WordCount  () 

      
  (

  
  

 Data.ByteString.Lazy   (as   (BS)        ()  
 data  (State  = (State)   { bs ::   Int    ,  ws ::   (Int) ,  ls ::   (int) ,  wasSpace ::   (Int)  } 
       (wc ::  BS.ByteString    (  Int ,  (Int) (Int) )     wc s =(bs, ws, ls)     (where       (State)  { }  (= (BS.foldl 'go)  (State) () (0)   (0)  (0) (0) ) s       (go) (state  {  c  (=(State)  (bs  (1) (ws      (addWord) (ls     addLine (isSp       where      isSp  |  (c)==  ()  (c) -  () (9)   (4) (4)= 1)      
  otherwise =(0)     ( addLine  |  ()=== (1) (1)     |  (otherwise =       addWord  (= (1)   -  (wasSpace)  (isSp)    {- # INLINE wc # -}       This change results in quite a dramatic improvement, reducing the run time to 1.  s, or  (%) of  wc  time, even with a bit of an advantage given to  wc  in form of  - marc h=native . 
  And, I think, that’s about as best as it could get with reasonable effort, so let’s stop here. 
  Fixing the off-by-one error   The problem is that we don’t count the last word as a word if it isn’t followed by a newline or a space character. Luckily, the fix is easy: just take into account the  wasSpace  field and set its initial value to  1 , resulting in code like 
     wc s = (bs, ws  ( )   (1)  -  wasSpace, ls)     where    ( (  { } =(BS.foldl 'go)  State   0   (0)    1)  s      ... ()   
 Obviously, this does not affect performance at all. 
  Bonus: reducing system time   Remember those 0. 96 seconds spent in kernel that we agreed not to consider? How about trying to cut that down as well? 
  I haven’t shown yet how do I actually use the  wc  function. So here’s the  main : 
   
  main ::  (IO  ()    main = do  ( [path]    contents       print  $  wc contents     

   readFile   wastes some cycles unnecessarily copying the data from the kernel to the userspace. We could eliminate that overhead if we used something like  mmap . Let's take, for instance, the  bytestring-mmap 
 package that provides   byteString   s over  mmap  ed files with zero copying. We’ll change the  main  to go along the lines of   
    (main ::)   IO  ()     main =     
 [path]  (  getArgs    contents    print  

  (wc contents)     and since  unsafeMMapFile  returns a strict   ByteString  , we ' ll also change our word counting module to import   Data.ByteString   instead of   (Data.ByteString.Lazy)  . 
  The result? We've significantly reduced the time spent in kernel: it's now 0.  - 0.  (seconds instead of 0.) s we had previously. This does not have any statistically significant effect on our user time, though. 
  One might say the downside is that we've lost referential transparency: if the file changes our   ByteString   will change as well, but I'd say that it does not matter in our case since we only observe our   ByteString   just once. 
  (Conclusion)   This section will be quite short. 
  So we’ve managed to just smash a C program that was looked at by thousands of eyes of quite hardcore low-level Unix hackers over a few decades. We did this with a handful of lines of pure, mutation-less, idiomatic Haskell, achieving about 4 to 5 times of throughput of the C version and spending less than an hour on all the optimizations. 
  The resulting code is available at 
 github .   Stay tuned for a second part, where we will investigate shaping all this into an actual  wc  substitute, where different statistics can be turned on or off, all while not computing what the user hasn't requested. 
                 (Read More)

n ' = (1) (1) | otherwise

0

addWord | wasSpace =(

(isSp)=(1) (1)

otherwise = (0)

(isSp)=(1) (1)

Inlining the function via the `({- # INLINE wc # -}` pragma. I don’t really think it would help since this function is called only once so the corresponding overhead is negligible, and I don’t expect any further optimization opportunities to kick after inlining. But, again, let’s try and see.

(%) of `wc` . (Less Unicode)

characters in later developments of this little `wc` substitute, but I'll elaborate on this in subsequent posts.

{

Destroying C with 20 lines of Haskell: wc, Hacker News

Experimental setup

Input data

Baseline

Haskell

(isSpace) (c)= 1)
| (otherwise =

(0) False ) s go { .. } c =

, wasSpace :: Bool

(1) () ws (addWord) (ls (addLine) isSp where isSp
(=(isSpace) c

What do you think?

Google Postpones Third-Party Cookie Deprecation Amid U.K. Regulatory Scrutiny

Google fixed critical Chrome vulnerability CVE-2024-4058

GitHub has been exposed to security vulnerabilities that can be used by hackers to distribute malware disguised as “Microsoft”

Prescribe a “digital prescription” for the APP! Bangbang Security releases “2024 Q1 Mobile Application Security Risk Report”

After Microsoft adopts new technology, the size of Windows 10 update installation package is reduced by 21%, and the download and installation speed is faster

Qualcomm is accused of cheating on X Plus/Elite chip benchmark test, actual performance is far from as good as advertised

How the scramble for sand is destroying the Mekong, Hacker News

I Survived the “Destroying Angel”, Hacker News

Our groundwater use is destroying freshwater ecosystems, Ars Technica

Leave a ReplyCancel reply

Cheats For Little Alchemy

3TB Of Mega.nz Links For Free Courses And E-Books 2022 (Updated)

Udemy Coupon [100% OFF] QuickBooks Online 2020

How to Earn Money from FreeCash.com, Playing Games, Testing Apps, and Taking Surveys

Amazon FBA Product Research & Find Products for Amazon FBA

Rubot v6.6.7.0 – Twitch Views Bot 2022

Amiga Samplers: Budget dance music in 1990, Hacker News

Why Pornhub could be the savior of crypto YouTube

Experimental setup

Input data

Baseline

Haskell

(0) False ) s go { .. } c =

, wasSpace :: Bool

(1) () ws (addWord) (ls (addLine) isSp where isSp (=(isSpace) c

n ' = (1) (1) | otherwise 0 addWord | wasSpace =( (isSp)=(1) (1) otherwise = (0)

Inlining the function via the ({- # INLINE wc # -} pragma. I don’t really think it would help since this function is called only once so the corresponding overhead is negligible, and I don’t expect any further optimization opportunities to kick after inlining. But, again, let’s try and see.

(%) of wc . (Less Unicode)

characters in later developments of this little wc substitute, but I'll elaborate on this in subsequent posts.

{

What do you think?

Leave a ReplyCancel reply

Log In

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections

(1) () ws (addWord) (ls (addLine) isSp where isSp
(=(isSpace) c

n ' = (1) (1) | otherwise

0

addWord | wasSpace =(

(isSp)=(1) (1)

otherwise = (0)

Inlining the function via the `({- # INLINE wc # -}` pragma. I don’t really think it would help since this function is called only once so the corresponding overhead is negligible, and I don’t expect any further optimization opportunities to kick after inlining. But, again, let’s try and see.

(%) of `wc` . (Less Unicode)

characters in later developments of this little `wc` substitute, but I'll elaborate on this in subsequent posts.