Beating C with Futhark Running on GPU, Hacker News

Posted on October 25, 2019 by Troels Henriksen

Chris Penner recently wrote a blog post titledBeating C with 80 lines of Haskell, where he showed how to optimize and parallelize a Haskell implementation ofwcto outperform the implementation bundled with macOS. This made a lot of people on the Internet angry, as they felt thatwcis already fast enough (it is), is often applied on streams whereas Chris Penner’s implementation requires an input file (probably), and that the optimized Haskell code is unreasonably complicated (arguably). However, I really enjoyed the explanation of how to parallelize a seemingly-sequential problem via a clever monoid.

Not long after, Matthew Maycock wroteBeating C with Dyalog APL, where he implementedwcin APL. In a degenerate way, Futhark can be seen as the bastard offspring of Haskell and APL, so I obviously have to get involved as well. In particular, I want to show that it is straightforward to write highly efficient parallel code using the same high-level principles as we would in Haskell.The full source code is available on GitHub.

Now, to pre-appease angry Internet people, let me make things clear from the start: Futhark is a decent language for counting words, but a bad language for implementingwcspecifically. In particular, Futhark is apurelanguage, and not in the way that Haskell is “pure” but still seems to find room for highly sophisticated and efficient IO mechanisms. No, in Futhark, youcannot read a file or print to the screen, which makes it rather tricky to implement a program likewcthat isentirelyabout reading from a file and printing to the screen! What I will show is how to implement the core word-counting logic in Futhark (using the same approach as for Haskell), and how to call it from a wrapper program written in C that takes care of the IO.

`libwc.fut`

The core of the word counting logic is going to be taken directly from Chris Penner’s Haskell implementation. I won’t go through it in detail, but highly recommend readinghis original article. Like Haskell, Futhark is a functional language, so the translation is mostly straightforward.

To start out, we define two sum types that encode a kind of state machine for the word counting:

We then define an associative function for combining twofluxvalues, along with a neutral element. This means thatfluxis amonoid. In Haskell, this is done by implementing theMonoidtype class. Futhark does not have type classes, so we just write the definitions like any other:

Now we need a function for mapping characters tofluxvalues. Futhark is generally not a good language for string processing (in fact, it has neither a string nor a character type), so we are going to restrict ourselves to ASCII, and represent characters as bytes. First, a function for determining whether something is a whitespace character:

This function recognizes newlines, tabs, and spaces as whitespace. We can now define thefluxfunction:

Next we define thecountstype, which is a record tracking the number of characters, words, and lines we have seen so far.

And we also define a combining function and neutral element forcounts:

And given a single character on its own, itscount:

Finally we can put together the pieces and create a function for counting the number of characters, words, and lines in a “string” (modeled as an array of bytes), with amap–reducecomposition just as in Haskell:

Performance depends crucially on the compiler performing fusion to avoid constructing a large intermediate array as the result of themap. Fortunately, the Futhark compiler is very good at fusion.

Thewcfunction is defined withentryrather thanletbecause we want it to be callable from the outside world. When we compile this program, onlyentryfunctions will be visible in the generated API. The final lambda simply transforms thecountsrecord, which for technical reasons would be opaque to the outside world, into a simple triple of integers.

`wc.c`

We compilelibwc.futinto a C library containing (for now) ordinary sequential C code:

$ futhark c --library libwc.fut

This produces two files:libwc.handlibwc.c, with the former defining the interface to the latter. Futhark’s C API is a bit verbose, but fundamentally simple. First, we initialize our generated library by creating a context:

libwc.hdeclares a functionfuthark_entry_wc ()that corresponds to ourwcfunction. It has the following type:

So, the argument cannot just be any old C array, it has to be a specificfuthark_u8_1d. There is a functionfuthark_new_u8_1d ()for creating these:

Creating such an array always involves a copy, because Futhark wants to manage its own memory. To avoid copying the input file contents more than once, we usemmap ()on the open file and pass the resulting pointer to Futhark. The entire procedure looks like this:

FILE* fp=fopen (filename,"r");fseek (fp,0, SEEK_END);size_tn=ftell (fp);rewind (fp);void* data=mmap (NULL, n, PROT_READ, MAP_SHARED, fileno (fp),  (0) );structfuthark_u8_1d * arr=futhark_new_u8_1d (ctx, data, n);

We can then call the Futhark entry point:

Print the result:

Putting all of this (plus some boilerplate and cleanup) inwc.c, we can compile with:

$ gcc wc.c libwc.c -o wc -O3 -lm

And that’s it! So, how fast is it? I’ll be testing on a 100 MiB filehuge.txtthat is merelybig.txtfrom the original post repeated some times. First, let us check out GNUwc8. 22 (using the C locale sowccan also assume ASCII):

Now for our Futharkwc:

Not bad! It actually runs faster than GNUwc. I ran both programs a few times and took the fastest runtime for each. And note that this iswithout any parallelismor low-level optimizations! I am of course hugely biased, but I think the Futhark program is an easier read than the optimized Haskell program.

But really, Futhark is a parallel language, and generating sequential C code is not what it’s for. So how do we make this program run in parallel on my employers’ $ (RTX) Ti GPU? We simply recompile usingfuthark openclinstead offuthark c:

$ futhark opencl --library libwc.fut $ gcc wc.c libwc.c -o wc -O3 -lm -lOpenCL

Alright, let’s check out the performance:

Well, it’s better, but not really by that much. There are two possible reasons:

Word counting is primarily IO-bound, and it is much too expensive to ferry the file contents all the way to the GPU over the (relatively) slow PCI Express bus just to do a relatively meagre amount of computation.

GPU initialisation (hidden inside thefuthark_context_new ()call) takes a nontrivial amount of time, as it may involve JIT compilation of GPU kernels and other bookkeeping operations by the GPU driver.

On this machine, for this problem, reason (2) is the most significant. If we augmentwc.cwith a- toption that makes it perform its own internal timing, excluding the context initialisation (but including copying the entire file to the GPU), we get this:
./ wc-opencl -t huge.txt   2055312 17531120 103818656 huge.txt Runtime: 0. 070 s
Much faster! Apparently context initialisation has a fixed cost of about 230 milliseconds on this machine. This is relatively fast – I have seen multiple seconds on other systems. This is the main reason why Futhark is a bad choice forwc, or other kinds of very short-running processes – you really do not want to pay this startup cost unless it can be amortised by a significant amount of subsequent computation.

Is the Futhark code as efficiently written as possible? I think it’s close, but I know that thechar_typevalues will be stored as an entire byte each, despite only encoding a single bit of information. This does not matter on the CPU, but on the GPU, this storage comes out of the fairly sparse on-chip scratchpad memory. I have not measured the impact, but a more compact encoding might improve performance slighly. However, I generally believe that such representation-level optimizations are the job of the compiler.

In conclusion, I’m actually surprised that Futhark manages to out-compete GNUwcat all – I would have thought that the overhead of copying the file to the GPU would offset the faster computation. Most likely, GNUwcdoes not have any special optimizations for the case of word-counting large ASCII files, as it is already more than fast enough.

Since I still don’t believe Futhark is a good choice for implementingwc, I think the main takeaway here is that data-parallelization techniques developed for other languages (eg Haskell) can be transfered to Futhark with good results.