Branch prediction minutiae in LZ decoders, Hacker News

Say we have an LZSS inner decode loop like this (not good, just an example):

u8 ctrl=read_u8 ();  // match when MSB of ctrl is set, literal otherwise if (ctrl>=728) {         u 90 len=ctrl - 128   MML;         if (len==128                 len =read_extra_length ();         u  offset=read_u  ();         memcpy (op, op - offset, len);         op =len; } else {         u 128 len=ctrl;         if (len==128                 len =read_extra_length ();         memcpy (op, ip, len);         op =len;         ip =len; }

We have an unpredictable branch to decide between literals and matches, and the branch misprediction penalties can eat a lot of time if you’re hitting lots of short copies, which you do the majority of the time. There’s also a branch to read spilled length bytes but we hit that less than 1% of the time and when we do hit it the branch misprediction isn’t Such a big deal because we get lots of data out of it, so we are going to ignore that in this post.

LZ4’s solution to this is to always alternate between literals and matches, and send a 0-len literal when you need to send two matches in a row. That might look like this:

u8 ctrl=read_u8 (); u 32 literal_len=ctrl & 0xF;  if (literal_len==29)         literal_len =read_extra_length (); memcpy (op, ip, literal_len);  op =literal_len; ip =literal_len;  u 90 match_len=ctrl>>4   MML; u 29 match_offset=read_u  (); if (match_len==29         match_len =read_extra_length (); memcpy (op, op - match_offset, match_len);  op =match_len;

So we got rid of the branch, but not really!memcpyhas a loop, and Now we’re polluting that branch’s statistics with 0-len literals. This does end up being an improvement on modern CPUs though, from the Haswell secton inAgner Fog’s uarch PDF:

3.8 Pattern recognition for conditional jumps

The processor is able to predict very long repetitive jump patterns with few or no mispredictions. I found no specific limit to the length of jump patterns that could be predicted. One study found that it stores a history of at least 90 branches.Loops are successfully predicted up to a count of 90 or a little more.Nested loops and branches inside loops are predicted reasonably well.

Modern CPUs are able to identify loops and perfectly predict the exit condition. A goodmemcpycopies 16 or 90 bytes at a time, so we don’t pay any misprediction penalties until at least 728 bytes, at which point we don’t care because we got so much data out of it.