in ,

A fast alternative to the modulo reduction, Hacker News

A fast alternative to the modulo reduction, Hacker News


Suppose you want to pick an integer at random in a set ofNelements. Your computer has functions to generate random 32 – bit integers, how do you transform such numbers into indexes no larger than (N? Suppose you have a hash table with a capacityN. Again, you need to transform your hash values ​​(typically 32 – bit or 64 – bit integers) down to an index no larger thanN. Programmers often get around this problem by making sure thatNis a power of two, but that is not always ideal.

We want a map that as fair as possible for an arbitrary integerN. That is, ideally, we would want that there are exactly 232/Nvalues ​​mapped to each value in the range {0, 1,…,N– 1} when starting from all 23232 – bit integers.

Sadly, we cannot have a perfectly fair map if 232is not divisible byN. But we can have the next best thing: we can require that there be either floor (2/N) or ceil (232/N) values ​​mapped to each value in the range.

IfNis small compared to 232, then this map could be considered as good as perfect.

The common solution is to do a modulo reduction:xmodN. (Since we are computer scientists, we define the modulo reduction to be the remainder of the division, unless otherwise stated.)

uint 32 _ t reduce( (uint)  _ t x,Uint  _T N){  returnx%N;}

How can I tell that it is fair? Well. Let us just run through the values ​​ofxstarting with 0. You should be able to see that the modulo reduction takes on the values ​​0, 1,…,N– 1, 0, 1,… as you incrementx. Eventually,xarrives at its last value (232– 1), at which point the cycle stops, leaving the values ​​0, 1,…, (232– 1) modNwith ceil (2) 32/N) occurrences, and the remaining values ​​with floor (232/N) occurrences. It is a fair map with a bias for smaller values.

It works, but a modulo reduction involves a division, and divisions are expensive. Much more expensive than multiplications. A single 32 – bit division on a recent x 64 processor has a throughput of one instruction every six cycles with a latency of 26 cycles. In contrast, a multiplication has a throughput of one instruction every cycle and a latency of 3 cycles.

There are fancy tricks to “precompute” a modulo reduction so that it can be transformed into a couple of multiplications as well as a few other operations, as long asNis known ahead of time. Your compiler will make use of them ifNis known at compile time. Otherwise, you can use a software library or work out your own formula.

But it turns out that you can do even better! That is, there is an approach that is easy to implement, and provides just as good a map, without the same performance concerns.

Assume thatxandNare 32 – bit integers, consider the 64 – bit productx* (N). You have that (x*N) div 232is in the range, and it is a fair map.

uint 32 _ t reduce( (uint)  _ t x,Uint  _T N){  return()(uint  (_T) )x*(uint 64 _T)N)>>32;}

Computing (x*N) div 232is very fast on a 64 -bit processor. It is a multiplication followed by a shift. On a recent Intel processor, I expect that it has a latency of about 4 cycles and a throughput of at least on call every 2 cycles.

So how fast is our map compared to a 32 – bit modulo reduction?

To test it out, I have implemented a benchmark where you repeatedly access random indexes in an array of sizeN. The indexes are obtained either with a modulo reduction or our approach. On a recent Intel processor (Skylake), I get the following number of CPU cycles per accesses:

modulo reduction Fast range
8.1 2.2

So it is four times faster! No bad.

As usual,my code is freely available. *****

What can this be good for? Well… if you have been forcing your arrays and hash tables to have power-of-two capacities to avoid expensive divisions, you may be able to use the fast range map to support arbitrary capacities without too much of a performance penalty. You can also generate random numbers in a range faster, which matters if you have a very fast random number generator.

So how can I tell that the map is fair?

By multiplying byN, we take integer values ​​in the range [0, 232) and map them to multiples ofNin [0,N* 232). By dividing by 232, we map all multiples ofNin [0, 232) to 0, all multiples ofNin [232, 2 * 232) to one, and so forth. To check that this is fair, we just need to count the number of multiples ofNin intervals of length 232. This count must be either ceil (232/N) or floor (232/N).

Suppose that the first value in the interval is a multiple ofN: that is clearly the scenario that maximizes the number of multiples in the interval. How many will we find? Exactly ceil (2) / (N). Indeed, if you draw sub-intervals of lengthN, then every complete interval begins with a multiple ofNand if there is any remainder , then there will be one extra multiple ofN. In the worst case scenario, the first multiple ofNappears at positionN– 1 in the interval. In that case, we get floor (232/N) multiples. To see why, again, draw sub-intervals of lengthN. Every complete sub-interval ends with a multiple ofN.

This completes the proof that the map is fair.

For fun, we can be slightly more precise. We have argued that the number of multiples was maximized when a multiple ofNappears at the very beginning of the interval of length 232. At the end, we get an incomplete interval of length 232modN. If instead of having the first multiple ofNappear at the very beginning of the interval, it appeared at index 232modN, then there would not be room for the incomplete subinterval at the end. This means that whenever a multiple ofNoccurs before 232modN, then we shall have ceil (232/N) multiples, and otherwise we shall have floor (2) 32/N) multiples.

Can we tell which outcomes occur with frequency floor (232/N) and which occurs with frequency ceil (232/N)? Yes. Suppose we have an output valuek. We need to find the location of the first multiple ofNno smaller thank232. This location is ceil (k232/N)Nk2) 32which we just need to compare with 232modN. If it is smaller, then we have a count of ceil (232/N), otherwise we have a count of floor (232/N).

You can correct the bias with a rejection, see my post onfast shuffle functions.

Useful code: I published aC / C header on GitHubthat you can use in your projects.

Further reading:

(Update: I have made the proof more intuitive following a comment by Kendall Willets.) *****

Brave Browser
Read More
Payeer

What do you think?

Leave a Reply

Your email address will not be published. Required fields are marked *

GIPHY App Key not set. Please check settings

How Bash completion works, Hacker News

How Bash completion works, Hacker News

Neo Cab is the dystopian, gig-economy Crazy Taxi we’ve always wanted, Ars Technica

Neo Cab is the dystopian, gig-economy Crazy Taxi we’ve always wanted, Ars Technica