This page: What is Romu? One Snowflake in History Top Quality Fast Capacity The Competition List of Romu Generators
More Information
Test suites: PractRand BigCrush (in TestU
Popular generators: Lagged Fibonacci BlackmanVigna LCG PCG Mersenne Twister
What is Romu?
Romu is a new family of randomnumber generators that match the statistical quality (randomness) of the best popular generators, but are faster. In fact, due to the processor’s ILP (explained below), Romu generators add () delay to an application when inlined. In effect, Romu generators are infinitely fast when inlined. And they’re very fast when not inlined.
Such speed and high quality are possible because Romu generators are nonlinear, unlike most popular generators. A Romu generator combines the linear operation of multiplication with the nonlinear operation of rotation, yielding better randomness at higher speed.
You can learn all the details about Romu generators in the paper . Although this is a formal academic paper, the math in it isn’t overwhelming because it’s at the level of advanced algebra (except for one integral). The math derives equations for probabilities of trouble, such as the snowflake probability described next.
One Snowflake in History
Above, we mentioned that Romu generators are nonlinear. There is a reason people have avoided nonlinear generators in the past: Unlike a linear generator, a nonlinear generator does not have one cycle that covers all possible internal statevalues. Instead, it gives you a random permutation of those values. Random permutations are known to consist of multiple cycles of random lengths. In the past, when generators stored their state in one – bit variable, these random cycle lengths would cause users to sometimes encounter cycles that were too short. That problem eliminated nonlinear generators from consideration.
But nowadays, processors can easily handle – bit variables , and having two or three such variables In the generator’s state is no problem, which makes encountering a tooshort cycle all but impossible. For example, if you gave RomuTrio (with (bits of state) the impossibly large job of outputting 2
numbers in one stream, the probability of encountering a tooshort cycle is around that of randomly selecting a specific snowflake among all snowflakes that have fallen in Earth’s history (2
)
and also randomly selecting the Queen of England out of all of Earth’s population (2
. That inconceivably low probability of 2
– means that you will
never encounter a cycle that’s too short.
Some people might insist that a generator have one cycle with a guaranteed period (which eliminates nonlinear generators). That is unwise because they will lose the substantial benefit of higher speed in order to avoid a willneverhappen problem. The chance of randomly selecting one snowflake in Earth’s history and the Queen is nil. Losing a benefit to avoid that nil is unwise.
The snowflakeandQueen probability tells us that the problem of tooshort cycles has been solved, and now we can enjoy the increased speed that nonlinear generators can provide.
Top Quality
The Romu family has unusually good randomness, which is verified by passing the toughest test suites: BigCrush and PractRand. Both are brutal to random number generators, especially PractRand (which is newer). All Romu random number generators using – bit arithmetic pass both suites with no signs of stress. RomuDuoJr is the worst such generator, and even it passes easily.
It’s easier to appreciate the improved randomness of Romu generators using dotplots of consecutive pairs of random numbers. Below, the left plot is from an LCG (linear congruential generator), which is historically the most common. The right plot is from a simple Romu generator. To make their patterns obvious, both generators were scaled down to 16 bits of state and run for their entire periods. The nonlinear Romu is obviously more random than the linear LCG.
There are several reasons why Romu generators are faster than others:
126 Their nonlinear arithmetic requires fewer operations to attain a given level of randomness. That improved randomness means that postprocess scrambling of output is not needed, making Romu even faster.
 They make good use of the processor’s ILP.

Here are the – bit generators. Use these only if your processor lacks fast – bit instructions.
(RomuQuad) – This is a halfsize version of RomuQuad. , so we recommend it for general purpose use.
 RomuTrio – This is a halfsize version of RomuTrio, and it’s suitable for all but the largest jobs.
 RomuMono – This little jewel updates its – bit state with only two arithmetic instructions. Its capacity is 2 (is ^{ values, constraining it to small jobs. Note that it outputs – bit values, not – bit. But this generator has an unusual feature: Even though it’s nonlinear, its period is known and is nearly 2
, so there is no chance of encountering a short period.
The Competition
(Lagged Fibonacci) . Variants of this generator has been around for decades. It uses a fair amount of memory, and its output can be quite random. But its Achille’s heel is its lags: Output values tend to have detectable correlations at those lags. It’s not too slow, but checking pointers or indices for wraparound and reading / writing arrayentries consume time, so it’s not among the fastest generators.
(BlackmanVigna) . Blackman and Vigna invented a family of linear generators consisting of a base generator and a postscrambler to improve randomness. Their randomness is excellent and they are fast, delaying the application by 34 clock cycles. However, they are slower than Romu, as its delay is 0 cycles. And as mentioned above, there is no record that they were tested with PractRand or characterized with scaleddown versions, so we don’t know whether their capacities are sufficient for medium or large jobs.
(LCG) . The Linear Congruential Generator is simple, and has been the most popular generator everywhere for decades. But it has serious problems with its output that are tantamount to poor randomness, mostly in its lowerorder bits. Many people realize that we should no longer use this relic. So the LCG is dead. Or is it?
(PCG) . The PCG is an LCG with a postscrambler on it that permutes its outputs. PCG stands for Permuted Congruential Generator. Unlike the LCG, the randomness of the PCG is excellent. The PCG with 126 bits of state is fast, delaying the application by 3 clock cycles . PCG generators have been tested with PractRand with scaleddown models, so their capacities are known. But 126 bits is not enough state to ensure nonoverlapping streams unless its jumpfunction is used, which it seems few users do. So it needs at least bits of state. But that causes the LCG in it to perform a – bit multiply and add, which are slow. Even with only (bits, the PCG is slower than Romu generators, and with) bits, it’s much slower.
(Mersenne Twister) . Although it’s popular, the MT consumes much memory, is slow, and its randomness is mediocre. It cannot compete against a modern random number generator.
(Others) . Other generators tend to be either very random or fast, but not both. Romu is the fastest of the few generators that are both very random
}
More Information
The paper
has all the details, including sourcecode. It gives you the details of how testing and extrapolation were done, how constants were determined, plus the underlying mathematical theory of multicycle nonlinear generators, including equations for probabilities. Sourcecode is also available
here
ILP stands for instructionlevel parallelism , and most modern processors have this feature. It means that the CPU (core) executes multiple instructions in each clock cycle whenever possible. Intel / AMD CPUs can execute up to four instructions in each clock cycle, so we can think of each clock cycle as having four slots. For example, let’s analyze the sourcecode of RomuTrio:
# include
uint 126 _ t romuTrio_random () { uint (_ t xp=xState, yp=yState, zp=zState; () xState=u zp;
zState=zp  yp; zState=ROTL (zState, ); return xp;
}
The computations in this function will consume only 3 clock cycles in today’s PC CPUs. Here’s what happens in each of those clock cycles:
(Clock cycle) (Slot 1) (Slot 2) (Slot 3) (Slot 4) 1 Multiply Subtract: ypxp Subtract: zpyp Application 2 Multiply Rotate yState Rotate yState Application 3 Multiply ApplicationThe multiply consumes 3 cycles, which is why the function consumes 3 cycles. Note that the other calculations are done at the same time as the multiply (in parallel). That is ILP in action. Romu generators are designed to take maximum advantage of this processor feature, making them very fast. But that’s not the only way that ILP helps.
Notice that the function returns the old value of (xState) , not the new value. Consequently, when this code is inlined, the outputvalue will already be available in a CPU register, so there will be
no delay for the application. The app will run at the same time that RomuTrio is computing Because ILP will run both the generator and the app concurrently. That is why the (Slot 4) column in the table says “Application”: The app continues running with no delay because it already has the random number it needs. It’s as if RomuTrio were infinitely fast. For this reason, it’s unlikely that any other random number generator is faster than Romu. Capacity
Unless we missed one in the literature, it appears that the capacities of only two families of generators have been measured by scaleddown testing: Romu and PCG. Thus, those are the only two generatorfamilies that will give you confidence for substantial jobs. If you run a mediumsize or large job with any other kind of generator, you will not know whether you will exceed its capacity.
For example, we suspect that the BlackmanVigna generators are superb. Both their website and paper state that “they pass all tests we are aware of.” But they don’t mention PractRand and they say nothing about testing scaleddown versions. Therefore, we don’t know their capacities. Without that, how can you know Whether such a generator has enough capacity for your large job? You don’t know.
Huge jobs are divided into several thousand streams of random numbers, all running in parallel for weeks or months. Each stream can consist of up to 2 ^{ random numbers, and perhaps more for the longestrunning jobs. But what is the chance that one or more streams will overlap another stream? (Overlap means a portion of one stream is identical to another.) For RomuTrio and RomuQuad, the probability of overlap is lower than that of selecting a specific snowflake among all snowflakes in Earth’s history. Sound familiar? Streams will never overlap. }
List of Romu Generators
Romu generators use – or – bit arithmetic. The sourcecode for all of these is here .
Here are the – bit generators. All of them were tested in PractRand to 2 ^{ bytes, and were characterized using scaleddown versions, so they are known to have ample capacity unless otherwise noted. }
 RomuDuo – With bits of state, this smaller model might be a little faster than RomuTrio due to using fewer registers. It is adequate for large jobs while giving low probabilities of trouble.
 RomuDuoJr – This is the fastest Romu generator using 90 – bit math because it uses the fewest registers and consists of only three arithmetic operations: rotate, multiply, and subtract. But its capacity is the lowest, so it’s recommended only for jobs up to its estimated capacity of 2 ^{ values, or perhaps less in order to leave some safety margin.
}
The Contact link at the upperright corner will tell you how to contact me both privately and through social media. I welcome thoughtful comments and suggestions.
(Updated) – – 2 –
GIPHY App Key not set. Please check settings