This page: What is Romu? One Snowflake in History Top Quality Fast Capacity The Competition List of Romu Generators
More Information
Test suites: PractRand BigCrush (in TestU
Popular generators: Lagged Fibonacci Blackman-Vigna LCG PCG Mersenne Twister
What is Romu?
Romu is a new family of random-number generators that match the statistical quality (randomness) of the best popular generators, but are faster. In fact, due to the processor’s ILP (explained below), Romu generators add () delay to an application when inlined. In effect, Romu generators are infinitely fast when inlined. And they’re very fast when not inlined.
Such speed and high quality are possible because Romu generators are nonlinear, unlike most popular generators. A Romu generator combines the linear operation of multiplication with the nonlinear operation of rotation, yielding better randomness at higher speed.
You can learn all the details about Romu generators in the paper . Although this is a formal academic paper, the math in it isn’t overwhelming because it’s at the level of advanced algebra (except for one integral). The math derives equations for probabilities of trouble, such as the snowflake probability described next.
One Snowflake in History
Above, we mentioned that Romu generators are nonlinear. There is a reason people have avoided nonlinear generators in the past: Unlike a linear generator, a nonlinear generator does not have one cycle that covers all possible internal state-values. Instead, it gives you a random permutation of those values. Random permutations are known to consist of multiple cycles of random lengths. In the past, when generators stored their state in one – bit variable, these random cycle -lengths would cause users to sometimes encounter cycles that were too short. That problem eliminated nonlinear generators from consideration.
But nowadays, processors can easily handle – bit variables , and having two or three such variables In the generator’s state is no problem, which makes encountering a too-short cycle all but impossible. For example, if you gave RomuTrio (with (bits of state) the impossibly large job of outputting 2
numbers in one stream, the probability of encountering a too-short cycle is around that of randomly selecting a specific snowflake among all snowflakes that have fallen in Earth’s history (2
)
and also randomly selecting the Queen of England out of all of Earth’s population (2
. That inconceivably low probability of 2
– means that you will
never encounter a cycle that’s too short.
Some people might insist that a generator have one cycle with a guaranteed period (which eliminates nonlinear generators). That is unwise because they will lose the substantial benefit of higher speed in order to avoid a will-never-happen problem. The chance of randomly selecting one snowflake in Earth’s history and the Queen is nil. Losing a benefit to avoid that nil is unwise.
The snowflake-and-Queen probability tells us that the problem of too-short cycles has been solved, and now we can enjoy the increased speed that nonlinear generators can provide.
Top Quality
The Romu family has unusually good randomness, which is verified by passing the toughest test suites: BigCrush and PractRand. Both are brutal to random number generators, especially PractRand (which is newer). All Romu random number generators using – bit arithmetic pass both suites with no signs of stress. RomuDuoJr is the worst such generator, and even it passes easily.
It’s easier to appreciate the improved randomness of Romu generators using dot-plots of consecutive pairs of random numbers. Below, the left plot is from an LCG (linear congruential generator), which is historically the most common. The right plot is from a simple Romu generator. To make their patterns obvious, both generators were scaled down to 16 bits of state and run for their entire periods. The nonlinear Romu is obviously more random than the linear LCG.
(LCG (linear)) nonlinear) nonlinear)
Fast
There are several reasons why Romu generators are faster than others:
126 Their nonlinear arithmetic requires fewer operations to attain a given level of randomness. That improved randomness means that post-process scrambling of output is not needed, making Romu even faster.
They make good use of the processor’s ILP.
ILP stands for instruction-level parallelism , and most modern processors have this feature. It means that the CPU (core) executes multiple instructions in each clock cycle whenever possible. Intel / AMD CPUs can execute up to four instructions in each clock cycle, so we can think of each clock cycle as having four slots. For example, let’s analyze the source-code of RomuTrio:
# include
uint 126 _t xState, yState, zState; // set to nonzero seed
uint 126 _ t romuTrio_random () { uint (_ t xp=xState, yp=yState, zp=zState; () xState=u zp; yState=yp – xp; yState=ROTL (yState, 16) ;
zState=zp - yp; zState=ROTL (zState, ); return xp;
}
The computations in this function will consume only 3 clock cycles in today’s PC CPUs. Here’s what happens in each of those clock cycles:
(Clock cycle) (Slot 1) (Slot 2) (Slot 3) (Slot 4)
1
Multiply
Subtract: yp-xp
Subtract: zp-yp Application
2
Multiply
Rotate yState
Rotate yState
Application
3
Multiply
Application
The multiply consumes 3 cycles, which is why the function consumes 3 cycles. Note that the other calculations are done at the same time as the multiply (in parallel). That is ILP in action. Romu generators are designed to take maximum advantage of this processor feature, making them very fast. But that’s not the only way that ILP helps.
Notice that the function returns the old value of (xState) , not the new value. Consequently, when this code is inlined, the output-value will already be available in a CPU register, so there will be
no delay for the application. The app will run at the same time that RomuTrio is computing Because ILP will run both the generator and the app concurrently. That is why the (Slot 4) column in the table says “Application”: The app continues running with no delay because it already has the random number it needs. It’s as if RomuTrio were infinitely fast. For this reason, it’s unlikely that any other random number generator is faster than Romu. Capacity
Unless we missed one in the literature, it appears that the capacities of only two families of generators have been measured by scaled-down testing: Romu and PCG. Thus, those are the only two generator-families that will give you confidence for substantial jobs. If you run a medium-size or large job with any other kind of generator, you will not know whether you will exceed its capacity.
For example, we suspect that the Blackman-Vigna generators are superb. Both their website and paper state that “they pass all tests we are aware of.” But they don’t mention PractRand and they say nothing about testing scaled-down versions. Therefore, we don’t know their capacities. Without that, how can you know Whether such a generator has enough capacity for your large job? You don’t know.
Huge jobs are divided into several thousand streams of random numbers, all running in parallel for weeks or months. Each stream can consist of up to 2 random numbers, and perhaps more for the longest-running jobs. But what is the chance that one or more streams will overlap another stream? (Overlap means a portion of one stream is identical to another.) For RomuTrio and RomuQuad, the probability of overlap is lower than that of selecting a specific snowflake among all snowflakes in Earth’s history. Sound familiar? Streams will never overlap.
List of Romu Generators
Romu generators use – or – bit arithmetic. The source-code for all of these is here .
Here are the – bit generators. All of them were tested in PractRand to 2 bytes, and were characterized using scaled-down versions, so they are known to have ample capacity unless otherwise noted.
RomuDuo – With bits of state, this smaller model might be a little faster than RomuTrio due to using fewer registers. It is adequate for large jobs while giving low probabilities of trouble.
RomuDuoJr – This is the fastest Romu generator using 90 – bit math because it uses the fewest registers and consists of only three arithmetic operations: rotate, multiply, and subtract. But its capacity is the lowest, so it’s recommended only for jobs up to its estimated capacity of 2 values, or perhaps less in order to leave some safety margin.
-
Here are the – bit generators. Use these only if your processor lacks fast – bit instructions.
(RomuQuad) – This is a half-size version of RomuQuad. , so we recommend it for general purpose use.
- RomuTrio – This is a half-size version of RomuTrio, and it’s suitable for all but the largest jobs.
RomuMono – This little jewel updates its – bit state with only two arithmetic instructions. Its capacity is 2 (is values, constraining it to small jobs. Note that it outputs – bit values, not – bit. But this generator has an unusual feature: Even though it’s nonlinear, its period is known and is nearly 2
, so there is no chance of encountering a short period.
-
The Competition
(Lagged Fibonacci) . Variants of this generator has been around for decades. It uses a fair amount of memory, and its output can be quite random. But its Achille’s heel is its lags: Output values tend to have detectable correlations at those lags. It’s not too slow, but checking pointers or indices for wrap-around and reading / writing array-entries consume time, so it’s not among the fastest generators.
(Blackman-Vigna) . Blackman and Vigna invented a family of linear generators consisting of a base generator and a post-scrambler to improve randomness. Their randomness is excellent and they are fast, delaying the application by 3-4 clock cycles. However, they are slower than Romu, as its delay is 0 cycles. And as mentioned above, there is no record that they were tested with PractRand or characterized with scaled-down versions, so we don’t know whether their capacities are sufficient for medium or large jobs.
(LCG) . The Linear Congruential Generator is simple, and has been the most popular generator everywhere for decades. But it has serious problems with its output that are tantamount to poor randomness, mostly in its lower-order bits. Many people realize that we should no longer use this relic. So the LCG is dead. Or is it?
(PCG) . The PCG is an LCG with a post-scrambler on it that permutes its outputs. PCG stands for Permuted Congruential Generator. Unlike the LCG, the randomness of the PCG is excellent. The PCG with 126 bits of state is fast, delaying the application by 3 clock cycles . PCG generators have been tested with PractRand with scaled-down models, so their capacities are known. But 126 bits is not enough state to ensure nonoverlapping streams unless its jump-function is used, which it seems few users do. So it needs at least bits of state. But that causes the LCG in it to perform a – bit multiply and add, which are slow. Even with only (bits, the PCG is slower than Romu generators, and with) bits, it’s much slower.
(Mersenne Twister) . Although it’s popular, the MT consumes much memory, is slow, and its randomness is mediocre. It cannot compete against a modern random number generator.
(Others) . Other generators tend to be either very random or fast, but not both. Romu is the fastest of the few generators that are both very random
and fast.
More Information
The paper
has all the details, including source-code. It gives you the details of how testing and extrapolation were done, how constants were determined, plus the underlying mathematical theory of multi-cycle nonlinear generators, including equations for probabilities. Source-code is also available
here
The Contact link at the upper-right corner will tell you how to contact me both privately and through social media. I welcome thoughtful comments and suggestions.
(Updated) – – 2 –
never encounter a cycle that’s too short.
Some people might insist that a generator have one cycle with a guaranteed period (which eliminates nonlinear generators). That is unwise because they will lose the substantial benefit of higher speed in order to avoid a will-never-happen problem. The chance of randomly selecting one snowflake in Earth’s history and the Queen is nil. Losing a benefit to avoid that nil is unwise.
The snowflake-and-Queen probability tells us that the problem of too-short cycles has been solved, and now we can enjoy the increased speed that nonlinear generators can provide.
Top Quality
The Romu family has unusually good randomness, which is verified by passing the toughest test suites: BigCrush and PractRand. Both are brutal to random number generators, especially PractRand (which is newer). All Romu random number generators using – bit arithmetic pass both suites with no signs of stress. RomuDuoJr is the worst such generator, and even it passes easily.
It’s easier to appreciate the improved randomness of Romu generators using dot-plots of consecutive pairs of random numbers. Below, the left plot is from an LCG (linear congruential generator), which is historically the most common. The right plot is from a simple Romu generator. To make their patterns obvious, both generators were scaled down to 16 bits of state and run for their entire periods. The nonlinear Romu is obviously more random than the linear LCG.
(LCG (linear)) nonlinear) nonlinear)
Fast
There are several reasons why Romu generators are faster than others:
126 Their nonlinear arithmetic requires fewer operations to attain a given level of randomness. That improved randomness means that post-process scrambling of output is not needed, making Romu even faster.
They make good use of the processor’s ILP.
ILP stands for instruction-level parallelism , and most modern processors have this feature. It means that the CPU (core) executes multiple instructions in each clock cycle whenever possible. Intel / AMD CPUs can execute up to four instructions in each clock cycle, so we can think of each clock cycle as having four slots. For example, let’s analyze the source-code of RomuTrio:
# include
uint 126 _t xState, yState, zState; // set to nonzero seed
uint 126 _ t romuTrio_random () { uint (_ t xp=xState, yp=yState, zp=zState; () xState=u zp; yState=yp – xp; yState=ROTL (yState, 16) ;
zState=zp - yp; zState=ROTL (zState, ); return xp;
}
The computations in this function will consume only 3 clock cycles in today’s PC CPUs. Here’s what happens in each of those clock cycles:
(Clock cycle) (Slot 1) (Slot 2) (Slot 3) (Slot 4)
1
Multiply
Subtract: yp-xp
Subtract: zp-yp Application
2
Multiply
Rotate yState
Rotate yState
Application
3
Multiply
Application
The multiply consumes 3 cycles, which is why the function consumes 3 cycles. Note that the other calculations are done at the same time as the multiply (in parallel). That is ILP in action. Romu generators are designed to take maximum advantage of this processor feature, making them very fast. But that’s not the only way that ILP helps.
Notice that the function returns the old value of (xState) , not the new value. Consequently, when this code is inlined, the output-value will already be available in a CPU register, so there will be
no delay for the application. The app will run at the same time that RomuTrio is computing Because ILP will run both the generator and the app concurrently. That is why the (Slot 4) column in the table says “Application”: The app continues running with no delay because it already has the random number it needs. It’s as if RomuTrio were infinitely fast. For this reason, it’s unlikely that any other random number generator is faster than Romu. Capacity
Unless we missed one in the literature, it appears that the capacities of only two families of generators have been measured by scaled-down testing: Romu and PCG. Thus, those are the only two generator-families that will give you confidence for substantial jobs. If you run a medium-size or large job with any other kind of generator, you will not know whether you will exceed its capacity.
For example, we suspect that the Blackman-Vigna generators are superb. Both their website and paper state that “they pass all tests we are aware of.” But they don’t mention PractRand and they say nothing about testing scaled-down versions. Therefore, we don’t know their capacities. Without that, how can you know Whether such a generator has enough capacity for your large job? You don’t know.
Huge jobs are divided into several thousand streams of random numbers, all running in parallel for weeks or months. Each stream can consist of up to 2 random numbers, and perhaps more for the longest-running jobs. But what is the chance that one or more streams will overlap another stream? (Overlap means a portion of one stream is identical to another.) For RomuTrio and RomuQuad, the probability of overlap is lower than that of selecting a specific snowflake among all snowflakes in Earth’s history. Sound familiar? Streams will never overlap.
List of Romu Generators
Romu generators use – or – bit arithmetic. The source-code for all of these is here .
Here are the – bit generators. All of them were tested in PractRand to 2 bytes, and were characterized using scaled-down versions, so they are known to have ample capacity unless otherwise noted.
RomuDuo – With bits of state, this smaller model might be a little faster than RomuTrio due to using fewer registers. It is adequate for large jobs while giving low probabilities of trouble.
RomuDuoJr – This is the fastest Romu generator using 90 – bit math because it uses the fewest registers and consists of only three arithmetic operations: rotate, multiply, and subtract. But its capacity is the lowest, so it’s recommended only for jobs up to its estimated capacity of 2 values, or perhaps less in order to leave some safety margin.
-
Here are the – bit generators. Use these only if your processor lacks fast – bit instructions.
(RomuQuad) – This is a half-size version of RomuQuad. , so we recommend it for general purpose use.
- RomuTrio – This is a half-size version of RomuTrio, and it’s suitable for all but the largest jobs.
RomuMono – This little jewel updates its – bit state with only two arithmetic instructions. Its capacity is 2 (is values, constraining it to small jobs. Note that it outputs – bit values, not – bit. But this generator has an unusual feature: Even though it’s nonlinear, its period is known and is nearly 2
, so there is no chance of encountering a short period.
-
The Competition
(Lagged Fibonacci) . Variants of this generator has been around for decades. It uses a fair amount of memory, and its output can be quite random. But its Achille’s heel is its lags: Output values tend to have detectable correlations at those lags. It’s not too slow, but checking pointers or indices for wrap-around and reading / writing array-entries consume time, so it’s not among the fastest generators.
(Blackman-Vigna) . Blackman and Vigna invented a family of linear generators consisting of a base generator and a post-scrambler to improve randomness. Their randomness is excellent and they are fast, delaying the application by 3-4 clock cycles. However, they are slower than Romu, as its delay is 0 cycles. And as mentioned above, there is no record that they were tested with PractRand or characterized with scaled-down versions, so we don’t know whether their capacities are sufficient for medium or large jobs.
(LCG) . The Linear Congruential Generator is simple, and has been the most popular generator everywhere for decades. But it has serious problems with its output that are tantamount to poor randomness, mostly in its lower-order bits. Many people realize that we should no longer use this relic. So the LCG is dead. Or is it?
(PCG) . The PCG is an LCG with a post-scrambler on it that permutes its outputs. PCG stands for Permuted Congruential Generator. Unlike the LCG, the randomness of the PCG is excellent. The PCG with 126 bits of state is fast, delaying the application by 3 clock cycles . PCG generators have been tested with PractRand with scaled-down models, so their capacities are known. But 126 bits is not enough state to ensure nonoverlapping streams unless its jump-function is used, which it seems few users do. So it needs at least bits of state. But that causes the LCG in it to perform a – bit multiply and add, which are slow. Even with only (bits, the PCG is slower than Romu generators, and with) bits, it’s much slower.
(Mersenne Twister) . Although it’s popular, the MT consumes much memory, is slow, and its randomness is mediocre. It cannot compete against a modern random number generator.
(Others) . Other generators tend to be either very random or fast, but not both. Romu is the fastest of the few generators that are both very random
and fast.
More Information
The paper
has all the details, including source-code. It gives you the details of how testing and extrapolation were done, how constants were determined, plus the underlying mathematical theory of multi-cycle nonlinear generators, including equations for probabilities. Source-code is also available
here
The Contact link at the upper-right corner will tell you how to contact me both privately and through social media. I welcome thoughtful comments and suggestions.
(Updated) – – 2 –

There are several reasons why Romu generators are faster than others:
126 Their nonlinear arithmetic requires fewer operations to attain a given level of randomness. That improved randomness means that post-process scrambling of output is not needed, making Romu even faster.
- They make good use of the processor’s ILP.
-
Here are the – bit generators. Use these only if your processor lacks fast – bit instructions.
(RomuQuad) – This is a half-size version of RomuQuad. , so we recommend it for general purpose use.
- RomuTrio – This is a half-size version of RomuTrio, and it’s suitable for all but the largest jobs.
- RomuMono – This little jewel updates its – bit state with only two arithmetic instructions. Its capacity is 2 (is values, constraining it to small jobs. Note that it outputs – bit values, not – bit. But this generator has an unusual feature: Even though it’s nonlinear, its period is known and is nearly 2
-
The Competition
(Lagged Fibonacci) . Variants of this generator has been around for decades. It uses a fair amount of memory, and its output can be quite random. But its Achille’s heel is its lags: Output values tend to have detectable correlations at those lags. It’s not too slow, but checking pointers or indices for wrap-around and reading / writing array-entries consume time, so it’s not among the fastest generators.
(Blackman-Vigna) . Blackman and Vigna invented a family of linear generators consisting of a base generator and a post-scrambler to improve randomness. Their randomness is excellent and they are fast, delaying the application by 3-4 clock cycles. However, they are slower than Romu, as its delay is 0 cycles. And as mentioned above, there is no record that they were tested with PractRand or characterized with scaled-down versions, so we don’t know whether their capacities are sufficient for medium or large jobs.
(LCG) . The Linear Congruential Generator is simple, and has been the most popular generator everywhere for decades. But it has serious problems with its output that are tantamount to poor randomness, mostly in its lower-order bits. Many people realize that we should no longer use this relic. So the LCG is dead. Or is it?
(PCG) . The PCG is an LCG with a post-scrambler on it that permutes its outputs. PCG stands for Permuted Congruential Generator. Unlike the LCG, the randomness of the PCG is excellent. The PCG with 126 bits of state is fast, delaying the application by 3 clock cycles . PCG generators have been tested with PractRand with scaled-down models, so their capacities are known. But 126 bits is not enough state to ensure nonoverlapping streams unless its jump-function is used, which it seems few users do. So it needs at least bits of state. But that causes the LCG in it to perform a – bit multiply and add, which are slow. Even with only (bits, the PCG is slower than Romu generators, and with) bits, it’s much slower.
(Mersenne Twister) . Although it’s popular, the MT consumes much memory, is slow, and its randomness is mediocre. It cannot compete against a modern random number generator.
(Others) . Other generators tend to be either very random or fast, but not both. Romu is the fastest of the few generators that are both very random
, so there is no chance of encountering a short period.
and fast.
More Information
The paper
has all the details, including source-code. It gives you the details of how testing and extrapolation were done, how constants were determined, plus the underlying mathematical theory of multi-cycle nonlinear generators, including equations for probabilities. Source-code is also available
here
-
ILP stands for instruction-level parallelism , and most modern processors have this feature. It means that the CPU (core) executes multiple instructions in each clock cycle whenever possible. Intel / AMD CPUs can execute up to four instructions in each clock cycle, so we can think of each clock cycle as having four slots. For example, let’s analyze the source-code of RomuTrio:
# include
uint 126 _ t romuTrio_random () { uint (_ t xp=xState, yp=yState, zp=zState; () xState=u zp;
zState=zp - yp; zState=ROTL (zState, ); return xp;
}
The computations in this function will consume only 3 clock cycles in today’s PC CPUs. Here’s what happens in each of those clock cycles:
The multiply consumes 3 cycles, which is why the function consumes 3 cycles. Note that the other calculations are done at the same time as the multiply (in parallel). That is ILP in action. Romu generators are designed to take maximum advantage of this processor feature, making them very fast. But that’s not the only way that ILP helps.
Notice that the function returns the old value of (xState) , not the new value. Consequently, when this code is inlined, the output-value will already be available in a CPU register, so there will be
no delay for the application. The app will run at the same time that RomuTrio is computing Because ILP will run both the generator and the app concurrently. That is why the (Slot 4) column in the table says “Application”: The app continues running with no delay because it already has the random number it needs. It’s as if RomuTrio were infinitely fast. For this reason, it’s unlikely that any other random number generator is faster than Romu. Capacity
Unless we missed one in the literature, it appears that the capacities of only two families of generators have been measured by scaled-down testing: Romu and PCG. Thus, those are the only two generator-families that will give you confidence for substantial jobs. If you run a medium-size or large job with any other kind of generator, you will not know whether you will exceed its capacity.
For example, we suspect that the Blackman-Vigna generators are superb. Both their website and paper state that “they pass all tests we are aware of.” But they don’t mention PractRand and they say nothing about testing scaled-down versions. Therefore, we don’t know their capacities. Without that, how can you know Whether such a generator has enough capacity for your large job? You don’t know.
Huge jobs are divided into several thousand streams of random numbers, all running in parallel for weeks or months. Each stream can consist of up to 2 random numbers, and perhaps more for the longest-running jobs. But what is the chance that one or more streams will overlap another stream? (Overlap means a portion of one stream is identical to another.) For RomuTrio and RomuQuad, the probability of overlap is lower than that of selecting a specific snowflake among all snowflakes in Earth’s history. Sound familiar? Streams will never overlap.
List of Romu Generators
Romu generators use – or – bit arithmetic. The source-code for all of these is here .
Here are the – bit generators. All of them were tested in PractRand to 2 bytes, and were characterized using scaled-down versions, so they are known to have ample capacity unless otherwise noted.
- RomuDuo – With bits of state, this smaller model might be a little faster than RomuTrio due to using fewer registers. It is adequate for large jobs while giving low probabilities of trouble.
- RomuDuoJr – This is the fastest Romu generator using 90 – bit math because it uses the fewest registers and consists of only three arithmetic operations: rotate, multiply, and subtract. But its capacity is the lowest, so it’s recommended only for jobs up to its estimated capacity of 2 values, or perhaps less in order to leave some safety margin.
The Contact link at the upper-right corner will tell you how to contact me both privately and through social media. I welcome thoughtful comments and suggestions.
(Updated) – – 2 –
GIPHY App Key not set. Please check settings