Efficiently generating random numbers without bias

Introduction
Poor algorithms and sources of bias
1. Generating integers in a range
2. Sources of bias
  1. Modulus bias
  2. Range bias
Improved algorithms

Introduction

This article describes typical ways of generating uniform random numbers, identifies common flaws, and provides improved alternatives. It does not attempt to describe the underlying algorithms for generating random bits (e.g. Mersenne Twister, etc.) but rather ways to use those underlying algorithms in practical programs. Nonetheless, fast and high-quality underlying algorithms are crucial, so I'll recommend looking at the modern algorithms developed by George Marsaglia, particularly the KISS variants and the Xorshift variants. The code given here will assume the existence of the following function:

uint NextUInt32() { /* generate a random 32-bit unsigned integer */ }

Poor algorithms and sources of bias

Here are some of the typical algorithms in widespread use and the sources of bias in those algorithms.

Generating integers in a range

There are two common functions for generating numbers in a range, with signatures like the following:

// Returns a random number from 0 to exclusiveMax-1. exclusiveMax must be > 0.
uint Next(uint exclusiveMax);

// Returns a random number from min to max (inclusive). min must be less than or equal to max.
uint Next(uint min, uint max);

We will focus on the former function, since the latter can be implemented in terms of the former as follows:

uint Next(uint min, uint max)
{
  if(min > max) throw new ArgumentException("The minimum must be less than or equal to the maximum.");
  uint diff = max - min + 1;
  return diff != 0 ? Next(diff) + min : NextUInt32(); // if diff == 0 then it's the full 32-bit range
}

Similarly, we will focus on unsigned integers since signed integers can be implemented in terms of unsigned:

int Next(int exclusiveMax)
{
  if(exclusiveMax <= 0) throw new ArgumentOutOfRangeException();
  return (int)Next((uint)exclusiveMax);
}

int Next(int min, int max)
{
  if(min > max) throw new ArgumentException("The minimum must be less than or equal to the maximum.");
  uint diff = (uint)(max - min + 1);
  return diff != 0 ? (int)Next(diff) + min : (int)NextUInt32(); // if diff == 0 then it's the full 32-bit range
}

A typical poor implementation of Next(uint) looks like this:

uint Next(uint exclusiveMax)
{
  if(exclusiveMax == 0) throw new ArgumentOutOfRangeException();
  return NextUInt32() % exclusiveMax;
}

Sources of bias

There are two main sources of bias in code like the above: modulus bias and range bias.

Modulus bias

The code above is a prime example of modulus bias. The modulus operation maps values from the range of NextUInt32 (0 to 2³²–1, i.e. 2³² different values) onto the output range (0 to exclusiveMax–1, i.e. exclusiveMax different values). If the range of NextUInt32 is not evenly divisible by exclusiveMax, then some values will be more likely to be returned than others. (Imagine the underlying generator returned values from 0 to 9, and exclusiveMax is 4. Then 0, 4, and 8, taken modulo 4, map to 0; 1, 5, and 9 map to 1; 2 and 6 map to 2; and 3 and 7 map to 3. Out of the 10 possibilities, three map onto each of 0 and 1, and only two map onto each of 2 and 3. Thus, 0 or 1 are 50% more likely to be chosen than 2 or 3.) Although the function may be no less biased than other, "better" methods of generating integers, the distribution of the bias is important. The simple modulus approach concentrates the bias in the lower portion of the output range. This is important since a very common use of the random numbers produced by such a function is to compare them against a threshold. For example:

if(Next(100) < 30) { /* do this 30% of the time */ }
if(Next(5) == 0) { /* do this 20% of the time */ }

The concentration of bias in the lower numbers makes these kinds of comparisons particularly biased. The bias may be small in this case, but it exists.

Range bias

Range bias is less commonly discussed than modulus bias, but it plagues nearly all random number generators. To "eliminate bias", people generally suggest replacing code like the Next method above with something like this:

uint Next(uint exclusiveMax)
{
  if(exclusiveMax == 0) throw new ArgumentOutOfRangeException();
  return (uint)(NextDouble()*exclusiveMax); // look ma, no bias! or is there?
}

But in a sense this is no less biased than the modulus approach if NextDouble is implemented as follows (as it all too often is):

// Returns a random floating point number in the range [0,1).
double NextDouble()
{
  return NextUInt32() * (1.0 / 4294967296); // divide by 2^32
}

In both cases, 2^32 values are being mapped directly onto exclusiveMax output values. The only difference is the distribution of the bias. While the modulus approach concentrates the bias in the low numbers, the "multiply by a floating point number" spreads the bias out. This is an improvement, since threshold comparisons will be much less affected, but for some other uses of random numbers it's no improvement at all.

In fact, the bias of the floating-point method is often worse than that, even in widely used software. The implementation of System.Random.NextDouble in the Microsoft .NET framework does this:

double NextDouble()
{
  return NextInt() * (1.0 / int.MaxValue); // divide by 2^31-1
}

We can do much better when it comes to generating floating-point numbers.

Another example of range bias occurs when the output range is larger than the input range. Consider the following 64-bit integer generator.

ulong Next(ulong exclusiveMax)
{
  if(exclusiveMax == 0) throw new ArgumentOutOfRangeException();
  return (ulong)(NextDouble()*exclusiveMax);
}

Even if NextDouble is implemented properly, the double will have at most 52 bits of randomness, whereas exclusiveMax is a 64-bit value. If exclusiveMax is larger than 2⁵², some values can never be generated. In that case, you might be forced to choose: should you use the modulus approach with a 64-bit generator and suffer the modulus bias, or use the floating-point approach to eliminate modulus bias and suffer range bias that's about 2^64–52 = 4096 times worse than the alternative? Perhaps we don't have to choose.

Improved algorithms

Here are some ways to improve upon the typical algorithms to eliminate modulus bias, reduce and bound range bias, and finally eliminate range bias, plus some other tips as well.

Smearing the bias

An example of how to smear the bias across the output space (and thus avoid modulus bias) was given above — multiply by a random floating point number in [0,1) — and it is the standard approach. While a good implementation of NextDouble (i.e. one that has the full 52 bits of randomness rather than a mere 31 or 32) will provide acceptably low bias — about one part in a million — when used to generate 32-bit integers, it's slow. There's just no good reason to use floating-point math when you can use much faster fixed-point math instead.

Fixed-point math is where you treat part of an integer as a whole number and the other part as a fractional number. For example, a 32-bit unsigned integer is normally considered to hold integer values up to 2³² (exclusive), but if you split it into a 20-bit whole number portion and a 12-bit fractional portion, say, it can instead store values up to 2²⁰ = 1048576 (exclusive) with each whole number being divided into 2¹² = 4096 fractional parts. (We'll call this a 20:12 fixed-point scheme.) For example, to compute 2.25 + 3.125 (i.e. 2 1/4 + 3 1/8) and 2.25 * 3.125 with such a fixed-point scheme, you could do:

uint a = (2<<12) | (4096/4); // 2 and 1/4th. note that 4096 == 2^12
uint b = (3<<12) | (4096/8); // 3 and 1/8th
uint c = a + b; // c == 5.375. addition of fixed-point numbers proceeds normally
uint wholePart = c >> 12; // wholePart == 5
uint fraction = c & 4095; // fraction == 1536 and 1536/4096 == 0.375
// multiplication is similar
c = (a * b) >> 12; // when multiplying, you have to shift the result back into place
wholePart = c >> 12; // wholePart == 7
fraction = c & 4095; // fraction == 128 and 128/4096 == 0.03125. 2.25 * 3.125 == 7.03125, so it's correct

This is not a very practical example, but it demonstrates one way to store non-integer values in integer variables. For our purposes, we want to duplicate the approach of multiplying a fractional value from [0,1) by exclusiveMax. We already have NextUInt32 which returns a number from 0 to 2³²–1. If interpreted as a 64-bit fixed-point number with a 32-bit whole portion and a 32-bit fractional portion, then NextUInt32 produces a value that fits snugly in the fractional portion, resulting in a number from [0,1). We can multiply that number by exclusiveMax. Multiplying a fixed-point number by a "regular" number doesn't need any shift, so (ulong)NextUInt32()*exclusiveMax is exactly equivalent to NextDouble()*exclusiveMax. Then we just need to truncate the result back to an integer. We just extract the whole portion as above, by right-shifting. Putting it all together, we get the following code. Note that this example makes no attempt to reduce the amount of bias, only to smear it. Reducing the bias is discussed in the next section.

uint Next(uint exclusiveMax)
{
  if(exclusiveMax == 0) throw new ArgumentOutOfRangeException();
  // NextUInt32 is a fractional number in [0,1) when interpreted as 32:32 fixed-point. multiply it by exclusiveMax and
  // truncate the fixed-point result to an integer by right-shifting
  return (uint)(((ulong)exclusiveMax * NextUInt32()) >> 32); // this is about 5x as fast as floating-point
}

For generating 64-bit numbers, we can use a similar approach, but it's much more complicated because we'd want an integer type larger than 64 bits to hold the fixed-point result.

ulong Next(ulong exclusiveMax)
{
  if(exclusiveMax == 0) throw new ArgumentOutOfRangeException();
  // NextUInt64 is a fractional number in [0,1) when interpreted as 64:64 fixed-point, but...
  return (ulong)(((ulong128)exclusiveMax * NextUInt64()) >> 64); // ulong128 doesn't exist! what can we do?
}

By the way, NextUInt64 can be implemented as follows (although some underlying generators natively produce 64-bit values):

ulong NextUInt64()
{
  // it's slightly faster to shift and OR than to write the individual halves with 'unsafe' code, at least in C#
  return ((ulong)NextUInt32() << 32) | NextUInt32(); 
}

Now to actually perform that 128-bit multiplication. Some languages (like many implementations of C++) provide access to hardware instructions for natively multiplying 64-bit integers and obtaining the 128-bit result. C# does not, and doing that in C++ is not portable anyway, but we can emulate it. To multiply two arbitrary-precision integers a and b, we first split them into 32-bit chunks a₀ (containing the low 32 bits of a), a₁ (containing the next 32 bits), etc. and the same for b. Then a = (a₀*s⁰ + a₁*s¹ + ... + a_N*s^N) where s is two to the power of the word size, or 2³² in our case. The result of the multiplication needs a number of bits equal to the sum of the numbers of bits needed to represent a and b.

In the specific case of multiplying two 64-bit numbers, we have a₀, a₁, b₀, and b₁, and the result could be up to 128 bits, which we'll place in the result values r₀ through r₃. We have r = a*b = (a₀*s⁰ + a₁*s¹) * (b₀*s⁰ + b₁*s¹) = (a₀*s⁰ * b₀*s⁰) + (a₀*s⁰ * b₁*s¹) + (a₁*s¹ * b₀*s⁰) + (a₁*s¹ * b₁*s¹) = (a₀ * b₀) * s⁰ + (a₀ * b₁) * s¹ + (a₁ * b₀) * s¹ + (a₁ * b₁) * s². Essentially, each part of a is multiplied by each part of b and the results are combined. The result of each multiplication of parts of a and b is a 32-bit multiplication producing a 64-bit product (call it t₀ and t₁). Now, since s represents our word size, and the values are split into words, multiplying by s simply shifts a value higher in the result. That is to say, (a₀ * b₀) * s⁰ produces t₀ and t₁ which get added into the result at r₀ and r₁ respectively, whereas (a₀ * b₁) * s¹ produces t₀ and t₁ which get added into the result at r₁ and r₂. (The r subscripts are increased by one because the result is multiplied by s¹.) The only complication is carry between parts of r. Putting it all together into code looks something like this:

ulong t = (ulong)a0*b0; // multiply a0
r0 = (uint)t;
t = (t >> 32) + (ulong)a0*b1;
r1 = (uint)t;
r2 = (uint)(t >> 32);

t = r1 + (ulong)a1*b0; // multiply a1
r1 = (uint)t;
t = r2 + (t >> 32) + (ulong)a1*b1;
r2 = (uint)t;
r3 = (uint)(t >> 32);

(For truly arbitrary precision, you could use arrays and put code like this in a loop, although there are fancier algorithms.) Now in our case, r is a fixed-point value that we're going to right-shift by 64, which effectively means throwing away r0 and r1. So we can eliminate those variables and simplify it to this:

ulong t = (ulong)a0*b0;
t = (t >> 32) + (ulong)a0*b1;
r2 = (uint)(t >> 32);

t = (uint)t + (ulong)a1*b0;
t = r2 + (t >> 32) + (ulong)a1*b1;
r2 = (uint)t;
r3 = (uint)(t >> 32);

Furthermore, what we're going to do with r2 and r3 is just return them as a 64-bit number. That means there's no point in unpacking the 64-bit t into r2 and r3 when we could just return the 64-bit t directly. So we can eliminate r2 and r3 as well, and the final function looks like this:

ulong Next(ulong exclusiveMax)
{
  if(exclusiveMax == 0) throw new ArgumentOutOfRangeException();
  uint a0 = NextUInt32(), a1 = NextUInt32(), b0 = (uint)exclusiveMax, b1 = (uint)(exclusiveMax >> 32);
  ulong t = ((ulong)a0*b0 >> 32) + (ulong)a0*b1;
  return (t >> 32) + (((uint)t + (ulong)a1*b0) >> 32) + (ulong)a1*b1;
}

Even this complicated code is twice as fast as using NextDouble, while providing 64 bits of randomness rather than at 52 (at most) from the double. But as I mentioned before, this method still has too much range bias to be acceptable, and reducing the bias will be addressed in the next section.

Bounding the bias

So far, we've seen how to avoid modulus bias while continuing to use fast integer math by means of fixed-point calculations. But those methods still have significant range bias. There are various ways to compute bias, but a simple estimation is to take the range of the underlying generator, r and divide by the range of the output, n: w = r / n. Then take bias = max(abs(floor(w) / w – 1), ceiling(w) / w – 1). This is based on the intuition that the average/ideal weight given to each result is r / n = w, but since our weights can only be integers, some values will have a weight of floor(w) and others will have a weight of ceiling(w). Dividing those by w and subtracting 1 gives the proportional difference of the actual weights from the ideal weights, and taking the one with the largest magnitude gives the maximum single-point bias of the function. Another estimation is bias = ceiling(w) / floor(w) – 1, which describes how much more likely the most likely values are compared to the least likely values. In all cases, the bias is inversely proportional to w, so to reduce bias we must increase w. Since w = r / n and we can't change n, we have to increase r. In short, to reduce the bias we must map a wider range of values onto the output (assuming r isn't divisible by n, in which case there's no bias).

It's useful for a random number generator to have a guaranteed upper bound on bias. For example, let's say we want our generator to have a bias of at most one in a million. We could simply ensure that r / n is at least 2²⁰ (which is about 1.05 million), or in other words that we generate at least 20 more random bits than the number of bits needed to represent n. This leads to the idea of an adaptive generator function that uses more random bits as n gets larger. Let's return to our previous 32-bit generator function, which used fixed-point math to smear the bias across the output space.

uint Next(uint exclusiveMax)
{
  if(exclusiveMax == 0) throw new ArgumentOutOfRangeException();
  return (uint)(((ulong)exclusiveMax * NextUInt32()) >> 32);
}

The problem with this function is that it generates 32 random bits and the output space can also be up to 32 bits. This can lead to significant bias as exclusiveMax gets large. So lets say we want to apply our bias guarantee. That means the 32 random bits can only be used for values of exclusiveMax up to 32 – 20 = 12 bits. Then we need a strategy for larger values. For example, our fixed-point approach can be used for products up to 64 bits, i.e. up to the point where the number of random bits plus the bit size of exclusiveMax equals 64. Solving 2x + 20 = 64, we get x = 22. So for values of exclusiveMax up to 2²², we can generate 42 random bits and use nearly the same fixed-point formula.

But then what do we do with the case of even larger values of exclusiveMax? Remember our strategy for computing the 128-bit product above. In the case of 32-bit values of exclusiveMax, we'd need 52 random bits, resulting in an 84-bit product. Let's start with the same 128-bit product code from above:

ulong t = (ulong)a0*b0;
r0 = (uint)t;
t = (t >> 32) + (ulong)a0*b1;
r1 = (uint)t;
r2 = (uint)(t >> 32);

t = r1 + (ulong)a1*b0;
r1 = (uint)t;
t = r2 + (t >> 32) + (ulong)a1*b1;
r2 = (uint)t;
r3 = (uint)(t >> 32);

But in this case, we know that b1 = 0 (because b is exclusiveMax, which fits in a single 32-bit word). So we can simplify the code with that assumption.

ulong t = (ulong)a0*b0;
r0 = (uint)t;
r1 = (uint)(t >> 32);

t = r1 + (ulong)a1*b0;
r1 = (uint)t;
r2 = (uint)(t >> 32);

And, since we're generating 52 random bits, we're going to right-shift the product by 52, which means throwing away all of r0 and 20 bits of r1. Also as before, we don't need to unpack t into r1 and r2 at the end, since we're just going to return it (albeit shifted 20 bits). So, we can simplify it into a single expression:

(((ulong)a0*b0 >> 32) + (ulong)a1*b0) >> 20

You'd plug in NextUInt32() for a0, NextBits(20) for a1, and exclusiveMax for b0. That said, with most underlying generators, it's faster to consume 64 random bits and compute a 96-bit product than to try to consume only 52 random bits (which requires saving the extra bits in a bit buffer). So in most cases you'll want to use the code below and use NextUInt32() for a1 as well, but if your underlying generator is slow, the code above may be better.

(((ulong)a0*b0 >> 32) + (ulong)a1*b0) >> 32

The resulting function is:

uint Next(uint exclusiveMax)
{
  if(exclusiveMax <= 1u<<12) // if the range is no more than 2^12 and we need up to 32 random bits...
  {
    if(exclusiveMax == 0) throw new ArgumentOutOfRangeException();
    // NextUInt32 has the range [0,1) considered as 32:32 fixed-point
    return (uint)(((ulong)NextUInt32() * exclusiveMax) >> 32);
  }
  else if(exclusiveMax <= 1u<<22) // otherwise, if we need up to 42 random bits
  {
    // use a 22:42 fixed-point scheme, whose product still fits in 64 bits
    return (uint)(((((ulong)NextBits(10)<<32) | NextUInt32()) * exclusiveMax) >> 42);
  }
  else // otherwise, we need up to 52 random bits
  {
    // we could use NextDouble() * exclusiveMax since it provides 52 bits of randomness, but fixed-point math is
    // still 30-100% faster. compute the multiplication of 64 random bits and the 32-bit range, producing a 96-bit
    // fixed-point result. we don't compute the bottom 32 bits because we'd right-shift it away anyway
    return (uint)((((ulong)NextUInt32()*exclusiveMax >> 32) + (ulong)NextUInt32()*exclusiveMax) >> 32);
  }
}

So there we have an adaptive generation function that handles ranges up to 2¹² using 32 random bits, up to 2²² using 42 random bits, and up to 2³² using 64 random bits, in each case ensuring that we use at least 20 more random bits than the size of exclusiveMax and thereby keeping the bias below one part in a million. For simplicity you can drop the first and/or second cases if you like, although it may be a bit slower. In any case, this is not the fastest approach. Later we'll see a method that is both faster than complex fixed-point math and is completely unbiased.

A similar approach can be taken for the 64-bit generator function, with the only difference being that larger fixed-point products must be handled. The code would look something like this:

ulong Next(ulong exclusiveMax)
{
  if(exclusiveMax <= uint.MaxValue) // if the range is less than 2^32...
  {
    return Next((uint)exclusiveMax); // just use the 32-bit method
  }
  else if(exclusiveMax <= 1UL<<38) // otherwise, if we need up to 58 random bits...
  {
    // compute the multiplication of 58 random bits and the 38-bit range, producing a 96-bit fixed-point result.
    // we actually consume 64 random bits because it's faster than going through the bit buffer
    return ((NextUInt32()*exclusiveMax >> 32) + (NextUInt32()&((1u<<26)-1))*exclusiveMax) >> 26;
  }
  else // otherwise, if we need up to 84 random bits...
  {
    // compute the multiplication of 96 random bits and the 64-bit range, producing a 160-bit fixed-point result. this is complicated but
    // still 4x faster than creating a random 84-bit FP107 value and doing the floating-point multiplication
    uint a0 = NextUInt32(), a1 = NextUInt32(), a2 = NextUInt32(), b0 = (uint)exclusiveMax, b1 = (uint)(exclusiveMax >> 32);
    ulong t = (((ulong)a0*b0) >> 32) + (ulong)a0*b1;
    t = (uint)(t >> 32) + (((uint)t + (ulong)a1*b0) >> 32) + (ulong)a1*b1;
    return (uint)(t >> 32) + (((uint)t + (ulong)a2*b0) >> 32) + (ulong)a2*b1;
  }
}

You can also try to optimize the number generators to generate exactly log2(exclusiveMax) bits when exclusiveMax is a power of two, rather than blindly generating an extra 20 bits, because there's no bias in that case. It's often not worth it, but more details will be given below.

By the way, the function NextBits, which consumes exactly the given number of bits from the underlying generator (on average), works by saving the extra bits in a bit buffer to be used next time. It can be implemented like this:

uint bitBuffer;
int bitsInBuffer; // could be a byte but that may be slightly slower

// Returns a random number from 0 to 2^bits-1.
uint NextBits(int bits)
{
  uint value;
  if((uint)bits <= (uint)bitsInBuffer) // if we have enough bits in the buffer...
  {
    value = bitBuffer & ((1u<<bits)-1); // extract them
    bitBuffer >>= bits;
    bitsInBuffer -= bits;
  }
  else if(bits < 32) // otherwise, if we need more bits (but not a full word)...
  {
    bits -= bitsInBuffer; // get as many bits as we can from the buffer
    value = bitBuffer << bits; // leave space for the remaining bits
    bitBuffer = NextUInt32(); // then refill the buffer
    value |= bitBuffer & ((1u<<bits)-1); // and grab the remaining bits from it
    bitBuffer >>= bits;
    bitsInBuffer = 32 - bits;
  }
  else if(bits == 32) // otherwise, if we need a full word, just call NextUInt32()
  {
    value = NextUInt32();
  }
  else // bits < 0 || bits > 32
  {
    throw new ArgumentOutOfRangeException();
  }
  return value;
}

ulong NextBits64(int bits)
{
  if(bits <= 32) return NextBits(bits);
  else return ((ulong)NextBits(bits-32)<<32) | NextUInt32();
}

In practice and roughly speaking, most underlying generators are so fast that the overhead of managing the bit buffer outweighs the time saved by avoiding extra calls to the underlying generator unless you're saving at least half the calls (e.g. unless you're calling NextBits with 16 or less for a 32-bit underlying generator). But the bit buffer still has great utility, especially for generating booleans and numbers in ranges that are powers of two. For example, NextBits(8) is equivalent to Next(256), but faster.

bool NextBoolean()
{
  if(--bitsInBuffer < 0) // if the bit buffer is empty...
  {
    bitsInBuffer = 31; // refill it (minus the one bit we're about to consume)
    bitBuffer = NextUInt32();
  }
  bool result = (bitBuffer & 1) != 0;
  bitBuffer >>= 1;
  return result;
}

So now we have random number generators that are free from modulus bias and also have guaranteed upper bounds on range bias, while being faster and less biased than the standard technique of multiplying by a random floating-point number. In the next section we'll see how we can generate numbers even faster and with no bias at all.

P.S. You may want to special-case powers of two. You can detect powers of two easily; n is a power of two (or zero) if n & (n–1) == 0. Computing log2(exclusiveMax) and related values is another story. If you're using a language that gives access to hardware instructions for counting leading zero bits (CLZ), the base-2 logarithm of a 32-bit power of two is 31 – CLZ(n). But if you must write portable code, the expense of computing the logarithm is usually not worth it. Nonetheless, if you want to attempt it, you can use code like the following.

uint Next(uint exclusiveMax)
{
  if((exclusiveMax & (exclusiveMax-1)) == 0) // if exclusiveMax is a power of two or zero...
  {
    return NextBits(Log2(exclusiveMax)); // Log2 will check for exclusiveMax == 0
  }
  ...
}

// Returns the number of bits needed to represent the value. If the value is zero, zero is returned.
static int BitsNeeded(uint value)
{
  int count = 0;
  if(value > 0xFFFF) { count = 16; value >>= 16; }
  if(value > 0x00FF) { count += 8; value >>= 8; }
  if(value > 0x000F) { count += 4; value >>= 4; }
  if(value > 0x0003) { count += 2; value >>= 2; }
  if(value > 0x0001) { count += 1; value >>= 1; }
  return count + (int)value;
}

// Returns the base-2 logarithm of the value, rounded down.
static int Log2(uint value)
{
  if(value == 0) throw new ArgumentOutOfRangeException();
  return BitsNeeded(value) - 1; // for faster code, BitsNeeded should be inlined
}

static int BitsNeeded(ulong value)
{
  int count = 0;
  uint lo = (uint)(value >> 32);
  if(lo != 0) count = 32;
  else lo = (uint)value;
  if(lo > 0xFFFF) { count += 16; lo >>= 16; }
  if(lo > 0x00FF) { count +=  8; lo >>=  8; }
  if(lo > 0x000F) { count +=  4; lo >>=  4; }
  if(lo > 0x0003) { count +=  2; lo >>=  2; }
  if(lo > 0x0001) { count +=  1; lo >>=  1; }
  return count + (int)lo;
}

static int Log2(ulong value)
{
  if(value == 0) throw new ArgumentOutOfRangeException();
  return BitsNeeded(value) - 1; // for faster code, BitsNeeded should be inlined
}

The following optimization using a look-up table is slightly faster but still rarely fast enough to be worth it.

struct ByteTable
{
  public unsafe fixed byte Entries[256];
}

unsafe static int BitsNeeded(uint value)
{
  int count = 0;
  if(value > 0xFFFF) { count = 16; value >>= 16; }
  if(value > 0x00FF) { count += 8; value >>=  8; }
  fixed(byte* p = bitsTable.Entries) return count + p[value];
}

unsafe static int BitsNeeded(ulong value)
{
  int count = 0;
  uint lo = (uint)(value >> 32);
  if(lo != 0) count = 32;
  else lo = (uint)value;
  if(lo > 0xFFFF) { count += 16; lo >>= 16; }
  if(lo > 0x00FF) { count +=  8; lo >>=  8; }
  fixed(byte* p = bitsTable.Entries) return count + p[lo];
}

static unsafe ByteTable CreateBitsNeededTable()
{
  var table = new ByteTable();
  for(int i = 1, n = 0; i < 256; i++)
  {
    if((i & (i-1)) == 0) n++; // when we see a power of two, increase the bits needed
    table.Entries[i] = (byte)n;
  }
  return table;
}

static ByteTable bitsTable = CreateBitsNeededTable();

Eliminating bias

Perhaps low bias isn't good enough. How can we eliminate bias entirely? In general, to eliminate bias we may have to reject some values from the underlying generator, looping until we find one that's suitable. The number of bits we consume is therefore unbounded, but a good method should terminate quickly. There are several methods but they tend to fall into two general groups — various modulo methods and the power-of-two method — and I'm only going to recommend two of them.

The first method is the modulo method. There are actually many modulo methods, but let's start with a simple one. I don't recommend this method, but it gets the idea across. The basic idea is that the range of the random numbers must be divisible by the range of the output. For example, if we want a random number from 0 to N–1 (i.e. N possible values), we would like an underlying generator that generates numbers within a range that is a multiple of N. In that case, there would be no bias of any kind if we just did Next() % N. But underlying generators almost always return numbers in a range that's a power of two. So we would find the largest multiple of N within the underlying generator's range — call it M — and accept values from the underlying generator if they're less than M, and reject them otherwise. This yields a random number from 0 to M–1. Having reduced the range to a multiple of N, we can take that value modulo N to get a result without bias.

uint Unbiased(uint exclusiveMax)
{
  if(exclusiveMax == 0) throw new ArgumentOutOfRangeException();
  const ulong Max = 1UL<<32; // 2^32
  // take N = exclusiveMax and compute the largest multiple of N (call it M) less than or equal to 2^32. then
  // limit = M - 1 (so it fits in a uint). there are better ways to compute the limit. (see below.)
  uint limit = (uint)(Max - Max % exclusiveMax) - 1, r;
  do r = NextUInt32(); while(r > limit); // r is drawn from a random range that's a multiple of exclusiveMax
  return r % exclusiveMax; // therefore r % exclusiveMax has no bias
}

The method above works by rejecting the last 2³² % exclusiveMax values from the range of the underlying generator. A slightly simpler method is to reject the first values from the range instead of the last.

uint Unbiased(uint exclusiveMax)
{
  if(exclusiveMax == 0) throw new ArgumentOutOfRangeException();
  // compute the first value we can accept, which is the same as the number of values we must reject. we would like to
  // take N = exclusiveMax and find 2^32 % N, but to avoid 64-bit division, we'll instead take (2^32 - N) % N. the
  // result is the same, since adding or subtracting N doesn't change a value taken modulo N. that is, x % N =
  // (x + N) % N = (x - N) % N (assuming no overflow or underflow). we also use the trick that (uint)-x = 2^32 - x
  uint limit = (uint)-(int)exclusiveMax % exclusiveMax, r; // limit == 2^32 % N
  do r = NextUInt32(); while(r < limit); // r is drawn from a random range that's a multiple of exclusiveMax
  return r % exclusiveMax; // therefore r % exclusiveMax has no bias
}

This also works, but in both cases we need to perform at least two slow modulo operations. There exists a method to get away with a minimum of one modulo operation, and this is the method I recommend.

uint Unbiased(uint exclusiveMax)
{
  if(exclusiveMax == 0) throw new ArgumentOutOfRangeException();
  // let's say N = exclusiveMax = 10. then NextUInt32() % N would be biased because 2^32 is not divisible by 10. we need
  // to reject 2^32 % 10 = 6 values, leaving a usable range of 2^32 - 6 which IS divisible by 10. two common ways to do
  // this are to reject random values less than 6 or greater than 2^32-1 - 6 and then return value % N if it wasn't
  // rejected. the problem with this approach is that it requires performing at least two (slow) modulus operations. so
  // we use a trick, which is to instead take limit = 2^32 - 10 (not -6) and for each random value r compare r - r%N to
  // that limit. for values 0 to N-1, r - r%N is zero. (0-0%10 = 0, 1-1%10 = 0, ..., 9-9%10 = 0.) then for N to 2N-1,
  // r - r%N == N. (10 - 10%10 = 10 - 0 = 10, 11 - 11%10 = 11 - 1 = 10, etc.) thus r - r%N is always a multiple of N.
  // since limit = 2^32 - N, limit is the largest legal value of r - r%N. (beyond that limit, there isn't enough space
  // in the range for another full set of N values.) in our example, limit = 2^32 - 10 but it's not until r = 2^32 - 6
  // that r - r%N will exceed the limit. when r = 2^32 - 7, r - r%N = 2^32 - 16 and when r = 2^32 - 6, r - r%N =
  // 2^32 - 6. thus, this approach also excludes six values from the range but requires a minimum of one modulus
  // operation instead of two
  uint r, v, limit = (uint)-(int)exclusiveMax; // limit = 2^32 - exclusiveMax
  do
  {
    r = NextUInt32();
    v = r % exclusiveMax;
  } while(r-v > limit);
  return v;
}

How many times might these methods need to loop? You might be afraid that if exclusiveMax is near 2³² that we'd reject almost the entire range, but that doesn't happen. Let's say it's 2³²–1. Then 2³² % exclusiveMax is 1, so we'd only exclude a single value from the range. The worst case occurs when exclusiveMax is 2³¹±1, when we exclude 2³¹–1 values from the range, rejecting them nearly half the time. That means there's still a 1/2 chance of returning after one iteration, 1/4 chance after two, 1/8 after three, etc. Since 1/2 + 2/4 + 3/8 + 4/16 + ... = 2, the average number of iterations is two. In general, the method is so fast that it beats the more complex fixed-point multiplication method. Thus I'd recommend the biased methods look like this:

uint Next(uint exclusiveMax)
{
  if(exclusiveMax <= 1u<<12) // if the range is no more than 2^12 and we need up to 32 random bits...
  {
    if(exclusiveMax == 0) throw new ArgumentOutOfRangeException();
    // NextUInt32 has the range [0,1) considered as 32:32 fixed-point
    return (uint)(((ulong)NextUInt32() * exclusiveMax) >> 32);
  }
  else // otherwise, the unbiased method is actually faster than larger fixed-point products
  {
    uint r, v, limit = (uint)-(int)exclusiveMax;
    do
    {
      r = NextUInt32();
      v = r % exclusiveMax;
    } while(r-v > limit);
    return v;
  }
}

ulong Next(ulong exclusiveMax)
{
  if(exclusiveMax <= uint.MaxValue) // if the range is no more than 2^32...
  {
    return Next((uint)exclusiveMax); // just use the 32-bit method
  }
  else if(exclusiveMax <= 1UL<<38) // otherwise, if we need up to 58 random bits...
  {
    // compute the multiplication of 58 random bits and the 38-bit range, producing a 96-bit fixed-point result.
    // we actually consume 64 random bits because it's faster than going through the bit buffer
    return ((NextUInt32()*exclusiveMax >> 32) + (NextUInt32()&((1u<<26)-1))*exclusiveMax) >> 26;
  }
  else // otherwise, if we need up to 84 random bits, use the unbiased method, which is faster than a large fixed-point product
  {
    ulong r, v, limit = (ulong)-(long)exclusiveMax;
    do
    {
      r = NextUInt64();
      v = r % exclusiveMax;
    } while(r-v > limit);
    return v;
  }
}

Those are in fact the methods I use in my own code. But there's another way to generate unbiased random numbers, and it's even faster in some cases. The idea is to generate ceiling(log2(exclusiveMax)) random bits and loop until the result is less than exclusiveMax. The worst cases occur when exclusiveMax is one greater than a power of two, when the method takes slightly less than two iterations on average — the exact same worst-case behavior as the modulo technique. Bad cases are somewhat more common with this method, but it makes up for it by consuming fewer bits from the underlying generator and using cheaper operations, except for the calculation of log2(exclusiveMax). The expense of calculating the logarithm slows the method down enough that I can't recommend it unless you have access to a hardware instruction to count leading zero bits. That said, for 64-bit values on a 32-bit architecture, this method is faster than the modulo method for small to medium values, even without a fast way to compute the logarithm. Here's an implementation.

uint Unbiased(uint exclusiveMax)
{
  if(exclusiveMax == 0) throw new ArgumentOutOfRangeException();
  int bits = BitsNeeded(exclusiveMax-1);
  uint v;
  do v = NextBits(bits); while(v >= exclusiveMax);
  return v;
}

In the case of unbiased number generation as well, you might consider special-casing powers of two, although I personally haven't found it to be worthwhile. (Normally when I need a power-of-two range I know it and just call NextBits directly.)

Generating better floating-point numbers

This section is short and simple. Above I gave a simple and poor method of generating random floating-point values.

// Returns a random floating point number in the range [0,1).
double NextDouble()
{
  return NextUInt32() * (1.0 / 4294967296); // divide by 2^32
}

The main problem with this method is that a double has a 52-bit mantissa but the method only generates 2³² different values. This could be fixed by using more random bits.

// Returns a random floating point number in the range [0,1).
double NextDouble()
{
  return NextBits64(52) * (1.0 / (1UL<<52)); // divide by 2^52
}

The second problem is that the method must divide the number to get it in the right range, and division is slow. Although the division has been optimized to a multiplication by the reciprocal, it's still a bit slower than it has to be. Instead, we'd like to be able to simply fill the mantissa with 52 random bits using integer operations. To do that, we must choose an appropriate exponent. Our goal is to return a value within [0,1), so we need an exponent that makes all the values of the mantissa lie within a range of size 1 (exclusive). Remember that floating-point numbers are basically numbers of the form sign * mantissa * 2^exponent, where the mantissa is a number within [1,2) (except for denormalized values which I won't consider here). That is, a mantissa of zero represents 1.0 * 2^exponent, while the maximum mantissa represents 1.999... * 2^exponent. (We'll ignore sign since we only generate non-negative values.) The mantissa, then, is already in a range of size 1 (exclusive), exactly what we need. So we want to choose an exponent of zero so the value becomes [1,2) * 2⁰ = [1,2). Then we just need to subtract 1 to shift it into the range [0,1). Exponents in IEEE754 floating point are "biased", which just means offset by a fixed value and is their way to allow negative exponents to be represented. So instead of putting zero in the exponent field, we have to put 0 + bias. The bias is 1023 for 64-bit floating-point numbers and 127 for 32-bit ones. Putting it all together, we get this.

// Returns a random floating point number in the range [0,1).
unsafe double NextDouble()
{
  // generate 52 random mantissa bits and an (unbiased) exponent of 0. it turns out to be significantly faster to use
  // an integer variable and reinterpret as a double than to use a double variable and reinterpret as an integer
  ulong n = ((ulong)(NextBits(20) | (1023u<<20)) << 32) | NextUInt32();
  return *(double*)&n - 1; // 1.xxxxx * 2^0 lies within [1,2). subtract 1 to get it in [0,1)
}

unsafe float NextFloat()
{
  uint n = NextBits(23) | (127u<<23); // generate 23 random mantissa bits and an (unbiased) exponent of 0
  return *(float*)&n - 1; // 1.xxxxx * 2^0 lies within [1,2). subtract 1 to get it in [0,1)
}

Gaussian, exponential, and other non-uniform distributions

While this article has primarily focused on uniform distributions, it is also of interest to efficiently generate values from nonuniform distributions. The best algorithm for this that I know of is the Ziggarat algorithm, which works for many unimodal distributions. While I haven't implemented every possible optimization, I do have a decent implementation, which I use for generating numbers from the normal and exponential distributions. See Random.cs in my mathematics library.

Comments

No comments yet.

Web Rarely