8215441: Increase uniformity of the distribution of BigIntegers constructed by BigInteger(int, Random)

Thu Dec 20 15:21:45 UTC 2018

Hi Brian,

I don't quite understand this reasoning. I need some explanation...

You are talking about the distribution of BigInteger values (not the bit 
lengths of the values), obtained with constructor BigInteger(int, 
java.util.Random) for which the javadoc says:

      * Constructs a randomly generated BigInteger, uniformly 
distributed over
      * the range 0 to (2<sup>{@code numBits}</sup> - 1), inclusive.
      * The uniformity of the distribution assumes that a fair source of 
random
      * bits is provided in {@code rnd}.  Note that this constructor always
      * constructs a non-negative BigInteger.

I think that the javadoc meant to say that the *values* of the 
BigIntegers constructed are uniformly distributed...

On 12/20/18 5:01 AM, Brian Burkhalter wrote:
> https://bugs.openjdk.java.net/browse/JDK-8215441
>
> This issue was filed to cover improving the uniformity of randomly generated BigIntegers. It is not intended to resolve [1] which is deliberately left open. The proposed patch implements a modified version of the “workaround” suggested in [1].
>
> The problem is that the magnitude of the random BigInteger is created from a sequence of bytes generated by Random.nextBytes() [2]. The likelihood that any of these bytes is zero is small

It should be 1/256 right? If the bytes returned by Random.nextBytes() 
are good random bytes, they are independent and their distribution is 
uniform. So the likelihood of an individual byte being 0 is 
approximately the same as the likelihood of it being any particular 
other byte value. As there are 256 different values a byte can have, 
this amounts to 1/256...

>   so the distribution of the resulting random BigIntegers is skewed towards values close to the maximum bit size “numBits” specified to the constructor.

I think that's a normal consequence of uniform distribution of values. 
The "bit length" of a BigInteger value is the index of its highest 1 bit 
(+1 counted from lowest bit0). Say you produce a uniformly distributed 
random BigInteger in range 0 ... 2^8 - 1 (the bitLength constructor 
parameter being 8 in this case)

The probability that you get an integer with "bit length" of 8 is 1/2 
(that's the probability of a uniformly distributed 8 bit integer having 
the highest bit 1)... 1XXXXXXX
The probability that you get an integer with "bit lenght" of 7 is 1/4 
(that's the probability of a uniformly distributed 8 bit integer having 
the highest bit 0 and the next to highest bit 1)... 01XXXXXX
Probability you get bit length 6 is 1/8 ... 001XXXXX
Probability you get bit length 5 is 1/16 ... 0001XXXX
Probability you get bit length 4 is 1/32 ... 00001XXX
Probability you get bit length 3 is 1/64 ... 000001XX
Probability you get bit length 2 is 1/128 ... 0000001X
Probability you get bit length 1 is 1/256 ... (the probability of 
getting value 1) ... 00000001
Probability you get bit length 0 is 1/256 ... (the probability of 
getting value 0) ... 00000000

In general, the probability that you get an integer with bit length N is 
equal to the probability that you get an integer with bit length < N.

The probability that you get 0 from BigInteger random constructor when 
you pass to it numBits is 1/(2^numBits). When you choose 256 for 
numBits, that's a very slim probability. Practically zero. Let alone 
when you choose 4096.

>
> The workaround suggested in [1] is to randomly change numBits to the value numBits = Random.nextInt(numBits + 1) [3]. (Actually the suggested workaround is nextInt(numBits) which is incorrect as the parameter is an exclusive upper bound.) This greatly improves the uniformity of the distribution. A remaining problem however is that now the very largest numbers in the interval [0,2^numBits) are underrepresented. A modification of this approach is to increment the new value of numBits as numBits = Random.nextInt(numBits + 1) + 1 [4]. This was empirically observed to improve the underrepresentation of the largest values.
>
> The distribution of the random BigIntegers was estimated using [5]. For a given maximum bit length, bin size, and number of random values to generate, this creates a histogram and calculates the coefficient of variation of the numbers generated. The histogram bin at index zero represents the largest valued bin in the result. The count in a given histogram bin is the number of values for which that bin is the leftmost (largest valued) with at least one non-zero bit. The bin of maximum index represents zero.

What exactly are you trying to show with this code? I'm strugling to 
understand it...

If you wanted to show the distribution of BigInteger values into equally 
sized bins by the value (each bin representing the sub-interval in the 
interval 0 ... 2^numBits -1) then the code would be much simpler. But 
you are trying to show something else... What is that?

Regards, Peter