<i18n dev> [15] RFR: 8244459: Optimize the hash map size in LocaleProviderAdapters
Peter Levart
peter.levart at gmail.com
Tue May 5 20:26:52 UTC 2020
There's more...
Guava (and JDK in copy constructor) formula:
(int) ((float) expectedSize / 0.75f + 1.0f)
is not the same as Naoto's formula:
(int) (expectedSize / 0.75f) + 1
Notice that Naoto does addition of 1 in integer arithmetic after
conversion to int, while Guava/JDK does in floating point before
conversion to int. If I put Naoto's formula into my test program, no
undercalculations are detected.
So while Naoto's formula is sub-optimal for expectedSizes that are
multiples of 3, the Guava/JDK also has a bug.
Regards, Peter
On 5/5/20 10:01 PM, Peter Levart wrote:
>
>
> On 5/5/20 9:41 PM, Martin Buchholz wrote:
>> Hi Peter,
>>
>> Are you saying guava has a tiny bug?
>
>
> If it was just 1 too much when expected size is a multiple of 3 then
> that would not be a bug, just sub-optimal calculation. And the same
> calculation is performed also in JDK when a copy constructor is called
> for example.
>
>
> But I investigated further and what I found could be considered a bug.
> Sometimes, the following expression:
>
>
> (int) ((float) expectedSize / 0.75f + 1.0f)
>
>
> ...calculates a value that is not enough (due to floating point
> arithmetic and conversion to int) to store the expectedSize elements
> into the HashMap without re-hashing.
>
>
> What HashMap does with initialCapacity parameter is to round it up to
> nearest power of 2:
>
> static int tableSizeFor(int cap) {
> int n = -1 >>> Integer.numberOfLeadingZeros(cap - 1);
> return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ?
> MAXIMUM_CAPACITY : n + 1;
> }
>
> then it uses this as the initial backing table size. From that table
> size it calculates the threshold value:
>
> static int threshold(int cap) {
> float ft = (float) cap * 0.75f;
> return (cap < MAXIMUM_CAPACITY && ft < (float) MAXIMUM_CAPACITY ?
> (int) ft : Integer.MAX_VALUE);
> }
>
> ... and uses it as the max. number of elements that a HashMap can hold
> before it is re-hashed. So I did the following test (comparing the
> effectiveness of above formula with alternative (expectedSize*4+2)/3
> formula):
>
>
> public class HMTest {
> static final int MAXIMUM_CAPACITY = 1 << 30;
>
> static int tableSizeFor(int cap) {
> int n = -1 >>> Integer.numberOfLeadingZeros(cap - 1);
> return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ?
> MAXIMUM_CAPACITY : n + 1;
> }
>
> static int threshold(int cap) {
> float ft = (float) cap * 0.75f;
> return (cap < MAXIMUM_CAPACITY && ft < (float) MAXIMUM_CAPACITY ?
> (int) ft : Integer.MAX_VALUE);
> }
>
> public static void main(String[] args) {
> for (int expectedSize = 0; expectedSize < (Integer.MAX_VALUE -
> 2) / 4; expectedSize++) {
> int cap1 = (int) ((float) expectedSize / 0.75f + 1.0f);
> int cap2 = (expectedSize * 4 + 2) / 3;
> int ts1 = tableSizeFor(cap1);
> int ts2 = tableSizeFor(cap2);
> int th1 = threshold(ts1);
> int th2 = threshold(ts2);
>
> if (th1 < expectedSize || th2 < expectedSize) {
> System.out.printf("%d: (%d, %d, %d)%s (%d, %d, %d)%s\n",
> expectedSize,
> cap1, ts1, th1, (th1 < expectedSize) ? "!" : " ",
> cap2, ts2, th2, (th2 < expectedSize) ? "!" : " "
> );
> }
> }
> }
> }
>
>
> And what this prints is the following:
>
>
> 25165825: (33554432, 33554432, 25165824)! (33554434, 67108864, 50331648)
> 50331649: (67108864, 67108864, 50331648)! (67108866, 134217728,
> 100663296)
> 50331650: (67108864, 67108864, 50331648)! (67108867, 134217728,
> 100663296)
> 100663297: (134217728, 134217728, 100663296)! (134217730, 268435456,
> 201326592)
> 100663298: (134217728, 134217728, 100663296)! (134217731, 268435456,
> 201326592)
> 100663299: (134217728, 134217728, 100663296)! (134217732, 268435456,
> 201326592)
> 100663300: (134217728, 134217728, 100663296)! (134217734, 268435456,
> 201326592)
> 201326593: (268435456, 268435456, 201326592)! (268435458, 536870912,
> 402653184)
> 201326594: (268435456, 268435456, 201326592)! (268435459, 536870912,
> 402653184)
> 201326595: (268435456, 268435456, 201326592)! (268435460, 536870912,
> 402653184)
> 201326596: (268435456, 268435456, 201326592)! (268435462, 536870912,
> 402653184)
> 201326597: (268435456, 268435456, 201326592)! (268435463, 536870912,
> 402653184)
> 201326598: (268435456, 268435456, 201326592)! (268435464, 536870912,
> 402653184)
> 201326599: (268435456, 268435456, 201326592)! (268435466, 536870912,
> 402653184)
> 201326600: (268435456, 268435456, 201326592)! (268435467, 536870912,
> 402653184)
> 402653185: (536870912, 536870912, 402653184)! (536870914, 1073741824,
> 2147483647)
> 402653186: (536870912, 536870912, 402653184)! (536870915, 1073741824,
> 2147483647)
> 402653187: (536870912, 536870912, 402653184)! (536870916, 1073741824,
> 2147483647)
> 402653188: (536870912, 536870912, 402653184)! (536870918, 1073741824,
> 2147483647)
> 402653189: (536870912, 536870912, 402653184)! (536870919, 1073741824,
> 2147483647)
> 402653190: (536870912, 536870912, 402653184)! (536870920, 1073741824,
> 2147483647)
> 402653191: (536870912, 536870912, 402653184)! (536870922, 1073741824,
> 2147483647)
> 402653192: (536870912, 536870912, 402653184)! (536870923, 1073741824,
> 2147483647)
> 402653193: (536870912, 536870912, 402653184)! (536870924, 1073741824,
> 2147483647)
> 402653194: (536870912, 536870912, 402653184)! (536870926, 1073741824,
> 2147483647)
> 402653195: (536870912, 536870912, 402653184)! (536870927, 1073741824,
> 2147483647)
> 402653196: (536870912, 536870912, 402653184)! (536870928, 1073741824,
> 2147483647)
> 402653197: (536870912, 536870912, 402653184)! (536870930, 1073741824,
> 2147483647)
> 402653198: (536870912, 536870912, 402653184)! (536870931, 1073741824,
> 2147483647)
> 402653199: (536870912, 536870912, 402653184)! (536870932, 1073741824,
> 2147483647)
> 402653200: (536870912, 536870912, 402653184)! (536870934, 1073741824,
> 2147483647)
>
>
> So as you see, for expectedSize < (Integer.MAX_VALUE - 2) / 4 (where
> the alternative formula does not experience overflow and is enough for
> Naoto's case) all miscalculations are due to the JDK/Guava formula
> which in those cases calculates a value that is less than alternative
> formula's value and too small to adequately pre-size the HashMap table.
>
>
> Voila, we have some bugs to fix or I am doing something wrong here.
>
>
> Regards, Peter
>
>
>>
>> On Tue, May 5, 2020 at 12:12 PM Peter Levart <peter.levart at gmail.com
>> <mailto:peter.levart at gmail.com>> wrote:
>>
>> Hi Martin,
>>
>> On 5/5/20 8:26 PM, Martin Buchholz wrote:
>>> See related:
>>> https://guava.dev/releases/23.0/api/docs/com/google/common/collect/Maps.html#newHashMapWithExpectedSize-int-
>>
>>
>> This is basically the same calculation (or at least gives same
>> result) as Naoto did (without the max part):
>>
>> Naoto: (int)(expectedSize / 0.75f) + 1
>>
>> Guava: (int) ((float) expectedSize / 0.75F + 1.0F)
>>
>> but in case expectedSize is a multiple of 3, it gives the result
>> which is 1 more than needed. If what is needed is also a power of
>> 2, then twice the needed space is allocated in the HashMap
>> backing table.
>>
>>
>> Regards, Peter
>>
>>
>>>
>>> On Tue, May 5, 2020 at 11:03 AM <naoto.sato at oracle.com
>>> <mailto:naoto.sato at oracle.com>> wrote:
>>>
>>> And here is the fix. Please review.
>>>
>>> http://cr.openjdk.java.net/~naoto/8244459/webrev.00/
>>>
>>> Naoto
>>>
>>> On 5/5/20 10:25 AM, naoto.sato at oracle.com
>>> <mailto:naoto.sato at oracle.com> wrote:
>>> > Hi Peter,
>>> >
>>> > You are correct. Thanks. I'll remove that initial value of 16.
>>> >
>>> > Naoto
>>> >
>>> > On 5/5/20 9:37 AM, Peter Levart wrote:
>>> >> Hi Naoto,
>>> >>
>>> >> On 4/30/20 12:18 AM, naoto.sato at oracle.com
>>> <mailto:naoto.sato at oracle.com> wrote:
>>> >>> Hello,
>>> >>>
>>> >>> Please review this small fix to the following issue:
>>> >>>
>>> >>> https://bugs.openjdk.java.net/browse/JDK-8244152
>>> >>>
>>> >>> The proposed changeset is located at:
>>> >>>
>>> >>> https://cr.openjdk.java.net/~naoto/8244152/webrev.00/
>>> >>>
>>> >>> The hash map used there didn't have initial capacity,
>>> even though the
>>> >>> exact numbers are known.
>>> >>
>>> >>
>>> >> Well, it has to be calculated 1st (countTokens), but I
>>> guess this pays
>>> >> off when HashSet (the backing HashMap) does not have to
>>> be rehashed then.
>>> >>
>>> >> The expression you use:
>>> >>
>>> >> Math.max((int)(tokens.countTokens() / 0.75f) + 1, 16)
>>> >>
>>> >> ...has a minimum value of 16. Why is that? 16 is just
>>> HashMap's
>>> >> default initialCapacity if not specified explicitly. But
>>> if you only
>>> >> want to store say 1 entry in the map, you can specify 2 as
>>> >> initialCapacity and HashMap will happily work for such
>>> case without
>>> >> resizing.
>>> >>
>>> >>
>>> >> So you could just use:
>>> >>
>>> >> (int)(tokens.countTokens() / 0.75f) + 1
>>> >>
>>> >> And even this expression is sometimes overshooting the
>>> minimal
>>> >> required value by 1 (when # of tokens is "exact" multiple
>>> of 0.75f,
>>> >> say 6). I think the following could be used to optimally
>>> pre-size the
>>> >> HashMap with default load factor 0.75:
>>> >>
>>> >> (tokens.countTokens() * 4 + 2) / 3
>>> >>
>>> >>
>>> >> Regards, Peter
>>> >>
>>> >>>
>>> >>> Naoto
>>>
More information about the i18n-dev
mailing list