<i18n dev> [15] RFR: 8244459: Optimize the hash map size in LocaleProviderAdapters

naoto.sato at oracle.com naoto.sato at oracle.com
Wed May 6 00:04:40 UTC 2020


Hi Stuart,

On 5/5/20 4:29 PM, Stuart Marks wrote:
> As for Naoto's changeset, I don't think it should be held up while we 
> bikeshed this. :-) I'm ok with whatever he decides.

I don't think the number of language tags exceed 2^29, unless languages 
in the whole universe are counted :-) So I would go with the current 
version.

Naoto

> 
> s'marks
> 
> 
> 
> 
> On 5/5/20 1:26 PM, Peter Levart wrote:
>> There's more...
>>
>>
>> Guava (and JDK in copy constructor) formula:
>>
>>      (int) ((float) expectedSize / 0.75f + 1.0f)
>>
>>
>> is not the same as Naoto's formula:
>>
>>      (int) (expectedSize / 0.75f) + 1
>>
>>
>> Notice that Naoto does addition of 1 in integer arithmetic after 
>> conversion to int, while Guava/JDK does in floating point before 
>> conversion to int. If I put Naoto's formula into my test program, no 
>> undercalculations are detected.
>>
>>
>> So while Naoto's formula is sub-optimal for expectedSizes that are 
>> multiples of 3, the Guava/JDK also has a bug.
>>
>>
>> Regards, Peter
>>
>>
>> On 5/5/20 10:01 PM, Peter Levart wrote:
>>>
>>>
>>> On 5/5/20 9:41 PM, Martin Buchholz wrote:
>>>> Hi Peter,
>>>>
>>>> Are you saying guava has a tiny bug?
>>>
>>>
>>> If it was just 1 too much when expected size is a multiple of 3 then 
>>> that would not be a bug, just sub-optimal calculation. And the same 
>>> calculation is performed also in JDK when a copy constructor is 
>>> called for example.
>>>
>>>
>>> But I investigated further and what I found could be considered a 
>>> bug. Sometimes, the following expression:
>>>
>>>
>>> (int) ((float) expectedSize / 0.75f + 1.0f)
>>>
>>>
>>> ...calculates a value that is not enough (due to floating point 
>>> arithmetic and conversion to int) to store the expectedSize elements 
>>> into the HashMap without re-hashing.
>>>
>>>
>>> What HashMap does with initialCapacity parameter is to round it up to 
>>> nearest power of 2:
>>>
>>>     static int tableSizeFor(int cap) {
>>>         int n = -1 >>> Integer.numberOfLeadingZeros(cap - 1);
>>>         return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? 
>>> MAXIMUM_CAPACITY : n + 1;
>>>     }
>>>
>>> then it uses this as the initial backing table size. From that table 
>>> size it calculates the threshold value:
>>>
>>>     static int threshold(int cap) {
>>>         float ft = (float) cap * 0.75f;
>>>         return (cap < MAXIMUM_CAPACITY && ft < (float) 
>>> MAXIMUM_CAPACITY ?
>>>                 (int) ft : Integer.MAX_VALUE);
>>>     }
>>>
>>> ... and uses it as the max. number of elements that a HashMap can 
>>> hold before it is re-hashed. So I did the following test (comparing 
>>> the effectiveness of above formula with alternative 
>>> (expectedSize*4+2)/3 formula):
>>>
>>>
>>> public class HMTest {
>>>     static final int MAXIMUM_CAPACITY = 1 << 30;
>>>
>>>     static int tableSizeFor(int cap) {
>>>         int n = -1 >>> Integer.numberOfLeadingZeros(cap - 1);
>>>         return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? 
>>> MAXIMUM_CAPACITY : n + 1;
>>>     }
>>>
>>>     static int threshold(int cap) {
>>>         float ft = (float) cap * 0.75f;
>>>         return (cap < MAXIMUM_CAPACITY && ft < (float) 
>>> MAXIMUM_CAPACITY ?
>>>                 (int) ft : Integer.MAX_VALUE);
>>>     }
>>>
>>>     public static void main(String[] args) {
>>>         for (int expectedSize = 0; expectedSize < (Integer.MAX_VALUE 
>>> - 2) / 4; expectedSize++) {
>>>             int cap1 = (int) ((float) expectedSize / 0.75f + 1.0f);
>>>             int cap2 = (expectedSize * 4 + 2) / 3;
>>>             int ts1 = tableSizeFor(cap1);
>>>             int ts2 = tableSizeFor(cap2);
>>>             int th1 = threshold(ts1);
>>>             int th2 = threshold(ts2);
>>>
>>>             if (th1 < expectedSize || th2 < expectedSize) {
>>>                 System.out.printf("%d: (%d, %d, %d)%s (%d, %d, %d)%s\n",
>>>                         expectedSize,
>>>                         cap1, ts1, th1, (th1 < expectedSize) ? "!" : 
>>> " ",
>>>                         cap2, ts2, th2, (th2 < expectedSize) ? "!" : " "
>>>                 );
>>>             }
>>>         }
>>>     }
>>> }
>>>
>>>
>>> And what this prints is the following:
>>>
>>>
>>> 25165825: (33554432, 33554432, 25165824)! (33554434, 67108864, 50331648)
>>> 50331649: (67108864, 67108864, 50331648)! (67108866, 134217728, 
>>> 100663296)
>>> 50331650: (67108864, 67108864, 50331648)! (67108867, 134217728, 
>>> 100663296)
>>> 100663297: (134217728, 134217728, 100663296)! (134217730, 268435456, 
>>> 201326592)
>>> 100663298: (134217728, 134217728, 100663296)! (134217731, 268435456, 
>>> 201326592)
>>> 100663299: (134217728, 134217728, 100663296)! (134217732, 268435456, 
>>> 201326592)
>>> 100663300: (134217728, 134217728, 100663296)! (134217734, 268435456, 
>>> 201326592)
>>> 201326593: (268435456, 268435456, 201326592)! (268435458, 536870912, 
>>> 402653184)
>>> 201326594: (268435456, 268435456, 201326592)! (268435459, 536870912, 
>>> 402653184)
>>> 201326595: (268435456, 268435456, 201326592)! (268435460, 536870912, 
>>> 402653184)
>>> 201326596: (268435456, 268435456, 201326592)! (268435462, 536870912, 
>>> 402653184)
>>> 201326597: (268435456, 268435456, 201326592)! (268435463, 536870912, 
>>> 402653184)
>>> 201326598: (268435456, 268435456, 201326592)! (268435464, 536870912, 
>>> 402653184)
>>> 201326599: (268435456, 268435456, 201326592)! (268435466, 536870912, 
>>> 402653184)
>>> 201326600: (268435456, 268435456, 201326592)! (268435467, 536870912, 
>>> 402653184)
>>> 402653185: (536870912, 536870912, 402653184)! (536870914, 1073741824, 
>>> 2147483647)
>>> 402653186: (536870912, 536870912, 402653184)! (536870915, 1073741824, 
>>> 2147483647)
>>> 402653187: (536870912, 536870912, 402653184)! (536870916, 1073741824, 
>>> 2147483647)
>>> 402653188: (536870912, 536870912, 402653184)! (536870918, 1073741824, 
>>> 2147483647)
>>> 402653189: (536870912, 536870912, 402653184)! (536870919, 1073741824, 
>>> 2147483647)
>>> 402653190: (536870912, 536870912, 402653184)! (536870920, 1073741824, 
>>> 2147483647)
>>> 402653191: (536870912, 536870912, 402653184)! (536870922, 1073741824, 
>>> 2147483647)
>>> 402653192: (536870912, 536870912, 402653184)! (536870923, 1073741824, 
>>> 2147483647)
>>> 402653193: (536870912, 536870912, 402653184)! (536870924, 1073741824, 
>>> 2147483647)
>>> 402653194: (536870912, 536870912, 402653184)! (536870926, 1073741824, 
>>> 2147483647)
>>> 402653195: (536870912, 536870912, 402653184)! (536870927, 1073741824, 
>>> 2147483647)
>>> 402653196: (536870912, 536870912, 402653184)! (536870928, 1073741824, 
>>> 2147483647)
>>> 402653197: (536870912, 536870912, 402653184)! (536870930, 1073741824, 
>>> 2147483647)
>>> 402653198: (536870912, 536870912, 402653184)! (536870931, 1073741824, 
>>> 2147483647)
>>> 402653199: (536870912, 536870912, 402653184)! (536870932, 1073741824, 
>>> 2147483647)
>>> 402653200: (536870912, 536870912, 402653184)! (536870934, 1073741824, 
>>> 2147483647)
>>>
>>>
>>> So as you see, for expectedSize < (Integer.MAX_VALUE - 2) / 4 (where 
>>> the alternative formula does not experience overflow and is enough 
>>> for Naoto's case) all miscalculations are due to the JDK/Guava 
>>> formula which in those cases calculates a value that is less than 
>>> alternative formula's value and too small to adequately pre-size the 
>>> HashMap table.
>>>
>>>
>>> Voila, we have some bugs to fix or I am doing something wrong here.
>>>
>>>
>>> Regards, Peter
>>>
>>>
>>>>
>>>> On Tue, May 5, 2020 at 12:12 PM Peter Levart <peter.levart at gmail.com 
>>>> <mailto:peter.levart at gmail.com>> wrote:
>>>>
>>>>     Hi Martin,
>>>>
>>>>     On 5/5/20 8:26 PM, Martin Buchholz wrote:
>>>>>     See related:
>>>>> https://guava.dev/releases/23.0/api/docs/com/google/common/collect/Maps.html#newHashMapWithExpectedSize-int- 
>>>>>
>>>>
>>>>
>>>>     This is basically the same calculation (or at least gives same
>>>>     result) as Naoto did (without the max part):
>>>>
>>>>     Naoto: (int)(expectedSize / 0.75f) + 1
>>>>
>>>>     Guava: (int) ((float) expectedSize / 0.75F + 1.0F)
>>>>
>>>>     but in case expectedSize is a multiple of 3, it gives the result
>>>>     which is 1 more than needed. If what is needed is also a power of
>>>>     2, then twice the needed space is allocated in the HashMap
>>>>     backing table.
>>>>
>>>>
>>>>     Regards, Peter
>>>>
>>>>
>>>>>
>>>>>     On Tue, May 5, 2020 at 11:03 AM <naoto.sato at oracle.com
>>>>>     <mailto:naoto.sato at oracle.com>> wrote:
>>>>>
>>>>>         And here is the fix. Please review.
>>>>>
>>>>>         http://cr.openjdk.java.net/~naoto/8244459/webrev.00/
>>>>>
>>>>>         Naoto
>>>>>
>>>>>         On 5/5/20 10:25 AM, naoto.sato at oracle.com
>>>>>         <mailto:naoto.sato at oracle.com> wrote:
>>>>>         > Hi Peter,
>>>>>         >
>>>>>         > You are correct. Thanks. I'll remove that initial value 
>>>>> of 16.
>>>>>         >
>>>>>         > Naoto
>>>>>         >
>>>>>         > On 5/5/20 9:37 AM, Peter Levart wrote:
>>>>>         >> Hi Naoto,
>>>>>         >>
>>>>>         >> On 4/30/20 12:18 AM, naoto.sato at oracle.com
>>>>>         <mailto:naoto.sato at oracle.com> wrote:
>>>>>         >>> Hello,
>>>>>         >>>
>>>>>         >>> Please review this small fix to the following issue:
>>>>>         >>>
>>>>>         >>> https://bugs.openjdk.java.net/browse/JDK-8244152
>>>>>         >>>
>>>>>         >>> The proposed changeset is located at:
>>>>>         >>>
>>>>>         >>> https://cr.openjdk.java.net/~naoto/8244152/webrev.00/
>>>>>         >>>
>>>>>         >>> The hash map used there didn't have initial capacity,
>>>>>         even though the
>>>>>         >>> exact numbers are known.
>>>>>         >>
>>>>>         >>
>>>>>         >> Well, it has to be calculated 1st (countTokens), but I
>>>>>         guess this pays
>>>>>         >> off when HashSet (the backing HashMap) does not have to
>>>>>         be rehashed then.
>>>>>         >>
>>>>>         >> The expression you use:
>>>>>         >>
>>>>>         >>      Math.max((int)(tokens.countTokens() / 0.75f) + 1, 16)
>>>>>         >>
>>>>>         >> ...has a minimum value of 16. Why is that? 16 is just
>>>>>         HashMap's
>>>>>         >> default initialCapacity if not specified explicitly. But
>>>>>         if you only
>>>>>         >> want to store say 1 entry in the map, you can specify 2 as
>>>>>         >> initialCapacity and HashMap will happily work for such
>>>>>         case without
>>>>>         >> resizing.
>>>>>         >>
>>>>>         >>
>>>>>         >> So you could just use:
>>>>>         >>
>>>>>         >>      (int)(tokens.countTokens() / 0.75f) + 1
>>>>>         >>
>>>>>         >> And even this expression is sometimes overshooting the
>>>>>         minimal
>>>>>         >> required value by 1 (when # of tokens is "exact" multiple
>>>>>         of 0.75f,
>>>>>         >> say 6). I think the following could be used to optimally
>>>>>         pre-size the
>>>>>         >> HashMap with default load factor 0.75:
>>>>>         >>
>>>>>         >>      (tokens.countTokens() * 4 + 2) / 3
>>>>>         >>
>>>>>         >>
>>>>>         >> Regards, Peter
>>>>>         >>
>>>>>         >>>
>>>>>         >>> Naoto
>>>>>


More information about the core-libs-dev mailing list