<i18n dev> [15] RFR: 8244459: Optimize the hash map size in LocaleProviderAdapters
Stuart Marks
stuart.marks at oracle.com
Tue May 5 23:29:21 UTC 2020
Hm, interesting, good catch Peter! Very subtle. The time-honored
(int) (expected / 0.75f) + 1
appears in several places around the JDK. I think most people (including me)
just copy it, because it's "good enough" for most cases.
I'm a little concerned about
(expectedSize * 4 + 2) / 3
because that wraps around to negative starting at about 2^29. Might I suggest
the following instead?
(int) Math.ceil(expectedSize / 0.75)
This expresses the actual intent more clearly, I think, and its results match
Peter's expression for the range less than 2^29. It also saturates at
Integer.MAX_VALUE instead of going negative. It does use double precision,
though, as there's no float version of ceil(). At this point I think this is a
small disadvantage.
**
As for Naoto's changeset, I don't think it should be held up while we bikeshed
this. :-) I'm ok with whatever he decides.
s'marks
On 5/5/20 1:26 PM, Peter Levart wrote:
> There's more...
>
>
> Guava (and JDK in copy constructor) formula:
>
> (int) ((float) expectedSize / 0.75f + 1.0f)
>
>
> is not the same as Naoto's formula:
>
> (int) (expectedSize / 0.75f) + 1
>
>
> Notice that Naoto does addition of 1 in integer arithmetic after conversion to
> int, while Guava/JDK does in floating point before conversion to int. If I put
> Naoto's formula into my test program, no undercalculations are detected.
>
>
> So while Naoto's formula is sub-optimal for expectedSizes that are multiples of
> 3, the Guava/JDK also has a bug.
>
>
> Regards, Peter
>
>
> On 5/5/20 10:01 PM, Peter Levart wrote:
>>
>>
>> On 5/5/20 9:41 PM, Martin Buchholz wrote:
>>> Hi Peter,
>>>
>>> Are you saying guava has a tiny bug?
>>
>>
>> If it was just 1 too much when expected size is a multiple of 3 then that
>> would not be a bug, just sub-optimal calculation. And the same calculation is
>> performed also in JDK when a copy constructor is called for example.
>>
>>
>> But I investigated further and what I found could be considered a bug.
>> Sometimes, the following expression:
>>
>>
>> (int) ((float) expectedSize / 0.75f + 1.0f)
>>
>>
>> ...calculates a value that is not enough (due to floating point arithmetic and
>> conversion to int) to store the expectedSize elements into the HashMap without
>> re-hashing.
>>
>>
>> What HashMap does with initialCapacity parameter is to round it up to nearest
>> power of 2:
>>
>> static int tableSizeFor(int cap) {
>> int n = -1 >>> Integer.numberOfLeadingZeros(cap - 1);
>> return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
>> }
>>
>> then it uses this as the initial backing table size. From that table size it
>> calculates the threshold value:
>>
>> static int threshold(int cap) {
>> float ft = (float) cap * 0.75f;
>> return (cap < MAXIMUM_CAPACITY && ft < (float) MAXIMUM_CAPACITY ?
>> (int) ft : Integer.MAX_VALUE);
>> }
>>
>> ... and uses it as the max. number of elements that a HashMap can hold before
>> it is re-hashed. So I did the following test (comparing the effectiveness of
>> above formula with alternative (expectedSize*4+2)/3 formula):
>>
>>
>> public class HMTest {
>> static final int MAXIMUM_CAPACITY = 1 << 30;
>>
>> static int tableSizeFor(int cap) {
>> int n = -1 >>> Integer.numberOfLeadingZeros(cap - 1);
>> return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
>> }
>>
>> static int threshold(int cap) {
>> float ft = (float) cap * 0.75f;
>> return (cap < MAXIMUM_CAPACITY && ft < (float) MAXIMUM_CAPACITY ?
>> (int) ft : Integer.MAX_VALUE);
>> }
>>
>> public static void main(String[] args) {
>> for (int expectedSize = 0; expectedSize < (Integer.MAX_VALUE - 2) / 4;
>> expectedSize++) {
>> int cap1 = (int) ((float) expectedSize / 0.75f + 1.0f);
>> int cap2 = (expectedSize * 4 + 2) / 3;
>> int ts1 = tableSizeFor(cap1);
>> int ts2 = tableSizeFor(cap2);
>> int th1 = threshold(ts1);
>> int th2 = threshold(ts2);
>>
>> if (th1 < expectedSize || th2 < expectedSize) {
>> System.out.printf("%d: (%d, %d, %d)%s (%d, %d, %d)%s\n",
>> expectedSize,
>> cap1, ts1, th1, (th1 < expectedSize) ? "!" : " ",
>> cap2, ts2, th2, (th2 < expectedSize) ? "!" : " "
>> );
>> }
>> }
>> }
>> }
>>
>>
>> And what this prints is the following:
>>
>>
>> 25165825: (33554432, 33554432, 25165824)! (33554434, 67108864, 50331648)
>> 50331649: (67108864, 67108864, 50331648)! (67108866, 134217728, 100663296)
>> 50331650: (67108864, 67108864, 50331648)! (67108867, 134217728, 100663296)
>> 100663297: (134217728, 134217728, 100663296)! (134217730, 268435456, 201326592)
>> 100663298: (134217728, 134217728, 100663296)! (134217731, 268435456, 201326592)
>> 100663299: (134217728, 134217728, 100663296)! (134217732, 268435456, 201326592)
>> 100663300: (134217728, 134217728, 100663296)! (134217734, 268435456, 201326592)
>> 201326593: (268435456, 268435456, 201326592)! (268435458, 536870912, 402653184)
>> 201326594: (268435456, 268435456, 201326592)! (268435459, 536870912, 402653184)
>> 201326595: (268435456, 268435456, 201326592)! (268435460, 536870912, 402653184)
>> 201326596: (268435456, 268435456, 201326592)! (268435462, 536870912, 402653184)
>> 201326597: (268435456, 268435456, 201326592)! (268435463, 536870912, 402653184)
>> 201326598: (268435456, 268435456, 201326592)! (268435464, 536870912, 402653184)
>> 201326599: (268435456, 268435456, 201326592)! (268435466, 536870912, 402653184)
>> 201326600: (268435456, 268435456, 201326592)! (268435467, 536870912, 402653184)
>> 402653185: (536870912, 536870912, 402653184)! (536870914, 1073741824, 2147483647)
>> 402653186: (536870912, 536870912, 402653184)! (536870915, 1073741824, 2147483647)
>> 402653187: (536870912, 536870912, 402653184)! (536870916, 1073741824, 2147483647)
>> 402653188: (536870912, 536870912, 402653184)! (536870918, 1073741824, 2147483647)
>> 402653189: (536870912, 536870912, 402653184)! (536870919, 1073741824, 2147483647)
>> 402653190: (536870912, 536870912, 402653184)! (536870920, 1073741824, 2147483647)
>> 402653191: (536870912, 536870912, 402653184)! (536870922, 1073741824, 2147483647)
>> 402653192: (536870912, 536870912, 402653184)! (536870923, 1073741824, 2147483647)
>> 402653193: (536870912, 536870912, 402653184)! (536870924, 1073741824, 2147483647)
>> 402653194: (536870912, 536870912, 402653184)! (536870926, 1073741824, 2147483647)
>> 402653195: (536870912, 536870912, 402653184)! (536870927, 1073741824, 2147483647)
>> 402653196: (536870912, 536870912, 402653184)! (536870928, 1073741824, 2147483647)
>> 402653197: (536870912, 536870912, 402653184)! (536870930, 1073741824, 2147483647)
>> 402653198: (536870912, 536870912, 402653184)! (536870931, 1073741824, 2147483647)
>> 402653199: (536870912, 536870912, 402653184)! (536870932, 1073741824, 2147483647)
>> 402653200: (536870912, 536870912, 402653184)! (536870934, 1073741824, 2147483647)
>>
>>
>> So as you see, for expectedSize < (Integer.MAX_VALUE - 2) / 4 (where the
>> alternative formula does not experience overflow and is enough for Naoto's
>> case) all miscalculations are due to the JDK/Guava formula which in those
>> cases calculates a value that is less than alternative formula's value and too
>> small to adequately pre-size the HashMap table.
>>
>>
>> Voila, we have some bugs to fix or I am doing something wrong here.
>>
>>
>> Regards, Peter
>>
>>
>>>
>>> On Tue, May 5, 2020 at 12:12 PM Peter Levart <peter.levart at gmail.com
>>> <mailto:peter.levart at gmail.com>> wrote:
>>>
>>> Hi Martin,
>>>
>>> On 5/5/20 8:26 PM, Martin Buchholz wrote:
>>>> See related:
>>>>
>>>> https://guava.dev/releases/23.0/api/docs/com/google/common/collect/Maps.html#newHashMapWithExpectedSize-int-
>>>>
>>>
>>>
>>> This is basically the same calculation (or at least gives same
>>> result) as Naoto did (without the max part):
>>>
>>> Naoto: (int)(expectedSize / 0.75f) + 1
>>>
>>> Guava: (int) ((float) expectedSize / 0.75F + 1.0F)
>>>
>>> but in case expectedSize is a multiple of 3, it gives the result
>>> which is 1 more than needed. If what is needed is also a power of
>>> 2, then twice the needed space is allocated in the HashMap
>>> backing table.
>>>
>>>
>>> Regards, Peter
>>>
>>>
>>>>
>>>> On Tue, May 5, 2020 at 11:03 AM <naoto.sato at oracle.com
>>>> <mailto:naoto.sato at oracle.com>> wrote:
>>>>
>>>> And here is the fix. Please review.
>>>>
>>>> http://cr.openjdk.java.net/~naoto/8244459/webrev.00/
>>>>
>>>> Naoto
>>>>
>>>> On 5/5/20 10:25 AM, naoto.sato at oracle.com
>>>> <mailto:naoto.sato at oracle.com> wrote:
>>>> > Hi Peter,
>>>> >
>>>> > You are correct. Thanks. I'll remove that initial value of 16.
>>>> >
>>>> > Naoto
>>>> >
>>>> > On 5/5/20 9:37 AM, Peter Levart wrote:
>>>> >> Hi Naoto,
>>>> >>
>>>> >> On 4/30/20 12:18 AM, naoto.sato at oracle.com
>>>> <mailto:naoto.sato at oracle.com> wrote:
>>>> >>> Hello,
>>>> >>>
>>>> >>> Please review this small fix to the following issue:
>>>> >>>
>>>> >>> https://bugs.openjdk.java.net/browse/JDK-8244152
>>>> >>>
>>>> >>> The proposed changeset is located at:
>>>> >>>
>>>> >>> https://cr.openjdk.java.net/~naoto/8244152/webrev.00/
>>>> >>>
>>>> >>> The hash map used there didn't have initial capacity,
>>>> even though the
>>>> >>> exact numbers are known.
>>>> >>
>>>> >>
>>>> >> Well, it has to be calculated 1st (countTokens), but I
>>>> guess this pays
>>>> >> off when HashSet (the backing HashMap) does not have to
>>>> be rehashed then.
>>>> >>
>>>> >> The expression you use:
>>>> >>
>>>> >> Math.max((int)(tokens.countTokens() / 0.75f) + 1, 16)
>>>> >>
>>>> >> ...has a minimum value of 16. Why is that? 16 is just
>>>> HashMap's
>>>> >> default initialCapacity if not specified explicitly. But
>>>> if you only
>>>> >> want to store say 1 entry in the map, you can specify 2 as
>>>> >> initialCapacity and HashMap will happily work for such
>>>> case without
>>>> >> resizing.
>>>> >>
>>>> >>
>>>> >> So you could just use:
>>>> >>
>>>> >> (int)(tokens.countTokens() / 0.75f) + 1
>>>> >>
>>>> >> And even this expression is sometimes overshooting the
>>>> minimal
>>>> >> required value by 1 (when # of tokens is "exact" multiple
>>>> of 0.75f,
>>>> >> say 6). I think the following could be used to optimally
>>>> pre-size the
>>>> >> HashMap with default load factor 0.75:
>>>> >>
>>>> >> (tokens.countTokens() * 4 + 2) / 3
>>>> >>
>>>> >>
>>>> >> Regards, Peter
>>>> >>
>>>> >>>
>>>> >>> Naoto
>>>>
More information about the core-libs-dev
mailing list