RFR(S): 8087143: Reduce Symbol::_identity_hash to 2 bytes

Ioi Lam ioi.lam at oracle.com
Sat Jun 20 01:39:21 UTC 2015


Yumin,

The new version of Symbol::size over-calculates the required size by 1 
byte. That's why your average Symbol size is only reduced from 44.198 to 
43.317 bytes (a reduction of 0.88 bytes).

The code:

    jbyte _body[1];
    static int size(int length) {
        size_t sz = heap_word_size(sizeof(Symbol) + (length > 1 ? length
    - 1 : 0));
         return align_object_size(sz);
    }

should be changed to

    jbyte _body[2];
    static int size(int length) {
        size_t sz = heap_word_size(sizeof(Symbol) + (length > 2 ? length
    - 2 : 0));
        return align_object_size(sz);
    }

When you have this

    class Symbol : public MetaspaceObj {
        ATOMIC_SHORT_PAIR(
          volatile short _refcount,  // needs atomic operation
          unsigned short _length     // number of UTF8 characters in the
    symbol (does not need atomic op)
        );
        short _identity_hash;
        jbyte _body[1];
    }

Even though you have declared only 7 bytes of fields, sizeof(Symbol) is 
8 bytes, not 7. That's because the C++ compiler implicitly adds 1 byte 
of padding. So if your string is "ab", Symbol::size() would use

      sizeof(Symbol) + (length > 1 ? length - 1 : 0)  ==>  8 + 2 - 1 ==> 
9 bytes

even though 8 bytes are enough.

Please see more comments below:

On 6/19/15 5:20 PM, Yumin Qi wrote:
> Hi, Coleen and Ioi
>
>   The revised webrev:
> http://cr.openjdk.java.net/~minqi/8087143/webrev02/
>
>   In this version, I have tested with Specjbb2005:
>   1) change _body to one byte, contribute _body[0] to the calculation 
> of identity_hash. No rehash request found during the test. Like the 
> Warnings mentioned by Coleen in her email;
>   2)  Following data is the output, no. of literals is same as no. of 
> entries which refers to no. of symbol.
>
> before:
> SymbolTable statistics:
> Number of buckets       :     20011 =    320176 bytes, avg  16.000
> Number of entries       :      2016 =     64512 bytes, avg  32.000
> Number of literals      :      2016 =     89104 bytes, avg  44.198
> Total footprint         :           =    473792 bytes
> Average bucket size     :     0.101
> Variance of bucket size :     0.100
> Std. dev. of bucket size:     0.316
> Maximum bucket size     :         3
>
> after:
> SymbolTable statistics:
> Number of buckets       :     20011 =    320176 bytes, avg  16.000
> Number of entries       :      2021 =     64672 bytes, avg  32.000
> Number of literals      :      2021 =     87544 bytes, avg  43.317
> Total footprint         :           =    472392 bytes
> Average bucket size     :     0.101
> Variance of bucket size :     0.100
> Std. dev. of bucket size:     0.316
> Maximum bucket size     :         3
>
>   About no. of literals, the calculation of the size is:
>
> template <class T, MEMFLAGS F> int RehashableHashtable<T, 
> F>::literal_size(Symbol *symbol) {
>   return symbol->size() * HeapWordSize;
> }
>
>   And the size is just the _length of the symbol which I think is the 
> length of the characters, not including  Symbol itself. Cannot use 
> this data to display how many bytes we saved.

This is not true.  If it were true, you wouldn't have seen any changes 
in the "average literal" size printed above.

Symbol::size() is defined to be this:

   int size()                { return size(utf8_length()); }

which calls the static Symbol::size(int) function, which includes the 
size of the header.

>   3) correct a inaccurate calculation of size for Symbol in old version.
>       size_t sz = heap_word_size(sizeof(SymbolBase) + (length > 0 ? 
> length : 0));
>       It will allocate one extra byte when length > 0 since we already 
> reserve one byte for _body.
Your description is incorrect. In the old version, SymbolBase did not 
include _body. Please see comments in JDK-8009575.

So why did we need SymbolBase for 8009575 (the previous version) but 
don't need it for 8087143 (the current RFE)?

After the fix for 8009575, we needed 8 bytes of header for Symbol, so if 
Symbol was declared as this:

    class Symbol : public MetaspaceObj {
        ATOMIC_SHORT_PAIR(
          volatile short _refcount,  // needs atomic operation
          unsigned short _length     // number of UTF8 characters in the
    symbol (does not need atomic op)
        );
        int _identity_hash;
        jbyte _body[1];
    }

It would occupy 9 bytes. Most compilers would automatically pad it up to 
12 bytes.

I could have written Symbol as the following to consider the 
pad-to-12-bytes case:

    class Symbol : public MetaspaceObj {
        ATOMIC_SHORT_PAIR(
          volatile short _refcount,  // needs atomic operation
          unsigned short _length     // number of UTF8 characters in the
    symbol (does not need atomic op)
        );
        int _identity_hash;
        jbyte _body[4];    // align up to 12 bytes

        static int size(int length) {
          size_t sz = heap_word_size(sizeof(Symbol) + (length > 4 ?
    length - 4 : 0));
          return align_object_size(sz);
        }
    }

However, the amount of structure padding by the C compiler is not 
well-defined. Although most of our compiler would pad to 12 bytes, some 
compiler may decide to pad to 16 byte. While this is highly unlikely, I 
didn't want to take the chance. Explicitly splitting out a SymbolBase 
class, which is 8 bytes, would make our size calculation always correct.

Anyway, for the current fix, we reduced the Symbol header to only 6 
bytes. Adding a body[2] will make sizeof(Symbol) to 8 bytes. Virtually 
no compiler would add any padding to a 8-byte structure, so using 
sizeof(Symbol) is good enough.

Please trace the Symbol::size(int) function inside gdb to verify that it 
performs the calculation correctly.

> Now we save 3 bytes for every symbol (for symbol whose length is 
> greater than 1).
>   4) remove java.cpp change, which is not needed, as Coleen pointed 
> out, we can use -XX:+PrintStringTableStatistics to get this data, 
> which also is a product flag.
>   5) for .jsa files:
>        specjbb2005-org.jsa:   4026368
>        specjbb2005.jsa:          4018176
>        Saved about 8K.
>
Here is the output extracted from "java  -Xshare:dump 
-XX:+PrintSharedSpaces" on 64-bit Linux.

Before any changes:

Symbol              :    50913    1925736  27.4 | 0          0   0.0 
|    50913    1925736  10.5

After your webrev02/

Symbol              :    50913    1889520  27.0 | 0          0   0.0 
|    50913    1889520  10.3

After changing _body[1] => _body[2]

Symbol              :    50913    1856248  26.6 | 0          0   0.0 
|    50913    1856248  10.1

The average Symbol size changed from

        37.82 bytes to 37.11 (-0.71) to 36.45 (-1.37)

As I mentioned in the bug report, the reduction (1.37) is smaller than 
the theoretical value (2 bytes) because we have many small symbols, and 
metaspace.cpp enforces a 24-byte minimum allocation size for each 
MetaspaceObj on 64-bit platforms.

Thanks
- Ioi


> Thanks
> Yumin
>
>
> On 6/19/2015 11:21 AM, Coleen Phillimore wrote:
>>
>> Hi,  This is okay, one comment below.
>>
>> On 6/19/15 12:21 PM, Yumin Qi wrote:
>>> Coleen,
>>>
>>>   Thanks.
>>>
>>> On 6/19/2015 5:14 AM, Coleen Phillimore wrote:
>>>>
>>>> Yumin,
>>>>
>>>> If the identity hash isn't unique enough you get a system 
>>>> dictionary warning like this, which doesn't stop the JVM. Could you 
>>>> check your test logs to see if this warning is output in any of 
>>>> them (jtreg keeps logs around, just checking the jtreg test results 
>>>> should be sufficient, I think).
>>>>
>>>> Log : HotSpot(TM) Server VM warning: Performance bug: 
>>>> SystemDictionary lookup_count=6918 lookup_length=3907 
>>>> average=0.564759 load=0.266601
>>>>
>>>> I think with including _body[0] in the function you may make the 
>>>> hash random enough.   So this was one of my concerns with this change.
>>>>
>>> I will check that.
>>>> In 
>>>> http://cr.openjdk.java.net/~minqi/8087143/webrev01/src/share/vm/runtime/java.cpp.udiff.html
>>>>
>>>> Are you sure you want all of the SymbolTable dump?  Maybe it should 
>>>> be under PrintSymbolTableSizeStatistics instead?
>>>>
>>> I will check and make change based on the output of statistics, 
>>> maybe it is too much to print out all SymbolTable.
>>
>> It's way more than anyone needs normally.   Maybe only print it if 
>> one adds +Verbose to PrintStringTableSizeStatistics and when we do 
>> logging, it'll be -Xlog:symboltable=trace or a lower logging level 
>> than info.
>>
>> Thanks,
>> Coleen
>>>> In 
>>>> http://cr.openjdk.java.net/~minqi/8087143/webrev01/src/share/vm/oops/symbol.hpp.udiff.html
>>>>
>>>> -// We separate the fields in SymbolBase from Symbol::_body so that
>>>> -// Symbol::size(int) can correctly calculate the space needed.
>>>>
>>>> I'm not sure what that comment meant.  With SymbolBase gone (which 
>>>> I think looks nicer), can you still correctly calculate the space 
>>>> needed?
>>>>
>>> Yes, now the _body changed from 1 byte array to 2 bytes array.
>>> The original calculation is not correct I think:  if symbol length > 
>>> 0, we allocate *_ONE_* extra byte for it:
>>> -    size_t sz = heap_word_size(sizeof(SymbolBase) + (length > 0 ? length : 0));
>>> +    size_t sz = heap_word_size(sizeof(Symbol) + (length > 2 ? length - 2 : 0));
>>>
>>>> Other than these things, this change looks good.  Nice to find some 
>>>> space savings with something so frequent as symbols.
>>>>
>>>> Thanks,
>>>> Coleen
>>>>
>>>> On 6/18/15 8:40 PM, Yumin Qi wrote:
>>>>> Ioi,
>>>>>
>>>>>   Thanks for review, I will add that (for print out purpose, not 
>>>>> in the final change?)
>>>>>
>>>>> Yumin
>>>>>
>>>>> On 6/18/2015 3:34 PM, ioi.lam at oracle.com wrote:
>>>>>> Hi Yumin,
>>>>>>
>>>>>> I wonder why the average symbol size didn't change. They are both 
>>>>>> about 34.25 bytes long.
>>>>>>
>>>>>> When I tested the change, -Xshare:dump -XX:+PrintSharedSpaces 
>>>>>> show about 1.3 bytes reduction per symbol.
>>>>>>
>>>>>> Also, could you add some code to print out the statistics of the 
>>>>>> system dictionary to show if there's any change in the hashtable 
>>>>>> balance?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>> On Jun 18, 2015, at 1:46 PM, Yumin Qi <yumin.qi at oracle.com> wrote:
>>>>>>>
>>>>>>> Please review the small change for
>>>>>>>
>>>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8087143
>>>>>>> webrev: http://cr.openjdk.java.net/~minqi/8087143/webrev01/
>>>>>>>
>>>>>>> Summary: _identity_hash is an integer in Symbol (SymbolBase), it 
>>>>>>> is used to compute hash bucket index by modulus division of 
>>>>>>> table size. Currently in hotspot, no table size is more than 
>>>>>>> 65535 so we can use short instead.  For case with table size 
>>>>>>> over 65535 we can use the first two bytes of symbol data to be 
>>>>>>> as the upper 16 bits for the calculation but rare cases.
>>>>>>>
>>>>>>> This is a minor change, also eliminates SymbolBase which seems 
>>>>>>> redundant.  Also add print out SymbolTable statistics under 
>>>>>>> debug flag PrintSystemDictionaryAtExit.
>>>>>>>
>>>>>>> Tests: JPRT, Specjbb2005.
>>>>>>>
>>>>>>> The SymbolTable statistics for specjbb2005:
>>>>>>>
>>>>>>> Before Change:
>>>>>>> SymbolTable statistics:
>>>>>>> Number of buckets       :     20011 =    320176 bytes, avg 16.000
>>>>>>> Number of entries       :     49914 =   1597248 bytes, avg 32.000
>>>>>>> Number of literals      :     49914 =   1709688 bytes, avg 34.253
>>>>>>> Total footprint         :           =   3627112 bytes
>>>>>>> Average bucket size     :     2.494
>>>>>>> Variance of bucket size :     2.483
>>>>>>> Std. dev. of bucket size:     1.576
>>>>>>> Maximum bucket size     :        10
>>>>>>>
>>>>>>> After Change:
>>>>>>>
>>>>>>> SymbolTable statistics:
>>>>>>> Number of buckets       :     20011 =    320176 bytes, avg 16.000
>>>>>>> Number of entries       :     49911 =   1597152 bytes, avg 32.000
>>>>>>> Number of literals      :     49911 =   1709544 bytes, avg 34.252
>>>>>>> Total footprint         :           =   3626872 bytes
>>>>>>> Average bucket size     :     2.494
>>>>>>> Variance of bucket size :     2.483
>>>>>>> Std. dev. of bucket size:     1.576
>>>>>>> Maximum bucket size     :        10
>>>>>>>
>>>>>>> There is no big change for the hashtable balance.
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Yumin
>>>>>
>>>>
>>>
>>
>



More information about the hotspot-runtime-dev mailing list