RFR (S) JDK-8009575 - Reduce Symbol::_refcount from 4 bytes to 2 bytes

Tue Mar 26 22:05:51 PDT 2013

On 03/26/2013 09:24 PM, David Holmes wrote:
> Hi Ioi,
>
> First your frames view in the webrev continues to be broken (missing 
> file) :(
>
I am not sure what's wrong. I'll try to update to a new webrev version. 
I always use "sdiffs" so I wasn't too bothered by it :-)

> On 27/03/2013 7:03 AM, Ioi Lam wrote:
>> Please review:
>>
>> http://cr.openjdk.java.net/~iklam/8009575/symbol_refcount_001/
>> <http://cr.openjdk.java.net/%7Eiklam/8009575/symbol_refcount_001/>
>
> Looking at generateJvmOffsets.cpp:
>
> I don't understand this part of the code so can't comment.
>
> Looking at vmStructs.cpp:
>
> I don't see any SA changes so I assume these values never actually get 
> accessed ?? How do you verify this part of the change?
>
The SA .java code does not expose the _refcount field. So the only 
change visible to SA is the offset of _length has changed (but size of 
_length remains 2 bytes). I will try to do some SA sanity tests, but it 
will probably work (famous last words).

>> Bug: Reduce Symbol::_refcount from 4 bytes to 2 bytes
>>
>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8009575
>> https://jbs.oracle.com/bugs/browse/JDK-8009575
>>
>>
>> Summary of fix:
>>
>> (1) Symbol::_refcount has been changed to a short.
>>
>>      Note that _refcount needs to be incremented atomically using
>>      Atom::add, which requires a 32-bit word in memory. So I glue the
>>      _refcount and _length fields together using union and ENDIAN
>>      tricks, which is kind of ugly but faster.
>>
>>      The prettier alternative would be to use shift and masking to
>>      access the _refcount and _length components, but performance
>>      would be slightly slower.
>>
>>      If anyone has a more elegant solution, please let me know.
>
> Not sure why shift and mask would be slower as the compiler will have 
> to generate shifts and masks anyway ??

Many (all?) architectures have 2-byte loads, so loading a two byte 
u.s._length would be a single LOAD instruction. In contrast, loading a 
32-bit word and then masking off the upper 16 bits would take 2 ~ 3 
instructions. Some compilers (such as gcc) would be smart enough to 
generate a 2-byte LOAD instead, but I don't want to count on that.

There's also the fact that _refcount is volatile but _length is not. So 
always combining them in a 32-bit volatile may make reading of _length 
more expensive.
>
>> #------------------------------------------------------
>>        union {
>>          volatile int   _refcount_and_length;
>>          struct {
>>      #ifdef VM_LITTLE_ENDIAN
>>            unsigned short _length; // number of UTF8 characters in the
>> symbol
>>            volatile short _refcount;
>>      #else
>>            volatile short _refcount;
>>            unsigned short _length; // number of UTF8 characters in the
>> symbol
>>      #endif
>>          } s;
>>        } u;
>>
>>      void Symbol::increment_refcount() {
>>        if (u.s._refcount >= 0) {
>>          // Adding 0x10000 to _refcount_and_length is equivalent to
>> adding 0x1 to
>>          // _refcount (w.r.t overflow/underflow).
>>          (void)Atomic::add(0x10000, &u._refcount_and_length);
>>        ....
>>      }
>> #------------------------------------------------------
>
> As per side email a key enabler here is that _length is immutable once 
> set.
>
> I was going to suggest that the fields be swapped so that we just 
> add/subtract 1, but I see now that the order is necessary to ensure 
> the overflow detection works without modifying _length.
>
This is correct. Underflowing (from 0 to -1) shouldn't happen, unless 
there is a coding error. However, today such coding error would just 
leave the Symbol pinned forever (_refcount == -1). However, with the new 
code (if _refcount were placed at lower 16 bit), then underflowing would 
cause _length to be corrupt, and may cause more severe problems. So I 
decided to put refcount at the upper 16 bit just for bug-for-bug 
compatibility (if any) as today's code.

>> (2) Symbol::size(int) was wasting one byte per Symbol instance. I fixed
>> it by
>>      splitting part of the fields from Symbol into SymbolBase.
>
> I don't understand this part. What were we wasting? And given 
> alignment constraints did we in fact waste it?
>
The old code was

class Symbol : public MetaspaceObj {
   int   _refcount;
   int    _identity_hash;
   unsigned short _length;
   jbyte _body[1];

   static int size(int length) {
     size_t sz = heap_word_size(sizeof(Symbol) + (length > 0 ? length - 
1 : 0));
     return align_object_size(sz);
   }

It assumes that sizeof(Symbol) is 11 bytes (2 ints + 1 short + 1 byte), 
hence the "length - 1". I.e., if length == 2, size() => 11 + 2 - 1 = 12.

However, on most platforms, sizeof(Symbol) == 12, so we get (12 + 2 - 1) 
= 13 bytes.

Hence, while a 2-byte symbol could be stored in 12 bytes, we would 
allocate 13 bytes (rounded up to 16) for the Symbol.

On average, we would waste one byte per symbol -- we have a more-or-less 
even distribution of (length % 4).

>> Footprint result:
>>
>>      The theoretical reduction is 3 bytes per symbol -- 2 bytes
>>      from _refcount, and 1 byte from the fixed Symbol::size(int)
>>      function.
>>      In actual measurement, CDS image (about 3000 classes, with about
>>      50000 symbols) is reduced by about 140KB, or about 2.8 bytes per
>>      Symbol, or about 0.6% of the whole CDS image.
>
> I'm not sure that makes sense. I would expect you to save 0 bytes or 4 
> bytes depending on alignment (assuming the Symbol is 32-bit aligned to 
> start with).

You're correct that sometimes you save 4 bytes, sometimes none. Due to 
the even distribution of (length % 4), on average you save ~ 3 bytes per 
Symbol.

  Thanks
- Ioi

>
>
> Thanks,
> David
>
>>
>> Testing:
>>
>>      * UTE (vm.runtime.testlist, vm.quick.testlist
>> vm.parallel_class_loading.testlist)
>>
>> Thanks,
>> Ioi
>>