compressed oops and 64-bit header words

Fri May 9 03:54:59 PDT 2008

> Thanks Vladimir - I didn't realize that the extra 32 bits were being
> used for a field. This is work that we're considering doing - mostly,
> I wanted to hear feedback, and find out whether you were already doing
> this.
>
> So the real question from my standpoint is what we're missing when we
> think about this, and whether it's viable at all.

Wouldn't it be great if google would dedicate resources to Hotspot's
development?
Thanks god hotspot does not have a public interface, who knows wether
it would lead to another Android ;)

lg Clemens

>
> Dan
>
> On Thu, May 8, 2008 at 8:12 AM, Vladimir Kozlov <Vladimir.Kozlov at sun.com> wrote:
>> Dan,
>>
>> It is not 2 64-bits words, it is 1 and half :)
>> since klass is 32-bits and we use other 32-bits for a field.
>> So the overhead is only 4 bytes. Also don't forget that
>> all objects are aligned to 8 bytes in the heap even
>> in 32-bits VM. So the average overhead will be less.
>>
>> I want to be clear that it is not that we totally against
>> your suggestion. It is resources we need to implement it
>> which we don't have currently.
>> On other hand, VM is open source now so you or your colleges
>> can do it and help us all.
>>
>> Thanks,
>> Vladimir
>>
>> Dan Grove wrote:
>>>
>>> Thanks Vladimir. I'm still worried about the memory bloat from having
>>> (effectively) 2 64-bit words in the object header, rather than 2 32-bit
>>> words. If we consider an average (non-array) object size around 30-40 bytes,
>>> this is a significant overhead. It seems that if users were willing to
>>> declare that they were running inside a 4GB virtual address space (and in my
>>> case, users would be willing to do in order to avoid memory bloat), we
>>> should be able to do this.
>>>
>>> On linux, I believe that if the process were running with a "ulimit -v
>>> XXXX" shell, we could make guarantees that all address would fit in 32 bits,
>>> even for a 64-bit VM. Do you agree that this would make sense?
>>>
>>> Dan
>>>
>>> 2008/5/5 Vladimir Kozlov <Vladimir.Kozlov at sun.com
>>> <mailto:Vladimir.Kozlov at sun.com>>:
>>>  > Dan,
>>>  >
>>>  > Thank you for the paper.
>>>  > I think, the benefit they have with the compressed header comes
>>>  > mostly from a compressed vtable pointer. Which in our VM corresponds
>>>  > to a klass pointer which is also compressed.
>>>  > So in this sense we also have compressed header.
>>>  >
>>>  > I can not say what the performance benefit we have now with
>>>  > compressed oops since the generated code for a klass pointer
>>>  > load/stores currently is not what we would like to have
>>>  > (and we are working to improve it).
>>>  >
>>>  > I doubt that the compressed markword will give big difference.
>>>  > But I may be wrong.
>>>  >
>>>  >
>>>  >
>>>  > Thanks,
>>>  > Vladimir
>>>  >
>>>  > Dan Grove wrote:
>>>  >
>>>  > > Hi Colleen-
>>>  > >
>>>  > > I'm not worried about the shift instruction - I agree that it's
>>>  > > unlikely to matter. What I am worried about is have the standard
>>>  > > object header have 2 64-bit words in (well, 1 64-bit word, 1 32-bit
>>>  > > word, and 32 bits of pad).
>>>  > >
>>>  > > What I'm worried about is the increase in memory footprint and its
>>>  > > impact on performance. I was pointed to
>>>  > >
>>> http://ieeexplore.ieee.org/iel5/9012/28612/01281667.pdf?arnumber=1281667
>>>  > > , which (conveniently) breaks out the performance impact of
>>>  > > compressing the header versus compressing references versus both.
>>>  > >
>>>  > > So what I would really be interested would be a way to have both the
>>>  > > pointers/words in the header and the oops be 32 bits. I think this
>>>  > > would be a good win, when coupled with the extra registers when using
>>>  > > the 64-bit ABI.
>>>  > >
>>>  > > Dan
>>>  > >
>>>  > > On Mon, May 5, 2008 at 3:47 PM, Coleen Phillimore
>>>  > > <Coleen.Phillimore at sun.com <mailto:Coleen.Phillimore at sun.com>> wrote:
>>>  > >
>>>  > > > Hi,
>>>  > > > It made sense when I first read it but in order to have 32 bit
>>> pointers in
>>>  > > > #3, I can't imagine not having to encode and decode them by some
>>> heap base
>>>  > > > in order to dereference these pointers, so the only difference
>>> between #2
>>>  > > > and #3 is the shift instruction to get to 32G. We didn't believe
>>> that the
>>>  > > > shift causes much of a performance penalty so we didn't implement
>>> it this
>>>  > > > way. We would like to measure this at some point though, and if it
>>> is
>>>  > > > faster could add this mode fairly easily.
>>>  > > >
>>>  > > > thanks!
>>>  > > > Coleen
>>>  > > >
>>>  > > >
>>>  > > >
>>>  > > > Dan Grove wrote:
>>>  > > >
>>>  > > >
>>>  > > > > Thanks Colleen and Vladimir-
>>>  > > > >
>>>  > > > > What I'm wondering is whether there could be a third mode:
>>>  > > > >
>>>  > > > > 1. > 32GB - uses uncompressed pointers
>>>  > > > > 2. (something less than 4GB) < Xmx < 32GB - uses compressed
>>> pointers
>>>  > > > > (along with 64-bit mark word), 64-bit ABI
>>>  > > > > 3. whole app fits in 4GB - uses 32-bit pointers in heap, but
>>> 64-bit ABI.
>>>  > > > >
>>>  > > > > The idea here is that I'd prefer to pay no penalty over 32-bit
>>> when my
>>>  > > > > app runs in 64-bit mode and the app fits in 4GB of memory (my
>>> reason
>>>  > > > > for this is that I want to support our JNI libraries only in
>>> 64-bit
>>>  > > > > mode, and deprecate the 32-bit JNI libraries).
>>>  > > > >
>>>  > > > > Does this make any sense to you?
>>>  > > > >
>>>  > > > > Dan
>>>  > > > >
>>>  > > > > On Mon, May 5, 2008 at 12:20 PM, Coleen Phillimore - Sun
>>> Microsystems
>>>  > > > > <Coleen.Phillimore at sun.com <mailto:Coleen.Phillimore at sun.com>>
>>> wrote:
>>>  > > > >
>>>  > > > >
>>>  > > > >
>>>  > > > > > Actually, we are using the gap for a field and array length in
>>> the code
>>>  > > > > > now, but the code Vladimir showed me makes the allocation code
>>> a lot
>>>  > > > > >
>>>  > > > >
>>>  > > > cleaner
>>>  > > >
>>>  > > > >
>>>  > > > > > for the instance field case.
>>>  > > > > >
>>>  > > > > > In the array case in 64 bits, compressing the _klass pointer
>>> into 32
>>>  > > > > >
>>>  > > > >
>>>  > > > bits
>>>  > > >
>>>  > > > >
>>>  > > > > > allows us to move the _length field into the other 32 bits,
>>> which
>>>  > > > > >
>>>  > > > >
>>>  > > > because of
>>>  > > >
>>>  > > > >
>>>  > > > > > alignment saves 64 bits. There was a 32 bit alignment gap after
>>> the
>>>  > > > > >
>>>  > > > >
>>>  > > > _length
>>>  > > >
>>>  > > > >
>>>  > > > > > field, if not compressed with the klass pointer.
>>>  > > > > >
>>>  > > > > > The mark word can also contain a forwarding pointer used during
>>> GC, so
>>>  > > > > > can't be 32 bits.
>>>  > > > > >
>>>  > > > > > The compression that we use allows for 32G because we shift
>>> into the
>>>  > > > > >
>>>  > > > >
>>>  > > > least
>>>  > > >
>>>  > > > >
>>>  > > > > > significant bits - the algorithm is (ptr-heap_base)>>3.
>>>  > > > > >
>>>  > > > > > Coleen
>>>  > > > > >
>>>  > > > > >
>>>  > > > > >
>>>  > > > > > Vladimir Kozlov wrote:
>>>  > > > > >
>>>  > > > > >
>>>  > > > > >
>>>  > > > > >
>>>  > > > > > > Dan,
>>>  > > > > > >
>>>  > > > > > > Only the mark word is 64 bits. The klass pointer is 32-bits
>>> but
>>>  > > > > > > in the current implementation the gap after klass is not
>>> used.
>>>  > > > > > >
>>>  > > > > > > I am working on to use the gap for a field or array's length.
>>>  > > > > > >
>>>  > > > > > > The mark word may contain a 64-bits tread pointer (for Biased
>>>  > > > > > >
>>>  > > > > >
>>>  > > > >
>>>  > > > Locking).
>>>  > > >
>>>  > > > >
>>>  > > > > >
>>>  > > > > > > Thanks,
>>>  > > > > > > Vladimir
>>>  > > > > > >
>>>  > > > > > > Dan Grove wrote:
>>>  > > > > > >
>>>  > > > > > >
>>>  > > > > > >
>>>  > > > > > >
>>>  > > > > > > > Hi-
>>>  > > > > > > >
>>>  > > > > > > > I talked some with the Nikolay Igotti about compressed oops
>>> in
>>>  > > > > > > > OpenJDK7. He tells me that the mark word and class pointer
>>> remain 64
>>>  > > > > > > > bits when compressed oops are being used. It seems that
>>> this leaves
>>>  > > > > > > >
>>>  > > > > > >
>>>  > > > > >
>>>  > > > >
>>>  > > > a
>>>  > > >
>>>  > > > >
>>>  > > > > >
>>>  > > > > > >
>>>  > > > > > > > fair amount of the bloat in place when moving from 32->64
>>> bits.
>>>  > > > > > > >
>>>  > > > > > > > I'm interesting in deprecating 32-bit VM's at my employer
>>> at some
>>>  > > > > > > > point. Doing this is going to require that 64-bit VM's have
>>> as
>>>  > > > > > > >
>>>  > > > > > >
>>>  > > > > >
>>>  > > > >
>>>  > > > little
>>>  > > >
>>>  > > > >
>>>  > > > > >
>>>  > > > > > >
>>>  > > > > > > > bloat as possible. Has there been any consideration of
>>> making the
>>>  > > > > > > >
>>>  > > > > > >
>>>  > > > > >
>>>  > > > >
>>>  > > > mark
>>>  > > >
>>>  > > > >
>>>  > > > > >
>>>  > > > > > >
>>>  > > > > > > > word and class pointer 32 bits in cases where the VM fits
>>> within
>>>  > > > > > > >
>>>  > > > > > >
>>>  > > > > >
>>>  > > > >
>>>  > > > 4GB?
>>>  > > >
>>>  > > > >
>>>  > > > > >
>>>  > > > > > >
>>>  > > > > > > > It seems like this would be a major win. A second benefit
>>> here is
>>>  > > > > > > >
>>>  > > > > > >
>>>  > > > > >
>>>  > > > >
>>>  > > > that
>>>  > > >
>>>  > > > >
>>>  > > > > >
>>>  > > > > > >
>>>  > > > > > > > the "add and shift" currently required on dereference of
>>> compressed
>>>  > > > > > > > oops could be eliminated in cases where the VM fit inside
>>> 4GB.
>>>  > > > > > > >
>>>  > > > > > > > Dan
>>>  > > > > > > >
>>>  > > > > > > >
>>>  > > > > > > >
>>>  > > > > > > >
>>>  > > > > > >
>>>  > > > > >
>>>  > > > >
>>>  > > >
>>>  > > >
>>>  > >
>>>  >
>>>
>>
>