compressed oops and 64-bit header words

Fri May 9 06:57:44 PDT 2008

Dan,

I think what you're envisioning is a bit different from what we did with 
compressed oops.  What we did was for Java heap sizes 32G or less, we'd 
keep the base of the Java heap in order to compress pointers in the Java 
heap into 32 bits.  The VM executable still occupies a 64 bit address 
space, so all other pointers in process are 64 bit wide.  We only 
compress when storing into Java heap but need to decompress coming out 
of Java heap, in order to use them.

The mark word, which is the first word of the header may contain 
pointers to the stack in the case of locks or to C heap in the case of 
biased locking (I believe it's the Thread* pointer).  The mark word also 
contains forwarding pointers to be used during GC.  These could be 
encoded and decoded to fit into 32 bits since they're heap pointers but 
I think that would make GC really really slow.  I'm working on some code 
that encodes the FreeChunk size into the 64 bits in the header mark word 
for concurrent mark sweep gc.  So we're using that mark word for a lot 
of things.  In 32 bits, we've pretty much tapped that word out for bits.

If you want to use 32 bit pointers within a 64 bit ABI, you could hack 
the linker to only load the process in the 32 bit address space and then 
extend the pointers to 64 bit when needed when going to the 64 bit ABI.  
I've worked on such a system and hope nobody yells at me but my opinion 
was that it was a god-awful mess.  google: "xtaso".

Thanks,
Coleen

Dan Grove wrote:
> Thanks Vladimir - I didn't realize that the extra 32 bits were being
> used for a field. This is work that we're considering doing - mostly,
> I wanted to hear feedback, and find out whether you were already doing
> this.
>
> So the real question from my standpoint is what we're missing when we
> think about this, and whether it's viable at all.
>
> Dan
>
> On Thu, May 8, 2008 at 8:12 AM, Vladimir Kozlov <Vladimir.Kozlov at sun.com> wrote:
>   
>> Dan,
>>
>> It is not 2 64-bits words, it is 1 and half :)
>> since klass is 32-bits and we use other 32-bits for a field.
>> So the overhead is only 4 bytes. Also don't forget that
>> all objects are aligned to 8 bytes in the heap even
>> in 32-bits VM. So the average overhead will be less.
>>
>> I want to be clear that it is not that we totally against
>> your suggestion. It is resources we need to implement it
>> which we don't have currently.
>> On other hand, VM is open source now so you or your colleges
>> can do it and help us all.
>>
>> Thanks,
>> Vladimir
>>
>> Dan Grove wrote:
>>     
>>> Thanks Vladimir. I'm still worried about the memory bloat from having
>>> (effectively) 2 64-bit words in the object header, rather than 2 32-bit
>>> words. If we consider an average (non-array) object size around 30-40 bytes,
>>> this is a significant overhead. It seems that if users were willing to
>>> declare that they were running inside a 4GB virtual address space (and in my
>>> case, users would be willing to do in order to avoid memory bloat), we
>>> should be able to do this.
>>>
>>> On linux, I believe that if the process were running with a "ulimit -v
>>> XXXX" shell, we could make guarantees that all address would fit in 32 bits,
>>> even for a 64-bit VM. Do you agree that this would make sense?
>>>
>>> Dan
>>>
>>> 2008/5/5 Vladimir Kozlov <Vladimir.Kozlov at sun.com
>>> <mailto:Vladimir.Kozlov at sun.com>>:
>>>  > Dan,
>>>  >
>>>  > Thank you for the paper.
>>>  > I think, the benefit they have with the compressed header comes
>>>  > mostly from a compressed vtable pointer. Which in our VM corresponds
>>>  > to a klass pointer which is also compressed.
>>>  > So in this sense we also have compressed header.
>>>  >
>>>  > I can not say what the performance benefit we have now with
>>>  > compressed oops since the generated code for a klass pointer
>>>  > load/stores currently is not what we would like to have
>>>  > (and we are working to improve it).
>>>  >
>>>  > I doubt that the compressed markword will give big difference.
>>>  > But I may be wrong.
>>>  >
>>>  >
>>>  >
>>>  > Thanks,
>>>  > Vladimir
>>>  >
>>>  > Dan Grove wrote:
>>>  >
>>>  > > Hi Colleen-
>>>  > >
>>>  > > I'm not worried about the shift instruction - I agree that it's
>>>  > > unlikely to matter. What I am worried about is have the standard
>>>  > > object header have 2 64-bit words in (well, 1 64-bit word, 1 32-bit
>>>  > > word, and 32 bits of pad).
>>>  > >
>>>  > > What I'm worried about is the increase in memory footprint and its
>>>  > > impact on performance. I was pointed to
>>>  > >
>>> http://ieeexplore.ieee.org/iel5/9012/28612/01281667.pdf?arnumber=1281667
>>>  > > , which (conveniently) breaks out the performance impact of
>>>  > > compressing the header versus compressing references versus both.
>>>  > >
>>>  > > So what I would really be interested would be a way to have both the
>>>  > > pointers/words in the header and the oops be 32 bits. I think this
>>>  > > would be a good win, when coupled with the extra registers when using
>>>  > > the 64-bit ABI.
>>>  > >
>>>  > > Dan
>>>  > >
>>>  > > On Mon, May 5, 2008 at 3:47 PM, Coleen Phillimore
>>>  > > <Coleen.Phillimore at sun.com <mailto:Coleen.Phillimore at sun.com>> wrote:
>>>  > >
>>>  > > > Hi,
>>>  > > > It made sense when I first read it but in order to have 32 bit
>>> pointers in
>>>  > > > #3, I can't imagine not having to encode and decode them by some
>>> heap base
>>>  > > > in order to dereference these pointers, so the only difference
>>> between #2
>>>  > > > and #3 is the shift instruction to get to 32G. We didn't believe
>>> that the
>>>  > > > shift causes much of a performance penalty so we didn't implement
>>> it this
>>>  > > > way. We would like to measure this at some point though, and if it
>>> is
>>>  > > > faster could add this mode fairly easily.
>>>  > > >
>>>  > > > thanks!
>>>  > > > Coleen
>>>  > > >
>>>  > > >
>>>  > > >
>>>  > > > Dan Grove wrote:
>>>  > > >
>>>  > > >
>>>  > > > > Thanks Colleen and Vladimir-
>>>  > > > >
>>>  > > > > What I'm wondering is whether there could be a third mode:
>>>  > > > >
>>>  > > > > 1. > 32GB - uses uncompressed pointers
>>>  > > > > 2. (something less than 4GB) < Xmx < 32GB - uses compressed
>>> pointers
>>>  > > > > (along with 64-bit mark word), 64-bit ABI
>>>  > > > > 3. whole app fits in 4GB - uses 32-bit pointers in heap, but
>>> 64-bit ABI.
>>>  > > > >
>>>  > > > > The idea here is that I'd prefer to pay no penalty over 32-bit
>>> when my
>>>  > > > > app runs in 64-bit mode and the app fits in 4GB of memory (my
>>> reason
>>>  > > > > for this is that I want to support our JNI libraries only in
>>> 64-bit
>>>  > > > > mode, and deprecate the 32-bit JNI libraries).
>>>  > > > >
>>>  > > > > Does this make any sense to you?
>>>  > > > >
>>>  > > > > Dan
>>>  > > > >
>>>  > > > > On Mon, May 5, 2008 at 12:20 PM, Coleen Phillimore - Sun
>>> Microsystems
>>>  > > > > <Coleen.Phillimore at sun.com <mailto:Coleen.Phillimore at sun.com>>
>>> wrote:
>>>  > > > >
>>>  > > > >
>>>  > > > >
>>>  > > > > > Actually, we are using the gap for a field and array length in
>>> the code
>>>  > > > > > now, but the code Vladimir showed me makes the allocation code
>>> a lot
>>>  > > > > >
>>>  > > > >
>>>  > > > cleaner
>>>  > > >
>>>  > > > >
>>>  > > > > > for the instance field case.
>>>  > > > > >
>>>  > > > > > In the array case in 64 bits, compressing the _klass pointer
>>> into 32
>>>  > > > > >
>>>  > > > >
>>>  > > > bits
>>>  > > >
>>>  > > > >
>>>  > > > > > allows us to move the _length field into the other 32 bits,
>>> which
>>>  > > > > >
>>>  > > > >
>>>  > > > because of
>>>  > > >
>>>  > > > >
>>>  > > > > > alignment saves 64 bits. There was a 32 bit alignment gap after
>>> the
>>>  > > > > >
>>>  > > > >
>>>  > > > _length
>>>  > > >
>>>  > > > >
>>>  > > > > > field, if not compressed with the klass pointer.
>>>  > > > > >
>>>  > > > > > The mark word can also contain a forwarding pointer used during
>>> GC, so
>>>  > > > > > can't be 32 bits.
>>>  > > > > >
>>>  > > > > > The compression that we use allows for 32G because we shift
>>> into the
>>>  > > > > >
>>>  > > > >
>>>  > > > least
>>>  > > >
>>>  > > > >
>>>  > > > > > significant bits - the algorithm is (ptr-heap_base)>>3.
>>>  > > > > >
>>>  > > > > > Coleen
>>>  > > > > >
>>>  > > > > >
>>>  > > > > >
>>>  > > > > > Vladimir Kozlov wrote:
>>>  > > > > >
>>>  > > > > >
>>>  > > > > >
>>>  > > > > >
>>>  > > > > > > Dan,
>>>  > > > > > >
>>>  > > > > > > Only the mark word is 64 bits. The klass pointer is 32-bits
>>> but
>>>  > > > > > > in the current implementation the gap after klass is not
>>> used.
>>>  > > > > > >
>>>  > > > > > > I am working on to use the gap for a field or array's length.
>>>  > > > > > >
>>>  > > > > > > The mark word may contain a 64-bits tread pointer (for Biased
>>>  > > > > > >
>>>  > > > > >
>>>  > > > >
>>>  > > > Locking).
>>>  > > >
>>>  > > > >
>>>  > > > > >
>>>  > > > > > > Thanks,
>>>  > > > > > > Vladimir
>>>  > > > > > >
>>>  > > > > > > Dan Grove wrote:
>>>  > > > > > >
>>>  > > > > > >
>>>  > > > > > >
>>>  > > > > > >
>>>  > > > > > > > Hi-
>>>  > > > > > > >
>>>  > > > > > > > I talked some with the Nikolay Igotti about compressed oops
>>> in
>>>  > > > > > > > OpenJDK7. He tells me that the mark word and class pointer
>>> remain 64
>>>  > > > > > > > bits when compressed oops are being used. It seems that
>>> this leaves
>>>  > > > > > > >
>>>  > > > > > >
>>>  > > > > >
>>>  > > > >
>>>  > > > a
>>>  > > >
>>>  > > > >
>>>  > > > > >
>>>  > > > > > >
>>>  > > > > > > > fair amount of the bloat in place when moving from 32->64
>>> bits.
>>>  > > > > > > >
>>>  > > > > > > > I'm interesting in deprecating 32-bit VM's at my employer
>>> at some
>>>  > > > > > > > point. Doing this is going to require that 64-bit VM's have
>>> as
>>>  > > > > > > >
>>>  > > > > > >
>>>  > > > > >
>>>  > > > >
>>>  > > > little
>>>  > > >
>>>  > > > >
>>>  > > > > >
>>>  > > > > > >
>>>  > > > > > > > bloat as possible. Has there been any consideration of
>>> making the
>>>  > > > > > > >
>>>  > > > > > >
>>>  > > > > >
>>>  > > > >
>>>  > > > mark
>>>  > > >
>>>  > > > >
>>>  > > > > >
>>>  > > > > > >
>>>  > > > > > > > word and class pointer 32 bits in cases where the VM fits
>>> within
>>>  > > > > > > >
>>>  > > > > > >
>>>  > > > > >
>>>  > > > >
>>>  > > > 4GB?
>>>  > > >
>>>  > > > >
>>>  > > > > >
>>>  > > > > > >
>>>  > > > > > > > It seems like this would be a major win. A second benefit
>>> here is
>>>  > > > > > > >
>>>  > > > > > >
>>>  > > > > >
>>>  > > > >
>>>  > > > that
>>>  > > >
>>>  > > > >
>>>  > > > > >
>>>  > > > > > >
>>>  > > > > > > > the "add and shift" currently required on dereference of
>>> compressed
>>>  > > > > > > > oops could be eliminated in cases where the VM fit inside
>>> 4GB.
>>>  > > > > > > >
>>>  > > > > > > > Dan
>>>  > > > > > > >
>>>  > > > > > > >
>>>  > > > > > > >
>>>  > > > > > > >
>>>  > > > > > >
>>>  > > > > >
>>>  > > > >
>>>  > > >
>>>  > > >
>>>  > >
>>>  >
>>>
>>>