Parallel GC and array object layout: way off the base and laid out in reverse?

Wed Sep 4 18:19:37 UTC 2013

Thomas,

See in-line.

On 9/4/2013 6:49 AM, Thomas Schatzl wrote:
> HI,
>
> On Wed, 2013-09-04 at 15:28 +0400, Aleksey Shipilev wrote:
>> On 09/04/2013 02:56 PM, Thomas Schatzl wrote:
>>> Also the results are wrong:
>>>
>>>>> $ java -XX:+UseParallelGC ArrayLayoutTest
>>>>> Before the GC:
>>>>> array is at 4120951026 (0 units off base)
>>>>>    object is at 4120951033, 7 units off base, toString = 0
>>>                               ^^^ the first array element is typically on
>>> word offset 3 or so... (iirc in the simplest case: 1 word header, 1 word
>>> klass pointer, 1 int element size and possibly some padding).
>> That is not &arr[i], i.e. not the (arr + i*sizeof(oop)). This is the
>> location of object referred by the arr[i], i.e. (long)arr[i]. The issue
>> is about GC laying out the referenced objects in the reverse order.
> Okay, now I understand the purpose of the test. Thanks for the
> clarification.
>
>> I had updated the test one more time to make in clearer. It does not
>> uses either Integer or Strings to dodge any sort of magic VM can do
>> otherwise:
>>   http://cr.openjdk.java.net/~shade/scratch/ArrayLayoutTest.java
>>
>> This is the sample output:
>>
>> $ ~/Install/jdk8b104/bin/java  -XX:-UseCompressedOops ArrayLayoutTest
>> Before the GC:
>> array is at 140679387257752 (0 units off base)
>>    object is at 140679387257856, 104 units off base, toString = 0
>>    object is at 140679387257880, 128 units off base, toString = 1
>>    object is at 140679387257904, 152 units off base, toString = 2
>>    object is at 140679387257928, 176 units off base, toString = 3
>>    object is at 140679387257952, 200 units off base, toString = 4
>>    object is at 140679387257976, 224 units off base, toString = 5
>>    object is at 140679387258000, 248 units off base, toString = 6
>>    object is at 140679387258024, 272 units off base, toString = 7
>>    object is at 140679387258048, 296 units off base, toString = 8
>>    object is at 140679387258072, 320 units off base, toString = 9
>>
>> Without the compressed oops, 1 unit = 1 byte.
>>
>> Which means that for freshly allocated object, array of 10 references
>> takes 80 bytes, plus 24 bytes for the header (8+8 headers, 4 array size,
>> 4 for alignment), totalling 104 bytes. Right after the array, we start
>> to layout the referenced objects, which take 24 bytes each (16 bytes for
>> the header + 4 bytes int + 4 bytes alignment up to 8 bytes). Notice how
>> dense they are packed.
>>
>> After the GC:
>> array is at 140676601743672 (0 units off base)
>>    object is at 140676601743992, 320 units off base, toString = 0
>>    object is at 140676601743968, 296 units off base, toString = 1
>>    object is at 140676601743944, 272 units off base, toString = 2
>>    object is at 140676601743920, 248 units off base, toString = 3
>>    object is at 140676601743896, 224 units off base, toString = 4
>>    object is at 140676601743872, 200 units off base, toString = 5
>>    object is at 140676601743848, 176 units off base, toString = 6
>>    object is at 140676601743824, 152 units off base, toString = 7
>>    object is at 140676601743800, 128 units off base, toString = 8
>>    object is at 140676601743776, 104 units off base, toString = 9
>>
>> Now it's different. We know the array and all the referenced values got
>> promoted, because their addresses changed. But now, we see the
>> referenced objects are laid out in reverse! What gives?
> There is no preservation of the placement order of objects during
> evacuation.
>
> The collectors do not particularly try very hard to keep objects
> together, except maybe trying a rough depth-first traversal within a
> single collector thread (I may be completely wrong, I would need to have
> a look at the particular implementation; maybe others can chime in).

I haven't followed this thread carefully enough but the ParallelGC 
collector uses a depth-first
traversal while the other collectors use a breadth-first.  Would that 
explain the difference?

Jon

>
> In addition to that, local allocation buffers, threading, work stealing
> and (large) object array handling make somewhat sure that it is unlikely
> that the allocation order is preserved. Since the collectors use
> different implementations, the actual allocation order is also not the
> same across collectors.
>
>> Yes, and user code should be oblivious to this. However, I ask the
>> different question: whether we should lay out the referenced elements in
>> their indexed order, not in reverse.
> Imo it's not clear whether there is a big difference, as future access
> order would be important here.
> Preferential access may go in either direction or completely
> independent of the array (if the program accesses lots of unrelated
> objects for each array element anyway).
>
> In this particular case, modern hw prefetchers also work well in the
> reverse direction.
>
> At the moment, access information is not gathered anywhere in the VM afaik.
> Even if the information were available and somehow used it is not clear
> whether the effort spent on gathering and applying this information
> amortizes itself later.
>
> Maybe there are good studies on current hardware on realistic loads
> about that somewhere?
>
> Hth,
> Thomas
>
>
>