Parallel GC and array object layout: way off the base and laid out in reverse?

Thu Sep 5 08:00:28 UTC 2013

Hi,

On Thu, 2013-09-05 at 00:44 +0400, Aleksey Shipilev wrote:
> Thanks Tony!
> 
> On 05.09.2013, at 0:33, Tony Printezis <tprintezis at twitter.com> wrote:
>
> >Performance of forward array iteration might or might not be important. For
>>hash tables, it's all about look-ups, so the order should not matter.
>>It should only matter if you do a lot of whole array traversals. It
>>might be important for something like ArrayLists.
> 
> Still have to do this on non-x86; I would suspect the behavior on ARM is quite
>different for forward and backward traversal.

When talking about particular hardware, could you specify potential
targets instead of just "embedded" and "non-x86"? Also, parallel GC has
certain requirements to make sense to use imo - for everything lower
there is still serial gc.

Given that, I would at least expect something multi-core (parallel gc),
a few 100MB of live data set (not sure, maybe just 100M). You mentioned
ARM.
There do not seem to be a lot of currently available potential targets
here: A5, A7 and A9 (there is also arm11 mpcore afaik, but I do not know
implementations with reasonable amount of RAM to need parallel gc).

Searching the web for a few minutes reveals that all current Cortex A5,
A7 and Cortex A9 processors implement that (on paper).

(http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka15338.html,
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0464d/CHDIGCEB.html,
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388i/CBBIAAAA.html - may not be the most definitive sources, but are statements from the ARM site)

They seem to all implement backward/forward data prefetching as is the
most simple way of predication - I mean as soon as you have a data
prefetcher you most likely get both.

A15 and A-5x do not look really embedded any more (personally I would
not really count the Ax either, but that depends on your definition of
embedded), as their performance is likely somewhere above minimum usable
x86 perf (Atom; from a desktop user POV :). I believe their prefetching
capabilities will only improve anyway.

>From the performance POV: I would expect that you would get few (one?)
additional cache misses before the backward prefetching starts working
(but you will also get them until forward prefetching does).
So, on these targets, you save maybe one additional miss of how many?

You further mentioned that it would be advantageous if the first object
were on the same cache line to save that fetch. Both the array and the
object (at least the first few elements) must fit on that cache line,
which is maybe 64 bytes. This is a small array; at least 20 bytes are
wasted by the two headers, then assuming we have an object with two
elements, there is only space left for an array of size of maybe 10
elements (on 32 bit VMs, a lot less on 64 bit ones, some more with
compressed oops)
That seems fairly small.

Also this problem affects only a very small minority of objects, i.e.
the ones which actually get GC'ed. Most of them will never be GC'ed and
never affected anyway.

Another thing to note is that this is the behavior of the parallel young
gen gc: iirc the old gen compacting full gcs are all order-preserving.

> > Regarding "I don't like surprises" : Thomas' reply was spot on. The
>>expectation that all GCs should always copy objects in the same way
>>and lay out objects graphs in the same order is very misguided, IMHO.
>>Different GCs will behave differently (due to different allocation
>>strategies, parallelism, PLABs, work stealing, array chunking, etc.).
>>It might be possible to make the GCs behave a bit more similarly then
>>they do now, but that's about it. :-)
> 
> Well, d'uh! The second part of that "no surprises" is "where possible".
>Overly generalizing (the absence of common sense in object layout) is
>also misguided. This quirk does look surprising and seems relatively
>simple to fix ;) Of course, users are oblivious of the exact layouts,
>as they should be. 

To me it is fairly natural to when something is defined as
implementation dependent to actually be that way; the change has
probably been done intentionally after testing.
As there is so much possible variation due to other settings mentioned,
there does not seem to be a point to actually try to set (or even test
for) it in a particular way. The next variation in platform tested,
default settings, ... would break this assumption anyway.

However, I think, patches are always welcome. They should not cause,
particularly in this case, performance regressions though.
Maybe a re-assessment with current hardware will show that the initial
measurements are outdated now anyway.

Thanks,
Thomas