hotspot heap and L1 and L2 cache misses

Thu Sep 27 00:36:00 PDT 2012

Hi Andy --

What John said, but I think asking about TLABs may be the wrong question if
the automata construction occurs over a period of time, and the size of the
automaton is large. In that case, minor gc's will definitely intervene and
by the time you've finished contructing the entire automaton and start
executing it, the state and transition objects are no longer in TLABs, they
are probably by now in the old generation. So, what matters is what kind of
co-location you are looking for (hence John's note about "breadth first" vs
"depth first"). Most GC's in HotSpot will relocate objects via a
depth-first evacuation into tenured space, and either sliding compaction or
depth-first evacuation in tenured space (or leave them alone). I suspect
that with almost any automaton with a reasonably large state space and a
reasonably large input alphabet (i.e. unless these automata are thin and
linear, and have very regular and local transitions), i'd expect that with
HotSpot's GC's any hope of colocation of "adjacent states" are likely
remote. But with modern cache architectures and large caches, perhaps cache
misses aren't quite as bad as you might think. It's best to measure cache
misses to see how bad it is compared to your custom allocator which knew
where to place the states so as to colocate them.

You might compare that with running with a huge Eden so as to compare with
the benefits of "TLAB colocation".

-- ramki

On Wed, Sep 26, 2012 at 10:05 AM, John Cuthbertson <
john.cuthbertson at oracle.com> wrote:

> **
> Hi Andy,
>
> TLAB is short for Thread Local Allocation Buffer. Each Java thread has a
> TLAB and satisfies most allocations from there. When the TLAB is full (or
> doesn't have enough space to satisfy the current allocation request), it is
> retired and a new TLAB is allocated (from the shared eden) for the thread.
> Threads that are allocating either more frequently (or larger objects) will
> be filling up their TLABs, retiring them, and getting new TLABs faster than
> other threads. Since (IIRC) you app performs most of its allocations in a
> small number of threads - you'll mostly get the co-location you're looking
> for. I say mostly because other threads will fill up their TLAB, retire it,
> and allocate a new TLAB while your main allocating thread(s) is/are doing
> the same. So your eden may contain TLABs from your main allocating
> thread(s) periodically inter-spaced with the retired TLAB from one of the
> other threads.
>
> I also said that most objects will be TLAB allocated. Obviously if the
> object is larger than the TLAB size it will be allocated in shared eden.
> Also (IIRC) arrays larger than a certain length are allocated in shared
> eden.
>
> Since you're interested in object co-location you should also research
> the meaning of the terms "depth-first" and "breadth-first".
>
> HTHs
>
> JohnC
>
>
> On 09/26/12 09:39, Andy Nuss wrote:
>
>  I tested TLAB allocations in single threaded microbenchmark, and when no
> GC was involved, it seems like it was about 5 nanos overhead to create a
> small object.  That is plenty fast enough.
>
>  However, now I'm wondering about my chained objects.  My long running
> execution function unlinks and relinks many types of chains.  The question
> is, how strong is the guarantee of co-location with a thread, i.e. when
> many Java threads are calling this execution function that iteratively
> creates small objects per thread.  (NOTE: simultaneous calls of the
> execution function do not share objects in any way).  I.e. is TLAB a
> threadlocal approach that uses a reasonable sized block of known free
> memory for each thread?
>
>    ------------------------------
> *From:* Christian Thalinger <christian.thalinger at oracle.com><christian.thalinger at oracle.com>
> *To:* Andy Nuss <andrew_nuss at yahoo.com> <andrew_nuss at yahoo.com>
> *Cc:* hotspot <hotspot-compiler-dev at openjdk.java.net><hotspot-compiler-dev at openjdk.java.net>
> *Sent:* Monday, September 17, 2012 11:39 AM
> *Subject:* Re: hotspot heap and L1 and L2 cache misses
>
>
> On Sep 15, 2012, at 12:03 PM, Andy Nuss <andrew_nuss at yahoo.com> wrote:
>
> > Hi,
> >
> > Lets say I have a function which mutates a finite automata.  It creates
> lots of small objects (my own link and double-link structures).  It also
> does a lot of puts in my own maps.  The objects and maps in turn have
> references to arrays and some immutable objects.
> >
> > My question is, all these arrays and objects created in one function
> that has to do a ton of construction, are there any things to watchout for
> so that hotspot will try to create all the objects in this one
> function/thread colocated on the heap so that L1/L2 cache misses are
> reduced when the finite automata is executed against data?
> >
> > Ideally, someone could tell me that when my class constructors in turn
> creates new instances of other various size other objects and arrays, they
> are all colocated on the heap.
> >
> > Ideally, someone could tell me that when I have a looping function that
> creates alot of very small Linked List objects in succession, again they
> are colocated.
> >
> > In general, how does hotspot try with creating new objects to help the
> L1/L2 caches?
> >
> > By the way, I did a test port of my automata to C++ where for objects
> like the above, I had big memory chunks that my inplace constructors just
> subdivided the memory chunk that it owned so that all the subobjects were
> absolutely as colocated as possible.
> >
> > This C++ ported automata out-performed my java version by 5x in
> execution against data.  And in cases where I tested the performance of
> construction-time cost of the automata where the comparison is between the
> hotspot new, versus my simple inplace C++ member functions which basically
> just return the current chunk cursor, after calculating the size of the
> object, and updating the chunk cursor to point beyond the new size, in
> those cases I saw 25x performance differences (5 yrs ago).
>
> TLAB allocations do the same pointer-bump in HotSpot.  Do the 5x really
> come from co-located data?  Did you measure it?  And maybe you should redo
> your 25x experiment.  5 years is a long time...
>
> -- Chris
>
> >
> > Andy
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20120927/a844b236/attachment.html