hotspot heap and L1 and L2 cache misses

Wed Sep 26 10:05:13 PDT 2012

Hi Andy,

TLAB is short for Thread Local Allocation Buffer. Each Java thread has a 
TLAB and satisfies most allocations from there. When the TLAB is full 
(or doesn't have enough space to satisfy the current allocation 
request), it is retired and a new TLAB is allocated (from the shared 
eden) for the thread. Threads that are allocating either more frequently 
(or larger objects) will be filling up their TLABs, retiring them, and 
getting new TLABs faster than other threads. Since (IIRC) you app 
performs most of its allocations in a small number of threads - you'll 
mostly get the co-location you're looking for. I say mostly because 
other threads will fill up their TLAB, retire it, and allocate a new 
TLAB while your main allocating thread(s) is/are doing the same. So your 
eden may contain TLABs from your main allocating thread(s) periodically 
inter-spaced with the retired TLAB from one of the other threads.

I also said that most objects will be TLAB allocated. Obviously if the 
object is larger than the TLAB size it will be allocated in shared eden. 
Also (IIRC) arrays larger than a certain length are allocated in shared 
eden.

Since you're interested in object co-location you should also research  
the meaning of the terms "depth-first" and "breadth-first".

HTHs

JohnC

On 09/26/12 09:39, Andy Nuss wrote:
> I tested TLAB allocations in single threaded microbenchmark, and when 
> no GC was involved, it seems like it was about 5 nanos overhead to 
> create a small object.  That is plenty fast enough.
>
> However, now I'm wondering about my chained objects.  My long running 
> execution function unlinks and relinks many types of chains.  The 
> question is, how strong is the guarantee of co-location with a thread, 
> i.e. when many Java threads are calling this execution function that 
> iteratively creates small objects per thread.  (NOTE: simultaneous 
> calls of the execution function do not share objects in any way).  
> I.e. is TLAB a threadlocal approach that uses a reasonable sized block 
> of known free memory for each thread?
>
> ------------------------------------------------------------------------
> *From:* Christian Thalinger <christian.thalinger at oracle.com>
> *To:* Andy Nuss <andrew_nuss at yahoo.com>
> *Cc:* hotspot <hotspot-compiler-dev at openjdk.java.net>
> *Sent:* Monday, September 17, 2012 11:39 AM
> *Subject:* Re: hotspot heap and L1 and L2 cache misses
>
>
> On Sep 15, 2012, at 12:03 PM, Andy Nuss <andrew_nuss at yahoo.com 
> <mailto:andrew_nuss at yahoo.com>> wrote:
>
> > Hi,
> >
> > Lets say I have a function which mutates a finite automata.  It 
> creates lots of small objects (my own link and double-link 
> structures).  It also does a lot of puts in my own maps.  The objects 
> and maps in turn have references to arrays and some immutable objects.
> >
> > My question is, all these arrays and objects created in one function 
> that has to do a ton of construction, are there any things to watchout 
> for so that hotspot will try to create all the objects in this one 
> function/thread colocated on the heap so that L1/L2 cache misses are 
> reduced when the finite automata is executed against data?
> >
> > Ideally, someone could tell me that when my class constructors in 
> turn creates new instances of other various size other objects and 
> arrays, they are all colocated on the heap.
> >
> > Ideally, someone could tell me that when I have a looping function 
> that creates alot of very small Linked List objects in succession, 
> again they are colocated.
> >
> > In general, how does hotspot try with creating new objects to help 
> the L1/L2 caches?
> >
> > By the way, I did a test port of my automata to C++ where for 
> objects like the above, I had big memory chunks that my inplace 
> constructors just subdivided the memory chunk that it owned so that 
> all the subobjects were absolutely as colocated as possible.
> >
> > This C++ ported automata out-performed my java version by 5x in 
> execution against data.  And in cases where I tested the performance 
> of construction-time cost of the automata where the comparison is 
> between the hotspot new, versus my simple inplace C++ member functions 
> which basically just return the current chunk cursor, after 
> calculating the size of the object, and updating the chunk cursor to 
> point beyond the new size, in those cases I saw 25x performance 
> differences (5 yrs ago).
>
> TLAB allocations do the same pointer-bump in HotSpot.  Do the 5x 
> really come from co-located data?  Did you measure it?  And maybe you 
> should redo your 25x experiment.  5 years is a long time...
>
> -- Chris
>
> >
> > Andy
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20120926/f34d07a7/attachment-0001.html