hotspot heap and L1 and L2 cache misses

Thu Sep 27 04:13:18 PDT 2012

Hi Srinivas,

My current game plan is that for the finite automata data structure, when finally constructed, I will have no choice but to represent it in primarily an array of ints (split into int[][] when big).  There may be supplemental array of Object.  This will give me enough colocation for the automata, even though array access measures as 4x slower in java than class member access for me, most likely because of having to make crashes impossible even with bad indices.

What I am really concerned with is my linked lists that are created during the execution of the automata against a string.  My automata are really like NFA, in that multiple states are possible even after construction, so to execute them while looking at one character at a time, I have to build 3 types of chains whose elements are being created, added to chain, and unlinked as well depending on transitions.   Assuming you are matching the whole document with the automata, you don't want to be hitting eden space hard for each new chain elem or that causes gc slowdown.  But if you avoid that by having free lists for the chains, then after even 1000 chars are matched, you start adding on to the chains with elements that are completely shuffled in memory.  That is the tradeoff.

Andy

________________________________
 From: Srinivas Ramakrishna <ysr1729 at gmail.com>
To: John Cuthbertson <john.cuthbertson at oracle.com> 
Cc: Andy Nuss <andrew_nuss at yahoo.com>; hotspot <hotspot-compiler-dev at openjdk.java.net>; Christian Thalinger <christian.thalinger at oracle.com> 
Sent: Thursday, September 27, 2012 12:36 AM
Subject: Re: hotspot heap and L1 and L2 cache misses

Hi Andy --

What John said, but I think asking about TLABs may be the wrong question if the automata construction occurs over a period of time, and the size of the automaton is large. In that case, minor gc's will definitely intervene and by the time you've finished contructing the entire automaton and start executing it, the state and transition objects are no longer in TLABs, they are probably by now in the old generation. So, what matters is what kind of co-location you are looking for (hence John's note about "breadth first" vs "depth first"). Most GC's in HotSpot will relocate objects via a depth-first evacuation into tenured space, and either sliding compaction or depth-first evacuation in tenured space (or leave them alone). I suspect that with almost any automaton with a reasonably large state space and a reasonably large input alphabet (i.e. unless these automata are thin and linear, and have very regular and local transitions), i'd expect that with
 HotSpot's GC's any hope of colocation of "adjacent states" are likely remote. But with modern cache architectures and large caches, perhaps cache misses aren't quite as bad as you might think. It's best to measure cache misses to see how bad it is compared to your custom allocator which knew where to place the states so as to colocate them.

You might compare that with running with a huge Eden so as to compare with the benefits of "TLAB colocation".

-- ramki

On Wed, Sep 26, 2012 at 10:05 AM, John Cuthbertson <john.cuthbertson at oracle.com> wrote:

>Hi Andy,
>
>TLAB is short for Thread Local Allocation Buffer. Each Java thread has
a TLAB and satisfies most allocations from there. When the TLAB is full
(or doesn't have enough space to satisfy the current allocation
request), it is retired and a new TLAB is allocated (from the shared
eden) for the thread. Threads that are allocating either more
frequently (or larger objects) will be filling up their TLABs, retiring
them, and getting new TLABs faster than other threads. Since (IIRC) you
app performs most of its allocations in a small number of threads -
you'll mostly get the co-location you're looking for. I say mostly
because other threads will fill up their TLAB, retire it, and allocate
a new TLAB while your main allocating thread(s) is/are doing the same.
So your eden may contain TLABs from your main allocating thread(s)
periodically inter-spaced with the retired TLAB from one of the other
threads.
>
>I also said that most objects will be TLAB allocated. Obviously if the
object is larger than the TLAB size it will be allocated in shared
eden. Also (IIRC) arrays larger than a certain length are allocated in
shared eden.
>
>Since you're interested in object co-location you should also research 
the meaning of the terms "depth-first" and "breadth-first".
>
>HTHs
>
>JohnC
>
>
>On 09/26/12 09:39, Andy Nuss wrote: 
>I tested TLAB allocations in single threaded
microbenchmark, and when no GC was involved, it seems like it was about
5 nanos overhead to create a small object.  That is plenty fast enough.
>>
>>
>>However,
now I'm wondering about my chained objects.  My long running execution
function unlinks and relinks many types of chains.  The question is,
how strong is the guarantee of co-location with a thread, i.e. when
many Java threads are calling this execution function that iteratively
creates small objects per thread.  (NOTE: simultaneous calls of the
execution function do not share objects in any way).  I.e. is TLAB a
threadlocal approach that uses a reasonable sized block of known free
memory for each thread?
>>
>>
>>
>>
>>________________________________
>> From: Christian Thalinger <christian.thalinger at oracle.com>
>>To: Andy Nuss <andrew_nuss at yahoo.com> 
>>Cc: hotspot <hotspot-compiler-dev at openjdk.java.net> 
>>Sent: Monday,
September 17, 2012 11:39 AM
>>Subject: Re: hotspot
heap and L1 and L2 cache misses
>> 
>>
>>On Sep 15, 2012, at 12:03 PM, Andy Nuss <andrew_nuss at yahoo.com>
wrote:
>>
>>> Hi,
>>> 
>>> Lets say I have a function which mutates a finite automata.  It
creates lots of small objects (my own link and double-link
structures).  It also does a lot of puts in my own maps.  The objects
and maps in turn have references to arrays and some immutable objects.
>>> 
>>> My question is, all these arrays and objects created in one
function that has to do a ton of construction, are there any things to
watchout for so that hotspot will try to create all the objects in this
one function/thread colocated on the heap so that L1/L2 cache misses
are reduced when the finite automata is executed against data?
>>> 
>>> Ideally, someone could tell me that when my class constructors in
turn creates new instances of other various size other objects and
arrays, they are all colocated on the heap.
>>> 
>>> Ideally, someone could tell me that when I have a looping function
that creates alot of very small Linked List objects in succession,
again they are colocated.
>>> 
>>> In general, how does hotspot try with creating new objects to help
the L1/L2 caches?
>>> 
>>> By the way, I did a test port of my automata to C++ where for
objects like the above, I had big memory chunks that my inplace
constructors just subdivided the memory chunk that it owned so that all
the subobjects were absolutely as colocated as possible.
>>> 
>>> This C++ ported automata out-performed my java version by 5x in
execution against data.  And in cases where I tested the performance of
construction-time cost of the automata where the comparison is between
the hotspot new, versus my simple inplace C++ member functions which
basically just return the current chunk cursor, after calculating the
size of the object, and updating the chunk cursor to point beyond the
new size, in those cases I saw 25x performance differences (5 yrs ago).
>>
>>TLAB allocations do the same pointer-bump in HotSpot.  Do the 5x really
come from co-located data?  Did you measure it?  And maybe you should
redo your 25x experiment.  5 years is a long time...
>>
>>-- Chris
>>
>>> 
>>> Andy
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20120927/8e8e408b/attachment-0001.html