NUMA-Aware Java Heaps for in-memory databases

Tue Feb 19 14:17:05 PST 2013

On 02/13/13 05:42, Antoine Chambille wrote:
> We are developing a Java in-memory analytical database (it's called
> "ActivePivot") that our customers deploy on ever larger datasets. Some
> ActivePivot instances are deployed on java heaps close to 1TB, on NUMA
> servers (typically 4 Xeon processors and 4 NUMA nodes). This is becoming a
> trend, and we are researching solutions to improve our performance on NUMA
> configurations.
>
>
> We understand that in the current state of things (and including JDK8) the
> support for NUMA in hotspot is the following:
> * The young generation heap layout can be NUMA-Aware (partitioned per NUMA
> node, objects allocated in the same node than the running thread)
> * The old generation heap layout is not optimized for NUMA (at best the old
> generation is interleaved among nodes which at least makes memory accesses
> somewhat uniform)
> * The parallel garbage collector is NUMA optimized, the GC threads focusing
> on objects in their node.

This last part is not true.  GC threads do not  focus on objects on
their node.

> Yet activating -XX:+UseNUMA option has almost no impact on the performance
> of our in-memory database. It is not surprising, the pattern for a database
> is to load the data in the memory and then make queries on it. The data
> goes and stays in the old generation, and it is read from there by queries.
> Most memory accesses are in the old gen and most of those are not local.
>
> I guess there is a reason hotspot does not yet optimize the old generation
> for NUMA. It must be very difficult to do it in the general case, when you
> have no idea what thread from what node will read data and interleaving is.
> But for an in-memory database this is frustrating because we know very well
> which threads will access which piece of data. At least in ActivePivot data
> structures are partitioned, partitions are each assigned a thread pool so
> the threads that allocated the data in a partition are also the threads
> that perform sub-queries on that partition. We are a few lines of code away
> from binding thread pools to NUMA nodes, and if the garbage collector would
> leave objects promoted to the old generation on their original NUMA node
> memory accesses would be close to optimal.
>
> We have not been able to do that. But that being said I read an inspiring
> 2005 article from Mustafa M. Tikir and Jeffrey K. Hollingsworth that did
> experiment on NUMA layouts for the old generation. ("NUMA-aware Java heaps
> for server applications"
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.6587&rep=rep1&type=pdf).
> That motivated me to ask questions in this mailing list:
>
>
> * Are there hidden or experimental hotspot options that allow NUMA-Aware
> partitioning of the old generation?
> * Do you know why there isn't much (visible, generally available) progress
> on NUMA optimizations for the old gen? Is the Java in-memory database use
> case considered a rare one?

Development does not make decisions about what feature/enhancements
we implement.  We have a product management team that talks to
customers and proposes projects to development.  I'll forward your mail to
them if you like.

> * Maybe we at Quartet FS should experiment and even contribute new heap
> layouts to the open-jdk project. Can you comment on the difficulty of that?

So for your case you would want the data allocated to
a region of the young generation on node XX
to be promoted to a region of the old generation
on XX.

I think doing this would require

1) Partition the old gen into regions OXX that
would have the OXX's memory on a particular
node (easy)

2) A strategy for moving the right
objects into OXX's.   The young gen
GC's do the copying of the objects from the
young gen to the old gen.  You know that
you want the objects in NXX (region in young
gen on node XX), but our young gen GC's
would not just copy live objects from NXX
to OXX.  The young gen GC's start from the
roots (references to objects) held by each
thread (e.g., reference to an object on the
thread's stack) and copies all objects reachable
from the roots (i.e., referenced from the roots
so can used by the application thread so
is live) to the old gen.  I can think of ways to
do this but don't know how effective they
would be.  Would need some experimentation
so I would say hard.

3) Changing the old GC to understand that the
old gen is divided into regions OXX and to
keep the objects in an OXX in the same OXX.
I think we know how to do this but there
would have to be lots of code changes so
not easy.

4) Maybe a strategy for dynamically sizing the
OXX in case some OXX have more live data
that others.  Plus a strategy for overflowing
an OXX.  Simplest thing would be to do a
full GC but that might happen too often.

Jon

> Thanks for reading, and Best Regards,
>
> --
> Antoine CHAMBILLE
> Director Research & Development
> Quartet FS