NUMA-Aware Java Heaps for in-memory databases

Wed Feb 20 01:12:01 PST 2013

Thank you very much Jon for taking the time to understand the use case and
describe the steps toward a NUMA-Partitioned old generation.

I better realize the amount of work/skill that is involved and I fear it is
out of reach of hotspot codebase beginners.

I would appreciate to get in touch with product management if you will to
give them my contact, to understand how they look at the "Java for
in-memory database" use case.

-Antoine

On 19 February 2013 23:17, Jon Masamitsu <jon.masamitsu at oracle.com> wrote:

>
>
> On 02/13/13 05:42, Antoine Chambille wrote:
>
>> We are developing a Java in-memory analytical database (it's called
>> "ActivePivot") that our customers deploy on ever larger datasets. Some
>> ActivePivot instances are deployed on java heaps close to 1TB, on NUMA
>> servers (typically 4 Xeon processors and 4 NUMA nodes). This is becoming a
>> trend, and we are researching solutions to improve our performance on NUMA
>> configurations.
>>
>>
>> We understand that in the current state of things (and including JDK8) the
>> support for NUMA in hotspot is the following:
>> * The young generation heap layout can be NUMA-Aware (partitioned per NUMA
>> node, objects allocated in the same node than the running thread)
>> * The old generation heap layout is not optimized for NUMA (at best the
>> old
>> generation is interleaved among nodes which at least makes memory accesses
>> somewhat uniform)
>> * The parallel garbage collector is NUMA optimized, the GC threads
>> focusing
>> on objects in their node.
>>
>
> This last part is not true.  GC threads do not  focus on objects on
>
> their node.
>
>  Yet activating -XX:+UseNUMA option has almost no impact on the performance
>> of our in-memory database. It is not surprising, the pattern for a
>> database
>> is to load the data in the memory and then make queries on it. The data
>> goes and stays in the old generation, and it is read from there by
>> queries.
>> Most memory accesses are in the old gen and most of those are not local.
>>
>> I guess there is a reason hotspot does not yet optimize the old generation
>> for NUMA. It must be very difficult to do it in the general case, when you
>> have no idea what thread from what node will read data and interleaving
>> is.
>> But for an in-memory database this is frustrating because we know very
>> well
>> which threads will access which piece of data. At least in ActivePivot
>> data
>> structures are partitioned, partitions are each assigned a thread pool so
>> the threads that allocated the data in a partition are also the threads
>> that perform sub-queries on that partition. We are a few lines of code
>> away
>> from binding thread pools to NUMA nodes, and if the garbage collector
>> would
>> leave objects promoted to the old generation on their original NUMA node
>> memory accesses would be close to optimal.
>>
>> We have not been able to do that. But that being said I read an inspiring
>> 2005 article from Mustafa M. Tikir and Jeffrey K. Hollingsworth that did
>> experiment on NUMA layouts for the old generation. ("NUMA-aware Java heaps
>> for server applications"
>> http://citeseerx.ist.psu.edu/**viewdoc/download?doi=10.1.1.**
>> 92.6587&rep=rep1&type=pdf<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.6587&rep=rep1&type=pdf>
>> ).
>> That motivated me to ask questions in this mailing list:
>>
>>
>> * Are there hidden or experimental hotspot options that allow NUMA-Aware
>> partitioning of the old generation?
>> * Do you know why there isn't much (visible, generally available) progress
>> on NUMA optimizations for the old gen? Is the Java in-memory database use
>> case considered a rare one?
>>
>
> Development does not make decisions about what feature/enhancements
> we implement.  We have a product management team that talks to
> customers and proposes projects to development.  I'll forward your mail to
> them if you like.
>
>
>  * Maybe we at Quartet FS should experiment and even contribute new heap
>> layouts to the open-jdk project. Can you comment on the difficulty of
>> that?
>>
>
> So for your case you would want the data allocated to
> a region of the young generation on node XX
> to be promoted to a region of the old generation
> on XX.
>
> I think doing this would require
>
> 1) Partition the old gen into regions OXX that
> would have the OXX's memory on a particular
> node (easy)
>
> 2) A strategy for moving the right
> objects into OXX's.   The young gen
> GC's do the copying of the objects from the
> young gen to the old gen.  You know that
> you want the objects in NXX (region in young
> gen on node XX), but our young gen GC's
> would not just copy live objects from NXX
> to OXX.  The young gen GC's start from the
> roots (references to objects) held by each
> thread (e.g., reference to an object on the
> thread's stack) and copies all objects reachable
> from the roots (i.e., referenced from the roots
> so can used by the application thread so
> is live) to the old gen.  I can think of ways to
> do this but don't know how effective they
> would be.  Would need some experimentation
> so I would say hard.
>
> 3) Changing the old GC to understand that the
> old gen is divided into regions OXX and to
> keep the objects in an OXX in the same OXX.
> I think we know how to do this but there
> would have to be lots of code changes so
> not easy.
>
> 4) Maybe a strategy for dynamically sizing the
> OXX in case some OXX have more live data
> that others.  Plus a strategy for overflowing
> an OXX.  Simplest thing would be to do a
> full GC but that might happen too often.
>
> Jon
>
>
>
>  Thanks for reading, and Best Regards,
>>
>> --
>> Antoine CHAMBILLE
>> Director Research & Development
>> Quartet FS
>>
>

-- 
Antoine CHAMBILLE
Director Research & Development
Quartet FS