[PATCH] JDK NUMA Interleaving issue

Thu Nov 22 11:48:11 UTC 2018

Hi Amith,

  welcome to the OpenJDK community! :)

On Fri, 2018-11-09 at 17:53 +0530, amith pawar wrote:
> Hi all,
> 
> The flag UseNUMA (or UseNUMAInterleaving), has to interleave old gen,
> S1 and S2 region (if any other ) memory areas on requested Numa nodes
> and it should not configure itself to access other Numa nodes. This
> issue is observed only when Java is allowed to run on fewer NUMA
> nodes than available on the system with numactl membind and
> interleave options. Running on all the nodes does not have any
> effect. This will cause some applications (objects residing in old
> gen and survivor region) to run slower on system with large Numa
> nodes.
> 
[... long explanation...]

Is it possible to summarize the problem and the changes with the
following few lines:

"NUMA interleaving of memory of old gen and survivor spaces (for
parallel GC) tells the OS to interleave memory across all nodes of a
NUMA system. However the VM process may be configured to be limited to
run only on a few nodes, which means that large parts of the heap will
be located on foreign nodes. This can incurs a large performance
penalty.

The proposed solution is to tell the OS to interleave memory only
across available nodes when enabling NUMA."

We have had trouble understanding the problem statement and purpose of
this patch when triaging (making sure the issue is understandable and
can be worked on) as the text is rather technical. Having an easily
understandable text also helps reviewers a lot.

Assuming my summary is appropriate, I have several other unrelated
questions:

- could you propose a better subject for this work? "JDK NUMA
Interleaving issue" seems very generic. Something like "NUMA heap
allocation does not respect VM membind/interleave settings" maybe?

- there have been other NUMA related patches from AMD recently, in
particular JDK-what is the relation to JDK-8205051? The text there
reads awfully similar to this one, but I *think* JDK-8205051 is
actually about making sure that the parallel gc eden is not put on
inactive nodes.
Can you confirm this (talk to your colleague) so that we can fix the
description too?

- fyi, we are currently working on NUMA aware memory allocation support
for G1 in JEP 345 (JDK-8210473). It may be useful to sync up a bit to
not step on each other's toes (see below).

[... broken patch...]

I tried to apply the patch to the jdk/jdk tree, which failed; I then
started to manually apply it but stopped after not being able to find
the context of some hunks. I do not think this change applies to the
latest source tree.

Please make sure that the patch applies to the latest jdk/jdk tree with
errors. All changes generally must first go into the latest dev tree
before you can apply for backporting.

Could you please send the patch as attachment (not copy&pasted) to this
list and cc me? Then I can create a webrev out of it.

I did look a bit over the patch as much as I could (it's very hard
trying to review a diff), some comments:

  - the _numa_interleave_memory_v2 function pointer is never assigned
to in the patch in the CR, so it will not be used. Please make sure the
patch is complete.
Actually it is never defined anywhere, ie. the patch unlikely actually
compiles even if I could apply it somewhere.

Please avoid frustrating reviewers by sending incomplete patches.

  - I am not sure set_numa_interleave, i.e. the interleaving, should be
done without UseNUMAInterleaving enabled. Some collectors may do their
own interleaving in the future (JDK-8210473 / JEP-345) so this would
massively interfere in how they work. (That issue may be because I am
looking at a potentially incomplete diff, so forgive me if the patch
already handles this).

  - if some of the actions (interleaving/membind) fail, and it had been
requested, it would be nice to print a log message.

Actually it would be nice to print information about e.g. the bitmask
anyway in the log so that the user can easily verify that what he
specified on the command line has been properly picked up by the VM -
instead of looking through the proc filesystem.

Thanks,
  Thomas