Review Request: UseNUMAInterleaving #3

Fri Aug 19 22:47:00 UTC 2011

Please review this patch which adds a new flag called
UseNUMAInterleaving.  This flag provides a subset of the functionality
provided by UseNUMA.  In Hotspot UseNUMA terminology,
UseNUMAInterleaved makes all memory "numa_global" which is implemented
as interleaved.  This patch's main purpose is to provide that subset
on OSes like Windows which do not support the full UseNUMA
functionality.  However, a simple implementation of UseNUMAInterleaving is
also provided for other OSes

The situations where this shows the biggest benefits would be:
    * Windows platforms with multiple numa nodes (eg, 4)

    * The JVM process is run across all the nodes (not affinitized to one
node).

    * A workload that has enough threads so that it uses the majority
      of the cores in the machine, so that the heap is being accessed
      from many cores, including remote ones.

    * Enough memory per node and a heap size such that the default heap
      placement policy on windows would end up with the heap (or
      nursery) placed on one node.

jbb2005 and SPECPower_ssj2008 are examples of such workloads.  In our
measurements, we have seen some cases where the performance with
UseNUMAInterleaving was 2.7x vs. the performance without. There were
gains of varying sizes across all systems.

The webrev is at
http://cr.openjdk.java.net/~tdeneau/UseNUMAInterleaving/webrev.03/

Summary of changes in webrev.03 from webrev.02:

    * As suggested by Igor Veresov, reverts to using
      UseNUMAInterleaving as the enabling flag.  This will make it
      easier in the future when there are GCs that enable fuller
      UseNUMA on Windows.

    * Adds a simple implementation of UseNUMAInterleaving on Linux and
      Solaris, which just calls numa_make_global after commit_memory
      and reserve_memory_special

    * Adds a flag NUMAInterleaveGranularity which allows setting the
      granularity with which we move to a different node in a memory
      allocation.  The default is 2MB.  This flag only applies to
      Windows for now.

    * Several code cleanups in os_windows.cpp suggested by Igor.

Summary of changes in os_windows.cpp:

    * Some static routines were added to set things up init time.  These
       * check that the required APIs (VirtualAllocExNuma,
         GetNumaHighestNodeNumber, GetNumaNodeProcessorMask) exist in
         the OS

       * build the list of numa nodes on which this process has affinity

    * Changes to os::reserve_memory
       * There was already a routine that reserved pages one page at a
         time (used for Individual Large Page Allocation on WS2003).
         This was abstracted to a separate routine, called
         allocate_pages_individually.  This gets called both for the
         Individual Large Page Allocation thing mentioned above and for
         UseNUMAInterleaving (for both small and large pages)

       * When used for NUMA Interleaving this just goes thru the numa
         node list in a round-robin fashion, using a different one for
         each chunk (with 4K pages, the minimum allocation granularity
         is 64K, with 2M pages it is 1 Page)

       * Whether we do just a reserve or a combined reserve/commit is
         determined by the caller of allocate_pages_individually

          * When used with large pages, we do a Reserve and Commit at
            the same time which is the way it always worked and the way
            it has to work on windows.

          * For small pages, only the reserve is done, the commit will
            come later. (which is the way it worked for
            non-interleaved)

    * os::commit_memory changes
       * If UseNUMAIntereaving is true, os::commit_memory has to check
         whether it was being asked to commit memory that might have
         come from multiple Reserve allocations, if so, the commits
         must also be broken up.  We don't keep any data structure to
         keep track of this, we just use VirtualQuery which queries the
         properties of a VA range and can tell us how much came from
         one VirtualAlloc call.

I do not have a bug id for this.

-- Tom Deneau, AMD