Review Request: UseNUMAInterleaving #4

Tue Aug 23 23:50:50 UTC 2011

Hi Tom -- the perf improvement on windows is impressive.

The changes look good. Just a few very minor nits below:

globals.hpp: In the doc string field for NUMAInterleaveGranularity, you
might state that this is a Windows only option. (although i recognize
that this hasn't been done for some of the other windows options that
i became aware of now as being used exclusively in windows before
your changes, for instance: UseLargePagesIndividualAllocation
and LargePagesIndividualAllocationInjectError).

arguments.cpp: you could get rid of the empty lines 1432-1433, and move the
content of 1428-1430 into the if-scope of 1422-1426.

os_windows.cpp: you can probably get rid of the extra newline
introduced at line 1967.

line 3018, typo: "NUMAInterleavaing"
also at line 3033: "thNUMANodeListHolderat"
The comment at lines 3030-3033 would also benefit
from a few missing punctuation marks.

at lines 3040 and 3043, it might read better to place the returns
on lines of their own.

If you run with +UseNUMAInterleaving and a commit failed,
it would seem that the error message at line 2987 would be
confusing and incorrect. Perhaps you want to suitably modify
it or just suppress the additional text in that case.

os_solaris.cpp: 2780-2784, it might make sense to do the madvise global/many
call only if the mmap_chunk() succeeds, rather than all the time as you
are doing. May be something like:-

2780   char *res = Solaris::mmap_chunk(addr, size, MAP_PRIVATE|MAP_FIXED, prot);
2781   if (res != NULL) {
          if (UseNUMAInterleaving) {
2782       numa_make_global(res, size);
          }
          return true;
2783   }
2784   return false;

At line 3444, would it make sense to use "size" instead of "bytes" (although
size is just a copy of bytes -- i don't understand the reason for making
the copy, so feel free to ignore if this is some recherche style issue; otherwise
it might make sense to get rid of the copy and just use the formal parameter as is
the case for the Linux code; although this is really not code that you introduced,
but just because you happen to be touching code in the vicinity... your choice.)

In the same vein, i'd make the Linux code similar in shape to
the solaris code for the two hunks changed in os_linux.cpp.

rest looks good.
-- ramki

On 08/23/11 12:59, Deneau, Tom wrote:
> OK, http://cr.openjdk.java.net/~tdeneau/UseNUMAInterleaving/webrev.05/
> should address the concerns listed below...
> 
> -- Tom
> 
> 
>> -----Original Message-----
>> From: Igor Veresov [mailto:igor.veresov at oracle.com]
>> Sent: Tuesday, August 23, 2011 1:53 PM
>> To: Deneau, Tom
>> Cc: hotspot-gc-dev at openjdk.java.net
>> Subject: Re: Review Request: UseNUMAInterleaving #4
>>
>>  Tom,
>>
>> This looks good to me, except three minor things:
>>
>> os_windows.cpp:
>>
>> - you should check for null here:
>> 2630 ~NUMANodeListHolder() {
>>> if (_numa_used_node_list != NULL) {
>> 2631 FREE_C_HEAP_ARRAY(int, _numa_used_node_list);
>>> }
>> 2632 }
>>
>> - if NUMANodeListHolder::build() will be called multiple times, you'll
>> leak memory. I guess you should check if _numa_used_node_list is NULL and
>> if not free it first.
>>
>> - you didn't modify os::numa_get_leaf_groups() to handle the situation
>> when the value of argument "size" is bigger than
>> NUMANodeListHolder::get_count(). You can use MIN2 to adjust the value.
>> See my comment in the previous mail.
>>
>>
>> igor
>>
>> On Tuesday, August 23, 2011 at 11:23 AM, Deneau, Tom wrote:
>>
>>> Please review this patch which adds a new flag called
>>> UseNUMAInterleaving. This flag provides a subset of the functionality
>>> provided by UseNUMA. In Hotspot UseNUMA terminology,
>>> UseNUMAInterleaved makes all memory "numa_global" which is implemented
>>> as interleaved. This patch's main purpose is to provide that subset
>>> on OSes like Windows which do not support the full UseNUMA
>>> functionality. However, a simple implementation of UseNUMAInterleaving
>> is
>>> also provided for other OSes
>>>
>>> The situations where this shows the biggest benefits would be:
>>>  * Windows platforms with multiple numa nodes (eg, 4)
>>>
>>>  * The JVM process is run across all the nodes (not affinitized to
>>>  one node).
>>>
>>>  * A workload that has enough threads so that it uses the majority
>>>  of the cores in the machine, so that the heap is being accessed
>>>  from many cores, including remote ones.
>>>
>>>  * Enough memory per node and a heap size such that the default heap
>>>  placement policy on windows would end up with the heap (or
>>>  nursery) placed on one node.
>>>
>>> jbb2005 and SPECPower_ssj2008 are examples of such workloads. In our
>>> measurements, we have seen some cases where the performance with
>>> UseNUMAInterleaving was 2.7x vs. the performance without. There were
>>> gains of varying sizes across all systems.
>>>
>>> The webrev is at
>>> http://cr.openjdk.java.net/~tdeneau/UseNUMAInterleaving/webrev.04/
>>>
>>> Summary of changes in webrev.04 from webrev.03:
>>>
>>>  * As suggested by Igor Veresov, UseNUMA can imply
>>>  UseNUMAInterleaving on all platforms. This is in arguments.cpp
>>>
>>>  * In NUMANodeListHolder in os_windows.cpp, allocates the node_list
>>>  dynamically rather than assuming a length of 64. The method
>>>  NUMANodeListHolder::get_node_list_entry checks returns -1 for
>>>  indexes that are out of bounds.
>>>
>>>  * Several code convention cleanups suggested by Igor.
>>>
>>>  * Merge with the new style system dll function resolutions from
>>>  "7016797: Hotspot: securely/restrictive load dlls and new API for
>>>  loading system dlls" Note: my new NUMA functions are outside the
>> ifdefs.
>>>
>>> Summary of changes in webrev.03 from webrev.02:
>>>
>>>  * As suggested by Igor Veresov, reverts to using
>>>  UseNUMAInterleaving as the enabling flag. This will make it
>>>  easier in the future when there are GCs that enable fuller
>>>  UseNUMA on Windows.
>>>
>>>  * Adds a simple implementation of UseNUMAInterleaving on Linux and
>>>  Solaris, which just calls numa_make_global after commit_memory
>>>  and reserve_memory_special
>>>
>>>  * Adds a flag NUMAInterleaveGranularity which allows setting the
>>>  granularity with which we move to a different node in a memory
>>>  allocation. The default is 2MB. This flag only applies to
>>>  Windows for now.
>>>
>>>  * Several code cleanups in os_windows.cpp suggested by Igor.
>>>
>>>
>>> Summary of overall changes in os_windows.cpp:
>>>
>>>  * Some static routines were added to set things up init time. These
>>>  * check that the required APIs (VirtualAllocExNuma,
>>>  GetNumaHighestNodeNumber, GetNumaNodeProcessorMask) exist in
>>>  the OS
>>>
>>>  * build the list of numa nodes on which this process has affinity
>>>
>>>  * Changes to os::reserve_memory
>>>  * There was already a routine that reserved pages one page at a
>>>  time (used for Individual Large Page Allocation on WS2003).
>>>  This was abstracted to a separate routine, called
>>>  allocate_pages_individually. This gets called both for the
>>>  Individual Large Page Allocation thing mentioned above and for
>>>  UseNUMAInterleaving (for both small and large pages)
>>>
>>>  * When used for NUMA Interleaving this just goes thru the numa
>>>  node list in a round-robin fashion, allocating chunks at the
>>>  NUMAInterleaveGranularity using a different allocation for
>>>  each chunk
>>>
>>>  * Whether we do just a reserve or a combined reserve/commit is
>>>  determined by the caller of allocate_pages_individually
>>>
>>>  * When used with large pages, we do a Reserve and Commit at
>>>  the same time which is the way it always worked and the way
>>>  it has to work on windows.
>>>
>>>  * For small pages, only the reserve is done, the commit will
>>>  come later. (which is the way it worked for
>>>  non-interleaved)
>>>
>>>  * os::commit_memory changes
>>>  * If UseNUMAIntereaving is true, os::commit_memory has to check
>>>  whether it was being asked to commit memory that might have
>>>  come from multiple Reserve allocations, if so, the commits
>>>  must also be broken up. We don't keep any data structure to
>>>  keep track of this, we just use VirtualQuery which queries the
>>>  properties of a VA range and can tell us how much came from
>>>  one VirtualAlloc call.
>>>
>>> I do not have a bug id for this.
>>>
>>> -- Tom Deneau, AMD
>>
>