Review Request: UseNUMAInterleaving #3

Sat Aug 20 05:53:10 UTC 2011

Tom, this looks better, a few minor things:

* os_solaris.cpp, os_linux.cpp:
- There're some problems with indentation in os_linux.cpp and os_solaris.cpp, please follow the style (two spaces).
- I think we can set UseNUMAInterleaving when UseNUMA is set, it wouldn't hurt the numa allocator in PS but will benefit other collectors that are not numa-aware.
 May be put this into arguments.cpp, since that behavior will be universal across platforms.

* os_solaris.cpp
- in os::commit_memory(char* addr, size_t bytes, size_t alignment_hint, bool exec):
2811 if (UseNUMAInterleaving) { 
2812 numa_make_global(addr, bytes); 
2813 }

This is not necessary. This function in calls the other commit_memory(), the one without the alignment hint (unlike on linux) and you already call numa_make_global() there.

- in os::reserve_memory_special()
3439 if (UseNUMAInterleaving) { 
3440 numa_make_global(addr, bytes); 
3441 }
You should use retAddr instead of addr. The addr param is ignored there at all.

* os_windows.cpp
- two spaces indent please.
- sometimes you don't put spaces around infix operators.
- we typically don't prefix class names with "C": CNUMANodeListHolder
- variables are usually lower case with underscore separating words:
2744 } NUMANodeListHolder
3162 size_t BytesRemaining = bytes;
3163 char * NextAllocAddr = addr;
3165 MEMORY_BASIC_INFORMATION allocInfo; 
3167 size_t BytesToRq = MIN2(BytesRemaining, allocInfo.RegionSize); 
- class data members should start with underscore:
2719 int numa_used_node_list[64]; 
2720 int numa_used_node_count;
os_windows.cpp definitely doesn't set a good style example, sorry...

- would it be easier not to have UseNUMAForced and just let it set lgrp_id during thread init (it's currently a no op on windows anyway)? Also, in the future, when 
windows fully support numa allocation it will be a required behavior anyway.

- here, in case size is bigger than the number of nodes in the node list it should be adjusted:
3254 size_t os::numa_get_leaf_groups(int *ids, size_t size) {
>  size = MIN2(size, (size_t)NUMANodeListHolder.get_count()); 
3255 for (int i = 0; i < size; i++) {
3256 ids[i] = NUMANodeListHolder.get_node_list_entry(i); 
3257 } 
3258 return size; 
3259 }

- in CNUMANodeListHolder::build(), is it guaranteed that we don't overflow numa_used_node_list? If not, can we stop adding to it if the index >=64? If yes, can we put a guarantee or assert? Can 64 be a named constant?
- CNUMANodeListHolder probably needs a constructor that would set numa_used_count to 0 (I know in your case it's a global, but it feels safer to have it initialized).
- in numa_interleaving_init() WARN need to be undef'ed.

igor

On Friday, August 19, 2011 at 3:47 PM, Deneau, Tom wrote:

> Please review this patch which adds a new flag called
> UseNUMAInterleaving. This flag provides a subset of the functionality
> provided by UseNUMA. In Hotspot UseNUMA terminology,
> UseNUMAInterleaved makes all memory "numa_global" which is implemented
> as interleaved. This patch's main purpose is to provide that subset
> on OSes like Windows which do not support the full UseNUMA
> functionality. However, a simple implementation of UseNUMAInterleaving is
> also provided for other OSes
> 
> The situations where this shows the biggest benefits would be:
>  * Windows platforms with multiple numa nodes (eg, 4)
> 
>  * The JVM process is run across all the nodes (not affinitized to one
> node).
> 
>  * A workload that has enough threads so that it uses the majority
>  of the cores in the machine, so that the heap is being accessed
>  from many cores, including remote ones.
> 
>  * Enough memory per node and a heap size such that the default heap
>  placement policy on windows would end up with the heap (or
>  nursery) placed on one node.
> 
> jbb2005 and SPECPower_ssj2008 are examples of such workloads. In our
> measurements, we have seen some cases where the performance with
> UseNUMAInterleaving was 2.7x vs. the performance without. There were
> gains of varying sizes across all systems.
> 
> The webrev is at
> http://cr.openjdk.java.net/~tdeneau/UseNUMAInterleaving/webrev.03/
> 
> Summary of changes in webrev.03 from webrev.02:
> 
>  * As suggested by Igor Veresov, reverts to using
>  UseNUMAInterleaving as the enabling flag. This will make it
>  easier in the future when there are GCs that enable fuller
>  UseNUMA on Windows.
> 
>  * Adds a simple implementation of UseNUMAInterleaving on Linux and
>  Solaris, which just calls numa_make_global after commit_memory
>  and reserve_memory_special
> 
>  * Adds a flag NUMAInterleaveGranularity which allows setting the
>  granularity with which we move to a different node in a memory
>  allocation. The default is 2MB. This flag only applies to
>  Windows for now.
> 
>  * Several code cleanups in os_windows.cpp suggested by Igor.
> 
> Summary of changes in os_windows.cpp:
> 
>  * Some static routines were added to set things up init time. These
>  * check that the required APIs (VirtualAllocExNuma,
>  GetNumaHighestNodeNumber, GetNumaNodeProcessorMask) exist in
>  the OS
> 
>  * build the list of numa nodes on which this process has affinity
> 
>  * Changes to os::reserve_memory
>  * There was already a routine that reserved pages one page at a
>  time (used for Individual Large Page Allocation on WS2003).
>  This was abstracted to a separate routine, called
>  allocate_pages_individually. This gets called both for the
>  Individual Large Page Allocation thing mentioned above and for
>  UseNUMAInterleaving (for both small and large pages)
> 
>  * When used for NUMA Interleaving this just goes thru the numa
>  node list in a round-robin fashion, using a different one for
>  each chunk (with 4K pages, the minimum allocation granularity
>  is 64K, with 2M pages it is 1 Page)
> 
>  * Whether we do just a reserve or a combined reserve/commit is
>  determined by the caller of allocate_pages_individually
> 
>  * When used with large pages, we do a Reserve and Commit at
>  the same time which is the way it always worked and the way
>  it has to work on windows.
> 
>  * For small pages, only the reserve is done, the commit will
>  come later. (which is the way it worked for
>  non-interleaved)
> 
>  * os::commit_memory changes
>  * If UseNUMAIntereaving is true, os::commit_memory has to check
>  whether it was being asked to commit memory that might have
>  come from multiple Reserve allocations, if so, the commits
>  must also be broken up. We don't keep any data structure to
>  keep track of this, we just use VirtualQuery which queries the
>  properties of a VA range and can tell us how much came from
>  one VirtualAlloc call.
> 
> I do not have a bug id for this.
> 
> -- Tom Deneau, AMD