Review Request: UseNUMAInterleaving

Wed Aug 17 23:16:52 UTC 2011

On Wednesday, August 17, 2011 at 3:51 PM, Deneau, Tom wrote:
> Igor --
> 
> Regarding your comment #5 below:
> 
> > 5. What is the typical allocation granularity on windows? Wouldn't that
> > be a problem if we tried to allocate a large heap with small interleaved
> > pages? Have you tried using larger interleaving granularity for modern
> > windows version? Doing a syscall and creating a segment per even a large
> > page seems bit excessive. If you did try that, was there any difference?
> 
> The allocation granularity for 4K pages on Windows is 64K. (and for 2M pages is 1 page).
> I didn't do any precise measurements of how long it took to allocate an interleaved heap at this granularity but I didn't perceive startup slowdowns when allocating a 12G heap. I can try to get some actual measurements. I didn't try any different granularities. Do you think it's worth making the granularity a command line parameter?

Hm, I don't know, but in your example with the 12G heap, if you have a 64k allocation granularity you'll have to make 196608 syscalls during startup, which seems like a lot. So, yeah, it could make sense to add some parameter that would allow the granularity to be increased and set it to a more sane default value at least for the case of small pages. 

igor

> 
> -- Tom
> 
> 
> 
> 
> > -----Original Message-----
> > From: Igor Veresov [mailto:igor.veresov at oracle.com]
> > Sent: Monday, August 08, 2011 1:43 PM
> > To: hotspot-gc-dev at openjdk.java.net (mailto:hotspot-gc-dev at openjdk.java.net); Deneau, Tom
> > Subject: Re: Review Request: UseNUMAInterleaving
> > 
> > Hi, Tom!
> > 
> > Sorry it took me so long to get to that.
> > 
> > 1. I don't think the new version of flag usage is prudent. The reason I
> > proposed to introduce a new flag for interleaving is that it would make
> > life easier in the future when the proper NUMA-aware implementation of
> > GCs are added (G1 would be the most probable candidate). I would propose
> > to still have UseNUMAInterleaving flag.
> > 
> > The usage would be as follows:
> > - If UseNUMA is specified on Windows that would turn UseNUMAInterleaving
> > (for the time being, and that behavior would change in the future).
> > - If UseNUMAInterleaving is specified on the command line, you just do
> > the interleaving. If you don't add this flag now, you'll have to do that
> > anyway as soon as NUMA-aware GCs start supporting windows.
> > 
> > 2. I guess the accepted coding convention in hotspot is that "else"
> > should have closing and open bracket be on one line.
> > 2846 }
> > 2847 else {
> > And in all other places...
> > 
> > 
> > 3. Did you forget to remove that?
> > 3149 // tty->print("VirtualQuery AllocBase=%p, RegionSize=%Id\n",
> > allocInfo.AllocationBase, allocInfo.RegionSize);
> > 
> > 4. Does it make sense to pass UseLargePages and UseNUMAInterleaving to
> > allocate_pages_individually()? They are global variables anyway.
> > 
> > 5. What is the typical allocation granularity on windows? Wouldn't that
> > be a problem if we tried to allocate a large heap with small interleaved
> > pages? Have you tried using larger interleaving granularity for modern
> > windows version? Doing a syscall and creating a segment per even a large
> > page seems bit excessive. If you did try that, was there any difference?
> > 
> > 6. The usage of "result" doesn't seem right here, did you mean "if
> > (!result) return false;" ?
> > 3129 bool result = VirtualAlloc(addr, bytes, MEM_COMMIT,
> > PAGE_READWRITE) != 0;
> > 3130 if (result == NULL) return false;
> > 
> > 7. Wouldn't it be nicer instead of the idiom
> >  BOOL ok = SysCall();
> >  if (!ok) return false;
> > just to say
> >  if (!SysCall()) return false;
> > ?
> > 
> > 8. Instead of introducing a global variable numa_used_node_count, could
> > you implement os::numa_get_groups_num() that was intended to return this
> > number?
> > Also build_numa_used_node_list() seems to have the same functionality
> > as os::numa_get_leaf_groups() was intended to have. Could you implement
> > it and use it instead?
> > 
> > Please name function parameters in lower case with words separated with
> > underscores. I know that there are exceptions, especially in
> > os_windows.cpp, but it's better if we stick to the general convention.
> > 
> > 
> > igor
> > 
> > 
> > 
> > On 5/26/11 4:37 PM, Deneau, Tom wrote:
> > > I have incorporated the change suggested by Paul Hohensee to just use
> > the existing UseNUMA flag rather than introduce a new flag. Please let
> > me know when you think this will be able to be checked in...
> > > 
> > > The new webrev is at
> > > http://cr.openjdk.java.net/~tdeneau/UseNUMAInterleaving/webrev.02/
> > > 
> > > -- Tom Deneau, AMD
> > > 
> > > 
> > > 
> > > > -----Original Message-----
> > > > From: Deneau, Tom
> > > > Sent: Monday, May 16, 2011 12:54 PM
> > > > To: 'hotspot-compiler-dev at openjdk.java.net (mailto:hotspot-compiler-dev at openjdk.java.net)'
> > > > Subject: Review Request: UseNUMAInterleaving
> > > > 
> > > > Please review this patch which adds a new flag called
> > > > UseNUMAInterleaving. This flag provides a subset of the functionality
> > > > provided by UseNUMA, and its main purpose is to provide that subset on
> > > > OSes like Windows which do not support the full UseNUMA functionality.
> > > > In UseNUMA terminology, UseNUMAInterleaved makes all memory
> > > > "numa_global" which is implemented as interleaved.
> > > > 
> > > > The situations where this shows the biggest benefits would be:
> > > >  * Windows platforms with multiple numa nodes (eg, 4)
> > > > 
> > > >  * The JVM process is run across all the nodes (not affinitized to
> > one
> > > > node).
> > > > 
> > > >  * A workload that uses the majority of the cores in the machine,
> > so
> > > >  that the heap is being accessed from many cores, including
> > remote
> > > >  ones.
> > > > 
> > > >  * Enough memory per node and a heap size such that the default
> > heap
> > > >  placement policy on windows would end up with the heap (or
> > > >  nursery) placed on one node.
> > > > 
> > > > jbb2005 and SPECPower_ssj2008 are examples of such workloads. In our
> > > > measurements, we have seen some cases where the performance with
> > > > UseNUMAInterleaving was 2.7x vs. the performance without. There were
> > > > gains of varying sizes across all systems.
> > > > 
> > > > As currently implemented this flag is ignored on Linux and Solaris
> > > > since they already support the full UseNUMA flag.
> > > > 
> > > > The webrev is at
> > > > http://cr.openjdk.java.net/~tdeneau/UseNUMAInterleaving/webrev.01/
> > > > 
> > > > Summary of changes:
> > > > 
> > > >  * Other than adding the new UseNUMAInterleaving global flag, all
> > of
> > > >  the changes are in src/os/windows/vm/os_windows.cpp
> > > > 
> > > >  * Some static routines were added to set things up init time.
> > These
> > > >  * check that the required APIs (VirtualAllocExNuma,
> > > >  GetNumaHighestNodeNumber, GetNumaNodeProcessorMask) exist in
> > > >  the OS
> > > > 
> > > >  * build the list of numa nodes on which this process has
> > affinity
> > > > 
> > > >  * Changes to os::reserve_memory
> > > >  * There was already a routine that reserved pages one page at a
> > > >  time (used for Individual Large Page Allocation on WS2003).
> > > >  This was abstracted to a separate routine, called
> > > >  allocate_pages_individually. This gets called both for the
> > > >  Individual Large Page Allocation thing mentioned above and
> > for
> > > >  UseNUMAInterleaving (for both small and large pages)
> > > > 
> > > >  * When used for NUMA Interleaving this just goes thru the numa
> > > >  node list in a round-robin fashion, using a different one for
> > > >  each chunk (with 4K pages, the minimum allocation granularity
> > > >  is 64K, with 2M pages it is 1 Page)
> > > > 
> > > >  * Whether we do just a reserve or a combined reserve/commit is
> > > >  determined by the caller of allocate_pages_individually
> > > > 
> > > >  * When used with large pages, we do a Reserve and Commit at
> > > >  the same time which is the way it always worked and the
> > way
> > > >  it has to work on windows.
> > > > 
> > > >  * For small pages, only the reserve is done, the commit will
> > > >  come later. (which is the way it worked for
> > > >  non-interleaved)
> > > > 
> > > >  * os::commit_memory changes
> > > >  * If UseNUMAIntereaving is true, os::commit_memory has to check
> > > >  whether it was being asked to commit memory that might have
> > > >  come from multiple Reserve allocations, if so, the commits
> > > >  must also be broken up. We don't keep any data structure to
> > > >  keep track of this, we just use VirtualQuery which queries
> > the
> > > >  properties of a VA range and can tell us how much came from
> > > >  one VirtualAlloc call.
> > > > 
> > > > I do not have a bug id for this.
> > > > 
> > > > -- Tom Deneau, AMD