linux os processor optimizations for OpenJDK GC performance enhancement

Tue Apr 25 17:27:33 UTC 2017

Hi Volker,

This proposal is complementary to Large page and NUMA support. Please find
below the typical processor memory access hierarchy and the role of each
feature and other discussion points.

*Typical Processor Memory Access Hierarchy*
*Translation Lookaside Buffer (TLB)*
-- Role of TLB is to translate from virtual to physical memory address in HW
-- Large page support makes sure that TLB entries are not exhausted and
thus lead to better performance
-- Note: There are two TLB accesses in the case JVM is running on a
hardware virtualized platform such as KVM, VMware etc.

*Cache Hierarchy -- Scope of this proposal*

*System Memory*
-- Modern multi-socket machines typically have Non-uniform memory access
(NUMA), with not all memory equidistant from each socket.
-- NUMA support makes sure that memory local to a socket is used to the
extent possible and thus lead to better performance.

*Your concern with NUMA <-> large page interaction*
I can see your concern with JEP [1] with the following remark about NUMA
<-> large page interaction
>>When using large pages, where multiple regions map to the same physical
page, things get a >>bit complicated. For now, we will finesse this by
disabling NUMA optimizations as soon as >>the page size exceeds some small
multiple of region size (say 4), and deal with the more >>general case in a
separate later phase.

*Other ways to handle NUMA*
Use Linux numactl -- https://linux.die.net/man/8/numactl -- "numactl runs
processes with a specific NUMA scheduling or memory placement policy. The
policy is set for command and inherited by all of its children. In addition
it can set persistent policy for shared memory segments or files."

NUMA topology awareness (leveraging linux numactl) is supported by
orchestration systems such as OpenStack, Kubernetes etc.
http://redhatstackblog.redhat.com/2015/05/05/cpu-pinning-and-numa-topology-awareness-in-openstack-compute/
)

The caveats are 1) JVM should not require resources more than what a single
socket can provide in terms of CPU, Memory and PCIe I/O 2) There may be
resource fragmentation depending on the JVM resource request pattern. These
are typically not a problem in modern server class CPUs.

Hope this clarifies.

Thanks,
Ramki

On Tue, Apr 25, 2017 at 1:57 AM, Volker Simonis <volker.simonis at gmail.com>
wrote:

> Hi Ram,
>
> while this sounds interesting, I wonder how this plays together with
> NUMA and Large page support. I understand that these are different
> concepts, but in the end it all bails down tot he fact that memory
> access is not uniform and we have different "kinds" of memory. It
> seems to me that this fact is currently not very well handled in
> HotSpot and needs some general redesign. There are for example two
> JEPs [1,2] about improving the NUMA support in general and in G1. One
> of the problems is that NUMA support doesn't play well together with
> Large/Huge page support.
>
> I think your proposal must be evaluated in the broader context of
> enhancing the VM and GC for non-uniform memory architectures.
> Otherwise it would be yet another point fix which doesn't plays well
> together with other features like NUMA and LargePages.
>
> Thanks,
> Volker
>
> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable
> NUMA Mode by Default When Appropriate)
> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC:
> NUMA-Aware Allocation)
>
> On Wed, Apr 19, 2017 at 4:04 PM, Ram Krishnan <ramkri123 at gmail.com> wrote:
> > Many thanks David.
> >
> > Thanks,
> > Ramki
> >
> > On Tue, Apr 18, 2017 at 11:08 PM, David Holmes <david.holmes at oracle.com>
> > wrote:
> >
> >> On 19/04/2017 11:38 AM, Ram Krishnan wrote:
> >>
> >>> Hi David,
> >>>
> >>> Many thanks, please find attached text version of document for
> temporary
> >>> hosting.
> >>>
> >>
> >> Hosted at: http://cr.openjdk.java.net/~dholmes/JEP-cache-partitioning-
> >> v1.txt
> >>
> >> David
> >>
> >>
> >>> Thanks,
> >>> Ramki
> >>>
> >>> On Tue, Apr 18, 2017 at 5:42 PM, David Holmes <david.holmes at oracle.com
> >>> <mailto:david.holmes at oracle.com>> wrote:
> >>>
> >>>     Hi Ramki,
> >>>
> >>>     On 19/04/2017 8:27 AM, Ram Krishnan wrote:
> >>>
> >>>         Hi David,
> >>>
> >>>         Thanks for the clarification.
> >>>
> >>>         I have signed the OCA and mailed it to
> >>>         oracle-ca_us(at)oracle.com <http://oracle.com>
> >>>         <http://oracle.com>. Any help to expedite processing would be
> >>> much
> >>>         appreciated.
> >>>
> >>>
> >>>     Can't help with that I'm afraid. :)
> >>>
> >>>         We are seeing promising POC results (details in the google doc)
> >>>         for this
> >>>         proposal -- would really appreciate your help in moving this
> >>>         forward.
> >>>
> >>>
> >>>     If you email me a text/html version of the document I can host it
> on
> >>>     cr.openjdk.java.net <http://cr.openjdk.java.net> temporarily. For
> >>>     this to become a JEP you will need a sponsor with the necessary
> >>>     OpenJDK credentials.
> >>>
> >>>     http://cr.openjdk.java.net/~mr/jep/jep-2.0-02.html
> >>>     <http://cr.openjdk.java.net/~mr/jep/jep-2.0-02.html>
> >>>
> >>>     Cheers,
> >>>     David
> >>>
> >>>
> >>>         Thanks,
> >>>         Ramki
> >>>
> >>>         On Tue, Apr 18, 2017 at 1:55 PM, David Holmes
> >>>         <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
> >>>         <mailto:david.holmes at oracle.com
> >>>         <mailto:david.holmes at oracle.com>>> wrote:
> >>>
> >>>             Hi Ramki,
> >>>
> >>>             On 19/04/2017 12:34 AM, Ram Krishnan wrote:
> >>>
> >>>                 Please find detailed proposal below, looking forward to
> >>> your
> >>>                 comments.
> >>>
> >>>                 "Minimize application tail latency using
> >>>                 cache-partitioning-aware G1GC" --
> >>>
> >>>         https://docs.google.com/document/d/1rPMG4XUiE7cUEOogW1z5tBbB
> >>> ZTclOWyg0arhuycXN94/edit
> >>>         <https://docs.google.com/document/d/1rPMG4XUiE7cUEOogW1z5tBb
> >>> BZTclOWyg0arhuycXN94/edit>
> >>>
> >>>         <https://docs.google.com/document/d/1rPMG4XUiE7cUEOogW1z5tBb
> >>> BZTclOWyg0arhuycXN94/edit
> >>>         <https://docs.google.com/document/d/1rPMG4XUiE7cUEOogW1z5tBb
> >>> BZTclOWyg0arhuycXN94/edit>>
> >>>
> >>>
> >>>             All contributions to OpenJDK need to be hosted on OpenJDK
> >>>             infrastructure not on external systems like the above.
> >>>
> >>>             Also I can not see you listed as an OCA signatory. Are you
> an
> >>>             OpenJDK contributor?
> >>>
> >>>             Thanks,
> >>>             David
> >>>             -----
> >>>
> >>>                 Thanks,
> >>>                 Ramki
> >>>
> >>>                 On Thu, Apr 13, 2017 at 11:04 PM, Bernd Eckenfels
> >>>                 <ecki at zusammenkunft.net <mailto:ecki at zusammenkunft.net
> >
> >>>         <mailto:ecki at zusammenkunft.net <mailto:ecki at zusammenkunft.net
> >>>
> >>>                 wrote:
> >>>
> >>>                     Maybe it would be better to concentrate the
> processor
> >>>                     optimizations on
> >>>                     accessors and barrriers without introducing a
> >>>         completely new GC
> >>>                     architecture. I can imagine that especially in the
> >>>         area of
> >>>                     NUMA, TLAB, huge
> >>>                     pages, cache consistency and possibly MMX
> extensions
> >>>         there
> >>>                     is some
> >>>                     potential.
> >>>
> >>>                     Abandoning the global STW - while it seems like a
> >>> pretty
> >>>                     powerful change -
> >>>                     is I guess not a good starter exercise. Especially
> >>>         since it
> >>>                     is not only a
> >>>                     question of mutator threads.
> >>>
> >>>                     Gruss
> >>>                     Bernd
> >>>                     --
> >>>                     http://bernd.eckenfels.net
> >>>                     ------------------------------
> >>>                     *From:* hotspot-gc-dev
> >>>                     <hotspot-gc-dev-bounces at openjdk.java.net
> >>>         <mailto:hotspot-gc-dev-bounces at openjdk.java.net>
> >>>                     <mailto:hotspot-gc-dev-bounces at openjdk.java.net
> >>>         <mailto:hotspot-gc-dev-bounces at openjdk.java.net>>> on
> >>>                     behalf of Ram Krishnan <ramkri123 at gmail.com
> >>>         <mailto:ramkri123 at gmail.com>
> >>>                     <mailto:ramkri123 at gmail.com
> >>>         <mailto:ramkri123 at gmail.com>>>
> >>>                     *Sent:* Friday, April 14, 2017 6:36:27 AM
> >>>                     *To:* Asif Qamar; Andrew Haley;
> >>>                     hotspot-gc-dev at openjdk.java.net
> >>>         <mailto:hotspot-gc-dev at openjdk.java.net>
> >>>                     <mailto:hotspot-gc-dev at openjdk.java.net
> >>>         <mailto:hotspot-gc-dev at openjdk.java.net>>
> >>>                     *Subject:* Re: linux os processor optimizations for
> >>>         OpenJDK GC
> >>>                     performance enhancement
> >>>
> >>>                     Thanks Andrew.
> >>>
> >>>                     >>Surely there is: a thread could have its TLAB
> >>>         allocated
> >>>                     from a region
> >>>
> >>>                             local to that socket (or core), and the GC
> >>>         thread
> >>>                             for that region
> >>>                             could run on the same socket.  It only
> works
> >>> for
> >>>                             young gen, but that's
> >>>                             a lot of the problem.
> >>>
> >>>
> >>>                     A clarification -- does the TLAB allocation apply
> to
> >>>         tenured
> >>>                     space also?
> >>>                     If not, the above would work only for young gen
> >>>         cases where
> >>>                     there is no
> >>>                     promotion to tenured right?
> >>>
> >>>                     Thanks,
> >>>                     Ramki
> >>>
> >>>                     On Thu, Apr 13, 2017 at 12:55 PM, Ram Krishnan
> >>>                     <ramkri123 at gmail.com <mailto:ramkri123 at gmail.com>
> >>>         <mailto:ramkri123 at gmail.com <mailto:ramkri123 at gmail.com>>>
> >>>                     wrote:
> >>>
> >>>
> >>>                         ---------- Forwarded message ----------
> >>>                         From:
> >>>
> >>>                         Andrew Haley <aph at redhat.com
> >>>         <mailto:aph at redhat.com> <mailto:aph at redhat.com
> >>>         <mailto:aph at redhat.com>>>
> >>>                         Date: Thu, Apr 13, 2017 at 9:52 AM
> >>>                         Subject: Re: linux os processor optimizations
> for
> >>>                         OpenJDK GC performance
> >>>                         enhancement
> >>>                         To:
> >>>
> >>>                         hotspot-gc-dev at openjdk.java.net
> >>>         <mailto:hotspot-gc-dev at openjdk.java.net>
> >>>                         <mailto:hotspot-gc-dev at openjdk.java.net
> >>>         <mailto:hotspot-gc-dev at openjdk.java.net>>
> >>>
> >>>
> >>>                         On 13/04/17 16:33, Kim Barrett wrote:
> >>>
> >>>                             An application thread may touch memory in
> any
> >>>                             region; there is no
> >>>                             notion of a thread being "scoped" to a
> >>>         specific set
> >>>                             of regions. While
> >>>                             it might happen that a thread would only
> touch
> >>>                             regions not being
> >>>                             worked on by the collector, there is no a
> >>>         priori way
> >>>                             to know that.
> >>>
> >>>
> >>>
> >>>                         Surely there is: a thread could have its TLAB
> >>>         allocated
> >>>                         from a region
> >>>                         local to that socket (or core), and the GC
> >>>         thread for
> >>>                         that region
> >>>                         could run on the same socket.  It only works
> for
> >>>         young
> >>>                         gen, but that's
> >>>                         a lot of the problem.
> >>>
> >>>                         Andrew.
> >>>
> >>>
> >>>
> >>>
> >>>                         --
> >>>                         Thanks,
> >>>                         Ramki
> >>>
> >>>
> >>>
> >>>
> >>>                     --
> >>>                     Thanks,
> >>>                     Ramki
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>         --
> >>>         Thanks,
> >>>         Ramki
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks,
> >>> Ramki
> >>>
> >>
> >
> >
> > --
> > Thanks,
> > Ramki
>

-- 
Thanks,
Ramki