Fwd: Unexpected results when enabling +UseNUMA for G1GC

Thu Jan 21 17:39:04 UTC 2021

Hey Sangheon,
Thanks for your suggestions.
I answered your questions in-line.

Regarding your suggestion to increase the heap,
I've increased the heap size to 40GB and the container memory to 50GB,
and ran 2 deployments (numa and non-numa), each deployment has 1 pod which
runs on a dedicated physical k8s node (the same machines mentioned
previously).
After running it for several days I could see the following pattern:

For several days, whenever comes the hours of the day when throughput is at
its max,
then the local memory access ratio of NUMA deployment is much better than
the non-numa deployment (5%-6% diff).
This can be seen in the charts below:

1. Throughput Per deployment (Numa deployment vs Non-Numa deployment):
https://drive.google.com/file/d/1tG_Qm9MNHZbtmIiXryL8KGMyUk_vylVG/view?usp=sharing

2. Local memory ratio % (kube3-10769
<http://kube3-10769-prod-nydc1.nydc1.outbrain.com> is the k8s node WITH
NUMA, kube3-10770 WITHOUT NUMA)
https://drive.google.com/file/d/1WmjBSPiwwMpXDX3MWsjQQN6vR3BLSro1/view?usp=sharing

>From this I understand that the NUMA based deployment behaves better under
a higher workload,
but what's still unclear to me, is why the throughput of the non-numa
deployment is higher than numa deployment ?

Thanks,
Tal

On Mon, Jan 11, 2021 at 10:05 PM <sangheon.kim at oracle.com> wrote:
> Hi Tal,
> I added in-line comments.
> On 1/9/21 12:15 PM, Tal Goldstein wrote:
> > Hi Guys,
> > We're exploring the use of the flag -XX:+UseNUMA and its effect on G1 GC
> in
> > JDK 14.
> > For that, we've created a test that consists of 2 k8s deployments of some
> > service,
> > where deployment A has the UseNUMA flag enabled, and deployment B doesn't
> > have it.
> >
> > In order for NUMA to actually work inside the docker container, we also
> > needed to add numactl lib to the container (apk add numactl),
> > and in order to measure the local/remote memory access we've used
> pcm-numa (
> > https://github.com/opcm/pcm),
> > the docker is based on an image of Alpine Linux v3.11.
> >
> > Each deployment handles around 150 requests per second and all of the
> > deployment's pods are running on the same kube machine.
> > When running the test, we expected to see that the (local memory access)
> /
> > (total memory access) ratio on the UseNUMA deployment, is much higher
> than
> > the non-numa deployment,
> > and as a result that the deployment itself handles a higher throughput of
> > requests than the non-numa deployment.
> >
> > Surprisingly this isn't the case:
> > On the kube running deployment A which uses NUMA, we measured 20M/ 13M/
> 33M
> > (local/remote/total) memory accesses,
> > and for the kube running deployment B which doesn't use NUMA, we measured
> > (23M/10M/33M) on the same time.
> Just curious, did you see any performance difference(other than
> pcm-numa) between those two?
> Does it mean you ran 2 pods in parallel(at the same time) on one
> physical machine?
>

 I didn't see any other significant difference.
Yes, so there were 4 pods on the original experiment:
2 On each deployment (NUMA deployment, and non-NUMA deployment),
and each deployment ran on a separate k8s physical node,
and those nodes didn't run anything else but the 2 k8s pods.

>
> > Can you help to understand if we're doing anything wrong? or maybe our
> > expectations are wrong ?
> >
> > The 2 deployments are identical (except for the UseNUMA flag):
> > Each deployment contains 2 pods running on k8s.
> > Each pod has 10GB memory, 8GB heap, requires 2 CPUs (but not limited to
> 2).
> > Each deployment runs on a separate but identical kube machine with this
> > spec:
> >                Hardware............: Supermicro SYS-2027TR-HTRF+
> >                CPU.................: Intel(R) Xeon(R) CPU E5-2630L v2 @
> > 2.40GHz
> >                CPUs................: 2
> >                CPU Cores...........: 12
> >                Memory..............: 63627 MB
> >
> >
> > We've also written to a file all NUMA related logs (using
> >
> -Xlog:os*,gc*=trace:file=/outbrain/heapdumps/fulllog.log:hostname,time,level,tags)
> > - log file could be found here:
> >
> https://drive.google.com/file/d/1eZqYDtBDWKXaEakh_DoYv0P6V9bcLs6Z/view?usp=sharing
> > so we know that NUMA is indeed working, but again, it doesn't give the
> > desired results we expected to see.
>  From the shared log file, I see only 1 GC (GC id, 6761) and numa stat
> shows 53% of local memory allocation (gc,heap,numa) which seems okay.
> Could you share your full vm options?
>

These are the updated vm options:
-XX:+PerfDisableSharedMem
-Xmx40g
-Xms40g
-XX:+DisableExplicitGC
-XX:-OmitStackTraceInFastThrow
-XX:+AlwaysPreTouch
-Duser.country=US
-XX:+UnlockDiagnosticVMOptions
-XX:+DebugNonSafepoints
-XX:+ParallelRefProcEnabled
-XX:+UnlockExperimentalVMOptions
-XX:G1MaxNewSizePercent=90
-XX:InitiatingHeapOccupancyPercent=35
-XX:-G1UseAdaptiveIHOP
-XX:ActiveProcessorCount=2
-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
-XX:+UseNUMA
-Xlog:os*,gc*=trace:file=/outbrain/heapdumps/fulllog.log:hostname,time,level,tags

> >
> > Any Ideas why ?
> > Is it a matter of workload ?
> Can you increase your Java heap on the testing machine?
> Your test machine has almost 64GB of memory on 2 NUMA nodes. So I assume
> each NUMA node will have almost 32GB of memory.
> But you are using only 8GB on Java heap which fits on one node, so I
> can't expect any benefit of enabling NUMA.
>

But when the jvm is started, doesn't it spreads the heap evenly across all
numa nodes ?
And in this case, won't each NUMA node hold half of the heap (around 4GB) ?

I've increased the heap to be 40GB, and the container memory to 50GB.

> As the JVM is running on Kubernetes, there could be another thing may
> affect to the test.
> For example, topology manager may treat a pod to allocate from a single
> NUMA node.
> https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/
>
> That's very interesting, I will read about it and try to understand more,
and to understand if we're even using the topology manager.
Do you think that using k8s with toplogy manager might be the problem ?
Or that actually enabling topology manager should allow better usage of the
hardware and actually help in our case ?

> > Are there any workloads you can suggest that
> > will benefit from G1 NUMA awareness ?
> I measured some performance improvements on SpecJBB2015 and SpecJBB2005.
>
> > Do you happen to have a link to code that runs such a workload?
> No, I don't have such link for above runs.
>
> Thanks,
> Sangheon
>
> > Thanks,
> > Tal
> >

>

-- 
The above terms reflect a potential business arrangement, are provided 
solely as a basis for further discussion, and are not intended to be and do 
not constitute a legally binding obligation. No legally binding obligations 
will be created, implied, or inferred until an agreement in final form is 
executed in writing by all parties involved.

This email and any 
attachments hereto may be confidential or privileged.  If you received this 
communication by mistake, please don't forward it to anyone else, please 
erase all copies and attachments, and please let me know that it has gone 
to the wrong person. Thanks.