Unexpected results when enabling +UseNUMA for G1GC

Mon Jan 11 20:05:10 UTC 2021

Hi Tal,

I added in-line comments.

On 1/9/21 12:15 PM, Tal Goldstein wrote:
> Hi Guys,
> We're exploring the use of the flag -XX:+UseNUMA and its effect on G1 GC in
> JDK 14.
> For that, we've created a test that consists of 2 k8s deployments of some
> service,
> where deployment A has the UseNUMA flag enabled, and deployment B doesn't
> have it.
>
> In order for NUMA to actually work inside the docker container, we also
> needed to add numactl lib to the container (apk add numactl),
> and in order to measure the local/remote memory access we've used pcm-numa (
> https://github.com/opcm/pcm),
> the docker is based on an image of Alpine Linux v3.11.
>
> Each deployment handles around 150 requests per second and all of the
> deployment's pods are running on the same kube machine.
> When running the test, we expected to see that the (local memory access) /
> (total memory access) ratio on the UseNUMA deployment, is much higher than
> the non-numa deployment,
> and as a result that the deployment itself handles a higher throughput of
> requests than the non-numa deployment.
>
> Surprisingly this isn't the case:
> On the kube running deployment A which uses NUMA, we measured 20M/ 13M/ 33M
> (local/remote/total) memory accesses,
> and for the kube running deployment B which doesn't use NUMA, we measured
> (23M/10M/33M) on the same time.
Just curious, did you see any performance difference(other than 
pcm-numa) between those two?

Does it mean you ran 2 pods in parallel(at the same time) on one 
physical machine?

> Can you help to understand if we're doing anything wrong? or maybe our
> expectations are wrong ?
>
> The 2 deployments are identical (except for the UseNUMA flag):
> Each deployment contains 2 pods running on k8s.
> Each pod has 10GB memory, 8GB heap, requires 2 CPUs (but not limited to 2).
> Each deployment runs on a separate but identical kube machine with this
> spec:
>                Hardware............: Supermicro SYS-2027TR-HTRF+
>                CPU.................: Intel(R) Xeon(R) CPU E5-2630L v2 @
> 2.40GHz
>                CPUs................: 2
>                CPU Cores...........: 12
>                Memory..............: 63627 MB
>
>
> We've also written to a file all NUMA related logs (using
> -Xlog:os*,gc*=trace:file=/outbrain/heapdumps/fulllog.log:hostname,time,level,tags)
> - log file could be found here:
> https://drive.google.com/file/d/1eZqYDtBDWKXaEakh_DoYv0P6V9bcLs6Z/view?usp=sharing
> so we know that NUMA is indeed working, but again, it doesn't give the
> desired results we expected to see.
 From the shared log file, I see only 1 GC (GC id, 6761) and numa stat 
shows 53% of local memory allocation (gc,heap,numa) which seems okay.

Could you share your full vm options?

>
> Any Ideas why ?
> Is it a matter of workload ?
Can you increase your Java heap on the testing machine?
Your test machine has almost 64GB of memory on 2 NUMA nodes. So I assume 
each NUMA node will have almost 32GB of memory.
But you are using only 8GB on Java heap which fits on one node, so I 
can't expect any benefit of enabling NUMA.

As the JVM is running on Kubernetes, there could be another thing may 
affect to the test.
For example, topology manager may treat a pod to allocate from a single 
NUMA node.

https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/

> Are there any workloads you can suggest that
> will benefit from G1 NUMA awareness ?
I measured some performance improvements on SpecJBB2015 and SpecJBB2005.

> Do you happen to have a link to code that runs such a workload?
No, I don't have such link for above runs.

Thanks,
Sangheon

> Thanks,
> Tal
>