[External] : Fwd: Unexpected results when enabling +UseNUMA for G1GC

Mon Mar 1 23:57:04 UTC 2021

Hi Sangheon,
We ran 1 more experiment and I would be happy if you could take a look at
the results we got.
This time we used a short program we wrote which could be seen here:
https://gist.github.com/talgol/14a5755db1ac83f4f6c04d71ad57c2e3
The program tries to make lots of memory allocations, by creating many new
objects that are not supposed to survive and move to the oldGen.

The program runs using only 1 thread for a duration of 12 hours.
It was configured with a 40GB heap with the same hardware mentioned before,
and with the UseNUMA flag.
This time we expected to see that almost 100% of memory allocations are
local,
since the machine itself wasn't under a high load, and there was actually
no reason for the memory to be allocated from the opposite numa node (there
are 2).
But that wasn't the case; it can be seen from the graph below that just a
bit over 50% of calls were local:
https://drive.google.com/file/d/1QP5I_SaeNUL6-oEHfc5B9QuiCke0xkiL/view?usp=sharing

We also used jna to log the cpu IDs that were used for running the thread,
and we then mapped those CPU IDs to identify to which NUMA node they belong.
We expected to see that only 1 NUMA node is being used, but again, our
results were different.

Do these results make sense to you?
Can you explain why there are so many remote allocations?

Thanks,
Tal

On Mon, Jan 25, 2021 at 8:53 PM Sangheon Kim <sangheon.kim at oracle.com>
wrote:

> Hi Tal,
>
> On 1/21/21 9:39 AM, Tal Goldstein wrote:
>
> Hey Sangheon,
> Thanks for your suggestions.
> I answered your questions in-line.
>
> Regarding your suggestion to increase the heap,
> I've increased the heap size to 40GB and the container memory to 50GB,
> and ran 2 deployments (numa and non-numa), each deployment has 1 pod which
> runs on a dedicated physical k8s node (the same machines mentioned
> previously).
> After running it for several days I could see the following pattern:
>
> For several days, whenever comes the hours of the day when throughput is
> at its max,
> then the local memory access ratio of NUMA deployment is much better than
> the non-numa deployment (5%-6% diff).
> This can be seen in the charts below:
>
> 1. Throughput Per deployment (Numa deployment vs Non-Numa deployment):
>
> https://drive.google.com/file/d/1tG_Qm9MNHZbtmIiXryL8KGMyUk_vylVG/view?usp=sharing
> <https://urldefense.com/v3/__https://drive.google.com/file/d/1tG_Qm9MNHZbtmIiXryL8KGMyUk_vylVG/view?usp=sharing__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lbfxJrcYA$>
>
>
> 2. Local memory ratio % (kube3-10769 is the k8s node WITH NUMA,
> kube3-10770 WITHOUT NUMA)
>
> https://drive.google.com/file/d/1WmjBSPiwwMpXDX3MWsjQQN6vR3BLSro1/view?usp=sharing
> <https://urldefense.com/v3/__https://drive.google.com/file/d/1WmjBSPiwwMpXDX3MWsjQQN6vR3BLSro1/view?usp=sharing__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lbf5GdwcQ$>
>
> From this I understand that the NUMA based deployment behaves better under
> a higher workload,
> but what's still unclear to me, is why the throughput of the non-numa
> deployment is higher than numa deployment ?
>
> Sorry, I don't have good answer for that.
> If you want to investigate, you have to compare logs of 2 runs, both vm
> and endpoint(if applicable) logs.
> You can check average gc pause time, gc frequency etc. for vm logs.
>
> My answers are in-lined.
>
>
> Thanks,
> Tal
>
> On Mon, Jan 11, 2021 at 10:05 PM <sangheon.kim at oracle.com> wrote:
>> Hi Tal,
>> I added in-line comments.
>> On 1/9/21 12:15 PM, Tal Goldstein wrote:
>> > Hi Guys,
>> > We're exploring the use of the flag -XX:+UseNUMA and its effect on G1
>> GC in
>> > JDK 14.
>> > For that, we've created a test that consists of 2 k8s deployments of
>> some
>> > service,
>> > where deployment A has the UseNUMA flag enabled, and deployment B
>> doesn't
>> > have it.
>> >
>> > In order for NUMA to actually work inside the docker container, we also
>> > needed to add numactl lib to the container (apk add numactl),
>> > and in order to measure the local/remote memory access we've used
>> pcm-numa (
>> > https://github.com/opcm/pcm
>> <https://urldefense.com/v3/__https://github.com/opcm/pcm__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lZkjcPFzw$>
>> ),
>> > the docker is based on an image of Alpine Linux v3.11.
>> >
>> > Each deployment handles around 150 requests per second and all of the
>> > deployment's pods are running on the same kube machine.
>> > When running the test, we expected to see that the (local memory
>> access) /
>> > (total memory access) ratio on the UseNUMA deployment, is much higher
>> than
>> > the non-numa deployment,
>> > and as a result that the deployment itself handles a higher throughput
>> of
>> > requests than the non-numa deployment.
>> >
>> > Surprisingly this isn't the case:
>> > On the kube running deployment A which uses NUMA, we measured 20M/ 13M/
>> 33M
>> > (local/remote/total) memory accesses,
>> > and for the kube running deployment B which doesn't use NUMA, we
>> measured
>> > (23M/10M/33M) on the same time.
>> Just curious, did you see any performance difference(other than
>> pcm-numa) between those two?
>> Does it mean you ran 2 pods in parallel(at the same time) on one
>> physical machine?
>>
>
>  I didn't see any other significant difference.
> Yes, so there were 4 pods on the original experiment:
> 2 On each deployment (NUMA deployment, and non-NUMA deployment),
> and each deployment ran on a separate k8s physical node,
> and those nodes didn't run anything else but the 2 k8s pods.
>
> Okay.
>
>
>
>>
>> > Can you help to understand if we're doing anything wrong? or maybe our
>> > expectations are wrong ?
>> >
>> > The 2 deployments are identical (except for the UseNUMA flag):
>> > Each deployment contains 2 pods running on k8s.
>> > Each pod has 10GB memory, 8GB heap, requires 2 CPUs (but not limited to
>> 2).
>> > Each deployment runs on a separate but identical kube machine with this
>> > spec:
>> >                Hardware............: Supermicro SYS-2027TR-HTRF+
>> >                CPU.................: Intel(R) Xeon(R) CPU E5-2630L v2 @
>> > 2.40GHz
>> >                CPUs................: 2
>> >                CPU Cores...........: 12
>> >                Memory..............: 63627 MB
>> >
>> >
>> > We've also written to a file all NUMA related logs (using
>> >
>> -Xlog:os*,gc*=trace:file=/outbrain/heapdumps/fulllog.log:hostname,time,level,tags)
>> > - log file could be found here:
>> >
>> https://drive.google.com/file/d/1eZqYDtBDWKXaEakh_DoYv0P6V9bcLs6Z/view?usp=sharing
>>
>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1eZqYDtBDWKXaEakh_DoYv0P6V9bcLs6Z/view?usp=sharing__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lbd8cDMkQ$>
>> > so we know that NUMA is indeed working, but again, it doesn't give the
>> > desired results we expected to see.
>>  From the shared log file, I see only 1 GC (GC id, 6761) and numa stat
>> shows 53% of local memory allocation (gc,heap,numa) which seems okay.
>> Could you share your full vm options?
>>
>
> These are the updated vm options:
> -XX:+PerfDisableSharedMem
> -Xmx40g
> -Xms40g
> -XX:+DisableExplicitGC
> -XX:-OmitStackTraceInFastThrow
> -XX:+AlwaysPreTouch
> -Duser.country=US
> -XX:+UnlockDiagnosticVMOptions
> -XX:+DebugNonSafepoints
> -XX:+ParallelRefProcEnabled
> -XX:+UnlockExperimentalVMOptions
> -XX:G1MaxNewSizePercent=90
> -XX:InitiatingHeapOccupancyPercent=35
> -XX:-G1UseAdaptiveIHOP
> -XX:ActiveProcessorCount=2
> -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
> -XX:+UseNUMA
>
> -Xlog:os*,gc*=trace:file=/outbrain/heapdumps/fulllog.log:hostname,time,level,tags
>
> Thanks
>
>
>
>> >
>> > Any Ideas why ?
>> > Is it a matter of workload ?
>> Can you increase your Java heap on the testing machine?
>> Your test machine has almost 64GB of memory on 2 NUMA nodes. So I assume
>> each NUMA node will have almost 32GB of memory.
>> But you are using only 8GB on Java heap which fits on one node, so I
>> can't expect any benefit of enabling NUMA.
>>
>
> But when the jvm is started, doesn't it spreads the heap evenly across all
> numa nodes ?
> And in this case, won't each NUMA node hold half of the heap (around 4GB) ?
>
> Your statements above are all right.
> From 8G of Java heap, each half of heap(4G) will be allocated to node 0
> and 1.
>
> G1 NUMA has tiny addition of 1) checking a caller thread's NUMA id and
> then 2) allocate memory from same node. (compare to the non-NUMA case).
> If a testing environment is using very little memory and threads, all of
> them can reside on one node. So above tiny addition may not help.
> Running without above addition would work better.
> This is what I wanted to explain in my previous email.
>
>
> I've increased the heap to be 40GB, and the container memory to 50GB.
>
>
>> As the JVM is running on Kubernetes, there could be another thing may
>> affect to the test.
>> For example, topology manager may treat a pod to allocate from a single
>> NUMA node.
>> https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/
>> <https://urldefense.com/v3/__https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lZcrGJXfw$>
>>
>> That's very interesting, I will read about it and try to understand more,
> and to understand if we're even using the topology manager.
> Do you think that using k8s with toplogy manager might be the problem ?
> Or that actually enabling topology manager should allow better usage of
> the hardware and actually help in our case ?
>
> Sorry I don't have enough experience / knowledge on topology manager /
> Kubernetes.
> As I don't know your testing environment fully, I was trying to enumerate
> what could affect to your test.
>
> Thanks,
> Sangheon
>
>
>
>
>> > Are there any workloads you can suggest that
>> > will benefit from G1 NUMA awareness ?
>> I measured some performance improvements on SpecJBB2015 and SpecJBB2005.
>>
>> > Do you happen to have a link to code that runs such a workload?
>> No, I don't have such link for above runs.
>>
>> Thanks,
>> Sangheon
>>
>> > Thanks,
>> > Tal
>> >
>
>
>>
>
> The above terms reflect a potential business arrangement, are provided solely
> as a basis for further discussion, and are not intended to be and do not
> constitute a legally binding obligation. No legally binding obligations will
> be created, implied, or inferred until an agreement in final form is executed
> in writing by all parties involved.
>
> This email and any attachments hereto may be confidential or privileged.
> If you received this communication by mistake, please don't forward it to
> anyone else, please erase all copies and attachments, and please let me
> know that it has gone to the wrong person. Thanks.
>
>
>

-- 
Tal Goldstein
Software Engineer
tgoldstein at outbrain.com <+tgoldstein at outbrain.com>
<http://www.outbrain.com/?utm_source=gmail&utm_medium=email&utm_campaign=email_signature>

-- 
The above terms reflect a potential business arrangement, are provided 
solely as a basis for further discussion, and are not intended to be and do 
not constitute a legally binding obligation. No legally binding obligations 
will be created, implied, or inferred until an agreement in final form is 
executed in writing by all parties involved.

This email and any 
attachments hereto may be confidential or privileged.  If you received this 
communication by mistake, please don't forward it to anyone else, please 
erase all copies and attachments, and please let me know that it has gone 
to the wrong person. Thanks.