[External] : Fwd: Unexpected results when enabling +UseNUMA for G1GC

Tue Mar 2 00:55:59 UTC 2021

Hi Tal,

On 3/1/21 3:57 PM, Tal Goldstein wrote:
> Hi Sangheon,
> We ran 1 more experiment and I would be happy if you could take a look 
> at the results we got.
> This time we used a short program we wrote which could be seen here: 
> https://gist.github.com/talgol/14a5755db1ac83f4f6c04d71ad57c2e3 
> <https://urldefense.com/v3/__https://gist.github.com/talgol/14a5755db1ac83f4f6c04d71ad57c2e3__;!!GqivPVa7Brio!IpgFVgW3bmO0zPdc4tMuDl35wyTTAfyb0ivel5ckDci2pcsRpyY2fJQxeHjoZm2KSw$> 
>
> The program tries to make lots of memory allocations, by creating many 
> new objects that are not supposed to survive and move to the oldGen.
>
> The program runs using only 1 thread for a duration of 12 hours.
> It was configured with a 40GB heap with the same hardware mentioned 
> before, and with the UseNUMA flag.
> This time we expected to see that almost 100% of memory allocations 
> are local,
> since the machine itself wasn't under a high load, and there was 
> actually no reason for the memory to be allocated from the opposite 
> numa node (there are 2).
> But that wasn't the case; it can be seen from the graph below that 
> just a bit over 50% of calls were local:
> https://drive.google.com/file/d/1QP5I_SaeNUL6-oEHfc5B9QuiCke0xkiL/view?usp=sharing 
> <https://urldefense.com/v3/__https://drive.google.com/file/d/1QP5I_SaeNUL6-oEHfc5B9QuiCke0xkiL/view?usp=sharing__;!!GqivPVa7Brio!IpgFVgW3bmO0zPdc4tMuDl35wyTTAfyb0ivel5ckDci2pcsRpyY2fJQxeHh8-8Cdkw$>
>
> We also used jna to log the cpu IDs that were used for running the 
> thread, and we then mapped those CPU IDs to identify to which NUMA 
> node they belong.
> We expected to see that only 1 NUMA node is being used, but again, our 
> results were different.
>
> Do these results make sense to you?
> Can you explain why there are so many remote allocations?
I should mention about search depth on G1 NUMA previously!

Short answer:
One node allocation will not happen because of search depth = 3.
So 100% local memory allocation in your example will not happen.

Long answer:
Search depth was introduced to avoid too much unbalance among NUMA 
nodes. And it will provide less fragmentation and delay when we traverse 
FreeRegionList.
In the below example, G1 is requested to return memory from node 1 but 
the FreeRegionList already used many HeapRegions on node 1.
So after searching 3 sets(=12 regions = # of NUMA nodes * 3) of regions, 
G1 will stop searching and then return the first region on the list.
In your environment, after searching 6 regions (3 search depth * 2 
nodes), G1 will return the first region which will be remote.

HTH,
Sangheon

>
>
> Thanks,
> Tal
>
> On Mon, Jan 25, 2021 at 8:53 PM Sangheon Kim <sangheon.kim at oracle.com 
> <mailto:sangheon.kim at oracle.com>> wrote:
>
>     Hi Tal,
>
>     On 1/21/21 9:39 AM, Tal Goldstein wrote:
>>     Hey Sangheon,
>>     Thanks for your suggestions.
>>     I answered your questions in-line.
>>
>>     Regarding your suggestion to increase the heap,
>>     I've increased the heap size to 40GB and the container memory to
>>     50GB,
>>     and ran 2 deployments (numa and non-numa), each deployment has 1
>>     pod which runs on a dedicated physical k8s node (the same
>>     machines mentioned previously).
>>     After running it for several days I could see the following pattern:
>>
>>     For several days, whenever comes the hours of the day when
>>     throughput is at its max,
>>     then the local memory access ratio of NUMA deployment is much
>>     better than the non-numa deployment (5%-6% diff).
>>     This can be seen in the charts below:
>>
>>     1. Throughput Per deployment (Numa deployment vs Non-Numa
>>     deployment):
>>     https://drive.google.com/file/d/1tG_Qm9MNHZbtmIiXryL8KGMyUk_vylVG/view?usp=sharing
>>     <https://urldefense.com/v3/__https://drive.google.com/file/d/1tG_Qm9MNHZbtmIiXryL8KGMyUk_vylVG/view?usp=sharing__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lbfxJrcYA$>
>>
>>
>>     2. Local memory ratio % (kube3-10769 is the k8s node WITH NUMA,
>>     kube3-10770 WITHOUT NUMA)
>>     https://drive.google.com/file/d/1WmjBSPiwwMpXDX3MWsjQQN6vR3BLSro1/view?usp=sharing
>>     <https://urldefense.com/v3/__https://drive.google.com/file/d/1WmjBSPiwwMpXDX3MWsjQQN6vR3BLSro1/view?usp=sharing__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lbf5GdwcQ$>
>>
>>     From this I understand that the NUMA based deployment behaves
>>     better under a higher workload,
>>     but what's still unclear to me, is why the throughput of the
>>     non-numa deployment is higher than numa deployment ?
>     Sorry, I don't have good answer for that.
>     If you want to investigate, you have to compare logs of 2 runs,
>     both vm and endpoint(if applicable) logs.
>     You can check average gc pause time, gc frequency etc. for vm logs.
>
>     My answers are in-lined.
>
>>
>>     Thanks,
>>     Tal
>>
>>         On Mon, Jan 11, 2021 at 10:05 PM <sangheon.kim at oracle.com
>>         <mailto:sangheon.kim at oracle.com>> wrote:
>>         Hi Tal,
>>         I added in-line comments.
>>         On 1/9/21 12:15 PM, Tal Goldstein wrote:
>>         > Hi Guys,
>>         > We're exploring the use of the flag -XX:+UseNUMA and its
>>         effect on G1 GC in
>>         > JDK 14.
>>         > For that, we've created a test that consists of 2 k8s
>>         deployments of some
>>         > service,
>>         > where deployment A has the UseNUMA flag enabled, and
>>         deployment B doesn't
>>         > have it.
>>         >
>>         > In order for NUMA to actually work inside the docker
>>         container, we also
>>         > needed to add numactl lib to the container (apk add numactl),
>>         > and in order to measure the local/remote memory access
>>         we've used pcm-numa (
>>         > https://github.com/opcm/pcm
>>         <https://urldefense.com/v3/__https://github.com/opcm/pcm__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lZkjcPFzw$>),
>>         > the docker is based on an image of Alpine Linux v3.11.
>>         >
>>         > Each deployment handles around 150 requests per second and
>>         all of the
>>         > deployment's pods are running on the same kube machine.
>>         > When running the test, we expected to see that the (local
>>         memory access) /
>>         > (total memory access) ratio on the UseNUMA deployment, is
>>         much higher than
>>         > the non-numa deployment,
>>         > and as a result that the deployment itself handles a higher
>>         throughput of
>>         > requests than the non-numa deployment.
>>         >
>>         > Surprisingly this isn't the case:
>>         > On the kube running deployment A which uses NUMA, we
>>         measured 20M/ 13M/ 33M
>>         > (local/remote/total) memory accesses,
>>         > and for the kube running deployment B which doesn't use
>>         NUMA, we measured
>>         > (23M/10M/33M) on the same time.
>>         Just curious, did you see any performance difference(other than
>>         pcm-numa) between those two?
>>         Does it mean you ran 2 pods in parallel(at the same time) on one
>>         physical machine?
>>
>>
>>      I didn't see any other significant difference.
>>     Yes, so there were 4 pods on the original experiment:
>>     2 On each deployment (NUMA deployment, and non-NUMA deployment),
>>     and each deployment ran on a separate k8s physical node,
>>     and those nodes didn't run anything else but the 2 k8s pods.
>     Okay.
>
>>
>>         > Can you help to understand if we're doing anything wrong?
>>         or maybe our
>>         > expectations are wrong ?
>>         >
>>         > The 2 deployments are identical (except for the UseNUMA flag):
>>         > Each deployment contains 2 pods running on k8s.
>>         > Each pod has 10GB memory, 8GB heap, requires 2 CPUs (but
>>         not limited to 2).
>>         > Each deployment runs on a separate but identical kube
>>         machine with this
>>         > spec:
>>         >                Hardware............: Supermicro
>>         SYS-2027TR-HTRF+
>>         >                CPU.................: Intel(R) Xeon(R) CPU
>>         E5-2630L v2 @
>>         > 2.40GHz
>>         >                CPUs................: 2
>>         >                CPU Cores...........: 12
>>         >                Memory..............: 63627 MB
>>         >
>>         >
>>         > We've also written to a file all NUMA related logs (using
>>         >
>>         -Xlog:os*,gc*=trace:file=/outbrain/heapdumps/fulllog.log:hostname,time,level,tags)
>>         > - log file could be found here:
>>         >
>>         https://drive.google.com/file/d/1eZqYDtBDWKXaEakh_DoYv0P6V9bcLs6Z/view?usp=sharing
>>         <https://urldefense.com/v3/__https://drive.google.com/file/d/1eZqYDtBDWKXaEakh_DoYv0P6V9bcLs6Z/view?usp=sharing__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lbd8cDMkQ$>
>>         > so we know that NUMA is indeed working, but again, it
>>         doesn't give the
>>         > desired results we expected to see.
>>          From the shared log file, I see only 1 GC (GC id, 6761) and
>>         numa stat
>>         shows 53% of local memory allocation (gc,heap,numa) which
>>         seems okay.
>>         Could you share your full vm options?
>>
>>
>>     These are the updated vm options:
>>     -XX:+PerfDisableSharedMem
>>     -Xmx40g
>>     -Xms40g
>>     -XX:+DisableExplicitGC
>>     -XX:-OmitStackTraceInFastThrow
>>     -XX:+AlwaysPreTouch
>>     -Duser.country=US
>>     -XX:+UnlockDiagnosticVMOptions
>>     -XX:+DebugNonSafepoints
>>     -XX:+ParallelRefProcEnabled
>>     -XX:+UnlockExperimentalVMOptions
>>     -XX:G1MaxNewSizePercent=90
>>     -XX:InitiatingHeapOccupancyPercent=35
>>     -XX:-G1UseAdaptiveIHOP
>>     -XX:ActiveProcessorCount=2
>>     -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
>>     -XX:+UseNUMA
>>     -Xlog:os*,gc*=trace:file=/outbrain/heapdumps/fulllog.log:hostname,time,level,tags
>     Thanks
>
>>         >
>>         > Any Ideas why ?
>>         > Is it a matter of workload ?
>>         Can you increase your Java heap on the testing machine?
>>         Your test machine has almost 64GB of memory on 2 NUMA nodes.
>>         So I assume
>>         each NUMA node will have almost 32GB of memory.
>>         But you are using only 8GB on Java heap which fits on one
>>         node, so I
>>         can't expect any benefit of enabling NUMA.
>>
>>
>>     But when the jvm is started, doesn't it spreads the heap evenly
>>     across all numa nodes ?
>>     And in this case, won't each NUMA node hold half of the heap
>>     (around 4GB) ?
>     Your statements above are all right.
>     From 8G of Java heap, each half of heap(4G) will be allocated to
>     node 0 and 1.
>
>     G1 NUMA has tiny addition of 1) checking a caller thread's NUMA id
>     and then 2) allocate memory from same node. (compare to the
>     non-NUMA case).
>     If a testing environment is using very little memory and threads,
>     all of them can reside on one node. So above tiny addition may not
>     help.
>     Running without above addition would work better.
>     This is what I wanted to explain in my previous email.
>
>>
>>     I've increased the heap to be 40GB, and the container memory to 50GB.
>>
>>         As the JVM is running on Kubernetes, there could be another
>>         thing may
>>         affect to the test.
>>         For example, topology manager may treat a pod to allocate
>>         from a single
>>         NUMA node.
>>         https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/
>>         <https://urldefense.com/v3/__https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lZcrGJXfw$>
>>
>>     That's very interesting, I will read about it and try to
>>     understand more, and to understand if we're even using the
>>     topology manager.
>>     Do you think that using k8s with toplogy manager might be the
>>     problem ?
>>     Or that actually enabling topology manager should allow better
>>     usage of the hardware and actually help in our case ?
>     Sorry I don't have enough experience / knowledge on topology
>     manager / Kubernetes.
>     As I don't know your testing environment fully, I was trying to
>     enumerate what could affect to your test.
>
>     Thanks,
>     Sangheon
>
>
>>         > Are there any workloads you can suggest that
>>         > will benefit from G1 NUMA awareness ?
>>         I measured some performance improvements on SpecJBB2015 and
>>         SpecJBB2005.
>>
>>         > Do you happen to have a link to code that runs such a workload?
>>         No, I don't have such link for above runs.
>>
>>         Thanks,
>>         Sangheon
>>
>>         > Thanks,
>>         > Tal
>>         >
>>
>>
>>
>>
>>     The above terms reflect a potential business arrangement, are
>>     provided solely as a basis for further discussion, and are not
>>     intended to be and do not constitute a legally binding
>>     obligation. No legally binding obligations will be created,
>>     implied, or inferred until an agreement in final form is executed
>>     in writing by all parties involved.
>>
>>     This email and any attachments hereto may be confidential or
>>     privileged.  If you received this communication by mistake,
>>     please don't forward it to anyone else, please erase all copies
>>     and attachments, and please let me know that it has gone to the
>>     wrong person. Thanks.
>
>
>
> -- 
> Tal Goldstein
> Software Engineer
> tgoldstein at outbrain.com <mailto:+tgoldstein at outbrain.com>
> <https://urldefense.com/v3/__http://www.outbrain.com/?utm_source=gmail&utm_medium=email&utm_campaign=email_signature__;!!GqivPVa7Brio!IpgFVgW3bmO0zPdc4tMuDl35wyTTAfyb0ivel5ckDci2pcsRpyY2fJQxeHhr3aqOhA$>
>
> The above terms reflect a potential business arrangement, are provided 
> solely as a basis for further discussion, and are not intended to be 
> and do not constitute a legally binding obligation. No legally binding 
> obligations will be created, implied, or inferred until an agreement 
> in final form is executed in writing by all parties involved.
>
> This email and any attachments hereto may be confidential or 
> privileged.  If you received this communication by mistake, please 
> don't forward it to anyone else, please erase all copies and 
> attachments, and please let me know that it has gone to the wrong 
> person. Thanks.