[External] : Fwd: Unexpected results when enabling +UseNUMA for G1GC

Tue Mar 2 01:10:27 UTC 2021

Sorry, the attached image is not uploaded.

This is the image that I used to explain.
https://sangheon.github.io/assets/posts/g1numa/SearchDepth_G1GC_NUMA.jpg

Thanks,
Sangheon

On 3/1/21 4:55 PM, Sangheon Kim wrote:
> Hi Tal,
>
> On 3/1/21 3:57 PM, Tal Goldstein wrote:
>> Hi Sangheon,
>> We ran 1 more experiment and I would be happy if you could take a 
>> look at the results we got.
>> This time we used a short program we wrote which could be seen here: 
>> https://gist.github.com/talgol/14a5755db1ac83f4f6c04d71ad57c2e3 
>> <https://urldefense.com/v3/__https://gist.github.com/talgol/14a5755db1ac83f4f6c04d71ad57c2e3__;!!GqivPVa7Brio!IpgFVgW3bmO0zPdc4tMuDl35wyTTAfyb0ivel5ckDci2pcsRpyY2fJQxeHjoZm2KSw$> 
>>
>> The program tries to make lots of memory allocations, by creating 
>> many new objects that are not supposed to survive and move to the 
>> oldGen.
>>
>> The program runs using only 1 thread for a duration of 12 hours.
>> It was configured with a 40GB heap with the same hardware mentioned 
>> before, and with the UseNUMA flag.
>> This time we expected to see that almost 100% of memory allocations 
>> are local,
>> since the machine itself wasn't under a high load, and there was 
>> actually no reason for the memory to be allocated from the opposite 
>> numa node (there are 2).
>> But that wasn't the case; it can be seen from the graph below that 
>> just a bit over 50% of calls were local:
>> https://drive.google.com/file/d/1QP5I_SaeNUL6-oEHfc5B9QuiCke0xkiL/view?usp=sharing 
>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1QP5I_SaeNUL6-oEHfc5B9QuiCke0xkiL/view?usp=sharing__;!!GqivPVa7Brio!IpgFVgW3bmO0zPdc4tMuDl35wyTTAfyb0ivel5ckDci2pcsRpyY2fJQxeHh8-8Cdkw$> 
>>
>>
>> We also used jna to log the cpu IDs that were used for running the 
>> thread, and we then mapped those CPU IDs to identify to which NUMA 
>> node they belong.
>> We expected to see that only 1 NUMA node is being used, but again, 
>> our results were different.
>>
>> Do these results make sense to you?
>> Can you explain why there are so many remote allocations?
> I should mention about search depth on G1 NUMA previously!
>
> Short answer:
> One node allocation will not happen because of search depth = 3.
> So 100% local memory allocation in your example will not happen.
>
> Long answer:
> Search depth was introduced to avoid too much unbalance among NUMA 
> nodes. And it will provide less fragmentation and delay when we 
> traverse FreeRegionList.
> In the below example, G1 is requested to return memory from node 1 but 
> the FreeRegionList already used many HeapRegions on node 1.
> So after searching 3 sets(=12 regions = # of NUMA nodes * 3) of 
> regions, G1 will stop searching and then return the first region on 
> the list.
> In your environment, after searching 6 regions (3 search depth * 2 
> nodes), G1 will return the first region which will be remote.
>
>
>
> HTH,
> Sangheon
>
>
>>
>>
>> Thanks,
>> Tal
>>
>> On Mon, Jan 25, 2021 at 8:53 PM Sangheon Kim <sangheon.kim at oracle.com 
>> <mailto:sangheon.kim at oracle.com>> wrote:
>>
>>     Hi Tal,
>>
>>     On 1/21/21 9:39 AM, Tal Goldstein wrote:
>>>     Hey Sangheon,
>>>     Thanks for your suggestions.
>>>     I answered your questions in-line.
>>>
>>>     Regarding your suggestion to increase the heap,
>>>     I've increased the heap size to 40GB and the container memory to
>>>     50GB,
>>>     and ran 2 deployments (numa and non-numa), each deployment has 1
>>>     pod which runs on a dedicated physical k8s node (the same
>>>     machines mentioned previously).
>>>     After running it for several days I could see the following 
>>> pattern:
>>>
>>>     For several days, whenever comes the hours of the day when
>>>     throughput is at its max,
>>>     then the local memory access ratio of NUMA deployment is much
>>>     better than the non-numa deployment (5%-6% diff).
>>>     This can be seen in the charts below:
>>>
>>>     1. Throughput Per deployment (Numa deployment vs Non-Numa
>>>     deployment):
>>> https://drive.google.com/file/d/1tG_Qm9MNHZbtmIiXryL8KGMyUk_vylVG/view?usp=sharing
>>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1tG_Qm9MNHZbtmIiXryL8KGMyUk_vylVG/view?usp=sharing__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lbfxJrcYA$>
>>>
>>>
>>>     2. Local memory ratio % (kube3-10769 is the k8s node WITH NUMA,
>>>     kube3-10770 WITHOUT NUMA)
>>> https://drive.google.com/file/d/1WmjBSPiwwMpXDX3MWsjQQN6vR3BLSro1/view?usp=sharing
>>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1WmjBSPiwwMpXDX3MWsjQQN6vR3BLSro1/view?usp=sharing__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lbf5GdwcQ$>
>>>
>>>     From this I understand that the NUMA based deployment behaves
>>>     better under a higher workload,
>>>     but what's still unclear to me, is why the throughput of the
>>>     non-numa deployment is higher than numa deployment ?
>>     Sorry, I don't have good answer for that.
>>     If you want to investigate, you have to compare logs of 2 runs,
>>     both vm and endpoint(if applicable) logs.
>>     You can check average gc pause time, gc frequency etc. for vm logs.
>>
>>     My answers are in-lined.
>>
>>>
>>>     Thanks,
>>>     Tal
>>>
>>>         On Mon, Jan 11, 2021 at 10:05 PM <sangheon.kim at oracle.com
>>>         <mailto:sangheon.kim at oracle.com>> wrote:
>>>         Hi Tal,
>>>         I added in-line comments.
>>>         On 1/9/21 12:15 PM, Tal Goldstein wrote:
>>>         > Hi Guys,
>>>         > We're exploring the use of the flag -XX:+UseNUMA and its
>>>         effect on G1 GC in
>>>         > JDK 14.
>>>         > For that, we've created a test that consists of 2 k8s
>>>         deployments of some
>>>         > service,
>>>         > where deployment A has the UseNUMA flag enabled, and
>>>         deployment B doesn't
>>>         > have it.
>>>         >
>>>         > In order for NUMA to actually work inside the docker
>>>         container, we also
>>>         > needed to add numactl lib to the container (apk add numactl),
>>>         > and in order to measure the local/remote memory access
>>>         we've used pcm-numa (
>>>         > https://github.com/opcm/pcm
>>> <https://urldefense.com/v3/__https://github.com/opcm/pcm__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lZkjcPFzw$>),
>>>         > the docker is based on an image of Alpine Linux v3.11.
>>>         >
>>>         > Each deployment handles around 150 requests per second and
>>>         all of the
>>>         > deployment's pods are running on the same kube machine.
>>>         > When running the test, we expected to see that the (local
>>>         memory access) /
>>>         > (total memory access) ratio on the UseNUMA deployment, is
>>>         much higher than
>>>         > the non-numa deployment,
>>>         > and as a result that the deployment itself handles a higher
>>>         throughput of
>>>         > requests than the non-numa deployment.
>>>         >
>>>         > Surprisingly this isn't the case:
>>>         > On the kube running deployment A which uses NUMA, we
>>>         measured 20M/ 13M/ 33M
>>>         > (local/remote/total) memory accesses,
>>>         > and for the kube running deployment B which doesn't use
>>>         NUMA, we measured
>>>         > (23M/10M/33M) on the same time.
>>>         Just curious, did you see any performance difference(other than
>>>         pcm-numa) between those two?
>>>         Does it mean you ran 2 pods in parallel(at the same time) on 
>>> one
>>>         physical machine?
>>>
>>>
>>>      I didn't see any other significant difference.
>>>     Yes, so there were 4 pods on the original experiment:
>>>     2 On each deployment (NUMA deployment, and non-NUMA deployment),
>>>     and each deployment ran on a separate k8s physical node,
>>>     and those nodes didn't run anything else but the 2 k8s pods.
>>     Okay.
>>
>>>
>>>         > Can you help to understand if we're doing anything wrong?
>>>         or maybe our
>>>         > expectations are wrong ?
>>>         >
>>>         > The 2 deployments are identical (except for the UseNUMA 
>>> flag):
>>>         > Each deployment contains 2 pods running on k8s.
>>>         > Each pod has 10GB memory, 8GB heap, requires 2 CPUs (but
>>>         not limited to 2).
>>>         > Each deployment runs on a separate but identical kube
>>>         machine with this
>>>         > spec:
>>>         >                Hardware............: Supermicro
>>>         SYS-2027TR-HTRF+
>>>         >                CPU.................: Intel(R) Xeon(R) CPU
>>>         E5-2630L v2 @
>>>         > 2.40GHz
>>>         >                CPUs................: 2
>>>         >                CPU Cores...........: 12
>>>         >                Memory..............: 63627 MB
>>>         >
>>>         >
>>>         > We've also written to a file all NUMA related logs (using
>>>         >
>>> -Xlog:os*,gc*=trace:file=/outbrain/heapdumps/fulllog.log:hostname,time,level,tags)
>>>         > - log file could be found here:
>>>         >
>>> https://drive.google.com/file/d/1eZqYDtBDWKXaEakh_DoYv0P6V9bcLs6Z/view?usp=sharing
>>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1eZqYDtBDWKXaEakh_DoYv0P6V9bcLs6Z/view?usp=sharing__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lbd8cDMkQ$>
>>>         > so we know that NUMA is indeed working, but again, it
>>>         doesn't give the
>>>         > desired results we expected to see.
>>>          From the shared log file, I see only 1 GC (GC id, 6761) and
>>>         numa stat
>>>         shows 53% of local memory allocation (gc,heap,numa) which
>>>         seems okay.
>>>         Could you share your full vm options?
>>>
>>>
>>>     These are the updated vm options:
>>>     -XX:+PerfDisableSharedMem
>>>     -Xmx40g
>>>     -Xms40g
>>>     -XX:+DisableExplicitGC
>>>     -XX:-OmitStackTraceInFastThrow
>>>     -XX:+AlwaysPreTouch
>>>     -Duser.country=US
>>>     -XX:+UnlockDiagnosticVMOptions
>>>     -XX:+DebugNonSafepoints
>>>     -XX:+ParallelRefProcEnabled
>>>     -XX:+UnlockExperimentalVMOptions
>>>     -XX:G1MaxNewSizePercent=90
>>>     -XX:InitiatingHeapOccupancyPercent=35
>>>     -XX:-G1UseAdaptiveIHOP
>>>     -XX:ActiveProcessorCount=2
>>> -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
>>>     -XX:+UseNUMA
>>> -Xlog:os*,gc*=trace:file=/outbrain/heapdumps/fulllog.log:hostname,time,level,tags
>>     Thanks
>>
>>>         >
>>>         > Any Ideas why ?
>>>         > Is it a matter of workload ?
>>>         Can you increase your Java heap on the testing machine?
>>>         Your test machine has almost 64GB of memory on 2 NUMA nodes.
>>>         So I assume
>>>         each NUMA node will have almost 32GB of memory.
>>>         But you are using only 8GB on Java heap which fits on one
>>>         node, so I
>>>         can't expect any benefit of enabling NUMA.
>>>
>>>
>>>     But when the jvm is started, doesn't it spreads the heap evenly
>>>     across all numa nodes ?
>>>     And in this case, won't each NUMA node hold half of the heap
>>>     (around 4GB) ?
>>     Your statements above are all right.
>>     From 8G of Java heap, each half of heap(4G) will be allocated to
>>     node 0 and 1.
>>
>>     G1 NUMA has tiny addition of 1) checking a caller thread's NUMA id
>>     and then 2) allocate memory from same node. (compare to the
>>     non-NUMA case).
>>     If a testing environment is using very little memory and threads,
>>     all of them can reside on one node. So above tiny addition may not
>>     help.
>>     Running without above addition would work better.
>>     This is what I wanted to explain in my previous email.
>>
>>>
>>>     I've increased the heap to be 40GB, and the container memory to 
>>> 50GB.
>>>
>>>         As the JVM is running on Kubernetes, there could be another
>>>         thing may
>>>         affect to the test.
>>>         For example, topology manager may treat a pod to allocate
>>>         from a single
>>>         NUMA node.
>>> https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/
>>> <https://urldefense.com/v3/__https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lZcrGJXfw$>
>>>
>>>     That's very interesting, I will read about it and try to
>>>     understand more, and to understand if we're even using the
>>>     topology manager.
>>>     Do you think that using k8s with toplogy manager might be the
>>>     problem ?
>>>     Or that actually enabling topology manager should allow better
>>>     usage of the hardware and actually help in our case ?
>>     Sorry I don't have enough experience / knowledge on topology
>>     manager / Kubernetes.
>>     As I don't know your testing environment fully, I was trying to
>>     enumerate what could affect to your test.
>>
>>     Thanks,
>>     Sangheon
>>
>>
>>>         > Are there any workloads you can suggest that
>>>         > will benefit from G1 NUMA awareness ?
>>>         I measured some performance improvements on SpecJBB2015 and
>>>         SpecJBB2005.
>>>
>>>         > Do you happen to have a link to code that runs such a 
>>> workload?
>>>         No, I don't have such link for above runs.
>>>
>>>         Thanks,
>>>         Sangheon
>>>
>>>         > Thanks,
>>>         > Tal
>>>         >
>>>
>>>
>>>
>>>
>>>     The above terms reflect a potential business arrangement, are
>>>     provided solely as a basis for further discussion, and are not
>>>     intended to be and do not constitute a legally binding
>>>     obligation. No legally binding obligations will be created,
>>>     implied, or inferred until an agreement in final form is executed
>>>     in writing by all parties involved.
>>>
>>>     This email and any attachments hereto may be confidential or
>>>     privileged.  If you received this communication by mistake,
>>>     please don't forward it to anyone else, please erase all copies
>>>     and attachments, and please let me know that it has gone to the
>>>     wrong person. Thanks.
>>
>>
>>
>> -- 
>> Tal Goldstein
>> Software Engineer
>> tgoldstein at outbrain.com <mailto:+tgoldstein at outbrain.com>
>> <https://urldefense.com/v3/__http://www.outbrain.com/?utm_source=gmail&utm_medium=email&utm_campaign=email_signature__;!!GqivPVa7Brio!IpgFVgW3bmO0zPdc4tMuDl35wyTTAfyb0ivel5ckDci2pcsRpyY2fJQxeHhr3aqOhA$> 
>>
>>
>> The above terms reflect a potential business arrangement, are 
>> provided solely as a basis for further discussion, and are not 
>> intended to be and do not constitute a legally binding obligation. No 
>> legally binding obligations will be created, implied, or inferred 
>> until an agreement in final form is executed in writing by all 
>> parties involved.
>>
>> This email and any attachments hereto may be confidential or 
>> privileged.  If you received this communication by mistake, please 
>> don't forward it to anyone else, please erase all copies and 
>> attachments, and please let me know that it has gone to the wrong 
>> person. Thanks.
>