[External] : Fwd: Unexpected results when enabling +UseNUMA for G1GC
sangheon.kim at oracle.com
sangheon.kim at oracle.com
Tue Mar 2 01:10:27 UTC 2021
Sorry, the attached image is not uploaded.
This is the image that I used to explain.
https://sangheon.github.io/assets/posts/g1numa/SearchDepth_G1GC_NUMA.jpg
Thanks,
Sangheon
On 3/1/21 4:55 PM, Sangheon Kim wrote:
> Hi Tal,
>
> On 3/1/21 3:57 PM, Tal Goldstein wrote:
>> Hi Sangheon,
>> We ran 1 more experiment and I would be happy if you could take a
>> look at the results we got.
>> This time we used a short program we wrote which could be seen here:
>> https://gist.github.com/talgol/14a5755db1ac83f4f6c04d71ad57c2e3
>> <https://urldefense.com/v3/__https://gist.github.com/talgol/14a5755db1ac83f4f6c04d71ad57c2e3__;!!GqivPVa7Brio!IpgFVgW3bmO0zPdc4tMuDl35wyTTAfyb0ivel5ckDci2pcsRpyY2fJQxeHjoZm2KSw$>
>>
>> The program tries to make lots of memory allocations, by creating
>> many new objects that are not supposed to survive and move to the
>> oldGen.
>>
>> The program runs using only 1 thread for a duration of 12 hours.
>> It was configured with a 40GB heap with the same hardware mentioned
>> before, and with the UseNUMA flag.
>> This time we expected to see that almost 100% of memory allocations
>> are local,
>> since the machine itself wasn't under a high load, and there was
>> actually no reason for the memory to be allocated from the opposite
>> numa node (there are 2).
>> But that wasn't the case; it can be seen from the graph below that
>> just a bit over 50% of calls were local:
>> https://drive.google.com/file/d/1QP5I_SaeNUL6-oEHfc5B9QuiCke0xkiL/view?usp=sharing
>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1QP5I_SaeNUL6-oEHfc5B9QuiCke0xkiL/view?usp=sharing__;!!GqivPVa7Brio!IpgFVgW3bmO0zPdc4tMuDl35wyTTAfyb0ivel5ckDci2pcsRpyY2fJQxeHh8-8Cdkw$>
>>
>>
>> We also used jna to log the cpu IDs that were used for running the
>> thread, and we then mapped those CPU IDs to identify to which NUMA
>> node they belong.
>> We expected to see that only 1 NUMA node is being used, but again,
>> our results were different.
>>
>> Do these results make sense to you?
>> Can you explain why there are so many remote allocations?
> I should mention about search depth on G1 NUMA previously!
>
> Short answer:
> One node allocation will not happen because of search depth = 3.
> So 100% local memory allocation in your example will not happen.
>
> Long answer:
> Search depth was introduced to avoid too much unbalance among NUMA
> nodes. And it will provide less fragmentation and delay when we
> traverse FreeRegionList.
> In the below example, G1 is requested to return memory from node 1 but
> the FreeRegionList already used many HeapRegions on node 1.
> So after searching 3 sets(=12 regions = # of NUMA nodes * 3) of
> regions, G1 will stop searching and then return the first region on
> the list.
> In your environment, after searching 6 regions (3 search depth * 2
> nodes), G1 will return the first region which will be remote.
>
>
>
> HTH,
> Sangheon
>
>
>>
>>
>> Thanks,
>> Tal
>>
>> On Mon, Jan 25, 2021 at 8:53 PM Sangheon Kim <sangheon.kim at oracle.com
>> <mailto:sangheon.kim at oracle.com>> wrote:
>>
>> Hi Tal,
>>
>> On 1/21/21 9:39 AM, Tal Goldstein wrote:
>>> Hey Sangheon,
>>> Thanks for your suggestions.
>>> I answered your questions in-line.
>>>
>>> Regarding your suggestion to increase the heap,
>>> I've increased the heap size to 40GB and the container memory to
>>> 50GB,
>>> and ran 2 deployments (numa and non-numa), each deployment has 1
>>> pod which runs on a dedicated physical k8s node (the same
>>> machines mentioned previously).
>>> After running it for several days I could see the following
>>> pattern:
>>>
>>> For several days, whenever comes the hours of the day when
>>> throughput is at its max,
>>> then the local memory access ratio of NUMA deployment is much
>>> better than the non-numa deployment (5%-6% diff).
>>> This can be seen in the charts below:
>>>
>>> 1. Throughput Per deployment (Numa deployment vs Non-Numa
>>> deployment):
>>> https://drive.google.com/file/d/1tG_Qm9MNHZbtmIiXryL8KGMyUk_vylVG/view?usp=sharing
>>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1tG_Qm9MNHZbtmIiXryL8KGMyUk_vylVG/view?usp=sharing__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lbfxJrcYA$>
>>>
>>>
>>> 2. Local memory ratio % (kube3-10769 is the k8s node WITH NUMA,
>>> kube3-10770 WITHOUT NUMA)
>>> https://drive.google.com/file/d/1WmjBSPiwwMpXDX3MWsjQQN6vR3BLSro1/view?usp=sharing
>>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1WmjBSPiwwMpXDX3MWsjQQN6vR3BLSro1/view?usp=sharing__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lbf5GdwcQ$>
>>>
>>> From this I understand that the NUMA based deployment behaves
>>> better under a higher workload,
>>> but what's still unclear to me, is why the throughput of the
>>> non-numa deployment is higher than numa deployment ?
>> Sorry, I don't have good answer for that.
>> If you want to investigate, you have to compare logs of 2 runs,
>> both vm and endpoint(if applicable) logs.
>> You can check average gc pause time, gc frequency etc. for vm logs.
>>
>> My answers are in-lined.
>>
>>>
>>> Thanks,
>>> Tal
>>>
>>> On Mon, Jan 11, 2021 at 10:05 PM <sangheon.kim at oracle.com
>>> <mailto:sangheon.kim at oracle.com>> wrote:
>>> Hi Tal,
>>> I added in-line comments.
>>> On 1/9/21 12:15 PM, Tal Goldstein wrote:
>>> > Hi Guys,
>>> > We're exploring the use of the flag -XX:+UseNUMA and its
>>> effect on G1 GC in
>>> > JDK 14.
>>> > For that, we've created a test that consists of 2 k8s
>>> deployments of some
>>> > service,
>>> > where deployment A has the UseNUMA flag enabled, and
>>> deployment B doesn't
>>> > have it.
>>> >
>>> > In order for NUMA to actually work inside the docker
>>> container, we also
>>> > needed to add numactl lib to the container (apk add numactl),
>>> > and in order to measure the local/remote memory access
>>> we've used pcm-numa (
>>> > https://github.com/opcm/pcm
>>> <https://urldefense.com/v3/__https://github.com/opcm/pcm__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lZkjcPFzw$>),
>>> > the docker is based on an image of Alpine Linux v3.11.
>>> >
>>> > Each deployment handles around 150 requests per second and
>>> all of the
>>> > deployment's pods are running on the same kube machine.
>>> > When running the test, we expected to see that the (local
>>> memory access) /
>>> > (total memory access) ratio on the UseNUMA deployment, is
>>> much higher than
>>> > the non-numa deployment,
>>> > and as a result that the deployment itself handles a higher
>>> throughput of
>>> > requests than the non-numa deployment.
>>> >
>>> > Surprisingly this isn't the case:
>>> > On the kube running deployment A which uses NUMA, we
>>> measured 20M/ 13M/ 33M
>>> > (local/remote/total) memory accesses,
>>> > and for the kube running deployment B which doesn't use
>>> NUMA, we measured
>>> > (23M/10M/33M) on the same time.
>>> Just curious, did you see any performance difference(other than
>>> pcm-numa) between those two?
>>> Does it mean you ran 2 pods in parallel(at the same time) on
>>> one
>>> physical machine?
>>>
>>>
>>> I didn't see any other significant difference.
>>> Yes, so there were 4 pods on the original experiment:
>>> 2 On each deployment (NUMA deployment, and non-NUMA deployment),
>>> and each deployment ran on a separate k8s physical node,
>>> and those nodes didn't run anything else but the 2 k8s pods.
>> Okay.
>>
>>>
>>> > Can you help to understand if we're doing anything wrong?
>>> or maybe our
>>> > expectations are wrong ?
>>> >
>>> > The 2 deployments are identical (except for the UseNUMA
>>> flag):
>>> > Each deployment contains 2 pods running on k8s.
>>> > Each pod has 10GB memory, 8GB heap, requires 2 CPUs (but
>>> not limited to 2).
>>> > Each deployment runs on a separate but identical kube
>>> machine with this
>>> > spec:
>>> > Hardware............: Supermicro
>>> SYS-2027TR-HTRF+
>>> > CPU.................: Intel(R) Xeon(R) CPU
>>> E5-2630L v2 @
>>> > 2.40GHz
>>> > CPUs................: 2
>>> > CPU Cores...........: 12
>>> > Memory..............: 63627 MB
>>> >
>>> >
>>> > We've also written to a file all NUMA related logs (using
>>> >
>>> -Xlog:os*,gc*=trace:file=/outbrain/heapdumps/fulllog.log:hostname,time,level,tags)
>>> > - log file could be found here:
>>> >
>>> https://drive.google.com/file/d/1eZqYDtBDWKXaEakh_DoYv0P6V9bcLs6Z/view?usp=sharing
>>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1eZqYDtBDWKXaEakh_DoYv0P6V9bcLs6Z/view?usp=sharing__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lbd8cDMkQ$>
>>> > so we know that NUMA is indeed working, but again, it
>>> doesn't give the
>>> > desired results we expected to see.
>>> From the shared log file, I see only 1 GC (GC id, 6761) and
>>> numa stat
>>> shows 53% of local memory allocation (gc,heap,numa) which
>>> seems okay.
>>> Could you share your full vm options?
>>>
>>>
>>> These are the updated vm options:
>>> -XX:+PerfDisableSharedMem
>>> -Xmx40g
>>> -Xms40g
>>> -XX:+DisableExplicitGC
>>> -XX:-OmitStackTraceInFastThrow
>>> -XX:+AlwaysPreTouch
>>> -Duser.country=US
>>> -XX:+UnlockDiagnosticVMOptions
>>> -XX:+DebugNonSafepoints
>>> -XX:+ParallelRefProcEnabled
>>> -XX:+UnlockExperimentalVMOptions
>>> -XX:G1MaxNewSizePercent=90
>>> -XX:InitiatingHeapOccupancyPercent=35
>>> -XX:-G1UseAdaptiveIHOP
>>> -XX:ActiveProcessorCount=2
>>> -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
>>> -XX:+UseNUMA
>>> -Xlog:os*,gc*=trace:file=/outbrain/heapdumps/fulllog.log:hostname,time,level,tags
>> Thanks
>>
>>> >
>>> > Any Ideas why ?
>>> > Is it a matter of workload ?
>>> Can you increase your Java heap on the testing machine?
>>> Your test machine has almost 64GB of memory on 2 NUMA nodes.
>>> So I assume
>>> each NUMA node will have almost 32GB of memory.
>>> But you are using only 8GB on Java heap which fits on one
>>> node, so I
>>> can't expect any benefit of enabling NUMA.
>>>
>>>
>>> But when the jvm is started, doesn't it spreads the heap evenly
>>> across all numa nodes ?
>>> And in this case, won't each NUMA node hold half of the heap
>>> (around 4GB) ?
>> Your statements above are all right.
>> From 8G of Java heap, each half of heap(4G) will be allocated to
>> node 0 and 1.
>>
>> G1 NUMA has tiny addition of 1) checking a caller thread's NUMA id
>> and then 2) allocate memory from same node. (compare to the
>> non-NUMA case).
>> If a testing environment is using very little memory and threads,
>> all of them can reside on one node. So above tiny addition may not
>> help.
>> Running without above addition would work better.
>> This is what I wanted to explain in my previous email.
>>
>>>
>>> I've increased the heap to be 40GB, and the container memory to
>>> 50GB.
>>>
>>> As the JVM is running on Kubernetes, there could be another
>>> thing may
>>> affect to the test.
>>> For example, topology manager may treat a pod to allocate
>>> from a single
>>> NUMA node.
>>> https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/
>>> <https://urldefense.com/v3/__https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/__;!!GqivPVa7Brio!IMXAbu0Ep8xCuI2wVlf14Rc0waPyj-rLfaSpGGK-BIhOJkc4Z2Gm2Gsm8lZcrGJXfw$>
>>>
>>> That's very interesting, I will read about it and try to
>>> understand more, and to understand if we're even using the
>>> topology manager.
>>> Do you think that using k8s with toplogy manager might be the
>>> problem ?
>>> Or that actually enabling topology manager should allow better
>>> usage of the hardware and actually help in our case ?
>> Sorry I don't have enough experience / knowledge on topology
>> manager / Kubernetes.
>> As I don't know your testing environment fully, I was trying to
>> enumerate what could affect to your test.
>>
>> Thanks,
>> Sangheon
>>
>>
>>> > Are there any workloads you can suggest that
>>> > will benefit from G1 NUMA awareness ?
>>> I measured some performance improvements on SpecJBB2015 and
>>> SpecJBB2005.
>>>
>>> > Do you happen to have a link to code that runs such a
>>> workload?
>>> No, I don't have such link for above runs.
>>>
>>> Thanks,
>>> Sangheon
>>>
>>> > Thanks,
>>> > Tal
>>> >
>>>
>>>
>>>
>>>
>>> The above terms reflect a potential business arrangement, are
>>> provided solely as a basis for further discussion, and are not
>>> intended to be and do not constitute a legally binding
>>> obligation. No legally binding obligations will be created,
>>> implied, or inferred until an agreement in final form is executed
>>> in writing by all parties involved.
>>>
>>> This email and any attachments hereto may be confidential or
>>> privileged. If you received this communication by mistake,
>>> please don't forward it to anyone else, please erase all copies
>>> and attachments, and please let me know that it has gone to the
>>> wrong person. Thanks.
>>
>>
>>
>> --
>> Tal Goldstein
>> Software Engineer
>> tgoldstein at outbrain.com <mailto:+tgoldstein at outbrain.com>
>> <https://urldefense.com/v3/__http://www.outbrain.com/?utm_source=gmail&utm_medium=email&utm_campaign=email_signature__;!!GqivPVa7Brio!IpgFVgW3bmO0zPdc4tMuDl35wyTTAfyb0ivel5ckDci2pcsRpyY2fJQxeHhr3aqOhA$>
>>
>>
>> The above terms reflect a potential business arrangement, are
>> provided solely as a basis for further discussion, and are not
>> intended to be and do not constitute a legally binding obligation. No
>> legally binding obligations will be created, implied, or inferred
>> until an agreement in final form is executed in writing by all
>> parties involved.
>>
>> This email and any attachments hereto may be confidential or
>> privileged. If you received this communication by mistake, please
>> don't forward it to anyone else, please erase all copies and
>> attachments, and please let me know that it has gone to the wrong
>> person. Thanks.
>
More information about the hotspot-gc-dev
mailing list