[External] : Fwd: Unexpected results when enabling +UseNUMA for G1GC

Tue Mar 2 17:00:22 UTC 2021

Hi,

On 02.03.21 00:57, Tal Goldstein wrote:
> Hi Sangheon,
> We ran 1 more experiment and I would be happy if you could take a look at
> the results we got.
> This time we used a short program we wrote which could be seen here:
> https://gist.github.com/talgol/14a5755db1ac83f4f6c04d71ad57c2e3
> The program tries to make lots of memory allocations, by creating many new
> objects that are not supposed to survive and move to the oldGen.
> 
> The program runs using only 1 thread for a duration of 12 hours.
> It was configured with a 40GB heap with the same hardware mentioned before,
> and with the UseNUMA flag.
> This time we expected to see that almost 100% of memory allocations are
> local,
> since the machine itself wasn't under a high load, and there was actually
> no reason for the memory to be allocated from the opposite numa node (there
> are 2).

> But that wasn't the case; it can be seen from the graph below that just a
> bit over 50% of calls were local:
> https://drive.google.com/file/d/1QP5I_SaeNUL6-oEHfc5B9QuiCke0xkiL/view?usp=sharing
> 
> We also used jna to log the cpu IDs that were used for running the thread,
> and we then mapped those CPU IDs to identify to which NUMA node they belong.
> We expected to see that only 1 NUMA node is being used, but again, our
> results were different.
> 
> Do these results make sense to you?
> Can you explain why there are so many remote allocations?
> 

There is a problem when only using memory from a single node: if you did 
that, you could only use half the memory in the current implementation.

This implementation assumes that the application memory access and use 
is fairly balanced overall; also, young gen memory access is assumed to 
be local, and old gen (long-lived) memory is global, i.e. is best 
striped across all nodes.

The current implementation does not rebalance memory (or threads) 
according to actual access patterns.

Which means there can be situations where enabling NUMA (with G1, but 
also with Parallel) makes things worse, mainly if the application is as 
lopsided as in your test and apparently also in your real-world application.

Similar problems will occur if you have (short-living) distinct 
producers and consumers working in tandem but ending up on threads on 
different node. I.e. thread A on node 0 always accesses data from thread 
B on node 1, and that data is (due to huge young gen) never scattered 
across nodes in the old gen (which would be still worse than everything 
on a single node).

Operating system heuristics for locating memory may be better in these 
cases (but maybe not consistently).

Hth,
   Thomas