RFR: 8205051: UseNUMA memory interleaving vs cpunodebind & localalloc

Wed Nov 27 13:49:41 UTC 2024

On Tue, 26 Nov 2024 17:25:03 GMT, Swati Sharma <duke at openjdk.org> wrote:

> Hi All,
> 
> The PR handles the performance issues related to flag UseNUMA. We disable the UseNUMA flag when the process gets invoked with incorrect node alignment.
> We check the cpunodebind and membind(or interleave for interleave policy) bitmask equality and disable UseNUMA when they are not equal.
> For example on a 4 NUMA node system:
> 0123 Node Number
> 1100  cpunodebind bitmask
> 1111 membind bitmask 
> Disable UseNUMA as CPU and memory bitmask are not equal.
> 
> 0123 Node Number  
> 1100  cpunodebind bitmask
> 1100 membind bitmask 
> Enable UseNUMA as CPU and memory bitmask are equal.
> 
> This covers all the cases with all policies and tested this with below command
> numactl --cpunodebind=0,1 --localalloc java -Xlog:gc*=info -XX:+UseParallelGC -XX:+UseNUMA -version
> 
> For localalloc and preferred policies the membind bitmask returns true for all nodes, hence if cpunodebind is not bound to all nodes then the UseNUMA will be disabled.
> 
> This PR covers disabling the UseNUMA flag for all GC's hence we observed an improvement of ~25% on G1GC , ~20% on ZGC and ~7-8% on PGC in both throughput  and latency on SPECjbb2015 on a 2 NUMA node SRF-SP system with 6Group configuration.
> 
> Please review and provide your valuable comments.
> 
> Thanks,
> Swati Sharma
> Intel

Thanks for looking at this. Getting the NUMA support right when the configuration is "non-optimal" is not always easy and I think this looks like a good approach. 

When testing this and playing around with `numactl` I noticed that when using `--interleave` to favor one node we still leave the NUMA support on. This is not new, but I think we might want to fix this as well:

$ numactl -N 0 -i 0 java -Xlog:os,gc+init -XX:+UseNUMA -XX:+UseParallelGC -XX:+UseNUMA -version
[0.002s][info][os] UseNUMA is enabled and invoked in 'interleave' mode. Heap will be configured using NUMA memory nodes: 0
...
[0.004s][info][gc     ] Using Parallel
...
[0.006s][info][gc,init] NUMA Support: Enabled
[0.006s][info][gc,init] NUMA Nodes: 2

I know that with interleaving we are not really binding the process to a single node, but as long as there is memory left on a node it will be used so turning on the NUMA support for the collectors is not really correct. It doesn't seem to matter much, because the GC support still figures out to only use the configured one, but it looks a bit strange:

[0,032s][info][gc,heap,exit]  PSYoungGen      total 9728K, used 522K [0x00000000ff580000, 0x0000000100000000, 0x0000000100000000)
[0,032s][info][gc,heap,exit]   eden space 8704K, 6% used [0x00000000ff580000,0x00000000ff602980,0x00000000ffe00000)
[0,032s][info][gc,heap,exit]     lgrp 0 space 8704K, 6% used [0x00000000ff580000,0x00000000ff602980,0x00000000ffe00000)
[0,032s][info][gc,heap,exit]   from space 1024K, 0% used [0x00000000fff00000,0x00000000fff00000,0x0000000100000000)
[0,032s][info][gc,heap,exit]   to   space 1024K, 0% used [0x00000000ffe00000,0x00000000ffe00000,0x00000000fff00000)

This seems to be easy to fix by just making sure to use the correct mem-bitmask in `is_bound_to_single_node()`.

src/hotspot/os/linux/os_linux.cpp line 4502:

> 4500:         (Linux::is_running_in_interleave_mode() && Linux::_numa_interleave_bitmask != nullptr &&
> 4501:         Linux::_numa_cpunodebind_bitmask != nullptr &&
> 4502:         !_numa_bitmask_equal(Linux::_numa_interleave_bitmask, Linux::_numa_cpunodebind_bitmask))) {

Please extract this to a separate helper like the others. Maybe you can come up with a better name but something like: 
Suggestion:

    if (Linux::numa_max_node() < 1 ||
        Linux::is_bound_to_single_node() ||
        Linux::mem_and_cpu_node_mismatch()) {

Here is a suggestion on the helper to make it more readable, please make sure I got the logic right.
```  
static bool mem_and_cpu_node_mismatch() {
    struct bitmask* mem_nodes_bitmask = Linux::_numa_membind_bitmask;
    if (Linux::is_running_in_interleave_mode()) {
      mem_nodes_bitmask = Linux::_numa_interleave_bitmask;
    }

    if (mem_nodes_bitmask == nullptr || Linux::_numa_cpunodebind_bitmask == nullptr) {
      return false;
    }

    return !_numa_bitmask_equal(mem_nodes_bitmask, Linux::_numa_cpunodebind_bitmask);
  }

src/hotspot/os/linux/os_linux.cpp line 4508:

> 4506:         warning("UseNUMA is disabled as the process bound to a single numa node"
> 4507:                 " or cpu and memory nodes are not aligned");
> 4508:       FLAG_SET_ERGO(UseNUMA, false);

I think we should check both `UseNUMA` and `UseNUMAInterleaving` here and set both accordingly. 
Suggestion:

      // Disable NUMA support if:
      // 1. Only a single NUMA node is available
      // 2. The process is bound to a single NUMA node
      // 3. The process memory and cpu node configuration is misaligned
      if ((UseNUMA && FLAG_IS_CMDLINE(UseNUMA)) ||
          (UseNUMAInterleaving && FLAG_IS_CMDLINE(UseNUMAInterleaving))) {
        // Only issue a warning if the user explicitly asked for NUMA support
	      log_warning(os)("NUMA support is disabled as the process bound to a single"
		                    " numa node or cpu and memory nodes are not aligned");
      }
      FLAG_SET_ERGO(UseNUMA, false);
      FLAG_SET_ERGO(UseNUMAInterleaving, false);

I also changed to use `log_warning(os)` here instead of the old `warning`, but I see that we use the old warning below and we should probably make sure those are using the same. I would prefer `log_warning(os)` but this is not a strong request.

-------------

Changes requested by sjohanss (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/22395#pullrequestreview-2464923817
PR Review Comment: https://git.openjdk.org/jdk/pull/22395#discussion_r1860600600
PR Review Comment: https://git.openjdk.org/jdk/pull/22395#discussion_r1860643601