RFR: 8307352: AARCH64: Improve itable_stub

Tue Jun 13 14:36:02 UTC 2023

On Thu, 4 May 2023 07:36:43 GMT, Boris Ulasevich <bulasevich at openjdk.org> wrote:

> This is a change for AARCH similar to https://github.com/openjdk/jdk/pull/13460 
> 
> The change replaces two separate iterations over the itable with a new algorithm consisting of two loops. First, we look for a match with resolved_klass, checking for a match with holder_klass along the way. Then we continue iterating (not starting over) the itable using the second loop, checking only for a match with holder_klass.
> 
> InterfaceCalls openjdk benchmark performance results on A53, A72, Neoverse N1 and V1 micro-architectures:
> 
> 
> Cortex-A53 (Pi 3 Model B Rev 1.2)
> 
> test1stInt2Types    37.5      37.358     0.38
> test1stInt3Types   160.166   148.04      8.19
> test1stInt5Types   158.131   147.955     6.88
> test2ndInt2Types    52.634    53.291    -1.23
> test2ndInt3Types   201.39    181.603    10.90
> test2ndInt5Types   195.722   176.707    10.76
> testIfaceCall      157.453   140.498    12.07
> testIfaceExtCall   175.46    154.351    13.68
> testMonomorphic     32.052    32.039     0.04
>                                  AVG:    6.85
> 
> Cortex-A72 (Pi 4 Model B Rev 1.2)
> 
> test1stInt2Types    27.4796   27.4738    0.02
> test1stInt3Types    66.0085   64.9374    1.65
> test1stInt5Types    67.9812   66.2316    2.64
> test2ndInt2Types    32.0581   32.062    -0.01
> test2ndInt3Types    68.2715   65.6643    3.97
> test2ndInt5Types    68.1012   65.8024    3.49
> testIfaceCall       64.0684   64.1811   -0.18
> testIfaceExtCall    91.6226   81.5867   12.30
> testMonomorphic     26.7161   26.7142    0.01
>                                  AVG:    2.66
> 
> Neoverse N1 (m6g.metal)
> 
> test1stInt2Types     2.9104    2.9086    0.06
> test1stInt3Types    10.9642   10.2909    6.54
> test1stInt5Types    10.9607   10.2856    6.56
> test2ndInt2Types     3.3410    3.3478   -0.20
> test2ndInt3Types    12.3291   11.3089    9.02
> test2ndInt5Types    12.328    11.2704    9.38
> testIfaceCall       11.0598   10.3657    6.70
> testIfaceExtCall    13.0692   11.2826   15.84
> testMonomorphic      2.2354    2.2341    0.06
>                                  AVG:    6.00
> 
> Neoverse V1 (c7g.2xlarge)
> 
> test1stInt2Types    2.2317     2.2320   -0.01
> test1stInt3Types    6.6884     6.1911    8.03
> test1stInt5Types    6.7334     6.2193    8.27
> test2ndInt2Types    2.4002     2.4013   -0.04
> test2ndInt3Types    7.9603     7.0372   13.12
> test2ndInt5Types    7.9532     7.0474   12.85
> testIfaceCall       6.7028     6.3272    5.94
> testIfaceExtCall    8.3253     6.9416   19.93
> testMonomorphic     1.2446     1.2544   -0.79
>                                  AVG:    7.48 
> 
> 
> Testing...

Thanks.

For this to be reviewable, we'll need:

A benchmark, and some data.
An explanation of why it's better than the existing implementation.

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 1255:

> 1253:   //     }
> 1254:   //   } while (temp_itbl_klass != 0);
> 1255:   //   goto L_no_such_interface // Not found.

I don't think the pseudocode matches the assembly. This is more like

  do {
    temp_itbl_klass = *(scan_temp += scan_step);
    if (holder_klass, temp_itbl_klass ) ...

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 1283:

> 1281:     bind(L_loop_scan_resolved_entry);
> 1282:     cmp(holder_klass, temp_itbl_klass);
> 1283:     csel(holder_offset, scan_temp, holder_offset, Assembler::EQ);

It's worth being cautious about conditional selects. In out-of-order machines they help when the outcome genuinely is unpredictable and the probability of each is close to 50-50. Benchmarks can fool us because they are often designed to be random, but real-world code rarely is.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13792#issuecomment-1534360182
PR Review Comment: https://git.openjdk.org/jdk/pull/13792#discussion_r1221626714
PR Review Comment: https://git.openjdk.org/jdk/pull/13792#discussion_r1222732706