RFR: 8300669: AArch64: Table based tails processing and wider stores for Arrays.fill() intrinsic [v4]

Dmitry Chuyko dchuyko at openjdk.org
Thu Jan 26 16:59:44 UTC 2023


> This is a new AArch64 implementation of existing (1-4-byte element) stubs that are called in C2-compiled code for array fill patterns and Arrays.fill().
> 
> Main variant of existing algorithm:
> 
> 
> [Short arrays (< 8 bytes): fill by element and exit];
> // ...
> [align base to 8 bytes];
> // ...
> // fill_words
> head_len = (cnt & 14) / 2;
> switch (head_len) {
>     do {
>       cnt -= 16;
>         stp;
>        case 7:
>         stp;
>        case 6:
>         stp;
>         // ...
>       case 1:
>          stp;
>        case 0:
>          base += 8*16;
>      } while (cnt);
>    }
> [(over)write a tail < 8 bytes];
> 
> 
> Even in good case, only 16-byte GPR (STP) stores are used, and there is a jump for every 8 stores. There is always extra work to be done for misaligned targets, which especially affects small to medium lengths.
> 
> The new implementation generates fill implementation for every length up to a certain threshold (160-byte length). These implementations form a table where you jump when the remaining target length is suitable.
> 
> For each table entry (target length), we can have no branches and use the most number of widest possible stores that best fit the detected CPU model. Currently it is SIMD STPQ for Neoverse N2 and GPR STP for the rest. The choice is made after benchmarking and is controlled by the new UseSIMDForArrayFill flag in AArch64.
> 
> Main variant of the new algorithm (see mode detailed description in comments):
> 
> 
> [align data at 16 bytes];
> while(cnt_bytes > 128) {
>    [store 128 bytes];
>    cnt_bytes -= 128;
> }
> [store tail of 0..127 bytes];
> 
> 
> 
> Both existing and proposed implementations specifically handle zero fill case (see comments about ZVA). New implementation contains a path for very small arrays that can be cut to further improve more generic case (added to avoid regressions).
> 
> The check added in https://bugs.openjdk.org/browse/JDK-8298720 in StubGenerator is removed as it is a stub code being generated. For the selected threshold, the increase in code size is within 8 KB.
> 
> New test TestArraysFill is added to intrinsics jtreg tests. It calls optimized versions of 2-arg and 4-arg Arrays.fill() for different data types, lengths and patterns. The target data is checked to be filled with the required value, the surrounding data is checked to be intact.
> 
> Existing test/micro/org/openjdk/bench/java/util/ArraysFill.java benchmark was used only initially. There are many cases and data lengths to cover. A modified version of the benchmark is attached [1] to the RFE, but not included in the change as it takes too long to complete all valuable variants.
> 
> Resulting performance data are listed in the spreadsheet [2] attache to the RFE. Target processors were Graviton 3, Graviton 2, TaiShan, A72 and A53. Latest data from Altra is not included but the picture there was similar to Graviton 2 in all experiments. There is a range of target lengths with various enhancement numbers. Interesting lengths are within table implementation threshold and close to them (stepped), small lengths (all) and long lengths (1 point, they look similar). Over this voluntary selection:
> 
> - No major regressions were found.
> - Geomean improvement: 11-33%
> - Median improvement: 10-48%
> 
> Testing: tier1, tier2 and the new test on fastdebug aarch64 and x86.
> 
> [1] https://bugs.openjdk.org/secure/attachment/102426/ArraysFill.java
> [2] https://bugs.openjdk.org/secure/attachment/102427/arrays-fill.ods

Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision:

  Wording about alignment

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/12222/files
  - new: https://git.openjdk.org/jdk/pull/12222/files/680bede8..9b8164e7

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12222&range=03
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12222&range=02-03

  Stats: 5 lines in 1 file changed: 0 ins; 0 del; 5 mod
  Patch: https://git.openjdk.org/jdk/pull/12222.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12222/head:pull/12222

PR: https://git.openjdk.org/jdk/pull/12222


More information about the hotspot-dev mailing list