Withdrawn: 8300669: AArch64: Table based tails processing and wider stores for Arrays.fill() intrinsic

duke duke at openjdk.org
Tue May 30 03:44:08 UTC 2023


On Thu, 26 Jan 2023 15:28:32 GMT, Dmitry Chuyko <dchuyko at openjdk.org> wrote:

> This is a new AArch64 implementation of existing (1-4-byte element) stubs that are called in C2-compiled code for array fill patterns and Arrays.fill().
> 
> Main variant of existing algorithm:
> 
> 
> [Short arrays (< 8 bytes): fill by element and exit];
> // ...
> [align base to 8 bytes];
> // ...
> // fill_words
> head_len = (cnt & 14) / 2;
> switch (head_len) {
>     do {
>       cnt -= 16;
>         stp;
>        case 7:
>         stp;
>        case 6:
>         stp;
>         // ...
>       case 1:
>          stp;
>        case 0:
>          base += 8*16;
>      } while (cnt);
>    }
> [(over)write a tail < 8 bytes];
> 
> 
> Even in good case, only 16-byte GPR (STP) stores are used, and there is a jump for every 8 stores. There is always extra work to be done for misaligned targets, which especially affects small to medium lengths.
> 
> The new implementation generates fill implementation for every length up to a certain threshold (160-byte length). These implementations form a table where you jump when the remaining target length is suitable.
> 
> For each table entry (target length), we can have no branches and use the most number of widest possible stores that best fit the detected CPU model. Currently it is SIMD STPQ for Neoverse N2 and GPR STP for the rest. The choice is made after benchmarking and is controlled by the new UseSIMDForArrayFill flag in AArch64.
> 
> Main variant of the new algorithm (see mode detailed description in comments):
> 
> 
> [align data at 16 bytes];
> while(cnt_bytes > 128) {
>    [store 128 bytes];
>    cnt_bytes -= 128;
> }
> [store tail of 0..127 bytes];
> 
> 
> 
> Both existing and proposed implementations specifically handle zero fill case (see comments about ZVA). New implementation contains a path for very small arrays that can be cut to further improve more generic case (added to avoid regressions).
> 
> The check added in https://bugs.openjdk.org/browse/JDK-8298720 in StubGenerator is removed as it is a stub code being generated. For the selected threshold, the increase in code size is within 8 KB.
> 
> New test TestArraysFill is added to intrinsics jtreg tests. It calls optimized versions of 2-arg and 4-arg Arrays.fill() for different data types, lengths and patterns. The target data is checked to be filled with the required value, the surrounding data is checked to be intact.
> 
> Existing test/micro/org/openjdk/bench/java/util/ArraysFill.java benchmark was used only initially. There are many cases and data lengths to cover. A modified version of the benchmark ...

This pull request has been closed without being integrated.

-------------

PR: https://git.openjdk.org/jdk/pull/12222


More information about the hotspot-dev mailing list