RFR: 8312570: [TESTBUG] Jtreg compiler/loopopts/superword/TestDependencyOffsets.java fails on 512-bit SVE

Thu Aug 10 10:51:01 UTC 2023

On Tue, 25 Jul 2023 07:42:59 GMT, Pengfei Li <pli at openjdk.org> wrote:

> Hotspot jtreg `compiler/loopopts/superword/TestDependencyOffsets.java` fails on AArch64 CPUs with 512-bit SVE. The reason is that many test loops in the code cannot be vectorized due to data dependence but IR tests assume they can.
> 
> On AArch64, these IR tests just check the CPU feature of `asimd` and incorrectly assumes AArch64 vectors are at most 256 bits. But actually, `asimd` on AArch64 only represents NEON vectors which are at most 128 bits. AArch64 CPUs may have another feature of `sve` which represents scalable vectors of at most 2048 bits. The vectorization won't succeed on 512-bit SVE CPUs if the memory offset between some read and write is less than 512 bits.
> 
> As this jtreg is auto-generated by a python script, we have updated the script and re-generated this jtreg. In this new version, we checked the auto-vectorization on both NEON-only and NEON+SVE platforms. Below is the diff of the generator script. We have also attached the new script to the JBS page.
> 
> 
> @@ -321,7 +321,8 @@ class Type:
>             p.append(Platform("avx512", ["avx512", "true"], 64))
>          else:
>             assert False, "type not implemented" + self.name
> -        p.append(Platform("asimd", ["asimd", "true"], 32))
> +        p.append(Platform("asimd", ["asimd", "true", "sve", "false"], 16))
> +        p.append(Platform("sve", ["sve", "true"], 256))
>          return p
> 
>  class Test:
> @@ -457,7 +458,7 @@ class Generator:
>          lines.append(" *   and various MaxVectorSize values, and +- AlignVector.")
>          lines.append(" *")
>          lines.append(" * Note: this test is auto-generated. Please modify / generate with script:")
> -        lines.append(" *       https://bugs.openjdk.org/browse/JDK-8308606")
> +        lines.append(" *       https://bugs.openjdk.org/browse/JDK-8312570")
>          lines.append(" *")
>          lines.append(" * Types: " + ", ".join([t.name for t in self.types]))
>          lines.append(" * Offsets: " + ", ".join([str(o) for o in self.offsets]))
> @@ -598,7 +599,8 @@ class Generator:
>              # IR rules
>              for p in test.t.platforms():
>                  elements = p.vector_width // test.t.size
> -                lines.append(f"    // CPU: {p.name} -> vector_width: {p.vector_width} -> elements in vector: {elements}")
> +                max_pre = "max " if p.name == "sve" else ""
> +                lines.append(f"    // CPU: {p.name} -> {max_pre}vector_width: {p.vector_width} -> {max_pre}elements in vector: {elements}")
>                  ###############  -Align...

@pfustc Thanks for the changes and explanations, looks good to me! :)

Ah. Just one more idea: Since you now have even longer vector widths with 2048 bits: Should we not add some cases with even larger dependency offsets? We should go further than `-196, 196`. We could consider adding `255, 256, 511, 512, 1024, 1536` (positive and negative). Of course the question is if that increases the runtime too much, what do you think?

-------------

Marked as reviewed by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/15010#pullrequestreview-1571589212