PING: RFR: 8245158: C2: Enable SLP for some manually unrolled loops

Tue May 26 09:50:57 UTC 2020

Ping - Any reviews of this?

--
Thanks,
Pengfei

> Can I have a review of this enhancement of C2 SLP?
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8245158
> Webrev: http://cr.openjdk.java.net/~pli/rfr/8245158/webrev.00/
> 
> Below Java loop with stride = 1 can be vectorized by C2.
>   for (int i = start; i < limit; i++) {
>     c[i] = a[i] + b[i];
>   }
> 
> But if it's manually unrolled once, like in the code below, SLP would fail to
> vectorize it.
>   for (int i = start; i < limit; i += 2) {
>     c[i] = a[i] + b[i];
>     c[i + 1] = a[i + 1] + b[i + 1];
>   }
> 
> Notably, if the induction variable's initial value "start" is replaced by a
> compile-time constant, the vectorization works.
> 
> Root cause of these is that in current C2 SuperWord implementation,
> find_adjacent_refs() calls find_align_to_ref() to select a "best align to"
> memory reference to create packs, and particularly, the reference selected
> must be "pre-loop alignable". In other words, C2 must be able to adjust the
> pre-loop trip count such that the vectorized access of this reference is aligned.
> Hence, in find_align_to_ref(), unalignable memory references are discarded.
> [1] Then SLP packs creation is aborted if no memory reference is eligible to be
> the "best align to". [2]
> 
> In current C2 SLP code, the selected "best align to" reference has two uses.
> One is to compute alignment info in order to find adjacent memory
> references for packs creation. Another use is to facilitate the pre-loop trip
> count adjustment to align vector memory accesses in the main-loop.
> But on some platforms, aligning vector accesses is not a mandatory
> requirement (after Roland's JDK-8215483 [3], this is usually checked by
> "!Matcher::misaligned_vectors_ok() || AlignVector"). So the "best align to"
> memory reference doesn't have to be "pre-loop alignable" on all platforms.
> In this patch, we only discard unalignable references when that platform-
> dependent check returns true.
> 
> After this patch, some manually unrolled loops can be vectorized on
> platforms with no alignment requirement. As almost all modern x86 CPUs
> support unaligned vector move, I suspect this can benefit the majority of
> today's CPUs.
> 
> Please note that this patch doesn't try to enable SLP for all manually unrolled
> loops. If above case is unrolled more times, vectorization may still don't work.
> The reason behind is that current SLP applies only to main-loops produced by
> the iteration split. When the loop is manually unrolled many times, node
> count may exceed LoopUnrollLimit, resulting in no iteration split at all.
> Although this can be workarounded by relaxing the unrolling policy by
> slp_max_unroll_factor, we don't do in this way since splitting a big loop may
> increase too much code size. Anyone wants to vectorize a super-manually-
> unrolled loop can use -XX:LoopUnrollLimit= with a greater value.
> 
> [Tests]
> 
> Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core, langtools::tier1 are tested
> and no new failure is found.
> 
> Below are the results of the JMH test [4] from above case.
> 
> Before:
>   Benchmark              Mode  Cnt      Score     Error  Units
>   TestUnrolledLoop.bar  thrpt   25  58097.290 ± 128.802  ops/s
> 
> After:
>   Benchmark              Mode  Cnt       Score       Error  Units
>   TestUnrolledLoop.bar  thrpt   25  260110.139 ± 10902.284  ops/s
> 
> [1]
> http://hg.openjdk.java.net/jdk/jdk/file/a0a21978f3b9/src/hotspot/share/opt
> o/superword.cpp#l780
> [2]
> http://hg.openjdk.java.net/jdk/jdk/file/a0a21978f3b9/src/hotspot/share/opt
> o/superword.cpp#l587
> [3] http://hg.openjdk.java.net/jdk/jdk/rev/da7dc9e92d91
> [4] http://cr.openjdk.java.net/~pli/rfr/8245158/TestUnrolledLoop.java
> 
> --
> Thanks,
> Pengfei