RFR: 8245158: C2: Enable SLP for some manually unrolled loops

Wed May 20 09:42:55 UTC 2020

Hi C2 Reviewers,

Can I have a review of this enhancement of C2 SLP?

JBS: https://bugs.openjdk.java.net/browse/JDK-8245158
Webrev: http://cr.openjdk.java.net/~pli/rfr/8245158/webrev.00/

Below Java loop with stride = 1 can be vectorized by C2.
  for (int i = start; i < limit; i++) {
    c[i] = a[i] + b[i];
  }

But if it's manually unrolled once, like in the code below, SLP would
fail to vectorize it.
  for (int i = start; i < limit; i += 2) {
    c[i] = a[i] + b[i];
    c[i + 1] = a[i + 1] + b[i + 1];
  }

Notably, if the induction variable's initial value "start" is replaced
by a compile-time constant, the vectorization works.

Root cause of these is that in current C2 SuperWord implementation,
find_adjacent_refs() calls find_align_to_ref() to select a "best align
to" memory reference to create packs, and particularly, the reference
selected must be "pre-loop alignable". In other words, C2 must be able
to adjust the pre-loop trip count such that the vectorized access of
this reference is aligned. Hence, in find_align_to_ref(), unalignable
memory references are discarded. [1] Then SLP packs creation is aborted
if no memory reference is eligible to be the "best align to". [2]

In current C2 SLP code, the selected "best align to" reference has two
uses. One is to compute alignment info in order to find adjacent memory
references for packs creation. Another use is to facilitate the pre-loop
trip count adjustment to align vector memory accesses in the main-loop.
But on some platforms, aligning vector accesses is not a mandatory
requirement (after Roland's JDK-8215483 [3], this is usually checked by
"!Matcher::misaligned_vectors_ok() || AlignVector"). So the "best align
to" memory reference doesn't have to be "pre-loop alignable" on all
platforms. In this patch, we only discard unalignable references when
that platform-dependent check returns true.

After this patch, some manually unrolled loops can be vectorized on
platforms with no alignment requirement. As almost all modern x86 CPUs
support unaligned vector move, I suspect this can benefit the majority
of today's CPUs.

Please note that this patch doesn't try to enable SLP for all manually
unrolled loops. If above case is unrolled more times, vectorization may
still don't work. The reason behind is that current SLP applies only to
main-loops produced by the iteration split. When the loop is manually
unrolled many times, node count may exceed LoopUnrollLimit, resulting in
no iteration split at all. Although this can be workarounded by relaxing
the unrolling policy by slp_max_unroll_factor, we don't do in this way
since splitting a big loop may increase too much code size. Anyone wants
to vectorize a super-manually-unrolled loop can use -XX:LoopUnrollLimit=
with a greater value.

[Tests]

Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core, langtools::tier1 are
tested and no new failure is found.

Below are the results of the JMH test [4] from above case.

Before:
  Benchmark              Mode  Cnt      Score     Error  Units
  TestUnrolledLoop.bar  thrpt   25  58097.290 ± 128.802  ops/s

After:
  Benchmark              Mode  Cnt       Score       Error  Units
  TestUnrolledLoop.bar  thrpt   25  260110.139 ± 10902.284  ops/s

[1] http://hg.openjdk.java.net/jdk/jdk/file/a0a21978f3b9/src/hotspot/share/opto/superword.cpp#l780
[2] http://hg.openjdk.java.net/jdk/jdk/file/a0a21978f3b9/src/hotspot/share/opto/superword.cpp#l587
[3] http://hg.openjdk.java.net/jdk/jdk/rev/da7dc9e92d91
[4] http://cr.openjdk.java.net/~pli/rfr/8245158/TestUnrolledLoop.java

--
Thanks,
Pengfei