RFR: 8286197: C2: Optimize MemorySegment shape in int loop [v2]

Fri May 6 09:18:27 UTC 2022

> This is another small enhancement for a code shape that showed up in a
> MemorySegment micro benchmark. The shape to optimize is the one from test1:
> 
> 
> for (int i = 0; i < size; i++) {
>     long j = i * UNSAFE.ARRAY_INT_INDEX_SCALE;
> 
>     j = Objects.checkIndex(j, size * 4);
> 
>     if (((base + j) & 3) != 0) {
>         throw new RuntimeException();
>     }
> 
>     v += UNSAFE.getInt(base + j);
> }
> 
> 
> In that code shape, the loop iv is first scaled, result is then casted
> to long, range checked and finally address of memory location is
> computed.
> 
> The alignment check is transformed so the loop body has no check In
> order to eliminate the range check, that loop is transformed into:
> 
> 
> for (int i1 = ..) {
>     for (int i2 = ..) {
>         long j = (i1 + i2) * UNSAFE.ARRAY_INT_INDEX_SCALE;
> 
>         j = Objects.checkIndex(j, size * 4);
> 
>         v += UNSAFE.getInt(base + j);
>     }
> }
> 
> 
> The address shape is (AddP base (CastLL (ConvI2L (LShiftI (AddI ...
> 
> In this case, the type of the ConvI2L is [min_jint, max_jint] and type
> of CastLL is [0, max_jint] (the CastLL has a narrower type).
> 
> I propose transforming (CastLL (ConvI2L into (ConvI2L (CastII in that
> case. The convI2L and CastII types can be set to [0, max_jint]. The
> new address shape is then:
> 
> (AddP base (ConvI2L (CastII (LShiftI (AddI ...
> 
> which optimize well.
> 
> (LShiftI (AddI ...
> is transformed into
> (AddI (LShiftI ...
> because one of the AddI input is loop invariant (i2) and we have:
> 
> (AddP base (ConvI2L (CastII (AddI (LShiftI ...
> 
> Then because the ConvI2L and CastII types are [0, max_jint], the AddI
> is pushed through the ConvI2L and CastII:
> 
> (AddP base (AddL (ConvI2L (CastII (LShiftI ...
> 
> base and one of the inputs of the AddL are loop invariant so this
> transformed into:
> 
> (AddP (AddP ...) (ConvI2L (CastII (LShiftI ...
> 
> The (AddP ...) is loop invariant so computed before entry. The
> (ConvI2L ...) only depends on the loop iv.
> 
> The resulting address is a shift + an add. The address before
> transformation requires 2 adds + a shift. Also after unrolling, the
> adress of the second access in the loop is cheaper to compute as it
> can be derived from the address of the first access.
> 
> For all of this to work:
> 1) I added a CastLL::Ideal transformation:
> (CastLL (ConvI2L into (ConvI2l (CastII
> 
> 2) I also had to prevent split if to transform (LShiftI (Phi for the
> iv Phi of a counted loop.
> 
> 
> test2 and test3 test 1) and 2) separately.

Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision:

  review

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/8555/files
  - new: https://git.openjdk.java.net/jdk/pull/8555/files/6493429c..a122f0cf

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8555&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8555&range=00-01

  Stats: 12 lines in 2 files changed: 5 ins; 0 del; 7 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8555.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8555/head:pull/8555

PR: https://git.openjdk.java.net/jdk/pull/8555