RFR: 8342692: C2: MemorySegment API slow with short running loops

Tue Oct 22 07:53:22 UTC 2024

To optimize a long counted loop and long range checks in a long or int
counted loop, the loop is turned into a loop nest. When the loop has
few iterations, the overhead of having an outer loop whose backedge is
never taken, has a measurable cost. Furthermore, creating the loop
nest usually causes one iteration of the loop to be peeled so
predicates can be set up. If the loop is short running, then it's an
extra iteration that's run with range checks (compared to an int
counted loop with int range checks).

This change doesn't create a loop nest when:

1- it can be determined statically at loop nest creation time that the
   loop runs for a short enough number of iterations

2- profiling reports that the loop runs for no more than ShortLoopIter
   iterations (1000 by default).

For 2-, a guard is added which is implemented as yet another predicate.

While this change is in principle simple, I ran into a few
implementation issues:

- while c2 has a way to compute the number of iterations of an int
  counted loop, it doesn't have that for long counted loop. The
  existing logic for int counted loops promotes values to long to
  avoid overflows. I reworked it so it now works for both long and int
  counted loops.

- I added a new deoptimization reason (Reason_short_running_loop) for
  the new predicate. Given the number of iterations is narrowed down
  by the predicate, the limit of the loop after transformation is a
  cast node that's control dependent on the short running loop
  predicate. Because once the counted loop is transformed, it is
  likely that range check predicates will be inserted and they will
  depend on the limit, the short running loop predicate has to be the
  one that's further away from the loop entry. Now it is also possible
  that the limit before transformation depends on a predicate
  (TestShortRunningLongCountedLoopPredicatesClone is an example), we
  can have: new predicates inserted after the transformation that
  depend on the casted limit that itself depend on old predicates
  added before the transformation. To solve this cicular dependency,
  parse and assert predicates are cloned between the old predicates
  and the loop head. The cloned short running loop parse predicate is
  the one that's used to insert the short running loop predicate.

- In the case of a long counted loop, the loop is transformed into a
  regular loop with a new limit and transformed range checks that's
  later turned into an in counted loop. The int counted loop doesn't
  need loop limit checks because of the way it's constructed. There's
  an assert that catches that we don't attempt to add one. I ran into
  test failures where, by the time the int counted loop is created,
  the fact that the number of iterations of the loop is small enough
  to not need a loop limit check gets lost. I added a cast to make
  sure the narrowed limit's type is not lost (I had to do something
  similar for loop nests). But then, I ran into the same issue again
  because the cast was pushed through a sub or add and the narrowed
  type was lost. I propose that pushing casts through sub/add be only
  done after loop opts are over (same as what's done for range check
  `CastII`).

On Maurizio's benchmark that's mentioned in the bug, this gives a ~30%
performance increase.

-------------

Commit messages:
 - more
 - more
 - more
 - more
 - more
 - fix & test

Changes: https://git.openjdk.org/jdk/pull/21630/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21630&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8342692
  Stats: 1166 lines in 18 files changed: 1104 ins; 16 del; 46 mod
  Patch: https://git.openjdk.org/jdk/pull/21630.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/21630/head:pull/21630

PR: https://git.openjdk.org/jdk/pull/21630