[aarch64-port-dev ] aarch64 port "review"

Wed May 4 12:33:40 UTC 2016

Over the past weeks, I've been inspecting generated code for the aarch64
port looking for opportunities of improvements. Here is a small report
of what I found.

I used the specjvm2008 suite of benchmarks and focused on the benchmarks
that spend significant time in a few compiled methods (and can be run
with jdk 9). I went over the generated code of: compress, crypto,
mpegaudio and scimark and looked at the code of inner loops where most
of the time is spent. In particular, I tried to:

- locate redundant code
- identify suboptimal use of aarch64 features
- double check anything that seemed odd to me
- verify that vectorization triggers
- for complex benchmarks where the hot code is split over several
  methods, check that big differences in profiling data between x86 and
  aarch64 could be explained (by inlining decision for instance).

I didn't try to tune anything (such as inlining or the register
allocator). I didn't check whether scheduling of instructions could be
improved given as far as I know, c2's scheduling is pretty limited.

I don't claim to be an aarch64 expert so I'm sure I missed some
things. This said, given the platform dependent code of c2 is mostly
restricted to instruction selection, I didn't expect to find any major
problem but wanted to verify it was indeed the case. The only issues I
found are:

- a case where c2's loop alignment code inserts nops in the body of a
  loop (JDK-8154135, pushed)

- the aarch64 port attempts to take advantage of the base + shifted
  offset addressing mode but it does that only in platform dependent
  code.  By exposing that addressing mode at the end of the optimization
  passes, generated code can be improved (JDK-8154826, being reviewed)

- when vector instructions are used, redundant address computation
  instructions are emitted because the compiler keeps redundant integer
  to long conversions to help code generation on x86 (JDK-8154943,
  reviewed, waiting for a sponsor).

- Enabling superword loop unroll analysis (contributed by intel) on ARM
  helps with smaller types (bytes) and trigger more unrolling with all
  types so should be beneficial (JDK-8155717, reviewed, waiting for a
  sponsor)

Overall, those are minor issues and I don't see any big aarch64 specific
opportunities to improve code generation (ignoring missing intrinsics or
maybe some tuning).

Roland.