[aarch64-port-dev ] aarch64 port "review"

Wed May 4 22:53:25 UTC 2016

Hi Roland, I'm Ananth Jasty, a developer from the Cavium team. It's nice to hear from you, Ed has spoken of you in somewhat awed tones.

So far the codegen of the aarch64 JIT looks very strong, no major pitfalls to be seen. However, given the nature of aarch64 cores, we are seeing a greater need for predictive pre-fetching, both in compiled code, and in the runtime itself (GC is particularly sensitive).

Also, Ed has been investigating unaligned accesses on our ThunderX cores, but I'm uncertain if there is any performance impact on A53 cores.

Overall, the codegen and overall implementation seems near to par with x86, however the different architectural and micro-architectural goals of available ARM designs (low-power/high-concurrency) could be limitations until the larger A100-class cores start showing up 2 years from now.

Just my opinion. If you are interested in testing on one of our 2-socket, 96-core systems, I'd be happy to arrange access.

Thanks,

Ananth
________________________________________
From: aarch64-port-dev <aarch64-port-dev-bounces at openjdk.java.net> on behalf of Roland Westrelin <rwestrel at redhat.com>
Sent: Wednesday, May 4, 2016 5:33:40 AM
To: aarch64-port-dev at openjdk.java.net
Subject: [aarch64-port-dev ] aarch64 port "review"

Over the past weeks, I've been inspecting generated code for the aarch64
port looking for opportunities of improvements. Here is a small report
of what I found.

I used the specjvm2008 suite of benchmarks and focused on the benchmarks
that spend significant time in a few compiled methods (and can be run
with jdk 9). I went over the generated code of: compress, crypto,
mpegaudio and scimark and looked at the code of inner loops where most
of the time is spent. In particular, I tried to:

- locate redundant code
- identify suboptimal use of aarch64 features
- double check anything that seemed odd to me
- verify that vectorization triggers
- for complex benchmarks where the hot code is split over several
  methods, check that big differences in profiling data between x86 and
  aarch64 could be explained (by inlining decision for instance).

I didn't try to tune anything (such as inlining or the register
allocator). I didn't check whether scheduling of instructions could be
improved given as far as I know, c2's scheduling is pretty limited.

I don't claim to be an aarch64 expert so I'm sure I missed some
things. This said, given the platform dependent code of c2 is mostly
restricted to instruction selection, I didn't expect to find any major
problem but wanted to verify it was indeed the case. The only issues I
found are:

- a case where c2's loop alignment code inserts nops in the body of a
  loop (JDK-8154135, pushed)

- the aarch64 port attempts to take advantage of the base + shifted
  offset addressing mode but it does that only in platform dependent
  code.  By exposing that addressing mode at the end of the optimization
  passes, generated code can be improved (JDK-8154826, being reviewed)

- when vector instructions are used, redundant address computation
  instructions are emitted because the compiler keeps redundant integer
  to long conversions to help code generation on x86 (JDK-8154943,
  reviewed, waiting for a sponsor).

- Enabling superword loop unroll analysis (contributed by intel) on ARM
  helps with smaller types (bytes) and trigger more unrolling with all
  types so should be beneficial (JDK-8155717, reviewed, waiting for a
  sponsor)

Overall, those are minor issues and I don't see any big aarch64 specific
opportunities to improve code generation (ignoring missing intrinsics or
maybe some tuning).

Roland.