[aarch64-port-dev ] aarch64 port "review"

Thu May 5 11:05:53 UTC 2016

> Overall, those are minor issues and I don't see any big aarch64
specific
> opportunities to improve code generation (ignoring missing intrinsics
or
> maybe some tuning).

Thanks for this. This confirms my own observations, that actually the
code generated by aarch64 C2 is pretty good and difficult to improve
upon.

Yes, there are minor optimisations that can be made, but usually these
involve data processing instructions and do not lead to any signifcant
performance improvement overall because in OOO cores data processing
instructions tend to be folded with load/stores in any case. In other
words the optimal scheduling of any piece of code is restricted by the
load/store dependancies and the presence of a few extra data
processing instructions does not make any difference.

In some cases folding data processing instructions into load store
instructions may make performance worse. So, for example,

   ldr Rd, [Rn, Ro, lsl #3]

on some uArches may be worse than

  add Rn, Rn, Ro, lsl #3
  ldr Rd, [Rn]

because the shift can be scheduled away from the load, whereas folding
it into the load may cause a 1 cycle delay. However, in general the
first is probably prefereable.

One of the issues with aarch64 in that there are many different uArch
implementations with different performance considerations, whereas the
same is not true to the same extent in the x86 world.

Overall, I have found that the single item dominating Java performance
is memory traffic and this seems to be true to a greater extent than
for equivalent C/C++ applications. Also aarch64 implementations tend
to have smaller L1 caches than x86 because of power/size/cost
constraints.

The sources of memory traffic in a Java application are

- GC
- Application generated
- Compiler generated
- (any more - native code/intrinsics?)

GC we can do liitle about the amout of traffic, except choose the
right GC and tune it correctly. The only thing we can do is ensure
that the bus bandwidth is used as efficiently as possibly. I have
spent some time optimising the copy routines to be as efficient as
possible and have another patch pending upstreaming to optimise for
uArches where there is an unaligned access penalty.

Application generated traffic we can do little about the amount of
traffic since this is preseumably the 'useful' work being done by the
application. One thing we could look at is merging loads/stores to
ldp/stp. Not sure how/where this would be done. The peepholer is one
option, but the peepholer is currently disabled and in any case
peepholers are the last refuge of the desperate IMHO. Also possibly
improving/enhancing vectorisation and merging of smaller load/stores
(bytes/chars) to word or long ops.

The main thing we can do with application traffic is to ensure that it
is as efficient as possible and I have spent some time looking at the
array copy stubs and have another patch pending upstreaming to
optimise for unaligned accesses.

Compiler traffic is more under our contros. The main source of
compiler generated traffic is inter procedural, IE spilling of
registers and the frame overhead (128 bits). One thing I have
considered is whether moving to a mixed caller/callee convention as
per C/C++ would gain us anything. However here be dragons (I think
there are even comments in the source to warn about the dragons). How
could we estimate what the benefit of moving to a mixed caller/callee
convention would be without actually doing it?

Look forward to discussing with you at the fireside chat next Thurs!

All the best,
Ed.