[aarch64-port-dev ] [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives
Dmitrij Pochepko
dmitrij.pochepko at bell-sw.com
Thu Jul 20 10:03:45 UTC 2017
Hi everyone,
Please review this small webrev [1] that implements an enhancement [2]
which adds has_negatives intrinsic to AARCH64 OpenJDK port. This
intrinsic performs better than c2-compiled code for every array size tried:
ThunderX T88: about 2% for array size = 1 and up to 8.5x for large arrays
Cortex A53(R-Pi): shows about the same numbers(really large sizes can't
be normally tested there due to small amount of available memory).
Intrinsified HasNegatives method checks if provided byte array has any
byte with negative value(higher bit set) and intrinsic in general do as
following(with various minor optimizations):
1) check array length variable to have lower bits set (0x1, 0x2, 0x4,
0x8) and invoke respective load instruction(ldrb, ldrh, ldrw, ldr) while
reducing remaining length variable respectively. So, remaining length is
16*N after this code. Proceed to 2).
2) in case remaining length >= 64, loads data in a loop with 4 ldp
instructions(16 bytes each) and invoking prfm (prefetch hint) in case
SoftwarePrefetchHintDistance >= 0 once per loop. This new flag
(SoftwarePrefetchHintDistance) is introduced to provide configurable
software prefetching in dynamically compiled code. This flag can disable
software prefetch hint or set prefetch distance. Default distance is set
to 3 * dcache_line which shows best performance on armv8 CPUs we have.
64-bytes loop proceed until length < 64, then, proceed to 3).
3) simple 16-byte loading loop until remaining length is 0.
Note: It was observed that software prefetching hint improves
performance for platforms that do not have hardware prefetching
(ThunderX T88), but also for platforms we have in hand which do have
hardware prefetching (Cortex A53).
Performance testing:
JMH-based microbenchmark was developed [3] to test the performance of
this enhancement. The performance results on Cortex A53 [4] and
ThunderX T88 [5] for this intrinsic are on-par with C2-compiled java
code for very small strings and improve the performance with the
increase in string length starting from string length of 3 and up to 8x
for long strings.
Functional testing:
Tested by running hotspot jtreg tests on Cortex A53 and ThunderX T88 and
comparing the test results diff with vanilla build. No regressions were
observed. Specifically, test
hotspot/test/compiler/intrinsics/string/TestHasNegatives.java passed on
both Cortex A53 and ThunderX T88.
[1] webrev: http://cr.openjdk.java.net/~dpochepk/8184943/webrev.01/
[2] CR: https://bugs.openjdk.java.net/browse/JDK-8184943
[3] JMH micro benchmark:
http://cr.openjdk.java.net/~dpochepk/8184943/HasNegativesBenchmark/
[4] A53 graph:
http://cr.openjdk.java.net/~dpochepk/8184943/Cortex_A53_comparison.png
[5] T88 graph:
http://cr.openjdk.java.net/~dpochepk/8184943/ThunderX_comparison.png
I'll be happy to merge suggestions for improvement of this intrinsic
should they come into this review.
Thanks,
Dmitrij
More information about the aarch64-port-dev
mailing list