[aarch64-port-dev ] Adv. SIMD/Neon support in intrinsics/stubs for AArch64

Thu Aug 8 11:27:32 UTC 2019

> How would that help anything? string_compare already uses vector ops.

Hmm..well. I might have thought it for granted that UMOV should be used instead of FMOV, shortly after found a piece of comment in Arm Arch ref manual [1]: 
>> Some of the Floating-point move (register) instructions overlap with the functionality provided by the Advanced
>> SIMD instructions DUP, INS, and UMOV. However, ARM recommends using the FMOV instructions when operating on
>> scalar floating-point data to avoid the creation of scalar floating-point code that depends on the availability of the
>> Advanced SIMD instruction set.
And in another relevant doc for A75 [2], it says FMOV's execution latency and throughput can be the better choice.

[1] https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile, p228
[2] https://static.docs.arm.com/101398/0200/arm_cortex_a75_software_optimization_guide_v2.pdf, p11

Regards
Patrick

-----Original Message-----
From: Andrew Haley <aph at redhat.com> 
Sent: Thursday, August 8, 2019 4:18 PM
To: Patrick Zhang OS <patrick at os.amperecomputing.com>; aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] Adv. SIMD/Neon support in intrinsics/stubs for AArch64

On 8/8/19 7:15 AM, Patrick Zhang OS wrote:

> Does anyone know the whole picture of the Neon support in intrinsics/stubs?

I'm as likely to know as anyone else.

> I am not sure whether what I observed is correct: 1. Pre-Armv8 
> distinguishes between VFP and Neon floating-point support but the 
> definitions are not explicitly separated (via any special prefix) in 
> assembly_aarch64.hpp.

With AArch64 there's no need: AdvSIMD is part of the spec.

> 2. There are some intrinsics/stubs implemented using VFP for 
> vectorizing, where we still have opportunities to improve by aid of 
> Neon 2x64-bit integer/floating-point operations, such as 
> string_compare [1].

How would that help anything? string_compare already uses vector ops.

> 3. More intrinsics need vectorizing, with reference to other archs.

Maybe. In theory it would help, but in practice meaningful speedups are quite difficult to achieve without compromising performance with small strings. But please feel free to try.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671