[10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd

Fri Sep 1 07:51:59 UTC 2017

On 31/08/17 23:46, Dmitrij Pochepko wrote:
> I tried a number of initial versions first. I also tried to use wider 
> multiplication via umulh (and larger load instructions like ldp/ldr), 
> but after measuring all versions I've found that version I've initially 
> sent appeared to be the fastest (I was measuring it on ThunderX which I 
> have in hand). It might be because of lots of additional ror(..., 32) 
> operations in other versions to convert values from initial layout to 
> register and back. Another reason might be more complex overall logic 
> and larger code, which triggers more icache lines to be loaded. Or even 
> some umulh specifics on some CPUs. So, after measuring, I've abandoned 
> these versions in a middle of development and polished the fastest one.
> I have some raw development unpolished versions of such approaches 
> left(not sure I have debugged versions saved, but at least has an 
> overall idea).
> I attached squares_v2.3.1.diff: early version which is using mul/umulh 
> for just one case. It was surprisingly slower for this case than version 
> I've sent to review, so, I've abandoned this approach.
> I've also tried version with large load instructions(ldp/ldr): 
> squares_v1.diff and it was also slower(it has another, slower, mul_add 
> loop implementation, but I was comparing to the same version, which is 
> using ldrw-only).
> 
> I'm not sure if I should use 64-bit multiplications and/or 64/128 bit 
> loads. I can try to return back to one of such versions and try to 
> polish it, but I'll probably get slower results again on h/w I have and 
> it's not clear if it'll be faster on any other h/w(which one? It takes a 
> lot of time to iteratively improve and measure every version on 
> respective h/w).

I'm using Applied Micro hardware for my testing at the moment.

I did the speed testing for Montgomery multiply on ThunderX.  I
appreciate that it's difficult to get the 64-bit version right and
fast, but you should see about 3 - 3.5* speedup over the pure Java
version if you get it right.  That's what I saw when I did the
Montgomery multiply.  You do have to pipeline the loads and the
multiplies to avoid stalls.

Be aware that squareToLen is not used at all when running the
RSA benchmark with C2.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671