RFR: 8256431: [PPC64] Implement Base64 encodeBlock() for Power64-LE

Thu Dec 3 20:52:57 UTC 2020

On Thu, 3 Dec 2020 18:14:55 GMT, Martin Doerr <mdoerr at openjdk.org> wrote:

>> Add a vector-based implementation of the Base64 encodeBlock intrinsic for Power9 and Power10, little-endian Linux only.
>> 
>> This implementation is based upon a paper (linked in comments) describing an Intel SSE vector-based implementation of Base64 encoding.  Although the Intel SSE instruction set and the Power VMX/VSX instruction sets are different, the method used in the paper is adaptable to Power.  In addition there are a few places in the algorithm where it's possible to gain some performance by using more optimal instruction sequences for VMX/VSX, and some additional benefit is gained from the ISA 3.1 additions available in Power10.
>> 
>> There is one controversial method I used in this implementation: I defined a macro to emit the instruction sequence for encoding 12 bytes in a vector to 16 bytes, because this sequence is needed in three places.  Turning it into a function would have been possible, but I would have needed to pass quite a few register numbers into the function.  I would have liked to have used a nested function, to give the function visibility to the register numbers declared in the outer scope, but alas nested functions are not possible in C++.
>> 
>> The overall performance advantage on Power9 is about 4.0X, based on the main/java/org/openjdk/micro/bench/java/util/Base64VarLenEncode.java benchmark.  This benchmark covers random buffer lengths from 8 to 20007 bytes.  Buffers that are short won't perform as well, approaching the performance of the pure Java code (or slightly worse for very short buffers),  Buffers that are consistently long will perform a little better than 4.0X.
>
> src/hotspot/cpu/ppc/stubGenerator_ppc.cpp line 4036:
> 
>> 4034:        // 5.4X slower.  So on P9, we replace lxvl with a conditional
>> 4035:        // unaligned load sequence, based on the alignment of the address
>> 4036:        // and the length of the data requested.
> 
> This code looks like it is more than 5.4X slower than fast lxvl and hence slower than slow lxvl.

I spent quite a lot of time benchmarking different variations of replacement code, and arrived at this one.  In about 40% of the cases, lxvl outperforms this replacement by a bit, but in 60% of the cases, the replacement does quite a lot better than lxvl.  I have the spreadsheets that show it.  The 5.4X number is conservative because it includes overhead of the benchmark loop used to test it, so lxvl may in fact be quite a lot more than 5.4X slower.

That said, I don't really like having this code in there, and would be happy to get rid of it.  Since it's not used in the main loop, I'm guessing using just lxvl might not impact overall performance very much.  So I'm a bit on the fence about it, to be honest.

-------------

PR: https://git.openjdk.java.net/jdk/pull/1245