PPC64 VSX load/store instructions in stubs

Wed May 11 21:06:41 UTC 2016

Hi Volker, Hi Martin

Sincere apologies for the long delay.

My initial approach to test the VSX load/store was from an
extracted snippet regarding just the mass copy loop "grafted" inside an inline
asm, performing isolated tests with "perf" tool focused only on aligned source and
destination (best case).

The extracted code, called "Original" in the plot below (black line), is here:
https://github.com/gromero/arraycopy/blob/2pairs/arraycopy.c#L27-L36

That extracted, after some experiments, evolved into this one that employs VSX
load/store, Data Stream deepest pre-fetch, d-cache touch, and backbranch aligned
to 32-byte:
https://github.com/gromero/arraycopy/blob/2pairs/arraycopy_vsx.c#L27-L41

All runs where "pinned" using `numactl --cpunodebind --membind` to avoid any
scheduler decision that could add noise to the measure.

VSX, deepest data pre-fetch, d-cache touch, and 32-bytes align proved to be better
in the isolated code (red line) in comparison to the original extracted code
(black line):
http://gromero.github.io/openjdk/original_vsx_non_pf_vsx_pf_deepest.pdf

So I proceeded to implement the VSX loop in OpenJDK based on the best case
result (VSX, pre-fetch deepest, d-cache touch, and backbranch target align -
goetz TODO note).

OpenJDK 8 webrev:
http://81.de.7a9f.ip4.static.sl-reverse.com/8154156/8/

OpenJDK 9 webrev:
http://81.de.7a9f.ip4.static.sl-reverse.com/8154156/9/

I've tested the change on OpenJDK 8 using this script that calls
System.arraycopy() on shorts:
https://goo.gl/8UWtLm

The results for all data alignment cases:
http://gromero.github.io/openjdk/src_0_dst_0.pdf
http://gromero.github.io/openjdk/src_1_dst_0.pdf
http://gromero.github.io/openjdk/src_0_dst_1.pdf
http://gromero.github.io/openjdk/src_1_dst_1.pdf

Martin, I added the vsx test to the feature-string. Regarding the ABI, I'm just
using two VSR: vsr0 and vsr1, both volatile.

Volker, as the loop unrolling was removed now the loop copies 16 elemets a time,
like the non-VSX loop, and not 32 elements. I just verified the change on Little
endian. Sorry I didn't understand your question regarding "instructions for
aligned load/stores". Did you mean instructions for unaligned load/stores? I think
both fixed-point (ld/std) and VSX instructions will do load/store slower in
unaligned scenario. However VMX load/store is different and expects aligned
operands. Thank you very much for opening the bug
https://bugs.openjdk.java.net/browse/JDK-8154156

I don't have the profiling per function for each SPEC{jbb,jvm} benchmark
in order to determine which one would stress the proposed change better.
Could I use a better benchmark?

Thank you!

Best regards,
Gustavo

On 05-04-2016 14:23, Volker Simonis wrote:
> Hi Gustavo,
> 
> thanks a lot for your contribution.
> 
> Can you please describe if you've run benchmarks and which performance
> improvements you saw?
> 
> With your change if we're running on Power 8, we will only use the
> fast path for arrays with at least 32 elements. For smaller arrays, we
> will fall-back to copying only 2 elements at a time which will be
> slower than the initial version which copied 4 at a time in that case.
> 
> Did you verified your changes on both, little and big endian?
> 
> And what about unaligned memory accesses? As far as I read,
> lxvd2x/stxvd2x still work, but may be slower. I saw there also exist
> instructions for aligned load/stores. Would it make sens
> (performance-wise) to use them for the cases where we can be sure that
> we have aligned memory accesses?
> 
> Thank you and best regards,
> Volker
> 
> 
> On Fri, Apr 1, 2016 at 10:36 PM, Gustavo Romero
> <gromero at linux.vnet.ibm.com> wrote:
>> Hi Martin, Hi Volker
>>
>> Currently VSX load/store instructions are not being used in PPC64 stubs,
>> particularly in arraycopy stubs inside generate_arraycopy_stubs() like,
>> but not limited to, generate_disjoint_{byte,short,int,long}_copy.
>>
>> We can speed up mass copy using VSX (Vector-Scalar Extension) load/store
>> instruction in processors >= POWER8, the same way it's already done for
>> libc memcpy().
>>
>> This is an initial patch just for jshort_disjoint_arraycopy() VSX vector
>> load/store:
>>
>> http://81.de.7a9f.ip4.static.sl-reverse.com/202539/webrev
>>
>> What are your thoughts on that? Is there any impediment to use VSX
>> instructions in OpenJDK at the moment?
>>
>> Thank you.
>>
>> Best regards,
>> Gustavo
>>
>