PPC64 VSX load/store instructions in stubs

Thu May 19 23:46:05 UTC 2016

Hi Martin

Thank you for reviewing the webrev.

> We could use a static variable for the default dscr value. It could be modified in VM_Version::config_dscr() and used by your restore code (load_const_optimized(tmp1, ...) instead of li(tmp1, 0)).

Absolutely, resetting DSCR to the default value (zero) is not right.

I did as you suggested and created a static variable modified and
initialized from VM_Version::config_dscr(). Then I used it to get the
current value of DSCR, set only the pre-fetch as deepest, and restore
its previous value.

> - The PPC-elf64abi-1.9 says: "Functions must ensure that the appropriate bits in the vrsave register are set for any vector registers they use. ...". I think not touching vrsave is the right thing for AIX and ppc64le, but I think we will either have to skip the optimization on ppc64 big endian or handle vrsave. Do you agree?

About the VRSAVE register, you are right, but there is a confusing here
and it's my fault: I'm not using the VMX registers.

In my code I've used the VSX load/store instructions with a
VectorRegister type, i.e. VR0 and VR1. It's OK if we look at the
assembled instructions because, in the end, VR0 and VR1 will be
converted to target (or source) registers number 0 and 1. But it's VSX
registers 0 and 1 (VSR0 and VSR1) and not VMX (aka Altivec) registers
0 and 1 (VR0 and VR1).

There is indeed a relationship between VSR and VR registers, as
we can see in the following diagram adapted from [1]:

       .---------------------------------.
VSR( 0)|     FPR(0)     |                |
VSR( 1)|     FPR(1)     |                |
  ...  |      ...       |                |
  ...  |      ...       |                |
VSR(30)|     FPR(30)    |                |
VSR(31)|     FPR(31)    |                |
VSR(32)|              VR(0)              |
VSR(33)|              VR(1)              |
  ...  |               ...               |
  ...  |               ...               |
VSR(62)|              VR(30)             |
VSR(63)|              VR(31)             |
       '---------------------------------'
        0                             127

However VMX registers VR0-31 are mapped to VSX VSR32-63 registers,
and so we can use VSR0 and VSR1 (although they are also mapped to FPR,
FPR0-13 are volatile). Thus actually in my code I was using VSR0 and
VSR1 and not VR0 and VR1. Thus as VRSAVE only corresponds to
VMX/Altivec registers (VR0-VR31), there is not need to take care of
VRSAVE. I fixed the registers names/types in this new webrev.

I noted that the VSR registers were not implemented and thus I
implemented them. Now VSX load/store instruction use VectorSRegister
type. I'm using VSR0 and VSR1 registers in the stub, respecting the
ABI.

Webrev:
http://81.de.7a9f.ip4.static.sl-reverse.com./8154156/9/v2/

Best regards,
Gustavo

[1] Power Architecture 64-Bit ELF V2 ABI https://goo.gl/LLXRwN, p. 43-44

> -----Original Message-----
> From: Gustavo Romero [mailto:gromero at linux.vnet.ibm.com] 
> Sent: Mittwoch, 11. Mai 2016 23:07
> To: Volker Simonis <volker.simonis at gmail.com>
> Cc: Doerr, Martin <martin.doerr at sap.com>; Simonis, Volker <volker.simonis at sap.com>; ppc-aix-port-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; brenohl at br.ibm.com
> Subject: Re: PPC64 VSX load/store instructions in stubs
> Importance: High
> 
> Hi Volker, Hi Martin
> 
> Sincere apologies for the long delay.
> 
> My initial approach to test the VSX load/store was from an
> extracted snippet regarding just the mass copy loop "grafted" inside an inline
> asm, performing isolated tests with "perf" tool focused only on aligned source and
> destination (best case).
> 
> The extracted code, called "Original" in the plot below (black line), is here:
> https://github.com/gromero/arraycopy/blob/2pairs/arraycopy.c#L27-L36
> 
> That extracted, after some experiments, evolved into this one that employs VSX
> load/store, Data Stream deepest pre-fetch, d-cache touch, and backbranch aligned
> to 32-byte:
> https://github.com/gromero/arraycopy/blob/2pairs/arraycopy_vsx.c#L27-L41
> 
> All runs where "pinned" using `numactl --cpunodebind --membind` to avoid any
> scheduler decision that could add noise to the measure.
> 
> VSX, deepest data pre-fetch, d-cache touch, and 32-bytes align proved to be better
> in the isolated code (red line) in comparison to the original extracted code
> (black line):
> http://gromero.github.io/openjdk/original_vsx_non_pf_vsx_pf_deepest.pdf
> 
> So I proceeded to implement the VSX loop in OpenJDK based on the best case
> result (VSX, pre-fetch deepest, d-cache touch, and backbranch target align -
> goetz TODO note).
> 
> OpenJDK 8 webrev:
> http://81.de.7a9f.ip4.static.sl-reverse.com/8154156/8/
> 
> OpenJDK 9 webrev:
> http://81.de.7a9f.ip4.static.sl-reverse.com/8154156/9/
> 
> I've tested the change on OpenJDK 8 using this script that calls
> System.arraycopy() on shorts:
> https://goo.gl/8UWtLm
> 
> The results for all data alignment cases:
> http://gromero.github.io/openjdk/src_0_dst_0.pdf
> http://gromero.github.io/openjdk/src_1_dst_0.pdf
> http://gromero.github.io/openjdk/src_0_dst_1.pdf
> http://gromero.github.io/openjdk/src_1_dst_1.pdf
> 
> Martin, I added the vsx test to the feature-string. Regarding the ABI, I'm just
> using two VSR: vsr0 and vsr1, both volatile.
> 
> Volker, as the loop unrolling was removed now the loop copies 16 elemets a time,
> like the non-VSX loop, and not 32 elements. I just verified the change on Little
> endian. Sorry I didn't understand your question regarding "instructions for
> aligned load/stores". Did you mean instructions for unaligned load/stores? I think
> both fixed-point (ld/std) and VSX instructions will do load/store slower in
> unaligned scenario. However VMX load/store is different and expects aligned
> operands. Thank you very much for opening the bug
> https://bugs.openjdk.java.net/browse/JDK-8154156
> 
> I don't have the profiling per function for each SPEC{jbb,jvm} benchmark
> in order to determine which one would stress the proposed change better.
> Could I use a better benchmark?
> 
> Thank you!
> 
> Best regards,
> Gustavo
> 
> On 05-04-2016 14:23, Volker Simonis wrote:
>> Hi Gustavo,
>>
>> thanks a lot for your contribution.
>>
>> Can you please describe if you've run benchmarks and which performance
>> improvements you saw?
>>
>> With your change if we're running on Power 8, we will only use the
>> fast path for arrays with at least 32 elements. For smaller arrays, we
>> will fall-back to copying only 2 elements at a time which will be
>> slower than the initial version which copied 4 at a time in that case.
>>
>> Did you verified your changes on both, little and big endian?
>>
>> And what about unaligned memory accesses? As far as I read,
>> lxvd2x/stxvd2x still work, but may be slower. I saw there also exist
>> instructions for aligned load/stores. Would it make sens
>> (performance-wise) to use them for the cases where we can be sure that
>> we have aligned memory accesses?
>>
>> Thank you and best regards,
>> Volker
>>
>>
>> On Fri, Apr 1, 2016 at 10:36 PM, Gustavo Romero
>> <gromero at linux.vnet.ibm.com> wrote:
>>> Hi Martin, Hi Volker
>>>
>>> Currently VSX load/store instructions are not being used in PPC64 stubs,
>>> particularly in arraycopy stubs inside generate_arraycopy_stubs() like,
>>> but not limited to, generate_disjoint_{byte,short,int,long}_copy.
>>>
>>> We can speed up mass copy using VSX (Vector-Scalar Extension) load/store
>>> instruction in processors >= POWER8, the same way it's already done for
>>> libc memcpy().
>>>
>>> This is an initial patch just for jshort_disjoint_arraycopy() VSX vector
>>> load/store:
>>>
>>> http://81.de.7a9f.ip4.static.sl-reverse.com/202539/webrev
>>>
>>> What are your thoughts on that? Is there any impediment to use VSX
>>> instructions in OpenJDK at the moment?
>>>
>>> Thank you.
>>>
>>> Best regards,
>>> Gustavo
>>>
>>
>