PPC64 VSX load/store instructions in stubs

Mon May 16 05:53:48 UTC 2016

Dear Gustavo, Volker, and Martin

I also implemented VSX disjoint long arraycopy.
I appreciate it if it is applied to OpenJDK, too. 

The performance was almost better than the original code.
VSX(max) means aligned case, while VSX(min) is unaligned case. In 
addition, VMX can be better if unaligned. 
http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20160417/fb12037e/result-0001.jpg

The benchmark code is here.
http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20160417/fb12037e/ArrayCopyTest1-0001.java
Server:  8247-22L (POWER8 (3.3GHz 12 cores) x2, 512GB memory), Ubuntu 
Linux 15.04 ppc64LE (kernel: 3.19.0-18-generic),
OpenJDK (build based on 1.9), JVMARGS: “-Xmx40g ?Xms40g -Xmn20g"

created patches are for Java9.
http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20160417/fb12037e/ppc64le_vsx-0001.diff
http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20160417/fb12037e/ppc64le_vmx-0001.diff

I would appreciate your comments. 

Best regards,
Miki

"ppc-aix-port-dev" <ppc-aix-port-dev-bounces at openjdk.java.net> wrote on 
2016/05/12 18:33:03:

> From: "Doerr, Martin" <martin.doerr at sap.com>
> To: Gustavo Romero <gromero at linux.vnet.ibm.com>, Volker Simonis 
> <volker.simonis at gmail.com>
> Cc: "Simonis, Volker" <volker.simonis at sap.com>, "ppc-aix-port-
> dev at openjdk.java.net" <ppc-aix-port-dev at openjdk.java.net>, "hotspot-
> dev at openjdk.java.net" <hotspot-dev at openjdk.java.net>, 
> "brenohl at br.ibm.com" <brenohl at br.ibm.com>
> Date: 2016/05/12 18:34
> Subject: RE: PPC64 VSX load/store instructions in stubs
> Sent by: "ppc-aix-port-dev" <ppc-aix-port-dev-bounces at openjdk.java.net>
> 
> Hi Gustavo,
> 
> thanks for providing the webrevs. The change looks basically good.
> 
> I only have the following concerns:
> - We basically support configuring dscr by various DSCR switches. 
> Your code resets the value to hardware default instead of the 
> possibly modified values. We're currently only using default DSCR 
> values, but maybe we may want to play with them in the future.
> We could use a static variable for the default dscr value. It could 
> be modified in VM_Version::config_dscr() and used by your restore 
> code (load_const_optimized(tmp1, ...) instead of li(tmp1, 0)).
> 
> - The PPC-elf64abi-1.9 says: "Functions must ensure that the 
> appropriate bits in the vrsave register are set for any vector 
> registers they use. ...". I think not touching vrsave is the right 
> thing for AIX and ppc64le, but I think we will either have to skip 
> the optimization on ppc64 big endian or handle vrsave. Do you agree?
> 
> Best regards,
> Martin
> 
> 
> -----Original Message-----
> From: Gustavo Romero [mailto:gromero at linux.vnet.ibm.com] 
> Sent: Mittwoch, 11. Mai 2016 23:07
> To: Volker Simonis <volker.simonis at gmail.com>
> Cc: Doerr, Martin <martin.doerr at sap.com>; Simonis, Volker 
> <volker.simonis at sap.com>; ppc-aix-port-dev at openjdk.java.net; 
> hotspot-dev at openjdk.java.net; brenohl at br.ibm.com
> Subject: Re: PPC64 VSX load/store instructions in stubs
> Importance: High
> 
> Hi Volker, Hi Martin
> 
> Sincere apologies for the long delay.
> 
> My initial approach to test the VSX load/store was from an
> extracted snippet regarding just the mass copy loop "grafted" insidean 
inline
> asm, performing isolated tests with "perf" tool focused only on 
> aligned source and
> destination (best case).
> 
> The extracted code, called "Original" in the plot below (black line), is 
here:
> https://github.com/gromero/arraycopy/blob/2pairs/arraycopy.c#L27-L36
> 
> That extracted, after some experiments, evolved into this one that 
employs VSX
> load/store, Data Stream deepest pre-fetch, d-cache touch, and 
> backbranch aligned
> to 32-byte:
> https://github.com/gromero/arraycopy/blob/2pairs/arraycopy_vsx.c#L27-L41
> 
> All runs where "pinned" using `numactl --cpunodebind --membind` to avoid 
any
> scheduler decision that could add noise to the measure.
> 
> VSX, deepest data pre-fetch, d-cache touch, and 32-bytes align 
> proved to be better
> in the isolated code (red line) in comparison to the original extracted 
code
> (black line):
> http://gromero.github.io/openjdk/original_vsx_non_pf_vsx_pf_deepest.pdf
> 
> So I proceeded to implement the VSX loop in OpenJDK based on the best 
case
> result (VSX, pre-fetch deepest, d-cache touch, and backbranch target 
align -
> goetz TODO note).
> 
> OpenJDK 8 webrev:
> http://81.de.7a9f.ip4.static.sl-reverse.com/8154156/8/
> 
> OpenJDK 9 webrev:
> http://81.de.7a9f.ip4.static.sl-reverse.com/8154156/9/
> 
> I've tested the change on OpenJDK 8 using this script that calls
> System.arraycopy() on shorts:
> https://goo.gl/8UWtLm
> 
> The results for all data alignment cases:
> http://gromero.github.io/openjdk/src_0_dst_0.pdf
> http://gromero.github.io/openjdk/src_1_dst_0.pdf
> http://gromero.github.io/openjdk/src_0_dst_1.pdf
> http://gromero.github.io/openjdk/src_1_dst_1.pdf
> 
> Martin, I added the vsx test to the feature-string. Regarding the 
> ABI, I'm just
> using two VSR: vsr0 and vsr1, both volatile.
> 
> Volker, as the loop unrolling was removed now the loop copies 16 
> elemets a time,
> like the non-VSX loop, and not 32 elements. I just verified the 
> change on Little
> endian. Sorry I didn't understand your question regarding "instructions 
for
> aligned load/stores". Did you mean instructions for unaligned load/
> stores? I think
> both fixed-point (ld/std) and VSX instructions will do load/store slower 
in
> unaligned scenario. However VMX load/store is different and expects 
aligned
> operands. Thank you very much for opening the bug
> https://bugs.openjdk.java.net/browse/JDK-8154156
> 
> I don't have the profiling per function for each SPEC{jbb,jvm} benchmark
> in order to determine which one would stress the proposed change better.
> Could I use a better benchmark?
> 
> Thank you!
> 
> Best regards,
> Gustavo
> 
> On 05-04-2016 14:23, Volker Simonis wrote:
> > Hi Gustavo,
> > 
> > thanks a lot for your contribution.
> > 
> > Can you please describe if you've run benchmarks and which performance
> > improvements you saw?
> > 
> > With your change if we're running on Power 8, we will only use the
> > fast path for arrays with at least 32 elements. For smaller arrays, we
> > will fall-back to copying only 2 elements at a time which will be
> > slower than the initial version which copied 4 at a time in that case.
> > 
> > Did you verified your changes on both, little and big endian?
> > 
> > And what about unaligned memory accesses? As far as I read,
> > lxvd2x/stxvd2x still work, but may be slower. I saw there also exist
> > instructions for aligned load/stores. Would it make sens
> > (performance-wise) to use them for the cases where we can be sure that
> > we have aligned memory accesses?
> > 
> > Thank you and best regards,
> > Volker
> > 
> > 
> > On Fri, Apr 1, 2016 at 10:36 PM, Gustavo Romero
> > <gromero at linux.vnet.ibm.com> wrote:
> >> Hi Martin, Hi Volker
> >>
> >> Currently VSX load/store instructions are not being used in PPC64 
stubs,
> >> particularly in arraycopy stubs inside generate_arraycopy_stubs() 
like,
> >> but not limited to, generate_disjoint_{byte,short,int,long}_copy.
> >>
> >> We can speed up mass copy using VSX (Vector-Scalar Extension) 
load/store
> >> instruction in processors >= POWER8, the same way it's already done 
for
> >> libc memcpy().
> >>
> >> This is an initial patch just for jshort_disjoint_arraycopy() VSX 
vector
> >> load/store:
> >>
> >> http://81.de.7a9f.ip4.static.sl-reverse.com/202539/webrev
> >>
> >> What are your thoughts on that? Is there any impediment to use VSX
> >> instructions in OpenJDK at the moment?
> >>
> >> Thank you.
> >>
> >> Best regards,
> >> Gustavo
> >>
> > 
>