From stewartd.qdt at qualcommdatacenter.com Fri Sep 1 16:22:27 2017 From: stewartd.qdt at qualcommdatacenter.com (stewartd.qdt) Date: Fri, 1 Sep 2017 16:22:27 +0000 Subject: [aarch64-port-dev ] [aarch64-port-dev][10] RFR: 8187022 UBFX instructions have wrong format string Message-ID: <8cde07840db64072af6e68393ef4c704@NASANEXM01B.na.qualcomm.com> Please see the webrev [1] for fixing the format string of ubfx [2]. [1]: http://cr.openjdk.java.net/~njian/8187022/webrev.00/ [2]: https://bugs.openjdk.java.net/browse/JDK-8187022 Thanks, Daniel Stewart From dmitry.chuyko at bell-sw.com Fri Sep 1 16:26:11 2017 From: dmitry.chuyko at bell-sw.com (Dmitry Chuyko) Date: Fri, 1 Sep 2017 19:26:11 +0300 Subject: [aarch64-port-dev ] RFR: 8186671: Use `yield` instruction in SpinPause on linux-aarch64 In-Reply-To: <68db10fa-46d1-638f-7f46-a5c862e11b69@bell-sw.com> References: <68db10fa-46d1-638f-7f46-a5c862e11b69@bell-sw.com> Message-ID: <33ae3f2a-80b3-ed99-cb23-93de86af480d@bell-sw.com> On 08/24/2017 06:33 PM, Dmitry Chuyko wrote: > On 08/23/2017 10:39 PM, White, Derek wrote: >> Hi Andrew, >> >>> -----Original Message----- >>> From: aarch64-port-dev [mailto:aarch64-port-dev- >>> bounces at openjdk.java.net] On Behalf Of Andrew Haley >>> Sent: Wednesday, August 23, 2017 12:32 PM >>> To: aarch64-port-dev at openjdk.java.net >>> Subject: Re: [aarch64-port-dev ] RFR: 8186671: Use `yield` >>> instruction in >>> SpinPause on linux-aarch64 >>> >>> On 23/08/17 17:07, Dmitry Chuyko wrote: >>>> Please review a change in SpinPause implementation. >>>> >>>> related study: >>>> http://cr.openjdk.java.net/~dchuyko/8186670/yield/spinwait.html >>>> rfe: https://bugs.openjdk.java.net/browse/JDK-8186671 >>>> webrev: http://cr.openjdk.java.net/~dchuyko/8186671/webrev.00/ >>>> >>>> The function was moved to platform .S file and now contains yield >>>> instruction. >>> .......................................................................................................................................... >>> >>> In any case we >>>> Re the use of yield in SpinPause(): this looks correct to me. OK. > Good. This part seemed more scaring. There were no objections to this part (extern). I need sponsorship to push the change. It would be interesting to discuss the other (intrinsic) part a bit more at fireside chat. -Dmitry > > -- > Dmitry >>> >>> -- >>> Andrew Haley >>> Java Platform Lead Engineer >>> Red Hat UK Ltd. >>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 > From aph at redhat.com Fri Sep 1 17:40:24 2017 From: aph at redhat.com (Andrew Haley) Date: Fri, 1 Sep 2017 18:40:24 +0100 Subject: [aarch64-port-dev ] [aarch64-port-dev][10] RFR: 8187022 UBFX instructions have wrong format string In-Reply-To: <8cde07840db64072af6e68393ef4c704@NASANEXM01B.na.qualcomm.com> References: <8cde07840db64072af6e68393ef4c704@NASANEXM01B.na.qualcomm.com> Message-ID: <879a0d91-03c8-6b7a-4abb-1e6a74bd70f7@redhat.com> On 01/09/17 17:22, stewartd.qdt wrote: > Please see the webrev [1] for fixing the format string of ubfx [2]. > > [1]: http://cr.openjdk.java.net/~njian/8187022/webrev.00/ > [2]: https://bugs.openjdk.java.net/browse/JDK-8187022 Great, thanks. P.S. Resending to hotspot-dev. That's where this should go, because Aarch64 is in the main hotspot tree now. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From aph at redhat.com Sat Sep 2 08:10:00 2017 From: aph at redhat.com (Andrew Haley) Date: Sat, 2 Sep 2017 09:10:00 +0100 Subject: [aarch64-port-dev ] RFR: 8186671: Use `yield` instruction in SpinPause on linux-aarch64 In-Reply-To: <33ae3f2a-80b3-ed99-cb23-93de86af480d@bell-sw.com> References: <68db10fa-46d1-638f-7f46-a5c862e11b69@bell-sw.com> <33ae3f2a-80b3-ed99-cb23-93de86af480d@bell-sw.com> Message-ID: On 01/09/17 17:26, Dmitry Chuyko wrote: > There were no objections to this part (extern). I need sponsorship to > push the change. I can do it, but it really needs to be sent to hotspot-dev. > It would be interesting to discuss the other (intrinsic) part a bit more > at fireside chat. OK, but without any actual implementations we can test it'll be a very short discussion. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From felix.yang at linaro.org Sun Sep 3 13:16:38 2017 From: felix.yang at linaro.org (Felix Yang) Date: Sun, 3 Sep 2017 21:16:38 +0800 Subject: [aarch64-port-dev ] [aarch64-port-dev][10] RFR: 8187022 UBFX instructions have wrong format string In-Reply-To: <879a0d91-03c8-6b7a-4abb-1e6a74bd70f7@redhat.com> References: <8cde07840db64072af6e68393ef4c704@NASANEXM01B.na.qualcomm.com> <879a0d91-03c8-6b7a-4abb-1e6a74bd70f7@redhat.com> Message-ID: Pushed. Thanks. On 2 September 2017 at 01:40, Andrew Haley wrote: > On 01/09/17 17:22, stewartd.qdt wrote: > > Please see the webrev [1] for fixing the format string of ubfx [2]. > > > > [1]: http://cr.openjdk.java.net/~njian/8187022/webrev.00/ > > [2]: https://bugs.openjdk.java.net/browse/JDK-8187022 > > Great, thanks. > > P.S. Resending to hotspot-dev. That's where this should go, because > Aarch64 is in the main hotspot tree now. > > -- > Andrew Haley > Java Platform Lead Engineer > Red Hat UK Ltd. > EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 > From felix.yang at linaro.org Mon Sep 4 15:18:23 2017 From: felix.yang at linaro.org (Felix Yang) Date: Mon, 4 Sep 2017 23:18:23 +0800 Subject: [aarch64-port-dev ] RFR(S): 8154537: AArch64: some integer rotate instructions are never emitted In-Reply-To: References: <57161A23.3050807@redhat.com> <57162A22.2050706@redhat.com> <571722B7.2000404@redhat.com> <1bc50628-5ef9-b24b-d1b3-4762e7ff3b12@redhat.com> Message-ID: Hi, This issue has been fixed in jdk10. Shall I propose a patch for jdk9 dev and maybe jdk8u aarch64 repo? They got the same issue. Thanks, Felix On 1 August 2017 at 20:43, Felix Yang wrote: > LGTM. This also addresses: http://cr.openjdk.java.net/~fyang/8157906/ > webrev.00/src/cpu/aarch64/vm/aarch64.ad.sdiff.html > > Thanks, > Felix > > On 29 July 2017 at 01:59, Andrew Haley wrote: > >> I'm looking at the webrev in >> http://cr.openjdk.java.net/~roland/8154537/webrev.00/ and I see the >> the changes were made to aarch64.ad but not to ad_aarch64.m4. This is >> problematic because some .m4 files are used to generate the .ad file, >> and if anyone regenerates the .ad file the bug will regress. >> >> I think this is the change we need to make. It won't affect generated >> code at all, but it is something of a ticking bomb. >> >> diff -r 214a94e9366c src/cpu/aarch64/vm/aarch64_ad.m4 >> --- a/src/cpu/aarch64/vm/aarch64_ad.m4 Mon Jul 17 12:11:32 2017 +0000 >> +++ b/src/cpu/aarch64/vm/aarch64_ad.m4 Fri Jul 28 18:57:25 2017 +0100 >> @@ -268,21 +268,21 @@ >> ins_pipe(ialu_reg_reg_vshift); >> %}')dnl >> define(ROL_INSN, ` >> -instruct $3$1_rReg_Var_C$2(iRegLNoSp dst, iRegL src, iRegI shift, immI$2 >> c$2, rFlagsReg cr) >> +instruct $3$1_rReg_Var_C$2(iReg$1NoSp dst, iReg$1 src, iRegI shift, >> immI$2 c$2, rFlagsReg cr) >> %{ >> match(Set dst (Or$1 (LShift$1 src shift) (URShift$1 src (SubI c$2 >> shift)))); >> >> expand %{ >> - $3L_rReg(dst, src, shift, cr); >> + $3$1_rReg(dst, src, shift, cr); >> %} >> %}')dnl >> define(ROR_INSN, ` >> -instruct $3$1_rReg_Var_C$2(iRegLNoSp dst, iRegL src, iRegI shift, immI$2 >> c$2, rFlagsReg cr) >> +instruct $3$1_rReg_Var_C$2(iReg$1NoSp dst, iReg$1 src, iRegI shift, >> immI$2 c$2, rFlagsReg cr) >> %{ >> match(Set dst (Or$1 (URShift$1 src shift) (LShift$1 src (SubI c$2 >> shift)))); >> >> expand %{ >> - $3L_rReg(dst, src, shift, cr); >> + $3$1_rReg(dst, src, shift, cr); >> %} >> %}')dnl >> ROL_EXPAND(L, rol, rorv) >> >> -- >> Andrew Haley >> Java Platform Lead Engineer >> Red Hat UK Ltd. >> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 >> > > From aph at redhat.com Mon Sep 4 16:09:16 2017 From: aph at redhat.com (Andrew Haley) Date: Mon, 4 Sep 2017 17:09:16 +0100 Subject: [aarch64-port-dev ] RFR(S): 8154537: AArch64: some integer rotate instructions are never emitted In-Reply-To: References: <57161A23.3050807@redhat.com> <57162A22.2050706@redhat.com> <571722B7.2000404@redhat.com> <1bc50628-5ef9-b24b-d1b3-4762e7ff3b12@redhat.com> Message-ID: <67595610-3d30-fbfd-96ab-baa00e12233b@redhat.com> On 04/09/17 16:18, Felix Yang wrote: > This issue has been fixed in jdk10. > Shall I propose a patch for jdk9 dev and maybe jdk8u aarch64 repo? They > got the same issue. jdk8u is pre-approved. jdk9 dev is more problematic: I think it's closed for now. But please register the bugs anyway. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From gnu.andrew at redhat.com Tue Sep 5 02:15:29 2017 From: gnu.andrew at redhat.com (gnu.andrew at redhat.com) Date: Tue, 05 Sep 2017 02:15:29 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u: Added tag aarch64-jdk8u144-b02 for changeset 461c9270383a Message-ID: <201709050215.v852FTR4016107@aojmv0008.oracle.com> Changeset: 8803133b679b Author: andrew Date: 2017-09-05 03:10 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u/rev/8803133b679b Added tag aarch64-jdk8u144-b02 for changeset 461c9270383a ! .hgtags From gnu.andrew at redhat.com Tue Sep 5 02:15:35 2017 From: gnu.andrew at redhat.com (gnu.andrew at redhat.com) Date: Tue, 05 Sep 2017 02:15:35 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u/corba: Added tag aarch64-jdk8u144-b02 for changeset 4b222c433612 Message-ID: <201709050215.v852FZ7F016192@aojmv0008.oracle.com> Changeset: 2f6bf6972714 Author: andrew Date: 2017-09-05 03:10 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u/corba/rev/2f6bf6972714 Added tag aarch64-jdk8u144-b02 for changeset 4b222c433612 ! .hgtags From gnu.andrew at redhat.com Tue Sep 5 02:15:42 2017 From: gnu.andrew at redhat.com (gnu.andrew at redhat.com) Date: Tue, 05 Sep 2017 02:15:42 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u/jaxp: Added tag aarch64-jdk8u144-b02 for changeset 2793510feb8c Message-ID: <201709050215.v852FgFW016257@aojmv0008.oracle.com> Changeset: b56f98b75e2a Author: andrew Date: 2017-09-05 03:10 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u/jaxp/rev/b56f98b75e2a Added tag aarch64-jdk8u144-b02 for changeset 2793510feb8c ! .hgtags From gnu.andrew at redhat.com Tue Sep 5 02:18:41 2017 From: gnu.andrew at redhat.com (gnu.andrew at redhat.com) Date: Tue, 05 Sep 2017 02:18:41 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u/jaxws: Added tag aarch64-jdk8u144-b02 for changeset 1eb06202a5c9 Message-ID: <201709050218.v852IfiP019096@aojmv0008.oracle.com> Changeset: 56babd47ee19 Author: andrew Date: 2017-09-05 03:10 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u/jaxws/rev/56babd47ee19 Added tag aarch64-jdk8u144-b02 for changeset 1eb06202a5c9 ! .hgtags From gnu.andrew at redhat.com Tue Sep 5 02:18:47 2017 From: gnu.andrew at redhat.com (gnu.andrew at redhat.com) Date: Tue, 05 Sep 2017 02:18:47 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u/langtools: Added tag aarch64-jdk8u144-b02 for changeset eb8e9a1d6c9f Message-ID: <201709050218.v852Ilj4019169@aojmv0008.oracle.com> Changeset: 9a5a859f6fda Author: andrew Date: 2017-09-05 03:10 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u/langtools/rev/9a5a859f6fda Added tag aarch64-jdk8u144-b02 for changeset eb8e9a1d6c9f ! .hgtags From gnu.andrew at redhat.com Tue Sep 5 02:18:53 2017 From: gnu.andrew at redhat.com (gnu.andrew at redhat.com) Date: Tue, 05 Sep 2017 02:18:53 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u/hotspot: Added tag aarch64-jdk8u144-b02 for changeset 7672149aea2c Message-ID: <201709050218.v852Irmm019271@aojmv0008.oracle.com> Changeset: 21011afdacc2 Author: andrew Date: 2017-09-05 03:10 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u/hotspot/rev/21011afdacc2 Added tag aarch64-jdk8u144-b02 for changeset 7672149aea2c ! .hgtags From gnu.andrew at redhat.com Tue Sep 5 02:18:59 2017 From: gnu.andrew at redhat.com (gnu.andrew at redhat.com) Date: Tue, 05 Sep 2017 02:18:59 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u/jdk: Added tag aarch64-jdk8u144-b02 for changeset 9322c39fd0df Message-ID: <201709050218.v852IxgL019351@aojmv0008.oracle.com> Changeset: 49cb4b2b45a3 Author: andrew Date: 2017-09-05 03:10 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u/jdk/rev/49cb4b2b45a3 Added tag aarch64-jdk8u144-b02 for changeset 9322c39fd0df ! .hgtags From gnu.andrew at redhat.com Tue Sep 5 02:19:06 2017 From: gnu.andrew at redhat.com (gnu.andrew at redhat.com) Date: Tue, 05 Sep 2017 02:19:06 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u/nashorn: Added tag aarch64-jdk8u144-b02 for changeset 13c40d5bd8cc Message-ID: <201709050219.v852J6cu019482@aojmv0008.oracle.com> Changeset: b74e5d373608 Author: andrew Date: 2017-09-05 03:10 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u/nashorn/rev/b74e5d373608 Added tag aarch64-jdk8u144-b02 for changeset 13c40d5bd8cc ! .hgtags From gnu.andrew at redhat.com Tue Sep 5 02:49:32 2017 From: gnu.andrew at redhat.com (gnu.andrew at redhat.com) Date: Tue, 05 Sep 2017 02:49:32 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah: 4 new changesets Message-ID: <201709050249.v852nWv9004341@aojmv0008.oracle.com> Changeset: 461c9270383a Author: andrew Date: 2016-07-11 05:02 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/rev/461c9270383a 8151841: Build needs additional flags to compile with GCC 6 [plus parts of 8149647 & 8032045] Summary: C++ standard needs to be explicitly set and some optimisations turned off to build on GCC 6 Reviewed-by: erikj, dholmes, kbarrett ! common/autoconf/generated-configure.sh ! common/autoconf/hotspot-spec.gmk.in ! common/autoconf/spec.gmk.in ! common/autoconf/toolchain.m4 Changeset: 8803133b679b Author: andrew Date: 2017-09-05 03:10 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/rev/8803133b679b Added tag aarch64-jdk8u144-b02 for changeset 461c9270383a ! .hgtags Changeset: 768646fd5745 Author: andrew Date: 2017-09-05 03:44 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/rev/768646fd5745 Merge aarch64-jdk8u144-b02 ! .hgtags Changeset: 6cd26459fb2f Author: andrew Date: 2017-09-05 03:46 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/rev/6cd26459fb2f Added tag aarch64-shenandoah-jdk8u144-b02 for changeset 768646fd5745 ! .hgtags From gnu.andrew at redhat.com Tue Sep 5 02:49:37 2017 From: gnu.andrew at redhat.com (gnu.andrew at redhat.com) Date: Tue, 05 Sep 2017 02:49:37 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/corba: 3 new changesets Message-ID: <201709050249.v852nbOB004413@aojmv0008.oracle.com> Changeset: 2f6bf6972714 Author: andrew Date: 2017-09-05 03:10 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/corba/rev/2f6bf6972714 Added tag aarch64-jdk8u144-b02 for changeset 4b222c433612 ! .hgtags Changeset: ad9497112b0a Author: andrew Date: 2017-09-05 03:44 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/corba/rev/ad9497112b0a Merge aarch64-jdk8u144-b02 ! .hgtags Changeset: 76a6ff94929a Author: andrew Date: 2017-09-05 03:46 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/corba/rev/76a6ff94929a Added tag aarch64-shenandoah-jdk8u144-b02 for changeset ad9497112b0a ! .hgtags From gnu.andrew at redhat.com Tue Sep 5 02:49:42 2017 From: gnu.andrew at redhat.com (gnu.andrew at redhat.com) Date: Tue, 05 Sep 2017 02:49:42 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/jaxp: 3 new changesets Message-ID: <201709050249.v852ng69004475@aojmv0008.oracle.com> Changeset: b56f98b75e2a Author: andrew Date: 2017-09-05 03:10 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jaxp/rev/b56f98b75e2a Added tag aarch64-jdk8u144-b02 for changeset 2793510feb8c ! .hgtags Changeset: 488afc89de7b Author: andrew Date: 2017-09-05 03:44 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jaxp/rev/488afc89de7b Merge aarch64-jdk8u144-b02 ! .hgtags Changeset: 0e28e142b39e Author: andrew Date: 2017-09-05 03:46 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jaxp/rev/0e28e142b39e Added tag aarch64-shenandoah-jdk8u144-b02 for changeset 488afc89de7b ! .hgtags From gnu.andrew at redhat.com Tue Sep 5 02:49:48 2017 From: gnu.andrew at redhat.com (gnu.andrew at redhat.com) Date: Tue, 05 Sep 2017 02:49:48 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/jaxws: 3 new changesets Message-ID: <201709050249.v852nm49004562@aojmv0008.oracle.com> Changeset: 56babd47ee19 Author: andrew Date: 2017-09-05 03:10 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jaxws/rev/56babd47ee19 Added tag aarch64-jdk8u144-b02 for changeset 1eb06202a5c9 ! .hgtags Changeset: f1d17bae71a9 Author: andrew Date: 2017-09-05 03:44 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jaxws/rev/f1d17bae71a9 Merge aarch64-jdk8u144-b02 ! .hgtags Changeset: 20f47c7395a6 Author: andrew Date: 2017-09-05 03:46 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jaxws/rev/20f47c7395a6 Added tag aarch64-shenandoah-jdk8u144-b02 for changeset f1d17bae71a9 ! .hgtags From gnu.andrew at redhat.com Tue Sep 5 02:49:53 2017 From: gnu.andrew at redhat.com (gnu.andrew at redhat.com) Date: Tue, 05 Sep 2017 02:49:53 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/langtools: 3 new changesets Message-ID: <201709050249.v852nr2Z004637@aojmv0008.oracle.com> Changeset: 9a5a859f6fda Author: andrew Date: 2017-09-05 03:10 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/langtools/rev/9a5a859f6fda Added tag aarch64-jdk8u144-b02 for changeset eb8e9a1d6c9f ! .hgtags Changeset: 7d0753285c49 Author: andrew Date: 2017-09-05 03:44 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/langtools/rev/7d0753285c49 Merge aarch64-jdk8u144-b02 ! .hgtags Changeset: e0e8b07dc201 Author: andrew Date: 2017-09-05 03:46 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/langtools/rev/e0e8b07dc201 Added tag aarch64-shenandoah-jdk8u144-b02 for changeset 7d0753285c49 ! .hgtags From gnu.andrew at redhat.com Tue Sep 5 02:49:59 2017 From: gnu.andrew at redhat.com (gnu.andrew at redhat.com) Date: Tue, 05 Sep 2017 02:49:59 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/hotspot: 3 new changesets Message-ID: <201709050249.v852nxCN004697@aojmv0008.oracle.com> Changeset: 21011afdacc2 Author: andrew Date: 2017-09-05 03:10 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/hotspot/rev/21011afdacc2 Added tag aarch64-jdk8u144-b02 for changeset 7672149aea2c ! .hgtags Changeset: ec1439043fc0 Author: andrew Date: 2017-09-05 03:44 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/hotspot/rev/ec1439043fc0 Merge aarch64-jdk8u144-b02 ! .hgtags Changeset: 1bec072d4387 Author: andrew Date: 2017-09-05 03:46 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/hotspot/rev/1bec072d4387 Added tag aarch64-shenandoah-jdk8u144-b02 for changeset ec1439043fc0 ! .hgtags From gnu.andrew at redhat.com Tue Sep 5 02:50:05 2017 From: gnu.andrew at redhat.com (gnu.andrew at redhat.com) Date: Tue, 05 Sep 2017 02:50:05 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/jdk: 3 new changesets Message-ID: <201709050250.v852o5Bf004777@aojmv0008.oracle.com> Changeset: 49cb4b2b45a3 Author: andrew Date: 2017-09-05 03:10 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jdk/rev/49cb4b2b45a3 Added tag aarch64-jdk8u144-b02 for changeset 9322c39fd0df ! .hgtags Changeset: b415ef41ac89 Author: andrew Date: 2017-09-05 03:44 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jdk/rev/b415ef41ac89 Merge aarch64-jdk8u144-b02 ! .hgtags Changeset: 24a174dce25b Author: andrew Date: 2017-09-05 03:46 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jdk/rev/24a174dce25b Added tag aarch64-shenandoah-jdk8u144-b02 for changeset b415ef41ac89 ! .hgtags From gnu.andrew at redhat.com Tue Sep 5 02:50:11 2017 From: gnu.andrew at redhat.com (gnu.andrew at redhat.com) Date: Tue, 05 Sep 2017 02:50:11 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/nashorn: 3 new changesets Message-ID: <201709050250.v852oBQl004839@aojmv0008.oracle.com> Changeset: b74e5d373608 Author: andrew Date: 2017-09-05 03:10 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/nashorn/rev/b74e5d373608 Added tag aarch64-jdk8u144-b02 for changeset 13c40d5bd8cc ! .hgtags Changeset: f41bd292f498 Author: andrew Date: 2017-09-05 03:44 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/nashorn/rev/f41bd292f498 Merge aarch64-jdk8u144-b02 ! .hgtags Changeset: 84c7ffac6e87 Author: andrew Date: 2017-09-05 03:46 +0100 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/nashorn/rev/84c7ffac6e87 Added tag aarch64-shenandoah-jdk8u144-b02 for changeset f41bd292f498 ! .hgtags From gnu.andrew at redhat.com Tue Sep 5 02:49:43 2017 From: gnu.andrew at redhat.com (Andrew Hughes) Date: Tue, 5 Sep 2017 03:49:43 +0100 Subject: [aarch64-port-dev ] [RFR] 8151841: Build needs additional flags to compile with GCC 6 [plus parts of 8149647 & 8032045] In-Reply-To: <850ac4fa-d724-51e3-c46a-65438e530558@redhat.com> References: <00db57b4-dada-e528-fb87-181ee04ebb45@redhat.com> <850ac4fa-d724-51e3-c46a-65438e530558@redhat.com> Message-ID: On 31 August 2017 at 10:04, Andrew Dinn wrote: > On 30/08/17 16:29, Andrew Hughes wrote: >> I haven't merged to shenandoah/jdk8u yet as it doesn't seem worth it >> for just the one fix and there is nothing else in the aarch64/jdk8u repositories >> after aarch64-jdk8u144-b01. Are there any pending backports to aarch64/jdk8u >> which would then make a worthwhile batch of changes to merge over to >> the Shenandoah tree? >> >> Of course, in the worst case, it'll be merged with the next security update >> in October. > I don't know of any other changes that need merging. However, it would > be very helpful also to have this in shenandoah/jdk8u repo because it > ensures that anyone can build the repo on fedora, not just those who > know the necessary black magic. I'm particularly thinking of JBoss staff > like Andy Miller (who recently wanted to build and a Shenandoah > slowdebug JVM and then debug it using gdb). So, unless there is a good > reason not to merge just this one change could we please have it? > > regards, > > > Andrew Dinn > ----------- > Senior Principal Software Engineer > Red Hat UK Ltd > Registered in England and Wales under Company Registration No. 03798903 > Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander I wasn't proposing to not merge it, just that we wouldn't do a merge solely for this change. It would have been nicer to bring in a number of changes rather than just the one. However, I do agree it's a pain; it's been driving me mad for the last year. Tagged as aarch64-jdk8u144-b02 and merged as aarch64-shenandoah-jdk8u144-b02 -- Andrew :) Senior Free Java Software Engineer Red Hat, Inc. (http://www.redhat.com) Web Site: http://fuseyism.com Twitter: https://twitter.com/gnu_andrew_java PGP Key: ed25519/0xCFDA0F9B35964222 (hkp://keys.gnupg.net) Fingerprint = 5132 579D D154 0ED2 3E04 C5A0 CFDA 0F9B 3596 4222 From adinn at redhat.com Tue Sep 5 09:32:09 2017 From: adinn at redhat.com (Andrew Dinn) Date: Tue, 5 Sep 2017 10:32:09 +0100 Subject: [aarch64-port-dev ] [RFR] 8151841: Build needs additional flags to compile with GCC 6 [plus parts of 8149647 & 8032045] In-Reply-To: References: <00db57b4-dada-e528-fb87-181ee04ebb45@redhat.com> <850ac4fa-d724-51e3-c46a-65438e530558@redhat.com> Message-ID: <01d2f874-561d-d2ab-8edd-972ff2d0aa06@redhat.com> On 05/09/17 03:49, Andrew Hughes wrote: > I wasn't proposing to not merge it, just that we wouldn't do a merge solely for > this change. It would have been nicer to bring in a number of changes rather > than just the one. However, I do agree it's a pain; it's been driving me mad > for the last year. > > Tagged as aarch64-jdk8u144-b02 and merged as aarch64-shenandoah-jdk8u144-b02 Excellent. Thanks! regards, Andrew Dinn ----------- From felix.yang at linaro.org Tue Sep 5 14:14:40 2017 From: felix.yang at linaro.org (felix.yang at linaro.org) Date: Tue, 05 Sep 2017 14:14:40 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u/hotspot: 8187224: aarch64: some inconsistency between aarch64_ad.m4 and aarch64.ad Message-ID: <201709051414.v85EEeEx028772@aojmv0008.oracle.com> Changeset: 471de666658d Author: fyang Date: 2017-09-05 19:09 +0800 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u/hotspot/rev/471de666658d 8187224: aarch64: some inconsistency between aarch64_ad.m4 and aarch64.ad Summary: fix ROL_INSN and ROR_INSN definition in aarch64_ad.m4 Reviewed-by: aph ! src/cpu/aarch64/vm/aarch64_ad.m4 From felix.yang at linaro.org Tue Sep 5 14:26:18 2017 From: felix.yang at linaro.org (Felix Yang) Date: Tue, 5 Sep 2017 22:26:18 +0800 Subject: [aarch64-port-dev ] RFR(S): 8154537: AArch64: some integer rotate instructions are never emitted In-Reply-To: <67595610-3d30-fbfd-96ab-baa00e12233b@redhat.com> References: <57161A23.3050807@redhat.com> <57162A22.2050706@redhat.com> <571722B7.2000404@redhat.com> <1bc50628-5ef9-b24b-d1b3-4762e7ff3b12@redhat.com> <67595610-3d30-fbfd-96ab-baa00e12233b@redhat.com> Message-ID: Newly created bug report: https://bugs.openjdk.java.net/browse/JDK-8187224 (which duplicates:https://bugs.openjdk.java.net/browse/JDK-8185656) Patch has been applied to aarch64 jdk8u repo: http://hg.openjdk.java.net/aarch64-port/jdk8u/hotspot/rev/471de666658d Thanks, Felix On 5 September 2017 at 00:09, Andrew Haley wrote: > On 04/09/17 16:18, Felix Yang wrote: > > This issue has been fixed in jdk10. > > Shall I propose a patch for jdk9 dev and maybe jdk8u aarch64 repo? > They > > got the same issue. > > jdk8u is pre-approved. jdk9 dev is more problematic: I think it's closed > for > now. But please register the bugs anyway. > > -- > Andrew Haley > Java Platform Lead Engineer > Red Hat UK Ltd. > EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 > From dmitrij.pochepko at bell-sw.com Tue Sep 5 17:34:11 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko) Date: Tue, 5 Sep 2017 20:34:11 +0300 Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> Message-ID: <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> Hi Andrew, you can find my attempt to implement mulAdd intrinsic using wider multiplication here http://cr.openjdk.java.net/~dpochepk/8186915/webrev.02/ but as expected I got slower results on same benchmark compared to original webrev.01 with 32-bit multiplication. I've measured results on ThunderX: webrev.01 version: Benchmark (size) Mode Cnt Score Error Units BigIntegerBench.mulAddReflect 1 avgt 5 194.809 ? 1.341 ns/op BigIntegerBench.mulAddReflect 2 avgt 5 198.242 ? 1.348 ns/op BigIntegerBench.mulAddReflect 3 avgt 5 201.213 ? 0.670 ns/op BigIntegerBench.mulAddReflect 5 avgt 5 213.426 ? 7.441 ns/op BigIntegerBench.mulAddReflect 10 avgt 5 236.396 ? 1.663 ns/op BigIntegerBench.mulAddReflect 50 avgt 5 432.255 ? 24.718 ns/op BigIntegerBench.mulAddReflect 100 avgt 5 653.961 ? 10.140 ns/op webrev.02 version with wider multiplication: Benchmark (size) Mode Cnt Score Error Units BigIntegerBench.mulAddReflect 1 avgt 5 196.109 ? 0.663 ns/op BigIntegerBench.mulAddReflect 2 avgt 5 213.438 ? 124.206 ns/op BigIntegerBench.mulAddReflect 3 avgt 5 211.683 ? 3.206 ns/op BigIntegerBench.mulAddReflect 5 avgt 5 217.324 ? 5.827 ns/op BigIntegerBench.mulAddReflect 10 avgt 5 233.272 ? 21.560 ns/op BigIntegerBench.mulAddReflect 50 avgt 5 455.337 ? 237.168 ns/op BigIntegerBench.mulAddReflect 100 avgt 5 826.844 ? 4.972 ns/op As you can see, it's up to 26% worse throughput with wider multiplication. The reasons for this is: 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it can?t be changed within the function signature. Thus we can?t fully utilize the potential of 64-bit multiplication. 2. umulh instruction is more expensive than mul instruction. I haven't implemented wider multiplication for squareToLen intrinsic, since it'll require much more code due to more corner cases. Also, squaring algorithm in BigInteger doesn't handle more than 127 integers in one squareToLen call(large integer arrays are divided to smaller parts for squaring, so, 1..127 integers are squared at once), which makes all additional off-loop penalties expensive in comparison to loop execution time. At this point I ran out of ideas how we could improve the performance 3x for these intrinsics. I understand one can do better with 64bit for the intrinsics you implemented, but squareToLen and mulAdd looks different. Do you have other suggestions, or can we proceed with initial version (webrev.01)? Thanks, Dmitrij On 01.09.2017 10:51, Andrew Haley wrote: > On 31/08/17 23:46, Dmitrij Pochepko wrote: >> I tried a number of initial versions first. I also tried to use wider >> multiplication via umulh (and larger load instructions like ldp/ldr), >> but after measuring all versions I've found that version I've initially >> sent appeared to be the fastest (I was measuring it on ThunderX which I >> have in hand). It might be because of lots of additional ror(..., 32) >> operations in other versions to convert values from initial layout to >> register and back. Another reason might be more complex overall logic >> and larger code, which triggers more icache lines to be loaded. Or even >> some umulh specifics on some CPUs. So, after measuring, I've abandoned >> these versions in a middle of development and polished the fastest one. >> I have some raw development unpolished versions of such approaches >> left(not sure I have debugged versions saved, but at least has an >> overall idea). >> I attached squares_v2.3.1.diff: early version which is using mul/umulh >> for just one case. It was surprisingly slower for this case than version >> I've sent to review, so, I've abandoned this approach. >> I've also tried version with large load instructions(ldp/ldr): >> squares_v1.diff and it was also slower(it has another, slower, mul_add >> loop implementation, but I was comparing to the same version, which is >> using ldrw-only). >> >> I'm not sure if I should use 64-bit multiplications and/or 64/128 bit >> loads. I can try to return back to one of such versions and try to >> polish it, but I'll probably get slower results again on h/w I have and >> it's not clear if it'll be faster on any other h/w(which one? It takes a >> lot of time to iteratively improve and measure every version on >> respective h/w). > I'm using Applied Micro hardware for my testing at the moment. > > I did the speed testing for Montgomery multiply on ThunderX. I > appreciate that it's difficult to get the 64-bit version right and > fast, but you should see about 3 - 3.5* speedup over the pure Java > version if you get it right. That's what I saw when I did the > Montgomery multiply. You do have to pipeline the loads and the > multiplies to avoid stalls. > > Be aware that squareToLen is not used at all when running the > RSA benchmark with C2. > From aph at redhat.com Wed Sep 6 09:53:05 2017 From: aph at redhat.com (Andrew Haley) Date: Wed, 6 Sep 2017 10:53:05 +0100 Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> Message-ID: On 05/09/17 18:34, Dmitrij Pochepko wrote: > As you can see, it's up to 26% worse throughput with wider multiplication. > > The reasons for this is: > 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it > can?t be changed within the function signature. Thus we can?t fully > utilize the potential of 64-bit multiplication. > 2. umulh instruction is more expensive than mul instruction. Ah, my apologies. I wasn't thinking about mulAdd, but about squareToLen(). But did you look at the way x86 uses 64-bit multiplications? > I haven't implemented wider multiplication for squareToLen intrinsic, > since it'll require much more code due to more corner cases. Also, > squaring algorithm in BigInteger doesn't handle more than 127 integers > in one squareToLen call(large integer arrays are divided to smaller > parts for squaring, so, 1..127 integers are squared at once), which > makes all additional off-loop penalties expensive in comparison to loop > execution time. Should we intrinsify squareToLen() at all? It's only used AFAICS by C1 and interpreter when doing integer crypto. One other thing I haven't checked: is the multiplyToLen() intrinisc called when squareToLen() is absent? -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From dmitrij.pochepko at bell-sw.com Wed Sep 6 11:50:24 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij) Date: Wed, 6 Sep 2017 14:50:24 +0300 Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> Message-ID: <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> On 06.09.2017 12:53, Andrew Haley wrote: > On 05/09/17 18:34, Dmitrij Pochepko wrote: >> As you can see, it's up to 26% worse throughput with wider multiplication. >> >> The reasons for this is: >> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it >> can?t be changed within the function signature. Thus we can?t fully >> utilize the potential of 64-bit multiplication. >> 2. umulh instruction is more expensive than mul instruction. > Ah, my apologies. I wasn't thinking about mulAdd, but about > squareToLen(). But did you look at the way x86 uses 64-bit > multiplications? > Yes. It uses single x86 mulq instruction which performs 64x64 multiplication and placing 128 bit result in 2 registers. There is no such single instruction on aarch64 and the most effective aarch64 instruction sequence i've found doesn't seem to be as fast as mulq. Simplier 32x32bit multiplication works faster according to my measurements. >> I haven't implemented wider multiplication for squareToLen intrinsic, >> since it'll require much more code due to more corner cases. Also, >> squaring algorithm in BigInteger doesn't handle more than 127 integers >> in one squareToLen call(large integer arrays are divided to smaller >> parts for squaring, so, 1..127 integers are squared at once), which >> makes all additional off-loop penalties expensive in comparison to loop >> execution time. > Should we intrinsify squareToLen() at all? Yes, we should intrinsify it, because we can see performance boost. Not as significant as for x86 but still noticeable. > It's only used AFAICS by > C1 and interpreter when doing integer crypto. This intrinsic is known to c2(http://hg.openjdk.java.net/jdk10/hs/hotspot/file/tip/src/share/vm/opto/library_call.cpp#l5507). squareToLen is called in BigInteger multiplication in case it's multiplied by itself (http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l1565) and in pow(...) method: http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l2305 > One other thing I > haven't checked: is the multiplyToLen() intrinisc called when > squareToLen() is absent? > It could have been a good alternative, but it's not used instead of squareToLen when squareToLen is not implemented. A java implementation of squareToLen will be eventually compiled and used instead: http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l2039 Thanks, Dmitrij From aph at redhat.com Wed Sep 6 12:43:23 2017 From: aph at redhat.com (Andrew Haley) Date: Wed, 6 Sep 2017 13:43:23 +0100 Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> Message-ID: <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> On 06/09/17 12:50, Dmitrij wrote: > > > On 06.09.2017 12:53, Andrew Haley wrote: >> On 05/09/17 18:34, Dmitrij Pochepko wrote: >>> As you can see, it's up to 26% worse throughput with wider multiplication. >>> >>> The reasons for this is: >>> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it >>> can?t be changed within the function signature. Thus we can?t fully >>> utilize the potential of 64-bit multiplication. >>> 2. umulh instruction is more expensive than mul instruction. >> Ah, my apologies. I wasn't thinking about mulAdd, but about >> squareToLen(). But did you look at the way x86 uses 64-bit >> multiplications? >> > Yes. It uses single x86 mulq instruction which performs 64x64 > multiplication and placing 128 bit result in 2 registers. There is no > such single instruction on aarch64 and the most effective aarch64 > instruction sequence i've found doesn't seem to be as fast as mulq. I think there is effectively a 64x64 - >128-bit instruction: it's just that you have to represent it as a mul and a umulh. But I take your point. >> One other thing I >> haven't checked: is the multiplyToLen() intrinisc called when >> squareToLen() is absent? >> > It could have been a good alternative, but it's not used instead of > squareToLen when squareToLen is not implemented. A java implementation > of squareToLen will be eventually compiled and used instead: > http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l2039 Please compare your squareToLen wih the MacroAssembler::multiply_to_len we already have. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From dmitrij.pochepko at bell-sw.com Wed Sep 6 17:39:13 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij) Date: Wed, 6 Sep 2017 20:39:13 +0300 Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> Message-ID: <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> On 06.09.2017 15:43, Andrew Haley wrote: > On 06/09/17 12:50, Dmitrij wrote: >> >> On 06.09.2017 12:53, Andrew Haley wrote: >>> On 05/09/17 18:34, Dmitrij Pochepko wrote: >>>> As you can see, it's up to 26% worse throughput with wider multiplication. >>>> >>>> The reasons for this is: >>>> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it >>>> can?t be changed within the function signature. Thus we can?t fully >>>> utilize the potential of 64-bit multiplication. >>>> 2. umulh instruction is more expensive than mul instruction. >>> Ah, my apologies. I wasn't thinking about mulAdd, but about >>> squareToLen(). But did you look at the way x86 uses 64-bit >>> multiplications? >>> >> Yes. It uses single x86 mulq instruction which performs 64x64 >> multiplication and placing 128 bit result in 2 registers. There is no >> such single instruction on aarch64 and the most effective aarch64 >> instruction sequence i've found doesn't seem to be as fast as mulq. > I think there is effectively a 64x64 - >128-bit instruction: it's just > that you have to represent it as a mul and a umulh. But I take your > point. > >>> One other thing I >>> haven't checked: is the multiplyToLen() intrinisc called when >>> squareToLen() is absent? >>> >> It could have been a good alternative, but it's not used instead of >> squareToLen when squareToLen is not implemented. A java implementation >> of squareToLen will be eventually compiled and used instead: >> http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l2039 > Please compare your squareToLen wih the > MacroAssembler::multiply_to_len we already have. > I've compared it by calling square and multiply methods and got following results(ThunderX): Benchmark (size, ints) Mode Cnt Score Error Units BigIntegerBench.implMutliplyToLenReflect 1 avgt 5 186.930 ? 14.933 ns/op (26% slower) BigIntegerBench.implMutliplyToLenReflect 2 avgt 5 194.095 ? 11.857 ns/op (12% slower) BigIntegerBench.implMutliplyToLenReflect 3 avgt 5 233.912 ? 4.229 ns/op (24% slower) BigIntegerBench.implMutliplyToLenReflect 5 avgt 5 308.349 ? 20.383 ns/op (22% slower) BigIntegerBench.implMutliplyToLenReflect 10 avgt 5 475.839 ? 6.232 ns/op (same) BigIntegerBench.implMutliplyToLenReflect 50 avgt 5 6514.691 ? 76.934 ns/op (same) BigIntegerBench.implMutliplyToLenReflect 90 avgt 5 20347.040 ? 224.290 ns/op (3% slower) BigIntegerBench.implMutliplyToLenReflect 127 avgt 5 41929.302 ? 181.053 ns/op (9% slower) BigIntegerBench.implSquareToLenReflect 1 avgt 5 147.751 ? 12.760 ns/op BigIntegerBench.implSquareToLenReflect 2 avgt 5 173.804 ? 4.850 ns/op BigIntegerBench.implSquareToLenReflect 3 avgt 5 187.822 ? 34.027 ns/op BigIntegerBench.implSquareToLenReflect 5 avgt 5 251.995 ? 19.711 ns/op BigIntegerBench.implSquareToLenReflect 10 avgt 5 474.489 ? 1.040 ns/op BigIntegerBench.implSquareToLenReflect 50 avgt 5 6493.768 ? 33.809 ns/op BigIntegerBench.implSquareToLenReflect 90 avgt 5 19766.524 ? 88.398 ns/op BigIntegerBench.implSquareToLenReflect 127 avgt 5 38448.202 ? 180.095 ns/op As we can see, squareToLen is faster than multiplyToLen. (I've updated benchmark code at http://cr.openjdk.java.net/~dpochepk/8186915/BigIntegerBench.java) Thanks, Dmitrij From ci_notify at linaro.org Thu Sep 7 20:25:13 2017 From: ci_notify at linaro.org (ci_notify at linaro.org) Date: Thu, 7 Sep 2017 20:25:13 +0000 (UTC) Subject: [aarch64-port-dev ] JTREG, JCStress, SPECjbb2015 and Hadoop/Terasort results for OpenJDK 10 on AArch64 Message-ID: <211725077.1786.1504815915080.JavaMail.jenkins@81294fa8a221> This is a summary of the JTREG test results =========================================== The build and test results are cycled every 15 days. For detailed information on the test output please refer to: http://openjdk.linaro.org/jdk10/openjdk-jtreg-nightly-tests/summary/2017/249/summary.html ------------------------------------------------------------------------------- client-release/hotspot ------------------------------------------------------------------------------- Build 0: aarch64/2017/sep/06 pass: 1,400; fail: 11,561 2 fatal errors were detected; please follow the link above for more detail. ------------------------------------------------------------------------------- client-release/jdk ------------------------------------------------------------------------------- Build 0: aarch64/2017/sep/06 pass: 7,432; fail: 714; error: 20 ------------------------------------------------------------------------------- client-release/langtools ------------------------------------------------------------------------------- Build 0: aarch64/2017/sep/06 pass: 3,784 ------------------------------------------------------------------------------- server-release/hotspot ------------------------------------------------------------------------------- Build 0: aarch64/2017/sep/06 pass: 1,403; fail: 11,562; error: 1 2 fatal errors were detected; please follow the link above for more detail. ------------------------------------------------------------------------------- server-release/jdk ------------------------------------------------------------------------------- Build 0: aarch64/2017/sep/06 pass: 7,469; fail: 675; error: 22 ------------------------------------------------------------------------------- server-release/langtools ------------------------------------------------------------------------------- Build 0: aarch64/2017/sep/06 pass: 3,782; error: 2 Previous results can be found here: http://openjdk.linaro.org/jdk10/openjdk-jtreg-nightly-tests/index.html SPECjbb2015 composite regression test completed =============================================== This test measures the relative performance of the server compiler running the SPECjbb2015 composite tests and compares the performance against the baseline performance of the server compiler taken on 2016-11-21. In accordance with [1], the SPECjbb2015 tests are run on a system which is not production ready and does not meet all the requirements for publishing compliant results. The numbers below shall be treated as non-compliant (nc) and are for experimental purposes only. Relative performance: Server max-jOPS (nc): 1.05x Relative performance: Server critical-jOPS (nc): 0.90x Details of the test setup and historical results may be found here: http://openjdk.linaro.org/jdk10/SPECjbb2015-results/ [1] http://www.spec.org/fairuse.html#Academic Regression test Hadoop-Terasort completed ========================================= This test measures the performance of the server and client compilers running Hadoop sorting a 1GB file using Terasort and compares the performance against the baseline performance of the Zero interpreter and against the baseline performance of the client and server compilers on 2014-04-01. Relative performance: Zero: 1.0, Client: 71.29, Server: 118.61 Client 71.29 / Client 2014-04-01 (43.00): 1.66x Server 118.61 / Server 2014-04-01 (71.00): 1.67x Details of the test setup and historical results may be found here: http://openjdk.linaro.org/jdk10/hadoop-terasort-benchmark-results/ This is a summary of the jcstress test results ============================================== The build and test results are cycled every 15 days. 2017-09-07 pass rate: 11556/11559, results: http://openjdk.linaro.org/jdk10/jcstress-nightly-runs/2017/249/results/ For detailed information on the test output please refer to: http://openjdk.linaro.org/jdk10/jcstress-nightly-runs/ From stuart.monteith at linaro.org Fri Sep 8 08:13:51 2017 From: stuart.monteith at linaro.org (Stuart Monteith) Date: Fri, 8 Sep 2017 09:13:51 +0100 Subject: [aarch64-port-dev ] JTREG, JCStress, SPECjbb2015 and Hadoop/Terasort results for OpenJDK 10 on AArch64 In-Reply-To: <211725077.1786.1504815915080.JavaMail.jenkins@81294fa8a221> References: <211725077.1786.1504815915080.JavaMail.jenkins@81294fa8a221> Message-ID: Hello, This was a first attempt at a jdk10 build in the automation. The large number of failures are because of jcstress being integrated into JTReg and failing - so don't pay too much attention just yet. BR, Stuart On 7 September 2017 at 21:25, wrote: > This is a summary of the JTREG test results > =========================================== > > The build and test results are cycled every 15 days. > > For detailed information on the test output please refer to: > > http://openjdk.linaro.org/jdk10/openjdk-jtreg-nightly- > tests/summary/2017/249/summary.html > > ------------------------------------------------------------ > ------------------- > client-release/hotspot > ------------------------------------------------------------ > ------------------- > Build 0: aarch64/2017/sep/06 pass: 1,400; fail: 11,561 > > 2 fatal errors were detected; please follow the link above for more detail. > > ------------------------------------------------------------ > ------------------- > client-release/jdk > ------------------------------------------------------------ > ------------------- > Build 0: aarch64/2017/sep/06 pass: 7,432; fail: 714; error: 20 > > ------------------------------------------------------------ > ------------------- > client-release/langtools > ------------------------------------------------------------ > ------------------- > Build 0: aarch64/2017/sep/06 pass: 3,784 > > ------------------------------------------------------------ > ------------------- > server-release/hotspot > ------------------------------------------------------------ > ------------------- > Build 0: aarch64/2017/sep/06 pass: 1,403; fail: 11,562; error: 1 > > 2 fatal errors were detected; please follow the link above for more detail. > > ------------------------------------------------------------ > ------------------- > server-release/jdk > ------------------------------------------------------------ > ------------------- > Build 0: aarch64/2017/sep/06 pass: 7,469; fail: 675; error: 22 > > ------------------------------------------------------------ > ------------------- > server-release/langtools > ------------------------------------------------------------ > ------------------- > Build 0: aarch64/2017/sep/06 pass: 3,782; error: 2 > > Previous results can be found here: > > http://openjdk.linaro.org/jdk10/openjdk-jtreg-nightly-tests/index.html > > > SPECjbb2015 composite regression test completed > =============================================== > > This test measures the relative performance of the server > compiler running the SPECjbb2015 composite tests and compares > the performance against the baseline performance of the server > compiler taken on 2016-11-21. > > In accordance with [1], the SPECjbb2015 tests are run on a system > which is not production ready and does not meet all the > requirements for publishing compliant results. The numbers below > shall be treated as non-compliant (nc) and are for experimental > purposes only. > > Relative performance: Server max-jOPS (nc): 1.05x > Relative performance: Server critical-jOPS (nc): 0.90x > > Details of the test setup and historical results may be found here: > > http://openjdk.linaro.org/jdk10/SPECjbb2015-results/ > > [1] http://www.spec.org/fairuse.html#Academic > > Regression test Hadoop-Terasort completed > ========================================= > > This test measures the performance of the server and client compilers > running Hadoop sorting a 1GB file using Terasort and compares > the performance against the baseline performance of the Zero interpreter > and against the baseline performance of the client and server compilers > on 2014-04-01. > > Relative performance: Zero: 1.0, Client: 71.29, Server: 118.61 > > Client 71.29 / Client 2014-04-01 (43.00): 1.66x > Server 118.61 / Server 2014-04-01 (71.00): 1.67x > > Details of the test setup and historical results may be found here: > > http://openjdk.linaro.org/jdk10/hadoop-terasort-benchmark-results/ > > This is a summary of the jcstress test results > ============================================== > > The build and test results are cycled every 15 days. > > 2017-09-07 pass rate: 11556/11559, results: http://openjdk.linaro.org/ > jdk10/jcstress-nightly-runs/2017/249/results/ > > For detailed information on the test output please refer to: > > http://openjdk.linaro.org/jdk10/jcstress-nightly-runs/ From zhongwei.yao at linaro.org Mon Sep 18 09:04:58 2017 From: zhongwei.yao at linaro.org (Zhongwei Yao) Date: Mon, 18 Sep 2017 17:04:58 +0800 Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed Message-ID: Hi, all, Bug: https://bugs.openjdk.java.net/browse/JDK-8187601 Webrev: http://cr.openjdk.java.net/~zyao/8187601/webrev.00 In the current implementation, the loop unrolling times are determined by vector size and element size when SuperWordLoopUnrollAnalysis is true (both X86 and aarch64 are true for now). This unrolling policy generates less optimized code when SLP auto-vectorization fails (as following example shows). In this patch, I modify the current unrolling policy to do more unrolling when SLP auto-vectorization fails. So the loop will be unrolled until reaching the unroll times limitation. Here is one example: public static void accessArrayConstants(int[] array) { for (int j = 0; j < 1024; j++) { array[0]++; array[1]++; } } Before this patch, the loop will be unrolled by 4 times. 4 is determined by: AArch64's vector size 128 bits / array element size 32 bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. Below is the generated code by C2 on AArch64: ... ... # omit unrelated code. ... 0x0000ffff6caf3180: ldr w10, [x1,#16] ;*iaload {reexecute=0 rethrow=0 return_oop=0} ; - ArrayAccess::accessArrayConstants at 12 (line 6) 0x0000ffff6caf3184: add w13, w10, #0x1 0x0000ffff6caf3188: str w13, [x1,#16] ;*iastore {reexecute=0 rethrow=0 return_oop=0} ; - ArrayAccess::accessArrayConstants at 15 (line 6) 0x0000ffff6caf318c: ldr w12, [x1,#20] ;*iaload {reexecute=0 rethrow=0 return_oop=0} ; - ArrayAccess::accessArrayConstants at 19 (line 7) 0x0000ffff6caf3190: add w13, w10, #0x4 0x0000ffff6caf3194: add w10, w12, #0x4 0x0000ffff6caf3198: str w13, [x1,#16] ;*iastore {reexecute=0 rethrow=0 return_oop=0} ; - ArrayAccess::accessArrayConstants at 15 (line 6) 0x0000ffff6caf319c: add w11, w11, #0x4 ;*iinc {reexecute=0 rethrow=0 return_oop=0} ; - ArrayAccess::accessArrayConstants at 23 (line 5) 0x0000ffff6caf31a0: str w10, [x1,#20] ;*iastore {reexecute=0 rethrow=0 return_oop=0} ; - ArrayAccess::accessArrayConstants at 22 (line 7) 0x0000ffff6caf31a4: cmp w11, #0x3fd 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} ; - ArrayAccess::accessArrayConstants at 6 (line 5) ... ... # omit unrelated code. ... After applied this patch, it is unrolled 16 times: ... ... # omit unrelated code. ... 0x0000ffffb0aa6100: ldr w10, [x1,#16] ;*iaload {reexecute=0 rethrow=0 return_oop=0} ; - ArrayAccess::accessArrayConstants at 12 (line 6) 0x0000ffffb0aa6104: add w13, w10, #0x1 0x0000ffffb0aa6108: str w13, [x1,#16] ;*iastore {reexecute=0 rethrow=0 return_oop=0} ; - ArrayAccess::accessArrayConstants at 15 (line 6) 0x0000ffffb0aa610c: ldr w12, [x1,#20] ;*iaload {reexecute=0 rethrow=0 return_oop=0} ; - ArrayAccess::accessArrayConstants at 19 (line 7) 0x0000ffffb0aa6110: add w13, w10, #0x10 0x0000ffffb0aa6114: add w10, w12, #0x10 0x0000ffffb0aa6118: str w13, [x1,#16] ;*iastore {reexecute=0 rethrow=0 return_oop=0} ; - ArrayAccess::accessArrayConstants at 15 (line 6) 0x0000ffffb0aa611c: add w11, w11, #0x10 ;*iinc {reexecute=0 rethrow=0 return_oop=0} ; - ArrayAccess::accessArrayConstants at 23 (line 5) 0x0000ffffb0aa6120: str w10, [x1,#20] ;*iastore {reexecute=0 rethrow=0 return_oop=0} ; - ArrayAccess::accessArrayConstants at 22 (line 7) 0x0000ffffb0aa6124: cmp w11, #0x3f1 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} ; - ArrayAccess::accessArrayConstants at 6 (line 5) ... ... # omit unrelated code. ... This patch passes jtreg tests both on AArch64 and X86. -- Best regards, Zhongwei From rwestrel at redhat.com Mon Sep 18 09:42:06 2017 From: rwestrel at redhat.com (Roland Westrelin) Date: Mon, 18 Sep 2017 11:42:06 +0200 Subject: [aarch64-port-dev ] RFR: Bulk import from Shenandoah In-Reply-To: References: Message-ID: > I'd like to see a review of the C2 changes from a C2 developer: that probably > means Roland. He's probably the best person to review backports of his own > code anyway. In case this is still waiting for me: The change in loopopts.cpp looks useless. Other C2 changes look reasonable. Roland. From zhongwei.yao at linaro.org Mon Sep 18 09:58:11 2017 From: zhongwei.yao at linaro.org (Zhongwei Yao) Date: Mon, 18 Sep 2017 17:58:11 +0800 Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed Message-ID: [Forward from aarch64-port-dev to hotspot-compiler-dev] Hi, all, Bug: https://bugs.openjdk.java.net/browse/JDK-8187601 Webrev: http://cr.openjdk.java.net/~zyao/8187601/webrev.00 In the current implementation, the loop unrolling times are determined by vector size and element size when SuperWordLoopUnrollAnalysis is true (both X86 and aarch64 are true for now). This unrolling policy generates less optimized code when SLP auto-vectorization fails (as following example shows). In this patch, I modify the current unrolling policy to do more unrolling when SLP auto-vectorization fails. So the loop will be unrolled until reaching the unroll times limitation. Here is one example: public static void accessArrayConstants(int[] array) { for (int j = 0; j < 1024; j++) { array[0]++; array[1]++; } } Before this patch, the loop will be unrolled by 4 times. 4 is determined by: AArch64's vector size 128 bits / array element size 32 bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. Below is the generated code by C2 on AArch64: ==== generated code start ==== 0x0000ffff6caf3180: ldr w10, [x1,#16] ; 0x0000ffff6caf3184: add w13, w10, #0x1 0x0000ffff6caf3188: str w13, [x1,#16] ; 0x0000ffff6caf318c: ldr w12, [x1,#20] ; 0x0000ffff6caf3190: add w13, w10, #0x4 0x0000ffff6caf3194: add w10, w12, #0x4 0x0000ffff6caf3198: str w13, [x1,#16] ; 0x0000ffff6caf319c: add w11, w11, #0x4 ; 0x0000ffff6caf31a0: str w10, [x1,#20] ; 0x0000ffff6caf31a4: cmp w11, #0x3fd 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; ==== generated code end ==== After applied this patch, it is unrolled 16 times: ==== generated code start ==== 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; 0x0000ffffb0aa6104: add w13, w10, #0x1 0x0000ffffb0aa6108: str w13, [x1,#16] ; 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; 0x0000ffffb0aa6110: add w13, w10, #0x10 0x0000ffffb0aa6114: add w10, w12, #0x10 0x0000ffffb0aa6118: str w13, [x1,#16] ; 0x0000ffffb0aa611c: add w11, w11, #0x10 ; 0x0000ffffb0aa6120: str w10, [x1,#20] ; 0x0000ffffb0aa6124: cmp w11, #0x3f1 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; ==== generated code end ==== This patch passes jtreg tests both on AArch64 and X86. -- Best regards, Zhongwei From vladimir.kozlov at oracle.com Mon Sep 18 16:17:26 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 18 Sep 2017 09:17:26 -0700 Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: References: Message-ID: Why not use existing set_notpassed_slp() instead of mark_slp_vec_failed()? Why you need next additional check?: - } else if (cl->is_main_loop()) { + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { sw.transform_loop(lpt, true); Thanks, Vladimir On 9/18/17 2:58 AM, Zhongwei Yao wrote: > [Forward from aarch64-port-dev to hotspot-compiler-dev] > > Hi, all, > > Bug: > https://bugs.openjdk.java.net/browse/JDK-8187601 > > Webrev: > http://cr.openjdk.java.net/~zyao/8187601/webrev.00 > > In the current implementation, the loop unrolling times are determined > by vector size and element size when SuperWordLoopUnrollAnalysis is > true (both X86 and aarch64 are true for now). > > This unrolling policy generates less optimized code when SLP > auto-vectorization fails (as following example shows). > > In this patch, I modify the current unrolling policy to do more > unrolling when SLP auto-vectorization fails. So the loop will be > unrolled until reaching the unroll times limitation. > > Here is one example: > public static void accessArrayConstants(int[] array) { > for (int j = 0; j < 1024; j++) { > array[0]++; > array[1]++; > } > } > > Before this patch, the loop will be unrolled by 4 times. 4 is > determined by: AArch64's vector size 128 bits / array element size 32 > bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. > > Below is the generated code by C2 on AArch64: > > ==== generated code start ==== > 0x0000ffff6caf3180: ldr w10, [x1,#16] ; > 0x0000ffff6caf3184: add w13, w10, #0x1 > 0x0000ffff6caf3188: str w13, [x1,#16] ; > 0x0000ffff6caf318c: ldr w12, [x1,#20] ; > 0x0000ffff6caf3190: add w13, w10, #0x4 > 0x0000ffff6caf3194: add w10, w12, #0x4 > 0x0000ffff6caf3198: str w13, [x1,#16] ; > 0x0000ffff6caf319c: add w11, w11, #0x4 ; > 0x0000ffff6caf31a0: str w10, [x1,#20] ; > 0x0000ffff6caf31a4: cmp w11, #0x3fd > 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; > ==== generated code end ==== > > After applied this patch, it is unrolled 16 times: > > ==== generated code start ==== > 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; > 0x0000ffffb0aa6104: add w13, w10, #0x1 > 0x0000ffffb0aa6108: str w13, [x1,#16] ; > 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; > 0x0000ffffb0aa6110: add w13, w10, #0x10 > 0x0000ffffb0aa6114: add w10, w12, #0x10 > 0x0000ffffb0aa6118: str w13, [x1,#16] ; > 0x0000ffffb0aa611c: add w11, w11, #0x10 ; > 0x0000ffffb0aa6120: str w10, [x1,#20] ; > 0x0000ffffb0aa6124: cmp w11, #0x3f1 > 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; > ==== generated code end ==== > > This patch passes jtreg tests both on AArch64 and X86. > From zhongwei.yao at linaro.org Tue Sep 19 05:59:18 2017 From: zhongwei.yao at linaro.org (Zhongwei Yao) Date: Tue, 19 Sep 2017 13:59:18 +0800 Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: References: Message-ID: Hi, Vladimir, On 19 September 2017 at 00:17, Vladimir Kozlov wrote: > Why not use existing set_notpassed_slp() instead of mark_slp_vec_failed()? Due to 2 reasons, I have not chosen existing passed_slp flag: 1. If we set_notpassed_slp() when _packset.length() == 0 in SuperWord::output(), then in the IdealLoopTree::policy_unroll() checking: if (cl->has_passed_slp()) { if (slp_max_unroll_factor >= future_unroll_ct) return true; // Normal case: loop too big return false; } we will ignore the case: "cl->has_passed_slp() && slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()" as alos exposed in my patch: if (cl->has_passed_slp()) { if (slp_max_unroll_factor >= future_unroll_ct) return true; - // Normal case: loop too big - return false; + // When SLP vectorization failed, we could do more unrolling + // optimizations if body size is less than limit size. Otherwise, + // return false due to loop is too big. + if (!cl->is_slp_vec_failed()) return false; } However, I have not found a case to support this condition yet. 2. As replied below, in: > - } else if (cl->is_main_loop()) { > + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { > sw.transform_loop(lpt, true); I need to check whether cl->is_slp_vec_failed() is true.Such checking becomes explicit when using SLPAutoVecFailed flag. > > Why you need next additional check?: > > - } else if (cl->is_main_loop()) { > + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { > sw.transform_loop(lpt, true); > The additional check prevents the case that when cl->is_slp_vec_failed() is true, then SuperWord::output() will set_major_progress() at the beginning (because _packset.length() == 0 is true when cl->is_slp_vec_failed() is true). Then the "phase ideal loop iteration" will not stop untill loop_opts_cnt reachs 0, which is not we want. > > Thanks, > Vladimir > > > On 9/18/17 2:58 AM, Zhongwei Yao wrote: >> >> [Forward from aarch64-port-dev to hotspot-compiler-dev] >> >> Hi, all, >> >> Bug: >> https://bugs.openjdk.java.net/browse/JDK-8187601 >> >> Webrev: >> http://cr.openjdk.java.net/~zyao/8187601/webrev.00 >> >> In the current implementation, the loop unrolling times are determined >> by vector size and element size when SuperWordLoopUnrollAnalysis is >> true (both X86 and aarch64 are true for now). >> >> This unrolling policy generates less optimized code when SLP >> auto-vectorization fails (as following example shows). >> >> In this patch, I modify the current unrolling policy to do more >> unrolling when SLP auto-vectorization fails. So the loop will be >> unrolled until reaching the unroll times limitation. >> >> Here is one example: >> public static void accessArrayConstants(int[] array) { >> for (int j = 0; j < 1024; j++) { >> array[0]++; >> array[1]++; >> } >> } >> >> Before this patch, the loop will be unrolled by 4 times. 4 is >> determined by: AArch64's vector size 128 bits / array element size 32 >> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. >> >> Below is the generated code by C2 on AArch64: >> >> ==== generated code start ==== >> 0x0000ffff6caf3180: ldr w10, [x1,#16] ; >> 0x0000ffff6caf3184: add w13, w10, #0x1 >> 0x0000ffff6caf3188: str w13, [x1,#16] ; >> 0x0000ffff6caf318c: ldr w12, [x1,#20] ; >> 0x0000ffff6caf3190: add w13, w10, #0x4 >> 0x0000ffff6caf3194: add w10, w12, #0x4 >> 0x0000ffff6caf3198: str w13, [x1,#16] ; >> 0x0000ffff6caf319c: add w11, w11, #0x4 ; >> 0x0000ffff6caf31a0: str w10, [x1,#20] ; >> 0x0000ffff6caf31a4: cmp w11, #0x3fd >> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; >> ==== generated code end ==== >> >> After applied this patch, it is unrolled 16 times: >> >> ==== generated code start ==== >> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; >> 0x0000ffffb0aa6104: add w13, w10, #0x1 >> 0x0000ffffb0aa6108: str w13, [x1,#16] ; >> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; >> 0x0000ffffb0aa6110: add w13, w10, #0x10 >> 0x0000ffffb0aa6114: add w10, w12, #0x10 >> 0x0000ffffb0aa6118: str w13, [x1,#16] ; >> 0x0000ffffb0aa611c: add w11, w11, #0x10 ; >> 0x0000ffffb0aa6120: str w10, [x1,#20] ; >> 0x0000ffffb0aa6124: cmp w11, #0x3f1 >> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; >> ==== generated code end ==== >> >> This patch passes jtreg tests both on AArch64 and X86. >> > -- Best regards, Zhongwei From rwestrel at redhat.com Tue Sep 19 13:50:06 2017 From: rwestrel at redhat.com (Roland Westrelin) Date: Tue, 19 Sep 2017 15:50:06 +0200 Subject: [aarch64-port-dev ] bug fix for aarch64-port/jdk8u-shenandoah when Shenandoah is disabled Message-ID: http://cr.openjdk.java.net/~roland/shenandoah/phi_has_only_data_users/webrev.00/ This is a fix for an issue that affects aarch64-port/jdk8u-shenandoah even when shenandoah is disabled. That bug causes a debug VM to crash with: Internal Error at phaseX.cpp:985, pid=26493, tid=0x00007fd8c4199700 assert(false) failed: infinite loop in PhaseIterGVN::optimize and a product build to consume a lot of memory. Roland. From adinn at redhat.com Tue Sep 19 15:01:31 2017 From: adinn at redhat.com (Andrew Dinn) Date: Tue, 19 Sep 2017 16:01:31 +0100 Subject: [aarch64-port-dev ] bug fix for aarch64-port/jdk8u-shenandoah when Shenandoah is disabled In-Reply-To: References: Message-ID: <1fdb96ff-ebc5-ea80-9fd4-ebc423aa2369@redhat.com> On 19/09/17 14:50, Roland Westrelin wrote: > > http://cr.openjdk.java.net/~roland/shenandoah/phi_has_only_data_users/webrev.00/ > > This is a fix for an issue that affects aarch64-port/jdk8u-shenandoah > even when shenandoah is disabled. That bug causes a debug VM to crash > with: > > Internal Error at phaseX.cpp:985, pid=26493, tid=0x00007fd8c4199700 > assert(false) failed: infinite loop in PhaseIterGVN::optimize > > and a product build to consume a lot of memory. Looks good. regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander From vladimir.kozlov at oracle.com Tue Sep 19 17:54:49 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 19 Sep 2017 10:54:49 -0700 Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: References: Message-ID: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com> On 9/18/17 10:59 PM, Zhongwei Yao wrote: > Hi, Vladimir, > > On 19 September 2017 at 00:17, Vladimir Kozlov > wrote: >> Why not use existing set_notpassed_slp() instead of mark_slp_vec_failed()? > > Due to 2 reasons, I have not chosen existing passed_slp flag: My point is that if we don't find vectors in a loop (as in your case) we should ignore whole SLP analysis. In best case scenario SuperWord::unrolling_analysis() should determine if there are vectors candidates. For example, check if array's index is depend on loop's index variable. An other way is to call SuperWord::unrolling_analysis() only after we did vector analysis. It is more complicated changes and out of scope of this. There is also side effect I missed before which may prevent using set_notpassed_slp(): LoopMaxUnroll is changed based on SLP analysis before has_passed_slp() check. Note, set_notpassed_slp() is also used to additional unroll already vectorized loops: http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421 May be you should also call mark_do_unroll_only() when you set set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt problem you pointed. Can you look on this? I am not against adding new is_slp_vec_failed() but I want first to investigate if we can re-use existing functions. Thanks, Vladimir > 1. If we set_notpassed_slp() when _packset.length() == 0 in > SuperWord::output(), then in the IdealLoopTree::policy_unroll() > checking: > > if (cl->has_passed_slp()) { > if (slp_max_unroll_factor >= future_unroll_ct) return true; > // Normal case: loop too big > return false; > } > > we will ignore the case: "cl->has_passed_slp() && > slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()" > as alos exposed in my patch: > > if (cl->has_passed_slp()) { > if (slp_max_unroll_factor >= future_unroll_ct) return true; > - // Normal case: loop too big > - return false; > + // When SLP vectorization failed, we could do more unrolling > + // optimizations if body size is less than limit size. Otherwise, > + // return false due to loop is too big. > + if (!cl->is_slp_vec_failed()) return false; > } > > However, I have not found a case to support this condition yet. > > 2. As replied below, in: >> - } else if (cl->is_main_loop()) { >> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >> sw.transform_loop(lpt, true); > I need to check whether cl->is_slp_vec_failed() is true.Such > checking becomes explicit when using SLPAutoVecFailed flag. > >> >> Why you need next additional check?: >> >> - } else if (cl->is_main_loop()) { >> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >> sw.transform_loop(lpt, true); >> > > The additional check prevents the case that when > cl->is_slp_vec_failed() is true, then SuperWord::output() will > set_major_progress() at the beginning (because _packset.length() == 0 > is true when cl->is_slp_vec_failed() is true). Then the "phase ideal > loop iteration" will not stop untill loop_opts_cnt reachs 0, which is > not we want. > >> >> Thanks, >> Vladimir >> >> >> On 9/18/17 2:58 AM, Zhongwei Yao wrote: >>> >>> [Forward from aarch64-port-dev to hotspot-compiler-dev] >>> >>> Hi, all, >>> >>> Bug: >>> https://bugs.openjdk.java.net/browse/JDK-8187601 >>> >>> Webrev: >>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00 >>> >>> In the current implementation, the loop unrolling times are determined >>> by vector size and element size when SuperWordLoopUnrollAnalysis is >>> true (both X86 and aarch64 are true for now). >>> >>> This unrolling policy generates less optimized code when SLP >>> auto-vectorization fails (as following example shows). >>> >>> In this patch, I modify the current unrolling policy to do more >>> unrolling when SLP auto-vectorization fails. So the loop will be >>> unrolled until reaching the unroll times limitation. >>> >>> Here is one example: >>> public static void accessArrayConstants(int[] array) { >>> for (int j = 0; j < 1024; j++) { >>> array[0]++; >>> array[1]++; >>> } >>> } >>> >>> Before this patch, the loop will be unrolled by 4 times. 4 is >>> determined by: AArch64's vector size 128 bits / array element size 32 >>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. >>> >>> Below is the generated code by C2 on AArch64: >>> >>> ==== generated code start ==== >>> 0x0000ffff6caf3180: ldr w10, [x1,#16] ; >>> 0x0000ffff6caf3184: add w13, w10, #0x1 >>> 0x0000ffff6caf3188: str w13, [x1,#16] ; >>> 0x0000ffff6caf318c: ldr w12, [x1,#20] ; >>> 0x0000ffff6caf3190: add w13, w10, #0x4 >>> 0x0000ffff6caf3194: add w10, w12, #0x4 >>> 0x0000ffff6caf3198: str w13, [x1,#16] ; >>> 0x0000ffff6caf319c: add w11, w11, #0x4 ; >>> 0x0000ffff6caf31a0: str w10, [x1,#20] ; >>> 0x0000ffff6caf31a4: cmp w11, #0x3fd >>> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; >>> ==== generated code end ==== >>> >>> After applied this patch, it is unrolled 16 times: >>> >>> ==== generated code start ==== >>> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; >>> 0x0000ffffb0aa6104: add w13, w10, #0x1 >>> 0x0000ffffb0aa6108: str w13, [x1,#16] ; >>> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; >>> 0x0000ffffb0aa6110: add w13, w10, #0x10 >>> 0x0000ffffb0aa6114: add w10, w12, #0x10 >>> 0x0000ffffb0aa6118: str w13, [x1,#16] ; >>> 0x0000ffffb0aa611c: add w11, w11, #0x10 ; >>> 0x0000ffffb0aa6120: str w10, [x1,#20] ; >>> 0x0000ffffb0aa6124: cmp w11, #0x3f1 >>> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; >>> ==== generated code end ==== >>> >>> This patch passes jtreg tests both on AArch64 and X86. >>> >> > > > From zhongwei.yao at linaro.org Wed Sep 20 11:07:20 2017 From: zhongwei.yao at linaro.org (Zhongwei Yao) Date: Wed, 20 Sep 2017 19:07:20 +0800 Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com> References: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com> Message-ID: Thanks for your suggestions! I've updated the patch that uses pass_slp and do_unroll_only flags without adding a new flag. Please take a look: http://cr.openjdk.java.net/~zyao/8187601/webrev.01/ On 20 September 2017 at 01:54, Vladimir Kozlov wrote: > > > On 9/18/17 10:59 PM, Zhongwei Yao wrote: >> >> Hi, Vladimir, >> >> On 19 September 2017 at 00:17, Vladimir Kozlov >> wrote: >>> >>> Why not use existing set_notpassed_slp() instead of >>> mark_slp_vec_failed()? >> >> >> Due to 2 reasons, I have not chosen existing passed_slp flag: > > > My point is that if we don't find vectors in a loop (as in your case) we > should ignore whole SLP analysis. > > In best case scenario SuperWord::unrolling_analysis() should determine if > there are vectors candidates. For example, check if array's index is depend > on loop's index variable. > > An other way is to call SuperWord::unrolling_analysis() only after we did > vector analysis. > > It is more complicated changes and out of scope of this. There is also side > effect I missed before which may prevent using set_notpassed_slp(): > LoopMaxUnroll is changed based on SLP analysis before has_passed_slp() > check. > > Note, set_notpassed_slp() is also used to additional unroll already > vectorized loops: > > http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421 > > May be you should also call mark_do_unroll_only() when you set > set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt > problem you pointed. Can you look on this? > > I am not against adding new is_slp_vec_failed() but I want first to > investigate if we can re-use existing functions. > > Thanks, > Vladimir > > >> 1. If we set_notpassed_slp() when _packset.length() == 0 in >> SuperWord::output(), then in the IdealLoopTree::policy_unroll() >> checking: >> >> if (cl->has_passed_slp()) { >> if (slp_max_unroll_factor >= future_unroll_ct) return true; >> // Normal case: loop too big >> return false; >> } >> >> we will ignore the case: "cl->has_passed_slp() && >> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()" >> as alos exposed in my patch: >> >> if (cl->has_passed_slp()) { >> if (slp_max_unroll_factor >= future_unroll_ct) return true; >> - // Normal case: loop too big >> - return false; >> + // When SLP vectorization failed, we could do more unrolling >> + // optimizations if body size is less than limit size. Otherwise, >> + // return false due to loop is too big. >> + if (!cl->is_slp_vec_failed()) return false; >> } >> >> However, I have not found a case to support this condition yet. >> >> 2. As replied below, in: >>> >>> - } else if (cl->is_main_loop()) { >>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>> sw.transform_loop(lpt, true); >> >> I need to check whether cl->is_slp_vec_failed() is true.Such >> checking becomes explicit when using SLPAutoVecFailed flag. >> >>> >>> Why you need next additional check?: >>> >>> - } else if (cl->is_main_loop()) { >>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>> sw.transform_loop(lpt, true); >>> >> >> The additional check prevents the case that when >> cl->is_slp_vec_failed() is true, then SuperWord::output() will >> set_major_progress() at the beginning (because _packset.length() == 0 >> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal >> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is >> not we want. > > > >> >>> >>> Thanks, >>> Vladimir >>> >>> >>> On 9/18/17 2:58 AM, Zhongwei Yao wrote: >>>> >>>> >>>> [Forward from aarch64-port-dev to hotspot-compiler-dev] >>>> >>>> Hi, all, >>>> >>>> Bug: >>>> https://bugs.openjdk.java.net/browse/JDK-8187601 >>>> >>>> Webrev: >>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00 >>>> >>>> In the current implementation, the loop unrolling times are determined >>>> by vector size and element size when SuperWordLoopUnrollAnalysis is >>>> true (both X86 and aarch64 are true for now). >>>> >>>> This unrolling policy generates less optimized code when SLP >>>> auto-vectorization fails (as following example shows). >>>> >>>> In this patch, I modify the current unrolling policy to do more >>>> unrolling when SLP auto-vectorization fails. So the loop will be >>>> unrolled until reaching the unroll times limitation. >>>> >>>> Here is one example: >>>> public static void accessArrayConstants(int[] array) { >>>> for (int j = 0; j < 1024; j++) { >>>> array[0]++; >>>> array[1]++; >>>> } >>>> } >>>> >>>> Before this patch, the loop will be unrolled by 4 times. 4 is >>>> determined by: AArch64's vector size 128 bits / array element size 32 >>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. >>>> >>>> Below is the generated code by C2 on AArch64: >>>> >>>> ==== generated code start ==== >>>> 0x0000ffff6caf3180: ldr w10, [x1,#16] ; >>>> 0x0000ffff6caf3184: add w13, w10, #0x1 >>>> 0x0000ffff6caf3188: str w13, [x1,#16] ; >>>> 0x0000ffff6caf318c: ldr w12, [x1,#20] ; >>>> 0x0000ffff6caf3190: add w13, w10, #0x4 >>>> 0x0000ffff6caf3194: add w10, w12, #0x4 >>>> 0x0000ffff6caf3198: str w13, [x1,#16] ; >>>> 0x0000ffff6caf319c: add w11, w11, #0x4 ; >>>> 0x0000ffff6caf31a0: str w10, [x1,#20] ; >>>> 0x0000ffff6caf31a4: cmp w11, #0x3fd >>>> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; >>>> ==== generated code end ==== >>>> >>>> After applied this patch, it is unrolled 16 times: >>>> >>>> ==== generated code start ==== >>>> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; >>>> 0x0000ffffb0aa6104: add w13, w10, #0x1 >>>> 0x0000ffffb0aa6108: str w13, [x1,#16] ; >>>> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; >>>> 0x0000ffffb0aa6110: add w13, w10, #0x10 >>>> 0x0000ffffb0aa6114: add w10, w12, #0x10 >>>> 0x0000ffffb0aa6118: str w13, [x1,#16] ; >>>> 0x0000ffffb0aa611c: add w11, w11, #0x10 ; >>>> 0x0000ffffb0aa6120: str w10, [x1,#20] ; >>>> 0x0000ffffb0aa6124: cmp w11, #0x3f1 >>>> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; >>>> ==== generated code end ==== >>>> >>>> This patch passes jtreg tests both on AArch64 and X86. >>>> >>> >> >> >> > -- Best regards, Zhongwei From rwestrel at redhat.com Wed Sep 20 12:22:49 2017 From: rwestrel at redhat.com (rwestrel at redhat.com) Date: Wed, 20 Sep 2017 12:22:49 +0000 Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/hotspot: [backport] PhiNode::has_only_data_users() needs to apply to shenandoah barrier only Message-ID: <201709201222.v8KCMnLM011364@aojmv0008.oracle.com> Changeset: 48b74a7788cd Author: roland Date: 2017-09-19 13:41 +0200 URL: http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/hotspot/rev/48b74a7788cd [backport] PhiNode::has_only_data_users() needs to apply to shenandoah barrier only ! src/share/vm/opto/cfgnode.cpp From rwestrel at redhat.com Wed Sep 20 12:20:33 2017 From: rwestrel at redhat.com (Roland Westrelin) Date: Wed, 20 Sep 2017 14:20:33 +0200 Subject: [aarch64-port-dev ] bug fix for aarch64-port/jdk8u-shenandoah when Shenandoah is disabled In-Reply-To: <1fdb96ff-ebc5-ea80-9fd4-ebc423aa2369@redhat.com> References: <1fdb96ff-ebc5-ea80-9fd4-ebc423aa2369@redhat.com> Message-ID: Thanks. I pushed it. Roland. From dmitrij.pochepko at bell-sw.com Wed Sep 20 14:13:11 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko) Date: Wed, 20 Sep 2017 17:13:11 +0300 Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> Message-ID: <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> Hi, Andrew, do you believe this is ok to push? Thanks, Dmitrij .... On 06.09.2017 20:39, Dmitrij wrote: > > I've compared it by calling square and multiply methods and got > following results(ThunderX): > > > Benchmark??????????????????????????????????????? (size, ints) Mode > Cnt????? Score???? Error? Units > BigIntegerBench.implMutliplyToLenReflect?????? 1? avgt??? 5 186.930 ?? > 14.933? ns/op? (26% slower) > BigIntegerBench.implMutliplyToLenReflect?????? 2? avgt??? 5 194.095 ?? > 11.857? ns/op? (12% slower) > BigIntegerBench.implMutliplyToLenReflect?????? 3? avgt??? 5 233.912 > ??? 4.229? ns/op?? (24% slower) > BigIntegerBench.implMutliplyToLenReflect?????? 5? avgt??? 5 308.349 ?? > 20.383? ns/op? (22% slower) > BigIntegerBench.implMutliplyToLenReflect????? 10? avgt??? 5 475.839 > ??? 6.232? ns/op? (same) > BigIntegerBench.implMutliplyToLenReflect????? 50? avgt??? 5 6514.691 > ?? 76.934? ns/op (same) > BigIntegerBench.implMutliplyToLenReflect????? 90? avgt??? 5 20347.040 > ? 224.290? ns/op (3% slower) > BigIntegerBench.implMutliplyToLenReflect???? 127? avgt??? 5 41929.302 > ? 181.053? ns/op (9% slower) > > BigIntegerBench.implSquareToLenReflect???????? 1? avgt??? 5 147.751 ?? > 12.760? ns/op > BigIntegerBench.implSquareToLenReflect???????? 2? avgt??? 5 173.804 > ??? 4.850? ns/op > BigIntegerBench.implSquareToLenReflect???????? 3? avgt??? 5 187.822 ?? > 34.027? ns/op > BigIntegerBench.implSquareToLenReflect???????? 5? avgt??? 5 251.995 ?? > 19.711? ns/op > BigIntegerBench.implSquareToLenReflect??????? 10? avgt??? 5 474.489 > ??? 1.040? ns/op > BigIntegerBench.implSquareToLenReflect??????? 50? avgt??? 5 6493.768 > ?? 33.809? ns/op > BigIntegerBench.implSquareToLenReflect??????? 90? avgt??? 5 19766.524 > ?? 88.398? ns/op > BigIntegerBench.implSquareToLenReflect?????? 127? avgt??? 5 38448.202 > ? 180.095? ns/op > > > As we can see, squareToLen is faster than multiplyToLen. > > (I've updated benchmark code at > http://cr.openjdk.java.net/~dpochepk/8186915/BigIntegerBench.java) > > Thanks, > Dmitrij From aph at redhat.com Wed Sep 20 14:40:31 2017 From: aph at redhat.com (Andrew Haley) Date: Wed, 20 Sep 2017 15:40:31 +0100 Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> Message-ID: On 20/09/17 15:13, Dmitrij Pochepko wrote: > Andrew, do you believe this is ok to push? I'm testing it on some other hardware. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From vladimir.kozlov at oracle.com Wed Sep 20 15:34:10 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 20 Sep 2017 08:34:10 -0700 Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> Message-ID: Dmitrij, You need Oracle's sponsor for push since you touched shared code register.hpp Thanks, Vladimir On 9/20/17 7:13 AM, Dmitrij Pochepko wrote: > Hi, > > Andrew, do you believe this is ok to push? > > Thanks, > > Dmitrij > > > .... > > On 06.09.2017 20:39, Dmitrij wrote: >> >> I've compared it by calling square and multiply methods and got >> following results(ThunderX): >> >> >> Benchmark??????????????????????????????????????? (size, ints) Mode >> Cnt????? Score???? Error? Units >> BigIntegerBench.implMutliplyToLenReflect?????? 1? avgt??? 5 186.930 ? >> 14.933? ns/op? (26% slower) >> BigIntegerBench.implMutliplyToLenReflect?????? 2? avgt??? 5 194.095 ? >> 11.857? ns/op? (12% slower) >> BigIntegerBench.implMutliplyToLenReflect?????? 3? avgt??? 5 233.912 >> ??? 4.229? ns/op?? (24% slower) >> BigIntegerBench.implMutliplyToLenReflect?????? 5? avgt??? 5 308.349 ? >> 20.383? ns/op? (22% slower) >> BigIntegerBench.implMutliplyToLenReflect????? 10? avgt??? 5 475.839 >> ??? 6.232? ns/op? (same) >> BigIntegerBench.implMutliplyToLenReflect????? 50? avgt??? 5 6514.691 >> ?? 76.934? ns/op (same) >> BigIntegerBench.implMutliplyToLenReflect????? 90? avgt??? 5 20347.040 >> ? 224.290? ns/op (3% slower) >> BigIntegerBench.implMutliplyToLenReflect???? 127? avgt??? 5 41929.302 >> ? 181.053? ns/op (9% slower) >> >> BigIntegerBench.implSquareToLenReflect???????? 1? avgt??? 5 147.751 ? >> 12.760? ns/op >> BigIntegerBench.implSquareToLenReflect???????? 2? avgt??? 5 173.804 >> ??? 4.850? ns/op >> BigIntegerBench.implSquareToLenReflect???????? 3? avgt??? 5 187.822 ? >> 34.027? ns/op >> BigIntegerBench.implSquareToLenReflect???????? 5? avgt??? 5 251.995 ? >> 19.711? ns/op >> BigIntegerBench.implSquareToLenReflect??????? 10? avgt??? 5 474.489 >> ??? 1.040? ns/op >> BigIntegerBench.implSquareToLenReflect??????? 50? avgt??? 5 6493.768 >> ?? 33.809? ns/op >> BigIntegerBench.implSquareToLenReflect??????? 90? avgt??? 5 19766.524 >> ?? 88.398? ns/op >> BigIntegerBench.implSquareToLenReflect?????? 127? avgt??? 5 38448.202 >> ? 180.095? ns/op >> >> >> As we can see, squareToLen is faster than multiplyToLen. >> >> (I've updated benchmark code at >> http://cr.openjdk.java.net/~dpochepk/8186915/BigIntegerBench.java) >> >> Thanks, >> Dmitrij > From vladimir.kozlov at oracle.com Wed Sep 20 16:18:00 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 20 Sep 2017 09:18:00 -0700 Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: References: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com> Message-ID: Nice. Did you verified that it fixed your case? Would be nice to run specjvm2008 to make sure performance did not regress. Thanks, Vladimir On 9/20/17 4:07 AM, Zhongwei Yao wrote: > Thanks for your suggestions! > > I've updated the patch that uses pass_slp and do_unroll_only flags > without adding a new flag. Please take a look: > > http://cr.openjdk.java.net/~zyao/8187601/webrev.01/ > > > > On 20 September 2017 at 01:54, Vladimir Kozlov > wrote: >> >> >> On 9/18/17 10:59 PM, Zhongwei Yao wrote: >>> >>> Hi, Vladimir, >>> >>> On 19 September 2017 at 00:17, Vladimir Kozlov >>> wrote: >>>> >>>> Why not use existing set_notpassed_slp() instead of >>>> mark_slp_vec_failed()? >>> >>> >>> Due to 2 reasons, I have not chosen existing passed_slp flag: >> >> >> My point is that if we don't find vectors in a loop (as in your case) we >> should ignore whole SLP analysis. >> >> In best case scenario SuperWord::unrolling_analysis() should determine if >> there are vectors candidates. For example, check if array's index is depend >> on loop's index variable. >> >> An other way is to call SuperWord::unrolling_analysis() only after we did >> vector analysis. >> >> It is more complicated changes and out of scope of this. There is also side >> effect I missed before which may prevent using set_notpassed_slp(): >> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp() >> check. >> >> Note, set_notpassed_slp() is also used to additional unroll already >> vectorized loops: >> >> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421 >> >> May be you should also call mark_do_unroll_only() when you set >> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt >> problem you pointed. Can you look on this? >> >> I am not against adding new is_slp_vec_failed() but I want first to >> investigate if we can re-use existing functions. >> >> Thanks, >> Vladimir >> >> >>> 1. If we set_notpassed_slp() when _packset.length() == 0 in >>> SuperWord::output(), then in the IdealLoopTree::policy_unroll() >>> checking: >>> >>> if (cl->has_passed_slp()) { >>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>> // Normal case: loop too big >>> return false; >>> } >>> >>> we will ignore the case: "cl->has_passed_slp() && >>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()" >>> as alos exposed in my patch: >>> >>> if (cl->has_passed_slp()) { >>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>> - // Normal case: loop too big >>> - return false; >>> + // When SLP vectorization failed, we could do more unrolling >>> + // optimizations if body size is less than limit size. Otherwise, >>> + // return false due to loop is too big. >>> + if (!cl->is_slp_vec_failed()) return false; >>> } >>> >>> However, I have not found a case to support this condition yet. >>> >>> 2. As replied below, in: >>>> >>>> - } else if (cl->is_main_loop()) { >>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>> sw.transform_loop(lpt, true); >>> >>> I need to check whether cl->is_slp_vec_failed() is true.Such >>> checking becomes explicit when using SLPAutoVecFailed flag. >>> >>>> >>>> Why you need next additional check?: >>>> >>>> - } else if (cl->is_main_loop()) { >>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>> sw.transform_loop(lpt, true); >>>> >>> >>> The additional check prevents the case that when >>> cl->is_slp_vec_failed() is true, then SuperWord::output() will >>> set_major_progress() at the beginning (because _packset.length() == 0 >>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal >>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is >>> not we want. >> >> >> >>> >>>> >>>> Thanks, >>>> Vladimir >>>> >>>> >>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote: >>>>> >>>>> >>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev] >>>>> >>>>> Hi, all, >>>>> >>>>> Bug: >>>>> https://bugs.openjdk.java.net/browse/JDK-8187601 >>>>> >>>>> Webrev: >>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00 >>>>> >>>>> In the current implementation, the loop unrolling times are determined >>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is >>>>> true (both X86 and aarch64 are true for now). >>>>> >>>>> This unrolling policy generates less optimized code when SLP >>>>> auto-vectorization fails (as following example shows). >>>>> >>>>> In this patch, I modify the current unrolling policy to do more >>>>> unrolling when SLP auto-vectorization fails. So the loop will be >>>>> unrolled until reaching the unroll times limitation. >>>>> >>>>> Here is one example: >>>>> public static void accessArrayConstants(int[] array) { >>>>> for (int j = 0; j < 1024; j++) { >>>>> array[0]++; >>>>> array[1]++; >>>>> } >>>>> } >>>>> >>>>> Before this patch, the loop will be unrolled by 4 times. 4 is >>>>> determined by: AArch64's vector size 128 bits / array element size 32 >>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. >>>>> >>>>> Below is the generated code by C2 on AArch64: >>>>> >>>>> ==== generated code start ==== >>>>> 0x0000ffff6caf3180: ldr w10, [x1,#16] ; >>>>> 0x0000ffff6caf3184: add w13, w10, #0x1 >>>>> 0x0000ffff6caf3188: str w13, [x1,#16] ; >>>>> 0x0000ffff6caf318c: ldr w12, [x1,#20] ; >>>>> 0x0000ffff6caf3190: add w13, w10, #0x4 >>>>> 0x0000ffff6caf3194: add w10, w12, #0x4 >>>>> 0x0000ffff6caf3198: str w13, [x1,#16] ; >>>>> 0x0000ffff6caf319c: add w11, w11, #0x4 ; >>>>> 0x0000ffff6caf31a0: str w10, [x1,#20] ; >>>>> 0x0000ffff6caf31a4: cmp w11, #0x3fd >>>>> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; >>>>> ==== generated code end ==== >>>>> >>>>> After applied this patch, it is unrolled 16 times: >>>>> >>>>> ==== generated code start ==== >>>>> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; >>>>> 0x0000ffffb0aa6104: add w13, w10, #0x1 >>>>> 0x0000ffffb0aa6108: str w13, [x1,#16] ; >>>>> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; >>>>> 0x0000ffffb0aa6110: add w13, w10, #0x10 >>>>> 0x0000ffffb0aa6114: add w10, w12, #0x10 >>>>> 0x0000ffffb0aa6118: str w13, [x1,#16] ; >>>>> 0x0000ffffb0aa611c: add w11, w11, #0x10 ; >>>>> 0x0000ffffb0aa6120: str w10, [x1,#20] ; >>>>> 0x0000ffffb0aa6124: cmp w11, #0x3f1 >>>>> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; >>>>> ==== generated code end ==== >>>>> >>>>> This patch passes jtreg tests both on AArch64 and X86. >>>>> >>>> >>> >>> >>> >> > > > From aph at redhat.com Thu Sep 21 13:04:07 2017 From: aph at redhat.com (Andrew Haley) Date: Thu, 21 Sep 2017 14:04:07 +0100 Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> Message-ID: <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com> I reworked your benchmark to run faster and have less overhead, at http://cr.openjdk.java.net/~aph/8186915/ Run it as java --add-exports java.base/jdk.internal.misc=ALL-UNNAMED -jar target/benchmarks.jar org.sample.BigIntegerBench.implMutliplyToLen The test here was run on (rather old) Applied Micro hardware. The real issue is, I think, that almost all of the time of squareToLen without an intrinsic is dominated by mulAdd, and that already has an intrinsic. Asymptotically, an intrinsic squareToLen should take half the time of multiplyToLen, but we don't see that. Indeed, we barely see any advantage for UseSquareToLenIntrinsic. For a larger size, we see this with intrinsics enabled: BigIntegerBench.implMutliplyToLen 200 avgt 5 50833.555 ? 10.674 ns/op BigIntegerBench.implSquareToLen 200 avgt 5 57607.460 ? 87.155 ns/op BigIntegerBench.implMutliplyToLen 1000 avgt 5 1254728.119 ? 527.126 ns/op BigIntegerBench.implSquareToLen 1000 avgt 5 1369841.961 ? 169.843 ns/op which makes the problem clear, I believe. No intrinsics: Benchmark (size) Mode Cnt Score Error Units BigIntegerBench.implMutliplyToLen 1 avgt 5 24.176 ? 0.006 ns/op BigIntegerBench.implMutliplyToLen 2 avgt 5 41.266 ? 0.008 ns/op BigIntegerBench.implMutliplyToLen 3 avgt 5 65.027 ? 0.019 ns/op BigIntegerBench.implMutliplyToLen 10 avgt 5 466.440 ? 0.080 ns/op BigIntegerBench.implMutliplyToLen 50 avgt 5 10613.512 ? 5.153 ns/op BigIntegerBench.implMutliplyToLen 90 avgt 5 34070.328 ? 10.991 ns/op BigIntegerBench.implMutliplyToLen 127 avgt 5 67546.985 ? 16.581 ns/op -XX:+UseMultiplyToLenIntrinsic: Benchmark (size) Mode Cnt Score Error Units BigIntegerBench.implMutliplyToLen 1 avgt 5 25.661 ? 0.062 ns/op BigIntegerBench.implMutliplyToLen 2 avgt 5 29.183 ? 0.037 ns/op BigIntegerBench.implMutliplyToLen 3 avgt 5 51.690 ? 0.024 ns/op BigIntegerBench.implMutliplyToLen 10 avgt 5 193.401 ? 0.032 ns/op BigIntegerBench.implMutliplyToLen 50 avgt 5 3419.226 ? 0.312 ns/op BigIntegerBench.implMutliplyToLen 90 avgt 5 10638.801 ? 0.970 ns/op BigIntegerBench.implMutliplyToLen 127 avgt 5 21274.149 ? 7.188 ns/op No Intrinsics: Benchmark (size) Mode Cnt Score Error Units BigIntegerBench.implSquareToLen 1 avgt 5 38.933 ? 1.437 ns/op BigIntegerBench.implSquareToLen 2 avgt 5 62.523 ? 0.007 ns/op BigIntegerBench.implSquareToLen 3 avgt 5 82.114 ? 0.012 ns/op BigIntegerBench.implSquareToLen 10 avgt 5 366.986 ? 10.148 ns/op BigIntegerBench.implSquareToLen 50 avgt 5 5534.064 ? 88.895 ns/op BigIntegerBench.implSquareToLen 90 avgt 5 16308.025 ? 29.203 ns/op BigIntegerBench.implSquareToLen 127 avgt 5 31521.335 ? 49.421 ns/op -XX:+UseMulAddIntrinsic: Benchmark (size) Mode Cnt Score Error Units BigIntegerBench.implSquareToLen 1 avgt 5 46.268 ? 0.005 ns/op BigIntegerBench.implSquareToLen 2 avgt 5 67.527 ? 0.017 ns/op BigIntegerBench.implSquareToLen 3 avgt 5 97.975 ? 0.179 ns/op BigIntegerBench.implSquareToLen 10 avgt 5 345.126 ? 0.037 ns/op BigIntegerBench.implSquareToLen 50 avgt 5 4327.120 ? 9.942 ns/op BigIntegerBench.implSquareToLen 90 avgt 5 13143.308 ? 1.217 ns/op BigIntegerBench.implSquareToLen 127 avgt 5 25014.420 ? 16.221 ns/op -XX:+UseSquareToLenIntrinsic Benchmark (size) Mode Cnt Score Error Units BigIntegerBench.implSquareToLen 1 avgt 5 27.095 ? 0.012 ns/op BigIntegerBench.implSquareToLen 2 avgt 5 49.185 ? 0.007 ns/op BigIntegerBench.implSquareToLen 3 avgt 5 53.771 ? 0.013 ns/op BigIntegerBench.implSquareToLen 10 avgt 5 238.843 ? 0.080 ns/op BigIntegerBench.implSquareToLen 50 avgt 5 3828.313 ? 1.684 ns/op BigIntegerBench.implSquareToLen 90 avgt 5 11949.819 ? 9.925 ns/op BigIntegerBench.implSquareToLen 127 avgt 5 23613.427 ? 28.164 ns/op -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From ci_notify at linaro.org Thu Sep 21 15:46:56 2017 From: ci_notify at linaro.org (ci_notify at linaro.org) Date: Thu, 21 Sep 2017 15:46:56 +0000 (UTC) Subject: [aarch64-port-dev ] JTREG, JCStress, SPECjbb2015 and Hadoop/Terasort results for OpenJDK 10 on AArch64 Message-ID: <2028460771.4685.1506008817168.JavaMail.jenkins@81294fa8a221> This is a summary of the JTREG test results =========================================== The build and test results are cycled every 15 days. For detailed information on the test output please refer to: http://openjdk.linaro.org/jdk10/openjdk-jtreg-nightly-tests/summary/2017/263/summary.html ------------------------------------------------------------------------------- client-release/hotspot ------------------------------------------------------------------------------- Build 0: aarch64/2017/sep/06 pass: 1,400; fail: 11,561 Build 1: aarch64/2017/sep/20 pass: 1,369; fail: 35; error: 1 2 fatal errors were detected; please follow the link above for more detail. ------------------------------------------------------------------------------- client-release/jdk ------------------------------------------------------------------------------- Build 0: aarch64/2017/sep/06 pass: 7,432; fail: 714; error: 20 Build 1: aarch64/2017/sep/20 pass: 7,469; fail: 689; error: 22 ------------------------------------------------------------------------------- client-release/langtools ------------------------------------------------------------------------------- Build 0: aarch64/2017/sep/06 pass: 3,784 Build 1: aarch64/2017/sep/20 pass: 3,783; fail: 1 ------------------------------------------------------------------------------- server-release/hotspot ------------------------------------------------------------------------------- Build 0: aarch64/2017/sep/06 pass: 1,403; fail: 11,562; error: 1 Build 1: aarch64/2017/sep/20 pass: 1,373; fail: 37 2 fatal errors were detected; please follow the link above for more detail. ------------------------------------------------------------------------------- server-release/jdk ------------------------------------------------------------------------------- Build 0: aarch64/2017/sep/06 pass: 7,469; fail: 675; error: 22 Build 1: aarch64/2017/sep/20 pass: 7,452; fail: 705; error: 23 ------------------------------------------------------------------------------- server-release/langtools ------------------------------------------------------------------------------- Build 0: aarch64/2017/sep/06 pass: 3,782; error: 2 Build 1: aarch64/2017/sep/20 pass: 3,783; fail: 1 Previous results can be found here: http://openjdk.linaro.org/jdk10/openjdk-jtreg-nightly-tests/index.html SPECjbb2015 composite regression test completed =============================================== This test measures the relative performance of the server compiler running the SPECjbb2015 composite tests and compares the performance against the baseline performance of the server compiler taken on 2016-11-21. In accordance with [1], the SPECjbb2015 tests are run on a system which is not production ready and does not meet all the requirements for publishing compliant results. The numbers below shall be treated as non-compliant (nc) and are for experimental purposes only. Relative performance: Server max-jOPS (nc): 1.05x Relative performance: Server critical-jOPS (nc): 0.90x Details of the test setup and historical results may be found here: http://openjdk.linaro.org/jdk10/SPECjbb2015-results/ [1] http://www.spec.org/fairuse.html#Academic Regression test Hadoop-Terasort completed ========================================= This test measures the performance of the server and client compilers running Hadoop sorting a 1GB file using Terasort and compares the performance against the baseline performance of the Zero interpreter and against the baseline performance of the client and server compilers on 2014-04-01. Relative performance: Zero: 1.0, Client: 70.58, Server: 115.7 Client 70.58 / Client 2014-04-01 (43.00): 1.64x Server 115.7 / Server 2014-04-01 (71.00): 1.63x Details of the test setup and historical results may be found here: http://openjdk.linaro.org/jdk10/hadoop-terasort-benchmark-results/ This is a summary of the jcstress test results ============================================== The build and test results are cycled every 15 days. 2017-09-07 pass rate: 11556/11559, results: http://openjdk.linaro.org/jdk10/jcstress-nightly-runs/2017/249/results/ 2017-09-21 pass rate: 11556/11559, results: http://openjdk.linaro.org/jdk10/jcstress-nightly-runs/2017/263/results/ For detailed information on the test output please refer to: http://openjdk.linaro.org/jdk10/jcstress-nightly-runs/ From dmitrij.pochepko at bell-sw.com Thu Sep 21 18:19:33 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko) Date: Thu, 21 Sep 2017 21:19:33 +0300 Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com> Message-ID: <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com> Hi, thank you for looking into this and trying on APM(I have no access to this h/w). I've used modified benchmark you've sent and run it on ThunderX and implSquareToLen still shows better results than implMultiplyToLen in most cases on ThunderX (up to 10% on size=127. results: http://cr.openjdk.java.net/~dpochepk/8186915/ThunderX_new.txt). However, since performance difference for APM is more than on ThunderX, I think it'll be more logical to return back to your idea and call multiplyToLen intrinsic inside squareToLen. Alternative solution is to generate different code for APM and ThunderX, but I prefer to have single version in case of such relatively small difference in performance and it's still much faster than without intrinsic at all. What do you think? fyi: regarding size 200 and 1000 - it's incorrect to measure these sizes for squareToLen, because squareToLen is never called for size more than 127(I've mentioned it before). An upper level squaring algorithm divides larger arrays into few parts(smaller than 128 integers) and then squaring it separately. In order to compare squaring vs multiplication on longer sizes, we should compare BigInteger::multiply vs BigInteger::square methods with full logic behind it, because this is what's called in real situation instead of direct intrinsified method call. I've uploaded benchmark with multiply method measurement here: http://cr.openjdk.java.net/~dpochepk/8186915/BigIntegerBench2.java just in case. Thanks, Dmitrij On 21.09.2017 16:04, Andrew Haley wrote: > I reworked your benchmark to run faster and have less overhead, at > http://cr.openjdk.java.net/~aph/8186915/ > > Run it as > > java --add-exports java.base/jdk.internal.misc=ALL-UNNAMED -jar target/benchmarks.jar org.sample.BigIntegerBench.implMutliplyToLen > > The test here was run on (rather old) Applied Micro hardware. The > real issue is, I think, that almost all of the time of squareToLen > without an intrinsic is dominated by mulAdd, and that already has an > intrinsic. Asymptotically, an intrinsic squareToLen should take half > the time of multiplyToLen, but we don't see that. Indeed, we barely > see any advantage for UseSquareToLenIntrinsic. > > For a larger size, we see this with intrinsics enabled: > > BigIntegerBench.implMutliplyToLen 200 avgt 5 50833.555 ? 10.674 ns/op > BigIntegerBench.implSquareToLen 200 avgt 5 57607.460 ? 87.155 ns/op > > BigIntegerBench.implMutliplyToLen 1000 avgt 5 1254728.119 ? 527.126 ns/op > BigIntegerBench.implSquareToLen 1000 avgt 5 1369841.961 ? 169.843 ns/op > > which makes the problem clear, I believe. > > > No intrinsics: > > Benchmark (size) Mode Cnt Score Error Units > BigIntegerBench.implMutliplyToLen 1 avgt 5 24.176 ? 0.006 ns/op > BigIntegerBench.implMutliplyToLen 2 avgt 5 41.266 ? 0.008 ns/op > BigIntegerBench.implMutliplyToLen 3 avgt 5 65.027 ? 0.019 ns/op > BigIntegerBench.implMutliplyToLen 10 avgt 5 466.440 ? 0.080 ns/op > BigIntegerBench.implMutliplyToLen 50 avgt 5 10613.512 ? 5.153 ns/op > BigIntegerBench.implMutliplyToLen 90 avgt 5 34070.328 ? 10.991 ns/op > BigIntegerBench.implMutliplyToLen 127 avgt 5 67546.985 ? 16.581 ns/op > > -XX:+UseMultiplyToLenIntrinsic: > > Benchmark (size) Mode Cnt Score Error Units > BigIntegerBench.implMutliplyToLen 1 avgt 5 25.661 ? 0.062 ns/op > BigIntegerBench.implMutliplyToLen 2 avgt 5 29.183 ? 0.037 ns/op > BigIntegerBench.implMutliplyToLen 3 avgt 5 51.690 ? 0.024 ns/op > BigIntegerBench.implMutliplyToLen 10 avgt 5 193.401 ? 0.032 ns/op > BigIntegerBench.implMutliplyToLen 50 avgt 5 3419.226 ? 0.312 ns/op > BigIntegerBench.implMutliplyToLen 90 avgt 5 10638.801 ? 0.970 ns/op > BigIntegerBench.implMutliplyToLen 127 avgt 5 21274.149 ? 7.188 ns/op > > > No Intrinsics: > > Benchmark (size) Mode Cnt Score Error Units > BigIntegerBench.implSquareToLen 1 avgt 5 38.933 ? 1.437 ns/op > BigIntegerBench.implSquareToLen 2 avgt 5 62.523 ? 0.007 ns/op > BigIntegerBench.implSquareToLen 3 avgt 5 82.114 ? 0.012 ns/op > BigIntegerBench.implSquareToLen 10 avgt 5 366.986 ? 10.148 ns/op > BigIntegerBench.implSquareToLen 50 avgt 5 5534.064 ? 88.895 ns/op > BigIntegerBench.implSquareToLen 90 avgt 5 16308.025 ? 29.203 ns/op > BigIntegerBench.implSquareToLen 127 avgt 5 31521.335 ? 49.421 ns/op > > -XX:+UseMulAddIntrinsic: > > Benchmark (size) Mode Cnt Score Error Units > BigIntegerBench.implSquareToLen 1 avgt 5 46.268 ? 0.005 ns/op > BigIntegerBench.implSquareToLen 2 avgt 5 67.527 ? 0.017 ns/op > BigIntegerBench.implSquareToLen 3 avgt 5 97.975 ? 0.179 ns/op > BigIntegerBench.implSquareToLen 10 avgt 5 345.126 ? 0.037 ns/op > BigIntegerBench.implSquareToLen 50 avgt 5 4327.120 ? 9.942 ns/op > BigIntegerBench.implSquareToLen 90 avgt 5 13143.308 ? 1.217 ns/op > BigIntegerBench.implSquareToLen 127 avgt 5 25014.420 ? 16.221 ns/op > > -XX:+UseSquareToLenIntrinsic > > Benchmark (size) Mode Cnt Score Error Units > BigIntegerBench.implSquareToLen 1 avgt 5 27.095 ? 0.012 ns/op > BigIntegerBench.implSquareToLen 2 avgt 5 49.185 ? 0.007 ns/op > BigIntegerBench.implSquareToLen 3 avgt 5 53.771 ? 0.013 ns/op > BigIntegerBench.implSquareToLen 10 avgt 5 238.843 ? 0.080 ns/op > BigIntegerBench.implSquareToLen 50 avgt 5 3828.313 ? 1.684 ns/op > BigIntegerBench.implSquareToLen 90 avgt 5 11949.819 ? 9.925 ns/op > BigIntegerBench.implSquareToLen 127 avgt 5 23613.427 ? 28.164 ns/op > > From aph at redhat.com Fri Sep 22 08:12:23 2017 From: aph at redhat.com (Andrew Haley) Date: Fri, 22 Sep 2017 09:12:23 +0100 Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com> <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com> Message-ID: <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com> On 21/09/17 19:19, Dmitrij Pochepko wrote: > thank you for looking into this and trying on APM(I have no access to > this h/w). > > > I've used modified benchmark you've sent and run it on ThunderX and > implSquareToLen still shows better results than implMultiplyToLen in > most cases on ThunderX (up to 10% on size=127. results: > http://cr.openjdk.java.net/~dpochepk/8186915/ThunderX_new.txt). For 10%, it's not worth doing, given the risks and that it's not used by crypto operations when C2-compiled. > However, since performance difference for APM is more than on > ThunderX, I think it'll be more logical to return back to your idea > and call multiplyToLen intrinsic inside squareToLen. Alternative > solution is to generate different code for APM and ThunderX, but I > prefer to have single version in case of such relatively small > difference in performance and it's still much faster than without > intrinsic at all. What do you think? Yes. Calling multiplyToLen would be fine. > fyi: regarding size 200 and 1000 - it's incorrect to measure these > sizes for squareToLen, because squareToLen is never called for size > more than 127(I've mentioned it before). It's not incorrect: it's a test for asymptotic behaviour. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From dmitrij.pochepko at bell-sw.com Mon Sep 25 15:46:43 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko) Date: Mon, 25 Sep 2017 18:46:43 +0300 Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com> <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com> <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com> Message-ID: Hi, please take a look at v2. I've modified code to use multiplyToLen in squareToLen. Additional benefit: no more code in common part. I've left mulAdd unchanged. http://cr.openjdk.java.net/~dpochepk/8186915/webrev.02/ I've also rerun benchmark on ThunderX and got these results: http://cr.openjdk.java.net/~dpochepk/8186915/ThunderX_new.txt Thanks, Dmitrij On 22.09.2017 11:12, Andrew Haley wrote: > On 21/09/17 19:19, Dmitrij Pochepko wrote: > >> thank you for looking into this and trying on APM(I have no access to >> this h/w). >> >> >> I've used modified benchmark you've sent and run it on ThunderX and >> implSquareToLen still shows better results than implMultiplyToLen in >> most cases on ThunderX (up to 10% on size=127. results: >> http://cr.openjdk.java.net/~dpochepk/8186915/ThunderX_new.txt). > For 10%, it's not worth doing, given the risks and that it's not used > by crypto operations when C2-compiled. > >> However, since performance difference for APM is more than on >> ThunderX, I think it'll be more logical to return back to your idea >> and call multiplyToLen intrinsic inside squareToLen. Alternative >> solution is to generate different code for APM and ThunderX, but I >> prefer to have single version in case of such relatively small >> difference in performance and it's still much faster than without >> intrinsic at all. What do you think? > Yes. Calling multiplyToLen would be fine. > >> fyi: regarding size 200 and 1000 - it's incorrect to measure these >> sizes for squareToLen, because squareToLen is never called for size >> more than 127(I've mentioned it before). > It's not incorrect: it's a test for asymptotic behaviour. From aph at redhat.com Mon Sep 25 15:57:43 2017 From: aph at redhat.com (Andrew Haley) Date: Mon, 25 Sep 2017 16:57:43 +0100 Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com> <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com> <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com> Message-ID: <6ba73c2b-33fa-bdb7-af84-e0b6a2e3b730@redhat.com> On 25/09/17 16:46, Dmitrij Pochepko wrote: > please take a look at v2. I've modified code to use multiplyToLen in > squareToLen. Additional benefit: no more code in common part. I've left > mulAdd unchanged. That looks fine. Please commit if you've run the jtreg test suite. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From dmitrij.pochepko at bell-sw.com Mon Sep 25 16:33:00 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko) Date: Mon, 25 Sep 2017 19:33:00 +0300 Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <6ba73c2b-33fa-bdb7-af84-e0b6a2e3b730@redhat.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com> <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com> <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com> <6ba73c2b-33fa-bdb7-af84-e0b6a2e3b730@redhat.com> Message-ID: <40a89126-e855-c681-d6ff-21ed611fbd89@bell-sw.com> Thank you for such attentive review. I'll commit it now. I've run jtreg tests in jdk/test/java/math/BigInteger/* in both Xmixed and Xcomp modes. Thanks, Dmitrij On 25.09.2017 18:57, Andrew Haley wrote: > On 25/09/17 16:46, Dmitrij Pochepko wrote: >> please take a look at v2. I've modified code to use multiplyToLen in >> squareToLen. Additional benefit: no more code in common part. I've left >> mulAdd unchanged. > That looks fine. Please commit if you've run the jtreg test suite. > From dmitrij.pochepko at bell-sw.com Mon Sep 25 16:36:06 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko) Date: Mon, 25 Sep 2017 19:36:06 +0300 Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <40a89126-e855-c681-d6ff-21ed611fbd89@bell-sw.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com> <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com> <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com> <6ba73c2b-33fa-bdb7-af84-e0b6a2e3b730@redhat.com> <40a89126-e855-c681-d6ff-21ed611fbd89@bell-sw.com> Message-ID: Seems like repo is still closed. Have to wait a bit On 25.09.2017 19:33, Dmitrij Pochepko wrote: > Thank you for such attentive review. > > I'll commit it now. I've run jtreg tests in > jdk/test/java/math/BigInteger/* in both Xmixed and Xcomp modes. > > > Thanks, > Dmitrij > On 25.09.2017 18:57, Andrew Haley wrote: >> On 25/09/17 16:46, Dmitrij Pochepko wrote: >>> please take a look at v2. I've modified code to use multiplyToLen in >>> squareToLen. Additional benefit: no more code in common part. I've left >>> mulAdd unchanged. >> That looks fine.? Please commit if you've run the jtreg test suite. >> > From zhongwei.yao at linaro.org Fri Sep 29 08:25:39 2017 From: zhongwei.yao at linaro.org (Zhongwei Yao) Date: Fri, 29 Sep 2017 16:25:39 +0800 Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: References: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com> Message-ID: Hi, Vladimir, Sorry for my late response! And yes, it solves my case. But I found specjvm2008 doesn't have a stable result, especially for benchmark case like startup.xxx, scimark.xxx.large etc. And I have found obvious performance regress in the rest of benchmark cases. What do you think? On 21 September 2017 at 00:18, Vladimir Kozlov wrote: > Nice. > > Did you verified that it fixed your case? > > Would be nice to run specjvm2008 to make sure performance did not regress. > > Thanks, > Vladimir > > > On 9/20/17 4:07 AM, Zhongwei Yao wrote: >> >> Thanks for your suggestions! >> >> I've updated the patch that uses pass_slp and do_unroll_only flags >> without adding a new flag. Please take a look: >> >> http://cr.openjdk.java.net/~zyao/8187601/webrev.01/ >> >> >> >> On 20 September 2017 at 01:54, Vladimir Kozlov >> wrote: >>> >>> >>> >>> On 9/18/17 10:59 PM, Zhongwei Yao wrote: >>>> >>>> >>>> Hi, Vladimir, >>>> >>>> On 19 September 2017 at 00:17, Vladimir Kozlov >>>> wrote: >>>>> >>>>> >>>>> Why not use existing set_notpassed_slp() instead of >>>>> mark_slp_vec_failed()? >>>> >>>> >>>> >>>> Due to 2 reasons, I have not chosen existing passed_slp flag: >>> >>> >>> >>> My point is that if we don't find vectors in a loop (as in your case) we >>> should ignore whole SLP analysis. >>> >>> In best case scenario SuperWord::unrolling_analysis() should determine if >>> there are vectors candidates. For example, check if array's index is >>> depend >>> on loop's index variable. >>> >>> An other way is to call SuperWord::unrolling_analysis() only after we did >>> vector analysis. >>> >>> It is more complicated changes and out of scope of this. There is also >>> side >>> effect I missed before which may prevent using set_notpassed_slp(): >>> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp() >>> check. >>> >>> Note, set_notpassed_slp() is also used to additional unroll already >>> vectorized loops: >>> >>> >>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421 >>> >>> May be you should also call mark_do_unroll_only() when you set >>> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt >>> problem you pointed. Can you look on this? >>> >>> I am not against adding new is_slp_vec_failed() but I want first to >>> investigate if we can re-use existing functions. >>> >>> Thanks, >>> Vladimir >>> >>> >>>> 1. If we set_notpassed_slp() when _packset.length() == 0 in >>>> SuperWord::output(), then in the IdealLoopTree::policy_unroll() >>>> checking: >>>> >>>> if (cl->has_passed_slp()) { >>>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>>> // Normal case: loop too big >>>> return false; >>>> } >>>> >>>> we will ignore the case: "cl->has_passed_slp() && >>>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()" >>>> as alos exposed in my patch: >>>> >>>> if (cl->has_passed_slp()) { >>>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>>> - // Normal case: loop too big >>>> - return false; >>>> + // When SLP vectorization failed, we could do more unrolling >>>> + // optimizations if body size is less than limit size. Otherwise, >>>> + // return false due to loop is too big. >>>> + if (!cl->is_slp_vec_failed()) return false; >>>> } >>>> >>>> However, I have not found a case to support this condition yet. >>>> >>>> 2. As replied below, in: >>>>> >>>>> >>>>> - } else if (cl->is_main_loop()) { >>>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>>> sw.transform_loop(lpt, true); >>>> >>>> >>>> I need to check whether cl->is_slp_vec_failed() is true.Such >>>> checking becomes explicit when using SLPAutoVecFailed flag. >>>> >>>>> >>>>> Why you need next additional check?: >>>>> >>>>> - } else if (cl->is_main_loop()) { >>>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>>> sw.transform_loop(lpt, true); >>>>> >>>> >>>> The additional check prevents the case that when >>>> cl->is_slp_vec_failed() is true, then SuperWord::output() will >>>> set_major_progress() at the beginning (because _packset.length() == 0 >>>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal >>>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is >>>> not we want. >>> >>> >>> >>> >>>> >>>>> >>>>> Thanks, >>>>> Vladimir >>>>> >>>>> >>>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote: >>>>>> >>>>>> >>>>>> >>>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev] >>>>>> >>>>>> Hi, all, >>>>>> >>>>>> Bug: >>>>>> https://bugs.openjdk.java.net/browse/JDK-8187601 >>>>>> >>>>>> Webrev: >>>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00 >>>>>> >>>>>> In the current implementation, the loop unrolling times are determined >>>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is >>>>>> true (both X86 and aarch64 are true for now). >>>>>> >>>>>> This unrolling policy generates less optimized code when SLP >>>>>> auto-vectorization fails (as following example shows). >>>>>> >>>>>> In this patch, I modify the current unrolling policy to do more >>>>>> unrolling when SLP auto-vectorization fails. So the loop will be >>>>>> unrolled until reaching the unroll times limitation. >>>>>> >>>>>> Here is one example: >>>>>> public static void accessArrayConstants(int[] array) { >>>>>> for (int j = 0; j < 1024; j++) { >>>>>> array[0]++; >>>>>> array[1]++; >>>>>> } >>>>>> } >>>>>> >>>>>> Before this patch, the loop will be unrolled by 4 times. 4 is >>>>>> determined by: AArch64's vector size 128 bits / array element size 32 >>>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. >>>>>> >>>>>> Below is the generated code by C2 on AArch64: >>>>>> >>>>>> ==== generated code start ==== >>>>>> 0x0000ffff6caf3180: ldr w10, [x1,#16] ; >>>>>> 0x0000ffff6caf3184: add w13, w10, #0x1 >>>>>> 0x0000ffff6caf3188: str w13, [x1,#16] ; >>>>>> 0x0000ffff6caf318c: ldr w12, [x1,#20] ; >>>>>> 0x0000ffff6caf3190: add w13, w10, #0x4 >>>>>> 0x0000ffff6caf3194: add w10, w12, #0x4 >>>>>> 0x0000ffff6caf3198: str w13, [x1,#16] ; >>>>>> 0x0000ffff6caf319c: add w11, w11, #0x4 ; >>>>>> 0x0000ffff6caf31a0: str w10, [x1,#20] ; >>>>>> 0x0000ffff6caf31a4: cmp w11, #0x3fd >>>>>> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; >>>>>> ==== generated code end ==== >>>>>> >>>>>> After applied this patch, it is unrolled 16 times: >>>>>> >>>>>> ==== generated code start ==== >>>>>> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; >>>>>> 0x0000ffffb0aa6104: add w13, w10, #0x1 >>>>>> 0x0000ffffb0aa6108: str w13, [x1,#16] ; >>>>>> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; >>>>>> 0x0000ffffb0aa6110: add w13, w10, #0x10 >>>>>> 0x0000ffffb0aa6114: add w10, w12, #0x10 >>>>>> 0x0000ffffb0aa6118: str w13, [x1,#16] ; >>>>>> 0x0000ffffb0aa611c: add w11, w11, #0x10 ; >>>>>> 0x0000ffffb0aa6120: str w10, [x1,#20] ; >>>>>> 0x0000ffffb0aa6124: cmp w11, #0x3f1 >>>>>> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; >>>>>> ==== generated code end ==== >>>>>> >>>>>> This patch passes jtreg tests both on AArch64 and X86. >>>>>> >>>>> >>>> >>>> >>>> >>> >> >> >> > -- Best regards, Zhongwei From zhongwei.yao at linaro.org Fri Sep 29 09:22:24 2017 From: zhongwei.yao at linaro.org (Zhongwei Yao) Date: Fri, 29 Sep 2017 17:22:24 +0800 Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: References: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com> Message-ID: I made a typo in the previous reply. On 29 September 2017 at 16:25, Zhongwei Yao wrote: > Hi, Vladimir, > > Sorry for my late response! > > And yes, it solves my case. > > But I found specjvm2008 doesn't have a stable result, especially for > benchmark case like startup.xxx, scimark.xxx.large etc. And I have > found obvious performance regress in the rest of benchmark cases. What And I have NOT found obvious performance regress in the rest of benchmark cases. > do you think? > > On 21 September 2017 at 00:18, Vladimir Kozlov > wrote: >> Nice. >> >> Did you verified that it fixed your case? >> >> Would be nice to run specjvm2008 to make sure performance did not regress. >> >> Thanks, >> Vladimir >> >> >> On 9/20/17 4:07 AM, Zhongwei Yao wrote: >>> >>> Thanks for your suggestions! >>> >>> I've updated the patch that uses pass_slp and do_unroll_only flags >>> without adding a new flag. Please take a look: >>> >>> http://cr.openjdk.java.net/~zyao/8187601/webrev.01/ >>> >>> >>> >>> On 20 September 2017 at 01:54, Vladimir Kozlov >>> wrote: >>>> >>>> >>>> >>>> On 9/18/17 10:59 PM, Zhongwei Yao wrote: >>>>> >>>>> >>>>> Hi, Vladimir, >>>>> >>>>> On 19 September 2017 at 00:17, Vladimir Kozlov >>>>> wrote: >>>>>> >>>>>> >>>>>> Why not use existing set_notpassed_slp() instead of >>>>>> mark_slp_vec_failed()? >>>>> >>>>> >>>>> >>>>> Due to 2 reasons, I have not chosen existing passed_slp flag: >>>> >>>> >>>> >>>> My point is that if we don't find vectors in a loop (as in your case) we >>>> should ignore whole SLP analysis. >>>> >>>> In best case scenario SuperWord::unrolling_analysis() should determine if >>>> there are vectors candidates. For example, check if array's index is >>>> depend >>>> on loop's index variable. >>>> >>>> An other way is to call SuperWord::unrolling_analysis() only after we did >>>> vector analysis. >>>> >>>> It is more complicated changes and out of scope of this. There is also >>>> side >>>> effect I missed before which may prevent using set_notpassed_slp(): >>>> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp() >>>> check. >>>> >>>> Note, set_notpassed_slp() is also used to additional unroll already >>>> vectorized loops: >>>> >>>> >>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421 >>>> >>>> May be you should also call mark_do_unroll_only() when you set >>>> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt >>>> problem you pointed. Can you look on this? >>>> >>>> I am not against adding new is_slp_vec_failed() but I want first to >>>> investigate if we can re-use existing functions. >>>> >>>> Thanks, >>>> Vladimir >>>> >>>> >>>>> 1. If we set_notpassed_slp() when _packset.length() == 0 in >>>>> SuperWord::output(), then in the IdealLoopTree::policy_unroll() >>>>> checking: >>>>> >>>>> if (cl->has_passed_slp()) { >>>>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>>>> // Normal case: loop too big >>>>> return false; >>>>> } >>>>> >>>>> we will ignore the case: "cl->has_passed_slp() && >>>>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()" >>>>> as alos exposed in my patch: >>>>> >>>>> if (cl->has_passed_slp()) { >>>>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>>>> - // Normal case: loop too big >>>>> - return false; >>>>> + // When SLP vectorization failed, we could do more unrolling >>>>> + // optimizations if body size is less than limit size. Otherwise, >>>>> + // return false due to loop is too big. >>>>> + if (!cl->is_slp_vec_failed()) return false; >>>>> } >>>>> >>>>> However, I have not found a case to support this condition yet. >>>>> >>>>> 2. As replied below, in: >>>>>> >>>>>> >>>>>> - } else if (cl->is_main_loop()) { >>>>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>>>> sw.transform_loop(lpt, true); >>>>> >>>>> >>>>> I need to check whether cl->is_slp_vec_failed() is true.Such >>>>> checking becomes explicit when using SLPAutoVecFailed flag. >>>>> >>>>>> >>>>>> Why you need next additional check?: >>>>>> >>>>>> - } else if (cl->is_main_loop()) { >>>>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>>>> sw.transform_loop(lpt, true); >>>>>> >>>>> >>>>> The additional check prevents the case that when >>>>> cl->is_slp_vec_failed() is true, then SuperWord::output() will >>>>> set_major_progress() at the beginning (because _packset.length() == 0 >>>>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal >>>>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is >>>>> not we want. >>>> >>>> >>>> >>>> >>>>> >>>>>> >>>>>> Thanks, >>>>>> Vladimir >>>>>> >>>>>> >>>>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev] >>>>>>> >>>>>>> Hi, all, >>>>>>> >>>>>>> Bug: >>>>>>> https://bugs.openjdk.java.net/browse/JDK-8187601 >>>>>>> >>>>>>> Webrev: >>>>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00 >>>>>>> >>>>>>> In the current implementation, the loop unrolling times are determined >>>>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is >>>>>>> true (both X86 and aarch64 are true for now). >>>>>>> >>>>>>> This unrolling policy generates less optimized code when SLP >>>>>>> auto-vectorization fails (as following example shows). >>>>>>> >>>>>>> In this patch, I modify the current unrolling policy to do more >>>>>>> unrolling when SLP auto-vectorization fails. So the loop will be >>>>>>> unrolled until reaching the unroll times limitation. >>>>>>> >>>>>>> Here is one example: >>>>>>> public static void accessArrayConstants(int[] array) { >>>>>>> for (int j = 0; j < 1024; j++) { >>>>>>> array[0]++; >>>>>>> array[1]++; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> Before this patch, the loop will be unrolled by 4 times. 4 is >>>>>>> determined by: AArch64's vector size 128 bits / array element size 32 >>>>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. >>>>>>> >>>>>>> Below is the generated code by C2 on AArch64: >>>>>>> >>>>>>> ==== generated code start ==== >>>>>>> 0x0000ffff6caf3180: ldr w10, [x1,#16] ; >>>>>>> 0x0000ffff6caf3184: add w13, w10, #0x1 >>>>>>> 0x0000ffff6caf3188: str w13, [x1,#16] ; >>>>>>> 0x0000ffff6caf318c: ldr w12, [x1,#20] ; >>>>>>> 0x0000ffff6caf3190: add w13, w10, #0x4 >>>>>>> 0x0000ffff6caf3194: add w10, w12, #0x4 >>>>>>> 0x0000ffff6caf3198: str w13, [x1,#16] ; >>>>>>> 0x0000ffff6caf319c: add w11, w11, #0x4 ; >>>>>>> 0x0000ffff6caf31a0: str w10, [x1,#20] ; >>>>>>> 0x0000ffff6caf31a4: cmp w11, #0x3fd >>>>>>> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; >>>>>>> ==== generated code end ==== >>>>>>> >>>>>>> After applied this patch, it is unrolled 16 times: >>>>>>> >>>>>>> ==== generated code start ==== >>>>>>> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; >>>>>>> 0x0000ffffb0aa6104: add w13, w10, #0x1 >>>>>>> 0x0000ffffb0aa6108: str w13, [x1,#16] ; >>>>>>> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; >>>>>>> 0x0000ffffb0aa6110: add w13, w10, #0x10 >>>>>>> 0x0000ffffb0aa6114: add w10, w12, #0x10 >>>>>>> 0x0000ffffb0aa6118: str w13, [x1,#16] ; >>>>>>> 0x0000ffffb0aa611c: add w11, w11, #0x10 ; >>>>>>> 0x0000ffffb0aa6120: str w10, [x1,#20] ; >>>>>>> 0x0000ffffb0aa6124: cmp w11, #0x3f1 >>>>>>> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; >>>>>>> ==== generated code end ==== >>>>>>> >>>>>>> This patch passes jtreg tests both on AArch64 and X86. >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> >> > > > > -- > Best regards, > Zhongwei -- Best regards, Zhongwei From vladimir.kozlov at oracle.com Fri Sep 29 18:10:10 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 29 Sep 2017 11:10:10 -0700 Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: References: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com> Message-ID: <06d44e32-0d33-ae78-1516-6c4497adf983@oracle.com> On 9/29/17 1:25 AM, Zhongwei Yao wrote: > Hi, Vladimir, > > Sorry for my late response! > > And yes, it solves my case. > > But I found specjvm2008 doesn't have a stable result, especially for > benchmark case like startup.xxx, scimark.xxx.large etc. And I have > found obvious performance regress in the rest of benchmark cases. What > do you think? You know that you can change run parameters for specjvm2008 to avoid waiting for long to finish. And you need to run on one node preferable. Variations in startup is not important in this case. But scimark is important since they show quality of loop optimizations. Does regression significant? We need more time to investigate it then. Thanks, Vladimir > > On 21 September 2017 at 00:18, Vladimir Kozlov > wrote: >> Nice. >> >> Did you verified that it fixed your case? >> >> Would be nice to run specjvm2008 to make sure performance did not regress. >> >> Thanks, >> Vladimir >> >> >> On 9/20/17 4:07 AM, Zhongwei Yao wrote: >>> >>> Thanks for your suggestions! >>> >>> I've updated the patch that uses pass_slp and do_unroll_only flags >>> without adding a new flag. Please take a look: >>> >>> http://cr.openjdk.java.net/~zyao/8187601/webrev.01/ >>> >>> >>> >>> On 20 September 2017 at 01:54, Vladimir Kozlov >>> wrote: >>>> >>>> >>>> >>>> On 9/18/17 10:59 PM, Zhongwei Yao wrote: >>>>> >>>>> >>>>> Hi, Vladimir, >>>>> >>>>> On 19 September 2017 at 00:17, Vladimir Kozlov >>>>> wrote: >>>>>> >>>>>> >>>>>> Why not use existing set_notpassed_slp() instead of >>>>>> mark_slp_vec_failed()? >>>>> >>>>> >>>>> >>>>> Due to 2 reasons, I have not chosen existing passed_slp flag: >>>> >>>> >>>> >>>> My point is that if we don't find vectors in a loop (as in your case) we >>>> should ignore whole SLP analysis. >>>> >>>> In best case scenario SuperWord::unrolling_analysis() should determine if >>>> there are vectors candidates. For example, check if array's index is >>>> depend >>>> on loop's index variable. >>>> >>>> An other way is to call SuperWord::unrolling_analysis() only after we did >>>> vector analysis. >>>> >>>> It is more complicated changes and out of scope of this. There is also >>>> side >>>> effect I missed before which may prevent using set_notpassed_slp(): >>>> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp() >>>> check. >>>> >>>> Note, set_notpassed_slp() is also used to additional unroll already >>>> vectorized loops: >>>> >>>> >>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421 >>>> >>>> May be you should also call mark_do_unroll_only() when you set >>>> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt >>>> problem you pointed. Can you look on this? >>>> >>>> I am not against adding new is_slp_vec_failed() but I want first to >>>> investigate if we can re-use existing functions. >>>> >>>> Thanks, >>>> Vladimir >>>> >>>> >>>>> 1. If we set_notpassed_slp() when _packset.length() == 0 in >>>>> SuperWord::output(), then in the IdealLoopTree::policy_unroll() >>>>> checking: >>>>> >>>>> if (cl->has_passed_slp()) { >>>>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>>>> // Normal case: loop too big >>>>> return false; >>>>> } >>>>> >>>>> we will ignore the case: "cl->has_passed_slp() && >>>>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()" >>>>> as alos exposed in my patch: >>>>> >>>>> if (cl->has_passed_slp()) { >>>>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>>>> - // Normal case: loop too big >>>>> - return false; >>>>> + // When SLP vectorization failed, we could do more unrolling >>>>> + // optimizations if body size is less than limit size. Otherwise, >>>>> + // return false due to loop is too big. >>>>> + if (!cl->is_slp_vec_failed()) return false; >>>>> } >>>>> >>>>> However, I have not found a case to support this condition yet. >>>>> >>>>> 2. As replied below, in: >>>>>> >>>>>> >>>>>> - } else if (cl->is_main_loop()) { >>>>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>>>> sw.transform_loop(lpt, true); >>>>> >>>>> >>>>> I need to check whether cl->is_slp_vec_failed() is true.Such >>>>> checking becomes explicit when using SLPAutoVecFailed flag. >>>>> >>>>>> >>>>>> Why you need next additional check?: >>>>>> >>>>>> - } else if (cl->is_main_loop()) { >>>>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>>>> sw.transform_loop(lpt, true); >>>>>> >>>>> >>>>> The additional check prevents the case that when >>>>> cl->is_slp_vec_failed() is true, then SuperWord::output() will >>>>> set_major_progress() at the beginning (because _packset.length() == 0 >>>>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal >>>>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is >>>>> not we want. >>>> >>>> >>>> >>>> >>>>> >>>>>> >>>>>> Thanks, >>>>>> Vladimir >>>>>> >>>>>> >>>>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev] >>>>>>> >>>>>>> Hi, all, >>>>>>> >>>>>>> Bug: >>>>>>> https://bugs.openjdk.java.net/browse/JDK-8187601 >>>>>>> >>>>>>> Webrev: >>>>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00 >>>>>>> >>>>>>> In the current implementation, the loop unrolling times are determined >>>>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is >>>>>>> true (both X86 and aarch64 are true for now). >>>>>>> >>>>>>> This unrolling policy generates less optimized code when SLP >>>>>>> auto-vectorization fails (as following example shows). >>>>>>> >>>>>>> In this patch, I modify the current unrolling policy to do more >>>>>>> unrolling when SLP auto-vectorization fails. So the loop will be >>>>>>> unrolled until reaching the unroll times limitation. >>>>>>> >>>>>>> Here is one example: >>>>>>> public static void accessArrayConstants(int[] array) { >>>>>>> for (int j = 0; j < 1024; j++) { >>>>>>> array[0]++; >>>>>>> array[1]++; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> Before this patch, the loop will be unrolled by 4 times. 4 is >>>>>>> determined by: AArch64's vector size 128 bits / array element size 32 >>>>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. >>>>>>> >>>>>>> Below is the generated code by C2 on AArch64: >>>>>>> >>>>>>> ==== generated code start ==== >>>>>>> 0x0000ffff6caf3180: ldr w10, [x1,#16] ; >>>>>>> 0x0000ffff6caf3184: add w13, w10, #0x1 >>>>>>> 0x0000ffff6caf3188: str w13, [x1,#16] ; >>>>>>> 0x0000ffff6caf318c: ldr w12, [x1,#20] ; >>>>>>> 0x0000ffff6caf3190: add w13, w10, #0x4 >>>>>>> 0x0000ffff6caf3194: add w10, w12, #0x4 >>>>>>> 0x0000ffff6caf3198: str w13, [x1,#16] ; >>>>>>> 0x0000ffff6caf319c: add w11, w11, #0x4 ; >>>>>>> 0x0000ffff6caf31a0: str w10, [x1,#20] ; >>>>>>> 0x0000ffff6caf31a4: cmp w11, #0x3fd >>>>>>> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; >>>>>>> ==== generated code end ==== >>>>>>> >>>>>>> After applied this patch, it is unrolled 16 times: >>>>>>> >>>>>>> ==== generated code start ==== >>>>>>> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; >>>>>>> 0x0000ffffb0aa6104: add w13, w10, #0x1 >>>>>>> 0x0000ffffb0aa6108: str w13, [x1,#16] ; >>>>>>> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; >>>>>>> 0x0000ffffb0aa6110: add w13, w10, #0x10 >>>>>>> 0x0000ffffb0aa6114: add w10, w12, #0x10 >>>>>>> 0x0000ffffb0aa6118: str w13, [x1,#16] ; >>>>>>> 0x0000ffffb0aa611c: add w11, w11, #0x10 ; >>>>>>> 0x0000ffffb0aa6120: str w10, [x1,#20] ; >>>>>>> 0x0000ffffb0aa6124: cmp w11, #0x3f1 >>>>>>> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; >>>>>>> ==== generated code end ==== >>>>>>> >>>>>>> This patch passes jtreg tests both on AArch64 and X86. >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> >> > > > From zhongwei.yao at linaro.org Sat Sep 30 06:37:32 2017 From: zhongwei.yao at linaro.org (Zhongwei Yao) Date: Sat, 30 Sep 2017 14:37:32 +0800 Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: <06d44e32-0d33-ae78-1516-6c4497adf983@oracle.com> References: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com> <06d44e32-0d33-ae78-1516-6c4497adf983@oracle.com> Message-ID: On 30 September 2017 at 02:10, Vladimir Kozlov wrote: > On 9/29/17 1:25 AM, Zhongwei Yao wrote: >> >> Hi, Vladimir, >> >> Sorry for my late response! >> >> And yes, it solves my case. >> >> But I found specjvm2008 doesn't have a stable result, especially for >> benchmark case like startup.xxx, scimark.xxx.large etc. And I have not >> found obvious performance regress in the rest of benchmark cases. What >> do you think? > > > You know that you can change run parameters for specjvm2008 to avoid waiting > for long to finish. > And you need to run on one node preferable. > > Variations in startup is not important in this case. But scimark is > important since they show quality of loop optimizations. > > Does regression significant? We need more time to investigate it then. I see performance data fluctuates in specjvm2008. However, I check the scimark 2.0 (http://math.nist.gov/scimark2/) and see no performance regression in it both on x86 and arm64. > > Thanks, > Vladimir > > >> >> On 21 September 2017 at 00:18, Vladimir Kozlov >> wrote: >>> >>> Nice. >>> >>> Did you verified that it fixed your case? >>> >>> Would be nice to run specjvm2008 to make sure performance did not >>> regress. >>> >>> Thanks, >>> Vladimir >>> >>> >>> On 9/20/17 4:07 AM, Zhongwei Yao wrote: >>>> >>>> >>>> Thanks for your suggestions! >>>> >>>> I've updated the patch that uses pass_slp and do_unroll_only flags >>>> without adding a new flag. Please take a look: >>>> >>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.01/ >>>> >>>> >>>> >>>> On 20 September 2017 at 01:54, Vladimir Kozlov >>>> wrote: >>>>> >>>>> >>>>> >>>>> >>>>> On 9/18/17 10:59 PM, Zhongwei Yao wrote: >>>>>> >>>>>> >>>>>> >>>>>> Hi, Vladimir, >>>>>> >>>>>> On 19 September 2017 at 00:17, Vladimir Kozlov >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Why not use existing set_notpassed_slp() instead of >>>>>>> mark_slp_vec_failed()? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Due to 2 reasons, I have not chosen existing passed_slp flag: >>>>> >>>>> >>>>> >>>>> >>>>> My point is that if we don't find vectors in a loop (as in your case) >>>>> we >>>>> should ignore whole SLP analysis. >>>>> >>>>> In best case scenario SuperWord::unrolling_analysis() should determine >>>>> if >>>>> there are vectors candidates. For example, check if array's index is >>>>> depend >>>>> on loop's index variable. >>>>> >>>>> An other way is to call SuperWord::unrolling_analysis() only after we >>>>> did >>>>> vector analysis. >>>>> >>>>> It is more complicated changes and out of scope of this. There is also >>>>> side >>>>> effect I missed before which may prevent using set_notpassed_slp(): >>>>> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp() >>>>> check. >>>>> >>>>> Note, set_notpassed_slp() is also used to additional unroll already >>>>> vectorized loops: >>>>> >>>>> >>>>> >>>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421 >>>>> >>>>> May be you should also call mark_do_unroll_only() when you set >>>>> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt >>>>> problem you pointed. Can you look on this? >>>>> >>>>> I am not against adding new is_slp_vec_failed() but I want first to >>>>> investigate if we can re-use existing functions. >>>>> >>>>> Thanks, >>>>> Vladimir >>>>> >>>>> >>>>>> 1. If we set_notpassed_slp() when _packset.length() == 0 in >>>>>> SuperWord::output(), then in the IdealLoopTree::policy_unroll() >>>>>> checking: >>>>>> >>>>>> if (cl->has_passed_slp()) { >>>>>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>>>>> // Normal case: loop too big >>>>>> return false; >>>>>> } >>>>>> >>>>>> we will ignore the case: "cl->has_passed_slp() && >>>>>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()" >>>>>> as alos exposed in my patch: >>>>>> >>>>>> if (cl->has_passed_slp()) { >>>>>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>>>>> - // Normal case: loop too big >>>>>> - return false; >>>>>> + // When SLP vectorization failed, we could do more unrolling >>>>>> + // optimizations if body size is less than limit size. Otherwise, >>>>>> + // return false due to loop is too big. >>>>>> + if (!cl->is_slp_vec_failed()) return false; >>>>>> } >>>>>> >>>>>> However, I have not found a case to support this condition yet. >>>>>> >>>>>> 2. As replied below, in: >>>>>>> >>>>>>> >>>>>>> >>>>>>> - } else if (cl->is_main_loop()) { >>>>>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>>>>> sw.transform_loop(lpt, true); >>>>>> >>>>>> >>>>>> >>>>>> I need to check whether cl->is_slp_vec_failed() is true.Such >>>>>> checking becomes explicit when using SLPAutoVecFailed flag. >>>>>> >>>>>>> >>>>>>> Why you need next additional check?: >>>>>>> >>>>>>> - } else if (cl->is_main_loop()) { >>>>>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>>>>> sw.transform_loop(lpt, true); >>>>>>> >>>>>> >>>>>> The additional check prevents the case that when >>>>>> cl->is_slp_vec_failed() is true, then SuperWord::output() will >>>>>> set_major_progress() at the beginning (because _packset.length() == 0 >>>>>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal >>>>>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is >>>>>> not we want. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Vladimir >>>>>>> >>>>>>> >>>>>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev] >>>>>>>> >>>>>>>> Hi, all, >>>>>>>> >>>>>>>> Bug: >>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8187601 >>>>>>>> >>>>>>>> Webrev: >>>>>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00 >>>>>>>> >>>>>>>> In the current implementation, the loop unrolling times are >>>>>>>> determined >>>>>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is >>>>>>>> true (both X86 and aarch64 are true for now). >>>>>>>> >>>>>>>> This unrolling policy generates less optimized code when SLP >>>>>>>> auto-vectorization fails (as following example shows). >>>>>>>> >>>>>>>> In this patch, I modify the current unrolling policy to do more >>>>>>>> unrolling when SLP auto-vectorization fails. So the loop will be >>>>>>>> unrolled until reaching the unroll times limitation. >>>>>>>> >>>>>>>> Here is one example: >>>>>>>> public static void accessArrayConstants(int[] array) { >>>>>>>> for (int j = 0; j < 1024; j++) { >>>>>>>> array[0]++; >>>>>>>> array[1]++; >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> Before this patch, the loop will be unrolled by 4 times. 4 is >>>>>>>> determined by: AArch64's vector size 128 bits / array element size >>>>>>>> 32 >>>>>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are >>>>>>>> 8. >>>>>>>> >>>>>>>> Below is the generated code by C2 on AArch64: >>>>>>>> >>>>>>>> ==== generated code start ==== >>>>>>>> 0x0000ffff6caf3180: ldr w10, [x1,#16] ; >>>>>>>> 0x0000ffff6caf3184: add w13, w10, #0x1 >>>>>>>> 0x0000ffff6caf3188: str w13, [x1,#16] ; >>>>>>>> 0x0000ffff6caf318c: ldr w12, [x1,#20] ; >>>>>>>> 0x0000ffff6caf3190: add w13, w10, #0x4 >>>>>>>> 0x0000ffff6caf3194: add w10, w12, #0x4 >>>>>>>> 0x0000ffff6caf3198: str w13, [x1,#16] ; >>>>>>>> 0x0000ffff6caf319c: add w11, w11, #0x4 ; >>>>>>>> 0x0000ffff6caf31a0: str w10, [x1,#20] ; >>>>>>>> 0x0000ffff6caf31a4: cmp w11, #0x3fd >>>>>>>> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; >>>>>>>> ==== generated code end ==== >>>>>>>> >>>>>>>> After applied this patch, it is unrolled 16 times: >>>>>>>> >>>>>>>> ==== generated code start ==== >>>>>>>> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; >>>>>>>> 0x0000ffffb0aa6104: add w13, w10, #0x1 >>>>>>>> 0x0000ffffb0aa6108: str w13, [x1,#16] ; >>>>>>>> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; >>>>>>>> 0x0000ffffb0aa6110: add w13, w10, #0x10 >>>>>>>> 0x0000ffffb0aa6114: add w10, w12, #0x10 >>>>>>>> 0x0000ffffb0aa6118: str w13, [x1,#16] ; >>>>>>>> 0x0000ffffb0aa611c: add w11, w11, #0x10 ; >>>>>>>> 0x0000ffffb0aa6120: str w10, [x1,#20] ; >>>>>>>> 0x0000ffffb0aa6124: cmp w11, #0x3f1 >>>>>>>> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; >>>>>>>> ==== generated code end ==== >>>>>>>> >>>>>>>> This patch passes jtreg tests both on AArch64 and X86. >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>> >> >> >> > -- Best regards, Zhongwei