From stewartd.qdt at qualcommdatacenter.com  Fri Sep  1 16:22:27 2017
From: stewartd.qdt at qualcommdatacenter.com (stewartd.qdt)
Date: Fri, 1 Sep 2017 16:22:27 +0000
Subject: [aarch64-port-dev ] [aarch64-port-dev][10] RFR: 8187022 UBFX
 instructions have wrong format string
Message-ID: <8cde07840db64072af6e68393ef4c704@NASANEXM01B.na.qualcomm.com>

Please see the webrev [1] for fixing the format string of ubfx [2].

[1]: http://cr.openjdk.java.net/~njian/8187022/webrev.00/
[2]: https://bugs.openjdk.java.net/browse/JDK-8187022


Thanks,
Daniel Stewart

From dmitry.chuyko at bell-sw.com  Fri Sep  1 16:26:11 2017
From: dmitry.chuyko at bell-sw.com (Dmitry Chuyko)
Date: Fri, 1 Sep 2017 19:26:11 +0300
Subject: [aarch64-port-dev ] RFR: 8186671: Use `yield` instruction in
 SpinPause on linux-aarch64
In-Reply-To: <68db10fa-46d1-638f-7f46-a5c862e11b69@bell-sw.com>
References: <c682abdb-0b7b-f06c-d8b1-99e6a4d2b2ed@bell-sw.com>
 <e8b5acfc-30af-5ee3-9dcb-31a6b74cb269@redhat.com>
 <CY1PR0701MB1632A40E13C8441EC9C433A084850@CY1PR0701MB1632.namprd07.prod.outlook.com>
 <68db10fa-46d1-638f-7f46-a5c862e11b69@bell-sw.com>
Message-ID: <33ae3f2a-80b3-ed99-cb23-93de86af480d@bell-sw.com>

On 08/24/2017 06:33 PM, Dmitry Chuyko wrote:
> On 08/23/2017 10:39 PM, White, Derek wrote:
>> Hi Andrew,
>>
>>> -----Original Message-----
>>> From: aarch64-port-dev [mailto:aarch64-port-dev-
>>> bounces at openjdk.java.net] On Behalf Of Andrew Haley
>>> Sent: Wednesday, August 23, 2017 12:32 PM
>>> To: aarch64-port-dev at openjdk.java.net
>>> Subject: Re: [aarch64-port-dev ] RFR: 8186671: Use `yield` 
>>> instruction in
>>> SpinPause on linux-aarch64
>>>
>>> On 23/08/17 17:07, Dmitry Chuyko wrote:
>>>> Please review a change in SpinPause implementation.
>>>>
>>>> related study:
>>>> http://cr.openjdk.java.net/~dchuyko/8186670/yield/spinwait.html
>>>> rfe: https://bugs.openjdk.java.net/browse/JDK-8186671
>>>> webrev: http://cr.openjdk.java.net/~dchuyko/8186671/webrev.00/
>>>>
>>>> The function was moved to platform .S file and now contains yield
>>>> instruction.
>>> ..........................................................................................................................................
>>>
>>> In any case we
>>>> Re the use of yield in SpinPause(): this looks correct to me.  OK.
> Good. This part seemed more scaring.
There were no objections to this part (extern). I need sponsorship to 
push the change.

It would be interesting to discuss the other (intrinsic) part a bit more 
at fireside chat.

-Dmitry
>
> -- 
> Dmitry
>>>
>>> -- 
>>> Andrew Haley
>>> Java Platform Lead Engineer
>>> Red Hat UK Ltd. <https://www.redhat.com>
>>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>


From aph at redhat.com  Fri Sep  1 17:40:24 2017
From: aph at redhat.com (Andrew Haley)
Date: Fri, 1 Sep 2017 18:40:24 +0100
Subject: [aarch64-port-dev ] [aarch64-port-dev][10] RFR: 8187022 UBFX
 instructions have wrong format string
In-Reply-To: <8cde07840db64072af6e68393ef4c704@NASANEXM01B.na.qualcomm.com>
References: <8cde07840db64072af6e68393ef4c704@NASANEXM01B.na.qualcomm.com>
Message-ID: <879a0d91-03c8-6b7a-4abb-1e6a74bd70f7@redhat.com>

On 01/09/17 17:22, stewartd.qdt wrote:
> Please see the webrev [1] for fixing the format string of ubfx [2].
> 
> [1]: http://cr.openjdk.java.net/~njian/8187022/webrev.00/
> [2]: https://bugs.openjdk.java.net/browse/JDK-8187022

Great, thanks.

P.S.  Resending to hotspot-dev.  That's where this should go, because
Aarch64 is in the main hotspot tree now.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

From aph at redhat.com  Sat Sep  2 08:10:00 2017
From: aph at redhat.com (Andrew Haley)
Date: Sat, 2 Sep 2017 09:10:00 +0100
Subject: [aarch64-port-dev ] RFR: 8186671: Use `yield` instruction in
 SpinPause on linux-aarch64
In-Reply-To: <33ae3f2a-80b3-ed99-cb23-93de86af480d@bell-sw.com>
References: <c682abdb-0b7b-f06c-d8b1-99e6a4d2b2ed@bell-sw.com>
 <e8b5acfc-30af-5ee3-9dcb-31a6b74cb269@redhat.com>
 <CY1PR0701MB1632A40E13C8441EC9C433A084850@CY1PR0701MB1632.namprd07.prod.outlook.com>
 <68db10fa-46d1-638f-7f46-a5c862e11b69@bell-sw.com>
 <33ae3f2a-80b3-ed99-cb23-93de86af480d@bell-sw.com>
Message-ID: <a2a1efff-c572-205c-fed9-5f8ca3219e8d@redhat.com>

On 01/09/17 17:26, Dmitry Chuyko wrote:
> There were no objections to this part (extern). I need sponsorship to 
> push the change.

I can do it, but it really needs to be sent to hotspot-dev.

> It would be interesting to discuss the other (intrinsic) part a bit more 
> at fireside chat.

OK, but without any actual implementations we can test it'll be a very
short discussion.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

From felix.yang at linaro.org  Sun Sep  3 13:16:38 2017
From: felix.yang at linaro.org (Felix Yang)
Date: Sun, 3 Sep 2017 21:16:38 +0800
Subject: [aarch64-port-dev ] [aarch64-port-dev][10] RFR: 8187022 UBFX
 instructions have wrong format string
In-Reply-To: <879a0d91-03c8-6b7a-4abb-1e6a74bd70f7@redhat.com>
References: <8cde07840db64072af6e68393ef4c704@NASANEXM01B.na.qualcomm.com>
 <879a0d91-03c8-6b7a-4abb-1e6a74bd70f7@redhat.com>
Message-ID: <CACc5Y6Ron5ggnqMSq_GDyPgH3vPne3B0h0+K9w6WWH-rHFjmKA@mail.gmail.com>

Pushed. Thanks.

On 2 September 2017 at 01:40, Andrew Haley <aph at redhat.com> wrote:

> On 01/09/17 17:22, stewartd.qdt wrote:
> > Please see the webrev [1] for fixing the format string of ubfx [2].
> >
> > [1]: http://cr.openjdk.java.net/~njian/8187022/webrev.00/
> > [2]: https://bugs.openjdk.java.net/browse/JDK-8187022
>
> Great, thanks.
>
> P.S.  Resending to hotspot-dev.  That's where this should go, because
> Aarch64 is in the main hotspot tree now.
>
> --
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>

From felix.yang at linaro.org  Mon Sep  4 15:18:23 2017
From: felix.yang at linaro.org (Felix Yang)
Date: Mon, 4 Sep 2017 23:18:23 +0800
Subject: [aarch64-port-dev ] RFR(S): 8154537: AArch64: some integer
 rotate instructions are never emitted
In-Reply-To: <CACc5Y6R8L-XMu3yv87NNWESs-mC_Ockmtj1tvKJFL21ZacmmTw@mail.gmail.com>
References: <57161A23.3050807@redhat.com> <57162A22.2050706@redhat.com>
 <571722B7.2000404@redhat.com>
 <1bc50628-5ef9-b24b-d1b3-4762e7ff3b12@redhat.com>
 <CACc5Y6R8L-XMu3yv87NNWESs-mC_Ockmtj1tvKJFL21ZacmmTw@mail.gmail.com>
Message-ID: <CACc5Y6SW0vBQ4k0PQVOANrrgz_zm4E58W7qF6fHkBwvWVEggOg@mail.gmail.com>

Hi,

    This issue has been fixed in jdk10.
    Shall I propose a patch for jdk9 dev and maybe jdk8u aarch64 repo?  They
got the same issue.

Thanks,
Felix

On 1 August 2017 at 20:43, Felix Yang <felix.yang at linaro.org> wrote:

> LGTM. This also addresses: http://cr.openjdk.java.net/~fyang/8157906/
> webrev.00/src/cpu/aarch64/vm/aarch64.ad.sdiff.html
>
> Thanks,
> Felix
>
> On 29 July 2017 at 01:59, Andrew Haley <aph at redhat.com> wrote:
>
>> I'm looking at the webrev in
>> http://cr.openjdk.java.net/~roland/8154537/webrev.00/ and I see the
>> the changes were made to aarch64.ad but not to ad_aarch64.m4.  This is
>> problematic because some .m4 files are used to generate the .ad file,
>> and if anyone regenerates the .ad file the bug will regress.
>>
>> I think this is the change we need to make.  It won't affect generated
>> code at all, but it is something of a ticking bomb.
>>
>> diff -r 214a94e9366c src/cpu/aarch64/vm/aarch64_ad.m4
>> --- a/src/cpu/aarch64/vm/aarch64_ad.m4  Mon Jul 17 12:11:32 2017 +0000
>> +++ b/src/cpu/aarch64/vm/aarch64_ad.m4  Fri Jul 28 18:57:25 2017 +0100
>> @@ -268,21 +268,21 @@
>>    ins_pipe(ialu_reg_reg_vshift);
>>  %}')dnl
>>  define(ROL_INSN, `
>> -instruct $3$1_rReg_Var_C$2(iRegLNoSp dst, iRegL src, iRegI shift, immI$2
>> c$2, rFlagsReg cr)
>> +instruct $3$1_rReg_Var_C$2(iReg$1NoSp dst, iReg$1 src, iRegI shift,
>> immI$2 c$2, rFlagsReg cr)
>>  %{
>>    match(Set dst (Or$1 (LShift$1 src shift) (URShift$1 src (SubI c$2
>> shift))));
>>
>>    expand %{
>> -    $3L_rReg(dst, src, shift, cr);
>> +    $3$1_rReg(dst, src, shift, cr);
>>    %}
>>  %}')dnl
>>  define(ROR_INSN, `
>> -instruct $3$1_rReg_Var_C$2(iRegLNoSp dst, iRegL src, iRegI shift, immI$2
>> c$2, rFlagsReg cr)
>> +instruct $3$1_rReg_Var_C$2(iReg$1NoSp dst, iReg$1 src, iRegI shift,
>> immI$2 c$2, rFlagsReg cr)
>>  %{
>>    match(Set dst (Or$1 (URShift$1 src shift) (LShift$1 src (SubI c$2
>> shift))));
>>
>>    expand %{
>> -    $3L_rReg(dst, src, shift, cr);
>> +    $3$1_rReg(dst, src, shift, cr);
>>    %}
>>  %}')dnl
>>  ROL_EXPAND(L, rol, rorv)
>>
>> --
>> Andrew Haley
>> Java Platform Lead Engineer
>> Red Hat UK Ltd. <https://www.redhat.com>
>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>>
>
>

From aph at redhat.com  Mon Sep  4 16:09:16 2017
From: aph at redhat.com (Andrew Haley)
Date: Mon, 4 Sep 2017 17:09:16 +0100
Subject: [aarch64-port-dev ] RFR(S): 8154537: AArch64: some integer
 rotate instructions are never emitted
In-Reply-To: <CACc5Y6SW0vBQ4k0PQVOANrrgz_zm4E58W7qF6fHkBwvWVEggOg@mail.gmail.com>
References: <57161A23.3050807@redhat.com> <57162A22.2050706@redhat.com>
 <571722B7.2000404@redhat.com>
 <1bc50628-5ef9-b24b-d1b3-4762e7ff3b12@redhat.com>
 <CACc5Y6R8L-XMu3yv87NNWESs-mC_Ockmtj1tvKJFL21ZacmmTw@mail.gmail.com>
 <CACc5Y6SW0vBQ4k0PQVOANrrgz_zm4E58W7qF6fHkBwvWVEggOg@mail.gmail.com>
Message-ID: <67595610-3d30-fbfd-96ab-baa00e12233b@redhat.com>

On 04/09/17 16:18, Felix Yang wrote:
>     This issue has been fixed in jdk10.
>     Shall I propose a patch for jdk9 dev and maybe jdk8u aarch64 repo?  They
> got the same issue.

jdk8u is pre-approved.  jdk9 dev is more problematic: I think it's closed for
now.  But please register the bugs anyway.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

From gnu.andrew at redhat.com  Tue Sep  5 02:15:29 2017
From: gnu.andrew at redhat.com (gnu.andrew at redhat.com)
Date: Tue, 05 Sep 2017 02:15:29 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u: Added tag
 aarch64-jdk8u144-b02 for changeset 461c9270383a
Message-ID: <201709050215.v852FTR4016107@aojmv0008.oracle.com>

Changeset: 8803133b679b
Author:    andrew
Date:      2017-09-05 03:10 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u/rev/8803133b679b

Added tag aarch64-jdk8u144-b02 for changeset 461c9270383a

! .hgtags


From gnu.andrew at redhat.com  Tue Sep  5 02:15:35 2017
From: gnu.andrew at redhat.com (gnu.andrew at redhat.com)
Date: Tue, 05 Sep 2017 02:15:35 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u/corba: Added tag
 aarch64-jdk8u144-b02 for changeset 4b222c433612
Message-ID: <201709050215.v852FZ7F016192@aojmv0008.oracle.com>

Changeset: 2f6bf6972714
Author:    andrew
Date:      2017-09-05 03:10 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u/corba/rev/2f6bf6972714

Added tag aarch64-jdk8u144-b02 for changeset 4b222c433612

! .hgtags


From gnu.andrew at redhat.com  Tue Sep  5 02:15:42 2017
From: gnu.andrew at redhat.com (gnu.andrew at redhat.com)
Date: Tue, 05 Sep 2017 02:15:42 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u/jaxp: Added tag
 aarch64-jdk8u144-b02 for changeset 2793510feb8c
Message-ID: <201709050215.v852FgFW016257@aojmv0008.oracle.com>

Changeset: b56f98b75e2a
Author:    andrew
Date:      2017-09-05 03:10 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u/jaxp/rev/b56f98b75e2a

Added tag aarch64-jdk8u144-b02 for changeset 2793510feb8c

! .hgtags


From gnu.andrew at redhat.com  Tue Sep  5 02:18:41 2017
From: gnu.andrew at redhat.com (gnu.andrew at redhat.com)
Date: Tue, 05 Sep 2017 02:18:41 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u/jaxws: Added tag
 aarch64-jdk8u144-b02 for changeset 1eb06202a5c9
Message-ID: <201709050218.v852IfiP019096@aojmv0008.oracle.com>

Changeset: 56babd47ee19
Author:    andrew
Date:      2017-09-05 03:10 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u/jaxws/rev/56babd47ee19

Added tag aarch64-jdk8u144-b02 for changeset 1eb06202a5c9

! .hgtags


From gnu.andrew at redhat.com  Tue Sep  5 02:18:47 2017
From: gnu.andrew at redhat.com (gnu.andrew at redhat.com)
Date: Tue, 05 Sep 2017 02:18:47 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u/langtools: Added tag
 aarch64-jdk8u144-b02 for changeset eb8e9a1d6c9f
Message-ID: <201709050218.v852Ilj4019169@aojmv0008.oracle.com>

Changeset: 9a5a859f6fda
Author:    andrew
Date:      2017-09-05 03:10 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u/langtools/rev/9a5a859f6fda

Added tag aarch64-jdk8u144-b02 for changeset eb8e9a1d6c9f

! .hgtags


From gnu.andrew at redhat.com  Tue Sep  5 02:18:53 2017
From: gnu.andrew at redhat.com (gnu.andrew at redhat.com)
Date: Tue, 05 Sep 2017 02:18:53 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u/hotspot: Added tag
 aarch64-jdk8u144-b02 for changeset 7672149aea2c
Message-ID: <201709050218.v852Irmm019271@aojmv0008.oracle.com>

Changeset: 21011afdacc2
Author:    andrew
Date:      2017-09-05 03:10 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u/hotspot/rev/21011afdacc2

Added tag aarch64-jdk8u144-b02 for changeset 7672149aea2c

! .hgtags


From gnu.andrew at redhat.com  Tue Sep  5 02:18:59 2017
From: gnu.andrew at redhat.com (gnu.andrew at redhat.com)
Date: Tue, 05 Sep 2017 02:18:59 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u/jdk: Added tag
 aarch64-jdk8u144-b02 for changeset 9322c39fd0df
Message-ID: <201709050218.v852IxgL019351@aojmv0008.oracle.com>

Changeset: 49cb4b2b45a3
Author:    andrew
Date:      2017-09-05 03:10 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u/jdk/rev/49cb4b2b45a3

Added tag aarch64-jdk8u144-b02 for changeset 9322c39fd0df

! .hgtags


From gnu.andrew at redhat.com  Tue Sep  5 02:19:06 2017
From: gnu.andrew at redhat.com (gnu.andrew at redhat.com)
Date: Tue, 05 Sep 2017 02:19:06 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u/nashorn: Added tag
 aarch64-jdk8u144-b02 for changeset 13c40d5bd8cc
Message-ID: <201709050219.v852J6cu019482@aojmv0008.oracle.com>

Changeset: b74e5d373608
Author:    andrew
Date:      2017-09-05 03:10 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u/nashorn/rev/b74e5d373608

Added tag aarch64-jdk8u144-b02 for changeset 13c40d5bd8cc

! .hgtags


From gnu.andrew at redhat.com  Tue Sep  5 02:49:32 2017
From: gnu.andrew at redhat.com (gnu.andrew at redhat.com)
Date: Tue, 05 Sep 2017 02:49:32 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah: 4 new
	changesets
Message-ID: <201709050249.v852nWv9004341@aojmv0008.oracle.com>

Changeset: 461c9270383a
Author:    andrew
Date:      2016-07-11 05:02 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/rev/461c9270383a

8151841: Build needs additional flags to compile with GCC 6 [plus parts of 8149647 & 8032045]
Summary: C++ standard needs to be explicitly set and some optimisations turned off to build on GCC 6
Reviewed-by: erikj, dholmes, kbarrett

! common/autoconf/generated-configure.sh
! common/autoconf/hotspot-spec.gmk.in
! common/autoconf/spec.gmk.in
! common/autoconf/toolchain.m4

Changeset: 8803133b679b
Author:    andrew
Date:      2017-09-05 03:10 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/rev/8803133b679b

Added tag aarch64-jdk8u144-b02 for changeset 461c9270383a

! .hgtags

Changeset: 768646fd5745
Author:    andrew
Date:      2017-09-05 03:44 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/rev/768646fd5745

Merge aarch64-jdk8u144-b02

! .hgtags

Changeset: 6cd26459fb2f
Author:    andrew
Date:      2017-09-05 03:46 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/rev/6cd26459fb2f

Added tag aarch64-shenandoah-jdk8u144-b02 for changeset 768646fd5745

! .hgtags


From gnu.andrew at redhat.com  Tue Sep  5 02:49:37 2017
From: gnu.andrew at redhat.com (gnu.andrew at redhat.com)
Date: Tue, 05 Sep 2017 02:49:37 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/corba: 3 new
	changesets
Message-ID: <201709050249.v852nbOB004413@aojmv0008.oracle.com>

Changeset: 2f6bf6972714
Author:    andrew
Date:      2017-09-05 03:10 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/corba/rev/2f6bf6972714

Added tag aarch64-jdk8u144-b02 for changeset 4b222c433612

! .hgtags

Changeset: ad9497112b0a
Author:    andrew
Date:      2017-09-05 03:44 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/corba/rev/ad9497112b0a

Merge aarch64-jdk8u144-b02

! .hgtags

Changeset: 76a6ff94929a
Author:    andrew
Date:      2017-09-05 03:46 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/corba/rev/76a6ff94929a

Added tag aarch64-shenandoah-jdk8u144-b02 for changeset ad9497112b0a

! .hgtags


From gnu.andrew at redhat.com  Tue Sep  5 02:49:42 2017
From: gnu.andrew at redhat.com (gnu.andrew at redhat.com)
Date: Tue, 05 Sep 2017 02:49:42 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/jaxp: 3 new
	changesets
Message-ID: <201709050249.v852ng69004475@aojmv0008.oracle.com>

Changeset: b56f98b75e2a
Author:    andrew
Date:      2017-09-05 03:10 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jaxp/rev/b56f98b75e2a

Added tag aarch64-jdk8u144-b02 for changeset 2793510feb8c

! .hgtags

Changeset: 488afc89de7b
Author:    andrew
Date:      2017-09-05 03:44 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jaxp/rev/488afc89de7b

Merge aarch64-jdk8u144-b02

! .hgtags

Changeset: 0e28e142b39e
Author:    andrew
Date:      2017-09-05 03:46 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jaxp/rev/0e28e142b39e

Added tag aarch64-shenandoah-jdk8u144-b02 for changeset 488afc89de7b

! .hgtags


From gnu.andrew at redhat.com  Tue Sep  5 02:49:48 2017
From: gnu.andrew at redhat.com (gnu.andrew at redhat.com)
Date: Tue, 05 Sep 2017 02:49:48 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/jaxws: 3 new
	changesets
Message-ID: <201709050249.v852nm49004562@aojmv0008.oracle.com>

Changeset: 56babd47ee19
Author:    andrew
Date:      2017-09-05 03:10 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jaxws/rev/56babd47ee19

Added tag aarch64-jdk8u144-b02 for changeset 1eb06202a5c9

! .hgtags

Changeset: f1d17bae71a9
Author:    andrew
Date:      2017-09-05 03:44 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jaxws/rev/f1d17bae71a9

Merge aarch64-jdk8u144-b02

! .hgtags

Changeset: 20f47c7395a6
Author:    andrew
Date:      2017-09-05 03:46 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jaxws/rev/20f47c7395a6

Added tag aarch64-shenandoah-jdk8u144-b02 for changeset f1d17bae71a9

! .hgtags


From gnu.andrew at redhat.com  Tue Sep  5 02:49:53 2017
From: gnu.andrew at redhat.com (gnu.andrew at redhat.com)
Date: Tue, 05 Sep 2017 02:49:53 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/langtools: 3
	new changesets
Message-ID: <201709050249.v852nr2Z004637@aojmv0008.oracle.com>

Changeset: 9a5a859f6fda
Author:    andrew
Date:      2017-09-05 03:10 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/langtools/rev/9a5a859f6fda

Added tag aarch64-jdk8u144-b02 for changeset eb8e9a1d6c9f

! .hgtags

Changeset: 7d0753285c49
Author:    andrew
Date:      2017-09-05 03:44 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/langtools/rev/7d0753285c49

Merge aarch64-jdk8u144-b02

! .hgtags

Changeset: e0e8b07dc201
Author:    andrew
Date:      2017-09-05 03:46 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/langtools/rev/e0e8b07dc201

Added tag aarch64-shenandoah-jdk8u144-b02 for changeset 7d0753285c49

! .hgtags


From gnu.andrew at redhat.com  Tue Sep  5 02:49:59 2017
From: gnu.andrew at redhat.com (gnu.andrew at redhat.com)
Date: Tue, 05 Sep 2017 02:49:59 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/hotspot: 3
	new changesets
Message-ID: <201709050249.v852nxCN004697@aojmv0008.oracle.com>

Changeset: 21011afdacc2
Author:    andrew
Date:      2017-09-05 03:10 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/hotspot/rev/21011afdacc2

Added tag aarch64-jdk8u144-b02 for changeset 7672149aea2c

! .hgtags

Changeset: ec1439043fc0
Author:    andrew
Date:      2017-09-05 03:44 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/hotspot/rev/ec1439043fc0

Merge aarch64-jdk8u144-b02

! .hgtags

Changeset: 1bec072d4387
Author:    andrew
Date:      2017-09-05 03:46 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/hotspot/rev/1bec072d4387

Added tag aarch64-shenandoah-jdk8u144-b02 for changeset ec1439043fc0

! .hgtags


From gnu.andrew at redhat.com  Tue Sep  5 02:50:05 2017
From: gnu.andrew at redhat.com (gnu.andrew at redhat.com)
Date: Tue, 05 Sep 2017 02:50:05 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/jdk: 3 new
	changesets
Message-ID: <201709050250.v852o5Bf004777@aojmv0008.oracle.com>

Changeset: 49cb4b2b45a3
Author:    andrew
Date:      2017-09-05 03:10 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jdk/rev/49cb4b2b45a3

Added tag aarch64-jdk8u144-b02 for changeset 9322c39fd0df

! .hgtags

Changeset: b415ef41ac89
Author:    andrew
Date:      2017-09-05 03:44 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jdk/rev/b415ef41ac89

Merge aarch64-jdk8u144-b02

! .hgtags

Changeset: 24a174dce25b
Author:    andrew
Date:      2017-09-05 03:46 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/jdk/rev/24a174dce25b

Added tag aarch64-shenandoah-jdk8u144-b02 for changeset b415ef41ac89

! .hgtags


From gnu.andrew at redhat.com  Tue Sep  5 02:50:11 2017
From: gnu.andrew at redhat.com (gnu.andrew at redhat.com)
Date: Tue, 05 Sep 2017 02:50:11 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/nashorn: 3
	new changesets
Message-ID: <201709050250.v852oBQl004839@aojmv0008.oracle.com>

Changeset: b74e5d373608
Author:    andrew
Date:      2017-09-05 03:10 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/nashorn/rev/b74e5d373608

Added tag aarch64-jdk8u144-b02 for changeset 13c40d5bd8cc

! .hgtags

Changeset: f41bd292f498
Author:    andrew
Date:      2017-09-05 03:44 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/nashorn/rev/f41bd292f498

Merge aarch64-jdk8u144-b02

! .hgtags

Changeset: 84c7ffac6e87
Author:    andrew
Date:      2017-09-05 03:46 +0100
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/nashorn/rev/84c7ffac6e87

Added tag aarch64-shenandoah-jdk8u144-b02 for changeset f41bd292f498

! .hgtags


From gnu.andrew at redhat.com  Tue Sep  5 02:49:43 2017
From: gnu.andrew at redhat.com (Andrew Hughes)
Date: Tue, 5 Sep 2017 03:49:43 +0100
Subject: [aarch64-port-dev ] [RFR] 8151841: Build needs additional flags
 to compile with GCC 6 [plus parts of 8149647 & 8032045]
In-Reply-To: <850ac4fa-d724-51e3-c46a-65438e530558@redhat.com>
References: <CABi63P7bzN+Rod1Lbfpq0Nh6OVCueYdr+KjU_FGEFwc_YTTmHA@mail.gmail.com>
 <00db57b4-dada-e528-fb87-181ee04ebb45@redhat.com>
 <CABi63P4VXDUB2XHi-TuH5_HWv7TyFo3w8X_54o23H5eks2iaEA@mail.gmail.com>
 <850ac4fa-d724-51e3-c46a-65438e530558@redhat.com>
Message-ID: <CABi63P5zj0_d7Bq5f1PHVyzasq0qiNoQ3U0hKCg0JE9vDyCCtA@mail.gmail.com>

On 31 August 2017 at 10:04, Andrew Dinn <adinn at redhat.com> wrote:
> On 30/08/17 16:29, Andrew Hughes wrote:
>> I haven't merged to shenandoah/jdk8u yet as it doesn't seem worth it
>> for just the one fix and there is nothing else in the aarch64/jdk8u repositories
>> after aarch64-jdk8u144-b01. Are there any pending backports to aarch64/jdk8u
>> which would then make a worthwhile batch of changes to merge over to
>> the Shenandoah tree?
>>
>> Of course, in the worst case, it'll be merged with the next security update
>> in October.
> I don't know of any other changes that need merging. However, it would
> be very helpful also to have this in shenandoah/jdk8u repo because it
> ensures that anyone can build the repo on fedora, not just those who
> know the necessary black magic. I'm particularly thinking of JBoss staff
> like Andy Miller (who recently wanted to build and a Shenandoah
> slowdebug JVM and then debug it using gdb). So, unless there is a good
> reason not to merge just this one change could we please have it?
>
> regards,
>
>
> Andrew Dinn
> -----------
> Senior Principal Software Engineer
> Red Hat UK Ltd
> Registered in England and Wales under Company Registration No. 03798903
> Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

I wasn't proposing to not merge it, just that we wouldn't do a merge solely for
this change. It would have been nicer to bring in a number of changes rather
than just the one. However, I do agree it's a pain; it's been driving me mad
for the last year.

Tagged as aarch64-jdk8u144-b02 and merged as aarch64-shenandoah-jdk8u144-b02
-- 
Andrew :)

Senior Free Java Software Engineer
Red Hat, Inc. (http://www.redhat.com)

Web Site: http://fuseyism.com
Twitter: https://twitter.com/gnu_andrew_java
PGP Key: ed25519/0xCFDA0F9B35964222 (hkp://keys.gnupg.net)
Fingerprint = 5132 579D D154 0ED2 3E04  C5A0 CFDA 0F9B 3596 4222

From adinn at redhat.com  Tue Sep  5 09:32:09 2017
From: adinn at redhat.com (Andrew Dinn)
Date: Tue, 5 Sep 2017 10:32:09 +0100
Subject: [aarch64-port-dev ] [RFR] 8151841: Build needs additional flags
 to compile with GCC 6 [plus parts of 8149647 & 8032045]
In-Reply-To: <CABi63P5zj0_d7Bq5f1PHVyzasq0qiNoQ3U0hKCg0JE9vDyCCtA@mail.gmail.com>
References: <CABi63P7bzN+Rod1Lbfpq0Nh6OVCueYdr+KjU_FGEFwc_YTTmHA@mail.gmail.com>
 <00db57b4-dada-e528-fb87-181ee04ebb45@redhat.com>
 <CABi63P4VXDUB2XHi-TuH5_HWv7TyFo3w8X_54o23H5eks2iaEA@mail.gmail.com>
 <850ac4fa-d724-51e3-c46a-65438e530558@redhat.com>
 <CABi63P5zj0_d7Bq5f1PHVyzasq0qiNoQ3U0hKCg0JE9vDyCCtA@mail.gmail.com>
Message-ID: <01d2f874-561d-d2ab-8edd-972ff2d0aa06@redhat.com>

On 05/09/17 03:49, Andrew Hughes wrote:
> I wasn't proposing to not merge it, just that we wouldn't do a merge solely for
> this change. It would have been nicer to bring in a number of changes rather
> than just the one. However, I do agree it's a pain; it's been driving me mad
> for the last year.
> 
> Tagged as aarch64-jdk8u144-b02 and merged as aarch64-shenandoah-jdk8u144-b02
Excellent. Thanks!

regards,


Andrew Dinn
-----------


From felix.yang at linaro.org  Tue Sep  5 14:14:40 2017
From: felix.yang at linaro.org (felix.yang at linaro.org)
Date: Tue, 05 Sep 2017 14:14:40 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u/hotspot: 8187224:
 aarch64: some inconsistency between aarch64_ad.m4 and aarch64.ad
Message-ID: <201709051414.v85EEeEx028772@aojmv0008.oracle.com>

Changeset: 471de666658d
Author:    fyang
Date:      2017-09-05 19:09 +0800
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u/hotspot/rev/471de666658d

8187224: aarch64: some inconsistency between aarch64_ad.m4 and aarch64.ad
Summary: fix ROL_INSN and ROR_INSN definition in aarch64_ad.m4
Reviewed-by: aph

! src/cpu/aarch64/vm/aarch64_ad.m4


From felix.yang at linaro.org  Tue Sep  5 14:26:18 2017
From: felix.yang at linaro.org (Felix Yang)
Date: Tue, 5 Sep 2017 22:26:18 +0800
Subject: [aarch64-port-dev ] RFR(S): 8154537: AArch64: some integer
 rotate instructions are never emitted
In-Reply-To: <67595610-3d30-fbfd-96ab-baa00e12233b@redhat.com>
References: <57161A23.3050807@redhat.com> <57162A22.2050706@redhat.com>
 <571722B7.2000404@redhat.com>
 <1bc50628-5ef9-b24b-d1b3-4762e7ff3b12@redhat.com>
 <CACc5Y6R8L-XMu3yv87NNWESs-mC_Ockmtj1tvKJFL21ZacmmTw@mail.gmail.com>
 <CACc5Y6SW0vBQ4k0PQVOANrrgz_zm4E58W7qF6fHkBwvWVEggOg@mail.gmail.com>
 <67595610-3d30-fbfd-96ab-baa00e12233b@redhat.com>
Message-ID: <CACc5Y6S1+AeD6o3+W7r8yjrHxdGSVp9E1GN84DDH1c29XwZKmQ@mail.gmail.com>

Newly created bug report: https://bugs.openjdk.java.net/browse/JDK-8187224
(which duplicates:https://bugs.openjdk.java.net/browse/JDK-8185656)

Patch has been applied to aarch64 jdk8u repo:
http://hg.openjdk.java.net/aarch64-port/jdk8u/hotspot/rev/471de666658d

Thanks,
Felix

On 5 September 2017 at 00:09, Andrew Haley <aph at redhat.com> wrote:

> On 04/09/17 16:18, Felix Yang wrote:
> >     This issue has been fixed in jdk10.
> >     Shall I propose a patch for jdk9 dev and maybe jdk8u aarch64 repo?
> They
> > got the same issue.
>
> jdk8u is pre-approved.  jdk9 dev is more problematic: I think it's closed
> for
> now.  But please register the bugs anyway.
>
> --
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>

From dmitrij.pochepko at bell-sw.com  Tue Sep  5 17:34:11 2017
From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko)
Date: Tue, 5 Sep 2017 20:34:11 +0300
Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify
	squareToLen and mulAdd
In-Reply-To: <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com>
References: <b488c589-d422-3089-4cb9-8747fb3a0075@bell-sw.com>
 <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com>
 <e0de06fa-0f4f-5b86-53ce-015bd7a46b15@bell-sw.com>
 <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com>
Message-ID: <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com>

Hi Andrew,

you can find my attempt to implement mulAdd intrinsic using wider 
multiplication here 
http://cr.openjdk.java.net/~dpochepk/8186915/webrev.02/ but as expected 
I got slower results on same benchmark compared to original webrev.01 
with 32-bit multiplication.

I've measured results on ThunderX:

webrev.01 version:

Benchmark                      (size)  Mode  Cnt    Score    Error Units
BigIntegerBench.mulAddReflect       1  avgt    5  194.809 ?  1.341 ns/op
BigIntegerBench.mulAddReflect       2  avgt    5  198.242 ?  1.348 ns/op
BigIntegerBench.mulAddReflect       3  avgt    5  201.213 ?  0.670 ns/op
BigIntegerBench.mulAddReflect       5  avgt    5  213.426 ?  7.441 ns/op
BigIntegerBench.mulAddReflect      10  avgt    5  236.396 ?  1.663 ns/op
BigIntegerBench.mulAddReflect      50  avgt    5  432.255 ? 24.718 ns/op
BigIntegerBench.mulAddReflect     100  avgt    5  653.961 ? 10.140 ns/op

webrev.02 version with wider multiplication:

Benchmark                      (size)  Mode  Cnt    Score     Error Units
BigIntegerBench.mulAddReflect       1  avgt    5  196.109 ?   0.663 ns/op
BigIntegerBench.mulAddReflect       2  avgt    5  213.438 ? 124.206 ns/op
BigIntegerBench.mulAddReflect       3  avgt    5  211.683 ?   3.206 ns/op
BigIntegerBench.mulAddReflect       5  avgt    5  217.324 ?   5.827 ns/op
BigIntegerBench.mulAddReflect      10  avgt    5  233.272 ?  21.560 ns/op
BigIntegerBench.mulAddReflect      50  avgt    5  455.337 ? 237.168 ns/op
BigIntegerBench.mulAddReflect     100  avgt    5  826.844 ?   4.972 ns/op

As you can see, it's up to 26% worse throughput with wider multiplication.

The reasons for this is:
1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it 
can?t be changed within the function signature. Thus we can?t fully 
utilize the potential of 64-bit multiplication.
2. umulh instruction is more expensive than mul instruction.

I haven't implemented wider multiplication for squareToLen intrinsic, 
since it'll require much more code due to more corner cases. Also, 
squaring algorithm in BigInteger doesn't handle more than 127 integers 
in one squareToLen call(large integer arrays are divided to smaller 
parts for squaring, so, 1..127 integers are squared at once), which 
makes all additional off-loop penalties expensive in comparison to loop 
execution time.

At this point I ran out of ideas how we could improve the performance 3x 
for these intrinsics. I understand one can do better with 64bit for the 
intrinsics you implemented, but squareToLen and mulAdd looks different. 
Do you have other suggestions, or can we proceed with initial version 
(webrev.01)?

Thanks,

Dmitrij

On 01.09.2017 10:51, Andrew Haley wrote:
> On 31/08/17 23:46, Dmitrij Pochepko wrote:
>> I tried a number of initial versions first. I also tried to use wider
>> multiplication via umulh (and larger load instructions like ldp/ldr),
>> but after measuring all versions I've found that version I've initially
>> sent appeared to be the fastest (I was measuring it on ThunderX which I
>> have in hand). It might be because of lots of additional ror(..., 32)
>> operations in other versions to convert values from initial layout to
>> register and back. Another reason might be more complex overall logic
>> and larger code, which triggers more icache lines to be loaded. Or even
>> some umulh specifics on some CPUs. So, after measuring, I've abandoned
>> these versions in a middle of development and polished the fastest one.
>> I have some raw development unpolished versions of such approaches
>> left(not sure I have debugged versions saved, but at least has an
>> overall idea).
>> I attached squares_v2.3.1.diff: early version which is using mul/umulh
>> for just one case. It was surprisingly slower for this case than version
>> I've sent to review, so, I've abandoned this approach.
>> I've also tried version with large load instructions(ldp/ldr):
>> squares_v1.diff and it was also slower(it has another, slower, mul_add
>> loop implementation, but I was comparing to the same version, which is
>> using ldrw-only).
>>
>> I'm not sure if I should use 64-bit multiplications and/or 64/128 bit
>> loads. I can try to return back to one of such versions and try to
>> polish it, but I'll probably get slower results again on h/w I have and
>> it's not clear if it'll be faster on any other h/w(which one? It takes a
>> lot of time to iteratively improve and measure every version on
>> respective h/w).
> I'm using Applied Micro hardware for my testing at the moment.
>
> I did the speed testing for Montgomery multiply on ThunderX.  I
> appreciate that it's difficult to get the 64-bit version right and
> fast, but you should see about 3 - 3.5* speedup over the pure Java
> version if you get it right.  That's what I saw when I did the
> Montgomery multiply.  You do have to pipeline the loads and the
> multiplies to avoid stalls.
>
> Be aware that squareToLen is not used at all when running the
> RSA benchmark with C2.
>


From aph at redhat.com  Wed Sep  6 09:53:05 2017
From: aph at redhat.com (Andrew Haley)
Date: Wed, 6 Sep 2017 10:53:05 +0100
Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify
	squareToLen and mulAdd
In-Reply-To: <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com>
References: <b488c589-d422-3089-4cb9-8747fb3a0075@bell-sw.com>
 <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com>
 <e0de06fa-0f4f-5b86-53ce-015bd7a46b15@bell-sw.com>
 <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com>
 <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com>
Message-ID: <c7a1e17f-1fbf-3864-5084-ac12d2ef4ff6@redhat.com>

On 05/09/17 18:34, Dmitrij Pochepko wrote:
> As you can see, it's up to 26% worse throughput with wider multiplication.
> 
> The reasons for this is:
> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it 
> can?t be changed within the function signature. Thus we can?t fully 
> utilize the potential of 64-bit multiplication.
> 2. umulh instruction is more expensive than mul instruction.

Ah, my apologies.  I wasn't thinking about mulAdd, but about
squareToLen().  But did you look at the way x86 uses 64-bit
multiplications?

> I haven't implemented wider multiplication for squareToLen intrinsic, 
> since it'll require much more code due to more corner cases. Also, 
> squaring algorithm in BigInteger doesn't handle more than 127 integers 
> in one squareToLen call(large integer arrays are divided to smaller 
> parts for squaring, so, 1..127 integers are squared at once), which 
> makes all additional off-loop penalties expensive in comparison to loop 
> execution time.

Should we intrinsify squareToLen() at all?  It's only used AFAICS by
C1 and interpreter when doing integer crypto.  One other thing I
haven't checked: is the multiplyToLen() intrinisc called when
squareToLen() is absent?

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

From dmitrij.pochepko at bell-sw.com  Wed Sep  6 11:50:24 2017
From: dmitrij.pochepko at bell-sw.com (Dmitrij)
Date: Wed, 6 Sep 2017 14:50:24 +0300
Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify
	squareToLen and mulAdd
In-Reply-To: <c7a1e17f-1fbf-3864-5084-ac12d2ef4ff6@redhat.com>
References: <b488c589-d422-3089-4cb9-8747fb3a0075@bell-sw.com>
 <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com>
 <e0de06fa-0f4f-5b86-53ce-015bd7a46b15@bell-sw.com>
 <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com>
 <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com>
 <c7a1e17f-1fbf-3864-5084-ac12d2ef4ff6@redhat.com>
Message-ID: <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com>


On 06.09.2017 12:53, Andrew Haley wrote:
> On 05/09/17 18:34, Dmitrij Pochepko wrote:
>> As you can see, it's up to 26% worse throughput with wider multiplication.
>>
>> The reasons for this is:
>> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it
>> can?t be changed within the function signature. Thus we can?t fully
>> utilize the potential of 64-bit multiplication.
>> 2. umulh instruction is more expensive than mul instruction.
> Ah, my apologies.  I wasn't thinking about mulAdd, but about
> squareToLen().  But did you look at the way x86 uses 64-bit
> multiplications?
>
Yes. It uses single x86 mulq instruction which performs 64x64 
multiplication and placing 128 bit result in 2 registers. There is no 
such single instruction on aarch64 and the most effective aarch64 
instruction sequence i've found doesn't seem to be as fast as mulq. 
Simplier 32x32bit multiplication works  faster according to my measurements.
>> I haven't implemented wider multiplication for squareToLen intrinsic,
>> since it'll require much more code due to more corner cases. Also,
>> squaring algorithm in BigInteger doesn't handle more than 127 integers
>> in one squareToLen call(large integer arrays are divided to smaller
>> parts for squaring, so, 1..127 integers are squared at once), which
>> makes all additional off-loop penalties expensive in comparison to loop
>> execution time.
> Should we intrinsify squareToLen() at all?

Yes, we should intrinsify it, because we can see performance boost. Not 
as significant as for x86 but still noticeable.
>    It's only used AFAICS by
> C1 and interpreter when doing integer crypto.
This intrinsic is known to 
c2(http://hg.openjdk.java.net/jdk10/hs/hotspot/file/tip/src/share/vm/opto/library_call.cpp#l5507). 
squareToLen is called in BigInteger multiplication in case it's 
multiplied by itself 
(http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l1565) 
and in pow(...) method: 
http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l2305 

>    One other thing I
> haven't checked: is the multiplyToLen() intrinisc called when
> squareToLen() is absent?
>
It could have been a good alternative, but it's not used instead of 
squareToLen when squareToLen is not implemented. A java implementation 
of squareToLen will be eventually compiled and used instead: 
http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l2039

Thanks,
Dmitrij

From aph at redhat.com  Wed Sep  6 12:43:23 2017
From: aph at redhat.com (Andrew Haley)
Date: Wed, 6 Sep 2017 13:43:23 +0100
Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify
	squareToLen and mulAdd
In-Reply-To: <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com>
References: <b488c589-d422-3089-4cb9-8747fb3a0075@bell-sw.com>
 <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com>
 <e0de06fa-0f4f-5b86-53ce-015bd7a46b15@bell-sw.com>
 <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com>
 <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com>
 <c7a1e17f-1fbf-3864-5084-ac12d2ef4ff6@redhat.com>
 <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com>
Message-ID: <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com>

On 06/09/17 12:50, Dmitrij wrote:
> 
> 
> On 06.09.2017 12:53, Andrew Haley wrote:
>> On 05/09/17 18:34, Dmitrij Pochepko wrote:
>>> As you can see, it's up to 26% worse throughput with wider multiplication.
>>>
>>> The reasons for this is:
>>> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it
>>> can?t be changed within the function signature. Thus we can?t fully
>>> utilize the potential of 64-bit multiplication.
>>> 2. umulh instruction is more expensive than mul instruction.
>> Ah, my apologies.  I wasn't thinking about mulAdd, but about
>> squareToLen().  But did you look at the way x86 uses 64-bit
>> multiplications?
>>
> Yes. It uses single x86 mulq instruction which performs 64x64 
> multiplication and placing 128 bit result in 2 registers. There is no 
> such single instruction on aarch64 and the most effective aarch64 
> instruction sequence i've found doesn't seem to be as fast as mulq. 

I think there is effectively a 64x64 - >128-bit instruction: it's just
that you have to represent it as a mul and a umulh.  But I take your
point.

>>    One other thing I
>> haven't checked: is the multiplyToLen() intrinisc called when
>> squareToLen() is absent?
>>
> It could have been a good alternative, but it's not used instead of 
> squareToLen when squareToLen is not implemented. A java implementation 
> of squareToLen will be eventually compiled and used instead: 
> http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l2039

Please compare your squareToLen wih the
MacroAssembler::multiply_to_len we already have.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

From dmitrij.pochepko at bell-sw.com  Wed Sep  6 17:39:13 2017
From: dmitrij.pochepko at bell-sw.com (Dmitrij)
Date: Wed, 6 Sep 2017 20:39:13 +0300
Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify
	squareToLen and mulAdd
In-Reply-To: <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com>
References: <b488c589-d422-3089-4cb9-8747fb3a0075@bell-sw.com>
 <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com>
 <e0de06fa-0f4f-5b86-53ce-015bd7a46b15@bell-sw.com>
 <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com>
 <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com>
 <c7a1e17f-1fbf-3864-5084-ac12d2ef4ff6@redhat.com>
 <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com>
 <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com>
Message-ID: <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com>


On 06.09.2017 15:43, Andrew Haley wrote:
> On 06/09/17 12:50, Dmitrij wrote:
>>
>> On 06.09.2017 12:53, Andrew Haley wrote:
>>> On 05/09/17 18:34, Dmitrij Pochepko wrote:
>>>> As you can see, it's up to 26% worse throughput with wider multiplication.
>>>>
>>>> The reasons for this is:
>>>> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it
>>>> can?t be changed within the function signature. Thus we can?t fully
>>>> utilize the potential of 64-bit multiplication.
>>>> 2. umulh instruction is more expensive than mul instruction.
>>> Ah, my apologies.  I wasn't thinking about mulAdd, but about
>>> squareToLen().  But did you look at the way x86 uses 64-bit
>>> multiplications?
>>>
>> Yes. It uses single x86 mulq instruction which performs 64x64
>> multiplication and placing 128 bit result in 2 registers. There is no
>> such single instruction on aarch64 and the most effective aarch64
>> instruction sequence i've found doesn't seem to be as fast as mulq.
> I think there is effectively a 64x64 - >128-bit instruction: it's just
> that you have to represent it as a mul and a umulh.  But I take your
> point.
>
>>>     One other thing I
>>> haven't checked: is the multiplyToLen() intrinisc called when
>>> squareToLen() is absent?
>>>
>> It could have been a good alternative, but it's not used instead of
>> squareToLen when squareToLen is not implemented. A java implementation
>> of squareToLen will be eventually compiled and used instead:
>> http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l2039
> Please compare your squareToLen wih the
> MacroAssembler::multiply_to_len we already have.
>
I've compared it by calling square and multiply methods and got 
following results(ThunderX):


Benchmark                                        (size, ints)  Mode 
Cnt      Score     Error  Units
BigIntegerBench.implMutliplyToLenReflect       1  avgt    5 186.930 ?  
14.933  ns/op  (26% slower)
BigIntegerBench.implMutliplyToLenReflect       2  avgt    5 194.095 ?  
11.857  ns/op  (12% slower)
BigIntegerBench.implMutliplyToLenReflect       3  avgt    5 233.912 ?   
4.229  ns/op   (24% slower)
BigIntegerBench.implMutliplyToLenReflect       5  avgt    5 308.349 ?  
20.383  ns/op  (22% slower)
BigIntegerBench.implMutliplyToLenReflect      10  avgt    5 475.839 ?   
6.232  ns/op  (same)
BigIntegerBench.implMutliplyToLenReflect      50  avgt    5 6514.691 ?  
76.934  ns/op (same)
BigIntegerBench.implMutliplyToLenReflect      90  avgt    5 20347.040 ? 
224.290  ns/op (3% slower)
BigIntegerBench.implMutliplyToLenReflect     127  avgt    5 41929.302 ? 
181.053  ns/op (9% slower)

BigIntegerBench.implSquareToLenReflect         1  avgt    5 147.751 ?  
12.760  ns/op
BigIntegerBench.implSquareToLenReflect         2  avgt    5 173.804 ?   
4.850  ns/op
BigIntegerBench.implSquareToLenReflect         3  avgt    5 187.822 ?  
34.027  ns/op
BigIntegerBench.implSquareToLenReflect         5  avgt    5 251.995 ?  
19.711  ns/op
BigIntegerBench.implSquareToLenReflect        10  avgt    5 474.489 ?   
1.040  ns/op
BigIntegerBench.implSquareToLenReflect        50  avgt    5 6493.768 ?  
33.809  ns/op
BigIntegerBench.implSquareToLenReflect        90  avgt    5 19766.524 ?  
88.398  ns/op
BigIntegerBench.implSquareToLenReflect       127  avgt    5 38448.202 ? 
180.095  ns/op


As we can see, squareToLen is faster than multiplyToLen.

(I've updated benchmark code at 
http://cr.openjdk.java.net/~dpochepk/8186915/BigIntegerBench.java)

Thanks,
Dmitrij

From ci_notify at linaro.org  Thu Sep  7 20:25:13 2017
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Thu, 7 Sep 2017 20:25:13 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK 10 on AArch64
Message-ID: <211725077.1786.1504815915080.JavaMail.jenkins@81294fa8a221>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk10/openjdk-jtreg-nightly-tests/summary/2017/249/summary.html
 
-------------------------------------------------------------------------------
client-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2017/sep/06 pass: 1,400; fail: 11,561

2 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
client-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2017/sep/06 pass: 7,432; fail: 714; error: 20

-------------------------------------------------------------------------------
client-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2017/sep/06 pass: 3,784

-------------------------------------------------------------------------------
server-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2017/sep/06 pass: 1,403; fail: 11,562; error: 1

2 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
server-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2017/sep/06 pass: 7,469; fail: 675; error: 22

-------------------------------------------------------------------------------
server-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2017/sep/06 pass: 3,782; error: 2

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdk10/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 1.05x
Relative performance: Server critical-jOPS (nc): 0.90x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk10/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the client and server compilers
on 2014-04-01.

Relative performance: Zero: 1.0, Client: 71.29, Server: 118.61

Client 71.29 / Client 2014-04-01 (43.00): 1.66x
Server 118.61 / Server 2014-04-01 (71.00): 1.67x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk10/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2017-09-07 pass rate: 11556/11559, results: http://openjdk.linaro.org/jdk10/jcstress-nightly-runs/2017/249/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk10/jcstress-nightly-runs/

From stuart.monteith at linaro.org  Fri Sep  8 08:13:51 2017
From: stuart.monteith at linaro.org (Stuart Monteith)
Date: Fri, 8 Sep 2017 09:13:51 +0100
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK 10 on AArch64
In-Reply-To: <211725077.1786.1504815915080.JavaMail.jenkins@81294fa8a221>
References: <211725077.1786.1504815915080.JavaMail.jenkins@81294fa8a221>
Message-ID: <CAEGA6kbZcZ5Xmxw1gJH0u8LSFGkwPBe5NGj-qYQMe2PM7tbd2w@mail.gmail.com>

Hello,
   This was a first attempt at a jdk10 build in the automation. The large
number of failures are because of jcstress being integrated into JTReg and
failing - so don't pay too much attention just yet.

BR,
   Stuart


On 7 September 2017 at 21:25, <ci_notify at linaro.org> wrote:

> This is a summary of the JTREG test results
> ===========================================
>
> The build and test results are cycled every 15 days.
>
> For detailed information on the test output please refer to:
>
>   http://openjdk.linaro.org/jdk10/openjdk-jtreg-nightly-
> tests/summary/2017/249/summary.html
>
> ------------------------------------------------------------
> -------------------
> client-release/hotspot
> ------------------------------------------------------------
> -------------------
> Build 0: aarch64/2017/sep/06 pass: 1,400; fail: 11,561
>
> 2 fatal errors were detected; please follow the link above for more detail.
>
> ------------------------------------------------------------
> -------------------
> client-release/jdk
> ------------------------------------------------------------
> -------------------
> Build 0: aarch64/2017/sep/06 pass: 7,432; fail: 714; error: 20
>
> ------------------------------------------------------------
> -------------------
> client-release/langtools
> ------------------------------------------------------------
> -------------------
> Build 0: aarch64/2017/sep/06 pass: 3,784
>
> ------------------------------------------------------------
> -------------------
> server-release/hotspot
> ------------------------------------------------------------
> -------------------
> Build 0: aarch64/2017/sep/06 pass: 1,403; fail: 11,562; error: 1
>
> 2 fatal errors were detected; please follow the link above for more detail.
>
> ------------------------------------------------------------
> -------------------
> server-release/jdk
> ------------------------------------------------------------
> -------------------
> Build 0: aarch64/2017/sep/06 pass: 7,469; fail: 675; error: 22
>
> ------------------------------------------------------------
> -------------------
> server-release/langtools
> ------------------------------------------------------------
> -------------------
> Build 0: aarch64/2017/sep/06 pass: 3,782; error: 2
>
> Previous results can be found here:
>
>   http://openjdk.linaro.org/jdk10/openjdk-jtreg-nightly-tests/index.html
>
>
> SPECjbb2015 composite regression test completed
> ===============================================
>
> This test measures the relative performance of the server
> compiler running the SPECjbb2015 composite tests and compares
> the performance against the baseline performance of the server
> compiler taken on 2016-11-21.
>
> In accordance with [1], the SPECjbb2015 tests are run on a system
> which is not production ready and does not meet all the
> requirements for publishing compliant results. The numbers below
> shall be treated as non-compliant (nc) and are for experimental
> purposes only.
>
> Relative performance: Server max-jOPS (nc): 1.05x
> Relative performance: Server critical-jOPS (nc): 0.90x
>
> Details of the test setup and historical results may be found here:
>
>     http://openjdk.linaro.org/jdk10/SPECjbb2015-results/
>
> [1] http://www.spec.org/fairuse.html#Academic
>
> Regression test Hadoop-Terasort completed
> =========================================
>
> This test measures the performance of the server and client compilers
> running Hadoop sorting a 1GB file using Terasort and compares
> the performance against the baseline performance of the Zero interpreter
> and against the baseline performance of the client and server compilers
> on 2014-04-01.
>
> Relative performance: Zero: 1.0, Client: 71.29, Server: 118.61
>
> Client 71.29 / Client 2014-04-01 (43.00): 1.66x
> Server 118.61 / Server 2014-04-01 (71.00): 1.67x
>
> Details of the test setup and historical results may be found here:
>
>     http://openjdk.linaro.org/jdk10/hadoop-terasort-benchmark-results/
>
> This is a summary of the jcstress test results
> ==============================================
>
> The build and test results are cycled every 15 days.
>
> 2017-09-07 pass rate: 11556/11559, results: http://openjdk.linaro.org/
> jdk10/jcstress-nightly-runs/2017/249/results/
>
> For detailed information on the test output please refer to:
>
>   http://openjdk.linaro.org/jdk10/jcstress-nightly-runs/

From zhongwei.yao at linaro.org  Mon Sep 18 09:04:58 2017
From: zhongwei.yao at linaro.org (Zhongwei Yao)
Date: Mon, 18 Sep 2017 17:04:58 +0800
Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP
	auto-vectorization failed
Message-ID: <CAFkOo6bGNjHX3m2QBvjckHojosxw67LSf71YwVAoLDB4caoKbQ@mail.gmail.com>

Hi, all,

Bug:
https://bugs.openjdk.java.net/browse/JDK-8187601

Webrev:
http://cr.openjdk.java.net/~zyao/8187601/webrev.00

In the current implementation, the loop unrolling times are determined
by vector size and element size when SuperWordLoopUnrollAnalysis is
true (both X86 and aarch64 are true for now).

This unrolling policy generates less optimized code when SLP
auto-vectorization fails (as following example shows).

In this patch, I modify the current unrolling policy to do more
unrolling when SLP auto-vectorization fails. So the loop will be
unrolled until reaching the unroll times limitation.

Here is one example:
  public static void accessArrayConstants(int[] array) {
      for (int j = 0; j < 1024; j++) {
          array[0]++;
          array[1]++;
      }
  }

Before this patch, the loop will be unrolled by 4 times. 4 is
determined by: AArch64's vector size 128 bits / array element size 32
bits = 4. On X86, vector size is 256 bits. So the unroll times are 8.

Below is the generated code by C2 on AArch64:

  ...
  ... # omit unrelated code.
  ...

  0x0000ffff6caf3180: ldr w10, [x1,#16]   ;*iaload {reexecute=0
rethrow=0 return_oop=0}
                                                ; -
ArrayAccess::accessArrayConstants at 12 (line 6)

  0x0000ffff6caf3184: add w13, w10, #0x1
  0x0000ffff6caf3188: str w13, [x1,#16]   ;*iastore {reexecute=0
rethrow=0 return_oop=0}
                                                ; -
ArrayAccess::accessArrayConstants at 15 (line 6)

  0x0000ffff6caf318c: ldr w12, [x1,#20]   ;*iaload {reexecute=0
rethrow=0 return_oop=0}
                                                ; -
ArrayAccess::accessArrayConstants at 19 (line 7)

  0x0000ffff6caf3190: add w13, w10, #0x4
  0x0000ffff6caf3194: add w10, w12, #0x4
  0x0000ffff6caf3198: str w13, [x1,#16]   ;*iastore {reexecute=0
rethrow=0 return_oop=0}
                                                ; -
ArrayAccess::accessArrayConstants at 15 (line 6)

  0x0000ffff6caf319c: add w11, w11, #0x4  ;*iinc {reexecute=0
rethrow=0 return_oop=0}
                                                ; -
ArrayAccess::accessArrayConstants at 23 (line 5)

  0x0000ffff6caf31a0: str w10, [x1,#20]   ;*iastore {reexecute=0
rethrow=0 return_oop=0}
                                                ; -
ArrayAccess::accessArrayConstants at 22 (line 7)

  0x0000ffff6caf31a4: cmp w11, #0x3fd
  0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180  ;*if_icmpge
{reexecute=0 rethrow=0 return_oop=0}
                                                ; -
ArrayAccess::accessArrayConstants at 6 (line 5)
  ...
  ... # omit unrelated code.
  ...

After applied this patch, it is unrolled 16 times:

  ...
  ... # omit unrelated code.
  ...

  0x0000ffffb0aa6100: ldr w10, [x1,#16]   ;*iaload {reexecute=0
rethrow=0 return_oop=0}
                                                ; -
ArrayAccess::accessArrayConstants at 12 (line 6)

  0x0000ffffb0aa6104: add w13, w10, #0x1
  0x0000ffffb0aa6108: str w13, [x1,#16]   ;*iastore {reexecute=0
rethrow=0 return_oop=0}
                                                ; -
ArrayAccess::accessArrayConstants at 15 (line 6)

  0x0000ffffb0aa610c: ldr w12, [x1,#20]   ;*iaload {reexecute=0
rethrow=0 return_oop=0}
                                                ; -
ArrayAccess::accessArrayConstants at 19 (line 7)

  0x0000ffffb0aa6110: add w13, w10, #0x10
  0x0000ffffb0aa6114: add w10, w12, #0x10
  0x0000ffffb0aa6118: str w13, [x1,#16]   ;*iastore {reexecute=0
rethrow=0 return_oop=0}
                                                ; -
ArrayAccess::accessArrayConstants at 15 (line 6)

  0x0000ffffb0aa611c: add w11, w11, #0x10  ;*iinc {reexecute=0
rethrow=0 return_oop=0}
                                                ; -
ArrayAccess::accessArrayConstants at 23 (line 5)

  0x0000ffffb0aa6120: str w10, [x1,#20]   ;*iastore {reexecute=0
rethrow=0 return_oop=0}
                                                ; -
ArrayAccess::accessArrayConstants at 22 (line 7)

  0x0000ffffb0aa6124: cmp w11, #0x3f1
  0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100  ;*if_icmpge
{reexecute=0 rethrow=0 return_oop=0}
                                                ; -
ArrayAccess::accessArrayConstants at 6 (line 5)
  ...
  ... # omit unrelated code.
  ...

This patch passes jtreg tests both on AArch64 and X86.

-- 
Best regards,
Zhongwei

From rwestrel at redhat.com  Mon Sep 18 09:42:06 2017
From: rwestrel at redhat.com (Roland Westrelin)
Date: Mon, 18 Sep 2017 11:42:06 +0200
Subject: [aarch64-port-dev ] RFR: Bulk import from Shenandoah
In-Reply-To: <c1bd0335-8d48-d1dd-e23b-b6966f24f131@redhat.com>
References: <ca1e221b-72dd-e060-8308-d00c947129b2@redhat.com>
 <c1bd0335-8d48-d1dd-e23b-b6966f24f131@redhat.com>
Message-ID: <dk6inggibrl.fsf@rwestrel.remote.csb>


> I'd like to see a review of the C2 changes from a C2 developer: that probably
> means Roland.  He's probably the best person to review backports of his own
> code anyway.

In case this is still waiting for me:

The change in loopopts.cpp looks useless. Other C2 changes look
reasonable.

Roland.

From zhongwei.yao at linaro.org  Mon Sep 18 09:58:11 2017
From: zhongwei.yao at linaro.org (Zhongwei Yao)
Date: Mon, 18 Sep 2017 17:58:11 +0800
Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP
	auto-vectorization failed
Message-ID: <CAFkOo6a9Q-9kq863WGVa6XoUbOGE7-MZLGrsdH3N-MBasawgpg@mail.gmail.com>

[Forward from aarch64-port-dev to hotspot-compiler-dev]

Hi, all,

Bug:
https://bugs.openjdk.java.net/browse/JDK-8187601

Webrev:
http://cr.openjdk.java.net/~zyao/8187601/webrev.00

In the current implementation, the loop unrolling times are determined
by vector size and element size when SuperWordLoopUnrollAnalysis is
true (both X86 and aarch64 are true for now).

This unrolling policy generates less optimized code when SLP
auto-vectorization fails (as following example shows).

In this patch, I modify the current unrolling policy to do more
unrolling when SLP auto-vectorization fails. So the loop will be
unrolled until reaching the unroll times limitation.

Here is one example:
  public static void accessArrayConstants(int[] array) {
      for (int j = 0; j < 1024; j++) {
          array[0]++;
          array[1]++;
      }
  }

Before this patch, the loop will be unrolled by 4 times. 4 is
determined by: AArch64's vector size 128 bits / array element size 32
bits = 4. On X86, vector size is 256 bits. So the unroll times are 8.

Below is the generated code by C2 on AArch64:

==== generated code start ====
  0x0000ffff6caf3180: ldr w10, [x1,#16]   ;
  0x0000ffff6caf3184: add w13, w10, #0x1
  0x0000ffff6caf3188: str w13, [x1,#16]   ;
  0x0000ffff6caf318c: ldr w12, [x1,#20]   ;
  0x0000ffff6caf3190: add w13, w10, #0x4
  0x0000ffff6caf3194: add w10, w12, #0x4
  0x0000ffff6caf3198: str w13, [x1,#16]   ;
  0x0000ffff6caf319c: add w11, w11, #0x4  ;
  0x0000ffff6caf31a0: str w10, [x1,#20]   ;
  0x0000ffff6caf31a4: cmp w11, #0x3fd
  0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180  ;
==== generated code end ====

After applied this patch, it is unrolled 16 times:

==== generated code start ====
  0x0000ffffb0aa6100: ldr w10, [x1,#16]   ;
  0x0000ffffb0aa6104: add w13, w10, #0x1
  0x0000ffffb0aa6108: str w13, [x1,#16]   ;
  0x0000ffffb0aa610c: ldr w12, [x1,#20]   ;
  0x0000ffffb0aa6110: add w13, w10, #0x10
  0x0000ffffb0aa6114: add w10, w12, #0x10
  0x0000ffffb0aa6118: str w13, [x1,#16]   ;
  0x0000ffffb0aa611c: add w11, w11, #0x10  ;
  0x0000ffffb0aa6120: str w10, [x1,#20]   ;
  0x0000ffffb0aa6124: cmp w11, #0x3f1
  0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100  ;
==== generated code end ====

This patch passes jtreg tests both on AArch64 and X86.

-- 
Best regards,
Zhongwei

From vladimir.kozlov at oracle.com  Mon Sep 18 16:17:26 2017
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 18 Sep 2017 09:17:26 -0700
Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP
 auto-vectorization failed
In-Reply-To: <CAFkOo6a9Q-9kq863WGVa6XoUbOGE7-MZLGrsdH3N-MBasawgpg@mail.gmail.com>
References: <CAFkOo6a9Q-9kq863WGVa6XoUbOGE7-MZLGrsdH3N-MBasawgpg@mail.gmail.com>
Message-ID: <c6a8b312-0f10-f2e3-479a-f9f568e0afdb@oracle.com>

Why not use existing set_notpassed_slp() instead of mark_slp_vec_failed()?

Why you need next additional check?:

-        } else if (cl->is_main_loop()) {
+        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
            sw.transform_loop(lpt, true);


Thanks,
Vladimir

On 9/18/17 2:58 AM, Zhongwei Yao wrote:
> [Forward from aarch64-port-dev to hotspot-compiler-dev]
> 
> Hi, all,
> 
> Bug:
> https://bugs.openjdk.java.net/browse/JDK-8187601
> 
> Webrev:
> http://cr.openjdk.java.net/~zyao/8187601/webrev.00
> 
> In the current implementation, the loop unrolling times are determined
> by vector size and element size when SuperWordLoopUnrollAnalysis is
> true (both X86 and aarch64 are true for now).
> 
> This unrolling policy generates less optimized code when SLP
> auto-vectorization fails (as following example shows).
> 
> In this patch, I modify the current unrolling policy to do more
> unrolling when SLP auto-vectorization fails. So the loop will be
> unrolled until reaching the unroll times limitation.
> 
> Here is one example:
>    public static void accessArrayConstants(int[] array) {
>        for (int j = 0; j < 1024; j++) {
>            array[0]++;
>            array[1]++;
>        }
>    }
> 
> Before this patch, the loop will be unrolled by 4 times. 4 is
> determined by: AArch64's vector size 128 bits / array element size 32
> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8.
> 
> Below is the generated code by C2 on AArch64:
> 
> ==== generated code start ====
>    0x0000ffff6caf3180: ldr w10, [x1,#16]   ;
>    0x0000ffff6caf3184: add w13, w10, #0x1
>    0x0000ffff6caf3188: str w13, [x1,#16]   ;
>    0x0000ffff6caf318c: ldr w12, [x1,#20]   ;
>    0x0000ffff6caf3190: add w13, w10, #0x4
>    0x0000ffff6caf3194: add w10, w12, #0x4
>    0x0000ffff6caf3198: str w13, [x1,#16]   ;
>    0x0000ffff6caf319c: add w11, w11, #0x4  ;
>    0x0000ffff6caf31a0: str w10, [x1,#20]   ;
>    0x0000ffff6caf31a4: cmp w11, #0x3fd
>    0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180  ;
> ==== generated code end ====
> 
> After applied this patch, it is unrolled 16 times:
> 
> ==== generated code start ====
>    0x0000ffffb0aa6100: ldr w10, [x1,#16]   ;
>    0x0000ffffb0aa6104: add w13, w10, #0x1
>    0x0000ffffb0aa6108: str w13, [x1,#16]   ;
>    0x0000ffffb0aa610c: ldr w12, [x1,#20]   ;
>    0x0000ffffb0aa6110: add w13, w10, #0x10
>    0x0000ffffb0aa6114: add w10, w12, #0x10
>    0x0000ffffb0aa6118: str w13, [x1,#16]   ;
>    0x0000ffffb0aa611c: add w11, w11, #0x10  ;
>    0x0000ffffb0aa6120: str w10, [x1,#20]   ;
>    0x0000ffffb0aa6124: cmp w11, #0x3f1
>    0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100  ;
> ==== generated code end ====
> 
> This patch passes jtreg tests both on AArch64 and X86.
> 

From zhongwei.yao at linaro.org  Tue Sep 19 05:59:18 2017
From: zhongwei.yao at linaro.org (Zhongwei Yao)
Date: Tue, 19 Sep 2017 13:59:18 +0800
Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP
	auto-vectorization failed
In-Reply-To: <c6a8b312-0f10-f2e3-479a-f9f568e0afdb@oracle.com>
References: <CAFkOo6a9Q-9kq863WGVa6XoUbOGE7-MZLGrsdH3N-MBasawgpg@mail.gmail.com>
 <c6a8b312-0f10-f2e3-479a-f9f568e0afdb@oracle.com>
Message-ID: <CAFkOo6Z3LO7ug5nwseneChDEMdrTj7DUgUNMu803b7ivnLMM9g@mail.gmail.com>

Hi, Vladimir,

On 19 September 2017 at 00:17, Vladimir Kozlov
<vladimir.kozlov at oracle.com> wrote:
> Why not use existing set_notpassed_slp() instead of mark_slp_vec_failed()?

Due to 2 reasons, I have not chosen existing passed_slp flag:
  1. If we set_notpassed_slp() when _packset.length() == 0 in
SuperWord::output(), then in the IdealLoopTree::policy_unroll()
checking:

   if (cl->has_passed_slp()) {
     if (slp_max_unroll_factor >= future_unroll_ct) return true;
     // Normal case: loop too big
     return false;
   }

   we will ignore the case: "cl->has_passed_slp() &&
slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()"
as alos exposed in my patch:

   if (cl->has_passed_slp()) {
     if (slp_max_unroll_factor >= future_unroll_ct) return true;
-    // Normal case: loop too big
-    return false;
+    // When SLP vectorization failed, we could do more unrolling
+    // optimizations if body size is less than limit size. Otherwise,
+    // return false due to loop is too big.
+    if (!cl->is_slp_vec_failed()) return false;
   }

   However, I have not found a case to support this condition yet.

  2. As replied below, in:
> -        } else if (cl->is_main_loop()) {
> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>            sw.transform_loop(lpt, true);
      I need to check whether cl->is_slp_vec_failed() is true.Such
checking becomes explicit when using SLPAutoVecFailed flag.

>
> Why you need next additional check?:
>
> -        } else if (cl->is_main_loop()) {
> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>            sw.transform_loop(lpt, true);
>

The additional check prevents the case that when
cl->is_slp_vec_failed() is true, then SuperWord::output() will
set_major_progress() at the beginning (because _packset.length() == 0
is true when cl->is_slp_vec_failed() is true). Then the "phase ideal
loop iteration" will not stop untill loop_opts_cnt reachs 0, which is
not we want.

>
> Thanks,
> Vladimir
>
>
> On 9/18/17 2:58 AM, Zhongwei Yao wrote:
>>
>> [Forward from aarch64-port-dev to hotspot-compiler-dev]
>>
>> Hi, all,
>>
>> Bug:
>> https://bugs.openjdk.java.net/browse/JDK-8187601
>>
>> Webrev:
>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00
>>
>> In the current implementation, the loop unrolling times are determined
>> by vector size and element size when SuperWordLoopUnrollAnalysis is
>> true (both X86 and aarch64 are true for now).
>>
>> This unrolling policy generates less optimized code when SLP
>> auto-vectorization fails (as following example shows).
>>
>> In this patch, I modify the current unrolling policy to do more
>> unrolling when SLP auto-vectorization fails. So the loop will be
>> unrolled until reaching the unroll times limitation.
>>
>> Here is one example:
>>    public static void accessArrayConstants(int[] array) {
>>        for (int j = 0; j < 1024; j++) {
>>            array[0]++;
>>            array[1]++;
>>        }
>>    }
>>
>> Before this patch, the loop will be unrolled by 4 times. 4 is
>> determined by: AArch64's vector size 128 bits / array element size 32
>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8.
>>
>> Below is the generated code by C2 on AArch64:
>>
>> ==== generated code start ====
>>    0x0000ffff6caf3180: ldr w10, [x1,#16]   ;
>>    0x0000ffff6caf3184: add w13, w10, #0x1
>>    0x0000ffff6caf3188: str w13, [x1,#16]   ;
>>    0x0000ffff6caf318c: ldr w12, [x1,#20]   ;
>>    0x0000ffff6caf3190: add w13, w10, #0x4
>>    0x0000ffff6caf3194: add w10, w12, #0x4
>>    0x0000ffff6caf3198: str w13, [x1,#16]   ;
>>    0x0000ffff6caf319c: add w11, w11, #0x4  ;
>>    0x0000ffff6caf31a0: str w10, [x1,#20]   ;
>>    0x0000ffff6caf31a4: cmp w11, #0x3fd
>>    0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180  ;
>> ==== generated code end ====
>>
>> After applied this patch, it is unrolled 16 times:
>>
>> ==== generated code start ====
>>    0x0000ffffb0aa6100: ldr w10, [x1,#16]   ;
>>    0x0000ffffb0aa6104: add w13, w10, #0x1
>>    0x0000ffffb0aa6108: str w13, [x1,#16]   ;
>>    0x0000ffffb0aa610c: ldr w12, [x1,#20]   ;
>>    0x0000ffffb0aa6110: add w13, w10, #0x10
>>    0x0000ffffb0aa6114: add w10, w12, #0x10
>>    0x0000ffffb0aa6118: str w13, [x1,#16]   ;
>>    0x0000ffffb0aa611c: add w11, w11, #0x10  ;
>>    0x0000ffffb0aa6120: str w10, [x1,#20]   ;
>>    0x0000ffffb0aa6124: cmp w11, #0x3f1
>>    0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100  ;
>> ==== generated code end ====
>>
>> This patch passes jtreg tests both on AArch64 and X86.
>>
>


-- 
Best regards,
Zhongwei

From rwestrel at redhat.com  Tue Sep 19 13:50:06 2017
From: rwestrel at redhat.com (Roland Westrelin)
Date: Tue, 19 Sep 2017 15:50:06 +0200
Subject: [aarch64-port-dev ] bug fix for aarch64-port/jdk8u-shenandoah when
	Shenandoah is disabled
Message-ID: <dk6efr2hk6p.fsf@rwestrel.remote.csb>


http://cr.openjdk.java.net/~roland/shenandoah/phi_has_only_data_users/webrev.00/

This is a fix for an issue that affects aarch64-port/jdk8u-shenandoah
even when shenandoah is disabled. That bug causes a debug VM to crash
with:

Internal Error at phaseX.cpp:985, pid=26493, tid=0x00007fd8c4199700
assert(false) failed: infinite loop in PhaseIterGVN::optimize

and a product build to consume a lot of memory.

Roland.

From adinn at redhat.com  Tue Sep 19 15:01:31 2017
From: adinn at redhat.com (Andrew Dinn)
Date: Tue, 19 Sep 2017 16:01:31 +0100
Subject: [aarch64-port-dev ] bug fix for aarch64-port/jdk8u-shenandoah
 when Shenandoah is disabled
In-Reply-To: <dk6efr2hk6p.fsf@rwestrel.remote.csb>
References: <dk6efr2hk6p.fsf@rwestrel.remote.csb>
Message-ID: <1fdb96ff-ebc5-ea80-9fd4-ebc423aa2369@redhat.com>

On 19/09/17 14:50, Roland Westrelin wrote:
> 
> http://cr.openjdk.java.net/~roland/shenandoah/phi_has_only_data_users/webrev.00/
> 
> This is a fix for an issue that affects aarch64-port/jdk8u-shenandoah
> even when shenandoah is disabled. That bug causes a debug VM to crash
> with:
> 
> Internal Error at phaseX.cpp:985, pid=26493, tid=0x00007fd8c4199700
> assert(false) failed: infinite loop in PhaseIterGVN::optimize
> 
> and a product build to consume a lot of memory.
Looks good.

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

From vladimir.kozlov at oracle.com  Tue Sep 19 17:54:49 2017
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Tue, 19 Sep 2017 10:54:49 -0700
Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP
 auto-vectorization failed
In-Reply-To: <CAFkOo6Z3LO7ug5nwseneChDEMdrTj7DUgUNMu803b7ivnLMM9g@mail.gmail.com>
References: <CAFkOo6a9Q-9kq863WGVa6XoUbOGE7-MZLGrsdH3N-MBasawgpg@mail.gmail.com>
 <c6a8b312-0f10-f2e3-479a-f9f568e0afdb@oracle.com>
 <CAFkOo6Z3LO7ug5nwseneChDEMdrTj7DUgUNMu803b7ivnLMM9g@mail.gmail.com>
Message-ID: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com>


On 9/18/17 10:59 PM, Zhongwei Yao wrote:
> Hi, Vladimir,
> 
> On 19 September 2017 at 00:17, Vladimir Kozlov
> <vladimir.kozlov at oracle.com> wrote:
>> Why not use existing set_notpassed_slp() instead of mark_slp_vec_failed()?
> 
> Due to 2 reasons, I have not chosen existing passed_slp flag:

My point is that if we don't find vectors in a loop (as in your case) we 
should ignore whole SLP analysis.

In best case scenario SuperWord::unrolling_analysis() should determine 
if there are vectors candidates. For example, check if array's index is 
depend on loop's index variable.

An other way is to call SuperWord::unrolling_analysis() only after we 
did vector analysis.

It is more complicated changes and out of scope of this. There is also 
side effect I missed before which may prevent using set_notpassed_slp(): 
LoopMaxUnroll is changed based on SLP analysis before has_passed_slp() 
check.

Note, set_notpassed_slp() is also used to additional unroll already 
vectorized loops:

http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421

May be you should also call mark_do_unroll_only() when you set 
set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt 
problem you pointed. Can you look on this?

I am not against adding new is_slp_vec_failed() but I want first to 
investigate if we can re-use existing functions.

Thanks,
Vladimir

>    1. If we set_notpassed_slp() when _packset.length() == 0 in
> SuperWord::output(), then in the IdealLoopTree::policy_unroll()
> checking:
> 
>     if (cl->has_passed_slp()) {
>       if (slp_max_unroll_factor >= future_unroll_ct) return true;
>       // Normal case: loop too big
>       return false;
>     }
> 
>     we will ignore the case: "cl->has_passed_slp() &&
> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()"
> as alos exposed in my patch:
> 
>     if (cl->has_passed_slp()) {
>       if (slp_max_unroll_factor >= future_unroll_ct) return true;
> -    // Normal case: loop too big
> -    return false;
> +    // When SLP vectorization failed, we could do more unrolling
> +    // optimizations if body size is less than limit size. Otherwise,
> +    // return false due to loop is too big.
> +    if (!cl->is_slp_vec_failed()) return false;
>     }
> 
>     However, I have not found a case to support this condition yet.
> 
>    2. As replied below, in:
>> -        } else if (cl->is_main_loop()) {
>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>             sw.transform_loop(lpt, true);
>        I need to check whether cl->is_slp_vec_failed() is true.Such
> checking becomes explicit when using SLPAutoVecFailed flag.
> 
>>
>> Why you need next additional check?:
>>
>> -        } else if (cl->is_main_loop()) {
>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>             sw.transform_loop(lpt, true);
>>
> 
> The additional check prevents the case that when
> cl->is_slp_vec_failed() is true, then SuperWord::output() will
> set_major_progress() at the beginning (because _packset.length() == 0
> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal
> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is
> not we want.


> 
>>
>> Thanks,
>> Vladimir
>>
>>
>> On 9/18/17 2:58 AM, Zhongwei Yao wrote:
>>>
>>> [Forward from aarch64-port-dev to hotspot-compiler-dev]
>>>
>>> Hi, all,
>>>
>>> Bug:
>>> https://bugs.openjdk.java.net/browse/JDK-8187601
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00
>>>
>>> In the current implementation, the loop unrolling times are determined
>>> by vector size and element size when SuperWordLoopUnrollAnalysis is
>>> true (both X86 and aarch64 are true for now).
>>>
>>> This unrolling policy generates less optimized code when SLP
>>> auto-vectorization fails (as following example shows).
>>>
>>> In this patch, I modify the current unrolling policy to do more
>>> unrolling when SLP auto-vectorization fails. So the loop will be
>>> unrolled until reaching the unroll times limitation.
>>>
>>> Here is one example:
>>>     public static void accessArrayConstants(int[] array) {
>>>         for (int j = 0; j < 1024; j++) {
>>>             array[0]++;
>>>             array[1]++;
>>>         }
>>>     }
>>>
>>> Before this patch, the loop will be unrolled by 4 times. 4 is
>>> determined by: AArch64's vector size 128 bits / array element size 32
>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8.
>>>
>>> Below is the generated code by C2 on AArch64:
>>>
>>> ==== generated code start ====
>>>     0x0000ffff6caf3180: ldr w10, [x1,#16]   ;
>>>     0x0000ffff6caf3184: add w13, w10, #0x1
>>>     0x0000ffff6caf3188: str w13, [x1,#16]   ;
>>>     0x0000ffff6caf318c: ldr w12, [x1,#20]   ;
>>>     0x0000ffff6caf3190: add w13, w10, #0x4
>>>     0x0000ffff6caf3194: add w10, w12, #0x4
>>>     0x0000ffff6caf3198: str w13, [x1,#16]   ;
>>>     0x0000ffff6caf319c: add w11, w11, #0x4  ;
>>>     0x0000ffff6caf31a0: str w10, [x1,#20]   ;
>>>     0x0000ffff6caf31a4: cmp w11, #0x3fd
>>>     0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180  ;
>>> ==== generated code end ====
>>>
>>> After applied this patch, it is unrolled 16 times:
>>>
>>> ==== generated code start ====
>>>     0x0000ffffb0aa6100: ldr w10, [x1,#16]   ;
>>>     0x0000ffffb0aa6104: add w13, w10, #0x1
>>>     0x0000ffffb0aa6108: str w13, [x1,#16]   ;
>>>     0x0000ffffb0aa610c: ldr w12, [x1,#20]   ;
>>>     0x0000ffffb0aa6110: add w13, w10, #0x10
>>>     0x0000ffffb0aa6114: add w10, w12, #0x10
>>>     0x0000ffffb0aa6118: str w13, [x1,#16]   ;
>>>     0x0000ffffb0aa611c: add w11, w11, #0x10  ;
>>>     0x0000ffffb0aa6120: str w10, [x1,#20]   ;
>>>     0x0000ffffb0aa6124: cmp w11, #0x3f1
>>>     0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100  ;
>>> ==== generated code end ====
>>>
>>> This patch passes jtreg tests both on AArch64 and X86.
>>>
>>
> 
> 
> 

From zhongwei.yao at linaro.org  Wed Sep 20 11:07:20 2017
From: zhongwei.yao at linaro.org (Zhongwei Yao)
Date: Wed, 20 Sep 2017 19:07:20 +0800
Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP
	auto-vectorization failed
In-Reply-To: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com>
References: <CAFkOo6a9Q-9kq863WGVa6XoUbOGE7-MZLGrsdH3N-MBasawgpg@mail.gmail.com>
 <c6a8b312-0f10-f2e3-479a-f9f568e0afdb@oracle.com>
 <CAFkOo6Z3LO7ug5nwseneChDEMdrTj7DUgUNMu803b7ivnLMM9g@mail.gmail.com>
 <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com>
Message-ID: <CAFkOo6YBdZ4v7YTvDQACuOUk+_AwHFS9FCAT_NLY_PVry2yEYw@mail.gmail.com>

Thanks for your suggestions!

I've updated the patch that uses pass_slp and do_unroll_only flags
without adding a new flag. Please take a look:

http://cr.openjdk.java.net/~zyao/8187601/webrev.01/


On 20 September 2017 at 01:54, Vladimir Kozlov
<vladimir.kozlov at oracle.com> wrote:
>
>
> On 9/18/17 10:59 PM, Zhongwei Yao wrote:
>>
>> Hi, Vladimir,
>>
>> On 19 September 2017 at 00:17, Vladimir Kozlov
>> <vladimir.kozlov at oracle.com> wrote:
>>>
>>> Why not use existing set_notpassed_slp() instead of
>>> mark_slp_vec_failed()?
>>
>>
>> Due to 2 reasons, I have not chosen existing passed_slp flag:
>
>
> My point is that if we don't find vectors in a loop (as in your case) we
> should ignore whole SLP analysis.
>
> In best case scenario SuperWord::unrolling_analysis() should determine if
> there are vectors candidates. For example, check if array's index is depend
> on loop's index variable.
>
> An other way is to call SuperWord::unrolling_analysis() only after we did
> vector analysis.
>
> It is more complicated changes and out of scope of this. There is also side
> effect I missed before which may prevent using set_notpassed_slp():
> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp()
> check.
>
> Note, set_notpassed_slp() is also used to additional unroll already
> vectorized loops:
>
> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421
>
> May be you should also call mark_do_unroll_only() when you set
> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt
> problem you pointed. Can you look on this?
>
> I am not against adding new is_slp_vec_failed() but I want first to
> investigate if we can re-use existing functions.
>
> Thanks,
> Vladimir
>
>
>>    1. If we set_notpassed_slp() when _packset.length() == 0 in
>> SuperWord::output(), then in the IdealLoopTree::policy_unroll()
>> checking:
>>
>>     if (cl->has_passed_slp()) {
>>       if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>       // Normal case: loop too big
>>       return false;
>>     }
>>
>>     we will ignore the case: "cl->has_passed_slp() &&
>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()"
>> as alos exposed in my patch:
>>
>>     if (cl->has_passed_slp()) {
>>       if (slp_max_unroll_factor >= future_unroll_ct) return true;
>> -    // Normal case: loop too big
>> -    return false;
>> +    // When SLP vectorization failed, we could do more unrolling
>> +    // optimizations if body size is less than limit size. Otherwise,
>> +    // return false due to loop is too big.
>> +    if (!cl->is_slp_vec_failed()) return false;
>>     }
>>
>>     However, I have not found a case to support this condition yet.
>>
>>    2. As replied below, in:
>>>
>>> -        } else if (cl->is_main_loop()) {
>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>             sw.transform_loop(lpt, true);
>>
>>        I need to check whether cl->is_slp_vec_failed() is true.Such
>> checking becomes explicit when using SLPAutoVecFailed flag.
>>
>>>
>>> Why you need next additional check?:
>>>
>>> -        } else if (cl->is_main_loop()) {
>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>             sw.transform_loop(lpt, true);
>>>
>>
>> The additional check prevents the case that when
>> cl->is_slp_vec_failed() is true, then SuperWord::output() will
>> set_major_progress() at the beginning (because _packset.length() == 0
>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal
>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is
>> not we want.
>
>
>
>>
>>>
>>> Thanks,
>>> Vladimir
>>>
>>>
>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote:
>>>>
>>>>
>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev]
>>>>
>>>> Hi, all,
>>>>
>>>> Bug:
>>>> https://bugs.openjdk.java.net/browse/JDK-8187601
>>>>
>>>> Webrev:
>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00
>>>>
>>>> In the current implementation, the loop unrolling times are determined
>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is
>>>> true (both X86 and aarch64 are true for now).
>>>>
>>>> This unrolling policy generates less optimized code when SLP
>>>> auto-vectorization fails (as following example shows).
>>>>
>>>> In this patch, I modify the current unrolling policy to do more
>>>> unrolling when SLP auto-vectorization fails. So the loop will be
>>>> unrolled until reaching the unroll times limitation.
>>>>
>>>> Here is one example:
>>>>     public static void accessArrayConstants(int[] array) {
>>>>         for (int j = 0; j < 1024; j++) {
>>>>             array[0]++;
>>>>             array[1]++;
>>>>         }
>>>>     }
>>>>
>>>> Before this patch, the loop will be unrolled by 4 times. 4 is
>>>> determined by: AArch64's vector size 128 bits / array element size 32
>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8.
>>>>
>>>> Below is the generated code by C2 on AArch64:
>>>>
>>>> ==== generated code start ====
>>>>     0x0000ffff6caf3180: ldr w10, [x1,#16]   ;
>>>>     0x0000ffff6caf3184: add w13, w10, #0x1
>>>>     0x0000ffff6caf3188: str w13, [x1,#16]   ;
>>>>     0x0000ffff6caf318c: ldr w12, [x1,#20]   ;
>>>>     0x0000ffff6caf3190: add w13, w10, #0x4
>>>>     0x0000ffff6caf3194: add w10, w12, #0x4
>>>>     0x0000ffff6caf3198: str w13, [x1,#16]   ;
>>>>     0x0000ffff6caf319c: add w11, w11, #0x4  ;
>>>>     0x0000ffff6caf31a0: str w10, [x1,#20]   ;
>>>>     0x0000ffff6caf31a4: cmp w11, #0x3fd
>>>>     0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180  ;
>>>> ==== generated code end ====
>>>>
>>>> After applied this patch, it is unrolled 16 times:
>>>>
>>>> ==== generated code start ====
>>>>     0x0000ffffb0aa6100: ldr w10, [x1,#16]   ;
>>>>     0x0000ffffb0aa6104: add w13, w10, #0x1
>>>>     0x0000ffffb0aa6108: str w13, [x1,#16]   ;
>>>>     0x0000ffffb0aa610c: ldr w12, [x1,#20]   ;
>>>>     0x0000ffffb0aa6110: add w13, w10, #0x10
>>>>     0x0000ffffb0aa6114: add w10, w12, #0x10
>>>>     0x0000ffffb0aa6118: str w13, [x1,#16]   ;
>>>>     0x0000ffffb0aa611c: add w11, w11, #0x10  ;
>>>>     0x0000ffffb0aa6120: str w10, [x1,#20]   ;
>>>>     0x0000ffffb0aa6124: cmp w11, #0x3f1
>>>>     0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100  ;
>>>> ==== generated code end ====
>>>>
>>>> This patch passes jtreg tests both on AArch64 and X86.
>>>>
>>>
>>
>>
>>
>


-- 
Best regards,
Zhongwei

From rwestrel at redhat.com  Wed Sep 20 12:22:49 2017
From: rwestrel at redhat.com (rwestrel at redhat.com)
Date: Wed, 20 Sep 2017 12:22:49 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/hotspot:
 [backport] PhiNode::has_only_data_users() needs to apply to shenandoah
 barrier only
Message-ID: <201709201222.v8KCMnLM011364@aojmv0008.oracle.com>

Changeset: 48b74a7788cd
Author:    roland
Date:      2017-09-19 13:41 +0200
URL:       http://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/hotspot/rev/48b74a7788cd

[backport] PhiNode::has_only_data_users() needs to apply to shenandoah barrier only

! src/share/vm/opto/cfgnode.cpp


From rwestrel at redhat.com  Wed Sep 20 12:20:33 2017
From: rwestrel at redhat.com (Roland Westrelin)
Date: Wed, 20 Sep 2017 14:20:33 +0200
Subject: [aarch64-port-dev ] bug fix for aarch64-port/jdk8u-shenandoah
	when Shenandoah is disabled
In-Reply-To: <1fdb96ff-ebc5-ea80-9fd4-ebc423aa2369@redhat.com>
References: <dk6efr2hk6p.fsf@rwestrel.remote.csb>
 <1fdb96ff-ebc5-ea80-9fd4-ebc423aa2369@redhat.com>
Message-ID: <dk6377hh88e.fsf@rwestrel.remote.csb>


Thanks. I pushed it.

Roland.

From dmitrij.pochepko at bell-sw.com  Wed Sep 20 14:13:11 2017
From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko)
Date: Wed, 20 Sep 2017 17:13:11 +0300
Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify
	squareToLen and mulAdd
In-Reply-To: <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com>
References: <b488c589-d422-3089-4cb9-8747fb3a0075@bell-sw.com>
 <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com>
 <e0de06fa-0f4f-5b86-53ce-015bd7a46b15@bell-sw.com>
 <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com>
 <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com>
 <c7a1e17f-1fbf-3864-5084-ac12d2ef4ff6@redhat.com>
 <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com>
 <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com>
 <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com>
Message-ID: <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com>

Hi,

Andrew, do you believe this is ok to push?

Thanks,

Dmitrij


....

On 06.09.2017 20:39, Dmitrij wrote:
>
> I've compared it by calling square and multiply methods and got 
> following results(ThunderX):
>
>
> Benchmark??????????????????????????????????????? (size, ints) Mode 
> Cnt????? Score???? Error? Units
> BigIntegerBench.implMutliplyToLenReflect?????? 1? avgt??? 5 186.930 ?? 
> 14.933? ns/op? (26% slower)
> BigIntegerBench.implMutliplyToLenReflect?????? 2? avgt??? 5 194.095 ?? 
> 11.857? ns/op? (12% slower)
> BigIntegerBench.implMutliplyToLenReflect?????? 3? avgt??? 5 233.912 
> ??? 4.229? ns/op?? (24% slower)
> BigIntegerBench.implMutliplyToLenReflect?????? 5? avgt??? 5 308.349 ?? 
> 20.383? ns/op? (22% slower)
> BigIntegerBench.implMutliplyToLenReflect????? 10? avgt??? 5 475.839 
> ??? 6.232? ns/op? (same)
> BigIntegerBench.implMutliplyToLenReflect????? 50? avgt??? 5 6514.691 
> ?? 76.934? ns/op (same)
> BigIntegerBench.implMutliplyToLenReflect????? 90? avgt??? 5 20347.040 
> ? 224.290? ns/op (3% slower)
> BigIntegerBench.implMutliplyToLenReflect???? 127? avgt??? 5 41929.302 
> ? 181.053? ns/op (9% slower)
>
> BigIntegerBench.implSquareToLenReflect???????? 1? avgt??? 5 147.751 ?? 
> 12.760? ns/op
> BigIntegerBench.implSquareToLenReflect???????? 2? avgt??? 5 173.804 
> ??? 4.850? ns/op
> BigIntegerBench.implSquareToLenReflect???????? 3? avgt??? 5 187.822 ?? 
> 34.027? ns/op
> BigIntegerBench.implSquareToLenReflect???????? 5? avgt??? 5 251.995 ?? 
> 19.711? ns/op
> BigIntegerBench.implSquareToLenReflect??????? 10? avgt??? 5 474.489 
> ??? 1.040? ns/op
> BigIntegerBench.implSquareToLenReflect??????? 50? avgt??? 5 6493.768 
> ?? 33.809? ns/op
> BigIntegerBench.implSquareToLenReflect??????? 90? avgt??? 5 19766.524 
> ?? 88.398? ns/op
> BigIntegerBench.implSquareToLenReflect?????? 127? avgt??? 5 38448.202 
> ? 180.095? ns/op
>
>
> As we can see, squareToLen is faster than multiplyToLen.
>
> (I've updated benchmark code at 
> http://cr.openjdk.java.net/~dpochepk/8186915/BigIntegerBench.java)
>
> Thanks,
> Dmitrij


From aph at redhat.com  Wed Sep 20 14:40:31 2017
From: aph at redhat.com (Andrew Haley)
Date: Wed, 20 Sep 2017 15:40:31 +0100
Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify
	squareToLen and mulAdd
In-Reply-To: <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com>
References: <b488c589-d422-3089-4cb9-8747fb3a0075@bell-sw.com>
 <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com>
 <e0de06fa-0f4f-5b86-53ce-015bd7a46b15@bell-sw.com>
 <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com>
 <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com>
 <c7a1e17f-1fbf-3864-5084-ac12d2ef4ff6@redhat.com>
 <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com>
 <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com>
 <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com>
 <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com>
Message-ID: <e4cab210-fe71-dcfb-1297-fabe8aad88f4@redhat.com>

On 20/09/17 15:13, Dmitrij Pochepko wrote:
> Andrew, do you believe this is ok to push?

I'm testing it on some other hardware.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

From vladimir.kozlov at oracle.com  Wed Sep 20 15:34:10 2017
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 20 Sep 2017 08:34:10 -0700
Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify
	squareToLen and mulAdd
In-Reply-To: <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com>
References: <b488c589-d422-3089-4cb9-8747fb3a0075@bell-sw.com>
 <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com>
 <e0de06fa-0f4f-5b86-53ce-015bd7a46b15@bell-sw.com>
 <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com>
 <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com>
 <c7a1e17f-1fbf-3864-5084-ac12d2ef4ff6@redhat.com>
 <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com>
 <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com>
 <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com>
 <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com>
Message-ID: <e2eefb7e-cdfc-dbc8-8774-bb3d95f41bd6@oracle.com>

Dmitrij,

You need Oracle's sponsor for push since you touched shared code 
register.hpp

Thanks,
Vladimir

On 9/20/17 7:13 AM, Dmitrij Pochepko wrote:
> Hi,
> 
> Andrew, do you believe this is ok to push?
> 
> Thanks,
> 
> Dmitrij
> 
> 
> ....
> 
> On 06.09.2017 20:39, Dmitrij wrote:
>>
>> I've compared it by calling square and multiply methods and got 
>> following results(ThunderX):
>>
>>
>> Benchmark??????????????????????????????????????? (size, ints) Mode 
>> Cnt????? Score???? Error? Units
>> BigIntegerBench.implMutliplyToLenReflect?????? 1? avgt??? 5 186.930 ? 
>> 14.933? ns/op? (26% slower)
>> BigIntegerBench.implMutliplyToLenReflect?????? 2? avgt??? 5 194.095 ? 
>> 11.857? ns/op? (12% slower)
>> BigIntegerBench.implMutliplyToLenReflect?????? 3? avgt??? 5 233.912 
>> ??? 4.229? ns/op?? (24% slower)
>> BigIntegerBench.implMutliplyToLenReflect?????? 5? avgt??? 5 308.349 ? 
>> 20.383? ns/op? (22% slower)
>> BigIntegerBench.implMutliplyToLenReflect????? 10? avgt??? 5 475.839 
>> ??? 6.232? ns/op? (same)
>> BigIntegerBench.implMutliplyToLenReflect????? 50? avgt??? 5 6514.691 
>> ?? 76.934? ns/op (same)
>> BigIntegerBench.implMutliplyToLenReflect????? 90? avgt??? 5 20347.040 
>> ? 224.290? ns/op (3% slower)
>> BigIntegerBench.implMutliplyToLenReflect???? 127? avgt??? 5 41929.302 
>> ? 181.053? ns/op (9% slower)
>>
>> BigIntegerBench.implSquareToLenReflect???????? 1? avgt??? 5 147.751 ? 
>> 12.760? ns/op
>> BigIntegerBench.implSquareToLenReflect???????? 2? avgt??? 5 173.804 
>> ??? 4.850? ns/op
>> BigIntegerBench.implSquareToLenReflect???????? 3? avgt??? 5 187.822 ? 
>> 34.027? ns/op
>> BigIntegerBench.implSquareToLenReflect???????? 5? avgt??? 5 251.995 ? 
>> 19.711? ns/op
>> BigIntegerBench.implSquareToLenReflect??????? 10? avgt??? 5 474.489 
>> ??? 1.040? ns/op
>> BigIntegerBench.implSquareToLenReflect??????? 50? avgt??? 5 6493.768 
>> ?? 33.809? ns/op
>> BigIntegerBench.implSquareToLenReflect??????? 90? avgt??? 5 19766.524 
>> ?? 88.398? ns/op
>> BigIntegerBench.implSquareToLenReflect?????? 127? avgt??? 5 38448.202 
>> ? 180.095? ns/op
>>
>>
>> As we can see, squareToLen is faster than multiplyToLen.
>>
>> (I've updated benchmark code at 
>> http://cr.openjdk.java.net/~dpochepk/8186915/BigIntegerBench.java)
>>
>> Thanks,
>> Dmitrij
> 

From vladimir.kozlov at oracle.com  Wed Sep 20 16:18:00 2017
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 20 Sep 2017 09:18:00 -0700
Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP
 auto-vectorization failed
In-Reply-To: <CAFkOo6YBdZ4v7YTvDQACuOUk+_AwHFS9FCAT_NLY_PVry2yEYw@mail.gmail.com>
References: <CAFkOo6a9Q-9kq863WGVa6XoUbOGE7-MZLGrsdH3N-MBasawgpg@mail.gmail.com>
 <c6a8b312-0f10-f2e3-479a-f9f568e0afdb@oracle.com>
 <CAFkOo6Z3LO7ug5nwseneChDEMdrTj7DUgUNMu803b7ivnLMM9g@mail.gmail.com>
 <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com>
 <CAFkOo6YBdZ4v7YTvDQACuOUk+_AwHFS9FCAT_NLY_PVry2yEYw@mail.gmail.com>
Message-ID: <c34142dd-0eaa-953f-6e88-433ca5d9a074@oracle.com>

Nice.

Did you verified that it fixed your case?

Would be nice to run specjvm2008 to make sure performance did not regress.

Thanks,
Vladimir

On 9/20/17 4:07 AM, Zhongwei Yao wrote:
> Thanks for your suggestions!
> 
> I've updated the patch that uses pass_slp and do_unroll_only flags
> without adding a new flag. Please take a look:
> 
> http://cr.openjdk.java.net/~zyao/8187601/webrev.01/
> 
> 
> 
> On 20 September 2017 at 01:54, Vladimir Kozlov
> <vladimir.kozlov at oracle.com> wrote:
>>
>>
>> On 9/18/17 10:59 PM, Zhongwei Yao wrote:
>>>
>>> Hi, Vladimir,
>>>
>>> On 19 September 2017 at 00:17, Vladimir Kozlov
>>> <vladimir.kozlov at oracle.com> wrote:
>>>>
>>>> Why not use existing set_notpassed_slp() instead of
>>>> mark_slp_vec_failed()?
>>>
>>>
>>> Due to 2 reasons, I have not chosen existing passed_slp flag:
>>
>>
>> My point is that if we don't find vectors in a loop (as in your case) we
>> should ignore whole SLP analysis.
>>
>> In best case scenario SuperWord::unrolling_analysis() should determine if
>> there are vectors candidates. For example, check if array's index is depend
>> on loop's index variable.
>>
>> An other way is to call SuperWord::unrolling_analysis() only after we did
>> vector analysis.
>>
>> It is more complicated changes and out of scope of this. There is also side
>> effect I missed before which may prevent using set_notpassed_slp():
>> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp()
>> check.
>>
>> Note, set_notpassed_slp() is also used to additional unroll already
>> vectorized loops:
>>
>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421
>>
>> May be you should also call mark_do_unroll_only() when you set
>> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt
>> problem you pointed. Can you look on this?
>>
>> I am not against adding new is_slp_vec_failed() but I want first to
>> investigate if we can re-use existing functions.
>>
>> Thanks,
>> Vladimir
>>
>>
>>>     1. If we set_notpassed_slp() when _packset.length() == 0 in
>>> SuperWord::output(), then in the IdealLoopTree::policy_unroll()
>>> checking:
>>>
>>>      if (cl->has_passed_slp()) {
>>>        if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>>        // Normal case: loop too big
>>>        return false;
>>>      }
>>>
>>>      we will ignore the case: "cl->has_passed_slp() &&
>>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()"
>>> as alos exposed in my patch:
>>>
>>>      if (cl->has_passed_slp()) {
>>>        if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>> -    // Normal case: loop too big
>>> -    return false;
>>> +    // When SLP vectorization failed, we could do more unrolling
>>> +    // optimizations if body size is less than limit size. Otherwise,
>>> +    // return false due to loop is too big.
>>> +    if (!cl->is_slp_vec_failed()) return false;
>>>      }
>>>
>>>      However, I have not found a case to support this condition yet.
>>>
>>>     2. As replied below, in:
>>>>
>>>> -        } else if (cl->is_main_loop()) {
>>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>>              sw.transform_loop(lpt, true);
>>>
>>>         I need to check whether cl->is_slp_vec_failed() is true.Such
>>> checking becomes explicit when using SLPAutoVecFailed flag.
>>>
>>>>
>>>> Why you need next additional check?:
>>>>
>>>> -        } else if (cl->is_main_loop()) {
>>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>>              sw.transform_loop(lpt, true);
>>>>
>>>
>>> The additional check prevents the case that when
>>> cl->is_slp_vec_failed() is true, then SuperWord::output() will
>>> set_major_progress() at the beginning (because _packset.length() == 0
>>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal
>>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is
>>> not we want.
>>
>>
>>
>>>
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>>
>>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote:
>>>>>
>>>>>
>>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev]
>>>>>
>>>>> Hi, all,
>>>>>
>>>>> Bug:
>>>>> https://bugs.openjdk.java.net/browse/JDK-8187601
>>>>>
>>>>> Webrev:
>>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00
>>>>>
>>>>> In the current implementation, the loop unrolling times are determined
>>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is
>>>>> true (both X86 and aarch64 are true for now).
>>>>>
>>>>> This unrolling policy generates less optimized code when SLP
>>>>> auto-vectorization fails (as following example shows).
>>>>>
>>>>> In this patch, I modify the current unrolling policy to do more
>>>>> unrolling when SLP auto-vectorization fails. So the loop will be
>>>>> unrolled until reaching the unroll times limitation.
>>>>>
>>>>> Here is one example:
>>>>>      public static void accessArrayConstants(int[] array) {
>>>>>          for (int j = 0; j < 1024; j++) {
>>>>>              array[0]++;
>>>>>              array[1]++;
>>>>>          }
>>>>>      }
>>>>>
>>>>> Before this patch, the loop will be unrolled by 4 times. 4 is
>>>>> determined by: AArch64's vector size 128 bits / array element size 32
>>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8.
>>>>>
>>>>> Below is the generated code by C2 on AArch64:
>>>>>
>>>>> ==== generated code start ====
>>>>>      0x0000ffff6caf3180: ldr w10, [x1,#16]   ;
>>>>>      0x0000ffff6caf3184: add w13, w10, #0x1
>>>>>      0x0000ffff6caf3188: str w13, [x1,#16]   ;
>>>>>      0x0000ffff6caf318c: ldr w12, [x1,#20]   ;
>>>>>      0x0000ffff6caf3190: add w13, w10, #0x4
>>>>>      0x0000ffff6caf3194: add w10, w12, #0x4
>>>>>      0x0000ffff6caf3198: str w13, [x1,#16]   ;
>>>>>      0x0000ffff6caf319c: add w11, w11, #0x4  ;
>>>>>      0x0000ffff6caf31a0: str w10, [x1,#20]   ;
>>>>>      0x0000ffff6caf31a4: cmp w11, #0x3fd
>>>>>      0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180  ;
>>>>> ==== generated code end ====
>>>>>
>>>>> After applied this patch, it is unrolled 16 times:
>>>>>
>>>>> ==== generated code start ====
>>>>>      0x0000ffffb0aa6100: ldr w10, [x1,#16]   ;
>>>>>      0x0000ffffb0aa6104: add w13, w10, #0x1
>>>>>      0x0000ffffb0aa6108: str w13, [x1,#16]   ;
>>>>>      0x0000ffffb0aa610c: ldr w12, [x1,#20]   ;
>>>>>      0x0000ffffb0aa6110: add w13, w10, #0x10
>>>>>      0x0000ffffb0aa6114: add w10, w12, #0x10
>>>>>      0x0000ffffb0aa6118: str w13, [x1,#16]   ;
>>>>>      0x0000ffffb0aa611c: add w11, w11, #0x10  ;
>>>>>      0x0000ffffb0aa6120: str w10, [x1,#20]   ;
>>>>>      0x0000ffffb0aa6124: cmp w11, #0x3f1
>>>>>      0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100  ;
>>>>> ==== generated code end ====
>>>>>
>>>>> This patch passes jtreg tests both on AArch64 and X86.
>>>>>
>>>>
>>>
>>>
>>>
>>
> 
> 
> 

From aph at redhat.com  Thu Sep 21 13:04:07 2017
From: aph at redhat.com (Andrew Haley)
Date: Thu, 21 Sep 2017 14:04:07 +0100
Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify
	squareToLen and mulAdd
In-Reply-To: <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com>
References: <b488c589-d422-3089-4cb9-8747fb3a0075@bell-sw.com>
 <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com>
 <e0de06fa-0f4f-5b86-53ce-015bd7a46b15@bell-sw.com>
 <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com>
 <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com>
 <c7a1e17f-1fbf-3864-5084-ac12d2ef4ff6@redhat.com>
 <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com>
 <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com>
 <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com>
 <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com>
Message-ID: <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com>

I reworked your benchmark to run faster and have less overhead, at
http://cr.openjdk.java.net/~aph/8186915/

Run it as

java --add-exports java.base/jdk.internal.misc=ALL-UNNAMED -jar target/benchmarks.jar org.sample.BigIntegerBench.implMutliplyToLen

The test here was run on (rather old) Applied Micro hardware.  The
real issue is, I think, that almost all of the time of squareToLen
without an intrinsic is dominated by mulAdd, and that already has an
intrinsic.  Asymptotically, an intrinsic squareToLen should take half
the time of multiplyToLen, but we don't see that.  Indeed, we barely
see any advantage for UseSquareToLenIntrinsic.

For a larger size, we see this with intrinsics enabled:

BigIntegerBench.implMutliplyToLen     200  avgt    5  50833.555 ? 10.674  ns/op
BigIntegerBench.implSquareToLen       200  avgt    5  57607.460 ? 87.155  ns/op

BigIntegerBench.implMutliplyToLen    1000  avgt    5  1254728.119 ? 527.126  ns/op
BigIntegerBench.implSquareToLen      1000  avgt    5  1369841.961 ? 169.843  ns/op

which makes the problem clear, I believe.


No intrinsics:

Benchmark                          (size)  Mode  Cnt      Score    Error  Units
BigIntegerBench.implMutliplyToLen       1  avgt    5     24.176 ?  0.006  ns/op
BigIntegerBench.implMutliplyToLen       2  avgt    5     41.266 ?  0.008  ns/op
BigIntegerBench.implMutliplyToLen       3  avgt    5     65.027 ?  0.019  ns/op
BigIntegerBench.implMutliplyToLen      10  avgt    5    466.440 ?  0.080  ns/op
BigIntegerBench.implMutliplyToLen      50  avgt    5  10613.512 ?  5.153  ns/op
BigIntegerBench.implMutliplyToLen      90  avgt    5  34070.328 ? 10.991  ns/op
BigIntegerBench.implMutliplyToLen     127  avgt    5  67546.985 ? 16.581  ns/op

-XX:+UseMultiplyToLenIntrinsic:

Benchmark                          (size)  Mode  Cnt      Score   Error  Units
BigIntegerBench.implMutliplyToLen       1  avgt    5     25.661 ? 0.062  ns/op
BigIntegerBench.implMutliplyToLen       2  avgt    5     29.183 ? 0.037  ns/op
BigIntegerBench.implMutliplyToLen       3  avgt    5     51.690 ? 0.024  ns/op
BigIntegerBench.implMutliplyToLen      10  avgt    5    193.401 ? 0.032  ns/op
BigIntegerBench.implMutliplyToLen      50  avgt    5   3419.226 ? 0.312  ns/op
BigIntegerBench.implMutliplyToLen      90  avgt    5  10638.801 ? 0.970  ns/op
BigIntegerBench.implMutliplyToLen     127  avgt    5  21274.149 ? 7.188  ns/op


No Intrinsics:

Benchmark                        (size)  Mode  Cnt      Score    Error  Units
BigIntegerBench.implSquareToLen       1  avgt    5     38.933 ?  1.437  ns/op
BigIntegerBench.implSquareToLen       2  avgt    5     62.523 ?  0.007  ns/op
BigIntegerBench.implSquareToLen       3  avgt    5     82.114 ?  0.012  ns/op
BigIntegerBench.implSquareToLen      10  avgt    5    366.986 ? 10.148  ns/op
BigIntegerBench.implSquareToLen      50  avgt    5   5534.064 ? 88.895  ns/op
BigIntegerBench.implSquareToLen      90  avgt    5  16308.025 ? 29.203  ns/op
BigIntegerBench.implSquareToLen     127  avgt    5  31521.335 ? 49.421  ns/op

-XX:+UseMulAddIntrinsic:

Benchmark                        (size)  Mode  Cnt      Score    Error  Units
BigIntegerBench.implSquareToLen       1  avgt    5     46.268 ?  0.005  ns/op
BigIntegerBench.implSquareToLen       2  avgt    5     67.527 ?  0.017  ns/op
BigIntegerBench.implSquareToLen       3  avgt    5     97.975 ?  0.179  ns/op
BigIntegerBench.implSquareToLen      10  avgt    5    345.126 ?  0.037  ns/op
BigIntegerBench.implSquareToLen      50  avgt    5   4327.120 ?  9.942  ns/op
BigIntegerBench.implSquareToLen      90  avgt    5  13143.308 ?  1.217  ns/op
BigIntegerBench.implSquareToLen     127  avgt    5  25014.420 ? 16.221  ns/op

-XX:+UseSquareToLenIntrinsic

Benchmark                        (size)  Mode  Cnt      Score    Error  Units
BigIntegerBench.implSquareToLen       1  avgt    5     27.095 ?  0.012  ns/op
BigIntegerBench.implSquareToLen       2  avgt    5     49.185 ?  0.007  ns/op
BigIntegerBench.implSquareToLen       3  avgt    5     53.771 ?  0.013  ns/op
BigIntegerBench.implSquareToLen      10  avgt    5    238.843 ?  0.080  ns/op
BigIntegerBench.implSquareToLen      50  avgt    5   3828.313 ?  1.684  ns/op
BigIntegerBench.implSquareToLen      90  avgt    5  11949.819 ?  9.925  ns/op
BigIntegerBench.implSquareToLen     127  avgt    5  23613.427 ? 28.164  ns/op


-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

From ci_notify at linaro.org  Thu Sep 21 15:46:56 2017
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Thu, 21 Sep 2017 15:46:56 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK 10 on AArch64
Message-ID: <2028460771.4685.1506008817168.JavaMail.jenkins@81294fa8a221>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk10/openjdk-jtreg-nightly-tests/summary/2017/263/summary.html
 
-------------------------------------------------------------------------------
client-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2017/sep/06 pass: 1,400; fail: 11,561
Build 1: aarch64/2017/sep/20 pass: 1,369; fail: 35; error: 1

2 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
client-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2017/sep/06 pass: 7,432; fail: 714; error: 20
Build 1: aarch64/2017/sep/20 pass: 7,469; fail: 689; error: 22

-------------------------------------------------------------------------------
client-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2017/sep/06 pass: 3,784
Build 1: aarch64/2017/sep/20 pass: 3,783; fail: 1

-------------------------------------------------------------------------------
server-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2017/sep/06 pass: 1,403; fail: 11,562; error: 1
Build 1: aarch64/2017/sep/20 pass: 1,373; fail: 37

2 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
server-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2017/sep/06 pass: 7,469; fail: 675; error: 22
Build 1: aarch64/2017/sep/20 pass: 7,452; fail: 705; error: 23

-------------------------------------------------------------------------------
server-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2017/sep/06 pass: 3,782; error: 2
Build 1: aarch64/2017/sep/20 pass: 3,783; fail: 1

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdk10/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 1.05x
Relative performance: Server critical-jOPS (nc): 0.90x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk10/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the client and server compilers
on 2014-04-01.

Relative performance: Zero: 1.0, Client: 70.58, Server: 115.7

Client 70.58 / Client 2014-04-01 (43.00): 1.64x
Server 115.7 / Server 2014-04-01 (71.00): 1.63x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk10/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2017-09-07 pass rate: 11556/11559, results: http://openjdk.linaro.org/jdk10/jcstress-nightly-runs/2017/249/results/
2017-09-21 pass rate: 11556/11559, results: http://openjdk.linaro.org/jdk10/jcstress-nightly-runs/2017/263/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk10/jcstress-nightly-runs/

From dmitrij.pochepko at bell-sw.com  Thu Sep 21 18:19:33 2017
From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko)
Date: Thu, 21 Sep 2017 21:19:33 +0300
Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify
	squareToLen and mulAdd
In-Reply-To: <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com>
References: <b488c589-d422-3089-4cb9-8747fb3a0075@bell-sw.com>
 <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com>
 <e0de06fa-0f4f-5b86-53ce-015bd7a46b15@bell-sw.com>
 <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com>
 <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com>
 <c7a1e17f-1fbf-3864-5084-ac12d2ef4ff6@redhat.com>
 <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com>
 <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com>
 <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com>
 <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com>
 <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com>
Message-ID: <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com>

Hi,

thank you for looking into this and trying on APM(I have no access to 
this h/w).


I've used modified benchmark you've sent and run it on ThunderX and 
implSquareToLen still shows better results than implMultiplyToLen in 
most cases on ThunderX (up to 10% on size=127. results: 
http://cr.openjdk.java.net/~dpochepk/8186915/ThunderX_new.txt).

However, since performance difference for APM is more than on ThunderX, 
I think it'll be more logical to return back to your idea and call 
multiplyToLen intrinsic inside squareToLen. Alternative solution is to 
generate different code for APM and ThunderX, but I prefer to have 
single version in case of such relatively small difference in 
performance and it's still much faster than without intrinsic at all. 
What do you think?


fyi: regarding size 200 and 1000 - it's incorrect to measure these sizes 
for squareToLen, because squareToLen is never called for size more than 
127(I've mentioned it before). An upper level squaring algorithm divides 
larger arrays into few parts(smaller than 128 integers) and then 
squaring it separately. In order to compare squaring vs multiplication 
on longer sizes, we should compare BigInteger::multiply vs 
BigInteger::square methods with full logic behind it, because this is 
what's called in real situation instead of direct intrinsified method 
call. I've uploaded benchmark with multiply method measurement here: 
http://cr.openjdk.java.net/~dpochepk/8186915/BigIntegerBench2.java just 
in case.

Thanks,

Dmitrij


On 21.09.2017 16:04, Andrew Haley wrote:
> I reworked your benchmark to run faster and have less overhead, at
> http://cr.openjdk.java.net/~aph/8186915/
>
> Run it as
>
> java --add-exports java.base/jdk.internal.misc=ALL-UNNAMED -jar target/benchmarks.jar org.sample.BigIntegerBench.implMutliplyToLen
>
> The test here was run on (rather old) Applied Micro hardware.  The
> real issue is, I think, that almost all of the time of squareToLen
> without an intrinsic is dominated by mulAdd, and that already has an
> intrinsic.  Asymptotically, an intrinsic squareToLen should take half
> the time of multiplyToLen, but we don't see that.  Indeed, we barely
> see any advantage for UseSquareToLenIntrinsic.
>
> For a larger size, we see this with intrinsics enabled:
>
> BigIntegerBench.implMutliplyToLen     200  avgt    5  50833.555 ? 10.674  ns/op
> BigIntegerBench.implSquareToLen       200  avgt    5  57607.460 ? 87.155  ns/op
>
> BigIntegerBench.implMutliplyToLen    1000  avgt    5  1254728.119 ? 527.126  ns/op
> BigIntegerBench.implSquareToLen      1000  avgt    5  1369841.961 ? 169.843  ns/op
>
> which makes the problem clear, I believe.
>
>
> No intrinsics:
>
> Benchmark                          (size)  Mode  Cnt      Score    Error  Units
> BigIntegerBench.implMutliplyToLen       1  avgt    5     24.176 ?  0.006  ns/op
> BigIntegerBench.implMutliplyToLen       2  avgt    5     41.266 ?  0.008  ns/op
> BigIntegerBench.implMutliplyToLen       3  avgt    5     65.027 ?  0.019  ns/op
> BigIntegerBench.implMutliplyToLen      10  avgt    5    466.440 ?  0.080  ns/op
> BigIntegerBench.implMutliplyToLen      50  avgt    5  10613.512 ?  5.153  ns/op
> BigIntegerBench.implMutliplyToLen      90  avgt    5  34070.328 ? 10.991  ns/op
> BigIntegerBench.implMutliplyToLen     127  avgt    5  67546.985 ? 16.581  ns/op
>
> -XX:+UseMultiplyToLenIntrinsic:
>
> Benchmark                          (size)  Mode  Cnt      Score   Error  Units
> BigIntegerBench.implMutliplyToLen       1  avgt    5     25.661 ? 0.062  ns/op
> BigIntegerBench.implMutliplyToLen       2  avgt    5     29.183 ? 0.037  ns/op
> BigIntegerBench.implMutliplyToLen       3  avgt    5     51.690 ? 0.024  ns/op
> BigIntegerBench.implMutliplyToLen      10  avgt    5    193.401 ? 0.032  ns/op
> BigIntegerBench.implMutliplyToLen      50  avgt    5   3419.226 ? 0.312  ns/op
> BigIntegerBench.implMutliplyToLen      90  avgt    5  10638.801 ? 0.970  ns/op
> BigIntegerBench.implMutliplyToLen     127  avgt    5  21274.149 ? 7.188  ns/op
>
>
> No Intrinsics:
>
> Benchmark                        (size)  Mode  Cnt      Score    Error  Units
> BigIntegerBench.implSquareToLen       1  avgt    5     38.933 ?  1.437  ns/op
> BigIntegerBench.implSquareToLen       2  avgt    5     62.523 ?  0.007  ns/op
> BigIntegerBench.implSquareToLen       3  avgt    5     82.114 ?  0.012  ns/op
> BigIntegerBench.implSquareToLen      10  avgt    5    366.986 ? 10.148  ns/op
> BigIntegerBench.implSquareToLen      50  avgt    5   5534.064 ? 88.895  ns/op
> BigIntegerBench.implSquareToLen      90  avgt    5  16308.025 ? 29.203  ns/op
> BigIntegerBench.implSquareToLen     127  avgt    5  31521.335 ? 49.421  ns/op
>
> -XX:+UseMulAddIntrinsic:
>
> Benchmark                        (size)  Mode  Cnt      Score    Error  Units
> BigIntegerBench.implSquareToLen       1  avgt    5     46.268 ?  0.005  ns/op
> BigIntegerBench.implSquareToLen       2  avgt    5     67.527 ?  0.017  ns/op
> BigIntegerBench.implSquareToLen       3  avgt    5     97.975 ?  0.179  ns/op
> BigIntegerBench.implSquareToLen      10  avgt    5    345.126 ?  0.037  ns/op
> BigIntegerBench.implSquareToLen      50  avgt    5   4327.120 ?  9.942  ns/op
> BigIntegerBench.implSquareToLen      90  avgt    5  13143.308 ?  1.217  ns/op
> BigIntegerBench.implSquareToLen     127  avgt    5  25014.420 ? 16.221  ns/op
>
> -XX:+UseSquareToLenIntrinsic
>
> Benchmark                        (size)  Mode  Cnt      Score    Error  Units
> BigIntegerBench.implSquareToLen       1  avgt    5     27.095 ?  0.012  ns/op
> BigIntegerBench.implSquareToLen       2  avgt    5     49.185 ?  0.007  ns/op
> BigIntegerBench.implSquareToLen       3  avgt    5     53.771 ?  0.013  ns/op
> BigIntegerBench.implSquareToLen      10  avgt    5    238.843 ?  0.080  ns/op
> BigIntegerBench.implSquareToLen      50  avgt    5   3828.313 ?  1.684  ns/op
> BigIntegerBench.implSquareToLen      90  avgt    5  11949.819 ?  9.925  ns/op
> BigIntegerBench.implSquareToLen     127  avgt    5  23613.427 ? 28.164  ns/op
>
>


From aph at redhat.com  Fri Sep 22 08:12:23 2017
From: aph at redhat.com (Andrew Haley)
Date: Fri, 22 Sep 2017 09:12:23 +0100
Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify
	squareToLen and mulAdd
In-Reply-To: <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com>
References: <b488c589-d422-3089-4cb9-8747fb3a0075@bell-sw.com>
 <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com>
 <e0de06fa-0f4f-5b86-53ce-015bd7a46b15@bell-sw.com>
 <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com>
 <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com>
 <c7a1e17f-1fbf-3864-5084-ac12d2ef4ff6@redhat.com>
 <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com>
 <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com>
 <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com>
 <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com>
 <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com>
 <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com>
Message-ID: <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com>

On 21/09/17 19:19, Dmitrij Pochepko wrote:

> thank you for looking into this and trying on APM(I have no access to 
> this h/w).
> 
> 
> I've used modified benchmark you've sent and run it on ThunderX and 
> implSquareToLen still shows better results than implMultiplyToLen in 
> most cases on ThunderX (up to 10% on size=127. results: 
> http://cr.openjdk.java.net/~dpochepk/8186915/ThunderX_new.txt).

For 10%, it's not worth doing, given the risks and that it's not used
by crypto operations when C2-compiled.

> However, since performance difference for APM is more than on
> ThunderX, I think it'll be more logical to return back to your idea
> and call multiplyToLen intrinsic inside squareToLen. Alternative
> solution is to generate different code for APM and ThunderX, but I
> prefer to have single version in case of such relatively small
> difference in performance and it's still much faster than without
> intrinsic at all.  What do you think?

Yes.  Calling multiplyToLen would be fine.

> fyi: regarding size 200 and 1000 - it's incorrect to measure these
> sizes for squareToLen, because squareToLen is never called for size
> more than 127(I've mentioned it before).

It's not incorrect: it's a test for asymptotic behaviour.
-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

From dmitrij.pochepko at bell-sw.com  Mon Sep 25 15:46:43 2017
From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko)
Date: Mon, 25 Sep 2017 18:46:43 +0300
Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify
	squareToLen and mulAdd
In-Reply-To: <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com>
References: <b488c589-d422-3089-4cb9-8747fb3a0075@bell-sw.com>
 <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com>
 <e0de06fa-0f4f-5b86-53ce-015bd7a46b15@bell-sw.com>
 <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com>
 <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com>
 <c7a1e17f-1fbf-3864-5084-ac12d2ef4ff6@redhat.com>
 <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com>
 <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com>
 <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com>
 <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com>
 <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com>
 <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com>
 <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com>
Message-ID: <e411e5d8-f435-7aee-a097-1faf03984492@bell-sw.com>

Hi,

please take a look at v2. I've modified code to use multiplyToLen in 
squareToLen. Additional benefit: no more code in common part. I've left 
mulAdd unchanged.


http://cr.openjdk.java.net/~dpochepk/8186915/webrev.02/

I've also rerun benchmark on ThunderX and got these results: 
http://cr.openjdk.java.net/~dpochepk/8186915/ThunderX_new.txt


Thanks,
Dmitrij

On 22.09.2017 11:12, Andrew Haley wrote:
> On 21/09/17 19:19, Dmitrij Pochepko wrote:
>
>> thank you for looking into this and trying on APM(I have no access to
>> this h/w).
>>
>>
>> I've used modified benchmark you've sent and run it on ThunderX and
>> implSquareToLen still shows better results than implMultiplyToLen in
>> most cases on ThunderX (up to 10% on size=127. results:
>> http://cr.openjdk.java.net/~dpochepk/8186915/ThunderX_new.txt).
> For 10%, it's not worth doing, given the risks and that it's not used
> by crypto operations when C2-compiled.
>
>> However, since performance difference for APM is more than on
>> ThunderX, I think it'll be more logical to return back to your idea
>> and call multiplyToLen intrinsic inside squareToLen. Alternative
>> solution is to generate different code for APM and ThunderX, but I
>> prefer to have single version in case of such relatively small
>> difference in performance and it's still much faster than without
>> intrinsic at all.  What do you think?
> Yes.  Calling multiplyToLen would be fine.
>
>> fyi: regarding size 200 and 1000 - it's incorrect to measure these
>> sizes for squareToLen, because squareToLen is never called for size
>> more than 127(I've mentioned it before).
> It's not incorrect: it's a test for asymptotic behaviour.


From aph at redhat.com  Mon Sep 25 15:57:43 2017
From: aph at redhat.com (Andrew Haley)
Date: Mon, 25 Sep 2017 16:57:43 +0100
Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify
	squareToLen and mulAdd
In-Reply-To: <e411e5d8-f435-7aee-a097-1faf03984492@bell-sw.com>
References: <b488c589-d422-3089-4cb9-8747fb3a0075@bell-sw.com>
 <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com>
 <e0de06fa-0f4f-5b86-53ce-015bd7a46b15@bell-sw.com>
 <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com>
 <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com>
 <c7a1e17f-1fbf-3864-5084-ac12d2ef4ff6@redhat.com>
 <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com>
 <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com>
 <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com>
 <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com>
 <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com>
 <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com>
 <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com>
 <e411e5d8-f435-7aee-a097-1faf03984492@bell-sw.com>
Message-ID: <6ba73c2b-33fa-bdb7-af84-e0b6a2e3b730@redhat.com>

On 25/09/17 16:46, Dmitrij Pochepko wrote:
> please take a look at v2. I've modified code to use multiplyToLen in 
> squareToLen. Additional benefit: no more code in common part. I've left 
> mulAdd unchanged.

That looks fine.  Please commit if you've run the jtreg test suite.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

From dmitrij.pochepko at bell-sw.com  Mon Sep 25 16:33:00 2017
From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko)
Date: Mon, 25 Sep 2017 19:33:00 +0300
Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify
	squareToLen and mulAdd
In-Reply-To: <6ba73c2b-33fa-bdb7-af84-e0b6a2e3b730@redhat.com>
References: <b488c589-d422-3089-4cb9-8747fb3a0075@bell-sw.com>
 <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com>
 <e0de06fa-0f4f-5b86-53ce-015bd7a46b15@bell-sw.com>
 <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com>
 <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com>
 <c7a1e17f-1fbf-3864-5084-ac12d2ef4ff6@redhat.com>
 <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com>
 <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com>
 <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com>
 <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com>
 <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com>
 <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com>
 <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com>
 <e411e5d8-f435-7aee-a097-1faf03984492@bell-sw.com>
 <6ba73c2b-33fa-bdb7-af84-e0b6a2e3b730@redhat.com>
Message-ID: <40a89126-e855-c681-d6ff-21ed611fbd89@bell-sw.com>

Thank you for such attentive review.

I'll commit it now. I've run jtreg tests in 
jdk/test/java/math/BigInteger/* in both Xmixed and Xcomp modes.


Thanks,
Dmitrij
On 25.09.2017 18:57, Andrew Haley wrote:
> On 25/09/17 16:46, Dmitrij Pochepko wrote:
>> please take a look at v2. I've modified code to use multiplyToLen in
>> squareToLen. Additional benefit: no more code in common part. I've left
>> mulAdd unchanged.
> That looks fine.  Please commit if you've run the jtreg test suite.
>


From dmitrij.pochepko at bell-sw.com  Mon Sep 25 16:36:06 2017
From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko)
Date: Mon, 25 Sep 2017 19:36:06 +0300
Subject: [aarch64-port-dev ] [10] RFR: 8186915 - AARCH64: Intrinsify
	squareToLen and mulAdd
In-Reply-To: <40a89126-e855-c681-d6ff-21ed611fbd89@bell-sw.com>
References: <b488c589-d422-3089-4cb9-8747fb3a0075@bell-sw.com>
 <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com>
 <e0de06fa-0f4f-5b86-53ce-015bd7a46b15@bell-sw.com>
 <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com>
 <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com>
 <c7a1e17f-1fbf-3864-5084-ac12d2ef4ff6@redhat.com>
 <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com>
 <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com>
 <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com>
 <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com>
 <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com>
 <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com>
 <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com>
 <e411e5d8-f435-7aee-a097-1faf03984492@bell-sw.com>
 <6ba73c2b-33fa-bdb7-af84-e0b6a2e3b730@redhat.com>
 <40a89126-e855-c681-d6ff-21ed611fbd89@bell-sw.com>
Message-ID: <ae332543-43e6-9362-7253-f68629a5cb59@bell-sw.com>

Seems like repo is still closed. Have to wait a bit


On 25.09.2017 19:33, Dmitrij Pochepko wrote:
> Thank you for such attentive review.
>
> I'll commit it now. I've run jtreg tests in 
> jdk/test/java/math/BigInteger/* in both Xmixed and Xcomp modes.
>
>
> Thanks,
> Dmitrij
> On 25.09.2017 18:57, Andrew Haley wrote:
>> On 25/09/17 16:46, Dmitrij Pochepko wrote:
>>> please take a look at v2. I've modified code to use multiplyToLen in
>>> squareToLen. Additional benefit: no more code in common part. I've left
>>> mulAdd unchanged.
>> That looks fine.? Please commit if you've run the jtreg test suite.
>>
>


From zhongwei.yao at linaro.org  Fri Sep 29 08:25:39 2017
From: zhongwei.yao at linaro.org (Zhongwei Yao)
Date: Fri, 29 Sep 2017 16:25:39 +0800
Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP
	auto-vectorization failed
In-Reply-To: <c34142dd-0eaa-953f-6e88-433ca5d9a074@oracle.com>
References: <CAFkOo6a9Q-9kq863WGVa6XoUbOGE7-MZLGrsdH3N-MBasawgpg@mail.gmail.com>
 <c6a8b312-0f10-f2e3-479a-f9f568e0afdb@oracle.com>
 <CAFkOo6Z3LO7ug5nwseneChDEMdrTj7DUgUNMu803b7ivnLMM9g@mail.gmail.com>
 <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com>
 <CAFkOo6YBdZ4v7YTvDQACuOUk+_AwHFS9FCAT_NLY_PVry2yEYw@mail.gmail.com>
 <c34142dd-0eaa-953f-6e88-433ca5d9a074@oracle.com>
Message-ID: <CAFkOo6bWF2_rNzY0eu16KjYJL8G3wS5Uj85p+vdiMOhGpu-M8Q@mail.gmail.com>

Hi, Vladimir,

Sorry for my late response!

And yes, it solves my case.

But I found specjvm2008 doesn't have a stable result, especially for
benchmark case like startup.xxx, scimark.xxx.large etc. And I have
found obvious performance regress in the rest of benchmark cases. What
do you think?

On 21 September 2017 at 00:18, Vladimir Kozlov
<vladimir.kozlov at oracle.com> wrote:
> Nice.
>
> Did you verified that it fixed your case?
>
> Would be nice to run specjvm2008 to make sure performance did not regress.
>
> Thanks,
> Vladimir
>
>
> On 9/20/17 4:07 AM, Zhongwei Yao wrote:
>>
>> Thanks for your suggestions!
>>
>> I've updated the patch that uses pass_slp and do_unroll_only flags
>> without adding a new flag. Please take a look:
>>
>> http://cr.openjdk.java.net/~zyao/8187601/webrev.01/
>>
>>
>>
>> On 20 September 2017 at 01:54, Vladimir Kozlov
>> <vladimir.kozlov at oracle.com> wrote:
>>>
>>>
>>>
>>> On 9/18/17 10:59 PM, Zhongwei Yao wrote:
>>>>
>>>>
>>>> Hi, Vladimir,
>>>>
>>>> On 19 September 2017 at 00:17, Vladimir Kozlov
>>>> <vladimir.kozlov at oracle.com> wrote:
>>>>>
>>>>>
>>>>> Why not use existing set_notpassed_slp() instead of
>>>>> mark_slp_vec_failed()?
>>>>
>>>>
>>>>
>>>> Due to 2 reasons, I have not chosen existing passed_slp flag:
>>>
>>>
>>>
>>> My point is that if we don't find vectors in a loop (as in your case) we
>>> should ignore whole SLP analysis.
>>>
>>> In best case scenario SuperWord::unrolling_analysis() should determine if
>>> there are vectors candidates. For example, check if array's index is
>>> depend
>>> on loop's index variable.
>>>
>>> An other way is to call SuperWord::unrolling_analysis() only after we did
>>> vector analysis.
>>>
>>> It is more complicated changes and out of scope of this. There is also
>>> side
>>> effect I missed before which may prevent using set_notpassed_slp():
>>> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp()
>>> check.
>>>
>>> Note, set_notpassed_slp() is also used to additional unroll already
>>> vectorized loops:
>>>
>>>
>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421
>>>
>>> May be you should also call mark_do_unroll_only() when you set
>>> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt
>>> problem you pointed. Can you look on this?
>>>
>>> I am not against adding new is_slp_vec_failed() but I want first to
>>> investigate if we can re-use existing functions.
>>>
>>> Thanks,
>>> Vladimir
>>>
>>>
>>>>     1. If we set_notpassed_slp() when _packset.length() == 0 in
>>>> SuperWord::output(), then in the IdealLoopTree::policy_unroll()
>>>> checking:
>>>>
>>>>      if (cl->has_passed_slp()) {
>>>>        if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>>>        // Normal case: loop too big
>>>>        return false;
>>>>      }
>>>>
>>>>      we will ignore the case: "cl->has_passed_slp() &&
>>>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()"
>>>> as alos exposed in my patch:
>>>>
>>>>      if (cl->has_passed_slp()) {
>>>>        if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>>> -    // Normal case: loop too big
>>>> -    return false;
>>>> +    // When SLP vectorization failed, we could do more unrolling
>>>> +    // optimizations if body size is less than limit size. Otherwise,
>>>> +    // return false due to loop is too big.
>>>> +    if (!cl->is_slp_vec_failed()) return false;
>>>>      }
>>>>
>>>>      However, I have not found a case to support this condition yet.
>>>>
>>>>     2. As replied below, in:
>>>>>
>>>>>
>>>>> -        } else if (cl->is_main_loop()) {
>>>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>>>              sw.transform_loop(lpt, true);
>>>>
>>>>
>>>>         I need to check whether cl->is_slp_vec_failed() is true.Such
>>>> checking becomes explicit when using SLPAutoVecFailed flag.
>>>>
>>>>>
>>>>> Why you need next additional check?:
>>>>>
>>>>> -        } else if (cl->is_main_loop()) {
>>>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>>>              sw.transform_loop(lpt, true);
>>>>>
>>>>
>>>> The additional check prevents the case that when
>>>> cl->is_slp_vec_failed() is true, then SuperWord::output() will
>>>> set_major_progress() at the beginning (because _packset.length() == 0
>>>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal
>>>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is
>>>> not we want.
>>>
>>>
>>>
>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>>
>>>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev]
>>>>>>
>>>>>> Hi, all,
>>>>>>
>>>>>> Bug:
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8187601
>>>>>>
>>>>>> Webrev:
>>>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00
>>>>>>
>>>>>> In the current implementation, the loop unrolling times are determined
>>>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is
>>>>>> true (both X86 and aarch64 are true for now).
>>>>>>
>>>>>> This unrolling policy generates less optimized code when SLP
>>>>>> auto-vectorization fails (as following example shows).
>>>>>>
>>>>>> In this patch, I modify the current unrolling policy to do more
>>>>>> unrolling when SLP auto-vectorization fails. So the loop will be
>>>>>> unrolled until reaching the unroll times limitation.
>>>>>>
>>>>>> Here is one example:
>>>>>>      public static void accessArrayConstants(int[] array) {
>>>>>>          for (int j = 0; j < 1024; j++) {
>>>>>>              array[0]++;
>>>>>>              array[1]++;
>>>>>>          }
>>>>>>      }
>>>>>>
>>>>>> Before this patch, the loop will be unrolled by 4 times. 4 is
>>>>>> determined by: AArch64's vector size 128 bits / array element size 32
>>>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8.
>>>>>>
>>>>>> Below is the generated code by C2 on AArch64:
>>>>>>
>>>>>> ==== generated code start ====
>>>>>>      0x0000ffff6caf3180: ldr w10, [x1,#16]   ;
>>>>>>      0x0000ffff6caf3184: add w13, w10, #0x1
>>>>>>      0x0000ffff6caf3188: str w13, [x1,#16]   ;
>>>>>>      0x0000ffff6caf318c: ldr w12, [x1,#20]   ;
>>>>>>      0x0000ffff6caf3190: add w13, w10, #0x4
>>>>>>      0x0000ffff6caf3194: add w10, w12, #0x4
>>>>>>      0x0000ffff6caf3198: str w13, [x1,#16]   ;
>>>>>>      0x0000ffff6caf319c: add w11, w11, #0x4  ;
>>>>>>      0x0000ffff6caf31a0: str w10, [x1,#20]   ;
>>>>>>      0x0000ffff6caf31a4: cmp w11, #0x3fd
>>>>>>      0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180  ;
>>>>>> ==== generated code end ====
>>>>>>
>>>>>> After applied this patch, it is unrolled 16 times:
>>>>>>
>>>>>> ==== generated code start ====
>>>>>>      0x0000ffffb0aa6100: ldr w10, [x1,#16]   ;
>>>>>>      0x0000ffffb0aa6104: add w13, w10, #0x1
>>>>>>      0x0000ffffb0aa6108: str w13, [x1,#16]   ;
>>>>>>      0x0000ffffb0aa610c: ldr w12, [x1,#20]   ;
>>>>>>      0x0000ffffb0aa6110: add w13, w10, #0x10
>>>>>>      0x0000ffffb0aa6114: add w10, w12, #0x10
>>>>>>      0x0000ffffb0aa6118: str w13, [x1,#16]   ;
>>>>>>      0x0000ffffb0aa611c: add w11, w11, #0x10  ;
>>>>>>      0x0000ffffb0aa6120: str w10, [x1,#20]   ;
>>>>>>      0x0000ffffb0aa6124: cmp w11, #0x3f1
>>>>>>      0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100  ;
>>>>>> ==== generated code end ====
>>>>>>
>>>>>> This patch passes jtreg tests both on AArch64 and X86.
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>


-- 
Best regards,
Zhongwei

From zhongwei.yao at linaro.org  Fri Sep 29 09:22:24 2017
From: zhongwei.yao at linaro.org (Zhongwei Yao)
Date: Fri, 29 Sep 2017 17:22:24 +0800
Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP
	auto-vectorization failed
In-Reply-To: <CAFkOo6bWF2_rNzY0eu16KjYJL8G3wS5Uj85p+vdiMOhGpu-M8Q@mail.gmail.com>
References: <CAFkOo6a9Q-9kq863WGVa6XoUbOGE7-MZLGrsdH3N-MBasawgpg@mail.gmail.com>
 <c6a8b312-0f10-f2e3-479a-f9f568e0afdb@oracle.com>
 <CAFkOo6Z3LO7ug5nwseneChDEMdrTj7DUgUNMu803b7ivnLMM9g@mail.gmail.com>
 <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com>
 <CAFkOo6YBdZ4v7YTvDQACuOUk+_AwHFS9FCAT_NLY_PVry2yEYw@mail.gmail.com>
 <c34142dd-0eaa-953f-6e88-433ca5d9a074@oracle.com>
 <CAFkOo6bWF2_rNzY0eu16KjYJL8G3wS5Uj85p+vdiMOhGpu-M8Q@mail.gmail.com>
Message-ID: <CAFkOo6Y+G25m5JK=zbYGBZObOvjy-6QRM3knfytDptysPdSc-Q@mail.gmail.com>

I made a typo in the previous reply.


On 29 September 2017 at 16:25, Zhongwei Yao <zhongwei.yao at linaro.org> wrote:
> Hi, Vladimir,
>
> Sorry for my late response!
>
> And yes, it solves my case.
>
> But I found specjvm2008 doesn't have a stable result, especially for
> benchmark case like startup.xxx, scimark.xxx.large etc. And I have
> found obvious performance regress in the rest of benchmark cases. What

And I have NOT found obvious performance regress in the rest of benchmark cases.

> do you think?
>
> On 21 September 2017 at 00:18, Vladimir Kozlov
> <vladimir.kozlov at oracle.com> wrote:
>> Nice.
>>
>> Did you verified that it fixed your case?
>>
>> Would be nice to run specjvm2008 to make sure performance did not regress.
>>
>> Thanks,
>> Vladimir
>>
>>
>> On 9/20/17 4:07 AM, Zhongwei Yao wrote:
>>>
>>> Thanks for your suggestions!
>>>
>>> I've updated the patch that uses pass_slp and do_unroll_only flags
>>> without adding a new flag. Please take a look:
>>>
>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.01/
>>>
>>>
>>>
>>> On 20 September 2017 at 01:54, Vladimir Kozlov
>>> <vladimir.kozlov at oracle.com> wrote:
>>>>
>>>>
>>>>
>>>> On 9/18/17 10:59 PM, Zhongwei Yao wrote:
>>>>>
>>>>>
>>>>> Hi, Vladimir,
>>>>>
>>>>> On 19 September 2017 at 00:17, Vladimir Kozlov
>>>>> <vladimir.kozlov at oracle.com> wrote:
>>>>>>
>>>>>>
>>>>>> Why not use existing set_notpassed_slp() instead of
>>>>>> mark_slp_vec_failed()?
>>>>>
>>>>>
>>>>>
>>>>> Due to 2 reasons, I have not chosen existing passed_slp flag:
>>>>
>>>>
>>>>
>>>> My point is that if we don't find vectors in a loop (as in your case) we
>>>> should ignore whole SLP analysis.
>>>>
>>>> In best case scenario SuperWord::unrolling_analysis() should determine if
>>>> there are vectors candidates. For example, check if array's index is
>>>> depend
>>>> on loop's index variable.
>>>>
>>>> An other way is to call SuperWord::unrolling_analysis() only after we did
>>>> vector analysis.
>>>>
>>>> It is more complicated changes and out of scope of this. There is also
>>>> side
>>>> effect I missed before which may prevent using set_notpassed_slp():
>>>> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp()
>>>> check.
>>>>
>>>> Note, set_notpassed_slp() is also used to additional unroll already
>>>> vectorized loops:
>>>>
>>>>
>>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421
>>>>
>>>> May be you should also call mark_do_unroll_only() when you set
>>>> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt
>>>> problem you pointed. Can you look on this?
>>>>
>>>> I am not against adding new is_slp_vec_failed() but I want first to
>>>> investigate if we can re-use existing functions.
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>>
>>>>>     1. If we set_notpassed_slp() when _packset.length() == 0 in
>>>>> SuperWord::output(), then in the IdealLoopTree::policy_unroll()
>>>>> checking:
>>>>>
>>>>>      if (cl->has_passed_slp()) {
>>>>>        if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>>>>        // Normal case: loop too big
>>>>>        return false;
>>>>>      }
>>>>>
>>>>>      we will ignore the case: "cl->has_passed_slp() &&
>>>>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()"
>>>>> as alos exposed in my patch:
>>>>>
>>>>>      if (cl->has_passed_slp()) {
>>>>>        if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>>>> -    // Normal case: loop too big
>>>>> -    return false;
>>>>> +    // When SLP vectorization failed, we could do more unrolling
>>>>> +    // optimizations if body size is less than limit size. Otherwise,
>>>>> +    // return false due to loop is too big.
>>>>> +    if (!cl->is_slp_vec_failed()) return false;
>>>>>      }
>>>>>
>>>>>      However, I have not found a case to support this condition yet.
>>>>>
>>>>>     2. As replied below, in:
>>>>>>
>>>>>>
>>>>>> -        } else if (cl->is_main_loop()) {
>>>>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>>>>              sw.transform_loop(lpt, true);
>>>>>
>>>>>
>>>>>         I need to check whether cl->is_slp_vec_failed() is true.Such
>>>>> checking becomes explicit when using SLPAutoVecFailed flag.
>>>>>
>>>>>>
>>>>>> Why you need next additional check?:
>>>>>>
>>>>>> -        } else if (cl->is_main_loop()) {
>>>>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>>>>              sw.transform_loop(lpt, true);
>>>>>>
>>>>>
>>>>> The additional check prevents the case that when
>>>>> cl->is_slp_vec_failed() is true, then SuperWord::output() will
>>>>> set_major_progress() at the beginning (because _packset.length() == 0
>>>>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal
>>>>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is
>>>>> not we want.
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Vladimir
>>>>>>
>>>>>>
>>>>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev]
>>>>>>>
>>>>>>> Hi, all,
>>>>>>>
>>>>>>> Bug:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8187601
>>>>>>>
>>>>>>> Webrev:
>>>>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00
>>>>>>>
>>>>>>> In the current implementation, the loop unrolling times are determined
>>>>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is
>>>>>>> true (both X86 and aarch64 are true for now).
>>>>>>>
>>>>>>> This unrolling policy generates less optimized code when SLP
>>>>>>> auto-vectorization fails (as following example shows).
>>>>>>>
>>>>>>> In this patch, I modify the current unrolling policy to do more
>>>>>>> unrolling when SLP auto-vectorization fails. So the loop will be
>>>>>>> unrolled until reaching the unroll times limitation.
>>>>>>>
>>>>>>> Here is one example:
>>>>>>>      public static void accessArrayConstants(int[] array) {
>>>>>>>          for (int j = 0; j < 1024; j++) {
>>>>>>>              array[0]++;
>>>>>>>              array[1]++;
>>>>>>>          }
>>>>>>>      }
>>>>>>>
>>>>>>> Before this patch, the loop will be unrolled by 4 times. 4 is
>>>>>>> determined by: AArch64's vector size 128 bits / array element size 32
>>>>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8.
>>>>>>>
>>>>>>> Below is the generated code by C2 on AArch64:
>>>>>>>
>>>>>>> ==== generated code start ====
>>>>>>>      0x0000ffff6caf3180: ldr w10, [x1,#16]   ;
>>>>>>>      0x0000ffff6caf3184: add w13, w10, #0x1
>>>>>>>      0x0000ffff6caf3188: str w13, [x1,#16]   ;
>>>>>>>      0x0000ffff6caf318c: ldr w12, [x1,#20]   ;
>>>>>>>      0x0000ffff6caf3190: add w13, w10, #0x4
>>>>>>>      0x0000ffff6caf3194: add w10, w12, #0x4
>>>>>>>      0x0000ffff6caf3198: str w13, [x1,#16]   ;
>>>>>>>      0x0000ffff6caf319c: add w11, w11, #0x4  ;
>>>>>>>      0x0000ffff6caf31a0: str w10, [x1,#20]   ;
>>>>>>>      0x0000ffff6caf31a4: cmp w11, #0x3fd
>>>>>>>      0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180  ;
>>>>>>> ==== generated code end ====
>>>>>>>
>>>>>>> After applied this patch, it is unrolled 16 times:
>>>>>>>
>>>>>>> ==== generated code start ====
>>>>>>>      0x0000ffffb0aa6100: ldr w10, [x1,#16]   ;
>>>>>>>      0x0000ffffb0aa6104: add w13, w10, #0x1
>>>>>>>      0x0000ffffb0aa6108: str w13, [x1,#16]   ;
>>>>>>>      0x0000ffffb0aa610c: ldr w12, [x1,#20]   ;
>>>>>>>      0x0000ffffb0aa6110: add w13, w10, #0x10
>>>>>>>      0x0000ffffb0aa6114: add w10, w12, #0x10
>>>>>>>      0x0000ffffb0aa6118: str w13, [x1,#16]   ;
>>>>>>>      0x0000ffffb0aa611c: add w11, w11, #0x10  ;
>>>>>>>      0x0000ffffb0aa6120: str w10, [x1,#20]   ;
>>>>>>>      0x0000ffffb0aa6124: cmp w11, #0x3f1
>>>>>>>      0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100  ;
>>>>>>> ==== generated code end ====
>>>>>>>
>>>>>>> This patch passes jtreg tests both on AArch64 and X86.
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>
> --
> Best regards,
> Zhongwei


-- 
Best regards,
Zhongwei

From vladimir.kozlov at oracle.com  Fri Sep 29 18:10:10 2017
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 29 Sep 2017 11:10:10 -0700
Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP
 auto-vectorization failed
In-Reply-To: <CAFkOo6bWF2_rNzY0eu16KjYJL8G3wS5Uj85p+vdiMOhGpu-M8Q@mail.gmail.com>
References: <CAFkOo6a9Q-9kq863WGVa6XoUbOGE7-MZLGrsdH3N-MBasawgpg@mail.gmail.com>
 <c6a8b312-0f10-f2e3-479a-f9f568e0afdb@oracle.com>
 <CAFkOo6Z3LO7ug5nwseneChDEMdrTj7DUgUNMu803b7ivnLMM9g@mail.gmail.com>
 <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com>
 <CAFkOo6YBdZ4v7YTvDQACuOUk+_AwHFS9FCAT_NLY_PVry2yEYw@mail.gmail.com>
 <c34142dd-0eaa-953f-6e88-433ca5d9a074@oracle.com>
 <CAFkOo6bWF2_rNzY0eu16KjYJL8G3wS5Uj85p+vdiMOhGpu-M8Q@mail.gmail.com>
Message-ID: <06d44e32-0d33-ae78-1516-6c4497adf983@oracle.com>

On 9/29/17 1:25 AM, Zhongwei Yao wrote:
> Hi, Vladimir,
> 
> Sorry for my late response!
> 
> And yes, it solves my case.
> 
> But I found specjvm2008 doesn't have a stable result, especially for
> benchmark case like startup.xxx, scimark.xxx.large etc. And I have
> found obvious performance regress in the rest of benchmark cases. What
> do you think?

You know that you can change run parameters for specjvm2008 to avoid waiting for long to finish.
And you need to run on one node preferable.

Variations in startup is not important in this case. But scimark is important since they show quality of loop optimizations.

Does regression significant? We need more time to investigate it then.

Thanks,
Vladimir

> 
> On 21 September 2017 at 00:18, Vladimir Kozlov
> <vladimir.kozlov at oracle.com> wrote:
>> Nice.
>>
>> Did you verified that it fixed your case?
>>
>> Would be nice to run specjvm2008 to make sure performance did not regress.
>>
>> Thanks,
>> Vladimir
>>
>>
>> On 9/20/17 4:07 AM, Zhongwei Yao wrote:
>>>
>>> Thanks for your suggestions!
>>>
>>> I've updated the patch that uses pass_slp and do_unroll_only flags
>>> without adding a new flag. Please take a look:
>>>
>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.01/
>>>
>>>
>>>
>>> On 20 September 2017 at 01:54, Vladimir Kozlov
>>> <vladimir.kozlov at oracle.com> wrote:
>>>>
>>>>
>>>>
>>>> On 9/18/17 10:59 PM, Zhongwei Yao wrote:
>>>>>
>>>>>
>>>>> Hi, Vladimir,
>>>>>
>>>>> On 19 September 2017 at 00:17, Vladimir Kozlov
>>>>> <vladimir.kozlov at oracle.com> wrote:
>>>>>>
>>>>>>
>>>>>> Why not use existing set_notpassed_slp() instead of
>>>>>> mark_slp_vec_failed()?
>>>>>
>>>>>
>>>>>
>>>>> Due to 2 reasons, I have not chosen existing passed_slp flag:
>>>>
>>>>
>>>>
>>>> My point is that if we don't find vectors in a loop (as in your case) we
>>>> should ignore whole SLP analysis.
>>>>
>>>> In best case scenario SuperWord::unrolling_analysis() should determine if
>>>> there are vectors candidates. For example, check if array's index is
>>>> depend
>>>> on loop's index variable.
>>>>
>>>> An other way is to call SuperWord::unrolling_analysis() only after we did
>>>> vector analysis.
>>>>
>>>> It is more complicated changes and out of scope of this. There is also
>>>> side
>>>> effect I missed before which may prevent using set_notpassed_slp():
>>>> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp()
>>>> check.
>>>>
>>>> Note, set_notpassed_slp() is also used to additional unroll already
>>>> vectorized loops:
>>>>
>>>>
>>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421
>>>>
>>>> May be you should also call mark_do_unroll_only() when you set
>>>> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt
>>>> problem you pointed. Can you look on this?
>>>>
>>>> I am not against adding new is_slp_vec_failed() but I want first to
>>>> investigate if we can re-use existing functions.
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>>
>>>>>      1. If we set_notpassed_slp() when _packset.length() == 0 in
>>>>> SuperWord::output(), then in the IdealLoopTree::policy_unroll()
>>>>> checking:
>>>>>
>>>>>       if (cl->has_passed_slp()) {
>>>>>         if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>>>>         // Normal case: loop too big
>>>>>         return false;
>>>>>       }
>>>>>
>>>>>       we will ignore the case: "cl->has_passed_slp() &&
>>>>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()"
>>>>> as alos exposed in my patch:
>>>>>
>>>>>       if (cl->has_passed_slp()) {
>>>>>         if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>>>> -    // Normal case: loop too big
>>>>> -    return false;
>>>>> +    // When SLP vectorization failed, we could do more unrolling
>>>>> +    // optimizations if body size is less than limit size. Otherwise,
>>>>> +    // return false due to loop is too big.
>>>>> +    if (!cl->is_slp_vec_failed()) return false;
>>>>>       }
>>>>>
>>>>>       However, I have not found a case to support this condition yet.
>>>>>
>>>>>      2. As replied below, in:
>>>>>>
>>>>>>
>>>>>> -        } else if (cl->is_main_loop()) {
>>>>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>>>>               sw.transform_loop(lpt, true);
>>>>>
>>>>>
>>>>>          I need to check whether cl->is_slp_vec_failed() is true.Such
>>>>> checking becomes explicit when using SLPAutoVecFailed flag.
>>>>>
>>>>>>
>>>>>> Why you need next additional check?:
>>>>>>
>>>>>> -        } else if (cl->is_main_loop()) {
>>>>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>>>>               sw.transform_loop(lpt, true);
>>>>>>
>>>>>
>>>>> The additional check prevents the case that when
>>>>> cl->is_slp_vec_failed() is true, then SuperWord::output() will
>>>>> set_major_progress() at the beginning (because _packset.length() == 0
>>>>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal
>>>>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is
>>>>> not we want.
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Vladimir
>>>>>>
>>>>>>
>>>>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev]
>>>>>>>
>>>>>>> Hi, all,
>>>>>>>
>>>>>>> Bug:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8187601
>>>>>>>
>>>>>>> Webrev:
>>>>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00
>>>>>>>
>>>>>>> In the current implementation, the loop unrolling times are determined
>>>>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is
>>>>>>> true (both X86 and aarch64 are true for now).
>>>>>>>
>>>>>>> This unrolling policy generates less optimized code when SLP
>>>>>>> auto-vectorization fails (as following example shows).
>>>>>>>
>>>>>>> In this patch, I modify the current unrolling policy to do more
>>>>>>> unrolling when SLP auto-vectorization fails. So the loop will be
>>>>>>> unrolled until reaching the unroll times limitation.
>>>>>>>
>>>>>>> Here is one example:
>>>>>>>       public static void accessArrayConstants(int[] array) {
>>>>>>>           for (int j = 0; j < 1024; j++) {
>>>>>>>               array[0]++;
>>>>>>>               array[1]++;
>>>>>>>           }
>>>>>>>       }
>>>>>>>
>>>>>>> Before this patch, the loop will be unrolled by 4 times. 4 is
>>>>>>> determined by: AArch64's vector size 128 bits / array element size 32
>>>>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8.
>>>>>>>
>>>>>>> Below is the generated code by C2 on AArch64:
>>>>>>>
>>>>>>> ==== generated code start ====
>>>>>>>       0x0000ffff6caf3180: ldr w10, [x1,#16]   ;
>>>>>>>       0x0000ffff6caf3184: add w13, w10, #0x1
>>>>>>>       0x0000ffff6caf3188: str w13, [x1,#16]   ;
>>>>>>>       0x0000ffff6caf318c: ldr w12, [x1,#20]   ;
>>>>>>>       0x0000ffff6caf3190: add w13, w10, #0x4
>>>>>>>       0x0000ffff6caf3194: add w10, w12, #0x4
>>>>>>>       0x0000ffff6caf3198: str w13, [x1,#16]   ;
>>>>>>>       0x0000ffff6caf319c: add w11, w11, #0x4  ;
>>>>>>>       0x0000ffff6caf31a0: str w10, [x1,#20]   ;
>>>>>>>       0x0000ffff6caf31a4: cmp w11, #0x3fd
>>>>>>>       0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180  ;
>>>>>>> ==== generated code end ====
>>>>>>>
>>>>>>> After applied this patch, it is unrolled 16 times:
>>>>>>>
>>>>>>> ==== generated code start ====
>>>>>>>       0x0000ffffb0aa6100: ldr w10, [x1,#16]   ;
>>>>>>>       0x0000ffffb0aa6104: add w13, w10, #0x1
>>>>>>>       0x0000ffffb0aa6108: str w13, [x1,#16]   ;
>>>>>>>       0x0000ffffb0aa610c: ldr w12, [x1,#20]   ;
>>>>>>>       0x0000ffffb0aa6110: add w13, w10, #0x10
>>>>>>>       0x0000ffffb0aa6114: add w10, w12, #0x10
>>>>>>>       0x0000ffffb0aa6118: str w13, [x1,#16]   ;
>>>>>>>       0x0000ffffb0aa611c: add w11, w11, #0x10  ;
>>>>>>>       0x0000ffffb0aa6120: str w10, [x1,#20]   ;
>>>>>>>       0x0000ffffb0aa6124: cmp w11, #0x3f1
>>>>>>>       0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100  ;
>>>>>>> ==== generated code end ====
>>>>>>>
>>>>>>> This patch passes jtreg tests both on AArch64 and X86.
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
> 
> 
> 

From zhongwei.yao at linaro.org  Sat Sep 30 06:37:32 2017
From: zhongwei.yao at linaro.org (Zhongwei Yao)
Date: Sat, 30 Sep 2017 14:37:32 +0800
Subject: [aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP
	auto-vectorization failed
In-Reply-To: <06d44e32-0d33-ae78-1516-6c4497adf983@oracle.com>
References: <CAFkOo6a9Q-9kq863WGVa6XoUbOGE7-MZLGrsdH3N-MBasawgpg@mail.gmail.com>
 <c6a8b312-0f10-f2e3-479a-f9f568e0afdb@oracle.com>
 <CAFkOo6Z3LO7ug5nwseneChDEMdrTj7DUgUNMu803b7ivnLMM9g@mail.gmail.com>
 <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com>
 <CAFkOo6YBdZ4v7YTvDQACuOUk+_AwHFS9FCAT_NLY_PVry2yEYw@mail.gmail.com>
 <c34142dd-0eaa-953f-6e88-433ca5d9a074@oracle.com>
 <CAFkOo6bWF2_rNzY0eu16KjYJL8G3wS5Uj85p+vdiMOhGpu-M8Q@mail.gmail.com>
 <06d44e32-0d33-ae78-1516-6c4497adf983@oracle.com>
Message-ID: <CAFkOo6awRMFu-Q80H7FEqAn+mhqCmaBfGnB=dVB_Fesh6N1ypA@mail.gmail.com>

On 30 September 2017 at 02:10, Vladimir Kozlov
<vladimir.kozlov at oracle.com> wrote:
> On 9/29/17 1:25 AM, Zhongwei Yao wrote:
>>
>> Hi, Vladimir,
>>
>> Sorry for my late response!
>>
>> And yes, it solves my case.
>>
>> But I found specjvm2008 doesn't have a stable result, especially for
>> benchmark case like startup.xxx, scimark.xxx.large etc. And I have not
>> found obvious performance regress in the rest of benchmark cases. What
>> do you think?
>
>
> You know that you can change run parameters for specjvm2008 to avoid waiting
> for long to finish.
> And you need to run on one node preferable.
>
> Variations in startup is not important in this case. But scimark is
> important since they show quality of loop optimizations.
>
> Does regression significant? We need more time to investigate it then.

I see performance data fluctuates in specjvm2008.

However, I check the scimark 2.0 (http://math.nist.gov/scimark2/) and
see no performance regression in it both on x86 and arm64.

>
> Thanks,
> Vladimir
>
>
>>
>> On 21 September 2017 at 00:18, Vladimir Kozlov
>> <vladimir.kozlov at oracle.com> wrote:
>>>
>>> Nice.
>>>
>>> Did you verified that it fixed your case?
>>>
>>> Would be nice to run specjvm2008 to make sure performance did not
>>> regress.
>>>
>>> Thanks,
>>> Vladimir
>>>
>>>
>>> On 9/20/17 4:07 AM, Zhongwei Yao wrote:
>>>>
>>>>
>>>> Thanks for your suggestions!
>>>>
>>>> I've updated the patch that uses pass_slp and do_unroll_only flags
>>>> without adding a new flag. Please take a look:
>>>>
>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.01/
>>>>
>>>>
>>>>
>>>> On 20 September 2017 at 01:54, Vladimir Kozlov
>>>> <vladimir.kozlov at oracle.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 9/18/17 10:59 PM, Zhongwei Yao wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi, Vladimir,
>>>>>>
>>>>>> On 19 September 2017 at 00:17, Vladimir Kozlov
>>>>>> <vladimir.kozlov at oracle.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Why not use existing set_notpassed_slp() instead of
>>>>>>> mark_slp_vec_failed()?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Due to 2 reasons, I have not chosen existing passed_slp flag:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> My point is that if we don't find vectors in a loop (as in your case)
>>>>> we
>>>>> should ignore whole SLP analysis.
>>>>>
>>>>> In best case scenario SuperWord::unrolling_analysis() should determine
>>>>> if
>>>>> there are vectors candidates. For example, check if array's index is
>>>>> depend
>>>>> on loop's index variable.
>>>>>
>>>>> An other way is to call SuperWord::unrolling_analysis() only after we
>>>>> did
>>>>> vector analysis.
>>>>>
>>>>> It is more complicated changes and out of scope of this. There is also
>>>>> side
>>>>> effect I missed before which may prevent using set_notpassed_slp():
>>>>> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp()
>>>>> check.
>>>>>
>>>>> Note, set_notpassed_slp() is also used to additional unroll already
>>>>> vectorized loops:
>>>>>
>>>>>
>>>>>
>>>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421
>>>>>
>>>>> May be you should also call mark_do_unroll_only() when you set
>>>>> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt
>>>>> problem you pointed. Can you look on this?
>>>>>
>>>>> I am not against adding new is_slp_vec_failed() but I want first to
>>>>> investigate if we can re-use existing functions.
>>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>>
>>>>>>      1. If we set_notpassed_slp() when _packset.length() == 0 in
>>>>>> SuperWord::output(), then in the IdealLoopTree::policy_unroll()
>>>>>> checking:
>>>>>>
>>>>>>       if (cl->has_passed_slp()) {
>>>>>>         if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>>>>>         // Normal case: loop too big
>>>>>>         return false;
>>>>>>       }
>>>>>>
>>>>>>       we will ignore the case: "cl->has_passed_slp() &&
>>>>>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()"
>>>>>> as alos exposed in my patch:
>>>>>>
>>>>>>       if (cl->has_passed_slp()) {
>>>>>>         if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>>>>> -    // Normal case: loop too big
>>>>>> -    return false;
>>>>>> +    // When SLP vectorization failed, we could do more unrolling
>>>>>> +    // optimizations if body size is less than limit size. Otherwise,
>>>>>> +    // return false due to loop is too big.
>>>>>> +    if (!cl->is_slp_vec_failed()) return false;
>>>>>>       }
>>>>>>
>>>>>>       However, I have not found a case to support this condition yet.
>>>>>>
>>>>>>      2. As replied below, in:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -        } else if (cl->is_main_loop()) {
>>>>>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>>>>>               sw.transform_loop(lpt, true);
>>>>>>
>>>>>>
>>>>>>
>>>>>>          I need to check whether cl->is_slp_vec_failed() is true.Such
>>>>>> checking becomes explicit when using SLPAutoVecFailed flag.
>>>>>>
>>>>>>>
>>>>>>> Why you need next additional check?:
>>>>>>>
>>>>>>> -        } else if (cl->is_main_loop()) {
>>>>>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>>>>>               sw.transform_loop(lpt, true);
>>>>>>>
>>>>>>
>>>>>> The additional check prevents the case that when
>>>>>> cl->is_slp_vec_failed() is true, then SuperWord::output() will
>>>>>> set_major_progress() at the beginning (because _packset.length() == 0
>>>>>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal
>>>>>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is
>>>>>> not we want.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Vladimir
>>>>>>>
>>>>>>>
>>>>>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev]
>>>>>>>>
>>>>>>>> Hi, all,
>>>>>>>>
>>>>>>>> Bug:
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8187601
>>>>>>>>
>>>>>>>> Webrev:
>>>>>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00
>>>>>>>>
>>>>>>>> In the current implementation, the loop unrolling times are
>>>>>>>> determined
>>>>>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is
>>>>>>>> true (both X86 and aarch64 are true for now).
>>>>>>>>
>>>>>>>> This unrolling policy generates less optimized code when SLP
>>>>>>>> auto-vectorization fails (as following example shows).
>>>>>>>>
>>>>>>>> In this patch, I modify the current unrolling policy to do more
>>>>>>>> unrolling when SLP auto-vectorization fails. So the loop will be
>>>>>>>> unrolled until reaching the unroll times limitation.
>>>>>>>>
>>>>>>>> Here is one example:
>>>>>>>>       public static void accessArrayConstants(int[] array) {
>>>>>>>>           for (int j = 0; j < 1024; j++) {
>>>>>>>>               array[0]++;
>>>>>>>>               array[1]++;
>>>>>>>>           }
>>>>>>>>       }
>>>>>>>>
>>>>>>>> Before this patch, the loop will be unrolled by 4 times. 4 is
>>>>>>>> determined by: AArch64's vector size 128 bits / array element size
>>>>>>>> 32
>>>>>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are
>>>>>>>> 8.
>>>>>>>>
>>>>>>>> Below is the generated code by C2 on AArch64:
>>>>>>>>
>>>>>>>> ==== generated code start ====
>>>>>>>>       0x0000ffff6caf3180: ldr w10, [x1,#16]   ;
>>>>>>>>       0x0000ffff6caf3184: add w13, w10, #0x1
>>>>>>>>       0x0000ffff6caf3188: str w13, [x1,#16]   ;
>>>>>>>>       0x0000ffff6caf318c: ldr w12, [x1,#20]   ;
>>>>>>>>       0x0000ffff6caf3190: add w13, w10, #0x4
>>>>>>>>       0x0000ffff6caf3194: add w10, w12, #0x4
>>>>>>>>       0x0000ffff6caf3198: str w13, [x1,#16]   ;
>>>>>>>>       0x0000ffff6caf319c: add w11, w11, #0x4  ;
>>>>>>>>       0x0000ffff6caf31a0: str w10, [x1,#20]   ;
>>>>>>>>       0x0000ffff6caf31a4: cmp w11, #0x3fd
>>>>>>>>       0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180  ;
>>>>>>>> ==== generated code end ====
>>>>>>>>
>>>>>>>> After applied this patch, it is unrolled 16 times:
>>>>>>>>
>>>>>>>> ==== generated code start ====
>>>>>>>>       0x0000ffffb0aa6100: ldr w10, [x1,#16]   ;
>>>>>>>>       0x0000ffffb0aa6104: add w13, w10, #0x1
>>>>>>>>       0x0000ffffb0aa6108: str w13, [x1,#16]   ;
>>>>>>>>       0x0000ffffb0aa610c: ldr w12, [x1,#20]   ;
>>>>>>>>       0x0000ffffb0aa6110: add w13, w10, #0x10
>>>>>>>>       0x0000ffffb0aa6114: add w10, w12, #0x10
>>>>>>>>       0x0000ffffb0aa6118: str w13, [x1,#16]   ;
>>>>>>>>       0x0000ffffb0aa611c: add w11, w11, #0x10  ;
>>>>>>>>       0x0000ffffb0aa6120: str w10, [x1,#20]   ;
>>>>>>>>       0x0000ffffb0aa6124: cmp w11, #0x3f1
>>>>>>>>       0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100  ;
>>>>>>>> ==== generated code end ====
>>>>>>>>
>>>>>>>> This patch passes jtreg tests both on AArch64 and X86.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>


-- 
Best regards,
Zhongwei