From christian.thalinger at oracle.com Mon Jun 1 17:51:10 2015 From: christian.thalinger at oracle.com (Christian Thalinger) Date: Mon, 1 Jun 2015 10:51:10 -0700 Subject: RFR 8076276 support for AVX512 In-Reply-To: <554D09E0.4040904@oracle.com> References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> <554BA377.8050700@oracle.com> <554BD8CA.1090409@oracle.com> <554BF60F.9000401@oracle.com> <554D09E0.4040904@oracle.com> Message-ID: > On May 8, 2015, at 12:09 PM, Vladimir Kozlov wrote: > > I applied small "cosmetic" changes sent by Michael and I am pushing this. > > Here is updated webrev for the record (I also removed trailing spaces in .ad files): > > http://cr.openjdk.java.net/~kvn/8076276/webrev.03 I think I found a bug: http://cr.openjdk.java.net/~kvn/8076276/webrev.03/src/cpu/x86/vm/sharedRuntime_x86_64.cpp.udiff.html in RegisterSaver::restore_live_registers: __ vinsertf64x4h(xmm31, Address(rsp, 992)); __ subptr(rsp, 1024); Shouldn?t that be an add? > > Thanks, > Vladimir > > On 5/7/15 4:41 PM, Berg, Michael C wrote: >> Ok, I will leave that way it is plus the comment. >> >> Thanks, >> -Michael >> >> -----Original Message----- >> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] >> Sent: Thursday, May 07, 2015 4:33 PM >> To: Berg, Michael C; Roland Westrelin >> Cc: hotspot-compiler-dev at openjdk.java.net >> Subject: Re: RFR 8076276 support for AVX512 >> >> You should have said this from the beginning! Add comment to x86_32.ad before operand vecS explaining why it uses vectors_reg_legacy. >> >> To avoid unneeded runtime code generation by adlc. It produces the same result regardless evex support for these vectors in 32-bit VM. >> >> Thanks, >> Vladimir >> >> On 5/7/15 3:43 PM, Berg, Michael C wrote: >>> Tried it out as a single definition referencing reg_class_dynamic for both 64-bit and 32-bit. On the new machine model it does work correctly for both. However on 32-bit we will now emit more auto generated code through the adlc layer as we will have the same definitions on both evex and non evex, it is why I used separate definitions. >>> >>> -Michael >>> >>> -----Original Message----- >>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] >>> Sent: Thursday, May 07, 2015 2:28 PM >>> To: Berg, Michael C; Roland Westrelin >>> Cc: hotspot-compiler-dev at openjdk.java.net >>> Subject: Re: RFR 8076276 support for AVX512 >>> >>> On 5/7/15 2:07 PM, Berg, Michael C wrote: >>>> Roland, >>>> >>>> VecS like the other forms of Vec{D|X|Y|Z} are defined as needed in each of the 32 and 64 ad files, as they are divergent now. The 32-bit version only has the legacy bank. >>>> 64-bit version uses the newly defined reg_class_dynamic definition to provide the appropriate bank of registers based on CPUID. The chunk3 rename is an artifact of removing the kreg bank. >>> >>> >>> For 32-bit reg_class_dynamic will give the same result as legacy because registers after XMM7 will be cut of by #ifdef _LP64. So I don't understand why you need to split Vec{D|X|Y} operands. You do kept vecZ operand the same for 32- and 64-bit. >>> >>> Regards, >>> Vladimir >>> >>>> I will put it back to the old name (and test it out). Eventually, we will change it back once we formalize the usage of kreg into all facets of code generation ( in a later patch ). >>>> If we made a single version out of os_supports_avx_vectors()'s loop, we would have a guard in the loop as the size of the regions are now different, it seems cleaner the way it is. >>>> >>>> In x86.64.ad: >>>> >>>>>> 3463 // Float register operands >>>>>> 3473 // Double register operands >>>> >>>> I will re-add the comments. >>>> >>>> In chaitain.hpp: >>>> >>>> 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else >>>> >>>> I will fix the comment location. >>>> >>>> Thanks, >>>> Michael >>>> >>>> -----Original Message----- >>>> From: hotspot-compiler-dev >>>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of >>>> Roland Westrelin >>>> Sent: Thursday, May 07, 2015 11:40 AM >>>> To: Vladimir Kozlov >>>> Cc: hotspot-compiler-dev at openjdk.java.net >>>> Subject: Re: RFR 8076276 support for AVX512 >>>> >>>>>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02 >>>>>> >>>>>> This looks good to me. A few minor remarks: >>>>>> >>>>>> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is. >>>>> >>>>> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about? >>>> >>>> I'm talking about: >>>> 4101 #ifdef _LP64 >>>> 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, >>>> regF tmp, regF tmp2) %{ >>>> >>>> for instance in x86.ad >>>> Why isn't it in x86_64.ad? >>>> >>>> Roland. >>>>> >>>>>> >>>>>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. >>>>>> >>>>>> Not sure why you dropped: >>>>>> >>>>>> 3463 // Float register operands >>>>>> 3473 // Double register operands >>>>>> >>>>>> in x86_64.ad >>>>>> >>>>>> In chaitin.hpp: >>>>>> >>>>>> 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else >>>>>> >>>>>> comment is not aligned with the one below anymore. >>>>>> >>>>>> Roland. >>>>>> >>>>> >>>>> I also have questions to Michael. >>>>> >>>>> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"? >>>>> >>>>> Why you moved "operand vecS() etc." from x86.ad ? Do you mean evex is not supported in 32-bit? >>>>> >>>>> Thanks, >>>>> Vladimir >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.c.berg at intel.com Mon Jun 1 20:21:36 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Mon, 1 Jun 2015 20:21:36 +0000 Subject: RFR 8076276 support for AVX512 In-Reply-To: References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> <554BA377.8050700@oracle.com> <554BD8CA.1090409@oracle.com> <554BF60F.9000401@oracle.com> <554D09E0.4040904@oracle.com> Message-ID: Yes Christian, in restore, we should add like in the other two locations in that code. I have other changes which address more AVX512 extensions, do you want me to just add this change to the webrev I created for that set? Thanks, Michael From: Christian Thalinger [mailto:christian.thalinger at oracle.com] Sent: Monday, June 01, 2015 10:51 AM To: Berg, Michael C Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR 8076276 support for AVX512 On May 8, 2015, at 12:09 PM, Vladimir Kozlov > wrote: I applied small "cosmetic" changes sent by Michael and I am pushing this. Here is updated webrev for the record (I also removed trailing spaces in .ad files): http://cr.openjdk.java.net/~kvn/8076276/webrev.03 I think I found a bug: http://cr.openjdk.java.net/~kvn/8076276/webrev.03/src/cpu/x86/vm/sharedRuntime_x86_64.cpp.udiff.html in RegisterSaver::restore_live_registers: __ vinsertf64x4h(xmm31, Address(rsp, 992)); __ subptr(rsp, 1024); Shouldn't that be an add? Thanks, Vladimir On 5/7/15 4:41 PM, Berg, Michael C wrote: Ok, I will leave that way it is plus the comment. Thanks, -Michael -----Original Message----- From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] Sent: Thursday, May 07, 2015 4:33 PM To: Berg, Michael C; Roland Westrelin Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR 8076276 support for AVX512 You should have said this from the beginning! Add comment to x86_32.ad before operand vecS explaining why it uses vectors_reg_legacy. To avoid unneeded runtime code generation by adlc. It produces the same result regardless evex support for these vectors in 32-bit VM. Thanks, Vladimir On 5/7/15 3:43 PM, Berg, Michael C wrote: Tried it out as a single definition referencing reg_class_dynamic for both 64-bit and 32-bit. On the new machine model it does work correctly for both. However on 32-bit we will now emit more auto generated code through the adlc layer as we will have the same definitions on both evex and non evex, it is why I used separate definitions. -Michael -----Original Message----- From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] Sent: Thursday, May 07, 2015 2:28 PM To: Berg, Michael C; Roland Westrelin Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR 8076276 support for AVX512 On 5/7/15 2:07 PM, Berg, Michael C wrote: Roland, VecS like the other forms of Vec{D|X|Y|Z} are defined as needed in each of the 32 and 64 ad files, as they are divergent now. The 32-bit version only has the legacy bank. 64-bit version uses the newly defined reg_class_dynamic definition to provide the appropriate bank of registers based on CPUID. The chunk3 rename is an artifact of removing the kreg bank. For 32-bit reg_class_dynamic will give the same result as legacy because registers after XMM7 will be cut of by #ifdef _LP64. So I don't understand why you need to split Vec{D|X|Y} operands. You do kept vecZ operand the same for 32- and 64-bit. Regards, Vladimir I will put it back to the old name (and test it out). Eventually, we will change it back once we formalize the usage of kreg into all facets of code generation ( in a later patch ). If we made a single version out of os_supports_avx_vectors()'s loop, we would have a guard in the loop as the size of the regions are now different, it seems cleaner the way it is. In x86.64.ad: 3463 // Float register operands 3473 // Double register operands I will re-add the comments. In chaitain.hpp: 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else I will fix the comment location. Thanks, Michael -----Original Message----- From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Roland Westrelin Sent: Thursday, May 07, 2015 11:40 AM To: Vladimir Kozlov Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR 8076276 support for AVX512 http://cr.openjdk.java.net/~kvn/8076276/webrev.02 This looks good to me. A few minor remarks: Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is. Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about? I'm talking about: 4101 #ifdef _LP64 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, regF tmp, regF tmp2) %{ for instance in x86.ad Why isn't it in x86_64.ad? Roland. In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. Not sure why you dropped: 3463 // Float register operands 3473 // Double register operands in x86_64.ad In chaitin.hpp: 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else comment is not aligned with the one below anymore. Roland. I also have questions to Michael. Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"? Why you moved "operand vecS() etc." from x86.ad ? Do you mean evex is not supported in 32-bit? Thanks, Vladimir -------------- next part -------------- An HTML attachment was scrubbed... URL: From christian.thalinger at oracle.com Mon Jun 1 21:26:32 2015 From: christian.thalinger at oracle.com (Christian Thalinger) Date: Mon, 1 Jun 2015 14:26:32 -0700 Subject: RFR 8076276 support for AVX512 In-Reply-To: References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> <554BA377.8050700@oracle.com> <554BD8CA.1090409@oracle.com> <554BF60F.9000401@oracle.com> <554D09E0.4040904@oracle.com> Message-ID: > On Jun 1, 2015, at 1:21 PM, Berg, Michael C wrote: > > Yes Christian, in restore, we should add like in the other two locations in that code. > I have other changes which address more AVX512 extensions, do you want me to just add this change to the webrev I created for that set? Yes. I wanted to ask you something else: currently I?m trying to merge with these changes: http://hg.openjdk.java.net/graal/graal/diff/a560c9b81f0f/src/cpu/x86/vm/sharedRuntime_x86_64.cpp and I?m trying to use enum values for offsets instead of hardcoded ones. Unfortunately I don?t have a good solution yet but maybe you have an idea. I would like to have this change upstream. > > Thanks, > Michael > > From: Christian Thalinger [mailto:christian.thalinger at oracle.com] > Sent: Monday, June 01, 2015 10:51 AM > To: Berg, Michael C > Cc: hotspot-compiler-dev at openjdk.java.net > Subject: Re: RFR 8076276 support for AVX512 > > > On May 8, 2015, at 12:09 PM, Vladimir Kozlov > wrote: > > I applied small "cosmetic" changes sent by Michael and I am pushing this. > > Here is updated webrev for the record (I also removed trailing spaces in .ad files): > > http://cr.openjdk.java.net/~kvn/8076276/webrev.03 > > I think I found a bug: > > http://cr.openjdk.java.net/~kvn/8076276/webrev.03/src/cpu/x86/vm/sharedRuntime_x86_64.cpp.udiff.html > > in RegisterSaver::restore_live_registers: > > __ vinsertf64x4h(xmm31, Address(rsp, 992)); > __ subptr(rsp, 1024); > > Shouldn?t that be an add? > > > > Thanks, > Vladimir > > On 5/7/15 4:41 PM, Berg, Michael C wrote: > > Ok, I will leave that way it is plus the comment. > > Thanks, > -Michael > > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com ] > Sent: Thursday, May 07, 2015 4:33 PM > To: Berg, Michael C; Roland Westrelin > Cc: hotspot-compiler-dev at openjdk.java.net > Subject: Re: RFR 8076276 support for AVX512 > > You should have said this from the beginning! Add comment to x86_32.ad before operand vecS explaining why it uses vectors_reg_legacy. > > To avoid unneeded runtime code generation by adlc. It produces the same result regardless evex support for these vectors in 32-bit VM. > > Thanks, > Vladimir > > On 5/7/15 3:43 PM, Berg, Michael C wrote: > > Tried it out as a single definition referencing reg_class_dynamic for both 64-bit and 32-bit. On the new machine model it does work correctly for both. However on 32-bit we will now emit more auto generated code through the adlc layer as we will have the same definitions on both evex and non evex, it is why I used separate definitions. > > -Michael > > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com ] > Sent: Thursday, May 07, 2015 2:28 PM > To: Berg, Michael C; Roland Westrelin > Cc: hotspot-compiler-dev at openjdk.java.net > Subject: Re: RFR 8076276 support for AVX512 > > On 5/7/15 2:07 PM, Berg, Michael C wrote: > > Roland, > > VecS like the other forms of Vec{D|X|Y|Z} are defined as needed in each of the 32 and 64 ad files, as they are divergent now. The 32-bit version only has the legacy bank. > 64-bit version uses the newly defined reg_class_dynamic definition to provide the appropriate bank of registers based on CPUID. The chunk3 rename is an artifact of removing the kreg bank. > > > For 32-bit reg_class_dynamic will give the same result as legacy because registers after XMM7 will be cut of by #ifdef _LP64. So I don't understand why you need to split Vec{D|X|Y} operands. You do kept vecZ operand the same for 32- and 64-bit. > > Regards, > Vladimir > > > I will put it back to the old name (and test it out). Eventually, we will change it back once we formalize the usage of kreg into all facets of code generation ( in a later patch ). > If we made a single version out of os_supports_avx_vectors()'s loop, we would have a guard in the loop as the size of the regions are now different, it seems cleaner the way it is. > > In x86.64.ad: > > > 3463 // Float register operands > 3473 // Double register operands > > I will re-add the comments. > > In chaitain.hpp: > > 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else > > I will fix the comment location. > > Thanks, > Michael > > -----Original Message----- > From: hotspot-compiler-dev > [mailto:hotspot-compiler-dev-bounces at openjdk.java.net ] On Behalf Of > Roland Westrelin > Sent: Thursday, May 07, 2015 11:40 AM > To: Vladimir Kozlov > Cc: hotspot-compiler-dev at openjdk.java.net > Subject: Re: RFR 8076276 support for AVX512 > > > http://cr.openjdk.java.net/~kvn/8076276/webrev.02 > > This looks good to me. A few minor remarks: > > Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is. > > Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about? > > I'm talking about: > 4101 #ifdef _LP64 > 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, > regF tmp, regF tmp2) %{ > > for instance in x86.ad > Why isn't it in x86_64.ad? > > Roland. > > > > > In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. > > Not sure why you dropped: > > 3463 // Float register operands > 3473 // Double register operands > > in x86_64.ad > > In chaitin.hpp: > > 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else > > comment is not aligned with the one below anymore. > > Roland. > > > I also have questions to Michael. > > Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"? > > Why you moved "operand vecS() etc." from x86.ad ? Do you mean evex is not supported in 32-bit? > > Thanks, > Vladimir -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.r.rose at oracle.com Tue Jun 2 00:50:41 2015 From: john.r.rose at oracle.com (John Rose) Date: Mon, 1 Jun 2015 17:50:41 -0700 Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] In-Reply-To: <5567A74F.4040105@oracle.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> <555708BE.8090100@redhat.com> <5565FAD6.5010409@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337BA3A@FMSMSX112.amr.corp.intel.com> <5567A74F.4040105@oracle.com> Message-ID: The important goal, regarding the checks, is to tightly couple the validity checks to the actual loop, without actually putting the checks into the same method as the loop (which is going to be replaced by assembly code!). There should be one copy of the checks and one copy of of the loop itself. The organization of the source code should clearly co-locate the checks and the loop. If these goals are not met, then future changes to the software could introduce calls to the loop which are not properly guarded by validity checks. To do this, you need at least two methods. One can be a wrapper for the loop, and can contain the check code (single copy). Or, one method can be just checks; then each call of the loop method needs to be preceded by a call to the check method. Either pattern will work. There may be other ways to do it, also. For the sake of clarity, I think the validity checks for the intrinsified loop should be called out clearly, which means not mixing them with other validity checks. In the case of 8073108, I'm not sure whether the checks that precede processBlocks are all necessary to the intrinsified loop, or whether some of them are related to the contract of the update method. Putting them in their own method processBlocksChecks would make that more clear and maintainable. It may be that *all* of the check are relevant to the loop, in which case they should be linked more formally to the loop, using a coding pattern that makes it clear. In the code for 8069539, implSquareToLenChecks clearly provides the preconditions for an assembly-coded loop in implSquareToLen to be safely executed. Having two methods instead of one is almost never a problem. Method call overhead is zero in hot code, since everything inlines. I know I'm being picky, but I get that way when working hand-compiled assembly code. HTH, ? John On May 28, 2015, at 4:39 PM, Anthony Scarpino wrote: > > Personally I think it better to not have implSquareToLenChecks() and implMulAddCheck() as separate methods and to have the range check squareToLen and mulAdd. Given these change are about performance, it seems unnecessary to add an extra call to a method. > > While we are changing BigInteger, should a range check for multiplyToLen be added? Or is there a different bug for that? > > Tony > > On 05/27/2015 06:27 PM, Viswanathan, Sandhya wrote: >> Hi Tony, >> >> Please let us know if you are ok with the changes in BigInteger.java (range checks) in patch from Intel: >> >> http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ >> >> Per Andrew's email below we could go ahead with this patch and it shouldn't affect his work. >> >> Best Regards, >> Sandhya >> >> >> -----Original Message----- >> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andrew Haley >> Sent: Wednesday, May 27, 2015 10:12 AM >> To: Christian Thalinger >> Cc: Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net >> Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] >> >> An update: >> >> I'm still working on this. Following last week's revelations [1] it >> seems to me that a faster implementation of (integer) D-H is even more >> important. >> >> I've spent a couple of days tracking down an extremely odd feature >> (bug?) in MutableBigInteger which was breaking everything, but I'm >> past that now. I'm trying to produce an intrinsic implementation of >> the core modular exponentiation which is as fast as any state-of-the- >> art implementation while disrupting the common code as little as >> possible; this is not easy. >> >> I hope to have something which is faster on all processors, not just >> those for which we have hand-coded assembly-language implementations. >> >> I don't think that my work should be any impediment to Sadya's patch >> for squareToLen at http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ >> being committed. It'll still be useful. >> >> Andrew. >> >> >> [1] Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice >> https://weakdh.org/imperfect-forward-secrecy.pdf >> > From michael.c.berg at intel.com Tue Jun 2 02:01:36 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Tue, 2 Jun 2015 02:01:36 +0000 Subject: RFR 8076276 support for AVX512 In-Reply-To: References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> <554BA377.8050700@oracle.com> <554BD8CA.1090409@oracle.com> <554BF60F.9000401@oracle.com> <554D09E0.4040904@oracle.com> Message-ID: Christian, I will take a look. Regards, Michael From: Christian Thalinger [mailto:christian.thalinger at oracle.com] Sent: Monday, June 01, 2015 2:27 PM To: Berg, Michael C Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR 8076276 support for AVX512 On Jun 1, 2015, at 1:21 PM, Berg, Michael C > wrote: Yes Christian, in restore, we should add like in the other two locations in that code. I have other changes which address more AVX512 extensions, do you want me to just add this change to the webrev I created for that set? Yes. I wanted to ask you something else: currently I?m trying to merge with these changes: http://hg.openjdk.java.net/graal/graal/diff/a560c9b81f0f/src/cpu/x86/vm/sharedRuntime_x86_64.cpp and I?m trying to use enum values for offsets instead of hardcoded ones. Unfortunately I don?t have a good solution yet but maybe you have an idea. I would like to have this change upstream. Thanks, Michael From: Christian Thalinger [mailto:christian.thalinger at oracle.com] Sent: Monday, June 01, 2015 10:51 AM To: Berg, Michael C Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR 8076276 support for AVX512 On May 8, 2015, at 12:09 PM, Vladimir Kozlov > wrote: I applied small "cosmetic" changes sent by Michael and I am pushing this. Here is updated webrev for the record (I also removed trailing spaces in .ad files): http://cr.openjdk.java.net/~kvn/8076276/webrev.03 I think I found a bug: http://cr.openjdk.java.net/~kvn/8076276/webrev.03/src/cpu/x86/vm/sharedRuntime_x86_64.cpp.udiff.html in RegisterSaver::restore_live_registers: __ vinsertf64x4h(xmm31, Address(rsp, 992)); __ subptr(rsp, 1024); Shouldn?t that be an add? Thanks, Vladimir On 5/7/15 4:41 PM, Berg, Michael C wrote: Ok, I will leave that way it is plus the comment. Thanks, -Michael -----Original Message----- From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] Sent: Thursday, May 07, 2015 4:33 PM To: Berg, Michael C; Roland Westrelin Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR 8076276 support for AVX512 You should have said this from the beginning! Add comment to x86_32.ad before operand vecS explaining why it uses vectors_reg_legacy. To avoid unneeded runtime code generation by adlc. It produces the same result regardless evex support for these vectors in 32-bit VM. Thanks, Vladimir On 5/7/15 3:43 PM, Berg, Michael C wrote: Tried it out as a single definition referencing reg_class_dynamic for both 64-bit and 32-bit. On the new machine model it does work correctly for both. However on 32-bit we will now emit more auto generated code through the adlc layer as we will have the same definitions on both evex and non evex, it is why I used separate definitions. -Michael -----Original Message----- From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] Sent: Thursday, May 07, 2015 2:28 PM To: Berg, Michael C; Roland Westrelin Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR 8076276 support for AVX512 On 5/7/15 2:07 PM, Berg, Michael C wrote: Roland, VecS like the other forms of Vec{D|X|Y|Z} are defined as needed in each of the 32 and 64 ad files, as they are divergent now. The 32-bit version only has the legacy bank. 64-bit version uses the newly defined reg_class_dynamic definition to provide the appropriate bank of registers based on CPUID. The chunk3 rename is an artifact of removing the kreg bank. For 32-bit reg_class_dynamic will give the same result as legacy because registers after XMM7 will be cut of by #ifdef _LP64. So I don't understand why you need to split Vec{D|X|Y} operands. You do kept vecZ operand the same for 32- and 64-bit. Regards, Vladimir I will put it back to the old name (and test it out). Eventually, we will change it back once we formalize the usage of kreg into all facets of code generation ( in a later patch ). If we made a single version out of os_supports_avx_vectors()'s loop, we would have a guard in the loop as the size of the regions are now different, it seems cleaner the way it is. In x86.64.ad: 3463 // Float register operands 3473 // Double register operands I will re-add the comments. In chaitain.hpp: 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else I will fix the comment location. Thanks, Michael -----Original Message----- From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Roland Westrelin Sent: Thursday, May 07, 2015 11:40 AM To: Vladimir Kozlov Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR 8076276 support for AVX512 http://cr.openjdk.java.net/~kvn/8076276/webrev.02 This looks good to me. A few minor remarks: Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is. Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about? I'm talking about: 4101 #ifdef _LP64 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, regF tmp, regF tmp2) %{ for instance in x86.ad Why isn't it in x86_64.ad? Roland. In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. Not sure why you dropped: 3463 // Float register operands 3473 // Double register operands in x86_64.ad In chaitin.hpp: 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else comment is not aligned with the one below anymore. Roland. I also have questions to Michael. Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"? Why you moved "operand vecS() etc." from x86.ad ? Do you mean evex is not supported in 32-bit? Thanks, Vladimir -------------- next part -------------- An HTML attachment was scrubbed... URL: From sandhya.viswanathan at intel.com Tue Jun 2 21:56:06 2015 From: sandhya.viswanathan at intel.com (Viswanathan, Sandhya) Date: Tue, 2 Jun 2015 21:56:06 +0000 Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] In-Reply-To: References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> <555708BE.8090100@redhat.com> <5565FAD6.5010409@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337BA3A@FMSMSX112.amr.corp.intel.com> <5567A74F.4040105@oracle.com> Message-ID: <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337CC68@FMSMSX112.amr.corp.intel.com> Hi John/Tony, Thanks a lot for your comments and inputs. Hi Vladimir, The patch for 8069539 has the checks as per John's email. Please advise if the patch looks ok to you for the next steps. Let me know if I need to make any changes. Best Regards, Sandhya -----Original Message----- From: John Rose [mailto:john.r.rose at oracle.com] Sent: Monday, June 01, 2015 5:51 PM To: Anthony Scarpino Cc: Viswanathan, Sandhya; Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net Subject: Re: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] The important goal, regarding the checks, is to tightly couple the validity checks to the actual loop, without actually putting the checks into the same method as the loop (which is going to be replaced by assembly code!). There should be one copy of the checks and one copy of of the loop itself. The organization of the source code should clearly co-locate the checks and the loop. If these goals are not met, then future changes to the software could introduce calls to the loop which are not properly guarded by validity checks. To do this, you need at least two methods. One can be a wrapper for the loop, and can contain the check code (single copy). Or, one method can be just checks; then each call of the loop method needs to be preceded by a call to the check method. Either pattern will work. There may be other ways to do it, also. For the sake of clarity, I think the validity checks for the intrinsified loop should be called out clearly, which means not mixing them with other validity checks. In the case of 8073108, I'm not sure whether the checks that precede processBlocks are all necessary to the intrinsified loop, or whether some of them are related to the contract of the update method. Putting them in their own method processBlocksChecks would make that more clear and maintainable. It may be that *all* of the check are relevant to the loop, in which case they should be linked more formally to the loop, using a coding pattern that makes it clear. In the code for 8069539, implSquareToLenChecks clearly provides the preconditions for an assembly-coded loop in implSquareToLen to be safely executed. Having two methods instead of one is almost never a problem. Method call overhead is zero in hot code, since everything inlines. I know I'm being picky, but I get that way when working hand-compiled assembly code. HTH, - John On May 28, 2015, at 4:39 PM, Anthony Scarpino wrote: > > Personally I think it better to not have implSquareToLenChecks() and implMulAddCheck() as separate methods and to have the range check squareToLen and mulAdd. Given these change are about performance, it seems unnecessary to add an extra call to a method. > > While we are changing BigInteger, should a range check for multiplyToLen be added? Or is there a different bug for that? > > Tony > > On 05/27/2015 06:27 PM, Viswanathan, Sandhya wrote: >> Hi Tony, >> >> Please let us know if you are ok with the changes in BigInteger.java (range checks) in patch from Intel: >> >> http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ >> >> Per Andrew's email below we could go ahead with this patch and it shouldn't affect his work. >> >> Best Regards, >> Sandhya >> >> >> -----Original Message----- >> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andrew Haley >> Sent: Wednesday, May 27, 2015 10:12 AM >> To: Christian Thalinger >> Cc: Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net >> Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] >> >> An update: >> >> I'm still working on this. Following last week's revelations [1] it >> seems to me that a faster implementation of (integer) D-H is even more >> important. >> >> I've spent a couple of days tracking down an extremely odd feature >> (bug?) in MutableBigInteger which was breaking everything, but I'm >> past that now. I'm trying to produce an intrinsic implementation of >> the core modular exponentiation which is as fast as any state-of-the- >> art implementation while disrupting the common code as little as >> possible; this is not easy. >> >> I hope to have something which is faster on all processors, not just >> those for which we have hand-coded assembly-language implementations. >> >> I don't think that my work should be any impediment to Sadya's patch >> for squareToLen at http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ >> being committed. It'll still be useful. >> >> Andrew. >> >> >> [1] Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice >> https://weakdh.org/imperfect-forward-secrecy.pdf >> > From anthony.scarpino at oracle.com Tue Jun 2 22:00:03 2015 From: anthony.scarpino at oracle.com (Anthony Scarpino) Date: Tue, 02 Jun 2015 15:00:03 -0700 Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] In-Reply-To: References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> <555708BE.8090100@redhat.com> <5565FAD6.5010409@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337BA3A@FMSMSX112.amr.corp.intel.com> <5567A74F.4040105@oracle.com> Message-ID: <556E2763.9050101@oracle.com> Ok.. If there is no cost to having two methods then my comment to combined them isn't important.. It's fine the way the patch is. Tony On 06/01/2015 05:50 PM, John Rose wrote: > The important goal, regarding the checks, is to tightly couple the > validity checks to the actual loop, without actually putting the > checks into the same method as the loop (which is going to be > replaced by assembly code!). There should be one copy of the checks > and one copy of of the loop itself. The organization of the source > code should clearly co-locate the checks and the loop. If these > goals are not met, then future changes to the software could > introduce calls to the loop which are not properly guarded by > validity checks. > > To do this, you need at least two methods. One can be a wrapper for > the loop, and can contain the check code (single copy). Or, one > method can be just checks; then each call of the loop method needs to > be preceded by a call to the check method. Either pattern will work. > There may be other ways to do it, also. > > For the sake of clarity, I think the validity checks for the > intrinsified loop should be called out clearly, which means not > mixing them with other validity checks. In the case of 8073108, I'm > not sure whether the checks that precede processBlocks are all > necessary to the intrinsified loop, or whether some of them are > related to the contract of the update method. Putting them in their > own method processBlocksChecks would make that more clear and > maintainable. It may be that *all* of the check are relevant to the > loop, in which case they should be linked more formally to the loop, > using a coding pattern that makes it clear. In the code for 8069539, > implSquareToLenChecks clearly provides the preconditions for an > assembly-coded loop in implSquareToLen to be safely executed. > > Having two methods instead of one is almost never a problem. Method > call overhead is zero in hot code, since everything inlines. > > I know I'm being picky, but I get that way when working hand-compiled > assembly code. > > HTH, ? John > > On May 28, 2015, at 4:39 PM, Anthony Scarpino > wrote: >> >> Personally I think it better to not have implSquareToLenChecks() >> and implMulAddCheck() as separate methods and to have the range >> check squareToLen and mulAdd. Given these change are about >> performance, it seems unnecessary to add an extra call to a >> method. >> >> While we are changing BigInteger, should a range check for >> multiplyToLen be added? Or is there a different bug for that? >> >> Tony >> >> On 05/27/2015 06:27 PM, Viswanathan, Sandhya wrote: >>> Hi Tony, >>> >>> Please let us know if you are ok with the changes in >>> BigInteger.java (range checks) in patch from Intel: >>> >>> http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ >>> >>> Per Andrew's email below we could go ahead with this patch and it >>> shouldn't affect his work. >>> >>> Best Regards, Sandhya >>> >>> >>> -----Original Message----- From: hotspot-compiler-dev >>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf >>> Of Andrew Haley Sent: Wednesday, May 27, 2015 10:12 AM To: >>> Christian Thalinger Cc: Vladimir Kozlov; >>> hotspot-compiler-dev at openjdk.java.net Subject: RSA and >>> Diffie-Hellman performance [Was: RFR(L): 8069539: RSA >>> acceleration] >>> >>> An update: >>> >>> I'm still working on this. Following last week's revelations [1] >>> it seems to me that a faster implementation of (integer) D-H is >>> even more important. >>> >>> I've spent a couple of days tracking down an extremely odd >>> feature (bug?) in MutableBigInteger which was breaking >>> everything, but I'm past that now. I'm trying to produce an >>> intrinsic implementation of the core modular exponentiation which >>> is as fast as any state-of-the- art implementation while >>> disrupting the common code as little as possible; this is not >>> easy. >>> >>> I hope to have something which is faster on all processors, not >>> just those for which we have hand-coded assembly-language >>> implementations. >>> >>> I don't think that my work should be any impediment to Sadya's >>> patch for squareToLen at >>> http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ being >>> committed. It'll still be useful. >>> >>> Andrew. >>> >>> >>> [1] Imperfect Forward Secrecy: How Diffie-Hellman Fails in >>> Practice https://weakdh.org/imperfect-forward-secrecy.pdf >>> >> > From vladimir.kozlov at oracle.com Tue Jun 2 23:25:15 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 02 Jun 2015 16:25:15 -0700 Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] In-Reply-To: <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337CC68@FMSMSX112.amr.corp.intel.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> <555708BE.8090100@redhat.com> <5565FAD6.5010409@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337BA3A@FMSMSX112.amr.corp.intel.com> <5567A74F.4040105@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337CC68@FMSMSX112.amr.corp.intel.com> Message-ID: <556E3B5B.9040505@oracle.com> On 6/2/15 2:56 PM, Viswanathan, Sandhya wrote: > > Hi John/Tony, Thanks a lot for your comments and inputs. > > Hi Vladimir, The patch for 8069539 has the checks as per John's email. Please advise if the patch looks ok to you for the next steps. Let me know if I need to make any changes. I think webrev.01 is good. We can push it since "JEP 246: Leverage CPU Instructions for GHASH and RSA" is in "Targeted" state (thanks Tony!). If nobody objects I can push it. Thanks, Vladimir > > Best Regards, > Sandhya > > > -----Original Message----- > From: John Rose [mailto:john.r.rose at oracle.com] > Sent: Monday, June 01, 2015 5:51 PM > To: Anthony Scarpino > Cc: Viswanathan, Sandhya; Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net > Subject: Re: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] > > The important goal, regarding the checks, is to tightly couple the validity checks to the actual loop, without actually putting the checks into the same method as the loop (which is going to be replaced by assembly code!). There should be one copy of the checks and one copy of of the loop itself. The organization of the source code should clearly co-locate the checks and the loop. If these goals are not met, then future changes to the software could introduce calls to the loop which are not properly guarded by validity checks. > > To do this, you need at least two methods. One can be a wrapper for the loop, and can contain the check code (single copy). Or, one method can be just checks; then each call of the loop method needs to be preceded by a call to the check method. Either pattern will work. There may be other ways to do it, also. > > For the sake of clarity, I think the validity checks for the intrinsified loop should be called out clearly, which means not mixing them with other validity checks. In the case of 8073108, I'm not sure whether the checks that precede processBlocks are all necessary to the intrinsified loop, or whether some of them are related to the contract of the update method. Putting them in their own method processBlocksChecks would make that more clear and maintainable. It may be that *all* of the check are relevant to the loop, in which case they should be linked more formally to the loop, using a coding pattern that makes it clear. In the code for 8069539, implSquareToLenChecks clearly provides the preconditions for an assembly-coded loop in implSquareToLen to be safely executed. > > Having two methods instead of one is almost never a problem. Method call overhead is zero in hot code, since everything inlines. > > I know I'm being picky, but I get that way when working hand-compiled assembly code. > > HTH, > - John > > On May 28, 2015, at 4:39 PM, Anthony Scarpino wrote: >> >> Personally I think it better to not have implSquareToLenChecks() and implMulAddCheck() as separate methods and to have the range check squareToLen and mulAdd. Given these change are about performance, it seems unnecessary to add an extra call to a method. >> >> While we are changing BigInteger, should a range check for multiplyToLen be added? Or is there a different bug for that? >> >> Tony >> >> On 05/27/2015 06:27 PM, Viswanathan, Sandhya wrote: >>> Hi Tony, >>> >>> Please let us know if you are ok with the changes in BigInteger.java (range checks) in patch from Intel: >>> >>> http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ >>> >>> Per Andrew's email below we could go ahead with this patch and it shouldn't affect his work. >>> >>> Best Regards, >>> Sandhya >>> >>> >>> -----Original Message----- >>> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andrew Haley >>> Sent: Wednesday, May 27, 2015 10:12 AM >>> To: Christian Thalinger >>> Cc: Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net >>> Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] >>> >>> An update: >>> >>> I'm still working on this. Following last week's revelations [1] it >>> seems to me that a faster implementation of (integer) D-H is even more >>> important. >>> >>> I've spent a couple of days tracking down an extremely odd feature >>> (bug?) in MutableBigInteger which was breaking everything, but I'm >>> past that now. I'm trying to produce an intrinsic implementation of >>> the core modular exponentiation which is as fast as any state-of-the- >>> art implementation while disrupting the common code as little as >>> possible; this is not easy. >>> >>> I hope to have something which is faster on all processors, not just >>> those for which we have hand-coded assembly-language implementations. >>> >>> I don't think that my work should be any impediment to Sadya's patch >>> for squareToLen at http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ >>> being committed. It'll still be useful. >>> >>> Andrew. >>> >>> >>> [1] Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice >>> https://weakdh.org/imperfect-forward-secrecy.pdf >>> >> > From anthony.scarpino at oracle.com Wed Jun 3 00:51:26 2015 From: anthony.scarpino at oracle.com (Anthony Scarpino) Date: Tue, 2 Jun 2015 17:51:26 -0700 Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] In-Reply-To: <556E3B5B.9040505@oracle.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> <555708BE.8090100@redhat.com> <5565FAD6.5010409@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337BA3A@FMSMSX112.amr.corp.intel.com> <5567A74F.4040105@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337CC68@FMSMSX112.amr.corp.intel.com> <556E3B5B.9040505@oracle.com> Message-ID: <234223F6-0BAF-4024-B6AA-0286D8C44322@oracle.com> On Jun 2, 2015, at 4:25 PM, Vladimir Kozlov wrote: > On 6/2/15 2:56 PM, Viswanathan, Sandhya wrote: >> >> Hi John/Tony, Thanks a lot for your comments and inputs. >> >> Hi Vladimir, The patch for 8069539 has the checks as per John's email. Please advise if the patch looks ok to you for the next steps. Let me know if I need to make any changes. > > I think webrev.01 is good. > > We can push it since "JEP 246: Leverage CPU Instructions for GHASH and RSA" is in "Targeted" state (thanks Tony!). > > If nobody objects I can push it. That is fine with me. I?m not sure if subtasks can be used to push into the repo, but I have one for the RSA work that you can take ownership of and use, JDK-8069539 which is attached to the JEP. If you need a bug/rfe, file a new one and let me know so I can link it to the JEP. thanks Tony > > Thanks, > Vladimir > >> >> Best Regards, >> Sandhya >> >> >> -----Original Message----- >> From: John Rose [mailto:john.r.rose at oracle.com] >> Sent: Monday, June 01, 2015 5:51 PM >> To: Anthony Scarpino >> Cc: Viswanathan, Sandhya; Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net >> Subject: Re: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] >> >> The important goal, regarding the checks, is to tightly couple the validity checks to the actual loop, without actually putting the checks into the same method as the loop (which is going to be replaced by assembly code!). There should be one copy of the checks and one copy of of the loop itself. The organization of the source code should clearly co-locate the checks and the loop. If these goals are not met, then future changes to the software could introduce calls to the loop which are not properly guarded by validity checks. >> >> To do this, you need at least two methods. One can be a wrapper for the loop, and can contain the check code (single copy). Or, one method can be just checks; then each call of the loop method needs to be preceded by a call to the check method. Either pattern will work. There may be other ways to do it, also. >> >> For the sake of clarity, I think the validity checks for the intrinsified loop should be called out clearly, which means not mixing them with other validity checks. In the case of 8073108, I'm not sure whether the checks that precede processBlocks are all necessary to the intrinsified loop, or whether some of them are related to the contract of the update method. Putting them in their own method processBlocksChecks would make that more clear and maintainable. It may be that *all* of the check are relevant to the loop, in which case they should be linked more formally to the loop, using a coding pattern that makes it clear. In the code for 8069539, implSquareToLenChecks clearly provides the preconditions for an assembly-coded loop in implSquareToLen to be safely executed. >> >> Having two methods instead of one is almost never a problem. Method call overhead is zero in hot code, since everything inlines. >> >> I know I'm being picky, but I get that way when working hand-compiled assembly code. >> >> HTH, >> - John >> >> On May 28, 2015, at 4:39 PM, Anthony Scarpino wrote: >>> >>> Personally I think it better to not have implSquareToLenChecks() and implMulAddCheck() as separate methods and to have the range check squareToLen and mulAdd. Given these change are about performance, it seems unnecessary to add an extra call to a method. >>> >>> While we are changing BigInteger, should a range check for multiplyToLen be added? Or is there a different bug for that? >>> >>> Tony >>> >>> On 05/27/2015 06:27 PM, Viswanathan, Sandhya wrote: >>>> Hi Tony, >>>> >>>> Please let us know if you are ok with the changes in BigInteger.java (range checks) in patch from Intel: >>>> >>>> http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ >>>> >>>> Per Andrew's email below we could go ahead with this patch and it shouldn't affect his work. >>>> >>>> Best Regards, >>>> Sandhya >>>> >>>> >>>> -----Original Message----- >>>> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andrew Haley >>>> Sent: Wednesday, May 27, 2015 10:12 AM >>>> To: Christian Thalinger >>>> Cc: Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net >>>> Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] >>>> >>>> An update: >>>> >>>> I'm still working on this. Following last week's revelations [1] it >>>> seems to me that a faster implementation of (integer) D-H is even more >>>> important. >>>> >>>> I've spent a couple of days tracking down an extremely odd feature >>>> (bug?) in MutableBigInteger which was breaking everything, but I'm >>>> past that now. I'm trying to produce an intrinsic implementation of >>>> the core modular exponentiation which is as fast as any state-of-the- >>>> art implementation while disrupting the common code as little as >>>> possible; this is not easy. >>>> >>>> I hope to have something which is faster on all processors, not just >>>> those for which we have hand-coded assembly-language implementations. >>>> >>>> I don't think that my work should be any impediment to Sadya's patch >>>> for squareToLen at http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ >>>> being committed. It'll still be useful. >>>> >>>> Andrew. >>>> >>>> >>>> [1] Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice >>>> https://weakdh.org/imperfect-forward-secrecy.pdf >>>> >>> >> From vladimir.kozlov at oracle.com Wed Jun 3 02:00:21 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 02 Jun 2015 19:00:21 -0700 Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] In-Reply-To: <234223F6-0BAF-4024-B6AA-0286D8C44322@oracle.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> <555708BE.8090100@redhat.com> <5565FAD6.5010409@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337BA3A@FMSMSX112.amr.corp.intel.com> <5567A74F.4040105@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337CC68@FMSMSX112.amr.corp.intel.com> <556E3B5B.9040505@oracle.com> <234223F6-0BAF-4024-B6AA-0286D8C44322@oracle.com> Message-ID: <556E5FB5.1070007@oracle.com> I created RFE which is linked to sub-task JDK-8069539: https://bugs.openjdk.java.net/browse/JDK-8081778 Vladimir On 6/2/15 5:51 PM, Anthony Scarpino wrote: > On Jun 2, 2015, at 4:25 PM, Vladimir Kozlov wrote: > >> On 6/2/15 2:56 PM, Viswanathan, Sandhya wrote: >>> >>> Hi John/Tony, Thanks a lot for your comments and inputs. >>> >>> Hi Vladimir, The patch for 8069539 has the checks as per John's email. Please advise if the patch looks ok to you for the next steps. Let me know if I need to make any changes. >> >> I think webrev.01 is good. >> >> We can push it since "JEP 246: Leverage CPU Instructions for GHASH and RSA" is in "Targeted" state (thanks Tony!). >> >> If nobody objects I can push it. > > That is fine with me. > > I?m not sure if subtasks can be used to push into the repo, but I have one for the RSA work that you can take ownership of and use, JDK-8069539 which is attached to the JEP. > If you need a bug/rfe, file a new one and let me know so I can link it to the JEP. > > thanks > > Tony > >> >> Thanks, >> Vladimir >> >>> >>> Best Regards, >>> Sandhya >>> >>> >>> -----Original Message----- >>> From: John Rose [mailto:john.r.rose at oracle.com] >>> Sent: Monday, June 01, 2015 5:51 PM >>> To: Anthony Scarpino >>> Cc: Viswanathan, Sandhya; Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net >>> Subject: Re: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] >>> >>> The important goal, regarding the checks, is to tightly couple the validity checks to the actual loop, without actually putting the checks into the same method as the loop (which is going to be replaced by assembly code!). There should be one copy of the checks and one copy of of the loop itself. The organization of the source code should clearly co-locate the checks and the loop. If these goals are not met, then future changes to the software could introduce calls to the loop which are not properly guarded by validity checks. >>> >>> To do this, you need at least two methods. One can be a wrapper for the loop, and can contain the check code (single copy). Or, one method can be just checks; then each call of the loop method needs to be preceded by a call to the check method. Either pattern will work. There may be other ways to do it, also. >>> >>> For the sake of clarity, I think the validity checks for the intrinsified loop should be called out clearly, which means not mixing them with other validity checks. In the case of 8073108, I'm not sure whether the checks that precede processBlocks are all necessary to the intrinsified loop, or whether some of them are related to the contract of the update method. Putting them in their own method processBlocksChecks would make that more clear and maintainable. It may be that *all* of the check are relevant to the loop, in which case they should be linked more formally to the loop, using a coding pattern that makes it clear. In the code for 8069539, implSquareToLenChecks clearly provides the preconditions for an assembly-coded loop in implSquareToLen to be safely executed. >>> >>> Having two methods instead of one is almost never a problem. Method call overhead is zero in hot code, since everything inlines. >>> >>> I know I'm being picky, but I get that way when working hand-compiled assembly code. >>> >>> HTH, >>> - John >>> >>> On May 28, 2015, at 4:39 PM, Anthony Scarpino wrote: >>>> >>>> Personally I think it better to not have implSquareToLenChecks() and implMulAddCheck() as separate methods and to have the range check squareToLen and mulAdd. Given these change are about performance, it seems unnecessary to add an extra call to a method. >>>> >>>> While we are changing BigInteger, should a range check for multiplyToLen be added? Or is there a different bug for that? >>>> >>>> Tony >>>> >>>> On 05/27/2015 06:27 PM, Viswanathan, Sandhya wrote: >>>>> Hi Tony, >>>>> >>>>> Please let us know if you are ok with the changes in BigInteger.java (range checks) in patch from Intel: >>>>> >>>>> http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ >>>>> >>>>> Per Andrew's email below we could go ahead with this patch and it shouldn't affect his work. >>>>> >>>>> Best Regards, >>>>> Sandhya >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andrew Haley >>>>> Sent: Wednesday, May 27, 2015 10:12 AM >>>>> To: Christian Thalinger >>>>> Cc: Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net >>>>> Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] >>>>> >>>>> An update: >>>>> >>>>> I'm still working on this. Following last week's revelations [1] it >>>>> seems to me that a faster implementation of (integer) D-H is even more >>>>> important. >>>>> >>>>> I've spent a couple of days tracking down an extremely odd feature >>>>> (bug?) in MutableBigInteger which was breaking everything, but I'm >>>>> past that now. I'm trying to produce an intrinsic implementation of >>>>> the core modular exponentiation which is as fast as any state-of-the- >>>>> art implementation while disrupting the common code as little as >>>>> possible; this is not easy. >>>>> >>>>> I hope to have something which is faster on all processors, not just >>>>> those for which we have hand-coded assembly-language implementations. >>>>> >>>>> I don't think that my work should be any impediment to Sadya's patch >>>>> for squareToLen at http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ >>>>> being committed. It'll still be useful. >>>>> >>>>> Andrew. >>>>> >>>>> >>>>> [1] Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice >>>>> https://weakdh.org/imperfect-forward-secrecy.pdf >>>>> >>>> >>> > From michael.c.berg at intel.com Wed Jun 3 04:38:23 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Wed, 3 Jun 2015 04:38:23 +0000 Subject: RFR(L): 8081247 AVX 512 extended support code review request Message-ID: Hi Folks, I would like to contribute more AVX512 enabling to facilitate support for machines which utilize EVEX encoding. I need two reviewers to review this patch and comment as needed: Bug-id: https://bugs.openjdk.java.net/browse/JDK-8081247 webrev: http://cr.openjdk.java.net/~mcberg/8081247/webrev.01/ This patch enables BMI code on EVEX targets, improves replication patterns to be more efficient on both EVEX enabled and legacy targets, adds more CPUID based rules for correct code generation on various EVEX enabled servers, extends more call save/restore functionality and extends the vector space further for SIMD operations. Please expedite this review as there is a near term need for the support. Also, as I am not yet a committer, this code needs a sponsor as well. Thanks, Michael -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Wed Jun 3 16:02:34 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Wed, 3 Jun 2015 12:02:34 -0400 Subject: InlineCacheBuffer + GuaranteedSafepointInterval Message-ID: Hi, Could someone please explain the idea behind GuaranteedSafepointInterval induced safepoints being gated on InlineCacheBuffer::is_empty() returning false? This code in safepoint.cpp: bool SafepointSynchronize::is_cleanup_needed() { // Need a safepoint if some inline cache buffers is non-empty if (!InlineCacheBuffer::is_empty()) return true; return false; } Looking at icBuffer.cpp: void InlineCacheBuffer::update_inline_caches() { if (buffer()->number_of_stubs() > 1) { if (TraceICBuffer) { tty->print_cr("[updating inline caches with %d stubs]", buffer()->number_of_stubs()); } buffer()->remove_all(); init_next_stub(); } release_pending_icholders(); } What exactly triggers IC holders to be eligible for deletion? The reason behind this question is I'd like to eliminate "unnecessary" safepoints that I'm seeing, but would like to understand implications of this with respect to compiler infrastructure (C2, specifically). I have a fairly large code cache reserved, and the # of compiled methods isn't too big, so space there shouldn't be an issue. Why is GuaranteedSafepointInterval based safepoint actually gated on this particular check? If I turn off background safepoints (i.e. GuaranteedSafepointInterval=0) or set them very far apart, am I risking stability problems, at least in terms of compiler? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Wed Jun 3 22:10:38 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 03 Jun 2015 15:10:38 -0700 Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] In-Reply-To: <556E5FB5.1070007@oracle.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> <555708BE.8090100@redhat.com> <5565FAD6.5010409@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337BA3A@FMSMSX112.amr.corp.intel.com> <5567A74F.4040105@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337CC68@FMSMSX112.amr.corp.intel.com> <556E3B5B.9040505@oracle.com> <234223F6-0BAF-4024-B6AA-0286D8C44322@oracle.com> <556E5FB5.1070007@oracle.com> Message-ID: <556F7B5E.70807@oracle.com> There were several problems my testing found. First, almost all files had trailing spaces. Java code had also TABs. Second, I have to increase stub buffer size 'code_size2' since I hit on windows assert: # Internal Error (C:\jprt\T\P1\201627.vkozlov\s\hotspot\src\share\vm\asm/codeBuffer.hpp:176), pid=37668, tid=0x0000000000003514 # assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000002b80146ce0 <= 0x0000002b8014c2d1 <= 0x0000002b8014c2d0 V [jvm.dll+0x373471] ?report_vm_error@@YAXPEBDH00 at Z+0x71 V [jvm.dll+0x50d53] ?set_end at CodeSection@@QEAAXPEAE at Z+0x73 V [jvm.dll+0x4526e5] ?square_to_len at MacroAssembler@@QEAAXPEAVRegisterImpl@@0000000000 at Z+0x5f5 V [jvm.dll+0x4818f5] ?generate_squareToLen at StubGenerator@@AEAAPEAEXZ+0x145 I fixed all that and now is pushing. Thanks, Vladimir On 6/2/15 7:00 PM, Vladimir Kozlov wrote: > I created RFE which is linked to sub-task JDK-8069539: > > https://bugs.openjdk.java.net/browse/JDK-8081778 > > Vladimir > > On 6/2/15 5:51 PM, Anthony Scarpino wrote: >> On Jun 2, 2015, at 4:25 PM, Vladimir Kozlov >> wrote: >> >>> On 6/2/15 2:56 PM, Viswanathan, Sandhya wrote: >>>> >>>> Hi John/Tony, Thanks a lot for your comments and inputs. >>>> >>>> Hi Vladimir, The patch for 8069539 has the checks as per John's >>>> email. Please advise if the patch looks ok to you for the next >>>> steps. Let me know if I need to make any changes. >>> >>> I think webrev.01 is good. >>> >>> We can push it since "JEP 246: Leverage CPU Instructions for GHASH >>> and RSA" is in "Targeted" state (thanks Tony!). >>> >>> If nobody objects I can push it. >> >> That is fine with me. >> >> I?m not sure if subtasks can be used to push into the repo, but I have >> one for the RSA work that you can take ownership of and use, >> JDK-8069539 which is attached to the JEP. >> If you need a bug/rfe, file a new one and let me know so I can link it >> to the JEP. >> >> thanks >> >> Tony >> >>> >>> Thanks, >>> Vladimir >>> >>>> >>>> Best Regards, >>>> Sandhya >>>> >>>> >>>> -----Original Message----- >>>> From: John Rose [mailto:john.r.rose at oracle.com] >>>> Sent: Monday, June 01, 2015 5:51 PM >>>> To: Anthony Scarpino >>>> Cc: Viswanathan, Sandhya; Vladimir Kozlov; >>>> hotspot-compiler-dev at openjdk.java.net >>>> Subject: Re: RSA and Diffie-Hellman performance [Was: RFR(L): >>>> 8069539: RSA acceleration] >>>> >>>> The important goal, regarding the checks, is to tightly couple the >>>> validity checks to the actual loop, without actually putting the >>>> checks into the same method as the loop (which is going to be >>>> replaced by assembly code!). There should be one copy of the checks >>>> and one copy of of the loop itself. The organization of the source >>>> code should clearly co-locate the checks and the loop. If these >>>> goals are not met, then future changes to the software could >>>> introduce calls to the loop which are not properly guarded by >>>> validity checks. >>>> >>>> To do this, you need at least two methods. One can be a wrapper for >>>> the loop, and can contain the check code (single copy). Or, one >>>> method can be just checks; then each call of the loop method needs >>>> to be preceded by a call to the check method. Either pattern will >>>> work. There may be other ways to do it, also. >>>> >>>> For the sake of clarity, I think the validity checks for the >>>> intrinsified loop should be called out clearly, which means not >>>> mixing them with other validity checks. In the case of 8073108, I'm >>>> not sure whether the checks that precede processBlocks are all >>>> necessary to the intrinsified loop, or whether some of them are >>>> related to the contract of the update method. Putting them in their >>>> own method processBlocksChecks would make that more clear and >>>> maintainable. It may be that *all* of the check are relevant to the >>>> loop, in which case they should be linked more formally to the loop, >>>> using a coding pattern that makes it clear. In the code for >>>> 8069539, implSquareToLenChecks clearly provides the preconditions >>>> for an assembly-coded loop in implSquareToLen to be safely executed. >>>> >>>> Having two methods instead of one is almost never a problem. Method >>>> call overhead is zero in hot code, since everything inlines. >>>> >>>> I know I'm being picky, but I get that way when working >>>> hand-compiled assembly code. >>>> >>>> HTH, >>>> - John >>>> >>>> On May 28, 2015, at 4:39 PM, Anthony Scarpino >>>> wrote: >>>>> >>>>> Personally I think it better to not have implSquareToLenChecks() >>>>> and implMulAddCheck() as separate methods and to have the range >>>>> check squareToLen and mulAdd. Given these change are about >>>>> performance, it seems unnecessary to add an extra call to a method. >>>>> >>>>> While we are changing BigInteger, should a range check for >>>>> multiplyToLen be added? Or is there a different bug for that? >>>>> >>>>> Tony >>>>> >>>>> On 05/27/2015 06:27 PM, Viswanathan, Sandhya wrote: >>>>>> Hi Tony, >>>>>> >>>>>> Please let us know if you are ok with the changes in >>>>>> BigInteger.java (range checks) in patch from Intel: >>>>>> >>>>>> http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ >>>>>> >>>>>> Per Andrew's email below we could go ahead with this patch and it >>>>>> shouldn't affect his work. >>>>>> >>>>>> Best Regards, >>>>>> Sandhya >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: hotspot-compiler-dev >>>>>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf >>>>>> Of Andrew Haley >>>>>> Sent: Wednesday, May 27, 2015 10:12 AM >>>>>> To: Christian Thalinger >>>>>> Cc: Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net >>>>>> Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: >>>>>> RSA acceleration] >>>>>> >>>>>> An update: >>>>>> >>>>>> I'm still working on this. Following last week's revelations [1] it >>>>>> seems to me that a faster implementation of (integer) D-H is even >>>>>> more >>>>>> important. >>>>>> >>>>>> I've spent a couple of days tracking down an extremely odd feature >>>>>> (bug?) in MutableBigInteger which was breaking everything, but I'm >>>>>> past that now. I'm trying to produce an intrinsic implementation of >>>>>> the core modular exponentiation which is as fast as any state-of-the- >>>>>> art implementation while disrupting the common code as little as >>>>>> possible; this is not easy. >>>>>> >>>>>> I hope to have something which is faster on all processors, not just >>>>>> those for which we have hand-coded assembly-language implementations. >>>>>> >>>>>> I don't think that my work should be any impediment to Sadya's patch >>>>>> for squareToLen at http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ >>>>>> being committed. It'll still be useful. >>>>>> >>>>>> Andrew. >>>>>> >>>>>> >>>>>> [1] Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice >>>>>> https://weakdh.org/imperfect-forward-secrecy.pdf >>>>>> >>>>> >>>> >> From andreas.lundblad at oracle.com Wed Jun 3 20:16:07 2015 From: andreas.lundblad at oracle.com (Andreas Lundblad) Date: Wed, 3 Jun 2015 22:16:07 +0200 Subject: Why isn't Object.notify() a synchronized method? In-Reply-To: <556A8E9D.4040407@oracle.com> References: <55673D97.4010505@CoSoCo.de> <5567420A.9020406@redhat.com> <5568985C.6050507@CoSoCo.de> <556A8E9D.4040407@oracle.com> Message-ID: <20150603201607.GB4474@e6430> On Sun, May 31, 2015 at 02:31:25PM +1000, David Holmes wrote: > >As I recently fell into the trap of forgetting the synchronized block > >around a single notifyAll(), I believe, the current situation is just > >errorprone. > > How is it errorprone? You forgot to acquire the lock and you got an > IllegalMonitorStateException when you did notifyAll. That's pointing > out your error. The reason for not making wait/notify synchronized is *not* that "it would be unnecessary because you typically already hold the lock anyway". The reason is (as David Holms pointed out earlier) that it would be *meaningless* to make them synchronized. When you say it's errorprone it sounds like you first had notifyAll(); and then "fixed" the IllegalMonitorStateException by doing synchronized (this) { notifyAll(); } (and then wrote your original email asking why notifyAll didn't do this "on the inside"). If this is the case you have not understood the intention of using synchronized here. In a nutshell wait and notify is all about thread communication, and for that to work correctly you need to synchronize your execution. Let me try to explain why wrapping *just* notifyAll() in a synchronized block (or relying on making notifyAll synchronized) is broken: Suppose you have a producer / consumer thing going on. In produce() you have something like enqueue(value); synchronized (this) { notifyAll(); // because consumer may be waiting for a value } and in consume() you have something like synchronized (this) { while (noValueIsAvailable()) wait(); } value = retrieve(); Suppose now that ProducerThread enters produce(), enqueues a value and before reaching notifyAll, ConsumerThread comes along, enters consume(), consumes the value, processes the value, calls consume *again*, sees that noValueIsAvailable and calls wait. ProducerThread now continues it's execution and calls notifyAll(). ConsumerThread is woken up, but at this point no value is available despite there was a call to notify! (In the best case, this doesn't crash the program, in worst case, ProducerThread assumed that the value should be processed after notifyAll, in which case you may run into a deadlock. If there were more variables involved you could also have memory visibility issues involved and accidentally break class invariants by doing this.) I've written an answer to a similar question here: "Why must wait() always be in synchronized block" http://stackoverflow.com/a/2779674/276052 best regards, Andreas From brian.goetz at oracle.com Thu Jun 4 00:14:19 2015 From: brian.goetz at oracle.com (Brian Goetz) Date: Wed, 03 Jun 2015 20:14:19 -0400 Subject: Why isn't Object.notify() a synchronized method? In-Reply-To: <5568985C.6050507@CoSoCo.de> References: <55673D97.4010505@CoSoCo.de> <5567420A.9020406@redhat.com> <5568985C.6050507@CoSoCo.de> Message-ID: <556F985B.5010009@oracle.com> The performance issue here is mostly a red herring. The reason notify() is not synchronized has much more to do with correctness; when you "forget" to wrap your notify call with lock acquisition, it is almost always a bug. (The rest of this explanation is probably clearer if you've read JCiP Ch14.) A thread calls notify/notifyAll if it has effected a state change that could cause a waiting thread to become unblocked. Blocking is generally associated with a state predicate ("queue is nonempty", "light is green"), which refers to some state, and that state needs to be guarded by the lock associated with the condition queue. If you've modified the state that underlies the predicate (i.e., you put something in the queue, or switched the light to green), you needed the lock anyway. (Because spurious wakeup is allowed, even the trivial cases like a one-shot "release all threads when timer expires" are better implemented with a boolean state predicate (or better, CountDownLatch) than simply calling notifyAll.) So, while its not out of the question that code that wants to call notify without locking isn't wrong, it probably is wrong, and having the library "conveniently" acquire the lock for you is just brushing the mistake under the rug. On 5/29/2015 12:48 PM, Ulf Zibis wrote: > Thanks for your hint David. That's the only reason I could imagine too. > Can somebody tell something about the cost for recursive lock > acquisition in comparison to the whole call, couldn't it be eliminated > by Hotspot? > > As I recently fell into the trap of forgetting the synchronized block > around a single notifyAll(), I believe, the current situation is just > errorprone. > > Any comments about the Javadoc issue? > > -Ulf > > > Am 28.05.2015 um 18:27 schrieb David M. Lloyd: >> Since most of the time you have to hold the lock anyway for other >> reasons, I think this would generally be an unwelcome change since I >> expect the cost of recursive lock acquisition is nonzero. >> >> On 05/28/2015 11:08 AM, Ulf Zibis wrote: >>> Hi all, >>> >>> in the Javadoc of notify(), notifyAll() and wait(...) I read, that this >>> methods should only be used with synchronisation on it's instance. >>> So I'm wondering, why they don't have the synchronized modifier out of >>> the box in Object class. >>> >>> Also I think, the following note should be moved from wait(long,int) to >>> wait(long): >>> /The current thread must own this object's monitor. The thread releases >>> ownership of this monitor and waits until either of the following two >>> conditions has occurred:// >>> / >>> >>> * /Another thread notifies threads waiting on this object's monitor to >>> wake up either through a >>> call to the notify method or the notifyAll method./ >>> * /The timeout period, specified by timeout milliseconds plus nanos >>> nanoseconds arguments, has >>> elapsed. / >>> >>> >>> >>> Cheers, >>> >>> Ulf >>> >> > From vladimir.kozlov at oracle.com Thu Jun 4 00:25:27 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 03 Jun 2015 17:25:27 -0700 Subject: RFR 8080325 SuperWord loop unrolling analysis In-Reply-To: References: Message-ID: <556F9AF7.5050301@oracle.com> Thank you, Michael, for this contribution. First, I am fine with such approach - call superword only to collect data about loop. It could be useful for superword optimization to have a pass over loop's nodes to determine if it could be vectorize as separate phase. And avoid to do that in SLP analysis. Make SuperWordLoopUnrollAnalysis flag's default value to 'false' and set it to true only in vm_version_x86.cpp (#ifdef COMPILER2) so that you don't need to modify setting on all platforms. In flag description say what 'slp' means (Superword Level Parallelism). Code style: if (cl->is_reduction_loop() == false) phase->mark_reductions(this); should be: if (!cl->is_reduction_loop()) { phase->mark_reductions(this); } An other one: cl->has_passed_slp() == false. We use ! for such cases. There are following checks after new code which prevent unrolling. Why you do analysis before them without affecting their decision? And you use the result of analysis only later at the end of method. Why not do analysis there then? (_local_loop_unroll_factor > 4) check should be done inside policy_slp_max_unroll() to have only one check. Actually all next code (lines 668-692) could be done at the end of policy_slp_max_unroll(). And you don't need to return bool then (I changed name too): 665 if (LoopMaxUnroll > _local_loop_unroll_factor) { 666 // Once policy_slp_analysis succeeds, mark the loop with the 667 // maximal unroll factor so that we minimize analysis passes 668 policy_unroll_slp_analysis(cl, phase); 693 } 694 } 695 696 // Check for initial stride being a small enough constant 697 if (abs(cl->stride_con()) > (1<<2)*future_unroll_ct) return false; I think slp analysis code should be in superword.cpp - SWPointer should be used only there. Just add an other method to SuperWord class. Instead of a->Afree() and next: 810 Arena *a = Thread::current()->resource_area(); 812 size_t ignored_size = _body.size()*sizeof(int*); 813 int *ignored_loop_nodes = (int*)a->Amalloc_D(ignored_size); 814 Node_Stack nstack((int)ignored_size); use: ResourceMark rm; size_t ignored_size = _body.size(); int *ignored_loop_nodes = NEW_RESOURCE_ARRAY(int, ignored_size); Node_Stack nstack((int)ignored_size); Node_Stack should take number of nodes and not bytes. I am concern about nstack.clear() because you have poped all nodes on it. Thanks, Vladimir On 5/13/15 6:26 PM, Berg, Michael C wrote: > Hi Folks, > > We (Intel) would like to contribute superword loop unrolling analysis to > facilitate more default unrolling and larger SIMD vector mapping. > Please review this patch and comment as needed: > > Bug-id: https://bugs.openjdk.java.net/browse/JDK-8080325 > > webrev: > > http://cr.openjdk.java.net/~kvn/8080325/webrev/ > > The design finds the highest common vector supported and implemented on > a given machine and applies that to unrolling, iff it is greater than > the default. If the user gates unrolling we will still obey the user > directive. It?s light weight, when we get to the analysis part, if we > fail, we stop further tries. If we succeed we stop further tries. We > generally always analyze only once. We then gate the unroll factor by > extending the size of the unroll segment so that the iterative tries > will keep unrolling, obtaining larger unrolls of a targeted loop. I see > no negative behavior wrt to performance, and a modest positive swing in > default behavior up to 1.5x for some micros. > > Vladimir Koslov has offered to sponsor this patch. > From vladimir.kozlov at oracle.com Thu Jun 4 01:09:34 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 03 Jun 2015 18:09:34 -0700 Subject: RFR(L): 8081247 AVX 512 extended support code review request In-Reply-To: References: Message-ID: <556FA54E.8050001@oracle.com> Hi, Michael assembler_x86.cpp: I don't like that you replaced prefix method with few parameters with method which has a lot of them: - int encode = vex_prefix_0F38_and_encode_q(dst, src1, src2); + int encode = vex_prefix_and_encode(dst->encoding(), src1->encoding(), src2->encoding(), + VEX_SIMD_NONE, VEX_OPCODE_0F_38, true, AVX_128bit, + true, false); Why you did that? stubGenerator_x86_64.cpp: Can we set different loop limit based on UseAVX instead of having 2 loops. x86.ad: Instead of long condition expressions like next: UseAVX > 0 && !VM_Version::supports_avx512vl() && !VM_Version::supports_avx512bw() May be have one VM_Version finction which does these checks. Thanks, Vladimir On 6/2/15 9:38 PM, Berg, Michael C wrote: > Hi Folks, > > I would like to contribute more AVX512 enabling to facilitate support > for machines which utilize EVEX encoding. I need two reviewers to review > this patch and comment as needed: > > Bug-id: https://bugs.openjdk.java.net/browse/JDK-8081247 > > webrev: > > http://cr.openjdk.java.net/~mcberg/8081247/webrev.01/ > > This patch enables BMI code on EVEX targets, improves replication > patterns to be more efficient on both EVEX enabled and legacy targets, > adds more CPUID based rules for correct code generation on various EVEX > enabled servers, extends more call save/restore functionality and > extends the vector space further for SIMD operations. Please expedite > this review as there is a near term need for the support. > > Also, as I am not yet a committer, this code needs a sponsor as well. > > Thanks, > > Michael > From roland.westrelin at oracle.com Thu Jun 4 14:28:41 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Thu, 4 Jun 2015 16:28:41 +0200 Subject: RFR(S): 8078866: compiler/eliminateAutobox/6934604/TestIntBoxing.java assert(p_f->Opcode() == Op_IfFalse) failed In-Reply-To: <5B87D4E4-FA26-411F-BD03-9F9CAA1E0CBF@oracle.com> References: <5553C1E2.3040003@oracle.com> <555CE2B2.8020305@oracle.com> <5B87D4E4-FA26-411F-BD03-9F9CAA1E0CBF@oracle.com> Message-ID: As suggested by Vladimir, here is a new simpler fix that simply bails out of range check elimination if the preloop is not found: http://cr.openjdk.java.net/~roland/8078866/webrev.01/ Removing the main and post loops will be pushed as a separate RFE. Roland. From roland.westrelin at oracle.com Thu Jun 4 14:34:54 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Thu, 4 Jun 2015 16:34:54 +0200 Subject: RFR(S): 8078866: compiler/eliminateAutobox/6934604/TestIntBoxing.java assert(p_f->Opcode() == Op_IfFalse) failed In-Reply-To: References: <5553C1E2.3040003@oracle.com> Message-ID: > If you make pre_end->cmp_node() a variable it would remove some overhead too like you did with main_cmp. Thanks for the suggestion, Michael. I will do that when I?ll send this out for review again as: https://bugs.openjdk.java.net/browse/JDK-8085832 Roland. From roland.westrelin at oracle.com Thu Jun 4 14:54:48 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Thu, 4 Jun 2015 16:54:48 +0200 Subject: RFR(S): Optimize main and post loop out when pre loop is found empty Message-ID: <4E659695-625A-4454-A89F-1D45F60E3033@oracle.com> http://cr.openjdk.java.net/~roland/8085832/webrev.00/ This is the change I proposed as a fix for 8078866. Roland. From vladimir.kozlov at oracle.com Thu Jun 4 15:50:30 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 04 Jun 2015 08:50:30 -0700 Subject: RFR(S): 8078866: compiler/eliminateAutobox/6934604/TestIntBoxing.java assert(p_f->Opcode() == Op_IfFalse) failed In-Reply-To: References: <5553C1E2.3040003@oracle.com> <555CE2B2.8020305@oracle.com> <5B87D4E4-FA26-411F-BD03-9F9CAA1E0CBF@oracle.com> Message-ID: <557073C6.1050002@oracle.com> Good. Thanks, Vladimir On 6/4/15 7:28 AM, Roland Westrelin wrote: > As suggested by Vladimir, here is a new simpler fix that simply bails out of range check elimination if the preloop is not found: > > http://cr.openjdk.java.net/~roland/8078866/webrev.01/ > > Removing the main and post loops will be pushed as a separate RFE. > > Roland. > From vladimir.kozlov at oracle.com Thu Jun 4 15:53:44 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 04 Jun 2015 08:53:44 -0700 Subject: RFR(S): 8085832: Optimize main and post loop out when pre loop is found empty In-Reply-To: <4E659695-625A-4454-A89F-1D45F60E3033@oracle.com> References: <4E659695-625A-4454-A89F-1D45F60E3033@oracle.com> Message-ID: <55707488.5040600@oracle.com> Looks fine. Thanks, Vladimir On 6/4/15 7:54 AM, Roland Westrelin wrote: > http://cr.openjdk.java.net/~roland/8085832/webrev.00/ > > This is the change I proposed as a fix for 8078866. > > Roland. > From aph at redhat.com Thu Jun 4 17:08:01 2015 From: aph at redhat.com (Andrew Haley) Date: Thu, 04 Jun 2015 18:08:01 +0100 Subject: RSA intrinsics [Was: RFR(L): 8069539: RSA acceleration] In-Reply-To: <555708BE.8090100@redhat.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> <555708BE.8090100@redhat.com> Message-ID: <557085F1.9070703@redhat.com> I'm sorry this is a rather long email, and I pray for your patience. I'm getting close to something I can put forward for review. The performance is encouraging. [ Some background: The kernel of RSA and Diffie-Hellman key exchange is Montgomery multiplication. Optimizing RSA basically comes down to optimizing Montgomery multiplication. The core of OpenJDK's RSA is BigInteger.oddModPow(). ] My Montgomery multiply routine (mostly portable C, with a small assembly-language insert) executes a 1024-bit multiply/reduce in about 2000ns; the hand-coded assembly language equivalent in OpenSSL is faster (as you'd expect) but not very much faster: about 1700ns. In other words, compiled C is only about 20% slower. Firstly, this is pretty remarkable performance by GCC (Yay! Go GCC!) and it shows I'm on the right track. It also shows that there isn't a huge amount to be gained by hand-coding Montgomery multiplication, but anybody who fancies their hand can try to improve on GCC. This is also very nice because porting it to non-x86 hardware is fairly straightforward -- certainly far easier than writing a large assembly- language routine. Here are some numbers for comparison. Today's hs-comp: sign verify sign/s verify/s rsa 512 bits 0.000133s 0.000009s 7508.5 112819.1 rsa 1024 bits 0.000667s 0.000028s 1498.6 35497.2 rsa 2048 bits 0.003867s 0.000097s 258.6 10342.7 rsa 4096 bits 0.026383s 0.000357s 37.9 2799.8 After my patch: sign verify sign/s verify/s rsa 512 bits 0.000071s 0.000005s 14127.3 204112.4 rsa 1024 bits 0.000292s 0.000013s 3424.5 74204.1 rsa 2048 bits 0.001628s 0.000045s 614.4 22399.7 rsa 4096 bits 0.010966s 0.000163s 91.2 6117.8 So, it's about twice as fast we have at present. [ Note that this comparison includes the latest "8081778: Use Intel x64 CPU instructions for RSA acceleration" patch. ] However, even after my patch OpenJDK is still somewhat slower than OpenSSL, which is: sign verify sign/s verify/s rsa 512 bits 0.000048s 0.000004s 20687.1 257399.4 rsa 1024 bits 0.000189s 0.000011s 5288.3 91711.5 rsa 2048 bits 0.001174s 0.000037s 851.7 27367.2 rsa 4096 bits 0.008475s 0.000137s 118.0 7305.4 [ I am assuming that OpenSSL represents something like the "speed of light" for RSA on x86: this is carefully hand-coded assembly language and C, hand tuned. Getting anywhere near OpenSSL is a major win. ] Here's my problem: Some of this slowdown is due to the overhead of using the JCE, but not very much. Quite a lot of it, however, is due to the fact that the scratch memory used in oddModPow() is a big-endian array of jints. I have to convert the big-endian jints into native jlongs to do the multiply on little-endian x86-64. If I do the word reversal during the multiply (i.e. keep all data in memory in little-endian form, swap words when reading and writing to memory) the performance is horrible: a 1024-bit multiply takes 3000ns, 50% longer. This perhaps isn't very surprising: if you do the word-reversal before the multiplication you have O(N) swaps, if you do it during the multiplication you have O(N^2). I have found that the best thing to do is to word reverse all the data in memory into temporary little-endian arrays and do the work on them. It's much faster, but still really is very annoying: for 1024-bit RSA the word reversal is 14% of the total runtime. It would be nice to keep all of the data in an array of jlongs for the duration of oddModPow(). Here's one idea: I could write a version of oddModPow() which is guaranteed never to use the Java version of the Montgomery multiply. This would use a JNI method which calls the native Montgomery multiply routine, guaranteeing that that we always use that native routine, even from the interpreter and C1. Then I could keep all the internal state in native word order, and all this word-swapping would just go away. This would have the additional benefit that it would be faster when using the interpreter and C1. So, we'd have two versions of oddModPow() in BigInteger, and switch between them depending on whether the platform had support for a native Montgomery multiplication. The downside of having two versions of oddModPow() would, of course, be some code duplication. Or am I just making too much fuss about this? Maybe I should be happy with what I've got. Thank you for reading this far, Andrew. From anthony.scarpino at oracle.com Thu Jun 4 18:32:32 2015 From: anthony.scarpino at oracle.com (Anthony Scarpino) Date: Thu, 04 Jun 2015 11:32:32 -0700 Subject: RSA intrinsics [Was: RFR(L): 8069539: RSA acceleration] In-Reply-To: <557085F1.9070703@redhat.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> <555708BE.8090100@redhat.com> <557085F1.9070703@redhat.com> Message-ID: <557099C0.9050008@oracle.com> On 06/04/2015 10:08 AM, Andrew Haley wrote: > I'm sorry this is a rather long email, and I pray for your patience. > > I'm getting close to something I can put forward for review. The > performance is encouraging. > > [ Some background: The kernel of RSA and Diffie-Hellman key exchange > is Montgomery multiplication. Optimizing RSA basically comes down to > optimizing Montgomery multiplication. The core of OpenJDK's RSA is > BigInteger.oddModPow(). ] > > My Montgomery multiply routine (mostly portable C, with a small > assembly-language insert) executes a 1024-bit multiply/reduce in about > 2000ns; the hand-coded assembly language equivalent in OpenSSL is > faster (as you'd expect) but not very much faster: about 1700ns. In > other words, compiled C is only about 20% slower. > > Firstly, this is pretty remarkable performance by GCC (Yay! Go GCC!) > and it shows I'm on the right track. It also shows that there isn't a > huge amount to be gained by hand-coding Montgomery multiplication, but > anybody who fancies their hand can try to improve on GCC. This is > also very nice because porting it to non-x86 hardware is fairly > straightforward -- certainly far easier than writing a large assembly- > language routine. I'm sure when I see the code it will become clearer, but I'm guessing you are taking a C version of Montgomery multiply, using GCC to turn it into assembly with some cpu acceleration instructions and putting that into an intrinsic? > > Here are some numbers for comparison. > > Today's hs-comp: > > sign verify sign/s verify/s > rsa 512 bits 0.000133s 0.000009s 7508.5 112819.1 > rsa 1024 bits 0.000667s 0.000028s 1498.6 35497.2 > rsa 2048 bits 0.003867s 0.000097s 258.6 10342.7 > rsa 4096 bits 0.026383s 0.000357s 37.9 2799.8 > > After my patch: > > sign verify sign/s verify/s > rsa 512 bits 0.000071s 0.000005s 14127.3 204112.4 > rsa 1024 bits 0.000292s 0.000013s 3424.5 74204.1 > rsa 2048 bits 0.001628s 0.000045s 614.4 22399.7 > rsa 4096 bits 0.010966s 0.000163s 91.2 6117.8 > > So, it's about twice as fast we have at present. > > [ Note that this comparison includes the latest "8081778: Use Intel > x64 CPU instructions for RSA acceleration" patch. ] > > However, even after my patch OpenJDK is still somewhat slower than > OpenSSL, which is: > > sign verify sign/s verify/s > rsa 512 bits 0.000048s 0.000004s 20687.1 257399.4 > rsa 1024 bits 0.000189s 0.000011s 5288.3 91711.5 > rsa 2048 bits 0.001174s 0.000037s 851.7 27367.2 > rsa 4096 bits 0.008475s 0.000137s 118.0 7305.4 > > [ I am assuming that OpenSSL represents something like the "speed of > light" for RSA on x86: this is carefully hand-coded assembly language > and C, hand tuned. Getting anywhere near OpenSSL is a major win. ] > > Here's my problem: > > Some of this slowdown is due to the overhead of using the JCE, but not > very much. Quite a lot of it, however, is due to the fact that the > scratch memory used in oddModPow() is a big-endian array of jints. I > have to convert the big-endian jints into native jlongs to do the > multiply on little-endian x86-64. Given you have taken the effort to see the overheads caused by JCE, I'd be interested in see that data to see if there is anything we can do about it. > If I do the word reversal during the multiply (i.e. keep all data in > memory in little-endian form, swap words when reading and writing to > memory) the performance is horrible: a 1024-bit multiply takes 3000ns, > 50% longer. This perhaps isn't very surprising: if you do the > word-reversal before the multiplication you have O(N) swaps, if you do > it during the multiplication you have O(N^2). > > I have found that the best thing to do is to word reverse all the data > in memory into temporary little-endian arrays and do the work on them. > It's much faster, but still really is very annoying: for 1024-bit RSA > the word reversal is 14% of the total runtime. > > It would be nice to keep all of the data in an array of jlongs for the > duration of oddModPow(). Here's one idea: I could write a version of > oddModPow() which is guaranteed never to use the Java version of the > Montgomery multiply. This would use a JNI method which calls the > native Montgomery multiply routine, guaranteeing that that we always > use that native routine, even from the interpreter and C1. Then I > could keep all the internal state in native word order, and all this > word-swapping would just go away. This would have the additional > benefit that it would be faster when using the interpreter and C1. > So, we'd have two versions of oddModPow() in BigInteger, and switch > between them depending on whether the platform had support for a > native Montgomery multiplication. I'm assuming the JNI method is closer to OpenSSL numbers? > > The downside of having two versions of oddModPow() would, of course, > be some code duplication. > > Or am I just making too much fuss about this? Maybe I should be happy > with what I've got. Looking at the 2k RSA sign numbers 2.3x better after your patch vs hs-comp. I'd be happy with that improvement. The JNI version maybe be better as a separate JCE provider you can make available for situations where appropriate. Unless there turns out to be something very compelling I wouldn't think we'd integrate it as part security providers given our focus is to use intrinsics rather than JNI at this time. > > Thank you for reading this far, > > Andrew. > From aph at redhat.com Thu Jun 4 18:41:53 2015 From: aph at redhat.com (Andrew Haley) Date: Thu, 04 Jun 2015 19:41:53 +0100 Subject: RSA intrinsics [Was: RFR(L): 8069539: RSA acceleration] In-Reply-To: <557099C0.9050008@oracle.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> <555708BE.8090100@redhat.com> <557085F1.9070703@redhat.com> <557099C0.9050008@oracle.com> Message-ID: <55709BF1.2070704@redhat.com> On 04/06/15 19:32, Anthony Scarpino wrote: > On 06/04/2015 10:08 AM, Andrew Haley wrote: >> I'm sorry this is a rather long email, and I pray for your patience. >> >> I'm getting close to something I can put forward for review. The >> performance is encouraging. >> >> [ Some background: The kernel of RSA and Diffie-Hellman key exchange >> is Montgomery multiplication. Optimizing RSA basically comes down to >> optimizing Montgomery multiplication. The core of OpenJDK's RSA is >> BigInteger.oddModPow(). ] >> >> My Montgomery multiply routine (mostly portable C, with a small >> assembly-language insert) executes a 1024-bit multiply/reduce in about >> 2000ns; the hand-coded assembly language equivalent in OpenSSL is >> faster (as you'd expect) but not very much faster: about 1700ns. In >> other words, compiled C is only about 20% slower. >> >> Firstly, this is pretty remarkable performance by GCC (Yay! Go GCC!) >> and it shows I'm on the right track. It also shows that there isn't a >> huge amount to be gained by hand-coding Montgomery multiplication, but >> anybody who fancies their hand can try to improve on GCC. This is >> also very nice because porting it to non-x86 hardware is fairly >> straightforward -- certainly far easier than writing a large assembly- >> language routine. > > I'm sure when I see the code it will become clearer, but I'm guessing > you are taking a C version of Montgomery multiply, using GCC to turn it > into assembly with some cpu acceleration instructions and putting that > into an intrinsic? I have written a Montgomery multiply routine: it is mostly C, with a tiny bit (really, just a few instructions) of inline assembly language. It's called from a HotSpot intrinsic. >> It would be nice to keep all of the data in an array of jlongs for the >> duration of oddModPow(). Here's one idea: I could write a version of >> oddModPow() which is guaranteed never to use the Java version of the >> Montgomery multiply. This would use a JNI method which calls the >> native Montgomery multiply routine, guaranteeing that that we always >> use that native routine, even from the interpreter and C1. Then I >> could keep all the internal state in native word order, and all this >> word-swapping would just go away. This would have the additional >> benefit that it would be faster when using the interpreter and C1. >> So, we'd have two versions of oddModPow() in BigInteger, and switch >> between them depending on whether the platform had support for a >> native Montgomery multiplication. > > I'm assuming the JNI method is closer to OpenSSL numbers? It's the same Montgomery multiplication as the intrinsic, but called from a JNI method instead of a HotSpot intrinsic. >> The downside of having two versions of oddModPow() would, of course, >> be some code duplication. >> >> Or am I just making too much fuss about this? Maybe I should be happy >> with what I've got. > > Looking at the 2k RSA sign numbers 2.3x better after your patch vs > hs-comp. I'd be happy with that improvement. OK. I kinda guessed that would be the response, really. Andrew. From vitalyd at gmail.com Thu Jun 4 23:20:47 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Thu, 4 Jun 2015 19:20:47 -0400 Subject: profiling of branches - odd code generation? Message-ID: Hi, Suppose you have a method like this: private static int f(final int x) { if (x == 0) return 1; else if (x == 1) return 2; else if (x == 2) return 3; return 4; } If I then call it with x=2 always, the generated asm is not what I expect (8u40 with C2 compiler) # parm0: rsi = int # [sp+0x30] (sp of caller) 0x00007fcc5970c520: mov %eax,-0x14000(%rsp) 0x00007fcc5970c527: push %rbp 0x00007fcc5970c528: sub $0x20,%rsp ;*synchronization entry 0x00007fcc5970c52c: test %esi,%esi 0x00007fcc5970c52e: je 0x00007fcc5970c55d ;*ifne 0x00007fcc5970c530: cmp $0x1,%esi 0x00007fcc5970c533: je 0x00007fcc5970c571 ;*if_icmpne 0x00007fcc5970c535: cmp $0x2,%esi 0x00007fcc5970c538: jne 0x00007fcc5970c54b ;*if_icmpne 0x00007fcc5970c53a: mov $0x3,%eax 0x00007fcc5970c53f: add $0x20,%rsp 0x00007fcc5970c543: pop %rbp 0x00007fcc5970c544: test %eax,0x5e0dab6(%rip)300000 # 0x00007fcc5f51a000 ; {poll_return} 0x00007fcc5970c54a: retq It's checking the if conditions in order, and then jumps to some runtime calls (I'm assuming that's for deopt to restore pruned branches? Cause I don't see anything that returns 1 or 2 otherwise). Why is this code not favoring x=2? I'd have thought this code would be something like (after epilogue): cmp $0x2, %esi jne mov $0x3, %eax retq Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From forax at univ-mlv.fr Thu Jun 4 23:31:08 2015 From: forax at univ-mlv.fr (Remi Forax) Date: Fri, 05 Jun 2015 01:31:08 +0200 Subject: profiling of branches - odd code generation? In-Reply-To: References: Message-ID: <5570DFBC.3090305@univ-mlv.fr> The problem is that the bytecode instruction used to restart the interpreter is different for each branch and the JIT is neither able to see that the checks do not do side effect nor able to re-balance a tree of test. cheers, R?mi On 06/05/2015 01:20 AM, Vitaly Davidovich wrote: > Hi, > > Suppose you have a method like this: > > private static int f(final int x) { > if (x == 0) > return 1; > else if (x == 1) > return 2; > else if (x == 2) > return 3; > return 4; > > } > > If I then call it with x=2 always, the generated asm is not what I > expect (8u40 with C2 compiler) > > # parm0: rsi = int > # [sp+0x30] (sp of caller) > 0x00007fcc5970c520: mov %eax,-0x14000(%rsp) > 0x00007fcc5970c527: push %rbp > 0x00007fcc5970c528: sub $0x20,%rsp ;*synchronization entry > > 0x00007fcc5970c52c: test %esi,%esi > 0x00007fcc5970c52e: je 0x00007fcc5970c55d ;*ifne > > 0x00007fcc5970c530: cmp $0x1,%esi > 0x00007fcc5970c533: je 0x00007fcc5970c571 ;*if_icmpne > > 0x00007fcc5970c535: cmp $0x2,%esi > 0x00007fcc5970c538: jne 0x00007fcc5970c54b ;*if_icmpne > > 0x00007fcc5970c53a: mov $0x3,%eax > 0x00007fcc5970c53f: add $0x20,%rsp > 0x00007fcc5970c543: pop %rbp > 0x00007fcc5970c544: test %eax,0x5e0dab6(%rip)300000 # > 0x00007fcc5f51a000 > ; {poll_return} > 0x00007fcc5970c54a: retq > > > It's checking the if conditions in order, and then jumps to some > runtime calls (I'm assuming that's for deopt to restore pruned > branches? Cause I don't see anything that returns 1 or 2 otherwise). > Why is this code not favoring x=2? I'd have thought this code would be > something like (after epilogue): > > cmp $0x2, %esi > jne > mov $0x3, %eax > retq > > Thanks > From vladimir.kozlov at oracle.com Thu Jun 4 23:36:56 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 04 Jun 2015 16:36:56 -0700 Subject: profiling of branches - odd code generation? In-Reply-To: References: Message-ID: <5570E118.8020607@oracle.com> VM does not profiling values. We profiling branches. When C2 construct control flow graph it follows bytecode. And it can't eliminate cmp code based only on branch profiling. Profiling still shows that all cmp bytecodes are always executed - only branches are not taken. We would eliminate tests if they were on non taken branch. We generate uncommon traps for branches which were not taken based on profiling. Vladimir On 6/4/15 4:20 PM, Vitaly Davidovich wrote: > Hi, > > Suppose you have a method like this: > > private static int f(final int x) { > if (x == 0) > return 1; > else if (x == 1) > return 2; > else if (x == 2) > return 3; > return 4; > > } > > If I then call it with x=2 always, the generated asm is not what I > expect (8u40 with C2 compiler) > > # parm0: rsi = int > # [sp+0x30] (sp of caller) > 0x00007fcc5970c520: mov %eax,-0x14000(%rsp) > 0x00007fcc5970c527: push %rbp > 0x00007fcc5970c528: sub $0x20,%rsp ;*synchronization entry > > 0x00007fcc5970c52c: test %esi,%esi > 0x00007fcc5970c52e: je 0x00007fcc5970c55d ;*ifne > > 0x00007fcc5970c530: cmp $0x1,%esi > 0x00007fcc5970c533: je 0x00007fcc5970c571 ;*if_icmpne > > 0x00007fcc5970c535: cmp $0x2,%esi > 0x00007fcc5970c538: jne 0x00007fcc5970c54b ;*if_icmpne > > 0x00007fcc5970c53a: mov $0x3,%eax > 0x00007fcc5970c53f: add $0x20,%rsp > 0x00007fcc5970c543: pop %rbp > 0x00007fcc5970c544: test %eax,0x5e0dab6(%rip)300000 # > 0x00007fcc5f51a000 > ; {poll_return} > 0x00007fcc5970c54a: retq > > > It's checking the if conditions in order, and then jumps to some runtime > calls (I'm assuming that's for deopt to restore pruned branches? Cause I > don't see anything that returns 1 or 2 otherwise). Why is this code not > favoring x=2? I'd have thought this code would be something like (after > epilogue): > > cmp $0x2, %esi > jne > mov $0x3, %eax > retq > > Thanks > From vitalyd at gmail.com Thu Jun 4 23:42:02 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Thu, 4 Jun 2015 19:42:02 -0400 Subject: profiling of branches - odd code generation? In-Reply-To: <5570E118.8020607@oracle.com> References: <5570E118.8020607@oracle.com> Message-ID: Thanks for the response Vladimir. In this case though, can the JIT not see that the cmp bytecodes of non-taken branches have no side effects and remove them altogether? Is that just deemed not worth the cost or is there something fundamental I'm missing here? On Thu, Jun 4, 2015 at 7:36 PM, Vladimir Kozlov wrote: > VM does not profiling values. We profiling branches. > When C2 construct control flow graph it follows bytecode. And it can't > eliminate cmp code based only on branch profiling. Profiling still shows > that all cmp bytecodes are always executed - only branches are not taken. > We would eliminate tests if they were on non taken branch. > We generate uncommon traps for branches which were not taken based on > profiling. > > Vladimir > > > On 6/4/15 4:20 PM, Vitaly Davidovich wrote: > >> Hi, >> >> Suppose you have a method like this: >> >> private static int f(final int x) { >> if (x == 0) >> return 1; >> else if (x == 1) >> return 2; >> else if (x == 2) >> return 3; >> return 4; >> >> } >> >> If I then call it with x=2 always, the generated asm is not what I >> expect (8u40 with C2 compiler) >> >> # parm0: rsi = int >> # [sp+0x30] (sp of caller) >> 0x00007fcc5970c520: mov %eax,-0x14000(%rsp) >> 0x00007fcc5970c527: push %rbp >> 0x00007fcc5970c528: sub $0x20,%rsp ;*synchronization entry >> >> 0x00007fcc5970c52c: test %esi,%esi >> 0x00007fcc5970c52e: je 0x00007fcc5970c55d ;*ifne >> >> 0x00007fcc5970c530: cmp $0x1,%esi >> 0x00007fcc5970c533: je 0x00007fcc5970c571 ;*if_icmpne >> >> 0x00007fcc5970c535: cmp $0x2,%esi >> 0x00007fcc5970c538: jne 0x00007fcc5970c54b ;*if_icmpne >> >> 0x00007fcc5970c53a: mov $0x3,%eax >> 0x00007fcc5970c53f: add $0x20,%rsp >> 0x00007fcc5970c543: pop %rbp >> 0x00007fcc5970c544: test %eax,0x5e0dab6(%rip)300000 # >> 0x00007fcc5f51a000 >> ; {poll_return} >> 0x00007fcc5970c54a: retq >> >> >> It's checking the if conditions in order, and then jumps to some runtime >> calls (I'm assuming that's for deopt to restore pruned branches? Cause I >> don't see anything that returns 1 or 2 otherwise). Why is this code not >> favoring x=2? I'd have thought this code would be something like (after >> epilogue): >> >> cmp $0x2, %esi >> jne >> mov $0x3, %eax >> retq >> >> Thanks >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Thu Jun 4 23:44:48 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Thu, 4 Jun 2015 19:44:48 -0400 Subject: profiling of branches - odd code generation? In-Reply-To: References: <5570E118.8020607@oracle.com> Message-ID: By the way, the context for this example is the following. Suppose you have a class with such a method. This class is then used in different java processes such that in each instance only one of those branches is ever taken and the other compares have no side effects. Ideally, the compiled code would favor that fast path, which may not be the first arm of the if/else chain. On Thu, Jun 4, 2015 at 7:42 PM, Vitaly Davidovich wrote: > Thanks for the response Vladimir. In this case though, can the JIT not > see that the cmp bytecodes of non-taken branches have no side effects and > remove them altogether? Is that just deemed not worth the cost or is there > something fundamental I'm missing here? > > On Thu, Jun 4, 2015 at 7:36 PM, Vladimir Kozlov < > vladimir.kozlov at oracle.com> wrote: > >> VM does not profiling values. We profiling branches. >> When C2 construct control flow graph it follows bytecode. And it can't >> eliminate cmp code based only on branch profiling. Profiling still shows >> that all cmp bytecodes are always executed - only branches are not taken. >> We would eliminate tests if they were on non taken branch. >> We generate uncommon traps for branches which were not taken based on >> profiling. >> >> Vladimir >> >> >> On 6/4/15 4:20 PM, Vitaly Davidovich wrote: >> >>> Hi, >>> >>> Suppose you have a method like this: >>> >>> private static int f(final int x) { >>> if (x == 0) >>> return 1; >>> else if (x == 1) >>> return 2; >>> else if (x == 2) >>> return 3; >>> return 4; >>> >>> } >>> >>> If I then call it with x=2 always, the generated asm is not what I >>> expect (8u40 with C2 compiler) >>> >>> # parm0: rsi = int >>> # [sp+0x30] (sp of caller) >>> 0x00007fcc5970c520: mov %eax,-0x14000(%rsp) >>> 0x00007fcc5970c527: push %rbp >>> 0x00007fcc5970c528: sub $0x20,%rsp ;*synchronization entry >>> >>> 0x00007fcc5970c52c: test %esi,%esi >>> 0x00007fcc5970c52e: je 0x00007fcc5970c55d ;*ifne >>> >>> 0x00007fcc5970c530: cmp $0x1,%esi >>> 0x00007fcc5970c533: je 0x00007fcc5970c571 ;*if_icmpne >>> >>> 0x00007fcc5970c535: cmp $0x2,%esi >>> 0x00007fcc5970c538: jne 0x00007fcc5970c54b ;*if_icmpne >>> >>> 0x00007fcc5970c53a: mov $0x3,%eax >>> 0x00007fcc5970c53f: add $0x20,%rsp >>> 0x00007fcc5970c543: pop %rbp >>> 0x00007fcc5970c544: test %eax,0x5e0dab6(%rip)300000 # >>> 0x00007fcc5f51a000 >>> ; {poll_return} >>> 0x00007fcc5970c54a: retq >>> >>> >>> It's checking the if conditions in order, and then jumps to some runtime >>> calls (I'm assuming that's for deopt to restore pruned branches? Cause I >>> don't see anything that returns 1 or 2 otherwise). Why is this code not >>> favoring x=2? I'd have thought this code would be something like (after >>> epilogue): >>> >>> cmp $0x2, %esi >>> jne >>> mov $0x3, %eax >>> retq >>> >>> Thanks >>> >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Fri Jun 5 00:16:41 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 04 Jun 2015 17:16:41 -0700 Subject: profiling of branches - odd code generation? In-Reply-To: References: <5570E118.8020607@oracle.com> Message-ID: <5570EA69.4050201@oracle.com> Uncommon traps are bound to bytecode. If we hit uncommon trap, for example, for (x == 1) test then after deoptimization Interpreter will execute only 'return 2;'. If generated code as you suggested we need to bind uncommon trap to the BCI of the first (x == 0) check so it will be executed in Interpreter after deoptimization. So it is not simple optimization but doable for cases like this (integer checks). Did you tried 'switch' instead? Regards, Vladimir On 6/4/15 4:44 PM, Vitaly Davidovich wrote: > By the way, the context for this example is the following. Suppose you > have a class with such a method. This class is then used in different > java processes such that in each instance only one of those branches is > ever taken and the other compares have no side effects. Ideally, the > compiled code would favor that fast path, which may not be the first arm > of the if/else chain. > > On Thu, Jun 4, 2015 at 7:42 PM, Vitaly Davidovich > wrote: > > Thanks for the response Vladimir. In this case though, can the JIT > not see that the cmp bytecodes of non-taken branches have no side > effects and remove them altogether? Is that just deemed not worth > the cost or is there something fundamental I'm missing here? > > On Thu, Jun 4, 2015 at 7:36 PM, Vladimir Kozlov > > wrote: > > VM does not profiling values. We profiling branches. > When C2 construct control flow graph it follows bytecode. And it > can't eliminate cmp code based only on branch profiling. > Profiling still shows that all cmp bytecodes are always executed > - only branches are not taken. We would eliminate tests if they > were on non taken branch. > We generate uncommon traps for branches which were not taken > based on profiling. > > Vladimir > > > On 6/4/15 4:20 PM, Vitaly Davidovich wrote: > > Hi, > > Suppose you have a method like this: > > private static int f(final int x) { > if (x == 0) > return 1; > else if (x == 1) > return 2; > else if (x == 2) > return 3; > return 4; > > } > > If I then call it with x=2 always, the generated asm is not > what I > expect (8u40 with C2 compiler) > > # parm0: rsi = int > # [sp+0x30] (sp of caller) > 0x00007fcc5970c520: mov %eax,-0x14000(%rsp) > 0x00007fcc5970c527: push %rbp > 0x00007fcc5970c528: sub $0x20,%rsp > ;*synchronization entry > > 0x00007fcc5970c52c: test %esi,%esi > 0x00007fcc5970c52e: je 0x00007fcc5970c55d ;*ifne > > 0x00007fcc5970c530: cmp $0x1,%esi > 0x00007fcc5970c533: je 0x00007fcc5970c571 ;*if_icmpne > > 0x00007fcc5970c535: cmp $0x2,%esi > 0x00007fcc5970c538: jne 0x00007fcc5970c54b ;*if_icmpne > > 0x00007fcc5970c53a: mov $0x3,%eax > 0x00007fcc5970c53f: add $0x20,%rsp > 0x00007fcc5970c543: pop %rbp > 0x00007fcc5970c544: test %eax,0x5e0dab6(%rip)300000 > # > 0x00007fcc5f51a000 > ; > {poll_return} > 0x00007fcc5970c54a: retq > > > It's checking the if conditions in order, and then jumps to > some runtime > calls (I'm assuming that's for deopt to restore pruned > branches? Cause I > don't see anything that returns 1 or 2 otherwise). Why is > this code not > favoring x=2? I'd have thought this code would be something > like (after > epilogue): > > cmp $0x2, %esi > jne > mov $0x3, %eax > retq > > Thanks > > > From vitalyd at gmail.com Fri Jun 5 00:33:26 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Thu, 4 Jun 2015 20:33:26 -0400 Subject: profiling of branches - odd code generation? In-Reply-To: <5570EA69.4050201@oracle.com> References: <5570E118.8020607@oracle.com> <5570EA69.4050201@oracle.com> Message-ID: Ok I see the complication with the restart in interpreter (I think that's what Remi was saying as well). I suspect that most checks will tend to not have side effects (bad practice), but of course there may be some and for more complex scenarios sufficient inlining would need to occur. For simple cases however, it's a pity since the VM has enough info and deopt ability to do this safely. The real case i had was checks against enum constants, but the int example is just as good. As for switch, I didn't try it this time but we had a thread on here a few months back where I was complaining about the switch not handling the same type of scenario, and John Rose mentioned it's a known issue with switches (i.e. no profile based optimization) :). In addition, I find some of the switch codegen suboptimal (e.g. same cmp performed back to back with just a different jump after). So I then tried if/else chain given that's supposedly profiled, and it is but apparently the codegen isn't what I thought it would be. I was basically hoping for a single class CHA-like analysis for branches :). sent from my phone On Jun 4, 2015 8:15 PM, "Vladimir Kozlov" wrote: > Uncommon traps are bound to bytecode. If we hit uncommon trap, for > example, for (x == 1) test then after deoptimization Interpreter will > execute only 'return 2;'. If generated code as you suggested we need to > bind uncommon trap to the BCI of the first (x == 0) check so it will be > executed in Interpreter after deoptimization. > > So it is not simple optimization but doable for cases like this (integer > checks). > > Did you tried 'switch' instead? > > Regards, > Vladimir > > On 6/4/15 4:44 PM, Vitaly Davidovich wrote: > >> By the way, the context for this example is the following. Suppose you >> have a class with such a method. This class is then used in different >> java processes such that in each instance only one of those branches is >> ever taken and the other compares have no side effects. Ideally, the >> compiled code would favor that fast path, which may not be the first arm >> of the if/else chain. >> >> On Thu, Jun 4, 2015 at 7:42 PM, Vitaly Davidovich > > wrote: >> >> Thanks for the response Vladimir. In this case though, can the JIT >> not see that the cmp bytecodes of non-taken branches have no side >> effects and remove them altogether? Is that just deemed not worth >> the cost or is there something fundamental I'm missing here? >> >> On Thu, Jun 4, 2015 at 7:36 PM, Vladimir Kozlov >> > >> wrote: >> >> VM does not profiling values. We profiling branches. >> When C2 construct control flow graph it follows bytecode. And it >> can't eliminate cmp code based only on branch profiling. >> Profiling still shows that all cmp bytecodes are always executed >> - only branches are not taken. We would eliminate tests if they >> were on non taken branch. >> We generate uncommon traps for branches which were not taken >> based on profiling. >> >> Vladimir >> >> >> On 6/4/15 4:20 PM, Vitaly Davidovich wrote: >> >> Hi, >> >> Suppose you have a method like this: >> >> private static int f(final int x) { >> if (x == 0) >> return 1; >> else if (x == 1) >> return 2; >> else if (x == 2) >> return 3; >> return 4; >> >> } >> >> If I then call it with x=2 always, the generated asm is not >> what I >> expect (8u40 with C2 compiler) >> >> # parm0: rsi = int >> # [sp+0x30] (sp of caller) >> 0x00007fcc5970c520: mov %eax,-0x14000(%rsp) >> 0x00007fcc5970c527: push %rbp >> 0x00007fcc5970c528: sub $0x20,%rsp >> ;*synchronization entry >> >> 0x00007fcc5970c52c: test %esi,%esi >> 0x00007fcc5970c52e: je 0x00007fcc5970c55d ;*ifne >> >> 0x00007fcc5970c530: cmp $0x1,%esi >> 0x00007fcc5970c533: je 0x00007fcc5970c571 ;*if_icmpne >> >> 0x00007fcc5970c535: cmp $0x2,%esi >> 0x00007fcc5970c538: jne 0x00007fcc5970c54b ;*if_icmpne >> >> 0x00007fcc5970c53a: mov $0x3,%eax >> 0x00007fcc5970c53f: add $0x20,%rsp >> 0x00007fcc5970c543: pop %rbp >> 0x00007fcc5970c544: test %eax,0x5e0dab6(%rip)300000 >> # >> 0x00007fcc5f51a000 >> ; >> {poll_return} >> 0x00007fcc5970c54a: retq >> >> >> It's checking the if conditions in order, and then jumps to >> some runtime >> calls (I'm assuming that's for deopt to restore pruned >> branches? Cause I >> don't see anything that returns 1 or 2 otherwise). Why is >> this code not >> favoring x=2? I'd have thought this code would be something >> like (after >> epilogue): >> >> cmp $0x2, %esi >> jne >> mov $0x3, %eax >> retq >> >> Thanks >> >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.c.berg at intel.com Fri Jun 5 04:45:10 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Fri, 5 Jun 2015 04:45:10 +0000 Subject: RFR 8080325 SuperWord loop unrolling analysis In-Reply-To: <556F9AF7.5050301@oracle.com> References: <556F9AF7.5050301@oracle.com> Message-ID: Vladimir, please find the following webrev with pretty much the full list of changes made. I made some improvements too. For instance I only allow the analysis to take place when we are trying to unroll beyond the default. http://cr.openjdk.java.net/~mcberg/8080325/webrev.01/ Regards, -Michael -----Original Message----- From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] Sent: Wednesday, June 03, 2015 5:25 PM To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR 8080325 SuperWord loop unrolling analysis Thank you, Michael, for this contribution. First, I am fine with such approach - call superword only to collect data about loop. It could be useful for superword optimization to have a pass over loop's nodes to determine if it could be vectorize as separate phase. And avoid to do that in SLP analysis. Make SuperWordLoopUnrollAnalysis flag's default value to 'false' and set it to true only in vm_version_x86.cpp (#ifdef COMPILER2) so that you don't need to modify setting on all platforms. In flag description say what 'slp' means (Superword Level Parallelism). Code style: if (cl->is_reduction_loop() == false) phase->mark_reductions(this); should be: if (!cl->is_reduction_loop()) { phase->mark_reductions(this); } An other one: cl->has_passed_slp() == false. We use ! for such cases. There are following checks after new code which prevent unrolling. Why you do analysis before them without affecting their decision? And you use the result of analysis only later at the end of method. Why not do analysis there then? (_local_loop_unroll_factor > 4) check should be done inside policy_slp_max_unroll() to have only one check. Actually all next code (lines 668-692) could be done at the end of policy_slp_max_unroll(). And you don't need to return bool then (I changed name too): 665 if (LoopMaxUnroll > _local_loop_unroll_factor) { 666 // Once policy_slp_analysis succeeds, mark the loop with the 667 // maximal unroll factor so that we minimize analysis passes 668 policy_unroll_slp_analysis(cl, phase); 693 } 694 } 695 696 // Check for initial stride being a small enough constant 697 if (abs(cl->stride_con()) > (1<<2)*future_unroll_ct) return false; I think slp analysis code should be in superword.cpp - SWPointer should be used only there. Just add an other method to SuperWord class. Instead of a->Afree() and next: 810 Arena *a = Thread::current()->resource_area(); 812 size_t ignored_size = _body.size()*sizeof(int*); 813 int *ignored_loop_nodes = (int*)a->Amalloc_D(ignored_size); 814 Node_Stack nstack((int)ignored_size); use: ResourceMark rm; size_t ignored_size = _body.size(); int *ignored_loop_nodes = NEW_RESOURCE_ARRAY(int, ignored_size); Node_Stack nstack((int)ignored_size); Node_Stack should take number of nodes and not bytes. I am concern about nstack.clear() because you have poped all nodes on it. Thanks, Vladimir On 5/13/15 6:26 PM, Berg, Michael C wrote: > Hi Folks, > > We (Intel) would like to contribute superword loop unrolling analysis > to facilitate more default unrolling and larger SIMD vector mapping. > Please review this patch and comment as needed: > > Bug-id: https://bugs.openjdk.java.net/browse/JDK-8080325 > > webrev: > > http://cr.openjdk.java.net/~kvn/8080325/webrev/ > > The design finds the highest common vector supported and implemented > on a given machine and applies that to unrolling, iff it is greater > than the default. If the user gates unrolling we will still obey the > user directive. It's light weight, when we get to the analysis part, > if we fail, we stop further tries. If we succeed we stop further > tries. We generally always analyze only once. We then gate the unroll > factor by extending the size of the unroll segment so that the > iterative tries will keep unrolling, obtaining larger unrolls of a > targeted loop. I see no negative behavior wrt to performance, and a > modest positive swing in default behavior up to 1.5x for some micros. > > Vladimir Koslov has offered to sponsor this patch. > From michael.c.berg at intel.com Fri Jun 5 04:46:07 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Fri, 5 Jun 2015 04:46:07 +0000 Subject: RFR(L): 8081247 AVX 512 extended support code review request In-Reply-To: <556FA54E.8050001@oracle.com> References: <556FA54E.8050001@oracle.com> Message-ID: Vladimir please find the following webrev with the suggested changes, I have added small signature functions which look like the old versions in the assembler but manage the problem I need to handle, which is additional state for legacy only instructions. There is a new vm_version function which handles the cpuid checks with a conglomerate approach for the one scenario which had it. The loop in the stub generator is now formed to alter the upper bound and execute in one path. http://cr.openjdk.java.net/~mcberg/8081247/webrev.03/ Regards, Michael -----Original Message----- From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] Sent: Wednesday, June 03, 2015 6:10 PM To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net' Subject: Re: RFR(L): 8081247 AVX 512 extended support code review request Hi, Michael assembler_x86.cpp: I don't like that you replaced prefix method with few parameters with method which has a lot of them: - int encode = vex_prefix_0F38_and_encode_q(dst, src1, src2); + int encode = vex_prefix_and_encode(dst->encoding(), src1->encoding(), src2->encoding(), + VEX_SIMD_NONE, VEX_OPCODE_0F_38, true, AVX_128bit, + true, false); Why you did that? stubGenerator_x86_64.cpp: Can we set different loop limit based on UseAVX instead of having 2 loops. x86.ad: Instead of long condition expressions like next: UseAVX > 0 && !VM_Version::supports_avx512vl() && !VM_Version::supports_avx512bw() May be have one VM_Version finction which does these checks. Thanks, Vladimir On 6/2/15 9:38 PM, Berg, Michael C wrote: > Hi Folks, > > I would like to contribute more AVX512 enabling to facilitate support > for machines which utilize EVEX encoding. I need two reviewers to > review this patch and comment as needed: > > Bug-id: https://bugs.openjdk.java.net/browse/JDK-8081247 > > webrev: > > http://cr.openjdk.java.net/~mcberg/8081247/webrev.01/ > > This patch enables BMI code on EVEX targets, improves replication > patterns to be more efficient on both EVEX enabled and legacy targets, > adds more CPUID based rules for correct code generation on various > EVEX enabled servers, extends more call save/restore functionality and > extends the vector space further for SIMD operations. Please expedite > this review as there is a near term need for the support. > > Also, as I am not yet a committer, this code needs a sponsor as well. > > Thanks, > > Michael > From roland.westrelin at oracle.com Fri Jun 5 07:33:30 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Fri, 5 Jun 2015 09:33:30 +0200 Subject: RFR(S): 8078866: compiler/eliminateAutobox/6934604/TestIntBoxing.java assert(p_f->Opcode() == Op_IfFalse) failed In-Reply-To: <557073C6.1050002@oracle.com> References: <5553C1E2.3040003@oracle.com> <555CE2B2.8020305@oracle.com> <5B87D4E4-FA26-411F-BD03-9F9CAA1E0CBF@oracle.com> <557073C6.1050002@oracle.com> Message-ID: <01E73384-C419-400E-9712-474474E2D3C9@oracle.com> Thanks for the review, Vladimir. Roland. > On Jun 4, 2015, at 5:50 PM, Vladimir Kozlov wrote: > > Good. > > Thanks, > Vladimir > > On 6/4/15 7:28 AM, Roland Westrelin wrote: >> As suggested by Vladimir, here is a new simpler fix that simply bails out of range check elimination if the preloop is not found: >> >> http://cr.openjdk.java.net/~roland/8078866/webrev.01/ >> >> Removing the main and post loops will be pushed as a separate RFE. >> >> Roland. >> From duboscq at ssw.jku.at Fri Jun 5 07:34:36 2015 From: duboscq at ssw.jku.at (Gilles Duboscq) Date: Fri, 05 Jun 2015 07:34:36 +0000 Subject: profiling of branches - odd code generation? In-Reply-To: References: <5570E118.8020607@oracle.com> <5570EA69.4050201@oracle.com> Message-ID: Regarding interpreter restart/uncommon_trap, for this kind of problem, the compiler has a better chance when it does not bind them to a precise bytecode restart location right from the start of the compilation. The branches can then be reordered and eliminated to achieve what you want. For example with Graal, on your code i get: # {method} {0x00007f442685e318} 'f' '(I)I' in 'Test' # parm0: rsi = int # [sp+0x20] (sp of caller) 0x00007f44288d2ba0: mov qword ptr [rsp-0x14000],rax 0x00007f44288d2ba8: sub rsp,0x18 0x00007f44288d2bac: mov qword ptr [rsp+0x10],rbp 0x00007f44288d2bb1: cmp esi,0x2 0x00007f44288d2bb4: jne 0x00007f44288d2bca 0x00007f44288d2bba: mov eax,0x3 0x00007f44288d2bbf: add rsp,0x18 0x00007f44288d2bc3: test dword ptr [rip+0x1a8d743d],eax ; {poll_return} 0x00007f44288d2bc9: ret 0x00007f44288d2bca: mov dword ptr [r15+0x8],0xffffffed 0x00007f44288d2bd2: mov qword ptr [r15+0x10],r12 0x00007f44288d2bd6: call 0x00007f4428047341 ; OopMap{off=59} ;*iload_0 {reexecute=1 rethrow=0 return_oop=0} ; - Test::f at 0 (line 10) ; {runtime_call} Which is more or less what you wanted modulo frame setup/tear down and poll_return. In this deopt case, execution restarts at bci 0 (it re-executes the method from the start). In general it would re-execute from the last side-effect. Gilles On Fri, Jun 5, 2015 at 2:35 AM Vitaly Davidovich wrote: > Ok I see the complication with the restart in interpreter (I think that's > what Remi was saying as well). I suspect that most checks will tend to not > have side effects (bad practice), but of course there may be some and for > more complex scenarios sufficient inlining would need to occur. For simple > cases however, it's a pity since the VM has enough info and deopt ability > to do this safely. The real case i had was checks against enum constants, > but the int example is just as good. > > As for switch, I didn't try it this time but we had a thread on here a few > months back where I was complaining about the switch not handling the same > type of scenario, and John Rose mentioned it's a known issue with switches > (i.e. no profile based optimization) :). In addition, I find some of the > switch codegen suboptimal (e.g. same cmp performed back to back with just a > different jump after). So I then tried if/else chain given that's > supposedly profiled, and it is but apparently the codegen isn't what I > thought it would be. I was basically hoping for a single class CHA-like > analysis for branches :). > > sent from my phone > On Jun 4, 2015 8:15 PM, "Vladimir Kozlov" > wrote: > >> Uncommon traps are bound to bytecode. If we hit uncommon trap, for >> example, for (x == 1) test then after deoptimization Interpreter will >> execute only 'return 2;'. If generated code as you suggested we need to >> bind uncommon trap to the BCI of the first (x == 0) check so it will be >> executed in Interpreter after deoptimization. >> >> So it is not simple optimization but doable for cases like this (integer >> checks). >> >> Did you tried 'switch' instead? >> >> Regards, >> Vladimir >> >> On 6/4/15 4:44 PM, Vitaly Davidovich wrote: >> >>> By the way, the context for this example is the following. Suppose you >>> have a class with such a method. This class is then used in different >>> java processes such that in each instance only one of those branches is >>> ever taken and the other compares have no side effects. Ideally, the >>> compiled code would favor that fast path, which may not be the first arm >>> of the if/else chain. >>> >>> On Thu, Jun 4, 2015 at 7:42 PM, Vitaly Davidovich >> > wrote: >>> >>> Thanks for the response Vladimir. In this case though, can the JIT >>> not see that the cmp bytecodes of non-taken branches have no side >>> effects and remove them altogether? Is that just deemed not worth >>> the cost or is there something fundamental I'm missing here? >>> >>> On Thu, Jun 4, 2015 at 7:36 PM, Vladimir Kozlov >>> > >>> wrote: >>> >>> VM does not profiling values. We profiling branches. >>> When C2 construct control flow graph it follows bytecode. And it >>> can't eliminate cmp code based only on branch profiling. >>> Profiling still shows that all cmp bytecodes are always executed >>> - only branches are not taken. We would eliminate tests if they >>> were on non taken branch. >>> We generate uncommon traps for branches which were not taken >>> based on profiling. >>> >>> Vladimir >>> >>> >>> On 6/4/15 4:20 PM, Vitaly Davidovich wrote: >>> >>> Hi, >>> >>> Suppose you have a method like this: >>> >>> private static int f(final int x) { >>> if (x == 0) >>> return 1; >>> else if (x == 1) >>> return 2; >>> else if (x == 2) >>> return 3; >>> return 4; >>> >>> } >>> >>> If I then call it with x=2 always, the generated asm is not >>> what I >>> expect (8u40 with C2 compiler) >>> >>> # parm0: rsi = int >>> # [sp+0x30] (sp of caller) >>> 0x00007fcc5970c520: mov %eax,-0x14000(%rsp) >>> 0x00007fcc5970c527: push %rbp >>> 0x00007fcc5970c528: sub $0x20,%rsp >>> ;*synchronization entry >>> >>> 0x00007fcc5970c52c: test %esi,%esi >>> 0x00007fcc5970c52e: je 0x00007fcc5970c55d ;*ifne >>> >>> 0x00007fcc5970c530: cmp $0x1,%esi >>> 0x00007fcc5970c533: je 0x00007fcc5970c571 >>> ;*if_icmpne >>> >>> 0x00007fcc5970c535: cmp $0x2,%esi >>> 0x00007fcc5970c538: jne 0x00007fcc5970c54b >>> ;*if_icmpne >>> >>> 0x00007fcc5970c53a: mov $0x3,%eax >>> 0x00007fcc5970c53f: add $0x20,%rsp >>> 0x00007fcc5970c543: pop %rbp >>> 0x00007fcc5970c544: test %eax,0x5e0dab6(%rip)300000 >>> # >>> 0x00007fcc5f51a000 >>> ; >>> {poll_return} >>> 0x00007fcc5970c54a: retq >>> >>> >>> It's checking the if conditions in order, and then jumps to >>> some runtime >>> calls (I'm assuming that's for deopt to restore pruned >>> branches? Cause I >>> don't see anything that returns 1 or 2 otherwise). Why is >>> this code not >>> favoring x=2? I'd have thought this code would be something >>> like (after >>> epilogue): >>> >>> cmp $0x2, %esi >>> jne >>> mov $0x3, %eax >>> retq >>> >>> Thanks >>> >>> >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Fri Jun 5 12:59:54 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Fri, 5 Jun 2015 08:59:54 -0400 Subject: profiling of branches - odd code generation? In-Reply-To: References: <5570E118.8020607@oracle.com> <5570EA69.4050201@oracle.com> Message-ID: Hi Gilles, Yes, that's exactly what I was expecting/hoping C2 emits. Nice to see Graal does this. Does Graal also use profile info for switch statements? Would it peel out a check for 2 as the first thing in this case? Thanks sent from my phone On Jun 5, 2015 3:35 AM, "Gilles Duboscq" wrote: > Regarding interpreter restart/uncommon_trap, for this kind of problem, the > compiler has a better chance when it does not bind them to a precise > bytecode restart location right from the start of the compilation. The > branches can then be reordered and eliminated to achieve what you want. > > For example with Graal, on your code i get: > > # {method} {0x00007f442685e318} 'f' '(I)I' in 'Test' > # parm0: rsi = int > # [sp+0x20] (sp of caller) > 0x00007f44288d2ba0: mov qword ptr [rsp-0x14000],rax > 0x00007f44288d2ba8: sub rsp,0x18 > 0x00007f44288d2bac: mov qword ptr [rsp+0x10],rbp > 0x00007f44288d2bb1: cmp esi,0x2 > 0x00007f44288d2bb4: jne 0x00007f44288d2bca > 0x00007f44288d2bba: mov eax,0x3 > 0x00007f44288d2bbf: add rsp,0x18 > 0x00007f44288d2bc3: test dword ptr [rip+0x1a8d743d],eax ; > {poll_return} > 0x00007f44288d2bc9: ret > 0x00007f44288d2bca: mov dword ptr [r15+0x8],0xffffffed > 0x00007f44288d2bd2: mov qword ptr [r15+0x10],r12 > 0x00007f44288d2bd6: call 0x00007f4428047341 ; OopMap{off=59} > ;*iload_0 {reexecute=1 > rethrow=0 return_oop=0} > ; - Test::f at 0 (line 10) > ; {runtime_call} > > Which is more or less what you wanted modulo frame setup/tear down and > poll_return. In this deopt case, execution restarts at bci 0 (it > re-executes the method from the start). In general it would re-execute from > the last side-effect. > > Gilles > > On Fri, Jun 5, 2015 at 2:35 AM Vitaly Davidovich > wrote: > >> Ok I see the complication with the restart in interpreter (I think that's >> what Remi was saying as well). I suspect that most checks will tend to not >> have side effects (bad practice), but of course there may be some and for >> more complex scenarios sufficient inlining would need to occur. For simple >> cases however, it's a pity since the VM has enough info and deopt ability >> to do this safely. The real case i had was checks against enum constants, >> but the int example is just as good. >> >> As for switch, I didn't try it this time but we had a thread on here a >> few months back where I was complaining about the switch not handling the >> same type of scenario, and John Rose mentioned it's a known issue with >> switches (i.e. no profile based optimization) :). In addition, I find some >> of the switch codegen suboptimal (e.g. same cmp performed back to back with >> just a different jump after). So I then tried if/else chain given that's >> supposedly profiled, and it is but apparently the codegen isn't what I >> thought it would be. I was basically hoping for a single class CHA-like >> analysis for branches :). >> >> sent from my phone >> On Jun 4, 2015 8:15 PM, "Vladimir Kozlov" >> wrote: >> >>> Uncommon traps are bound to bytecode. If we hit uncommon trap, for >>> example, for (x == 1) test then after deoptimization Interpreter will >>> execute only 'return 2;'. If generated code as you suggested we need to >>> bind uncommon trap to the BCI of the first (x == 0) check so it will be >>> executed in Interpreter after deoptimization. >>> >>> So it is not simple optimization but doable for cases like this (integer >>> checks). >>> >>> Did you tried 'switch' instead? >>> >>> Regards, >>> Vladimir >>> >>> On 6/4/15 4:44 PM, Vitaly Davidovich wrote: >>> >>>> By the way, the context for this example is the following. Suppose you >>>> have a class with such a method. This class is then used in different >>>> java processes such that in each instance only one of those branches is >>>> ever taken and the other compares have no side effects. Ideally, the >>>> compiled code would favor that fast path, which may not be the first arm >>>> of the if/else chain. >>>> >>>> On Thu, Jun 4, 2015 at 7:42 PM, Vitaly Davidovich >>> > wrote: >>>> >>>> Thanks for the response Vladimir. In this case though, can the JIT >>>> not see that the cmp bytecodes of non-taken branches have no side >>>> effects and remove them altogether? Is that just deemed not worth >>>> the cost or is there something fundamental I'm missing here? >>>> >>>> On Thu, Jun 4, 2015 at 7:36 PM, Vladimir Kozlov >>>> > >>>> wrote: >>>> >>>> VM does not profiling values. We profiling branches. >>>> When C2 construct control flow graph it follows bytecode. And it >>>> can't eliminate cmp code based only on branch profiling. >>>> Profiling still shows that all cmp bytecodes are always executed >>>> - only branches are not taken. We would eliminate tests if they >>>> were on non taken branch. >>>> We generate uncommon traps for branches which were not taken >>>> based on profiling. >>>> >>>> Vladimir >>>> >>>> >>>> On 6/4/15 4:20 PM, Vitaly Davidovich wrote: >>>> >>>> Hi, >>>> >>>> Suppose you have a method like this: >>>> >>>> private static int f(final int x) { >>>> if (x == 0) >>>> return 1; >>>> else if (x == 1) >>>> return 2; >>>> else if (x == 2) >>>> return 3; >>>> return 4; >>>> >>>> } >>>> >>>> If I then call it with x=2 always, the generated asm is not >>>> what I >>>> expect (8u40 with C2 compiler) >>>> >>>> # parm0: rsi = int >>>> # [sp+0x30] (sp of caller) >>>> 0x00007fcc5970c520: mov %eax,-0x14000(%rsp) >>>> 0x00007fcc5970c527: push %rbp >>>> 0x00007fcc5970c528: sub $0x20,%rsp >>>> ;*synchronization entry >>>> >>>> 0x00007fcc5970c52c: test %esi,%esi >>>> 0x00007fcc5970c52e: je 0x00007fcc5970c55d ;*ifne >>>> >>>> 0x00007fcc5970c530: cmp $0x1,%esi >>>> 0x00007fcc5970c533: je 0x00007fcc5970c571 >>>> ;*if_icmpne >>>> >>>> 0x00007fcc5970c535: cmp $0x2,%esi >>>> 0x00007fcc5970c538: jne 0x00007fcc5970c54b >>>> ;*if_icmpne >>>> >>>> 0x00007fcc5970c53a: mov $0x3,%eax >>>> 0x00007fcc5970c53f: add $0x20,%rsp >>>> 0x00007fcc5970c543: pop %rbp >>>> 0x00007fcc5970c544: test %eax,0x5e0dab6(%rip)300000 >>>> # >>>> 0x00007fcc5f51a000 >>>> ; >>>> {poll_return} >>>> 0x00007fcc5970c54a: retq >>>> >>>> >>>> It's checking the if conditions in order, and then jumps to >>>> some runtime >>>> calls (I'm assuming that's for deopt to restore pruned >>>> branches? Cause I >>>> don't see anything that returns 1 or 2 otherwise). Why is >>>> this code not >>>> favoring x=2? I'd have thought this code would be something >>>> like (after >>>> epilogue): >>>> >>>> cmp $0x2, %esi >>>> jne >>>> mov $0x3, %eax >>>> retq >>>> >>>> Thanks >>>> >>>> >>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.c.berg at intel.com Fri Jun 5 14:49:21 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Fri, 5 Jun 2015 14:49:21 +0000 Subject: RFR(S): 8078866: compiler/eliminateAutobox/6934604/TestIntBoxing.java assert(p_f->Opcode() == Op_IfFalse) failed In-Reply-To: <01E73384-C419-400E-9712-474474E2D3C9@oracle.com> References: <5553C1E2.3040003@oracle.com> <555CE2B2.8020305@oracle.com> <5B87D4E4-FA26-411F-BD03-9F9CAA1E0CBF@oracle.com> <557073C6.1050002@oracle.com> <01E73384-C419-400E-9712-474474E2D3C9@oracle.com> Message-ID: Looks good. -Michael -----Original Message----- From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Roland Westrelin Sent: Friday, June 05, 2015 12:34 AM To: Vladimir Kozlov Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR(S): 8078866: compiler/eliminateAutobox/6934604/TestIntBoxing.java assert(p_f->Opcode() == Op_IfFalse) failed Thanks for the review, Vladimir. Roland. > On Jun 4, 2015, at 5:50 PM, Vladimir Kozlov wrote: > > Good. > > Thanks, > Vladimir > > On 6/4/15 7:28 AM, Roland Westrelin wrote: >> As suggested by Vladimir, here is a new simpler fix that simply bails out of range check elimination if the preloop is not found: >> >> http://cr.openjdk.java.net/~roland/8078866/webrev.01/ >> >> Removing the main and post loops will be pushed as a separate RFE. >> >> Roland. >> From vitalyd at gmail.com Fri Jun 5 15:17:21 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Fri, 5 Jun 2015 11:17:21 -0400 Subject: InlineCacheBuffer + GuaranteedSafepointInterval In-Reply-To: References: Message-ID: Any words of wisdom? Thanks sent from my phone On Jun 3, 2015 12:02 PM, "Vitaly Davidovich" wrote: > Hi, > > Could someone please explain the idea behind GuaranteedSafepointInterval > induced safepoints being gated on InlineCacheBuffer::is_empty() returning > false? This code in safepoint.cpp: > > bool SafepointSynchronize::is_cleanup_needed() { // Need a safepoint if some inline cache buffers is non-empty if (!InlineCacheBuffer::is_empty()) return true; return false; } > > > Looking at icBuffer.cpp: > > void InlineCacheBuffer::update_inline_caches() { if (buffer()->number_of_stubs() > 1) { if (TraceICBuffer) { tty->print_cr("[updating inline caches with %d stubs]", buffer()->number_of_stubs()); } buffer()->remove_all(); init_next_stub(); } release_pending_icholders(); } What exactly triggers IC holders to be eligible for deletion? > > > The reason behind this question is I'd like to eliminate "unnecessary" safepoints that I'm seeing, but would like to understand implications of this with respect to compiler infrastructure (C2, specifically). I have a fairly large code cache reserved, and the # of compiled methods isn't too big, so space there shouldn't be an issue. > > > Why is GuaranteedSafepointInterval based safepoint actually gated on this particular check? If I turn off background safepoints (i.e. GuaranteedSafepointInterval=0) or set them very far apart, am I risking stability problems, at least in terms of compiler? > > > Thanks > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.westrelin at oracle.com Fri Jun 5 15:36:13 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Fri, 5 Jun 2015 17:36:13 +0200 Subject: InlineCacheBuffer + GuaranteedSafepointInterval In-Reply-To: References: Message-ID: <716EE683-BE62-485E-89E9-77143E3DB38C@oracle.com> > What exactly triggers IC holders to be eligible for deletion? As I remember, IC stubs are stubs that are sometimes used when an IC call sites change state and updating the call site can?t be done atomically. So we insert an IC stub between the call site and the target method and there?s an extra jump on that path. At a safepoint, we can remove all the IC stubs by updating the call sites that couldn?t be updated atomically because there?s no concurrency issue anymore. > The reason behind this question is I'd like to eliminate "unnecessary" safepoints that I'm seeing, but would like to understand implications of this with respect to compiler infrastructure (C2, specifically). I have a fairly large code cache reserved, and the # of compiled methods isn't too big, so space there shouldn't be an issue. > > Why is GuaranteedSafepointInterval based safepoint actually gated on this particular check? If I turn off background safepoints (i.e. GuaranteedSafepointInterval=0) or set them very far apart, am I risking stability problems, at least in terms of compiler? As far as I can tell and as far as IC stubs are concerned you risk a performance loss (if you have a very hot call site with the extra jump caused by the IC stub) and running out of IC stubs which, it seems from the code, would trigger a safepoint. Roland. From roland.westrelin at oracle.com Fri Jun 5 15:38:58 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Fri, 5 Jun 2015 17:38:58 +0200 Subject: Inlining methods with large switch statements In-Reply-To: References: Message-ID: > Is this something that can be handled better? You could try forcing inlining with -XX:CompileCommand I suppose but then it can have cascading effects (other inlining may not happen as a consequence). Not sure if that?s what you meant by "handled better". Roland. From vitalyd at gmail.com Fri Jun 5 15:43:44 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Fri, 5 Jun 2015 11:43:44 -0400 Subject: Inlining methods with large switch statements In-Reply-To: References: Message-ID: Hi Roland, By "handled better" I mean for the JIT to not get scared about the bytecode size since machine code is rather compact and quick to execute (especially if the indirect jump via the jump table is predicted well). This is somewhat analogous to the JIT being spooked by methods > MaxInlineSize where the actual bytecode size isn't representative of the real cost (e.g. dead code, asserts, etc), but for FreqInlineSize. Thanks On Fri, Jun 5, 2015 at 11:38 AM, Roland Westrelin < roland.westrelin at oracle.com> wrote: > > Is this something that can be handled better? > > You could try forcing inlining with -XX:CompileCommand I suppose but then > it can have cascading effects (other inlining may not happen as a > consequence). Not sure if that?s what you meant by "handled better". > > Roland. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Fri Jun 5 15:48:03 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Fri, 5 Jun 2015 11:48:03 -0400 Subject: InlineCacheBuffer + GuaranteedSafepointInterval In-Reply-To: <716EE683-BE62-485E-89E9-77143E3DB38C@oracle.com> References: <716EE683-BE62-485E-89E9-77143E3DB38C@oracle.com> Message-ID: Hi Roland, Thanks for the insight. Do you know offhand the condition(s) that prevent updating IC call sites atomically? Also, I'm assuming this triggers when a new receiver is seen at a callsite that's already using an IC, is that right? So something like lower CompileThreshold may exacerbate this a bit if the shorter profile does not record sufficient # of receiver types. Thanks On Fri, Jun 5, 2015 at 11:36 AM, Roland Westrelin < roland.westrelin at oracle.com> wrote: > > What exactly triggers IC holders to be eligible for deletion? > > As I remember, IC stubs are stubs that are sometimes used when an IC call > sites change state and updating the call site can?t be done atomically. So > we insert an IC stub between the call site and the target method and > there?s an extra jump on that path. At a safepoint, we can remove all the > IC stubs by updating the call sites that couldn?t be updated atomically > because there?s no concurrency issue anymore. > > > The reason behind this question is I'd like to eliminate "unnecessary" > safepoints that I'm seeing, but would like to understand implications of > this with respect to compiler infrastructure (C2, specifically). I have a > fairly large code cache reserved, and the # of compiled methods isn't too > big, so space there shouldn't be an issue. > > > > Why is GuaranteedSafepointInterval based safepoint actually gated on > this particular check? If I turn off background safepoints (i.e. > GuaranteedSafepointInterval=0) or set them very far apart, am I risking > stability problems, at least in terms of compiler? > > As far as I can tell and as far as IC stubs are concerned you risk a > performance loss (if you have a very hot call site with the extra jump > caused by the IC stub) and running out of IC stubs which, it seems from the > code, would trigger a safepoint. > > Roland. -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.westrelin at oracle.com Fri Jun 5 16:03:08 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Fri, 5 Jun 2015 18:03:08 +0200 Subject: Inlining methods with large switch statements In-Reply-To: References: Message-ID: <198560A2-8173-4B19-9095-3B1729903690@oracle.com> > By "handled better" I mean for the JIT to not get scared about the bytecode size since machine code is rather compact and quick to execute (especially if the indirect jump via the jump table is predicted well). This is somewhat analogous to the JIT being spooked by methods > MaxInlineSize where the actual bytecode size isn't representative of the real cost (e.g. dead code, asserts, etc), but for FreqInlineSize. John suggested a way to improve our heuristics: https://bugs.openjdk.java.net/browse/JDK-6316156?focusedCommentId=13443564&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13443564 Roland. From vitalyd at gmail.com Fri Jun 5 16:04:28 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Fri, 5 Jun 2015 12:04:28 -0400 Subject: Inlining methods with large switch statements In-Reply-To: <198560A2-8173-4B19-9095-3B1729903690@oracle.com> References: <198560A2-8173-4B19-9095-3B1729903690@oracle.com> Message-ID: Right, I didn't see FreqInlineSize on there, and the large jump table scenario seemed like one worth calling out as well? On Fri, Jun 5, 2015 at 12:03 PM, Roland Westrelin < roland.westrelin at oracle.com> wrote: > > By "handled better" I mean for the JIT to not get scared about the > bytecode size since machine code is rather compact and quick to execute > (especially if the indirect jump via the jump table is predicted well). > This is somewhat analogous to the JIT being spooked by methods > > MaxInlineSize where the actual bytecode size isn't representative of the > real cost (e.g. dead code, asserts, etc), but for FreqInlineSize. > > John suggested a way to improve our heuristics: > > > https://bugs.openjdk.java.net/browse/JDK-6316156?focusedCommentId=13443564&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13443564 > > Roland. -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.r.rose at oracle.com Fri Jun 5 17:02:45 2015 From: john.r.rose at oracle.com (John Rose) Date: Fri, 5 Jun 2015 10:02:45 -0700 Subject: Inlining methods with large switch statements In-Reply-To: References: <198560A2-8173-4B19-9095-3B1729903690@oracle.com> Message-ID: Bug updated; thanks! > On Jun 5, 2015, at 9:04 AM, Vitaly Davidovich wrote: > > Right, I didn't see FreqInlineSize on there, and the large jump table scenario seemed like one worth calling out as well? > > On Fri, Jun 5, 2015 at 12:03 PM, Roland Westrelin > wrote: > > By "handled better" I mean for the JIT to not get scared about the bytecode size since machine code is rather compact and quick to execute (especially if the indirect jump via the jump table is predicted well). This is somewhat analogous to the JIT being spooked by methods > MaxInlineSize where the actual bytecode size isn't representative of the real cost (e.g. dead code, asserts, etc), but for FreqInlineSize. > > John suggested a way to improve our heuristics: > > https://bugs.openjdk.java.net/browse/JDK-6316156?focusedCommentId=13443564&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13443564 > > Roland. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Fri Jun 5 17:08:23 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Fri, 5 Jun 2015 13:08:23 -0400 Subject: Inlining methods with large switch statements In-Reply-To: References: <198560A2-8173-4B19-9095-3B1729903690@oracle.com> Message-ID: Thanks John. A switch should not be measured by the raw size of the instruction. > If several adjacent keys branch to the same successor, surely that should > count as a single test and branch. Although the adjacent keys branching to same successor is yet another case, it seems like any switch (whether it has adjacent keys or not) that gets lowered to a jump table should be *heavily* discounted by whatever heuristic is used. If the JIT were to incorporate profile info and detect common targets, then even better :). Thanks guys. On Fri, Jun 5, 2015 at 1:02 PM, John Rose wrote: > Bug updated; thanks! > > On Jun 5, 2015, at 9:04 AM, Vitaly Davidovich wrote: > > Right, I didn't see FreqInlineSize on there, and the large jump table > scenario seemed like one worth calling out as well? > > On Fri, Jun 5, 2015 at 12:03 PM, Roland Westrelin < > roland.westrelin at oracle.com> wrote: > >> > By "handled better" I mean for the JIT to not get scared about the >> bytecode size since machine code is rather compact and quick to execute >> (especially if the indirect jump via the jump table is predicted well). >> This is somewhat analogous to the JIT being spooked by methods > >> MaxInlineSize where the actual bytecode size isn't representative of the >> real cost (e.g. dead code, asserts, etc), but for FreqInlineSize. >> >> John suggested a way to improve our heuristics: >> >> >> https://bugs.openjdk.java.net/browse/JDK-6316156?focusedCommentId=13443564&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13443564 >> >> Roland. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.r.rose at oracle.com Fri Jun 5 17:08:48 2015 From: john.r.rose at oracle.com (John Rose) Date: Fri, 5 Jun 2015 10:08:48 -0700 Subject: InlineCacheBuffer + GuaranteedSafepointInterval In-Reply-To: References: <716EE683-BE62-485E-89E9-77143E3DB38C@oracle.com> Message-ID: <5D786338-E63B-43A7-A0DE-A53E08B3AB9E@oracle.com> On Jun 5, 2015, at 8:48 AM, Vitaly Davidovich wrote: > > Thanks for the insight. Do you know offhand the condition(s) that prevent updating IC call sites atomically? If you look at the structure of an IC, you'll note it consists of a set-constant instruction and a jump instruction. The jump can be patched atomically, but the set-constant cannot be part of the same transaction. The resulting race conditions are made innocuous with lots of fiddling. IC buffering helps do this. The IC's can be un-buffered (to run at speed) only at safepoints. ? John -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Fri Jun 5 17:17:07 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Fri, 5 Jun 2015 13:17:07 -0400 Subject: InlineCacheBuffer + GuaranteedSafepointInterval In-Reply-To: <5D786338-E63B-43A7-A0DE-A53E08B3AB9E@oracle.com> References: <716EE683-BE62-485E-89E9-77143E3DB38C@oracle.com> <5D786338-E63B-43A7-A0DE-A53E08B3AB9E@oracle.com> Message-ID: Understood, thanks John. Am I right in my other question regarding when IC needs patching in first place? If I see safepoint due to this, some IC saw a new receiver? Is there a flag (non-debug) to see which call site and what new receiver appeared? I may reorganize code to avoid this if possible. sent from my phone On Jun 5, 2015 1:08 PM, "John Rose" wrote: > On Jun 5, 2015, at 8:48 AM, Vitaly Davidovich wrote: > > > Thanks for the insight. Do you know offhand the condition(s) that prevent > updating IC call sites atomically? > > > If you look at the structure of an IC, you'll note it consists of a > set-constant instruction and a jump instruction. > The jump can be patched atomically, but the set-constant cannot be part of > the same transaction. > The resulting race conditions are made innocuous with lots of fiddling. > IC buffering helps do this. > The IC's can be un-buffered (to run at speed) only at safepoints. > ? John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.r.rose at oracle.com Fri Jun 5 17:19:45 2015 From: john.r.rose at oracle.com (John Rose) Date: Fri, 5 Jun 2015 10:19:45 -0700 Subject: InlineCacheBuffer + GuaranteedSafepointInterval In-Reply-To: References: <716EE683-BE62-485E-89E9-77143E3DB38C@oracle.com> <5D786338-E63B-43A7-A0DE-A53E08B3AB9E@oracle.com> Message-ID: On Jun 5, 2015, at 10:17 AM, Vitaly Davidovich wrote: > > Understood, thanks John. > > Am I right in my other question regarding when IC needs patching in first place? If I see safepoint due to this, some IC saw a new receiver? Is there a flag (non-debug) to see which call site and what new receiver appeared? I may reorganize code to avoid this if possible. > Look in globals.hpp for TraceICs; perhaps you already saw it. It's debug-only so you need a special build. ? John -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Fri Jun 5 17:23:33 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Fri, 5 Jun 2015 13:23:33 -0400 Subject: InlineCacheBuffer + GuaranteedSafepointInterval In-Reply-To: References: <716EE683-BE62-485E-89E9-77143E3DB38C@oracle.com> <5D786338-E63B-43A7-A0DE-A53E08B3AB9E@oracle.com> Message-ID: Yeah I saw that one, but was hoping there's a non debug one that I missed. Could you make this one diagnostic? :) sent from my phone On Jun 5, 2015 1:19 PM, "John Rose" wrote: > On Jun 5, 2015, at 10:17 AM, Vitaly Davidovich wrote: > > > Understood, thanks John. > > Am I right in my other question regarding when IC needs patching in first > place? If I see safepoint due to this, some IC saw a new receiver? Is there > a flag (non-debug) to see which call site and what new receiver appeared? I > may reorganize code to avoid this if possible. > > Look in globals.hpp for TraceICs; perhaps you already saw it. It's > debug-only so you need a special build. ? John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Fri Jun 5 19:44:23 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 05 Jun 2015 12:44:23 -0700 Subject: RFR(L): 8081247 AVX 512 extended support code review request In-Reply-To: References: <556FA54E.8050001@oracle.com> Message-ID: <5571FC17.3040108@oracle.com> This looks good to me. Thanks, Vladimir On 6/4/15 9:46 PM, Berg, Michael C wrote: > Vladimir please find the following webrev with the suggested changes, I have added small signature functions which look like the old versions in the assembler but manage the problem I need to handle, which is additional state for legacy only instructions. There is a new vm_version function which handles the cpuid checks with a conglomerate approach for the one scenario which had it. > The loop in the stub generator is now formed to alter the upper bound and execute in one path. > > http://cr.openjdk.java.net/~mcberg/8081247/webrev.03/ > > Regards, > Michael > > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > Sent: Wednesday, June 03, 2015 6:10 PM > To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net' > Subject: Re: RFR(L): 8081247 AVX 512 extended support code review request > > Hi, Michael > > assembler_x86.cpp: > > I don't like that you replaced prefix method with few parameters with method which has a lot of them: > > - int encode = vex_prefix_0F38_and_encode_q(dst, src1, src2); > + int encode = vex_prefix_and_encode(dst->encoding(), src1->encoding(), > src2->encoding(), > + VEX_SIMD_NONE, VEX_OPCODE_0F_38, > true, AVX_128bit, > + true, false); > > Why you did that? > > > stubGenerator_x86_64.cpp: > > Can we set different loop limit based on UseAVX instead of having 2 loops. > > x86.ad: > > Instead of long condition expressions like next: > > UseAVX > 0 && !VM_Version::supports_avx512vl() && > !VM_Version::supports_avx512bw() > > May be have one VM_Version finction which does these checks. > > Thanks, > Vladimir > > On 6/2/15 9:38 PM, Berg, Michael C wrote: >> Hi Folks, >> >> I would like to contribute more AVX512 enabling to facilitate support >> for machines which utilize EVEX encoding. I need two reviewers to >> review this patch and comment as needed: >> >> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8081247 >> >> webrev: >> >> http://cr.openjdk.java.net/~mcberg/8081247/webrev.01/ >> >> This patch enables BMI code on EVEX targets, improves replication >> patterns to be more efficient on both EVEX enabled and legacy targets, >> adds more CPUID based rules for correct code generation on various >> EVEX enabled servers, extends more call save/restore functionality and >> extends the vector space further for SIMD operations. Please expedite >> this review as there is a near term need for the support. >> >> Also, as I am not yet a committer, this code needs a sponsor as well. >> >> Thanks, >> >> Michael >> From vladimir.kozlov at oracle.com Fri Jun 5 19:55:07 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 05 Jun 2015 12:55:07 -0700 Subject: RFR 8080325 SuperWord loop unrolling analysis In-Reply-To: References: <556F9AF7.5050301@oracle.com> Message-ID: <5571FE9B.3080207@oracle.com> Looks good. I would invert not_slp variable to positive is_slp - if(is_slp) looks more readable. Thanks, Vladimir On 6/4/15 9:45 PM, Berg, Michael C wrote: > Vladimir, please find the following webrev with pretty much the full list of changes made. > I made some improvements too. For instance I only allow the analysis to take place when we are trying to unroll beyond the default. > > http://cr.openjdk.java.net/~mcberg/8080325/webrev.01/ > > Regards, > -Michael > > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > Sent: Wednesday, June 03, 2015 5:25 PM > To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net > Subject: Re: RFR 8080325 SuperWord loop unrolling analysis > > Thank you, Michael, for this contribution. > > First, I am fine with such approach - call superword only to collect data about loop. > > It could be useful for superword optimization to have a pass over loop's nodes to determine if it could be vectorize as separate phase. And avoid to do that in SLP analysis. > > Make SuperWordLoopUnrollAnalysis flag's default value to 'false' and set it to true only in vm_version_x86.cpp (#ifdef COMPILER2) so that you don't need to modify setting on all platforms. > > In flag description say what 'slp' means (Superword Level Parallelism). > > Code style: > > if (cl->is_reduction_loop() == false) phase->mark_reductions(this); > > should be: > > if (!cl->is_reduction_loop()) { > phase->mark_reductions(this); > } > > An other one: cl->has_passed_slp() == false. We use ! for such cases. > > There are following checks after new code which prevent unrolling. Why you do analysis before them without affecting their decision? And you use the result of analysis only later at the end of method. Why not do analysis there then? > > (_local_loop_unroll_factor > 4) check should be done inside > policy_slp_max_unroll() to have only one check. Actually all next code (lines 668-692) could be done at the end of policy_slp_max_unroll(). > And you don't need to return bool then (I changed name too): > > 665 if (LoopMaxUnroll > _local_loop_unroll_factor) { > 666 // Once policy_slp_analysis succeeds, mark the loop with the > 667 // maximal unroll factor so that we minimize analysis passes > 668 policy_unroll_slp_analysis(cl, phase); > 693 } > 694 } > 695 > 696 // Check for initial stride being a small enough constant > 697 if (abs(cl->stride_con()) > (1<<2)*future_unroll_ct) return false; > > I think slp analysis code should be in superword.cpp - SWPointer should be used only there. Just add an other method to SuperWord class. > > Instead of a->Afree() and next: > 810 Arena *a = Thread::current()->resource_area(); > 812 size_t ignored_size = _body.size()*sizeof(int*); > 813 int *ignored_loop_nodes = (int*)a->Amalloc_D(ignored_size); > 814 Node_Stack nstack((int)ignored_size); > > use: > > ResourceMark rm; > size_t ignored_size = _body.size(); > int *ignored_loop_nodes = NEW_RESOURCE_ARRAY(int, ignored_size); > Node_Stack nstack((int)ignored_size); > > Node_Stack should take number of nodes and not bytes. > > I am concern about nstack.clear() because you have poped all nodes on it. > > Thanks, > Vladimir > > On 5/13/15 6:26 PM, Berg, Michael C wrote: >> Hi Folks, >> >> We (Intel) would like to contribute superword loop unrolling analysis >> to facilitate more default unrolling and larger SIMD vector mapping. >> Please review this patch and comment as needed: >> >> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8080325 >> >> webrev: >> >> http://cr.openjdk.java.net/~kvn/8080325/webrev/ >> >> The design finds the highest common vector supported and implemented >> on a given machine and applies that to unrolling, iff it is greater >> than the default. If the user gates unrolling we will still obey the >> user directive. It's light weight, when we get to the analysis part, >> if we fail, we stop further tries. If we succeed we stop further >> tries. We generally always analyze only once. We then gate the unroll >> factor by extending the size of the unroll segment so that the >> iterative tries will keep unrolling, obtaining larger unrolls of a >> targeted loop. I see no negative behavior wrt to performance, and a >> modest positive swing in default behavior up to 1.5x for some micros. >> >> Vladimir Koslov has offered to sponsor this patch. >> From michael.c.berg at intel.com Fri Jun 5 20:03:42 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Fri, 5 Jun 2015 20:03:42 +0000 Subject: RFR 8080325 SuperWord loop unrolling analysis In-Reply-To: <5571FE9B.3080207@oracle.com> References: <556F9AF7.5050301@oracle.com> <5571FE9B.3080207@oracle.com> Message-ID: Ok, will do, I will roll it into any comments I get from Roland. I have already changed my copy and will have in the final webrev. -Michael -----Original Message----- From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] Sent: Friday, June 05, 2015 12:55 PM To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR 8080325 SuperWord loop unrolling analysis Looks good. I would invert not_slp variable to positive is_slp - if(is_slp) looks more readable. Thanks, Vladimir On 6/4/15 9:45 PM, Berg, Michael C wrote: > Vladimir, please find the following webrev with pretty much the full list of changes made. > I made some improvements too. For instance I only allow the analysis to take place when we are trying to unroll beyond the default. > > http://cr.openjdk.java.net/~mcberg/8080325/webrev.01/ > > Regards, > -Michael > > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > Sent: Wednesday, June 03, 2015 5:25 PM > To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net > Subject: Re: RFR 8080325 SuperWord loop unrolling analysis > > Thank you, Michael, for this contribution. > > First, I am fine with such approach - call superword only to collect data about loop. > > It could be useful for superword optimization to have a pass over loop's nodes to determine if it could be vectorize as separate phase. And avoid to do that in SLP analysis. > > Make SuperWordLoopUnrollAnalysis flag's default value to 'false' and set it to true only in vm_version_x86.cpp (#ifdef COMPILER2) so that you don't need to modify setting on all platforms. > > In flag description say what 'slp' means (Superword Level Parallelism). > > Code style: > > if (cl->is_reduction_loop() == false) phase->mark_reductions(this); > > should be: > > if (!cl->is_reduction_loop()) { > phase->mark_reductions(this); > } > > An other one: cl->has_passed_slp() == false. We use ! for such cases. > > There are following checks after new code which prevent unrolling. Why you do analysis before them without affecting their decision? And you use the result of analysis only later at the end of method. Why not do analysis there then? > > (_local_loop_unroll_factor > 4) check should be done inside > policy_slp_max_unroll() to have only one check. Actually all next code (lines 668-692) could be done at the end of policy_slp_max_unroll(). > And you don't need to return bool then (I changed name too): > > 665 if (LoopMaxUnroll > _local_loop_unroll_factor) { > 666 // Once policy_slp_analysis succeeds, mark the loop with the > 667 // maximal unroll factor so that we minimize analysis passes > 668 policy_unroll_slp_analysis(cl, phase); > 693 } > 694 } > 695 > 696 // Check for initial stride being a small enough constant > 697 if (abs(cl->stride_con()) > (1<<2)*future_unroll_ct) return false; > > I think slp analysis code should be in superword.cpp - SWPointer should be used only there. Just add an other method to SuperWord class. > > Instead of a->Afree() and next: > 810 Arena *a = Thread::current()->resource_area(); > 812 size_t ignored_size = _body.size()*sizeof(int*); > 813 int *ignored_loop_nodes = (int*)a->Amalloc_D(ignored_size); > 814 Node_Stack nstack((int)ignored_size); > > use: > > ResourceMark rm; > size_t ignored_size = _body.size(); > int *ignored_loop_nodes = NEW_RESOURCE_ARRAY(int, ignored_size); > Node_Stack nstack((int)ignored_size); > > Node_Stack should take number of nodes and not bytes. > > I am concern about nstack.clear() because you have poped all nodes on it. > > Thanks, > Vladimir > > On 5/13/15 6:26 PM, Berg, Michael C wrote: >> Hi Folks, >> >> We (Intel) would like to contribute superword loop unrolling analysis >> to facilitate more default unrolling and larger SIMD vector mapping. >> Please review this patch and comment as needed: >> >> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8080325 >> >> webrev: >> >> http://cr.openjdk.java.net/~kvn/8080325/webrev/ >> >> The design finds the highest common vector supported and implemented >> on a given machine and applies that to unrolling, iff it is greater >> than the default. If the user gates unrolling we will still obey the >> user directive. It's light weight, when we get to the analysis part, >> if we fail, we stop further tries. If we succeed we stop further >> tries. We generally always analyze only once. We then gate the unroll >> factor by extending the size of the unroll segment so that the >> iterative tries will keep unrolling, obtaining larger unrolls of a >> targeted loop. I see no negative behavior wrt to performance, and a >> modest positive swing in default behavior up to 1.5x for some micros. >> >> Vladimir Koslov has offered to sponsor this patch. >> From vladimir.kozlov at oracle.com Sat Jun 6 02:06:57 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 05 Jun 2015 19:06:57 -0700 Subject: 8054492: compiler/intrinsics/classcast/NullCheckDroppingsTest.java is an invalid test In-Reply-To: <555DAD7C.2090805@redhat.com> References: <5559EFA0.5060800@redhat.com> <555D0460.4080805@oracle.com> <555DAD7C.2090805@redhat.com> Message-ID: <557255C1.6060101@oracle.com> > We either need some code in WhiteBox to check for a deoptimization > event properly or we should just remove this altogether. > So, thoughts? Just delete the check? No, we have to check that uncommon trap was hit. This is the main purpose of that test. May be we should add an other method WHITE_BOX.wasMethodDeopted(method) which checks if method was deoptimized at least once. Vladimir On 5/21/15 3:03 AM, Andrew Haley wrote: > On 05/20/2015 11:02 PM, Vladimir Kozlov wrote: >> testVarClassCast tests deoptimization for javaMirror == null: >> >> void testVarClassCast(String s) { >> Class cl = (s == null) ? null : String.class; >> try { >> ssink = (String)cl.cast(svalue); >> >> Which is done in LibraryCallKit::inline_Class_cast() by: >> >> mirror = null_check(mirror); >> >> which has Deoptimization::Action_make_not_entrant. >> >> Unfortunately currently the test also pass because unstable_if is >> generated for the first line: >> >> (s == null) ? null : String.class; >> >> If you run the test with TraceDoptimization (or LogCompilation) you will >> see: >> >> Uncommon trap occurred in NullCheckDroppingsTest::testVarClassCast >> (@0x000000010b0670d8) thread=26883 reason=unstable_if action=reinterpret >> unloaded_class_index=-1 > > Not quite the same. I get a reason=null_check: > > Uncommon trap occurred in NullCheckDroppingsTest::testVarClassCast (@0x000003ff8d253e54) thread=4396243677696 reason=null_check action=maybe_recompile unloaded_class_index=-1 > > Which comes from a SEGV here: > > 0x000003ff89253ca4: ldr x0, [x10,#72] ; implicit exception: dispatches to 0x000003ff89253e40 > > which is the line > > ssink = (String)cl.cast(svalue); > > I don't get a trap for unstable_if because there isn't one. I just get > > 0x000003ff89253c90: cbz x2, 0x000003ff89253e00 (this is java/lang/String s) > > --> > > 0x000003ff89253e00: mov x10, xzr > 0x000003ff89253e04: b 0x000003ff89253ca0 > > --> > > 0x000003ff89253ca0: lsl x11, x11, #3 ;*getstatic svalue > ; - NullCheckDroppingsTest::testVarClassCast at 13 (line 181) > 0x000003ff89253ca4: ldr x0, [x10,#72] ; implicit exception: dispatches to 0x000003ff89253e40 > > ... and then the trap for the null pointer exception. > > Andrew. > From rednaxelafx at gmail.com Mon Jun 8 05:47:38 2015 From: rednaxelafx at gmail.com (Krystal Mok) Date: Sun, 7 Jun 2015 22:47:38 -0700 Subject: Question on C1's as_ValueType(ciConstant value) Message-ID: Hi compiler team, I'd like to ask a question about a piece of code in C1. The code snippet below is from the tip version of jdk9/hs-comp. 145 ValueType* as_ValueType(ciConstant value) { 146 switch (value.basic_type()) { 147 case T_BYTE : // fall through 148 case T_CHAR : // fall through 149 case T_SHORT : // fall through 150 case T_BOOLEAN: // fall through 151 case T_INT : return new IntConstant (value.as_int ()); 152 case T_LONG : return new LongConstant (value.as_long ()); 153 case T_FLOAT : return new FloatConstant (value.as_float ()); 154 case T_DOUBLE : return new DoubleConstant(value.as_double()); 155 case T_ARRAY : // fall through (ciConstant doesn't have an array accessor) 156 case T_OBJECT : return new ObjectConstant(value.as_object()); 157 } 158 ShouldNotReachHere(); 159 return illegalType; 160 } On lines 155 and 156, both basic types T_ARRAY and T_OBJECT turns into a ObjectConstant. That's not consistent with the handling in GraphKit::load_constant(), where ArrayConstant, InstanceConstant and ObjectConstant are treated separately. I ran into this inconsistency when I wanted to try out something with array constants. But I was only able to reach the constant from an ObjectConstant, instead of an ArrayConstant like I was expecting. If people agree that this inconsistency should be fixed, I'd be happy to provide a patch and test it. Thanks, Kris -------------- next part -------------- An HTML attachment was scrubbed... URL: From rednaxelafx at gmail.com Mon Jun 8 06:48:25 2015 From: rednaxelafx at gmail.com (Krystal Mok) Date: Sun, 7 Jun 2015 23:48:25 -0700 Subject: Question on C1's as_ValueType(ciConstant value) In-Reply-To: References: Message-ID: Oops, s/GraphKit/GraphBuilder/ in my last email. Thanks, Kris On Sun, Jun 7, 2015 at 10:47 PM, Krystal Mok wrote: > Hi compiler team, > > I'd like to ask a question about a piece of code in C1. The code snippet > below is from the tip version of jdk9/hs-comp. > > 145 ValueType* as_ValueType(ciConstant value) { > 146 switch (value.basic_type()) { > 147 case T_BYTE : // fall through > 148 case T_CHAR : // fall through > 149 case T_SHORT : // fall through > 150 case T_BOOLEAN: // fall through > 151 case T_INT : return new IntConstant (value.as_int ()); > 152 case T_LONG : return new LongConstant (value.as_long ()); > 153 case T_FLOAT : return new FloatConstant (value.as_float ()); > 154 case T_DOUBLE : return new DoubleConstant(value.as_double()); > 155 case T_ARRAY : // fall through (ciConstant doesn't have an array > accessor) > 156 case T_OBJECT : return new ObjectConstant(value.as_object()); > 157 } > 158 ShouldNotReachHere(); > 159 return illegalType; > 160 } > > On lines 155 and 156, both basic types T_ARRAY and T_OBJECT turns into a > ObjectConstant. > That's not consistent with the handling in GraphKit::load_constant(), > where ArrayConstant, InstanceConstant and ObjectConstant are treated > separately. > > I ran into this inconsistency when I wanted to try out something with > array constants. But I was only able to reach the constant from an > ObjectConstant, instead of an ArrayConstant like I was expecting. > > If people agree that this inconsistency should be fixed, I'd be happy to > provide a patch and test it. > > Thanks, > Kris > -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.westrelin at oracle.com Mon Jun 8 15:38:40 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Mon, 8 Jun 2015 17:38:40 +0200 Subject: 8081823: C2 performs unsigned comparison against -1 Message-ID: <7FBB5478-2F58-4B94-92AE-C02C904DF030@oracle.com> http://cr.openjdk.java.net/~roland/8081823/webrev.00/ C2 folds: if (i <= a || i > b) { as: if (i - a - 1 >u b - a - 1) { a == b is allowed and the test becomes then if (i-1 >u -1) { which is never true. Same is true with if (i > b || i <= a) { The fix folds it as: if (i - a - 1 >=u b - a) { which is always true for a == b Roland. From roland.westrelin at oracle.com Mon Jun 8 15:40:24 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Mon, 8 Jun 2015 17:40:24 +0200 Subject: RFR(S): 8078866: compiler/eliminateAutobox/6934604/TestIntBoxing.java assert(p_f->Opcode() == Op_IfFalse) failed In-Reply-To: References: <5553C1E2.3040003@oracle.com> <555CE2B2.8020305@oracle.com> <5B87D4E4-FA26-411F-BD03-9F9CAA1E0CBF@oracle.com> <557073C6.1050002@oracle.com> <01E73384-C419-400E-9712-474474E2D3C9@oracle.com> Message-ID: <8DBA0178-E8EA-4F35-94CC-6694DB58850C@oracle.com> > Looks good. Thanks for the review (I had pushed already when you sent that comment. Roland. From jan.civlin at intel.com Mon Jun 8 16:15:22 2015 From: jan.civlin at intel.com (Civlin, Jan) Date: Mon, 8 Jun 2015 16:15:22 +0000 Subject: RFR(S): 8085932: Fixing bugs in detecting memory alignments in SuperWord Message-ID: <39F83597C33E5F408096702907E6C450E4D6F7@ORSMSX104.amr.corp.intel.com> Hi All, We would like to contribute to Fixing bugs in detecting memory alignments in SuperWord. The contribution Bug ID: 8085932. Please review this patch: Bug-id: https://bugs.openjdk.java.net/browse/JDK-8085932 webrev: http://cr.openjdk.java.net/~kvn/8085932/webrev.00/ Description: Fixing bugs in detecting memory alignments in SuperWord Fixing bugs in detecting memory alignments in SuperWord: SWPointer::scaled_iv_plus_offset (fixing here a bug in detection of "scale"), SWPointer::offset_plus_k (fixing here a bug in detection of "invariant"), Add tracing output to the code that deal with memory alignment. The following routines are traceable: SWPointer::scaled_iv_plus_offset SWPointer::offset_plus_k SWPointer::scaled_iv, WPointer::SWPointer, SuperWord::memory_alignment Tracing is done only for NOT_PRODUCT. Currently tracing is controlled by VectorizeDebug: #ifndef PRODUCT if (_phase->C->method() != NULL) { _phase->C->method()->has_option_value("VectorizeDebug", _vector_loop_debug); } #endif And VectorizeDebug may take any combination (bitwise OR) of the following values: bool is_trace_alignment() { return (_vector_loop_debug & 2) > 0; } bool is_trace_mem_slice() { return (_vector_loop_debug & 4) > 0; } bool is_trace_loop() { return (_vector_loop_debug & 8) > 0; } bool is_trace_adjacent() { return (_vector_loop_debug & 16) > 0; } -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Mon Jun 8 16:18:53 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 08 Jun 2015 09:18:53 -0700 Subject: 8081823: C2 performs unsigned comparison against -1 In-Reply-To: <7FBB5478-2F58-4B94-92AE-C02C904DF030@oracle.com> References: <7FBB5478-2F58-4B94-92AE-C02C904DF030@oracle.com> Message-ID: <5575C06D.2070604@oracle.com> Looks good. Thank you for adding explanation comments. Thanks, Vladimir On 6/8/15 8:38 AM, Roland Westrelin wrote: > http://cr.openjdk.java.net/~roland/8081823/webrev.00/ > > C2 folds: > > if (i <= a || i > b) { > > as: > > if (i - a - 1 >u b - a - 1) { > > a == b is allowed and the test becomes then if (i-1 >u -1) { which is never true. > > Same is true with if (i > b || i <= a) { > > The fix folds it as: > > if (i - a - 1 >=u b - a) { > > which is always true for a == b > > Roland. > From vladimir.x.ivanov at oracle.com Mon Jun 8 17:45:20 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Mon, 08 Jun 2015 20:45:20 +0300 Subject: 8081823: C2 performs unsigned comparison against -1 In-Reply-To: <7FBB5478-2F58-4B94-92AE-C02C904DF030@oracle.com> References: <7FBB5478-2F58-4B94-92AE-C02C904DF030@oracle.com> Message-ID: <5575D4B0.20105@oracle.com> Looks good. Best regards, Vladimir Ivanov On 6/8/15 6:38 PM, Roland Westrelin wrote: > http://cr.openjdk.java.net/~roland/8081823/webrev.00/ > > C2 folds: > > if (i <= a || i > b) { > > as: > > if (i - a - 1 >u b - a - 1) { > > a == b is allowed and the test becomes then if (i-1 >u -1) { which is never true. > > Same is true with if (i > b || i <= a) { > > The fix folds it as: > > if (i - a - 1 >=u b - a) { > > which is always true for a == b > > Roland. > From rcmuir at gmail.com Tue Jun 9 01:24:00 2015 From: rcmuir at gmail.com (Robert Muir) Date: Mon, 8 Jun 2015 21:24:00 -0400 Subject: reproducible compiler issue with latest jdk9 Message-ID: Hello, I think we found something in lucene testing when testing the latest java 9 ea B27. It also fails with latest tip. See the following test case: http://pastebin.com/U3TCFGNu It passes with -Xint and will fail otherwise: rmuir at beast:~$ java -version openjdk version "1.9.0-internal" OpenJDK Runtime Environment (build 1.9.0-internal-rmuir_2015_06_08_18_48-b00) OpenJDK 64-Bit Server VM (build 1.9.0-internal-rmuir_2015_06_08_18_48-b00, mixed mode) rmuir at beast:~$ java ShouldWork Exception in thread "main" java.lang.AssertionError: expected=mklefvn, actual=m at ShouldWork.main(ShouldWork.java:20) rmuir at beast:~$ java -Xint ShouldWork rmuir at beast:~$ From rcmuir at gmail.com Tue Jun 9 02:13:24 2015 From: rcmuir at gmail.com (Robert Muir) Date: Mon, 8 Jun 2015 22:13:24 -0400 Subject: reproducible compiler issue with latest jdk9 In-Reply-To: References: Message-ID: If it helps, -XX:-DoEscapeAnalysis is enough to make the test pass (and make lucene tests go green) On Mon, Jun 8, 2015 at 9:24 PM, Robert Muir wrote: > Hello, > > I think we found something in lucene testing when testing the latest > java 9 ea B27. > It also fails with latest tip. > > See the following test case: http://pastebin.com/U3TCFGNu > > It passes with -Xint and will fail otherwise: > > rmuir at beast:~$ java -version > openjdk version "1.9.0-internal" > OpenJDK Runtime Environment (build 1.9.0-internal-rmuir_2015_06_08_18_48-b00) > OpenJDK 64-Bit Server VM (build > 1.9.0-internal-rmuir_2015_06_08_18_48-b00, mixed mode) > rmuir at beast:~$ java ShouldWork > Exception in thread "main" java.lang.AssertionError: expected=mklefvn, actual=m > at ShouldWork.main(ShouldWork.java:20) > rmuir at beast:~$ java -Xint ShouldWork > rmuir at beast:~$ From dawid.weiss at gmail.com Tue Jun 9 07:33:40 2015 From: dawid.weiss at gmail.com (Dawid Weiss) Date: Tue, 9 Jun 2015 09:33:40 +0200 Subject: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on (regression) In-Reply-To: <03323516-B787-4D8B-AF1F-C3D54C3BE1F1@oracle.com> References: <5567432D.3040108@oracle.com> <03323516-B787-4D8B-AF1F-C3D54C3BE1F1@oracle.com> Message-ID: Is there any documentation explaining how patches and bug fixes are merged into beta builds? I see the AIOOB issue was fixed in "team", but it apparently didn't make it into jdk9 b67: https://bugs.openjdk.java.net/browse/JDK-8080976 Just curious (Robert Muir has found a compiler issue in b67 and it seems like it's a separate one, which made me look for the fix to my original problem). Dawid On Fri, May 29, 2015 at 4:09 PM, Roland Westrelin wrote: > Michael, > >> Roland I would add a condition so that we save a step, and a comment or two: > > Thanks for the suggestions. I?ll go with your version. > >> I checked the resultant code for the bug, as is the case for the change in the webrev, the reduction is no longer emitted. I also checked the relevant micros and verified the desired behavior is still present. > > Thanks for taking the time to verify the fix is correct. > > Roland. > From rory.odonnell at oracle.com Tue Jun 9 07:39:47 2015 From: rory.odonnell at oracle.com (Rory O'Donnell) Date: Tue, 09 Jun 2015 08:39:47 +0100 Subject: reproducible compiler issue with latest jdk9 In-Reply-To: References: Message-ID: <55769843.2050108@oracle.com> Hi Robert, Could you please log a bug at bugs.java.com, and let us know what issue ID you receive. Rgds,Rory On 09/06/2015 03:13, Robert Muir wrote: > If it helps, -XX:-DoEscapeAnalysis is enough to make the test pass > (and make lucene tests go green) > > On Mon, Jun 8, 2015 at 9:24 PM, Robert Muir wrote: >> Hello, >> >> I think we found something in lucene testing when testing the latest >> java 9 ea B27. >> It also fails with latest tip. >> >> See the following test case: http://pastebin.com/U3TCFGNu >> >> It passes with -Xint and will fail otherwise: >> >> rmuir at beast:~$ java -version >> openjdk version "1.9.0-internal" >> OpenJDK Runtime Environment (build 1.9.0-internal-rmuir_2015_06_08_18_48-b00) >> OpenJDK 64-Bit Server VM (build >> 1.9.0-internal-rmuir_2015_06_08_18_48-b00, mixed mode) >> rmuir at beast:~$ java ShouldWork >> Exception in thread "main" java.lang.AssertionError: expected=mklefvn, actual=m >> at ShouldWork.main(ShouldWork.java:20) >> rmuir at beast:~$ java -Xint ShouldWork >> rmuir at beast:~$ -- Rgds,Rory O'Donnell Quality Engineering Manager Oracle EMEA , Dublin, Ireland From rcmuir at gmail.com Tue Jun 9 09:52:19 2015 From: rcmuir at gmail.com (Robert Muir) Date: Tue, 9 Jun 2015 05:52:19 -0400 Subject: reproducible compiler issue with latest jdk9 In-Reply-To: <55769843.2050108@oracle.com> References: <55769843.2050108@oracle.com> Message-ID: Here is the ID: JI-9021603 On Tue, Jun 9, 2015 at 3:39 AM, Rory O'Donnell wrote: > Hi Robert, > > Could you please log a bug at bugs.java.com, and let us know what issue ID > you receive. > > Rgds,Rory > > > On 09/06/2015 03:13, Robert Muir wrote: >> >> If it helps, -XX:-DoEscapeAnalysis is enough to make the test pass >> (and make lucene tests go green) >> >> On Mon, Jun 8, 2015 at 9:24 PM, Robert Muir wrote: >>> >>> Hello, >>> >>> I think we found something in lucene testing when testing the latest >>> java 9 ea B27. >>> It also fails with latest tip. >>> >>> See the following test case: http://pastebin.com/U3TCFGNu >>> >>> It passes with -Xint and will fail otherwise: >>> >>> rmuir at beast:~$ java -version >>> openjdk version "1.9.0-internal" >>> OpenJDK Runtime Environment (build >>> 1.9.0-internal-rmuir_2015_06_08_18_48-b00) >>> OpenJDK 64-Bit Server VM (build >>> 1.9.0-internal-rmuir_2015_06_08_18_48-b00, mixed mode) >>> rmuir at beast:~$ java ShouldWork >>> Exception in thread "main" java.lang.AssertionError: expected=mklefvn, >>> actual=m >>> at ShouldWork.main(ShouldWork.java:20) >>> rmuir at beast:~$ java -Xint ShouldWork >>> rmuir at beast:~$ > > > -- > Rgds,Rory O'Donnell > Quality Engineering Manager > Oracle EMEA , Dublin, Ireland > From mehmet at hazelcast.com Tue Jun 9 10:07:22 2015 From: mehmet at hazelcast.com (Mehmet Dogan) Date: Tue, 09 Jun 2015 10:07:22 +0000 Subject: Array accesses using sun.misc.Unsafe cause data corruption or SIGSEGV Message-ID: Hi all, While I was testing my app using java 8, I encountered the previously reported sun.misc.Unsafe issue. https://bugs.openjdk.java.net/browse/JDK-8076445 http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-April/017685.html Issue status says it's resolved with resolution "Cannot Reproduce". But unfortunately it's still reproducible using "1.8.0_60-ea-b18" and "1.9.0-ea-b67". Test is very simple: ``` public static void main(String[] args) throws Exception { Unsafe unsafe = findUnsafe(); // 10000 pass // 100000 jvm crash // 1000000 fail int count = 100000; long size = count * 8L; long baseAddress = unsafe.allocateMemory(size); try { for (int i = 0; i < count; i++) { long address = baseAddress + (i * 8L); long expected = i; unsafe.putLong(address, expected); long actual = unsafe.getLong(address); if (expected != actual) { throw new AssertionError("Expected: " + expected + ", Actual: " + actual); } } } finally { unsafe.freeMemory(baseAddress); } } ``` It's not failing up to version 1.8.0.31, by starting 1.8.0.40 test is failing constantly. - With iteration count 10000, test is passing. - With iteration count 100000, jvm is crashing with SIGSEGV. - With iteration count 1000000, test is failing with AssertionError. When one of compilation (-Xint) or inlining (-XX:-Inline) or on-stack-replacement (-XX:-UseOnStackReplacement) is disabled, test is not failing at all. I tested on platforms: - Centos-7/openjdk-1.8.0.45 - OSX/oraclejdk-1.8.0.40 - OSX/oraclejdk-1.8.0.45 - OSX/oraclejdk-1.8.0_60-ea-b18 - OSX/oraclejdk-1.9.0-ea-b67 Previous issue comment ( https://bugs.openjdk.java.net/browse/JDK-8076445?focusedCommentId=13633043#comment-13633043) says "Cannot reproduce based on the latest version". I hope that latest version is not mentioning to '1.8.0_60-ea-b18' or '1.9.0-ea-b67'. Because both are failing. I'm looking forward to hearing from you. Thanks, -Mehmet Dogan- -- @mmdogan -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.westrelin at oracle.com Tue Jun 9 10:49:22 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Tue, 9 Jun 2015 12:49:22 +0200 Subject: Question on C1's as_ValueType(ciConstant value) In-Reply-To: References: Message-ID: <4AE6845E-9B3B-42F1-B02E-A70B8D95F43F@oracle.com> Hi Kris, > 145 ValueType* as_ValueType(ciConstant value) { > 146 switch (value.basic_type()) { > 147 case T_BYTE : // fall through > 148 case T_CHAR : // fall through > 149 case T_SHORT : // fall through > 150 case T_BOOLEAN: // fall through > 151 case T_INT : return new IntConstant (value.as_int ()); > 152 case T_LONG : return new LongConstant (value.as_long ()); > 153 case T_FLOAT : return new FloatConstant (value.as_float ()); > 154 case T_DOUBLE : return new DoubleConstant(value.as_double()); > 155 case T_ARRAY : // fall through (ciConstant doesn't have an array accessor) > 156 case T_OBJECT : return new ObjectConstant(value.as_object()); > 157 } > 158 ShouldNotReachHere(); > 159 return illegalType; > 160 } > > On lines 155 and 156, both basic types T_ARRAY and T_OBJECT turns into a ObjectConstant. > That's not consistent with the handling in GraphKit::load_constant(), where ArrayConstant, InstanceConstant and ObjectConstant are treated separately. > > I ran into this inconsistency when I wanted to try out something with array constants. But I was only able to reach the constant from an ObjectConstant, instead of an ArrayConstant like I was expecting. > > If people agree that this inconsistency should be fixed, I'd be happy to provide a patch and test it. Strange inconsistency indeed. I think it makes sense to fix this. Please go ahead with the patch. I?ll sponsor it. Roland. From mehmet at hazelcast.com Tue Jun 9 11:03:54 2015 From: mehmet at hazelcast.com (Mehmet Dogan) Date: Tue, 09 Jun 2015 11:03:54 +0000 Subject: Array accesses using sun.misc.Unsafe cause data corruption or SIGSEGV In-Reply-To: References: Message-ID: Btw, (thanks to one my colleagues), when address calculation in the loop is converted to long address = baseAddress + (i * 8) test passes. Only difference is next long pointer is calculated using integer 8 instead of long 8. ``` for (int i = 0; i < count; i++) { long address = baseAddress + (i * 8); // <--- here, integer 8 instead of long 8 long expected = i; unsafe.putLong(address, expected); long actual = unsafe.getLong(address); if (expected != actual) { throw new AssertionError("Expected: " + expected + ", Actual: " + actual); } } ``` On Tue, Jun 9, 2015 at 1:07 PM Mehmet Dogan wrote: > Hi all, > > While I was testing my app using java 8, I encountered the previously > reported sun.misc.Unsafe issue. > > https://bugs.openjdk.java.net/browse/JDK-8076445 > > http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-April/017685.html > > Issue status says it's resolved with resolution "Cannot Reproduce". But > unfortunately it's still reproducible using "1.8.0_60-ea-b18" and > "1.9.0-ea-b67". > > Test is very simple: > > ``` > public static void main(String[] args) throws Exception { > Unsafe unsafe = findUnsafe(); > // 10000 pass > // 100000 jvm crash > // 1000000 fail > int count = 100000; > long size = count * 8L; > long baseAddress = unsafe.allocateMemory(size); > > try { > for (int i = 0; i < count; i++) { > long address = baseAddress + (i * 8L); > > long expected = i; > unsafe.putLong(address, expected); > > long actual = unsafe.getLong(address); > > if (expected != actual) { > throw new AssertionError("Expected: " + expected + ", > Actual: " + actual); > } > } > } finally { > unsafe.freeMemory(baseAddress); > } > } > ``` > It's not failing up to version 1.8.0.31, by starting 1.8.0.40 test is > failing constantly. > > - With iteration count 10000, test is passing. > - With iteration count 100000, jvm is crashing with SIGSEGV. > - With iteration count 1000000, test is failing with AssertionError. > > When one of compilation (-Xint) or inlining (-XX:-Inline) or > on-stack-replacement (-XX:-UseOnStackReplacement) is disabled, test is not > failing at all. > > I tested on platforms: > - Centos-7/openjdk-1.8.0.45 > - OSX/oraclejdk-1.8.0.40 > - OSX/oraclejdk-1.8.0.45 > - OSX/oraclejdk-1.8.0_60-ea-b18 > - OSX/oraclejdk-1.9.0-ea-b67 > > Previous issue comment ( > https://bugs.openjdk.java.net/browse/JDK-8076445?focusedCommentId=13633043#comment-13633043) > says "Cannot reproduce based on the latest version". I hope that latest > version is not mentioning to '1.8.0_60-ea-b18' or '1.9.0-ea-b67'. Because > both are failing. > > I'm looking forward to hearing from you. > > Thanks, > -Mehmet Dogan- > -- > > @mmdogan > -- @mmdogan -------------- next part -------------- An HTML attachment was scrubbed... URL: From rory.odonnell at oracle.com Tue Jun 9 12:07:29 2015 From: rory.odonnell at oracle.com (Rory O'Donnell) Date: Tue, 09 Jun 2015 13:07:29 +0100 Subject: reproducible compiler issue with latest jdk9 In-Reply-To: <5576C5F1.4020903@oracle.com> References: <55769843.2050108@oracle.com> <5576C5F1.4020903@oracle.com> Message-ID: <5576D701.1070703@oracle.com> Thank Balchandra for moving Incident over. On 09/06/2015 11:54, Balchandra Vaidya wrote: > > Hi Robert, > > Thank you for submitting a bug. The JBS id is > https://bugs.openjdk.java.net/browse/JDK-8086046 > > > Regards, > Balchandra > > On 6/9/2015 3:22 PM, Robert Muir wrote: >> Here is the ID: JI-9021603 >> >> On Tue, Jun 9, 2015 at 3:39 AM, Rory O'Donnell >> wrote: >>> Hi Robert, >>> >>> Could you please log a bug at bugs.java.com, and let us know what >>> issue ID >>> you receive. >>> >>> Rgds,Rory >>> >>> >>> On 09/06/2015 03:13, Robert Muir wrote: >>>> If it helps, -XX:-DoEscapeAnalysis is enough to make the test pass >>>> (and make lucene tests go green) >>>> >>>> On Mon, Jun 8, 2015 at 9:24 PM, Robert Muir wrote: >>>>> Hello, >>>>> >>>>> I think we found something in lucene testing when testing the latest >>>>> java 9 ea B27. >>>>> It also fails with latest tip. >>>>> >>>>> See the following test case: http://pastebin.com/U3TCFGNu >>>>> >>>>> It passes with -Xint and will fail otherwise: >>>>> >>>>> rmuir at beast:~$ java -version >>>>> openjdk version "1.9.0-internal" >>>>> OpenJDK Runtime Environment (build >>>>> 1.9.0-internal-rmuir_2015_06_08_18_48-b00) >>>>> OpenJDK 64-Bit Server VM (build >>>>> 1.9.0-internal-rmuir_2015_06_08_18_48-b00, mixed mode) >>>>> rmuir at beast:~$ java ShouldWork >>>>> Exception in thread "main" java.lang.AssertionError: >>>>> expected=mklefvn, >>>>> actual=m >>>>> at ShouldWork.main(ShouldWork.java:20) >>>>> rmuir at beast:~$ java -Xint ShouldWork >>>>> rmuir at beast:~$ >>> >>> -- >>> Rgds,Rory O'Donnell >>> Quality Engineering Manager >>> Oracle EMEA , Dublin, Ireland >>> > -- Rgds,Rory O'Donnell Quality Engineering Manager Oracle EMEA , Dublin, Ireland From roland.westrelin at oracle.com Tue Jun 9 13:02:26 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Tue, 9 Jun 2015 15:02:26 +0200 Subject: RFR 8080325 SuperWord loop unrolling analysis In-Reply-To: References: <37C8FBB9-1D75-4CE6-B6CE-A83349CEC09C@oracle.com> Message-ID: <546AB289-DAD9-4A26-9F3B-935B53E2F6DB@oracle.com> > http://cr.openjdk.java.net/~mcberg/8080325/webrev.01/ Since you?re touching that code, can you also fix the coding style in: In loopTransform.cpp 644 bool IdealLoopTree::policy_unroll( PhaseIdealLoop *phase ) { In loopnode.hpp 462 bool policy_unroll( PhaseIdealLoop *phase ); (no spaces after/before the parenthesis) In loopnode.hpp: 164 int slp_maximum_unroll_factor; should be: _slp_maximum_unroll_factor 248 int slp_max_unroll() { return slp_maximum_unroll_factor; } could be: int slp_max_unroll() const { 464 // Return TRUE or FALSE if the loop analyzes to map to a maximal 465 // superword unrolling for vectorization. 466 void policy_unroll_slp_analysis(CountedLoopNode *cl, PhaseIdealLoop *phase, int future_unroll_ct); The comment says the function returns something but it doesn?t return anything. I don?t see slp_maximum_unroll_factor being initialized to a default value. Isn?t there a risk it?s not set when we read it? Otherwise, I think it?s good. Roland. From roland.westrelin at oracle.com Tue Jun 9 13:04:11 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Tue, 9 Jun 2015 15:04:11 +0200 Subject: 8081823: C2 performs unsigned comparison against -1 In-Reply-To: <5575D4B0.20105@oracle.com> References: <7FBB5478-2F58-4B94-92AE-C02C904DF030@oracle.com> <5575D4B0.20105@oracle.com> Message-ID: Thanks for the reviews, Vladimir & Vladimir. Roland. > On Jun 8, 2015, at 7:45 PM, Vladimir Ivanov wrote: > > Looks good. > > Best regards, > Vladimir Ivanov > > On 6/8/15 6:38 PM, Roland Westrelin wrote: >> http://cr.openjdk.java.net/~roland/8081823/webrev.00/ >> >> C2 folds: >> >> if (i <= a || i > b) { >> >> as: >> >> if (i - a - 1 >u b - a - 1) { >> >> a == b is allowed and the test becomes then if (i-1 >u -1) { which is never true. >> >> Same is true with if (i > b || i <= a) { >> >> The fix folds it as: >> >> if (i - a - 1 >=u b - a) { >> >> which is always true for a == b >> >> Roland. >> From roland.westrelin at oracle.com Tue Jun 9 13:27:25 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Tue, 9 Jun 2015 15:27:25 +0200 Subject: RFR(S): 8086016: closed/java/text/Format/NumberFormat/BigDecimalCompatibilityTest.java is crashing Message-ID: http://cr.openjdk.java.net/~roland/8086016/webrev.00/ ArrayCopyNode is being transformed in a subgraph that is dying but not yet entirely destroyed. Roland. From vladimir.kozlov at oracle.com Tue Jun 9 15:55:22 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 09 Jun 2015 08:55:22 -0700 Subject: RFR(S): 8086016: closed/java/text/Format/NumberFormat/BigDecimalCompatibilityTest.java is crashing In-Reply-To: References: Message-ID: <55770C6A.6010905@oracle.com> Looks good. Thanks, Vladimir On 6/9/15 6:27 AM, Roland Westrelin wrote: > http://cr.openjdk.java.net/~roland/8086016/webrev.00/ > > ArrayCopyNode is being transformed in a subgraph that is dying but not yet entirely destroyed. > > Roland. > From igor.ignatyev at oracle.com Tue Jun 9 16:26:32 2015 From: igor.ignatyev at oracle.com (Igor Ignatyev) Date: Tue, 09 Jun 2015 19:26:32 +0300 Subject: Fwd: [Announcement] Upcoming Online Course: JDK8 MOOC - "Lambdas and Streams" In-Reply-To: <5571FFA2.6050302@oracle.com> References: <5571FFA2.6050302@oracle.com> Message-ID: <557713B8.8080107@oracle.com> tl;dr https://apexapps.oracle.com/pls/apex/f?p=44785:145:0::::P145_EVENT_ID,P145_PREV_PAGE:4887,143 2015-07-14 -- 2015-08-04 -------- Forwarded Message -------- Subject: [Announcement] Upcoming Online Course: JDK8 MOOC - "Lambdas and Streams" Date: Fri, 05 Jun 2015 12:59:30 -0700 From: Sharat Chander To: jdk_confidential_ww_grp Folks, The next Massive Open Online Course (MOOC) covering Javawill launch on July 14th. This MOOC was prepared by Simon Ritter (Java Evangelist extraordinaire) and willfocus on teaching Lambdas and Steams. Please feel free to share this information with your colleagues in the broader Java developer ecosystem. * Clickhere to sign up * Direct URL: https://apexapps.oracle.com/pls/apex/f?p=44785:145:0::::P145_EVENT_ID,P145_PREV_PAGE:4887,143 For more information you can either watch the YouTube video below or read the promotional e-mail I've included below my signature. Warmest Regards, --Sharat ** Online Course Announcement Oracle *JDK8 Massive Open and Online Course: Lambdas and Streams Introduction* *Start Date*: 14-JUL-2015 03:00:00 PM Enroll *Java SE 8 (JDK8) *introduced a fundamentally new way of programming in Java with the introduction of Lambda expressions. Lambda provides a simple way to pass functionality as an argument to another method, such as what action should be taken when someone clicks a button, or how to sort a set of names. Lambda expressions enable you to do this, to treat functionality as a method argument, or code as data. You may have heard about Lambda expressions, and are curious what impact it will have on you as a Java developer. This course is designed to answer your questions and more. Have you ever wondered what Lambda expressions are in Java? Have you ever wanted to write parallel code in Java without worrying about threads and locking? Have you ever wanted to process collections of data without using loops? Have you ever wanted to do functional programming in Java? All of these questions will be answered in this practical hands-on *MOOC*. This course introduces two major new changes to the Java platform: Lambda expressions and the Stream API. *Presented by: *Simon Ritter Oracle Corporation ------------------------------------------------------------------------ Send questions or comments to java-mooc-support at beehiveonline.oracle.com Hardware and Software Engineered to Work Together Copyright? 2015, Oracle. All rights reserved. From igor.ignatyev at oracle.com Tue Jun 9 16:27:25 2015 From: igor.ignatyev at oracle.com (Igor Ignatyev) Date: Tue, 09 Jun 2015 19:27:25 +0300 Subject: Fwd: [Announcement] Upcoming Online Course: JDK8 MOOC - "Lambdas and Streams" In-Reply-To: <557713B8.8080107@oracle.com> References: <5571FFA2.6050302@oracle.com> <557713B8.8080107@oracle.com> Message-ID: <557713ED.6070700@oracle.com> sorry a wrong alias. pls ignore this email. Igor On 06/09/2015 07:26 PM, Igor Ignatyev wrote: > tl;dr > > https://apexapps.oracle.com/pls/apex/f?p=44785:145:0::::P145_EVENT_ID,P145_PREV_PAGE:4887,143 > > > 2015-07-14 -- 2015-08-04 > > -------- Forwarded Message -------- > Subject: [Announcement] Upcoming Online Course: JDK8 MOOC - "Lambdas > and Streams" > Date: Fri, 05 Jun 2015 12:59:30 -0700 > From: Sharat Chander > To: jdk_confidential_ww_grp > > > > Folks, > > The next Massive Open Online Course (MOOC) covering Javawill launch on > July 14th. This MOOC was prepared by Simon Ritter (Java Evangelist > extraordinaire) and willfocus on teaching Lambdas and Steams. > > Please feel free to share this information with your colleagues in the > broader Java developer ecosystem. > > * Clickhere to sign up > > > > * Direct URL: > > https://apexapps.oracle.com/pls/apex/f?p=44785:145:0::::P145_EVENT_ID,P145_PREV_PAGE:4887,143 > > > For more information you can either watch the YouTube video below or > read the promotional e-mail I've included below my signature. > > > > > Warmest Regards, > --Sharat > > ** > > > > > Online Course Announcement > > > > Oracle > > > > *JDK8 Massive Open and Online Course: Lambdas and Streams Introduction* > > > > *Start Date*: 14-JUL-2015 03:00:00 PM > > > > Enroll > > > > *Java SE 8 (JDK8) *introduced a fundamentally new way of programming in > Java with the introduction of Lambda expressions. > > Lambda provides a simple way to pass functionality as an argument to > another method, such as what action should be taken when someone clicks > a button, or how to sort a set of names. Lambda expressions enable you > to do this, to treat functionality as a method argument, or code as data. > > You may have heard about Lambda expressions, and are curious what impact > it will have on you as a Java developer. > > This course is designed to answer your questions and more. > > Have you ever wondered what Lambda expressions are in Java? > Have you ever wanted to write parallel code in Java without worrying > about threads and locking? > Have you ever wanted to process collections of data without using loops? > Have you ever wanted to do functional programming in Java? > > All of these questions will be answered in this practical hands-on > *MOOC*. This course introduces two major new changes to the Java > platform: Lambda expressions and the Stream API. > > *Presented by: *Simon Ritter > > Oracle Corporation > > ------------------------------------------------------------------------ > > Send questions or comments to java-mooc-support at beehiveonline.oracle.com > > > Hardware and Software Engineered to Work Together > > > Copyright? 2015, Oracle. All rights reserved. > > > > From ysr1729 at gmail.com Wed Jun 10 07:15:30 2015 From: ysr1729 at gmail.com (Srinivas Ramakrishna) Date: Wed, 10 Jun 2015 00:15:30 -0700 Subject: More visibility into code cache churn Message-ID: I filed https://bugs.openjdk.java.net/browse/JDK-8087107 and attached a patch to the ticket that exposes some useful code cache stats as perf data counters: $ jcmd PerfCounter.print | grep -i "sun\.ci\." sun.ci.codeCacheCapacity=6291456 sun.ci.codeCacheMaxCapacity=6291456 sun.ci.codeCacheMethodsReclaimedNum=1030 sun.ci.codeCacheSweepsTotalNum=93 sun.ci.codeCacheSweepsTotalTimeMillis=63 sun.ci.codeCacheUsed=3386880 ... At Twitter, we've found this useful for easy monitoring of code cache activity, and would like to see this integrated into OpenJDK. thanks! -- ramki -------------- next part -------------- An HTML attachment was scrubbed... URL: From kirk.pepperdine at gmail.com Wed Jun 10 08:53:02 2015 From: kirk.pepperdine at gmail.com (Kirk Pepperdine) Date: Wed, 10 Jun 2015 10:53:02 +0200 Subject: More visibility into code cache churn In-Reply-To: References: Message-ID: Hi Ramki, Anything to improve visibility of the CodeCache would be hugely appreciated! Regards, Kirk On Jun 10, 2015, at 9:15 AM, Srinivas Ramakrishna wrote: > > I filed https://bugs.openjdk.java.net/browse/JDK-8087107 and attached a patch to the ticket that exposes some useful code cache stats as perf data counters: > > $ jcmd PerfCounter.print | grep -i "sun\.ci\." > sun.ci.codeCacheCapacity=6291456 > sun.ci.codeCacheMaxCapacity=6291456 > sun.ci.codeCacheMethodsReclaimedNum=1030 > sun.ci.codeCacheSweepsTotalNum=93 > sun.ci.codeCacheSweepsTotalTimeMillis=63 > sun.ci.codeCacheUsed=3386880 > ... > > At Twitter, we've found this useful for easy monitoring of code cache activity, and would like to see this integrated into OpenJDK. > > thanks! > -- ramki -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.x.ivanov at oracle.com Wed Jun 10 12:34:15 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Wed, 10 Jun 2015 15:34:15 +0300 Subject: [9,8u60] RFR (S): 8074551: GWT can be marked non-compilable due to deopt count pollution Message-ID: <55782EC7.30501@oracle.com> http://cr.openjdk.java.net/~vlivanov/8074551/webrev.00/ https://bugs.openjdk.java.net/browse/JDK-8074551 JDK-8063137 added profiling machinery for GWT combinator. In order to avoid trap count pollution, trap counts are simply ignored during JIT compilation for methods w/ injected profile. Unfortunately, it not enough to completely eliminate trap count pollution problem. As the regression test case demonstrates, VM marks heavily shared method as non-compilable when too many traps happen there. It causes severe performance regression since the method is neither compiled nor inlineable anymore. With the fix, VM doesn't update trap counts in methods with injected profile when trap reason is either intrinsic or unreached. These are 2 kinds of uncommon traps VM issues based on injected profile. I experimented with completely ignoring trap count updates for methods with injected profile, but it causes noticeable regression on a couple of Octane subbenchmarks [2]. I reverted injected profile detection logic and reintroduced an explicit marker for methods which inject profile (@InjectedProfile). All GWTs are marked with it during bytecode translation. Testing: jdk/test/java/lang/invoke, hotspot/test/compiler/jsr292, nashorn, octane (no perf regression) Best regards, Vladimir Ivanov [1] https://bugs.openjdk.java.net/browse/JDK-8063137 [2] FTR Richards (340ms->370us) & Regexp(100ms->135ms) From roland.westrelin at oracle.com Wed Jun 10 14:18:21 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 10 Jun 2015 16:18:21 +0200 Subject: RFR(S): Optimize main and post loop out when pre loop is found empty In-Reply-To: References: <4E659695-625A-4454-A89F-1D45F60E3033@oracle.com> Message-ID: <0C460464-9008-4F59-A3F2-0EE4443011AF@oracle.com> Hi Michael, > Roland, in LoopTransform.cpp, in IdealLoopTree::remove_main_post_loops, > Please remove the extra ; on line 2248. > > Also, you might want to make the instances of _next->head and ..->as_CountedLoop() variables in that function to keep the number of dereferences and calls down. May just returning on null and guarding on > Is_CountedLoop(), then assigning a variable for as_Counted Loop within the guard. > > Else it looks good. Thanks for looking at this. I refactored the code following your suggestions. Roland. > > -Michael > > -----Original Message----- > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Roland Westrelin > Sent: Thursday, June 04, 2015 7:55 AM > To: hotspot compiler > Subject: RFR(S): Optimize main and post loop out when pre loop is found empty > > http://cr.openjdk.java.net/~roland/8085832/webrev.00/ > > This is the change I proposed as a fix for 8078866. > > Roland. From roland.westrelin at oracle.com Wed Jun 10 14:18:36 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 10 Jun 2015 16:18:36 +0200 Subject: RFR(S): 8085832: Optimize main and post loop out when pre loop is found empty In-Reply-To: <55707488.5040600@oracle.com> References: <4E659695-625A-4454-A89F-1D45F60E3033@oracle.com> <55707488.5040600@oracle.com> Message-ID: Thanks for the review, Vladimir. Roland. > On Jun 4, 2015, at 5:53 PM, Vladimir Kozlov wrote: > > Looks fine. > > Thanks, > Vladimir > > On 6/4/15 7:54 AM, Roland Westrelin wrote: >> http://cr.openjdk.java.net/~roland/8085832/webrev.00/ >> >> This is the change I proposed as a fix for 8078866. >> >> Roland. >> From roland.westrelin at oracle.com Wed Jun 10 15:03:55 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 10 Jun 2015 17:03:55 +0200 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer Message-ID: http://cr.openjdk.java.net/~roland/8080289/webrev.00/ Sink stores out of loops when possible: for (int i = 0; i < 1000; i++) { // Some stuff that doesn?t prevent the optimization array[idx] = i; } becomes: for (int i = 0; i < 1000; i++) { // Some stuff } array[idx] = 999; Or move stores before the loop when possible: for (int i = 0; i < 1000; i++) { array[idx] = 999; // Some stuff that doesn?t prevent the optimization } becomes: array[idx] = 999; for (int i = 0; i < 1000; i++) { // Some stuff } The change in memnode.cpp is useful to clean up code generated from test_after_5 because the stores are moved out of the loop only after the loop is split and unrolled. That code removes duplicate stores. Roland. From vladimir.kozlov at oracle.com Wed Jun 10 17:04:08 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 10 Jun 2015 10:04:08 -0700 Subject: [9,8u60] RFR (S): 8074551: GWT can be marked non-compilable due to deopt count pollution In-Reply-To: <55782EC7.30501@oracle.com> References: <55782EC7.30501@oracle.com> Message-ID: <55786E08.8000603@oracle.com> Looks good to me. Thanks, Vladimir K. On 6/10/15 5:34 AM, Vladimir Ivanov wrote: > http://cr.openjdk.java.net/~vlivanov/8074551/webrev.00/ > https://bugs.openjdk.java.net/browse/JDK-8074551 > > JDK-8063137 added profiling machinery for GWT combinator. > In order to avoid trap count pollution, trap counts are simply ignored > during JIT compilation for methods w/ injected profile. > > Unfortunately, it not enough to completely eliminate trap count > pollution problem. As the regression test case demonstrates, VM marks > heavily shared method as non-compilable when too many traps happen > there. It causes severe performance regression since the method is > neither compiled nor inlineable anymore. > > With the fix, VM doesn't update trap counts in methods with injected > profile when trap reason is either intrinsic or unreached. These are 2 > kinds of uncommon traps VM issues based on injected profile. > > I experimented with completely ignoring trap count updates for methods > with injected profile, but it causes noticeable regression on a couple > of Octane subbenchmarks [2]. > > I reverted injected profile detection logic and reintroduced an explicit > marker for methods which inject profile (@InjectedProfile). All GWTs are > marked with it during bytecode translation. > > Testing: jdk/test/java/lang/invoke, hotspot/test/compiler/jsr292, > nashorn, octane (no perf regression) > > Best regards, > Vladimir Ivanov > > [1] https://bugs.openjdk.java.net/browse/JDK-8063137 > [2] FTR Richards (340ms->370us) & Regexp(100ms->135ms) From balchandra.vaidya at oracle.com Tue Jun 9 10:54:41 2015 From: balchandra.vaidya at oracle.com (Balchandra Vaidya) Date: Tue, 09 Jun 2015 16:24:41 +0530 Subject: reproducible compiler issue with latest jdk9 In-Reply-To: References: <55769843.2050108@oracle.com> Message-ID: <5576C5F1.4020903@oracle.com> Hi Robert, Thank you for submitting a bug. The JBS id is https://bugs.openjdk.java.net/browse/JDK-8086046 Regards, Balchandra On 6/9/2015 3:22 PM, Robert Muir wrote: > Here is the ID: JI-9021603 > > On Tue, Jun 9, 2015 at 3:39 AM, Rory O'Donnell wrote: >> Hi Robert, >> >> Could you please log a bug at bugs.java.com, and let us know what issue ID >> you receive. >> >> Rgds,Rory >> >> >> On 09/06/2015 03:13, Robert Muir wrote: >>> If it helps, -XX:-DoEscapeAnalysis is enough to make the test pass >>> (and make lucene tests go green) >>> >>> On Mon, Jun 8, 2015 at 9:24 PM, Robert Muir wrote: >>>> Hello, >>>> >>>> I think we found something in lucene testing when testing the latest >>>> java 9 ea B27. >>>> It also fails with latest tip. >>>> >>>> See the following test case: http://pastebin.com/U3TCFGNu >>>> >>>> It passes with -Xint and will fail otherwise: >>>> >>>> rmuir at beast:~$ java -version >>>> openjdk version "1.9.0-internal" >>>> OpenJDK Runtime Environment (build >>>> 1.9.0-internal-rmuir_2015_06_08_18_48-b00) >>>> OpenJDK 64-Bit Server VM (build >>>> 1.9.0-internal-rmuir_2015_06_08_18_48-b00, mixed mode) >>>> rmuir at beast:~$ java ShouldWork >>>> Exception in thread "main" java.lang.AssertionError: expected=mklefvn, >>>> actual=m >>>> at ShouldWork.main(ShouldWork.java:20) >>>> rmuir at beast:~$ java -Xint ShouldWork >>>> rmuir at beast:~$ >> >> -- >> Rgds,Rory O'Donnell >> Quality Engineering Manager >> Oracle EMEA , Dublin, Ireland >> From vladimir.kozlov at oracle.com Wed Jun 10 19:31:20 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 10 Jun 2015 12:31:20 -0700 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: References: Message-ID: <55789088.5050405@oracle.com> We don't call previous node as 'user' - we call them definitions or inputs. Your comment change in memnode.cpp sound strange because of that. The original statement was correct: // If anybody other than 'this' uses 'mem', we cannot fold 'mem' away. in your case it will be: // If anybody other than 'this' uses 'st', we cannot fold 'st' away. Also code does not seems right. The code should go through input memory chain and remove all preceding similar stores - 'this' node remains and we change its memory input which makes previous stores dead. So you can't do 'prev = st'. You need to set improved = true since 'this' will not change. We also use 'make_progress' variable's name in such cases. I try_move_store_before_loop() add check (n_loop->_head == n_ctrl) to make sure it is not other control node inside loop. Then you can check Phi's control as (mem->in(0) == n_ctrl). I don't understand verification code "Check that store's control does post dominate loop entry". In first comment you said "Store has to be first in the loop body" - I understand this as Store's control should be loop's head. You can't allow store to be on one of branches (it will not post dominate) but your check code allow that. Also the check done only in debug VM. If you really want to accept cases when a store is placed after diamond control then you need checks in product to make sure that it is really post dominate head. For that I think you need to go from loopend to loophead through idom links and see if you meet n_ctrl. I don't see assert(n->in(0) in try_move_store_before_loop() but you have it in try_move_store_after_loop(). Why you need next assert?: + assert(n_loop != address_loop, "address is loop varying"); Should you check phi == NULL instead of assert to make sure you have only one Phi node? conflict between assert and following check: + assert(new_loop->_child != NULL, ""); + if (new_loop->_child == NULL) new_loop->_body.push(st); Thanks, Vladimir On 6/10/15 8:03 AM, Roland Westrelin wrote: > http://cr.openjdk.java.net/~roland/8080289/webrev.00/ > > Sink stores out of loops when possible: > > for (int i = 0; i < 1000; i++) { > // Some stuff that doesn?t prevent the optimization > array[idx] = i; > } > > becomes: > > for (int i = 0; i < 1000; i++) { > // Some stuff > } > array[idx] = 999; > > Or move stores before the loop when possible: > > for (int i = 0; i < 1000; i++) { > array[idx] = 999; > // Some stuff that doesn?t prevent the optimization > } > > becomes: > > array[idx] = 999; > for (int i = 0; i < 1000; i++) { > // Some stuff > } > > The change in memnode.cpp is useful to clean up code generated from test_after_5 because the stores are moved out of the loop only after the loop is split and unrolled. That code removes duplicate stores. > > Roland. > From roland.westrelin at oracle.com Thu Jun 11 10:39:38 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Thu, 11 Jun 2015 12:39:38 +0200 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: <55789088.5050405@oracle.com> References: <55789088.5050405@oracle.com> Message-ID: <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> Thanks for looking at this, Vladimir. > We don't call previous node as 'user' - we call them definitions or inputs. Your comment change in memnode.cpp sound strange because of that. The original statement was correct: > // If anybody other than 'this' uses 'mem', we cannot fold 'mem' away. > > in your case it will be: > > // If anybody other than 'this' uses 'st', we cannot fold 'st' away. Right. > Also code does not seems right. The code should go through input memory chain and remove all preceding similar stores - 'this' node remains and we change its memory input which makes previous stores dead. So you can't do 'prev = st?. That?s what I think the code does. That is if you have: st1->st2->st3->st4 and st3 is redundant with st1, the chain should become: st1->st2->st4 so we need to change the memory input of st2 when we find st3 can be removed. In the code, at that point, this=st1, st = st3 and prev=st2. > You need to set improved = true since 'this' will not change. We also use 'make_progress' variable's name in such cases. In the example above, if we remove st2, we modify this, right? > I try_move_store_before_loop() add check (n_loop->_head == n_ctrl) to make sure it is not other control node inside loop. Then you can check Phi's control as (mem->in(0) == n_ctrl). > > I don't understand verification code "Check that store's control does post dominate loop entry". In first comment you said "Store has to be first in the loop body" - I understand this as Store's control should be loop's head. You can't allow store to be on one of branches (it will not post dominate) but your check code allow that. Also the check done only in debug VM. > > If you really want to accept cases when a store is placed after diamond control then you need checks in product to make sure that it is really post dominate head. For that I think you need to go from loopend to loophead through idom links and see if you meet n_ctrl. My check code starts from the loop, follows every path and make sure there?s no path that leaves the loop without going through n. With example 1 below: for (int i = 0; i < 10; i++) { if (some_condition) { array[idx] = 999; } else { } } We?ll find a path from the head that doesn?t go through the store and that exits the loop. What the comment doesn?t say is that with example 2 below: for (int i = 0; i < 10; i++) { if (some_condition) { uncommon_trap(); } array[idx] = 999; } my verification code would find the early exit as well. It?s verification code only because if we have example 1 above, then we have a memory Phi to merge both branches of the if. So the pattern that we look for in PhaseIdealLoop::try_move_store_before_loop() won?t match: the loop?s memory Phi backedge won?t be the store. If we have example 2 above, then the loop?s memory Phi doesn?t have a single memory use. So I don?t think we need to check that the store post dominate the loop head in product. That?s my reasoning anyway and the verification code is there to verify it. This said, my verification code doesn?t work for infinite loops. It would need to check whether we reach the tail of the loop maybe? > I don't see assert(n->in(0) in try_move_store_before_loop() but you have it in try_move_store_after_loop(). Ok. > > Why you need next assert?: > + assert(n_loop != address_loop, "address is loop varying?); I wonder about that too ;-) I?ll remove it. > Should you check phi == NULL instead of assert to make sure you have only one Phi node? Can there be more than one memory Phi for a particular slice that has in(0) == n_loop->_head? I would have expected that to be impossible. > > conflict between assert and following check: > + assert(new_loop->_child != NULL, ""); > + if (new_loop->_child == NULL) new_loop->_body.push(st); Thanks for spotting that. I?ll remove the assert. Roland. > > Thanks, > Vladimir > > On 6/10/15 8:03 AM, Roland Westrelin wrote: >> http://cr.openjdk.java.net/~roland/8080289/webrev.00/ >> >> Sink stores out of loops when possible: >> >> for (int i = 0; i < 1000; i++) { >> // Some stuff that doesn?t prevent the optimization >> array[idx] = i; >> } >> >> becomes: >> >> for (int i = 0; i < 1000; i++) { >> // Some stuff >> } >> array[idx] = 999; >> >> Or move stores before the loop when possible: >> >> for (int i = 0; i < 1000; i++) { >> array[idx] = 999; >> // Some stuff that doesn?t prevent the optimization >> } >> >> becomes: >> >> array[idx] = 999; >> for (int i = 0; i < 1000; i++) { >> // Some stuff >> } >> >> The change in memnode.cpp is useful to clean up code generated from test_after_5 because the stores are moved out of the loop only after the loop is split and unrolled. That code removes duplicate stores. >> >> Roland. >> From vladimir.x.ivanov at oracle.com Thu Jun 11 10:46:42 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 11 Jun 2015 13:46:42 +0300 Subject: [9,8u60] RFR (S): 8074551: GWT can be marked non-compilable due to deopt count pollution In-Reply-To: <55786E08.8000603@oracle.com> References: <55782EC7.30501@oracle.com> <55786E08.8000603@oracle.com> Message-ID: <55796712.5070400@oracle.com> Thanks, Vladimir. Best regards, Vladimir Ivanov On 6/10/15 8:04 PM, Vladimir Kozlov wrote: > Looks good to me. > > Thanks, > Vladimir K. > > On 6/10/15 5:34 AM, Vladimir Ivanov wrote: >> http://cr.openjdk.java.net/~vlivanov/8074551/webrev.00/ >> https://bugs.openjdk.java.net/browse/JDK-8074551 >> >> JDK-8063137 added profiling machinery for GWT combinator. >> In order to avoid trap count pollution, trap counts are simply ignored >> during JIT compilation for methods w/ injected profile. >> >> Unfortunately, it not enough to completely eliminate trap count >> pollution problem. As the regression test case demonstrates, VM marks >> heavily shared method as non-compilable when too many traps happen >> there. It causes severe performance regression since the method is >> neither compiled nor inlineable anymore. >> >> With the fix, VM doesn't update trap counts in methods with injected >> profile when trap reason is either intrinsic or unreached. These are 2 >> kinds of uncommon traps VM issues based on injected profile. >> >> I experimented with completely ignoring trap count updates for methods >> with injected profile, but it causes noticeable regression on a couple >> of Octane subbenchmarks [2]. >> >> I reverted injected profile detection logic and reintroduced an explicit >> marker for methods which inject profile (@InjectedProfile). All GWTs are >> marked with it during bytecode translation. >> >> Testing: jdk/test/java/lang/invoke, hotspot/test/compiler/jsr292, >> nashorn, octane (no perf regression) >> >> Best regards, >> Vladimir Ivanov >> >> [1] https://bugs.openjdk.java.net/browse/JDK-8063137 >> [2] FTR Richards (340ms->370us) & Regexp(100ms->135ms) From sergei.kovalev at oracle.com Thu Jun 11 16:13:12 2015 From: sergei.kovalev at oracle.com (Sergei Kovalev) Date: Thu, 11 Jun 2015 19:13:12 +0300 Subject: RFR(S): 8078145 testlibrary_tests/RandomGeneratorTest.java failed with AssertionError : Unexpected random number sequence for mode: NO_SEED Message-ID: <5579B398.5030001@oracle.com> Hello Team, Please review fix for CR: https://bugs.openjdk.java.net/browse/JDK-8078145 Webrev link: http://cr.openjdk.java.net/~skovalev/8078145/webrev.00/ Problem description: testlibrary_tests/RandomGeneratorTest.java test uses stdout from child java process for analysis. In some cases additional java options could force JVM to produce "system" output that interferes with test output that make output analysis unpredictable. Solution: Forward all test output to a file. In this case JVM system information that printed out to stdout won't interfere with test output. -- With best regards, Sergei From igor.ignatyev at oracle.com Thu Jun 11 16:32:55 2015 From: igor.ignatyev at oracle.com (Igor Ignatyev) Date: Thu, 11 Jun 2015 19:32:55 +0300 Subject: RFR(S): 8078145 testlibrary_tests/RandomGeneratorTest.java failed with AssertionError : Unexpected random number sequence for mode: NO_SEED In-Reply-To: <5579B398.5030001@oracle.com> References: <5579B398.5030001@oracle.com> Message-ID: <5579B837.5090107@oracle.com> Hi Sergei, Looks good to me. Igor On 06/11/2015 07:13 PM, Sergei Kovalev wrote: > Hello Team, > > Please review fix for CR: https://bugs.openjdk.java.net/browse/JDK-8078145 > Webrev link: http://cr.openjdk.java.net/~skovalev/8078145/webrev.00/ > > Problem description: testlibrary_tests/RandomGeneratorTest.java test > uses stdout from child java process for analysis. In some cases > additional java options could force JVM to produce "system" output that > interferes with test output that make output analysis unpredictable. > > Solution: Forward all test output to a file. In this case JVM system > information that printed out to stdout won't interfere with test output. > From stefan.johansson at oracle.com Fri Jun 12 08:55:42 2015 From: stefan.johansson at oracle.com (Stefan Johansson) Date: Fri, 12 Jun 2015 10:55:42 +0200 Subject: RFR: 8077279: assert(ic->is_clean()) failed: IC should be clean Message-ID: <557A9E8E.7050903@oracle.com> Hi, Please review this change to fix: https://bugs.openjdk.java.net/browse/JDK-8077279 Webrev: http://cr.openjdk.java.net/~sjohanss/8077279/hotspot.00/ Summary: While doing some extra G1 testing a couple of issues were found. I've been able to reproduce this specific assert but I suspect that both JDK-8077282 and JDK-8077283 are related as well. The root cause of the problem is a patch that was made a few months back that made it possible to have nmethods allocated in a code heap that differs from the compile level of the given nmethod, see [1]. In the NMethodIterator used by G1 class unloading it is assumed that the nmethods have the same compile level as the code heap the are allocated in, which now is wrong. This can, under certain circumstances, lead to not all nmethods being cleaned correctly and then hit this assertion. Testing: * Built and tested in JPRT * RBT run with using these testlists: jdk/test/:jdk,hotspot/test/:hotspot_all,vm.mlvm.testlist. Thanks, Stefan [1] https://bugs.openjdk.java.net/browse/JDK-8072774 From aph at redhat.com Fri Jun 12 11:06:22 2015 From: aph at redhat.com (Andrew Haley) Date: Fri, 12 Jun 2015 12:06:22 +0100 Subject: RFR: 8046943: RSA Acceleration Message-ID: <557ABD2E.7050608@redhat.com> http://cr.openjdk.java.net/~aph/8046943-hs-1/ http://cr.openjdk.java.net/~aph/8046943-jdk-1/ Before: rsa 512 bits 0.000134s 0.000009s 7444.2 111853.3 rsa 1024 bits 0.000674s 0.000029s 1483.9 34456.9 rsa 2048 bits 0.003945s 0.000100s 253.5 9994.7 rsa 4096 bits 0.027015s 0.000373s 37.0 2681.6 After: rsa 512 bits 0.000059s 0.000004s 17022.3 224141.1 rsa 1024 bits 0.000258s 0.000013s 3871.5 78851.0 rsa 2048 bits 0.001506s 0.000042s 664.1 23844.3 rsa 4096 bits 0.010214s 0.000153s 97.9 6516.0 There are some issues we need to discuss. The runtime code is in sharedRuntime_x86_64.cpp even though it's mostly written in C. My thinking here is that porters will want to tweak the code for their back ends, or maybe write it in assembler. It's little-endian and would need some reworking for big-endian machines. But maybe there should be a common version of the code in share/ ? Should it be in optoRuntime instead? It could be called from C1 or even the interpreter, but it's C2-only right now. I've done nothing about 32-bit targets, but I think they would benefit. I had to make some small changes to the Java library. 1. Montgomery multiplication works on whole words. Given that this is a 64-bit implementation I had to force the size of the arrays in oddModPow to be a multiple of 64 bits, i.e. the length of the arrays must be even. Given that RSA and Diffie-Hellman keys are usually a multiple of 64 bits in length I don't think this will be a real performance issue in practice. 2. I needed a 64-bit inverse rather than a 32-bit inverse. This is a single computation, done once for each modular exponentiation, so it makes an immeasurably small difference to the total runtime. 3. I fused squaring and multiplication into a single montgomeryMultiply method. This has two advantages. Firstly, we only need a single intrinsic, and secondly the decision whether to use multiply or squaring can be made in the runtime library. If the target does not support the montgomeryMultiply intrinsic there is no slowdown when using C2 because it removes the if (a == b) test in if (a == b) { product = squareToLen(a, len, product); } else { product = multiplyToLen(a, len, b, len, product); } Andrew. From aph at redhat.com Fri Jun 12 11:29:31 2015 From: aph at redhat.com (Andrew Haley) Date: Fri, 12 Jun 2015 12:29:31 +0100 Subject: RFR: 8046943: RSA Acceleration In-Reply-To: <557ABD2E.7050608@redhat.com> References: <557ABD2E.7050608@redhat.com> Message-ID: <557AC29B.3030103@redhat.com> On 06/12/2015 12:06 PM, Andrew Haley wrote: Sorry, I forgot the column labels. The are: sign verify sign/s verify/s > rsa 512 bits 0.000134s 0.000009s 7444.2 111853.3 > rsa 1024 bits 0.000674s 0.000029s 1483.9 34456.9 > rsa 2048 bits 0.003945s 0.000100s 253.5 9994.7 > rsa 4096 bits 0.027015s 0.000373s 37.0 2681.6 > > After: > > rsa 512 bits 0.000059s 0.000004s 17022.3 224141.1 > rsa 1024 bits 0.000258s 0.000013s 3871.5 78851.0 > rsa 2048 bits 0.001506s 0.000042s 664.1 23844.3 > rsa 4096 bits 0.010214s 0.000153s 97.9 6516.0 Andrew. From roland.westrelin at oracle.com Fri Jun 12 13:01:41 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Fri, 12 Jun 2015 15:01:41 +0200 Subject: RFR(S): 8086046: escape analysis generates incorrect code as of B67 Message-ID: <4ACD7598-A0BE-4AE6-9302-1447551F74D8@oracle.com> http://cr.openjdk.java.net/~roland/8086046/webrev.00/ The membar_for_arraycopy() code returns true if the MemBarNode is a membar following an arraycopy that can modify memory t_oop. After an ArrayCopyNode is expanded, that code must work with a subgraph that merges multiple code path. The membar_for_arraycopy() didn?t correctly match on that graph pattern. To be safer, I also made some change so the membar_for_arraycopy() code not only recognizes calls to fast path stubs but also the CallStaticJavaNode for the ?slow_arraycopy?. Roland. From tobias.hartmann at oracle.com Fri Jun 12 14:34:48 2015 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Fri, 12 Jun 2015 16:34:48 +0200 Subject: RFR: 8077279: assert(ic->is_clean()) failed: IC should be clean In-Reply-To: <557A9E8E.7050903@oracle.com> References: <557A9E8E.7050903@oracle.com> Message-ID: <557AEE08.2030605@oracle.com> Hi Stefan, looks good to me (not a reviewer). Thanks for fixing this! Best, Tobias On 12.06.2015 10:55, Stefan Johansson wrote: > Hi, > > Please review this change to fix: > https://bugs.openjdk.java.net/browse/JDK-8077279 > > Webrev: > http://cr.openjdk.java.net/~sjohanss/8077279/hotspot.00/ > > Summary: > While doing some extra G1 testing a couple of issues were found. I've been able to reproduce this specific assert but I suspect that both JDK-8077282 and JDK-8077283 are related as well. > > The root cause of the problem is a patch that was made a few months back that made it possible to have nmethods allocated in a code heap that differs from the compile level of the given nmethod, see [1]. In the NMethodIterator used by G1 class unloading it is assumed that the nmethods have the same compile level as the code heap the are allocated in, which now is wrong. This can, under certain circumstances, lead to not all nmethods being cleaned correctly and then hit this assertion. > > Testing: > * Built and tested in JPRT > * RBT run with using these testlists: jdk/test/:jdk,hotspot/test/:hotspot_all,vm.mlvm.testlist. > > Thanks, > Stefan > > [1] https://bugs.openjdk.java.net/browse/JDK-8072774 From vladimir.kozlov at oracle.com Fri Jun 12 16:53:07 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 12 Jun 2015 09:53:07 -0700 Subject: RFR(S): 8086046: escape analysis generates incorrect code as of B67 In-Reply-To: <4ACD7598-A0BE-4AE6-9302-1447551F74D8@oracle.com> References: <4ACD7598-A0BE-4AE6-9302-1447551F74D8@oracle.com> Message-ID: <557B0E73.6050701@oracle.com> This looks good. Thanks, Roland. Vladimir On 6/12/15 6:01 AM, Roland Westrelin wrote: > http://cr.openjdk.java.net/~roland/8086046/webrev.00/ > > The membar_for_arraycopy() code returns true if the MemBarNode is a membar following an arraycopy that can modify memory t_oop. After an ArrayCopyNode is expanded, that code must work with a subgraph that merges multiple code path. The membar_for_arraycopy() didn?t correctly match on that graph pattern. To be safer, I also made some change so the membar_for_arraycopy() code not only recognizes calls to fast path stubs but also the CallStaticJavaNode for the ?slow_arraycopy?. > > Roland. > From rcmuir at gmail.com Fri Jun 12 23:29:52 2015 From: rcmuir at gmail.com (Robert Muir) Date: Fri, 12 Jun 2015 19:29:52 -0400 Subject: RFR(S): 8086046: escape analysis generates incorrect code as of B67 In-Reply-To: <4ACD7598-A0BE-4AE6-9302-1447551F74D8@oracle.com> References: <4ACD7598-A0BE-4AE6-9302-1447551F74D8@oracle.com> Message-ID: On Fri, Jun 12, 2015 at 9:01 AM, Roland Westrelin wrote: > http://cr.openjdk.java.net/~roland/8086046/webrev.00/ > Thank you for fixing this bug. I tried the patch and lucene tests are passing again with it. From serkan at hazelcast.com Sun Jun 14 11:39:05 2015 From: serkan at hazelcast.com (=?UTF-8?B?U2Vya2FuIMOWemFs?=) Date: Sun, 14 Jun 2015 14:39:05 +0300 Subject: Array accesses using sun.misc.Unsafe cause data corruption or SIGSEGV Message-ID: Hi all, I had dived into the issue with JDK-HotSpot commits and the issue arised after this commit: http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/rev/a60a1309a03a Then I added some additional logs to *"vm/c1/c1_Canonicalizer.cpp"*: void Canonicalizer::do_UnsafeGetRaw(UnsafeGetRaw* x) { if (OptimizeUnsafes) do_UnsafeRawOp(x); tty->print_cr("Canonicalizer: do_UnsafeGetRaw id %d: base = id %d, index = id %d, log2_scale = %d", x->id(), x->base()->id(), x->index()->id(), x->log2_scale()); } void Canonicalizer::do_UnsafePutRaw(UnsafePutRaw* x) { if (OptimizeUnsafes) do_UnsafeRawOp(x); tty->print_cr("Canonicalizer: do_UnsafePutRaw id %d: base = id %d, index = id %d, log2_scale = %d", x->id(), x->base()->id(), x->index()->id(), x->log2_scale()); } So I run the test by calculating address as - *"int * long"* (int is index and long is 8l) - *"long * long"* (the first long is index and the second long is 8l) - *"int * int"* (the first int is index and the second int is 8) Here are the logs: *int * long:*Canonicalizer: do_UnsafeGetRaw id 18: base = id 16, index = id 17, log2_scale = 0 Canonicalizer: do_UnsafeGetRaw id 20: base = id 16, index = id 19, log2_scale = 0 Canonicalizer: do_UnsafeGetRaw id 22: base = id 16, index = id 21, log2_scale = 0 Canonicalizer: do_UnsafeGetRaw id 24: base = id 16, index = id 23, log2_scale = 0 Canonicalizer: do_UnsafePutRaw id 33: base = id 13, index = id 27, log2_scale = 3 Canonicalizer: do_UnsafeGetRaw id 36: base = id 13, index = id 27, log2_scale = 3 *long * long:*Canonicalizer: do_UnsafeGetRaw id 18: base = id 16, index = id 17, log2_scale = 0 Canonicalizer: do_UnsafeGetRaw id 20: base = id 16, index = id 19, log2_scale = 0 Canonicalizer: do_UnsafeGetRaw id 22: base = id 16, index = id 21, log2_scale = 0 Canonicalizer: do_UnsafeGetRaw id 24: base = id 16, index = id 23, log2_scale = 0 Canonicalizer: do_UnsafePutRaw id 35: base = id 13, index = id 14, log2_scale = 3 Canonicalizer: do_UnsafeGetRaw id 37: base = id 13, index = id 14, log2_scale = 3 *int * int:*Canonicalizer: do_UnsafeGetRaw id 18: base = id 16, index = id 17, log2_scale = 0 Canonicalizer: do_UnsafeGetRaw id 20: base = id 16, index = id 19, log2_scale = 0 Canonicalizer: do_UnsafeGetRaw id 22: base = id 16, index = id 21, log2_scale = 0 Canonicalizer: do_UnsafeGetRaw id 24: base = id 16, index = id 23, log2_scale = 0 Canonicalizer: do_UnsafePutRaw id 33: base = id 13, index = id 29, log2_scale = 0 Canonicalizer: do_UnsafeGetRaw id 36: base = id 13, index = id 29, log2_scale = 0 Canonicalizer: do_UnsafePutRaw id 19: base = id 8, index = id 15, log2_scale = 0 Canonicalizer: do_UnsafeGetRaw id 22: base = id 8, index = id 15, log2_scale = 0 As you can see, at the problematic runs (*"int * long"* and *"long * long"*) there are two scaling. One for *"Unsafe.put"* and the other one is for* "Unsafe.get"* and these instructions points to same *"base"* and *"index"* instructions. This means that address is scaled one more time because there should be only one scale. When I debugged the non-problematic run (*"int * int"*), I saw that *"instr->as_ArithmeticOp();"* is always returns *"null" *then *"match_index_and_scale"* method returns* "false"* always. So there is no scaling. static bool match_index_and_scale(Instruction* instr, Instruction** index, int* log2_scale) { ... ArithmeticOp* arith = instr->as_ArithmeticOp(); if (arith != NULL) { ... } return false; } Then I have added my fix attempt to prevent multiple scaling for Unsafe instructions points to same index instruction like this: void Canonicalizer::do_UnsafeRawOp(UnsafeRawOp* x) { Instruction* base = NULL; Instruction* index = NULL; int log2_scale; if (match(x, &base, &index, &log2_scale)) { x->set_base(base); x->set_index(index); // The fix attempt here // ///////////////////////////// if (index != NULL) { if (index->is_pinned()) { log2_scale = 0; } else { if (log2_scale != 0) { index->pin(); } } } // ///////////////////////////// x->set_log2_scale(log2_scale); if (PrintUnsafeOptimization) { tty->print_cr("Canonicalizer: UnsafeRawOp id %d: base = id %d, index = id %d, log2_scale = %d", x->id(), x->base()->id(), x->index()->id(), x->log2_scale()); } } } In this fix attempt, if there is a scaling for the Unsafe instruction, I pin index instruction of that instruction and at next calls, if the index instruction is pinned, I assummed that there is already scaling so no need to another scaling. After this fix, I rerun the problematic test (*"int * long"*) and it works with these logs: *int * long (after fix):*Canonicalizer: do_UnsafeGetRaw id 18: base = id 16, index = id 17, log2_scale = 0 Canonicalizer: do_UnsafeGetRaw id 20: base = id 16, index = id 19, log2_scale = 0 Canonicalizer: do_UnsafeGetRaw id 22: base = id 16, index = id 21, log2_scale = 0 Canonicalizer: do_UnsafeGetRaw id 24: base = id 16, index = id 23, log2_scale = 0 Canonicalizer: do_UnsafePutRaw id 35: base = id 13, index = id 14, log2_scale = 3 Canonicalizer: do_UnsafeGetRaw id 37: base = id 13, index = id 14, log2_scale = 0 Canonicalizer: do_UnsafePutRaw id 21: base = id 8, index = id 11, log2_scale = 3 Canonicalizer: do_UnsafeGetRaw id 23: base = id 8, index = id 11, log2_scale = 0 I am not sure my fix attempt is a really fix or maybe there are better fixes. Regards. -- Serkan ?ZAL > Btw, (thanks to one my colleagues), when address calculation in the loop is > converted to > long address = baseAddress + (i * 8) > test passes. Only difference is next long pointer is calculated using > integer 8 instead of long 8. > ``` > for (int i = 0; i < count; i++) { > long address = baseAddress + (i * 8); // <--- here, integer 8 instead > of long 8 > long expected = i; > unsafe.putLong(address, expected); > long actual = unsafe.getLong(address); > if (expected != actual) { > throw new AssertionError("Expected: " + expected + ", Actual: " + > actual); > } > } > ``` > On Tue, Jun 9, 2015 at 1:07 PM Mehmet Dogan > wrote: > >* Hi all, > *> > >* While I was testing my app using java 8, I encountered the previously > *>* reported sun.misc.Unsafe issue. > *> > >* https://bugs.openjdk.java.net/browse/JDK-8076445 > *> > >* http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-April/017685.html > *> > >* Issue status says it's resolved with resolution "Cannot Reproduce". But > *>* unfortunately it's still reproducible using "1.8.0_60-ea-b18" and > *>* "1.9.0-ea-b67". > *> > >* Test is very simple: > *> > >* ``` > *>* public static void main(String[] args) throws Exception { > *>* Unsafe unsafe = findUnsafe(); > *>* // 10000 pass > *>* // 100000 jvm crash > *>* // 1000000 fail > *>* int count = 100000; > *>* long size = count * 8L; > *>* long baseAddress = unsafe.allocateMemory(size); > *> > >* try { > *>* for (int i = 0; i < count; i++) { > *>* long address = baseAddress + (i * 8L); > *> > >* long expected = i; > *>* unsafe.putLong(address, expected); > *> > >* long actual = unsafe.getLong(address); > *> > >* if (expected != actual) { > *>* throw new AssertionError("Expected: " + expected + ", > *>* Actual: " + actual); > *>* } > *>* } > *>* } finally { > *>* unsafe.freeMemory(baseAddress); > *>* } > *>* } > *>* ``` > *>* It's not failing up to version 1.8.0.31, by starting 1.8.0.40 test is > *>* failing constantly. > *> > >* - With iteration count 10000, test is passing. > *>* - With iteration count 100000, jvm is crashing with SIGSEGV. > *>* - With iteration count 1000000, test is failing with AssertionError. > *> > >* When one of compilation (-Xint) or inlining (-XX:-Inline) or > *>* on-stack-replacement (-XX:-UseOnStackReplacement) is disabled, test is not > *>* failing at all. > *> > >* I tested on platforms: > *>* - Centos-7/openjdk-1.8.0.45 > *>* - OSX/oraclejdk-1.8.0.40 > *>* - OSX/oraclejdk-1.8.0.45 > *>* - OSX/oraclejdk-1.8.0_60-ea-b18 > *>* - OSX/oraclejdk-1.9.0-ea-b67 > *> > >* Previous issue comment ( > *>* https://bugs.openjdk.java.net/browse/JDK-8076445?focusedCommentId=13633043#comment-13633043 ) > *>* says "Cannot reproduce based on the latest version". I hope that latest > *>* version is not mentioning to '1.8.0_60-ea-b18' or '1.9.0-ea-b67'. Because > *>* both are failing. > *> > >* I'm looking forward to hearing from you. > *> > >* Thanks, > *>* -Mehmet Dogan- > *>* -- > *> > >* @mmdogan > *> -- Serkan ?ZAL Remotest Software Engineer GSM: +90 542 680 39 18 Twitter: @serkan_ozal -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.westrelin at oracle.com Mon Jun 15 07:43:55 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Mon, 15 Jun 2015 09:43:55 +0200 Subject: RFR(S): 8086046: escape analysis generates incorrect code as of B67 In-Reply-To: <557B0E73.6050701@oracle.com> References: <4ACD7598-A0BE-4AE6-9302-1447551F74D8@oracle.com> <557B0E73.6050701@oracle.com> Message-ID: <20AF0765-E746-4E72-8927-16A6E9FA14FC@oracle.com> Thanks for the review, Vladimir. Roland. > On Jun 12, 2015, at 6:53 PM, Vladimir Kozlov wrote: > > This looks good. Thanks, Roland. > > Vladimir > > On 6/12/15 6:01 AM, Roland Westrelin wrote: >> http://cr.openjdk.java.net/~roland/8086046/webrev.00/ >> >> The membar_for_arraycopy() code returns true if the MemBarNode is a membar following an arraycopy that can modify memory t_oop. After an ArrayCopyNode is expanded, that code must work with a subgraph that merges multiple code path. The membar_for_arraycopy() didn?t correctly match on that graph pattern. To be safer, I also made some change so the membar_for_arraycopy() code not only recognizes calls to fast path stubs but also the CallStaticJavaNode for the ?slow_arraycopy?. >> >> Roland. >> From roland.westrelin at oracle.com Mon Jun 15 07:45:23 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Mon, 15 Jun 2015 09:45:23 +0200 Subject: RFR(S): 8086046: escape analysis generates incorrect code as of B67 In-Reply-To: References: <4ACD7598-A0BE-4AE6-9302-1447551F74D8@oracle.com> Message-ID: > Thank you for fixing this bug. I tried the patch and lucene tests are > passing again with it. Thanks for confirming that that change fixes your problem. Roland. From rickard.backman at oracle.com Mon Jun 15 08:01:32 2015 From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=) Date: Mon, 15 Jun 2015 10:01:32 +0200 Subject: RFR: 8077279: assert(ic->is_clean()) failed: IC should be clean In-Reply-To: <557A9E8E.7050903@oracle.com> References: <557A9E8E.7050903@oracle.com> Message-ID: <20150615080132.GD7260@rbackman> Looks good. On 06/12, Stefan Johansson wrote: > Hi, > > Please review this change to fix: > https://bugs.openjdk.java.net/browse/JDK-8077279 > > Webrev: > http://cr.openjdk.java.net/~sjohanss/8077279/hotspot.00/ > > Summary: > While doing some extra G1 testing a couple of issues were found. > I've been able to reproduce this specific assert but I suspect that > both JDK-8077282 and JDK-8077283 are related as well. > > The root cause of the problem is a patch that was made a few months > back that made it possible to have nmethods allocated in a code heap > that differs from the compile level of the given nmethod, see [1]. > In the NMethodIterator used by G1 class unloading it is assumed that > the nmethods have the same compile level as the code heap the are > allocated in, which now is wrong. This can, under certain > circumstances, lead to not all nmethods being cleaned correctly and > then hit this assertion. > > Testing: > * Built and tested in JPRT > * RBT run with using these testlists: > jdk/test/:jdk,hotspot/test/:hotspot_all,vm.mlvm.testlist. > > Thanks, > Stefan > > [1] https://bugs.openjdk.java.net/browse/JDK-8072774 /R -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: Digital signature URL: From stefan.johansson at oracle.com Mon Jun 15 08:35:20 2015 From: stefan.johansson at oracle.com (Stefan Johansson) Date: Mon, 15 Jun 2015 10:35:20 +0200 Subject: RFR: 8077279: assert(ic->is_clean()) failed: IC should be clean In-Reply-To: <20150615080132.GD7260@rbackman> References: <557A9E8E.7050903@oracle.com> <20150615080132.GD7260@rbackman> Message-ID: <557E8E48.2050500@oracle.com> Thanks Tobias and Rickard for the reviews, I will push this through hs-rt to make sure the patch is present if/when the patch for G1 as default is pushed. Thanks, Stefan On 2015-06-15 10:01, Rickard B?ckman wrote: > Looks good. > > On 06/12, Stefan Johansson wrote: >> Hi, >> >> Please review this change to fix: >> https://bugs.openjdk.java.net/browse/JDK-8077279 >> >> Webrev: >> http://cr.openjdk.java.net/~sjohanss/8077279/hotspot.00/ >> >> Summary: >> While doing some extra G1 testing a couple of issues were found. >> I've been able to reproduce this specific assert but I suspect that >> both JDK-8077282 and JDK-8077283 are related as well. >> >> The root cause of the problem is a patch that was made a few months >> back that made it possible to have nmethods allocated in a code heap >> that differs from the compile level of the given nmethod, see [1]. >> In the NMethodIterator used by G1 class unloading it is assumed that >> the nmethods have the same compile level as the code heap the are >> allocated in, which now is wrong. This can, under certain >> circumstances, lead to not all nmethods being cleaned correctly and >> then hit this assertion. >> >> Testing: >> * Built and tested in JPRT >> * RBT run with using these testlists: >> jdk/test/:jdk,hotspot/test/:hotspot_all,vm.mlvm.testlist. >> >> Thanks, >> Stefan >> >> [1] https://bugs.openjdk.java.net/browse/JDK-8072774 > /R From vladimir.x.ivanov at oracle.com Mon Jun 15 13:15:18 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Mon, 15 Jun 2015 16:15:18 +0300 Subject: [9] RFR (XS): 8087218: Constant fold loads from final instance fields in VM anonymous classes Message-ID: <557ECFE6.3010808@oracle.com> http://cr.openjdk.java.net/~vlivanov/8087218/webrev.00 https://bugs.openjdk.java.net/browse/JDK-8087218 Right now, VM doesn't constant fold loads final instance fields unless an experimental flag -XX:+TrustFinalNonStaticFields is turned on. The only exception is classes in java.lang.invoke/sun.invoke packages. It can be extended to VM anonymous classes because there is no hacking of finals going on with them: (1) they are part of private API (sun.misc.Unsafe); (2) they can't be serialized. Testing: manual (verified in generated code that constant folding happens) Thanks! Best regards, Vladimir Ivanov From rickard.backman at oracle.com Mon Jun 15 13:21:26 2015 From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=) Date: Mon, 15 Jun 2015 15:21:26 +0200 Subject: [9] RFR (XS): 8087218: Constant fold loads from final instance fields in VM anonymous classes In-Reply-To: <557ECFE6.3010808@oracle.com> References: <557ECFE6.3010808@oracle.com> Message-ID: <20150615132126.GF7260@rbackman> Looks good. /R On 06/15, Vladimir Ivanov wrote: > http://cr.openjdk.java.net/~vlivanov/8087218/webrev.00 > https://bugs.openjdk.java.net/browse/JDK-8087218 > > Right now, VM doesn't constant fold loads final instance fields > unless an experimental flag -XX:+TrustFinalNonStaticFields is turned > on. > The only exception is classes in java.lang.invoke/sun.invoke packages. > It can be extended to VM anonymous classes because there is no > hacking of finals going on with them: > (1) they are part of private API (sun.misc.Unsafe); > (2) they can't be serialized. > > Testing: manual (verified in generated code that constant folding happens) > > Thanks! > > Best regards, > Vladimir Ivanov -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: Digital signature URL: From vladimir.x.ivanov at oracle.com Mon Jun 15 13:24:29 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Mon, 15 Jun 2015 16:24:29 +0300 Subject: RFR(S): 8078145 testlibrary_tests/RandomGeneratorTest.java failed with AssertionError : Unexpected random number sequence for mode: NO_SEED In-Reply-To: <5579B398.5030001@oracle.com> References: <5579B398.5030001@oracle.com> Message-ID: <557ED20D.2050300@oracle.com> Looks good. Best regards, Vladimir Ivanov On 6/11/15 7:13 PM, Sergei Kovalev wrote: > Hello Team, > > Please review fix for CR: https://bugs.openjdk.java.net/browse/JDK-8078145 > Webrev link: http://cr.openjdk.java.net/~skovalev/8078145/webrev.00/ > > Problem description: testlibrary_tests/RandomGeneratorTest.java test > uses stdout from child java process for analysis. In some cases > additional java options could force JVM to produce "system" output that > interferes with test output that make output analysis unpredictable. > > Solution: Forward all test output to a file. In this case JVM system > information that printed out to stdout won't interfere with test output. > From vladimir.kozlov at oracle.com Mon Jun 15 16:03:07 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 15 Jun 2015 09:03:07 -0700 Subject: [9] RFR (XS): 8087218: Constant fold loads from final instance fields in VM anonymous classes In-Reply-To: <557ECFE6.3010808@oracle.com> References: <557ECFE6.3010808@oracle.com> Message-ID: <557EF73B.1090807@oracle.com> Good. Thanks, Vladimir On 6/15/15 6:15 AM, Vladimir Ivanov wrote: > http://cr.openjdk.java.net/~vlivanov/8087218/webrev.00 > https://bugs.openjdk.java.net/browse/JDK-8087218 > > Right now, VM doesn't constant fold loads final instance fields unless an experimental flag > -XX:+TrustFinalNonStaticFields is turned on. > The only exception is classes in java.lang.invoke/sun.invoke packages. > It can be extended to VM anonymous classes because there is no hacking of finals going on with them: > (1) they are part of private API (sun.misc.Unsafe); > (2) they can't be serialized. > > Testing: manual (verified in generated code that constant folding happens) > > Thanks! > > Best regards, > Vladimir Ivanov From vladimir.x.ivanov at oracle.com Mon Jun 15 16:05:26 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Mon, 15 Jun 2015 19:05:26 +0300 Subject: [9] RFR (XS): 8087218: Constant fold loads from final instance fields in VM anonymous classes In-Reply-To: <557EF73B.1090807@oracle.com> References: <557ECFE6.3010808@oracle.com> <557EF73B.1090807@oracle.com> Message-ID: <557EF7C6.2050006@oracle.com> Rickard, Vladimir, thanks for review. Best regards, Vladimir Ivanov On 6/15/15 7:03 PM, Vladimir Kozlov wrote: > Good. > > Thanks, > Vladimir > > On 6/15/15 6:15 AM, Vladimir Ivanov wrote: >> http://cr.openjdk.java.net/~vlivanov/8087218/webrev.00 >> https://bugs.openjdk.java.net/browse/JDK-8087218 >> >> Right now, VM doesn't constant fold loads final instance fields unless >> an experimental flag >> -XX:+TrustFinalNonStaticFields is turned on. >> The only exception is classes in java.lang.invoke/sun.invoke packages. >> It can be extended to VM anonymous classes because there is no hacking >> of finals going on with them: >> (1) they are part of private API (sun.misc.Unsafe); >> (2) they can't be serialized. >> >> Testing: manual (verified in generated code that constant folding >> happens) >> >> Thanks! >> >> Best regards, >> Vladimir Ivanov From anthony.scarpino at oracle.com Mon Jun 15 16:38:44 2015 From: anthony.scarpino at oracle.com (Anthony Scarpino) Date: Mon, 15 Jun 2015 09:38:44 -0700 Subject: RFR: 8046943: RSA Acceleration In-Reply-To: <557ABD2E.7050608@redhat.com> References: <557ABD2E.7050608@redhat.com> Message-ID: <557EFF94.5000006@oracle.com> On 06/12/2015 04:06 AM, Andrew Haley wrote: > http://cr.openjdk.java.net/~aph/8046943-hs-1/ > http://cr.openjdk.java.net/~aph/8046943-jdk-1/ Please don't use the JEP 246 in the comments when you push. There are a number of changesets related to 246 and I'd rather not have one associated with it. We can link the a separate bug id to the JEP. > > Before: > > rsa 512 bits 0.000134s 0.000009s 7444.2 111853.3 > rsa 1024 bits 0.000674s 0.000029s 1483.9 34456.9 > rsa 2048 bits 0.003945s 0.000100s 253.5 9994.7 > rsa 4096 bits 0.027015s 0.000373s 37.0 2681.6 > > After: > > rsa 512 bits 0.000059s 0.000004s 17022.3 224141.1 > rsa 1024 bits 0.000258s 0.000013s 3871.5 78851.0 > rsa 2048 bits 0.001506s 0.000042s 664.1 23844.3 > rsa 4096 bits 0.010214s 0.000153s 97.9 6516.0 > > There are some issues we need to discuss. > > The runtime code is in sharedRuntime_x86_64.cpp even though it's > mostly written in C. My thinking here is that porters will want to > tweak the code for their back ends, or maybe write it in assembler. > It's little-endian and would need some reworking for big-endian > machines. But maybe there should be a common version of the code in > share/ ? > > Should it be in optoRuntime instead? It could be called from C1 or > even the interpreter, but it's C2-only right now. > > I've done nothing about 32-bit targets, but I think they would > benefit. > > I had to make some small changes to the Java library. > > 1. Montgomery multiplication works on whole words. Given that this > is a 64-bit implementation I had to force the size of the arrays in > oddModPow to be a multiple of 64 bits, i.e. the length of the arrays > must be even. Given that RSA and Diffie-Hellman keys are usually a > multiple of 64 bits in length I don't think this will be a real > performance issue in practice. > > 2. I needed a 64-bit inverse rather than a 32-bit inverse. This is a > single computation, done once for each modular exponentiation, so it > makes an immeasurably small difference to the total runtime. > > 3. I fused squaring and multiplication into a single > montgomeryMultiply method. This has two advantages. Firstly, we only > need a single intrinsic, and secondly the decision whether to use > multiply or squaring can be made in the runtime library. If the > target does not support the montgomeryMultiply intrinsic there is no > slowdown when using C2 because it removes the if (a == b) test in > > if (a == b) { > product = squareToLen(a, len, product); > } else { > product = multiplyToLen(a, len, b, len, product); > } I don't agree with fusing them together. I think there should two separate intrinsics. For one, SPARC has a montsqr and montmul instructions. Additionally if someone wants to call montgomerySquare, they should be able to call it directly with it's needed number of arguments and not pass 'a' twice to satisfy an internal if(). Tony From anthony.scarpino at oracle.com Mon Jun 15 16:50:35 2015 From: anthony.scarpino at oracle.com (Anthony Scarpino) Date: Mon, 15 Jun 2015 09:50:35 -0700 Subject: RFR: 8046943: RSA Acceleration In-Reply-To: <557ABD2E.7050608@redhat.com> References: <557ABD2E.7050608@redhat.com> Message-ID: <557F025B.9010203@oracle.com> [Resent: I dropped the security list by mistake] On 06/12/2015 04:06 AM, Andrew Haley wrote: > http://cr.openjdk.java.net/~aph/8046943-hs-1/ > http://cr.openjdk.java.net/~aph/8046943-jdk-1/ Please don't use the JEP 246 in the comments when you push. There are a number of changesets related to 246 and I'd rather not have one associated with it. We can link the a separate bug id to the JEP. > > Before: > > rsa 512 bits 0.000134s 0.000009s 7444.2 111853.3 > rsa 1024 bits 0.000674s 0.000029s 1483.9 34456.9 > rsa 2048 bits 0.003945s 0.000100s 253.5 9994.7 > rsa 4096 bits 0.027015s 0.000373s 37.0 2681.6 > > After: > > rsa 512 bits 0.000059s 0.000004s 17022.3 224141.1 > rsa 1024 bits 0.000258s 0.000013s 3871.5 78851.0 > rsa 2048 bits 0.001506s 0.000042s 664.1 23844.3 > rsa 4096 bits 0.010214s 0.000153s 97.9 6516.0 > > There are some issues we need to discuss. > > The runtime code is in sharedRuntime_x86_64.cpp even though it's > mostly written in C. My thinking here is that porters will want to > tweak the code for their back ends, or maybe write it in assembler. > It's little-endian and would need some reworking for big-endian > machines. But maybe there should be a common version of the code in > share/ ? > > Should it be in optoRuntime instead? It could be called from C1 or > even the interpreter, but it's C2-only right now. > > I've done nothing about 32-bit targets, but I think they would > benefit. > > I had to make some small changes to the Java library. > > 1. Montgomery multiplication works on whole words. Given that this > is a 64-bit implementation I had to force the size of the arrays in > oddModPow to be a multiple of 64 bits, i.e. the length of the arrays > must be even. Given that RSA and Diffie-Hellman keys are usually a > multiple of 64 bits in length I don't think this will be a real > performance issue in practice. > > 2. I needed a 64-bit inverse rather than a 32-bit inverse. This is a > single computation, done once for each modular exponentiation, so it > makes an immeasurably small difference to the total runtime. > > 3. I fused squaring and multiplication into a single > montgomeryMultiply method. This has two advantages. Firstly, we only > need a single intrinsic, and secondly the decision whether to use > multiply or squaring can be made in the runtime library. If the > target does not support the montgomeryMultiply intrinsic there is no > slowdown when using C2 because it removes the if (a == b) test in > > if (a == b) { > product = squareToLen(a, len, product); > } else { > product = multiplyToLen(a, len, b, len, product); > } I don't agree with fusing them together. I think there should two separate intrinsics. For one, SPARC has a montsqr and montmul instructions. Additionally if someone wants to call montgomerySquare, they should be able to call it directly with it's needed number of arguments and not pass 'a' twice to satisfy an internal if(). Tony From aph at redhat.com Mon Jun 15 16:57:26 2015 From: aph at redhat.com (Andrew Haley) Date: Mon, 15 Jun 2015 17:57:26 +0100 Subject: RFR: 8046943: RSA Acceleration In-Reply-To: <557EFF94.5000006@oracle.com> References: <557ABD2E.7050608@redhat.com> <557EFF94.5000006@oracle.com> Message-ID: <557F03F6.4080204@redhat.com> On 06/15/2015 05:38 PM, Anthony Scarpino wrote: > On 06/12/2015 04:06 AM, Andrew Haley wrote: >> http://cr.openjdk.java.net/~aph/8046943-hs-1/ >> http://cr.openjdk.java.net/~aph/8046943-jdk-1/ > > Please don't use the JEP 246 in the comments when you push. There are a > number of changesets related to 246 and I'd rather not have one > associated with it. We can link the a separate bug id to the JEP. Right. >> 3. I fused squaring and multiplication into a single >> montgomeryMultiply method. This has two advantages. Firstly, we only >> need a single intrinsic, and secondly the decision whether to use >> multiply or squaring can be made in the runtime library. If the >> target does not support the montgomeryMultiply intrinsic there is no >> slowdown when using C2 because it removes the if (a == b) test in >> >> if (a == b) { >> product = squareToLen(a, len, product); >> } else { >> product = multiplyToLen(a, len, b, len, product); >> } > > I don't agree with fusing them together. I think there should two > separate intrinsics. For one, SPARC has a montsqr and montmul > instructions. Additionally if someone wants to call montgomerySquare, > they should be able to call it directly with it's needed number of > arguments and not pass 'a' twice to satisfy an internal if(). OK, fair enough. I'll think a little more about the best way to do this. Out of curiosity I just had a look at the SPARC instruction specifications, and happily (it certainly surprised me!) they are almost exactly the same as what I've done, so using those instructions should be a trivial change after this patch. The SPARC instruction seems to be limited to 32 words (2048 bits) but I guess you'd just use the software for larger sizes. Andrew. From aph at redhat.com Mon Jun 15 16:58:21 2015 From: aph at redhat.com (Andrew Haley) Date: Mon, 15 Jun 2015 17:58:21 +0100 Subject: RFR: 8046943: RSA Acceleration In-Reply-To: <557EFF94.5000006@oracle.com> References: <557ABD2E.7050608@redhat.com> <557EFF94.5000006@oracle.com> Message-ID: <557F042D.4060707@redhat.com> On 06/15/2015 05:38 PM, Anthony Scarpino wrote: > On 06/12/2015 04:06 AM, Andrew Haley wrote: >> http://cr.openjdk.java.net/~aph/8046943-hs-1/ >> http://cr.openjdk.java.net/~aph/8046943-jdk-1/ > > Please don't use the JEP 246 in the comments when you push. There are a > number of changesets related to 246 and I'd rather not have one > associated with it. We can link the a separate bug id to the JEP. Right. >> 3. I fused squaring and multiplication into a single >> montgomeryMultiply method. This has two advantages. Firstly, we only >> need a single intrinsic, and secondly the decision whether to use >> multiply or squaring can be made in the runtime library. If the >> target does not support the montgomeryMultiply intrinsic there is no >> slowdown when using C2 because it removes the if (a == b) test in >> >> if (a == b) { >> product = squareToLen(a, len, product); >> } else { >> product = multiplyToLen(a, len, b, len, product); >> } > > I don't agree with fusing them together. I think there should two > separate intrinsics. For one, SPARC has a montsqr and montmul > instructions. Additionally if someone wants to call montgomerySquare, > they should be able to call it directly with it's needed number of > arguments and not pass 'a' twice to satisfy an internal if(). OK, fair enough. I'll think a little more about the best way to do this. Out of curiosity I just had a look at the SPARC instruction specifications, and happily (it certainly surprised me!) they are almost exactly the same as what I've done, so using those instructions should be a trivial change after this patch. The SPARC instruction seems to be limited to 32 words (2048 bits) but I guess you'd just use the software for larger sizes. Andrew. From anthony.scarpino at oracle.com Mon Jun 15 17:06:47 2015 From: anthony.scarpino at oracle.com (Anthony Scarpino) Date: Mon, 15 Jun 2015 10:06:47 -0700 Subject: RFR: 8046943: RSA Acceleration In-Reply-To: <557F042D.4060707@redhat.com> References: <557ABD2E.7050608@redhat.com> <557EFF94.5000006@oracle.com> <557F042D.4060707@redhat.com> Message-ID: <557F0627.40400@oracle.com> On 06/15/2015 09:58 AM, Andrew Haley wrote: > On 06/15/2015 05:38 PM, Anthony Scarpino wrote: >> On 06/12/2015 04:06 AM, Andrew Haley wrote: >>> http://cr.openjdk.java.net/~aph/8046943-hs-1/ >>> http://cr.openjdk.java.net/~aph/8046943-jdk-1/ >> >> Please don't use the JEP 246 in the comments when you push. There are a >> number of changesets related to 246 and I'd rather not have one >> associated with it. We can link the a separate bug id to the JEP. > > Right. > >>> 3. I fused squaring and multiplication into a single >>> montgomeryMultiply method. This has two advantages. Firstly, we only >>> need a single intrinsic, and secondly the decision whether to use >>> multiply or squaring can be made in the runtime library. If the >>> target does not support the montgomeryMultiply intrinsic there is no >>> slowdown when using C2 because it removes the if (a == b) test in >>> >>> if (a == b) { >>> product = squareToLen(a, len, product); >>> } else { >>> product = multiplyToLen(a, len, b, len, product); >>> } >> >> I don't agree with fusing them together. I think there should two >> separate intrinsics. For one, SPARC has a montsqr and montmul >> instructions. Additionally if someone wants to call montgomerySquare, >> they should be able to call it directly with it's needed number of >> arguments and not pass 'a' twice to satisfy an internal if(). > > OK, fair enough. I'll think a little more about the best way to do > this. > > Out of curiosity I just had a look at the SPARC instruction > specifications, and happily (it certainly surprised me!) they are > almost exactly the same as what I've done, so using those instructions > should be a trivial change after this patch. The SPARC instruction > seems to be limited to 32 words (2048 bits) but I guess you'd just use > the software for larger sizes. > > Andrew. > Correct, I was prototyping a SPARC intrinsic in May and we independently had similar methods in BigInteger. At least I believe you had a montgomerySqr and montgomeryMul method in BigInteger back in April/May. The instruction gets tedious getting the data to the instruction and the limitation hurts. Tony From ahmed.khawaja at oracle.com Mon Jun 15 17:56:10 2015 From: ahmed.khawaja at oracle.com (Ahmed Khawaja) Date: Mon, 15 Jun 2015 10:56:10 -0700 Subject: Instruction Scheduling in the C2 Compiler Message-ID: <557F11BA.5000106@oracle.com> Greetings, I am inquiring as to what would be a good starting point to get familiar with the current instruction scheduling capabilities of the C2 compiler in hotspot. I am relatively new to working with Hotspot and have primarily been working with the macro-assembler and will now be looking at the instruction scheduling abilities. I am going through the source code now but I am not exactly sure where instruction scheduling is done. The OpenJDK wiki also seems to be somewhat empty on this and the compiler passes in general. Thanks in advance, Ahmed From rednaxelafx at gmail.com Mon Jun 15 19:26:34 2015 From: rednaxelafx at gmail.com (Krystal Mok) Date: Mon, 15 Jun 2015 12:26:34 -0700 Subject: Instruction Scheduling in the C2 Compiler In-Reply-To: <557F11BA.5000106@oracle.com> References: <557F11BA.5000106@oracle.com> Message-ID: Hi Ahmed, Greetings. If you're looking for scheduling logic in C2, that would mostly be in opto/gcm.cpp and opto/lcm.cpp, for Global Code Motion and Local Code Motion, respectively. The place that initiates code scheduling is from Compile::Code_Gen() in opto/compile.cpp: 2320 // Build a proper-looking CFG 2321 PhaseCFG cfg(node_arena(), root(), matcher); 2322 _cfg = &cfg; 2323 { 2324 TracePhase tp("scheduler", &timers[_t_scheduler]); 2325 bool success = cfg.do_global_code_motion(); // <- Scheduling initiated from here 2326 if (!success) { 2327 return; 2328 } 2329 2330 print_method(PHASE_GLOBAL_CODE_MOTION, 2); 2331 NOT_PRODUCT( verify_graph_edges(); ) 2332 debug_only( cfg.verify(); ) 2333 } Follow that path and you'll be able to find all the code you need. Best regards, Kris On Mon, Jun 15, 2015 at 10:56 AM, Ahmed Khawaja wrote: > Greetings, > > I am inquiring as to what would be a good starting point to get > familiar with the current instruction scheduling capabilities of the C2 > compiler in hotspot. I am relatively new to working with Hotspot and have > primarily been working with the macro-assembler and will now be looking at > the instruction scheduling abilities. I am going through the source code > now but I am not exactly sure where instruction scheduling is done. The > OpenJDK wiki also seems to be somewhat empty on this and the compiler > passes in general. > > Thanks in advance, > Ahmed > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.r.rose at oracle.com Mon Jun 15 20:26:43 2015 From: john.r.rose at oracle.com (John Rose) Date: Mon, 15 Jun 2015 13:26:43 -0700 Subject: [9] RFR (XS): 8087218: Constant fold loads from final instance fields in VM anonymous classes In-Reply-To: <557ECFE6.3010808@oracle.com> References: <557ECFE6.3010808@oracle.com> Message-ID: <5F77146D-897A-4148-9E80-AF334238744B@oracle.com> Reviewed. I see the bug is linked properly to the more general bug. Thanks! ? John On Jun 15, 2015, at 6:15 AM, Vladimir Ivanov wrote: > > http://cr.openjdk.java.net/~vlivanov/8087218/webrev.00 > https://bugs.openjdk.java.net/browse/JDK-8087218 > > Right now, VM doesn't constant fold loads final instance fields unless an experimental flag -XX:+TrustFinalNonStaticFields is turned on. > The only exception is classes in java.lang.invoke/sun.invoke packages. > It can be extended to VM anonymous classes because there is no hacking of finals going on with them: > (1) they are part of private API (sun.misc.Unsafe); > (2) they can't be serialized. > > Testing: manual (verified in generated code that constant folding happens) > > Thanks! > > Best regards, > Vladimir Ivanov From michael.c.berg at intel.com Tue Jun 16 06:37:13 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Tue, 16 Jun 2015 06:37:13 +0000 Subject: RFR 8080325 SuperWord loop unrolling analysis In-Reply-To: <546AB289-DAD9-4A26-9F3B-935B53E2F6DB@oracle.com> References: <37C8FBB9-1D75-4CE6-B6CE-A83349CEC09C@oracle.com> <546AB289-DAD9-4A26-9F3B-935B53E2F6DB@oracle.com> Message-ID: Roland, thanks for the review. You will find the latest changes in the following webrev: http://cr.openjdk.java.net/~mcberg/8080325/webrev.02/ All comments below are addressed within the changes. Regards, Michael -----Original Message----- From: Roland Westrelin [mailto:roland.westrelin at oracle.com] Sent: Tuesday, June 09, 2015 6:02 AM To: Berg, Michael C Cc: HotSpot Compiler Subject: Re: RFR 8080325 SuperWord loop unrolling analysis > http://cr.openjdk.java.net/~mcberg/8080325/webrev.01/ Since you?re touching that code, can you also fix the coding style in: In loopTransform.cpp 644 bool IdealLoopTree::policy_unroll( PhaseIdealLoop *phase ) { In loopnode.hpp 462 bool policy_unroll( PhaseIdealLoop *phase ); (no spaces after/before the parenthesis) In loopnode.hpp: 164 int slp_maximum_unroll_factor; should be: _slp_maximum_unroll_factor 248 int slp_max_unroll() { return slp_maximum_unroll_factor; } could be: int slp_max_unroll() const { 464 // Return TRUE or FALSE if the loop analyzes to map to a maximal 465 // superword unrolling for vectorization. 466 void policy_unroll_slp_analysis(CountedLoopNode *cl, PhaseIdealLoop *phase, int future_unroll_ct); The comment says the function returns something but it doesn?t return anything. I don?t see slp_maximum_unroll_factor being initialized to a default value. Isn?t there a risk it?s not set when we read it? Otherwise, I think it?s good. Roland. From roland.westrelin at oracle.com Tue Jun 16 07:25:35 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Tue, 16 Jun 2015 09:25:35 +0200 Subject: RFR 8080325 SuperWord loop unrolling analysis In-Reply-To: References: <37C8FBB9-1D75-4CE6-B6CE-A83349CEC09C@oracle.com> <546AB289-DAD9-4A26-9F3B-935B53E2F6DB@oracle.com> Message-ID: <4D2CE51E-AD90-43CF-A223-EDBF088E54D2@oracle.com> > http://cr.openjdk.java.net/~mcberg/8080325/webrev.02/ That looks good to me. Roland. From vladimir.x.ivanov at oracle.com Tue Jun 16 10:26:10 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Tue, 16 Jun 2015 13:26:10 +0300 Subject: [9] RFR (XS): 8087218: Constant fold loads from final instance fields in VM anonymous classes In-Reply-To: <5F77146D-897A-4148-9E80-AF334238744B@oracle.com> References: <557ECFE6.3010808@oracle.com> <5F77146D-897A-4148-9E80-AF334238744B@oracle.com> Message-ID: <557FF9C2.4020602@oracle.com> Thanks for review, John. Best regards, Vladimir Ivanov On 6/15/15 11:26 PM, John Rose wrote: > Reviewed. I see the bug is linked properly to the more general bug. Thanks! ? John > > On Jun 15, 2015, at 6:15 AM, Vladimir Ivanov wrote: >> >> http://cr.openjdk.java.net/~vlivanov/8087218/webrev.00 >> https://bugs.openjdk.java.net/browse/JDK-8087218 >> >> Right now, VM doesn't constant fold loads final instance fields unless an experimental flag -XX:+TrustFinalNonStaticFields is turned on. >> The only exception is classes in java.lang.invoke/sun.invoke packages. >> It can be extended to VM anonymous classes because there is no hacking of finals going on with them: >> (1) they are part of private API (sun.misc.Unsafe); >> (2) they can't be serialized. >> >> Testing: manual (verified in generated code that constant folding happens) >> >> Thanks! >> >> Best regards, >> Vladimir Ivanov > From roland.westrelin at oracle.com Tue Jun 16 13:20:46 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Tue, 16 Jun 2015 15:20:46 +0200 Subject: RFR(S): 8086016: closed/java/text/Format/NumberFormat/BigDecimalCompatibilityTest.java is crashing In-Reply-To: <55770C6A.6010905@oracle.com> References: <55770C6A.6010905@oracle.com> Message-ID: Thanks for the review, Vladimir. Roland. > On Jun 9, 2015, at 5:55 PM, Vladimir Kozlov wrote: > > Looks good. > > Thanks, > Vladimir > > On 6/9/15 6:27 AM, Roland Westrelin wrote: >> http://cr.openjdk.java.net/~roland/8086016/webrev.00/ >> >> ArrayCopyNode is being transformed in a subgraph that is dying but not yet entirely destroyed. >> >> Roland. >> From aph at redhat.com Tue Jun 16 14:33:40 2015 From: aph at redhat.com (Andrew Haley) Date: Tue, 16 Jun 2015 15:33:40 +0100 Subject: RFR: 8046943: RSA Acceleration In-Reply-To: <557F042D.4060707@redhat.com> References: <557ABD2E.7050608@redhat.com> <557EFF94.5000006@oracle.com> <557F042D.4060707@redhat.com> Message-ID: <558033C4.8040104@redhat.com> On 06/15/2015 05:58 PM, Andrew Haley wrote: >>> 3. I fused squaring and multiplication into a single >>> >> montgomeryMultiply method. ... >> > >> > I don't agree with fusing them together. I think there should two >> > separate intrinsics. For one, SPARC has a montsqr and montmul >> > instructions. Additionally if someone wants to call montgomerySquare, >> > they should be able to call it directly with it's needed number of >> > arguments and not pass 'a' twice to satisfy an internal if(). > OK, fair enough. I'll think a little more about the best way to do > this. Done thusly. The only thing I had any doubt about was whether to use a single flag for squaring and multiplication. This patch uses separate flags. http://cr.openjdk.java.net/~aph/8046943-hs-2/ http://cr.openjdk.java.net/~aph/8046943-jdk-2/ Andrew. From vladimir.kozlov at oracle.com Tue Jun 16 19:05:49 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 16 Jun 2015 12:05:49 -0700 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> Message-ID: <5580738D.9070900@oracle.com> On 6/11/15 3:39 AM, Roland Westrelin wrote: > Thanks for looking at this, Vladimir. > >> We don't call previous node as 'user' - we call them definitions or inputs. Your comment change in memnode.cpp sound strange because of that. The original statement was correct: >> // If anybody other than 'this' uses 'mem', we cannot fold 'mem' away. >> >> in your case it will be: >> >> // If anybody other than 'this' uses 'st', we cannot fold 'st' away. > > Right. > >> Also code does not seems right. The code should go through input memory chain and remove all preceding similar stores - 'this' node remains and we change its memory input which makes previous stores dead. So you can't do 'prev = st?. > > That?s what I think the code does. That is if you have: > > st1->st2->st3->st4 I assume st4 is first store and st1 is last. Right? > > and st3 is redundant with st1, the chain should become: > > st1->st2->st4 I am not sure it is correct optimization. On some machines result of st3 could be visible before result of st2. And you change it. I am suggesting not do that. Do you need that for stores move from loop? > > so we need to change the memory input of st2 when we find st3 can be removed. In the code, at that point, this=st1, st = st3 and prev=st2. In this case the code should be: if (st->in(MemNode::Address)->eqv_uncast(address) && ... } else { prev = st; } to update 'prev' with 'st' only if 'st' is not removed. Also, I think, st->in(MemNode::Memory) could be put in local var since it is used several times in this code. > >> You need to set improved = true since 'this' will not change. We also use 'make_progress' variable's name in such cases. > > In the example above, if we remove st2, we modify this, right? We need to call Ideal() again if store inputs are changed. So if st2 is removed then inputs of st1 are changed so we need to rerun Ideal(). This allow to avoid having your new loop in the Ideal(). > >> I try_move_store_before_loop() add check (n_loop->_head == n_ctrl) to make sure it is not other control node inside loop. Then you can check Phi's control as (mem->in(0) == n_ctrl). >> >> I don't understand verification code "Check that store's control does post dominate loop entry". In first comment you said "Store has to be first in the loop body" - I understand this as Store's control should be loop's head. You can't allow store to be on one of branches (it will not post dominate) but your check code allow that. Also the check done only in debug VM. >> >> If you really want to accept cases when a store is placed after diamond control then you need checks in product to make sure that it is really post dominate head. For that I think you need to go from loopend to loophead through idom links and see if you meet n_ctrl. > > > My check code starts from the loop, follows every path and make sure there?s no path that leaves the loop without going through n. With example 1 below: > > for (int i = 0; i < 10; i++) { > if (some_condition) { > array[idx] = 999; > } else { > } > } > > We?ll find a path from the head that doesn?t go through the store and that exits the loop. What the comment doesn?t say is that with example 2 below: > > for (int i = 0; i < 10; i++) { > if (some_condition) { > uncommon_trap(); > } > array[idx] = 999; > } > > my verification code would find the early exit as well. > > It?s verification code only because if we have example 1 above, then we have a memory Phi to merge both branches of the if. So the pattern that we look for in PhaseIdealLoop::try_move_store_before_loop() won?t match: the loop?s memory Phi backedge won?t be the store. If we have example 2 above, then the loop?s memory Phi doesn?t have a single memory use. So I don?t think we need to check that the store post dominate the loop head in product. That?s my reasoning anyway and the verification code is there to verify it. I missed 'mem->in(LoopNode::LoopBackControl) == n' condition. Which reduce cases only to one store to this address in the loop - good. How you check in product VM that there are no other exists from a loop (your example 2)? Is it guarded by mem->outcnt() == 1 check? > > This said, my verification code doesn?t work for infinite loops. It would need to check whether we reach the tail of the loop maybe? > >> I don't see assert(n->in(0) in try_move_store_before_loop() but you have it in try_move_store_after_loop(). > > Ok. > >> >> Why you need next assert?: >> + assert(n_loop != address_loop, "address is loop varying?); > > I wonder about that too ;-) > I?ll remove it. > >> Should you check phi == NULL instead of assert to make sure you have only one Phi node? > > Can there be more than one memory Phi for a particular slice that has in(0) == n_loop->_head? > I would have expected that to be impossible. BOTTOM (all slices) Phi? Thanks, Vladimir > >> >> conflict between assert and following check: >> + assert(new_loop->_child != NULL, ""); >> + if (new_loop->_child == NULL) new_loop->_body.push(st); > > Thanks for spotting that. I?ll remove the assert. > > Roland. > >> >> Thanks, >> Vladimir >> >> On 6/10/15 8:03 AM, Roland Westrelin wrote: >>> http://cr.openjdk.java.net/~roland/8080289/webrev.00/ >>> >>> Sink stores out of loops when possible: >>> >>> for (int i = 0; i < 1000; i++) { >>> // Some stuff that doesn?t prevent the optimization >>> array[idx] = i; >>> } >>> >>> becomes: >>> >>> for (int i = 0; i < 1000; i++) { >>> // Some stuff >>> } >>> array[idx] = 999; >>> >>> Or move stores before the loop when possible: >>> >>> for (int i = 0; i < 1000; i++) { >>> array[idx] = 999; >>> // Some stuff that doesn?t prevent the optimization >>> } >>> >>> becomes: >>> >>> array[idx] = 999; >>> for (int i = 0; i < 1000; i++) { >>> // Some stuff >>> } >>> >>> The change in memnode.cpp is useful to clean up code generated from test_after_5 because the stores are moved out of the loop only after the loop is split and unrolled. That code removes duplicate stores. >>> >>> Roland. >>> > From vladimir.kozlov at oracle.com Tue Jun 16 22:07:23 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 16 Jun 2015 15:07:23 -0700 Subject: RFR 8080325 SuperWord loop unrolling analysis In-Reply-To: References: <37C8FBB9-1D75-4CE6-B6CE-A83349CEC09C@oracle.com> <546AB289-DAD9-4A26-9F3B-935B53E2F6DB@oracle.com> Message-ID: <55809E1B.70400@oracle.com> Good. Michael, please, update changes to latest hs-comp/hotspot sources: applying hotspot_8080325.patch patching file src/share/vm/opto/loopTransform.cpp Hunk #6 FAILED at 1632 1 out of 7 hunks FAILED -- saving rejects to file src/share/vm/opto/loopTransform.cpp.rej Thanks, Vladimir On 6/15/15 11:37 PM, Berg, Michael C wrote: > Roland, thanks for the review. > You will find the latest changes in the following webrev: > > http://cr.openjdk.java.net/~mcberg/8080325/webrev.02/ > > All comments below are addressed within the changes. > > Regards, > Michael > > -----Original Message----- > From: Roland Westrelin [mailto:roland.westrelin at oracle.com] > Sent: Tuesday, June 09, 2015 6:02 AM > To: Berg, Michael C > Cc: HotSpot Compiler > Subject: Re: RFR 8080325 SuperWord loop unrolling analysis > >> http://cr.openjdk.java.net/~mcberg/8080325/webrev.01/ > > Since you?re touching that code, can you also fix the coding style in: > > In loopTransform.cpp > > 644 bool IdealLoopTree::policy_unroll( PhaseIdealLoop *phase ) { > > In loopnode.hpp > > 462 bool policy_unroll( PhaseIdealLoop *phase ); > > (no spaces after/before the parenthesis) > > In loopnode.hpp: > > 164 int slp_maximum_unroll_factor; > > should be: _slp_maximum_unroll_factor > > 248 int slp_max_unroll() { return slp_maximum_unroll_factor; } > > could be: int slp_max_unroll() const { > > 464 // Return TRUE or FALSE if the loop analyzes to map to a maximal > 465 // superword unrolling for vectorization. > 466 void policy_unroll_slp_analysis(CountedLoopNode *cl, PhaseIdealLoop *phase, int future_unroll_ct); > > The comment says the function returns something but it doesn?t return anything. > > I don?t see slp_maximum_unroll_factor being initialized to a default value. Isn?t there a risk it?s not set when we read it? > > Otherwise, I think it?s good. > > Roland. > From vitalyd at gmail.com Wed Jun 17 04:46:22 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Wed, 17 Jun 2015 00:46:22 -0400 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: <5580738D.9070900@oracle.com> References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> Message-ID: Why would removing the redundant st3 be an incorrect optimization? sent from my phone On Jun 16, 2015 3:05 PM, "Vladimir Kozlov" wrote: > On 6/11/15 3:39 AM, Roland Westrelin wrote: > >> Thanks for looking at this, Vladimir. >> >> We don't call previous node as 'user' - we call them definitions or >>> inputs. Your comment change in memnode.cpp sound strange because of that. >>> The original statement was correct: >>> // If anybody other than 'this' uses 'mem', we cannot fold 'mem' away. >>> >>> in your case it will be: >>> >>> // If anybody other than 'this' uses 'st', we cannot fold 'st' away. >>> >> >> Right. >> >> Also code does not seems right. The code should go through input memory >>> chain and remove all preceding similar stores - 'this' node remains and we >>> change its memory input which makes previous stores dead. So you can't do >>> 'prev = st?. >>> >> >> That?s what I think the code does. That is if you have: >> >> st1->st2->st3->st4 >> > > I assume st4 is first store and st1 is last. Right? > > >> and st3 is redundant with st1, the chain should become: >> >> st1->st2->st4 >> > > I am not sure it is correct optimization. On some machines result of st3 > could be visible before result of st2. And you change it. > I am suggesting not do that. Do you need that for stores move from loop? > > >> so we need to change the memory input of st2 when we find st3 can be >> removed. In the code, at that point, this=st1, st = st3 and prev=st2. >> > > In this case the code should be: > > if (st->in(MemNode::Address)->eqv_uncast(address) && > ... > } else { > prev = st; > } > > to update 'prev' with 'st' only if 'st' is not removed. > Also, I think, st->in(MemNode::Memory) could be put in local var since it > is used several times in this code. > > >> You need to set improved = true since 'this' will not change. We also >>> use 'make_progress' variable's name in such cases. >>> >> >> In the example above, if we remove st2, we modify this, right? >> > > We need to call Ideal() again if store inputs are changed. So if st2 is > removed then inputs of st1 are changed so we need to rerun Ideal(). This > allow to avoid having your new loop in the Ideal(). > > >> I try_move_store_before_loop() add check (n_loop->_head == n_ctrl) to >>> make sure it is not other control node inside loop. Then you can check >>> Phi's control as (mem->in(0) == n_ctrl). >>> >>> I don't understand verification code "Check that store's control does >>> post dominate loop entry". In first comment you said "Store has to be first >>> in the loop body" - I understand this as Store's control should be loop's >>> head. You can't allow store to be on one of branches (it will not post >>> dominate) but your check code allow that. Also the check done only in debug >>> VM. >>> >>> If you really want to accept cases when a store is placed after diamond >>> control then you need checks in product to make sure that it is really post >>> dominate head. For that I think you need to go from loopend to loophead >>> through idom links and see if you meet n_ctrl. >>> >> >> >> My check code starts from the loop, follows every path and make sure >> there?s no path that leaves the loop without going through n. With example >> 1 below: >> >> for (int i = 0; i < 10; i++) { >> if (some_condition) { >> array[idx] = 999; >> } else { >> } >> } >> >> We?ll find a path from the head that doesn?t go through the store and >> that exits the loop. What the comment doesn?t say is that with example 2 >> below: >> >> for (int i = 0; i < 10; i++) { >> if (some_condition) { >> uncommon_trap(); >> } >> array[idx] = 999; >> } >> >> my verification code would find the early exit as well. >> >> It?s verification code only because if we have example 1 above, then we >> have a memory Phi to merge both branches of the if. So the pattern that we >> look for in PhaseIdealLoop::try_move_store_before_loop() won?t match: the >> loop?s memory Phi backedge won?t be the store. If we have example 2 above, >> then the loop?s memory Phi doesn?t have a single memory use. So I don?t >> think we need to check that the store post dominate the loop head in >> product. That?s my reasoning anyway and the verification code is there to >> verify it. >> > > I missed 'mem->in(LoopNode::LoopBackControl) == n' condition. Which reduce > cases only to one store to this address in the loop - good. > > How you check in product VM that there are no other exists from a loop > (your example 2)? Is it guarded by mem->outcnt() == 1 check? > > >> This said, my verification code doesn?t work for infinite loops. It would >> need to check whether we reach the tail of the loop maybe? >> >> I don't see assert(n->in(0) in try_move_store_before_loop() but you have >>> it in try_move_store_after_loop(). >>> >> >> Ok. >> >> >>> Why you need next assert?: >>> + assert(n_loop != address_loop, "address is loop varying?); >>> >> >> I wonder about that too ;-) >> I?ll remove it. >> >> Should you check phi == NULL instead of assert to make sure you have >>> only one Phi node? >>> >> >> Can there be more than one memory Phi for a particular slice that has >> in(0) == n_loop->_head? >> I would have expected that to be impossible. >> > > BOTTOM (all slices) Phi? > > Thanks, > Vladimir > > >> >>> conflict between assert and following check: >>> + assert(new_loop->_child != NULL, ""); >>> + if (new_loop->_child == NULL) new_loop->_body.push(st); >>> >> >> Thanks for spotting that. I?ll remove the assert. >> >> Roland. >> >> >>> Thanks, >>> Vladimir >>> >>> On 6/10/15 8:03 AM, Roland Westrelin wrote: >>> >>>> http://cr.openjdk.java.net/~roland/8080289/webrev.00/ >>>> >>>> Sink stores out of loops when possible: >>>> >>>> for (int i = 0; i < 1000; i++) { >>>> // Some stuff that doesn?t prevent the optimization >>>> array[idx] = i; >>>> } >>>> >>>> becomes: >>>> >>>> for (int i = 0; i < 1000; i++) { >>>> // Some stuff >>>> } >>>> array[idx] = 999; >>>> >>>> Or move stores before the loop when possible: >>>> >>>> for (int i = 0; i < 1000; i++) { >>>> array[idx] = 999; >>>> // Some stuff that doesn?t prevent the optimization >>>> } >>>> >>>> becomes: >>>> >>>> array[idx] = 999; >>>> for (int i = 0; i < 1000; i++) { >>>> // Some stuff >>>> } >>>> >>>> The change in memnode.cpp is useful to clean up code generated from >>>> test_after_5 because the stores are moved out of the loop only after the >>>> loop is split and unrolled. That code removes duplicate stores. >>>> >>>> Roland. >>>> >>>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.westrelin at oracle.com Wed Jun 17 08:35:47 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 17 Jun 2015 10:35:47 +0200 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: <5580738D.9070900@oracle.com> References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> Message-ID: >> That?s what I think the code does. That is if you have: >> >> st1->st2->st3->st4 > > I assume st4 is first store and st1 is last. Right? Program order is: st4 st3 st2 st1 >> and st3 is redundant with st1, the chain should become: >> >> st1->st2->st4 > > I am not sure it is correct optimization. On some machines result of st3 could be visible before result of st2. And you change it. > I am suggesting not do that. Do you need that for stores move from loop? It?s not required. It cleans up the graph in some cases like this: static void test_after_5(int idx) { for (int i = 0; i < 1000; i++) { array[idx] = i; array[idx+1] = i; array[idx+2] = i; array[idx+3] = i; array[idx+4] = i; array[idx+5] = i; } } all stores are sunk out of the loop but that happens after iteration splitting and so there are multiple redundant copies of each store that are not collapsed. This said, we currently reorder the stores even if it?s less aggressive than what I?m proposing. With program: st4 st3 st2 st1 If st1, st3 and st4 are on one slice and st2 is on another and if st1 and st3 store to the same address we optimize st3 out: st4 st2 st1 so st3=st1 may only be visible after st2. Also, the way I read the first table in this: http://gee.cs.oswego.edu/dl/jmm/cookbook.html it?s allowed to reorder normal stores with normal stores >> so we need to change the memory input of st2 when we find st3 can be removed. In the code, at that point, this=st1, st = st3 and prev=st2. > > In this case the code should be: > > if (st->in(MemNode::Address)->eqv_uncast(address) && > ... > } else { > prev = st; > } > > to update 'prev' with 'st' only if 'st' is not removed. You?re right. > Also, I think, st->in(MemNode::Memory) could be put in local var since it is used several times in this code. > >> >>> You need to set improved = true since 'this' will not change. We also use 'make_progress' variable's name in such cases. >> >> In the example above, if we remove st2, we modify this, right? > > We need to call Ideal() again if store inputs are changed. So if st2 is removed then inputs of st1 are changed so we need to rerun Ideal(). This allow to avoid having your new loop in the Ideal(). Sorry, I don?t understand this. Are you saying there?s no need for a loop at all? Or are you saying that as soon as there?s a similar store that is found we should return from Ideal that will be called again to maybe find other similar stores? >> We?ll find a path from the head that doesn?t go through the store and that exits the loop. What the comment doesn?t say is that with example 2 below: >> >> for (int i = 0; i < 10; i++) { >> if (some_condition) { >> uncommon_trap(); >> } >> array[idx] = 999; >> } >> >> my verification code would find the early exit as well. >> >> It?s verification code only because if we have example 1 above, then we have a memory Phi to merge both branches of the if. So the pattern that we look for in PhaseIdealLoop::try_move_store_before_loop() won?t match: the loop?s memory Phi backedge won?t be the store. If we have example 2 above, then the loop?s memory Phi doesn?t have a single memory use. So I don?t think we need to check that the store post dominate the loop head in product. That?s my reasoning anyway and the verification code is there to verify it. > > I missed 'mem->in(LoopNode::LoopBackControl) == n' condition. Which reduce cases only to one store to this address in the loop - good. > > How you check in product VM that there are no other exists from a loop (your example 2)? Is it guarded by mem->outcnt() == 1 check? Yes. >>> Should you check phi == NULL instead of assert to make sure you have only one Phi node? >> >> Can there be more than one memory Phi for a particular slice that has in(0) == n_loop->_head? >> I would have expected that to be impossible. > > BOTTOM (all slices) Phi? Wouldn?t there be a MergeMem between the store and the Phi then? For the record, the webrev: http://cr.openjdk.java.net/~roland/8080289/webrev.00/ Roland. From sergei.kovalev at oracle.com Wed Jun 17 10:44:51 2015 From: sergei.kovalev at oracle.com (Sergei Kovalev) Date: Wed, 17 Jun 2015 13:44:51 +0300 Subject: RFR(S): 8067163: Several JT_HS tests fails due to ClassNotFoundException on compacts Message-ID: <55814FA3.4000702@oracle.com> Hello Team, Please review the fix for https://bugs.openjdk.java.net/browse/JDK-8067163 Webrev: http://cr.openjdk.java.net/~skovalev/8067163/webrev.00/ Summary: several regression tests requires WitheBox object. The object is available starting from compact3 profile. To fix the issue all tests added to needs_compact3 group. -- With best regards, Sergei From benedikt.wedenik at theobroma-systems.com Wed Jun 17 11:13:23 2015 From: benedikt.wedenik at theobroma-systems.com (Benedikt Wedenik) Date: Wed, 17 Jun 2015 13:13:23 +0200 Subject: aarch64 DMB - patch Message-ID: <7A2AC9D2-F791-4473-9557-37B2C1218A4B@theobroma-systems.com> Hi! I did some experiments on aarch64 with specjBB2005 and other benchmarks and I found out, that not emitting memory barriers (DMB) in c2 (see attached patch) improved the performance by about 7%. The runs were still valid. I wrote a micro-benchmark (attachment), which only deals with synchronization, because the memory barriers (DMB) are emitted in the synchronized methods. I had a look into the generated assembly and I found out, that all needed variables are loaded and stored by using instructions like ldaxr / stlxr. As far as I knew these functions already guarentee an exclusive, as well as ordered access, which would be forced by the (redundant) DMB instruction. To be sure I did some research in the unified ARMv8 spec and found some interesting statements. In chapter B2.7 Memory ordering, Load-Acquire Store-Release (page 88-89) one can read the following: page 88: There are no additional ordering requirements on loads or stores that appear before the Load-Acquire. There are no additional ordering requirements on loads or stores that appear in program order after the Store-Release. page 89: The Load-Acquire/Store-Release instructions can remove the requirement to use the explicit DMB memory barrier instruction. This tells me, that my patch does not touch the correctness of the generated code but increases the performance. Benedikt -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: aarch64.ad.patch Type: application/octet-stream Size: 575 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Multi.java Type: application/octet-stream Size: 577 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Worker.java Type: application/octet-stream Size: 169 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From aph at redhat.com Wed Jun 17 11:17:47 2015 From: aph at redhat.com (Andrew Haley) Date: Wed, 17 Jun 2015 12:17:47 +0100 Subject: aarch64 DMB - patch In-Reply-To: <7A2AC9D2-F791-4473-9557-37B2C1218A4B@theobroma-systems.com> References: <7A2AC9D2-F791-4473-9557-37B2C1218A4B@theobroma-systems.com> Message-ID: <5581575B.8050602@redhat.com> On 06/17/2015 12:13 PM, Benedikt Wedenik wrote: > This tells me, that my patch does not touch the correctness of the generated > code but increases the performance. Which version of OpenJDK is this? Andrew Dinn is working on this code. Andrew. From benedikt.wedenik at theobroma-systems.com Wed Jun 17 11:20:14 2015 From: benedikt.wedenik at theobroma-systems.com (Benedikt Wedenik) Date: Wed, 17 Jun 2015 13:20:14 +0200 Subject: aarch64 DMB - patch In-Reply-To: <5581575B.8050602@redhat.com> References: <7A2AC9D2-F791-4473-9557-37B2C1218A4B@theobroma-systems.com> <5581575B.8050602@redhat.com> Message-ID: <74928C0A-3CB6-4811-A48D-F03CB65557D6@theobroma-systems.com> this is the first line of ?hg log?: changeset: 8158:471988878307 tag: tip user: thartmann date: Mon Mar 23 10:15:53 2015 +0100 summary: 8075136: Unnecessary sign extension for byte array access On 17 Jun 2015, at 13:17, Andrew Haley wrote: > On 06/17/2015 12:13 PM, Benedikt Wedenik wrote: >> This tells me, that my patch does not touch the correctness of the generated >> code but increases the performance. > > Which version of OpenJDK is this? Andrew Dinn is working on this code. > > Andrew. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aph at redhat.com Wed Jun 17 11:25:25 2015 From: aph at redhat.com (Andrew Haley) Date: Wed, 17 Jun 2015 12:25:25 +0100 Subject: aarch64 DMB - patch In-Reply-To: <74928C0A-3CB6-4811-A48D-F03CB65557D6@theobroma-systems.com> References: <7A2AC9D2-F791-4473-9557-37B2C1218A4B@theobroma-systems.com> <5581575B.8050602@redhat.com> <74928C0A-3CB6-4811-A48D-F03CB65557D6@theobroma-systems.com> Message-ID: <55815925.3020908@redhat.com> On 06/17/2015 12:20 PM, Benedikt Wedenik wrote: > this is the first line of ?hg log?: > > changeset: 8158:471988878307 > tag: tip > user: thartmann > date: Mon Mar 23 10:15:53 2015 +0100 > summary: 8075136: Unnecessary sign extension for byte array access That's rather old. Please use hg.openjdk.java.net/jdk9/hs-comp. I think that adinn has a patch which fixes this, but it's help up for some other reason. Andrew. From aph at redhat.com Wed Jun 17 11:30:55 2015 From: aph at redhat.com (Andrew Haley) Date: Wed, 17 Jun 2015 12:30:55 +0100 Subject: aarch64 DMB - patch In-Reply-To: <55815925.3020908@redhat.com> References: <7A2AC9D2-F791-4473-9557-37B2C1218A4B@theobroma-systems.com> <5581575B.8050602@redhat.com> <74928C0A-3CB6-4811-A48D-F03CB65557D6@theobroma-systems.com> <55815925.3020908@redhat.com> Message-ID: <55815A6F.7010505@redhat.com> On 06/17/2015 12:25 PM, Andrew Haley wrote: > On 06/17/2015 12:20 PM, Benedikt Wedenik wrote: >> this is the first line of ?hg log?: >> >> changeset: 8158:471988878307 >> tag: tip >> user: thartmann >> date: Mon Mar 23 10:15:53 2015 +0100 >> summary: 8075136: Unnecessary sign extension for byte array access > > That's rather old. Please use hg.openjdk.java.net/jdk9/hs-comp. > I think that adinn has a patch which fixes this, but it's help up for ^held up > some other reason. > > Andrew. > From benedikt.wedenik at theobroma-systems.com Wed Jun 17 11:42:05 2015 From: benedikt.wedenik at theobroma-systems.com (Benedikt Wedenik) Date: Wed, 17 Jun 2015 13:42:05 +0200 Subject: aarch64 DMB - patch In-Reply-To: <55815A6F.7010505@redhat.com> References: <7A2AC9D2-F791-4473-9557-37B2C1218A4B@theobroma-systems.com> <5581575B.8050602@redhat.com> <74928C0A-3CB6-4811-A48D-F03CB65557D6@theobroma-systems.com> <55815925.3020908@redhat.com> <55815A6F.7010505@redhat.com> Message-ID: <92ED304B-57A4-4F63-9F4D-6B8259CC807F@theobroma-systems.com> Thanks. But why not the http://hg.openjdk.java.net/aarch64-port/jdk9/ ? Or is hg.openjdk.java.net/jdk9/hs-comp used inside the aarch64-port? Benedikt On 17 Jun 2015, at 13:30, Andrew Haley wrote: > On 06/17/2015 12:25 PM, Andrew Haley wrote: >> On 06/17/2015 12:20 PM, Benedikt Wedenik wrote: >>> this is the first line of ?hg log?: >>> >>> changeset: 8158:471988878307 >>> tag: tip >>> user: thartmann >>> date: Mon Mar 23 10:15:53 2015 +0100 >>> summary: 8075136: Unnecessary sign extension for byte array access >> >> That's rather old. Please use hg.openjdk.java.net/jdk9/hs-comp. >> I think that adinn has a patch which fixes this, but it's help up for > ^held up >> some other reason. >> >> Andrew. >> > From aph at redhat.com Wed Jun 17 11:44:57 2015 From: aph at redhat.com (Andrew Haley) Date: Wed, 17 Jun 2015 12:44:57 +0100 Subject: aarch64 DMB - patch In-Reply-To: <92ED304B-57A4-4F63-9F4D-6B8259CC807F@theobroma-systems.com> References: <7A2AC9D2-F791-4473-9557-37B2C1218A4B@theobroma-systems.com> <5581575B.8050602@redhat.com> <74928C0A-3CB6-4811-A48D-F03CB65557D6@theobroma-systems.com> <55815925.3020908@redhat.com> <55815A6F.7010505@redhat.com> <92ED304B-57A4-4F63-9F4D-6B8259CC807F@theobroma-systems.com> Message-ID: <55815DB9.40409@redhat.com> On 06/17/2015 12:42 PM, Benedikt Wedenik wrote: > But why not the http://hg.openjdk.java.net/aarch64-port/jdk9/ ? This repo is used for HotSpot development. It's newer. I think your patch is correct, but ADinn is working on a rewrite of this code at the moment. Andrew. From david.holmes at oracle.com Wed Jun 17 12:44:21 2015 From: david.holmes at oracle.com (David Holmes) Date: Wed, 17 Jun 2015 22:44:21 +1000 Subject: RFR(S): 8067163: Several JT_HS tests fails due to ClassNotFoundException on compacts In-Reply-To: <55814FA3.4000702@oracle.com> References: <55814FA3.4000702@oracle.com> Message-ID: <55816BA5.9090809@oracle.com> On 17/06/2015 8:44 PM, Sergei Kovalev wrote: > Hello Team, > > Please review the fix for https://bugs.openjdk.java.net/browse/JDK-8067163 > Webrev: http://cr.openjdk.java.net/~skovalev/8067163/webrev.00/ > > Summary: several regression tests requires WitheBox object. The object > is available starting from compact3 profile. To fix the issue all tests > added to needs_compact3 group. Seems okay, but why has this only been discovered now? And are the tests runs that hit this using the groups mechanism? Thanks, David From sergei.kovalev at oracle.com Wed Jun 17 12:51:43 2015 From: sergei.kovalev at oracle.com (Sergei Kovalev) Date: Wed, 17 Jun 2015 15:51:43 +0300 Subject: RFR(S): 8067163: Several JT_HS tests fails due to ClassNotFoundException on compacts In-Reply-To: <55816BA5.9090809@oracle.com> References: <55814FA3.4000702@oracle.com> <55816BA5.9090809@oracle.com> Message-ID: <55816D5F.9080900@oracle.com> Hi David, On 17.06.15 15:44, David Holmes wrote: > On 17/06/2015 8:44 PM, Sergei Kovalev wrote: >> Hello Team, >> >> Please review the fix for >> https://bugs.openjdk.java.net/browse/JDK-8067163 >> Webrev: http://cr.openjdk.java.net/~skovalev/8067163/webrev.00/ >> >> Summary: several regression tests requires WitheBox object. The object >> is available starting from compact3 profile. To fix the issue all tests >> added to needs_compact3 group. > > Seems okay, but why has this only been discovered now? It was discovered on 8u40 when we switched to group mechanism. > And are the tests runs that hit this using the groups mechanism? The link to test run with group usage is in jira comments. Also we observed the issue with 8u60. I can provide a link in separate e-mail if you'd like it. > > > Thanks, > David > > > > > -- With best regards, Sergei From vladimir.x.ivanov at oracle.com Wed Jun 17 13:03:39 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Wed, 17 Jun 2015 16:03:39 +0300 Subject: RFR(S): 8067163: Several JT_HS tests fails due to ClassNotFoundException on compacts In-Reply-To: <55814FA3.4000702@oracle.com> References: <55814FA3.4000702@oracle.com> Message-ID: <5581702B.5030005@oracle.com> Looks good. Best regards, Vladimir Ivanov On 6/17/15 1:44 PM, Sergei Kovalev wrote: > Hello Team, > > Please review the fix for https://bugs.openjdk.java.net/browse/JDK-8067163 > Webrev: http://cr.openjdk.java.net/~skovalev/8067163/webrev.00/ > > Summary: several regression tests requires WitheBox object. The object > is available starting from compact3 profile. To fix the issue all tests > added to needs_compact3 group. > From benedikt.wedenik at theobroma-systems.com Wed Jun 17 13:26:38 2015 From: benedikt.wedenik at theobroma-systems.com (Benedikt Wedenik) Date: Wed, 17 Jun 2015 15:26:38 +0200 Subject: aarch64 DMB - patch In-Reply-To: <55815DB9.40409@redhat.com> References: <7A2AC9D2-F791-4473-9557-37B2C1218A4B@theobroma-systems.com> <5581575B.8050602@redhat.com> <74928C0A-3CB6-4811-A48D-F03CB65557D6@theobroma-systems.com> <55815925.3020908@redhat.com> <55815A6F.7010505@redhat.com> <92ED304B-57A4-4F63-9F4D-6B8259CC807F@theobroma-systems.com> <55815DB9.40409@redhat.com> Message-ID: <4554BF7F-F4CE-4FBD-B09A-1EFCDA598F5F@theobroma-systems.com> I checked out both repositories and compared the AD-file. My patch also works in the latest version of hg.openjdk.java.net/jdk9/hs-comp. If ADinn is working on that part of the code right now, do you think I should talk to him directly? Thanks, Benedikt On 17 Jun 2015, at 13:44, Andrew Haley wrote: > On 06/17/2015 12:42 PM, Benedikt Wedenik wrote: >> But why not the http://hg.openjdk.java.net/aarch64-port/jdk9/ ? > > This repo is used for HotSpot development. It's newer. I think your > patch is correct, but ADinn is working on a rewrite of this code at > the moment. > > Andrew. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.x.ivanov at oracle.com Wed Jun 17 16:38:04 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Wed, 17 Jun 2015 19:38:04 +0300 Subject: [9] RFR (M): VM should constant fold Unsafe.get*() loads from final fields Message-ID: <5581A26C.6090303@oracle.com> http://cr.openjdk.java.net/~vlivanov/8078629/webrev.00/ https://bugs.openjdk.java.net/browse/JDK-8078629 Direct(getfield/getstatic) read operations are faster than unsafe reads from constant Java fields, since VM doesn't constant fold unsafe loads. Though VM tries hard to recover field metadata from its offset, it doesn't optimize unsafe ones even if it has all necessary info in its hands. The fix is to align the behavior and share relevant code between C2 parser and intrinsic expansion logic. For testing purposes, I extended whitebox API to check whether a value is a compile-time constant. The whitebox test enumerates all combinations of a field and ensures that the behavior is consistent between bytecode and unsafe reads. Testing: focused whitebox tests, hotspot/test/compiler, jdk/test/java/lang/invoke, octane (for performance measurements) Thanks! Best regards, Vladimir Ivanov From vladimir.kozlov at oracle.com Wed Jun 17 19:03:01 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 17 Jun 2015 12:03:01 -0700 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> Message-ID: <5581C465.7070803@oracle.com> > http://gee.cs.oswego.edu/dl/jmm/cookbook.html > > it?s allowed to reorder normal stores with normal stores If we can guarantee that all passed stores are normal (I assume we will have barriers otherwise in between) then I agree. I am not sure why we didn't do it before, there could be a counterargument for that which I don't remember. To make sure, ask John. >> We need to call Ideal() again if store inputs are changed. So if st2 is removed then inputs of st1 are changed so we need to rerun Ideal(). This allow to avoid having your new loop in the Ideal(). > > Sorry, I don?t understand this. Are you saying there?s no need for a loop at all? Or are you saying that as soon as there?s a similar store that is found we should return from Ideal that will be called again to maybe find other similar stores? Yes, it may simplify the code of Ideal. You may still need a loop to look for previous store which could be eliminated but you don't need to have 'prev'. As soon you remove one node, you exit Ideal returning 'this' and it will be called again so you can search for another previous store. >> BOTTOM (all slices) Phi? > > Wouldn?t there be a MergeMem between the store and the Phi then? Yes. Okay, you can keep the check as assert we will see if Nightly testing hit it it or not. Thanks, Vladimir On 6/17/15 1:35 AM, Roland Westrelin wrote: > >>> That?s what I think the code does. That is if you have: >>> >>> st1->st2->st3->st4 >> >> I assume st4 is first store and st1 is last. Right? > > Program order is: > st4 > st3 > st2 > st1 > >>> and st3 is redundant with st1, the chain should become: >>> >>> st1->st2->st4 >> >> I am not sure it is correct optimization. On some machines result of st3 could be visible before result of st2. And you change it. >> I am suggesting not do that. Do you need that for stores move from loop? > > It?s not required. It cleans up the graph in some cases like this: > > static void test_after_5(int idx) { > for (int i = 0; i < 1000; i++) { > array[idx] = i; > array[idx+1] = i; > array[idx+2] = i; > array[idx+3] = i; > array[idx+4] = i; > array[idx+5] = i; > } > } > > all stores are sunk out of the loop but that happens after iteration splitting and so there are multiple redundant copies of each store that are not collapsed. > > This said, we currently reorder the stores even if it?s less aggressive than what I?m proposing. With program: > > st4 > st3 > st2 > st1 > > If st1, st3 and st4 are on one slice and st2 is on another and if st1 and st3 store to the same address we optimize st3 out: > > st4 > st2 > st1 > > so st3=st1 may only be visible after st2. > > Also, the way I read the first table in this: > > http://gee.cs.oswego.edu/dl/jmm/cookbook.html > > it?s allowed to reorder normal stores with normal stores > >>> so we need to change the memory input of st2 when we find st3 can be removed. In the code, at that point, this=st1, st = st3 and prev=st2. >> >> In this case the code should be: >> >> if (st->in(MemNode::Address)->eqv_uncast(address) && >> ... >> } else { >> prev = st; >> } >> >> to update 'prev' with 'st' only if 'st' is not removed. > > You?re right. > >> Also, I think, st->in(MemNode::Memory) could be put in local var since it is used several times in this code. >> >>> >>>> You need to set improved = true since 'this' will not change. We also use 'make_progress' variable's name in such cases. >>> >>> In the example above, if we remove st2, we modify this, right? >> >> We need to call Ideal() again if store inputs are changed. So if st2 is removed then inputs of st1 are changed so we need to rerun Ideal(). This allow to avoid having your new loop in the Ideal(). > > Sorry, I don?t understand this. Are you saying there?s no need for a loop at all? Or are you saying that as soon as there?s a similar store that is found we should return from Ideal that will be called again to maybe find other similar stores? > >>> We?ll find a path from the head that doesn?t go through the store and that exits the loop. What the comment doesn?t say is that with example 2 below: >>> >>> for (int i = 0; i < 10; i++) { >>> if (some_condition) { >>> uncommon_trap(); >>> } >>> array[idx] = 999; >>> } >>> >>> my verification code would find the early exit as well. >>> >>> It?s verification code only because if we have example 1 above, then we have a memory Phi to merge both branches of the if. So the pattern that we look for in PhaseIdealLoop::try_move_store_before_loop() won?t match: the loop?s memory Phi backedge won?t be the store. If we have example 2 above, then the loop?s memory Phi doesn?t have a single memory use. So I don?t think we need to check that the store post dominate the loop head in product. That?s my reasoning anyway and the verification code is there to verify it. >> >> I missed 'mem->in(LoopNode::LoopBackControl) == n' condition. Which reduce cases only to one store to this address in the loop - good. >> >> How you check in product VM that there are no other exists from a loop (your example 2)? Is it guarded by mem->outcnt() == 1 check? > > Yes. > >>>> Should you check phi == NULL instead of assert to make sure you have only one Phi node? >>> >>> Can there be more than one memory Phi for a particular slice that has in(0) == n_loop->_head? >>> I would have expected that to be impossible. >> >> BOTTOM (all slices) Phi? > > Wouldn?t there be a MergeMem between the store and the Phi then? > > For the record, the webrev: > > http://cr.openjdk.java.net/~roland/8080289/webrev.00/ > > Roland. > From vitalyd at gmail.com Wed Jun 17 19:08:19 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Wed, 17 Jun 2015 15:08:19 -0400 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: <5581C465.7070803@oracle.com> References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> <5581C465.7070803@oracle.com> Message-ID: > > If we can guarantee that all passed stores are normal (I assume we will > have barriers otherwise in between) then I agree. I am not sure why we > didn't do it before, there could be a counterargument for that which I > don't remember. To make sure, ask John. Just my $.02 in here, but AFAIK, this should be completely legal assuming no barriers in there, as you say. Java doesn't allow introducing new writes, but removing existing ones like this should be legal and a good optimization. On Wed, Jun 17, 2015 at 3:03 PM, Vladimir Kozlov wrote: > > http://gee.cs.oswego.edu/dl/jmm/cookbook.html > > > > it?s allowed to reorder normal stores with normal stores > > If we can guarantee that all passed stores are normal (I assume we will > have barriers otherwise in between) then I agree. I am not sure why we > didn't do it before, there could be a counterargument for that which I > don't remember. To make sure, ask John. > > >> We need to call Ideal() again if store inputs are changed. So if st2 is > removed then inputs of st1 are changed so we need to rerun Ideal(). This > allow to avoid having your new loop in the Ideal(). > > > > Sorry, I don?t understand this. Are you saying there?s no need for a > loop at all? Or are you saying that as soon as there?s a similar store that > is found we should return from Ideal that will be called again to maybe > find other similar stores? > > Yes, it may simplify the code of Ideal. You may still need a loop to look > for previous store which could be eliminated but you don't need to have > 'prev'. As soon you remove one node, you exit Ideal returning 'this' and it > will be called again so you can search for another previous store. > > >> BOTTOM (all slices) Phi? > > > > Wouldn?t there be a MergeMem between the store and the Phi then? > > Yes. Okay, you can keep the check as assert we will see if Nightly testing > hit it it or not. > > Thanks, > Vladimir > > > On 6/17/15 1:35 AM, Roland Westrelin wrote: > >> >> That?s what I think the code does. That is if you have: >>>> >>>> st1->st2->st3->st4 >>>> >>> >>> I assume st4 is first store and st1 is last. Right? >>> >> >> Program order is: >> st4 >> st3 >> st2 >> st1 >> >> and st3 is redundant with st1, the chain should become: >>>> >>>> st1->st2->st4 >>>> >>> >>> I am not sure it is correct optimization. On some machines result of st3 >>> could be visible before result of st2. And you change it. >>> I am suggesting not do that. Do you need that for stores move from loop? >>> >> >> It?s not required. It cleans up the graph in some cases like this: >> >> static void test_after_5(int idx) { >> for (int i = 0; i < 1000; i++) { >> array[idx] = i; >> array[idx+1] = i; >> array[idx+2] = i; >> array[idx+3] = i; >> array[idx+4] = i; >> array[idx+5] = i; >> } >> } >> >> all stores are sunk out of the loop but that happens after iteration >> splitting and so there are multiple redundant copies of each store that are >> not collapsed. >> >> This said, we currently reorder the stores even if it?s less aggressive >> than what I?m proposing. With program: >> >> st4 >> st3 >> st2 >> st1 >> >> If st1, st3 and st4 are on one slice and st2 is on another and if st1 and >> st3 store to the same address we optimize st3 out: >> >> st4 >> st2 >> st1 >> >> so st3=st1 may only be visible after st2. >> >> Also, the way I read the first table in this: >> >> http://gee.cs.oswego.edu/dl/jmm/cookbook.html >> >> it?s allowed to reorder normal stores with normal stores >> >> so we need to change the memory input of st2 when we find st3 can be >>>> removed. In the code, at that point, this=st1, st = st3 and prev=st2. >>>> >>> >>> In this case the code should be: >>> >>> if (st->in(MemNode::Address)->eqv_uncast(address) && >>> ... >>> } else { >>> prev = st; >>> } >>> >>> to update 'prev' with 'st' only if 'st' is not removed. >>> >> >> You?re right. >> >> Also, I think, st->in(MemNode::Memory) could be put in local var since >>> it is used several times in this code. >>> >>> >>>> You need to set improved = true since 'this' will not change. We also >>>>> use 'make_progress' variable's name in such cases. >>>>> >>>> >>>> In the example above, if we remove st2, we modify this, right? >>>> >>> >>> We need to call Ideal() again if store inputs are changed. So if st2 is >>> removed then inputs of st1 are changed so we need to rerun Ideal(). This >>> allow to avoid having your new loop in the Ideal(). >>> >> >> Sorry, I don?t understand this. Are you saying there?s no need for a loop >> at all? Or are you saying that as soon as there?s a similar store that is >> found we should return from Ideal that will be called again to maybe find >> other similar stores? >> >> We?ll find a path from the head that doesn?t go through the store and >>>> that exits the loop. What the comment doesn?t say is that with example 2 >>>> below: >>>> >>>> for (int i = 0; i < 10; i++) { >>>> if (some_condition) { >>>> uncommon_trap(); >>>> } >>>> array[idx] = 999; >>>> } >>>> >>>> my verification code would find the early exit as well. >>>> >>>> It?s verification code only because if we have example 1 above, then we >>>> have a memory Phi to merge both branches of the if. So the pattern that we >>>> look for in PhaseIdealLoop::try_move_store_before_loop() won?t match: the >>>> loop?s memory Phi backedge won?t be the store. If we have example 2 above, >>>> then the loop?s memory Phi doesn?t have a single memory use. So I don?t >>>> think we need to check that the store post dominate the loop head in >>>> product. That?s my reasoning anyway and the verification code is there to >>>> verify it. >>>> >>> >>> I missed 'mem->in(LoopNode::LoopBackControl) == n' condition. Which >>> reduce cases only to one store to this address in the loop - good. >>> >>> How you check in product VM that there are no other exists from a loop >>> (your example 2)? Is it guarded by mem->outcnt() == 1 check? >>> >> >> Yes. >> >> Should you check phi == NULL instead of assert to make sure you have >>>>> only one Phi node? >>>>> >>>> >>>> Can there be more than one memory Phi for a particular slice that has >>>> in(0) == n_loop->_head? >>>> I would have expected that to be impossible. >>>> >>> >>> BOTTOM (all slices) Phi? >>> >> >> Wouldn?t there be a MergeMem between the store and the Phi then? >> >> For the record, the webrev: >> >> http://cr.openjdk.java.net/~roland/8080289/webrev.00/ >> >> Roland. >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.r.rose at oracle.com Wed Jun 17 19:31:28 2015 From: john.r.rose at oracle.com (John Rose) Date: Wed, 17 Jun 2015 12:31:28 -0700 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: <5581C465.7070803@oracle.com> References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> <5581C465.7070803@oracle.com> Message-ID: On Jun 17, 2015, at 12:03 PM, Vladimir Kozlov wrote: > > If we can guarantee that all passed stores are normal (I assume we will have barriers otherwise in between) then I agree. I am not sure why we didn't do it before, there could be a counterargument for that which I don't remember. To make sure, ask John. Here's my a warning about this: New optimizations sometimes uncover implicit dependencies by badly-written user code on memory effect ordering. The user code is bad, but sometimes you need to be able to suppress the optimization anyway. You might suppress only as long as it takes to diagnose the bad code, but it might be a long time if the user can't or won't fix it. (Think of it as an indefinitely delayed optimization of an indefinitely delayable store.) We have to be careful about fences, both emitting them into IR and respecting them once there. It is likely that we have enough fences to implement the JMM at current optimization levels, but perhaps we need a few more if we reorder stores more aggressively. The important thing to remember is that the JMM is not defined in terms of fences; the fences are a means to an end in enforcing happens-before relations. An *activated* safepoint might want fence-like behavior w.r.t. aggressive optimizations. For example, if we delay a store indefinitely because of a loop, and the loop hits an active safepoint, an early store might need to be flushed out. Even if a safepoint doesn't need such "flushing" behavior per the JMM, we might want that anyway to preserve some basic "liveliness" on the JVM's behavior. It feels possibly correct and definitely surprising to delay a store indefinitely. All that said, the JMM gives us permission to perform very aggressive optimizations. So let's do them. ? John -------------- next part -------------- An HTML attachment was scrubbed... URL: From Alexander.Alexeev at caviumnetworks.com Wed Jun 17 19:34:44 2015 From: Alexander.Alexeev at caviumnetworks.com (Alexeev, Alexander) Date: Wed, 17 Jun 2015 19:34:44 +0000 Subject: pipeline class for sequence of instructions Message-ID: Hello Could somebody clarify how pipeline class is applied on sequence of instructions in architecture description file? For instance, class ialu_reg on countLeadingZerosL_bsr (snippet is below) or ialu_reg_mem on loadUB2L_immI (all from x86_64.ad). Stages for arguments read/writes, decoder and execution unit are specified only once. Is it then applied on every instructions that uses that pipeline class arguments or for the whole ins_encode body? BTW countLeadingZerosL_bsr isn't even a "single_instruction". Class pipe_cmplt looks more reasonable, but and_cmpLTMask and cadd_cmpLTMask still don't have 4 instructions how it is defined. Why 4 cycles are allocated to decode? Thanks, Alexander --------------------- // Integer ALU reg operation pipe_class ialu_reg(rRegI dst) %{ single_instruction; dst : S4(write); dst : S3(read); DECODE : S0; // any decoder ALU : S3; // any alu %} instruct countLeadingZerosL_bsr(rRegI dst, rRegL src, rFlagsReg cr) %{ predicate(!UseCountLeadingZerosInstruction); match(Set dst (CountLeadingZerosL src)); effect(KILL cr); format %{ "bsrq $dst, $src\t# count leading zeros (long)\n\t" "jnz skip\n\t" "movl $dst, -1\n" "skip:\n\t" "negl $dst\n\t" "addl $dst, 63" %} ins_encode %{ Register Rdst = $dst$$Register; Register Rsrc = $src$$Register; Label skip; __ bsrq(Rdst, Rsrc); __ jccb(Assembler::notZero, skip); __ movl(Rdst, -1); __ bind(skip); __ negl(Rdst); __ addl(Rdst, BitsPerLong - 1); %} ins_pipe(ialu_reg); %} -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Wed Jun 17 19:39:13 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Wed, 17 Jun 2015 15:39:13 -0400 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> <5581C465.7070803@oracle.com> Message-ID: But happens-before is meaningful only for inter-thread communication. If we're talking about plain stores with no fences (or let's say, JMM happens-before inducing points in between them), then as long as intra-thread semantics aren't violated, I'd say anything's on the table :). Will this risk uncovering broken user code? Yes, but that code is a ticking time bomb anyway (and subject to CPU reordering as well). On Wed, Jun 17, 2015 at 3:31 PM, John Rose wrote: > On Jun 17, 2015, at 12:03 PM, Vladimir Kozlov > wrote: > > > If we can guarantee that all passed stores are normal (I assume we will > have barriers otherwise in between) then I agree. I am not sure why we > didn't do it before, there could be a counterargument for that which I > don't remember. To make sure, ask John. > > > Here's my a warning about this: New optimizations sometimes uncover > implicit dependencies > by badly-written user code on memory effect ordering. The user code is > bad, but sometimes you > need to be able to suppress the optimization anyway. You might suppress > only as long > as it takes to diagnose the bad code, but it might be a long time if the > user can't or won't fix it. > (Think of it as an indefinitely delayed optimization of an indefinitely > delayable store.) > > We have to be careful about fences, both emitting them into IR and > respecting them once there. > It is likely that we have enough fences to implement the JMM at current > optimization levels, > but perhaps we need a few more if we reorder stores more aggressively. > The important thing to remember is that the JMM is not defined in terms of > fences; > the fences are a means to an end in enforcing happens-before relations. > > An *activated* safepoint might want fence-like behavior w.r.t. aggressive > optimizations. > For example, if we delay a store indefinitely because of a loop, and the > loop hits > an active safepoint, an early store might need to be flushed out. > > Even if a safepoint doesn't need such "flushing" behavior per the JMM, we > might > want that anyway to preserve some basic "liveliness" on the JVM's behavior. > It feels possibly correct and definitely surprising to delay a store > indefinitely. > > All that said, the JMM gives us permission to perform very aggressive > optimizations. > So let's do them. > > ? John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.r.rose at oracle.com Wed Jun 17 19:44:55 2015 From: john.r.rose at oracle.com (John Rose) Date: Wed, 17 Jun 2015 12:44:55 -0700 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> <5581C465.7070803@oracle.com> Message-ID: On Jun 17, 2015, at 12:39 PM, Vitaly Davidovich wrote: > > But happens-before is meaningful only for inter-thread communication. If we're talking about plain stores with no fences (or let's say, JMM happens-before inducing points in between them), then as long as intra-thread semantics aren't violated, I'd say anything's on the table :). Nope, that's an oversimplified understanding. One place where the JMM will bite you is with publication of object state via final fields. Normal stores used to initialize a structure which is published via final-field semantics must be ordered to take place before the object is published. We don't (and perhaps can't) track object publication events, nor their relation to stores into newly-reachable subgraphs. Instead, we have fences that gently but firmly ensure that data (from normal stores, even to non-final fields and array elements!) is posted to memory before any store which could be a publishing store for that data. > Will this risk uncovering broken user code? Yes, but that code is a ticking time bomb anyway (and subject to CPU reordering as well). Indeed. We live in a world of ticking time bombs. Some of them are our problem even if we didn't cause them. I don't mind uncovering broken user code; sooner is better. But that user will want (as a matter of QOS) a range of workarounds. ? John -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Wed Jun 17 20:23:37 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Wed, 17 Jun 2015 16:23:37 -0400 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> <5581C465.7070803@oracle.com> Message-ID: > > Nope, that's an oversimplified understanding. One place where the JMM > will bite you is with publication of object state via final fields. Normal > stores used to initialize a structure which is published via final-field > semantics must be ordered to take place before the object is published. We > don't (and perhaps can't) track object publication events, nor their > relation to stores into newly-reachable subgraphs. Instead, we have fences > that gently but firmly ensure that data (from normal stores, even to > non-final fields and array elements!) is posted to memory before any store > which could be a publishing store for that data. Not sure what's oversimplified -- you're describing a JMM semantic for final fields, which I'd expect to be modeled as barriers in the IR, just like volatile writes would be modeled as barriers, preventing removal or reordering of them. I appreciate that it can be troublesome to track this information, but that only means compiler will have to play more conservative and there may be some optimization opportunities lost. I'd think the pattern would look like: obj = allocZerodMemory(); // obj has final fields obj.ctor(); // arbitrarily long/complex CFG StoreStore _someRef = obj; I'd expect redundant stores to be removed as part of ctor() CFG without violating the storestore barrier. But, I do understand the complexity/trickiness of getting this right. On Wed, Jun 17, 2015 at 3:44 PM, John Rose wrote: > On Jun 17, 2015, at 12:39 PM, Vitaly Davidovich wrote: > > > But happens-before is meaningful only for inter-thread communication. If > we're talking about plain stores with no fences (or let's say, JMM > happens-before inducing points in between them), then as long as > intra-thread semantics aren't violated, I'd say anything's on the table :). > > > Nope, that's an oversimplified understanding. One place where the JMM > will bite you is with publication of object state via final fields. Normal > stores used to initialize a structure which is published via final-field > semantics must be ordered to take place before the object is published. We > don't (and perhaps can't) track object publication events, nor their > relation to stores into newly-reachable subgraphs. Instead, we have fences > that gently but firmly ensure that data (from normal stores, even to > non-final fields and array elements!) is posted to memory before any store > which could be a publishing store for that data. > > Will this risk uncovering broken user code? Yes, but that code is a > ticking time bomb anyway (and subject to CPU reordering as well). > > > Indeed. We live in a world of ticking time bombs. Some of them are our > problem even if we didn't cause them. > > I don't mind uncovering broken user code; sooner is better. But that user > will want (as a matter of QOS) a range of workarounds. > > ? John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahmed.khawaja at oracle.com Wed Jun 17 20:30:43 2015 From: ahmed.khawaja at oracle.com (Ahmed Khawaja) Date: Wed, 17 Jun 2015 13:30:43 -0700 Subject: FP Registers in SPARC Intrinsic Message-ID: <5581D8F3.3090308@oracle.com> I am working on adding some intrinsics that mostly benefit from the lack of a JNI call. One strategy I am using is to free up some registers by moving them into FP registers and restoring them later. One issue I am running into is I believe if the thread gets unscheduled, the FP registers are being trashed and then upon resume my code tries to reload the integer registers with the now garbage FP registers. When invoking the intrinsic, I removed RC_NO_FP, but this does not seem to indict to the JVM that the FP registers need to be saved on a context switch. Has anyone run into something similar? Thank you, Ahmed Khawaja From vladimir.kozlov at oracle.com Wed Jun 17 21:15:25 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 17 Jun 2015 14:15:25 -0700 Subject: FP Registers in SPARC Intrinsic In-Reply-To: <5581D8F3.3090308@oracle.com> References: <5581D8F3.3090308@oracle.com> Message-ID: <5581E36D.8040508@oracle.com> Context switching should preserve FP, otherwise float point arithmetic will not work. Something else may be happened. We need more details. Removing RC_NO_FP is good because register allocator should know that FP regs are used. Vladimir On 6/17/15 1:30 PM, Ahmed Khawaja wrote: > I am working on adding some intrinsics that mostly benefit from the lack > of a JNI call. One strategy I am using is to free up some registers by > moving them into FP registers and restoring them later. One issue I am > running into is I believe if the thread gets unscheduled, the FP > registers are being trashed and then upon resume my code tries to reload > the integer registers with the now garbage FP registers. When invoking > the intrinsic, I removed RC_NO_FP, but this does not seem to indict to > the JVM that the FP registers need to be saved on a context switch. Has > anyone run into something similar? > > Thank you, > Ahmed Khawaja From john.r.rose at oracle.com Wed Jun 17 21:27:36 2015 From: john.r.rose at oracle.com (John Rose) Date: Wed, 17 Jun 2015 14:27:36 -0700 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> <5581C465.7070803@oracle.com> Message-ID: <3609D7B9-E4F0-4EA2-A677-8E3BF63AE8B3@oracle.com> > On Jun 17, 2015, at 1:23 PM, Vitaly Davidovich wrote: > > Nope, that's an oversimplified understanding. One place where the JMM will bite you is with publication of object state via final fields. Normal stores used to initialize a structure which is published via final-field semantics must be ordered to take place before the object is published. We don't (and perhaps can't) track object publication events, nor their relation to stores into newly-reachable subgraphs. Instead, we have fences that gently but firmly ensure that data (from normal stores, even to non-final fields and array elements!) is posted to memory before any store which could be a publishing store for that data. > > Not sure what's oversimplified ? I probably misread you, then. > you're describing a JMM semantic for final fields, which I'd expect to be modeled as barriers in the IR, just like volatile writes would be modeled as barriers, preventing removal or reordering of them. I appreciate that it can be troublesome to track this information, but that only means compiler will have to play more conservative and there may be some optimization opportunities lost. I'd think the pattern would look like: > > obj = allocZerodMemory(); // obj has final fields > obj.ctor(); // arbitrarily long/complex CFG > StoreStore > _someRef = obj; > > I'd expect redundant stores to be removed as part of ctor() CFG without violating the storestore barrier. But, I do understand the complexity/trickiness of getting this right. You are correct. The StoreStore approximates the point at which the object is first published to other threads. All normal stores above the StoreStore can be issued in any order (as far as this fence is concerned) but must settle before the object is published. Presumably it is published shortly after the StoreStore, and the StoreStore could be sunk until that point, if we wanted to do this, or even eliminated if the object never gets published. Also, stores provably unrelated to (unreachable from) the published object could drop below the StoreStore. We don't attempt to make this distinction. None of these train of thought affects the basic assertion that (if fences are absent) normal stores can be reordered. If we wish to remove that StoreStore (for some reason) we would either need a more precise set of fences (or HB edges), or else we would have to hold back on aggressive store reordering. This is what makes me think we may discover a missing fence, once we start letting those little stores swarm around each other. What makes me more nervous about this is the clear fact that non-TSO platforms (TSO, Itanium) have to tweak their fences in various ad hoc ways to avoid breaking user code. See, for example, Parse::do_exits. If we make our thread-local orderings more non-TSO-ish, we might run into the same subtle issues that the PPC port wrestles with. By "subtle" I partly mean "relating to unstated user expectations even if not supported by the JMM", and I also mean "hard to detect, characterize, and fix". ? John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.r.rose at oracle.com Wed Jun 17 22:57:52 2015 From: john.r.rose at oracle.com (John Rose) Date: Wed, 17 Jun 2015 15:57:52 -0700 Subject: [9] RFR (M): VM should constant fold Unsafe.get*() loads from final fields In-Reply-To: <5581A26C.6090303@oracle.com> References: <5581A26C.6090303@oracle.com> Message-ID: <810DE23B-6616-4465-B91D-4CD9A8FB267D@oracle.com> On Jun 17, 2015, at 9:38 AM, Vladimir Ivanov wrote: > > http://cr.openjdk.java.net/~vlivanov/8078629/webrev.00/ > https://bugs.openjdk.java.net/browse/JDK-8078629 > > Direct(getfield/getstatic) read operations are faster than unsafe reads from constant Java fields, since VM doesn't constant fold unsafe loads. Though VM tries hard to recover field metadata from its offset, it doesn't optimize unsafe ones even if it has all necessary info in its hands. > > The fix is to align the behavior and share relevant code between C2 parser and intrinsic expansion logic. > > For testing purposes, I extended whitebox API to check whether a value is a compile-time constant. The whitebox test enumerates all combinations of a field and ensures that the behavior is consistent between bytecode and unsafe reads. > > Testing: focused whitebox tests, hotspot/test/compiler, jdk/test/java/lang/invoke, octane (for performance measurements) WB.isCompileConstant is a nice little thing. We should consider using it in java.lang.invoke to gate aggressive object-folding optimizations. That's one reason to consider putting it somewhere more central that WB. I can't propose a good place yet. (Unsafe is not quite right.) The gating logic in library_call includes this extra term: && alias_type->field()->is_constant() Why not just drop it and let make_constant do the test (which it does)? You have some lines with "/*require_const=*/" in two places; that can't be right. This is the result of functions with too many misc. arguments to keep track of. I don't have the code under my fingers, so I'm just guessing, but here are more suggestions: I wish the is_autobox_cache condition could be more localized. Could we omit the boolean flag (almost always false), and where it is true, post-process the node? Would that make the code simpler? This leads me to notice that make_constant is not related strongly to GraphKit; it is really a call to the Type and CI modules to look for a singleton type, ending with either a NULL or a call to GraphKit::makecon. So you might consider changing Node* GK::make_constant to const Type* Type::make_constant. Now to pick at the argument salad we have in push_constant: The effect of is_autobox_cache could be transferred to a method Type[Ary]::cast_to_autobox_cache(true). And the effect of stable_type on make_constant(ciCon,bool,bool,Type*), could also be factored out, as post-processing step contype=contype->Type::join(stabletype). ? John From vladimir.kozlov at oracle.com Wed Jun 17 23:55:38 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 17 Jun 2015 16:55:38 -0700 Subject: 8054492: compiler/intrinsics/classcast/NullCheckDroppingsTest.java is an invalid test In-Reply-To: <557255C1.6060101@oracle.com> References: <5559EFA0.5060800@redhat.com> <555D0460.4080805@oracle.com> <555DAD7C.2090805@redhat.com> <557255C1.6060101@oracle.com> Message-ID: <558208FA.3020304@oracle.com> Filed bug before I forgot: https://bugs.openjdk.java.net/browse/JDK-8129092 Vladimir On 6/5/15 7:06 PM, Vladimir Kozlov wrote: > > We either need some code in WhiteBox to check for a deoptimization > > event properly or we should just remove this altogether. > > > So, thoughts? Just delete the check? > > No, we have to check that uncommon trap was hit. This is the main > purpose of that test. > > May be we should add an other method WHITE_BOX.wasMethodDeopted(method) > which checks if method was deoptimized at least once. > > Vladimir > > On 5/21/15 3:03 AM, Andrew Haley wrote: >> On 05/20/2015 11:02 PM, Vladimir Kozlov wrote: >>> testVarClassCast tests deoptimization for javaMirror == null: >>> >>> void testVarClassCast(String s) { >>> Class cl = (s == null) ? null : String.class; >>> try { >>> ssink = (String)cl.cast(svalue); >>> >>> Which is done in LibraryCallKit::inline_Class_cast() by: >>> >>> mirror = null_check(mirror); >>> >>> which has Deoptimization::Action_make_not_entrant. >>> >>> Unfortunately currently the test also pass because unstable_if is >>> generated for the first line: >>> >>> (s == null) ? null : String.class; >>> >>> If you run the test with TraceDoptimization (or LogCompilation) you will >>> see: >>> >>> Uncommon trap occurred in NullCheckDroppingsTest::testVarClassCast >>> (@0x000000010b0670d8) thread=26883 reason=unstable_if action=reinterpret >>> unloaded_class_index=-1 >> >> Not quite the same. I get a reason=null_check: >> >> Uncommon trap occurred in NullCheckDroppingsTest::testVarClassCast >> (@0x000003ff8d253e54) thread=4396243677696 reason=null_check >> action=maybe_recompile unloaded_class_index=-1 >> >> Which comes from a SEGV here: >> >> 0x000003ff89253ca4: ldr x0, [x10,#72] ; implicit exception: >> dispatches to 0x000003ff89253e40 >> >> which is the line >> >> ssink = (String)cl.cast(svalue); >> >> I don't get a trap for unstable_if because there isn't one. I just get >> >> 0x000003ff89253c90: cbz x2, 0x000003ff89253e00 (this is >> java/lang/String s) >> >> --> >> >> 0x000003ff89253e00: mov x10, xzr >> 0x000003ff89253e04: b 0x000003ff89253ca0 >> >> --> >> >> 0x000003ff89253ca0: lsl x11, x11, #3 ;*getstatic svalue >> ; - >> NullCheckDroppingsTest::testVarClassCast at 13 (line 181) >> 0x000003ff89253ca4: ldr x0, [x10,#72] ; implicit exception: >> dispatches to 0x000003ff89253e40 >> >> ... and then the trap for the null pointer exception. >> >> Andrew. >> From vitalyd at gmail.com Thu Jun 18 00:28:25 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Wed, 17 Jun 2015 20:28:25 -0400 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> <5581C465.7070803@oracle.com> <3609D7B9-E4F0-4EA2-A677-8E3BF63AE8B3@oracle.com> Message-ID: So I'm not sure how many cases will arise where scheduling stores is beneficial (on modern cpus) apart from removing redundant ones. The compiler would need some seriously detailed machine model, I think, to reason about this intelligently. Removing redundant ones (or moving loop invariant ones out of loops, like Roland is trying here) seems more tractable and beneficial? Are there cases beyond this where it would be profitable? Perhaps scheduling writes to addresses likely to be on same cacheline maybe ... As for removing StoreStore barriers, it seems like that's practically feasible with java's semantics only when EA kicks in; I'm having a hard time imagining how the JIT can trace unsafe/racy publication reliably and with minimal overhead. Perhaps I'm not thinking hard enough though ... It's almost unfortunate that final fields were granted this right to be published unsafely :) - would've been perhaps better if explicit fencing was required for such specialized case. sent from my phone On Jun 17, 2015 5:27 PM, "John Rose" wrote: On Jun 17, 2015, at 1:23 PM, Vitaly Davidovich wrote: Nope, that's an oversimplified understanding. One place where the JMM will > bite you is with publication of object state via final fields. Normal > stores used to initialize a structure which is published via final-field > semantics must be ordered to take place before the object is published. We > don't (and perhaps can't) track object publication events, nor their > relation to stores into newly-reachable subgraphs. Instead, we have fences > that gently but firmly ensure that data (from normal stores, even to > non-final fields and array elements!) is posted to memory before any store > which could be a publishing store for that data. Not sure what's oversimplified ? I probably misread you, then. you're describing a JMM semantic for final fields, which I'd expect to be modeled as barriers in the IR, just like volatile writes would be modeled as barriers, preventing removal or reordering of them. I appreciate that it can be troublesome to track this information, but that only means compiler will have to play more conservative and there may be some optimization opportunities lost. I'd think the pattern would look like: obj = allocZerodMemory(); // obj has final fields obj.ctor(); // arbitrarily long/complex CFG StoreStore _someRef = obj; I'd expect redundant stores to be removed as part of ctor() CFG without violating the storestore barrier. But, I do understand the complexity/trickiness of getting this right. You are correct. The StoreStore approximates the point at which the object is first published to other threads. All normal stores above the StoreStore can be issued in any order (as far as this fence is concerned) but must settle before the object is published. Presumably it is published shortly after the StoreStore, and the StoreStore could be sunk until that point, if we wanted to do this, or even eliminated if the object never gets published. Also, stores provably unrelated to (unreachable from) the published object could drop below the StoreStore. We don't attempt to make this distinction. None of these train of thought affects the basic assertion that (if fences are absent) normal stores can be reordered. If we wish to remove that StoreStore (for some reason) we would either need a more precise set of fences (or HB edges), or else we would have to hold back on aggressive store reordering. This is what makes me think we may discover a missing fence, once we start letting those little stores swarm around each other. What makes me more nervous about this is the clear fact that non-TSO platforms (TSO, Itanium) have to tweak their fences in various ad hoc ways to avoid breaking user code. See, for example, Parse::do_exits. If we make our thread-local orderings more non-TSO-ish, we might run into the same subtle issues that the PPC port wrestles with. By "subtle" I partly mean "relating to unstated user expectations even if not supported by the JMM", and I also mean "hard to detect, characterize, and fix". ? John -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Thu Jun 18 00:31:51 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Wed, 17 Jun 2015 20:31:51 -0400 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> <5581C465.7070803@oracle.com> <3609D7B9-E4F0-4EA2-A677-8E3BF63AE8B3@oracle.com> Message-ID: Forgot to mention - it'd be nice if EA was a bit smarter in C2, e.g. flow sensitive like graal. Is the plan to leave it alone in C2 and wait for graal to mature? sent from my phone On Jun 17, 2015 8:28 PM, "Vitaly Davidovich" wrote: > So I'm not sure how many cases will arise where scheduling stores is > beneficial (on modern cpus) apart from removing redundant ones. The > compiler would need some seriously detailed machine model, I think, to > reason about this intelligently. Removing redundant ones (or moving loop > invariant ones out of loops, like Roland is trying here) seems more > tractable and beneficial? Are there cases beyond this where it would be > profitable? Perhaps scheduling writes to addresses likely to be on same > cacheline maybe ... > > As for removing StoreStore barriers, it seems like that's practically > feasible with java's semantics only when EA kicks in; I'm having a hard > time imagining how the JIT can trace unsafe/racy publication reliably and > with minimal overhead. Perhaps I'm not thinking hard enough though ... > > It's almost unfortunate that final fields were granted this right to be > published unsafely :) - would've been perhaps better if explicit fencing > was required for such specialized case. > > sent from my phone > On Jun 17, 2015 5:27 PM, "John Rose" wrote: > > > On Jun 17, 2015, at 1:23 PM, Vitaly Davidovich wrote: > > Nope, that's an oversimplified understanding. One place where the JMM >> will bite you is with publication of object state via final fields. Normal >> stores used to initialize a structure which is published via final-field >> semantics must be ordered to take place before the object is published. We >> don't (and perhaps can't) track object publication events, nor their >> relation to stores into newly-reachable subgraphs. Instead, we have fences >> that gently but firmly ensure that data (from normal stores, even to >> non-final fields and array elements!) is posted to memory before any store >> which could be a publishing store for that data. > > > Not sure what's oversimplified ? > > > I probably misread you, then. > > you're describing a JMM semantic for final fields, which I'd expect to be > modeled as barriers in the IR, just like volatile writes would be modeled > as barriers, preventing removal or reordering of them. I appreciate that > it can be troublesome to track this information, but that only means > compiler will have to play more conservative and there may be some > optimization opportunities lost. I'd think the pattern would look like: > > obj = allocZerodMemory(); // obj has final fields > obj.ctor(); // arbitrarily long/complex CFG > StoreStore > _someRef = obj; > > I'd expect redundant stores to be removed as part of ctor() CFG without > violating the storestore barrier. But, I do understand the > complexity/trickiness of getting this right. > > > You are correct. The StoreStore approximates the point at which the > object is first published to other threads. All normal stores above the > StoreStore can be issued in any order (as far as this fence is concerned) > but must settle before the object is published. Presumably it is published > shortly after the StoreStore, and the StoreStore could be sunk until that > point, if we wanted to do this, or even eliminated if the object never gets > published. Also, stores provably unrelated to (unreachable from) the > published object could drop below the StoreStore. We don't attempt to make > this distinction. None of these train of thought affects the basic > assertion that (if fences are absent) normal stores can be reordered. > > If we wish to remove that StoreStore (for some reason) we would either > need a more precise set of fences (or HB edges), or else we would have to > hold back on aggressive store reordering. This is what makes me think we > may discover a missing fence, once we start letting those little stores > swarm around each other. > > What makes me more nervous about this is the clear fact that non-TSO > platforms (TSO, Itanium) have to tweak their fences in various ad hoc ways > to avoid breaking user code. See, for example, Parse::do_exits. If we > make our thread-local orderings more non-TSO-ish, we might run into the > same subtle issues that the PPC port wrestles with. By "subtle" I partly > mean "relating to unstated user expectations even if not supported by the > JMM", and I also mean "hard to detect, characterize, and fix". > > ? John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Thu Jun 18 01:54:15 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 17 Jun 2015 18:54:15 -0700 Subject: [9] RFR(S) 8129094: assert(is_java_primitive(bt)) failed: only primitive type vectors Message-ID: <558224C7.9070504@oracle.com> http://cr.openjdk.java.net/~kvn/8129094/webrev/ https://bugs.openjdk.java.net/browse/JDK-8129094 If memory operation is not of primitive type it will be not vectorized and should be ignored regardless it (or its inputs) control. Tested with failing tests. Thanks, Vladimir From michael.c.berg at intel.com Thu Jun 18 02:05:25 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Thu, 18 Jun 2015 02:05:25 +0000 Subject: [9] RFR(S) 8129094: assert(is_java_primitive(bt)) failed: only primitive type vectors In-Reply-To: <558224C7.9070504@oracle.com> References: <558224C7.9070504@oracle.com> Message-ID: Looks ok. -Michael -----Original Message----- From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] Sent: Wednesday, June 17, 2015 6:54 PM To: hotspot compiler Cc: Berg, Michael C Subject: [9] RFR(S) 8129094: assert(is_java_primitive(bt)) failed: only primitive type vectors http://cr.openjdk.java.net/~kvn/8129094/webrev/ https://bugs.openjdk.java.net/browse/JDK-8129094 If memory operation is not of primitive type it will be not vectorized and should be ignored regardless it (or its inputs) control. Tested with failing tests. Thanks, Vladimir From vladimir.kozlov at oracle.com Thu Jun 18 02:54:29 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 17 Jun 2015 19:54:29 -0700 Subject: [9] RFR(S) 8129094: assert(is_java_primitive(bt)) failed: only primitive type vectors In-Reply-To: References: <558224C7.9070504@oracle.com> Message-ID: <558232E5.8040300@oracle.com> Thank you, Michael Vladimir On 6/17/15 7:05 PM, Berg, Michael C wrote: > Looks ok. > > -Michael > > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > Sent: Wednesday, June 17, 2015 6:54 PM > To: hotspot compiler > Cc: Berg, Michael C > Subject: [9] RFR(S) 8129094: assert(is_java_primitive(bt)) failed: only primitive type vectors > > http://cr.openjdk.java.net/~kvn/8129094/webrev/ > https://bugs.openjdk.java.net/browse/JDK-8129094 > > If memory operation is not of primitive type it will be not vectorized and should be ignored regardless it (or its inputs) control. > > Tested with failing tests. > > Thanks, > Vladimir > From roland.westrelin at oracle.com Thu Jun 18 07:25:02 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Thu, 18 Jun 2015 09:25:02 +0200 Subject: [9] RFR(S) 8129094: assert(is_java_primitive(bt)) failed: only primitive type vectors In-Reply-To: <558224C7.9070504@oracle.com> References: <558224C7.9070504@oracle.com> Message-ID: <398C2857-1D75-4117-85D4-88C67CF9E0B6@oracle.com> > http://cr.openjdk.java.net/~kvn/8129094/webrev/ Looks good to me. Roland. From aph at redhat.com Thu Jun 18 07:52:40 2015 From: aph at redhat.com (Andrew Haley) Date: Thu, 18 Jun 2015 08:52:40 +0100 Subject: 8054492: compiler/intrinsics/classcast/NullCheckDroppingsTest.java is an invalid test In-Reply-To: <557255C1.6060101@oracle.com> References: <5559EFA0.5060800@redhat.com> <555D0460.4080805@oracle.com> <555DAD7C.2090805@redhat.com> <557255C1.6060101@oracle.com> Message-ID: <558278C8.6010104@redhat.com> On 06/06/15 03:06, Vladimir Kozlov wrote: > May be we should add an other method WHITE_BOX.wasMethodDeopted(method) > which checks if method was deoptimized at least once. That sounds right to me. Andrew. From david.holmes at oracle.com Thu Jun 18 10:55:24 2015 From: david.holmes at oracle.com (David Holmes) Date: Thu, 18 Jun 2015 20:55:24 +1000 Subject: RFR(S): 8067163: Several JT_HS tests fails due to ClassNotFoundException on compacts In-Reply-To: <55816D5F.9080900@oracle.com> References: <55814FA3.4000702@oracle.com> <55816BA5.9090809@oracle.com> <55816D5F.9080900@oracle.com> Message-ID: <5582A39C.3000704@oracle.com> On 17/06/2015 10:51 PM, Sergei Kovalev wrote: > Hi David, > > > On 17.06.15 15:44, David Holmes wrote: >> On 17/06/2015 8:44 PM, Sergei Kovalev wrote: >>> Hello Team, >>> >>> Please review the fix for >>> https://bugs.openjdk.java.net/browse/JDK-8067163 >>> Webrev: http://cr.openjdk.java.net/~skovalev/8067163/webrev.00/ >>> >>> Summary: several regression tests requires WitheBox object. The object >>> is available starting from compact3 profile. To fix the issue all tests >>> added to needs_compact3 group. >> >> Seems okay, but why has this only been discovered now? > It was discovered on 8u40 when we switched to group mechanism. >> And are the tests runs that hit this using the groups mechanism? > The link to test run with group usage is in jira comments. Also we > observed the issue with 8u60. I can provide a link in separate e-mail if > you'd like it. No - all good. Thanks, David >> >> >> Thanks, >> David >> >> >> >> >> > From nils.eliasson at oracle.com Thu Jun 18 11:37:24 2015 From: nils.eliasson at oracle.com (Nils Eliasson) Date: Thu, 18 Jun 2015 13:37:24 +0200 Subject: RFR(L): 8081247 AVX 512 extended support code review request In-Reply-To: References: <556FA54E.8050001@oracle.com> Message-ID: <5582AD74.5060704@oracle.com> Hi Michael, The patch looks good. Thanks for contributing, Nils On 2015-06-05 06:46, Berg, Michael C wrote: > Vladimir please find the following webrev with the suggested changes, I have added small signature functions which look like the old versions in the assembler but manage the problem I need to handle, which is additional state for legacy only instructions. There is a new vm_version function which handles the cpuid checks with a conglomerate approach for the one scenario which had it. > The loop in the stub generator is now formed to alter the upper bound and execute in one path. > > http://cr.openjdk.java.net/~mcberg/8081247/webrev.03/ > > Regards, > Michael > > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > Sent: Wednesday, June 03, 2015 6:10 PM > To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net' > Subject: Re: RFR(L): 8081247 AVX 512 extended support code review request > > Hi, Michael > > assembler_x86.cpp: > > I don't like that you replaced prefix method with few parameters with method which has a lot of them: > > - int encode = vex_prefix_0F38_and_encode_q(dst, src1, src2); > + int encode = vex_prefix_and_encode(dst->encoding(), src1->encoding(), > src2->encoding(), > + VEX_SIMD_NONE, VEX_OPCODE_0F_38, > true, AVX_128bit, > + true, false); > > Why you did that? > > > stubGenerator_x86_64.cpp: > > Can we set different loop limit based on UseAVX instead of having 2 loops. > > x86.ad: > > Instead of long condition expressions like next: > > UseAVX > 0 && !VM_Version::supports_avx512vl() && > !VM_Version::supports_avx512bw() > > May be have one VM_Version finction which does these checks. > > Thanks, > Vladimir > > On 6/2/15 9:38 PM, Berg, Michael C wrote: >> Hi Folks, >> >> I would like to contribute more AVX512 enabling to facilitate support >> for machines which utilize EVEX encoding. I need two reviewers to >> review this patch and comment as needed: >> >> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8081247 >> >> webrev: >> >> http://cr.openjdk.java.net/~mcberg/8081247/webrev.01/ >> >> This patch enables BMI code on EVEX targets, improves replication >> patterns to be more efficient on both EVEX enabled and legacy targets, >> adds more CPUID based rules for correct code generation on various >> EVEX enabled servers, extends more call save/restore functionality and >> extends the vector space further for SIMD operations. Please expedite >> this review as there is a near term need for the support. >> >> Also, as I am not yet a committer, this code needs a sponsor as well. >> >> Thanks, >> >> Michael >> From paul.sandoz at oracle.com Thu Jun 18 12:03:19 2015 From: paul.sandoz at oracle.com (Paul Sandoz) Date: Thu, 18 Jun 2015 14:03:19 +0200 Subject: [9] RFR (M): VM should constant fold Unsafe.get*() loads from final fields In-Reply-To: <5581A26C.6090303@oracle.com> References: <5581A26C.6090303@oracle.com> Message-ID: <172A2B5E-9050-4219-BD07-EB9FA2D671E8@oracle.com> Hi Vladimir, I like the test, you have almost hand rolled your own specializer :-) A minor point. Since you have created a ClassWriter with "ClassWriter.COMPUTE_MAXS | ClassWriter.COMPUTE_FRAMES" can you remove the "mv.visitMax(0, 0)" calls? I was a little confused by the code that checked the expected result against the actual result. I am guessing the white box methods return -1 if the value is not a constant and 1 if it is. (Perhaps that can be documented, if even those methods may eventually reside somewhere else.) Whereas, Generator.expected returns 0 or 1. 118 if (direct != unsafe || // difference between testDirect & testUnsafe 119 (unsafe != -1 && expected != unsafe)) // differs from expected, but ignore "unknown"(-1) result 120 { 121 throw new AssertionError(String.format("%s: e=%d; d=%d; u=%d", t.name(), expected, direct, unsafe)); 122 } I don't quite understand why "unknown"(-1) can be ignored. Can that be changed to the following if Generator.expected returned the same values as the WB methods? if (direct != unsafe || unsafe != expected) { ... } ? Paul. On Jun 17, 2015, at 6:38 PM, Vladimir Ivanov wrote: > http://cr.openjdk.java.net/~vlivanov/8078629/webrev.00/ > https://bugs.openjdk.java.net/browse/JDK-8078629 > > Direct(getfield/getstatic) read operations are faster than unsafe reads from constant Java fields, since VM doesn't constant fold unsafe loads. Though VM tries hard to recover field metadata from its offset, it doesn't optimize unsafe ones even if it has all necessary info in its hands. > > The fix is to align the behavior and share relevant code between C2 parser and intrinsic expansion logic. > > For testing purposes, I extended whitebox API to check whether a value is a compile-time constant. The whitebox test enumerates all combinations of a field and ensures that the behavior is consistent between bytecode and unsafe reads. > > Testing: focused whitebox tests, hotspot/test/compiler, jdk/test/java/lang/invoke, octane (for performance measurements) > > Thanks! > > Best regards, > Vladimir Ivanov -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 841 bytes Desc: Message signed with OpenPGP using GPGMail URL: From forax at univ-mlv.fr Thu Jun 18 12:12:32 2015 From: forax at univ-mlv.fr (Remi Forax) Date: Thu, 18 Jun 2015 14:12:32 +0200 Subject: [9] RFR (M): VM should constant fold Unsafe.get*() loads from final fields In-Reply-To: <172A2B5E-9050-4219-BD07-EB9FA2D671E8@oracle.com> References: <5581A26C.6090303@oracle.com> <172A2B5E-9050-4219-BD07-EB9FA2D671E8@oracle.com> Message-ID: <5582B5B0.8010202@univ-mlv.fr> Hi Paul, On 06/18/2015 02:03 PM, Paul Sandoz wrote: > Hi Vladimir, > > I like the test, you have almost hand rolled your own specializer :-) > > A minor point. Since you have created a ClassWriter with "ClassWriter.COMPUTE_MAXS | ClassWriter.COMPUTE_FRAMES" can you remove the "mv.visitMax(0, 0)" calls? no, you can't. even if you ask ASM to compute the maxs, you still need to call visitMaxs() or kitten will die. R?mi > > > I was a little confused by the code that checked the expected result against the actual result. > > I am guessing the white box methods return -1 if the value is not a constant and 1 if it is. (Perhaps that can be documented, if even those methods may eventually reside somewhere else.) Whereas, Generator.expected returns 0 or 1. > > 118 if (direct != unsafe || // difference between testDirect & testUnsafe > 119 (unsafe != -1 && expected != unsafe)) // differs from expected, but ignore "unknown"(-1) result > 120 { > 121 throw new AssertionError(String.format("%s: e=%d; d=%d; u=%d", t.name(), expected, direct, unsafe)); > 122 } > > I don't quite understand why "unknown"(-1) can be ignored. > > Can that be changed to the following if Generator.expected returned the same values as the WB methods? > > if (direct != unsafe || unsafe != expected) { ... } > > ? > > Paul. > > On Jun 17, 2015, at 6:38 PM, Vladimir Ivanov wrote: > >> http://cr.openjdk.java.net/~vlivanov/8078629/webrev.00/ >> https://bugs.openjdk.java.net/browse/JDK-8078629 >> >> Direct(getfield/getstatic) read operations are faster than unsafe reads from constant Java fields, since VM doesn't constant fold unsafe loads. Though VM tries hard to recover field metadata from its offset, it doesn't optimize unsafe ones even if it has all necessary info in its hands. >> >> The fix is to align the behavior and share relevant code between C2 parser and intrinsic expansion logic. >> >> For testing purposes, I extended whitebox API to check whether a value is a compile-time constant. The whitebox test enumerates all combinations of a field and ensures that the behavior is consistent between bytecode and unsafe reads. >> >> Testing: focused whitebox tests, hotspot/test/compiler, jdk/test/java/lang/invoke, octane (for performance measurements) >> >> Thanks! >> >> Best regards, >> Vladimir Ivanov From paul.sandoz at oracle.com Thu Jun 18 13:51:22 2015 From: paul.sandoz at oracle.com (Paul Sandoz) Date: Thu, 18 Jun 2015 15:51:22 +0200 Subject: [9] RFR (M): VM should constant fold Unsafe.get*() loads from final fields In-Reply-To: <5582B5B0.8010202@univ-mlv.fr> References: <5581A26C.6090303@oracle.com> <172A2B5E-9050-4219-BD07-EB9FA2D671E8@oracle.com> <5582B5B0.8010202@univ-mlv.fr> Message-ID: <8B26E262-6DF0-4B94-9461-7AE66B7C6F4C@oracle.com> On Jun 18, 2015, at 2:12 PM, Remi Forax wrote: > Hi Paul, > > On 06/18/2015 02:03 PM, Paul Sandoz wrote: >> Hi Vladimir, >> >> I like the test, you have almost hand rolled your own specializer :-) >> >> A minor point. Since you have created a ClassWriter with "ClassWriter.COMPUTE_MAXS | ClassWriter.COMPUTE_FRAMES" can you remove the "mv.visitMax(0, 0)" calls? > > no, you can't. > even if you ask ASM to compute the maxs, you still need to call visitMaxs() or kitten will die. > Ah, i see now, it's the arguments to visitMax that are ignored and not the call itself. Thanks, Paul. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 841 bytes Desc: Message signed with OpenPGP using GPGMail URL: From vladimir.kozlov at oracle.com Thu Jun 18 14:04:50 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 18 Jun 2015 07:04:50 -0700 Subject: [9] RFR(S) 8129094: assert(is_java_primitive(bt)) failed: only primitive type vectors In-Reply-To: <398C2857-1D75-4117-85D4-88C67CF9E0B6@oracle.com> References: <558224C7.9070504@oracle.com> <398C2857-1D75-4117-85D4-88C67CF9E0B6@oracle.com> Message-ID: <5582D002.8050702@oracle.com> Thank you, Roland Vladimir On 6/18/15 12:25 AM, Roland Westrelin wrote: >> http://cr.openjdk.java.net/~kvn/8129094/webrev/ > > Looks good to me. > > Roland. > From anthony.scarpino at oracle.com Thu Jun 18 17:00:38 2015 From: anthony.scarpino at oracle.com (Anthony Scarpino) Date: Thu, 18 Jun 2015 10:00:38 -0700 Subject: RFR: 8046943: RSA Acceleration In-Reply-To: <558033C4.8040104@redhat.com> References: <557ABD2E.7050608@redhat.com> <557EFF94.5000006@oracle.com> <557F042D.4060707@redhat.com> <558033C4.8040104@redhat.com> Message-ID: <5582F936.5020008@oracle.com> On 06/16/2015 07:33 AM, Andrew Haley wrote: > On 06/15/2015 05:58 PM, Andrew Haley wrote: >>>> 3. I fused squaring and multiplication into a single >>>>>> montgomeryMultiply method. ... >>>> >>>> I don't agree with fusing them together. I think there should two >>>> separate intrinsics. For one, SPARC has a montsqr and montmul >>>> instructions. Additionally if someone wants to call montgomerySquare, >>>> they should be able to call it directly with it's needed number of >>>> arguments and not pass 'a' twice to satisfy an internal if(). > >> OK, fair enough. I'll think a little more about the best way to do >> this. > > Done thusly. The only thing I had any doubt about was whether to use a > single flag for squaring and multiplication. This patch uses separate > flags. > > http://cr.openjdk.java.net/~aph/8046943-hs-2/ > http://cr.openjdk.java.net/~aph/8046943-jdk-2/ > > Andrew. > I'm happy with the jdk change.. thanks.. Question, on the hotspot side you said in a previous post this was C2-only. Was there a reason you don't have it for all? Personally I'd enable it for all unless there was a performance hit in a particular mode. Tony From aph at redhat.com Thu Jun 18 17:07:22 2015 From: aph at redhat.com (Andrew Haley) Date: Thu, 18 Jun 2015 18:07:22 +0100 Subject: RFR: 8046943: RSA Acceleration In-Reply-To: <5582F936.5020008@oracle.com> References: <557ABD2E.7050608@redhat.com> <557EFF94.5000006@oracle.com> <557F042D.4060707@redhat.com> <558033C4.8040104@redhat.com> <5582F936.5020008@oracle.com> Message-ID: <5582FACA.4060103@redhat.com> On 06/18/2015 06:00 PM, Anthony Scarpino wrote: > Question, on the hotspot side you said in a previous post this was > C2-only. Was there a reason you don't have it for all? None. It's up for negotiation. What do you want? C1, interp? Andrew. From anthony.scarpino at oracle.com Thu Jun 18 17:20:10 2015 From: anthony.scarpino at oracle.com (Anthony Scarpino) Date: Thu, 18 Jun 2015 10:20:10 -0700 Subject: RFR: 8046943: RSA Acceleration In-Reply-To: <5582FACA.4060103@redhat.com> References: <557ABD2E.7050608@redhat.com> <557EFF94.5000006@oracle.com> <557F042D.4060707@redhat.com> <558033C4.8040104@redhat.com> <5582F936.5020008@oracle.com> <5582FACA.4060103@redhat.com> Message-ID: <5582FDCA.8010507@oracle.com> On 06/18/2015 10:07 AM, Andrew Haley wrote: > On 06/18/2015 06:00 PM, Anthony Scarpino wrote: >> Question, on the hotspot side you said in a previous post this was >> C2-only. Was there a reason you don't have it for all? > > None. It's up for negotiation. What do you want? C1, interp? > > Andrew. > I'd defer to the hotspot guys on what's best. I'm just not aware of any purposeful limitations of AES, SHA, and GHASH intrinsics, none of these are in the c2_globals.hpp file, so I'm assuming that's what controlling the c2-only. Tony From vladimir.kozlov at oracle.com Thu Jun 18 19:28:08 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 18 Jun 2015 12:28:08 -0700 Subject: RFR: 8046943: RSA Acceleration In-Reply-To: <5582FDCA.8010507@oracle.com> References: <557ABD2E.7050608@redhat.com> <557EFF94.5000006@oracle.com> <557F042D.4060707@redhat.com> <558033C4.8040104@redhat.com> <5582F936.5020008@oracle.com> <5582FACA.4060103@redhat.com> <5582FDCA.8010507@oracle.com> Message-ID: <55831BC8.9060001@oracle.com> Andrew, We have few new rules regarding intrinsics. You need to add private static java method which does range checks because their are not executed in intrinsic code - see squareToLen() implementation, for example. Note, we will rewrite multiplyToLen() too soon. Also method which will be intrinsified should be private static too and you can move allocation from it (like we did for squareToLen()) to avoid allocation in intrinsic code. Your Hotspot changes are hard to accept. We have to compile for solaris with Sun compilers which does not work with such changes: "hotspot/src/cpu/x86/vm/sharedRuntime_x86_64.cpp", line 3525: Warning: parameter in inline asm statement unused: %3. "hotspot/src/cpu/x86/vm/sharedRuntime_x86_64.cpp", line 3525: Warning: parameter in inline asm statement unused: %6. "hotspot/src/cpu/x86/vm/sharedRuntime_x86_64.cpp", line 3701: Error: The function "__builtin_expect" must have a prototype. "hotspot/src/cpu/x86/vm/sharedRuntime_x86_64.cpp", line 3707: Error: The function "__builtin_alloca" must have a prototype. May be you can convert the code to stub and add new MacroAssembler functions which you can use in sharedRuntime_x86_64.cpp. Yes, it is a lot of handwriting but we need it to work on all OSs. Also on Solaris you can add asm code similar what we do in solaris_x86_64.il. It may allow you to rewrite to assembler stub. Or don't do this intrinsic on Solaris (only linux and Mac). Regards, Vladimir On 6/18/15 10:20 AM, Anthony Scarpino wrote: > On 06/18/2015 10:07 AM, Andrew Haley wrote: >> On 06/18/2015 06:00 PM, Anthony Scarpino wrote: >>> Question, on the hotspot side you said in a previous post this was >>> C2-only. Was there a reason you don't have it for all? >> >> None. It's up for negotiation. What do you want? C1, interp? >> >> Andrew. >> > > I'd defer to the hotspot guys on what's best. I'm just not aware of any > purposeful limitations of AES, SHA, and GHASH intrinsics, none of these > are in the c2_globals.hpp file, so I'm assuming that's what controlling > the c2-only. > > Tony > From vladimir.kozlov at oracle.com Fri Jun 19 00:10:36 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 18 Jun 2015 17:10:36 -0700 Subject: RFR(S): 8085932: Fixing bugs in detecting memory alignments in SuperWord In-Reply-To: <39F83597C33E5F408096702907E6C450E4D6F7@ORSMSX104.amr.corp.intel.com> References: <39F83597C33E5F408096702907E6C450E4D6F7@ORSMSX104.amr.corp.intel.com> Message-ID: <55835DFC.70001@oracle.com> Thank you, Jan Fixes looks good but it would be nice if you replaced some tracing code with functions calls. In some place the execution code is hard to read because of big tracing code. For example, in SuperWord::memory_alignment() and in SWPointer methods. The one way to do that is to declare trace methods with empty body in product build, for example for SWPointer::scaled_iv_plus_offset() you may have new method declaration (not under #ifdef) in superword.hpp: class SWPointer VALUE_OBJ_CLASS_SPEC { void trace_1_scaled_iv_plus_offset(...) PRODUCT_RETURN; and in superword.cpp you will put the method under ifdef: #ifndef PRODUCT void trace_1_scaled_iv_plus_offset(...) { .... } #endif Then you can simply use it without ifdefs in code: bool SWPointer::scaled_iv_plus_offset(Node* n) { + trace_1_scaled_iv_plus_offset(...); + if (scaled_iv(n)) { Note, macro PRODUCT_RETURN is defined as: #ifdef PRODUCT #define PRODUCT_RETURN {} #else #define PRODUCT_RETURN /*next token must be ;*/ #endif Thanks, Vladimir On 6/8/15 9:15 AM, Civlin, Jan wrote: > Hi All, > > > We would like to contribute to Fixing bugs in detecting memory > alignments in SuperWord. > > The contribution Bug ID: 8085932. > > Please review this patch: > > Bug-id: https://bugs.openjdk.java.net/browse/JDK-8085932 > > webrev: http://cr.openjdk.java.net/~kvn/8085932/webrev.00/ > > > *Description**: *Fixing bugs in detecting memory alignments in > SuperWord > > Fixing bugs in detecting memory alignments in SuperWord: > SWPointer::scaled_iv_plus_offset (fixing here a bug in detection of > "scale"), > SWPointer::offset_plus_k (fixing here a bug in detection of "invariant"), > > Add tracing output to the code that deal with memory alignment. The > following routines are traceable: > > SWPointer::scaled_iv_plus_offset > SWPointer::offset_plus_k > SWPointer::scaled_iv, > WPointer::SWPointer, > SuperWord::memory_alignment > > Tracing is done only for NOT_PRODUCT. Currently tracing is controlled by > VectorizeDebug: > > #ifndef PRODUCT > if (_phase->C->method() != NULL) { > _phase->C->method()->has_option_value("VectorizeDebug", > _vector_loop_debug); > } > #endif > > And VectorizeDebug may take any combination (bitwise OR) of the > following values: > bool is_trace_alignment() { return (_vector_loop_debug & 2) > 0; } > bool is_trace_mem_slice() { return (_vector_loop_debug & 4) > 0; } > bool is_trace_loop() { return (_vector_loop_debug & 8) > 0; } > bool is_trace_adjacent() { return (_vector_loop_debug & 16) > 0; } > From vladimir.kozlov at oracle.com Fri Jun 19 00:51:33 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 18 Jun 2015 17:51:33 -0700 Subject: RFR(S): 8085932: Fixing bugs in detecting memory alignments in SuperWord In-Reply-To: <55835DFC.70001@oracle.com> References: <39F83597C33E5F408096702907E6C450E4D6F7@ORSMSX104.amr.corp.intel.com> <55835DFC.70001@oracle.com> Message-ID: <55836795.3050501@oracle.com> Jan, Here is why next code return false: if (_scale != 0) { return false; // already found a scale if (_invar != NULL) return false; // already have an invariant SWPointer() method tries to set _scale, _offset, _invar values. But, for example, simple array access address uses 2 AddP nodes and each of them has offsets but different offsets. Usually one have invariant offset and another - scaled index: AddP (base, base, iv*scale + offset) AddP (base, addp, invar) SWPointer() iterates over all AddP: for (int i = 0; i < 3; i++) { if (!scaled_iv_plus_offset(adr->in(AddPNode::Offset))) { assert(!valid(), "too complex"); return; } adr = adr->in(AddPNode::Address); if (base == adr || !adr->is_AddP()) { break; // stop looking at addp's } } And this code assumes only one of AddP can set those fields (_scale, _offset, _invar). If second AddP tries to set a field which is set by previous AddP it is considered complex address expression, for example: AddP (base, base, iv*scale + offset_con + invar1) AddP (base, addp, invar2) and such cases are skipped. Please, show your case for which you want to return 'true'. Thanks, Vladimir On 6/18/15 5:10 PM, Vladimir Kozlov wrote: > Thank you, Jan > > Fixes looks good but it would be nice if you replaced some tracing code > with functions calls. In some place the execution code is hard to read > because of big tracing code. For example, in > SuperWord::memory_alignment() and in SWPointer methods. > > The one way to do that is to declare trace methods with empty body in > product build, for example for SWPointer::scaled_iv_plus_offset() you > may have new method declaration (not under #ifdef) in superword.hpp: > > class SWPointer VALUE_OBJ_CLASS_SPEC { > > void trace_1_scaled_iv_plus_offset(...) PRODUCT_RETURN; > > and in superword.cpp you will put the method under ifdef: > > #ifndef PRODUCT > void trace_1_scaled_iv_plus_offset(...) { > .... > } > #endif > > Then you can simply use it without ifdefs in code: > > bool SWPointer::scaled_iv_plus_offset(Node* n) { > + trace_1_scaled_iv_plus_offset(...); > + > if (scaled_iv(n)) { > > Note, macro PRODUCT_RETURN is defined as: > > #ifdef PRODUCT > #define PRODUCT_RETURN {} > #else > #define PRODUCT_RETURN /*next token must be ;*/ > #endif > > Thanks, > Vladimir > > On 6/8/15 9:15 AM, Civlin, Jan wrote: >> Hi All, >> >> >> We would like to contribute to Fixing bugs in detecting memory >> alignments in SuperWord. >> >> The contribution Bug ID: 8085932. >> >> Please review this patch: >> >> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8085932 >> >> webrev: http://cr.openjdk.java.net/~kvn/8085932/webrev.00/ >> >> >> *Description**: *Fixing bugs in detecting memory alignments in >> SuperWord >> >> Fixing bugs in detecting memory alignments in SuperWord: >> SWPointer::scaled_iv_plus_offset (fixing here a bug in detection of >> "scale"), >> SWPointer::offset_plus_k (fixing here a bug in detection of "invariant"), >> >> Add tracing output to the code that deal with memory alignment. The >> following routines are traceable: >> >> SWPointer::scaled_iv_plus_offset >> SWPointer::offset_plus_k >> SWPointer::scaled_iv, >> WPointer::SWPointer, >> SuperWord::memory_alignment >> >> Tracing is done only for NOT_PRODUCT. Currently tracing is controlled by >> VectorizeDebug: >> >> #ifndef PRODUCT >> if (_phase->C->method() != NULL) { >> _phase->C->method()->has_option_value("VectorizeDebug", >> _vector_loop_debug); >> } >> #endif >> >> And VectorizeDebug may take any combination (bitwise OR) of the >> following values: >> bool is_trace_alignment() { return (_vector_loop_debug & 2) > 0; } >> bool is_trace_mem_slice() { return (_vector_loop_debug & 4) > 0; } >> bool is_trace_loop() { return (_vector_loop_debug & 8) > 0; } >> bool is_trace_adjacent() { return (_vector_loop_debug & 16) > 0; } >> From jan.civlin at intel.com Fri Jun 19 04:24:48 2015 From: jan.civlin at intel.com (Civlin, Jan) Date: Fri, 19 Jun 2015 04:24:48 +0000 Subject: FW: RFR(S): 8085932: Fixing bugs in detecting memory alignments in SuperWord In-Reply-To: <39F83597C33E5F408096702907E6C450E555F6@ORSMSX104.amr.corp.intel.com> References: <39F83597C33E5F408096702907E6C450E4D6F7@ORSMSX104.amr.corp.intel.com> <55835DFC.70001@oracle.com> <55836795.3050501@oracle.com> <39F83597C33E5F408096702907E6C450E555F6@ORSMSX104.amr.corp.intel.com> Message-ID: <39F83597C33E5F408096702907E6C450E55629@ORSMSX104.amr.corp.intel.com> Vladimir. It is a good explanation, thank you. I clearly remember I was debugging this code (attached) and it was hitting lines with _invar != NULL, but now I changed something in java example source and the _invar is just always NULL in the debugger... I'll try to figure out how it was before and if I find the java code on which _invar was not NULL, I'll send it to you. Otherwise I will remove my change. Please notice that this java example is good to demonstrate that this change is important for the vectorization (though I probably should remove the comment): if (tmp._invar == NULL ) || _slp->do_vector_loop()) { //I do not know, why tmp._invar == NULL was here at first hand Here is the code where I used to see that JVM is not recognizing the invariants (but as I said I have already modified it and now _invar != NULL does not occurs) Here in method aYb the expressions i*cols and j*cols are actually invariants and set in the caller method multiply_transpose: static double aYb(double[] left, double[] right, int cols, int i, int j) { double sum = 0; for (int k = 0; k < cols; k++) { sum += left[k + i * cols] * right[k + j * cols]; } return sum; } When it was called from static void multiply_transpose(double[] result, double[] left, double[] right, int cols, double[] T) { assert (left.length == right.length); assert (left.length == cols * cols); assert (result.length == cols * cols); transpose(right, T, cols); //IntStream.range(0, cols * cols).parallel().forEach(id -> { // int i = id / cols; // int j = id % cols; // double sum = aYb(left, T, cols, i, j); // result[i * cols + j] = sum; //}); for (int id = 0; id < cols * cols; id++) { int i = id / cols; int j = id % cols; double sum = aYb(left, T, cols, i, j); result[i * cols + j] = sum; } } Thanks, Jan. -----Original Message----- From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] Sent: Thursday, June 18, 2015 5:52 PM To: hotspot-compiler-dev at openjdk.java.net Cc: Civlin, Jan Subject: Re: RFR(S): 8085932: Fixing bugs in detecting memory alignments in SuperWord Jan, Here is why next code return false: if (_scale != 0) { return false; // already found a scale if (_invar != NULL) return false; // already have an invariant SWPointer() method tries to set _scale, _offset, _invar values. But, for example, simple array access address uses 2 AddP nodes and each of them has offsets but different offsets. Usually one have invariant offset and another - scaled index: AddP (base, base, iv*scale + offset) AddP (base, addp, invar) SWPointer() iterates over all AddP: for (int i = 0; i < 3; i++) { if (!scaled_iv_plus_offset(adr->in(AddPNode::Offset))) { assert(!valid(), "too complex"); return; } adr = adr->in(AddPNode::Address); if (base == adr || !adr->is_AddP()) { break; // stop looking at addp's } } And this code assumes only one of AddP can set those fields (_scale, _offset, _invar). If second AddP tries to set a field which is set by previous AddP it is considered complex address expression, for example: AddP (base, base, iv*scale + offset_con + invar1) AddP (base, addp, invar2) and such cases are skipped. Please, show your case for which you want to return 'true'. Thanks, Vladimir On 6/18/15 5:10 PM, Vladimir Kozlov wrote: > Thank you, Jan > > Fixes looks good but it would be nice if you replaced some tracing > code with functions calls. In some place the execution code is hard to > read because of big tracing code. For example, in > SuperWord::memory_alignment() and in SWPointer methods. > > The one way to do that is to declare trace methods with empty body in > product build, for example for SWPointer::scaled_iv_plus_offset() you > may have new method declaration (not under #ifdef) in superword.hpp: > > class SWPointer VALUE_OBJ_CLASS_SPEC { > > void trace_1_scaled_iv_plus_offset(...) PRODUCT_RETURN; > > and in superword.cpp you will put the method under ifdef: > > #ifndef PRODUCT > void trace_1_scaled_iv_plus_offset(...) { > .... > } > #endif > > Then you can simply use it without ifdefs in code: > > bool SWPointer::scaled_iv_plus_offset(Node* n) { > + trace_1_scaled_iv_plus_offset(...); > + > if (scaled_iv(n)) { > > Note, macro PRODUCT_RETURN is defined as: > > #ifdef PRODUCT > #define PRODUCT_RETURN {} > #else > #define PRODUCT_RETURN /*next token must be ;*/ #endif > > Thanks, > Vladimir > > On 6/8/15 9:15 AM, Civlin, Jan wrote: >> Hi All, >> >> >> We would like to contribute to Fixing bugs in detecting memory >> alignments in SuperWord. >> >> The contribution Bug ID: 8085932. >> >> Please review this patch: >> >> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8085932 >> >> webrev: http://cr.openjdk.java.net/~kvn/8085932/webrev.00/ >> >> >> *Description**: *Fixing bugs in detecting memory alignments in >> SuperWord >> >> Fixing bugs in detecting memory alignments in SuperWord: >> SWPointer::scaled_iv_plus_offset (fixing here a bug in detection of >> "scale"), SWPointer::offset_plus_k (fixing here a bug in detection of >> "invariant"), >> >> Add tracing output to the code that deal with memory alignment. The >> following routines are traceable: >> >> SWPointer::scaled_iv_plus_offset >> SWPointer::offset_plus_k >> SWPointer::scaled_iv, >> WPointer::SWPointer, >> SuperWord::memory_alignment >> >> Tracing is done only for NOT_PRODUCT. Currently tracing is controlled >> by >> VectorizeDebug: >> >> #ifndef PRODUCT >> if (_phase->C->method() != NULL) { >> _phase->C->method()->has_option_value("VectorizeDebug", >> _vector_loop_debug); >> } >> #endif >> >> And VectorizeDebug may take any combination (bitwise OR) of the >> following values: >> bool is_trace_alignment() { return (_vector_loop_debug & 2) > 0; } >> bool is_trace_mem_slice() { return (_vector_loop_debug & 4) > 0; } >> bool is_trace_loop() { return (_vector_loop_debug & 8) > 0; } bool >> is_trace_adjacent() { return (_vector_loop_debug & 16) > 0; } >> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: MatrixMultiply.java Type: application/octet-stream Size: 5269 bytes Desc: MatrixMultiply.java URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ATT61578.hotspot_compiler Type: application/octet-stream Size: 720 bytes Desc: ATT61578.hotspot_compiler URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: MatrixMultiply.java Type: application/octet-stream Size: 5265 bytes Desc: MatrixMultiply.java URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ATT28162.hotspot_compiler Type: application/octet-stream Size: 723 bytes Desc: ATT28162.hotspot_compiler URL: From Alexander.Alexeev at caviumnetworks.com Fri Jun 19 06:23:25 2015 From: Alexander.Alexeev at caviumnetworks.com (Alexeev, Alexander) Date: Fri, 19 Jun 2015 06:23:25 +0000 Subject: pipeline class for sequence of instructions In-Reply-To: References: Message-ID: Hi, Sorry to bother. Can somebody answer? I still need this information to update AD for specific version of aarch64. Thanks, Alexander From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Alexeev, Alexander Sent: Wednesday, June 17, 2015 10:35 PM To: hotspot compiler Subject: pipeline class for sequence of instructions Hello Could somebody clarify how pipeline class is applied on sequence of instructions in architecture description file? For instance, class ialu_reg on countLeadingZerosL_bsr (snippet is below) or ialu_reg_mem on loadUB2L_immI (all from x86_64.ad). Stages for arguments read/writes, decoder and execution unit are specified only once. Is it then applied on every instructions that uses that pipeline class arguments or for the whole ins_encode body? BTW countLeadingZerosL_bsr isn't even a "single_instruction". Class pipe_cmplt looks more reasonable, but and_cmpLTMask and cadd_cmpLTMask still don't have 4 instructions how it is defined. Why 4 cycles are allocated to decode? Thanks, Alexander --------------------- // Integer ALU reg operation pipe_class ialu_reg(rRegI dst) %{ single_instruction; dst : S4(write); dst : S3(read); DECODE : S0; // any decoder ALU : S3; // any alu %} instruct countLeadingZerosL_bsr(rRegI dst, rRegL src, rFlagsReg cr) %{ predicate(!UseCountLeadingZerosInstruction); match(Set dst (CountLeadingZerosL src)); effect(KILL cr); format %{ "bsrq $dst, $src\t# count leading zeros (long)\n\t" "jnz skip\n\t" "movl $dst, -1\n" "skip:\n\t" "negl $dst\n\t" "addl $dst, 63" %} ins_encode %{ Register Rdst = $dst$$Register; Register Rsrc = $src$$Register; Label skip; __ bsrq(Rdst, Rsrc); __ jccb(Assembler::notZero, skip); __ movl(Rdst, -1); __ bind(skip); __ negl(Rdst); __ addl(Rdst, BitsPerLong - 1); %} ins_pipe(ialu_reg); %} -------------- next part -------------- An HTML attachment was scrubbed... URL: From aph at redhat.com Fri Jun 19 08:34:28 2015 From: aph at redhat.com (Andrew Haley) Date: Fri, 19 Jun 2015 09:34:28 +0100 Subject: RFR: 8046943: RSA Acceleration In-Reply-To: <55831BC8.9060001@oracle.com> References: <557ABD2E.7050608@redhat.com> <557EFF94.5000006@oracle.com> <557F042D.4060707@redhat.com> <558033C4.8040104@redhat.com> <5582F936.5020008@oracle.com> <5582FACA.4060103@redhat.com> <5582FDCA.8010507@oracle.com> <55831BC8.9060001@oracle.com> Message-ID: <5583D414.5050502@redhat.com> On 18/06/15 20:28, Vladimir Kozlov wrote: > We have few new rules regarding intrinsics. > You need to add private static java method which does range checks > because their are not executed in intrinsic code - see squareToLen() > implementation, for example. Okay. > Note, we will rewrite multiplyToLen() too soon. Right. > Also method which will be intrinsified should be private static too and > you can move allocation from it (like we did for squareToLen()) to avoid > allocation in intrinsic code. I like that. It's surely a lot less complicated than what I've got right now. > Your Hotspot changes are hard to accept. We have to compile for solaris > with Sun compilers which does not work with such changes: > > "hotspot/src/cpu/x86/vm/sharedRuntime_x86_64.cpp", line 3525: Warning: > parameter in inline asm statement unused: %3. > "hotspot/src/cpu/x86/vm/sharedRuntime_x86_64.cpp", line 3525: Warning: > parameter in inline asm statement unused: %6. > "hotspot/src/cpu/x86/vm/sharedRuntime_x86_64.cpp", line 3701: Error: The > function "__builtin_expect" must have a prototype. > "hotspot/src/cpu/x86/vm/sharedRuntime_x86_64.cpp", line 3707: Error: The > function "__builtin_alloca" must have a prototype. Sure. I didn't realize that you were compiling that code with non-GCC. __builtin_alloca() can just be replaced by alloca() on on-GCC. > May be you can convert the code to stub and add new MacroAssembler > functions which you can use in sharedRuntime_x86_64.cpp. I don't think so. It's fast because it is truly inline assembler, inserted into the C code. If I had some way to access an x86 Solaris machine I'd test it there. But those warnings about unused parameters are just warnings. It may be that the code would work on a Sun compiler. Can you please try to replace __builtin_alloca() with alloca() and then tell me if the code works? > Yes, it is a lot of handwriting but we need it to work on all OSs. Sure, I get that. I knew there would be a few goes around with this, but it's worth the pain for the performance improvement. Andrew. From cnewland at chrisnewland.com Fri Jun 19 10:16:24 2015 From: cnewland at chrisnewland.com (Chris Newland) Date: Fri, 19 Jun 2015 11:16:24 +0100 Subject: Making PrintEscapeAnalysis a diagnostic option on product VM? Message-ID: Hi, hope this is the correct list (perhaps serviceability?) I'm experimenting with some HotSpot changes that log escape analysis decisions so that I can visualise eliminated allocations at the source and bytecode levels in JITWatch[1]. My plan was to build a companion VM for JITWatch based on the product VM that would allow users to inspect some of the deeper workings such as EA and DCE that are not present in the LogCompilation output. I mentioned this to some performance guys at Devoxx and they didn't like the custom VM idea and suggested I put in a request to consider making -XX:+PrintEscapeAnalysis available under -XX:+UnlockDiagnosticVMOptions on the product VM (it's currently a notproduct option). If this is something you would consider than could I also request consideration of -XX:+PrintEliminateAllocations. All I would need is the class, method, and bci of each NoEscape detected. Kind regards, Chris [1] https://github.com/AdoptOpenJDK/jitwatch From vladimir.x.ivanov at oracle.com Fri Jun 19 11:03:10 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Fri, 19 Jun 2015 14:03:10 +0300 Subject: Making PrintEscapeAnalysis a diagnostic option on product VM? In-Reply-To: References: Message-ID: <5583F6EE.7070901@oracle.com> Chris, I'd suggest to look into enhancing LogCompilation output instead of parsing VM output. It doesn't require any flag changes and fits nicely into existing LogCompilation functionality, so we can integrate it into the product, relieving you and JITWatch users from building a companion VM. Best regards, Vladimir Ivanov On 6/19/15 1:16 PM, Chris Newland wrote: > Hi, hope this is the correct list (perhaps serviceability?) > > I'm experimenting with some HotSpot changes that log escape analysis > decisions so that I can visualise eliminated allocations at the source and > bytecode levels in JITWatch[1]. > > My plan was to build a companion VM for JITWatch based on the product VM > that would allow users to inspect some of the deeper workings such as EA > and DCE that are not present in the LogCompilation output. > > I mentioned this to some performance guys at Devoxx and they didn't like > the custom VM idea and suggested I put in a request to consider making > -XX:+PrintEscapeAnalysis available under -XX:+UnlockDiagnosticVMOptions on > the product VM (it's currently a notproduct option). > > If this is something you would consider than could I also request > consideration of -XX:+PrintEliminateAllocations. > > All I would need is the class, method, and bci of each NoEscape detected. > > Kind regards, > > Chris > > [1] https://github.com/AdoptOpenJDK/jitwatch > From vitalyd at gmail.com Fri Jun 19 11:09:29 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Fri, 19 Jun 2015 07:09:29 -0400 Subject: Making PrintEscapeAnalysis a diagnostic option on product VM? In-Reply-To: <5583F6EE.7070901@oracle.com> References: <5583F6EE.7070901@oracle.com> Message-ID: I think I had made a similar request a couple of years ago or so on this list. Personally, I'd find output to tty useful, akin to PrintInlining. LogCompilation is useful for tools, but basic diagnostic output for human consumption would be great. Also, now that superword is seeing some love, it'd be nice to have diagnostic output for vectorization (i.e. loop xyz vectorized, not vectorized due to ..., etc). Thanks sent from my phone On Jun 19, 2015 7:03 AM, "Vladimir Ivanov" wrote: > Chris, > > I'd suggest to look into enhancing LogCompilation output instead of > parsing VM output. It doesn't require any flag changes and fits nicely into > existing LogCompilation functionality, so we can integrate it into the > product, relieving you and JITWatch users from building a companion VM. > > Best regards, > Vladimir Ivanov > > On 6/19/15 1:16 PM, Chris Newland wrote: > >> Hi, hope this is the correct list (perhaps serviceability?) >> >> I'm experimenting with some HotSpot changes that log escape analysis >> decisions so that I can visualise eliminated allocations at the source and >> bytecode levels in JITWatch[1]. >> >> My plan was to build a companion VM for JITWatch based on the product VM >> that would allow users to inspect some of the deeper workings such as EA >> and DCE that are not present in the LogCompilation output. >> >> I mentioned this to some performance guys at Devoxx and they didn't like >> the custom VM idea and suggested I put in a request to consider making >> -XX:+PrintEscapeAnalysis available under -XX:+UnlockDiagnosticVMOptions on >> the product VM (it's currently a notproduct option). >> >> If this is something you would consider than could I also request >> consideration of -XX:+PrintEliminateAllocations. >> >> All I would need is the class, method, and bci of each NoEscape detected. >> >> Kind regards, >> >> Chris >> >> [1] https://github.com/AdoptOpenJDK/jitwatch >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From cnewland at chrisnewland.com Fri Jun 19 11:57:07 2015 From: cnewland at chrisnewland.com (Chris Newland) Date: Fri, 19 Jun 2015 12:57:07 +0100 Subject: Making PrintEscapeAnalysis a diagnostic option on product VM? In-Reply-To: <5583F6EE.7070901@oracle.com> References: <5583F6EE.7070901@oracle.com> Message-ID: Hi Vladimir, What do you think the next step is? I'm not a committer but I'd be happy to submit a patch/webrev that outputs LogCompilation XML for the kind of EA info I think would be useful. I've just seen Vitaly's post and I agree a tty 1-liner for each elimination would also be nice. Thanks, Chris On Fri, June 19, 2015 12:03, Vladimir Ivanov wrote: > Chris, > > > I'd suggest to look into enhancing LogCompilation output instead of > parsing VM output. It doesn't require any flag changes and fits nicely into > existing LogCompilation functionality, so we can integrate it into the > product, relieving you and JITWatch users from building a companion VM. > > Best regards, > Vladimir Ivanov > > > On 6/19/15 1:16 PM, Chris Newland wrote: > >> Hi, hope this is the correct list (perhaps serviceability?) >> >> >> I'm experimenting with some HotSpot changes that log escape analysis >> decisions so that I can visualise eliminated allocations at the source >> and bytecode levels in JITWatch[1]. >> >> My plan was to build a companion VM for JITWatch based on the product >> VM >> that would allow users to inspect some of the deeper workings such as EA >> and DCE that are not present in the LogCompilation output. >> >> I mentioned this to some performance guys at Devoxx and they didn't >> like the custom VM idea and suggested I put in a request to consider >> making -XX:+PrintEscapeAnalysis available under >> -XX:+UnlockDiagnosticVMOptions on >> the product VM (it's currently a notproduct option). >> >> If this is something you would consider than could I also request >> consideration of -XX:+PrintEliminateAllocations. >> >> All I would need is the class, method, and bci of each NoEscape >> detected. >> >> Kind regards, >> >> >> Chris >> >> >> [1] https://github.com/AdoptOpenJDK/jitwatch >> >> > From vladimir.x.ivanov at oracle.com Fri Jun 19 12:10:08 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Fri, 19 Jun 2015 15:10:08 +0300 Subject: Making PrintEscapeAnalysis a diagnostic option on product VM? In-Reply-To: References: <5583F6EE.7070901@oracle.com> Message-ID: <558406A0.6060301@oracle.com> > What do you think the next step is? > > I'm not a committer but I'd be happy to submit a patch/webrev that outputs > LogCompilation XML for the kind of EA info I think would be useful. Go for it. If you are a Contributor (signed OCA), we'll review and accept your patch with gratitude. Keep in mind, that when you touch LogCompilation output format, you should update logc tool (src/share/tools/LogCompilation/ [1]) as well. > I've just seen Vitaly's post and I agree a tty 1-liner for each > elimination would also be nice. Feel free to enhance -XX:+PrintEscapeAnalysis output as well, if you find it useful. Best regards, Vladimir Ivanov [1] http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/tip/src/share/tools/LogCompilation > > Thanks, > > Chris > > On Fri, June 19, 2015 12:03, Vladimir Ivanov wrote: >> Chris, >> >> >> I'd suggest to look into enhancing LogCompilation output instead of >> parsing VM output. It doesn't require any flag changes and fits nicely into >> existing LogCompilation functionality, so we can integrate it into the >> product, relieving you and JITWatch users from building a companion VM. >> >> Best regards, >> Vladimir Ivanov >> >> >> On 6/19/15 1:16 PM, Chris Newland wrote: >> >>> Hi, hope this is the correct list (perhaps serviceability?) >>> >>> >>> I'm experimenting with some HotSpot changes that log escape analysis >>> decisions so that I can visualise eliminated allocations at the source >>> and bytecode levels in JITWatch[1]. >>> >>> My plan was to build a companion VM for JITWatch based on the product >>> VM >>> that would allow users to inspect some of the deeper workings such as EA >>> and DCE that are not present in the LogCompilation output. >>> >>> I mentioned this to some performance guys at Devoxx and they didn't >>> like the custom VM idea and suggested I put in a request to consider >>> making -XX:+PrintEscapeAnalysis available under >>> -XX:+UnlockDiagnosticVMOptions on >>> the product VM (it's currently a notproduct option). >>> >>> If this is something you would consider than could I also request >>> consideration of -XX:+PrintEliminateAllocations. >>> >>> All I would need is the class, method, and bci of each NoEscape >>> detected. >>> >>> Kind regards, >>> >>> >>> Chris >>> >>> >>> [1] https://github.com/AdoptOpenJDK/jitwatch >>> >>> >> > > From vladimir.kozlov at oracle.com Fri Jun 19 15:09:58 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 19 Jun 2015 08:09:58 -0700 Subject: Making PrintEscapeAnalysis a diagnostic option on product VM? In-Reply-To: <558406A0.6060301@oracle.com> References: <5583F6EE.7070901@oracle.com> <558406A0.6060301@oracle.com> Message-ID: <558430C6.1090102@oracle.com> Agree. Vladimir K On 6/19/15 5:10 AM, Vladimir Ivanov wrote: >> What do you think the next step is? >> >> I'm not a committer but I'd be happy to submit a patch/webrev that outputs >> LogCompilation XML for the kind of EA info I think would be useful. > Go for it. If you are a Contributor (signed OCA), we'll review and accept your patch with gratitude. Keep in mind, that > when you touch LogCompilation output format, you should update logc tool (src/share/tools/LogCompilation/ [1]) as well. > >> I've just seen Vitaly's post and I agree a tty 1-liner for each >> elimination would also be nice. > Feel free to enhance -XX:+PrintEscapeAnalysis output as well, if you find it useful. > > Best regards, > Vladimir Ivanov > > [1] http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/tip/src/share/tools/LogCompilation >> >> Thanks, >> >> Chris >> >> On Fri, June 19, 2015 12:03, Vladimir Ivanov wrote: >>> Chris, >>> >>> >>> I'd suggest to look into enhancing LogCompilation output instead of >>> parsing VM output. It doesn't require any flag changes and fits nicely into >>> existing LogCompilation functionality, so we can integrate it into the >>> product, relieving you and JITWatch users from building a companion VM. >>> >>> Best regards, >>> Vladimir Ivanov >>> >>> >>> On 6/19/15 1:16 PM, Chris Newland wrote: >>> >>>> Hi, hope this is the correct list (perhaps serviceability?) >>>> >>>> >>>> I'm experimenting with some HotSpot changes that log escape analysis >>>> decisions so that I can visualise eliminated allocations at the source >>>> and bytecode levels in JITWatch[1]. >>>> >>>> My plan was to build a companion VM for JITWatch based on the product >>>> VM >>>> that would allow users to inspect some of the deeper workings such as EA >>>> and DCE that are not present in the LogCompilation output. >>>> >>>> I mentioned this to some performance guys at Devoxx and they didn't >>>> like the custom VM idea and suggested I put in a request to consider >>>> making -XX:+PrintEscapeAnalysis available under >>>> -XX:+UnlockDiagnosticVMOptions on >>>> the product VM (it's currently a notproduct option). >>>> >>>> If this is something you would consider than could I also request >>>> consideration of -XX:+PrintEliminateAllocations. >>>> >>>> All I would need is the class, method, and bci of each NoEscape >>>> detected. >>>> >>>> Kind regards, >>>> >>>> >>>> Chris >>>> >>>> >>>> [1] https://github.com/AdoptOpenJDK/jitwatch >>>> >>>> >>> >> >> From vladimir.kozlov at oracle.com Fri Jun 19 18:58:10 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 19 Jun 2015 11:58:10 -0700 Subject: RFR(L): 8081247 AVX 512 extended support code review request In-Reply-To: References: <556FA54E.8050001@oracle.com> Message-ID: <55846642.7090702@oracle.com> JPRT testing did not pass - a lot of failures (run all hotspot/test/* jtreg tests): Terminal Rampup began Fri Jun 19 20:38:51 CEST 2015 for 0.5 minutes o467 ReplicateI === _ o75 [[o468 ]] #vectorx[4]:{int} --N: o467 ReplicateI === _ o75 [[o468 ]] #vectorx[4]:{int} --N: o75 ConI === o0 [[o465 o469 o322 o220 o223 o467 o323 o292 o293 o294 o321 ]] #int:1 IMMI 10 IMMI IMMI1 0 IMMI1 IMMI2 0 IMMI2 IMMI8 5 IMMI8 IMMI16 10 IMMI16 IMMU31 0 IMMU31 RREGI 100 loadConI RAX_REGI 100 loadConI RBX_REGI 100 loadConI RCX_REGI 100 loadConI RDX_REGI 100 loadConI RDI_REGI 100 loadConI NO_RCX_REGI 100 loadConI NO_RAX_RDX_REGI 100 loadConI STACKSLOTI 200 storeSSI # To suppress the following error report, specify this argument # after -XX: or in .hotspotrc: SuppressErrorAt=/matcher.cpp:1602 # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (hotspot/src/share/vm/opto/matcher.cpp:1602), pid=4714, tid=0x000000004090e950 # assert(false) failed: bad AD file # And # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/opt/jprt/T/P1/181115.vkozlov/s/hotspot/src/cpu/x86/vm/assembler_x86.cpp:3060), pid=6301, tid=0x000000004226e950 # assert((UseAVX > 0)) failed: SSE mode requires address alignment 16 bytes # V [libjvm.so+0x110ae41] VMError::report_and_die()+0x151 V [libjvm.so+0x7f65eb] report_vm_error(char const*, int, char const*, char const*)+0x7b V [libjvm.so+0x4afaa4] Assembler::pshufd(XMMRegisterImpl*, Address, int)+0x234 V [libjvm.so+0x2a003e] Repl2D_memNode::emit(CodeBuffer&, PhaseRegAlloc*) const+0x25e V [libjvm.so+0x765923] Compile::scratch_emit_size(Node const*)+0x463 V [libjvm.so+0xe72794] Compile::shorten_branches(unsigned int*, int&, int&, int&)+0x204 V [libjvm.so+0xe73897] Compile::init_buffer(unsigned int*)+0x2d7 V [libjvm.so+0xe7efb7] Compile::Output()+0x8c7 Vladimir On 6/4/15 9:46 PM, Berg, Michael C wrote: > Vladimir please find the following webrev with the suggested changes, I have added small signature functions which look like the old versions in the assembler but manage the problem I need to handle, which is additional state for legacy only instructions. There is a new vm_version function which handles the cpuid checks with a conglomerate approach for the one scenario which had it. > The loop in the stub generator is now formed to alter the upper bound and execute in one path. > > http://cr.openjdk.java.net/~mcberg/8081247/webrev.03/ > > Regards, > Michael > > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > Sent: Wednesday, June 03, 2015 6:10 PM > To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net' > Subject: Re: RFR(L): 8081247 AVX 512 extended support code review request > > Hi, Michael > > assembler_x86.cpp: > > I don't like that you replaced prefix method with few parameters with method which has a lot of them: > > - int encode = vex_prefix_0F38_and_encode_q(dst, src1, src2); > + int encode = vex_prefix_and_encode(dst->encoding(), src1->encoding(), > src2->encoding(), > + VEX_SIMD_NONE, VEX_OPCODE_0F_38, > true, AVX_128bit, > + true, false); > > Why you did that? > > > stubGenerator_x86_64.cpp: > > Can we set different loop limit based on UseAVX instead of having 2 loops. > > x86.ad: > > Instead of long condition expressions like next: > > UseAVX > 0 && !VM_Version::supports_avx512vl() && > !VM_Version::supports_avx512bw() > > May be have one VM_Version finction which does these checks. > > Thanks, > Vladimir > > On 6/2/15 9:38 PM, Berg, Michael C wrote: >> Hi Folks, >> >> I would like to contribute more AVX512 enabling to facilitate support >> for machines which utilize EVEX encoding. I need two reviewers to >> review this patch and comment as needed: >> >> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8081247 >> >> webrev: >> >> http://cr.openjdk.java.net/~mcberg/8081247/webrev.01/ >> >> This patch enables BMI code on EVEX targets, improves replication >> patterns to be more efficient on both EVEX enabled and legacy targets, >> adds more CPUID based rules for correct code generation on various >> EVEX enabled servers, extends more call save/restore functionality and >> extends the vector space further for SIMD operations. Please expedite >> this review as there is a near term need for the support. >> >> Also, as I am not yet a committer, this code needs a sponsor as well. >> >> Thanks, >> >> Michael >> From vladimir.kozlov at oracle.com Fri Jun 19 22:55:05 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 19 Jun 2015 15:55:05 -0700 Subject: [9] RFR(S) 8080157: assert(allocates2(pc)) failed: not in CodeBuffer memory Message-ID: <55849DC9.8080804@oracle.com> http://cr.openjdk.java.net/~kvn/8080157/webrev/ https://bugs.openjdk.java.net/browse/JDK-8080157 When stubs are generated their code is put into one CodeBuffer (in CodeCache). More stubs we have the bigger that buffer should be. Its size is determined by code_size2 (and code_size1 for an other set of stubs). The latest GHASH intrinsic added code which does not fit into previous size any more so we need to increase it. It failed only on windows because on win64 we have to save some used XMM registers (save-on-entry) so the code is bigger than on other x86 systems. I also added new asserts to have a meaningful message when there are no space left in this code buffer. Tested in JPRT with new asserts before and after size change. Thanks, Vladimir From igor.veresov at oracle.com Sat Jun 20 00:37:52 2015 From: igor.veresov at oracle.com (Igor Veresov) Date: Fri, 19 Jun 2015 17:37:52 -0700 Subject: [9] RFR(S) 8080157: assert(allocates2(pc)) failed: not in CodeBuffer memory In-Reply-To: <55849DC9.8080804@oracle.com> References: <55849DC9.8080804@oracle.com> Message-ID: Good. igor > On Jun 19, 2015, at 3:55 PM, Vladimir Kozlov wrote: > > http://cr.openjdk.java.net/~kvn/8080157/webrev/ > https://bugs.openjdk.java.net/browse/JDK-8080157 > > When stubs are generated their code is put into one CodeBuffer (in CodeCache). More stubs we have the bigger that buffer should be. > Its size is determined by code_size2 (and code_size1 for an other set of stubs). The latest GHASH intrinsic added code which does not fit into previous size any more so we need to increase it. > > It failed only on windows because on win64 we have to save some used XMM registers (save-on-entry) so the code is bigger than on other x86 systems. > > I also added new asserts to have a meaningful message when there are no space left in this code buffer. > > Tested in JPRT with new asserts before and after size change. > > Thanks, > Vladimir From vladimir.kozlov at oracle.com Sat Jun 20 00:39:59 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 19 Jun 2015 17:39:59 -0700 Subject: [9] RFR(S) 8080157: assert(allocates2(pc)) failed: not in CodeBuffer memory In-Reply-To: References: <55849DC9.8080804@oracle.com> Message-ID: <5584B65F.9080902@oracle.com> Thank you, Igor Vladimir On 6/19/15 5:37 PM, Igor Veresov wrote: > Good. > > igor > >> On Jun 19, 2015, at 3:55 PM, Vladimir Kozlov wrote: >> >> http://cr.openjdk.java.net/~kvn/8080157/webrev/ >> https://bugs.openjdk.java.net/browse/JDK-8080157 >> >> When stubs are generated their code is put into one CodeBuffer (in CodeCache). More stubs we have the bigger that buffer should be. >> Its size is determined by code_size2 (and code_size1 for an other set of stubs). The latest GHASH intrinsic added code which does not fit into previous size any more so we need to increase it. >> >> It failed only on windows because on win64 we have to save some used XMM registers (save-on-entry) so the code is bigger than on other x86 systems. >> >> I also added new asserts to have a meaningful message when there are no space left in this code buffer. >> >> Tested in JPRT with new asserts before and after size change. >> >> Thanks, >> Vladimir From michael.c.berg at intel.com Sat Jun 20 03:07:57 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Sat, 20 Jun 2015 03:07:57 +0000 Subject: [9] RFR(S) 8080157: assert(allocates2(pc)) failed: not in CodeBuffer memory In-Reply-To: <55849DC9.8080804@oracle.com> References: <55849DC9.8080804@oracle.com> Message-ID: Looks good. -Michael -----Original Message----- From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Vladimir Kozlov Sent: Friday, June 19, 2015 3:55 PM To: hotspot compiler Cc: hotspot-runtime-dev at openjdk.java.net Subject: [9] RFR(S) 8080157: assert(allocates2(pc)) failed: not in CodeBuffer memory http://cr.openjdk.java.net/~kvn/8080157/webrev/ https://bugs.openjdk.java.net/browse/JDK-8080157 When stubs are generated their code is put into one CodeBuffer (in CodeCache). More stubs we have the bigger that buffer should be. Its size is determined by code_size2 (and code_size1 for an other set of stubs). The latest GHASH intrinsic added code which does not fit into previous size any more so we need to increase it. It failed only on windows because on win64 we have to save some used XMM registers (save-on-entry) so the code is bigger than on other x86 systems. I also added new asserts to have a meaningful message when there are no space left in this code buffer. Tested in JPRT with new asserts before and after size change. Thanks, Vladimir From vladimir.kozlov at oracle.com Sat Jun 20 04:13:22 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 19 Jun 2015 21:13:22 -0700 Subject: [9] RFR(S) 8080157: assert(allocates2(pc)) failed: not in CodeBuffer memory In-Reply-To: References: <55849DC9.8080804@oracle.com> Message-ID: <5584E862.8020707@oracle.com> Thank you, Michael Vladimir On 6/19/15 8:07 PM, Berg, Michael C wrote: > Looks good. > > -Michael > > -----Original Message----- > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Vladimir Kozlov > Sent: Friday, June 19, 2015 3:55 PM > To: hotspot compiler > Cc: hotspot-runtime-dev at openjdk.java.net > Subject: [9] RFR(S) 8080157: assert(allocates2(pc)) failed: not in CodeBuffer memory > > http://cr.openjdk.java.net/~kvn/8080157/webrev/ > https://bugs.openjdk.java.net/browse/JDK-8080157 > > When stubs are generated their code is put into one CodeBuffer (in CodeCache). More stubs we have the bigger that buffer should be. > Its size is determined by code_size2 (and code_size1 for an other set of stubs). The latest GHASH intrinsic added code which does not fit into previous size any more so we need to increase it. > > It failed only on windows because on win64 we have to save some used XMM registers (save-on-entry) so the code is bigger than on other x86 systems. > > I also added new asserts to have a meaningful message when there are no space left in this code buffer. > > Tested in JPRT with new asserts before and after size change. > > Thanks, > Vladimir From michael.c.berg at intel.com Sat Jun 20 09:02:35 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Sat, 20 Jun 2015 09:02:35 +0000 Subject: RFR(L): 8081247 AVX 512 extended support code review request In-Reply-To: <55846642.7090702@oracle.com> References: <556FA54E.8050001@oracle.com> <55846642.7090702@oracle.com> Message-ID: Vladimir, please see the latest upload, revsion4, which has my changes tested on SSE,AVX,EVEX. Thanks, -Michael -----Original Message----- From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] Sent: Friday, June 19, 2015 11:58 AM To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net' Subject: Re: RFR(L): 8081247 AVX 512 extended support code review request JPRT testing did not pass - a lot of failures (run all hotspot/test/* jtreg tests): Terminal Rampup began Fri Jun 19 20:38:51 CEST 2015 for 0.5 minutes o467 ReplicateI === _ o75 [[o468 ]] #vectorx[4]:{int} --N: o467 ReplicateI === _ o75 [[o468 ]] #vectorx[4]:{int} --N: o75 ConI === o0 [[o465 o469 o322 o220 o223 o467 o323 o292 o293 o294 o321 ]] #int:1 IMMI 10 IMMI IMMI1 0 IMMI1 IMMI2 0 IMMI2 IMMI8 5 IMMI8 IMMI16 10 IMMI16 IMMU31 0 IMMU31 RREGI 100 loadConI RAX_REGI 100 loadConI RBX_REGI 100 loadConI RCX_REGI 100 loadConI RDX_REGI 100 loadConI RDI_REGI 100 loadConI NO_RCX_REGI 100 loadConI NO_RAX_RDX_REGI 100 loadConI STACKSLOTI 200 storeSSI # To suppress the following error report, specify this argument # after -XX: or in .hotspotrc: SuppressErrorAt=/matcher.cpp:1602 # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (hotspot/src/share/vm/opto/matcher.cpp:1602), pid=4714, tid=0x000000004090e950 # assert(false) failed: bad AD file # And # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/opt/jprt/T/P1/181115.vkozlov/s/hotspot/src/cpu/x86/vm/assembler_x86.cpp:3060), pid=6301, tid=0x000000004226e950 # assert((UseAVX > 0)) failed: SSE mode requires address alignment 16 bytes # V [libjvm.so+0x110ae41] VMError::report_and_die()+0x151 V [libjvm.so+0x7f65eb] report_vm_error(char const*, int, char const*, char const*)+0x7b V [libjvm.so+0x4afaa4] Assembler::pshufd(XMMRegisterImpl*, Address, int)+0x234 V [libjvm.so+0x2a003e] Repl2D_memNode::emit(CodeBuffer&, PhaseRegAlloc*) const+0x25e V [libjvm.so+0x765923] Compile::scratch_emit_size(Node const*)+0x463 V [libjvm.so+0xe72794] Compile::shorten_branches(unsigned int*, int&, int&, int&)+0x204 V [libjvm.so+0xe73897] Compile::init_buffer(unsigned int*)+0x2d7 V [libjvm.so+0xe7efb7] Compile::Output()+0x8c7 Vladimir On 6/4/15 9:46 PM, Berg, Michael C wrote: > Vladimir please find the following webrev with the suggested changes, I have added small signature functions which look like the old versions in the assembler but manage the problem I need to handle, which is additional state for legacy only instructions. There is a new vm_version function which handles the cpuid checks with a conglomerate approach for the one scenario which had it. > The loop in the stub generator is now formed to alter the upper bound and execute in one path. > > http://cr.openjdk.java.net/~mcberg/8081247/webrev.03/ > > Regards, > Michael > > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > Sent: Wednesday, June 03, 2015 6:10 PM > To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net' > Subject: Re: RFR(L): 8081247 AVX 512 extended support code review > request > > Hi, Michael > > assembler_x86.cpp: > > I don't like that you replaced prefix method with few parameters with method which has a lot of them: > > - int encode = vex_prefix_0F38_and_encode_q(dst, src1, src2); > + int encode = vex_prefix_and_encode(dst->encoding(), > + src1->encoding(), > src2->encoding(), > + VEX_SIMD_NONE, VEX_OPCODE_0F_38, > true, AVX_128bit, > + true, false); > > Why you did that? > > > stubGenerator_x86_64.cpp: > > Can we set different loop limit based on UseAVX instead of having 2 loops. > > x86.ad: > > Instead of long condition expressions like next: > > UseAVX > 0 && !VM_Version::supports_avx512vl() && > !VM_Version::supports_avx512bw() > > May be have one VM_Version finction which does these checks. > > Thanks, > Vladimir > > On 6/2/15 9:38 PM, Berg, Michael C wrote: >> Hi Folks, >> >> I would like to contribute more AVX512 enabling to facilitate support >> for machines which utilize EVEX encoding. I need two reviewers to >> review this patch and comment as needed: >> >> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8081247 >> >> webrev: >> >> http://cr.openjdk.java.net/~mcberg/8081247/webrev.01/ >> >> This patch enables BMI code on EVEX targets, improves replication >> patterns to be more efficient on both EVEX enabled and legacy >> targets, adds more CPUID based rules for correct code generation on >> various EVEX enabled servers, extends more call save/restore >> functionality and extends the vector space further for SIMD >> operations. Please expedite this review as there is a near term need for the support. >> >> Also, as I am not yet a committer, this code needs a sponsor as well. >> >> Thanks, >> >> Michael >> From michael.c.berg at intel.com Sat Jun 20 09:06:21 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Sat, 20 Jun 2015 09:06:21 +0000 Subject: RFR(L): 8081247 AVX 512 extended support code review request In-Reply-To: References: <556FA54E.8050001@oracle.com> <55846642.7090702@oracle.com> Message-ID: With this link: http://cr.openjdk.java.net/~mcberg/8081247/webrev.04/ Thanks, Michael -----Original Message----- From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Berg, Michael C Sent: Saturday, June 20, 2015 2:03 AM To: Vladimir Kozlov; 'hotspot-compiler-dev at openjdk.java.net' Subject: RE: RFR(L): 8081247 AVX 512 extended support code review request Vladimir, please see the latest upload, revsion4, which has my changes tested on SSE,AVX,EVEX. Thanks, -Michael -----Original Message----- From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] Sent: Friday, June 19, 2015 11:58 AM To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net' Subject: Re: RFR(L): 8081247 AVX 512 extended support code review request JPRT testing did not pass - a lot of failures (run all hotspot/test/* jtreg tests): Terminal Rampup began Fri Jun 19 20:38:51 CEST 2015 for 0.5 minutes o467 ReplicateI === _ o75 [[o468 ]] #vectorx[4]:{int} --N: o467 ReplicateI === _ o75 [[o468 ]] #vectorx[4]:{int} --N: o75 ConI === o0 [[o465 o469 o322 o220 o223 o467 o323 o292 o293 o294 o321 ]] #int:1 IMMI 10 IMMI IMMI1 0 IMMI1 IMMI2 0 IMMI2 IMMI8 5 IMMI8 IMMI16 10 IMMI16 IMMU31 0 IMMU31 RREGI 100 loadConI RAX_REGI 100 loadConI RBX_REGI 100 loadConI RCX_REGI 100 loadConI RDX_REGI 100 loadConI RDI_REGI 100 loadConI NO_RCX_REGI 100 loadConI NO_RAX_RDX_REGI 100 loadConI STACKSLOTI 200 storeSSI # To suppress the following error report, specify this argument # after -XX: or in .hotspotrc: SuppressErrorAt=/matcher.cpp:1602 # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (hotspot/src/share/vm/opto/matcher.cpp:1602), pid=4714, tid=0x000000004090e950 # assert(false) failed: bad AD file # And # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/opt/jprt/T/P1/181115.vkozlov/s/hotspot/src/cpu/x86/vm/assembler_x86.cpp:3060), pid=6301, tid=0x000000004226e950 # assert((UseAVX > 0)) failed: SSE mode requires address alignment 16 bytes # V [libjvm.so+0x110ae41] VMError::report_and_die()+0x151 V [libjvm.so+0x7f65eb] report_vm_error(char const*, int, char const*, char const*)+0x7b V [libjvm.so+0x4afaa4] Assembler::pshufd(XMMRegisterImpl*, Address, int)+0x234 V [libjvm.so+0x2a003e] Repl2D_memNode::emit(CodeBuffer&, PhaseRegAlloc*) const+0x25e V [libjvm.so+0x765923] Compile::scratch_emit_size(Node const*)+0x463 V [libjvm.so+0xe72794] Compile::shorten_branches(unsigned int*, int&, int&, int&)+0x204 V [libjvm.so+0xe73897] Compile::init_buffer(unsigned int*)+0x2d7 V [libjvm.so+0xe7efb7] Compile::Output()+0x8c7 Vladimir On 6/4/15 9:46 PM, Berg, Michael C wrote: > Vladimir please find the following webrev with the suggested changes, I have added small signature functions which look like the old versions in the assembler but manage the problem I need to handle, which is additional state for legacy only instructions. There is a new vm_version function which handles the cpuid checks with a conglomerate approach for the one scenario which had it. > The loop in the stub generator is now formed to alter the upper bound and execute in one path. > > http://cr.openjdk.java.net/~mcberg/8081247/webrev.03/ > > Regards, > Michael > > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > Sent: Wednesday, June 03, 2015 6:10 PM > To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net' > Subject: Re: RFR(L): 8081247 AVX 512 extended support code review > request > > Hi, Michael > > assembler_x86.cpp: > > I don't like that you replaced prefix method with few parameters with method which has a lot of them: > > - int encode = vex_prefix_0F38_and_encode_q(dst, src1, src2); > + int encode = vex_prefix_and_encode(dst->encoding(), > + src1->encoding(), > src2->encoding(), > + VEX_SIMD_NONE, VEX_OPCODE_0F_38, > true, AVX_128bit, > + true, false); > > Why you did that? > > > stubGenerator_x86_64.cpp: > > Can we set different loop limit based on UseAVX instead of having 2 loops. > > x86.ad: > > Instead of long condition expressions like next: > > UseAVX > 0 && !VM_Version::supports_avx512vl() && > !VM_Version::supports_avx512bw() > > May be have one VM_Version finction which does these checks. > > Thanks, > Vladimir > > On 6/2/15 9:38 PM, Berg, Michael C wrote: >> Hi Folks, >> >> I would like to contribute more AVX512 enabling to facilitate support >> for machines which utilize EVEX encoding. I need two reviewers to >> review this patch and comment as needed: >> >> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8081247 >> >> webrev: >> >> http://cr.openjdk.java.net/~mcberg/8081247/webrev.01/ >> >> This patch enables BMI code on EVEX targets, improves replication >> patterns to be more efficient on both EVEX enabled and legacy >> targets, adds more CPUID based rules for correct code generation on >> various EVEX enabled servers, extends more call save/restore >> functionality and extends the vector space further for SIMD >> operations. Please expedite this review as there is a near term need for the support. >> >> Also, as I am not yet a committer, this code needs a sponsor as well. >> >> Thanks, >> >> Michael >> From edward.nevill at gmail.com Mon Jun 22 12:47:06 2015 From: edward.nevill at gmail.com (Edward Nevill) Date: Mon, 22 Jun 2015 13:47:06 +0100 Subject: pipeline class for sequence of instructions In-Reply-To: References: Message-ID: <1434977226.21282.14.camel@mint> On Wed, 2015-06-17 at 19:34 +0000, Alexeev, Alexander wrote: > Stages for arguments read/writes, decoder and execution unit are specified only once. Is it then applied on every instructions that uses that pipeline class arguments or for the whole ins_encode body? AFAIUI It is the whole ins_encode body. The pipeline scheduler does not go inside an AD instruct %{ ... %} I would use the read parameters for the first instruction, and the write parameters for the last instruction. If a resource is used just list it as being used in the normal place you would expect such a resource to be used. I think the pipeline scheduler is not really designed for multi instruction sequences so you have to just do the best you can to model the multi instruction sequence as a single fictitious instruction. Interestingly, in the example you chose below, I think it should be ins_pipe(ialu_reg_reg) rather than ins_pipe(ialu_reg) because countLeadingZerosL_bsr has both src and dst registers. All the best, Ed. > // Integer ALU reg operation > pipe_class ialu_reg(rRegI dst) > %{ > single_instruction; > dst : S4(write); > dst : S3(read); > DECODE : S0; // any decoder > ALU : S3; // any alu > %} > > > > instruct countLeadingZerosL_bsr(rRegI dst, rRegL src, rFlagsReg cr) %{ > predicate(!UseCountLeadingZerosInstruction); > match(Set dst (CountLeadingZerosL src)); > effect(KILL cr); > > format %{ "bsrq $dst, $src\t# count leading zeros (long)\n\t" > "jnz skip\n\t" > "movl $dst, -1\n" > "skip:\n\t" > "negl $dst\n\t" > "addl $dst, 63" %} > ins_encode %{ > Register Rdst = $dst$$Register; > Register Rsrc = $src$$Register; > Label skip; > __ bsrq(Rdst, Rsrc); > __ jccb(Assembler::notZero, skip); > __ movl(Rdst, -1); > __ bind(skip); > __ negl(Rdst); > __ addl(Rdst, BitsPerLong - 1); > %} > ins_pipe(ialu_reg); > %} From roland.westrelin at oracle.com Tue Jun 23 08:11:43 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Tue, 23 Jun 2015 10:11:43 +0200 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: <5581C465.7070803@oracle.com> References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> <5581C465.7070803@oracle.com> Message-ID: <05B8872C-194E-4596-9291-C62262AAA930@oracle.com> > > http://gee.cs.oswego.edu/dl/jmm/cookbook.html > > > > it?s allowed to reorder normal stores with normal stores > > If we can guarantee that all passed stores are normal (I assume we will have barriers otherwise in between) then I agree. If I read the code correctly when support_IRIW_for_not_multiple_copy_atomic_cpu is true we don?t add a membar after a volatile store so folding a store with previous ones would not be a correct optimization because it could remove volatile stores. This said the current code in StoreNode::Ideal() that folds back to back stores has the effect or reordering stores on different slices. So isn?t there a bug already? Roland. From aleksey.shipilev at oracle.com Tue Jun 23 08:43:54 2015 From: aleksey.shipilev at oracle.com (Aleksey Shipilev) Date: Tue, 23 Jun 2015 11:43:54 +0300 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: <05B8872C-194E-4596-9291-C62262AAA930@oracle.com> References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> <5581C465.7070803@oracle.com> <05B8872C-194E-4596-9291-C62262AAA930@oracle.com> Message-ID: <55891C4A.6020002@oracle.com> On 06/23/2015 11:11 AM, Roland Westrelin wrote: >>> http://gee.cs.oswego.edu/dl/jmm/cookbook.html >>> >>> it?s allowed to reorder normal stores with normal stores >> >> If we can guarantee that all passed stores are normal (I assume we >> will have barriers otherwise in between) then I agree. > > If I read the code correctly when > support_IRIW_for_not_multiple_copy_atomic_cpu is true we don?t add a > membar after a volatile store so folding a store with previous ones > would not be a correct optimization because it could remove volatile > stores. This said the current code in StoreNode::Ideal() that folds > back to back stores has the effect or reordering stores on different > slices. So isn?t there a bug already? Let's back up a bit. Short story. Given the program: volatile int y; int x; x = 1; y = 1; // volatile store x = 2; y = 2; // volatile store This transformation is correct: x = 1; x = 2; y = 2; // volatile store ...since it does not introduce non-SC behaviors. Removing volatile ops (e.g. as the result of coalescing) is fine, as long as you don't introduce new behaviors. See also, JSR133 Cookbook, "Removing Barriers", StoreLoad -> no volatile loads -> StoreLoad example. Long story. Analyze the results of this program: volatile int y; int x; ---------------------------- x = 1; | r1 = y; y = 1; | r2 = x; x = 2; | y = 2; | Possible outcomes: (0, 0) -- allowed, initial values (0, 1) -- allowed, racy read of "x" (0, 2) -- allowed, racy read of "x" (1, 0) -- forbidden, violates HB consistency (1, 1) -- allowed, last read of "x" in HB (1, 2) -- allowed, racy read of "x" (2, 0) -- forbidden, violates HB consistency (2, 1) -- forbidden, violates HB consistency (2, 2) -- allowed, last read of "x" in HB Modified program: volatile int y; int x; ---------------------------- x = 1; | r1 = y; x = 2; | r2 = x; y = 2; | Possible outcomes: (0, 0) -- allowed, initial values (0, 1) -- allowed, racy read of "x" (0, 2) -- allowed, racy read of "x" (2, 0) -- forbidden, violates HB consistency (2, 1) -- forbidden, violates HB consistency (2, 2) -- allowed, last read of "x" in HB Here, modified program dropped the outcomes (1, *), but it does not start to allow other outcomes. The trick is that, in memory model sense, "allowed" does not mean "should be present". BTW, jcstress has the sequential consistency tests for volatiles [1], it makes sense to run it if you touch read/write coalescing code. (Or not, it would be running in VM testing anyway). Thanks, -Aleksey [1] $ hg clone http://hg.openjdk.java.net/code-tools/jcstress/ jcstress $ cd jcstress/ $ mvn clean install -pl tests-generated -am $ java -jar tests-generated/target/jcstress.jar -t seqcst -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: OpenPGP digital signature URL: From roland.westrelin at oracle.com Tue Jun 23 08:57:47 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Tue, 23 Jun 2015 10:57:47 +0200 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: <55891C4A.6020002@oracle.com> References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> <5581C465.7070803@oracle.com> <05B8872C-194E-4596-9291-C62262AAA930@oracle.com> <55891C4A.6020002@oracle.com> Message-ID: <7EBACFD5-8D27-4EB3-A6F6-851BFAAF6737@oracle.com> >> If I read the code correctly when >> support_IRIW_for_not_multiple_copy_atomic_cpu is true we don?t add a >> membar after a volatile store so folding a store with previous ones >> would not be a correct optimization because it could remove volatile >> stores. This said the current code in StoreNode::Ideal() that folds >> back to back stores has the effect or reordering stores on different >> slices. So isn?t there a bug already? > > Let's back up a bit. > > Short story. Given the program: > > volatile int y; > int x; > > x = 1; > y = 1; // volatile store > x = 2; > y = 2; // volatile store > > This transformation is correct: > > x = 1; > x = 2; > y = 2; // volatile store What about volatile int y; volatile int x; y=1 x=1 y=2 transformed to: x=1 y=2 ? Roland. > > ...since it does not introduce non-SC behaviors. Removing volatile ops > (e.g. as the result of coalescing) is fine, as long as you don't > introduce new behaviors. See also, JSR133 Cookbook, "Removing Barriers", > StoreLoad -> no volatile loads -> StoreLoad example. > > Long story. Analyze the results of this program: > > volatile int y; > int x; > ---------------------------- > x = 1; | r1 = y; > y = 1; | r2 = x; > x = 2; | > y = 2; | > > Possible outcomes: > (0, 0) -- allowed, initial values > (0, 1) -- allowed, racy read of "x" > (0, 2) -- allowed, racy read of "x" > (1, 0) -- forbidden, violates HB consistency > (1, 1) -- allowed, last read of "x" in HB > (1, 2) -- allowed, racy read of "x" > (2, 0) -- forbidden, violates HB consistency > (2, 1) -- forbidden, violates HB consistency > (2, 2) -- allowed, last read of "x" in HB > > Modified program: > > volatile int y; > int x; > ---------------------------- > x = 1; | r1 = y; > x = 2; | r2 = x; > y = 2; | > > Possible outcomes: > (0, 0) -- allowed, initial values > (0, 1) -- allowed, racy read of "x" > (0, 2) -- allowed, racy read of "x" > (2, 0) -- forbidden, violates HB consistency > (2, 1) -- forbidden, violates HB consistency > (2, 2) -- allowed, last read of "x" in HB > > Here, modified program dropped the outcomes (1, *), but it does not > start to allow other outcomes. The trick is that, in memory model sense, > "allowed" does not mean "should be present". > > BTW, jcstress has the sequential consistency tests for volatiles [1], it > makes sense to run it if you touch read/write coalescing code. (Or not, > it would be running in VM testing anyway). > > Thanks, > -Aleksey > > [1] > $ hg clone http://hg.openjdk.java.net/code-tools/jcstress/ jcstress > $ cd jcstress/ > $ mvn clean install -pl tests-generated -am > $ java -jar tests-generated/target/jcstress.jar -t seqcst > From aleksey.shipilev at oracle.com Tue Jun 23 09:13:15 2015 From: aleksey.shipilev at oracle.com (Aleksey Shipilev) Date: Tue, 23 Jun 2015 12:13:15 +0300 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: <7EBACFD5-8D27-4EB3-A6F6-851BFAAF6737@oracle.com> References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> <5581C465.7070803@oracle.com> <05B8872C-194E-4596-9291-C62262AAA930@oracle.com> <55891C4A.6020002@oracle.com> <7EBACFD5-8D27-4EB3-A6F6-851BFAAF6737@oracle.com> Message-ID: <5589232B.7080705@oracle.com> On 06/23/2015 11:57 AM, Roland Westrelin wrote: >>> If I read the code correctly when >>> support_IRIW_for_not_multiple_copy_atomic_cpu is true we don?t add a >>> membar after a volatile store so folding a store with previous ones >>> would not be a correct optimization because it could remove volatile >>> stores. This said the current code in StoreNode::Ideal() that folds >>> back to back stores has the effect or reordering stores on different >>> slices. So isn?t there a bug already? >> >> Let's back up a bit. >> >> Short story. Given the program: >> >> volatile int y; >> int x; >> >> x = 1; >> y = 1; // volatile store >> x = 2; >> y = 2; // volatile store >> >> This transformation is correct: >> >> x = 1; >> x = 2; >> y = 2; // volatile store > > What about > > volatile int y; > volatile int x; > > y=1 > x=1 > y=2 > > transformed to: > > x=1 > y=2 > > ? I think this is not allowed, since operations over "x" get tied up in the synchronization order. Here is the simplest counter-example: volatile int y; volatile int x; -------------------------- y = 1; | r1 = x; x = 1; | r2 = y; y = 2; | (1, 0) -- forbidden, violates SO-PO consistency (seeing x=1 implies seeing y=1 in program order before it) volatile int y; volatile int x; -------------------------- ... | r1 = x; x = 1; | r2 = y; y = 2; | (1, 0) -- allowed by transformed program, oops. (These cases are what jcstress sequential consistency tests are about) Thanks, -Aleksey -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: OpenPGP digital signature URL: From adinn at redhat.com Tue Jun 23 10:12:44 2015 From: adinn at redhat.com (Andrew Dinn) Date: Tue, 23 Jun 2015 11:12:44 +0100 Subject: aarch64 DMB - patch In-Reply-To: <4554BF7F-F4CE-4FBD-B09A-1EFCDA598F5F@theobroma-systems.com> References: <7A2AC9D2-F791-4473-9557-37B2C1218A4B@theobroma-systems.com> <5581575B.8050602@redhat.com> <74928C0A-3CB6-4811-A48D-F03CB65557D6@theobroma-systems.com> <55815925.3020908@redhat.com> <55815A6F.7010505@redhat.com> <92ED304B-57A4-4F63-9F4D-6B8259CC807F@theobroma-systems.com> <55815DB9.40409@redhat.com> <4554BF7F-F4CE-4FBD-B09A-1EFCDA598F5F@theobroma-systems.com> Message-ID: <5589311C.4010706@redhat.com> Hi Benedikt, On 17/06/15 14:26, Benedikt Wedenik wrote: > I checked out both repositories and compared the AD-file. > My patch also works in the latest version > of hg.openjdk.java.net/jdk9/hs-comp > . > > If ADinn is working on that part of the code right now, do you think I > should talk to him directly? You have been talking to him directly -- it's just that I have not been responding because I have been away on holiday for a few weeks. Firstly, here is a summary of what is currently being done to replace memory barriers with ldar/stlr instructions. I have already made one change in jdk9 to ensure that dmb instructions are elided for volatile gets and non-object field volatile puts. You can track progress for that patch via the associated JIRA issue: https://bugs.openjdk.java.net/browse/JDK-8078263 That fix required modifying the ad file rules which match MemBarAcquire, MemBarRelease and MemBarVolatile nodes to employ predicates which filter out the cases where generation of a dmb can safely be omitted. It also required changing the rules for put and get to use corresponding predicates to generate stlr and ldar in precisely the same cases. The predictes need to detect /exactly/ the same cases for elision and generation of synchronizing loads/stores in order for the optimization to be correct. You should look at the prior jdk9 aarch64 code to see why these predicates are defined as is -- the jdk7 and jdk8 aarch64 rules differ and are not a good starting point. This first fix fails to optimize volatile object stores. That's because the current predicates do not recognize the GC card mark nodes inserted by the compiler. I am about to post a fix for this case to aarch64-dev and hotspot-dev. The JIRA is https://bugs.openjdk.java.net/browse/JDK-8078743 A follow-up fix will also optimize CAS operations to drop dmbs in favour of ldar/stlr. This 3rd fix depends on the second fix as it requires use of a common function to test for the presence of GC card mark nodes. The JIRA issue is https://bugs.openjdk.java.net/browse/JDK-8080293 Now, as regards your proposed patch -- it appears to be addressing the unrelated case (unrelated to my changes above, that is) of memory barriers associated with fast lock and fast unlock operations i.e. locks associated with synchronized methods or synchronizations on objects via the synchronized keyword. I am not sure your patch is valid wrt to the jdk9 code base or even relative to jdk7/8. Your attachment includes a change to elide the dmb instructions planted when a MemBarAcquireLock or MemBarReleaseLock node is matched. These are generated, respectively, before and after a FastLock and FastUnlock node. The encodings for these latter two operations, aarch64_enc_fast_lock and aarch64_enc_fast_unlock currently employ ldxr and stlxr at the points where the object markOop field is being tested and updated (this is true in jdk7/8/9). Note that /ldxr/ is not an acquiring load. So, if your contention is that the barriers can be dropped because the markOop load-exclusive + store-exclusive pair provides sufficiently strong memory syncrhonization then at the very least your patch would need to modify the encoding to use ldaxr in place of ldxr. However, I am not convinced that these barriers can be removed even granted that change. There are various other memory operations encoded in both the fast_lock and fast_unlock cases both before and after the load-exclusive + store-exclusive pair. I believe the point of separating out the MemBarAcquireLock and MemBarReleaseLock from FastLock and FastUnlock is to ensure that those related memory operations are correctly synchronized wrt to memory operations performed by other threads which may be trying to synchronize on the same oop. If you think I am wrong and your optimization is valid then you really need to provide a detailed, convincing argument as to why -- n.b. that's not a requirement to convince me but rather to convince the many experts on this list who understand lock synchronization. Expect a lively and lengthy debate if you want to pursue this. regards, Andrew Dinn ----------- From aph at redhat.com Tue Jun 23 10:28:58 2015 From: aph at redhat.com (Andrew Haley) Date: Tue, 23 Jun 2015 11:28:58 +0100 Subject: aarch64 DMB - patch In-Reply-To: <5589311C.4010706@redhat.com> References: <7A2AC9D2-F791-4473-9557-37B2C1218A4B@theobroma-systems.com> <5581575B.8050602@redhat.com> <74928C0A-3CB6-4811-A48D-F03CB65557D6@theobroma-systems.com> <55815925.3020908@redhat.com> <55815A6F.7010505@redhat.com> <92ED304B-57A4-4F63-9F4D-6B8259CC807F@theobroma-systems.com> <55815DB9.40409@redhat.com> <4554BF7F-F4CE-4FBD-B09A-1EFCDA598F5F@theobroma-systems.com> <5589311C.4010706@redhat.com> Message-ID: <558934EA.7070609@redhat.com> On 23/06/15 11:12, Andrew Dinn wrote: > However, I am not convinced that these barriers can be removed even > granted that change. There are various other memory operations encoded > in both the fast_lock and fast_unlock cases both before and after the > load-exclusive + store-exclusive pair. I believe the point of separating > out the MemBarAcquireLock and MemBarReleaseLock from FastLock and > FastUnlock is to ensure that those related memory operations are > correctly synchronized wrt to memory operations performed by other > threads which may be trying to synchronize on the same oop. It's delicate code, for sure. For a while we weren't using stlxr when acquiring a lock because I decided we didn't need to. This was wrong because when you copy a header word to the displaced header you *must* ensure that the store to the displaced header itself happens-before the store of the pointer to the displaced header. OTOH, I'm not at all sure we need the separate locks as well as LDAXR/STLXR. Andrew. From adinn at redhat.com Tue Jun 23 10:37:36 2015 From: adinn at redhat.com (Andrew Dinn) Date: Tue, 23 Jun 2015 11:37:36 +0100 Subject: aarch64 DMB - patch In-Reply-To: <558934EA.7070609@redhat.com> References: <7A2AC9D2-F791-4473-9557-37B2C1218A4B@theobroma-systems.com> <5581575B.8050602@redhat.com> <74928C0A-3CB6-4811-A48D-F03CB65557D6@theobroma-systems.com> <55815925.3020908@redhat.com> <55815A6F.7010505@redhat.com> <92ED304B-57A4-4F63-9F4D-6B8259CC807F@theobroma-systems.com> <55815DB9.40409@redhat.com> <4554BF7F-F4CE-4FBD-B09A-1EFCDA598F5F@theobroma-systems.com> <5589311C.4010706@redhat.com> <558934EA.7070609@redhat.com> Message-ID: <558936F0.3020801@redhat.com> On 23/06/15 11:28, Andrew Haley wrote: > On 23/06/15 11:12, Andrew Dinn wrote: >> However, I am not convinced that these barriers can be removed even >> granted that change. There are various other memory operations encoded >> in both the fast_lock and fast_unlock cases both before and after the >> load-exclusive + store-exclusive pair. I believe the point of separating >> out the MemBarAcquireLock and MemBarReleaseLock from FastLock and >> FastUnlock is to ensure that those related memory operations are >> correctly synchronized wrt to memory operations performed by other >> threads which may be trying to synchronize on the same oop. > > It's delicate code, for sure. For a while we weren't using stlxr when > acquiring a lock because I decided we didn't need to. This was wrong > because when you copy a header word to the displaced header you *must* > ensure that the store to the displaced header itself happens-before > the store of the pointer to the displaced header. > > OTOH, I'm not at all sure we need the separate locks as well as > LDAXR/STLXR. Sorry, not sure I fully grokked that. Do you mean "I'm not at all sure we need the separate dmbs as well as the LDAXR/STLXR"? Or are you talking about something else? regards, Andrew Dinn ----------- From roland.westrelin at oracle.com Tue Jun 23 10:50:12 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Tue, 23 Jun 2015 12:50:12 +0200 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: <5589232B.7080705@oracle.com> References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> <5581C465.7070803@oracle.com> <05B8872C-194E-4596-9291-C62262AAA930@oracle.com> <55891C4A.6020002@oracle.com> <7EBACFD5-8D27-4EB3-A6F6-851BFAAF6737@oracle.com> <5589232B.7080705@oracle.com> Message-ID: >> What about >> >> volatile int y; >> volatile int x; >> >> y=1 >> x=1 >> y=2 >> >> transformed to: >> >> x=1 >> y=2 >> >> ? > > I think this is not allowed, since operations over "x" get tied up in > the synchronization order. Thanks. Then for support_IRIW_for_not_multiple_copy_atomic_cpu true, I don?t see how incorrect reordering is prevented. Roland. From vitalyd at gmail.com Tue Jun 23 14:29:55 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Tue, 23 Jun 2015 10:29:55 -0400 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> <5581C465.7070803@oracle.com> <05B8872C-194E-4596-9291-C62262AAA930@oracle.com> <55891C4A.6020002@oracle.com> <7EBACFD5-8D27-4EB3-A6F6-851BFAAF6737@oracle.com> <5589232B.7080705@oracle.com> Message-ID: Hi Roland, So what does the graph look like for this example? sent from my phone On Jun 23, 2015 6:50 AM, "Roland Westrelin" wrote: > > >> What about > >> > >> volatile int y; > >> volatile int x; > >> > >> y=1 > >> x=1 > >> y=2 > >> > >> transformed to: > >> > >> x=1 > >> y=2 > >> > >> ? > > > > I think this is not allowed, since operations over "x" get tied up in > > the synchronization order. > > Thanks. Then for support_IRIW_for_not_multiple_copy_atomic_cpu true, I > don?t see how incorrect reordering is prevented. > > Roland. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aph at redhat.com Tue Jun 23 14:32:09 2015 From: aph at redhat.com (Andrew Haley) Date: Tue, 23 Jun 2015 15:32:09 +0100 Subject: aarch64 DMB - patch In-Reply-To: <558936F0.3020801@redhat.com> References: <7A2AC9D2-F791-4473-9557-37B2C1218A4B@theobroma-systems.com> <5581575B.8050602@redhat.com> <74928C0A-3CB6-4811-A48D-F03CB65557D6@theobroma-systems.com> <55815925.3020908@redhat.com> <55815A6F.7010505@redhat.com> <92ED304B-57A4-4F63-9F4D-6B8259CC807F@theobroma-systems.com> <55815DB9.40409@redhat.com> <4554BF7F-F4CE-4FBD-B09A-1EFCDA598F5F@theobroma-systems.com> <5589311C.4010706@redhat.com> <558934EA.7070609@redhat.com> <558936F0.3020801@redhat.com> Message-ID: <55896DE9.2070409@redhat.com> On 06/23/2015 11:37 AM, Andrew Dinn wrote: > On 23/06/15 11:28, Andrew Haley wrote: >> On 23/06/15 11:12, Andrew Dinn wrote: >>> However, I am not convinced that these barriers can be removed even >>> granted that change. There are various other memory operations encoded >>> in both the fast_lock and fast_unlock cases both before and after the >>> load-exclusive + store-exclusive pair. I believe the point of separating >>> out the MemBarAcquireLock and MemBarReleaseLock from FastLock and >>> FastUnlock is to ensure that those related memory operations are >>> correctly synchronized wrt to memory operations performed by other >>> threads which may be trying to synchronize on the same oop. >> >> It's delicate code, for sure. For a while we weren't using stlxr when >> acquiring a lock because I decided we didn't need to. This was wrong >> because when you copy a header word to the displaced header you *must* >> ensure that the store to the displaced header itself happens-before >> the store of the pointer to the displaced header. >> >> OTOH, I'm not at all sure we need the separate locks as well as >> LDAXR/STLXR. > > Sorry, not sure I fully grokked that. Do you mean "I'm not at all sure > we need the separate dmbs as well as the LDAXR/STLXR"? Or are you > talking about something else? Very sorry. Yes, I mean to say that I'm not at all sure we need the separate dmbs as well as the LDAXR/STLXR. Andrew. From roland.westrelin at oracle.com Tue Jun 23 16:29:54 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Tue, 23 Jun 2015 18:29:54 +0200 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> <5581C465.7070803@oracle.com> <05B8872C-194E-4596-9291-C62262AAA930@oracle.com> <55891C4A.6020002@oracle.com> <7EBACFD5-8D27-4EB3-A6F6-851BFAAF6737@oracle.com> <5589232B.7080705@oracle.com> Message-ID: > So what does the graph look like for this example? I haven?t checked but I would expect: ...->MB->->ST(y=2)->MB->ST(x=1)->MB->ST(y=1)->... if support_IRIW_for_not_multiple_copy_atomic_cpu is false ...->ST(y=2)->ST(y=1)->... ...->ST(x=1)->... otherwise Roland. > > sent from my phone > > On Jun 23, 2015 6:50 AM, "Roland Westrelin" wrote: > > >> What about > >> > >> volatile int y; > >> volatile int x; > >> > >> y=1 > >> x=1 > >> y=2 > >> > >> transformed to: > >> > >> x=1 > >> y=2 > >> > >> ? > > > > I think this is not allowed, since operations over "x" get tied up in > > the synchronization order. > > Thanks. Then for support_IRIW_for_not_multiple_copy_atomic_cpu true, I don?t see how incorrect reordering is prevented. > > Roland. From vitalyd at gmail.com Tue Jun 23 16:48:13 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Tue, 23 Jun 2015 12:48:13 -0400 Subject: RFR(M): 8080289: Intermediate writes in a loop not eliminated by optimizer In-Reply-To: References: <55789088.5050405@oracle.com> <9DE849F6-9E4A-4649-A174-793726D67FD5@oracle.com> <5580738D.9070900@oracle.com> <5581C465.7070803@oracle.com> <05B8872C-194E-4596-9291-C62262AAA930@oracle.com> <55891C4A.6020002@oracle.com> <7EBACFD5-8D27-4EB3-A6F6-851BFAAF6737@oracle.com> <5589232B.7080705@oracle.com> Message-ID: Hmm, so no barriers *at all* if IRIW is true, as-if these are plain stores? That does seem wrong. On Tue, Jun 23, 2015 at 12:29 PM, Roland Westrelin < roland.westrelin at oracle.com> wrote: > > So what does the graph look like for this example? > > I haven?t checked but I would expect: > > ...->MB->->ST(y=2)->MB->ST(x=1)->MB->ST(y=1)->... > > if support_IRIW_for_not_multiple_copy_atomic_cpu is false > > ...->ST(y=2)->ST(y=1)->... > ...->ST(x=1)->... > > otherwise > > Roland. > > > > > sent from my phone > > > > On Jun 23, 2015 6:50 AM, "Roland Westrelin" > wrote: > > > > >> What about > > >> > > >> volatile int y; > > >> volatile int x; > > >> > > >> y=1 > > >> x=1 > > >> y=2 > > >> > > >> transformed to: > > >> > > >> x=1 > > >> y=2 > > >> > > >> ? > > > > > > I think this is not allowed, since operations over "x" get tied up in > > > the synchronization order. > > > > Thanks. Then for support_IRIW_for_not_multiple_copy_atomic_cpu true, I > don?t see how incorrect reordering is prevented. > > > > Roland. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.x.ivanov at oracle.com Wed Jun 24 12:59:13 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Wed, 24 Jun 2015 15:59:13 +0300 Subject: [9] RFR (M): VM should constant fold Unsafe.get*() loads from final fields In-Reply-To: <810DE23B-6616-4465-B91D-4CD9A8FB267D@oracle.com> References: <5581A26C.6090303@oracle.com> <810DE23B-6616-4465-B91D-4CD9A8FB267D@oracle.com> Message-ID: <558AA9A1.8070603@oracle.com> John, Paul, thanks for review! Updated webrev: http://cr.openjdk.java.net/~vlivanov/8078629/webrev.01/ I spotted a bug when field and accessor types mismatch, but the JIT still constant-folds the load. The fix made expected result detection even more complex, so I decided to get rid of it & WhiteBox hooks altogether. The test exercises different code paths and compares returned values now. > WB.isCompileConstant is a nice little thing. We should consider using it in java.lang.invoke > to gate aggressive object-folding optimizations. That's one reason to consider putting it > somewhere more central that WB. I can't propose a good place yet. (Unsafe is not quite right.) Actually, there's already j.l.i.MethodHandleImpl.isCompileConstant. Probably, compiler-specific interface is the right place for such things. But, as I wrote before, I decided to avoid WB hooks. > The gating logic in library_call includes this extra term: && alias_type->field()->is_constant() > Why not just drop it and let make_constant do the test (which it does)? I wanted to stress that make_constant depends on whether the field is constant or not. I failed to come up with a better method name (try_make_constant? make_constant_attempt), so I decided to keep the extra condition. > You have some lines with "/*require_const=*/" in two places; that can't be right. > This is the result of functions with too many misc. arguments to keep track of. > I don't have the code under my fingers, so I'm just guessing, but here are more suggestions: Thanks! I tried to address all your suggestions in updated version. Best regards, Vladimir Ivanov > > I wish the is_autobox_cache condition could be more localized. Could we omit the boolean > flag (almost always false), and where it is true, post-process the node? Would that make > the code simpler? > > This leads me to notice that make_constant is not related strongly to GraphKit; it is really > a call to the Type and CI modules to look for a singleton type, ending with either a NULL > or a call to GraphKit::makecon. So you might consider changing Node* GK::make_constant > to const Type* Type::make_constant. > > Now to pick at the argument salad we have in push_constant: The effect of is_autobox_cache > could be transferred to a method Type[Ary]::cast_to_autobox_cache(true). > And the effect of stable_type on make_constant(ciCon,bool,bool,Type*), > could also be factored out, as post-processing step contype=contype->Type::join(stabletype). From vladimir.x.ivanov at oracle.com Wed Jun 24 12:59:17 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Wed, 24 Jun 2015 15:59:17 +0300 Subject: [9] RFR (M): VM should constant fold Unsafe.get*() loads from final fields In-Reply-To: <172A2B5E-9050-4219-BD07-EB9FA2D671E8@oracle.com> References: <5581A26C.6090303@oracle.com> <172A2B5E-9050-4219-BD07-EB9FA2D671E8@oracle.com> Message-ID: <558AA9A5.2060802@oracle.com> Thanks, Paul! > I like the test, you have almost hand rolled your own specializer :-) I tried hard to avoid bytecode generation, but having 8 almost identical copies of the same logic is too much :-) > A minor point. Since you have created a ClassWriter with "ClassWriter.COMPUTE_MAXS | ClassWriter.COMPUTE_FRAMES" can you remove the "mv.visitMax(0, 0)" calls? > > > I was a little confused by the code that checked the expected result against the actual result. > > I am guessing the white box methods return -1 if the value is not a constant and 1 if it is. (Perhaps that can be documented, if even those methods may eventually reside somewhere else.) Whereas, Generator.expected returns 0 or 1. > > 118 if (direct != unsafe || // difference between testDirect & testUnsafe > 119 (unsafe != -1 && expected != unsafe)) // differs from expected, but ignore "unknown"(-1) result > 120 { > 121 throw new AssertionError(String.format("%s: e=%d; d=%d; u=%d", t.name(), expected, direct, unsafe)); > 122 } > > I don't quite understand why "unknown"(-1) can be ignored. > > Can that be changed to the following if Generator.expected returned the same values as the WB methods? > > if (direct != unsafe || unsafe != expected) { ... } > > ? Though I've removed WB.isCompileConstat() in the updated version, I'll elaborate on that. WB.isCompileConstat() returns 3 values: * 0: the argument is not a compile-time constant; * 1: the argument is a compile-time constant; * -1: it's not known whether it is constant or not "-1" signals that there's no data from JIT. WB.isCompileConstat() is intrinsified by a JIT-compiler. I implemented the intrinsics only in C2, so if you run the test in -Xint mode or with Client VM, WB.isCompileConstant() will always return -1. In order to be resilient, the test ignores "-1" when it checks the results. Best regards, Vladimir Ivanov > > Paul. > > On Jun 17, 2015, at 6:38 PM, Vladimir Ivanov wrote: > >> http://cr.openjdk.java.net/~vlivanov/8078629/webrev.00/ >> https://bugs.openjdk.java.net/browse/JDK-8078629 >> >> Direct(getfield/getstatic) read operations are faster than unsafe reads from constant Java fields, since VM doesn't constant fold unsafe loads. Though VM tries hard to recover field metadata from its offset, it doesn't optimize unsafe ones even if it has all necessary info in its hands. >> >> The fix is to align the behavior and share relevant code between C2 parser and intrinsic expansion logic. >> >> For testing purposes, I extended whitebox API to check whether a value is a compile-time constant. The whitebox test enumerates all combinations of a field and ensures that the behavior is consistent between bytecode and unsafe reads. >> >> Testing: focused whitebox tests, hotspot/test/compiler, jdk/test/java/lang/invoke, octane (for performance measurements) >> >> Thanks! >> >> Best regards, >> Vladimir Ivanov > From paul.sandoz at oracle.com Wed Jun 24 13:37:53 2015 From: paul.sandoz at oracle.com (Paul Sandoz) Date: Wed, 24 Jun 2015 15:37:53 +0200 Subject: [9] RFR (M): VM should constant fold Unsafe.get*() loads from final fields In-Reply-To: <558AA9A5.2060802@oracle.com> References: <5581A26C.6090303@oracle.com> <172A2B5E-9050-4219-BD07-EB9FA2D671E8@oracle.com> <558AA9A5.2060802@oracle.com> Message-ID: On Jun 24, 2015, at 2:59 PM, Vladimir Ivanov wrote: > Thanks, Paul! > >> I like the test, you have almost hand rolled your own specializer :-) > I tried hard to avoid bytecode generation, but having 8 almost identical copies of the same logic is too much :-) > I agree, i like it and will likely use a similar approach in some tests i need to write. >> A minor point. Since you have created a ClassWriter with "ClassWriter.COMPUTE_MAXS | ClassWriter.COMPUTE_FRAMES" can you remove the "mv.visitMax(0, 0)" calls? >> >> >> I was a little confused by the code that checked the expected result against the actual result. >> >> I am guessing the white box methods return -1 if the value is not a constant and 1 if it is. (Perhaps that can be documented, if even those methods may eventually reside somewhere else.) Whereas, Generator.expected returns 0 or 1. >> >> 118 if (direct != unsafe || // difference between testDirect & testUnsafe >> 119 (unsafe != -1 && expected != unsafe)) // differs from expected, but ignore "unknown"(-1) result >> 120 { >> 121 throw new AssertionError(String.format("%s: e=%d; d=%d; u=%d", t.name(), expected, direct, unsafe)); >> 122 } >> >> I don't quite understand why "unknown"(-1) can be ignored. >> >> Can that be changed to the following if Generator.expected returned the same values as the WB methods? >> >> if (direct != unsafe || unsafe != expected) { ... } >> >> ? > Though I've removed WB.isCompileConstat() in the updated version, Ok, i definitely understand the updated version :-) > I'll elaborate on that. WB.isCompileConstat() returns 3 values: > * 0: the argument is not a compile-time constant; > * 1: the argument is a compile-time constant; > * -1: it's not known whether it is constant or not > > "-1" signals that there's no data from JIT. WB.isCompileConstat() is intrinsified by a JIT-compiler. I implemented the intrinsics only in C2, so if you run the test in -Xint mode or with Client VM, WB.isCompileConstant() will always return -1. > > In order to be resilient, the test ignores "-1" when it checks the results. > Ah, thanks, that makes sense now, Paul. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 841 bytes Desc: Message signed with OpenPGP using GPGMail URL: From vladimir.kozlov at oracle.com Wed Jun 24 16:57:19 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 24 Jun 2015 09:57:19 -0700 Subject: RFR: 8086087: aarch64: add support for 64 bit vectors In-Reply-To: <1435159652.13459.6.camel@mint> References: <1435159652.13459.6.camel@mint> Message-ID: <558AE16F.9080704@oracle.com> Hi, Ed I am worried about 32 bit vectors. There could be conflict somewhere in RA since min_vector_size will not match minimum vector register VecD size. Can you split these changes to have separate changesets? One is support VecD (64 bit) and an other 32bit vectors. If some testing will show problems we can check which changes caused it more precisely. And this should be reviewed on compiler mailing list instead of runtime. Thanks, Vladimir On 6/24/15 8:27 AM, Edward Nevill wrote: > Hi, > > The following webrev based on the hs-rt repo > > http://cr.openjdk.java.net/~enevill/8086087/webrev.01 > > Adds support for 64 bit vectors on aarch64. Previously the vector code only supported 128 bit vectors. > > 32 bit vectors are not supported directly as aarch64 has no support for 32 bit vectors, however the above webrev will permit 32 bit vectors but just place them in a 64 bit vector. > > I have tested this with JTreg hotspot and get the same results before and after the change, viz, > > Test results: passed: 845; failed: 12; error: 6 > > I have also benchmarked the Test*Vect tests from 6340864 in the hotspot test suite. The following are the average results I get on one of our partners HW (lower number is better). > > TestByteVect: 128-bit (11.77), 64-bit (4.36) > TestShortVect: 128-bit (5.02), 64-bit (5.22) > TestIntVect: 128-bit (7.81), 64-bit (7.70) > TestLongVect: 128-bit (11.67), 64-bit (11.71) > TestFloatVect: 128-bit (16.75), 64-bit (17.29) > TestDoubleVect:128-bit (32.37), 64-bit (32.43) > > So the only test which shows an improvement is TestByteVect which shows a 2.7x speedup. The other tests are the same within the bounds of experimental error. > > The reason TestByteVect shows such an improvement is that with 128 bit vectors it is not being vectorized at all because the loop is not unrolled sufficiently to allow it to be vectorized, wheras with 64 bit vectors it is. > > Please review and let me know if this is OK to push? > > Ed. > > From edward.nevill at gmail.com Wed Jun 24 18:53:12 2015 From: edward.nevill at gmail.com (Edward Nevill) Date: Wed, 24 Jun 2015 19:53:12 +0100 Subject: RFR: 8086087: aarch64: add support for 64 bit vectors In-Reply-To: <558AE16F.9080704@oracle.com> References: <1435159652.13459.6.camel@mint> <558AE16F.9080704@oracle.com> Message-ID: <1435171992.13459.22.camel@mint> On Wed, 2015-06-24 at 09:57 -0700, Vladimir Kozlov wrote: > Hi, Ed > > I am worried about 32 bit vectors. There could be conflict somewhere in RA since min_vector_size will not match minimum > vector register VecD size. > > Can you split these changes to have separate changesets? One is support VecD (64 bit) and an other 32bit vectors. > If some testing will show problems we can check which changes caused it more precisely. Hi Vladimir, Thanks for the review. I am generally happy that putting 32 bit values in 64 bit registers is OK. I initially did the 64 bit registers by putting them in 128 bit registers. That worked OK, but there were 2 problems. First when a register was spilled I had to spill 128 bits since I did not know the size at the point of the spill. The second problem was with scalar reduction when doing an add across the vector, rather than a parallel vector operation. In this case it would get the wrong result if the top 64 bits were non zero. This is why I generated a separate 64 bit vectorisation. With 32 bit, spilling 64 bits instead of 32 bits does not matter, and scalar reduction operations do not exist for 32 bit (the minimum is 2I). I will do as you suggest, and split it into two webrevs. > > And this should be reviewed on compiler mailing list instead of runtime. And should the changeset then be based on hs-comp and pushed to hs-comp? All the best, Ed. From vladimir.kozlov at oracle.com Wed Jun 24 18:58:04 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 24 Jun 2015 11:58:04 -0700 Subject: RFR: 8086087: aarch64: add support for 64 bit vectors In-Reply-To: <1435171992.13459.22.camel@mint> References: <1435159652.13459.6.camel@mint> <558AE16F.9080704@oracle.com> <1435171992.13459.22.camel@mint> Message-ID: <558AFDBC.1080705@oracle.com> > And should the changeset then be based on hs-comp and pushed to hs-comp? Yes On 6/24/15 11:53 AM, Edward Nevill wrote: > On Wed, 2015-06-24 at 09:57 -0700, Vladimir Kozlov wrote: >> Hi, Ed >> >> I am worried about 32 bit vectors. There could be conflict somewhere in RA since min_vector_size will not match minimum >> vector register VecD size. >> >> Can you split these changes to have separate changesets? One is support VecD (64 bit) and an other 32bit vectors. >> If some testing will show problems we can check which changes caused it more precisely. > > Hi Vladimir, > > Thanks for the review. I am generally happy that putting 32 bit values in 64 bit registers is OK. I initially did the 64 bit registers by putting them in 128 bit registers. > > That worked OK, but there were 2 problems. > > First when a register was spilled I had to spill 128 bits since I did not know the size at the point of the spill. > > The second problem was with scalar reduction when doing an add across the vector, rather than a parallel vector operation. In this case it would get the wrong result if the top 64 bits were non zero. > > This is why I generated a separate 64 bit vectorisation. > > With 32 bit, spilling 64 bits instead of 32 bits does not matter, and scalar reduction operations do not exist for 32 bit (the minimum is 2I). > > I will do as you suggest, and split it into two webrevs. > >> >> And this should be reviewed on compiler mailing list instead of runtime. > > And should the changeset then be based on hs-comp and pushed to hs-comp? > > All the best, > Ed. > > From edward.nevill at gmail.com Thu Jun 25 10:40:49 2015 From: edward.nevill at gmail.com (Edward Nevill) Date: Thu, 25 Jun 2015 11:40:49 +0100 Subject: RFR: 8086087: aarch64: add support for 64 bit vectors Message-ID: <1435228849.11204.17.camel@mylittlepony.linaroharston> Hi, The following webrev adds support for 64 bit vectors (only) on aarch64 http://cr.openjdk.java.net/~enevill/8086087/webrev.02 Previously the vector code only supported 128 bit vectors. 32 bit vectors are not supported in this changeset but will be supported in a future changeset. I have tested this with JTreg hotspot with the following results Original: Test results: passed: 858; failed: 4; error: 6 Revised: Test results: passed: 857; failed: 5; error: 6 The additional test failure is compiler/intrinsics/muladd/TestMulAdd.java which fails intermittently with both original and revised versions (I'll take a look at that next:-). I have also benchmarked the Test*Vect tests from 6340864 in the hotspot test suite. The following are the average results I get on one of our partners HW (lower number is better). TestByteVect: 128-bit (11.77), 64-bit (4.36) TestShortVect: 128-bit (5.02), 64-bit (5.22) TestIntVect: 128-bit (7.81), 64-bit (7.70) TestLongVect: 128-bit (11.67), 64-bit (11.71) TestFloatVect: 128-bit (16.75), 64-bit (17.29) TestDoubleVect:128-bit (32.37), 64-bit (32.43) So the only test which shows an improvement is TestByteVect which shows a 2.7x speedup. The other tests are the same within the bounds of experimental error. The reason TestByteVect shows such an improvement is that with 128 bit vectors it is not being vectorized at all because the loop is not unrolled sufficiently to allow it to be vectorized, wheras with 64 bit vectors it is. Please review and let me know if this is OK to push? Ed. PS: For pushing an aarch64 specific change to hs-comp do I need 1 or 2 reviewers? From vladimir.kozlov at oracle.com Thu Jun 25 14:20:41 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 25 Jun 2015 07:20:41 -0700 Subject: RFR: 8086087: aarch64: add support for 64 bit vectors In-Reply-To: <1435228849.11204.17.camel@mylittlepony.linaroharston> References: <1435228849.11204.17.camel@mylittlepony.linaroharston> Message-ID: <558C0E39.2060000@oracle.com> This looks good. Thank you, Ed. Since changes a big you need 2 reviewers. One official reviewer (me in this case) and one who is familiar with this code and at least committer (Andrew, for example). Thanks, Vladimir On 6/25/15 3:40 AM, Edward Nevill wrote: > Hi, > > The following webrev adds support for 64 bit vectors (only) on aarch64 > > http://cr.openjdk.java.net/~enevill/8086087/webrev.02 > > Previously the vector code only supported 128 bit vectors. > > 32 bit vectors are not supported in this changeset but will be supported in a future changeset. > > I have tested this with JTreg hotspot with the following results > > Original: Test results: passed: 858; failed: 4; error: 6 > Revised: Test results: passed: 857; failed: 5; error: 6 > > The additional test failure is compiler/intrinsics/muladd/TestMulAdd.java which fails intermittently with both original and revised versions (I'll take a look at that next:-). > > I have also benchmarked the Test*Vect tests from 6340864 in the hotspot test suite. The following are the average results I get on one of our partners HW (lower number is better). > > TestByteVect: 128-bit (11.77), 64-bit (4.36) > TestShortVect: 128-bit (5.02), 64-bit (5.22) > TestIntVect: 128-bit (7.81), 64-bit (7.70) > TestLongVect: 128-bit (11.67), 64-bit (11.71) > TestFloatVect: 128-bit (16.75), 64-bit (17.29) > TestDoubleVect:128-bit (32.37), 64-bit (32.43) > > So the only test which shows an improvement is TestByteVect which shows a 2.7x speedup. The other tests are the same within the bounds of experimental error. > > The reason TestByteVect shows such an improvement is that with 128 bit vectors it is not being vectorized at all because the loop is not unrolled sufficiently to allow it to be vectorized, wheras with 64 bit vectors it is. > > Please review and let me know if this is OK to push? > > Ed. > > PS: For pushing an aarch64 specific change to hs-comp do I need 1 or 2 reviewers? > > From edward.nevill at gmail.com Thu Jun 25 14:44:04 2015 From: edward.nevill at gmail.com (Edward Nevill) Date: Thu, 25 Jun 2015 15:44:04 +0100 Subject: RFR: 8129426: aarch64: add support for PopCount in C2 Message-ID: <1435243444.29000.6.camel@mylittlepony.linaroharston> Hi, Aarch64 currently does not support the PopCountI and PopCountL nodes in aarch64.ad The following webrev adds support for these using the SIMD instructions 'cnt' and 'addv' http://cr.openjdk.java.net/~enevill/8129426/webrev.04 This patch was contributed by alexander.alexeev at caviumnetworks.com The patch only modifies aarch64 specific files. I have merged the patch in and tested it with JTreg / hotspot with the following results for both original and revised Test results: passed: 858; failed: 4; error: 6 I have benchmarked the patch on four different partner platforms. The average improvement was 2.6X for PopCountI and 2.5X for PopCountL. Please review, Thanks, Ed. From vladimir.kozlov at oracle.com Thu Jun 25 19:29:17 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 25 Jun 2015 12:29:17 -0700 Subject: RFR: 8129426: aarch64: add support for PopCount in C2 In-Reply-To: <1435243444.29000.6.camel@mylittlepony.linaroharston> References: <1435243444.29000.6.camel@mylittlepony.linaroharston> Message-ID: <558C568D.5080506@oracle.com> Looks good. Thanks, Vladimir On 6/25/15 7:44 AM, Edward Nevill wrote: > Hi, > > Aarch64 currently does not support the PopCountI and PopCountL nodes in aarch64.ad > > The following webrev adds support for these using the SIMD instructions 'cnt' and 'addv' > > http://cr.openjdk.java.net/~enevill/8129426/webrev.04 > > This patch was contributed by alexander.alexeev at caviumnetworks.com > > The patch only modifies aarch64 specific files. > > I have merged the patch in and tested it with JTreg / hotspot with the following results for both original and revised > > Test results: passed: 858; failed: 4; error: 6 > > I have benchmarked the patch on four different partner platforms. The average improvement was 2.6X for PopCountI and 2.5X for PopCountL. > > Please review, > > Thanks, > Ed. > > From vladimir.kozlov at oracle.com Thu Jun 25 23:24:23 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 25 Jun 2015 16:24:23 -0700 Subject: [9] RFR(S) 8129893: 8129094 fix is incomplete Message-ID: <558C8DA7.8040600@oracle.com> http://cr.openjdk.java.net/~kvn/8129893/webrev/ https://bugs.openjdk.java.net/browse/JDK-8129893 Current checks missed EncodeP node. Move primitive type check from under n->is_Mem() check. Add second primitive type to guarantee that we don't execute following code for non-primitive type node. Tested with failed CTW test. thanks, Vladimir From vladimir.kozlov at oracle.com Thu Jun 25 23:55:19 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 25 Jun 2015 16:55:19 -0700 Subject: RFR(L) 8073583: C2 support for CRC32C on SPARC In-Reply-To: <5D45CF93-9A2E-4E00-A25A-9A101A3CDB26@oracle.com> References: <5547E4F3.4070603@oracle.com> <779F1936-1EE2-419B-9F0A-149FB034BE68@oracle.com> <5548158A.2010103@oracle.com> <0FBC800A-693A-4ECB-A983-D5240FAC59BA@oracle.com> <245E1236-F423-45B8-93C0-9A9BB737371C@oracle.com> <5D45CF93-9A2E-4E00-A25A-9A101A3CDB26@oracle.com> Message-ID: <558C94E7.3040306@oracle.com> Here is latest webrev which was already reviewed by me and John. We think it is good to be integrated. http://cr.openjdk.java.net/~kvn/8073583/webrev.02/ https://bugs.openjdk.java.net/browse/JDK-8073583 Contributed by: James Cheng Thanks, Vladimir On 5/4/15 8:49 PM, John Rose wrote: > On May 4, 2015, at 8:04 PM, James Cheng wrote: >> >> Hi John, >> >>> On May 4, 2015, at 6:21 PM, John Rose wrote: >>> >>> One more comment, which is at a higher level: Could we recode the loop control in Java and use unsafe to handle word and byte loads? Then we would only need single instruction intrinsics. >> >> We could, I guess, but that means we?d need to rewrite the pure Java CRC32C in JDK. >> More difficult is how we implement the CRC32C methods so that they are not favoring >> one platform while hindering others. I am afraid that the CRC32C instructions on different >> platforms are too different to compromise. > > That may be; the vector size is CPU-dependent, for example. But a 64-bit vector is (currently) the sweet spot for writing vectorized code in Java, since 'long' is the biggest bit container in Java. (Note also that HotSpot JVM objects are aligned up to 8 byte boundaries, even after GC.) Another platform with larger vectors would have to use assembly language anyway (which Intel does), but Java code can express 64-bit vectorized loops. > > For CRC, the desirable number of distinct streams, and the prefetch mode and distance, are also CPU-dependent. For those variations injecting machine-specific parts into the Java-coded algorithm would get messy. > > The benefit of coding low-level vectorized loops in Java would be not having to code the loop logic in assembly code. If we could use byte buffers to manage the indexing, and/or had better array notations, it would probably be worth while moving from assembly to Java. At present it seems OK to code in assembly, *if* the assembly can be made more readable. > > We have a chicken and egg problem here: Nobody is going to experiment with Java-coded vector loops until we get single-vector CRC32[C] and XMULX instructions surfaced as C2-supported intrinsics. (We already have bit and byte reverse intrinsics, so that part is OK.) > > ? John > From igor.veresov at oracle.com Fri Jun 26 00:19:26 2015 From: igor.veresov at oracle.com (Igor Veresov) Date: Thu, 25 Jun 2015 17:19:26 -0700 Subject: [9] RFR(S) 8129893: 8129094 fix is incomplete In-Reply-To: <558C8DA7.8040600@oracle.com> References: <558C8DA7.8040600@oracle.com> Message-ID: <8849DC00-FB04-4833-BB10-EA87FEA152AA@oracle.com> Looks fine. igor > On Jun 25, 2015, at 4:24 PM, Vladimir Kozlov wrote: > > http://cr.openjdk.java.net/~kvn/8129893/webrev/ > https://bugs.openjdk.java.net/browse/JDK-8129893 > > Current checks missed EncodeP node. > Move primitive type check from under n->is_Mem() check. > Add second primitive type to guarantee that we don't execute following code for non-primitive type node. > > Tested with failed CTW test. > > thanks, > Vladimir From vladimir.kozlov at oracle.com Fri Jun 26 00:20:59 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 25 Jun 2015 17:20:59 -0700 Subject: [9] RFR(S) 8129893: 8129094 fix is incomplete In-Reply-To: <8849DC00-FB04-4833-BB10-EA87FEA152AA@oracle.com> References: <558C8DA7.8040600@oracle.com> <8849DC00-FB04-4833-BB10-EA87FEA152AA@oracle.com> Message-ID: <558C9AEB.6060600@oracle.com> Thank you, Igor Vladimir On 6/25/15 5:19 PM, Igor Veresov wrote: > Looks fine. > > igor > >> On Jun 25, 2015, at 4:24 PM, Vladimir Kozlov wrote: >> >> http://cr.openjdk.java.net/~kvn/8129893/webrev/ >> https://bugs.openjdk.java.net/browse/JDK-8129893 >> >> Current checks missed EncodeP node. >> Move primitive type check from under n->is_Mem() check. >> Add second primitive type to guarantee that we don't execute following code for non-primitive type node. >> >> Tested with failed CTW test. >> >> thanks, >> Vladimir > From vladimir.kozlov at oracle.com Fri Jun 26 00:57:31 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 25 Jun 2015 17:57:31 -0700 Subject: RFR(S): 8085932: Fixing bugs in detecting memory alignments in SuperWord In-Reply-To: <39F83597C33E5F408096702907E6C450E56020@ORSMSX104.amr.corp.intel.com> References: <39F83597C33E5F408096702907E6C450E4D6F7@ORSMSX104.amr.corp.intel.com> <55835DFC.70001@oracle.com> <39F83597C33E5F408096702907E6C450E56020@ORSMSX104.amr.corp.intel.com> Message-ID: <558CA37B.2040701@oracle.com> Okay, this is better. Thanks, Vladimir On 6/25/15 2:51 PM, Civlin, Jan wrote: > > Vladimir, > > Here is the updated patch with trace hidden in a new nested class Trace, that contains all the messages. The Trace class is compiled only in NOT_PRODUCT. > Looks much simple now (of course more lines but all outside of the algorithmic part). > > Thank you, > > Jan. > > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > Sent: Thursday, June 18, 2015 5:11 PM > To: Civlin, Jan; hotspot-compiler-dev at openjdk.java.net > Subject: Re: RFR(S): 8085932: Fixing bugs in detecting memory alignments in SuperWord > > Thank you, Jan > > Fixes looks good but it would be nice if you replaced some tracing code > with functions calls. In some place the execution code is hard to read > because of big tracing code. For example, in > SuperWord::memory_alignment() and in SWPointer methods. > > The one way to do that is to declare trace methods with empty body in > product build, for example for SWPointer::scaled_iv_plus_offset() you > may have new method declaration (not under #ifdef) in superword.hpp: > > class SWPointer VALUE_OBJ_CLASS_SPEC { > > void trace_1_scaled_iv_plus_offset(...) PRODUCT_RETURN; > > and in superword.cpp you will put the method under ifdef: > > #ifndef PRODUCT > void trace_1_scaled_iv_plus_offset(...) { > .... > } > #endif > > Then you can simply use it without ifdefs in code: > > bool SWPointer::scaled_iv_plus_offset(Node* n) { > + trace_1_scaled_iv_plus_offset(...); > + > if (scaled_iv(n)) { > > Note, macro PRODUCT_RETURN is defined as: > > #ifdef PRODUCT > #define PRODUCT_RETURN {} > #else > #define PRODUCT_RETURN /*next token must be ;*/ > #endif > > Thanks, > Vladimir > > On 6/8/15 9:15 AM, Civlin, Jan wrote: >> Hi All, >> >> >> We would like to contribute to Fixing bugs in detecting memory >> alignments in SuperWord. >> >> The contribution Bug ID: 8085932. >> >> Please review this patch: >> >> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8085932 >> >> webrev: http://cr.openjdk.java.net/~kvn/8085932/webrev.00/ >> >> >> *Description**: *Fixing bugs in detecting memory alignments in >> SuperWord >> >> Fixing bugs in detecting memory alignments in SuperWord: >> SWPointer::scaled_iv_plus_offset (fixing here a bug in detection of >> "scale"), >> SWPointer::offset_plus_k (fixing here a bug in detection of "invariant"), >> >> Add tracing output to the code that deal with memory alignment. The >> following routines are traceable: >> >> SWPointer::scaled_iv_plus_offset >> SWPointer::offset_plus_k >> SWPointer::scaled_iv, >> WPointer::SWPointer, >> SuperWord::memory_alignment >> >> Tracing is done only for NOT_PRODUCT. Currently tracing is controlled by >> VectorizeDebug: >> >> #ifndef PRODUCT >> if (_phase->C->method() != NULL) { >> _phase->C->method()->has_option_value("VectorizeDebug", >> _vector_loop_debug); >> } >> #endif >> >> And VectorizeDebug may take any combination (bitwise OR) of the >> following values: >> bool is_trace_alignment() { return (_vector_loop_debug & 2) > 0; } >> bool is_trace_mem_slice() { return (_vector_loop_debug & 4) > 0; } >> bool is_trace_loop() { return (_vector_loop_debug & 8) > 0; } >> bool is_trace_adjacent() { return (_vector_loop_debug & 16) > 0; } >> > From james.cheng at oracle.com Fri Jun 26 01:04:01 2015 From: james.cheng at oracle.com (james cheng) Date: Thu, 25 Jun 2015 18:04:01 -0700 Subject: RFR(L) 8073583: C2 support for CRC32C on SPARC In-Reply-To: <558C94E7.3040306@oracle.com> References: <5547E4F3.4070603@oracle.com> <779F1936-1EE2-419B-9F0A-149FB034BE68@oracle.com> <5548158A.2010103@oracle.com> <0FBC800A-693A-4ECB-A983-D5240FAC59BA@oracle.com> <245E1236-F423-45B8-93C0-9A9BB737371C@oracle.com> <5D45CF93-9A2E-4E00-A25A-9A101A3CDB26@oracle.com> <558C94E7.3040306@oracle.com> Message-ID: <558CA501.1000503@oracle.com> Hi Vladimir, Thank you and John for the reviews and guidance. Regards, -James On 6/25/2015 4:55 PM, Vladimir Kozlov wrote: > Here is latest webrev which was already reviewed by me and John. We think it is good to be integrated. > > http://cr.openjdk.java.net/~kvn/8073583/webrev.02/ > https://bugs.openjdk.java.net/browse/JDK-8073583 > > Contributed by: James Cheng > > Thanks, > Vladimir From aph at redhat.com Fri Jun 26 16:25:38 2015 From: aph at redhat.com (Andrew Haley) Date: Fri, 26 Jun 2015 17:25:38 +0100 Subject: RFR: 8046943: RSA Acceleration In-Reply-To: <5583D414.5050502@redhat.com> References: <557ABD2E.7050608@redhat.com> <557EFF94.5000006@oracle.com> <557F042D.4060707@redhat.com> <558033C4.8040104@redhat.com> <5582F936.5020008@oracle.com> <5582FACA.4060103@redhat.com> <5582FDCA.8010507@oracle.com> <55831BC8.9060001@oracle.com> <5583D414.5050502@redhat.com> Message-ID: <558D7D02.6070303@redhat.com> On 06/19/2015 09:34 AM, Andrew Haley wrote: > On 18/06/15 20:28, Vladimir Kozlov wrote: > >> Yes, it is a lot of handwriting but we need it to work on all OSs. > > Sure, I get that. I knew there would be a few goes around with this, > but it's worth the pain for the performance improvement. I made some changes, as requested. Everything is now private static final. The libcall now only calls the runtime code: all allocation is done in Java code. I tested on Solaris using Solaris Studio 12.3 tools, and it's fine. There's one thing I'm not sure about. I now longer allocate scratch memory on the heap. That was only needed for extremely large integers, larger than anyone needs for crypto. Now, if the size of an integer exceeds 16384 bits I do not use the intrinsic, and this allows it to use stack-allocated memory for its scratch space. The main thing I was worried about is that the time spent in Montgomery multiplication. The runtime of the algorithm is O(N^2); if you don't limit the size, the time is unbounded, with no safepoint delay. This would mean that anyone who passed an absurdly large integer to BigInteger.modPow() would see the virtual machine apparently lock up and garbage collection would not run. I note that the multiplyToLen() intrinsic has the same problem. http://cr.openjdk.java.net/~aph/8046943-hs-3/ http://cr.openjdk.java.net/~aph/8046943-jdk-3/ Andrew. From michael.c.berg at intel.com Fri Jun 26 19:43:56 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Fri, 26 Jun 2015 19:43:56 +0000 Subject: RFR: 8129920 - Vectorized loop unrolling Message-ID: Hi Folks, I would like to contribute Vectorized loop unrolling. I need two reviewers to review this patch and comment as needed: Bug-id: https://bugs.openjdk.java.net/browse/JDK-8129920 webrev: http://cr.openjdk.java.net/~mcberg/8129920/webrev.01/ With this change we leverage superword unrolling queries and superword to stage re-entrance to ideal loop optimization. We do this when superword succeeds on vectorizing a loop which was unroll query mapped. When we re-enter ideal loop optimization, we have already done all major optimizations such as peeling, splitting, rce and superword on the vector map candidate loop. Thus we only unroll the loop. We utilize the standard loop unrolling environment to accomplish this with default and any applicable user settings. In this way we leverage unroll factors from the baseline loop which are much larger to obtain optimum throughput on x86 architectures. The uplift range on SpecJvm2008 is seen on scimark.lu.{small,large} with uplift noted at 3% and 8% respectively. We see as much as 1.5x uplift on vector centric micros like reductions on default optimizations. Thanks, Michael -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Fri Jun 26 22:50:30 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 26 Jun 2015 15:50:30 -0700 Subject: [9] RFR(XS) 8130008: compiler/codecache/jmx/UsageThresholdIncreasedTest.java should be quarantined Message-ID: <558DD736.1040106@oracle.com> http://cr.openjdk.java.net/~kvn/8130008/webrev/ Quarantine UsageThresholdIncreasedTest.java test until 8129937 is fixed. The failure prevents taking snapshot of jdk9 this week. https://bugs.openjdk.java.net/browse/JDK-8130008 https://bugs.openjdk.java.net/browse/JDK-8129937 Thanks, Vladimir From igor.veresov at oracle.com Fri Jun 26 22:52:05 2015 From: igor.veresov at oracle.com (Igor Veresov) Date: Fri, 26 Jun 2015 15:52:05 -0700 Subject: [9] RFR(XS) 8130008: compiler/codecache/jmx/UsageThresholdIncreasedTest.java should be quarantined In-Reply-To: <558DD736.1040106@oracle.com> References: <558DD736.1040106@oracle.com> Message-ID: Good. igor > On Jun 26, 2015, at 3:50 PM, Vladimir Kozlov wrote: > > http://cr.openjdk.java.net/~kvn/8130008/webrev/ > > Quarantine UsageThresholdIncreasedTest.java test until 8129937 is fixed. > The failure prevents taking snapshot of jdk9 this week. > > https://bugs.openjdk.java.net/browse/JDK-8130008 > https://bugs.openjdk.java.net/browse/JDK-8129937 > > Thanks, > Vladimir From vladimir.x.ivanov at oracle.com Sat Jun 27 01:27:08 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Sat, 27 Jun 2015 04:27:08 +0300 Subject: On constant folding of final field loads Message-ID: <558DFBEC.6040700@oracle.com> Hi there, Recently I started looking at constant folding of loads from instance final fields: https://bugs.openjdk.java.net/browse/JDK-8058164 I made some progress and wanted to share my findings and initiate a discussion about the problems I spotted. Current prototype: http://cr.openjdk.java.net/~vlivanov/8058164/webrev.00/hotspot http://cr.openjdk.java.net/~vlivanov/8058164/webrev.00/jdk The idea is simple: JIT tracks final field changes and throws away nmethods which are affected. There are 2 parts of the problem: - how to track changes of final field - how to manage relation between final fields and nmethods; I. Tracking changes of final fields There are 4 ways to circumvent runtime limitations and change a final field value: - Reflection API (Field.setAccessible()) - Unsafe - JNI - java.lang.invoke (MethodHandles) (It's also possible to write to a final field in a constructor, but I consider it as a corner case and haven't addressed yet. VM can ignore ) Since Reflection & java.lang.invoke APIs use Unsafe, it ends up with only 2 cases: JNI & Unsafe. For JNI it's possible to encode field "finality" in jfieldID and check corresponding bit in Set*Field JNI methods before updating a field. There are already some data encoded in the field ID, so extending it to record final bit as well went pretty smooth. For Unsafe it's much more complex. I started with a similar approach (mostly implemented in the current prototype) - record "finality" bit in offset cookie and check it when performing a write. Though Unsafe.objectFieldOffset/staticFieldOffset javadoc explicitly states that returned value is not guaranteed to be a byte offset [1], after following that road I don't see how offset encoding scheme can be changed. First of all, there are Unsafe.get*Unaligned methods (added in 9), which require byte offset (Unsafe.getLong()): "Fetches a value at some byte offset into a given Java object ... The specification of this method is the same as {@link #getLong(Object, long)} except that the offset does not need to have been obtained from {@link #objectFieldOffset} on the {@link java.lang.reflect.Field} of some Java field." Unsafe.getInt supports 3 addressing modes: (1) NULL + address (2) oop + offset (3) base + index * scale Since there are no methods in Unsafe to get byte offsets, there's no way to make (3) work with non-byte offsets for Unaligned versions. Both base and scale should be either byte offsets or offset cookies to make things work. You can get a sense of the problems looking into Unsafe & java.nio hacks I did to make things somewhat function after switching offset encoding strategy. Also, Unsafe.copyMemory() doesn't work well with offset cookies (see java.nio.Bits changes I did). Though source and destination addressing shares the same mode with Unsage.getInt() et al., the size of the copied region is defined in bytes. So, in order to perform bulk copies of consecutive memory blocks, the user should be able to convert offset cookie to byte offset and vice versa. There's no way to solve that with current API right now. I don't want to touch compatibility concerns of switching from byte offsets to encoded offsets, but it looks like Unsafe API needs some overhaul in 9 to make offset encoding viable. More realistically, since there are external dependencies on Unsafe API, I'd prefer to leave sun.misc.Unsafe as is and switch to VarHandles (when they are available in 9) all over JDK. Or temporarily make a private copy (finally :-)) of field accessors from Unsafe, switch it to encoded offsets, and use it in Reflection & java.lang.invoke API. Regarding alternative approaches to track the finality, an offset bitmap on per-class basis can be used (containing locations of final fields). Possible downsides are: (1) memory footprint (1/8th of instance size per class); and (2) more complex checking logic (load a relevant piece of a bitmap from a klass, instead of checking locally available offset cookie). The advantage is that it is completely transparent to a user: it doesn't change offset translation scheme. II. Managing relations between final fields and nmethods Nmethods dependencies suits that purpose pretty well, but some enhancements are needed. I envision 2 types of dependencies: (1) per-class (field holder); and (2) per-instance (value holder). Field holder is used as a context. Unless a final field is changed, there's no need to track per-instance dependency. VM optimistically starts in per-class mode and switch to per-instance mode when it sees a field change. The policy is chosen on per-class basis. VM should be pretty conservative, since false positives are expensive - a change of unrelated field causes recompilation of all nmethods where the same field was inlined (even if the value was taken from a different instance). Aliasing also causes false positives (same instance, but different final field), so fields in the same class should be differentiated as well. Unilke methods, fields don't have any dedicated metadata associated with them. All data is confined in holder klass. To be able to identify a field in a dependency, byte offset can be used. Right now, dependency management machinery supports only oops and metadata. So, it should be extended to support primitive values in dependencies (haven't done yet). Byte offset + per-instance modes completely eliminates false positives. Another aspect is how expensive dependency checking becomes. I took a benchmark from Nashorn/Octane (Box2D), since MethodHandle inlining heavily relies on constant folding of instance final fields. Before After checks (#) 420 12,5K nmethods checked(#) 3K 1,5M total time: 60ms 2s deps total 19K 26K Though total number of dependencies in VM didn't change much (+37% = 19K->26K), total number of checked dependencies (500x: 3K -> 1,5M) and time spent on dependency checking (30x: 60ms -> 2s) dramatically increased. The reason is that constant field value dependencies created heavily populated contextes which are regularly checked: #1 #2 #3/#4 Before KlassDep 254 47/2,632 CallSiteDep 167 46/ 358 After ConstantFieldDep 11,790 0/1,494,112 KlassDep 286 41/ 2,769 CallSiteDep 249 58/ 393 (#1 - dependency kind; #2 - total number of unique dependencies; #3/#4 - invalidated nmethods/checked dependencies) I have 3 ideas how to improve performance of dependency checking: (1) split dependency context list (nmethodBucket) into 3 independent lists (Klass, CallSite & ConstantValue); (IMPLEMENTED) It trades size for speed - duplicate nmethods are possible, but the lists should be shorter on average. I already implemented it, but it didn't improve the benchmark I'm playing with, since the fraction of CallSite/Klass deps is very small compared to ConstantField. (2) group nmethodBucket entries into chunks of k-nmethods; (TODO) It should improve nmethod iteration speed in heavily populated contexts. (3) iterate only dependencies of appropriate kind; (TODO) There are 3 kinds of changes which require dependency checking: changes in CHA (KlassDepChange), call site target change (CallSiteDepChange), and constant field value change (ConstantFieldDepChange). Different types of changes affect disjoint sets of dependencies. So, instead of enumerating all dependencies in a nmethod, a more focused approach can be used (e.g. check only call_site_target_value deps for CallSiteDepChange). Since dependencies are sorted by type when serialized in a nmethod, it's possible to compute offsets for 3 disjoint sets and use them in DepStream to iterate only relevant dependencies. I hope it'll significantly reduce dependency checking costs I'm seeing. That's all for now. Thanks! Best regards, Vladimir Ivanov [1] "Do not expect to perform any sort of arithmetic on this offset; it is just a cookie which is passed to the unsafe heap memory accessors." From jan.civlin at intel.com Thu Jun 25 21:51:59 2015 From: jan.civlin at intel.com (Civlin, Jan) Date: Thu, 25 Jun 2015 21:51:59 +0000 Subject: RFR(S): 8085932: Fixing bugs in detecting memory alignments in SuperWord In-Reply-To: <55835DFC.70001@oracle.com> References: <39F83597C33E5F408096702907E6C450E4D6F7@ORSMSX104.amr.corp.intel.com> <55835DFC.70001@oracle.com> Message-ID: <39F83597C33E5F408096702907E6C450E56020@ORSMSX104.amr.corp.intel.com> Vladimir, Here is the updated patch with trace hidden in a new nested class Trace, that contains all the messages. The Trace class is compiled only in NOT_PRODUCT. Looks much simple now (of course more lines but all outside of the algorithmic part). Thank you, Jan. -----Original Message----- From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] Sent: Thursday, June 18, 2015 5:11 PM To: Civlin, Jan; hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR(S): 8085932: Fixing bugs in detecting memory alignments in SuperWord Thank you, Jan Fixes looks good but it would be nice if you replaced some tracing code with functions calls. In some place the execution code is hard to read because of big tracing code. For example, in SuperWord::memory_alignment() and in SWPointer methods. The one way to do that is to declare trace methods with empty body in product build, for example for SWPointer::scaled_iv_plus_offset() you may have new method declaration (not under #ifdef) in superword.hpp: class SWPointer VALUE_OBJ_CLASS_SPEC { void trace_1_scaled_iv_plus_offset(...) PRODUCT_RETURN; and in superword.cpp you will put the method under ifdef: #ifndef PRODUCT void trace_1_scaled_iv_plus_offset(...) { .... } #endif Then you can simply use it without ifdefs in code: bool SWPointer::scaled_iv_plus_offset(Node* n) { + trace_1_scaled_iv_plus_offset(...); + if (scaled_iv(n)) { Note, macro PRODUCT_RETURN is defined as: #ifdef PRODUCT #define PRODUCT_RETURN {} #else #define PRODUCT_RETURN /*next token must be ;*/ #endif Thanks, Vladimir On 6/8/15 9:15 AM, Civlin, Jan wrote: > Hi All, > > > We would like to contribute to Fixing bugs in detecting memory > alignments in SuperWord. > > The contribution Bug ID: 8085932. > > Please review this patch: > > Bug-id: https://bugs.openjdk.java.net/browse/JDK-8085932 > > webrev: http://cr.openjdk.java.net/~kvn/8085932/webrev.00/ > > > *Description**: *Fixing bugs in detecting memory alignments in > SuperWord > > Fixing bugs in detecting memory alignments in SuperWord: > SWPointer::scaled_iv_plus_offset (fixing here a bug in detection of > "scale"), > SWPointer::offset_plus_k (fixing here a bug in detection of "invariant"), > > Add tracing output to the code that deal with memory alignment. The > following routines are traceable: > > SWPointer::scaled_iv_plus_offset > SWPointer::offset_plus_k > SWPointer::scaled_iv, > WPointer::SWPointer, > SuperWord::memory_alignment > > Tracing is done only for NOT_PRODUCT. Currently tracing is controlled by > VectorizeDebug: > > #ifndef PRODUCT > if (_phase->C->method() != NULL) { > _phase->C->method()->has_option_value("VectorizeDebug", > _vector_loop_debug); > } > #endif > > And VectorizeDebug may take any combination (bitwise OR) of the > following values: > bool is_trace_alignment() { return (_vector_loop_debug & 2) > 0; } > bool is_trace_mem_slice() { return (_vector_loop_debug & 4) > 0; } > bool is_trace_loop() { return (_vector_loop_debug & 8) > 0; } > bool is_trace_adjacent() { return (_vector_loop_debug & 16) > 0; } > -------------- next part -------------- A non-text attachment was scrubbed... Name: webrev.062515.tar.bz2 Type: application/octet-stream Size: 1043695 bytes Desc: webrev.062515.tar.bz2 URL: From weatry at gmail.com Sun Jun 28 04:15:00 2015 From: weatry at gmail.com (weatry at gmail.com) Date: Sun, 28 Jun 2015 12:15:00 +0800 Subject: why doesn't trigger compile when loop 10700 times? Message-ID: <2015062812145835427919@gmail.com> hi, everyone! I have some question about the jdk 1.7 compiler. According to the source code of "hotspot\src\share\vm\interpreter\invocationCounter.cpp", the "InterpreterBackwardBranchLimit" is calculated by the following rules: if (ProfileInterpreter) { InterpreterBackwardBranchLimit = (CompileThreshold * (OnStackReplacePercentage - InterpreterProfilePercentage)) / 100; } else { InterpreterBackwardBranchLimit = ((CompileThreshold * OnStackReplacePercentage) / 100) << number_of_noncount_bits; } So if I run a piece of code on a server edition jvm, the InterpreterBackwardBranchLimit should be 10700 (CompileThreshold is 10000, OnStackReplacePercentage is 140, and InterpreterProfilePercentage is 33). But when I added -XX:+PrintCompilation, a loop with 10700 times would not print anything. When the loop growed up to 14564 times, the compiler began to work. Could anybody give me some advice? I use jdk1.7.0_67, and the test code as following: public class OSRDemo { public static void main(String[] args) { int result = 1; for (int i = 1; i < 10700; i++) {//14564 result+=i; } System.out.println(result); } } Thank you very much! Thomas -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Sun Jun 28 04:36:40 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Sat, 27 Jun 2015 21:36:40 -0700 Subject: why doesn't trigger compile when loop 10700 times? In-Reply-To: <2015062812145835427919@gmail.com> References: <2015062812145835427919@gmail.com> Message-ID: <558F79D8.7040907@oracle.com> Add -Xbatch (or -XX:-BackgroundCompilation) otherwise code continue execution after requesting compilation and may finish before compilation starts. Also counters are not precise so you may not see exact 10700. Regards, Vladimir On 6/27/15 9:15 PM, weatry at gmail.com wrote: > hi, everyone! > > I have some question about the jdk 1.7 compiler. > > According to the source code of > "hotspot\src\share\vm\interpreter\invocationCounter.cpp", the > "InterpreterBackwardBranchLimit" is calculated by the following rules: > > if (ProfileInterpreter) { > InterpreterBackwardBranchLimit = (CompileThreshold * (OnStackReplacePercentage - InterpreterProfilePercentage)) / 100; > } else { > InterpreterBackwardBranchLimit = ((CompileThreshold * OnStackReplacePercentage) / 100) << number_of_noncount_bits; > } > > So if I run a piece of code on a server edition jvm, the > InterpreterBackwardBranchLimit should be 10700 (CompileThreshold is > 10000, OnStackReplacePercentage is 140, and InterpreterProfilePercentage > is 33). But when I added -XX:+PrintCompilation, a loop with 10700 times > would not print anything. When the loop growed up to 14564 times, the > compiler began to work. > > Could anybody give me some advice? > > I use jdk1.7.0_67, and the test code as following: > > public class OSRDemo { > public static void main(String[] args) { > int result = 1; > for (int i = 1; i < 10700; i++) {//14564 > result+=i; > } > > System.out.println(result); > } > } > > Thank you very much! > > ------------------------------------------------------------------------ > Thomas From kirk.pepperdine at gmail.com Sun Jun 28 08:30:06 2015 From: kirk.pepperdine at gmail.com (Kirk Pepperdine) Date: Sun, 28 Jun 2015 10:30:06 +0200 Subject: why doesn't trigger compile when loop 10700 times? In-Reply-To: <558F79D8.7040907@oracle.com> References: <2015062812145835427919@gmail.com> <558F79D8.7040907@oracle.com> Message-ID: Hi, I believe with loops you should see an OSR but only after a count of 14000. I don?t believe this will be precise though. Kind regards, Kirk Pepperdine On Jun 28, 2015, at 6:36 AM, Vladimir Kozlov wrote: > Add -Xbatch (or -XX:-BackgroundCompilation) otherwise code continue execution after requesting compilation and may finish before compilation starts. > Also counters are not precise so you may not see exact 10700. > > Regards, > Vladimir > > On 6/27/15 9:15 PM, weatry at gmail.com wrote: >> hi, everyone! >> >> I have some question about the jdk 1.7 compiler. >> >> According to the source code of >> "hotspot\src\share\vm\interpreter\invocationCounter.cpp", the >> "InterpreterBackwardBranchLimit" is calculated by the following rules: >> >> if (ProfileInterpreter) { >> InterpreterBackwardBranchLimit = (CompileThreshold * (OnStackReplacePercentage - InterpreterProfilePercentage)) / 100; >> } else { >> InterpreterBackwardBranchLimit = ((CompileThreshold * OnStackReplacePercentage) / 100) << number_of_noncount_bits; >> } >> >> So if I run a piece of code on a server edition jvm, the >> InterpreterBackwardBranchLimit should be 10700 (CompileThreshold is >> 10000, OnStackReplacePercentage is 140, and InterpreterProfilePercentage >> is 33). But when I added -XX:+PrintCompilation, a loop with 10700 times >> would not print anything. When the loop growed up to 14564 times, the >> compiler began to work. >> >> Could anybody give me some advice? >> >> I use jdk1.7.0_67, and the test code as following: >> >> public class OSRDemo { >> public static void main(String[] args) { >> int result = 1; >> for (int i = 1; i < 10700; i++) {//14564 >> result+=i; >> } >> >> System.out.println(result); >> } >> } >> >> Thank you very much! >> >> ------------------------------------------------------------------------ >> Thomas From paul.sandoz at oracle.com Mon Jun 29 08:34:37 2015 From: paul.sandoz at oracle.com (Paul Sandoz) Date: Mon, 29 Jun 2015 10:34:37 +0200 Subject: On constant folding of final field loads In-Reply-To: <558DFBEC.6040700@oracle.com> References: <558DFBEC.6040700@oracle.com> Message-ID: <67DBC2C3-1AB2-464A-A691-3942278A4787@oracle.com> Hi Vladimir, Looks like there is some really good investigatory work here. On Jun 27, 2015, at 3:27 AM, Vladimir Ivanov wrote: > There are 4 ways to circumvent runtime limitations and change a final field value: > - Reflection API (Field.setAccessible()) > - Unsafe > - JNI > - java.lang.invoke (MethodHandles) > For MHs it's not possible to lookup a MH (via MH.L.findSetter/unreflectSetter) to a final field. http://hg.openjdk.java.net/jdk9/dev/jdk/file/93ced310c728/src/java.base/share/classes/java/lang/invoke/MethodHandles.java#l1516 Although i cannot find any such explicit mention in JavaDoc, so i guess it can be considered under the umbrella of "access checks". Paul. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 841 bytes Desc: Message signed with OpenPGP using GPGMail URL: From vladimir.kozlov at oracle.com Mon Jun 29 08:37:44 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 29 Jun 2015 01:37:44 -0700 Subject: RFR: 8046943: RSA Acceleration In-Reply-To: <558D7D02.6070303@redhat.com> References: <557ABD2E.7050608@redhat.com> <557EFF94.5000006@oracle.com> <557F042D.4060707@redhat.com> <558033C4.8040104@redhat.com> <5582F936.5020008@oracle.com> <5582FACA.4060103@redhat.com> <5582FDCA.8010507@oracle.com> <55831BC8.9060001@oracle.com> <5583D414.5050502@redhat.com> <558D7D02.6070303@redhat.com> Message-ID: <559103D8.1010302@oracle.com> Hi, Andrew Did you file RFE for this change? 8046943 is JEP. typo? "less" -> "more". + * number of ints in the number is less than this value we do not + * use the intrinsic. + */ + private static final int MONTGOMERY_INTRINSIC_THRESHOLD = 512; trailing spaces: src/java.base/share/classes/java/math/BigInteger.java:273: Trailing whitespace src/java.base/share/classes/java/math/BigInteger.java:2770: Trailing whitespace I ran changes through JPRT and linux/solaris passed - thanks. Next step - Windows: C:\jprt\T\P1\s\hotspot\src\cpu\x86\vm\sharedRuntime_x86_64.cpp(26) : fatal error C1083: Cannot open include file: 'alloca.h': No such file or directory I am fine with JDK changes. Would be nice to have a test for this change. Do existing tests cover this code? I agree that we should limit size when to invoke multiplyToLen intrinsic too. File bug I will assign it. Thanks, Vladimir On 6/26/15 9:25 AM, Andrew Haley wrote: > On 06/19/2015 09:34 AM, Andrew Haley wrote: >> On 18/06/15 20:28, Vladimir Kozlov wrote: >> >>> Yes, it is a lot of handwriting but we need it to work on all OSs. >> >> Sure, I get that. I knew there would be a few goes around with this, >> but it's worth the pain for the performance improvement. > > I made some changes, as requested. > > Everything is now private static final. > > The libcall now only calls the runtime code: all allocation is done > in Java code. > > I tested on Solaris using Solaris Studio 12.3 tools, and it's fine. > > There's one thing I'm not sure about. I now longer allocate scratch > memory on the heap. That was only needed for extremely large > integers, larger than anyone needs for crypto. Now, if the size of an > integer exceeds 16384 bits I do not use the intrinsic, and this allows > it to use stack-allocated memory for its scratch space. > > The main thing I was worried about is that the time spent in > Montgomery multiplication. The runtime of the algorithm is O(N^2); if > you don't limit the size, the time is unbounded, with no safepoint > delay. This would mean that anyone who passed an absurdly large > integer to BigInteger.modPow() would see the virtual machine > apparently lock up and garbage collection would not run. I note that > the multiplyToLen() intrinsic has the same problem. > > http://cr.openjdk.java.net/~aph/8046943-hs-3/ > http://cr.openjdk.java.net/~aph/8046943-jdk-3/ > > Andrew. > From aph at redhat.com Mon Jun 29 09:07:10 2015 From: aph at redhat.com (Andrew Haley) Date: Mon, 29 Jun 2015 10:07:10 +0100 Subject: [aarch64-port-dev ] RFR: 8086087: aarch64: add support for 64 bit vectors In-Reply-To: <558C0E39.2060000@oracle.com> References: <1435228849.11204.17.camel@mylittlepony.linaroharston> <558C0E39.2060000@oracle.com> Message-ID: <55910ABE.8050809@redhat.com> On 25/06/15 15:20, Vladimir Kozlov wrote: > Since changes a big you need 2 reviewers. One official reviewer (me in this case) and one who is familiar with this code > and at least committer (Andrew, for example). Looks good. Thanks, Andrew. From aph at redhat.com Mon Jun 29 09:08:18 2015 From: aph at redhat.com (Andrew Haley) Date: Mon, 29 Jun 2015 10:08:18 +0100 Subject: [aarch64-port-dev ] RFR: 8129426: aarch64: add support for PopCount in C2 In-Reply-To: <1435243444.29000.6.camel@mylittlepony.linaroharston> References: <1435243444.29000.6.camel@mylittlepony.linaroharston> Message-ID: <55910B02.50604@redhat.com> On 25/06/15 15:44, Edward Nevill wrote: > Please review, This is fine. Thanks, Andrew. From aph at redhat.com Mon Jun 29 09:32:47 2015 From: aph at redhat.com (Andrew Haley) Date: Mon, 29 Jun 2015 10:32:47 +0100 Subject: RFR: 8046943: RSA Acceleration In-Reply-To: <559103D8.1010302@oracle.com> References: <557ABD2E.7050608@redhat.com> <557EFF94.5000006@oracle.com> <557F042D.4060707@redhat.com> <558033C4.8040104@redhat.com> <5582F936.5020008@oracle.com> <5582FACA.4060103@redhat.com> <5582FDCA.8010507@oracle.com> <55831BC8.9060001@oracle.com> <5583D414.5050502@redhat.com> <558D7D02.6070303@redhat.com> <559103D8.1010302@oracle.com> Message-ID: <559110BF.4090804@redhat.com> On 29/06/15 09:37, Vladimir Kozlov wrote: > Hi, Andrew > > Did you file RFE for this change? 8046943 is JEP. No; I will do so. > typo? "less" -> "more". > > + * number of ints in the number is less than this value we do not > + * use the intrinsic. > + */ > + private static final int MONTGOMERY_INTRINSIC_THRESHOLD = 512; > > trailing spaces: > src/java.base/share/classes/java/math/BigInteger.java:273: Trailing whitespace > src/java.base/share/classes/java/math/BigInteger.java:2770: Trailing whitespace > > I ran changes through JPRT and linux/solaris passed - thanks. > Next step - Windows: > > C:\jprt\T\P1\s\hotspot\src\cpu\x86\vm\sharedRuntime_x86_64.cpp(26) : fatal error C1083: Cannot open include file: > 'alloca.h': No such file or directory Hmm, okay. This is going to be fun. :-) AFAIK OpenJDK builds with Visual Studio. The VS equivalent of alloca() is called _alloca() and its header file is . I'm going to try to do this untested. I think that autoconf will #include malloc.h on Windows automagically, so all that I have to do is create a #define for alloca() on Windows. > I am fine with JDK changes. > > Would be nice to have a test for this change. Do existing tests > cover this code? They do. jdk/test/com/oracle/security/ucrypto/TestRSA looks like a pretty thorough test. If you make any mistake in the arithmetic RSA decryption simply will not work: the result is corrupt. The risks, then, are mistakes such as accidental side-effects. I don't know any way to test for that. The other possible think I could test for is unusual key sizes. I'll have a look. > I agree that we should limit size when to invoke multiplyToLen > intrinsic too. File bug I will assign it. OK. Andrew. From aleksey.shipilev at oracle.com Mon Jun 29 10:35:20 2015 From: aleksey.shipilev at oracle.com (Aleksey Shipilev) Date: Mon, 29 Jun 2015 13:35:20 +0300 Subject: On constant folding of final field loads In-Reply-To: <558DFBEC.6040700@oracle.com> References: <558DFBEC.6040700@oracle.com> Message-ID: <55911F68.9050809@oracle.com> Hi, On 06/27/2015 04:27 AM, Vladimir Ivanov wrote: > Current prototype: > http://cr.openjdk.java.net/~vlivanov/8058164/webrev.00/hotspot > http://cr.openjdk.java.net/~vlivanov/8058164/webrev.00/jdk > > The idea is simple: JIT tracks final field changes and throws away > nmethods which are affected. Big picture question: do we actually care about propagating final field values once the object escaped (and in this sense, available to be introspected by the compiler)? Java memory model does not guarantee the final field visibility when the object had escaped. The very reason why deserialization works is because the deserialized object had not yet been published. That is, are we in line with the spec and general expectations by folding the final values, *and* not deoptimizing on the store? > Though Unsafe.objectFieldOffset/staticFieldOffset javadoc explicitly > states that returned value is not guaranteed to be a byte offset [1], > after following that road I don't see how offset encoding scheme can be > changed. Yes. Lots and lots of users rely on *fieldOffset to return the actual byte offset, even though it is not specified as such. This understanding is so prevalent, that it leaks into Unsafe.get*Unaligned, etc. > More realistically, since there are external dependencies on Unsafe API, > I'd prefer to leave sun.misc.Unsafe as is and switch to VarHandles (when > they are available in 9) all over JDK. Or temporarily make a private > copy (finally :-)) of field accessors from Unsafe, switch it to encoded > offsets, and use it in Reflection & java.lang.invoke API. Or, introduce Unsafe.invalidateFinalDep(Field/offset/etc), and add the call to it to Reflection accessors, MethodHandles invoke, VarHandle handles, etc. When/if Unsafe goes away, so do the unsafe (non-dependency-firing) final field stores. Raw memory access via Unsafe already escapes whatever traps you are setting in (oop + offset) path, so it would be nice to have the option to fire the dependency check for an arbitrary (?) offset. > Regarding alternative approaches to track the finality, an offset bitmap > on per-class basis can be used (containing locations of final fields). > Possible downsides are: (1) memory footprint (1/8th of instance size per > class); and (2) more complex checking logic (load a relevant piece of a > bitmap from a klass, instead of checking locally available offset > cookie). The advantage is that it is completely transparent to a user: > it doesn't change offset translation scheme. I like this one. Paying with slightly larger memory footprint for API compatibility sounds reasonable to me. > II. Managing relations between final fields and nmethods > Another aspect is how expensive dependency checking becomes. > > I took a benchmark from Nashorn/Octane (Box2D), since MethodHandle > inlining heavily relies on constant folding of instance final fields. > > Before After > checks (#) 420 12,5K > nmethods checked(#) 3K 1,5M > total time: 60ms 2s > deps total 19K 26K > > Though total number of dependencies in VM didn't change much (+37% = > 19K->26K), total number of checked dependencies (500x: 3K -> 1,5M) and > time spent on dependency checking (30x: 60ms -> 2s) dramatically increased. > > The reason is that constant field value dependencies created heavily > populated contextes which are regularly checked: > > #1 #2 #3/#4 > Before > KlassDep 254 47/2,632 > CallSiteDep 167 46/ 358 > > After > ConstantFieldDep 11,790 0/1,494,112 > KlassDep 286 41/ 2,769 > CallSiteDep 249 58/ 393 > > (#1 - dependency kind; #2 - total number of unique dependencies; > #3/#4 - invalidated nmethods/checked dependencies) Isn't the underlying problem being the dependencies are searched linearly? At least in ConstantFieldDep, can we compartmentalize the dependencies by holder class in some sort of hash table? Thanks, -Aleksey -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: OpenPGP digital signature URL: From vladimir.x.ivanov at oracle.com Mon Jun 29 12:00:57 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Mon, 29 Jun 2015 15:00:57 +0300 Subject: On constant folding of final field loads In-Reply-To: <67DBC2C3-1AB2-464A-A691-3942278A4787@oracle.com> References: <558DFBEC.6040700@oracle.com> <67DBC2C3-1AB2-464A-A691-3942278A4787@oracle.com> Message-ID: <55913379.2040000@oracle.com> Paul, > For MHs it's not possible to lookup a MH (via MH.L.findSetter/unreflectSetter) to a final field. > > http://hg.openjdk.java.net/jdk9/dev/jdk/file/93ced310c728/src/java.base/share/classes/java/lang/invoke/MethodHandles.java#l1516 > > Although i cannot find any such explicit mention in JavaDoc, so i guess it can be considered under the umbrella of "access checks". Though you can't lookup a setter for al final field directly, you can use Lookup.unreflectSetter: Field f = T.class.getDeclaredField("t"); f.setAccessible(true); MethodHandle setter = MethodHandles.lookup().unreflectSetter(f); But it doesn't matter much, since MH field getters/setters are baseon on Unsafe. So, if final field value tracking works for Unsafe, it works for j.l.i as well. Best regards, Vladimir Ivanov From paul.sandoz at oracle.com Mon Jun 29 12:13:39 2015 From: paul.sandoz at oracle.com (Paul Sandoz) Date: Mon, 29 Jun 2015 14:13:39 +0200 Subject: On constant folding of final field loads In-Reply-To: <55913379.2040000@oracle.com> References: <558DFBEC.6040700@oracle.com> <67DBC2C3-1AB2-464A-A691-3942278A4787@oracle.com> <55913379.2040000@oracle.com> Message-ID: On Jun 29, 2015, at 2:00 PM, Vladimir Ivanov wrote: > Paul, > >> For MHs it's not possible to lookup a MH (via MH.L.findSetter/unreflectSetter) to a final field. >> >> http://hg.openjdk.java.net/jdk9/dev/jdk/file/93ced310c728/src/java.base/share/classes/java/lang/invoke/MethodHandles.java#l1516 >> >> Although i cannot find any such explicit mention in JavaDoc, so i guess it can be considered under the umbrella of "access checks". > Though you can't lookup a setter for al final field directly, you can use Lookup.unreflectSetter: > Field f = T.class.getDeclaredField("t"); > f.setAccessible(true); > MethodHandle setter = MethodHandles.lookup().unreflectSetter(f); > Oh, yes, i see now, i missed the switch to use IMPL_LOOKUP if the field is set to accessible. > But it doesn't matter much, since MH field getters/setters are baseon on Unsafe. So, if final field value tracking works for Unsafe, it works for j.l.i as well. > Yes. Paul. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 841 bytes Desc: Message signed with OpenPGP using GPGMail URL: From vladimir.x.ivanov at oracle.com Mon Jun 29 13:10:42 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Mon, 29 Jun 2015 16:10:42 +0300 Subject: On constant folding of final field loads In-Reply-To: <55911F68.9050809@oracle.com> References: <558DFBEC.6040700@oracle.com> <55911F68.9050809@oracle.com> Message-ID: <559143D2.2020201@oracle.com> Aleksey, Thanks a lot for the feedback! See my answers inline. On 6/29/15 1:35 PM, Aleksey Shipilev wrote: > Hi, > > On 06/27/2015 04:27 AM, Vladimir Ivanov wrote: >> Current prototype: >> http://cr.openjdk.java.net/~vlivanov/8058164/webrev.00/hotspot >> http://cr.openjdk.java.net/~vlivanov/8058164/webrev.00/jdk >> >> The idea is simple: JIT tracks final field changes and throws away >> nmethods which are affected. > > Big picture question: do we actually care about propagating final field > values once the object escaped (and in this sense, available to be > introspected by the compiler)? > > Java memory model does not guarantee the final field visibility when the > object had escaped. The very reason why deserialization works is because > the deserialized object had not yet been published. > > That is, are we in line with the spec and general expectations by > folding the final values, *and* not deoptimizing on the store? Can you elaborate on your point and interaction with JMM a bit? Are you talking about not tracking constant folded final field values at all, since there are no guarantees by JMM such updates are visible? >> Though Unsafe.objectFieldOffset/staticFieldOffset javadoc explicitly >> states that returned value is not guaranteed to be a byte offset [1], >> after following that road I don't see how offset encoding scheme can be >> changed. > > Yes. Lots and lots of users rely on *fieldOffset to return the actual > byte offset, even though it is not specified as such. This understanding > is so prevalent, that it leaks into Unsafe.get*Unaligned, etc. > > >> More realistically, since there are external dependencies on Unsafe API, >> I'd prefer to leave sun.misc.Unsafe as is and switch to VarHandles (when >> they are available in 9) all over JDK. Or temporarily make a private >> copy (finally :-)) of field accessors from Unsafe, switch it to encoded >> offsets, and use it in Reflection & java.lang.invoke API. > > Or, introduce Unsafe.invalidateFinalDep(Field/offset/etc), and add the > call to it to Reflection accessors, MethodHandles invoke, VarHandle > handles, etc. When/if Unsafe goes away, so do the unsafe > (non-dependency-firing) final field stores. Raw memory access via Unsafe > already escapes whatever traps you are setting in (oop + offset) path, > so it would be nice to have the option to fire the dependency check for > an arbitrary (?) offset. > > >> Regarding alternative approaches to track the finality, an offset bitmap >> on per-class basis can be used (containing locations of final fields). >> Possible downsides are: (1) memory footprint (1/8th of instance size per >> class); and (2) more complex checking logic (load a relevant piece of a >> bitmap from a klass, instead of checking locally available offset >> cookie). The advantage is that it is completely transparent to a user: >> it doesn't change offset translation scheme. > > I like this one. Paying with slightly larger memory footprint for API > compatibility sounds reasonable to me. I don't care about cases when Unsafe API is abused (e.g. raw memory writes on absolute address or arbitrary offset in an object). In the end, it's unsafe API, right? :-) What I want to cover is proper usages of Unsafe API to access instance/static fields. That's the part which is used in Reflection & java.lang.invoke API. Unsafe is used there to bypass access checks. It doesn't mean I'm fine with breaking existing user code. But since Unsafe is not a supported API, I admit some limited changes in major release (e.g. 9) are allowed. What I'm trying to understand is to what extent it can be changed. My experiments show that simply changing offset encoding strategy doesn't work. There are cases when absolute offsets are needed. So, my next question is how to proceed. Does changing API and providing 2 set of functions working with absolute and encoded offsets solve the problem? Or leaving Unsafe as is (but clarifying the API) and migrating Reflection/j.l.i to VarHandles solve the problem? That's what I'm trying to understand. > >> II. Managing relations between final fields and nmethods >> Another aspect is how expensive dependency checking becomes. >> >> I took a benchmark from Nashorn/Octane (Box2D), since MethodHandle >> inlining heavily relies on constant folding of instance final fields. >> >> Before After >> checks (#) 420 12,5K >> nmethods checked(#) 3K 1,5M >> total time: 60ms 2s >> deps total 19K 26K >> >> Though total number of dependencies in VM didn't change much (+37% = >> 19K->26K), total number of checked dependencies (500x: 3K -> 1,5M) and >> time spent on dependency checking (30x: 60ms -> 2s) dramatically increased. >> >> The reason is that constant field value dependencies created heavily >> populated contextes which are regularly checked: >> >> #1 #2 #3/#4 >> Before >> KlassDep 254 47/2,632 >> CallSiteDep 167 46/ 358 >> >> After >> ConstantFieldDep 11,790 0/1,494,112 >> KlassDep 286 41/ 2,769 >> CallSiteDep 249 58/ 393 >> >> (#1 - dependency kind; #2 - total number of unique dependencies; >> #3/#4 - invalidated nmethods/checked dependencies) > > Isn't the underlying problem being the dependencies are searched > linearly? At least in ConstantFieldDep, can we compartmentalize the > dependencies by holder class in some sort of hash table? In some cases (when coarse-grained (per-class) tracking is used), linear traversal is fine, since all nmethods will be invalidated. In order to construct a more efficient data structure, you need a way to order or hash oops. The problem with that is oops aren't stable - they can change at any GC. So, either some stable value should be associated with them (System.identityHashCode()?) or dependency tables should be updated on every GC. Unless existing machinery can be sped up to appropriate level, I wouldn't consider complicating things so much. The 3 optimizations I initially proposed allow to isolate ConstantFieldDep from other kinds of dependencies, so dependency traversal speed will affect only final field writes. Which is acceptable IMO. Best regards, Vladimir Ivanov From vladimir.kozlov at oracle.com Mon Jun 29 20:59:48 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 29 Jun 2015 13:59:48 -0700 Subject: RFR: 8129920 - Vectorized loop unrolling In-Reply-To: References: Message-ID: <5591B1C4.302@oracle.com> ignore_slp() and NoMoreSlp whould be fine names if they guard only superword optimization. You use it to skipp all loop optimizations except unrolling. It should be named differently. allow_unroll_only ? why you need set_notpassed_slp()?: + // For atomic unrolled loops which are vector mapped, instigate more unrolling. + cl->set_notpassed_slp(); Thanks, Vladimir On 6/26/15 12:43 PM, Berg, Michael C wrote: > Hi Folks, > > I would like to contribute Vectorized loop unrolling. I need two > reviewers to review this patch and comment as needed: > > Bug-id: https://bugs.openjdk.java.net/browse/JDK-8129920 > > webrev: > > http://cr.openjdk.java.net/~mcberg/8129920/webrev.01/ > > With this change we leverage superword unrolling queries and superword > to stage re-entrance to ideal loop optimization. We do this when > superword succeeds on vectorizing a loop which was unroll query mapped. > When we re-enter ideal loop optimization, we have already done all major > optimizations such as peeling, splitting, rce and superword on the > vector map candidate loop. Thus we only unroll the loop. We utilize the > standard loop unrolling environment to accomplish this with default and > any applicable user settings. In this way we leverage unroll factors > from the baseline loop which are much larger to obtain optimum > throughput on x86 architectures. The uplift range on SpecJvm2008 is seen > on scimark.lu.{small,large} with uplift noted at 3% and 8% respectively. > We see as much as 1.5x uplift on vector centric micros like reductions > on default optimizations. > > Thanks, > > Michael > From michael.c.berg at intel.com Tue Jun 30 00:46:21 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Tue, 30 Jun 2015 00:46:21 +0000 Subject: RFR: 8129920 - Vectorized loop unrolling In-Reply-To: <5591B1C4.302@oracle.com> References: <5591B1C4.302@oracle.com> Message-ID: Vladimir, sure I will change to reflect we are only allowing unrolling. For the unroll only case, we are allowing all the standard logic for unrolling to apply without unroll queries and its cases. We would need (cl->has_passed_slp() && !cl->unroll_only()) to make both the guarded cases equivalent. We have less code the way I have it. I could word it differently, but it would work out about the same in new code. Thanks, Michael -----Original Message----- From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] Sent: Monday, June 29, 2015 2:00 PM To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net' Subject: Re: RFR: 8129920 - Vectorized loop unrolling ignore_slp() and NoMoreSlp whould be fine names if they guard only superword optimization. You use it to skipp all loop optimizations except unrolling. It should be named differently. allow_unroll_only ? why you need set_notpassed_slp()?: + // For atomic unrolled loops which are vector mapped, instigate more unrolling. + cl->set_notpassed_slp(); Thanks, Vladimir On 6/26/15 12:43 PM, Berg, Michael C wrote: > Hi Folks, > > I would like to contribute Vectorized loop unrolling. I need two > reviewers to review this patch and comment as needed: > > Bug-id: https://bugs.openjdk.java.net/browse/JDK-8129920 > > webrev: > > http://cr.openjdk.java.net/~mcberg/8129920/webrev.01/ > > With this change we leverage superword unrolling queries and superword > to stage re-entrance to ideal loop optimization. We do this when > superword succeeds on vectorizing a loop which was unroll query mapped. > When we re-enter ideal loop optimization, we have already done all > major optimizations such as peeling, splitting, rce and superword on > the vector map candidate loop. Thus we only unroll the loop. We > utilize the standard loop unrolling environment to accomplish this > with default and any applicable user settings. In this way we leverage > unroll factors from the baseline loop which are much larger to obtain > optimum throughput on x86 architectures. The uplift range on > SpecJvm2008 is seen on scimark.lu.{small,large} with uplift noted at 3% and 8% respectively. > We see as much as 1.5x uplift on vector centric micros like reductions > on default optimizations. > > Thanks, > > Michael > From aleksey.shipilev at oracle.com Tue Jun 30 09:49:32 2015 From: aleksey.shipilev at oracle.com (Aleksey Shipilev) Date: Tue, 30 Jun 2015 12:49:32 +0300 Subject: On constant folding of final field loads In-Reply-To: <559143D2.2020201@oracle.com> References: <558DFBEC.6040700@oracle.com> <55911F68.9050809@oracle.com> <559143D2.2020201@oracle.com> Message-ID: <5592662C.40400@oracle.com> Hi, On 06/29/2015 04:10 PM, Vladimir Ivanov wrote: > On 6/29/15 1:35 PM, Aleksey Shipilev wrote: >> On 06/27/2015 04:27 AM, Vladimir Ivanov wrote: >> Big picture question: do we actually care about propagating final field >> values once the object escaped (and in this sense, available to be >> introspected by the compiler)? >> >> Java memory model does not guarantee the final field visibility when the >> object had escaped. The very reason why deserialization works is because >> the deserialized object had not yet been published. >> >> That is, are we in line with the spec and general expectations by >> folding the final values, *and* not deoptimizing on the store? > Can you elaborate on your point and interaction with JMM a bit? > > Are you talking about not tracking constant folded final field values at > all, since there are no guarantees by JMM such updates are visible? Yup. AFAIU the JMM, there is no guarantees you would see the updated value for final field after the object had leaked. So, spec-wise you may just use the final field values as constants. I think the only reason you have to do the dependency tracking is when constant folding depends on instance identity. So, my question is, do we knowingly make a goodwill call to deopt on final field store, even though it is not required by spec? I am not opposing the change, but I'd like us to understand the implications better. For example, I can see the change gives rise to some interesting low-level coding idioms, like: final boolean running = true; Field runningField = resolve(...); // reflective // run stuff for minutes void m() { while (running) { // compiler hoists, turns into while(true) // do stuff } } void hammerTime() { runningField.set(this, false); // deopt, break the loop! } Once we allow users to go crazy like that, it would be cruel to retract/break/change this behavior. But I speculate those cases are not pervasive. By and large, people care about final ops to jump through the barriers. For example, the final load can be commonned through the acquires / control flow. See e.g.: http://psy-lob-saw.blogspot.ru/2014/02/when-i-say-final-i-mean-final.html >>> Regarding alternative approaches to track the finality, an offset bitmap >>> on per-class basis can be used (containing locations of final fields). >>> Possible downsides are: (1) memory footprint (1/8th of instance size per >>> class); and (2) more complex checking logic (load a relevant piece of a >>> bitmap from a klass, instead of checking locally available offset >>> cookie). The advantage is that it is completely transparent to a user: >>> it doesn't change offset translation scheme. >> >> I like this one. Paying with slightly larger memory footprint for API >> compatibility sounds reasonable to me. > > I don't care about cases when Unsafe API is abused (e.g. raw memory > writes on absolute address or arbitrary offset in an object). In the > end, it's unsafe API, right? :-) Yeah, but with millions of users, we are in a bit of a (implicit) compatibility bind here ;) > So, my next question is how to proceed. Does changing API and providing > 2 set of functions working with absolute and encoded offsets solve the > problem? Or leaving Unsafe as is (but clarifying the API) and migrating > Reflection/j.l.i to VarHandles solve the problem? That's what I'm trying > to understand. I would think Reflection/j.l.i would eventually migrate to VarHandles anyway. Paul? The interim solution for encoding final field flags shouldn't leak into (even Unsafe) API, or at least should not break the existing APIs. I further think that an interim solution makes auxiliary single Unsafe.fireDepChange(Field f / long addr) or something, and uses it along with the Unsafe calls in Reflection/j.l.i, when wrappers know they are dealing with final fields. In other words, should we try to reuse the knowledge those wrappers already have, instead of trying to encode the same knowledge into offset cookies? >>> II. Managing relations between final fields and nmethods >>> Another aspect is how expensive dependency checking becomes. >> Isn't the underlying problem being the dependencies are searched >> linearly? At least in ConstantFieldDep, can we compartmentalize the >> dependencies by holder class in some sort of hash table? > In some cases (when coarse-grained (per-class) tracking is used), linear > traversal is fine, since all nmethods will be invalidated. > > In order to construct a more efficient data structure, you need a way to > order or hash oops. The problem with that is oops aren't stable - they > can change at any GC. So, either some stable value should be associated > with them (System.identityHashCode()?) or dependency tables should be > updated on every GC. Yeah, like Symbol::_identity_hash. > Unless existing machinery can be sped up to appropriate level, I > wouldn't consider complicating things so much. Okay. I just can't escape the feeling we keep band-aiding the linear searches everywhere in VM on case-to-case basis, instead of providing the asymptotic guarantees with better data structures. > The 3 optimizations I initially proposed allow to isolate > ConstantFieldDep from other kinds of dependencies, so dependency > traversal speed will affect only final field writes. Which is acceptable > IMO. Except for an overwhelming number of cases where the final field stores happen in the course of deserialization. What's particularly bad about this scenario is that you wouldn't see the time burned in the VM unless you employ the native profiler, as we discovered in Nashorn perf work. Recapping the discussion in this thread, I think we would need to have a more thorough performance work for this change, since it touches the very core of the platform. I think many people outside the hotspot-compiler-dev understand some corner intricacies of the problem that we miss. JEP and outcry for public comments, maybe? Thanks, -Aleksey. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: OpenPGP digital signature URL: From tobias.hartmann at oracle.com Tue Jun 30 09:54:34 2015 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Tue, 30 Jun 2015 11:54:34 +0200 Subject: [9] RFR(S): 8129937: compiler/codecache/jmx/UsageThresholdIncreasedTest.java fails with "Usage threshold was hit" Message-ID: <5592675A.3070607@oracle.com> Hi, please review the following patch. https://bugs.openjdk.java.net/browse/JDK-8129937 http://cr.openjdk.java.net/~thartmann/8129937/webrev.00/ Problem: The jmx test disables compilation, sets a usage threshold for the non-profiled code heap and verifies that the threshold is not hit while allocating code with an overall size < threshold. The test fails because even if we disable compilation with -XX:CompileCommand=compileonly,null::*, we still generate compiled versions of MH intrinsics on method resolution and since those are stored in the non-profiled code heap we may hit the threshold. This problem potentially affects the other jmx tests as well. Of course, with -Xint we would not generate any compiled MH intrinsics because we can guarantee that we have no compiled code. However, for testing purposes we want to make sure that we have all code heaps available and therefore need to use -XX:CompileCommand=compileonly,null::*. Solution: I changed the jmx tests to not assume that the usage of the non-profiled code heap is predictable (see CodeCacheUtils::isCodeHeapPredictable). I added the method CodeCacheUtils::assertEQorGTE to verify that two values are equal if the corresponding code heap is predictable or fall back to the weaker condition of checking that the values are greater or equal if the code heap is not predictable. I changed the tests to use this method. Testing: JPRT with failing test Thanks, Tobias From vladimir.kozlov at oracle.com Tue Jun 30 15:25:20 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 30 Jun 2015 08:25:20 -0700 Subject: [9] RFR(S): 8129937: compiler/codecache/jmx/UsageThresholdIncreasedTest.java fails with "Usage threshold was hit" In-Reply-To: <5592675A.3070607@oracle.com> References: <5592675A.3070607@oracle.com> Message-ID: <5592B4E0.9030305@oracle.com> Looks fine to me. Let author (Igor Ignatyev?) to review it too. Thanks, Vladimir On 6/30/15 2:54 AM, Tobias Hartmann wrote: > Hi, > > please review the following patch. > > https://bugs.openjdk.java.net/browse/JDK-8129937 > http://cr.openjdk.java.net/~thartmann/8129937/webrev.00/ > > Problem: > The jmx test disables compilation, sets a usage threshold for the non-profiled code heap and verifies that the threshold is not hit while allocating code with an overall size < threshold. The test fails because even if we disable compilation with -XX:CompileCommand=compileonly,null::*, we still generate compiled versions of MH intrinsics on method resolution and since those are stored in the non-profiled code heap we may hit the threshold. This problem potentially affects the other jmx tests as well. > Of course, with -Xint we would not generate any compiled MH intrinsics because we can guarantee that we have no compiled code. However, for testing purposes we want to make sure that we have all code heaps available and therefore need to use -XX:CompileCommand=compileonly,null::*. > > Solution: > I changed the jmx tests to not assume that the usage of the non-profiled code heap is predictable (see CodeCacheUtils::isCodeHeapPredictable). I added the method CodeCacheUtils::assertEQorGTE to verify that two values are equal if the corresponding code heap is predictable or fall back to the weaker condition of checking that the values are greater or equal if the code heap is not predictable. I changed the tests to use this method. > > Testing: > JPRT with failing test > > Thanks, > Tobias > From tobias.hartmann at oracle.com Tue Jun 30 15:38:20 2015 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Tue, 30 Jun 2015 17:38:20 +0200 Subject: [9] RFR(S): 8129937: compiler/codecache/jmx/UsageThresholdIncreasedTest.java fails with "Usage threshold was hit" In-Reply-To: <5592B4E0.9030305@oracle.com> References: <5592675A.3070607@oracle.com> <5592B4E0.9030305@oracle.com> Message-ID: <5592B7EC.90801@oracle.com> Thanks, Vladimir. The author of the original change [1] is Dmitrij Pochepko (CC'ed). Dmitrij, could you have a look? Best, Tobias [1] https://bugs.openjdk.java.net/browse/JDK-8059613 On 30.06.2015 17:25, Vladimir Kozlov wrote: > Looks fine to me. Let author (Igor Ignatyev?) to review it too. > > Thanks, > Vladimir > > On 6/30/15 2:54 AM, Tobias Hartmann wrote: >> Hi, >> >> please review the following patch. >> >> https://bugs.openjdk.java.net/browse/JDK-8129937 >> http://cr.openjdk.java.net/~thartmann/8129937/webrev.00/ >> >> Problem: >> The jmx test disables compilation, sets a usage threshold for the non-profiled code heap and verifies that the threshold is not hit while allocating code with an overall size < threshold. The test fails because even if we disable compilation with -XX:CompileCommand=compileonly,null::*, we still generate compiled versions of MH intrinsics on method resolution and since those are stored in the non-profiled code heap we may hit the threshold. This problem potentially affects the other jmx tests as well. >> Of course, with -Xint we would not generate any compiled MH intrinsics because we can guarantee that we have no compiled code. However, for testing purposes we want to make sure that we have all code heaps available and therefore need to use -XX:CompileCommand=compileonly,null::*. >> >> Solution: >> I changed the jmx tests to not assume that the usage of the non-profiled code heap is predictable (see CodeCacheUtils::isCodeHeapPredictable). I added the method CodeCacheUtils::assertEQorGTE to verify that two values are equal if the corresponding code heap is predictable or fall back to the weaker condition of checking that the values are greater or equal if the code heap is not predictable. I changed the tests to use this method. >> >> Testing: >> JPRT with failing test >> >> Thanks, >> Tobias >> From dmitrij.pochepko at oracle.com Tue Jun 30 16:09:12 2015 From: dmitrij.pochepko at oracle.com (Dmitrij Pochepko) Date: Tue, 30 Jun 2015 19:09:12 +0300 Subject: [9] RFR(S): 8129937: compiler/codecache/jmx/UsageThresholdIncreasedTest.java fails with "Usage threshold was hit" In-Reply-To: <5592B7EC.90801@oracle.com> References: <5592675A.3070607@oracle.com> <5592B4E0.9030305@oracle.com> <5592B7EC.90801@oracle.com> Message-ID: <5592BF28.4030504@oracle.com> Hi, looks good (i'm not a reviewer) Thanks, Dmitrij > Thanks, Vladimir. > > The author of the original change [1] is Dmitrij Pochepko (CC'ed). > > Dmitrij, could you have a look? > > Best, > Tobias > > [1] https://bugs.openjdk.java.net/browse/JDK-8059613 > > On 30.06.2015 17:25, Vladimir Kozlov wrote: >> Looks fine to me. Let author (Igor Ignatyev?) to review it too. >> >> Thanks, >> Vladimir >> >> On 6/30/15 2:54 AM, Tobias Hartmann wrote: >>> Hi, >>> >>> please review the following patch. >>> >>> https://bugs.openjdk.java.net/browse/JDK-8129937 >>> http://cr.openjdk.java.net/~thartmann/8129937/webrev.00/ >>> >>> Problem: >>> The jmx test disables compilation, sets a usage threshold for the non-profiled code heap and verifies that the threshold is not hit while allocating code with an overall size< threshold. The test fails because even if we disable compilation with -XX:CompileCommand=compileonly,null::*, we still generate compiled versions of MH intrinsics on method resolution and since those are stored in the non-profiled code heap we may hit the threshold. This problem potentially affects the other jmx tests as well. >>> Of course, with -Xint we would not generate any compiled MH intrinsics because we can guarantee that we have no compiled code. However, for testing purposes we want to make sure that we have all code heaps available and therefore need to use -XX:CompileCommand=compileonly,null::*. >>> >>> Solution: >>> I changed the jmx tests to not assume that the usage of the non-profiled code heap is predictable (see CodeCacheUtils::isCodeHeapPredictable). I added the method CodeCacheUtils::assertEQorGTE to verify that two values are equal if the corresponding code heap is predictable or fall back to the weaker condition of checking that the values are greater or equal if the code heap is not predictable. I changed the tests to use this method. >>> >>> Testing: >>> JPRT with failing test >>> >>> Thanks, >>> Tobias >>> From tobias.hartmann at oracle.com Tue Jun 30 16:24:26 2015 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Tue, 30 Jun 2015 18:24:26 +0200 Subject: [9] RFR(S): 8129937: compiler/codecache/jmx/UsageThresholdIncreasedTest.java fails with "Usage threshold was hit" In-Reply-To: <5592BF28.4030504@oracle.com> References: <5592675A.3070607@oracle.com> <5592B4E0.9030305@oracle.com> <5592B7EC.90801@oracle.com> <5592BF28.4030504@oracle.com> Message-ID: <5592C2BA.1090702@oracle.com> Thanks, Dmitrij! Best, Tobias On 30.06.2015 18:09, Dmitrij Pochepko wrote: > Hi, > > looks good (i'm not a reviewer) > > Thanks, > Dmitrij >> Thanks, Vladimir. >> >> The author of the original change [1] is Dmitrij Pochepko (CC'ed). >> >> Dmitrij, could you have a look? >> >> Best, >> Tobias >> >> [1] https://bugs.openjdk.java.net/browse/JDK-8059613 >> >> On 30.06.2015 17:25, Vladimir Kozlov wrote: >>> Looks fine to me. Let author (Igor Ignatyev?) to review it too. >>> >>> Thanks, >>> Vladimir >>> >>> On 6/30/15 2:54 AM, Tobias Hartmann wrote: >>>> Hi, >>>> >>>> please review the following patch. >>>> >>>> https://bugs.openjdk.java.net/browse/JDK-8129937 >>>> http://cr.openjdk.java.net/~thartmann/8129937/webrev.00/ >>>> >>>> Problem: >>>> The jmx test disables compilation, sets a usage threshold for the non-profiled code heap and verifies that the threshold is not hit while allocating code with an overall size< threshold. The test fails because even if we disable compilation with -XX:CompileCommand=compileonly,null::*, we still generate compiled versions of MH intrinsics on method resolution and since those are stored in the non-profiled code heap we may hit the threshold. This problem potentially affects the other jmx tests as well. >>>> Of course, with -Xint we would not generate any compiled MH intrinsics because we can guarantee that we have no compiled code. However, for testing purposes we want to make sure that we have all code heaps available and therefore need to use -XX:CompileCommand=compileonly,null::*. >>>> >>>> Solution: >>>> I changed the jmx tests to not assume that the usage of the non-profiled code heap is predictable (see CodeCacheUtils::isCodeHeapPredictable). I added the method CodeCacheUtils::assertEQorGTE to verify that two values are equal if the corresponding code heap is predictable or fall back to the weaker condition of checking that the values are greater or equal if the code heap is not predictable. I changed the tests to use this method. >>>> >>>> Testing: >>>> JPRT with failing test >>>> >>>> Thanks, >>>> Tobias >>>> > From aph at redhat.com Tue Jun 30 17:49:30 2015 From: aph at redhat.com (Andrew Haley) Date: Tue, 30 Jun 2015 18:49:30 +0100 Subject: RFR: 8130150: RSA Acceleration In-Reply-To: <559110BF.4090804@redhat.com> References: <557ABD2E.7050608@redhat.com> <557EFF94.5000006@oracle.com> <557F042D.4060707@redhat.com> <558033C4.8040104@redhat.com> <5582F936.5020008@oracle.com> <5582FACA.4060103@redhat.com> <5582FDCA.8010507@oracle.com> <55831BC8.9060001@oracle.com> <5583D414.5050502@redhat.com> <558D7D02.6070303@redhat.com> <559103D8.1010302@oracle.com> <559110BF.4090804@redhat.com> Message-ID: <5592D6AA.8020509@redhat.com> On 06/29/2015 10:32 AM, Andrew Haley wrote: > On 29/06/15 09:37, Vladimir Kozlov wrote: >> Hi, Andrew >> >> Did you file RFE for this change? 8046943 is JEP. > > No; I will do so. Done. >> typo? "less" -> "more". >> >> + * number of ints in the number is less than this value we do not >> + * use the intrinsic. >> + */ >> + private static final int MONTGOMERY_INTRINSIC_THRESHOLD = 512; >> >> trailing spaces: >> src/java.base/share/classes/java/math/BigInteger.java:273: Trailing whitespace >> src/java.base/share/classes/java/math/BigInteger.java:2770: Trailing whitespace >> >> I ran changes through JPRT and linux/solaris passed - thanks. >> Next step - Windows: In the end I had to punt on Windows. It's ifdef'd out for someone who works on Windows to fix. >> I am fine with JDK changes. >> >> Would be nice to have a test for this change. Do existing tests >> cover this code? > > They do. They don't. The RSA tests don't run for long enough to test the intrinsics. I wrote a new test for Montgomery multiplication. >> I agree that we should limit size when to invoke multiplyToLen >> intrinsic too. File bug I will assign it. > > OK. JDK-8130154 New webrevs: http://cr.openjdk.java.net/~aph/8130150-jdk http://cr.openjdk.java.net/~aph/8130150-hs/ Andrew. From vladimir.x.ivanov at oracle.com Tue Jun 30 19:00:42 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Tue, 30 Jun 2015 22:00:42 +0300 Subject: On constant folding of final field loads In-Reply-To: <5592662C.40400@oracle.com> References: <558DFBEC.6040700@oracle.com> <55911F68.9050809@oracle.com> <559143D2.2020201@oracle.com> <5592662C.40400@oracle.com> Message-ID: <5592E75A.1000500@oracle.com> Aleksey, >>> Big picture question: do we actually care about propagating final field >>> values once the object escaped (and in this sense, available to be >>> introspected by the compiler)? >>> >>> Java memory model does not guarantee the final field visibility when the >>> object had escaped. The very reason why deserialization works is because >>> the deserialized object had not yet been published. >>> >>> That is, are we in line with the spec and general expectations by >>> folding the final values, *and* not deoptimizing on the store? >> Can you elaborate on your point and interaction with JMM a bit? >> >> Are you talking about not tracking constant folded final field values at >> all, since there are no guarantees by JMM such updates are visible? > > Yup. AFAIU the JMM, there is no guarantees you would see the updated > value for final field after the object had leaked. So, spec-wise you may > just use the final field values as constants. I think the only reason > you have to do the dependency tracking is when constant folding depends > on instance identity. > > So, my question is, do we knowingly make a goodwill call to deopt on > final field store, even though it is not required by spec? I am not > opposing the change, but I'd like us to understand the implications better. That's a good question. I consider it more like a quality of implementation aspect. Neither Reflection nor Unsafe APIs are part of JVM/JLS spec, so I don't think possibility of final field updates should be taken into account there. In order to avoid surprises and inconsistencies (old value vs new value depending on execution path) which are *very* hard to track down, VM should either completely forbid final field changes or keep track of them and adapt accordingly. > For example, I can see the change gives rise to some interesting > low-level coding idioms, like: > > final boolean running = true; > Field runningField = resolve(...); // reflective > > // run stuff for minutes > void m() { > while (running) { // compiler hoists, turns into while(true) > // do stuff > } > } > > void hammerTime() { > runningField.set(this, false); // deopt, break the loop! > } > > Once we allow users to go crazy like that, it would be cruel to > retract/break/change this behavior. > > But I speculate those cases are not pervasive. By and large, people care > about final ops to jump through the barriers. For example, the final > load can be commonned through the acquires / control flow. See e.g.: > http://psy-lob-saw.blogspot.ru/2014/02/when-i-say-final-i-mean-final.html > >>>> Regarding alternative approaches to track the finality, an offset bitmap >>>> on per-class basis can be used (containing locations of final fields). >>>> Possible downsides are: (1) memory footprint (1/8th of instance size per >>>> class); and (2) more complex checking logic (load a relevant piece of a >>>> bitmap from a klass, instead of checking locally available offset >>>> cookie). The advantage is that it is completely transparent to a user: >>>> it doesn't change offset translation scheme. >>> >>> I like this one. Paying with slightly larger memory footprint for API >>> compatibility sounds reasonable to me. >> >> I don't care about cases when Unsafe API is abused (e.g. raw memory >> writes on absolute address or arbitrary offset in an object). In the >> end, it's unsafe API, right? :-) > > Yeah, but with millions of users, we are in a bit of a (implicit) > compatibility bind here ;) That's why I deliberately tried to omit compatibility aspect discussion for now :-) Unsafe is unique: it's not a supported API, but nonetheless many people rely on it. It means we can't throw it away (even in a major release), but still we are not as limited as with official public API. As part of Project Jigsaw there's already an attempt to do an incompatible change for Unsafe API. Depending on how it goes, we can get some insights how to address compatibility concerns (e.g. preserve original behavior in Java 8 compatibility mode). What I'm trying to understand right now, before diving into compatibility details, is whether Unsafe API allows offset encoding scheme change itself and what can be done to make it happen. Though offset value is explicitly described in API as an opaque offset cookie, I spotted 2 inconsistencies in the API itself: * Unsafe.get/set*Unaligned() require absolute offsets; These methods were added in 9, so haven't leaked into public yet. Andrew, can you comment on why you decided to stick with absolute offsets and not preserving Unsafe.getInt() addressing scheme? * Unsafe.copyMemory() Source and destination addressing operate on offset cookies, but amount of copied data is expressed in bytes. In order to do bulk copies of consecutive memory blocks, the user should be able to convert offset cookies to byte offset and vice versa. There's no way to do that with current API. Are you aware of any other use cases when people rely on absolute offsets? I thought about VarHandles a bit and it seems they aren't a silver bullet - they should be based on Unsafe (or stripped Unsafe equivalent) anyway. Unsafe.fireDepChange is a viable option for Reflection and MethodHandles. I'll consider it during further explorations. The downside is that it puts responsibility of tracking final field changes on a user, which is error-prone. There are places in JDK where Unsafe is used directly and they should be analyzed whether a final field is updated or not on a case-by-case basis. It's basically opt-in vs opt-out approaches. I'd prefer a cleaner approach, if there's a solution for compatibility issues. >> So, my next question is how to proceed. Does changing API and providing >> 2 set of functions working with absolute and encoded offsets solve the >> problem? Or leaving Unsafe as is (but clarifying the API) and migrating >> Reflection/j.l.i to VarHandles solve the problem? That's what I'm trying >> to understand. > > I would think Reflection/j.l.i would eventually migrate to VarHandles > anyway. Paul? The interim solution for encoding final field flags > shouldn't leak into (even Unsafe) API, or at least should not break the > existing APIs. > > I further think that an interim solution makes auxiliary single > Unsafe.fireDepChange(Field f / long addr) or something, and uses it > along with the Unsafe calls in Reflection/j.l.i, when wrappers know they > are dealing with final fields. In other words, should we try to reuse > the knowledge those wrappers already have, instead of trying to encode > the same knowledge into offset cookies? > >>>> II. Managing relations between final fields and nmethods >>>> Another aspect is how expensive dependency checking becomes. > >>> Isn't the underlying problem being the dependencies are searched >>> linearly? At least in ConstantFieldDep, can we compartmentalize the >>> dependencies by holder class in some sort of hash table? >> In some cases (when coarse-grained (per-class) tracking is used), linear >> traversal is fine, since all nmethods will be invalidated. >> >> In order to construct a more efficient data structure, you need a way to >> order or hash oops. The problem with that is oops aren't stable - they >> can change at any GC. So, either some stable value should be associated >> with them (System.identityHashCode()?) or dependency tables should be >> updated on every GC. > > Yeah, like Symbol::_identity_hash. Symbol is an internal VM entity. Oops are different. They are just pointers to Java object (OOP = Ordinary Object Pointer). The only doable way is piggyback on object hash code. I won't dive into details here, but there are many intricate consequences. >> Unless existing machinery can be sped up to appropriate level, I >> wouldn't consider complicating things so much. > > Okay. I just can't escape the feeling we keep band-aiding the linear > searches everywhere in VM on case-to-case basis, instead of providing > the asymptotic guarantees with better data structures. Well, class-based dependency contexts have been working pretty well for KlassDeps. They worked pretty well for CallSiteDeps as well, once a more specific context was used (I introduced a specialized CallSite instance-based implementation because it is simpler to maintain). It's hard to come up with a narrow enough class context for ConstantFieldDeps, so, probably, it's a good time to consider a different approach to index nmethod dependencies. But assuming final field updates are rare (with the exception of deserialization), it can be not that important. >> The 3 optimizations I initially proposed allow to isolate >> ConstantFieldDep from other kinds of dependencies, so dependency >> traversal speed will affect only final field writes. Which is acceptable >> IMO. > > Except for an overwhelming number of cases where the final field stores > happen in the course of deserialization. What's particularly bad about > this scenario is that you wouldn't see the time burned in the VM unless > you employ the native profiler, as we discovered in Nashorn perf work. Yes, deserialization is a good example. It's special because it operates on freshly created objects, which, as you noted, haven't escaped yet. It'd be nice if VM can skip dependency checking in such case (either automatically or with explicit hints). In order to diagnose performance problems with excessive dependency checking, VM can monitor it closely (UsePerfData counters + JFR events + tracing should provide enough information to spot issues). > Recapping the discussion in this thread, I think we would need to have a > more thorough performance work for this change, since it touches the > very core of the platform. I think many people outside the > hotspot-compiler-dev understand some corner intricacies of the problem > that we miss. JEP and outcry for public comments, maybe? Yes, I planned to get quick feedback on the list and then file a JEP as a followup. Thanks again for the feedback, Aleksey! Best regards, Vladimir Ivanov From michael.c.berg at intel.com Tue Jun 30 19:06:39 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Tue, 30 Jun 2015 19:06:39 +0000 Subject: RFR: 8129920 - Vectorized loop unrolling In-Reply-To: References: <5591B1C4.302@oracle.com> Message-ID: Vladimir, please have a look at http://cr.openjdk.java.net/~mcberg/8129920/webrev.02 It addresses the issues below. Regards, Michael -----Original Message----- From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Berg, Michael C Sent: Monday, June 29, 2015 5:46 PM To: Vladimir Kozlov; 'hotspot-compiler-dev at openjdk.java.net' Subject: RE: RFR: 8129920 - Vectorized loop unrolling Vladimir, sure I will change to reflect we are only allowing unrolling. For the unroll only case, we are allowing all the standard logic for unrolling to apply without unroll queries and its cases. We would need (cl->has_passed_slp() && !cl->unroll_only()) to make both the guarded cases equivalent. We have less code the way I have it. I could word it differently, but it would work out about the same in new code. Thanks, Michael -----Original Message----- From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] Sent: Monday, June 29, 2015 2:00 PM To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net' Subject: Re: RFR: 8129920 - Vectorized loop unrolling ignore_slp() and NoMoreSlp whould be fine names if they guard only superword optimization. You use it to skipp all loop optimizations except unrolling. It should be named differently. allow_unroll_only ? why you need set_notpassed_slp()?: + // For atomic unrolled loops which are vector mapped, instigate more unrolling. + cl->set_notpassed_slp(); Thanks, Vladimir On 6/26/15 12:43 PM, Berg, Michael C wrote: > Hi Folks, > > I would like to contribute Vectorized loop unrolling. I need two > reviewers to review this patch and comment as needed: > > Bug-id: https://bugs.openjdk.java.net/browse/JDK-8129920 > > webrev: > > http://cr.openjdk.java.net/~mcberg/8129920/webrev.01/ > > With this change we leverage superword unrolling queries and superword > to stage re-entrance to ideal loop optimization. We do this when > superword succeeds on vectorizing a loop which was unroll query mapped. > When we re-enter ideal loop optimization, we have already done all > major optimizations such as peeling, splitting, rce and superword on > the vector map candidate loop. Thus we only unroll the loop. We > utilize the standard loop unrolling environment to accomplish this > with default and any applicable user settings. In this way we leverage > unroll factors from the baseline loop which are much larger to obtain > optimum throughput on x86 architectures. The uplift range on > SpecJvm2008 is seen on scimark.lu.{small,large} with uplift noted at 3% and 8% respectively. > We see as much as 1.5x uplift on vector centric micros like reductions > on default optimizations. > > Thanks, > > Michael > From vladimir.kozlov at oracle.com Tue Jun 30 19:18:14 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 30 Jun 2015 12:18:14 -0700 Subject: RFR: 8129920 - Vectorized loop unrolling In-Reply-To: References: <5591B1C4.302@oracle.com> Message-ID: <5592EB76.1070903@oracle.com> This looks good. Thanks, Vladimir On 6/30/15 12:06 PM, Berg, Michael C wrote: > Vladimir, please have a look at http://cr.openjdk.java.net/~mcberg/8129920/webrev.02 > > It addresses the issues below. > > Regards, > Michael > > -----Original Message----- > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Berg, Michael C > Sent: Monday, June 29, 2015 5:46 PM > To: Vladimir Kozlov; 'hotspot-compiler-dev at openjdk.java.net' > Subject: RE: RFR: 8129920 - Vectorized loop unrolling > > Vladimir, sure I will change to reflect we are only allowing unrolling. > For the unroll only case, we are allowing all the standard logic for unrolling to apply without unroll queries and its cases. We would need (cl->has_passed_slp() && !cl->unroll_only()) to make both the guarded cases equivalent. We have less code the way I have it. I could word it differently, but it would work out about the same in new code. > > Thanks, > Michael > > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > Sent: Monday, June 29, 2015 2:00 PM > To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net' > Subject: Re: RFR: 8129920 - Vectorized loop unrolling > > ignore_slp() and NoMoreSlp whould be fine names if they guard only superword optimization. You use it to skipp all loop optimizations except unrolling. It should be named differently. allow_unroll_only ? > > why you need set_notpassed_slp()?: > > + // For atomic unrolled loops which are vector mapped, instigate > more unrolling. > + cl->set_notpassed_slp(); > > Thanks, > Vladimir > > On 6/26/15 12:43 PM, Berg, Michael C wrote: >> Hi Folks, >> >> I would like to contribute Vectorized loop unrolling. I need two >> reviewers to review this patch and comment as needed: >> >> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8129920 >> >> webrev: >> >> http://cr.openjdk.java.net/~mcberg/8129920/webrev.01/ >> >> With this change we leverage superword unrolling queries and superword >> to stage re-entrance to ideal loop optimization. We do this when >> superword succeeds on vectorizing a loop which was unroll query mapped. >> When we re-enter ideal loop optimization, we have already done all >> major optimizations such as peeling, splitting, rce and superword on >> the vector map candidate loop. Thus we only unroll the loop. We >> utilize the standard loop unrolling environment to accomplish this >> with default and any applicable user settings. In this way we leverage >> unroll factors from the baseline loop which are much larger to obtain >> optimum throughput on x86 architectures. The uplift range on >> SpecJvm2008 is seen on scimark.lu.{small,large} with uplift noted at 3% and 8% respectively. >> We see as much as 1.5x uplift on vector centric micros like reductions >> on default optimizations. >> >> Thanks, >> >> Michael >> From igor.veresov at oracle.com Tue Jun 30 20:23:32 2015 From: igor.veresov at oracle.com (Igor Veresov) Date: Tue, 30 Jun 2015 13:23:32 -0700 Subject: RFR(S): 8079062, 8079775: Stack walking compilation policy issues Message-ID: <842772E3-C515-40CF-87A1-9C39CA48EE85@oracle.com> The stack walking compilation hasn?t been tested for a while and has bit-rotted a bit. The following change fixes: JDK-8079775 Java 9-fastdebug ia32 Error: Unimplemented with "-XX:CompilationPolicyChoice=1 -XX:-TieredCompilation" options This is just an error-reporting problem. Stack walk policy is not supported for client. JDK-8079062 Java 9-fastdebug crash(hit assertion) with "-XX:CompilationPolicyChoice=1 -XX:-TieredCompilation" options Here, after the permgen removal method handles can be only created on stack and RFrames were creating them in the resource area. In fact, since the permgen is gone, and methods cannot be removed while they are on stack, handles are unnecessary in RFrames, so I made those just Method*. http://cr.openjdk.java.net/~iveresov/8079062/webrev.00/ Thanks, igor From vladimir.kozlov at oracle.com Tue Jun 30 20:35:00 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 30 Jun 2015 13:35:00 -0700 Subject: RFR(S): 8079062, 8079775: Stack walking compilation policy issues In-Reply-To: <842772E3-C515-40CF-87A1-9C39CA48EE85@oracle.com> References: <842772E3-C515-40CF-87A1-9C39CA48EE85@oracle.com> Message-ID: <5592FD74.80509@oracle.com> Good. Thanks, Vladimir On 6/30/15 1:23 PM, Igor Veresov wrote: > The stack walking compilation hasn?t been tested for a while and has bit-rotted a bit. > The following change fixes: > > JDK-8079775 Java 9-fastdebug ia32 Error: Unimplemented with "-XX:CompilationPolicyChoice=1 -XX:-TieredCompilation" options > This is just an error-reporting problem. Stack walk policy is not supported for client. > > JDK-8079062 Java 9-fastdebug crash(hit assertion) with "-XX:CompilationPolicyChoice=1 -XX:-TieredCompilation" options > Here, after the permgen removal method handles can be only created on stack and RFrames were creating them in the resource area. > In fact, since the permgen is gone, and methods cannot be removed while they are on stack, handles are unnecessary in RFrames, so I made those just Method*. > > http://cr.openjdk.java.net/~iveresov/8079062/webrev.00/ > > Thanks, > igor > From igor.veresov at oracle.com Tue Jun 30 21:44:44 2015 From: igor.veresov at oracle.com (Igor Veresov) Date: Tue, 30 Jun 2015 14:44:44 -0700 Subject: RFR(S): 8079062, 8079775: Stack walking compilation policy issues In-Reply-To: <5592FD74.80509@oracle.com> References: <842772E3-C515-40CF-87A1-9C39CA48EE85@oracle.com> <5592FD74.80509@oracle.com> Message-ID: <5067CD11-F23C-4101-AD6C-B3B28D44116E@oracle.com> Thanks, Vladimir! igor > On Jun 30, 2015, at 1:35 PM, Vladimir Kozlov wrote: > > Good. > > Thanks, > Vladimir > > On 6/30/15 1:23 PM, Igor Veresov wrote: >> The stack walking compilation hasn?t been tested for a while and has bit-rotted a bit. >> The following change fixes: >> >> JDK-8079775 Java 9-fastdebug ia32 Error: Unimplemented with "-XX:CompilationPolicyChoice=1 -XX:-TieredCompilation" options >> This is just an error-reporting problem. Stack walk policy is not supported for client. >> >> JDK-8079062 Java 9-fastdebug crash(hit assertion) with "-XX:CompilationPolicyChoice=1 -XX:-TieredCompilation" options >> Here, after the permgen removal method handles can be only created on stack and RFrames were creating them in the resource area. >> In fact, since the permgen is gone, and methods cannot be removed while they are on stack, handles are unnecessary in RFrames, so I made those just Method*. >> >> http://cr.openjdk.java.net/~iveresov/8079062/webrev.00/ >> >> Thanks, >> igor >> From vladimir.kozlov at oracle.com Tue Jun 30 22:34:22 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 30 Jun 2015 15:34:22 -0700 Subject: RFR(S): 8085932: Fixing bugs in detecting memory alignments in SuperWord In-Reply-To: <558CA37B.2040701@oracle.com> References: <39F83597C33E5F408096702907E6C450E4D6F7@ORSMSX104.amr.corp.intel.com> <55835DFC.70001@oracle.com> <39F83597C33E5F408096702907E6C450E56020@ORSMSX104.amr.corp.intel.com> <558CA37B.2040701@oracle.com> Message-ID: <5593196E.4090407@oracle.com> I forgot to publish updated webrev: http://cr.openjdk.java.net/~kvn/8085932/webrev.01/ I will push it after Michael's "Vectorized loop unrolling" is reviewed and pushed. Vladimir On 6/25/15 5:57 PM, Vladimir Kozlov wrote: > Okay, this is better. > > Thanks, > Vladimir > > On 6/25/15 2:51 PM, Civlin, Jan wrote: >> >> Vladimir, >> >> Here is the updated patch with trace hidden in a new nested class Trace, that contains all the messages. The Trace >> class is compiled only in NOT_PRODUCT. >> Looks much simple now (of course more lines but all outside of the algorithmic part). >> >> Thank you, >> >> Jan. >> >> -----Original Message----- >> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] >> Sent: Thursday, June 18, 2015 5:11 PM >> To: Civlin, Jan; hotspot-compiler-dev at openjdk.java.net >> Subject: Re: RFR(S): 8085932: Fixing bugs in detecting memory alignments in SuperWord >> >> Thank you, Jan >> >> Fixes looks good but it would be nice if you replaced some tracing code >> with functions calls. In some place the execution code is hard to read >> because of big tracing code. For example, in >> SuperWord::memory_alignment() and in SWPointer methods. >> >> The one way to do that is to declare trace methods with empty body in >> product build, for example for SWPointer::scaled_iv_plus_offset() you >> may have new method declaration (not under #ifdef) in superword.hpp: >> >> class SWPointer VALUE_OBJ_CLASS_SPEC { >> >> void trace_1_scaled_iv_plus_offset(...) PRODUCT_RETURN; >> >> and in superword.cpp you will put the method under ifdef: >> >> #ifndef PRODUCT >> void trace_1_scaled_iv_plus_offset(...) { >> .... >> } >> #endif >> >> Then you can simply use it without ifdefs in code: >> >> bool SWPointer::scaled_iv_plus_offset(Node* n) { >> + trace_1_scaled_iv_plus_offset(...); >> + >> if (scaled_iv(n)) { >> >> Note, macro PRODUCT_RETURN is defined as: >> >> #ifdef PRODUCT >> #define PRODUCT_RETURN {} >> #else >> #define PRODUCT_RETURN /*next token must be ;*/ >> #endif >> >> Thanks, >> Vladimir >> >> On 6/8/15 9:15 AM, Civlin, Jan wrote: >>> Hi All, >>> >>> >>> We would like to contribute to Fixing bugs in detecting memory >>> alignments in SuperWord. >>> >>> The contribution Bug ID: 8085932. >>> >>> Please review this patch: >>> >>> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8085932 >>> >>> webrev: http://cr.openjdk.java.net/~kvn/8085932/webrev.00/ >>> >>> >>> *Description**: *Fixing bugs in detecting memory alignments in >>> SuperWord >>> >>> Fixing bugs in detecting memory alignments in SuperWord: >>> SWPointer::scaled_iv_plus_offset (fixing here a bug in detection of >>> "scale"), >>> SWPointer::offset_plus_k (fixing here a bug in detection of "invariant"), >>> >>> Add tracing output to the code that deal with memory alignment. The >>> following routines are traceable: >>> >>> SWPointer::scaled_iv_plus_offset >>> SWPointer::offset_plus_k >>> SWPointer::scaled_iv, >>> WPointer::SWPointer, >>> SuperWord::memory_alignment >>> >>> Tracing is done only for NOT_PRODUCT. Currently tracing is controlled by >>> VectorizeDebug: >>> >>> #ifndef PRODUCT >>> if (_phase->C->method() != NULL) { >>> _phase->C->method()->has_option_value("VectorizeDebug", >>> _vector_loop_debug); >>> } >>> #endif >>> >>> And VectorizeDebug may take any combination (bitwise OR) of the >>> following values: >>> bool is_trace_alignment() { return (_vector_loop_debug & 2) > 0; } >>> bool is_trace_mem_slice() { return (_vector_loop_debug & 4) > 0; } >>> bool is_trace_loop() { return (_vector_loop_debug & 8) > 0; } >>> bool is_trace_adjacent() { return (_vector_loop_debug & 16) > 0; } >>> >>