From vladimir.kozlov at oracle.com Fri Sep 1 02:19:27 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 31 Aug 2017 19:19:27 -0700 Subject: [10] RFR(L) 8187047: [AOT] jaotc --ignore-errors flag is ignored when class not found Message-ID: http://cr.openjdk.java.net/~kvn/8187047/webrev/ https://bugs.openjdk.java.net/browse/JDK-8187047 Currently jaotc will exit with java.lang.ClassNotFoundException when referenced class not found regardless --ignore-errors flag. I fixed it by consolidating code how exceptions are processed during collecting classes and methods to compile. Now Main.handleClassLoadError(InternalError err) should be called. For that I have to path Main parameter to methods which collect classes. Error during class loading and compilation now will be reported only with --verbose or if --exit-on-error is set. Stack trace will be printed with --debug only. No errors reporting with --info, only phases and times will be displayed. Also added checks to exit early if no classes or methods were found. And have to fix AOT junit tests accordingly. Passed all AOT tests in JPRT. Thanks, Vladimir From igor.veresov at oracle.com Fri Sep 1 04:52:41 2017 From: igor.veresov at oracle.com (Igor Veresov) Date: Thu, 31 Aug 2017 21:52:41 -0700 Subject: [10] RFR(L) 8187047: [AOT] jaotc --ignore-errors flag is ignored when class not found In-Reply-To: References: Message-ID: <57DA4265-E540-4848-A40F-61034F151E0D@oracle.com> Maybe there should be an interface to abstract the error-handling functionality? Like make a ClassLoadingErrorHandler interface or something, and make Main implement it? Otherwise passing the concrete Main everything ties everything up too much. igor > On Aug 31, 2017, at 7:19 PM, Vladimir Kozlov wrote: > > http://cr.openjdk.java.net/~kvn/8187047/webrev/ > https://bugs.openjdk.java.net/browse/JDK-8187047 > > Currently jaotc will exit with java.lang.ClassNotFoundException when referenced class not found regardless --ignore-errors flag. > I fixed it by consolidating code how exceptions are processed during collecting classes and methods to compile. Now Main.handleClassLoadError(InternalError err) should be called. For that I have to path Main parameter to methods which collect classes. > > Error during class loading and compilation now will be reported only with --verbose or if --exit-on-error is set. Stack trace will be printed with --debug only. > No errors reporting with --info, only phases and times will be displayed. > > Also added checks to exit early if no classes or methods were found. > > And have to fix AOT junit tests accordingly. > > Passed all AOT tests in JPRT. > > Thanks, > Vladimir From aph at redhat.com Fri Sep 1 07:51:59 2017 From: aph at redhat.com (Andrew Haley) Date: Fri, 1 Sep 2017 08:51:59 +0100 Subject: [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> Message-ID: <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> On 31/08/17 23:46, Dmitrij Pochepko wrote: > I tried a number of initial versions first. I also tried to use wider > multiplication via umulh (and larger load instructions like ldp/ldr), > but after measuring all versions I've found that version I've initially > sent appeared to be the fastest (I was measuring it on ThunderX which I > have in hand). It might be because of lots of additional ror(..., 32) > operations in other versions to convert values from initial layout to > register and back. Another reason might be more complex overall logic > and larger code, which triggers more icache lines to be loaded. Or even > some umulh specifics on some CPUs. So, after measuring, I've abandoned > these versions in a middle of development and polished the fastest one. > I have some raw development unpolished versions of such approaches > left(not sure I have debugged versions saved, but at least has an > overall idea). > I attached squares_v2.3.1.diff: early version which is using mul/umulh > for just one case. It was surprisingly slower for this case than version > I've sent to review, so, I've abandoned this approach. > I've also tried version with large load instructions(ldp/ldr): > squares_v1.diff and it was also slower(it has another, slower, mul_add > loop implementation, but I was comparing to the same version, which is > using ldrw-only). > > I'm not sure if I should use 64-bit multiplications and/or 64/128 bit > loads. I can try to return back to one of such versions and try to > polish it, but I'll probably get slower results again on h/w I have and > it's not clear if it'll be faster on any other h/w(which one? It takes a > lot of time to iteratively improve and measure every version on > respective h/w). I'm using Applied Micro hardware for my testing at the moment. I did the speed testing for Montgomery multiply on ThunderX. I appreciate that it's difficult to get the 64-bit version right and fast, but you should see about 3 - 3.5* speedup over the pure Java version if you get it right. That's what I saw when I did the Montgomery multiply. You do have to pipeline the loads and the multiplies to avoid stalls. Be aware that squareToLen is not used at all when running the RSA benchmark with C2. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From martin.doerr at sap.com Fri Sep 1 15:39:32 2017 From: martin.doerr at sap.com (Doerr, Martin) Date: Fri, 1 Sep 2017 15:39:32 +0000 Subject: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic In-Reply-To: <34e6550d426440bab3b8a54a82e25190@sap.com> References: <4999bc2a3f0640dfb6dd75d23b4f30ea@sap.com> <0089f9f653a6442aa672af2e15b2b864@serv030.corp.eldorado.org.br> <59397a3749024e91b56be6e990a3250d@sap.com> <363c2378f23e4be2bf60b622594c60fe@sap.com> <59A089F4.6010504@linux.vnet.ibm.com> <34e6550d426440bab3b8a54a82e25190@sap.com> Message-ID: Hi Gustavos, I have managed to upload a version which seems to work on both endianness implementations. At least some quick tests have passed on AIX and Big Endian linux in addition to Little Endian linux. Webrev is here: http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.02/ I'll be out next week, but the change looks ok for me. Please let me know if the changed version still looks ok for you, too. Feel free to overwork or improve it. It'd also be good to know, if relying on vrsave=-1 is safe. Is the copyright information ok? Did you get source code which requires to be mentioned in the comments? The code looks similar to a reference implementation, so the authors of it may want to be mentioned? Or did you just use the paper for implementing it? In this case, I'd mention the paper. After we got a second review and ran more tests, we can ask somebody from Oracle to push it. Thanks for contributing and your support, Martin -----Original Message----- From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Doerr, Martin Sent: Donnerstag, 31. August 2017 18:21 To: Gustavo Romero Cc: 'hotspot-compiler-dev at openjdk.java.net' ; ppc-aix-port-dev at openjdk.java.net Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic Hi Gustavo R, I guess you're right. vrsave is already set to -1, so all Vector Registers get saved. It'd be good to know where it is set (OS, Flag in ELF header, ???) and if this is guaranteed. I don't want to risk getting sporadic errors on some OS versions. I'd like to enable SHA intrinsics on linux BE as well. I already managed to get the 256 bit version working (was quite some work!). Thanks and best regards, Martin -----Original Message----- From: Gustavo Romero [mailto:gromero at linux.vnet.ibm.com] Sent: Freitag, 25. August 2017 22:35 To: Doerr, Martin Cc: Gustavo Serra Scalet ; 'hotspot-compiler-dev at openjdk.java.net' ; ppc-aix-port-dev at openjdk.java.net Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic Hi Martin, On 25-08-2017 13:18, Doerr, Martin wrote: > I think you didn't get my point about AIX. > Your current version doesn't break AIX, but it lacks SHA2 acceleration for AIX on Power 8 and newer, which is still relevant. > So I'd like to ask you kindly to take a look if Big Endian support for the stub could be added without high effort. AIX doesn't need VRSAVE handling (like Little Endian linux, unlike Big Endian linux), so a few lines in the stub could possibly be enough. I can assist with testing. I don't think that VRSAVE is handled on Linux, even on BE. Although BE ABI [1] says: "Functions must ensure that the appropriate bits in the vrsave register are set for any vector registers they use" and LE ABI does not say that, even on Linux BE VRSAVE is not in effect used to determine which vector registers (VMX/Altivec) should be saved/restored. No application uses it on Linux, so I would say that VRSAVE is ignored on Linux completely both on BE and LE. save/restore library interfaces don't pay attention to it in glibc: VRSAVE is just saved/restored completely in mechanisms of swap/get/setcontext(), set/longjump(), and dl-trampoline() and that's all. I checked that with toolchain folks and they agree. We've already discussed that a long time ago but at that time I was just using the vector-scalar registers [2] and at that time I agreed that if VMX/Altivec was in use instead of the VSX so VRSAVE should be handled accordingly. But I have a different opinion now... I'm wondering if something would really break on Linux BE if we forget about VRSAVE at all in the JVM. If not, we could forget about VRSAVE forever on Linux. Looks like VRSAVE was sort of born to the oblivion... ? Kind regards, Gustavo [1] http://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi-1.9.html [2] http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-May/002508.html From gromero at linux.vnet.ibm.com Fri Sep 1 16:04:27 2017 From: gromero at linux.vnet.ibm.com (Gustavo Romero) Date: Fri, 1 Sep 2017 13:04:27 -0300 Subject: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic In-Reply-To: References: <4999bc2a3f0640dfb6dd75d23b4f30ea@sap.com> <0089f9f653a6442aa672af2e15b2b864@serv030.corp.eldorado.org.br> <59397a3749024e91b56be6e990a3250d@sap.com> <363c2378f23e4be2bf60b622594c60fe@sap.com> <59A089F4.6010504@linux.vnet.ibm.com> <34e6550d426440bab3b8a54a82e25190@sap.com> Message-ID: <59A9850B.7030302@linux.vnet.ibm.com> Hi Martin! On 01-09-2017 12:39, Doerr, Martin wrote: > Hi Gustavos, > > I have managed to upload a version which seems to work on both endianness implementations. > At least some quick tests have passed on AIX and Big Endian linux in addition to Little Endian linux. Great! :-) > I'll be out next week, but the change looks ok for me. Please let me know if the changed version still looks ok for you, too. Feel free to overwork or improve it. > It'd also be good to know, if relying on vrsave=-1 is safe. Sure, Martin. I'm chasing what's exactly setting vrsave=-1 and the full history log (looks like it's not in the kernel, but I'm checking yet). > Is the copyright information ok? Did you get source code which requires to be mentioned in the comments? > The code looks similar to a reference implementation, so the authors of it may want to be mentioned? > Or did you just use the paper for implementing it? In this case, I'd mention the paper. Gustavo S: the information on the paper must be updated accordingly as Martin noted in the new webrev. There is none currently. > After we got a second review and ran more tests, we can ask somebody from Oracle to push it. > > Thanks for contributing and your support, > Martin Thanks a lot for reviewing and for all the help. Regards, Gustavo R > > -----Original Message----- > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Doerr, Martin > Sent: Donnerstag, 31. August 2017 18:21 > To: Gustavo Romero > Cc: 'hotspot-compiler-dev at openjdk.java.net' ; ppc-aix-port-dev at openjdk.java.net > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi Gustavo R, > > I guess you're right. vrsave is already set to -1, so all Vector Registers get saved. > It'd be good to know where it is set (OS, Flag in ELF header, ???) and if this is guaranteed. > I don't want to risk getting sporadic errors on some OS versions. > > I'd like to enable SHA intrinsics on linux BE as well. I already managed to get the 256 bit version working (was quite some work!). > > Thanks and best regards, > Martin > > > -----Original Message----- > From: Gustavo Romero [mailto:gromero at linux.vnet.ibm.com] > Sent: Freitag, 25. August 2017 22:35 > To: Doerr, Martin > Cc: Gustavo Serra Scalet ; 'hotspot-compiler-dev at openjdk.java.net' ; ppc-aix-port-dev at openjdk.java.net > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi Martin, > > On 25-08-2017 13:18, Doerr, Martin wrote: >> I think you didn't get my point about AIX. >> Your current version doesn't break AIX, but it lacks SHA2 acceleration for AIX on Power 8 and newer, which is still relevant. >> So I'd like to ask you kindly to take a look if Big Endian support for the stub could be added without high effort. AIX doesn't need VRSAVE handling (like Little Endian linux, unlike Big Endian linux), so a few lines in the stub could possibly be enough. I can assist with testing. > > I don't think that VRSAVE is handled on Linux, even on BE. Although BE ABI [1] > says: > > "Functions must ensure that the appropriate bits in the vrsave register are set for any vector registers they use" > > and LE ABI does not say that, even on Linux BE VRSAVE is not in effect > used to determine which vector registers (VMX/Altivec) should be saved/restored. > No application uses it on Linux, so I would say that VRSAVE is ignored on Linux > completely both on BE and LE. save/restore library interfaces don't pay > attention to it in glibc: VRSAVE is just saved/restored completely in mechanisms > of swap/get/setcontext(), set/longjump(), and dl-trampoline() and that's all. I > checked that with toolchain folks and they agree. We've already discussed that a > long time ago but at that time I was just using the vector-scalar registers [2] > and at that time I agreed that if VMX/Altivec was in use instead of the VSX so > VRSAVE should be handled accordingly. But I have a different opinion now... > > I'm wondering if something would really break on Linux BE if we forget about > VRSAVE at all in the JVM. If not, we could forget about VRSAVE forever on Linux. > Looks like VRSAVE was sort of born to the oblivion... ? > > > Kind regards, > Gustavo > > [1] https://urldefense.proofpoint.com/v2/url?u=http-3A__refspecs.linuxfoundation.org_ELF_ppc64_PPC-2Delf64abi-2D1.9.html&d=DwIFAg&c=jf_iaSHvJObTbx-siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z-esrs-Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=arAjOio5NNoRIZLdczhgF5BDoAF3HUvq-xCtSufn_kA&e= > [2] https://urldefense.proofpoint.com/v2/url?u=http-3A__mail.openjdk.java.net_pipermail_ppc-2Daix-2Dport-2Ddev_2016-2DMay_002508.html&d=DwIFAg&c=jf_iaSHvJObTbx-siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z-esrs-Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=p0xb08lxayJHBXZREL-7c5ipKc-waZMMZpTiQWfU-S4&e= > From gustavo.scalet at eldorado.org.br Fri Sep 1 16:22:27 2017 From: gustavo.scalet at eldorado.org.br (Gustavo Serra Scalet) Date: Fri, 1 Sep 2017 16:22:27 +0000 Subject: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic In-Reply-To: References: <4999bc2a3f0640dfb6dd75d23b4f30ea@sap.com> <0089f9f653a6442aa672af2e15b2b864@serv030.corp.eldorado.org.br> <59397a3749024e91b56be6e990a3250d@sap.com> <363c2378f23e4be2bf60b622594c60fe@sap.com> <59A089F4.6010504@linux.vnet.ibm.com> <34e6550d426440bab3b8a54a82e25190@sap.com> Message-ID: Hi Martin, Thanks for your changes. I'm gladly helping you on reviewing it. A few nits I found: 1) I guess you may remove the #else on vm_version_ppc.cpp, as the internal checks are sufficient now. 2) On macroAssembler_ppc_sha.cpp there a bunch of instructions (mostly vperm) that follow a simple pattern depending on endianness (swapping 2nd and 3rd parameters). I would avoid these many changes by overloading that with something else. E.g: a) a macro wrapper on that file b) a new interface to these kind of instructions on Assembler (if you see it as being used elsewhere too) And now, about your questions: > -----Original Message----- > From: Doerr, Martin > Is the copyright information ok? Yes, I verified with my company legal department and we can write down it as copyright owner. > Did you get source code which requires to be mentioned in the comments? We had some intermediate POC programs, but they are all original work, so I don't think we need to relate to anything else if we don't want to. > The code looks similar to a reference implementation, so the authors of > it may want to be mentioned? > Or did you just use the paper for implementing it? In this case, I'd > mention the paper. Which reference implementation? If you mean the paper I already told you, you may add it too: http://www.iwar.org.uk/comsec/resources/cipher/sha256-384-512.pdf We used that paper to understand SHA-2 and his tests (at the beginning) to verify if the implementation was working as expected. > After we got a second review and ran more tests, we can ask somebody > from Oracle to push it. Great. I see that it may take some weeks as JDK10 repo is being frozen for 2 weeks. No problem. Have a nice week ahead. > > Thanks for contributing and your support, Martin > > > -----Original Message----- > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > bounces at openjdk.java.net] On Behalf Of Doerr, Martin > Sent: Donnerstag, 31. August 2017 18:21 > To: Gustavo Romero > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi Gustavo R, > > I guess you're right. vrsave is already set to -1, so all Vector > Registers get saved. > It'd be good to know where it is set (OS, Flag in ELF header, ???) and > if this is guaranteed. > I don't want to risk getting sporadic errors on some OS versions. > > I'd like to enable SHA intrinsics on linux BE as well. I already managed > to get the 256 bit version working (was quite some work!). > > Thanks and best regards, > Martin > > > -----Original Message----- > From: Gustavo Romero [mailto:gromero at linux.vnet.ibm.com] > Sent: Freitag, 25. August 2017 22:35 > To: Doerr, Martin > Cc: Gustavo Serra Scalet ; 'hotspot- > compiler-dev at openjdk.java.net' ; > ppc-aix-port-dev at openjdk.java.net > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi Martin, > > On 25-08-2017 13:18, Doerr, Martin wrote: > > I think you didn't get my point about AIX. > > Your current version doesn't break AIX, but it lacks SHA2 acceleration > for AIX on Power 8 and newer, which is still relevant. > > So I'd like to ask you kindly to take a look if Big Endian support for > the stub could be added without high effort. AIX doesn't need VRSAVE > handling (like Little Endian linux, unlike Big Endian linux), so a few > lines in the stub could possibly be enough. I can assist with testing. > > I don't think that VRSAVE is handled on Linux, even on BE. Although BE > ABI [1] > says: > > "Functions must ensure that the appropriate bits in the vrsave register > are set for any vector registers they use" > > and LE ABI does not say that, even on Linux BE VRSAVE is not in effect > used to determine which vector registers (VMX/Altivec) should be > saved/restored. > No application uses it on Linux, so I would say that VRSAVE is ignored > on Linux completely both on BE and LE. save/restore library interfaces > don't pay attention to it in glibc: VRSAVE is just saved/restored > completely in mechanisms of swap/get/setcontext(), set/longjump(), and > dl-trampoline() and that's all. I checked that with toolchain folks and > they agree. We've already discussed that a long time ago but at that > time I was just using the vector-scalar registers [2] and at that time I > agreed that if VMX/Altivec was in use instead of the VSX so VRSAVE > should be handled accordingly. But I have a different opinion now... > > I'm wondering if something would really break on Linux BE if we forget > about VRSAVE at all in the JVM. If not, we could forget about VRSAVE > forever on Linux. > Looks like VRSAVE was sort of born to the oblivion... ? > > > Kind regards, > Gustavo > > [1] http://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi-1.9.html > [2] http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016- > May/002508.html From martin.doerr at sap.com Fri Sep 1 16:28:56 2017 From: martin.doerr at sap.com (Doerr, Martin) Date: Fri, 1 Sep 2017 16:28:56 +0000 Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics In-Reply-To: <10a918efbd344b1fbf95c56b7beedbc0@serv031.corp.eldorado.org.br> References: <1f159ee480284095b8e5c3f444dceb96@serv031.corp.eldorado.org.br> <16e8b68451e94eb79cdd7d9cb5d7984c@sap.com> <2425566a8ff74051af485c919a0bf5ee@serv030.corp.eldorado.org.br> <4ec93a6bcbe14cf99c2fa02d50a18965@sap.com> <0ef23b5fcbc54996aea876d4c60e4097@sap.com> <10a918efbd344b1fbf95c56b7beedbc0@serv031.corp.eldorado.org.br> Message-ID: <2badcffdc4fb44c9ba77b5a1c6cc26fb@sap.com> Hi Gustavo, your first webrev already works on Big Endian. So the only required change is to fix your new code by this trivial patch: --- a/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 17:47:45 2017 +0200 +++ b/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 17:55:08 2017 +0200 @@ -3426,7 +3426,9 @@ __ srdi (product, product, 1); // join them to the same register and store it as Little Endian __ orr (product, lplw_s, product); +#ifdef VM_LITTLE_ENDIAN __ rldicl (product, product, 32, 0); +#endif __ stdu (product, 8, out_aux); __ bdnz (LOOP_SQUARE); So please enable it again for Big Endian in vm_version_ppc. Besides that, it looks good to me. We also need a 2nd review. Best regards, Martin -----Original Message----- From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] Sent: Mittwoch, 30. August 2017 19:03 To: Doerr, Martin ; 'hotspot-compiler-dev at openjdk.java.net' Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics Hi Martin, (webrev at the end) > -----Original Message----- > From: Doerr, Martin > > > The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't seem to > > need further changes as it's being cleared with clrldi, which is the > > same as rldic with no shift. Therefore it's treated appropriately as > > requested for "offset" parameter. Do you agree? > > No, I didn't find clrldi for len in generate_mulAdd(). Only for k. I'm sorry. I was thinking about "offset" and "k", which are both cleaned on generate_mulAdd(). "len" was not cleaned and it was being used on muladd() directly with cmpdi, which could lead to problems. That is being changed. > Where are in_len and out_len fixed up in generate_squareToLen()? They are not. According to your suggestions, I agree it also needs to be done for the same reason. > > You are right. The way I'm building the 64 bits of the register > > depends on which kind of endianness it is run. For now it works only on > > little endian so I'm adding a switch (just like I did for SHA) to make > > it available only on little endian systems. > > It shouldn't be that hard to get it working on big endian ;-) Btw., my > point was not to replace the 2 4-byte store instructions by an 8-byte > one (though I'm also ok with that). It was that 2 stwu which update the > same pointer doesn't make sense from performance point of view. Please > keep something which works on big endian, too. I see. The 2x stwu was being used like that because it was the trivial approach when considering the original java update: z[i++] = (lastProductLowWord << 31) | (int)(product >>> 33); z[i++] = (int)(product >>> 1); As you pointed out, that might cause some stall on the pipeline so I made it with 1s stdu (and could improve code by reducing 1 instruction) Now about having a big endian version: I'm not confident in doing so as I don't have access to such a machine at the moment. You were kind on offering test support but I don't know if it'd work like that. I may support you in checking out which places are endianness-related but I'm not comfortable in sending you untested code. Would you be interested in doing such a changes for making it work on Big Endian? For this patch, I provided an interesting test that might help you to verify if it worked. > > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 14:15:31 > > 2017). The reported performance speedup was calculated by running the > > following test (TestSquareToLen.java): > > Seems like JDK-8145913 has not been backported, yet. Sorry for not > checking this earlier. So if you want to make RSA really fast, it should > be so much better to backport that one. But I can still sponsor this > change as it may be used elsewhere. No problem. It's nice to know that I may not need to request a backport of this patch for performance reasons. And at last, but not least, the new webrev with these clrldi changes: https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.03/index.html Thank you once again, Gustavo Serra Scalet > Best regards, > Martin > > > -----Original Message----- > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > Sent: Dienstag, 29. August 2017 22:37 > To: Doerr, Martin ; 'hotspot-compiler- > dev at openjdk.java.net' > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > SquareToLen intrinsics > > Hi Martin, > > New changes: > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.02/ > > Check comments below, please. > > > -----Original Message----- > > From: Doerr, Martin > > > > 1. Sign extending offset and len > > Right, sign and zero extending is equivalent for offset and len > > because they are guaranteed to be >=0 (by checks in Java). But you can > > only rely on bit 32 (IBM notation) to be 0. Bit 0-31 may contain > garbage. > > rldicl was incorrect. My mistake, sorry for that. Correct would be > > rldic which also clears the least significant bits. > > len should also get fixed e.g. by replacing cmpdi by extsw_ in muladd. > > The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't seem to > need further changes as it's being cleared with clrldi, which is the > same as rldic with no shift. Therefore it's treated appropriately as > requested for "offset" parameter. Do you agree? > > > 2. Using 8 byte instructions for int > > The code which feeds stdu is endianess specific. Doesn't work on all > > PPC64 platforms. > > You are right. The way I'm building the 64 bits of the register depends > on which kind of endianness it is run. For now it works only on little > endian so I'm adding a switch (just like I did for SHA) to make it > available only on little endian systems. > > > 3.Regarding Andrew's point: Superseded by Montgomery? > > The Montgomery change got backported to jdk8u (JDK-8150152 in 8u102). > > I'd expect the performance improvement of these intrinsics to be > > irrelevant for crypto.rsa. Did you measure with an older jdk8 release? > > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 14:15:31 > 2017). The reported performance speedup was calculated by running the > following test (TestSquareToLen.java): > import java.math.BigInteger; > > public class TestSquareToLen { > > public static void main(String args[]) throws Exception { > > int n = 10000000; > if (args.length >=1) { > n = Integer.parseInt(args[0]); > } > > BigInteger b1 = new > BigInteger("348939809235573590863505149820825039200022983118773208599936 > 739559418380102146884307139175604920787313701663155983793121475492609222 > 378029211020760922327218480828933663005773596942372680852064103011811651 > 644018048833823482390819947896524207635857984552089977996313113154016668 > 718795349783157384006672542605760392289645528307"); > BigInteger b2 = BigInteger.valueOf(0); > BigInteger check = BigInteger.valueOf(1); > for (int i = 0; i < n; i++) { > b2 = b1.multiply(b1); > if (i == 0) > // Didn't JIT yet. Comparing against interpreted mode > check = b2; > } > if (b2.compareTo(check) == 0) > System.out.println("Check ok!"); > else > System.out.println("Check failed!"); > } > } > > > I got these results on JDK8 on my POWER8 machine: > $ ./javac TestSquareToLen.java > $ sudo perf stat -r 5 ./java -XX:-UseMulAddIntrinsic -XX:- > UseSquareToLenIntrinsic TestSquareToLen Check ok! > Check ok! > Check ok! > Check ok! > Check ok! > > Performance counter stats for './java -XX:-UseMulAddIntrinsic -XX:- > UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > 15148.009557 task-clock (msec) # 1.053 CPUs > utilized ( +- 0.48% ) > 2,425 context-switches # 0.160 K/sec > ( +- 5.84% ) > 356 cpu-migrations # 0.023 K/sec > ( +- 3.01% ) > 5,153 page-faults # 0.340 K/sec > ( +- 5.22% ) > 54,536,889,909 cycles # 3.600 GHz > ( +- 0.56% ) (66.68%) > 239,554,105 stalled-cycles-frontend # 0.44% frontend > cycles idle ( +- 4.87% ) (49.90%) > 27,683,316,001 stalled-cycles-backend # 50.76% backend > cycles idle ( +- 0.56% ) (50.17%) > 102,020,229,733 instructions # 1.87 insn per > cycle > # 0.27 stalled > cycles per insn ( +- 0.14% ) (66.94%) > 7,706,072,218 branches # 508.718 M/sec > ( +- 0.23% ) (50.20%) > 456,051,162 branch-misses # 5.92% of all > branches ( +- 0.09% ) (50.07%) > > 14.390840733 seconds time elapsed > ( +- 0.09% ) > > $ sudo perf stat -r 5 ./java -XX:+UseMulAddIntrinsic - > XX:+UseSquareToLenIntrinsic TestSquareToLen Check ok! > Check ok! > Check ok! > Check ok! > Check ok! > > Performance counter stats for './java -XX:+UseMulAddIntrinsic - > XX:+UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > 11368.141410 task-clock (msec) # 1.045 CPUs > utilized ( +- 0.64% ) > 1,964 context-switches # 0.173 K/sec > ( +- 8.93% ) > 338 cpu-migrations # 0.030 K/sec > ( +- 7.65% ) > 5,627 page-faults # 0.495 K/sec > ( +- 6.15% ) > 41,100,168,967 cycles # 3.615 GHz > ( +- 0.50% ) (66.36%) > 309,052,316 stalled-cycles-frontend # 0.75% frontend > cycles idle ( +- 2.84% ) (49.89%) > 14,188,581,685 stalled-cycles-backend # 34.52% backend > cycles idle ( +- 0.99% ) (50.34%) > 77,846,029,829 instructions # 1.89 insn per > cycle > # 0.18 stalled > cycles per insn ( +- 0.29% ) (66.96%) > 8,435,216,989 branches # 742.005 M/sec > ( +- 0.28% ) (50.17%) > 339,903,936 branch-misses # 4.03% of all > branches ( +- 0.27% ) (49.90%) > > 10.882357546 seconds time elapsed > ( +- 0.24% ) > > > (out of curiosity, these numbers are 15.19s (+- 0.32%) and 13.42s (+- > 0.53%) on JDK10) > > I may run for SpecJVM2008's crypto.rsa if you are interested. > > Thank you once again for reviewing this. > > Best regards, > Gustavo > > > (I think the change is still acceptable as the intrinsics could be > > used elsewhere and the implementation also exists on other platforms.) > > > > Best regards, > > Martin > > > > > > -----Original Message----- > > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > > Sent: Mittwoch, 16. August 2017 18:50 > > To: Doerr, Martin ; 'hotspot-compiler- > > dev at openjdk.java.net' > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > SquareToLen intrinsics > > > > Hi Martin, > > > > Thanks for dedicated review. It took me a while to be able to work on > > this but I hope to have your points solved. Please check below the > > review as well as my comments quoting your email: > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.01/ > > > > > -----Original Message----- > > > First of all, C2 does not perform sign extend when calling stubs. > > > The int parms need to get zero/sign extended. (Could even be done > > > without extra instructions by replacing sldi -> rldicl, cmpdi -> > > > extsw_ in some > > > cases.) > > > > Does it make a difference on my case? > > > > I guess you are talking about mulAdd preparation code. The only aspect > > I found about him is to force the cast from 32 bits -> 64 bits by > > cleaning higher bits. Offset is a signed integer but it can't be > negative anyway. > > > > So I changed from: > > sldi (R5_ARG3, R5_ARG3, 2); > > > > to: > > rldicl (R5_ARG3, R5_ARG3, 2, 32); // always positive > > > > > > > macroAssembler_ppc.cpp: > > > - Indentation should be 2 spaces. > > > > Done > > > > > > > stubGenerator_ppc:cpp: > > > - or_, addi_ should get replaced by orr, addi when CR0 result is not > > > needed. > > > > Done > > > > > - Where is lplw initialized? > > > > It should be initialized with 0, I missed that... > > > > > - I believe that the updating load/store instructions e.g. lwzu > > > don't perform well on some processors. At least using stwu 2 times > > > in the loop doesn't make sense. > > > > You are right. I could manipulate the bits differently and ended up > > with a single stdu in the loop. Neat! Although I could not reduce the > > total number of instructions. > > > > > - Note: It should be possible to use 8 byte instead of 4 byte > > > instructions: MacroAssembler::multiply64, addc, adde. But I'm not > > > requesting to change that because I guess it would make the code > > > very complicated, especially when supporting both endianess > versions. > > > > Yes, that would require a new analysis on this code. May we consider > > it next? As you said, I prefer having an initial version that looks as > > simple as the original java code. > > > > > - The squareToLen stub implementation is very close the Java > > > implementation. So it'd be interesting to understand what C2 doesn't > > > do as well as the hand written assembly code. Do you know that? (Not > > > absolutely necessary for accepting this change as long as the stub > > > is measurably faster.) > > > > I don't know either. Basically I chose doing it because I noticed some > > performance gain on SpecJVM2008 when analyzing X64. Then, taking a > > closer look, I didn't notice any AVX or some special instructions on > > X64 so I decided to try it on ppc64 by using some basic assembly. > > > > Thanks > > > > > > > > Best regards, > > > Martin > > > > > > > > > -----Original Message----- > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > > Sent: Donnerstag, 10. August 2017 19:22 > > > To: 'hotspot-compiler-dev at openjdk.java.net' > > dev at openjdk.java.net> > > > Subject: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > SquareToLen intrinsics > > > > > > > > > > > > -----Original Message----- > > > From: ppc-aix-port-dev [mailto:ppc-aix-port-dev- > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > > Sent: ter?a-feira, 8 de agosto de 2017 17:19 > > > To: ppc-aix-port-dev at openjdk.java.net > > > Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > SquareToLen intrinsics > > > > > > Hi, > > > > > > Could you please review this specific PPC64 change to hotspot? By > > > implementing these intrinsics I noticed a small improvement with > > > microbenchmarks analysis. On SpecJVM2008's crypto.rsa benchmark, > > > only when backporting to JDK8 an improvement was noticed. > > > > > > JBS: https://bugs.openjdk.java.net/browse/JDK-8185976 > > > Webrev: https://gut.github.io/openjdk/webrev/JDK-8185976/webrev/ > > > > > > Motivation for this implementation: > > > https://twitter.com/ijuma/status/698309312498835457 > > > > > > Best regards, > > > Gustavo Serra Scalet From gustavo.scalet at eldorado.org.br Fri Sep 1 17:12:19 2017 From: gustavo.scalet at eldorado.org.br (Gustavo Serra Scalet) Date: Fri, 1 Sep 2017 17:12:19 +0000 Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics In-Reply-To: <2badcffdc4fb44c9ba77b5a1c6cc26fb@sap.com> References: <1f159ee480284095b8e5c3f444dceb96@serv031.corp.eldorado.org.br> <16e8b68451e94eb79cdd7d9cb5d7984c@sap.com> <2425566a8ff74051af485c919a0bf5ee@serv030.corp.eldorado.org.br> <4ec93a6bcbe14cf99c2fa02d50a18965@sap.com> <0ef23b5fcbc54996aea876d4c60e4097@sap.com> <10a918efbd344b1fbf95c56b7beedbc0@serv031.corp.eldorado.org.br> <2badcffdc4fb44c9ba77b5a1c6cc26fb@sap.com> Message-ID: <6362e4c1e3ab4871b12232580f2971aa@serv031.corp.eldorado.org.br> Hi Martin, > -----Original Message----- > From: Doerr, Martin > your first webrev already works on Big Endian. So the only required > change is to fix your new code by this trivial patch: > --- a/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 17:47:45 2017 > +0200 > +++ b/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 17:55:08 2017 > +0200 > @@ -3426,7 +3426,9 @@ > __ srdi (product, product, 1); > // join them to the same register and store it as Little Endian > __ orr (product, lplw_s, product); > +#ifdef VM_LITTLE_ENDIAN > __ rldicl (product, product, 32, 0); > +#endif > __ stdu (product, 8, out_aux); > __ bdnz (LOOP_SQUARE); > > So please enable it again for Big Endian in vm_version_ppc. Besides > that, it looks good to me. We also need a 2nd review. Great! Thanks for checking it and suggesting the diff. I changed these things. You can find it below: https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.04/ I wonder who could be a 2nd reviewer... Anybody in mind that we may ping? Maybe Goetz Lindenmaier? Best Regards, Gustavo Serra Scalet > > Best regards, > Martin > > > -----Original Message----- > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > Sent: Mittwoch, 30. August 2017 19:03 > To: Doerr, Martin ; 'hotspot-compiler- > dev at openjdk.java.net' > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > SquareToLen intrinsics > > Hi Martin, > > (webrev at the end) > > > -----Original Message----- > > From: Doerr, Martin > > > > > The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't seem > > > to need further changes as it's being cleared with clrldi, which is > > > the same as rldic with no shift. Therefore it's treated > > > appropriately as requested for "offset" parameter. Do you agree? > > > > No, I didn't find clrldi for len in generate_mulAdd(). Only for k. > > I'm sorry. I was thinking about "offset" and "k", which are both cleaned > on generate_mulAdd(). "len" was not cleaned and it was being used on > muladd() directly with cmpdi, which could lead to problems. > > That is being changed. > > > Where are in_len and out_len fixed up in generate_squareToLen()? > > They are not. According to your suggestions, I agree it also needs to be > done for the same reason. > > > > You are right. The way I'm building the 64 bits of the register > > > depends on which kind of endianness it is run. For now it works only > > > on little endian so I'm adding a switch (just like I did for SHA) to > > > make it available only on little endian systems. > > > > It shouldn't be that hard to get it working on big endian ;-) Btw., my > > point was not to replace the 2 4-byte store instructions by an 8-byte > > one (though I'm also ok with that). It was that 2 stwu which update > > the same pointer doesn't make sense from performance point of view. > > Please keep something which works on big endian, too. > > I see. The 2x stwu was being used like that because it was the trivial > approach when considering the original java update: > z[i++] = (lastProductLowWord << 31) | (int)(product >>> 33); z[i++] = > (int)(product >>> 1); > > As you pointed out, that might cause some stall on the pipeline so I > made it with 1s stdu (and could improve code by reducing 1 instruction) > > Now about having a big endian version: I'm not confident in doing so as > I don't have access to such a machine at the moment. You were kind on > offering test support but I don't know if it'd work like that. I may > support you in checking out which places are endianness-related but I'm > not comfortable in sending you untested code. > > Would you be interested in doing such a changes for making it work on > Big Endian? For this patch, I provided an interesting test that might > help you to verify if it worked. > > > > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 > > > 14:15:31 2017). The reported performance speedup was calculated by > > > running the following test (TestSquareToLen.java): > > > > Seems like JDK-8145913 has not been backported, yet. Sorry for not > > checking this earlier. So if you want to make RSA really fast, it > > should be so much better to backport that one. But I can still sponsor > > this change as it may be used elsewhere. > > No problem. It's nice to know that I may not need to request a backport > of this patch for performance reasons. > > And at last, but not least, the new webrev with these clrldi changes: > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.03/index.html > > Thank you once again, > Gustavo Serra Scalet > > > Best regards, > > Martin > > > > > > -----Original Message----- > > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > > Sent: Dienstag, 29. August 2017 22:37 > > To: Doerr, Martin ; 'hotspot-compiler- > > dev at openjdk.java.net' > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > SquareToLen intrinsics > > > > Hi Martin, > > > > New changes: > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.02/ > > > > Check comments below, please. > > > > > -----Original Message----- > > > From: Doerr, Martin > > > > > > 1. Sign extending offset and len > > > Right, sign and zero extending is equivalent for offset and len > > > because they are guaranteed to be >=0 (by checks in Java). But you > > > can only rely on bit 32 (IBM notation) to be 0. Bit 0-31 may contain > > garbage. > > > rldicl was incorrect. My mistake, sorry for that. Correct would be > > > rldic which also clears the least significant bits. > > > len should also get fixed e.g. by replacing cmpdi by extsw_ in > muladd. > > > > The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't seem to > > need further changes as it's being cleared with clrldi, which is the > > same as rldic with no shift. Therefore it's treated appropriately as > > requested for "offset" parameter. Do you agree? > > > > > 2. Using 8 byte instructions for int The code which feeds stdu is > > > endianess specific. Doesn't work on all > > > PPC64 platforms. > > > > You are right. The way I'm building the 64 bits of the register > > depends on which kind of endianness it is run. For now it works only > > on little endian so I'm adding a switch (just like I did for SHA) to > > make it available only on little endian systems. > > > > > 3.Regarding Andrew's point: Superseded by Montgomery? > > > The Montgomery change got backported to jdk8u (JDK-8150152 in > 8u102). > > > I'd expect the performance improvement of these intrinsics to be > > > irrelevant for crypto.rsa. Did you measure with an older jdk8 > release? > > > > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 14:15:31 > > 2017). The reported performance speedup was calculated by running the > > following test (TestSquareToLen.java): > > import java.math.BigInteger; > > > > public class TestSquareToLen { > > > > public static void main(String args[]) throws Exception { > > > > int n = 10000000; > > if (args.length >=1) { > > n = Integer.parseInt(args[0]); > > } > > > > BigInteger b1 = new > > BigInteger("3489398092355735908635051498208250392000229831187732085999 > > 36 > > 7395594183801021468843071391756049207873137016631559837931214754926092 > > 22 > > 3780292110207609223272184808289336630057735969423726808520641030118116 > > 51 > > 6440180488338234823908199478965242076358579845520899779963131131540166 > > 68 718795349783157384006672542605760392289645528307"); > > BigInteger b2 = BigInteger.valueOf(0); > > BigInteger check = BigInteger.valueOf(1); > > for (int i = 0; i < n; i++) { > > b2 = b1.multiply(b1); > > if (i == 0) > > // Didn't JIT yet. Comparing against interpreted mode > > check = b2; > > } > > if (b2.compareTo(check) == 0) > > System.out.println("Check ok!"); > > else > > System.out.println("Check failed!"); > > } > > } > > > > > > I got these results on JDK8 on my POWER8 machine: > > $ ./javac TestSquareToLen.java > > $ sudo perf stat -r 5 ./java -XX:-UseMulAddIntrinsic -XX:- > > UseSquareToLenIntrinsic TestSquareToLen Check ok! > > Check ok! > > Check ok! > > Check ok! > > Check ok! > > > > Performance counter stats for './java -XX:-UseMulAddIntrinsic -XX:- > > UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > > > 15148.009557 task-clock (msec) # 1.053 CPUs > > utilized ( +- 0.48% ) > > 2,425 context-switches # 0.160 K/sec > > ( +- 5.84% ) > > 356 cpu-migrations # 0.023 K/sec > > ( +- 3.01% ) > > 5,153 page-faults # 0.340 K/sec > > ( +- 5.22% ) > > 54,536,889,909 cycles # 3.600 GHz > > ( +- 0.56% ) (66.68%) > > 239,554,105 stalled-cycles-frontend # 0.44% frontend > > cycles idle ( +- 4.87% ) (49.90%) > > 27,683,316,001 stalled-cycles-backend # 50.76% backend > > cycles idle ( +- 0.56% ) (50.17%) > > 102,020,229,733 instructions # 1.87 insn per > > cycle > > # 0.27 stalled > > cycles per insn ( +- 0.14% ) (66.94%) > > 7,706,072,218 branches # 508.718 M/sec > > ( +- 0.23% ) (50.20%) > > 456,051,162 branch-misses # 5.92% of all > > branches ( +- 0.09% ) (50.07%) > > > > 14.390840733 seconds time elapsed ( +- 0.09% ) > > > > $ sudo perf stat -r 5 ./java -XX:+UseMulAddIntrinsic - > > XX:+UseSquareToLenIntrinsic TestSquareToLen Check ok! > > Check ok! > > Check ok! > > Check ok! > > Check ok! > > > > Performance counter stats for './java -XX:+UseMulAddIntrinsic - > > XX:+UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > > > 11368.141410 task-clock (msec) # 1.045 CPUs > > utilized ( +- 0.64% ) > > 1,964 context-switches # 0.173 K/sec > > ( +- 8.93% ) > > 338 cpu-migrations # 0.030 K/sec > > ( +- 7.65% ) > > 5,627 page-faults # 0.495 K/sec > > ( +- 6.15% ) > > 41,100,168,967 cycles # 3.615 GHz > > ( +- 0.50% ) (66.36%) > > 309,052,316 stalled-cycles-frontend # 0.75% frontend > > cycles idle ( +- 2.84% ) (49.89%) > > 14,188,581,685 stalled-cycles-backend # 34.52% backend > > cycles idle ( +- 0.99% ) (50.34%) > > 77,846,029,829 instructions # 1.89 insn per > > cycle > > # 0.18 stalled > > cycles per insn ( +- 0.29% ) (66.96%) > > 8,435,216,989 branches # 742.005 M/sec > > ( +- 0.28% ) (50.17%) > > 339,903,936 branch-misses # 4.03% of all > > branches ( +- 0.27% ) (49.90%) > > > > 10.882357546 seconds time elapsed ( +- 0.24% ) > > > > > > (out of curiosity, these numbers are 15.19s (+- 0.32%) and 13.42s (+- > > 0.53%) on JDK10) > > > > I may run for SpecJVM2008's crypto.rsa if you are interested. > > > > Thank you once again for reviewing this. > > > > Best regards, > > Gustavo > > > > > (I think the change is still acceptable as the intrinsics could be > > > used elsewhere and the implementation also exists on other > > > platforms.) > > > > > > Best regards, > > > Martin > > > > > > > > > -----Original Message----- > > > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > > > Sent: Mittwoch, 16. August 2017 18:50 > > > To: Doerr, Martin ; 'hotspot-compiler- > > > dev at openjdk.java.net' > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > SquareToLen intrinsics > > > > > > Hi Martin, > > > > > > Thanks for dedicated review. It took me a while to be able to work > > > on this but I hope to have your points solved. Please check below > > > the review as well as my comments quoting your email: > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.01/ > > > > > > > -----Original Message----- > > > > First of all, C2 does not perform sign extend when calling stubs. > > > > The int parms need to get zero/sign extended. (Could even be done > > > > without extra instructions by replacing sldi -> rldicl, cmpdi -> > > > > extsw_ in some > > > > cases.) > > > > > > Does it make a difference on my case? > > > > > > I guess you are talking about mulAdd preparation code. The only > > > aspect I found about him is to force the cast from 32 bits -> 64 > > > bits by cleaning higher bits. Offset is a signed integer but it > > > can't be > > negative anyway. > > > > > > So I changed from: > > > sldi (R5_ARG3, R5_ARG3, 2); > > > > > > to: > > > rldicl (R5_ARG3, R5_ARG3, 2, 32); // always positive > > > > > > > > > > macroAssembler_ppc.cpp: > > > > - Indentation should be 2 spaces. > > > > > > Done > > > > > > > > > > stubGenerator_ppc:cpp: > > > > - or_, addi_ should get replaced by orr, addi when CR0 result is > > > > not needed. > > > > > > Done > > > > > > > - Where is lplw initialized? > > > > > > It should be initialized with 0, I missed that... > > > > > > > - I believe that the updating load/store instructions e.g. lwzu > > > > don't perform well on some processors. At least using stwu 2 times > > > > in the loop doesn't make sense. > > > > > > You are right. I could manipulate the bits differently and ended up > > > with a single stdu in the loop. Neat! Although I could not reduce > > > the total number of instructions. > > > > > > > - Note: It should be possible to use 8 byte instead of 4 byte > > > > instructions: MacroAssembler::multiply64, addc, adde. But I'm not > > > > requesting to change that because I guess it would make the code > > > > very complicated, especially when supporting both endianess > > versions. > > > > > > Yes, that would require a new analysis on this code. May we consider > > > it next? As you said, I prefer having an initial version that looks > > > as simple as the original java code. > > > > > > > - The squareToLen stub implementation is very close the Java > > > > implementation. So it'd be interesting to understand what C2 > > > > doesn't do as well as the hand written assembly code. Do you know > > > > that? (Not absolutely necessary for accepting this change as long > > > > as the stub is measurably faster.) > > > > > > I don't know either. Basically I chose doing it because I noticed > > > some performance gain on SpecJVM2008 when analyzing X64. Then, > > > taking a closer look, I didn't notice any AVX or some special > > > instructions on > > > X64 so I decided to try it on ppc64 by using some basic assembly. > > > > > > Thanks > > > > > > > > > > > Best regards, > > > > Martin > > > > > > > > > > > > -----Original Message----- > > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > > > Sent: Donnerstag, 10. August 2017 19:22 > > > > To: 'hotspot-compiler-dev at openjdk.java.net' > > > dev at openjdk.java.net> > > > > Subject: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > SquareToLen intrinsics > > > > > > > > > > > > > > > > -----Original Message----- > > > > From: ppc-aix-port-dev [mailto:ppc-aix-port-dev- > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > > > Sent: ter?a-feira, 8 de agosto de 2017 17:19 > > > > To: ppc-aix-port-dev at openjdk.java.net > > > > Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > SquareToLen intrinsics > > > > > > > > Hi, > > > > > > > > Could you please review this specific PPC64 change to hotspot? By > > > > implementing these intrinsics I noticed a small improvement with > > > > microbenchmarks analysis. On SpecJVM2008's crypto.rsa benchmark, > > > > only when backporting to JDK8 an improvement was noticed. > > > > > > > > JBS: https://bugs.openjdk.java.net/browse/JDK-8185976 > > > > Webrev: https://gut.github.io/openjdk/webrev/JDK-8185976/webrev/ > > > > > > > > Motivation for this implementation: > > > > https://twitter.com/ijuma/status/698309312498835457 > > > > > > > > Best regards, > > > > Gustavo Serra Scalet From aph at redhat.com Fri Sep 1 17:40:24 2017 From: aph at redhat.com (Andrew Haley) Date: Fri, 1 Sep 2017 18:40:24 +0100 Subject: [aarch64-port-dev ] [aarch64-port-dev][10] RFR: 8187022 UBFX instructions have wrong format string In-Reply-To: <8cde07840db64072af6e68393ef4c704@NASANEXM01B.na.qualcomm.com> References: <8cde07840db64072af6e68393ef4c704@NASANEXM01B.na.qualcomm.com> Message-ID: <879a0d91-03c8-6b7a-4abb-1e6a74bd70f7@redhat.com> On 01/09/17 17:22, stewartd.qdt wrote: > Please see the webrev [1] for fixing the format string of ubfx [2]. > > [1]: http://cr.openjdk.java.net/~njian/8187022/webrev.00/ > [2]: https://bugs.openjdk.java.net/browse/JDK-8187022 Great, thanks. P.S. Resending to hotspot-dev. That's where this should go, because Aarch64 is in the main hotspot tree now. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From igor.ignatyev at oracle.com Fri Sep 1 18:07:22 2017 From: igor.ignatyev at oracle.com (Igor Ignatyev) Date: Fri, 1 Sep 2017 11:07:22 -0700 Subject: RFR(XS) : 8187020 : AOT tests should not fail if devkit dependency isn't resolved In-Reply-To: References: <5827C35C-76F1-437D-844A-1543F7C80C41@oracle.com> Message-ID: H Vladimir, I was planning to add execution of ld -v by a separate patch, but if you want I can merge it into this one. -- Igor > On Aug 31, 2017, at 2:12 PM, Vladimir Kozlov wrote: > > On MacOS we also have problem when 'ld' is present on path but Xcode is missing: > > % /usr/bin/ld -v > xcode-select: error: no developer tools were found at '/Applications/Xcode.app', and no install could be requested (perhaps no UI is present), please install manually from 'developer.apple.com'. > > AOT tests will fail in such case too. May be AotCompiler test code also have to execute ld -v (or ld -V on Solaris) to check that we can use linker similar to JAOTC tool: > > http://hg.openjdk.java.net/jdk10/hs/hotspot/file/a8ec32aa385e/src/jdk.aot/share/classes/jdk.tools.jaotc/src/jdk/tools/jaotc/Linker.java#l104 > > Vladimir > > On 8/31/17 11:02 AM, Igor Ignatyev wrote: >> http://cr.openjdk.java.net/~iignatyev//8187020/webrev.00/index.html >>> 11 lines changed: 5 ins; 0 del; 6 mod; >> Hi all, >> could you please review this tiny fix or AOT tests? Prior this fix, the tests would fail w/ FileNotFoundException when devkit dependency can't be resolved even if jaotc is able to find a linker, e.g. in one of VS*COMNTOOLS. after this fix, if tests aren't able to download devkit, they will allow jaotc to try to find a linker. This patch also remove search for 'ld' in PATH on windows b/c jaotc doesn't use it on windows. >> webrev: http://cr.openjdk.java.net/~iignatyev//8187020/webrev.00/index.html >> jbs: https://bugs.openjdk.java.net/browse/JDK-8187020 >> testing: compiler/aot tests >> Thanks, >> -- Igor From vladimir.kozlov at oracle.com Fri Sep 1 18:21:02 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 1 Sep 2017 11:21:02 -0700 Subject: RFR(XS) : 8187020 : AOT tests should not fail if devkit dependency isn't resolved In-Reply-To: References: <5827C35C-76F1-437D-844A-1543F7C80C41@oracle.com> Message-ID: On 9/1/17 11:07 AM, Igor Ignatyev wrote: > H Vladimir, > > I was planning to add execution of ld -v by a separate patch, but if you want I can merge it into this one. Yes, please. Thanks, Vladimir > > -- Igor > >> On Aug 31, 2017, at 2:12 PM, Vladimir Kozlov wrote: >> >> On MacOS we also have problem when 'ld' is present on path but Xcode is missing: >> >> % /usr/bin/ld -v >> xcode-select: error: no developer tools were found at '/Applications/Xcode.app', and no install could be requested (perhaps no UI is present), please install manually from 'developer.apple.com'. >> >> AOT tests will fail in such case too. May be AotCompiler test code also have to execute ld -v (or ld -V on Solaris) to check that we can use linker similar to JAOTC tool: >> >> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/a8ec32aa385e/src/jdk.aot/share/classes/jdk.tools.jaotc/src/jdk/tools/jaotc/Linker.java#l104 >> >> Vladimir >> >> On 8/31/17 11:02 AM, Igor Ignatyev wrote: >>> http://cr.openjdk.java.net/~iignatyev//8187020/webrev.00/index.html >>>> 11 lines changed: 5 ins; 0 del; 6 mod; >>> Hi all, >>> could you please review this tiny fix or AOT tests? Prior this fix, the tests would fail w/ FileNotFoundException when devkit dependency can't be resolved even if jaotc is able to find a linker, e.g. in one of VS*COMNTOOLS. after this fix, if tests aren't able to download devkit, they will allow jaotc to try to find a linker. This patch also remove search for 'ld' in PATH on windows b/c jaotc doesn't use it on windows. >>> webrev: http://cr.openjdk.java.net/~iignatyev//8187020/webrev.00/index.html >>> jbs: https://bugs.openjdk.java.net/browse/JDK-8187020 >>> testing: compiler/aot tests >>> Thanks, >>> -- Igor > From igor.ignatyev at oracle.com Sat Sep 2 07:03:04 2017 From: igor.ignatyev at oracle.com (Igor Ignatyev) Date: Sat, 2 Sep 2017 00:03:04 -0700 Subject: RFR(XS) : 8187020 : AOT tests should not fail if devkit dependency isn't resolved In-Reply-To: References: <5827C35C-76F1-437D-844A-1543F7C80C41@oracle.com> Message-ID: <8C09E09C-AB38-4670-8EB2-E1EF2AA32BE3@oracle.com> Hi Vladimir, I've added a check that ld -v/-V exits gracefully, compiler/aot tests have been rerun on major platforms. webrev: http://cr.openjdk.java.net/~iignatyev//8187020/webrev.01/index.html incremental: http://cr.openjdk.java.net/~iignatyev//8187020/webrev.00-01/index.html Thanks, -- Igor > On Sep 1, 2017, at 11:21 AM, Vladimir Kozlov wrote: > > On 9/1/17 11:07 AM, Igor Ignatyev wrote: >> H Vladimir, >> I was planning to add execution of ld -v by a separate patch, but if you want I can merge it into this one. > > Yes, please. > > Thanks, > Vladimir > >> -- Igor >>> On Aug 31, 2017, at 2:12 PM, Vladimir Kozlov wrote: >>> >>> On MacOS we also have problem when 'ld' is present on path but Xcode is missing: >>> >>> % /usr/bin/ld -v >>> xcode-select: error: no developer tools were found at '/Applications/Xcode.app', and no install could be requested (perhaps no UI is present), please install manually from 'developer.apple.com'. >>> >>> AOT tests will fail in such case too. May be AotCompiler test code also have to execute ld -v (or ld -V on Solaris) to check that we can use linker similar to JAOTC tool: >>> >>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/a8ec32aa385e/src/jdk.aot/share/classes/jdk.tools.jaotc/src/jdk/tools/jaotc/Linker.java#l104 >>> >>> Vladimir >>> >>> On 8/31/17 11:02 AM, Igor Ignatyev wrote: >>>> http://cr.openjdk.java.net/~iignatyev//8187020/webrev.00/index.html >>>>> 11 lines changed: 5 ins; 0 del; 6 mod; >>>> Hi all, >>>> could you please review this tiny fix or AOT tests? Prior this fix, the tests would fail w/ FileNotFoundException when devkit dependency can't be resolved even if jaotc is able to find a linker, e.g. in one of VS*COMNTOOLS. after this fix, if tests aren't able to download devkit, they will allow jaotc to try to find a linker. This patch also remove search for 'ld' in PATH on windows b/c jaotc doesn't use it on windows. >>>> webrev: http://cr.openjdk.java.net/~iignatyev//8187020/webrev.00/index.html >>>> jbs: https://bugs.openjdk.java.net/browse/JDK-8187020 >>>> testing: compiler/aot tests >>>> Thanks, >>>> -- Igor -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Sat Sep 2 17:39:29 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Sat, 2 Sep 2017 10:39:29 -0700 Subject: RFR(XS) : 8187020 : AOT tests should not fail if devkit dependency isn't resolved In-Reply-To: <8C09E09C-AB38-4670-8EB2-E1EF2AA32BE3@oracle.com> References: <5827C35C-76F1-437D-844A-1543F7C80C41@oracle.com> <8C09E09C-AB38-4670-8EB2-E1EF2AA32BE3@oracle.com> Message-ID: Indent in new code is wrong. Otherwise looks good. So we still don't exit tests gracefully if linker is not found. You still allow JAOTC to try to find linker. But since this test code did not find one JAOTC (which have almost the same code) will not find it either. You have next change in mind? Thanks, Vladimir On 9/2/17 12:03 AM, Igor Ignatyev wrote: > Hi Vladimir, > > I've added a check that ld -v/-V exits gracefully, compiler/aot tests > have been rerun on major platforms. > > webrev: http://cr.openjdk.java.net/~iignatyev//8187020/webrev.01/index.html > incremental: > http://cr.openjdk.java.net/~iignatyev//8187020/webrev.00-01/index.html > > Thanks, > -- Igor >> On Sep 1, 2017, at 11:21 AM, Vladimir Kozlov >> > wrote: >> >> On 9/1/17 11:07 AM, Igor Ignatyev wrote: >>> H Vladimir, >>> I was planning to add execution of ld -v by a separate patch, but if >>> you want I can merge it into this one. >> >> Yes, please. >> >> Thanks, >> Vladimir >> >>> -- Igor >>>> On Aug 31, 2017, at 2:12 PM, Vladimir Kozlov >>>> > wrote: >>>> >>>> On MacOS we also have problem when 'ld' is present on path but Xcode >>>> is missing: >>>> >>>> % /usr/bin/ld -v >>>> xcode-select: error: no developer tools were found at >>>> '/Applications/Xcode.app', and no install could be requested >>>> (perhaps no UI is present), please install manually from >>>> 'developer.apple.com '. >>>> >>>> AOT tests will fail in such case too. May be AotCompiler test code >>>> also have to execute ld -v (or ld -V on Solaris) to check that we >>>> can use linker similar to JAOTC tool: >>>> >>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/a8ec32aa385e/src/jdk.aot/share/classes/jdk.tools.jaotc/src/jdk/tools/jaotc/Linker.java#l104 >>>> >>>> Vladimir >>>> >>>> On 8/31/17 11:02 AM, Igor Ignatyev wrote: >>>>> http://cr.openjdk.java.net/~iignatyev//8187020/webrev.00/index.html >>>>>> 11 lines changed: 5 ins; 0 del; 6 mod; >>>>> Hi all, >>>>> could you please review this tiny fix or AOT tests? Prior this fix, >>>>> the tests would fail w/ FileNotFoundException when devkit >>>>> dependency can't be resolved even if jaotc is able to find a >>>>> linker, e.g. in one of VS*COMNTOOLS. after this fix, if tests >>>>> aren't able to download devkit, they will allow jaotc to try to >>>>> find a linker. This patch also remove search for 'ld' in PATH on >>>>> windows b/c jaotc doesn't use it on windows. >>>>> webrev: >>>>> http://cr.openjdk.java.net/~iignatyev//8187020/webrev.00/index.html >>>>> jbs: https://bugs.openjdk.java.net/browse/JDK-8187020 >>>>> testing: compiler/aot tests >>>>> Thanks, >>>>> -- Igor > From igor.ignatyev at oracle.com Sat Sep 2 17:54:37 2017 From: igor.ignatyev at oracle.com (Igor Ignatyev) Date: Sat, 2 Sep 2017 10:54:37 -0700 Subject: RFR(XS) : 8187020 : AOT tests should not fail if devkit dependency isn't resolved In-Reply-To: References: <5827C35C-76F1-437D-844A-1543F7C80C41@oracle.com> <8C09E09C-AB38-4670-8EB2-E1EF2AA32BE3@oracle.com> Message-ID: > On Sep 2, 2017, at 10:39 AM, Vladimir Kozlov wrote: > > Indent in new code is wrong. Otherwise looks good. thank you for review Vladimir. > > So we still don't exit tests gracefully if linker is not found. You still allow JAOTC to try to find linker. But since this test code did not find one JAOTC (which have almost the same code) will not find it either. You have next change in mind? yes. should it still be necessary, I'll add @requires property which shows if artifactory resolver can be used, i.e. there is non-default artifactory resolver or at least one java property which are used by default resolver. -- Igor > > Thanks, > Vladimir > > On 9/2/17 12:03 AM, Igor Ignatyev wrote: >> Hi Vladimir, >> I've added a check that ld -v/-V exits gracefully, compiler/aot tests have been rerun on major platforms. >> webrev: http://cr.openjdk.java.net/~iignatyev//8187020/webrev.01/index.html >> incremental: http://cr.openjdk.java.net/~iignatyev//8187020/webrev.00-01/index.html >> Thanks, >> -- Igor >>> On Sep 1, 2017, at 11:21 AM, Vladimir Kozlov > wrote: >>> >>> On 9/1/17 11:07 AM, Igor Ignatyev wrote: >>>> H Vladimir, >>>> I was planning to add execution of ld -v by a separate patch, but if you want I can merge it into this one. >>> >>> Yes, please. >>> >>> Thanks, >>> Vladimir >>> >>>> -- Igor >>>>> On Aug 31, 2017, at 2:12 PM, Vladimir Kozlov > wrote: >>>>> >>>>> On MacOS we also have problem when 'ld' is present on path but Xcode is missing: >>>>> >>>>> % /usr/bin/ld -v >>>>> xcode-select: error: no developer tools were found at '/Applications/Xcode.app', and no install could be requested (perhaps no UI is present), please install manually from 'developer.apple.com '. >>>>> >>>>> AOT tests will fail in such case too. May be AotCompiler test code also have to execute ld -v (or ld -V on Solaris) to check that we can use linker similar to JAOTC tool: >>>>> >>>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/a8ec32aa385e/src/jdk.aot/share/classes/jdk.tools.jaotc/src/jdk/tools/jaotc/Linker.java#l104 >>>>> >>>>> Vladimir >>>>> >>>>> On 8/31/17 11:02 AM, Igor Ignatyev wrote: >>>>>> http://cr.openjdk.java.net/~iignatyev//8187020/webrev.00/index.html >>>>>>> 11 lines changed: 5 ins; 0 del; 6 mod; >>>>>> Hi all, >>>>>> could you please review this tiny fix or AOT tests? Prior this fix, the tests would fail w/ FileNotFoundException when devkit dependency can't be resolved even if jaotc is able to find a linker, e.g. in one of VS*COMNTOOLS. after this fix, if tests aren't able to download devkit, they will allow jaotc to try to find a linker. This patch also remove search for 'ld' in PATH on windows b/c jaotc doesn't use it on windows. >>>>>> webrev: http://cr.openjdk.java.net/~iignatyev//8187020/webrev.00/index.html >>>>>> jbs: https://bugs.openjdk.java.net/browse/JDK-8187020 >>>>>> testing: compiler/aot tests >>>>>> Thanks, >>>>>> -- Igor From vladimir.kozlov at oracle.com Sat Sep 2 17:56:17 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Sat, 2 Sep 2017 10:56:17 -0700 Subject: RFR(XS) : 8187020 : AOT tests should not fail if devkit dependency isn't resolved In-Reply-To: References: <5827C35C-76F1-437D-844A-1543F7C80C41@oracle.com> <8C09E09C-AB38-4670-8EB2-E1EF2AA32BE3@oracle.com> Message-ID: <7215d637-09f6-981a-eb46-f2de03e074df@oracle.com> On 9/2/17 10:54 AM, Igor Ignatyev wrote: > >> On Sep 2, 2017, at 10:39 AM, Vladimir Kozlov wrote: >> >> Indent in new code is wrong. Otherwise looks good. > thank you for review Vladimir. >> >> So we still don't exit tests gracefully if linker is not found. You still allow JAOTC to try to find linker. But since this test code did not find one JAOTC (which have almost the same code) will not find it either. You have next change in mind? > yes. should it still be necessary, I'll add @requires property which shows if artifactory resolver can be used, i.e. there is non-default artifactory resolver or at least one java property which are used by default resolver. Good. Thanks, Vladimir > > -- Igor >> >> Thanks, >> Vladimir >> >> On 9/2/17 12:03 AM, Igor Ignatyev wrote: >>> Hi Vladimir, >>> I've added a check that ld -v/-V exits gracefully, compiler/aot tests have been rerun on major platforms. >>> webrev: http://cr.openjdk.java.net/~iignatyev//8187020/webrev.01/index.html >>> incremental: http://cr.openjdk.java.net/~iignatyev//8187020/webrev.00-01/index.html >>> Thanks, >>> -- Igor >>>> On Sep 1, 2017, at 11:21 AM, Vladimir Kozlov > wrote: >>>> >>>> On 9/1/17 11:07 AM, Igor Ignatyev wrote: >>>>> H Vladimir, >>>>> I was planning to add execution of ld -v by a separate patch, but if you want I can merge it into this one. >>>> >>>> Yes, please. >>>> >>>> Thanks, >>>> Vladimir >>>> >>>>> -- Igor >>>>>> On Aug 31, 2017, at 2:12 PM, Vladimir Kozlov > wrote: >>>>>> >>>>>> On MacOS we also have problem when 'ld' is present on path but Xcode is missing: >>>>>> >>>>>> % /usr/bin/ld -v >>>>>> xcode-select: error: no developer tools were found at '/Applications/Xcode.app', and no install could be requested (perhaps no UI is present), please install manually from 'developer.apple.com '. >>>>>> >>>>>> AOT tests will fail in such case too. May be AotCompiler test code also have to execute ld -v (or ld -V on Solaris) to check that we can use linker similar to JAOTC tool: >>>>>> >>>>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/a8ec32aa385e/src/jdk.aot/share/classes/jdk.tools.jaotc/src/jdk/tools/jaotc/Linker.java#l104 >>>>>> >>>>>> Vladimir >>>>>> >>>>>> On 8/31/17 11:02 AM, Igor Ignatyev wrote: >>>>>>> http://cr.openjdk.java.net/~iignatyev//8187020/webrev.00/index.html >>>>>>>> 11 lines changed: 5 ins; 0 del; 6 mod; >>>>>>> Hi all, >>>>>>> could you please review this tiny fix or AOT tests? Prior this fix, the tests would fail w/ FileNotFoundException when devkit dependency can't be resolved even if jaotc is able to find a linker, e.g. in one of VS*COMNTOOLS. after this fix, if tests aren't able to download devkit, they will allow jaotc to try to find a linker. This patch also remove search for 'ld' in PATH on windows b/c jaotc doesn't use it on windows. >>>>>>> webrev: http://cr.openjdk.java.net/~iignatyev//8187020/webrev.00/index.html >>>>>>> jbs: https://bugs.openjdk.java.net/browse/JDK-8187020 >>>>>>> testing: compiler/aot tests >>>>>>> Thanks, >>>>>>> -- Igor > From felix.yang at linaro.org Sun Sep 3 13:16:38 2017 From: felix.yang at linaro.org (Felix Yang) Date: Sun, 3 Sep 2017 21:16:38 +0800 Subject: [aarch64-port-dev ] [aarch64-port-dev][10] RFR: 8187022 UBFX instructions have wrong format string In-Reply-To: <879a0d91-03c8-6b7a-4abb-1e6a74bd70f7@redhat.com> References: <8cde07840db64072af6e68393ef4c704@NASANEXM01B.na.qualcomm.com> <879a0d91-03c8-6b7a-4abb-1e6a74bd70f7@redhat.com> Message-ID: Pushed. Thanks. On 2 September 2017 at 01:40, Andrew Haley wrote: > On 01/09/17 17:22, stewartd.qdt wrote: > > Please see the webrev [1] for fixing the format string of ubfx [2]. > > > > [1]: http://cr.openjdk.java.net/~njian/8187022/webrev.00/ > > [2]: https://bugs.openjdk.java.net/browse/JDK-8187022 > > Great, thanks. > > P.S. Resending to hotspot-dev. That's where this should go, because > Aarch64 is in the main hotspot tree now. > > -- > Andrew Haley > Java Platform Lead Engineer > Red Hat UK Ltd. > EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrew_nuss at yahoo.com Tue Sep 5 12:35:53 2017 From: andrew_nuss at yahoo.com (Andy Nuss) Date: Tue, 5 Sep 2017 12:35:53 +0000 (UTC) Subject: escape analysis friendly very small objects References: <660539323.2919871.1504614953928.ref@mail.yahoo.com> Message-ID: <660539323.2919871.1504614953928@mail.yahoo.com> I have a variety of classes which just contain a couple scalar primitives and possibly a reference to an object that is clearly on the heap, all of these small number of data members are private and all of them mutable.? Much like the Point class in Brian Goetz's article on escape analysis on whether the vm uses the stack or heap for class instances. There are two use cases for these objects:? (1) create with new such a small object as a local variable and use it thriughout the function, mutating it, etc.? (2) create a class with lots of these small objects as private or possibly protected members. The hope in the first case is under what conditions will hotspot put the object's 2 or 3 fields onto the stack and ideally without the hidden headers needed to make it a heap object.? hashCode, equals, and clone are not implemented or used if that is important.? I.e. will hotspot ever make it a simple C++ like object on the stack.? Does it help if I do defensive copy if I return one of these small objects from the function? In the second case, my hope is that for a class that contains 5 such small objects, and would definitely be on the heap, hotspot would be smart enough not to create 5 small objects on the heap and then 5 references to them in the containing class.? Instead, it would explode into the containing class ala C++ just the 2 or 3 primitive datamembers of the object, and again, ideally without headers. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dmitrij.pochepko at bell-sw.com Tue Sep 5 17:34:11 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko) Date: Tue, 5 Sep 2017 20:34:11 +0300 Subject: [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> Message-ID: <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> Hi Andrew, you can find my attempt to implement mulAdd intrinsic using wider multiplication here http://cr.openjdk.java.net/~dpochepk/8186915/webrev.02/ but as expected I got slower results on same benchmark compared to original webrev.01 with 32-bit multiplication. I've measured results on ThunderX: webrev.01 version: Benchmark (size) Mode Cnt Score Error Units BigIntegerBench.mulAddReflect 1 avgt 5 194.809 ? 1.341 ns/op BigIntegerBench.mulAddReflect 2 avgt 5 198.242 ? 1.348 ns/op BigIntegerBench.mulAddReflect 3 avgt 5 201.213 ? 0.670 ns/op BigIntegerBench.mulAddReflect 5 avgt 5 213.426 ? 7.441 ns/op BigIntegerBench.mulAddReflect 10 avgt 5 236.396 ? 1.663 ns/op BigIntegerBench.mulAddReflect 50 avgt 5 432.255 ? 24.718 ns/op BigIntegerBench.mulAddReflect 100 avgt 5 653.961 ? 10.140 ns/op webrev.02 version with wider multiplication: Benchmark (size) Mode Cnt Score Error Units BigIntegerBench.mulAddReflect 1 avgt 5 196.109 ? 0.663 ns/op BigIntegerBench.mulAddReflect 2 avgt 5 213.438 ? 124.206 ns/op BigIntegerBench.mulAddReflect 3 avgt 5 211.683 ? 3.206 ns/op BigIntegerBench.mulAddReflect 5 avgt 5 217.324 ? 5.827 ns/op BigIntegerBench.mulAddReflect 10 avgt 5 233.272 ? 21.560 ns/op BigIntegerBench.mulAddReflect 50 avgt 5 455.337 ? 237.168 ns/op BigIntegerBench.mulAddReflect 100 avgt 5 826.844 ? 4.972 ns/op As you can see, it's up to 26% worse throughput with wider multiplication. The reasons for this is: 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it can?t be changed within the function signature. Thus we can?t fully utilize the potential of 64-bit multiplication. 2. umulh instruction is more expensive than mul instruction. I haven't implemented wider multiplication for squareToLen intrinsic, since it'll require much more code due to more corner cases. Also, squaring algorithm in BigInteger doesn't handle more than 127 integers in one squareToLen call(large integer arrays are divided to smaller parts for squaring, so, 1..127 integers are squared at once), which makes all additional off-loop penalties expensive in comparison to loop execution time. At this point I ran out of ideas how we could improve the performance 3x for these intrinsics. I understand one can do better with 64bit for the intrinsics you implemented, but squareToLen and mulAdd looks different. Do you have other suggestions, or can we proceed with initial version (webrev.01)? Thanks, Dmitrij On 01.09.2017 10:51, Andrew Haley wrote: > On 31/08/17 23:46, Dmitrij Pochepko wrote: >> I tried a number of initial versions first. I also tried to use wider >> multiplication via umulh (and larger load instructions like ldp/ldr), >> but after measuring all versions I've found that version I've initially >> sent appeared to be the fastest (I was measuring it on ThunderX which I >> have in hand). It might be because of lots of additional ror(..., 32) >> operations in other versions to convert values from initial layout to >> register and back. Another reason might be more complex overall logic >> and larger code, which triggers more icache lines to be loaded. Or even >> some umulh specifics on some CPUs. So, after measuring, I've abandoned >> these versions in a middle of development and polished the fastest one. >> I have some raw development unpolished versions of such approaches >> left(not sure I have debugged versions saved, but at least has an >> overall idea). >> I attached squares_v2.3.1.diff: early version which is using mul/umulh >> for just one case. It was surprisingly slower for this case than version >> I've sent to review, so, I've abandoned this approach. >> I've also tried version with large load instructions(ldp/ldr): >> squares_v1.diff and it was also slower(it has another, slower, mul_add >> loop implementation, but I was comparing to the same version, which is >> using ldrw-only). >> >> I'm not sure if I should use 64-bit multiplications and/or 64/128 bit >> loads. I can try to return back to one of such versions and try to >> polish it, but I'll probably get slower results again on h/w I have and >> it's not clear if it'll be faster on any other h/w(which one? It takes a >> lot of time to iteratively improve and measure every version on >> respective h/w). > I'm using Applied Micro hardware for my testing at the moment. > > I did the speed testing for Montgomery multiply on ThunderX. I > appreciate that it's difficult to get the 64-bit version right and > fast, but you should see about 3 - 3.5* speedup over the pure Java > version if you get it right. That's what I saw when I did the > Montgomery multiply. You do have to pipeline the loads and the > multiplies to avoid stalls. > > Be aware that squareToLen is not used at all when running the > RSA benchmark with C2. > From goetz.lindenmaier at sap.com Wed Sep 6 06:29:41 2017 From: goetz.lindenmaier at sap.com (Lindenmaier, Goetz) Date: Wed, 6 Sep 2017 06:29:41 +0000 Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics In-Reply-To: <6362e4c1e3ab4871b12232580f2971aa@serv031.corp.eldorado.org.br> References: <1f159ee480284095b8e5c3f444dceb96@serv031.corp.eldorado.org.br> <16e8b68451e94eb79cdd7d9cb5d7984c@sap.com> <2425566a8ff74051af485c919a0bf5ee@serv030.corp.eldorado.org.br> <4ec93a6bcbe14cf99c2fa02d50a18965@sap.com> <0ef23b5fcbc54996aea876d4c60e4097@sap.com> <10a918efbd344b1fbf95c56b7beedbc0@serv031.corp.eldorado.org.br> <2badcffdc4fb44c9ba77b5a1c6cc26fb@sap.com> <6362e4c1e3ab4871b12232580f2971aa@serv031.corp.eldorado.org.br> Message-ID: Hi, I had a look at this change and tested it. Reviewed. Best regards, Goetz. > -----Original Message----- > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > Sent: Freitag, 1. September 2017 19:12 > To: Doerr, Martin ; 'hotspot-compiler- > dev at openjdk.java.net' > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > SquareToLen intrinsics > > Hi Martin, > > > -----Original Message----- > > From: Doerr, Martin > > your first webrev already works on Big Endian. So the only required > > change is to fix your new code by this trivial patch: > > --- a/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 17:47:45 2017 > > +0200 > > +++ b/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 17:55:08 2017 > > +0200 > > @@ -3426,7 +3426,9 @@ > > __ srdi (product, product, 1); > > // join them to the same register and store it as Little Endian > > __ orr (product, lplw_s, product); > > +#ifdef VM_LITTLE_ENDIAN > > __ rldicl (product, product, 32, 0); > > +#endif > > __ stdu (product, 8, out_aux); > > __ bdnz (LOOP_SQUARE); > > > > So please enable it again for Big Endian in vm_version_ppc. Besides > > that, it looks good to me. We also need a 2nd review. > > Great! Thanks for checking it and suggesting the diff. > > I changed these things. You can find it below: > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.04/ > > I wonder who could be a 2nd reviewer... Anybody in mind that we may ping? > Maybe Goetz Lindenmaier? > > Best Regards, > Gustavo Serra Scalet > > > > > Best regards, > > Martin > > > > > > -----Original Message----- > > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > > Sent: Mittwoch, 30. August 2017 19:03 > > To: Doerr, Martin ; 'hotspot-compiler- > > dev at openjdk.java.net' > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > SquareToLen intrinsics > > > > Hi Martin, > > > > (webrev at the end) > > > > > -----Original Message----- > > > From: Doerr, Martin > > > > > > > The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't seem > > > > to need further changes as it's being cleared with clrldi, which is > > > > the same as rldic with no shift. Therefore it's treated > > > > appropriately as requested for "offset" parameter. Do you agree? > > > > > > No, I didn't find clrldi for len in generate_mulAdd(). Only for k. > > > > I'm sorry. I was thinking about "offset" and "k", which are both cleaned > > on generate_mulAdd(). "len" was not cleaned and it was being used on > > muladd() directly with cmpdi, which could lead to problems. > > > > That is being changed. > > > > > Where are in_len and out_len fixed up in generate_squareToLen()? > > > > They are not. According to your suggestions, I agree it also needs to be > > done for the same reason. > > > > > > You are right. The way I'm building the 64 bits of the register > > > > depends on which kind of endianness it is run. For now it works only > > > > on little endian so I'm adding a switch (just like I did for SHA) to > > > > make it available only on little endian systems. > > > > > > It shouldn't be that hard to get it working on big endian ;-) Btw., my > > > point was not to replace the 2 4-byte store instructions by an 8-byte > > > one (though I'm also ok with that). It was that 2 stwu which update > > > the same pointer doesn't make sense from performance point of view. > > > Please keep something which works on big endian, too. > > > > I see. The 2x stwu was being used like that because it was the trivial > > approach when considering the original java update: > > z[i++] = (lastProductLowWord << 31) | (int)(product >>> 33); z[i++] = > > (int)(product >>> 1); > > > > As you pointed out, that might cause some stall on the pipeline so I > > made it with 1s stdu (and could improve code by reducing 1 instruction) > > > > Now about having a big endian version: I'm not confident in doing so as > > I don't have access to such a machine at the moment. You were kind on > > offering test support but I don't know if it'd work like that. I may > > support you in checking out which places are endianness-related but I'm > > not comfortable in sending you untested code. > > > > Would you be interested in doing such a changes for making it work on > > Big Endian? For this patch, I provided an interesting test that might > > help you to verify if it worked. > > > > > > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 > > > > 14:15:31 2017). The reported performance speedup was calculated by > > > > running the following test (TestSquareToLen.java): > > > > > > Seems like JDK-8145913 has not been backported, yet. Sorry for not > > > checking this earlier. So if you want to make RSA really fast, it > > > should be so much better to backport that one. But I can still sponsor > > > this change as it may be used elsewhere. > > > > No problem. It's nice to know that I may not need to request a backport > > of this patch for performance reasons. > > > > And at last, but not least, the new webrev with these clrldi changes: > > https://gut.github.io/openjdk/webrev/JDK- > 8185976/webrev.03/index.html > > > > Thank you once again, > > Gustavo Serra Scalet > > > > > Best regards, > > > Martin > > > > > > > > > -----Original Message----- > > > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > > > Sent: Dienstag, 29. August 2017 22:37 > > > To: Doerr, Martin ; 'hotspot-compiler- > > > dev at openjdk.java.net' > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > SquareToLen intrinsics > > > > > > Hi Martin, > > > > > > New changes: > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.02/ > > > > > > Check comments below, please. > > > > > > > -----Original Message----- > > > > From: Doerr, Martin > > > > > > > > 1. Sign extending offset and len > > > > Right, sign and zero extending is equivalent for offset and len > > > > because they are guaranteed to be >=0 (by checks in Java). But you > > > > can only rely on bit 32 (IBM notation) to be 0. Bit 0-31 may contain > > > garbage. > > > > rldicl was incorrect. My mistake, sorry for that. Correct would be > > > > rldic which also clears the least significant bits. > > > > len should also get fixed e.g. by replacing cmpdi by extsw_ in > > muladd. > > > > > > The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't seem to > > > need further changes as it's being cleared with clrldi, which is the > > > same as rldic with no shift. Therefore it's treated appropriately as > > > requested for "offset" parameter. Do you agree? > > > > > > > 2. Using 8 byte instructions for int The code which feeds stdu is > > > > endianess specific. Doesn't work on all > > > > PPC64 platforms. > > > > > > You are right. The way I'm building the 64 bits of the register > > > depends on which kind of endianness it is run. For now it works only > > > on little endian so I'm adding a switch (just like I did for SHA) to > > > make it available only on little endian systems. > > > > > > > 3.Regarding Andrew's point: Superseded by Montgomery? > > > > The Montgomery change got backported to jdk8u (JDK-8150152 in > > 8u102). > > > > I'd expect the performance improvement of these intrinsics to be > > > > irrelevant for crypto.rsa. Did you measure with an older jdk8 > > release? > > > > > > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 14:15:31 > > > 2017). The reported performance speedup was calculated by running the > > > following test (TestSquareToLen.java): > > > import java.math.BigInteger; > > > > > > public class TestSquareToLen { > > > > > > public static void main(String args[]) throws Exception { > > > > > > int n = 10000000; > > > if (args.length >=1) { > > > n = Integer.parseInt(args[0]); > > > } > > > > > > BigInteger b1 = new > > > > BigInteger("34893980923557359086350514982082503920002298311877320859 > 99 > > > 36 > > > > 7395594183801021468843071391756049207873137016631559837931214754926 > 092 > > > 22 > > > > 3780292110207609223272184808289336630057735969423726808520641030118 > 116 > > > 51 > > > > 6440180488338234823908199478965242076358579845520899779963131131540 > 166 > > > 68 718795349783157384006672542605760392289645528307"); > > > BigInteger b2 = BigInteger.valueOf(0); > > > BigInteger check = BigInteger.valueOf(1); > > > for (int i = 0; i < n; i++) { > > > b2 = b1.multiply(b1); > > > if (i == 0) > > > // Didn't JIT yet. Comparing against interpreted mode > > > check = b2; > > > } > > > if (b2.compareTo(check) == 0) > > > System.out.println("Check ok!"); > > > else > > > System.out.println("Check failed!"); > > > } > > > } > > > > > > > > > I got these results on JDK8 on my POWER8 machine: > > > $ ./javac TestSquareToLen.java > > > $ sudo perf stat -r 5 ./java -XX:-UseMulAddIntrinsic -XX:- > > > UseSquareToLenIntrinsic TestSquareToLen Check ok! > > > Check ok! > > > Check ok! > > > Check ok! > > > Check ok! > > > > > > Performance counter stats for './java -XX:-UseMulAddIntrinsic -XX:- > > > UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > > > > > 15148.009557 task-clock (msec) # 1.053 CPUs > > > utilized ( +- 0.48% ) > > > 2,425 context-switches # 0.160 K/sec > > > ( +- 5.84% ) > > > 356 cpu-migrations # 0.023 K/sec > > > ( +- 3.01% ) > > > 5,153 page-faults # 0.340 K/sec > > > ( +- 5.22% ) > > > 54,536,889,909 cycles # 3.600 GHz > > > ( +- 0.56% ) (66.68%) > > > 239,554,105 stalled-cycles-frontend # 0.44% frontend > > > cycles idle ( +- 4.87% ) (49.90%) > > > 27,683,316,001 stalled-cycles-backend # 50.76% backend > > > cycles idle ( +- 0.56% ) (50.17%) > > > 102,020,229,733 instructions # 1.87 insn per > > > cycle > > > # 0.27 stalled > > > cycles per insn ( +- 0.14% ) (66.94%) > > > 7,706,072,218 branches # 508.718 M/sec > > > ( +- 0.23% ) (50.20%) > > > 456,051,162 branch-misses # 5.92% of all > > > branches ( +- 0.09% ) (50.07%) > > > > > > 14.390840733 seconds time elapsed ( +- 0.09% ) > > > > > > $ sudo perf stat -r 5 ./java -XX:+UseMulAddIntrinsic - > > > XX:+UseSquareToLenIntrinsic TestSquareToLen Check ok! > > > Check ok! > > > Check ok! > > > Check ok! > > > Check ok! > > > > > > Performance counter stats for './java -XX:+UseMulAddIntrinsic - > > > XX:+UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > > > > > 11368.141410 task-clock (msec) # 1.045 CPUs > > > utilized ( +- 0.64% ) > > > 1,964 context-switches # 0.173 K/sec > > > ( +- 8.93% ) > > > 338 cpu-migrations # 0.030 K/sec > > > ( +- 7.65% ) > > > 5,627 page-faults # 0.495 K/sec > > > ( +- 6.15% ) > > > 41,100,168,967 cycles # 3.615 GHz > > > ( +- 0.50% ) (66.36%) > > > 309,052,316 stalled-cycles-frontend # 0.75% frontend > > > cycles idle ( +- 2.84% ) (49.89%) > > > 14,188,581,685 stalled-cycles-backend # 34.52% backend > > > cycles idle ( +- 0.99% ) (50.34%) > > > 77,846,029,829 instructions # 1.89 insn per > > > cycle > > > # 0.18 stalled > > > cycles per insn ( +- 0.29% ) (66.96%) > > > 8,435,216,989 branches # 742.005 M/sec > > > ( +- 0.28% ) (50.17%) > > > 339,903,936 branch-misses # 4.03% of all > > > branches ( +- 0.27% ) (49.90%) > > > > > > 10.882357546 seconds time elapsed ( +- 0.24% ) > > > > > > > > > (out of curiosity, these numbers are 15.19s (+- 0.32%) and 13.42s (+- > > > 0.53%) on JDK10) > > > > > > I may run for SpecJVM2008's crypto.rsa if you are interested. > > > > > > Thank you once again for reviewing this. > > > > > > Best regards, > > > Gustavo > > > > > > > (I think the change is still acceptable as the intrinsics could be > > > > used elsewhere and the implementation also exists on other > > > > platforms.) > > > > > > > > Best regards, > > > > Martin > > > > > > > > > > > > -----Original Message----- > > > > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > > > > Sent: Mittwoch, 16. August 2017 18:50 > > > > To: Doerr, Martin ; 'hotspot-compiler- > > > > dev at openjdk.java.net' > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > SquareToLen intrinsics > > > > > > > > Hi Martin, > > > > > > > > Thanks for dedicated review. It took me a while to be able to work > > > > on this but I hope to have your points solved. Please check below > > > > the review as well as my comments quoting your email: > > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.01/ > > > > > > > > > -----Original Message----- > > > > > First of all, C2 does not perform sign extend when calling stubs. > > > > > The int parms need to get zero/sign extended. (Could even be done > > > > > without extra instructions by replacing sldi -> rldicl, cmpdi -> > > > > > extsw_ in some > > > > > cases.) > > > > > > > > Does it make a difference on my case? > > > > > > > > I guess you are talking about mulAdd preparation code. The only > > > > aspect I found about him is to force the cast from 32 bits -> 64 > > > > bits by cleaning higher bits. Offset is a signed integer but it > > > > can't be > > > negative anyway. > > > > > > > > So I changed from: > > > > sldi (R5_ARG3, R5_ARG3, 2); > > > > > > > > to: > > > > rldicl (R5_ARG3, R5_ARG3, 2, 32); // always positive > > > > > > > > > > > > > macroAssembler_ppc.cpp: > > > > > - Indentation should be 2 spaces. > > > > > > > > Done > > > > > > > > > > > > > stubGenerator_ppc:cpp: > > > > > - or_, addi_ should get replaced by orr, addi when CR0 result is > > > > > not needed. > > > > > > > > Done > > > > > > > > > - Where is lplw initialized? > > > > > > > > It should be initialized with 0, I missed that... > > > > > > > > > - I believe that the updating load/store instructions e.g. lwzu > > > > > don't perform well on some processors. At least using stwu 2 times > > > > > in the loop doesn't make sense. > > > > > > > > You are right. I could manipulate the bits differently and ended up > > > > with a single stdu in the loop. Neat! Although I could not reduce > > > > the total number of instructions. > > > > > > > > > - Note: It should be possible to use 8 byte instead of 4 byte > > > > > instructions: MacroAssembler::multiply64, addc, adde. But I'm not > > > > > requesting to change that because I guess it would make the code > > > > > very complicated, especially when supporting both endianess > > > versions. > > > > > > > > Yes, that would require a new analysis on this code. May we consider > > > > it next? As you said, I prefer having an initial version that looks > > > > as simple as the original java code. > > > > > > > > > - The squareToLen stub implementation is very close the Java > > > > > implementation. So it'd be interesting to understand what C2 > > > > > doesn't do as well as the hand written assembly code. Do you know > > > > > that? (Not absolutely necessary for accepting this change as long > > > > > as the stub is measurably faster.) > > > > > > > > I don't know either. Basically I chose doing it because I noticed > > > > some performance gain on SpecJVM2008 when analyzing X64. Then, > > > > taking a closer look, I didn't notice any AVX or some special > > > > instructions on > > > > X64 so I decided to try it on ppc64 by using some basic assembly. > > > > > > > > Thanks > > > > > > > > > > > > > > Best regards, > > > > > Martin > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > > > > Sent: Donnerstag, 10. August 2017 19:22 > > > > > To: 'hotspot-compiler-dev at openjdk.java.net' > > > > dev at openjdk.java.net> > > > > > Subject: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > > SquareToLen intrinsics > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: ppc-aix-port-dev [mailto:ppc-aix-port-dev- > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > > > > Sent: ter?a-feira, 8 de agosto de 2017 17:19 > > > > > To: ppc-aix-port-dev at openjdk.java.net > > > > > Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > > SquareToLen intrinsics > > > > > > > > > > Hi, > > > > > > > > > > Could you please review this specific PPC64 change to hotspot? By > > > > > implementing these intrinsics I noticed a small improvement with > > > > > microbenchmarks analysis. On SpecJVM2008's crypto.rsa benchmark, > > > > > only when backporting to JDK8 an improvement was noticed. > > > > > > > > > > JBS: https://bugs.openjdk.java.net/browse/JDK-8185976 > > > > > Webrev: https://gut.github.io/openjdk/webrev/JDK- > 8185976/webrev/ > > > > > > > > > > Motivation for this implementation: > > > > > https://twitter.com/ijuma/status/698309312498835457 > > > > > > > > > > Best regards, > > > > > Gustavo Serra Scalet From aph at redhat.com Wed Sep 6 09:53:05 2017 From: aph at redhat.com (Andrew Haley) Date: Wed, 6 Sep 2017 10:53:05 +0100 Subject: [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> Message-ID: On 05/09/17 18:34, Dmitrij Pochepko wrote: > As you can see, it's up to 26% worse throughput with wider multiplication. > > The reasons for this is: > 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it > can?t be changed within the function signature. Thus we can?t fully > utilize the potential of 64-bit multiplication. > 2. umulh instruction is more expensive than mul instruction. Ah, my apologies. I wasn't thinking about mulAdd, but about squareToLen(). But did you look at the way x86 uses 64-bit multiplications? > I haven't implemented wider multiplication for squareToLen intrinsic, > since it'll require much more code due to more corner cases. Also, > squaring algorithm in BigInteger doesn't handle more than 127 integers > in one squareToLen call(large integer arrays are divided to smaller > parts for squaring, so, 1..127 integers are squared at once), which > makes all additional off-loop penalties expensive in comparison to loop > execution time. Should we intrinsify squareToLen() at all? It's only used AFAICS by C1 and interpreter when doing integer crypto. One other thing I haven't checked: is the multiplyToLen() intrinisc called when squareToLen() is absent? -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From aph at redhat.com Wed Sep 6 09:59:39 2017 From: aph at redhat.com (Andrew Haley) Date: Wed, 6 Sep 2017 10:59:39 +0100 Subject: escape analysis friendly very small objects In-Reply-To: <660539323.2919871.1504614953928@mail.yahoo.com> References: <660539323.2919871.1504614953928.ref@mail.yahoo.com> <660539323.2919871.1504614953928@mail.yahoo.com> Message-ID: <25e885c7-9545-966e-3b42-530ac7692eb4@redhat.com> On 05/09/17 13:35, Andy Nuss wrote: > I have a variety of classes which just contain a couple scalar > primitives and possibly a reference to an object that is clearly on > the heap, all of these small number of data members are private and > all of them mutable. Much like the Point class in Brian Goetz's > article on escape analysis on whether the vm uses the stack or heap > for class instances. > > There are two use cases for these objects: (1) create with new such > a small object as a local variable and use it thriughout the > function, mutating it, etc. (2) create a class with lots of these > small objects as private or possibly protected members. > The hope in the first case is under what conditions will hotspot put > the object's 2 or 3 fields onto the stack and ideally without the > hidden headers needed to make it a heap object. hashCode, equals, > and clone are not implemented or used if that is important. > I.e. will hotspot ever make it a simple C++ like object on the > stack. Does it help if I do defensive copy if I return one of these > small objects from the function? Possibly. It'd help to have an example that we could talk aout. Then we could try it and look at the generated code, and look at why it does (or does not) eliminate constructors. HotSpot is remarkably good at this, but occasionally fails because of unhandled corner cases or because an object seems not to escape but really does. > In the second case, my hope is that for a class that contains 5 such > small objects, and would definitely be on the heap, hotspot would be > smart enough not to create 5 small objects on the heap and then 5 > references to them in the containing class. Instead, it would > explode into the containing class ala C++ just the 2 or 3 primitive > datamembers of the object, and again, ideally without headers. HotSpot doesn't do that. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From dmitrij.pochepko at bell-sw.com Wed Sep 6 11:50:24 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij) Date: Wed, 6 Sep 2017 14:50:24 +0300 Subject: [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> Message-ID: <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> On 06.09.2017 12:53, Andrew Haley wrote: > On 05/09/17 18:34, Dmitrij Pochepko wrote: >> As you can see, it's up to 26% worse throughput with wider multiplication. >> >> The reasons for this is: >> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it >> can?t be changed within the function signature. Thus we can?t fully >> utilize the potential of 64-bit multiplication. >> 2. umulh instruction is more expensive than mul instruction. > Ah, my apologies. I wasn't thinking about mulAdd, but about > squareToLen(). But did you look at the way x86 uses 64-bit > multiplications? > Yes. It uses single x86 mulq instruction which performs 64x64 multiplication and placing 128 bit result in 2 registers. There is no such single instruction on aarch64 and the most effective aarch64 instruction sequence i've found doesn't seem to be as fast as mulq. Simplier 32x32bit multiplication works faster according to my measurements. >> I haven't implemented wider multiplication for squareToLen intrinsic, >> since it'll require much more code due to more corner cases. Also, >> squaring algorithm in BigInteger doesn't handle more than 127 integers >> in one squareToLen call(large integer arrays are divided to smaller >> parts for squaring, so, 1..127 integers are squared at once), which >> makes all additional off-loop penalties expensive in comparison to loop >> execution time. > Should we intrinsify squareToLen() at all? Yes, we should intrinsify it, because we can see performance boost. Not as significant as for x86 but still noticeable. > It's only used AFAICS by > C1 and interpreter when doing integer crypto. This intrinsic is known to c2(http://hg.openjdk.java.net/jdk10/hs/hotspot/file/tip/src/share/vm/opto/library_call.cpp#l5507). squareToLen is called in BigInteger multiplication in case it's multiplied by itself (http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l1565) and in pow(...) method: http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l2305 > One other thing I > haven't checked: is the multiplyToLen() intrinisc called when > squareToLen() is absent? > It could have been a good alternative, but it's not used instead of squareToLen when squareToLen is not implemented. A java implementation of squareToLen will be eventually compiled and used instead: http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l2039 Thanks, Dmitrij From gustavo.scalet at eldorado.org.br Wed Sep 6 12:32:20 2017 From: gustavo.scalet at eldorado.org.br (Gustavo Serra Scalet) Date: Wed, 6 Sep 2017 12:32:20 +0000 Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics In-Reply-To: References: <1f159ee480284095b8e5c3f444dceb96@serv031.corp.eldorado.org.br> <16e8b68451e94eb79cdd7d9cb5d7984c@sap.com> <2425566a8ff74051af485c919a0bf5ee@serv030.corp.eldorado.org.br> <4ec93a6bcbe14cf99c2fa02d50a18965@sap.com> <0ef23b5fcbc54996aea876d4c60e4097@sap.com> <10a918efbd344b1fbf95c56b7beedbc0@serv031.corp.eldorado.org.br> <2badcffdc4fb44c9ba77b5a1c6cc26fb@sap.com> <6362e4c1e3ab4871b12232580f2971aa@serv031.corp.eldorado.org.br> Message-ID: <1a54e77a4b5e45e2b848da3fccf423dd@serv030.corp.eldorado.org.br> Thanks Goetz. Could somebody sponsor this change? THanks > -----Original Message----- > From: Lindenmaier, Goetz [mailto:goetz.lindenmaier at sap.com] > Sent: quarta-feira, 6 de setembro de 2017 03:30 > To: Gustavo Serra Scalet ; Doerr, Martin > ; 'hotspot-compiler-dev at openjdk.java.net' > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > SquareToLen intrinsics > > Hi, > > I had a look at this change and tested it. Reviewed. > > Best regards, > Goetz. > > > -----Original Message----- > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > Sent: Freitag, 1. September 2017 19:12 > > To: Doerr, Martin ; 'hotspot-compiler- > > dev at openjdk.java.net' > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > SquareToLen intrinsics > > > > Hi Martin, > > > > > -----Original Message----- > > > From: Doerr, Martin > > > your first webrev already works on Big Endian. So the only required > > > change is to fix your new code by this trivial patch: > > > --- a/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 17:47:45 > 2017 > > > +0200 > > > +++ b/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 17:55:08 > 2017 > > > +0200 > > > @@ -3426,7 +3426,9 @@ > > > __ srdi (product, product, 1); > > > // join them to the same register and store it as Little Endian > > > __ orr (product, lplw_s, product); > > > +#ifdef VM_LITTLE_ENDIAN > > > __ rldicl (product, product, 32, 0); > > > +#endif > > > __ stdu (product, 8, out_aux); > > > __ bdnz (LOOP_SQUARE); > > > > > > So please enable it again for Big Endian in vm_version_ppc. Besides > > > that, it looks good to me. We also need a 2nd review. > > > > Great! Thanks for checking it and suggesting the diff. > > > > I changed these things. You can find it below: > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.04/ > > > > I wonder who could be a 2nd reviewer... Anybody in mind that we may > ping? > > Maybe Goetz Lindenmaier? > > > > Best Regards, > > Gustavo Serra Scalet > > > > > > > > Best regards, > > > Martin > > > > > > > > > -----Original Message----- > > > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > > > Sent: Mittwoch, 30. August 2017 19:03 > > > To: Doerr, Martin ; 'hotspot-compiler- > > > dev at openjdk.java.net' > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > SquareToLen intrinsics > > > > > > Hi Martin, > > > > > > (webrev at the end) > > > > > > > -----Original Message----- > > > > From: Doerr, Martin > > > > > > > > > The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't > > > > > seem to need further changes as it's being cleared with clrldi, > > > > > which is the same as rldic with no shift. Therefore it's treated > > > > > appropriately as requested for "offset" parameter. Do you agree? > > > > > > > > No, I didn't find clrldi for len in generate_mulAdd(). Only for k. > > > > > > I'm sorry. I was thinking about "offset" and "k", which are both > > > cleaned on generate_mulAdd(). "len" was not cleaned and it was being > > > used on > > > muladd() directly with cmpdi, which could lead to problems. > > > > > > That is being changed. > > > > > > > Where are in_len and out_len fixed up in generate_squareToLen()? > > > > > > They are not. According to your suggestions, I agree it also needs > > > to be done for the same reason. > > > > > > > > You are right. The way I'm building the 64 bits of the register > > > > > depends on which kind of endianness it is run. For now it works > > > > > only on little endian so I'm adding a switch (just like I did > > > > > for SHA) to make it available only on little endian systems. > > > > > > > > It shouldn't be that hard to get it working on big endian ;-) > > > > Btw., my point was not to replace the 2 4-byte store instructions > > > > by an 8-byte one (though I'm also ok with that). It was that 2 > > > > stwu which update the same pointer doesn't make sense from > performance point of view. > > > > Please keep something which works on big endian, too. > > > > > > I see. The 2x stwu was being used like that because it was the > > > trivial approach when considering the original java update: > > > z[i++] = (lastProductLowWord << 31) | (int)(product >>> 33); z[i++] > > > = (int)(product >>> 1); > > > > > > As you pointed out, that might cause some stall on the pipeline so I > > > made it with 1s stdu (and could improve code by reducing 1 > > > instruction) > > > > > > Now about having a big endian version: I'm not confident in doing so > > > as I don't have access to such a machine at the moment. You were > > > kind on offering test support but I don't know if it'd work like > > > that. I may support you in checking out which places are > > > endianness-related but I'm not comfortable in sending you untested > code. > > > > > > Would you be interested in doing such a changes for making it work > > > on Big Endian? For this patch, I provided an interesting test that > > > might help you to verify if it worked. > > > > > > > > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 > > > > > 14:15:31 2017). The reported performance speedup was calculated > > > > > by running the following test (TestSquareToLen.java): > > > > > > > > Seems like JDK-8145913 has not been backported, yet. Sorry for not > > > > checking this earlier. So if you want to make RSA really fast, it > > > > should be so much better to backport that one. But I can still > > > > sponsor this change as it may be used elsewhere. > > > > > > No problem. It's nice to know that I may not need to request a > > > backport of this patch for performance reasons. > > > > > > And at last, but not least, the new webrev with these clrldi > changes: > > > https://gut.github.io/openjdk/webrev/JDK- > > 8185976/webrev.03/index.html > > > > > > Thank you once again, > > > Gustavo Serra Scalet > > > > > > > Best regards, > > > > Martin > > > > > > > > > > > > -----Original Message----- > > > > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > > > > Sent: Dienstag, 29. August 2017 22:37 > > > > To: Doerr, Martin ; 'hotspot-compiler- > > > > dev at openjdk.java.net' > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > SquareToLen intrinsics > > > > > > > > Hi Martin, > > > > > > > > New changes: > > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.02/ > > > > > > > > Check comments below, please. > > > > > > > > > -----Original Message----- > > > > > From: Doerr, Martin > > > > > > > > > > 1. Sign extending offset and len Right, sign and zero extending > > > > > is equivalent for offset and len because they are guaranteed to > > > > > be >=0 (by checks in Java). But you can only rely on bit 32 (IBM > > > > > notation) to be 0. Bit 0-31 may contain > > > > garbage. > > > > > rldicl was incorrect. My mistake, sorry for that. Correct would > > > > > be rldic which also clears the least significant bits. > > > > > len should also get fixed e.g. by replacing cmpdi by extsw_ in > > > muladd. > > > > > > > > The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't seem > > > > to need further changes as it's being cleared with clrldi, which > > > > is the same as rldic with no shift. Therefore it's treated > > > > appropriately as requested for "offset" parameter. Do you agree? > > > > > > > > > 2. Using 8 byte instructions for int The code which feeds stdu > > > > > is endianess specific. Doesn't work on all > > > > > PPC64 platforms. > > > > > > > > You are right. The way I'm building the 64 bits of the register > > > > depends on which kind of endianness it is run. For now it works > > > > only on little endian so I'm adding a switch (just like I did for > > > > SHA) to make it available only on little endian systems. > > > > > > > > > 3.Regarding Andrew's point: Superseded by Montgomery? > > > > > The Montgomery change got backported to jdk8u (JDK-8150152 in > > > 8u102). > > > > > I'd expect the performance improvement of these intrinsics to be > > > > > irrelevant for crypto.rsa. Did you measure with an older jdk8 > > > release? > > > > > > > > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 > > > > 14:15:31 2017). The reported performance speedup was calculated by > > > > running the following test (TestSquareToLen.java): > > > > import java.math.BigInteger; > > > > > > > > public class TestSquareToLen { > > > > > > > > public static void main(String args[]) throws Exception { > > > > > > > > int n = 10000000; > > > > if (args.length >=1) { > > > > n = Integer.parseInt(args[0]); > > > > } > > > > > > > > BigInteger b1 = new > > > > > > BigInteger("34893980923557359086350514982082503920002298311877320859 > > 99 > > > > 36 > > > > > > 7395594183801021468843071391756049207873137016631559837931214754926 > > 092 > > > > 22 > > > > > > 3780292110207609223272184808289336630057735969423726808520641030118 > > 116 > > > > 51 > > > > > > 6440180488338234823908199478965242076358579845520899779963131131540 > > 166 > > > > 68 718795349783157384006672542605760392289645528307"); > > > > BigInteger b2 = BigInteger.valueOf(0); > > > > BigInteger check = BigInteger.valueOf(1); > > > > for (int i = 0; i < n; i++) { > > > > b2 = b1.multiply(b1); > > > > if (i == 0) > > > > // Didn't JIT yet. Comparing against interpreted mode > > > > check = b2; > > > > } > > > > if (b2.compareTo(check) == 0) > > > > System.out.println("Check ok!"); > > > > else > > > > System.out.println("Check failed!"); > > > > } > > > > } > > > > > > > > > > > > I got these results on JDK8 on my POWER8 machine: > > > > $ ./javac TestSquareToLen.java > > > > $ sudo perf stat -r 5 ./java -XX:-UseMulAddIntrinsic -XX:- > > > > UseSquareToLenIntrinsic TestSquareToLen Check ok! > > > > Check ok! > > > > Check ok! > > > > Check ok! > > > > Check ok! > > > > > > > > Performance counter stats for './java -XX:-UseMulAddIntrinsic > > > > -XX:- UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > > > > > > > 15148.009557 task-clock (msec) # 1.053 CPUs > > > > utilized ( +- 0.48% ) > > > > 2,425 context-switches # 0.160 K/sec > > > > ( +- 5.84% ) > > > > 356 cpu-migrations # 0.023 K/sec > > > > ( +- 3.01% ) > > > > 5,153 page-faults # 0.340 K/sec > > > > ( +- 5.22% ) > > > > 54,536,889,909 cycles # 3.600 GHz > > > > ( +- 0.56% ) (66.68%) > > > > 239,554,105 stalled-cycles-frontend # 0.44% > frontend > > > > cycles idle ( +- 4.87% ) (49.90%) > > > > 27,683,316,001 stalled-cycles-backend # 50.76% > backend > > > > cycles idle ( +- 0.56% ) (50.17%) > > > > 102,020,229,733 instructions # 1.87 insn > per > > > > cycle > > > > # 0.27 > stalled > > > > cycles per insn ( +- 0.14% ) (66.94%) > > > > 7,706,072,218 branches # 508.718 M/sec > > > > ( +- 0.23% ) (50.20%) > > > > 456,051,162 branch-misses # 5.92% of > all > > > > branches ( +- 0.09% ) (50.07%) > > > > > > > > 14.390840733 seconds time elapsed ( +- 0.09% ) > > > > > > > > $ sudo perf stat -r 5 ./java -XX:+UseMulAddIntrinsic - > > > > XX:+UseSquareToLenIntrinsic TestSquareToLen Check ok! > > > > Check ok! > > > > Check ok! > > > > Check ok! > > > > Check ok! > > > > > > > > Performance counter stats for './java -XX:+UseMulAddIntrinsic - > > > > XX:+UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > > > > > > > 11368.141410 task-clock (msec) # 1.045 CPUs > > > > utilized ( +- 0.64% ) > > > > 1,964 context-switches # 0.173 K/sec > > > > ( +- 8.93% ) > > > > 338 cpu-migrations # 0.030 K/sec > > > > ( +- 7.65% ) > > > > 5,627 page-faults # 0.495 K/sec > > > > ( +- 6.15% ) > > > > 41,100,168,967 cycles # 3.615 GHz > > > > ( +- 0.50% ) (66.36%) > > > > 309,052,316 stalled-cycles-frontend # 0.75% > frontend > > > > cycles idle ( +- 2.84% ) (49.89%) > > > > 14,188,581,685 stalled-cycles-backend # 34.52% > backend > > > > cycles idle ( +- 0.99% ) (50.34%) > > > > 77,846,029,829 instructions # 1.89 insn > per > > > > cycle > > > > # 0.18 > stalled > > > > cycles per insn ( +- 0.29% ) (66.96%) > > > > 8,435,216,989 branches # 742.005 M/sec > > > > ( +- 0.28% ) (50.17%) > > > > 339,903,936 branch-misses # 4.03% of > all > > > > branches ( +- 0.27% ) (49.90%) > > > > > > > > 10.882357546 seconds time elapsed ( +- 0.24% ) > > > > > > > > > > > > (out of curiosity, these numbers are 15.19s (+- 0.32%) and 13.42s > > > > (+- > > > > 0.53%) on JDK10) > > > > > > > > I may run for SpecJVM2008's crypto.rsa if you are interested. > > > > > > > > Thank you once again for reviewing this. > > > > > > > > Best regards, > > > > Gustavo > > > > > > > > > (I think the change is still acceptable as the intrinsics could > > > > > be used elsewhere and the implementation also exists on other > > > > > platforms.) > > > > > > > > > > Best regards, > > > > > Martin > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Gustavo Serra Scalet > > > > > [mailto:gustavo.scalet at eldorado.org.br] > > > > > Sent: Mittwoch, 16. August 2017 18:50 > > > > > To: Doerr, Martin ; 'hotspot-compiler- > > > > > dev at openjdk.java.net' > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > > SquareToLen intrinsics > > > > > > > > > > Hi Martin, > > > > > > > > > > Thanks for dedicated review. It took me a while to be able to > > > > > work on this but I hope to have your points solved. Please check > > > > > below the review as well as my comments quoting your email: > > > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.01/ > > > > > > > > > > > -----Original Message----- > > > > > > First of all, C2 does not perform sign extend when calling > stubs. > > > > > > The int parms need to get zero/sign extended. (Could even be > > > > > > done without extra instructions by replacing sldi -> rldicl, > > > > > > cmpdi -> extsw_ in some > > > > > > cases.) > > > > > > > > > > Does it make a difference on my case? > > > > > > > > > > I guess you are talking about mulAdd preparation code. The only > > > > > aspect I found about him is to force the cast from 32 bits -> 64 > > > > > bits by cleaning higher bits. Offset is a signed integer but it > > > > > can't be > > > > negative anyway. > > > > > > > > > > So I changed from: > > > > > sldi (R5_ARG3, R5_ARG3, 2); > > > > > > > > > > to: > > > > > rldicl (R5_ARG3, R5_ARG3, 2, 32); // always positive > > > > > > > > > > > > > > > > macroAssembler_ppc.cpp: > > > > > > - Indentation should be 2 spaces. > > > > > > > > > > Done > > > > > > > > > > > > > > > > stubGenerator_ppc:cpp: > > > > > > - or_, addi_ should get replaced by orr, addi when CR0 result > > > > > > is not needed. > > > > > > > > > > Done > > > > > > > > > > > - Where is lplw initialized? > > > > > > > > > > It should be initialized with 0, I missed that... > > > > > > > > > > > - I believe that the updating load/store instructions e.g. > > > > > > lwzu don't perform well on some processors. At least using > > > > > > stwu 2 times in the loop doesn't make sense. > > > > > > > > > > You are right. I could manipulate the bits differently and ended > > > > > up with a single stdu in the loop. Neat! Although I could not > > > > > reduce the total number of instructions. > > > > > > > > > > > - Note: It should be possible to use 8 byte instead of 4 byte > > > > > > instructions: MacroAssembler::multiply64, addc, adde. But I'm > > > > > > not requesting to change that because I guess it would make > > > > > > the code very complicated, especially when supporting both > > > > > > endianess > > > > versions. > > > > > > > > > > Yes, that would require a new analysis on this code. May we > > > > > consider it next? As you said, I prefer having an initial > > > > > version that looks as simple as the original java code. > > > > > > > > > > > - The squareToLen stub implementation is very close the Java > > > > > > implementation. So it'd be interesting to understand what C2 > > > > > > doesn't do as well as the hand written assembly code. Do you > > > > > > know that? (Not absolutely necessary for accepting this change > > > > > > as long as the stub is measurably faster.) > > > > > > > > > > I don't know either. Basically I chose doing it because I > > > > > noticed some performance gain on SpecJVM2008 when analyzing X64. > > > > > Then, taking a closer look, I didn't notice any AVX or some > > > > > special instructions on > > > > > X64 so I decided to try it on ppc64 by using some basic > assembly. > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > Best regards, > > > > > > Martin > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > > > > > Sent: Donnerstag, 10. August 2017 19:22 > > > > > > To: 'hotspot-compiler-dev at openjdk.java.net' > > > > > dev at openjdk.java.net> > > > > > > Subject: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > > > SquareToLen intrinsics > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: ppc-aix-port-dev [mailto:ppc-aix-port-dev- > > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > > > > > Sent: ter?a-feira, 8 de agosto de 2017 17:19 > > > > > > To: ppc-aix-port-dev at openjdk.java.net > > > > > > Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > > > SquareToLen intrinsics > > > > > > > > > > > > Hi, > > > > > > > > > > > > Could you please review this specific PPC64 change to hotspot? > > > > > > By implementing these intrinsics I noticed a small improvement > > > > > > with microbenchmarks analysis. On SpecJVM2008's crypto.rsa > > > > > > benchmark, only when backporting to JDK8 an improvement was > noticed. > > > > > > > > > > > > JBS: https://bugs.openjdk.java.net/browse/JDK-8185976 > > > > > > Webrev: https://gut.github.io/openjdk/webrev/JDK- > > 8185976/webrev/ > > > > > > > > > > > > Motivation for this implementation: > > > > > > https://twitter.com/ijuma/status/698309312498835457 > > > > > > > > > > > > Best regards, > > > > > > Gustavo Serra Scalet From aph at redhat.com Wed Sep 6 12:43:23 2017 From: aph at redhat.com (Andrew Haley) Date: Wed, 6 Sep 2017 13:43:23 +0100 Subject: [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> Message-ID: <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> On 06/09/17 12:50, Dmitrij wrote: > > > On 06.09.2017 12:53, Andrew Haley wrote: >> On 05/09/17 18:34, Dmitrij Pochepko wrote: >>> As you can see, it's up to 26% worse throughput with wider multiplication. >>> >>> The reasons for this is: >>> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it >>> can?t be changed within the function signature. Thus we can?t fully >>> utilize the potential of 64-bit multiplication. >>> 2. umulh instruction is more expensive than mul instruction. >> Ah, my apologies. I wasn't thinking about mulAdd, but about >> squareToLen(). But did you look at the way x86 uses 64-bit >> multiplications? >> > Yes. It uses single x86 mulq instruction which performs 64x64 > multiplication and placing 128 bit result in 2 registers. There is no > such single instruction on aarch64 and the most effective aarch64 > instruction sequence i've found doesn't seem to be as fast as mulq. I think there is effectively a 64x64 - >128-bit instruction: it's just that you have to represent it as a mul and a umulh. But I take your point. >> One other thing I >> haven't checked: is the multiplyToLen() intrinisc called when >> squareToLen() is absent? >> > It could have been a good alternative, but it's not used instead of > squareToLen when squareToLen is not implemented. A java implementation > of squareToLen will be eventually compiled and used instead: > http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l2039 Please compare your squareToLen wih the MacroAssembler::multiply_to_len we already have. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From goetz.lindenmaier at sap.com Wed Sep 6 12:44:06 2017 From: goetz.lindenmaier at sap.com (Lindenmaier, Goetz) Date: Wed, 6 Sep 2017 12:44:06 +0000 Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics In-Reply-To: <1a54e77a4b5e45e2b848da3fccf423dd@serv030.corp.eldorado.org.br> References: <1f159ee480284095b8e5c3f444dceb96@serv031.corp.eldorado.org.br> <16e8b68451e94eb79cdd7d9cb5d7984c@sap.com> <2425566a8ff74051af485c919a0bf5ee@serv030.corp.eldorado.org.br> <4ec93a6bcbe14cf99c2fa02d50a18965@sap.com> <0ef23b5fcbc54996aea876d4c60e4097@sap.com> <10a918efbd344b1fbf95c56b7beedbc0@serv031.corp.eldorado.org.br> <2badcffdc4fb44c9ba77b5a1c6cc26fb@sap.com> <6362e4c1e3ab4871b12232580f2971aa@serv031.corp.eldorado.org.br> <1a54e77a4b5e45e2b848da3fccf423dd@serv030.corp.eldorado.org.br> Message-ID: <170eb84ecbdb4c9b9fe8d4481e4c319f@sap.com> Hi Gustavo, the repos are all closed. Once they are opened again, you will have to merge your change into the new repo structure, post a new webrev and only then it can be sponsored. Me or Martin will sponsor it then. Best regards, Goetz. > -----Original Message----- > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > Sent: Mittwoch, 6. September 2017 14:32 > To: Lindenmaier, Goetz ; Doerr, Martin > ; 'hotspot-compiler-dev at openjdk.java.net' > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > SquareToLen intrinsics > > Thanks Goetz. > > Could somebody sponsor this change? > > THanks > > > -----Original Message----- > > From: Lindenmaier, Goetz [mailto:goetz.lindenmaier at sap.com] > > Sent: quarta-feira, 6 de setembro de 2017 03:30 > > To: Gustavo Serra Scalet ; Doerr, Martin > > ; 'hotspot-compiler-dev at openjdk.java.net' > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > SquareToLen intrinsics > > > > Hi, > > > > I had a look at this change and tested it. Reviewed. > > > > Best regards, > > Goetz. > > > > > -----Original Message----- > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > > Sent: Freitag, 1. September 2017 19:12 > > > To: Doerr, Martin ; 'hotspot-compiler- > > > dev at openjdk.java.net' > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > SquareToLen intrinsics > > > > > > Hi Martin, > > > > > > > -----Original Message----- > > > > From: Doerr, Martin > > > > your first webrev already works on Big Endian. So the only required > > > > change is to fix your new code by this trivial patch: > > > > --- a/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 17:47:45 > > 2017 > > > > +0200 > > > > +++ b/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 17:55:08 > > 2017 > > > > +0200 > > > > @@ -3426,7 +3426,9 @@ > > > > __ srdi (product, product, 1); > > > > // join them to the same register and store it as Little Endian > > > > __ orr (product, lplw_s, product); > > > > +#ifdef VM_LITTLE_ENDIAN > > > > __ rldicl (product, product, 32, 0); > > > > +#endif > > > > __ stdu (product, 8, out_aux); > > > > __ bdnz (LOOP_SQUARE); > > > > > > > > So please enable it again for Big Endian in vm_version_ppc. Besides > > > > that, it looks good to me. We also need a 2nd review. > > > > > > Great! Thanks for checking it and suggesting the diff. > > > > > > I changed these things. You can find it below: > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.04/ > > > > > > I wonder who could be a 2nd reviewer... Anybody in mind that we may > > ping? > > > Maybe Goetz Lindenmaier? > > > > > > Best Regards, > > > Gustavo Serra Scalet > > > > > > > > > > > Best regards, > > > > Martin > > > > > > > > > > > > -----Original Message----- > > > > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > > > > Sent: Mittwoch, 30. August 2017 19:03 > > > > To: Doerr, Martin ; 'hotspot-compiler- > > > > dev at openjdk.java.net' > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > SquareToLen intrinsics > > > > > > > > Hi Martin, > > > > > > > > (webrev at the end) > > > > > > > > > -----Original Message----- > > > > > From: Doerr, Martin > > > > > > > > > > > The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't > > > > > > seem to need further changes as it's being cleared with clrldi, > > > > > > which is the same as rldic with no shift. Therefore it's treated > > > > > > appropriately as requested for "offset" parameter. Do you agree? > > > > > > > > > > No, I didn't find clrldi for len in generate_mulAdd(). Only for k. > > > > > > > > I'm sorry. I was thinking about "offset" and "k", which are both > > > > cleaned on generate_mulAdd(). "len" was not cleaned and it was being > > > > used on > > > > muladd() directly with cmpdi, which could lead to problems. > > > > > > > > That is being changed. > > > > > > > > > Where are in_len and out_len fixed up in generate_squareToLen()? > > > > > > > > They are not. According to your suggestions, I agree it also needs > > > > to be done for the same reason. > > > > > > > > > > You are right. The way I'm building the 64 bits of the register > > > > > > depends on which kind of endianness it is run. For now it works > > > > > > only on little endian so I'm adding a switch (just like I did > > > > > > for SHA) to make it available only on little endian systems. > > > > > > > > > > It shouldn't be that hard to get it working on big endian ;-) > > > > > Btw., my point was not to replace the 2 4-byte store instructions > > > > > by an 8-byte one (though I'm also ok with that). It was that 2 > > > > > stwu which update the same pointer doesn't make sense from > > performance point of view. > > > > > Please keep something which works on big endian, too. > > > > > > > > I see. The 2x stwu was being used like that because it was the > > > > trivial approach when considering the original java update: > > > > z[i++] = (lastProductLowWord << 31) | (int)(product >>> 33); z[i++] > > > > = (int)(product >>> 1); > > > > > > > > As you pointed out, that might cause some stall on the pipeline so I > > > > made it with 1s stdu (and could improve code by reducing 1 > > > > instruction) > > > > > > > > Now about having a big endian version: I'm not confident in doing so > > > > as I don't have access to such a machine at the moment. You were > > > > kind on offering test support but I don't know if it'd work like > > > > that. I may support you in checking out which places are > > > > endianness-related but I'm not comfortable in sending you untested > > code. > > > > > > > > Would you be interested in doing such a changes for making it work > > > > on Big Endian? For this patch, I provided an interesting test that > > > > might help you to verify if it worked. > > > > > > > > > > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 > > > > > > 14:15:31 2017). The reported performance speedup was calculated > > > > > > by running the following test (TestSquareToLen.java): > > > > > > > > > > Seems like JDK-8145913 has not been backported, yet. Sorry for not > > > > > checking this earlier. So if you want to make RSA really fast, it > > > > > should be so much better to backport that one. But I can still > > > > > sponsor this change as it may be used elsewhere. > > > > > > > > No problem. It's nice to know that I may not need to request a > > > > backport of this patch for performance reasons. > > > > > > > > And at last, but not least, the new webrev with these clrldi > > changes: > > > > https://gut.github.io/openjdk/webrev/JDK- > > > 8185976/webrev.03/index.html > > > > > > > > Thank you once again, > > > > Gustavo Serra Scalet > > > > > > > > > Best regards, > > > > > Martin > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > > > > > Sent: Dienstag, 29. August 2017 22:37 > > > > > To: Doerr, Martin ; 'hotspot-compiler- > > > > > dev at openjdk.java.net' > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > > SquareToLen intrinsics > > > > > > > > > > Hi Martin, > > > > > > > > > > New changes: > > > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.02/ > > > > > > > > > > Check comments below, please. > > > > > > > > > > > -----Original Message----- > > > > > > From: Doerr, Martin > > > > > > > > > > > > 1. Sign extending offset and len Right, sign and zero extending > > > > > > is equivalent for offset and len because they are guaranteed to > > > > > > be >=0 (by checks in Java). But you can only rely on bit 32 (IBM > > > > > > notation) to be 0. Bit 0-31 may contain > > > > > garbage. > > > > > > rldicl was incorrect. My mistake, sorry for that. Correct would > > > > > > be rldic which also clears the least significant bits. > > > > > > len should also get fixed e.g. by replacing cmpdi by extsw_ in > > > > muladd. > > > > > > > > > > The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't seem > > > > > to need further changes as it's being cleared with clrldi, which > > > > > is the same as rldic with no shift. Therefore it's treated > > > > > appropriately as requested for "offset" parameter. Do you agree? > > > > > > > > > > > 2. Using 8 byte instructions for int The code which feeds stdu > > > > > > is endianess specific. Doesn't work on all > > > > > > PPC64 platforms. > > > > > > > > > > You are right. The way I'm building the 64 bits of the register > > > > > depends on which kind of endianness it is run. For now it works > > > > > only on little endian so I'm adding a switch (just like I did for > > > > > SHA) to make it available only on little endian systems. > > > > > > > > > > > 3.Regarding Andrew's point: Superseded by Montgomery? > > > > > > The Montgomery change got backported to jdk8u (JDK-8150152 in > > > > 8u102). > > > > > > I'd expect the performance improvement of these intrinsics to be > > > > > > irrelevant for crypto.rsa. Did you measure with an older jdk8 > > > > release? > > > > > > > > > > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 > > > > > 14:15:31 2017). The reported performance speedup was calculated by > > > > > running the following test (TestSquareToLen.java): > > > > > import java.math.BigInteger; > > > > > > > > > > public class TestSquareToLen { > > > > > > > > > > public static void main(String args[]) throws Exception { > > > > > > > > > > int n = 10000000; > > > > > if (args.length >=1) { > > > > > n = Integer.parseInt(args[0]); > > > > > } > > > > > > > > > > BigInteger b1 = new > > > > > > > > > BigInteger("34893980923557359086350514982082503920002298311877320859 > > > 99 > > > > > 36 > > > > > > > > > 7395594183801021468843071391756049207873137016631559837931214754926 > > > 092 > > > > > 22 > > > > > > > > > 3780292110207609223272184808289336630057735969423726808520641030118 > > > 116 > > > > > 51 > > > > > > > > > 6440180488338234823908199478965242076358579845520899779963131131540 > > > 166 > > > > > 68 718795349783157384006672542605760392289645528307"); > > > > > BigInteger b2 = BigInteger.valueOf(0); > > > > > BigInteger check = BigInteger.valueOf(1); > > > > > for (int i = 0; i < n; i++) { > > > > > b2 = b1.multiply(b1); > > > > > if (i == 0) > > > > > // Didn't JIT yet. Comparing against interpreted mode > > > > > check = b2; > > > > > } > > > > > if (b2.compareTo(check) == 0) > > > > > System.out.println("Check ok!"); > > > > > else > > > > > System.out.println("Check failed!"); > > > > > } > > > > > } > > > > > > > > > > > > > > > I got these results on JDK8 on my POWER8 machine: > > > > > $ ./javac TestSquareToLen.java > > > > > $ sudo perf stat -r 5 ./java -XX:-UseMulAddIntrinsic -XX:- > > > > > UseSquareToLenIntrinsic TestSquareToLen Check ok! > > > > > Check ok! > > > > > Check ok! > > > > > Check ok! > > > > > Check ok! > > > > > > > > > > Performance counter stats for './java -XX:-UseMulAddIntrinsic > > > > > -XX:- UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > > > > > > > > > 15148.009557 task-clock (msec) # 1.053 CPUs > > > > > utilized ( +- 0.48% ) > > > > > 2,425 context-switches # 0.160 K/sec > > > > > ( +- 5.84% ) > > > > > 356 cpu-migrations # 0.023 K/sec > > > > > ( +- 3.01% ) > > > > > 5,153 page-faults # 0.340 K/sec > > > > > ( +- 5.22% ) > > > > > 54,536,889,909 cycles # 3.600 GHz > > > > > ( +- 0.56% ) (66.68%) > > > > > 239,554,105 stalled-cycles-frontend # 0.44% > > frontend > > > > > cycles idle ( +- 4.87% ) (49.90%) > > > > > 27,683,316,001 stalled-cycles-backend # 50.76% > > backend > > > > > cycles idle ( +- 0.56% ) (50.17%) > > > > > 102,020,229,733 instructions # 1.87 insn > > per > > > > > cycle > > > > > # 0.27 > > stalled > > > > > cycles per insn ( +- 0.14% ) (66.94%) > > > > > 7,706,072,218 branches # 508.718 M/sec > > > > > ( +- 0.23% ) (50.20%) > > > > > 456,051,162 branch-misses # 5.92% of > > all > > > > > branches ( +- 0.09% ) (50.07%) > > > > > > > > > > 14.390840733 seconds time elapsed ( +- 0.09% ) > > > > > > > > > > $ sudo perf stat -r 5 ./java -XX:+UseMulAddIntrinsic - > > > > > XX:+UseSquareToLenIntrinsic TestSquareToLen Check ok! > > > > > Check ok! > > > > > Check ok! > > > > > Check ok! > > > > > Check ok! > > > > > > > > > > Performance counter stats for './java -XX:+UseMulAddIntrinsic - > > > > > XX:+UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > > > > > > > > > 11368.141410 task-clock (msec) # 1.045 CPUs > > > > > utilized ( +- 0.64% ) > > > > > 1,964 context-switches # 0.173 K/sec > > > > > ( +- 8.93% ) > > > > > 338 cpu-migrations # 0.030 K/sec > > > > > ( +- 7.65% ) > > > > > 5,627 page-faults # 0.495 K/sec > > > > > ( +- 6.15% ) > > > > > 41,100,168,967 cycles # 3.615 GHz > > > > > ( +- 0.50% ) (66.36%) > > > > > 309,052,316 stalled-cycles-frontend # 0.75% > > frontend > > > > > cycles idle ( +- 2.84% ) (49.89%) > > > > > 14,188,581,685 stalled-cycles-backend # 34.52% > > backend > > > > > cycles idle ( +- 0.99% ) (50.34%) > > > > > 77,846,029,829 instructions # 1.89 insn > > per > > > > > cycle > > > > > # 0.18 > > stalled > > > > > cycles per insn ( +- 0.29% ) (66.96%) > > > > > 8,435,216,989 branches # 742.005 M/sec > > > > > ( +- 0.28% ) (50.17%) > > > > > 339,903,936 branch-misses # 4.03% of > > all > > > > > branches ( +- 0.27% ) (49.90%) > > > > > > > > > > 10.882357546 seconds time elapsed ( +- 0.24% ) > > > > > > > > > > > > > > > (out of curiosity, these numbers are 15.19s (+- 0.32%) and 13.42s > > > > > (+- > > > > > 0.53%) on JDK10) > > > > > > > > > > I may run for SpecJVM2008's crypto.rsa if you are interested. > > > > > > > > > > Thank you once again for reviewing this. > > > > > > > > > > Best regards, > > > > > Gustavo > > > > > > > > > > > (I think the change is still acceptable as the intrinsics could > > > > > > be used elsewhere and the implementation also exists on other > > > > > > platforms.) > > > > > > > > > > > > Best regards, > > > > > > Martin > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: Gustavo Serra Scalet > > > > > > [mailto:gustavo.scalet at eldorado.org.br] > > > > > > Sent: Mittwoch, 16. August 2017 18:50 > > > > > > To: Doerr, Martin ; 'hotspot-compiler- > > > > > > dev at openjdk.java.net' > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > > > SquareToLen intrinsics > > > > > > > > > > > > Hi Martin, > > > > > > > > > > > > Thanks for dedicated review. It took me a while to be able to > > > > > > work on this but I hope to have your points solved. Please check > > > > > > below the review as well as my comments quoting your email: > > > > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.01/ > > > > > > > > > > > > > -----Original Message----- > > > > > > > First of all, C2 does not perform sign extend when calling > > stubs. > > > > > > > The int parms need to get zero/sign extended. (Could even be > > > > > > > done without extra instructions by replacing sldi -> rldicl, > > > > > > > cmpdi -> extsw_ in some > > > > > > > cases.) > > > > > > > > > > > > Does it make a difference on my case? > > > > > > > > > > > > I guess you are talking about mulAdd preparation code. The only > > > > > > aspect I found about him is to force the cast from 32 bits -> 64 > > > > > > bits by cleaning higher bits. Offset is a signed integer but it > > > > > > can't be > > > > > negative anyway. > > > > > > > > > > > > So I changed from: > > > > > > sldi (R5_ARG3, R5_ARG3, 2); > > > > > > > > > > > > to: > > > > > > rldicl (R5_ARG3, R5_ARG3, 2, 32); // always positive > > > > > > > > > > > > > > > > > > > macroAssembler_ppc.cpp: > > > > > > > - Indentation should be 2 spaces. > > > > > > > > > > > > Done > > > > > > > > > > > > > > > > > > > stubGenerator_ppc:cpp: > > > > > > > - or_, addi_ should get replaced by orr, addi when CR0 result > > > > > > > is not needed. > > > > > > > > > > > > Done > > > > > > > > > > > > > - Where is lplw initialized? > > > > > > > > > > > > It should be initialized with 0, I missed that... > > > > > > > > > > > > > - I believe that the updating load/store instructions e.g. > > > > > > > lwzu don't perform well on some processors. At least using > > > > > > > stwu 2 times in the loop doesn't make sense. > > > > > > > > > > > > You are right. I could manipulate the bits differently and ended > > > > > > up with a single stdu in the loop. Neat! Although I could not > > > > > > reduce the total number of instructions. > > > > > > > > > > > > > - Note: It should be possible to use 8 byte instead of 4 byte > > > > > > > instructions: MacroAssembler::multiply64, addc, adde. But I'm > > > > > > > not requesting to change that because I guess it would make > > > > > > > the code very complicated, especially when supporting both > > > > > > > endianess > > > > > versions. > > > > > > > > > > > > Yes, that would require a new analysis on this code. May we > > > > > > consider it next? As you said, I prefer having an initial > > > > > > version that looks as simple as the original java code. > > > > > > > > > > > > > - The squareToLen stub implementation is very close the Java > > > > > > > implementation. So it'd be interesting to understand what C2 > > > > > > > doesn't do as well as the hand written assembly code. Do you > > > > > > > know that? (Not absolutely necessary for accepting this change > > > > > > > as long as the stub is measurably faster.) > > > > > > > > > > > > I don't know either. Basically I chose doing it because I > > > > > > noticed some performance gain on SpecJVM2008 when analyzing > X64. > > > > > > Then, taking a closer look, I didn't notice any AVX or some > > > > > > special instructions on > > > > > > X64 so I decided to try it on ppc64 by using some basic > > assembly. > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > Martin > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > > > > > > Sent: Donnerstag, 10. August 2017 19:22 > > > > > > > To: 'hotspot-compiler-dev at openjdk.java.net' > > > > > > dev at openjdk.java.net> > > > > > > > Subject: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd > and > > > > > > > SquareToLen intrinsics > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: ppc-aix-port-dev [mailto:ppc-aix-port-dev- > > > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > > > > > > Sent: ter?a-feira, 8 de agosto de 2017 17:19 > > > > > > > To: ppc-aix-port-dev at openjdk.java.net > > > > > > > Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > > > > SquareToLen intrinsics > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > Could you please review this specific PPC64 change to hotspot? > > > > > > > By implementing these intrinsics I noticed a small improvement > > > > > > > with microbenchmarks analysis. On SpecJVM2008's crypto.rsa > > > > > > > benchmark, only when backporting to JDK8 an improvement was > > noticed. > > > > > > > > > > > > > > JBS: https://bugs.openjdk.java.net/browse/JDK-8185976 > > > > > > > Webrev: https://gut.github.io/openjdk/webrev/JDK- > > > 8185976/webrev/ > > > > > > > > > > > > > > Motivation for this implementation: > > > > > > > https://twitter.com/ijuma/status/698309312498835457 > > > > > > > > > > > > > > Best regards, > > > > > > > Gustavo Serra Scalet From gustavo.scalet at eldorado.org.br Wed Sep 6 12:45:18 2017 From: gustavo.scalet at eldorado.org.br (Gustavo Serra Scalet) Date: Wed, 6 Sep 2017 12:45:18 +0000 Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics In-Reply-To: <170eb84ecbdb4c9b9fe8d4481e4c319f@sap.com> References: <1f159ee480284095b8e5c3f444dceb96@serv031.corp.eldorado.org.br> <16e8b68451e94eb79cdd7d9cb5d7984c@sap.com> <2425566a8ff74051af485c919a0bf5ee@serv030.corp.eldorado.org.br> <4ec93a6bcbe14cf99c2fa02d50a18965@sap.com> <0ef23b5fcbc54996aea876d4c60e4097@sap.com> <10a918efbd344b1fbf95c56b7beedbc0@serv031.corp.eldorado.org.br> <2badcffdc4fb44c9ba77b5a1c6cc26fb@sap.com> <6362e4c1e3ab4871b12232580f2971aa@serv031.corp.eldorado.org.br> <1a54e77a4b5e45e2b848da3fccf423dd@serv030.corp.eldorado.org.br> <170eb84ecbdb4c9b9fe8d4481e4c319f@sap.com> Message-ID: <0aaf319e25934903a468542d02f6a734@serv030.corp.eldorado.org.br> Alright, thanks for the instructions. I'll keep that in mind. > -----Original Message----- > From: Lindenmaier, Goetz [mailto:goetz.lindenmaier at sap.com] > Sent: quarta-feira, 6 de setembro de 2017 09:44 > To: Gustavo Serra Scalet ; Doerr, Martin > ; 'hotspot-compiler-dev at openjdk.java.net' > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > SquareToLen intrinsics > > Hi Gustavo, > > the repos are all closed. Once they are opened again, you will have to > merge your change into the new repo structure, post a new webrev and > only then it can be sponsored. Me or Martin will sponsor it then. > > Best regards, > Goetz. > > > -----Original Message----- > > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > > Sent: Mittwoch, 6. September 2017 14:32 > > To: Lindenmaier, Goetz ; Doerr, Martin > > ; 'hotspot-compiler-dev at openjdk.java.net' > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > SquareToLen intrinsics > > > > Thanks Goetz. > > > > Could somebody sponsor this change? > > > > THanks > > > > > -----Original Message----- > > > From: Lindenmaier, Goetz [mailto:goetz.lindenmaier at sap.com] > > > Sent: quarta-feira, 6 de setembro de 2017 03:30 > > > To: Gustavo Serra Scalet ; Doerr, > > > Martin ; 'hotspot-compiler- > dev at openjdk.java.net' > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > SquareToLen intrinsics > > > > > > Hi, > > > > > > I had a look at this change and tested it. Reviewed. > > > > > > Best regards, > > > Goetz. > > > > > > > -----Original Message----- > > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > > > Sent: Freitag, 1. September 2017 19:12 > > > > To: Doerr, Martin ; 'hotspot-compiler- > > > > dev at openjdk.java.net' > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > SquareToLen intrinsics > > > > > > > > Hi Martin, > > > > > > > > > -----Original Message----- > > > > > From: Doerr, Martin > > > > > your first webrev already works on Big Endian. So the only > > > > > required change is to fix your new code by this trivial patch: > > > > > --- a/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 > 17:47:45 > > > 2017 > > > > > +0200 > > > > > +++ b/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 > 17:55:08 > > > 2017 > > > > > +0200 > > > > > @@ -3426,7 +3426,9 @@ > > > > > __ srdi (product, product, 1); > > > > > // join them to the same register and store it as Little > Endian > > > > > __ orr (product, lplw_s, product); > > > > > +#ifdef VM_LITTLE_ENDIAN > > > > > __ rldicl (product, product, 32, 0); > > > > > +#endif > > > > > __ stdu (product, 8, out_aux); > > > > > __ bdnz (LOOP_SQUARE); > > > > > > > > > > So please enable it again for Big Endian in vm_version_ppc. > > > > > Besides that, it looks good to me. We also need a 2nd review. > > > > > > > > Great! Thanks for checking it and suggesting the diff. > > > > > > > > I changed these things. You can find it below: > > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.04/ > > > > > > > > I wonder who could be a 2nd reviewer... Anybody in mind that we > > > > may > > > ping? > > > > Maybe Goetz Lindenmaier? > > > > > > > > Best Regards, > > > > Gustavo Serra Scalet > > > > > > > > > > > > > > Best regards, > > > > > Martin > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Gustavo Serra Scalet > > > > > [mailto:gustavo.scalet at eldorado.org.br] > > > > > Sent: Mittwoch, 30. August 2017 19:03 > > > > > To: Doerr, Martin ; 'hotspot-compiler- > > > > > dev at openjdk.java.net' > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > > SquareToLen intrinsics > > > > > > > > > > Hi Martin, > > > > > > > > > > (webrev at the end) > > > > > > > > > > > -----Original Message----- > > > > > > From: Doerr, Martin > > > > > > > > > > > > > The s/rldicl/rldic/ was fixed for "offset", but "len" > > > > > > > doesn't seem to need further changes as it's being cleared > > > > > > > with clrldi, which is the same as rldic with no shift. > > > > > > > Therefore it's treated appropriately as requested for > "offset" parameter. Do you agree? > > > > > > > > > > > > No, I didn't find clrldi for len in generate_mulAdd(). Only > for k. > > > > > > > > > > I'm sorry. I was thinking about "offset" and "k", which are both > > > > > cleaned on generate_mulAdd(). "len" was not cleaned and it was > > > > > being used on > > > > > muladd() directly with cmpdi, which could lead to problems. > > > > > > > > > > That is being changed. > > > > > > > > > > > Where are in_len and out_len fixed up in > generate_squareToLen()? > > > > > > > > > > They are not. According to your suggestions, I agree it also > > > > > needs to be done for the same reason. > > > > > > > > > > > > You are right. The way I'm building the 64 bits of the > > > > > > > register depends on which kind of endianness it is run. For > > > > > > > now it works only on little endian so I'm adding a switch > > > > > > > (just like I did for SHA) to make it available only on > little endian systems. > > > > > > > > > > > > It shouldn't be that hard to get it working on big endian ;-) > > > > > > Btw., my point was not to replace the 2 4-byte store > > > > > > instructions by an 8-byte one (though I'm also ok with that). > > > > > > It was that 2 stwu which update the same pointer doesn't make > > > > > > sense from > > > performance point of view. > > > > > > Please keep something which works on big endian, too. > > > > > > > > > > I see. The 2x stwu was being used like that because it was the > > > > > trivial approach when considering the original java update: > > > > > z[i++] = (lastProductLowWord << 31) | (int)(product >>> 33); > > > > > z[i++] = (int)(product >>> 1); > > > > > > > > > > As you pointed out, that might cause some stall on the pipeline > > > > > so I made it with 1s stdu (and could improve code by reducing 1 > > > > > instruction) > > > > > > > > > > Now about having a big endian version: I'm not confident in > > > > > doing so as I don't have access to such a machine at the moment. > > > > > You were kind on offering test support but I don't know if it'd > > > > > work like that. I may support you in checking out which places > > > > > are endianness-related but I'm not comfortable in sending you > > > > > untested > > > code. > > > > > > > > > > Would you be interested in doing such a changes for making it > > > > > work on Big Endian? For this patch, I provided an interesting > > > > > test that might help you to verify if it worked. > > > > > > > > > > > > No, I used the jdk8u152-b01 (State of repository at Thu Apr > > > > > > > 6 > > > > > > > 14:15:31 2017). The reported performance speedup was > > > > > > > calculated by running the following test > (TestSquareToLen.java): > > > > > > > > > > > > Seems like JDK-8145913 has not been backported, yet. Sorry for > > > > > > not checking this earlier. So if you want to make RSA really > > > > > > fast, it should be so much better to backport that one. But I > > > > > > can still sponsor this change as it may be used elsewhere. > > > > > > > > > > No problem. It's nice to know that I may not need to request a > > > > > backport of this patch for performance reasons. > > > > > > > > > > And at last, but not least, the new webrev with these clrldi > > > changes: > > > > > https://gut.github.io/openjdk/webrev/JDK- > > > > 8185976/webrev.03/index.html > > > > > > > > > > Thank you once again, > > > > > Gustavo Serra Scalet > > > > > > > > > > > Best regards, > > > > > > Martin > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: Gustavo Serra Scalet > > > > > > [mailto:gustavo.scalet at eldorado.org.br] > > > > > > Sent: Dienstag, 29. August 2017 22:37 > > > > > > To: Doerr, Martin ; 'hotspot-compiler- > > > > > > dev at openjdk.java.net' > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > > > SquareToLen intrinsics > > > > > > > > > > > > Hi Martin, > > > > > > > > > > > > New changes: > > > > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.02/ > > > > > > > > > > > > Check comments below, please. > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Doerr, Martin > > > > > > > > > > > > > > 1. Sign extending offset and len Right, sign and zero > > > > > > > extending is equivalent for offset and len because they are > > > > > > > guaranteed to be >=0 (by checks in Java). But you can only > > > > > > > rely on bit 32 (IBM > > > > > > > notation) to be 0. Bit 0-31 may contain > > > > > > garbage. > > > > > > > rldicl was incorrect. My mistake, sorry for that. Correct > > > > > > > would be rldic which also clears the least significant bits. > > > > > > > len should also get fixed e.g. by replacing cmpdi by extsw_ > > > > > > > in > > > > > muladd. > > > > > > > > > > > > The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't > > > > > > seem to need further changes as it's being cleared with > > > > > > clrldi, which is the same as rldic with no shift. Therefore > > > > > > it's treated appropriately as requested for "offset" > parameter. Do you agree? > > > > > > > > > > > > > 2. Using 8 byte instructions for int The code which feeds > > > > > > > stdu is endianess specific. Doesn't work on all > > > > > > > PPC64 platforms. > > > > > > > > > > > > You are right. The way I'm building the 64 bits of the > > > > > > register depends on which kind of endianness it is run. For > > > > > > now it works only on little endian so I'm adding a switch > > > > > > (just like I did for > > > > > > SHA) to make it available only on little endian systems. > > > > > > > > > > > > > 3.Regarding Andrew's point: Superseded by Montgomery? > > > > > > > The Montgomery change got backported to jdk8u (JDK-8150152 > > > > > > > in > > > > > 8u102). > > > > > > > I'd expect the performance improvement of these intrinsics > > > > > > > to be irrelevant for crypto.rsa. Did you measure with an > > > > > > > older jdk8 > > > > > release? > > > > > > > > > > > > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 > > > > > > 14:15:31 2017). The reported performance speedup was > > > > > > calculated by running the following test > (TestSquareToLen.java): > > > > > > import java.math.BigInteger; > > > > > > > > > > > > public class TestSquareToLen { > > > > > > > > > > > > public static void main(String args[]) throws Exception { > > > > > > > > > > > > int n = 10000000; > > > > > > if (args.length >=1) { > > > > > > n = Integer.parseInt(args[0]); > > > > > > } > > > > > > > > > > > > BigInteger b1 = new > > > > > > > > > > > > BigInteger("34893980923557359086350514982082503920002298311877320859 > > > > 99 > > > > > > 36 > > > > > > > > > > > > 7395594183801021468843071391756049207873137016631559837931214754926 > > > > 092 > > > > > > 22 > > > > > > > > > > > > 3780292110207609223272184808289336630057735969423726808520641030118 > > > > 116 > > > > > > 51 > > > > > > > > > > > > 6440180488338234823908199478965242076358579845520899779963131131540 > > > > 166 > > > > > > 68 718795349783157384006672542605760392289645528307"); > > > > > > BigInteger b2 = BigInteger.valueOf(0); > > > > > > BigInteger check = BigInteger.valueOf(1); > > > > > > for (int i = 0; i < n; i++) { > > > > > > b2 = b1.multiply(b1); > > > > > > if (i == 0) > > > > > > // Didn't JIT yet. Comparing against interpreted > mode > > > > > > check = b2; > > > > > > } > > > > > > if (b2.compareTo(check) == 0) > > > > > > System.out.println("Check ok!"); > > > > > > else > > > > > > System.out.println("Check failed!"); > > > > > > } > > > > > > } > > > > > > > > > > > > > > > > > > I got these results on JDK8 on my POWER8 machine: > > > > > > $ ./javac TestSquareToLen.java $ sudo perf stat -r 5 ./java > > > > > > -XX:-UseMulAddIntrinsic -XX:- UseSquareToLenIntrinsic > > > > > > TestSquareToLen Check ok! > > > > > > Check ok! > > > > > > Check ok! > > > > > > Check ok! > > > > > > Check ok! > > > > > > > > > > > > Performance counter stats for './java -XX:-UseMulAddIntrinsic > > > > > > -XX:- UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > > > > > > > > > > > 15148.009557 task-clock (msec) # 1.053 > CPUs > > > > > > utilized ( +- 0.48% ) > > > > > > 2,425 context-switches # 0.160 > K/sec > > > > > > ( +- 5.84% ) > > > > > > 356 cpu-migrations # 0.023 > K/sec > > > > > > ( +- 3.01% ) > > > > > > 5,153 page-faults # 0.340 > K/sec > > > > > > ( +- 5.22% ) > > > > > > 54,536,889,909 cycles # 3.600 > GHz > > > > > > ( +- 0.56% ) (66.68%) > > > > > > 239,554,105 stalled-cycles-frontend # 0.44% > > > frontend > > > > > > cycles idle ( +- 4.87% ) (49.90%) > > > > > > 27,683,316,001 stalled-cycles-backend # 50.76% > > > backend > > > > > > cycles idle ( +- 0.56% ) (50.17%) > > > > > > 102,020,229,733 instructions # 1.87 > insn > > > per > > > > > > cycle > > > > > > # 0.27 > > > stalled > > > > > > cycles per insn ( +- 0.14% ) (66.94%) > > > > > > 7,706,072,218 branches # 508.718 > M/sec > > > > > > ( +- 0.23% ) (50.20%) > > > > > > 456,051,162 branch-misses # 5.92% > of > > > all > > > > > > branches ( +- 0.09% ) (50.07%) > > > > > > > > > > > > 14.390840733 seconds time elapsed ( +- 0.09% ) > > > > > > > > > > > > $ sudo perf stat -r 5 ./java -XX:+UseMulAddIntrinsic - > > > > > > XX:+UseSquareToLenIntrinsic TestSquareToLen Check ok! > > > > > > Check ok! > > > > > > Check ok! > > > > > > Check ok! > > > > > > Check ok! > > > > > > > > > > > > Performance counter stats for './java -XX:+UseMulAddIntrinsic > > > > > > - XX:+UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > > > > > > > > > > > 11368.141410 task-clock (msec) # 1.045 > CPUs > > > > > > utilized ( +- 0.64% ) > > > > > > 1,964 context-switches # 0.173 > K/sec > > > > > > ( +- 8.93% ) > > > > > > 338 cpu-migrations # 0.030 > K/sec > > > > > > ( +- 7.65% ) > > > > > > 5,627 page-faults # 0.495 > K/sec > > > > > > ( +- 6.15% ) > > > > > > 41,100,168,967 cycles # 3.615 > GHz > > > > > > ( +- 0.50% ) (66.36%) > > > > > > 309,052,316 stalled-cycles-frontend # 0.75% > > > frontend > > > > > > cycles idle ( +- 2.84% ) (49.89%) > > > > > > 14,188,581,685 stalled-cycles-backend # 34.52% > > > backend > > > > > > cycles idle ( +- 0.99% ) (50.34%) > > > > > > 77,846,029,829 instructions # 1.89 > insn > > > per > > > > > > cycle > > > > > > # 0.18 > > > stalled > > > > > > cycles per insn ( +- 0.29% ) (66.96%) > > > > > > 8,435,216,989 branches # 742.005 > M/sec > > > > > > ( +- 0.28% ) (50.17%) > > > > > > 339,903,936 branch-misses # 4.03% > of > > > all > > > > > > branches ( +- 0.27% ) (49.90%) > > > > > > > > > > > > 10.882357546 seconds time elapsed ( +- 0.24% ) > > > > > > > > > > > > > > > > > > (out of curiosity, these numbers are 15.19s (+- 0.32%) and > > > > > > 13.42s > > > > > > (+- > > > > > > 0.53%) on JDK10) > > > > > > > > > > > > I may run for SpecJVM2008's crypto.rsa if you are interested. > > > > > > > > > > > > Thank you once again for reviewing this. > > > > > > > > > > > > Best regards, > > > > > > Gustavo > > > > > > > > > > > > > (I think the change is still acceptable as the intrinsics > > > > > > > could be used elsewhere and the implementation also exists > > > > > > > on other > > > > > > > platforms.) > > > > > > > > > > > > > > Best regards, > > > > > > > Martin > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Gustavo Serra Scalet > > > > > > > [mailto:gustavo.scalet at eldorado.org.br] > > > > > > > Sent: Mittwoch, 16. August 2017 18:50 > > > > > > > To: Doerr, Martin ; 'hotspot-compiler- > > > > > > > dev at openjdk.java.net' > > > > > > > > > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd > > > > > > > and SquareToLen intrinsics > > > > > > > > > > > > > > Hi Martin, > > > > > > > > > > > > > > Thanks for dedicated review. It took me a while to be able > > > > > > > to work on this but I hope to have your points solved. > > > > > > > Please check below the review as well as my comments quoting > your email: > > > > > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.01/ > > > > > > > > > > > > > > > -----Original Message----- First of all, C2 does not > > > > > > > > perform sign extend when calling > > > stubs. > > > > > > > > The int parms need to get zero/sign extended. (Could even > > > > > > > > be done without extra instructions by replacing sldi -> > > > > > > > > rldicl, cmpdi -> extsw_ in some > > > > > > > > cases.) > > > > > > > > > > > > > > Does it make a difference on my case? > > > > > > > > > > > > > > I guess you are talking about mulAdd preparation code. The > > > > > > > only aspect I found about him is to force the cast from 32 > > > > > > > bits -> 64 bits by cleaning higher bits. Offset is a signed > > > > > > > integer but it can't be > > > > > > negative anyway. > > > > > > > > > > > > > > So I changed from: > > > > > > > sldi (R5_ARG3, R5_ARG3, 2); > > > > > > > > > > > > > > to: > > > > > > > rldicl (R5_ARG3, R5_ARG3, 2, 32); // always positive > > > > > > > > > > > > > > > > > > > > > > macroAssembler_ppc.cpp: > > > > > > > > - Indentation should be 2 spaces. > > > > > > > > > > > > > > Done > > > > > > > > > > > > > > > > > > > > > > stubGenerator_ppc:cpp: > > > > > > > > - or_, addi_ should get replaced by orr, addi when CR0 > > > > > > > > result is not needed. > > > > > > > > > > > > > > Done > > > > > > > > > > > > > > > - Where is lplw initialized? > > > > > > > > > > > > > > It should be initialized with 0, I missed that... > > > > > > > > > > > > > > > - I believe that the updating load/store instructions e.g. > > > > > > > > lwzu don't perform well on some processors. At least using > > > > > > > > stwu 2 times in the loop doesn't make sense. > > > > > > > > > > > > > > You are right. I could manipulate the bits differently and > > > > > > > ended up with a single stdu in the loop. Neat! Although I > > > > > > > could not reduce the total number of instructions. > > > > > > > > > > > > > > > - Note: It should be possible to use 8 byte instead of 4 > > > > > > > > byte > > > > > > > > instructions: MacroAssembler::multiply64, addc, adde. But > > > > > > > > I'm not requesting to change that because I guess it would > > > > > > > > make the code very complicated, especially when supporting > > > > > > > > both endianess > > > > > > versions. > > > > > > > > > > > > > > Yes, that would require a new analysis on this code. May we > > > > > > > consider it next? As you said, I prefer having an initial > > > > > > > version that looks as simple as the original java code. > > > > > > > > > > > > > > > - The squareToLen stub implementation is very close the > > > > > > > > Java implementation. So it'd be interesting to understand > > > > > > > > what C2 doesn't do as well as the hand written assembly > > > > > > > > code. Do you know that? (Not absolutely necessary for > > > > > > > > accepting this change as long as the stub is measurably > > > > > > > > faster.) > > > > > > > > > > > > > > I don't know either. Basically I chose doing it because I > > > > > > > noticed some performance gain on SpecJVM2008 when analyzing > > X64. > > > > > > > Then, taking a closer look, I didn't notice any AVX or some > > > > > > > special instructions on > > > > > > > X64 so I decided to try it on ppc64 by using some basic > > > assembly. > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > Martin > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra > > > > > > > > Scalet > > > > > > > > Sent: Donnerstag, 10. August 2017 19:22 > > > > > > > > To: 'hotspot-compiler-dev at openjdk.java.net' > > > > > > > > > > > > > > > > Subject: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd > > and > > > > > > > > SquareToLen intrinsics > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: ppc-aix-port-dev [mailto:ppc-aix-port-dev- > > > > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra > > > > > > > > Scalet > > > > > > > > Sent: ter?a-feira, 8 de agosto de 2017 17:19 > > > > > > > > To: ppc-aix-port-dev at openjdk.java.net > > > > > > > > Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > > > > > SquareToLen intrinsics > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > Could you please review this specific PPC64 change to > hotspot? > > > > > > > > By implementing these intrinsics I noticed a small > > > > > > > > improvement with microbenchmarks analysis. On > > > > > > > > SpecJVM2008's crypto.rsa benchmark, only when backporting > > > > > > > > to JDK8 an improvement was > > > noticed. > > > > > > > > > > > > > > > > JBS: https://bugs.openjdk.java.net/browse/JDK-8185976 > > > > > > > > Webrev: https://gut.github.io/openjdk/webrev/JDK- > > > > 8185976/webrev/ > > > > > > > > > > > > > > > > Motivation for this implementation: > > > > > > > > https://twitter.com/ijuma/status/698309312498835457 > > > > > > > > > > > > > > > > Best regards, > > > > > > > > Gustavo Serra Scalet From dmitrij.pochepko at bell-sw.com Wed Sep 6 17:39:13 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij) Date: Wed, 6 Sep 2017 20:39:13 +0300 Subject: [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> Message-ID: <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> On 06.09.2017 15:43, Andrew Haley wrote: > On 06/09/17 12:50, Dmitrij wrote: >> >> On 06.09.2017 12:53, Andrew Haley wrote: >>> On 05/09/17 18:34, Dmitrij Pochepko wrote: >>>> As you can see, it's up to 26% worse throughput with wider multiplication. >>>> >>>> The reasons for this is: >>>> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it >>>> can?t be changed within the function signature. Thus we can?t fully >>>> utilize the potential of 64-bit multiplication. >>>> 2. umulh instruction is more expensive than mul instruction. >>> Ah, my apologies. I wasn't thinking about mulAdd, but about >>> squareToLen(). But did you look at the way x86 uses 64-bit >>> multiplications? >>> >> Yes. It uses single x86 mulq instruction which performs 64x64 >> multiplication and placing 128 bit result in 2 registers. There is no >> such single instruction on aarch64 and the most effective aarch64 >> instruction sequence i've found doesn't seem to be as fast as mulq. > I think there is effectively a 64x64 - >128-bit instruction: it's just > that you have to represent it as a mul and a umulh. But I take your > point. > >>> One other thing I >>> haven't checked: is the multiplyToLen() intrinisc called when >>> squareToLen() is absent? >>> >> It could have been a good alternative, but it's not used instead of >> squareToLen when squareToLen is not implemented. A java implementation >> of squareToLen will be eventually compiled and used instead: >> http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l2039 > Please compare your squareToLen wih the > MacroAssembler::multiply_to_len we already have. > I've compared it by calling square and multiply methods and got following results(ThunderX): Benchmark (size, ints) Mode Cnt Score Error Units BigIntegerBench.implMutliplyToLenReflect 1 avgt 5 186.930 ? 14.933 ns/op (26% slower) BigIntegerBench.implMutliplyToLenReflect 2 avgt 5 194.095 ? 11.857 ns/op (12% slower) BigIntegerBench.implMutliplyToLenReflect 3 avgt 5 233.912 ? 4.229 ns/op (24% slower) BigIntegerBench.implMutliplyToLenReflect 5 avgt 5 308.349 ? 20.383 ns/op (22% slower) BigIntegerBench.implMutliplyToLenReflect 10 avgt 5 475.839 ? 6.232 ns/op (same) BigIntegerBench.implMutliplyToLenReflect 50 avgt 5 6514.691 ? 76.934 ns/op (same) BigIntegerBench.implMutliplyToLenReflect 90 avgt 5 20347.040 ? 224.290 ns/op (3% slower) BigIntegerBench.implMutliplyToLenReflect 127 avgt 5 41929.302 ? 181.053 ns/op (9% slower) BigIntegerBench.implSquareToLenReflect 1 avgt 5 147.751 ? 12.760 ns/op BigIntegerBench.implSquareToLenReflect 2 avgt 5 173.804 ? 4.850 ns/op BigIntegerBench.implSquareToLenReflect 3 avgt 5 187.822 ? 34.027 ns/op BigIntegerBench.implSquareToLenReflect 5 avgt 5 251.995 ? 19.711 ns/op BigIntegerBench.implSquareToLenReflect 10 avgt 5 474.489 ? 1.040 ns/op BigIntegerBench.implSquareToLenReflect 50 avgt 5 6493.768 ? 33.809 ns/op BigIntegerBench.implSquareToLenReflect 90 avgt 5 19766.524 ? 88.398 ns/op BigIntegerBench.implSquareToLenReflect 127 avgt 5 38448.202 ? 180.095 ns/op As we can see, squareToLen is faster than multiplyToLen. (I've updated benchmark code at http://cr.openjdk.java.net/~dpochepk/8186915/BigIntegerBench.java) Thanks, Dmitrij From gromero at linux.vnet.ibm.com Thu Sep 7 00:28:34 2017 From: gromero at linux.vnet.ibm.com (Gustavo Romero) Date: Wed, 6 Sep 2017 21:28:34 -0300 Subject: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic In-Reply-To: References: <4999bc2a3f0640dfb6dd75d23b4f30ea@sap.com> <0089f9f653a6442aa672af2e15b2b864@serv030.corp.eldorado.org.br> <59397a3749024e91b56be6e990a3250d@sap.com> <363c2378f23e4be2bf60b622594c60fe@sap.com> <59A089F4.6010504@linux.vnet.ibm.com> <34e6550d426440bab3b8a54a82e25190@sap.com> Message-ID: <59B092B2.2040704@linux.vnet.ibm.com> Hi Martin, On 01-09-2017 12:39, Doerr, Martin wrote: > It'd also be good to know, if relying on vrsave=-1 is safe. VRSAVE is set to -1 in kernelspace on a VEC or a VSX unavailable exception, in load_up_altivec(), arch/powerpc/kernel/vector.S [1]: 51 /* 52 * While userspace in general ignores VRSAVE, glibc uses it as a boolean 53 * to optimise userspace context save/restore. Whenever we take an 54 * altivec unavailable exception we must set VRSAVE to something non 55 * zero. Set it to all 1s. See also the programming note in the ISA. 56 */ 57 mfspr r4,SPRN_VRSAVE 58 cmpwi 0,r4,0 59 bne+ 1f 60 li r4,-1 61 mtspr SPRN_VRSAVE,r4 All program images are created with MSR_VEC and MSR_VSX disabled (set to zero) and VRSAVE set to zero as well. However, on the first execution of a vector (VMX/Altivec) or a VSX (Vector-Scalar) instruction an exception is raised and the exception code path calls load_up_altivec() that will set VRSAVE=-1 if it's equal to zero (load_up_vsx() calls load_up_altivec()). The check on lines 58 and 59 guarantees that if a userspace program desires to set VRSAVE it can freely set the VRSAVE and on a new VEC /VSX exception VRSAVE value won't be clobbed (set again to -1) and will stay as user set it (a new exception can occur if a sufficient amount of context switches happen and MSR_VEC and MSR_VSX bits get disabled as part of kernel's mechanism to avoid the burden of saving/restoring the vec, fp, and vsx registers on context switches if they are not in use by the a program). For instance, a simple program like that compiled with 'gcc -static -O0 vrsave_.c -o vrsave_': vrsave_.c: int main() {} will trigger VRSAVE=-1 in kernelspace once it executes a 'stvx' VSX instruction: in __vmx_sigsetjmp(): $ gdb -q ./vrsave_ (gdb) x/i 0x1000db20 0x1000db20 <__vmx__sigsetjmp+424>: stvx v20,0,r5 The address 0x1000db20 can be determined by a crafted systemtap probe [2], for instance: start_thread.return : vrsave=0x0 start=0x10000840 load_up_altivec.call : trap=0xf21 nip=0x1000db20 vrsave=0x0 0xc00000000000c7c0 : load_up_altivec+0x0/0x164 [kernel] 0x0 (inexact) load_up_altivec.return: trap=0xf21 nip=0x1000db20 vrsave=0xffffffff Returning from: 0xc00000000000c7c0 : load_up_altivec+0x0/0x164 [kernel] Returning to : 0xc000000000009c78 : altivec_unavailable_common+0xf8/0x150 [kernel] 0x0 (inexact) __switch_to().call : vrsave=0xffffffff, nip=0x1002c4bc trap=0xf21 means a "Vector Unavailable" [3] exception was taken. -- For a non-static linkage we get almost the same, but a VSX unavailable exception (trap=0xf41), caused by a 'xxspltd' in memset() (0x7fff8745d428): start_thread.return : vrsave=0x0 start=0x7fff874312e0 load_up_altivec.call : trap=0xf41 nip=0x7fff8745d428 vrsave=0x0 0xc00000000000c7c0 : load_up_altivec+0x0/0x164 [kernel] 0x0 (inexact) load_up_altivec.return: trap=0xf41 nip=0x7fff8745d428 vrsave=0xffffffff Returning from: 0xc00000000000c7c0 : load_up_altivec+0x0/0x164 [kernel] Returning to : 0xc00000000000ca5c : load_up_vsx+0x10/0x2c [kernel] 0x0 (inexact) __switch_to.call : vrsave=0xffffffff, nip=0x100005ac (gdb) x/i 0x7fff8745d428 0x7fff8745d428 : xxspltd vs0,vs12,0 (gdb) bt #0 memset (dstpp=0x7fffffffeec0, c=0, len=640) at ../string/memset.c:29 #1 0x00007ffff7fa24d4 in _dl_start (arg=0x7ffffffff410) at rtld.c:373 #2 0x00007ffff7fa12f8 in _start () from /lib64/ld64.so.2 -- Setting VRSAVE=-1 in load_up_altivec() dates back to kernel v2.6.31-rc1 at least accordingly to 'git tag --contains e821ea70f', hence it's pretty old. Martin, please let me know if you need any additional information to check if we can rely on vrsave=-1 and so forget about taking care of vrsave in the JVM also for Linux BE. Best regards, Gustavo [1] https://github.com/torvalds/linux/blob/master/arch/powerpc/kernel/vector.S#L52 [2] http://cr.openjdk.java.net/~gromero/script.d [3] ISA 3.0. Figure 68, "Effective address of interrupt vector by interrupt type" From tobias.hartmann at oracle.com Thu Sep 7 07:06:20 2017 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Thu, 7 Sep 2017 09:06:20 +0200 Subject: escape analysis friendly very small objects In-Reply-To: <660539323.2919871.1504614953928@mail.yahoo.com> References: <660539323.2919871.1504614953928.ref@mail.yahoo.com> <660539323.2919871.1504614953928@mail.yahoo.com> Message-ID: <9a5d9085-c84d-38b5-059c-3bffc49d8e07@oracle.com> Hi Andy, I think the problem you are explaining is really a use case for value types: http://cr.openjdk.java.net/~jrose/values/shady-values.html We are currently working on Minimal Value Types and have an early prototype available that you can try out: http://mail.openjdk.java.net/pipermail/valhalla-dev/2017-August/003091.html If you declare your classes as value types, C2 will keep the fields in registers or on the stack whenever possible (especially in case (1) that you described). We don't need to rely on escape analysis which, as Andrew mentioned, is very limited. Case (2) is also solved by value types with flattening of value type fields into the holder class. Best regards, Tobias On 05.09.2017 14:35, Andy Nuss wrote: > I have a variety of classes which just contain a couple scalar primitives and possibly a reference to an object that is > clearly on the heap, all of these small number of data members are private and all of them mutable. Much like the Point > class in Brian Goetz's article on escape analysis on whether the vm uses the stack or heap for class instances. > > There are two use cases for these objects: (1) create with new such a small object as a local variable and use it > thriughout the function, mutating it, etc. (2) create a class with lots of these small objects as private or possibly > protected members. > > The hope in the first case is under what conditions will hotspot put the object's 2 or 3 fields onto the stack and > ideally without the hidden headers needed to make it a heap object. hashCode, equals, and clone are not implemented or > used if that is important. I.e. will hotspot ever make it a simple C++ like object on the stack. Does it help if I do > defensive copy if I return one of these small objects from the function? > > In the second case, my hope is that for a class that contains 5 such small objects, and would definitely be on the heap, > hotspot would be smart enough not to create 5 small objects on the heap and then 5 references to them in the containing > class. Instead, it would explode into the containing class ala C++ just the 2 or 3 primitive datamembers of the object, > and again, ideally without headers. From goetz.lindenmaier at sap.com Thu Sep 7 07:54:41 2017 From: goetz.lindenmaier at sap.com (Lindenmaier, Goetz) Date: Thu, 7 Sep 2017 07:54:41 +0000 Subject: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic In-Reply-To: <59A9850B.7030302@linux.vnet.ibm.com> References: <4999bc2a3f0640dfb6dd75d23b4f30ea@sap.com> <0089f9f653a6442aa672af2e15b2b864@serv030.corp.eldorado.org.br> <59397a3749024e91b56be6e990a3250d@sap.com> <363c2378f23e4be2bf60b622594c60fe@sap.com> <59A089F4.6010504@linux.vnet.ibm.com> <34e6550d426440bab3b8a54a82e25190@sap.com> <59A9850B.7030302@linux.vnet.ibm.com> Message-ID: <3e64d11f046f44379f9658dffc766a45@sap.com> Hi, I had a look at this change. Martin, you missed a ')' in IntrinsicPredicates.java. Combined with the multiplyToLen change, stub codebuffer space runs out. Please increase code_size2 = 20000 to 22000 in stubRoutines_ppc.hpp. I see TestSHA.java failing on linuxppc64le. Also, other tests are failing with SHA-256 digest error ... Also, on aix, some of our internal tests are failing. These didn't run on linuxppc64 on a Power8 machine, so it might fail there, too. But on the big endian platforms, the jtreg tests don't fail. @Gustavo, maybe you can have a look at the issues on linuxppc64le and post a new webrev. Then Martin can fix the remaining issue on big endian. Best regards, Goetz. > -----Original Message----- > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > bounces at openjdk.java.net] On Behalf Of Gustavo Romero > Sent: Freitag, 1. September 2017 18:04 > To: Doerr, Martin ; Gustavo Serra Scalet > > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi Martin! > > On 01-09-2017 12:39, Doerr, Martin wrote: > > Hi Gustavos, > > > > I have managed to upload a version which seems to work on both > endianness implementations. > > At least some quick tests have passed on AIX and Big Endian linux in > addition to Little Endian linux. > > Great! :-) > > > > I'll be out next week, but the change looks ok for me. Please let me know if > the changed version still looks ok for you, too. Feel free to overwork or > improve it. > > It'd also be good to know, if relying on vrsave=-1 is safe. > > Sure, Martin. I'm chasing what's exactly setting vrsave=-1 and the full history > log (looks like it's not in the kernel, but I'm checking yet). > > > > Is the copyright information ok? Did you get source code which requires to > be mentioned in the comments? > > The code looks similar to a reference implementation, so the authors of it > may want to be mentioned? > > Or did you just use the paper for implementing it? In this case, I'd mention > the paper. > > Gustavo S: the information on the paper must be updated accordingly as > Martin noted in the new webrev. There is none currently. > > > > After we got a second review and ran more tests, we can ask somebody > from Oracle to push it. > > > > Thanks for contributing and your support, > > Martin > > Thanks a lot for reviewing and for all the help. > > Regards, > Gustavo R > > > > > -----Original Message----- > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > bounces at openjdk.java.net] On Behalf Of Doerr, Martin > > Sent: Donnerstag, 31. August 2017 18:21 > > To: Gustavo Romero > > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > Hi Gustavo R, > > > > I guess you're right. vrsave is already set to -1, so all Vector Registers get > saved. > > It'd be good to know where it is set (OS, Flag in ELF header, ???) and if this > is guaranteed. > > I don't want to risk getting sporadic errors on some OS versions. > > > > I'd like to enable SHA intrinsics on linux BE as well. I already managed to get > the 256 bit version working (was quite some work!). > > > > Thanks and best regards, > > Martin > > > > > > -----Original Message----- > > From: Gustavo Romero [mailto:gromero at linux.vnet.ibm.com] > > Sent: Freitag, 25. August 2017 22:35 > > To: Doerr, Martin > > Cc: Gustavo Serra Scalet ; 'hotspot- > compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > Hi Martin, > > > > On 25-08-2017 13:18, Doerr, Martin wrote: > >> I think you didn't get my point about AIX. > >> Your current version doesn't break AIX, but it lacks SHA2 acceleration for > AIX on Power 8 and newer, which is still relevant. > >> So I'd like to ask you kindly to take a look if Big Endian support for the stub > could be added without high effort. AIX doesn't need VRSAVE handling (like > Little Endian linux, unlike Big Endian linux), so a few lines in the stub could > possibly be enough. I can assist with testing. > > > > I don't think that VRSAVE is handled on Linux, even on BE. Although BE ABI > [1] > > says: > > > > "Functions must ensure that the appropriate bits in the vrsave register are > set for any vector registers they use" > > > > and LE ABI does not say that, even on Linux BE VRSAVE is not in effect > > used to determine which vector registers (VMX/Altivec) should be > saved/restored. > > No application uses it on Linux, so I would say that VRSAVE is ignored on > Linux > > completely both on BE and LE. save/restore library interfaces don't pay > > attention to it in glibc: VRSAVE is just saved/restored completely in > mechanisms > > of swap/get/setcontext(), set/longjump(), and dl-trampoline() and that's > all. I > > checked that with toolchain folks and they agree. We've already discussed > that a > > long time ago but at that time I was just using the vector-scalar registers [2] > > and at that time I agreed that if VMX/Altivec was in use instead of the VSX > so > > VRSAVE should be handled accordingly. But I have a different opinion > now... > > > > I'm wondering if something would really break on Linux BE if we forget > about > > VRSAVE at all in the JVM. If not, we could forget about VRSAVE forever on > Linux. > > Looks like VRSAVE was sort of born to the oblivion... ? > > > > > > Kind regards, > > Gustavo > > > > [1] https://urldefense.proofpoint.com/v2/url?u=http- > 3A__refspecs.linuxfoundation.org_ELF_ppc64_PPC-2Delf64abi- > 2D1.9.html&d=DwIFAg&c=jf_iaSHvJObTbx- > siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z- > esrs- > Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=arAjOio5NNoRIZLdczhgF5BDoAF3HU > vq-xCtSufn_kA&e= > > [2] https://urldefense.proofpoint.com/v2/url?u=http- > 3A__mail.openjdk.java.net_pipermail_ppc-2Daix-2Dport-2Ddev_2016- > 2DMay_002508.html&d=DwIFAg&c=jf_iaSHvJObTbx- > siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z- > esrs-Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=p0xb08lxayJHBXZREL-7c5ipKc- > waZMMZpTiQWfU-S4&e= > > From aph at redhat.com Fri Sep 8 14:38:11 2017 From: aph at redhat.com (Andrew Haley) Date: Fri, 8 Sep 2017 15:38:11 +0100 Subject: multiplyHigh? Message-ID: I notice that Math.multiplyHigh(long, long) doesn't use a C2 intrinsic, even on machines with appropriate C2 patterns. So it's rather slow. JDK-5100935, No way to access the 64-bit integer multiplication of 64-bit CPUs efficiently, is closed, even though there is still no efficient way to do this. Is writing an intrinsic for multiplyHigh on someone's to-do list? I see no bug for it. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From dmitrij.pochepko at bell-sw.com Fri Sep 8 15:17:46 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij) Date: Fri, 8 Sep 2017 18:17:46 +0300 Subject: multiplyHigh? In-Reply-To: References: Message-ID: <9e627b27-54cd-d031-748f-c8f7c4c032b7@bell-sw.com> Hi Andrew, I support the idea of intrinsifying Math.multiplyHigh method: it seems like it will be very effective to implement at least on aarch64 and x86_64 using umulh and mulq instructions. Vladimir, what do you think? If there are no other volunteer I'd be happy to do this. Enhancing java.lang.Math intrinsics situation for aarch64 port is on my todo list. Thanks, Dmitrij On 08.09.2017 17:38, Andrew Haley wrote: > I notice that Math.multiplyHigh(long, long) doesn't use a C2 intrinsic, > even on machines with appropriate C2 patterns. So it's rather slow. > > JDK-5100935, No way to access the 64-bit integer multiplication of > 64-bit CPUs efficiently, is closed, even though there is still no > efficient way to do this. Is writing an intrinsic for multiplyHigh on > someone's to-do list? I see no bug for it. > From stuart.monteith at linaro.org Fri Sep 8 15:53:40 2017 From: stuart.monteith at linaro.org (Stuart Monteith) Date: Fri, 8 Sep 2017 16:53:40 +0100 Subject: RFR(XL/M) : 8178788: wrap JCStress test suite as jtreg tests In-Reply-To: <342c3748-616f-8a7d-74f4-3ce929b1e0dc@redhat.com> References: <9A2C94EA-89A3-4C75-9D3C-51E058BD8A1D@oracle.com> <342c3748-616f-8a7d-74f4-3ce929b1e0dc@redhat.com> Message-ID: Hello, I've spent some time on this, and I have to admit that I'm stumped. I get exactly the same errors on x86 on jdk10/hs and jdk10/jdk10 with arecent build of JTReg and JT_HOME set appropriately. Are there any pointers on how this is supposed to be run? Thanks, Stuart On 25 April 2017 at 11:47, Aleksey Shipilev wrote: > On 04/19/2017 12:12 AM, Igor Ignatyev wrote: > > http://cr.openjdk.java.net/~iignatyev//8178788/webrev.00/index.html > >> 69903 lines changed: 69903 ins; 0 del; 0 mod; > > (69524 lines are generated) > > > > Hi all, > > > > could you please review this patch which adds a jtreg test wrapper for > > jcstress test suite and jtreg tests which run jsctress tests thru this > > wrapper? > > > > webrev: http://cr.openjdk.java.net/~iignatyev//8178788/webrev.00/ > index.html > > JBS: https://bugs.openjdk.java.net/browse/JDK-8178788 testing: > > TL;DR: This patch introduces more problems than it solves. Just run the > jcstress > tests-all JAR against the tested runtime. > > Wrapping jcstress tests with jtreg defies the purpose of jcstress harness > -- > that is, running lots of tests as fast as it possibly could without > affecting > testing quality. For example, by cleverly reusing VMs between the tests, > using > Whitebox to deoptimize without restarting the VMs, etc. It really wastes > CPU > time to run each test in isolation. > > Also, it does not "automatically" work, which defies "easy to run" goal: > > Caused by: java.io.FileNotFoundException: Couldn't automatically resolve > dependency for jcstress-tests-all , revision 0.3 > Please specify the location using jdk.test.lib.artifacts. > jcstress-tests-all > at > jdk.test.lib.artifacts.DefaultArtifactManager.resolve( > DefaultArtifactManager.java:37) > at jdk.test.lib.artifacts.ArtifactResolver.resolve( > ArtifactResolver.java:54) > at applications.jcstress.JcstressRunner.pathToArtifact( > JcstressRunner.java:53) > ... 8 more > > Okay, brilliant! How do I configure this, if I run "make test"? > > CONF=linux-x86_64-normal-server-release LOG=info make test > TEST="hotspot_all" > > > -Aleksey > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Sat Sep 9 03:22:04 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 8 Sep 2017 20:22:04 -0700 Subject: multiplyHigh? In-Reply-To: <9e627b27-54cd-d031-748f-c8f7c4c032b7@bell-sw.com> References: <9e627b27-54cd-d031-748f-c8f7c4c032b7@bell-sw.com> Message-ID: <70ccb5fe-8c18-21fc-e6ad-ca64571eff88@oracle.com> On 9/8/17 8:17 AM, Dmitrij wrote: > Hi Andrew, > > I support the idea of intrinsifying Math.multiplyHigh method: it seems like it will be very effective to implement at > least on aarch64 and x86_64 using umulh and mulq instructions. > > Vladimir, what do you think? Nobody asked to intrinsify these methods before because they are new in JDK 9 I think. > If there are no other volunteer I'd be happy to do this. Yes, please do that for both: aarch64 and x64. Add jtreg hotspot test too, please. Thanks, Vladimir > > Enhancing java.lang.Math intrinsics situation for aarch64 port is on my todo list. > > Thanks, > Dmitrij > > On 08.09.2017 17:38, Andrew Haley wrote: >> I notice that Math.multiplyHigh(long, long) doesn't use a C2 intrinsic, >> even on machines with appropriate C2 patterns.? So it's rather slow. >> >> JDK-5100935, No way to access the 64-bit integer multiplication of >> 64-bit CPUs efficiently, is closed, even though there is still no >> efficient way to do this.? Is writing an intrinsic for multiplyHigh on >> someone's to-do list?? I see no bug for it. >> > From martin.doerr at sap.com Mon Sep 11 14:42:02 2017 From: martin.doerr at sap.com (Doerr, Martin) Date: Mon, 11 Sep 2017 14:42:02 +0000 Subject: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic In-Reply-To: <59B092B2.2040704@linux.vnet.ibm.com> References: <4999bc2a3f0640dfb6dd75d23b4f30ea@sap.com> <0089f9f653a6442aa672af2e15b2b864@serv030.corp.eldorado.org.br> <59397a3749024e91b56be6e990a3250d@sap.com> <363c2378f23e4be2bf60b622594c60fe@sap.com> <59A089F4.6010504@linux.vnet.ibm.com> <34e6550d426440bab3b8a54a82e25190@sap.com> <59B092B2.2040704@linux.vnet.ibm.com> Message-ID: <8e96fdb821244f798a40c4155d9c40a6@sap.com> Hi Gustavo, thank you very much for checking. It's good to know that all supported linux versions set VRSAVE to -1 so I'm convinced that it's safe to use Altivec on Linux BE, too. I'll update the webrev. Best regards, Martin -----Original Message----- From: Gustavo Romero [mailto:gromero at linux.vnet.ibm.com] Sent: Donnerstag, 7. September 2017 02:29 To: Doerr, Martin Cc: Gustavo Serra Scalet ; 'hotspot-compiler-dev at openjdk.java.net' ; ppc-aix-port-dev at openjdk.java.net Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic Hi Martin, On 01-09-2017 12:39, Doerr, Martin wrote: > It'd also be good to know, if relying on vrsave=-1 is safe. VRSAVE is set to -1 in kernelspace on a VEC or a VSX unavailable exception, in load_up_altivec(), arch/powerpc/kernel/vector.S [1]: 51 /* 52 * While userspace in general ignores VRSAVE, glibc uses it as a boolean 53 * to optimise userspace context save/restore. Whenever we take an 54 * altivec unavailable exception we must set VRSAVE to something non 55 * zero. Set it to all 1s. See also the programming note in the ISA. 56 */ 57 mfspr r4,SPRN_VRSAVE 58 cmpwi 0,r4,0 59 bne+ 1f 60 li r4,-1 61 mtspr SPRN_VRSAVE,r4 All program images are created with MSR_VEC and MSR_VSX disabled (set to zero) and VRSAVE set to zero as well. However, on the first execution of a vector (VMX/Altivec) or a VSX (Vector-Scalar) instruction an exception is raised and the exception code path calls load_up_altivec() that will set VRSAVE=-1 if it's equal to zero (load_up_vsx() calls load_up_altivec()). The check on lines 58 and 59 guarantees that if a userspace program desires to set VRSAVE it can freely set the VRSAVE and on a new VEC /VSX exception VRSAVE value won't be clobbed (set again to -1) and will stay as user set it (a new exception can occur if a sufficient amount of context switches happen and MSR_VEC and MSR_VSX bits get disabled as part of kernel's mechanism to avoid the burden of saving/restoring the vec, fp, and vsx registers on context switches if they are not in use by the a program). For instance, a simple program like that compiled with 'gcc -static -O0 vrsave_.c -o vrsave_': vrsave_.c: int main() {} will trigger VRSAVE=-1 in kernelspace once it executes a 'stvx' VSX instruction: in __vmx_sigsetjmp(): $ gdb -q ./vrsave_ (gdb) x/i 0x1000db20 0x1000db20 <__vmx__sigsetjmp+424>: stvx v20,0,r5 The address 0x1000db20 can be determined by a crafted systemtap probe [2], for instance: start_thread.return : vrsave=0x0 start=0x10000840 load_up_altivec.call : trap=0xf21 nip=0x1000db20 vrsave=0x0 0xc00000000000c7c0 : load_up_altivec+0x0/0x164 [kernel] 0x0 (inexact) load_up_altivec.return: trap=0xf21 nip=0x1000db20 vrsave=0xffffffff Returning from: 0xc00000000000c7c0 : load_up_altivec+0x0/0x164 [kernel] Returning to : 0xc000000000009c78 : altivec_unavailable_common+0xf8/0x150 [kernel] 0x0 (inexact) __switch_to().call : vrsave=0xffffffff, nip=0x1002c4bc trap=0xf21 means a "Vector Unavailable" [3] exception was taken. -- For a non-static linkage we get almost the same, but a VSX unavailable exception (trap=0xf41), caused by a 'xxspltd' in memset() (0x7fff8745d428): start_thread.return : vrsave=0x0 start=0x7fff874312e0 load_up_altivec.call : trap=0xf41 nip=0x7fff8745d428 vrsave=0x0 0xc00000000000c7c0 : load_up_altivec+0x0/0x164 [kernel] 0x0 (inexact) load_up_altivec.return: trap=0xf41 nip=0x7fff8745d428 vrsave=0xffffffff Returning from: 0xc00000000000c7c0 : load_up_altivec+0x0/0x164 [kernel] Returning to : 0xc00000000000ca5c : load_up_vsx+0x10/0x2c [kernel] 0x0 (inexact) __switch_to.call : vrsave=0xffffffff, nip=0x100005ac (gdb) x/i 0x7fff8745d428 0x7fff8745d428 : xxspltd vs0,vs12,0 (gdb) bt #0 memset (dstpp=0x7fffffffeec0, c=0, len=640) at ../string/memset.c:29 #1 0x00007ffff7fa24d4 in _dl_start (arg=0x7ffffffff410) at rtld.c:373 #2 0x00007ffff7fa12f8 in _start () from /lib64/ld64.so.2 -- Setting VRSAVE=-1 in load_up_altivec() dates back to kernel v2.6.31-rc1 at least accordingly to 'git tag --contains e821ea70f', hence it's pretty old. Martin, please let me know if you need any additional information to check if we can rely on vrsave=-1 and so forget about taking care of vrsave in the JVM also for Linux BE. Best regards, Gustavo [1] https://github.com/torvalds/linux/blob/master/arch/powerpc/kernel/vector.S#L52 [2] http://cr.openjdk.java.net/~gromero/script.d [3] ISA 3.0. Figure 68, "Effective address of interrupt vector by interrupt type" From gromero at linux.vnet.ibm.com Mon Sep 11 15:03:29 2017 From: gromero at linux.vnet.ibm.com (Gustavo Romero) Date: Mon, 11 Sep 2017 12:03:29 -0300 Subject: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic In-Reply-To: <8e96fdb821244f798a40c4155d9c40a6@sap.com> References: <4999bc2a3f0640dfb6dd75d23b4f30ea@sap.com> <0089f9f653a6442aa672af2e15b2b864@serv030.corp.eldorado.org.br> <59397a3749024e91b56be6e990a3250d@sap.com> <363c2378f23e4be2bf60b622594c60fe@sap.com> <59A089F4.6010504@linux.vnet.ibm.com> <34e6550d426440bab3b8a54a82e25190@sap.com> <59B092B2.2040704@linux.vnet.ibm.com> <8e96fdb821244f798a40c4155d9c40a6@sap.com> Message-ID: <59B6A5C1.3020200@linux.vnet.ibm.com> Hi Martin, On 11-09-2017 11:42, Doerr, Martin wrote: > thank you very much for checking. It's good to know that all supported linux versions set VRSAVE to -1 so I'm convinced that it's safe to use Altivec on Linux BE, too. I'll update the webrev. You're welcome. I hope that simplifies things a little bit. Best regards, Gustavo > -----Original Message----- > From: Gustavo Romero [mailto:gromero at linux.vnet.ibm.com] > Sent: Donnerstag, 7. September 2017 02:29 > To: Doerr, Martin > Cc: Gustavo Serra Scalet ; 'hotspot-compiler-dev at openjdk.java.net' ; ppc-aix-port-dev at openjdk.java.net > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi Martin, > > On 01-09-2017 12:39, Doerr, Martin wrote: >> It'd also be good to know, if relying on vrsave=-1 is safe. > > VRSAVE is set to -1 in kernelspace on a VEC or a VSX unavailable exception, in > load_up_altivec(), arch/powerpc/kernel/vector.S [1]: > > 51 /* > 52 * While userspace in general ignores VRSAVE, glibc uses it as a boolean > 53 * to optimise userspace context save/restore. Whenever we take an > 54 * altivec unavailable exception we must set VRSAVE to something non > 55 * zero. Set it to all 1s. See also the programming note in the ISA. > 56 */ > 57 mfspr r4,SPRN_VRSAVE > 58 cmpwi 0,r4,0 > 59 bne+ 1f > 60 li r4,-1 > 61 mtspr SPRN_VRSAVE,r4 > > All program images are created with MSR_VEC and MSR_VSX disabled (set to zero) > and VRSAVE set to zero as well. However, on the first execution of a vector > (VMX/Altivec) or a VSX (Vector-Scalar) instruction an exception is raised and > the exception code path calls load_up_altivec() that will set VRSAVE=-1 if it's > equal to zero (load_up_vsx() calls load_up_altivec()). > > The check on lines 58 and 59 guarantees that if a userspace program desires to > set VRSAVE it can freely set the VRSAVE and on a new VEC /VSX exception VRSAVE > value won't be clobbed (set again to -1) and will stay as user set it (a new > exception can occur if a sufficient amount of context switches happen and > MSR_VEC and MSR_VSX bits get disabled as part of kernel's mechanism to avoid the > burden of saving/restoring the vec, fp, and vsx registers on context switches if > they are not in use by the a program). > > For instance, a simple program like that compiled with > 'gcc -static -O0 vrsave_.c -o vrsave_': > > vrsave_.c: > > int main() {} > > will trigger VRSAVE=-1 in kernelspace once it executes a 'stvx' VSX instruction: > in __vmx_sigsetjmp(): > > $ gdb -q ./vrsave_ > (gdb) x/i 0x1000db20 > 0x1000db20 <__vmx__sigsetjmp+424>: stvx v20,0,r5 > > The address 0x1000db20 can be determined by a crafted systemtap probe [2], for > instance: > > start_thread.return : vrsave=0x0 start=0x10000840 > load_up_altivec.call : > trap=0xf21 nip=0x1000db20 vrsave=0x0 > 0xc00000000000c7c0 : load_up_altivec+0x0/0x164 [kernel] > 0x0 (inexact) > load_up_altivec.return: > trap=0xf21 nip=0x1000db20 vrsave=0xffffffff > Returning from: 0xc00000000000c7c0 : load_up_altivec+0x0/0x164 [kernel] > Returning to : 0xc000000000009c78 : altivec_unavailable_common+0xf8/0x150 [kernel] > 0x0 (inexact) > __switch_to().call : vrsave=0xffffffff, nip=0x1002c4bc > > trap=0xf21 means a "Vector Unavailable" [3] exception was taken. > [snip] From martin.doerr at sap.com Mon Sep 11 17:05:43 2017 From: martin.doerr at sap.com (Doerr, Martin) Date: Mon, 11 Sep 2017 17:05:43 +0000 Subject: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic In-Reply-To: <3e64d11f046f44379f9658dffc766a45@sap.com> References: <4999bc2a3f0640dfb6dd75d23b4f30ea@sap.com> <0089f9f653a6442aa672af2e15b2b864@serv030.corp.eldorado.org.br> <59397a3749024e91b56be6e990a3250d@sap.com> <363c2378f23e4be2bf60b622594c60fe@sap.com> <59A089F4.6010504@linux.vnet.ibm.com> <34e6550d426440bab3b8a54a82e25190@sap.com> <59A9850B.7030302@linux.vnet.ibm.com> <3e64d11f046f44379f9658dffc766a45@sap.com> Message-ID: <670dd284fe77479986abe75aca42b20a@sap.com> Hi G?tz and Gustavo, I had just posted the version I had before leaving. Thanks for your feedback. New webrev is here: http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.03/ Changes to webrev.02: - Referenced paper - Factored out endianness specific vector permute instructions (vec_perm with only 3 parms to reduce risk of mixing them up) - Removed code for PPC64 platforms which didn't support it - code_size2 = 22000 - Added missing ')' in IntrinsicPredicates.java My changes shouldn't change the behavior of the little endian implementation. We have to check if and if yes which tests still fail. Are there any updates on this? Best regards, Martin -----Original Message----- From: Lindenmaier, Goetz Sent: Donnerstag, 7. September 2017 09:55 To: Gustavo Romero ; Doerr, Martin ; Gustavo Serra Scalet Cc: 'hotspot-compiler-dev at openjdk.java.net' ; ppc-aix-port-dev at openjdk.java.net Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic Hi, I had a look at this change. Martin, you missed a ')' in IntrinsicPredicates.java. Combined with the multiplyToLen change, stub codebuffer space runs out. Please increase code_size2 = 20000 to 22000 in stubRoutines_ppc.hpp. I see TestSHA.java failing on linuxppc64le. Also, other tests are failing with SHA-256 digest error ... Also, on aix, some of our internal tests are failing. These didn't run on linuxppc64 on a Power8 machine, so it might fail there, too. But on the big endian platforms, the jtreg tests don't fail. @Gustavo, maybe you can have a look at the issues on linuxppc64le and post a new webrev. Then Martin can fix the remaining issue on big endian. Best regards, Goetz. > -----Original Message----- > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > bounces at openjdk.java.net] On Behalf Of Gustavo Romero > Sent: Freitag, 1. September 2017 18:04 > To: Doerr, Martin ; Gustavo Serra Scalet > > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi Martin! > > On 01-09-2017 12:39, Doerr, Martin wrote: > > Hi Gustavos, > > > > I have managed to upload a version which seems to work on both > endianness implementations. > > At least some quick tests have passed on AIX and Big Endian linux in > addition to Little Endian linux. > > Great! :-) > > > > I'll be out next week, but the change looks ok for me. Please let me know if > the changed version still looks ok for you, too. Feel free to overwork or > improve it. > > It'd also be good to know, if relying on vrsave=-1 is safe. > > Sure, Martin. I'm chasing what's exactly setting vrsave=-1 and the full history > log (looks like it's not in the kernel, but I'm checking yet). > > > > Is the copyright information ok? Did you get source code which requires to > be mentioned in the comments? > > The code looks similar to a reference implementation, so the authors of it > may want to be mentioned? > > Or did you just use the paper for implementing it? In this case, I'd mention > the paper. > > Gustavo S: the information on the paper must be updated accordingly as > Martin noted in the new webrev. There is none currently. > > > > After we got a second review and ran more tests, we can ask somebody > from Oracle to push it. > > > > Thanks for contributing and your support, > > Martin > > Thanks a lot for reviewing and for all the help. > > Regards, > Gustavo R > > > > > -----Original Message----- > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > bounces at openjdk.java.net] On Behalf Of Doerr, Martin > > Sent: Donnerstag, 31. August 2017 18:21 > > To: Gustavo Romero > > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > Hi Gustavo R, > > > > I guess you're right. vrsave is already set to -1, so all Vector Registers get > saved. > > It'd be good to know where it is set (OS, Flag in ELF header, ???) and if this > is guaranteed. > > I don't want to risk getting sporadic errors on some OS versions. > > > > I'd like to enable SHA intrinsics on linux BE as well. I already managed to get > the 256 bit version working (was quite some work!). > > > > Thanks and best regards, > > Martin > > > > > > -----Original Message----- > > From: Gustavo Romero [mailto:gromero at linux.vnet.ibm.com] > > Sent: Freitag, 25. August 2017 22:35 > > To: Doerr, Martin > > Cc: Gustavo Serra Scalet ; 'hotspot- > compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > Hi Martin, > > > > On 25-08-2017 13:18, Doerr, Martin wrote: > >> I think you didn't get my point about AIX. > >> Your current version doesn't break AIX, but it lacks SHA2 acceleration for > AIX on Power 8 and newer, which is still relevant. > >> So I'd like to ask you kindly to take a look if Big Endian support for the stub > could be added without high effort. AIX doesn't need VRSAVE handling (like > Little Endian linux, unlike Big Endian linux), so a few lines in the stub could > possibly be enough. I can assist with testing. > > > > I don't think that VRSAVE is handled on Linux, even on BE. Although BE ABI > [1] > > says: > > > > "Functions must ensure that the appropriate bits in the vrsave register are > set for any vector registers they use" > > > > and LE ABI does not say that, even on Linux BE VRSAVE is not in effect > > used to determine which vector registers (VMX/Altivec) should be > saved/restored. > > No application uses it on Linux, so I would say that VRSAVE is ignored on > Linux > > completely both on BE and LE. save/restore library interfaces don't pay > > attention to it in glibc: VRSAVE is just saved/restored completely in > mechanisms > > of swap/get/setcontext(), set/longjump(), and dl-trampoline() and that's > all. I > > checked that with toolchain folks and they agree. We've already discussed > that a > > long time ago but at that time I was just using the vector-scalar registers [2] > > and at that time I agreed that if VMX/Altivec was in use instead of the VSX > so > > VRSAVE should be handled accordingly. But I have a different opinion > now... > > > > I'm wondering if something would really break on Linux BE if we forget > about > > VRSAVE at all in the JVM. If not, we could forget about VRSAVE forever on > Linux. > > Looks like VRSAVE was sort of born to the oblivion... ? > > > > > > Kind regards, > > Gustavo > > > > [1] https://urldefense.proofpoint.com/v2/url?u=http- > 3A__refspecs.linuxfoundation.org_ELF_ppc64_PPC-2Delf64abi- > 2D1.9.html&d=DwIFAg&c=jf_iaSHvJObTbx- > siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z- > esrs- > Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=arAjOio5NNoRIZLdczhgF5BDoAF3HU > vq-xCtSufn_kA&e= > > [2] https://urldefense.proofpoint.com/v2/url?u=http- > 3A__mail.openjdk.java.net_pipermail_ppc-2Daix-2Dport-2Ddev_2016- > 2DMay_002508.html&d=DwIFAg&c=jf_iaSHvJObTbx- > siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z- > esrs-Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=p0xb08lxayJHBXZREL-7c5ipKc- > waZMMZpTiQWfU-S4&e= > > From jamsheed.c.m at oracle.com Mon Sep 11 18:13:47 2017 From: jamsheed.c.m at oracle.com (jamsheed) Date: Mon, 11 Sep 2017 23:43:47 +0530 Subject: [10] RFR: [AOT] assert(false) failed: DEBUG MESSAGE: InterpreterMacroAssembler::call_VM_base: last_sp != NULL Message-ID: Hi, request for review the fix made for the bug JBS: https://bugs.openjdk.java.net/browse/JDK-8168712 webrev: http://cr.openjdk.java.net/~jcm/8168712/webrev.00/ brief desc: special handling of Object. in TemplateInterpreter::deopt_reexecute_entry required last_sp to be reset explicitly in normal return path address TemplateInterpreter::deopt_reexecute_entry(Method* method, address bcp) { assert(method->contains(bcp), "just checkin'"); Bytecodes::Code code = Bytecodes::java_code_at(method, bcp); if (code == Bytecodes::_return) { // This is used for deopt during registration of finalizers // during Object.. We simply need to resume execution at // the standard return vtos bytecode to pop the frame normally. // reexecuting the real bytecode would cause double registration // of the finalizable object. return _normal_table.entry(Bytecodes::_return).entry(vtos); test: jprt Best Regards, Jamsheed From cthalinger at twitter.com Mon Sep 11 18:17:09 2017 From: cthalinger at twitter.com (Christian Thalinger) Date: Mon, 11 Sep 2017 08:17:09 -1000 Subject: [10] RFR: [AOT] assert(false) failed: DEBUG MESSAGE: InterpreterMacroAssembler::call_VM_base: last_sp != NULL In-Reply-To: References: Message-ID: > On Sep 11, 2017, at 8:13 AM, jamsheed wrote: > > Hi, > > request for review the fix made for the bug > > JBS: https://bugs.openjdk.java.net/browse/JDK-8168712 Any chance to make that bug open? > > webrev: http://cr.openjdk.java.net/~jcm/8168712/webrev.00/ > > brief desc: special handling of Object. in TemplateInterpreter::deopt_reexecute_entry > > required last_sp to be reset explicitly in normal return path > > address TemplateInterpreter::deopt_reexecute_entry(Method* method, address bcp) { > assert(method->contains(bcp), "just checkin'"); > Bytecodes::Code code = Bytecodes::java_code_at(method, bcp); > if (code == Bytecodes::_return) { > // This is used for deopt during registration of finalizers > // during Object.. We simply need to resume execution at > // the standard return vtos bytecode to pop the frame normally. > // reexecuting the real bytecode would cause double registration > // of the finalizable object. > return _normal_table.entry(Bytecodes::_return).entry(vtos); > > test: jprt > > Best Regards, > > Jamsheed > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Mon Sep 11 18:33:08 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 11 Sep 2017 11:33:08 -0700 Subject: [10] RFR: [AOT] assert(false) failed: DEBUG MESSAGE: InterpreterMacroAssembler::call_VM_base: last_sp != NULL In-Reply-To: References: Message-ID: <552034e1-e108-835f-aab7-8aee0988b235@oracle.com> Jamsheed, You said you have a small test case. Can you add it as new jtreg test? Can it be reproduced with CompileCommand=dontinline,Object. ? I don't see RBT testing results. Here is Dean's evaluation: "This crash happens because of the code in TemplateInterpreter::deopt_reexecute_entry() that uses _normal_table.entry() instead of AbstractInterpreter::deopt_reexecute_entry() if we are at a return bytecode. Normally C1 and C2 inline and use an intrinsic for Object., so there is little chance to deoptimize on a return bytecode. But with AOT-compiled Object. it's possible, and we assert because we skipped the deopt entry that is supposed to clear last_sp. " Thanks, Vladimir On 9/11/17 11:13 AM, jamsheed wrote: > Hi, > > request for review the fix made for the bug > > JBS: https://bugs.openjdk.java.net/browse/JDK-8168712 > > webrev: http://cr.openjdk.java.net/~jcm/8168712/webrev.00/ > > brief desc: special handling of Object. in > TemplateInterpreter::deopt_reexecute_entry > > required last_sp to be reset explicitly in normal return path > > address TemplateInterpreter::deopt_reexecute_entry(Method* method, > address bcp) { > ? assert(method->contains(bcp), "just checkin'"); > ? Bytecodes::Code code?? = Bytecodes::java_code_at(method, bcp); > ? if (code == Bytecodes::_return) { > ??? // This is used for deopt during registration of finalizers > ??? // during Object..? We simply need to resume execution at > ??? // the standard return vtos bytecode to pop the frame normally. > ??? // reexecuting the real bytecode would cause double registration > ??? // of the finalizable object. > ??? return _normal_table.entry(Bytecodes::_return).entry(vtos); > > test: jprt > > Best Regards, > > Jamsheed > > From cthalinger at twitter.com Mon Sep 11 18:59:26 2017 From: cthalinger at twitter.com (Christian Thalinger) Date: Mon, 11 Sep 2017 08:59:26 -1000 Subject: [10] RFR: [AOT] assert(false) failed: DEBUG MESSAGE: InterpreterMacroAssembler::call_VM_base: last_sp != NULL In-Reply-To: <552034e1-e108-835f-aab7-8aee0988b235@oracle.com> References: <552034e1-e108-835f-aab7-8aee0988b235@oracle.com> Message-ID: <3864A5E3-297A-4249-8DA6-C9BBC65C9CE1@twitter.com> > On Sep 11, 2017, at 8:33 AM, Vladimir Kozlov wrote: > > Jamsheed, > > You said you have a small test case. Can you add it as new jtreg test? > > Can it be reproduced with CompileCommand=dontinline,Object. ? > > I don't see RBT testing results. > > Here is Dean's evaluation: > > "This crash happens because of the code in TemplateInterpreter::deopt_reexecute_entry() that uses _normal_table.entry() instead of AbstractInterpreter::deopt_reexecute_entry() if we are at a return bytecode. Normally C1 and C2 inline and use an intrinsic for Object., so there is little chance to deoptimize on a return bytecode. But with AOT-compiled Object. it's possible, and we assert because we skipped the deopt entry that is supposed to clear last_sp. ? Thanks. > > Thanks, > Vladimir > > On 9/11/17 11:13 AM, jamsheed wrote: >> Hi, >> request for review the fix made for the bug >> JBS: https://bugs.openjdk.java.net/browse/JDK-8168712 >> webrev: http://cr.openjdk.java.net/~jcm/8168712/webrev.00/ >> brief desc: special handling of Object. in TemplateInterpreter::deopt_reexecute_entry >> required last_sp to be reset explicitly in normal return path >> address TemplateInterpreter::deopt_reexecute_entry(Method* method, address bcp) { >> assert(method->contains(bcp), "just checkin'"); >> Bytecodes::Code code = Bytecodes::java_code_at(method, bcp); >> if (code == Bytecodes::_return) { >> // This is used for deopt during registration of finalizers >> // during Object.. We simply need to resume execution at >> // the standard return vtos bytecode to pop the frame normally. >> // reexecuting the real bytecode would cause double registration >> // of the finalizable object. >> return _normal_table.entry(Bytecodes::_return).entry(vtos); >> test: jprt >> Best Regards, >> Jamsheed From dean.long at oracle.com Mon Sep 11 21:22:46 2017 From: dean.long at oracle.com (Dean Long) Date: Mon, 11 Sep 2017 14:22:46 -0700 Subject: [10] RFR: [AOT] assert(false) failed: DEBUG MESSAGE: InterpreterMacroAssembler::call_VM_base: last_sp != NULL In-Reply-To: References: Message-ID: Unfortunately, this fix slows down normal returns, even for non-debug builds.? Isn't what we really want something like deopt_no_reexecute_entry,? which would use Interpreter::deopt_entry(state, Bytecodes::length_for(code)) to skip the current return bytecode? dl On 9/11/2017 11:13 AM, jamsheed wrote: > Hi, > > request for review the fix made for the bug > > JBS: https://bugs.openjdk.java.net/browse/JDK-8168712 > > webrev: http://cr.openjdk.java.net/~jcm/8168712/webrev.00/ > > brief desc: special handling of Object. in > TemplateInterpreter::deopt_reexecute_entry > > required last_sp to be reset explicitly in normal return path > > address TemplateInterpreter::deopt_reexecute_entry(Method* method, > address bcp) { > ? assert(method->contains(bcp), "just checkin'"); > ? Bytecodes::Code code?? = Bytecodes::java_code_at(method, bcp); > ? if (code == Bytecodes::_return) { > ??? // This is used for deopt during registration of finalizers > ??? // during Object..? We simply need to resume execution at > ??? // the standard return vtos bytecode to pop the frame normally. > ??? // reexecuting the real bytecode would cause double registration > ??? // of the finalizable object. > ??? return _normal_table.entry(Bytecodes::_return).entry(vtos); > > test: jprt > > Best Regards, > > Jamsheed > > From dean.long at oracle.com Tue Sep 12 02:21:36 2017 From: dean.long at oracle.com (Dean Long) Date: Mon, 11 Sep 2017 19:21:36 -0700 Subject: [10] RFR(L): 8132547: [AOT] support invokedynamic instructions Message-ID: <5bd196a3-93f8-b59d-86d8-a87e0cfc8179@oracle.com> https://bugs.openjdk.java.net/browse/JDK-8132547 http://cr.openjdk.java.net/~dlong/8132547/ This enhancement is a first step in supporting invokedynamic instructions in AOT.? Previously, when we saw an invokedynamic instruction, or any anonymous class, we would generate code to bail out and deoptimize.? With this changeset we go a little further and call into the runtime to resolve the dynamic constant pool entry, running the bootstrap method, and returning the adapter method and appendix object.? Like class initialization in AOT, we only do this the first time through.? Because AOT double-checks classes using fingerprints and symbolic names, special care was required to handle anonymous class names.? The solution I chose was to name anonymous types with aliases based on their constant pool location ("adapter" and appendix"). Future work is needed to AOT-compile the anonymous classes and/or inline through them, so this change is not expected to affect AOT performance.? In my tests I was not able to measure any difference. Upstream Graal changes have already been pushed.? I broke the JVMCI and hotspot changes into separate webrevs. dl From forax at univ-mlv.fr Tue Sep 12 06:54:29 2017 From: forax at univ-mlv.fr (Remi Forax) Date: Tue, 12 Sep 2017 08:54:29 +0200 (CEST) Subject: [10] RFR(L): 8132547: [AOT] support invokedynamic instructions In-Reply-To: <5bd196a3-93f8-b59d-86d8-a87e0cfc8179@oracle.com> References: <5bd196a3-93f8-b59d-86d8-a87e0cfc8179@oracle.com> Message-ID: <841398712.806020.1505199269517.JavaMail.zimbra@u-pem.fr> Hi Dean, Java currently uses invokedynamic in two places, one is for lambda creation, the other is for string concatenation. Do you have tested string concatenation ? This patch should help right now because the StringConcatFactory do not uses any anonymous class. and Java (the language) is not the only one to use invokedynamic, how thing works if the boostrap method requires data that are only available at runtime, data that comes from a dynamic language runtime by example ? cheers, R?mi ----- Mail original ----- > De: "Dean Long" > ?: "hotspot compiler" > Envoy?: Mardi 12 Septembre 2017 04:21:36 > Objet: [10] RFR(L): 8132547: [AOT] support invokedynamic instructions > https://bugs.openjdk.java.net/browse/JDK-8132547 > > http://cr.openjdk.java.net/~dlong/8132547/ > > This enhancement is a first step in supporting invokedynamic > instructions in AOT.? Previously, when we saw an invokedynamic > instruction, or any anonymous class, we would generate code to bail out > and deoptimize.? With this changeset we go a little further and call > into the runtime to resolve the dynamic constant pool entry, running the > bootstrap method, and returning the adapter method and appendix object. > Like class initialization in AOT, we only do this the first time > through.? Because AOT double-checks classes using fingerprints and > symbolic names, special care was required to handle anonymous class > names.? The solution I chose was to name anonymous types with aliases > based on their constant pool location ("adapter" and > appendix"). > > Future work is needed to AOT-compile the anonymous classes and/or inline > through them, so this change is not expected to affect AOT performance. > In my tests I was not able to measure any difference. > > Upstream Graal changes have already been pushed.? I broke the JVMCI and > hotspot changes into separate webrevs. > > dl From jamsheed.c.m at oracle.com Tue Sep 12 09:16:08 2017 From: jamsheed.c.m at oracle.com (jamsheed) Date: Tue, 12 Sep 2017 14:46:08 +0530 Subject: [10] RFR: [AOT] assert(false) failed: DEBUG MESSAGE: InterpreterMacroAssembler::call_VM_base: last_sp != NULL In-Reply-To: References: Message-ID: <6ba88494-b64f-1b6a-0161-d291d4422016@oracle.com> Hi Chris, it is made open now. Best Regards, Jamsheed On Monday 11 September 2017 11:47 PM, Christian Thalinger wrote: >> JBS: https://bugs.openjdk.java.net/browse/JDK-8168712 > > Any chance to make that bug open? > From jamsheed.c.m at oracle.com Tue Sep 12 10:20:49 2017 From: jamsheed.c.m at oracle.com (jamsheed) Date: Tue, 12 Sep 2017 15:50:49 +0530 Subject: [10] RFR: [AOT] assert(false) failed: DEBUG MESSAGE: InterpreterMacroAssembler::call_VM_base: last_sp != NULL In-Reply-To: <552034e1-e108-835f-aab7-8aee0988b235@oracle.com> References: <552034e1-e108-835f-aab7-8aee0988b235@oracle.com> Message-ID: <6fe2ecd4-1fc8-c5f3-b762-1559a61def92@oracle.com> Hi Vladimir, On Tuesday 12 September 2017 12:03 AM, Vladimir Kozlov wrote: > Jamsheed, > > You said you have a small test case. Can you add it as new jtreg test? ok, i have added the test case in JBS now, it uses -XX:+DTraceMethodProbes. > > Can it be reproduced with CompileCommand=dontinline,Object. ? it reproduces in inline case too. > > I don't see RBT testing results. i will update it. Best Regards, Jamsheed > > Here is Dean's evaluation: > > "This crash happens because of the code in > TemplateInterpreter::deopt_reexecute_entry() that uses > _normal_table.entry() instead of > AbstractInterpreter::deopt_reexecute_entry() if we are at a return > bytecode. Normally C1 and C2 inline and use an intrinsic for > Object., so there is little chance to deoptimize on a return > bytecode. But with AOT-compiled Object. it's possible, and we > assert because we skipped the deopt entry that is supposed to clear > last_sp. " > > Thanks, > Vladimir > > On 9/11/17 11:13 AM, jamsheed wrote: >> Hi, >> >> request for review the fix made for the bug >> >> JBS: https://bugs.openjdk.java.net/browse/JDK-8168712 >> >> webrev: http://cr.openjdk.java.net/~jcm/8168712/webrev.00/ >> >> brief desc: special handling of Object. in >> TemplateInterpreter::deopt_reexecute_entry >> >> required last_sp to be reset explicitly in normal return path >> >> address TemplateInterpreter::deopt_reexecute_entry(Method* method, >> address bcp) { >> assert(method->contains(bcp), "just checkin'"); >> Bytecodes::Code code = Bytecodes::java_code_at(method, bcp); >> if (code == Bytecodes::_return) { >> // This is used for deopt during registration of finalizers >> // during Object.. We simply need to resume execution at >> // the standard return vtos bytecode to pop the frame normally. >> // reexecuting the real bytecode would cause double registration >> // of the finalizable object. >> return _normal_table.entry(Bytecodes::_return).entry(vtos); >> >> test: jprt >> >> Best Regards, >> >> Jamsheed >> >> From gustavo.scalet at eldorado.org.br Tue Sep 12 15:47:49 2017 From: gustavo.scalet at eldorado.org.br (Gustavo Serra Scalet) Date: Tue, 12 Sep 2017 15:47:49 +0000 Subject: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic In-Reply-To: <670dd284fe77479986abe75aca42b20a@sap.com> References: <4999bc2a3f0640dfb6dd75d23b4f30ea@sap.com> <0089f9f653a6442aa672af2e15b2b864@serv030.corp.eldorado.org.br> <59397a3749024e91b56be6e990a3250d@sap.com> <363c2378f23e4be2bf60b622594c60fe@sap.com> <59A089F4.6010504@linux.vnet.ibm.com> <34e6550d426440bab3b8a54a82e25190@sap.com> <59A9850B.7030302@linux.vnet.ibm.com> <3e64d11f046f44379f9658dffc766a45@sap.com> <670dd284fe77479986abe75aca42b20a@sap.com> Message-ID: <875858bb7bda421b97352d0fb2972a73@serv031.corp.eldorado.org.br> Hi Martin and G?tz, I was taking a closer look at the hotspot's tests/compiler and I see indeed one test failing for sha: Passed: compiler/intrinsics/sha/cli/TestUseSHA256IntrinsicsOptionOnSupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHA512IntrinsicsOptionOnSupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHA1IntrinsicsOptionOnUnsupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHAOptionOnUnsupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHA512IntrinsicsOptionOnUnsupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHA1IntrinsicsOptionOnSupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHA256IntrinsicsOptionOnUnsupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHAOptionOnSupportedCPU.java FAILED: compiler/intrinsics/sha/TestSHA.java Passed: compiler/intrinsics/sha/sanity/TestSHA1Intrinsics.java Passed: compiler/intrinsics/sha/sanity/TestSHA256Intrinsics.java Passed: compiler/intrinsics/sha/sanity/TestSHA1MultiBlockIntrinsics.java Passed: compiler/intrinsics/sha/sanity/TestSHA512Intrinsics.java Passed: compiler/intrinsics/sha/sanity/TestSHA512MultiBlockIntrinsics.java Passed: compiler/intrinsics/sha/sanity/TestSHA256MultiBlockIntrinsics.java That one was failing due to a bug on unaligned memory load. I took a closer look and fixed it. It should also work on Big Endian: https://gut.github.io/openjdk/webrev/JDK-8185979/webrev.02/index.html This new webrev was updated on top of Martin's webrev.03. I also took this chance to add all the contributors to this patch, as you suggested before. Thanks > -----Original Message----- > From: Doerr, Martin [mailto:martin.doerr at sap.com] > Sent: segunda-feira, 11 de setembro de 2017 14:06 > To: Lindenmaier, Goetz ; Gustavo Romero > ; Gustavo Serra Scalet > > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi G?tz and Gustavo, > > I had just posted the version I had before leaving. Thanks for your > feedback. > > New webrev is here: > http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.03/ > > Changes to webrev.02: > - Referenced paper > - Factored out endianness specific vector permute instructions (vec_perm > with only 3 parms to reduce risk of mixing them up) > - Removed code for PPC64 platforms which didn't support it > - code_size2 = 22000 > - Added missing ')' in IntrinsicPredicates.java > > My changes shouldn't change the behavior of the little endian > implementation. > We have to check if and if yes which tests still fail. Are there any > updates on this? > > Best regards, > Martin > > > -----Original Message----- > From: Lindenmaier, Goetz > Sent: Donnerstag, 7. September 2017 09:55 > To: Gustavo Romero ; Doerr, Martin > ; Gustavo Serra Scalet > > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi, > > I had a look at this change. > > Martin, you missed a ')' in IntrinsicPredicates.java. > > Combined with the multiplyToLen change, stub codebuffer space runs out. > Please increase > code_size2 = 20000 > to 22000 in stubRoutines_ppc.hpp. > > I see TestSHA.java failing on linuxppc64le. > Also, other tests are failing with SHA-256 digest error ... > > Also, on aix, some of our internal tests are failing. These didn't run > on linuxppc64 on a Power8 machine, so it might fail there, too. But on > the big endian platforms, the jtreg tests don't fail. > > @Gustavo, maybe you can have a look at the issues on linuxppc64le and > post a new webrev. Then Martin can fix the remaining issue on big > endian. > > Best regards, > Goetz. > > > > > > -----Original Message----- > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > bounces at openjdk.java.net] On Behalf Of Gustavo Romero > > Sent: Freitag, 1. September 2017 18:04 > > To: Doerr, Martin ; Gustavo Serra Scalet > > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > Hi Martin! > > > > On 01-09-2017 12:39, Doerr, Martin wrote: > > > Hi Gustavos, > > > > > > I have managed to upload a version which seems to work on both > > endianness implementations. > > > At least some quick tests have passed on AIX and Big Endian linux in > > addition to Little Endian linux. > > > > Great! :-) > > > > > > > I'll be out next week, but the change looks ok for me. Please let me > > > know if > > the changed version still looks ok for you, too. Feel free to overwork > > or improve it. > > > It'd also be good to know, if relying on vrsave=-1 is safe. > > > > Sure, Martin. I'm chasing what's exactly setting vrsave=-1 and the > > full history log (looks like it's not in the kernel, but I'm checking > yet). > > > > > > > Is the copyright information ok? Did you get source code which > > > requires to > > be mentioned in the comments? > > > The code looks similar to a reference implementation, so the authors > > > of it > > may want to be mentioned? > > > Or did you just use the paper for implementing it? In this case, I'd > > > mention > > the paper. > > > > Gustavo S: the information on the paper must be updated accordingly as > > Martin noted in the new webrev. There is none currently. > > > > > > > After we got a second review and ran more tests, we can ask somebody > > from Oracle to push it. > > > > > > Thanks for contributing and your support, Martin > > > > Thanks a lot for reviewing and for all the help. > > > > Regards, > > Gustavo R > > > > > > > > -----Original Message----- > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > bounces at openjdk.java.net] On Behalf Of Doerr, Martin > > > Sent: Donnerstag, 31. August 2017 18:21 > > > To: Gustavo Romero > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > > > Hi Gustavo R, > > > > > > I guess you're right. vrsave is already set to -1, so all Vector > > > Registers get > > saved. > > > It'd be good to know where it is set (OS, Flag in ELF header, ???) > > > and if this > > is guaranteed. > > > I don't want to risk getting sporadic errors on some OS versions. > > > > > > I'd like to enable SHA intrinsics on linux BE as well. I already > > > managed to get > > the 256 bit version working (was quite some work!). > > > > > > Thanks and best regards, > > > Martin > > > > > > > > > -----Original Message----- > > > From: Gustavo Romero [mailto:gromero at linux.vnet.ibm.com] > > > Sent: Freitag, 25. August 2017 22:35 > > > To: Doerr, Martin > > > Cc: Gustavo Serra Scalet ; 'hotspot- > > compiler-dev at openjdk.java.net' > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > > > Hi Martin, > > > > > > On 25-08-2017 13:18, Doerr, Martin wrote: > > >> I think you didn't get my point about AIX. > > >> Your current version doesn't break AIX, but it lacks SHA2 > > >> acceleration for > > AIX on Power 8 and newer, which is still relevant. > > >> So I'd like to ask you kindly to take a look if Big Endian support > > >> for the stub > > could be added without high effort. AIX doesn't need VRSAVE handling > > (like Little Endian linux, unlike Big Endian linux), so a few lines in > > the stub could possibly be enough. I can assist with testing. > > > > > > I don't think that VRSAVE is handled on Linux, even on BE. Although > > > BE ABI > > [1] > > > says: > > > > > > "Functions must ensure that the appropriate bits in the vrsave > > > register are > > set for any vector registers they use" > > > > > > and LE ABI does not say that, even on Linux BE VRSAVE is not in > > > effect used to determine which vector registers (VMX/Altivec) should > > > be > > saved/restored. > > > No application uses it on Linux, so I would say that VRSAVE is > > > ignored on > > Linux > > > completely both on BE and LE. save/restore library interfaces don't > > > pay attention to it in glibc: VRSAVE is just saved/restored > > > completely in > > mechanisms > > > of swap/get/setcontext(), set/longjump(), and dl-trampoline() and > > > that's > > all. I > > > checked that with toolchain folks and they agree. We've already > > > discussed > > that a > > > long time ago but at that time I was just using the vector-scalar > > > registers [2] and at that time I agreed that if VMX/Altivec was in > > > use instead of the VSX > > so > > > VRSAVE should be handled accordingly. But I have a different opinion > > now... > > > > > > I'm wondering if something would really break on Linux BE if we > > > forget > > about > > > VRSAVE at all in the JVM. If not, we could forget about VRSAVE > > > forever on > > Linux. > > > Looks like VRSAVE was sort of born to the oblivion... ? > > > > > > > > > Kind regards, > > > Gustavo > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=http- > > 3A__refspecs.linuxfoundation.org_ELF_ppc64_PPC-2Delf64abi- > > 2D1.9.html&d=DwIFAg&c=jf_iaSHvJObTbx- > > siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z- > > esrs- > > Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=arAjOio5NNoRIZLdczhgF5BDoAF3HU > > vq-xCtSufn_kA&e= > > > [2] https://urldefense.proofpoint.com/v2/url?u=http- > > 3A__mail.openjdk.java.net_pipermail_ppc-2Daix-2Dport-2Ddev_2016- > > 2DMay_002508.html&d=DwIFAg&c=jf_iaSHvJObTbx- > > siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z- > > esrs-Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=p0xb08lxayJHBXZREL-7c5ipKc- > > waZMMZpTiQWfU-S4&e= > > > From martin.doerr at sap.com Tue Sep 12 16:32:27 2017 From: martin.doerr at sap.com (Doerr, Martin) Date: Tue, 12 Sep 2017 16:32:27 +0000 Subject: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic In-Reply-To: <875858bb7bda421b97352d0fb2972a73@serv031.corp.eldorado.org.br> References: <4999bc2a3f0640dfb6dd75d23b4f30ea@sap.com> <0089f9f653a6442aa672af2e15b2b864@serv030.corp.eldorado.org.br> <59397a3749024e91b56be6e990a3250d@sap.com> <363c2378f23e4be2bf60b622594c60fe@sap.com> <59A089F4.6010504@linux.vnet.ibm.com> <34e6550d426440bab3b8a54a82e25190@sap.com> <59A9850B.7030302@linux.vnet.ibm.com> <3e64d11f046f44379f9658dffc766a45@sap.com> <670dd284fe77479986abe75aca42b20a@sap.com> <875858bb7bda421b97352d0fb2972a73@serv031.corp.eldorado.org.br> Message-ID: <2726388e5b224cdcb75a9c0931ec1e44@sap.com> Hi Gustavo, thanks for debugging. It should fix the problem for LE, but the previous version was correct for BE. Version which works for both: http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.04/ Changes: - factored out lvsr/lvsl - fixed ofs <= limit comparison (treat as positive ints to ignore garbage in high half and be protected against integer overflow, use <= see DigestBase.java) - removed unused labels - added the contributors you mentioned to Contributed-by list (not comments, I think it's better there) Please let us know if this is ok for you. We'll do some more testing on all platforms. Best regards, Martin -----Original Message----- From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] Sent: Dienstag, 12. September 2017 17:48 To: Doerr, Martin ; Lindenmaier, Goetz ; Gustavo Romero Cc: 'hotspot-compiler-dev at openjdk.java.net' ; ppc-aix-port-dev at openjdk.java.net Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic Hi Martin and G?tz, I was taking a closer look at the hotspot's tests/compiler and I see indeed one test failing for sha: Passed: compiler/intrinsics/sha/cli/TestUseSHA256IntrinsicsOptionOnSupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHA512IntrinsicsOptionOnSupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHA1IntrinsicsOptionOnUnsupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHAOptionOnUnsupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHA512IntrinsicsOptionOnUnsupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHA1IntrinsicsOptionOnSupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHA256IntrinsicsOptionOnUnsupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHAOptionOnSupportedCPU.java FAILED: compiler/intrinsics/sha/TestSHA.java Passed: compiler/intrinsics/sha/sanity/TestSHA1Intrinsics.java Passed: compiler/intrinsics/sha/sanity/TestSHA256Intrinsics.java Passed: compiler/intrinsics/sha/sanity/TestSHA1MultiBlockIntrinsics.java Passed: compiler/intrinsics/sha/sanity/TestSHA512Intrinsics.java Passed: compiler/intrinsics/sha/sanity/TestSHA512MultiBlockIntrinsics.java Passed: compiler/intrinsics/sha/sanity/TestSHA256MultiBlockIntrinsics.java That one was failing due to a bug on unaligned memory load. I took a closer look and fixed it. It should also work on Big Endian: https://gut.github.io/openjdk/webrev/JDK-8185979/webrev.02/index.html This new webrev was updated on top of Martin's webrev.03. I also took this chance to add all the contributors to this patch, as you suggested before. Thanks > -----Original Message----- > From: Doerr, Martin [mailto:martin.doerr at sap.com] > Sent: segunda-feira, 11 de setembro de 2017 14:06 > To: Lindenmaier, Goetz ; Gustavo Romero > ; Gustavo Serra Scalet > > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi G?tz and Gustavo, > > I had just posted the version I had before leaving. Thanks for your > feedback. > > New webrev is here: > http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.03/ > > Changes to webrev.02: > - Referenced paper > - Factored out endianness specific vector permute instructions (vec_perm > with only 3 parms to reduce risk of mixing them up) > - Removed code for PPC64 platforms which didn't support it > - code_size2 = 22000 > - Added missing ')' in IntrinsicPredicates.java > > My changes shouldn't change the behavior of the little endian > implementation. > We have to check if and if yes which tests still fail. Are there any > updates on this? > > Best regards, > Martin > > > -----Original Message----- > From: Lindenmaier, Goetz > Sent: Donnerstag, 7. September 2017 09:55 > To: Gustavo Romero ; Doerr, Martin > ; Gustavo Serra Scalet > > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi, > > I had a look at this change. > > Martin, you missed a ')' in IntrinsicPredicates.java. > > Combined with the multiplyToLen change, stub codebuffer space runs out. > Please increase > code_size2 = 20000 > to 22000 in stubRoutines_ppc.hpp. > > I see TestSHA.java failing on linuxppc64le. > Also, other tests are failing with SHA-256 digest error ... > > Also, on aix, some of our internal tests are failing. These didn't run > on linuxppc64 on a Power8 machine, so it might fail there, too. But on > the big endian platforms, the jtreg tests don't fail. > > @Gustavo, maybe you can have a look at the issues on linuxppc64le and > post a new webrev. Then Martin can fix the remaining issue on big > endian. > > Best regards, > Goetz. > > > > > > -----Original Message----- > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > bounces at openjdk.java.net] On Behalf Of Gustavo Romero > > Sent: Freitag, 1. September 2017 18:04 > > To: Doerr, Martin ; Gustavo Serra Scalet > > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > Hi Martin! > > > > On 01-09-2017 12:39, Doerr, Martin wrote: > > > Hi Gustavos, > > > > > > I have managed to upload a version which seems to work on both > > endianness implementations. > > > At least some quick tests have passed on AIX and Big Endian linux in > > addition to Little Endian linux. > > > > Great! :-) > > > > > > > I'll be out next week, but the change looks ok for me. Please let me > > > know if > > the changed version still looks ok for you, too. Feel free to overwork > > or improve it. > > > It'd also be good to know, if relying on vrsave=-1 is safe. > > > > Sure, Martin. I'm chasing what's exactly setting vrsave=-1 and the > > full history log (looks like it's not in the kernel, but I'm checking > yet). > > > > > > > Is the copyright information ok? Did you get source code which > > > requires to > > be mentioned in the comments? > > > The code looks similar to a reference implementation, so the authors > > > of it > > may want to be mentioned? > > > Or did you just use the paper for implementing it? In this case, I'd > > > mention > > the paper. > > > > Gustavo S: the information on the paper must be updated accordingly as > > Martin noted in the new webrev. There is none currently. > > > > > > > After we got a second review and ran more tests, we can ask somebody > > from Oracle to push it. > > > > > > Thanks for contributing and your support, Martin > > > > Thanks a lot for reviewing and for all the help. > > > > Regards, > > Gustavo R > > > > > > > > -----Original Message----- > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > bounces at openjdk.java.net] On Behalf Of Doerr, Martin > > > Sent: Donnerstag, 31. August 2017 18:21 > > > To: Gustavo Romero > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > > > Hi Gustavo R, > > > > > > I guess you're right. vrsave is already set to -1, so all Vector > > > Registers get > > saved. > > > It'd be good to know where it is set (OS, Flag in ELF header, ???) > > > and if this > > is guaranteed. > > > I don't want to risk getting sporadic errors on some OS versions. > > > > > > I'd like to enable SHA intrinsics on linux BE as well. I already > > > managed to get > > the 256 bit version working (was quite some work!). > > > > > > Thanks and best regards, > > > Martin > > > > > > > > > -----Original Message----- > > > From: Gustavo Romero [mailto:gromero at linux.vnet.ibm.com] > > > Sent: Freitag, 25. August 2017 22:35 > > > To: Doerr, Martin > > > Cc: Gustavo Serra Scalet ; 'hotspot- > > compiler-dev at openjdk.java.net' > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > > > Hi Martin, > > > > > > On 25-08-2017 13:18, Doerr, Martin wrote: > > >> I think you didn't get my point about AIX. > > >> Your current version doesn't break AIX, but it lacks SHA2 > > >> acceleration for > > AIX on Power 8 and newer, which is still relevant. > > >> So I'd like to ask you kindly to take a look if Big Endian support > > >> for the stub > > could be added without high effort. AIX doesn't need VRSAVE handling > > (like Little Endian linux, unlike Big Endian linux), so a few lines in > > the stub could possibly be enough. I can assist with testing. > > > > > > I don't think that VRSAVE is handled on Linux, even on BE. Although > > > BE ABI > > [1] > > > says: > > > > > > "Functions must ensure that the appropriate bits in the vrsave > > > register are > > set for any vector registers they use" > > > > > > and LE ABI does not say that, even on Linux BE VRSAVE is not in > > > effect used to determine which vector registers (VMX/Altivec) should > > > be > > saved/restored. > > > No application uses it on Linux, so I would say that VRSAVE is > > > ignored on > > Linux > > > completely both on BE and LE. save/restore library interfaces don't > > > pay attention to it in glibc: VRSAVE is just saved/restored > > > completely in > > mechanisms > > > of swap/get/setcontext(), set/longjump(), and dl-trampoline() and > > > that's > > all. I > > > checked that with toolchain folks and they agree. We've already > > > discussed > > that a > > > long time ago but at that time I was just using the vector-scalar > > > registers [2] and at that time I agreed that if VMX/Altivec was in > > > use instead of the VSX > > so > > > VRSAVE should be handled accordingly. But I have a different opinion > > now... > > > > > > I'm wondering if something would really break on Linux BE if we > > > forget > > about > > > VRSAVE at all in the JVM. If not, we could forget about VRSAVE > > > forever on > > Linux. > > > Looks like VRSAVE was sort of born to the oblivion... ? > > > > > > > > > Kind regards, > > > Gustavo > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=http- > > 3A__refspecs.linuxfoundation.org_ELF_ppc64_PPC-2Delf64abi- > > 2D1.9.html&d=DwIFAg&c=jf_iaSHvJObTbx- > > siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z- > > esrs- > > Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=arAjOio5NNoRIZLdczhgF5BDoAF3HU > > vq-xCtSufn_kA&e= > > > [2] https://urldefense.proofpoint.com/v2/url?u=http- > > 3A__mail.openjdk.java.net_pipermail_ppc-2Daix-2Dport-2Ddev_2016- > > 2DMay_002508.html&d=DwIFAg&c=jf_iaSHvJObTbx- > > siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z- > > esrs-Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=p0xb08lxayJHBXZREL-7c5ipKc- > > waZMMZpTiQWfU-S4&e= > > > From gustavo.scalet at eldorado.org.br Tue Sep 12 17:23:00 2017 From: gustavo.scalet at eldorado.org.br (Gustavo Serra Scalet) Date: Tue, 12 Sep 2017 17:23:00 +0000 Subject: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic In-Reply-To: <2726388e5b224cdcb75a9c0931ec1e44@sap.com> References: <4999bc2a3f0640dfb6dd75d23b4f30ea@sap.com> <0089f9f653a6442aa672af2e15b2b864@serv030.corp.eldorado.org.br> <59397a3749024e91b56be6e990a3250d@sap.com> <363c2378f23e4be2bf60b622594c60fe@sap.com> <59A089F4.6010504@linux.vnet.ibm.com> <34e6550d426440bab3b8a54a82e25190@sap.com> <59A9850B.7030302@linux.vnet.ibm.com> <3e64d11f046f44379f9658dffc766a45@sap.com> <670dd284fe77479986abe75aca42b20a@sap.com> <875858bb7bda421b97352d0fb2972a73@serv031.corp.eldorado.org.br> <2726388e5b224cdcb75a9c0931ec1e44@sap.com> Message-ID: <1fc44e1522d046deb3b810ee217b95f8@serv031.corp.eldorado.org.br> Hi Martin, Thanks for the improvements and testing on BE. I'm ok with your suggestions. Just in case, I tested for LE and it worked as expected. Thanks. > -----Original Message----- > From: Doerr, Martin [mailto:martin.doerr at sap.com] > Sent: ter?a-feira, 12 de setembro de 2017 13:32 > To: Gustavo Serra Scalet ; Lindenmaier, > Goetz ; Gustavo Romero > > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi Gustavo, > > thanks for debugging. It should fix the problem for LE, but the previous > version was correct for BE. > > Version which works for both: > http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.04/ > > Changes: > - factored out lvsr/lvsl > - fixed ofs <= limit comparison (treat as positive ints to ignore > garbage in high half and be protected against integer overflow, use <= > see DigestBase.java) > - removed unused labels > - added the contributors you mentioned to Contributed-by list (not > comments, I think it's better there) > > Please let us know if this is ok for you. We'll do some more testing on > all platforms. > > Best regards, > Martin > > > -----Original Message----- > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > Sent: Dienstag, 12. September 2017 17:48 > To: Doerr, Martin ; Lindenmaier, Goetz > ; Gustavo Romero > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi Martin and G?tz, > > I was taking a closer look at the hotspot's tests/compiler and I see > indeed one test failing for sha: > Passed: > compiler/intrinsics/sha/cli/TestUseSHA256IntrinsicsOptionOnSupportedCPU. > java > Passed: > compiler/intrinsics/sha/cli/TestUseSHA512IntrinsicsOptionOnSupportedCPU. > java > Passed: > compiler/intrinsics/sha/cli/TestUseSHA1IntrinsicsOptionOnUnsupportedCPU. > java > Passed: > compiler/intrinsics/sha/cli/TestUseSHAOptionOnUnsupportedCPU.java > Passed: > compiler/intrinsics/sha/cli/TestUseSHA512IntrinsicsOptionOnUnsupportedCP > U.java > Passed: > compiler/intrinsics/sha/cli/TestUseSHA1IntrinsicsOptionOnSupportedCPU.ja > va > Passed: > compiler/intrinsics/sha/cli/TestUseSHA256IntrinsicsOptionOnUnsupportedCP > U.java > Passed: compiler/intrinsics/sha/cli/TestUseSHAOptionOnSupportedCPU.java > FAILED: compiler/intrinsics/sha/TestSHA.java > Passed: compiler/intrinsics/sha/sanity/TestSHA1Intrinsics.java > Passed: compiler/intrinsics/sha/sanity/TestSHA256Intrinsics.java > Passed: compiler/intrinsics/sha/sanity/TestSHA1MultiBlockIntrinsics.java > Passed: compiler/intrinsics/sha/sanity/TestSHA512Intrinsics.java > Passed: > compiler/intrinsics/sha/sanity/TestSHA512MultiBlockIntrinsics.java > Passed: > compiler/intrinsics/sha/sanity/TestSHA256MultiBlockIntrinsics.java > > That one was failing due to a bug on unaligned memory load. I took a > closer look and fixed it. It should also work on Big Endian: > https://gut.github.io/openjdk/webrev/JDK-8185979/webrev.02/index.html > > This new webrev was updated on top of Martin's webrev.03. > > I also took this chance to add all the contributors to this patch, as > you suggested before. > > Thanks > > > -----Original Message----- > > From: Doerr, Martin [mailto:martin.doerr at sap.com] > > Sent: segunda-feira, 11 de setembro de 2017 14:06 > > To: Lindenmaier, Goetz ; Gustavo Romero > > ; Gustavo Serra Scalet > > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > Hi G?tz and Gustavo, > > > > I had just posted the version I had before leaving. Thanks for your > > feedback. > > > > New webrev is here: > > http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.03/ > > > > Changes to webrev.02: > > - Referenced paper > > - Factored out endianness specific vector permute instructions > > (vec_perm with only 3 parms to reduce risk of mixing them up) > > - Removed code for PPC64 platforms which didn't support it > > - code_size2 = 22000 > > - Added missing ')' in IntrinsicPredicates.java > > > > My changes shouldn't change the behavior of the little endian > > implementation. > > We have to check if and if yes which tests still fail. Are there any > > updates on this? > > > > Best regards, > > Martin > > > > > > -----Original Message----- > > From: Lindenmaier, Goetz > > Sent: Donnerstag, 7. September 2017 09:55 > > To: Gustavo Romero ; Doerr, Martin > > ; Gustavo Serra Scalet > > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > Hi, > > > > I had a look at this change. > > > > Martin, you missed a ')' in IntrinsicPredicates.java. > > > > Combined with the multiplyToLen change, stub codebuffer space runs > out. > > Please increase > > code_size2 = 20000 > > to 22000 in stubRoutines_ppc.hpp. > > > > I see TestSHA.java failing on linuxppc64le. > > Also, other tests are failing with SHA-256 digest error ... > > > > Also, on aix, some of our internal tests are failing. These didn't run > > on linuxppc64 on a Power8 machine, so it might fail there, too. But > > on the big endian platforms, the jtreg tests don't fail. > > > > @Gustavo, maybe you can have a look at the issues on linuxppc64le and > > post a new webrev. Then Martin can fix the remaining issue on big > > endian. > > > > Best regards, > > Goetz. > > > > > > > > > > > -----Original Message----- > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > bounces at openjdk.java.net] On Behalf Of Gustavo Romero > > > Sent: Freitag, 1. September 2017 18:04 > > > To: Doerr, Martin ; Gustavo Serra Scalet > > > > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > > > Hi Martin! > > > > > > On 01-09-2017 12:39, Doerr, Martin wrote: > > > > Hi Gustavos, > > > > > > > > I have managed to upload a version which seems to work on both > > > endianness implementations. > > > > At least some quick tests have passed on AIX and Big Endian linux > > > > in > > > addition to Little Endian linux. > > > > > > Great! :-) > > > > > > > > > > I'll be out next week, but the change looks ok for me. Please let > > > > me know if > > > the changed version still looks ok for you, too. Feel free to > > > overwork or improve it. > > > > It'd also be good to know, if relying on vrsave=-1 is safe. > > > > > > Sure, Martin. I'm chasing what's exactly setting vrsave=-1 and the > > > full history log (looks like it's not in the kernel, but I'm > > > checking > > yet). > > > > > > > > > > Is the copyright information ok? Did you get source code which > > > > requires to > > > be mentioned in the comments? > > > > The code looks similar to a reference implementation, so the > > > > authors of it > > > may want to be mentioned? > > > > Or did you just use the paper for implementing it? In this case, > > > > I'd mention > > > the paper. > > > > > > Gustavo S: the information on the paper must be updated accordingly > > > as Martin noted in the new webrev. There is none currently. > > > > > > > > > > After we got a second review and ran more tests, we can ask > > > > somebody > > > from Oracle to push it. > > > > > > > > Thanks for contributing and your support, Martin > > > > > > Thanks a lot for reviewing and for all the help. > > > > > > Regards, > > > Gustavo R > > > > > > > > > > > -----Original Message----- > > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > bounces at openjdk.java.net] On Behalf Of Doerr, Martin > > > > Sent: Donnerstag, 31. August 2017 18:21 > > > > To: Gustavo Romero > > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > > > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > > > > > Hi Gustavo R, > > > > > > > > I guess you're right. vrsave is already set to -1, so all Vector > > > > Registers get > > > saved. > > > > It'd be good to know where it is set (OS, Flag in ELF header, ???) > > > > and if this > > > is guaranteed. > > > > I don't want to risk getting sporadic errors on some OS versions. > > > > > > > > I'd like to enable SHA intrinsics on linux BE as well. I already > > > > managed to get > > > the 256 bit version working (was quite some work!). > > > > > > > > Thanks and best regards, > > > > Martin > > > > > > > > > > > > -----Original Message----- > > > > From: Gustavo Romero [mailto:gromero at linux.vnet.ibm.com] > > > > Sent: Freitag, 25. August 2017 22:35 > > > > To: Doerr, Martin > > > > Cc: Gustavo Serra Scalet ; > > > > 'hotspot- > > > compiler-dev at openjdk.java.net' > > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > > > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > > > > > Hi Martin, > > > > > > > > On 25-08-2017 13:18, Doerr, Martin wrote: > > > >> I think you didn't get my point about AIX. > > > >> Your current version doesn't break AIX, but it lacks SHA2 > > > >> acceleration for > > > AIX on Power 8 and newer, which is still relevant. > > > >> So I'd like to ask you kindly to take a look if Big Endian > > > >> support for the stub > > > could be added without high effort. AIX doesn't need VRSAVE handling > > > (like Little Endian linux, unlike Big Endian linux), so a few lines > > > in the stub could possibly be enough. I can assist with testing. > > > > > > > > I don't think that VRSAVE is handled on Linux, even on BE. > > > > Although BE ABI > > > [1] > > > > says: > > > > > > > > "Functions must ensure that the appropriate bits in the vrsave > > > > register are > > > set for any vector registers they use" > > > > > > > > and LE ABI does not say that, even on Linux BE VRSAVE is not in > > > > effect used to determine which vector registers (VMX/Altivec) > > > > should be > > > saved/restored. > > > > No application uses it on Linux, so I would say that VRSAVE is > > > > ignored on > > > Linux > > > > completely both on BE and LE. save/restore library interfaces > > > > don't pay attention to it in glibc: VRSAVE is just saved/restored > > > > completely in > > > mechanisms > > > > of swap/get/setcontext(), set/longjump(), and dl-trampoline() and > > > > that's > > > all. I > > > > checked that with toolchain folks and they agree. We've already > > > > discussed > > > that a > > > > long time ago but at that time I was just using the vector-scalar > > > > registers [2] and at that time I agreed that if VMX/Altivec was in > > > > use instead of the VSX > > > so > > > > VRSAVE should be handled accordingly. But I have a different > > > > opinion > > > now... > > > > > > > > I'm wondering if something would really break on Linux BE if we > > > > forget > > > about > > > > VRSAVE at all in the JVM. If not, we could forget about VRSAVE > > > > forever on > > > Linux. > > > > Looks like VRSAVE was sort of born to the oblivion... ? > > > > > > > > > > > > Kind regards, > > > > Gustavo > > > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=http- > > > 3A__refspecs.linuxfoundation.org_ELF_ppc64_PPC-2Delf64abi- > > > 2D1.9.html&d=DwIFAg&c=jf_iaSHvJObTbx- > > > siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z- > > > esrs- > > > Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=arAjOio5NNoRIZLdczhgF5BDoAF3HU > > > vq-xCtSufn_kA&e= > > > > [2] https://urldefense.proofpoint.com/v2/url?u=http- > > > 3A__mail.openjdk.java.net_pipermail_ppc-2Daix-2Dport-2Ddev_2016- > > > 2DMay_002508.html&d=DwIFAg&c=jf_iaSHvJObTbx- > > > siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z- > > > esrs-Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=p0xb08lxayJHBXZREL-7c5ipKc- > > > waZMMZpTiQWfU-S4&e= > > > > From jaroslav.tulach at oracle.com Tue Sep 12 17:44:07 2017 From: jaroslav.tulach at oracle.com (Jaroslav Tulach) Date: Tue, 12 Sep 2017 19:44:07 +0200 Subject: [10] RFR(M) 8182701: Modify JVMCI to allow Graal Compiler to expose platform MBean In-Reply-To: References: <73a34e68-19cf-e949-0057-a2e16cfca6da@oracle.com> <1906851.CDFazpJ8ns@pracovni> Message-ID: <1595796.xe8q7Anfpj@pracovni> Dear reviewers, after several reconsiderations I have webrev #4 ready for your review. Can you please take a look at http://cr.openjdk.java.net/~jtulach/8182701/webrev.04/ and let me know if it is in a reasonable shape? Thanks a lot. -jt From mandy.chung at oracle.com Tue Sep 12 18:19:45 2017 From: mandy.chung at oracle.com (mandy chung) Date: Tue, 12 Sep 2017 11:19:45 -0700 Subject: [10] RFR(M) 8182701: Modify JVMCI to allow Graal Compiler to expose platform MBean In-Reply-To: <1595796.xe8q7Anfpj@pracovni> References: <73a34e68-19cf-e949-0057-a2e16cfca6da@oracle.com> <1906851.CDFazpJ8ns@pracovni> <1595796.xe8q7Anfpj@pracovni> Message-ID: On 9/12/17 10:44 AM, Jaroslav Tulach wrote: > Dear reviewers, > after several reconsiderations I have webrev #4 ready for your review. Can you > please take a look at > > http://cr.openjdk.java.net/~jtulach/8182701/webrev.04/ > > and let me know if it is in a reasonable shape? Thanks a lot. > -jt Yes defining a new provider module for the platform mbean registration is a reasonable solution.? Generally the patch looks right.? I have a question on the build and a comment on the new mbean method. ./make/common/Modules.gmk ??? Nit: can you move jdk.internal.vm.compiler.management to keep the list in alphabetical order 199 # Filter out Graal specific modules if Graal build is disabled 200 201 ifeq ($(INCLUDE_GRAAL), false) 202 MODULES_FILTER += jdk.internal.vm.compiler 203 endif When will INCLUDE_GRAAL be set to false?? I think jdk.internal.vm.compiler.management should also be filtered if jdk.internal.vm.compiler is disabled. Is jdk.internal.vm.compiler and jdk.internal.vm.compiler.management built for all platforms in JDK 10? If not, ?? jdk/src/java.management/share/classes/module-info.java may fail to compile when jdk.internal.vm.compiler.management is not present.?? We can consult with the build team when you find out what configuration that jdk.internal.vm.compiler is not built. hotspot/src/jdk.internal.vm.compiler/share/classes/module-info.java 29 requires transitive jdk.internal.vm.ci; do you get any error without this requires transitive? jdk.internal.vm.compiler.management already requires jdk.internal.vm.ci.? I would think this requires transitive is not necessary. Is HotSpotGraalCompiler::mbean method necessary?? In GraalMBeans.java 53 public static Object findGraalRuntimeBean() { 54 JVMCIRuntime r = JVMCI.getRuntime(); 55 JVMCICompiler c = r.getCompiler(); 56 if (c instanceof HotSpotGraalCompiler) { 57 return ((HotSpotGraalCompiler) c).mbean(); 58 } 59 return null; 60 } It seems that you can call HotspotGraalRuntime::mbean directly.? As we discussed offline, we agree that HotSpotRuntimeMBean should belong to this new module but it requires some refactoring which may take some amount of work.? Such clean up will be followed up in a separate JBS issue. Mandy -------------- next part -------------- An HTML attachment was scrubbed... URL: From dean.long at oracle.com Tue Sep 12 19:07:04 2017 From: dean.long at oracle.com (Dean Long) Date: Tue, 12 Sep 2017 12:07:04 -0700 Subject: [10] RFR(L): 8132547: [AOT] support invokedynamic instructions In-Reply-To: <841398712.806020.1505199269517.JavaMail.zimbra@u-pem.fr> References: <5bd196a3-93f8-b59d-86d8-a87e0cfc8179@oracle.com> <841398712.806020.1505199269517.JavaMail.zimbra@u-pem.fr> Message-ID: <9dd53949-fdb1-e4b0-ca5c-7d06da7061c1@oracle.com> Hi Remi, On 9/11/2017 11:54 PM, Remi Forax wrote: > Hi Dean, > Java currently uses invokedynamic in two places, one is for lambda creation, the other is for string concatenation. > Do you have tested string concatenation ? This patch should help right now because the StringConcatFactory do not uses any anonymous class. Yes, what gets generated for AOT will be a call to MethodHandle.invokeBasic().? If StringConcatFactory was AOT-compiled, then we can take advantage of that as pre-compiled code, but not as inlined code.? To see an inlined string concatenation more work is required. > and Java (the language) is not the only one to use invokedynamic, how thing works if the boostrap method requires data that are only available at runtime, data that comes from a dynamic language runtime by example ? So the bootstrap method is using data other than constantpool constants?? Could you give an example? What we do at compile time is to resolve the constant pool entry, which gives us the actual adapter and appendix objects, allowing some folding and inlining.? At runtime, we resolve again, and do some sanity checking to make sure the compile-time types are compatible with the runtime types.? If a dynamic language runtime can break those assumptions, then we would probably need to turn off the compile-time speculation and compile the code as if the adapter and appendix types were unknown. dl > cheers, > R?mi > > ----- Mail original ----- >> De: "Dean Long" >> ?: "hotspot compiler" >> Envoy?: Mardi 12 Septembre 2017 04:21:36 >> Objet: [10] RFR(L): 8132547: [AOT] support invokedynamic instructions >> https://bugs.openjdk.java.net/browse/JDK-8132547 >> >> http://cr.openjdk.java.net/~dlong/8132547/ >> >> This enhancement is a first step in supporting invokedynamic >> instructions in AOT.? Previously, when we saw an invokedynamic >> instruction, or any anonymous class, we would generate code to bail out >> and deoptimize.? With this changeset we go a little further and call >> into the runtime to resolve the dynamic constant pool entry, running the >> bootstrap method, and returning the adapter method and appendix object. >> Like class initialization in AOT, we only do this the first time >> through.? Because AOT double-checks classes using fingerprints and >> symbolic names, special care was required to handle anonymous class >> names.? The solution I chose was to name anonymous types with aliases >> based on their constant pool location ("adapter" and >> appendix"). >> >> Future work is needed to AOT-compile the anonymous classes and/or inline >> through them, so this change is not expected to affect AOT performance. >> In my tests I was not able to measure any difference. >> >> Upstream Graal changes have already been pushed.? I broke the JVMCI and >> hotspot changes into separate webrevs. >> >> dl From forax at univ-mlv.fr Tue Sep 12 20:23:24 2017 From: forax at univ-mlv.fr (forax at univ-mlv.fr) Date: Tue, 12 Sep 2017 22:23:24 +0200 (CEST) Subject: [10] RFR(L): 8132547: [AOT] support invokedynamic instructions In-Reply-To: <9dd53949-fdb1-e4b0-ca5c-7d06da7061c1@oracle.com> References: <5bd196a3-93f8-b59d-86d8-a87e0cfc8179@oracle.com> <841398712.806020.1505199269517.JavaMail.zimbra@u-pem.fr> <9dd53949-fdb1-e4b0-ca5c-7d06da7061c1@oracle.com> Message-ID: <301465071.1345030.1505247804769.JavaMail.zimbra@u-pem.fr> ----- Mail original ----- > De: "Dean Long" > ?: "Remi Forax" > Cc: "hotspot compiler" > Envoy?: Mardi 12 Septembre 2017 21:07:04 > Objet: Re: [10] RFR(L): 8132547: [AOT] support invokedynamic instructions > Hi Remi, Hi Dean, > > On 9/11/2017 11:54 PM, Remi Forax wrote: >> Hi Dean, >> Java currently uses invokedynamic in two places, one is for lambda creation, the >> other is for string concatenation. >> Do you have tested string concatenation ? This patch should help right now >> because the StringConcatFactory do not uses any anonymous class. > > Yes, what gets generated for AOT will be a call to > MethodHandle.invokeBasic().? If StringConcatFactory was AOT-compiled, > then we can take advantage of that as pre-compiled code, but not as > inlined code.? To see an inlined string concatenation more work is required. > >> and Java (the language) is not the only one to use invokedynamic, how thing >> works if the boostrap method requires data that are only available at runtime, >> data that comes from a dynamic language runtime by example ? > > So the bootstrap method is using data other than constantpool > constants?? Could you give an example? A boostrap method can call any functions that does side effects, a BSM is not required to be transparent [1]. By example, a BSM can load some property files on the disk to see the value of a property as a constant (instead of compiling the property file into a .class). There are other examples where you store the callsites in some runtime data structure to reuse them, control them, etc. > > What we do at compile time is to resolve the constant pool entry, which > gives us the actual adapter and appendix objects, allowing some folding > and inlining.? At runtime, we resolve again, and do some sanity checking > to make sure the compile-time types are compatible with the runtime > types.? If a dynamic language runtime can break those assumptions, then > we would probably need to turn off the compile-time speculation and > compile the code as if the adapter and appendix types were unknown. The LambdaMetaFactory and the StringConcatFactory BSMs are transparent. Perhaps for the other BSMs, we need either some command line flags for jaotc (to opt out) or a way to annotate BSMs to say they are NOT transparent or perhaps both. R?mi [1] https://en.wikipedia.org/wiki/Referential_transparency_(computer_science) > > dl > >> cheers, >> R?mi >> >> ----- Mail original ----- >>> De: "Dean Long" >>> ?: "hotspot compiler" >>> Envoy?: Mardi 12 Septembre 2017 04:21:36 >>> Objet: [10] RFR(L): 8132547: [AOT] support invokedynamic instructions >>> https://bugs.openjdk.java.net/browse/JDK-8132547 >>> >>> http://cr.openjdk.java.net/~dlong/8132547/ >>> >>> This enhancement is a first step in supporting invokedynamic >>> instructions in AOT.? Previously, when we saw an invokedynamic >>> instruction, or any anonymous class, we would generate code to bail out >>> and deoptimize.? With this changeset we go a little further and call >>> into the runtime to resolve the dynamic constant pool entry, running the >>> bootstrap method, and returning the adapter method and appendix object. >>> Like class initialization in AOT, we only do this the first time >>> through.? Because AOT double-checks classes using fingerprints and >>> symbolic names, special care was required to handle anonymous class >>> names.? The solution I chose was to name anonymous types with aliases >>> based on their constant pool location ("adapter" and >>> appendix"). >>> >>> Future work is needed to AOT-compile the anonymous classes and/or inline >>> through them, so this change is not expected to affect AOT performance. >>> In my tests I was not able to measure any difference. >>> >>> Upstream Graal changes have already been pushed.? I broke the JVMCI and >>> hotspot changes into separate webrevs. >>> > >> dl From dean.long at oracle.com Tue Sep 12 23:52:39 2017 From: dean.long at oracle.com (Dean Long) Date: Tue, 12 Sep 2017 16:52:39 -0700 Subject: [10] RFR(L): 8132547: [AOT] support invokedynamic instructions In-Reply-To: <301465071.1345030.1505247804769.JavaMail.zimbra@u-pem.fr> References: <5bd196a3-93f8-b59d-86d8-a87e0cfc8179@oracle.com> <841398712.806020.1505199269517.JavaMail.zimbra@u-pem.fr> <9dd53949-fdb1-e4b0-ca5c-7d06da7061c1@oracle.com> <301465071.1345030.1505247804769.JavaMail.zimbra@u-pem.fr> Message-ID: Inline comments below... On 9/12/2017 1:23 PM, forax at univ-mlv.fr wrote: > ----- Mail original ----- >> De: "Dean Long" >> ?: "Remi Forax" >> Cc: "hotspot compiler" >> Envoy?: Mardi 12 Septembre 2017 21:07:04 >> Objet: Re: [10] RFR(L): 8132547: [AOT] support invokedynamic instructions >> Hi Remi, > Hi Dean, > >> On 9/11/2017 11:54 PM, Remi Forax wrote: >>> Hi Dean, >>> Java currently uses invokedynamic in two places, one is for lambda creation, the >>> other is for string concatenation. >>> Do you have tested string concatenation ? This patch should help right now >>> because the StringConcatFactory do not uses any anonymous class. >> Yes, what gets generated for AOT will be a call to >> MethodHandle.invokeBasic().? If StringConcatFactory was AOT-compiled, >> then we can take advantage of that as pre-compiled code, but not as >> inlined code.? To see an inlined string concatenation more work is required. >> >>> and Java (the language) is not the only one to use invokedynamic, how thing >>> works if the boostrap method requires data that are only available at runtime, >>> data that comes from a dynamic language runtime by example ? >> So the bootstrap method is using data other than constantpool >> constants?? Could you give an example? > A boostrap method can call any functions that does side effects, a BSM is not required to be transparent [1]. > By example, a BSM can load some property files on the disk to see the value of a property as a constant (instead of compiling the property file into a .class). > There are other examples where you store the callsites in some runtime data structure to reuse them, control them, etc. I did try to anticipate a situation like that.? At runtime, we run the bootstrap method and check the types of the appendix and adapter method using a "fingerprint", which is an md5 of the bytecodes.? The AOT library contains the fingerprints that were found at compile time.? If a mismatch is detected then we deoptimize the method. >> What we do at compile time is to resolve the constant pool entry, which >> gives us the actual adapter and appendix objects, allowing some folding >> and inlining.? At runtime, we resolve again, and do some sanity checking >> to make sure the compile-time types are compatible with the runtime >> types.? If a dynamic language runtime can break those assumptions, then >> we would probably need to turn off the compile-time speculation and >> compile the code as if the adapter and appendix types were unknown. > The LambdaMetaFactory and the StringConcatFactory BSMs are transparent. > Perhaps for the other BSMs, we need either some command line flags for jaotc (to opt out) or a way to annotate BSMs to say they are NOT transparent or perhaps both. That would be a great test case, to see if a non-transparent BSM can get past the fingerprint checks already in place.? If so, then I agree we need to file a bug and address it.? Perhaps once the code is pushed, you could help write such as test case :-) dl > R?mi > > [1] https://en.wikipedia.org/wiki/Referential_transparency_(computer_science) > >> dl >> >>> cheers, >>> R?mi >>> >>> ----- Mail original ----- >>>> De: "Dean Long" >>>> ?: "hotspot compiler" >>>> Envoy?: Mardi 12 Septembre 2017 04:21:36 >>>> Objet: [10] RFR(L): 8132547: [AOT] support invokedynamic instructions >>>> https://bugs.openjdk.java.net/browse/JDK-8132547 >>>> >>>> http://cr.openjdk.java.net/~dlong/8132547/ >>>> >>>> This enhancement is a first step in supporting invokedynamic >>>> instructions in AOT.? Previously, when we saw an invokedynamic >>>> instruction, or any anonymous class, we would generate code to bail out >>>> and deoptimize.? With this changeset we go a little further and call >>>> into the runtime to resolve the dynamic constant pool entry, running the >>>> bootstrap method, and returning the adapter method and appendix object. >>>> Like class initialization in AOT, we only do this the first time >>>> through.? Because AOT double-checks classes using fingerprints and >>>> symbolic names, special care was required to handle anonymous class >>>> names.? The solution I chose was to name anonymous types with aliases >>>> based on their constant pool location ("adapter" and >>>> appendix"). >>>> >>>> Future work is needed to AOT-compile the anonymous classes and/or inline >>>> through them, so this change is not expected to affect AOT performance. >>>> In my tests I was not able to measure any difference. >>>> >>>> Upstream Graal changes have already been pushed.? I broke the JVMCI and >>>> hotspot changes into separate webrevs. >>>> >>>> dl From jamsheed.c.m at oracle.com Wed Sep 13 01:29:03 2017 From: jamsheed.c.m at oracle.com (jamsheed) Date: Wed, 13 Sep 2017 06:59:03 +0530 Subject: [10] RFR: [AOT] assert(false) failed: DEBUG MESSAGE: InterpreterMacroAssembler::call_VM_base: last_sp != NULL In-Reply-To: References: Message-ID: <8d91428c-061c-1601-2b99-99ed6544b13e@oracle.com> Thank you for review and feedback. i presumed speed was not issue for interpreter, so went with this approach, all others need a bit more platform coding or investigation. Best Regards, Jamsheed On Tuesday 12 September 2017 02:52 AM, Dean Long wrote: > Unfortunately, this fix slows down normal returns, even for non-debug > builds. Isn't what we really want something like > deopt_no_reexecute_entry, which would use > Interpreter::deopt_entry(state, Bytecodes::length_for(code)) to skip > the current return bytecode? skip current return bytecode with removal of top frame, right ? Best regards, Jamsheed > > dl > > > On 9/11/2017 11:13 AM, jamsheed wrote: >> Hi, >> >> request for review the fix made for the bug >> >> JBS: https://bugs.openjdk.java.net/browse/JDK-8168712 >> >> webrev: http://cr.openjdk.java.net/~jcm/8168712/webrev.00/ >> >> brief desc: special handling of Object. in >> TemplateInterpreter::deopt_reexecute_entry >> >> required last_sp to be reset explicitly in normal return path >> >> address TemplateInterpreter::deopt_reexecute_entry(Method* method, >> address bcp) { >> assert(method->contains(bcp), "just checkin'"); >> Bytecodes::Code code = Bytecodes::java_code_at(method, bcp); >> if (code == Bytecodes::_return) { >> // This is used for deopt during registration of finalizers >> // during Object.. We simply need to resume execution at >> // the standard return vtos bytecode to pop the frame normally. >> // reexecuting the real bytecode would cause double registration >> // of the finalizable object. >> return _normal_table.entry(Bytecodes::_return).entry(vtos); >> >> test: jprt >> >> Best Regards, >> >> Jamsheed >> >> > From dean.long at oracle.com Wed Sep 13 04:33:03 2017 From: dean.long at oracle.com (Dean Long) Date: Tue, 12 Sep 2017 21:33:03 -0700 Subject: [10] RFR: [AOT] assert(false) failed: DEBUG MESSAGE: InterpreterMacroAssembler::call_VM_base: last_sp != NULL In-Reply-To: <8d91428c-061c-1601-2b99-99ed6544b13e@oracle.com> References: <8d91428c-061c-1601-2b99-99ed6544b13e@oracle.com> Message-ID: On 9/12/2017 6:29 PM, jamsheed wrote: > Thank you for review and feedback. > > i presumed speed was not issue for interpreter, so went with this > approach, all others need a bit more platform coding or investigation. > It may not matter if you #ifdef it so that it only slows down debug builds. > Best Regards, > > Jamsheed > > > On Tuesday 12 September 2017 02:52 AM, Dean Long wrote: >> Unfortunately, this fix slows down normal returns, even for non-debug >> builds. Isn't what we really want something like >> deopt_no_reexecute_entry,? which would use >> Interpreter::deopt_entry(state, Bytecodes::length_for(code)) to skip >> the current return bytecode? > skip current return bytecode with removal of top frame, right ? Yes, sorry, you're right it wouldn't make sense to continue executing *after* the return bytecode in the same frame. dl > Best regards, > Jamsheed > >> >> dl >> >> >> On 9/11/2017 11:13 AM, jamsheed wrote: >>> Hi, >>> >>> request for review the fix made for the bug >>> >>> JBS: https://bugs.openjdk.java.net/browse/JDK-8168712 >>> >>> webrev: http://cr.openjdk.java.net/~jcm/8168712/webrev.00/ >>> >>> brief desc: special handling of Object. in >>> TemplateInterpreter::deopt_reexecute_entry >>> >>> required last_sp to be reset explicitly in normal return path >>> >>> address TemplateInterpreter::deopt_reexecute_entry(Method* method, >>> address bcp) { >>> ? assert(method->contains(bcp), "just checkin'"); >>> ? Bytecodes::Code code?? = Bytecodes::java_code_at(method, bcp); >>> ? if (code == Bytecodes::_return) { >>> ??? // This is used for deopt during registration of finalizers >>> ??? // during Object..? We simply need to resume execution at >>> ??? // the standard return vtos bytecode to pop the frame normally. >>> ??? // reexecuting the real bytecode would cause double registration >>> ??? // of the finalizable object. >>> ??? return _normal_table.entry(Bytecodes::_return).entry(vtos); >>> >>> test: jprt >>> >>> Best Regards, >>> >>> Jamsheed >>> >>> >> > From dean.long at oracle.com Wed Sep 13 06:31:01 2017 From: dean.long at oracle.com (Dean Long) Date: Tue, 12 Sep 2017 23:31:01 -0700 Subject: RFR(XL) 8187438: Update Graal Message-ID: <99a6f631-6e6a-847d-02d5-9e2e336b9206@oracle.com> This is a Graal update. Please see the JBS entry for the complete list of upstream changes included. http://cr.openjdk.java.net/~dlong/8187438/webrev/ https://bugs.openjdk.java.net/browse/JDK-8187438 dl From jamsheed.c.m at oracle.com Wed Sep 13 13:22:11 2017 From: jamsheed.c.m at oracle.com (jamsheed) Date: Wed, 13 Sep 2017 18:52:11 +0530 Subject: [10] RFR: [AOT] assert(false) failed: DEBUG MESSAGE: InterpreterMacroAssembler::call_VM_base: last_sp != NULL In-Reply-To: References: <8d91428c-061c-1601-2b99-99ed6544b13e@oracle.com> Message-ID: Thanks for reply, revised webrev: http://cr.openjdk.java.net/~jcm/8168712/webrev.01/ last_sp ! = null not an issue for this specific case, so i skip the assert in debug build Best Regards, Jamsheed On Wednesday 13 September 2017 10:03 AM, Dean Long wrote: > It may not matter if you #ifdef it so that it only slows down debug > builds. From mandy.chung at oracle.com Wed Sep 13 15:23:22 2017 From: mandy.chung at oracle.com (mandy chung) Date: Wed, 13 Sep 2017 08:23:22 -0700 Subject: [10] RFR(M) 8182701: Modify JVMCI to allow Graal Compiler to expose platform MBean In-Reply-To: References: <73a34e68-19cf-e949-0057-a2e16cfca6da@oracle.com> <1906851.CDFazpJ8ns@pracovni> <1595796.xe8q7Anfpj@pracovni> Message-ID: <14ccc931-d861-1672-9d26-a95e51e2f440@oracle.com> On 9/13/17 2:28 AM, Daniel Fuchs wrote: > Hi Jaroslav, > > GraalMBeans.java: > > ? 77???????? @Override > ? 78???????? public Set mbeanInterfaceNames() { > ? 79???????????? return Collections.singleton(name); > ? 80???????? } Good catch, Daniel.? This should return empty set as mbeanInterfaces() returns.? mbeanInterfaceNames returns the class name of the mbean interfaces. Mandy > > This is not correct. The return set should be a set of > MXBean interface names, as in Class.getName(), not a set > of MXBean ObjectName strings. > > The interface in question must be implemented by the > concrete MBean instance and must be a subclass of > PlatformManagedObject. > > It is not required for an MBean to implement such > an interface - if it doesn't then it simply won't > be obtainable from ManagementFactory::getPlatformMXBean > or ManagementFactory::getPlatformMXBeans. > > So I suspect that in your case, since mbeanInterfaces() > returns an empty set then mbeanInterfaceNames() should > also return an empty set. > > IIRC mbeanInterfaceNames() was introduced so that > you could query for a particular MBean implementing > a given interface without necessarily triggering the > loading and initialization of all interfaces implemented > by all MBeans. > > best regards, > > -- daniel > > > > On 12/09/2017 18:44, Jaroslav Tulach wrote: >> Dear reviewers, >> after several reconsiderations I have webrev #4 ready for your >> review. Can you >> please take a look at >> >> http://cr.openjdk.java.net/~jtulach/8182701/webrev.04/ >> >> and let me know if it is in a reasonable shape? Thanks a lot. >> -jt >> > From vladimir.kozlov at oracle.com Wed Sep 13 16:30:22 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 13 Sep 2017 09:30:22 -0700 Subject: RFR(XL) 8187438: Update Graal In-Reply-To: <99a6f631-6e6a-847d-02d5-9e2e336b9206@oracle.com> References: <99a6f631-6e6a-847d-02d5-9e2e336b9206@oracle.com> Message-ID: <3277d8be-b63f-d661-b8ad-29d62f9471e5@oracle.com> Good. But you would need to wait when repo is open. Thanks, Vladimir On 9/12/17 11:31 PM, Dean Long wrote: > This is a Graal update.? Please see the JBS entry for the complete list > of upstream changes included. > > http://cr.openjdk.java.net/~dlong/8187438/webrev/ > https://bugs.openjdk.java.net/browse/JDK-8187438 > > dl From dean.long at oracle.com Wed Sep 13 17:52:44 2017 From: dean.long at oracle.com (Dean Long) Date: Wed, 13 Sep 2017 10:52:44 -0700 Subject: [10] RFR: [AOT] assert(false) failed: DEBUG MESSAGE: InterpreterMacroAssembler::call_VM_base: last_sp != NULL In-Reply-To: References: <8d91428c-061c-1601-2b99-99ed6544b13e@oracle.com> Message-ID: <3a500359-8838-a426-0cc6-4d9c886379b1@oracle.com> Don't you want to use #ifdef ASSERT instead? dl On 9/13/2017 6:22 AM, jamsheed wrote: > Thanks for reply, > > revised webrev: http://cr.openjdk.java.net/~jcm/8168712/webrev.01/ > > last_sp ! = null not an issue for this specific case, so i skip the > assert in debug build > > Best Regards, > > Jamsheed > > > On Wednesday 13 September 2017 10:03 AM, Dean Long wrote: >> It may not matter if you #ifdef it so that it only slows down debug >> builds. > From dean.long at oracle.com Wed Sep 13 17:57:30 2017 From: dean.long at oracle.com (Dean Long) Date: Wed, 13 Sep 2017 10:57:30 -0700 Subject: RFR(XL) 8187438: Update Graal In-Reply-To: <3277d8be-b63f-d661-b8ad-29d62f9471e5@oracle.com> References: <99a6f631-6e6a-847d-02d5-9e2e336b9206@oracle.com> <3277d8be-b63f-d661-b8ad-29d62f9471e5@oracle.com> Message-ID: <2b29dd34-7b35-81d3-e142-3b742bd2bf15@oracle.com> Thanks Vladimir.? I plan to push it together with 8132547. dl On 9/13/2017 9:30 AM, Vladimir Kozlov wrote: > Good. But you would need to wait when repo is open. > > Thanks, > Vladimir > > On 9/12/17 11:31 PM, Dean Long wrote: >> This is a Graal update.? Please see the JBS entry for the complete >> list of upstream changes included. >> >> http://cr.openjdk.java.net/~dlong/8187438/webrev/ >> https://bugs.openjdk.java.net/browse/JDK-8187438 >> >> dl From stuart.monteith at linaro.org Wed Sep 13 18:36:29 2017 From: stuart.monteith at linaro.org (Stuart Monteith) Date: Wed, 13 Sep 2017 19:36:29 +0100 Subject: RFR(XL/M) : 8178788: wrap JCStress test suite as jtreg tests In-Reply-To: References: <9A2C94EA-89A3-4C75-9D3C-51E058BD8A1D@oracle.com> <342c3748-616f-8a7d-74f4-3ce929b1e0dc@redhat.com> Message-ID: For the record, if I put my build of jcstress.jar in $HOME, the following allows the jcstress tests to run: make test TEST="hotspot_all" EXTRA_JTREG_OPTIONS="-Djdk.test.lib.artifacts.jcstress-tests-all=$HOME/jcstress.jar" >From "make help" and the makefiles themselves, I had expected the follow to work: make TEST="hotspot_all" JTREG="VM_OPTIONS=-Djdk.test.lib.artifacts.jcstress-tests-all=$HOME/jcstress.jar" test but it does not - the JTREG parameter is apparently ignored. This is unfortunate as there is a warning as it is a non-control variable. Am I wrong in thinking that this was written for testing internall within Oracle? I can not find an instance of "com.oracle.jib.api.JibServiceFactory" in the OpenJDK project or elsewhere. BR, Stuart On 8 September 2017 at 16:53, Stuart Monteith wrote: > Hello, > I've spent some time on this, and I have to admit that I'm stumped. I > get exactly the same errors on x86 on jdk10/hs and jdk10/jdk10 with arecent > build of JTReg and JT_HOME set appropriately. > > Are there any pointers on how this is supposed to be run? > > Thanks, > Stuart > > On 25 April 2017 at 11:47, Aleksey Shipilev wrote: > >> On 04/19/2017 12:12 AM, Igor Ignatyev wrote: >> > http://cr.openjdk.java.net/~iignatyev//8178788/webrev.00/index.html >> >> 69903 lines changed: 69903 ins; 0 del; 0 mod; >> > (69524 lines are generated) >> > >> > Hi all, >> > >> > could you please review this patch which adds a jtreg test wrapper for >> > jcstress test suite and jtreg tests which run jsctress tests thru this >> > wrapper? >> > >> > webrev: http://cr.openjdk.java.net/~iignatyev//8178788/webrev.00/ind >> ex.html >> > JBS: https://bugs.openjdk.java.net/browse/JDK-8178788 testing: >> >> TL;DR: This patch introduces more problems than it solves. Just run the >> jcstress >> tests-all JAR against the tested runtime. >> >> Wrapping jcstress tests with jtreg defies the purpose of jcstress harness >> -- >> that is, running lots of tests as fast as it possibly could without >> affecting >> testing quality. For example, by cleverly reusing VMs between the tests, >> using >> Whitebox to deoptimize without restarting the VMs, etc. It really wastes >> CPU >> time to run each test in isolation. >> >> Also, it does not "automatically" work, which defies "easy to run" goal: >> >> Caused by: java.io.FileNotFoundException: Couldn't automatically resolve >> dependency for jcstress-tests-all , revision 0.3 >> Please specify the location using jdk.test.lib.artifacts.jcstres >> s-tests-all >> at >> jdk.test.lib.artifacts.DefaultArtifactManager.resolve(Defaul >> tArtifactManager.java:37) >> at jdk.test.lib.artifacts.ArtifactResolver.resolve(ArtifactReso >> lver.java:54) >> at applications.jcstress.JcstressRunner.pathToArtifact(Jcstress >> Runner.java:53) >> ... 8 more >> >> Okay, brilliant! How do I configure this, if I run "make test"? >> >> CONF=linux-x86_64-normal-server-release LOG=info make test >> TEST="hotspot_all" >> >> >> -Aleksey >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.fuchs at oracle.com Wed Sep 13 09:28:39 2017 From: daniel.fuchs at oracle.com (Daniel Fuchs) Date: Wed, 13 Sep 2017 10:28:39 +0100 Subject: [10] RFR(M) 8182701: Modify JVMCI to allow Graal Compiler to expose platform MBean In-Reply-To: <1595796.xe8q7Anfpj@pracovni> References: <73a34e68-19cf-e949-0057-a2e16cfca6da@oracle.com> <1906851.CDFazpJ8ns@pracovni> <1595796.xe8q7Anfpj@pracovni> Message-ID: Hi Jaroslav, GraalMBeans.java: 77 @Override 78 public Set mbeanInterfaceNames() { 79 return Collections.singleton(name); 80 } This is not correct. The return set should be a set of MXBean interface names, as in Class.getName(), not a set of MXBean ObjectName strings. The interface in question must be implemented by the concrete MBean instance and must be a subclass of PlatformManagedObject. It is not required for an MBean to implement such an interface - if it doesn't then it simply won't be obtainable from ManagementFactory::getPlatformMXBean or ManagementFactory::getPlatformMXBeans. So I suspect that in your case, since mbeanInterfaces() returns an empty set then mbeanInterfaceNames() should also return an empty set. IIRC mbeanInterfaceNames() was introduced so that you could query for a particular MBean implementing a given interface without necessarily triggering the loading and initialization of all interfaces implemented by all MBeans. best regards, -- daniel On 12/09/2017 18:44, Jaroslav Tulach wrote: > Dear reviewers, > after several reconsiderations I have webrev #4 ready for your review. Can you > please take a look at > > http://cr.openjdk.java.net/~jtulach/8182701/webrev.04/ > > and let me know if it is in a reasonable shape? Thanks a lot. > -jt > From jamsheed.c.m at oracle.com Thu Sep 14 06:11:36 2017 From: jamsheed.c.m at oracle.com (jamsheed) Date: Thu, 14 Sep 2017 11:41:36 +0530 Subject: [10] RFR: [AOT] assert(false) failed: DEBUG MESSAGE: InterpreterMacroAssembler::call_VM_base: last_sp != NULL In-Reply-To: <3a500359-8838-a426-0cc6-4d9c886379b1@oracle.com> References: <8d91428c-061c-1601-2b99-99ed6544b13e@oracle.com> <3a500359-8838-a426-0cc6-4d9c886379b1@oracle.com> Message-ID: <19e7b493-b199-97f7-a8bf-619da902f9bc@oracle.com> Thank you for the review, agree, updated in place. Best Regards, Jamsheed On Wednesday 13 September 2017 11:22 PM, Dean Long wrote: > Don't you want to use #ifdef ASSERT instead? > > dl From dean.long at oracle.com Thu Sep 14 06:54:20 2017 From: dean.long at oracle.com (Dean Long) Date: Wed, 13 Sep 2017 23:54:20 -0700 Subject: [10] RFR: [AOT] assert(false) failed: DEBUG MESSAGE: InterpreterMacroAssembler::call_VM_base: last_sp != NULL In-Reply-To: <8eb949bd-fa14-2722-5eaa-21a0a0c95b26@oracle.com> References: <8eb949bd-fa14-2722-5eaa-21a0a0c95b26@oracle.com> Message-ID: <069663cb-4b48-483e-7d1b-8619dafe616d@oracle.com> It looks like you accidentally dropped hotspot-compiler-dev at openjdk.java.net when you added runtime. dl On 9/13/2017 11:21 PM, jamsheed wrote: > (adding runtime list for inputs) > > On Monday 11 September 2017 11:43 PM, jamsheed wrote: >> brief desc: special handling of Object. in >> TemplateInterpreter::deopt_reexecute_entry >> >> required last_sp to be reset explicitly in normal return path >> >> address TemplateInterpreter::deopt_reexecute_entry(Method* method, >> address bcp) { >> ? assert(method->contains(bcp), "just checkin'"); >> ? Bytecodes::Code code?? = Bytecodes::java_code_at(method, bcp); >> ? if (code == Bytecodes::_return) { >> ??? // This is used for deopt during registration of finalizers >> ??? // during Object..? We simply need to resume execution at >> ??? // the standard return vtos bytecode to pop the frame normally. >> ??? // reexecuting the real bytecode would cause double registration >> ??? // of the finalizable object. >> ??? return _normal_table.entry(Bytecodes::_return).entry(vtos); > > last_sp ! = null not an issue for this case, so i skip the assert in > debug build > > http://cr.openjdk.java.net/~jcm/8168712/webrev.01/ > > Please review. > > Best Regards, > Jamsheed > > > > > From magnus.ihse.bursie at oracle.com Thu Sep 14 07:40:53 2017 From: magnus.ihse.bursie at oracle.com (Magnus Ihse Bursie) Date: Thu, 14 Sep 2017 09:40:53 +0200 Subject: RFR(XL/M) : 8178788: wrap JCStress test suite as jtreg tests In-Reply-To: References: <9A2C94EA-89A3-4C75-9D3C-51E058BD8A1D@oracle.com> <342c3748-616f-8a7d-74f4-3ce929b1e0dc@redhat.com> Message-ID: <613877ee-1c56-3f85-138b-5e6ee320a08a@oracle.com> Stuart, On 2017-09-13 20:36, Stuart Monteith wrote: > For the record, if I put my build of jcstress.jar in $HOME, the > following allows the jcstress tests to run: > > make test TEST="hotspot_all" > EXTRA_JTREG_OPTIONS="-Djdk.test.lib.artifacts.jcstress-tests-all=$HOME/jcstress.jar" > > From "make help" and the makefiles themselves, I had expected the > follow to work: > make TEST="hotspot_all" > JTREG="VM_OPTIONS=-Djdk.test.lib.artifacts.jcstress-tests-all=$HOME/jcstress.jar" > test > > but it does not - the JTREG parameter is apparently ignored. This is > unfortunate as there is a warning as it is a non-control variable. You are using the "test" target, but the JTREG option is only available for the new "run-test" target. Using "test" will invoke the old testing framework, which is about to be replaced by a more modern and integrated one. In the long term, "test" will invoke the new framework, but during a transition period, "run-test" needs to be used. For what it's worth, I tried your command line (but without the patch applied) and verified using LOG=cmdlines that the -D option was indeed passed to jtreg. /Magnus > > Am I wrong in thinking that this was written for testing internall > within Oracle? I can not find an instance of > "com.oracle.jib.api.JibServiceFactory" in the OpenJDK project or > elsewhere. > > BR, > Stuart > > On 8 September 2017 at 16:53, Stuart Monteith > > wrote: > > Hello, > I've spent some time on this, and I have to admit that I'm > stumped. I get exactly the same errors on x86 on jdk10/hs and > jdk10/jdk10 with arecent build of JTReg and JT_HOME set appropriately. > > Are there any pointers on how this is supposed to be run? > > Thanks, > Stuart > > On 25 April 2017 at 11:47, Aleksey Shipilev > wrote: > > On 04/19/2017 12:12 AM, Igor Ignatyev wrote: > > > http://cr.openjdk.java.net/~iignatyev//8178788/webrev.00/index.html > > >> 69903 lines changed: 69903 ins; 0 del; 0 mod; > > (69524 lines are generated) > > > > Hi all, > > > > could you please review this patch which adds a jtreg test > wrapper for > > jcstress test suite and jtreg tests which run jsctress tests > thru this > > wrapper? > > > > webrev: > http://cr.openjdk.java.net/~iignatyev//8178788/webrev.00/index.html > > > JBS: https://bugs.openjdk.java.net/browse/JDK-8178788 > testing: > > TL;DR: This patch introduces more problems than it solves. > Just run the jcstress > tests-all JAR against the tested runtime. > > Wrapping jcstress tests with jtreg defies the purpose of > jcstress harness -- > that is, running lots of tests as fast as it possibly could > without affecting > testing quality. For example, by cleverly reusing VMs between > the tests, using > Whitebox to deoptimize without restarting the VMs, etc. It > really wastes CPU > time to run each test in isolation. > > Also, it does not "automatically" work, which defies "easy to > run" goal: > > Caused by: java.io.FileNotFoundException: Couldn't > automatically resolve > dependency for jcstress-tests-all , revision 0.3 > Please specify the location using > jdk.test.lib.artifacts.jcstress-tests-all > at > jdk.test.lib.artifacts.DefaultArtifactManager.resolve(DefaultArtifactManager.java:37) > at > jdk.test.lib.artifacts.ArtifactResolver.resolve(ArtifactResolver.java:54) > at > applications.jcstress.JcstressRunner.pathToArtifact(JcstressRunner.java:53) > ... 8 more > > Okay, brilliant! How do I configure this, if I run "make test"? > > CONF=linux-x86_64-normal-server-release LOG=info make test > TEST="hotspot_all" > > > -Aleksey > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart.monteith at linaro.org Thu Sep 14 10:26:53 2017 From: stuart.monteith at linaro.org (Stuart Monteith) Date: Thu, 14 Sep 2017 11:26:53 +0100 Subject: RFR(XL/M) : 8178788: wrap JCStress test suite as jtreg tests In-Reply-To: <613877ee-1c56-3f85-138b-5e6ee320a08a@oracle.com> References: <9A2C94EA-89A3-4C75-9D3C-51E058BD8A1D@oracle.com> <342c3748-616f-8a7d-74f4-3ce929b1e0dc@redhat.com> <613877ee-1c56-3f85-138b-5e6ee320a08a@oracle.com> Message-ID: Thank you Magnus, that's useful. My working command line is: make run-test TEST="hotspot_all" JTREG="VM_OPTIONS=-Djdk.test.lib.artifacts.jcstress-tests-all=$HOME/jcstress.jar" but it takes an excessively long time to run. Are there plans for a means to cleanly disable the JCStress JTreg tests so we can efficiently run them with the JCStress harness? Thanks, Stuart On 14 September 2017 at 08:40, Magnus Ihse Bursie < magnus.ihse.bursie at oracle.com> wrote: > Stuart, > > On 2017-09-13 20:36, Stuart Monteith wrote: > > For the record, if I put my build of jcstress.jar in $HOME, the following > allows the jcstress tests to run: > > make test TEST="hotspot_all" EXTRA_JTREG_OPTIONS="-Djdk. > test.lib.artifacts.jcstress-tests-all=$HOME/jcstress.jar" > > From "make help" and the makefiles themselves, I had expected the follow > to work: > make TEST="hotspot_all" JTREG="VM_OPTIONS=-Djdk.test. > lib.artifacts.jcstress-tests-all=$HOME/jcstress.jar" test > > but it does not - the JTREG parameter is apparently ignored. This is > unfortunate as there is a warning as it is a non-control variable. > > > You are using the "test" target, but the JTREG option is only available > for the new "run-test" target. Using "test" will invoke the old testing > framework, which is about to be replaced by a more modern and integrated > one. In the long term, "test" will invoke the new framework, but during a > transition period, "run-test" needs to be used. > > For what it's worth, I tried your command line (but without the patch > applied) and verified using LOG=cmdlines that the -D option was indeed > passed to jtreg. > > /Magnus > > > > Am I wrong in thinking that this was written for testing internall within > Oracle? I can not find an instance of "com.oracle.jib.api.JibServiceFactory" > in the OpenJDK project or elsewhere. > > BR, > Stuart > > On 8 September 2017 at 16:53, Stuart Monteith > wrote: > >> Hello, >> I've spent some time on this, and I have to admit that I'm stumped. I >> get exactly the same errors on x86 on jdk10/hs and jdk10/jdk10 with arecent >> build of JTReg and JT_HOME set appropriately. >> >> Are there any pointers on how this is supposed to be run? >> >> Thanks, >> Stuart >> >> On 25 April 2017 at 11:47, Aleksey Shipilev wrote: >> >>> On 04/19/2017 12:12 AM, Igor Ignatyev wrote: >>> > http://cr.openjdk.java.net/~iignatyev//8178788/webrev.00/index.html >>> >> 69903 lines changed: 69903 ins; 0 del; 0 mod; >>> > (69524 lines are generated) >>> > >>> > Hi all, >>> > >>> > could you please review this patch which adds a jtreg test wrapper for >>> > jcstress test suite and jtreg tests which run jsctress tests thru this >>> > wrapper? >>> > >>> > webrev: http://cr.openjdk.java.net/~iignatyev//8178788/webrev.00/ind >>> ex.html >>> > JBS: https://bugs.openjdk.java.net/browse/JDK-8178788 testing: >>> >>> TL;DR: This patch introduces more problems than it solves. Just run the >>> jcstress >>> tests-all JAR against the tested runtime. >>> >>> Wrapping jcstress tests with jtreg defies the purpose of jcstress >>> harness -- >>> that is, running lots of tests as fast as it possibly could without >>> affecting >>> testing quality. For example, by cleverly reusing VMs between the tests, >>> using >>> Whitebox to deoptimize without restarting the VMs, etc. It really wastes >>> CPU >>> time to run each test in isolation. >>> >>> Also, it does not "automatically" work, which defies "easy to run" goal: >>> >>> Caused by: java.io.FileNotFoundException: Couldn't automatically resolve >>> dependency for jcstress-tests-all , revision 0.3 >>> Please specify the location using jdk.test.lib.artifacts.jcstres >>> s-tests-all >>> at >>> jdk.test.lib.artifacts.DefaultArtifactManager.resolve(Defaul >>> tArtifactManager.java:37) >>> at jdk.test.lib.artifacts.ArtifactResolver.resolve(ArtifactReso >>> lver.java:54) >>> at applications.jcstress.JcstressRunner.pathToArtifact(Jcstress >>> Runner.java:53) >>> ... 8 more >>> >>> Okay, brilliant! How do I configure this, if I run "make test"? >>> >>> CONF=linux-x86_64-normal-server-release LOG=info make test >>> TEST="hotspot_all" >>> >>> >>> -Aleksey >>> >>> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gustavo.scalet at eldorado.org.br Thu Sep 14 18:59:57 2017 From: gustavo.scalet at eldorado.org.br (Gustavo Serra Scalet) Date: Thu, 14 Sep 2017 18:59:57 +0000 Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics In-Reply-To: <0ef23b5fcbc54996aea876d4c60e4097@sap.com> References: <1f159ee480284095b8e5c3f444dceb96@serv031.corp.eldorado.org.br> <16e8b68451e94eb79cdd7d9cb5d7984c@sap.com> <2425566a8ff74051af485c919a0bf5ee@serv030.corp.eldorado.org.br> <4ec93a6bcbe14cf99c2fa02d50a18965@sap.com> <0ef23b5fcbc54996aea876d4c60e4097@sap.com> Message-ID: Hi Martin, Short question about the Montgomery backport to JDK8: > > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 14:15:31 > > 2017). The reported performance speedup was calculated by > running the > > following test (TestSquareToLen.java): > > Seems like JDK-8145913 has not been backported, yet. Sorry for not > checking this earlier. So if you want to make RSA really fast, it should > be so much better to backport that one. But I can still sponsor this > change as it may be used elsewhere. Do we have a bug ID for the Montgomery backport to JDK8 so I can track it? If not, should I request that backport or request this intrinsic to be backported (once it's inside of JDK10)? I'm interested in this kind of performance gain. Thanks > > Best regards, > Martin > > > -----Original Message----- > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > Sent: Dienstag, 29. August 2017 22:37 > To: Doerr, Martin ; 'hotspot-compiler- > dev at openjdk.java.net' > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > SquareToLen intrinsics > > Hi Martin, > > New changes: > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.02/ > > Check comments below, please. > > > -----Original Message----- > > From: Doerr, Martin > > > > 1. Sign extending offset and len > > Right, sign and zero extending is equivalent for offset and len > > because they are guaranteed to be >=0 (by checks in Java). But you can > > only rely on bit 32 (IBM notation) to be 0. Bit 0-31 may contain > garbage. > > rldicl was incorrect. My mistake, sorry for that. Correct would be > > rldic which also clears the least significant bits. > > len should also get fixed e.g. by replacing cmpdi by extsw_ in muladd. > > The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't seem to > need further changes as it's being cleared with clrldi, which is the > same as rldic with no shift. Therefore it's treated appropriately as > requested for "offset" parameter. Do you agree? > > > 2. Using 8 byte instructions for int > > The code which feeds stdu is endianess specific. Doesn't work on all > > PPC64 platforms. > > You are right. The way I'm building the 64 bits of the register depends > on which kind of endianness it is run. For now it works only on little > endian so I'm adding a switch (just like I did for SHA) to make it > available only on little endian systems. > > > 3.Regarding Andrew's point: Superseded by Montgomery? > > The Montgomery change got backported to jdk8u (JDK-8150152 in 8u102). > > I'd expect the performance improvement of these intrinsics to be > > irrelevant for crypto.rsa. Did you measure with an older jdk8 release? > > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 14:15:31 > 2017). The reported performance speedup was calculated by running the > following test (TestSquareToLen.java): > import java.math.BigInteger; > > public class TestSquareToLen { > > public static void main(String args[]) throws Exception { > > int n = 10000000; > if (args.length >=1) { > n = Integer.parseInt(args[0]); > } > > BigInteger b1 = new > BigInteger("348939809235573590863505149820825039200022983118773208599936 > 739559418380102146884307139175604920787313701663155983793121475492609222 > 378029211020760922327218480828933663005773596942372680852064103011811651 > 644018048833823482390819947896524207635857984552089977996313113154016668 > 718795349783157384006672542605760392289645528307"); > BigInteger b2 = BigInteger.valueOf(0); > BigInteger check = BigInteger.valueOf(1); > for (int i = 0; i < n; i++) { > b2 = b1.multiply(b1); > if (i == 0) > // Didn't JIT yet. Comparing against interpreted mode > check = b2; > } > if (b2.compareTo(check) == 0) > System.out.println("Check ok!"); > else > System.out.println("Check failed!"); > } > } > > > I got these results on JDK8 on my POWER8 machine: > $ ./javac TestSquareToLen.java > $ sudo perf stat -r 5 ./java -XX:-UseMulAddIntrinsic -XX:- > UseSquareToLenIntrinsic TestSquareToLen Check ok! > Check ok! > Check ok! > Check ok! > Check ok! > > Performance counter stats for './java -XX:-UseMulAddIntrinsic -XX:- > UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > 15148.009557 task-clock (msec) # 1.053 CPUs > utilized ( +- 0.48% ) > 2,425 context-switches # 0.160 K/sec > ( +- 5.84% ) > 356 cpu-migrations # 0.023 K/sec > ( +- 3.01% ) > 5,153 page-faults # 0.340 K/sec > ( +- 5.22% ) > 54,536,889,909 cycles # 3.600 GHz > ( +- 0.56% ) (66.68%) > 239,554,105 stalled-cycles-frontend # 0.44% frontend > cycles idle ( +- 4.87% ) (49.90%) > 27,683,316,001 stalled-cycles-backend # 50.76% backend > cycles idle ( +- 0.56% ) (50.17%) > 102,020,229,733 instructions # 1.87 insn per > cycle > # 0.27 stalled > cycles per insn ( +- 0.14% ) (66.94%) > 7,706,072,218 branches # 508.718 M/sec > ( +- 0.23% ) (50.20%) > 456,051,162 branch-misses # 5.92% of all > branches ( +- 0.09% ) (50.07%) > > 14.390840733 seconds time elapsed > ( +- 0.09% ) > > $ sudo perf stat -r 5 ./java -XX:+UseMulAddIntrinsic - > XX:+UseSquareToLenIntrinsic TestSquareToLen Check ok! > Check ok! > Check ok! > Check ok! > Check ok! > > Performance counter stats for './java -XX:+UseMulAddIntrinsic - > XX:+UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > 11368.141410 task-clock (msec) # 1.045 CPUs > utilized ( +- 0.64% ) > 1,964 context-switches # 0.173 K/sec > ( +- 8.93% ) > 338 cpu-migrations # 0.030 K/sec > ( +- 7.65% ) > 5,627 page-faults # 0.495 K/sec > ( +- 6.15% ) > 41,100,168,967 cycles # 3.615 GHz > ( +- 0.50% ) (66.36%) > 309,052,316 stalled-cycles-frontend # 0.75% frontend > cycles idle ( +- 2.84% ) (49.89%) > 14,188,581,685 stalled-cycles-backend # 34.52% backend > cycles idle ( +- 0.99% ) (50.34%) > 77,846,029,829 instructions # 1.89 insn per > cycle > # 0.18 stalled > cycles per insn ( +- 0.29% ) (66.96%) > 8,435,216,989 branches # 742.005 M/sec > ( +- 0.28% ) (50.17%) > 339,903,936 branch-misses # 4.03% of all > branches ( +- 0.27% ) (49.90%) > > 10.882357546 seconds time elapsed > ( +- 0.24% ) > > > (out of curiosity, these numbers are 15.19s (+- 0.32%) and 13.42s (+- > 0.53%) on JDK10) > > I may run for SpecJVM2008's crypto.rsa if you are interested. > > Thank you once again for reviewing this. > > Best regards, > Gustavo > > > (I think the change is still acceptable as the intrinsics could be > > used elsewhere and the implementation also exists on other platforms.) > > > > Best regards, > > Martin > > > > > > -----Original Message----- > > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > > Sent: Mittwoch, 16. August 2017 18:50 > > To: Doerr, Martin ; 'hotspot-compiler- > > dev at openjdk.java.net' > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > SquareToLen intrinsics > > > > Hi Martin, > > > > Thanks for dedicated review. It took me a while to be able to work on > > this but I hope to have your points solved. Please check below the > > review as well as my comments quoting your email: > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.01/ > > > > > -----Original Message----- > > > First of all, C2 does not perform sign extend when calling stubs. > > > The int parms need to get zero/sign extended. (Could even be done > > > without extra instructions by replacing sldi -> rldicl, cmpdi -> > > > extsw_ in some > > > cases.) > > > > Does it make a difference on my case? > > > > I guess you are talking about mulAdd preparation code. The only aspect > > I found about him is to force the cast from 32 bits -> 64 bits by > > cleaning higher bits. Offset is a signed integer but it can't be > negative anyway. > > > > So I changed from: > > sldi (R5_ARG3, R5_ARG3, 2); > > > > to: > > rldicl (R5_ARG3, R5_ARG3, 2, 32); // always positive > > > > > > > macroAssembler_ppc.cpp: > > > - Indentation should be 2 spaces. > > > > Done > > > > > > > stubGenerator_ppc:cpp: > > > - or_, addi_ should get replaced by orr, addi when CR0 result is not > > > needed. > > > > Done > > > > > - Where is lplw initialized? > > > > It should be initialized with 0, I missed that... > > > > > - I believe that the updating load/store instructions e.g. lwzu > > > don't perform well on some processors. At least using stwu 2 times > > > in the loop doesn't make sense. > > > > You are right. I could manipulate the bits differently and ended up > > with a single stdu in the loop. Neat! Although I could not reduce the > > total number of instructions. > > > > > - Note: It should be possible to use 8 byte instead of 4 byte > > > instructions: MacroAssembler::multiply64, addc, adde. But I'm not > > > requesting to change that because I guess it would make the code > > > very complicated, especially when supporting both endianess > versions. > > > > Yes, that would require a new analysis on this code. May we consider > > it next? As you said, I prefer having an initial version that looks as > > simple as the original java code. > > > > > - The squareToLen stub implementation is very close the Java > > > implementation. So it'd be interesting to understand what C2 doesn't > > > do as well as the hand written assembly code. Do you know that? (Not > > > absolutely necessary for accepting this change as long as the stub > > > is measurably faster.) > > > > I don't know either. Basically I chose doing it because I noticed some > > performance gain on SpecJVM2008 when analyzing X64. Then, taking a > > closer look, I didn't notice any AVX or some special instructions on > > X64 so I decided to try it on ppc64 by using some basic assembly. > > > > Thanks > > > > > > > > Best regards, > > > Martin > > > > > > > > > -----Original Message----- > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > > Sent: Donnerstag, 10. August 2017 19:22 > > > To: 'hotspot-compiler-dev at openjdk.java.net' > > dev at openjdk.java.net> > > > Subject: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > SquareToLen intrinsics > > > > > > > > > > > > -----Original Message----- > > > From: ppc-aix-port-dev [mailto:ppc-aix-port-dev- > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > > Sent: ter?a-feira, 8 de agosto de 2017 17:19 > > > To: ppc-aix-port-dev at openjdk.java.net > > > Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > SquareToLen intrinsics > > > > > > Hi, > > > > > > Could you please review this specific PPC64 change to hotspot? By > > > implementing these intrinsics I noticed a small improvement with > > > microbenchmarks analysis. On SpecJVM2008's crypto.rsa benchmark, > > > only when backporting to JDK8 an improvement was noticed. > > > > > > JBS: https://bugs.openjdk.java.net/browse/JDK-8185976 > > > Webrev: https://gut.github.io/openjdk/webrev/JDK-8185976/webrev/ > > > > > > Motivation for this implementation: > > > https://twitter.com/ijuma/status/698309312498835457 > > > > > > Best regards, > > > Gustavo Serra Scalet From jaroslav.tulach at oracle.com Fri Sep 15 12:32:04 2017 From: jaroslav.tulach at oracle.com (Jaroslav Tulach) Date: Fri, 15 Sep 2017 14:32:04 +0200 Subject: [10] RFR(M) 8182701: Modify JVMCI to allow Graal Compiler to expose platform MBean In-Reply-To: References: <73a34e68-19cf-e949-0057-a2e16cfca6da@oracle.com> <1595796.xe8q7Anfpj@pracovni> Message-ID: <2523912.QeZCepiec8@pracovni> Thanks for the review Mandy, Daniel. Now, that the consolidated JDK10 repository is available, I have updated my webrev to its structure. In addition to that I addressed your comments: On ?ter? 12. z??? 2017 11:19:45 CEST mandy chung wrote: > ./make/common/Modules.gmk > Nit: can you move jdk.internal.vm.compiler.management to keep the > list in alphabetical order Inserted at appropriate place. > 199 # Filter out Graal specific modules if Graal build is disabled > 200 > 201 ifeq ($(INCLUDE_GRAAL), false) > 202 MODULES_FILTER += jdk.internal.vm.compiler > 203 endif > > When will INCLUDE_GRAAL be set to false? I think > jdk.internal.vm.compiler.management should also be filtered if > jdk.internal.vm.compiler is disabled. That is probably true. Fixed. > > Is jdk.internal.vm.compiler and jdk.internal.vm.compiler.management > built for all platforms in JDK 10? If not, > jdk/src/java.management/share/classes/module-info.java may fail to > compile when jdk.internal.vm.compiler.management is not present. We > can consult with the build team when you find out what configuration > that jdk.internal.vm.compiler is not built. I haven't found configuration where jdk.internal.vm.compiler wouldn't be built. However I wasn't looking very extensively... > hotspot/src/jdk.internal.vm.compiler/share/classes/module-info.java 29 > requires transitive jdk.internal.vm.ci; > do you get any error without this requires transitive? > jdk.internal.vm.compiler.management already requires > jdk.internal.vm.ci. I would think this requires transitive is not > necessary. Looks like this change isn't necessary. I am not sure what was the problem before, when I introduced it. > Is HotSpotGraalCompiler::mbean method necessary? In GraalMBeans.java > > 53 public static Object findGraalRuntimeBean() { > 54 JVMCIRuntime r = JVMCI.getRuntime(); > 55 JVMCICompiler c = r.getCompiler(); > 56 if (c instanceof HotSpotGraalCompiler) { > 57 return ((HotSpotGraalCompiler) c).mbean(); > 58 } > 59 return null; > 60 } > > It seems that you can call HotspotGraalRuntime::mbean directly. I don't think I can. There is no way to get to HotspotGraalRuntime except asking the HotSpotGraalCompiler. The HotspotGraalRuntime isn't JVMCIRuntime... At least I think so, there is slightly too much runtimes and providers in the codebase for my taste. However that isn't something I can change as part of JDK-8182701 > As we > discussed offline, we agree that HotSpotRuntimeMBean should belong to > this new module but it requires some refactoring which may take some > amount of work. Such clean up will be followed up in a separate JBS issue. Right. > GraalMBeans.java: > > 77 @Override > 78 public Set mbeanInterfaceNames() { > 79 return Collections.singleton(name); > 80 } > > This is not correct. The return set should be a set of > MXBean interface names, as in Class.getName(), not a set > of MXBean ObjectName strings. I see. Thanks. On st?eda 13. z??? 2017 8:23:22 CEST mandy chung wrote: > On 9/13/17 2:28 AM, Daniel Fuchs wrote: > Good catch, Daniel. This should return empty set as mbeanInterfaces() > returns. mbeanInterfaceNames returns the class name of the mbean > interfaces. OK, returning empty set. The webrev #5 is available at http://cr.openjdk.java.net/~jtulach/8182701/webrev.05/ -jt From doug.simon at oracle.com Fri Sep 15 12:47:00 2017 From: doug.simon at oracle.com (Doug Simon) Date: Fri, 15 Sep 2017 14:47:00 +0200 Subject: [10] RFR(M) 8182701: Modify JVMCI to allow Graal Compiler to expose platform MBean In-Reply-To: <2523912.QeZCepiec8@pracovni> References: <73a34e68-19cf-e949-0057-a2e16cfca6da@oracle.com> <1595796.xe8q7Anfpj@pracovni> <2523912.QeZCepiec8@pracovni> Message-ID: > On 15 Sep 2017, at 14:32, Jaroslav Tulach wrote: > > Thanks for the review Mandy, Daniel. Now, that the consolidated JDK10 > repository is available, I have updated my webrev to its structure. In > addition to that I addressed your comments: > > On ?ter? 12. z??? 2017 11:19:45 CEST mandy chung wrote: >> ./make/common/Modules.gmk >> Nit: can you move jdk.internal.vm.compiler.management to keep the >> list in alphabetical order > > Inserted at appropriate place. > >> 199 # Filter out Graal specific modules if Graal build is disabled >> 200 >> 201 ifeq ($(INCLUDE_GRAAL), false) >> 202 MODULES_FILTER += jdk.internal.vm.compiler >> 203 endif >> >> When will INCLUDE_GRAAL be set to false? I think >> jdk.internal.vm.compiler.management should also be filtered if >> jdk.internal.vm.compiler is disabled. > > That is probably true. Fixed. > >> >> Is jdk.internal.vm.compiler and jdk.internal.vm.compiler.management >> built for all platforms in JDK 10? If not, >> jdk/src/java.management/share/classes/module-info.java may fail to >> compile when jdk.internal.vm.compiler.management is not present. We >> can consult with the build team when you find out what configuration >> that jdk.internal.vm.compiler is not built. > > I haven't found configuration where jdk.internal.vm.compiler wouldn't be built. > However I wasn't looking very extensively... > >> hotspot/src/jdk.internal.vm.compiler/share/classes/module-info.java 29 >> requires transitive jdk.internal.vm.ci; >> do you get any error without this requires transitive? >> jdk.internal.vm.compiler.management already requires >> jdk.internal.vm.ci. I would think this requires transitive is not >> necessary. > > Looks like this change isn't necessary. I am not sure what was the problem > before, when I introduced it. > >> Is HotSpotGraalCompiler::mbean method necessary? In GraalMBeans.java >> >> 53 public static Object findGraalRuntimeBean() { >> 54 JVMCIRuntime r = JVMCI.getRuntime(); >> 55 JVMCICompiler c = r.getCompiler(); >> 56 if (c instanceof HotSpotGraalCompiler) { >> 57 return ((HotSpotGraalCompiler) c).mbean(); >> 58 } >> 59 return null; >> 60 } >> >> It seems that you can call HotspotGraalRuntime::mbean directly. > > I don't think I can. There is no way to get to HotspotGraalRuntime except > asking the HotSpotGraalCompiler. The HotspotGraalRuntime isn't JVMCIRuntime... > At least I think so That is correct - you have to obtain a HotspotGraalRuntime from a HotSpotGraalCompiler. There can be other HotspotGraalRuntime instances (e.g., Truffle has it's own HotSpotGraalCompiler/HotSpotGraalRuntime). -Doug From martin.doerr at sap.com Fri Sep 15 16:10:55 2017 From: martin.doerr at sap.com (Doerr, Martin) Date: Fri, 15 Sep 2017 16:10:55 +0000 Subject: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic In-Reply-To: <2726388e5b224cdcb75a9c0931ec1e44@sap.com> References: <4999bc2a3f0640dfb6dd75d23b4f30ea@sap.com> <0089f9f653a6442aa672af2e15b2b864@serv030.corp.eldorado.org.br> <59397a3749024e91b56be6e990a3250d@sap.com> <363c2378f23e4be2bf60b622594c60fe@sap.com> <59A089F4.6010504@linux.vnet.ibm.com> <34e6550d426440bab3b8a54a82e25190@sap.com> <59A9850B.7030302@linux.vnet.ibm.com> <3e64d11f046f44379f9658dffc766a45@sap.com> <670dd284fe77479986abe75aca42b20a@sap.com> <875858bb7bda421b97352d0fb2972a73@serv031.corp.eldorado.org.br> <2726388e5b224cdcb75a9c0931ec1e44@sap.com> Message-ID: <9bf856aaba4049b0940f5e2f36bf91f4@sap.com> Hi Gustavo, I had to make some more changes to get it working on linux BE and AIX. New webrev: http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.05/ Changes: - Fixed remaining lvsr/lvsl issues for BE - 16 byte alignment doesn't work with xlC on AIX. Workaround: use malloc on AIX - Changed address computations to provide more freedom for out-of-order execution - Removed some more unused stuff Please take a look. Maybe you find some further improvements. You may want to rerun tests and benchmarks. Tests have passed on linux BE and LE as well as on AIX. We'll run them again over the weekend. Best regards, Martin -----Original Message----- From: Doerr, Martin Sent: Dienstag, 12. September 2017 18:32 To: 'Gustavo Serra Scalet' ; Lindenmaier, Goetz ; Gustavo Romero Cc: 'hotspot-compiler-dev at openjdk.java.net' ; ppc-aix-port-dev at openjdk.java.net Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic Hi Gustavo, thanks for debugging. It should fix the problem for LE, but the previous version was correct for BE. Version which works for both: http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.04/ Changes: - factored out lvsr/lvsl - fixed ofs <= limit comparison (treat as positive ints to ignore garbage in high half and be protected against integer overflow, use <= see DigestBase.java) - removed unused labels - added the contributors you mentioned to Contributed-by list (not comments, I think it's better there) Please let us know if this is ok for you. We'll do some more testing on all platforms. Best regards, Martin -----Original Message----- From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] Sent: Dienstag, 12. September 2017 17:48 To: Doerr, Martin ; Lindenmaier, Goetz ; Gustavo Romero Cc: 'hotspot-compiler-dev at openjdk.java.net' ; ppc-aix-port-dev at openjdk.java.net Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic Hi Martin and G?tz, I was taking a closer look at the hotspot's tests/compiler and I see indeed one test failing for sha: Passed: compiler/intrinsics/sha/cli/TestUseSHA256IntrinsicsOptionOnSupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHA512IntrinsicsOptionOnSupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHA1IntrinsicsOptionOnUnsupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHAOptionOnUnsupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHA512IntrinsicsOptionOnUnsupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHA1IntrinsicsOptionOnSupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHA256IntrinsicsOptionOnUnsupportedCPU.java Passed: compiler/intrinsics/sha/cli/TestUseSHAOptionOnSupportedCPU.java FAILED: compiler/intrinsics/sha/TestSHA.java Passed: compiler/intrinsics/sha/sanity/TestSHA1Intrinsics.java Passed: compiler/intrinsics/sha/sanity/TestSHA256Intrinsics.java Passed: compiler/intrinsics/sha/sanity/TestSHA1MultiBlockIntrinsics.java Passed: compiler/intrinsics/sha/sanity/TestSHA512Intrinsics.java Passed: compiler/intrinsics/sha/sanity/TestSHA512MultiBlockIntrinsics.java Passed: compiler/intrinsics/sha/sanity/TestSHA256MultiBlockIntrinsics.java That one was failing due to a bug on unaligned memory load. I took a closer look and fixed it. It should also work on Big Endian: https://gut.github.io/openjdk/webrev/JDK-8185979/webrev.02/index.html This new webrev was updated on top of Martin's webrev.03. I also took this chance to add all the contributors to this patch, as you suggested before. Thanks > -----Original Message----- > From: Doerr, Martin [mailto:martin.doerr at sap.com] > Sent: segunda-feira, 11 de setembro de 2017 14:06 > To: Lindenmaier, Goetz ; Gustavo Romero > ; Gustavo Serra Scalet > > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi G?tz and Gustavo, > > I had just posted the version I had before leaving. Thanks for your > feedback. > > New webrev is here: > http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.03/ > > Changes to webrev.02: > - Referenced paper > - Factored out endianness specific vector permute instructions (vec_perm > with only 3 parms to reduce risk of mixing them up) > - Removed code for PPC64 platforms which didn't support it > - code_size2 = 22000 > - Added missing ')' in IntrinsicPredicates.java > > My changes shouldn't change the behavior of the little endian > implementation. > We have to check if and if yes which tests still fail. Are there any > updates on this? > > Best regards, > Martin > > > -----Original Message----- > From: Lindenmaier, Goetz > Sent: Donnerstag, 7. September 2017 09:55 > To: Gustavo Romero ; Doerr, Martin > ; Gustavo Serra Scalet > > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi, > > I had a look at this change. > > Martin, you missed a ')' in IntrinsicPredicates.java. > > Combined with the multiplyToLen change, stub codebuffer space runs out. > Please increase > code_size2 = 20000 > to 22000 in stubRoutines_ppc.hpp. > > I see TestSHA.java failing on linuxppc64le. > Also, other tests are failing with SHA-256 digest error ... > > Also, on aix, some of our internal tests are failing. These didn't run > on linuxppc64 on a Power8 machine, so it might fail there, too. But on > the big endian platforms, the jtreg tests don't fail. > > @Gustavo, maybe you can have a look at the issues on linuxppc64le and > post a new webrev. Then Martin can fix the remaining issue on big > endian. > > Best regards, > Goetz. > > > > > > -----Original Message----- > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > bounces at openjdk.java.net] On Behalf Of Gustavo Romero > > Sent: Freitag, 1. September 2017 18:04 > > To: Doerr, Martin ; Gustavo Serra Scalet > > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > Hi Martin! > > > > On 01-09-2017 12:39, Doerr, Martin wrote: > > > Hi Gustavos, > > > > > > I have managed to upload a version which seems to work on both > > endianness implementations. > > > At least some quick tests have passed on AIX and Big Endian linux in > > addition to Little Endian linux. > > > > Great! :-) > > > > > > > I'll be out next week, but the change looks ok for me. Please let me > > > know if > > the changed version still looks ok for you, too. Feel free to overwork > > or improve it. > > > It'd also be good to know, if relying on vrsave=-1 is safe. > > > > Sure, Martin. I'm chasing what's exactly setting vrsave=-1 and the > > full history log (looks like it's not in the kernel, but I'm checking > yet). > > > > > > > Is the copyright information ok? Did you get source code which > > > requires to > > be mentioned in the comments? > > > The code looks similar to a reference implementation, so the authors > > > of it > > may want to be mentioned? > > > Or did you just use the paper for implementing it? In this case, I'd > > > mention > > the paper. > > > > Gustavo S: the information on the paper must be updated accordingly as > > Martin noted in the new webrev. There is none currently. > > > > > > > After we got a second review and ran more tests, we can ask somebody > > from Oracle to push it. > > > > > > Thanks for contributing and your support, Martin > > > > Thanks a lot for reviewing and for all the help. > > > > Regards, > > Gustavo R > > > > > > > > -----Original Message----- > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > bounces at openjdk.java.net] On Behalf Of Doerr, Martin > > > Sent: Donnerstag, 31. August 2017 18:21 > > > To: Gustavo Romero > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > > > Hi Gustavo R, > > > > > > I guess you're right. vrsave is already set to -1, so all Vector > > > Registers get > > saved. > > > It'd be good to know where it is set (OS, Flag in ELF header, ???) > > > and if this > > is guaranteed. > > > I don't want to risk getting sporadic errors on some OS versions. > > > > > > I'd like to enable SHA intrinsics on linux BE as well. I already > > > managed to get > > the 256 bit version working (was quite some work!). > > > > > > Thanks and best regards, > > > Martin > > > > > > > > > -----Original Message----- > > > From: Gustavo Romero [mailto:gromero at linux.vnet.ibm.com] > > > Sent: Freitag, 25. August 2017 22:35 > > > To: Doerr, Martin > > > Cc: Gustavo Serra Scalet ; 'hotspot- > > compiler-dev at openjdk.java.net' > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > > > Hi Martin, > > > > > > On 25-08-2017 13:18, Doerr, Martin wrote: > > >> I think you didn't get my point about AIX. > > >> Your current version doesn't break AIX, but it lacks SHA2 > > >> acceleration for > > AIX on Power 8 and newer, which is still relevant. > > >> So I'd like to ask you kindly to take a look if Big Endian support > > >> for the stub > > could be added without high effort. AIX doesn't need VRSAVE handling > > (like Little Endian linux, unlike Big Endian linux), so a few lines in > > the stub could possibly be enough. I can assist with testing. > > > > > > I don't think that VRSAVE is handled on Linux, even on BE. Although > > > BE ABI > > [1] > > > says: > > > > > > "Functions must ensure that the appropriate bits in the vrsave > > > register are > > set for any vector registers they use" > > > > > > and LE ABI does not say that, even on Linux BE VRSAVE is not in > > > effect used to determine which vector registers (VMX/Altivec) should > > > be > > saved/restored. > > > No application uses it on Linux, so I would say that VRSAVE is > > > ignored on > > Linux > > > completely both on BE and LE. save/restore library interfaces don't > > > pay attention to it in glibc: VRSAVE is just saved/restored > > > completely in > > mechanisms > > > of swap/get/setcontext(), set/longjump(), and dl-trampoline() and > > > that's > > all. I > > > checked that with toolchain folks and they agree. We've already > > > discussed > > that a > > > long time ago but at that time I was just using the vector-scalar > > > registers [2] and at that time I agreed that if VMX/Altivec was in > > > use instead of the VSX > > so > > > VRSAVE should be handled accordingly. But I have a different opinion > > now... > > > > > > I'm wondering if something would really break on Linux BE if we > > > forget > > about > > > VRSAVE at all in the JVM. If not, we could forget about VRSAVE > > > forever on > > Linux. > > > Looks like VRSAVE was sort of born to the oblivion... ? > > > > > > > > > Kind regards, > > > Gustavo > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=http- > > 3A__refspecs.linuxfoundation.org_ELF_ppc64_PPC-2Delf64abi- > > 2D1.9.html&d=DwIFAg&c=jf_iaSHvJObTbx- > > siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z- > > esrs- > > Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=arAjOio5NNoRIZLdczhgF5BDoAF3HU > > vq-xCtSufn_kA&e= > > > [2] https://urldefense.proofpoint.com/v2/url?u=http- > > 3A__mail.openjdk.java.net_pipermail_ppc-2Daix-2Dport-2Ddev_2016- > > 2DMay_002508.html&d=DwIFAg&c=jf_iaSHvJObTbx- > > siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z- > > esrs-Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=p0xb08lxayJHBXZREL-7c5ipKc- > > waZMMZpTiQWfU-S4&e= > > > From mandy.chung at oracle.com Fri Sep 15 17:53:45 2017 From: mandy.chung at oracle.com (mandy chung) Date: Fri, 15 Sep 2017 10:53:45 -0700 Subject: [10] RFR(M) 8182701: Modify JVMCI to allow Graal Compiler to expose platform MBean In-Reply-To: References: <73a34e68-19cf-e949-0057-a2e16cfca6da@oracle.com> <1595796.xe8q7Anfpj@pracovni> <2523912.QeZCepiec8@pracovni> Message-ID: <48a815cd-2943-d328-d955-f301d39d7b86@oracle.com> On 9/15/17 5:47 AM, Doug Simon wrote: >> I don't think I can. There is no way to get to HotspotGraalRuntime except >> asking the HotSpotGraalCompiler. The HotspotGraalRuntime isn't JVMCIRuntime... >> At least I think so > That is correct - you have to obtain a HotspotGraalRuntime from a HotSpotGraalCompiler. There can be other HotspotGraalRuntime instances (e.g., Truffle has it's own HotSpotGraalCompiler/HotSpotGraalRuntime). Ah... that explains why you need two mbean() methods in this version. > http://cr.openjdk.java.net/~jtulach/8182701/webrev.05/index.html Looks good. Mandy -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.plummer at oracle.com Sat Sep 16 18:58:56 2017 From: chris.plummer at oracle.com (Chris Plummer) Date: Sat, 16 Sep 2017 11:58:56 -0700 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> Message-ID: Hi Yasumasa, Is this on a 32-bit system? I don't see how you could otherwise call getCIntegerField() on a long type. jlong is always 64-bit and long is (generally) 32-bit on 32-bit systems, and 64-bit on 64-bit systems, at least that seems to be the case with linux. From what I can see, _stack_traversal_mark is now the only long type in vmStructs.cpp. I don't know that we have a mechanism to safely fetch it on both 32-bit and 64-bit systems. _stack_traversal_mark seems to be a long because _traversals is also a long. ? ? static long????? _traversals;?????????????????? // Stack scan count, also sweep ID. This too might be considered a bug. I'm not sure why you would want the size of this field to vary between 32-bit and 64-bit systems (adding compiler-dev to help answer that). So, while I would agree that your fix is generally in the right direction, I think we first need to revisit the use of long for these fields. If they can be changed to an int, then your fix is correct (pending the changes to int). If not, then maybe we need getCLongField() support. And lastly, we really should have a test to detect this bug. Maybe we already do, and it is failing but is going unnoticed for some reason. I'll try to look into that some more on Monday. thanks, Chris On 9/16/17 5:20 AM, Yasumasa Suenaga wrote: > Hi all, > > I tried to get thread dump via jstack command on CLHSDB. But it was > failed as below: > > ``` > Caused by: sun.jvm.hotspot.types.WrongTypeException: field > "_stack_traversal_mark" in type nmethod is not of type jlong, but > instead of type long > ??????? at > jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:206) > ??????? at > jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:212) > ??????? at > jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getJLongField(BasicType.java:249) > ??????? at > jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.initialize(NMethod.java:108) > ??????? at > jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.access$000(NMethod.java:35) > ??????? at > jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod$1.update(NMethod.java:81) > ??????? at > jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMInitializedObserver(VM.java:451) > ??????? at > jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.(NMethod.java:79) > ??????? ... 23 more > ``` > > I think this exception is caused by JDK-8186837. > This changeset has changed the type of > `nmethod::_stack_traversal_mark` to `long` from `jlong`. > > SA should follow this change. > > I uploaded a webrev for this issue. This webrev is generated from > consolidated repo (jdk10/master). > Could you review it? > > ? http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.00/ > > > I cannot access JPRT. So I need reviewer. > > > Thanks, > > Yasumasa > From yasuenag at gmail.com Sun Sep 17 08:13:19 2017 From: yasuenag at gmail.com (Yasumasa Suenaga) Date: Sun, 17 Sep 2017 17:13:19 +0900 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> Message-ID: <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> Hi Chris, I've tested this issue on Fedora 26 x86_64. I think we can sue CIntegerField at this point because CIntegerField is not specialized for various int size [1]. In fact, CIntegerField had been used at this point [2], and HSDB worked fine. Thanks, Yasumasa [1] http://hg.openjdk.java.net/jdk10/master/file/fd36993f7bf5/src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/CIntegerField.java#l29 [2] http://hg.openjdk.java.net/jdk10/master/rev/cbfdbefc6ea3 On 2017/09/17 3:58, Chris Plummer wrote: > Hi Yasumasa, > > Is this on a 32-bit system? I don't see how you could otherwise call getCIntegerField() on a long type. jlong is always 64-bit and long is (generally) 32-bit on 32-bit systems, and 64-bit on 64-bit systems, at least that seems to be the case with linux. > > From what I can see, _stack_traversal_mark is now the only long type in vmStructs.cpp. I don't know that we have a mechanism to safely fetch it on both 32-bit and 64-bit systems. > > _stack_traversal_mark seems to be a long because _traversals is also a long. > > ? ? static long????? _traversals;?????????????????? // Stack scan count, also sweep ID. > > This too might be considered a bug. I'm not sure why you would want the size of this field to vary between 32-bit and 64-bit systems (adding compiler-dev to help answer that). > > So, while I would agree that your fix is generally in the right direction, I think we first need to revisit the use of long for these fields. If they can be changed to an int, then your fix is correct (pending the changes to int). If not, then maybe we need getCLongField() support. > > And lastly, we really should have a test to detect this bug. Maybe we already do, and it is failing but is going unnoticed for some reason. I'll try to look into that some more on Monday. > > thanks, > > Chris > > On 9/16/17 5:20 AM, Yasumasa Suenaga wrote: >> Hi all, >> >> I tried to get thread dump via jstack command on CLHSDB. But it was failed as below: >> >> ``` >> Caused by: sun.jvm.hotspot.types.WrongTypeException: field "_stack_traversal_mark" in type nmethod is not of type jlong, but instead of type long >> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:206) >> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:212) >> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getJLongField(BasicType.java:249) >> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.initialize(NMethod.java:108) >> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.access$000(NMethod.java:35) >> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod$1.update(NMethod.java:81) >> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMInitializedObserver(VM.java:451) >> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.(NMethod.java:79) >> ??????? ... 23 more >> ``` >> >> I think this exception is caused by JDK-8186837. >> This changeset has changed the type of `nmethod::_stack_traversal_mark` to `long` from `jlong`. >> >> SA should follow this change. >> >> I uploaded a webrev for this issue. This webrev is generated from consolidated repo (jdk10/master). >> Could you review it? >> >> ? http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.00/ >> >> >> I cannot access JPRT. So I need reviewer. >> >> >> Thanks, >> >> Yasumasa >> > > From zhongwei.yao at linaro.org Mon Sep 18 09:58:11 2017 From: zhongwei.yao at linaro.org (Zhongwei Yao) Date: Mon, 18 Sep 2017 17:58:11 +0800 Subject: RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed Message-ID: [Forward from aarch64-port-dev to hotspot-compiler-dev] Hi, all, Bug: https://bugs.openjdk.java.net/browse/JDK-8187601 Webrev: http://cr.openjdk.java.net/~zyao/8187601/webrev.00 In the current implementation, the loop unrolling times are determined by vector size and element size when SuperWordLoopUnrollAnalysis is true (both X86 and aarch64 are true for now). This unrolling policy generates less optimized code when SLP auto-vectorization fails (as following example shows). In this patch, I modify the current unrolling policy to do more unrolling when SLP auto-vectorization fails. So the loop will be unrolled until reaching the unroll times limitation. Here is one example: public static void accessArrayConstants(int[] array) { for (int j = 0; j < 1024; j++) { array[0]++; array[1]++; } } Before this patch, the loop will be unrolled by 4 times. 4 is determined by: AArch64's vector size 128 bits / array element size 32 bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. Below is the generated code by C2 on AArch64: ==== generated code start ==== 0x0000ffff6caf3180: ldr w10, [x1,#16] ; 0x0000ffff6caf3184: add w13, w10, #0x1 0x0000ffff6caf3188: str w13, [x1,#16] ; 0x0000ffff6caf318c: ldr w12, [x1,#20] ; 0x0000ffff6caf3190: add w13, w10, #0x4 0x0000ffff6caf3194: add w10, w12, #0x4 0x0000ffff6caf3198: str w13, [x1,#16] ; 0x0000ffff6caf319c: add w11, w11, #0x4 ; 0x0000ffff6caf31a0: str w10, [x1,#20] ; 0x0000ffff6caf31a4: cmp w11, #0x3fd 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; ==== generated code end ==== After applied this patch, it is unrolled 16 times: ==== generated code start ==== 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; 0x0000ffffb0aa6104: add w13, w10, #0x1 0x0000ffffb0aa6108: str w13, [x1,#16] ; 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; 0x0000ffffb0aa6110: add w13, w10, #0x10 0x0000ffffb0aa6114: add w10, w12, #0x10 0x0000ffffb0aa6118: str w13, [x1,#16] ; 0x0000ffffb0aa611c: add w11, w11, #0x10 ; 0x0000ffffb0aa6120: str w10, [x1,#20] ; 0x0000ffffb0aa6124: cmp w11, #0x3f1 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; ==== generated code end ==== This patch passes jtreg tests both on AArch64 and X86. -- Best regards, Zhongwei From gustavo.scalet at eldorado.org.br Mon Sep 18 13:02:04 2017 From: gustavo.scalet at eldorado.org.br (Gustavo Serra Scalet) Date: Mon, 18 Sep 2017 13:02:04 +0000 Subject: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic In-Reply-To: <9bf856aaba4049b0940f5e2f36bf91f4@sap.com> References: <4999bc2a3f0640dfb6dd75d23b4f30ea@sap.com> <0089f9f653a6442aa672af2e15b2b864@serv030.corp.eldorado.org.br> <59397a3749024e91b56be6e990a3250d@sap.com> <363c2378f23e4be2bf60b622594c60fe@sap.com> <59A089F4.6010504@linux.vnet.ibm.com> <34e6550d426440bab3b8a54a82e25190@sap.com> <59A9850B.7030302@linux.vnet.ibm.com> <3e64d11f046f44379f9658dffc766a45@sap.com> <670dd284fe77479986abe75aca42b20a@sap.com> <875858bb7bda421b97352d0fb2972a73@serv031.corp.eldorado.org.br> <2726388e5b224cdcb75a9c0931ec1e44@sap.com> <9bf856aaba4049b0940f5e2f36bf91f4@sap.com> Message-ID: <2bc62b4e121b4422af3ae142508e8977@serv031.corp.eldorado.org.br> Hi Martin, I agree with your changes. Performance and correctness is Ok on little endian. Thanks > -----Original Message----- > From: Doerr, Martin [mailto:martin.doerr at sap.com] > Sent: sexta-feira, 15 de setembro de 2017 13:11 > To: Gustavo Serra Scalet ; Lindenmaier, > Goetz ; Gustavo Romero > > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi Gustavo, > > I had to make some more changes to get it working on linux BE and AIX. > > New webrev: > http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.05/ > > Changes: > - Fixed remaining lvsr/lvsl issues for BE > - 16 byte alignment doesn't work with xlC on AIX. Workaround: use malloc > on AIX > - Changed address computations to provide more freedom for out-of-order > execution > - Removed some more unused stuff > > Please take a look. Maybe you find some further improvements. You may > want to rerun tests and benchmarks. > Tests have passed on linux BE and LE as well as on AIX. We'll run them > again over the weekend. > > Best regards, > Martin > > > -----Original Message----- > From: Doerr, Martin > Sent: Dienstag, 12. September 2017 18:32 > To: 'Gustavo Serra Scalet' ; > Lindenmaier, Goetz ; Gustavo Romero > > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi Gustavo, > > thanks for debugging. It should fix the problem for LE, but the previous > version was correct for BE. > > Version which works for both: > http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.04/ > > Changes: > - factored out lvsr/lvsl > - fixed ofs <= limit comparison (treat as positive ints to ignore > garbage in high half and be protected against integer overflow, use <= > see DigestBase.java) > - removed unused labels > - added the contributors you mentioned to Contributed-by list (not > comments, I think it's better there) > > Please let us know if this is ok for you. We'll do some more testing on > all platforms. > > Best regards, > Martin > > > -----Original Message----- > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > Sent: Dienstag, 12. September 2017 17:48 > To: Doerr, Martin ; Lindenmaier, Goetz > ; Gustavo Romero > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi Martin and G?tz, > > I was taking a closer look at the hotspot's tests/compiler and I see > indeed one test failing for sha: > Passed: > compiler/intrinsics/sha/cli/TestUseSHA256IntrinsicsOptionOnSupportedCPU. > java > Passed: > compiler/intrinsics/sha/cli/TestUseSHA512IntrinsicsOptionOnSupportedCPU. > java > Passed: > compiler/intrinsics/sha/cli/TestUseSHA1IntrinsicsOptionOnUnsupportedCPU. > java > Passed: > compiler/intrinsics/sha/cli/TestUseSHAOptionOnUnsupportedCPU.java > Passed: > compiler/intrinsics/sha/cli/TestUseSHA512IntrinsicsOptionOnUnsupportedCP > U.java > Passed: > compiler/intrinsics/sha/cli/TestUseSHA1IntrinsicsOptionOnSupportedCPU.ja > va > Passed: > compiler/intrinsics/sha/cli/TestUseSHA256IntrinsicsOptionOnUnsupportedCP > U.java > Passed: compiler/intrinsics/sha/cli/TestUseSHAOptionOnSupportedCPU.java > FAILED: compiler/intrinsics/sha/TestSHA.java > Passed: compiler/intrinsics/sha/sanity/TestSHA1Intrinsics.java > Passed: compiler/intrinsics/sha/sanity/TestSHA256Intrinsics.java > Passed: compiler/intrinsics/sha/sanity/TestSHA1MultiBlockIntrinsics.java > Passed: compiler/intrinsics/sha/sanity/TestSHA512Intrinsics.java > Passed: > compiler/intrinsics/sha/sanity/TestSHA512MultiBlockIntrinsics.java > Passed: > compiler/intrinsics/sha/sanity/TestSHA256MultiBlockIntrinsics.java > > That one was failing due to a bug on unaligned memory load. I took a > closer look and fixed it. It should also work on Big Endian: > https://gut.github.io/openjdk/webrev/JDK-8185979/webrev.02/index.html > > This new webrev was updated on top of Martin's webrev.03. > > I also took this chance to add all the contributors to this patch, as > you suggested before. > > Thanks > > > -----Original Message----- > > From: Doerr, Martin [mailto:martin.doerr at sap.com] > > Sent: segunda-feira, 11 de setembro de 2017 14:06 > > To: Lindenmaier, Goetz ; Gustavo Romero > > ; Gustavo Serra Scalet > > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > Hi G?tz and Gustavo, > > > > I had just posted the version I had before leaving. Thanks for your > > feedback. > > > > New webrev is here: > > http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.03/ > > > > Changes to webrev.02: > > - Referenced paper > > - Factored out endianness specific vector permute instructions > > (vec_perm with only 3 parms to reduce risk of mixing them up) > > - Removed code for PPC64 platforms which didn't support it > > - code_size2 = 22000 > > - Added missing ')' in IntrinsicPredicates.java > > > > My changes shouldn't change the behavior of the little endian > > implementation. > > We have to check if and if yes which tests still fail. Are there any > > updates on this? > > > > Best regards, > > Martin > > > > > > -----Original Message----- > > From: Lindenmaier, Goetz > > Sent: Donnerstag, 7. September 2017 09:55 > > To: Gustavo Romero ; Doerr, Martin > > ; Gustavo Serra Scalet > > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > Hi, > > > > I had a look at this change. > > > > Martin, you missed a ')' in IntrinsicPredicates.java. > > > > Combined with the multiplyToLen change, stub codebuffer space runs > out. > > Please increase > > code_size2 = 20000 > > to 22000 in stubRoutines_ppc.hpp. > > > > I see TestSHA.java failing on linuxppc64le. > > Also, other tests are failing with SHA-256 digest error ... > > > > Also, on aix, some of our internal tests are failing. These didn't run > > on linuxppc64 on a Power8 machine, so it might fail there, too. But > > on the big endian platforms, the jtreg tests don't fail. > > > > @Gustavo, maybe you can have a look at the issues on linuxppc64le and > > post a new webrev. Then Martin can fix the remaining issue on big > > endian. > > > > Best regards, > > Goetz. > > > > > > > > > > > -----Original Message----- > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > bounces at openjdk.java.net] On Behalf Of Gustavo Romero > > > Sent: Freitag, 1. September 2017 18:04 > > > To: Doerr, Martin ; Gustavo Serra Scalet > > > > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > > > Hi Martin! > > > > > > On 01-09-2017 12:39, Doerr, Martin wrote: > > > > Hi Gustavos, > > > > > > > > I have managed to upload a version which seems to work on both > > > endianness implementations. > > > > At least some quick tests have passed on AIX and Big Endian linux > > > > in > > > addition to Little Endian linux. > > > > > > Great! :-) > > > > > > > > > > I'll be out next week, but the change looks ok for me. Please let > > > > me know if > > > the changed version still looks ok for you, too. Feel free to > > > overwork or improve it. > > > > It'd also be good to know, if relying on vrsave=-1 is safe. > > > > > > Sure, Martin. I'm chasing what's exactly setting vrsave=-1 and the > > > full history log (looks like it's not in the kernel, but I'm > > > checking > > yet). > > > > > > > > > > Is the copyright information ok? Did you get source code which > > > > requires to > > > be mentioned in the comments? > > > > The code looks similar to a reference implementation, so the > > > > authors of it > > > may want to be mentioned? > > > > Or did you just use the paper for implementing it? In this case, > > > > I'd mention > > > the paper. > > > > > > Gustavo S: the information on the paper must be updated accordingly > > > as Martin noted in the new webrev. There is none currently. > > > > > > > > > > After we got a second review and ran more tests, we can ask > > > > somebody > > > from Oracle to push it. > > > > > > > > Thanks for contributing and your support, Martin > > > > > > Thanks a lot for reviewing and for all the help. > > > > > > Regards, > > > Gustavo R > > > > > > > > > > > -----Original Message----- > > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > bounces at openjdk.java.net] On Behalf Of Doerr, Martin > > > > Sent: Donnerstag, 31. August 2017 18:21 > > > > To: Gustavo Romero > > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > > > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > > > > > Hi Gustavo R, > > > > > > > > I guess you're right. vrsave is already set to -1, so all Vector > > > > Registers get > > > saved. > > > > It'd be good to know where it is set (OS, Flag in ELF header, ???) > > > > and if this > > > is guaranteed. > > > > I don't want to risk getting sporadic errors on some OS versions. > > > > > > > > I'd like to enable SHA intrinsics on linux BE as well. I already > > > > managed to get > > > the 256 bit version working (was quite some work!). > > > > > > > > Thanks and best regards, > > > > Martin > > > > > > > > > > > > -----Original Message----- > > > > From: Gustavo Romero [mailto:gromero at linux.vnet.ibm.com] > > > > Sent: Freitag, 25. August 2017 22:35 > > > > To: Doerr, Martin > > > > Cc: Gustavo Serra Scalet ; > > > > 'hotspot- > > > compiler-dev at openjdk.java.net' > > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > > > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > > > > > Hi Martin, > > > > > > > > On 25-08-2017 13:18, Doerr, Martin wrote: > > > >> I think you didn't get my point about AIX. > > > >> Your current version doesn't break AIX, but it lacks SHA2 > > > >> acceleration for > > > AIX on Power 8 and newer, which is still relevant. > > > >> So I'd like to ask you kindly to take a look if Big Endian > > > >> support for the stub > > > could be added without high effort. AIX doesn't need VRSAVE handling > > > (like Little Endian linux, unlike Big Endian linux), so a few lines > > > in the stub could possibly be enough. I can assist with testing. > > > > > > > > I don't think that VRSAVE is handled on Linux, even on BE. > > > > Although BE ABI > > > [1] > > > > says: > > > > > > > > "Functions must ensure that the appropriate bits in the vrsave > > > > register are > > > set for any vector registers they use" > > > > > > > > and LE ABI does not say that, even on Linux BE VRSAVE is not in > > > > effect used to determine which vector registers (VMX/Altivec) > > > > should be > > > saved/restored. > > > > No application uses it on Linux, so I would say that VRSAVE is > > > > ignored on > > > Linux > > > > completely both on BE and LE. save/restore library interfaces > > > > don't pay attention to it in glibc: VRSAVE is just saved/restored > > > > completely in > > > mechanisms > > > > of swap/get/setcontext(), set/longjump(), and dl-trampoline() and > > > > that's > > > all. I > > > > checked that with toolchain folks and they agree. We've already > > > > discussed > > > that a > > > > long time ago but at that time I was just using the vector-scalar > > > > registers [2] and at that time I agreed that if VMX/Altivec was in > > > > use instead of the VSX > > > so > > > > VRSAVE should be handled accordingly. But I have a different > > > > opinion > > > now... > > > > > > > > I'm wondering if something would really break on Linux BE if we > > > > forget > > > about > > > > VRSAVE at all in the JVM. If not, we could forget about VRSAVE > > > > forever on > > > Linux. > > > > Looks like VRSAVE was sort of born to the oblivion... ? > > > > > > > > > > > > Kind regards, > > > > Gustavo > > > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=http- > > > 3A__refspecs.linuxfoundation.org_ELF_ppc64_PPC-2Delf64abi- > > > 2D1.9.html&d=DwIFAg&c=jf_iaSHvJObTbx- > > > siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z- > > > esrs- > > > Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=arAjOio5NNoRIZLdczhgF5BDoAF3HU > > > vq-xCtSufn_kA&e= > > > > [2] https://urldefense.proofpoint.com/v2/url?u=http- > > > 3A__mail.openjdk.java.net_pipermail_ppc-2Daix-2Dport-2Ddev_2016- > > > 2DMay_002508.html&d=DwIFAg&c=jf_iaSHvJObTbx- > > > siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z- > > > esrs-Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=p0xb08lxayJHBXZREL-7c5ipKc- > > > waZMMZpTiQWfU-S4&e= > > > > From robbin.ehn at oracle.com Mon Sep 18 13:14:17 2017 From: robbin.ehn at oracle.com (Robbin Ehn) Date: Mon, 18 Sep 2017 15:14:17 +0200 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> Message-ID: Hi, FYI: It's been long since duke: 0: nonstatic_field(nmethod, _stack_traversal_mark, long) \ It's now also volatile. The correct type should/could probably be some like (u)int_fast32_t/intx or whatever. If there is an issue with variable length types in vmStructs we have had that issue for very long AFAICT. But looking at error message below: field "_stack_traversal_mark" in type nmethod is not of type jlong, but instead of type long It was jlong for about a month until I reverted that change (back to long but volatile). So it looks like you are running tools from an older build with a newer vm or something like that ? /Robbin On 09/17/2017 10:13 AM, Yasumasa Suenaga wrote: > Hi Chris, > > I've tested this issue on Fedora 26 x86_64. > I think we can sue CIntegerField at this point because CIntegerField is not specialized for various int size [1]. > In fact, CIntegerField had been used at this point [2], and HSDB worked fine. > > > Thanks, > > Yasumasa > > > [1] http://hg.openjdk.java.net/jdk10/master/file/fd36993f7bf5/src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/CIntegerField.java#l29 > [2] http://hg.openjdk.java.net/jdk10/master/rev/cbfdbefc6ea3 > > > On 2017/09/17 3:58, Chris Plummer wrote: >> Hi Yasumasa, >> >> Is this on a 32-bit system? I don't see how you could otherwise call getCIntegerField() on a long type. jlong is always 64-bit and long is (generally) 32-bit on 32-bit >> systems, and 64-bit on 64-bit systems, at least that seems to be the case with linux. >> >> ?From what I can see, _stack_traversal_mark is now the only long type in vmStructs.cpp. I don't know that we have a mechanism to safely fetch it on both 32-bit and 64-bit >> systems. >> >> _stack_traversal_mark seems to be a long because _traversals is also a long. >> >> ?? ? static long????? _traversals;?????????????????? // Stack scan count, also sweep ID. >> >> This too might be considered a bug. I'm not sure why you would want the size of this field to vary between 32-bit and 64-bit systems (adding compiler-dev to help answer >> that). >> >> So, while I would agree that your fix is generally in the right direction, I think we first need to revisit the use of long for these fields. If they can be changed to an >> int, then your fix is correct (pending the changes to int). If not, then maybe we need getCLongField() support. >> >> And lastly, we really should have a test to detect this bug. Maybe we already do, and it is failing but is going unnoticed for some reason. I'll try to look into that >> some more on Monday. >> >> thanks, >> >> Chris >> >> On 9/16/17 5:20 AM, Yasumasa Suenaga wrote: >>> Hi all, >>> >>> I tried to get thread dump via jstack command on CLHSDB. But it was failed as below: >>> >>> ``` >>> Caused by: sun.jvm.hotspot.types.WrongTypeException: field "_stack_traversal_mark" in type nmethod is not of type jlong, but instead of type long >>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:206) >>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:212) >>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getJLongField(BasicType.java:249) >>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.initialize(NMethod.java:108) >>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.access$000(NMethod.java:35) >>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod$1.update(NMethod.java:81) >>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMInitializedObserver(VM.java:451) >>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.(NMethod.java:79) >>> ??????? ... 23 more >>> ``` >>> >>> I think this exception is caused by JDK-8186837. >>> This changeset has changed the type of `nmethod::_stack_traversal_mark` to `long` from `jlong`. >>> >>> SA should follow this change. >>> >>> I uploaded a webrev for this issue. This webrev is generated from consolidated repo (jdk10/master). >>> Could you review it? >>> >>> ? http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.00/ >>> >>> >>> I cannot access JPRT. So I need reviewer. >>> >>> >>> Thanks, >>> >>> Yasumasa >>> >> >> From yasuenag at gmail.com Mon Sep 18 13:54:44 2017 From: yasuenag at gmail.com (Yasumasa Suenaga) Date: Mon, 18 Sep 2017 22:54:44 +0900 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> Message-ID: > So it looks like you are running tools from an older build with a newer vm or something like that ? No. I've applied this change to changeset 47219. ---------------- changeset: 47219:fd36993f7bf5 tag: tip user: ihse date: Fri Sep 15 09:18:00 2017 -0700 summary: 8187542: Remove superfluous *_TOPDIR variables ---------------- I've tested jshell and HSDB from this source. Yasumasa On 2017/09/18 22:14, Robbin Ehn wrote: > Hi, > > FYI: It's been long since duke: > > ?? 0:?? nonstatic_field(nmethod,???????????? _stack_traversal_mark,???????????????????????? long)????????????????????????????????? \ > > It's now also volatile. > > The correct type should/could probably be some like (u)int_fast32_t/intx or whatever. > > If there is an issue with variable length types in vmStructs we have had that issue for very long AFAICT. > > But looking at error message below: > field "_stack_traversal_mark" in type nmethod is not of type jlong, but instead of type long > > It was jlong for about a month until I reverted that change (back to long but volatile). > So it looks like you are running tools from an older build with a newer vm or something like that ? > > /Robbin > > On 09/17/2017 10:13 AM, Yasumasa Suenaga wrote: >> Hi Chris, >> >> I've tested this issue on Fedora 26 x86_64. >> I think we can sue CIntegerField at this point because CIntegerField is not specialized for various int size [1]. >> In fact, CIntegerField had been used at this point [2], and HSDB worked fine. >> >> >> Thanks, >> >> Yasumasa >> >> >> [1] http://hg.openjdk.java.net/jdk10/master/file/fd36993f7bf5/src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/CIntegerField.java#l29 >> [2] http://hg.openjdk.java.net/jdk10/master/rev/cbfdbefc6ea3 >> >> >> On 2017/09/17 3:58, Chris Plummer wrote: >>> Hi Yasumasa, >>> >>> Is this on a 32-bit system? I don't see how you could otherwise call getCIntegerField() on a long type. jlong is always 64-bit and long is (generally) 32-bit on 32-bit systems, and 64-bit on 64-bit systems, at least that seems to be the case with linux. >>> >>> ?From what I can see, _stack_traversal_mark is now the only long type in vmStructs.cpp. I don't know that we have a mechanism to safely fetch it on both 32-bit and 64-bit systems. >>> >>> _stack_traversal_mark seems to be a long because _traversals is also a long. >>> >>> ?? ? static long????? _traversals;?????????????????? // Stack scan count, also sweep ID. >>> >>> This too might be considered a bug. I'm not sure why you would want the size of this field to vary between 32-bit and 64-bit systems (adding compiler-dev to help answer that). >>> >>> So, while I would agree that your fix is generally in the right direction, I think we first need to revisit the use of long for these fields. If they can be changed to an int, then your fix is correct (pending the changes to int). If not, then maybe we need getCLongField() support. >>> >>> And lastly, we really should have a test to detect this bug. Maybe we already do, and it is failing but is going unnoticed for some reason. I'll try to look into that some more on Monday. >>> >>> thanks, >>> >>> Chris >>> >>> On 9/16/17 5:20 AM, Yasumasa Suenaga wrote: >>>> Hi all, >>>> >>>> I tried to get thread dump via jstack command on CLHSDB. But it was failed as below: >>>> >>>> ``` >>>> Caused by: sun.jvm.hotspot.types.WrongTypeException: field "_stack_traversal_mark" in type nmethod is not of type jlong, but instead of type long >>>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:206) >>>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:212) >>>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getJLongField(BasicType.java:249) >>>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.initialize(NMethod.java:108) >>>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.access$000(NMethod.java:35) >>>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod$1.update(NMethod.java:81) >>>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMInitializedObserver(VM.java:451) >>>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.(NMethod.java:79) >>>> ??????? ... 23 more >>>> ``` >>>> >>>> I think this exception is caused by JDK-8186837. >>>> This changeset has changed the type of `nmethod::_stack_traversal_mark` to `long` from `jlong`. >>>> >>>> SA should follow this change. >>>> >>>> I uploaded a webrev for this issue. This webrev is generated from consolidated repo (jdk10/master). >>>> Could you review it? >>>> >>>> ? http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.00/ >>>> >>>> >>>> I cannot access JPRT. So I need reviewer. >>>> >>>> >>>> Thanks, >>>> >>>> Yasumasa >>>> >>> >>> From robbin.ehn at oracle.com Mon Sep 18 14:14:14 2017 From: robbin.ehn at oracle.com (Robbin Ehn) Date: Mon, 18 Sep 2017 16:14:14 +0200 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> Message-ID: Hi, I missed that there was another commit to revert. (obviously if we have jlong here and it is long you will get that exception but there might be an old bug regrading long 32/64, e.g. CInt) Your fix looks correct, thanks for fixing! /Robbin [rehn at rehn-ws hotspot]$ hg log -r13303 -p changeset: 13303:777b211c54ba parent: 13301:681389dce7a6 user: rkennke date: Mon Jul 24 17:14:32 2017 +0200 summary: 8185102: TestSAServer.java fails due to "sun.jvm.hotspot.types.WrongTypeException: field "_stack_traversal_mark" diff -r 681389dce7a6 -r 777b211c54ba src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/NMethod.java --- a/src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/NMethod.java Mon Jul 24 09:32:35 2017 -0400 +++ b/src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/NMethod.java Mon Jul 24 17:14:32 2017 +0200 @@ -71,7 +71,7 @@ stack. An not_entrant method can be removed when there is no more activations, i.e., when the _stack_traversal_mark is less than current sweep traversal index. */ - private static CIntegerField stackTraversalMarkField; + private static JLongField stackTraversalMarkField; private static CIntegerField compLevelField; @@ -105,7 +105,7 @@ verifiedEntryPointField = type.getAddressField("_verified_entry_point"); osrEntryPointField = type.getAddressField("_osr_entry_point"); lockCountField = type.getJIntField("_lock_count"); - stackTraversalMarkField = type.getCIntegerField("_stack_traversal_mark"); + stackTraversalMarkField = type.getJLongField("_stack_traversal_mark"); compLevelField = type.getCIntegerField("_comp_level"); pcDescSize = db.lookupType("PcDesc").getSize(); } On 09/18/2017 03:54 PM, Yasumasa Suenaga wrote: >> So it looks like you are running tools from an older build with a newer vm or something like that ? > > No. I've applied this change to changeset 47219. > ---------------- > changeset:?? 47219:fd36993f7bf5 > tag:???????? tip > user:??????? ihse > date:??????? Fri Sep 15 09:18:00 2017 -0700 > summary:???? 8187542: Remove superfluous *_TOPDIR variables > ---------------- > > I've tested jshell and HSDB from this source. > > > Yasumasa > > > On 2017/09/18 22:14, Robbin Ehn wrote: >> Hi, >> >> FYI: It's been long since duke: >> >> ??? 0:?? nonstatic_field(nmethod,???????????? _stack_traversal_mark,???????????????????????? long)????????????????????????????????? \ >> >> It's now also volatile. >> >> The correct type should/could probably be some like (u)int_fast32_t/intx or whatever. >> >> If there is an issue with variable length types in vmStructs we have had that issue for very long AFAICT. >> >> But looking at error message below: >> field "_stack_traversal_mark" in type nmethod is not of type jlong, but instead of type long >> >> It was jlong for about a month until I reverted that change (back to long but volatile). >> So it looks like you are running tools from an older build with a newer vm or something like that ? >> >> /Robbin >> >> On 09/17/2017 10:13 AM, Yasumasa Suenaga wrote: >>> Hi Chris, >>> >>> I've tested this issue on Fedora 26 x86_64. >>> I think we can sue CIntegerField at this point because CIntegerField is not specialized for various int size [1]. >>> In fact, CIntegerField had been used at this point [2], and HSDB worked fine. >>> >>> >>> Thanks, >>> >>> Yasumasa >>> >>> >>> [1] http://hg.openjdk.java.net/jdk10/master/file/fd36993f7bf5/src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/CIntegerField.java#l29 >>> [2] http://hg.openjdk.java.net/jdk10/master/rev/cbfdbefc6ea3 >>> >>> >>> On 2017/09/17 3:58, Chris Plummer wrote: >>>> Hi Yasumasa, >>>> >>>> Is this on a 32-bit system? I don't see how you could otherwise call getCIntegerField() on a long type. jlong is always 64-bit and long is (generally) 32-bit on 32-bit >>>> systems, and 64-bit on 64-bit systems, at least that seems to be the case with linux. >>>> >>>> ?From what I can see, _stack_traversal_mark is now the only long type in vmStructs.cpp. I don't know that we have a mechanism to safely fetch it on both 32-bit and >>>> 64-bit systems. >>>> >>>> _stack_traversal_mark seems to be a long because _traversals is also a long. >>>> >>>> ?? ? static long????? _traversals;?????????????????? // Stack scan count, also sweep ID. >>>> >>>> This too might be considered a bug. I'm not sure why you would want the size of this field to vary between 32-bit and 64-bit systems (adding compiler-dev to help answer >>>> that). >>>> >>>> So, while I would agree that your fix is generally in the right direction, I think we first need to revisit the use of long for these fields. If they can be changed to >>>> an int, then your fix is correct (pending the changes to int). If not, then maybe we need getCLongField() support. >>>> >>>> And lastly, we really should have a test to detect this bug. Maybe we already do, and it is failing but is going unnoticed for some reason. I'll try to look into that >>>> some more on Monday. >>>> >>>> thanks, >>>> >>>> Chris >>>> >>>> On 9/16/17 5:20 AM, Yasumasa Suenaga wrote: >>>>> Hi all, >>>>> >>>>> I tried to get thread dump via jstack command on CLHSDB. But it was failed as below: >>>>> >>>>> ``` >>>>> Caused by: sun.jvm.hotspot.types.WrongTypeException: field "_stack_traversal_mark" in type nmethod is not of type jlong, but instead of type long >>>>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:206) >>>>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:212) >>>>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getJLongField(BasicType.java:249) >>>>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.initialize(NMethod.java:108) >>>>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.access$000(NMethod.java:35) >>>>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod$1.update(NMethod.java:81) >>>>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMInitializedObserver(VM.java:451) >>>>> ??????? at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.(NMethod.java:79) >>>>> ??????? ... 23 more >>>>> ``` >>>>> >>>>> I think this exception is caused by JDK-8186837. >>>>> This changeset has changed the type of `nmethod::_stack_traversal_mark` to `long` from `jlong`. >>>>> >>>>> SA should follow this change. >>>>> >>>>> I uploaded a webrev for this issue. This webrev is generated from consolidated repo (jdk10/master). >>>>> Could you review it? >>>>> >>>>> ? http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.00/ >>>>> >>>>> >>>>> I cannot access JPRT. So I need reviewer. >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Yasumasa >>>>> >>>> >>>> From vladimir.kozlov at oracle.com Mon Sep 18 16:17:26 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 18 Sep 2017 09:17:26 -0700 Subject: RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: References: Message-ID: Why not use existing set_notpassed_slp() instead of mark_slp_vec_failed()? Why you need next additional check?: - } else if (cl->is_main_loop()) { + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { sw.transform_loop(lpt, true); Thanks, Vladimir On 9/18/17 2:58 AM, Zhongwei Yao wrote: > [Forward from aarch64-port-dev to hotspot-compiler-dev] > > Hi, all, > > Bug: > https://bugs.openjdk.java.net/browse/JDK-8187601 > > Webrev: > http://cr.openjdk.java.net/~zyao/8187601/webrev.00 > > In the current implementation, the loop unrolling times are determined > by vector size and element size when SuperWordLoopUnrollAnalysis is > true (both X86 and aarch64 are true for now). > > This unrolling policy generates less optimized code when SLP > auto-vectorization fails (as following example shows). > > In this patch, I modify the current unrolling policy to do more > unrolling when SLP auto-vectorization fails. So the loop will be > unrolled until reaching the unroll times limitation. > > Here is one example: > public static void accessArrayConstants(int[] array) { > for (int j = 0; j < 1024; j++) { > array[0]++; > array[1]++; > } > } > > Before this patch, the loop will be unrolled by 4 times. 4 is > determined by: AArch64's vector size 128 bits / array element size 32 > bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. > > Below is the generated code by C2 on AArch64: > > ==== generated code start ==== > 0x0000ffff6caf3180: ldr w10, [x1,#16] ; > 0x0000ffff6caf3184: add w13, w10, #0x1 > 0x0000ffff6caf3188: str w13, [x1,#16] ; > 0x0000ffff6caf318c: ldr w12, [x1,#20] ; > 0x0000ffff6caf3190: add w13, w10, #0x4 > 0x0000ffff6caf3194: add w10, w12, #0x4 > 0x0000ffff6caf3198: str w13, [x1,#16] ; > 0x0000ffff6caf319c: add w11, w11, #0x4 ; > 0x0000ffff6caf31a0: str w10, [x1,#20] ; > 0x0000ffff6caf31a4: cmp w11, #0x3fd > 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; > ==== generated code end ==== > > After applied this patch, it is unrolled 16 times: > > ==== generated code start ==== > 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; > 0x0000ffffb0aa6104: add w13, w10, #0x1 > 0x0000ffffb0aa6108: str w13, [x1,#16] ; > 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; > 0x0000ffffb0aa6110: add w13, w10, #0x10 > 0x0000ffffb0aa6114: add w10, w12, #0x10 > 0x0000ffffb0aa6118: str w13, [x1,#16] ; > 0x0000ffffb0aa611c: add w11, w11, #0x10 ; > 0x0000ffffb0aa6120: str w10, [x1,#20] ; > 0x0000ffffb0aa6124: cmp w11, #0x3f1 > 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; > ==== generated code end ==== > > This patch passes jtreg tests both on AArch64 and X86. > From lutz.schmidt at sap.com Mon Sep 18 16:28:37 2017 From: lutz.schmidt at sap.com (Schmidt, Lutz) Date: Mon, 18 Sep 2017 16:28:37 +0000 Subject: RFR(M): 8187573: [s390] z/Architecture Vector Facility Support Message-ID: <1C19B275-B5BC-4F2D-9319-F9080B365BD6@sap.com> Dear all, I would like to request reviews for this s390-only enhancement: Bug: https://bugs.openjdk.java.net/browse/JDK-8187573 Webrev: http://cr.openjdk.java.net/~lucy/webrevs/8187573.00/ This change is all about providing the instruction definitions and related low-level code emitters for the vector instructions, introduced with z13. The change covers support instructions and integer vector instructions only. It only facilitates code generation. No code is generated by the change itself. Thank you! Lutz -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.plummer at oracle.com Mon Sep 18 22:14:43 2017 From: chris.plummer at oracle.com (Chris Plummer) Date: Mon, 18 Sep 2017 15:14:43 -0700 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> Message-ID: <9bfc8e1d-8438-21f2-3c06-e53bb95afb6c@oracle.com> Hi Yasumasa, Ok, I see now that CIntegerField is just an interface, so it's up to a class to implement getValue() to fetch the field. I'm a bit unclear on how that part works, but from responses by others, it seems this is ok. I've run all the tests I can find that use jstack or jhsdb, and the assert was not triggered. Probably need to have a NMethod on the stack to trigger the code you are fixing. thanks, Chris On 9/17/17 1:13 AM, Yasumasa Suenaga wrote: > Hi Chris, > > I've tested this issue on Fedora 26 x86_64. > I think we can sue CIntegerField at this point because CIntegerField > is not specialized for various int size [1]. > In fact, CIntegerField had been used at this point [2], and HSDB > worked fine. > > > Thanks, > > Yasumasa > > > [1] > http://hg.openjdk.java.net/jdk10/master/file/fd36993f7bf5/src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/CIntegerField.java#l29 > [2] http://hg.openjdk.java.net/jdk10/master/rev/cbfdbefc6ea3 > > > On 2017/09/17 3:58, Chris Plummer wrote: >> Hi Yasumasa, >> >> Is this on a 32-bit system? I don't see how you could otherwise call >> getCIntegerField() on a long type. jlong is always 64-bit and long is >> (generally) 32-bit on 32-bit systems, and 64-bit on 64-bit systems, >> at least that seems to be the case with linux. >> >> ?From what I can see, _stack_traversal_mark is now the only long type >> in vmStructs.cpp. I don't know that we have a mechanism to safely >> fetch it on both 32-bit and 64-bit systems. >> >> _stack_traversal_mark seems to be a long because _traversals is also >> a long. >> >> ?? ? static long????? _traversals;?????????????????? // Stack scan >> count, also sweep ID. >> >> This too might be considered a bug. I'm not sure why you would want >> the size of this field to vary between 32-bit and 64-bit systems >> (adding compiler-dev to help answer that). >> >> So, while I would agree that your fix is generally in the right >> direction, I think we first need to revisit the use of long for these >> fields. If they can be changed to an int, then your fix is correct >> (pending the changes to int). If not, then maybe we need >> getCLongField() support. >> >> And lastly, we really should have a test to detect this bug. Maybe we >> already do, and it is failing but is going unnoticed for some reason. >> I'll try to look into that some more on Monday. >> >> thanks, >> >> Chris >> >> On 9/16/17 5:20 AM, Yasumasa Suenaga wrote: >>> Hi all, >>> >>> I tried to get thread dump via jstack command on CLHSDB. But it was >>> failed as below: >>> >>> ``` >>> Caused by: sun.jvm.hotspot.types.WrongTypeException: field >>> "_stack_traversal_mark" in type nmethod is not of type jlong, but >>> instead of type long >>> ??????? at >>> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:206) >>> ??????? at >>> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:212) >>> ??????? at >>> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getJLongField(BasicType.java:249) >>> ??????? at >>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.initialize(NMethod.java:108) >>> ??????? at >>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.access$000(NMethod.java:35) >>> ??????? at >>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod$1.update(NMethod.java:81) >>> >>> ??????? at >>> jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMInitializedObserver(VM.java:451) >>> ??????? at >>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.(NMethod.java:79) >>> ??????? ... 23 more >>> ``` >>> >>> I think this exception is caused by JDK-8186837. >>> This changeset has changed the type of >>> `nmethod::_stack_traversal_mark` to `long` from `jlong`. >>> >>> SA should follow this change. >>> >>> I uploaded a webrev for this issue. This webrev is generated from >>> consolidated repo (jdk10/master). >>> Could you review it? >>> >>> ? http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.00/ >>> >>> >>> I cannot access JPRT. So I need reviewer. >>> >>> >>> Thanks, >>> >>> Yasumasa >>> >> >> From chris.plummer at oracle.com Mon Sep 18 22:17:56 2017 From: chris.plummer at oracle.com (Chris Plummer) Date: Mon, 18 Sep 2017 15:17:56 -0700 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> Message-ID: <00e64ebc-bf8a-31fa-6378-667a9005c797@oracle.com> On 9/18/17 6:14 AM, Robbin Ehn wrote: > Hi, > > FYI: It's been long since duke: > > ?? 0:?? nonstatic_field(nmethod, _stack_traversal_mark, > long)????????????????????????????????? \ > > It's now also volatile. > > The correct type should/could probably be some like > (u)int_fast32_t/intx or whatever. > > If there is an issue with variable length types in vmStructs we have > had that issue for very long AFAICT. Hi Robbin, I was more concerned with it being variable in general between 32-bit and 64-bit platforms, not that it was an issue for vmStructs. This does not seem like the type of value that would need to be a different size depending on the bitness, so it should be made whichever smallest size that is big enough. As you point out, something like intx, although the data structure it is in is full of ints, so probably int would be sufficient. thanks, Chris > > But looking at error message below: > field "_stack_traversal_mark" in type nmethod is not of type jlong, > but instead of type long > > It was jlong for about a month until I reverted that change (back to > long but volatile). > So it looks like you are running tools from an older build with a > newer vm or something like that ? > > /Robbin > > On 09/17/2017 10:13 AM, Yasumasa Suenaga wrote: >> Hi Chris, >> >> I've tested this issue on Fedora 26 x86_64. >> I think we can sue CIntegerField at this point because CIntegerField >> is not specialized for various int size [1]. >> In fact, CIntegerField had been used at this point [2], and HSDB >> worked fine. >> >> >> Thanks, >> >> Yasumasa >> >> >> [1] >> http://hg.openjdk.java.net/jdk10/master/file/fd36993f7bf5/src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/CIntegerField.java#l29 >> [2] http://hg.openjdk.java.net/jdk10/master/rev/cbfdbefc6ea3 >> >> >> On 2017/09/17 3:58, Chris Plummer wrote: >>> Hi Yasumasa, >>> >>> Is this on a 32-bit system? I don't see how you could otherwise call >>> getCIntegerField() on a long type. jlong is always 64-bit and long >>> is (generally) 32-bit on 32-bit systems, and 64-bit on 64-bit >>> systems, at least that seems to be the case with linux. >>> >>> ?From what I can see, _stack_traversal_mark is now the only long >>> type in vmStructs.cpp. I don't know that we have a mechanism to >>> safely fetch it on both 32-bit and 64-bit systems. >>> >>> _stack_traversal_mark seems to be a long because _traversals is also >>> a long. >>> >>> ?? ? static long????? _traversals;?????????????????? // Stack scan >>> count, also sweep ID. >>> >>> This too might be considered a bug. I'm not sure why you would want >>> the size of this field to vary between 32-bit and 64-bit systems >>> (adding compiler-dev to help answer that). >>> >>> So, while I would agree that your fix is generally in the right >>> direction, I think we first need to revisit the use of long for >>> these fields. If they can be changed to an int, then your fix is >>> correct (pending the changes to int). If not, then maybe we need >>> getCLongField() support. >>> >>> And lastly, we really should have a test to detect this bug. Maybe >>> we already do, and it is failing but is going unnoticed for some >>> reason. I'll try to look into that some more on Monday. >>> >>> thanks, >>> >>> Chris >>> >>> On 9/16/17 5:20 AM, Yasumasa Suenaga wrote: >>>> Hi all, >>>> >>>> I tried to get thread dump via jstack command on CLHSDB. But it was >>>> failed as below: >>>> >>>> ``` >>>> Caused by: sun.jvm.hotspot.types.WrongTypeException: field >>>> "_stack_traversal_mark" in type nmethod is not of type jlong, but >>>> instead of type long >>>> ??????? at >>>> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:206) >>>> ??????? at >>>> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:212) >>>> ??????? at >>>> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getJLongField(BasicType.java:249) >>>> ??????? at >>>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.initialize(NMethod.java:108) >>>> ??????? at >>>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.access$000(NMethod.java:35) >>>> ??????? at >>>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod$1.update(NMethod.java:81) >>>> >>>> ??????? at >>>> jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMInitializedObserver(VM.java:451) >>>> ??????? at >>>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.(NMethod.java:79) >>>> ??????? ... 23 more >>>> ``` >>>> >>>> I think this exception is caused by JDK-8186837. >>>> This changeset has changed the type of >>>> `nmethod::_stack_traversal_mark` to `long` from `jlong`. >>>> >>>> SA should follow this change. >>>> >>>> I uploaded a webrev for this issue. This webrev is generated from >>>> consolidated repo (jdk10/master). >>>> Could you review it? >>>> >>>> http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.00/ >>>> >>>> >>>> I cannot access JPRT. So I need reviewer. >>>> >>>> >>>> Thanks, >>>> >>>> Yasumasa >>>> >>> >>> From yasuenag at gmail.com Tue Sep 19 02:55:34 2017 From: yasuenag at gmail.com (Yasumasa Suenaga) Date: Tue, 19 Sep 2017 11:55:34 +0900 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> <9bfc8e1d-8438-21f2-3c06-e53bb95afb6c@oracle.com> Message-ID: Thanks Chris, Robbin, I'm waiting reviewer(s) for this change. Yasumasa 2017/09/19 ??7:14 "Chris Plummer" : Hi Yasumasa, Ok, I see now that CIntegerField is just an interface, so it's up to a class to implement getValue() to fetch the field. I'm a bit unclear on how that part works, but from responses by others, it seems this is ok. I've run all the tests I can find that use jstack or jhsdb, and the assert was not triggered. Probably need to have a NMethod on the stack to trigger the code you are fixing. thanks, Chris On 9/17/17 1:13 AM, Yasumasa Suenaga wrote: > Hi Chris, > > I've tested this issue on Fedora 26 x86_64. > I think we can sue CIntegerField at this point because CIntegerField is > not specialized for various int size [1]. > In fact, CIntegerField had been used at this point [2], and HSDB worked > fine. > > > Thanks, > > Yasumasa > > > [1] http://hg.openjdk.java.net/jdk10/master/file/fd36993f7bf5/ > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/ > CIntegerField.java#l29 > [2] http://hg.openjdk.java.net/jdk10/master/rev/cbfdbefc6ea3 > > > On 2017/09/17 3:58, Chris Plummer wrote: > >> Hi Yasumasa, >> >> Is this on a 32-bit system? I don't see how you could otherwise call >> getCIntegerField() on a long type. jlong is always 64-bit and long is >> (generally) 32-bit on 32-bit systems, and 64-bit on 64-bit systems, at >> least that seems to be the case with linux. >> >> From what I can see, _stack_traversal_mark is now the only long type in >> vmStructs.cpp. I don't know that we have a mechanism to safely fetch it on >> both 32-bit and 64-bit systems. >> >> _stack_traversal_mark seems to be a long because _traversals is also a >> long. >> >> static long _traversals; // Stack scan count, >> also sweep ID. >> >> This too might be considered a bug. I'm not sure why you would want the >> size of this field to vary between 32-bit and 64-bit systems (adding >> compiler-dev to help answer that). >> >> So, while I would agree that your fix is generally in the right >> direction, I think we first need to revisit the use of long for these >> fields. If they can be changed to an int, then your fix is correct (pending >> the changes to int). If not, then maybe we need getCLongField() support. >> >> And lastly, we really should have a test to detect this bug. Maybe we >> already do, and it is failing but is going unnoticed for some reason. I'll >> try to look into that some more on Monday. >> >> thanks, >> >> Chris >> >> On 9/16/17 5:20 AM, Yasumasa Suenaga wrote: >> >>> Hi all, >>> >>> I tried to get thread dump via jstack command on CLHSDB. But it was >>> failed as below: >>> >>> ``` >>> Caused by: sun.jvm.hotspot.types.WrongTypeException: field >>> "_stack_traversal_mark" in type nmethod is not of type jlong, but instead >>> of type long >>> at jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getF >>> ield(BasicType.java:206) >>> at jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getF >>> ield(BasicType.java:212) >>> at jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getJ >>> LongField(BasicType.java:249) >>> at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.initialize( >>> NMethod.java:108) >>> at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.access$000( >>> NMethod.java:35) >>> at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod$1.update(NMethod.java:81) >>> >>> at jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMIniti >>> alizedObserver(VM.java:451) >>> at jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.(NMet >>> hod.java:79) >>> ... 23 more >>> ``` >>> >>> I think this exception is caused by JDK-8186837. >>> This changeset has changed the type of `nmethod::_stack_traversal_mark` >>> to `long` from `jlong`. >>> >>> SA should follow this change. >>> >>> I uploaded a webrev for this issue. This webrev is generated from >>> consolidated repo (jdk10/master). >>> Could you review it? >>> >>> http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.00/ >>> >>> >>> I cannot access JPRT. So I need reviewer. >>> >>> >>> Thanks, >>> >>> Yasumasa >>> >>> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From david.holmes at oracle.com Tue Sep 19 03:08:17 2017 From: david.holmes at oracle.com (David Holmes) Date: Tue, 19 Sep 2017 13:08:17 +1000 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> <9bfc8e1d-8438-21f2-3c06-e53bb95afb6c@oracle.com> Message-ID: <0ee1ed46-5c27-8693-fb6d-396a3059335e@oracle.com> Hi Yasumasa, On 19/09/2017 12:55 PM, Yasumasa Suenaga wrote: > Thanks Chris, Robbin, > > I'm waiting reviewer(s) for this change. Reviewed. This simply reverts the change of 8185102. Thanks, David ----- > > Yasumasa > > > 2017/09/19 ??7:14 "Chris Plummer" >: > > Hi Yasumasa, > > Ok, I see now that CIntegerField is just an interface, so it's up to > a class to implement getValue() to fetch the field. I'm a bit > unclear on how that part works, but from responses by others, it > seems this is ok. > > I've run all the tests I can find that use jstack or jhsdb, and the > assert was not triggered. Probably need to have a NMethod on the > stack to trigger the code you are fixing. > > thanks, > > Chris > > > On 9/17/17 1:13 AM, Yasumasa Suenaga wrote: > > Hi Chris, > > I've tested this issue on Fedora 26 x86_64. > I think we can sue CIntegerField at this point because > CIntegerField is not specialized for various int size [1]. > In fact, CIntegerField had been used at this point [2], and HSDB > worked fine. > > > Thanks, > > Yasumasa > > > [1] > http://hg.openjdk.java.net/jdk10/master/file/fd36993f7bf5/src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/CIntegerField.java#l29 > > [2] http://hg.openjdk.java.net/jdk10/master/rev/cbfdbefc6ea3 > > > > On 2017/09/17 3:58, Chris Plummer wrote: > > Hi Yasumasa, > > Is this on a 32-bit system? I don't see how you could > otherwise call getCIntegerField() on a long type. jlong is > always 64-bit and long is (generally) 32-bit on 32-bit > systems, and 64-bit on 64-bit systems, at least that seems > to be the case with linux. > > ?From what I can see, _stack_traversal_mark is now the only > long type in vmStructs.cpp. I don't know that we have a > mechanism to safely fetch it on both 32-bit and 64-bit systems. > > _stack_traversal_mark seems to be a long because _traversals > is also a long. > > ?? ? static long????? _traversals;?????????????????? // > Stack scan count, also sweep ID. > > This too might be considered a bug. I'm not sure why you > would want the size of this field to vary between 32-bit and > 64-bit systems (adding compiler-dev to help answer that). > > So, while I would agree that your fix is generally in the > right direction, I think we first need to revisit the use of > long for these fields. If they can be changed to an int, > then your fix is correct (pending the changes to int). If > not, then maybe we need getCLongField() support. > > And lastly, we really should have a test to detect this bug. > Maybe we already do, and it is failing but is going > unnoticed for some reason. I'll try to look into that some > more on Monday. > > thanks, > > Chris > > On 9/16/17 5:20 AM, Yasumasa Suenaga wrote: > > Hi all, > > I tried to get thread dump via jstack command on CLHSDB. > But it was failed as below: > > ``` > Caused by: sun.jvm.hotspot.types.WrongTypeException: > field "_stack_traversal_mark" in type nmethod is not of > type jlong, but instead of type long > ??????? at > jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:206) > ??????? at > jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:212) > ??????? at > jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getJLongField(BasicType.java:249) > ??????? at > jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.initialize(NMethod.java:108) > ??????? at > jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.access$000(NMethod.java:35) > ??????? at > jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod$1.update(NMethod.java:81) > > ??????? at > jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMInitializedObserver(VM.java:451) > ??????? at > jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.(NMethod.java:79) > ??????? ... 23 more > ``` > > I think this exception is caused by JDK-8186837. > This changeset has changed the type of > `nmethod::_stack_traversal_mark` to `long` from `jlong`. > > SA should follow this change. > > I uploaded a webrev for this issue. This webrev is > generated from consolidated repo (jdk10/master). > Could you review it? > > http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.00/ > > > > I cannot access JPRT. So I need reviewer. > > > Thanks, > > Yasumasa > > > > > > From yasuenag at gmail.com Tue Sep 19 03:19:40 2017 From: yasuenag at gmail.com (Yasumasa Suenaga) Date: Tue, 19 Sep 2017 12:19:40 +0900 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: <0ee1ed46-5c27-8693-fb6d-396a3059335e@oracle.com> References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> <9bfc8e1d-8438-21f2-3c06-e53bb95afb6c@oracle.com> <0ee1ed46-5c27-8693-fb6d-396a3059335e@oracle.com> Message-ID: Thanks David, BTW, can I push this change after jdk10/master is opened? I cannot access JPRT. Yasumasa 2017/09/19 ??0:08 "David Holmes" : > Hi Yasumasa, > > On 19/09/2017 12:55 PM, Yasumasa Suenaga wrote: > >> Thanks Chris, Robbin, >> >> I'm waiting reviewer(s) for this change. >> > > Reviewed. > > This simply reverts the change of 8185102. > > Thanks, > David > ----- > > >> Yasumasa >> >> >> 2017/09/19 ??7:14 "Chris Plummer" > chris.plummer at oracle.com>>: >> >> Hi Yasumasa, >> >> Ok, I see now that CIntegerField is just an interface, so it's up to >> a class to implement getValue() to fetch the field. I'm a bit >> unclear on how that part works, but from responses by others, it >> seems this is ok. >> >> I've run all the tests I can find that use jstack or jhsdb, and the >> assert was not triggered. Probably need to have a NMethod on the >> stack to trigger the code you are fixing. >> >> thanks, >> >> Chris >> >> >> On 9/17/17 1:13 AM, Yasumasa Suenaga wrote: >> >> Hi Chris, >> >> I've tested this issue on Fedora 26 x86_64. >> I think we can sue CIntegerField at this point because >> CIntegerField is not specialized for various int size [1]. >> In fact, CIntegerField had been used at this point [2], and HSDB >> worked fine. >> >> >> Thanks, >> >> Yasumasa >> >> >> [1] >> http://hg.openjdk.java.net/jdk10/master/file/fd36993f7bf5/ >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/ >> CIntegerField.java#l29 >> > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/ >> CIntegerField.java#l29> >> [2] http://hg.openjdk.java.net/jdk10/master/rev/cbfdbefc6ea3 >> >> >> >> On 2017/09/17 3:58, Chris Plummer wrote: >> >> Hi Yasumasa, >> >> Is this on a 32-bit system? I don't see how you could >> otherwise call getCIntegerField() on a long type. jlong is >> always 64-bit and long is (generally) 32-bit on 32-bit >> systems, and 64-bit on 64-bit systems, at least that seems >> to be the case with linux. >> >> From what I can see, _stack_traversal_mark is now the only >> long type in vmStructs.cpp. I don't know that we have a >> mechanism to safely fetch it on both 32-bit and 64-bit >> systems. >> >> _stack_traversal_mark seems to be a long because _traversals >> is also a long. >> >> static long _traversals; // >> Stack scan count, also sweep ID. >> >> This too might be considered a bug. I'm not sure why you >> would want the size of this field to vary between 32-bit and >> 64-bit systems (adding compiler-dev to help answer that). >> >> So, while I would agree that your fix is generally in the >> right direction, I think we first need to revisit the use of >> long for these fields. If they can be changed to an int, >> then your fix is correct (pending the changes to int). If >> not, then maybe we need getCLongField() support. >> >> And lastly, we really should have a test to detect this bug. >> Maybe we already do, and it is failing but is going >> unnoticed for some reason. I'll try to look into that some >> more on Monday. >> >> thanks, >> >> Chris >> >> On 9/16/17 5:20 AM, Yasumasa Suenaga wrote: >> >> Hi all, >> >> I tried to get thread dump via jstack command on CLHSDB. >> But it was failed as below: >> >> ``` >> Caused by: sun.jvm.hotspot.types.WrongTypeException: >> field "_stack_traversal_mark" in type nmethod is not of >> type jlong, but instead of type long >> at >> jdk.hotspot.agent/sun.jvm.hots >> pot.types.basic.BasicType.getField(BasicType.java:206) >> at >> jdk.hotspot.agent/sun.jvm.hots >> pot.types.basic.BasicType.getField(BasicType.java:212) >> at >> jdk.hotspot.agent/sun.jvm.hots >> pot.types.basic.BasicType.getJLongField(BasicType.java:249) >> at >> jdk.hotspot.agent/sun.jvm.hots >> pot.code.NMethod.initialize(NMethod.java:108) >> at >> jdk.hotspot.agent/sun.jvm.hots >> pot.code.NMethod.access$000(NMethod.java:35) >> at >> jdk.hotspot.agent/sun.jvm.hots >> pot.code.NMethod$1.update(NMethod.java:81) >> >> at >> jdk.hotspot.agent/sun.jvm.hots >> pot.runtime.VM.registerVMInitializedObserver(VM.java:451) >> at >> jdk.hotspot.agent/sun.jvm.hots >> pot.code.NMethod.(NMethod.java:79) >> ... 23 more >> ``` >> >> I think this exception is caused by JDK-8186837. >> This changeset has changed the type of >> `nmethod::_stack_traversal_mark` to `long` from `jlong`. >> >> SA should follow this change. >> >> I uploaded a webrev for this issue. This webrev is >> generated from consolidated repo (jdk10/master). >> Could you review it? >> >> http://cr.openjdk.java.net/~ys >> uenaga/JDK-8187597/webrev.00/ >> > suenaga/JDK-8187597/webrev.00/> >> >> >> I cannot access JPRT. So I need reviewer. >> >> >> Thanks, >> >> Yasumasa >> >> >> >> >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From david.holmes at oracle.com Tue Sep 19 03:31:22 2017 From: david.holmes at oracle.com (David Holmes) Date: Tue, 19 Sep 2017 13:31:22 +1000 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> <9bfc8e1d-8438-21f2-3c06-e53bb95afb6c@oracle.com> <0ee1ed46-5c27-8693-fb6d-396a3059335e@oracle.com> Message-ID: <13b177d1-e04c-dd86-7a86-b55b165d5830@oracle.com> On 19/09/2017 1:19 PM, Yasumasa Suenaga wrote: > Thanks David, > > BTW, can I push this change after jdk10/master is opened? > I cannot access JPRT. I think we'd probably prefer this to go into jdk10/hs - once it is open - and for that you need a sponsor. Thanks, David > > Yasumasa > > > 2017/09/19 ??0:08 "David Holmes" >: > > Hi Yasumasa, > > On 19/09/2017 12:55 PM, Yasumasa Suenaga wrote: > > Thanks Chris, Robbin, > > I'm waiting reviewer(s) for this change. > > > Reviewed. > > This simply reverts the change of 8185102. > > Thanks, > David > ----- > > > Yasumasa > > > 2017/09/19 ??7:14 "Chris Plummer" > >>: > > ? ? Hi Yasumasa, > > ? ? Ok, I see now that CIntegerField is just an interface, so > it's up to > ? ? a class to implement getValue() to fetch the field. I'm a bit > ? ? unclear on how that part works, but from responses by > others, it > ? ? seems this is ok. > > ? ? I've run all the tests I can find that use jstack or jhsdb, > and the > ? ? assert was not triggered. Probably need to have a NMethod > on the > ? ? stack to trigger the code you are fixing. > > ? ? thanks, > > ? ? Chris > > > ? ? On 9/17/17 1:13 AM, Yasumasa Suenaga wrote: > > ? ? ? ? Hi Chris, > > ? ? ? ? I've tested this issue on Fedora 26 x86_64. > ? ? ? ? I think we can sue CIntegerField at this point because > ? ? ? ? CIntegerField is not specialized for various int size [1]. > ? ? ? ? In fact, CIntegerField had been used at this point [2], > and HSDB > ? ? ? ? worked fine. > > > ? ? ? ? Thanks, > > ? ? ? ? Yasumasa > > > ? ? ? ? [1] > http://hg.openjdk.java.net/jdk10/master/file/fd36993f7bf5/src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/CIntegerField.java#l29 > > > > > ? ? ? ? [2] > http://hg.openjdk.java.net/jdk10/master/rev/cbfdbefc6ea3 > > > > > > > ? ? ? ? On 2017/09/17 3:58, Chris Plummer wrote: > > ? ? ? ? ? ? Hi Yasumasa, > > ? ? ? ? ? ? Is this on a 32-bit system? I don't see how you could > ? ? ? ? ? ? otherwise call getCIntegerField() on a long type. > jlong is > ? ? ? ? ? ? always 64-bit and long is (generally) 32-bit on 32-bit > ? ? ? ? ? ? systems, and 64-bit on 64-bit systems, at least > that seems > ? ? ? ? ? ? to be the case with linux. > > ? ? ? ? ? ? ??From what I can see, _stack_traversal_mark is now > the only > ? ? ? ? ? ? long type in vmStructs.cpp. I don't know that we have a > ? ? ? ? ? ? mechanism to safely fetch it on both 32-bit and > 64-bit systems. > > ? ? ? ? ? ? _stack_traversal_mark seems to be a long because > _traversals > ? ? ? ? ? ? is also a long. > > ? ? ? ? ? ? ??? ? static long > _traversals;?????????????????? // > ? ? ? ? ? ? Stack scan count, also sweep ID. > > ? ? ? ? ? ? This too might be considered a bug. I'm not sure > why you > ? ? ? ? ? ? would want the size of this field to vary between > 32-bit and > ? ? ? ? ? ? 64-bit systems (adding compiler-dev to help answer > that). > > ? ? ? ? ? ? So, while I would agree that your fix is generally > in the > ? ? ? ? ? ? right direction, I think we first need to revisit > the use of > ? ? ? ? ? ? long for these fields. If they can be changed to an > int, > ? ? ? ? ? ? then your fix is correct (pending the changes to > int). If > ? ? ? ? ? ? not, then maybe we need getCLongField() support. > > ? ? ? ? ? ? And lastly, we really should have a test to detect > this bug. > ? ? ? ? ? ? Maybe we already do, and it is failing but is going > ? ? ? ? ? ? unnoticed for some reason. I'll try to look into > that some > ? ? ? ? ? ? more on Monday. > > ? ? ? ? ? ? thanks, > > ? ? ? ? ? ? Chris > > ? ? ? ? ? ? On 9/16/17 5:20 AM, Yasumasa Suenaga wrote: > > ? ? ? ? ? ? ? ? Hi all, > > ? ? ? ? ? ? ? ? I tried to get thread dump via jstack command > on CLHSDB. > ? ? ? ? ? ? ? ? But it was failed as below: > > ? ? ? ? ? ? ? ? ``` > ? ? ? ? ? ? ? ? Caused by: > sun.jvm.hotspot.types.WrongTypeException: > ? ? ? ? ? ? ? ? field "_stack_traversal_mark" in type nmethod > is not of > ? ? ? ? ? ? ? ? type jlong, but instead of type long > ? ? ? ? ? ? ? ? ???????? at > > jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:206) > ? ? ? ? ? ? ? ? ???????? at > > jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:212) > ? ? ? ? ? ? ? ? ???????? at > > jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getJLongField(BasicType.java:249) > ? ? ? ? ? ? ? ? ???????? at > > jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.initialize(NMethod.java:108) > ? ? ? ? ? ? ? ? ???????? at > > jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.access$000(NMethod.java:35) > ? ? ? ? ? ? ? ? ???????? at > > jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod$1.update(NMethod.java:81) > > ? ? ? ? ? ? ? ? ???????? at > > jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMInitializedObserver(VM.java:451) > ? ? ? ? ? ? ? ? ???????? at > > jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.(NMethod.java:79) > ? ? ? ? ? ? ? ? ???????? ... 23 more > ? ? ? ? ? ? ? ? ``` > > ? ? ? ? ? ? ? ? I think this exception is caused by JDK-8186837. > ? ? ? ? ? ? ? ? This changeset has changed the type of > ? ? ? ? ? ? ? ? `nmethod::_stack_traversal_mark` to `long` from > `jlong`. > > ? ? ? ? ? ? ? ? SA should follow this change. > > ? ? ? ? ? ? ? ? I uploaded a webrev for this issue. This webrev is > ? ? ? ? ? ? ? ? generated from consolidated repo (jdk10/master). > ? ? ? ? ? ? ? ? Could you review it? > > http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.00/ > > > > > > > ? ? ? ? ? ? ? ? I cannot access JPRT. So I need reviewer. > > > ? ? ? ? ? ? ? ? Thanks, > > ? ? ? ? ? ? ? ? Yasumasa > > > > > > From zhongwei.yao at linaro.org Tue Sep 19 05:59:18 2017 From: zhongwei.yao at linaro.org (Zhongwei Yao) Date: Tue, 19 Sep 2017 13:59:18 +0800 Subject: RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: References: Message-ID: Hi, Vladimir, On 19 September 2017 at 00:17, Vladimir Kozlov wrote: > Why not use existing set_notpassed_slp() instead of mark_slp_vec_failed()? Due to 2 reasons, I have not chosen existing passed_slp flag: 1. If we set_notpassed_slp() when _packset.length() == 0 in SuperWord::output(), then in the IdealLoopTree::policy_unroll() checking: if (cl->has_passed_slp()) { if (slp_max_unroll_factor >= future_unroll_ct) return true; // Normal case: loop too big return false; } we will ignore the case: "cl->has_passed_slp() && slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()" as alos exposed in my patch: if (cl->has_passed_slp()) { if (slp_max_unroll_factor >= future_unroll_ct) return true; - // Normal case: loop too big - return false; + // When SLP vectorization failed, we could do more unrolling + // optimizations if body size is less than limit size. Otherwise, + // return false due to loop is too big. + if (!cl->is_slp_vec_failed()) return false; } However, I have not found a case to support this condition yet. 2. As replied below, in: > - } else if (cl->is_main_loop()) { > + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { > sw.transform_loop(lpt, true); I need to check whether cl->is_slp_vec_failed() is true.Such checking becomes explicit when using SLPAutoVecFailed flag. > > Why you need next additional check?: > > - } else if (cl->is_main_loop()) { > + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { > sw.transform_loop(lpt, true); > The additional check prevents the case that when cl->is_slp_vec_failed() is true, then SuperWord::output() will set_major_progress() at the beginning (because _packset.length() == 0 is true when cl->is_slp_vec_failed() is true). Then the "phase ideal loop iteration" will not stop untill loop_opts_cnt reachs 0, which is not we want. > > Thanks, > Vladimir > > > On 9/18/17 2:58 AM, Zhongwei Yao wrote: >> >> [Forward from aarch64-port-dev to hotspot-compiler-dev] >> >> Hi, all, >> >> Bug: >> https://bugs.openjdk.java.net/browse/JDK-8187601 >> >> Webrev: >> http://cr.openjdk.java.net/~zyao/8187601/webrev.00 >> >> In the current implementation, the loop unrolling times are determined >> by vector size and element size when SuperWordLoopUnrollAnalysis is >> true (both X86 and aarch64 are true for now). >> >> This unrolling policy generates less optimized code when SLP >> auto-vectorization fails (as following example shows). >> >> In this patch, I modify the current unrolling policy to do more >> unrolling when SLP auto-vectorization fails. So the loop will be >> unrolled until reaching the unroll times limitation. >> >> Here is one example: >> public static void accessArrayConstants(int[] array) { >> for (int j = 0; j < 1024; j++) { >> array[0]++; >> array[1]++; >> } >> } >> >> Before this patch, the loop will be unrolled by 4 times. 4 is >> determined by: AArch64's vector size 128 bits / array element size 32 >> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. >> >> Below is the generated code by C2 on AArch64: >> >> ==== generated code start ==== >> 0x0000ffff6caf3180: ldr w10, [x1,#16] ; >> 0x0000ffff6caf3184: add w13, w10, #0x1 >> 0x0000ffff6caf3188: str w13, [x1,#16] ; >> 0x0000ffff6caf318c: ldr w12, [x1,#20] ; >> 0x0000ffff6caf3190: add w13, w10, #0x4 >> 0x0000ffff6caf3194: add w10, w12, #0x4 >> 0x0000ffff6caf3198: str w13, [x1,#16] ; >> 0x0000ffff6caf319c: add w11, w11, #0x4 ; >> 0x0000ffff6caf31a0: str w10, [x1,#20] ; >> 0x0000ffff6caf31a4: cmp w11, #0x3fd >> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; >> ==== generated code end ==== >> >> After applied this patch, it is unrolled 16 times: >> >> ==== generated code start ==== >> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; >> 0x0000ffffb0aa6104: add w13, w10, #0x1 >> 0x0000ffffb0aa6108: str w13, [x1,#16] ; >> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; >> 0x0000ffffb0aa6110: add w13, w10, #0x10 >> 0x0000ffffb0aa6114: add w10, w12, #0x10 >> 0x0000ffffb0aa6118: str w13, [x1,#16] ; >> 0x0000ffffb0aa611c: add w11, w11, #0x10 ; >> 0x0000ffffb0aa6120: str w10, [x1,#20] ; >> 0x0000ffffb0aa6124: cmp w11, #0x3f1 >> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; >> ==== generated code end ==== >> >> This patch passes jtreg tests both on AArch64 and X86. >> > -- Best regards, Zhongwei From aph at redhat.com Tue Sep 19 08:15:29 2017 From: aph at redhat.com (Andrew Haley) Date: Tue, 19 Sep 2017 09:15:29 +0100 Subject: multiplyHigh? In-Reply-To: <9e627b27-54cd-d031-748f-c8f7c4c032b7@bell-sw.com> References: <9e627b27-54cd-d031-748f-c8f7c4c032b7@bell-sw.com> Message-ID: <754125d0-28a0-831f-7989-0b28f8fbb478@redhat.com> Hi, On 08/09/17 16:17, Dmitrij wrote: > I support the idea of intrinsifying Math.multiplyHigh method: it seems > like it will be very effective to implement at least on aarch64 and > x86_64 using umulh and mulq instructions. > > Vladimir, what do you think? > > > If there are no other volunteer I'd be happy to do this. > > Enhancing java.lang.Math intrinsics situation for aarch64 port is on my > todo list. Can I have an ETA for this, please? -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From nils.eliasson at oracle.com Tue Sep 19 10:54:19 2017 From: nils.eliasson at oracle.com (Nils Eliasson) Date: Tue, 19 Sep 2017 12:54:19 +0200 Subject: RFR (XXS): 8160303: parse_method_pattern only scans 254 chars Message-ID: <7555f1ba-383f-3652-a701-765eda0417ac@oracle.com> Hi, This patch fixes the wrong (too short) scan length in the signature parsing in methodMatcher.cpp. Bug: https://bugs.openjdk.java.net/browse/JDK-8160303 Webrev: http://cr.openjdk.java.net/~neliasso/8160303/webrev.01/ Please review, Nils Eliasson From claes.redestad at oracle.com Tue Sep 19 11:01:07 2017 From: claes.redestad at oracle.com (Claes Redestad) Date: Tue, 19 Sep 2017 13:01:07 +0200 Subject: RFR (XXS): 8160303: parse_method_pattern only scans 254 chars In-Reply-To: <7555f1ba-383f-3652-a701-765eda0417ac@oracle.com> References: <7555f1ba-383f-3652-a701-765eda0417ac@oracle.com> Message-ID: Looks good to me! /Claes On 2017-09-19 12:54, Nils Eliasson wrote: > Hi, > > This patch fixes the wrong (too short) scan length in the signature > parsing in methodMatcher.cpp. > > Bug: https://bugs.openjdk.java.net/browse/JDK-8160303 > > Webrev: http://cr.openjdk.java.net/~neliasso/8160303/webrev.01/ > > > Please review, > > Nils Eliasson > > From jaroslav.tulach at oracle.com Tue Sep 19 12:56:45 2017 From: jaroslav.tulach at oracle.com (Jaroslav Tulach) Date: Tue, 19 Sep 2017 14:56:45 +0200 Subject: [10] RFR(M) 8182701: Modify JVMCI to allow Graal Compiler to expose platform MBean In-Reply-To: <48a815cd-2943-d328-d955-f301d39d7b86@oracle.com> References: <73a34e68-19cf-e949-0057-a2e16cfca6da@oracle.com> <48a815cd-2943-d328-d955-f301d39d7b86@oracle.com> Message-ID: <2954772.sIDvmWlhCn@pracovni> On p?tek 15. z??? 2017 10:53:45 CEST mandy chung wrote: > > http://cr.openjdk.java.net/~jtulach/8182701/webrev.05/index.html > > Looks good. Great. Unless there are further comments, then we can plan integration of my changes. Vladimir, are you "sponsor" of my change? Will you integrate it somewhere? This is my first change that has a chance to get into JDK, so I don't think I have push rights. But I am thrilled. -jt From nils.eliasson at oracle.com Tue Sep 19 13:09:56 2017 From: nils.eliasson at oracle.com (Nils Eliasson) Date: Tue, 19 Sep 2017 15:09:56 +0200 Subject: RFR (XXS): 8160303: parse_method_pattern only scans 254 chars In-Reply-To: References: <7555f1ba-383f-3652-a701-765eda0417ac@oracle.com> Message-ID: <14a35ed8-de26-8664-34cd-ade32bc5c8fd@oracle.com> Thank you Claes! // Nils On 2017-09-19 13:01, Claes Redestad wrote: > Looks good to me! > > /Claes > > > On 2017-09-19 12:54, Nils Eliasson wrote: >> Hi, >> >> This patch fixes the wrong (too short) scan length in the signature >> parsing in methodMatcher.cpp. >> >> Bug: https://bugs.openjdk.java.net/browse/JDK-8160303 >> >> Webrev: http://cr.openjdk.java.net/~neliasso/8160303/webrev.01/ >> >> >> Please review, >> >> Nils Eliasson >> >> > From lutz.schmidt at sap.com Tue Sep 19 14:26:48 2017 From: lutz.schmidt at sap.com (Schmidt, Lutz) Date: Tue, 19 Sep 2017 14:26:48 +0000 Subject: multiplyHigh? In-Reply-To: <70ccb5fe-8c18-21fc-e6ad-ca64571eff88@oracle.com> References: <9e627b27-54cd-d031-748f-c8f7c4c032b7@bell-sw.com> <70ccb5fe-8c18-21fc-e6ad-ca64571eff88@oracle.com> Message-ID: Hi, there is interest from these guys at SAP as well. I would volunteer to cover ppc and s390, once the initial implementation is available. Any ETA guess on that? I?m not pushing, just to get an idea. Thanks, Lutz On 09.09.2017, 05:22, "hotspot-compiler-dev on behalf of Vladimir Kozlov" wrote: On 9/8/17 8:17 AM, Dmitrij wrote: > Hi Andrew, > > I support the idea of intrinsifying Math.multiplyHigh method: it seems like it will be very effective to implement at > least on aarch64 and x86_64 using umulh and mulq instructions. > > Vladimir, what do you think? Nobody asked to intrinsify these methods before because they are new in JDK 9 I think. > If there are no other volunteer I'd be happy to do this. Yes, please do that for both: aarch64 and x64. Add jtreg hotspot test too, please. Thanks, Vladimir > > Enhancing java.lang.Math intrinsics situation for aarch64 port is on my todo list. > > Thanks, > Dmitrij > > On 08.09.2017 17:38, Andrew Haley wrote: >> I notice that Math.multiplyHigh(long, long) doesn't use a C2 intrinsic, >> even on machines with appropriate C2 patterns. So it's rather slow. >> >> JDK-5100935, No way to access the 64-bit integer multiplication of >> 64-bit CPUs efficiently, is closed, even though there is still no >> efficient way to do this. Is writing an intrinsic for multiplyHigh on >> someone's to-do list? I see no bug for it. >> > From rahul.v.raghavan at oracle.com Tue Sep 19 14:59:50 2017 From: rahul.v.raghavan at oracle.com (Rahul Raghavan) Date: Tue, 19 Sep 2017 20:29:50 +0530 Subject: RFR: 8187676: Disable harmless uninitialized warnings for two files In-Reply-To: <2031be3e-2623-dde1-fff2-2d6cd6e41de9@oracle.com> References: <2031be3e-2623-dde1-fff2-2d6cd6e41de9@oracle.com> Message-ID: <7512e87d-4e28-27a1-5e10-5cdfa794cdf4@oracle.com> Hi Erik, Please note that this 8187676 seems to be related to 8160404. https://bugs.openjdk.java.net/browse/JDK-8160404 (RelocationHolder constructors have bugs) As per the latest notes comments added for 8160404-jbs, I will submit webrev/RFR soon and will request help confirm similar issues with latest gcc7 gets solved. Thanks, Rahul On Tuesday 19 September 2017 07:07 PM, Erik Helin wrote: > Hi all, > > with gcc 7.1.1 from Fedora 26 on x86-64 there are warnings about the > potential usage of maybe uninitialized memory in > src/hotspot/cpu/x86/assembler_x86.cpp and in > src/hotspot/cpu/x86/interp_masm_x86.cpp. > > The problems arises from the class RelocationHolder in > src/hotspot/share/code/relocInfo.hpp which has the private fields: > enum { _relocbuf_size = 5 }; > void* _relocbuf[ _relocbuf_size ]; > > and the default constructor for RelocationHolder does not initialize the > elements of _relocbuf. I _think_ this is an optimization, > RelocationHolder is used *a lot* and setting the elements of > RelocationHolder::_relocbuf to NULL (or some other value) in the default > constructor might result in a performance penalty. Have a look in > build/linux-x86_64-normal-server-fastdebug/hotspot/variant-server/gensrc/adfiles > and you will see that RelocationHolder is used all over the place :) > > AFAICS all users of RelocationHolder::_relocbuf take care to not use > uninitialized memory, which means that this warning is wrong, so I > suggest we disable the warning -Wmaybe-uninitialized for > src/hotspot/cpu/x86/assembler_x86.cpp. > > The problem continues because the class Address in > src/hotspot/cpu/x86/assembler_x86.hpp has a private field, > `RelocationHolder _rspec;` and the default constructor for Address does > not initialize _rspec._relocbuf (most likely for performance reasons). > The class Address also has a default copy constructor, which will copy > all the elements of _rspec._relocbuf, which will result in a read of > uninitialized memory. However, this is a benign usage of uninitialized > memory, since we take no action based on the content of the > uninitialized memory (it is just copied byte for byte). > > So, in this case too, I suggest we disable the warning -Wuninitialized > for src/hotspot/cpu/x86/assembler_x86.hpp. > > What do you think? > > Patch: > http://cr.openjdk.java.net/~ehelin/8187676/00/ > > --- old/make/hotspot/lib/JvmOverrideFiles.gmk 2017-09-19 > 15:11:45.036108983 +0200 > +++ new/make/hotspot/lib/JvmOverrideFiles.gmk 2017-09-19 > 15:11:44.692107277 +0200 > @@ -32,6 +32,8 @@ > ifeq ($(TOOLCHAIN_TYPE), gcc) > BUILD_LIBJVM_vmStructs.cpp_CXXFLAGS := -fno-var-tracking-assignments > -O0 > BUILD_LIBJVM_jvmciCompilerToVM.cpp_CXXFLAGS := > -fno-var-tracking-assignments > + BUILD_LIBJVM_assembler_x86.cpp_CXXFLAGS := -Wno-maybe-uninitialized > + BUILD_LIBJVM_interp_masm_x86.cpp_CXXFLAGS := -Wno-uninitialized > endif > > ifeq ($(OPENJDK_TARGET_OS), linux) > > Issue: > https://bugs.openjdk.java.net/browse/JDK-8187676 > > Testing: > - Compiles with: > - gcc 7.1.1 and glibc 2.25 on Fedora 26 > - gcc 4.9.2 and glibc 2.12 on OEL 6.4 > - JPRT > > Thanks, > Erik From aph at redhat.com Tue Sep 19 16:04:33 2017 From: aph at redhat.com (Andrew Haley) Date: Tue, 19 Sep 2017 17:04:33 +0100 Subject: multiplyHigh? In-Reply-To: References: <9e627b27-54cd-d031-748f-c8f7c4c032b7@bell-sw.com> <70ccb5fe-8c18-21fc-e6ad-ca64571eff88@oracle.com> Message-ID: <6a2137ab-bf1e-ba87-cdb1-e13c86f00521@redhat.com> On 19/09/17 15:26, Schmidt, Lutz wrote: > there is interest from these guys at SAP as well. I would volunteer to cover ppc and s390, once the initial implementation is available. Any ETA guess on that? I?m not pushing, just to get an idea. If you have a definition for MulHiL, you can use the same code for all targets. You don't have to wait for the intrinsic to be written. For reference, Aarch64 is: instruct mulHiL_rReg(iRegLNoSp dst, iRegL src1, iRegL src2, rFlagsReg cr) %{ match(Set dst (MulHiL src1 src2)); ins_cost(INSN_COST * 7); format %{ "smulh $dst, $src1, $src2, \t# mulhi" %} ins_encode %{ __ smulh(as_Register($dst$$reg), as_Register($src1$$reg), as_Register($src2$$reg)); %} ins_pipe(lmul_reg_reg); %} -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From vladimir.kozlov at oracle.com Tue Sep 19 16:25:20 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 19 Sep 2017 09:25:20 -0700 Subject: RFR: 8187676: Disable harmless uninitialized warnings for two files In-Reply-To: <7512e87d-4e28-27a1-5e10-5cdfa794cdf4@oracle.com> References: <2031be3e-2623-dde1-fff2-2d6cd6e41de9@oracle.com> <7512e87d-4e28-27a1-5e10-5cdfa794cdf4@oracle.com> Message-ID: <4f5b0427-54bf-2b85-0a94-bb41049d2676@oracle.com> I would prefer to have general solution Rahul is working on because code is general - not only x86 is affected. Thanks, Vladimir On 9/19/17 7:59 AM, Rahul Raghavan wrote: > Hi Erik, > > Please note that this 8187676 seems to be related to 8160404. > ?? https://bugs.openjdk.java.net/browse/JDK-8160404 > ?? (RelocationHolder constructors have bugs) > > As per the latest notes comments added for 8160404-jbs, I will submit > webrev/RFR soon and will request help confirm similar issues with latest > gcc7 gets solved. > > Thanks, > Rahul > > On Tuesday 19 September 2017 07:07 PM, Erik Helin wrote: >> Hi all, >> >> with gcc 7.1.1 from Fedora 26 on x86-64 there are warnings about the >> potential usage of maybe uninitialized memory in >> src/hotspot/cpu/x86/assembler_x86.cpp and in >> src/hotspot/cpu/x86/interp_masm_x86.cpp. >> >> The problems arises from the class RelocationHolder in >> src/hotspot/share/code/relocInfo.hpp which has the private fields: >> ?? enum { _relocbuf_size = 5 }; >> ?? void* _relocbuf[ _relocbuf_size ]; >> >> and the default constructor for RelocationHolder does not initialize >> the elements of _relocbuf. I _think_ this is an optimization, >> RelocationHolder is used *a lot* and setting the elements of >> RelocationHolder::_relocbuf to NULL (or some other value) in the >> default constructor might result in a performance penalty. Have a look >> in >> build/linux-x86_64-normal-server-fastdebug/hotspot/variant-server/gensrc/adfiles >> and you will see that RelocationHolder is used all over the place :) >> >> AFAICS all users of RelocationHolder::_relocbuf take care to not use >> uninitialized memory, which means that this warning is wrong, so I >> suggest we disable the warning -Wmaybe-uninitialized for >> src/hotspot/cpu/x86/assembler_x86.cpp. >> >> The problem continues because the class Address in >> src/hotspot/cpu/x86/assembler_x86.hpp has a private field, >> `RelocationHolder _rspec;` and the default constructor for Address >> does not initialize _rspec._relocbuf (most likely for performance >> reasons). The class Address also has a default copy constructor, which >> will copy all the elements of _rspec._relocbuf, which will result in a >> read of uninitialized memory. However, this is a benign usage of >> uninitialized memory, since we take no action based on the content of >> the uninitialized memory (it is just copied byte for byte). >> >> So, in this case too, I suggest we disable the warning -Wuninitialized >> for src/hotspot/cpu/x86/assembler_x86.hpp. >> >> What do you think? >> >> Patch: >> http://cr.openjdk.java.net/~ehelin/8187676/00/ >> >> --- old/make/hotspot/lib/JvmOverrideFiles.gmk??? 2017-09-19 >> 15:11:45.036108983 +0200 >> +++ new/make/hotspot/lib/JvmOverrideFiles.gmk??? 2017-09-19 >> 15:11:44.692107277 +0200 >> @@ -32,6 +32,8 @@ >> ? ifeq ($(TOOLCHAIN_TYPE), gcc) >> ??? BUILD_LIBJVM_vmStructs.cpp_CXXFLAGS := >> -fno-var-tracking-assignments -O0 >> ??? BUILD_LIBJVM_jvmciCompilerToVM.cpp_CXXFLAGS := >> -fno-var-tracking-assignments >> +? BUILD_LIBJVM_assembler_x86.cpp_CXXFLAGS := -Wno-maybe-uninitialized >> +? BUILD_LIBJVM_interp_masm_x86.cpp_CXXFLAGS := -Wno-uninitialized >> ? endif >> >> ? ifeq ($(OPENJDK_TARGET_OS), linux) >> >> Issue: >> https://bugs.openjdk.java.net/browse/JDK-8187676 >> >> Testing: >> - Compiles with: >> ?? - gcc 7.1.1 and glibc 2.25 on Fedora 26 >> ?? - gcc 4.9.2 and glibc 2.12 on OEL 6.4 >> - JPRT >> >> Thanks, >> Erik From dmitrij.pochepko at bell-sw.com Tue Sep 19 17:09:44 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko) Date: Tue, 19 Sep 2017 20:09:44 +0300 Subject: multiplyHigh? In-Reply-To: <754125d0-28a0-831f-7989-0b28f8fbb478@redhat.com> References: <9e627b27-54cd-d031-748f-c8f7c4c032b7@bell-sw.com> <754125d0-28a0-831f-7989-0b28f8fbb478@redhat.com> Message-ID: <810516e8-1175-1c8b-2feb-8d3b9bc66acb@bell-sw.com> Hi, I'm working on that. Expecting it to be completed in a few days. Thanks, Dmitrij On 19.09.2017 11:15, Andrew Haley wrote: > Hi, > > On 08/09/17 16:17, Dmitrij wrote: >> I support the idea of intrinsifying Math.multiplyHigh method: it seems >> like it will be very effective to implement at least on aarch64 and >> x86_64 using umulh and mulq instructions. >> >> Vladimir, what do you think? >> >> >> If there are no other volunteer I'd be happy to do this. >> >> Enhancing java.lang.Math intrinsics situation for aarch64 port is on my >> todo list. > Can I have an ETA for this, please? > From vladimir.kozlov at oracle.com Tue Sep 19 17:54:49 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 19 Sep 2017 10:54:49 -0700 Subject: RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: References: Message-ID: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com> On 9/18/17 10:59 PM, Zhongwei Yao wrote: > Hi, Vladimir, > > On 19 September 2017 at 00:17, Vladimir Kozlov > wrote: >> Why not use existing set_notpassed_slp() instead of mark_slp_vec_failed()? > > Due to 2 reasons, I have not chosen existing passed_slp flag: My point is that if we don't find vectors in a loop (as in your case) we should ignore whole SLP analysis. In best case scenario SuperWord::unrolling_analysis() should determine if there are vectors candidates. For example, check if array's index is depend on loop's index variable. An other way is to call SuperWord::unrolling_analysis() only after we did vector analysis. It is more complicated changes and out of scope of this. There is also side effect I missed before which may prevent using set_notpassed_slp(): LoopMaxUnroll is changed based on SLP analysis before has_passed_slp() check. Note, set_notpassed_slp() is also used to additional unroll already vectorized loops: http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421 May be you should also call mark_do_unroll_only() when you set set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt problem you pointed. Can you look on this? I am not against adding new is_slp_vec_failed() but I want first to investigate if we can re-use existing functions. Thanks, Vladimir > 1. If we set_notpassed_slp() when _packset.length() == 0 in > SuperWord::output(), then in the IdealLoopTree::policy_unroll() > checking: > > if (cl->has_passed_slp()) { > if (slp_max_unroll_factor >= future_unroll_ct) return true; > // Normal case: loop too big > return false; > } > > we will ignore the case: "cl->has_passed_slp() && > slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()" > as alos exposed in my patch: > > if (cl->has_passed_slp()) { > if (slp_max_unroll_factor >= future_unroll_ct) return true; > - // Normal case: loop too big > - return false; > + // When SLP vectorization failed, we could do more unrolling > + // optimizations if body size is less than limit size. Otherwise, > + // return false due to loop is too big. > + if (!cl->is_slp_vec_failed()) return false; > } > > However, I have not found a case to support this condition yet. > > 2. As replied below, in: >> - } else if (cl->is_main_loop()) { >> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >> sw.transform_loop(lpt, true); > I need to check whether cl->is_slp_vec_failed() is true.Such > checking becomes explicit when using SLPAutoVecFailed flag. > >> >> Why you need next additional check?: >> >> - } else if (cl->is_main_loop()) { >> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >> sw.transform_loop(lpt, true); >> > > The additional check prevents the case that when > cl->is_slp_vec_failed() is true, then SuperWord::output() will > set_major_progress() at the beginning (because _packset.length() == 0 > is true when cl->is_slp_vec_failed() is true). Then the "phase ideal > loop iteration" will not stop untill loop_opts_cnt reachs 0, which is > not we want. > >> >> Thanks, >> Vladimir >> >> >> On 9/18/17 2:58 AM, Zhongwei Yao wrote: >>> >>> [Forward from aarch64-port-dev to hotspot-compiler-dev] >>> >>> Hi, all, >>> >>> Bug: >>> https://bugs.openjdk.java.net/browse/JDK-8187601 >>> >>> Webrev: >>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00 >>> >>> In the current implementation, the loop unrolling times are determined >>> by vector size and element size when SuperWordLoopUnrollAnalysis is >>> true (both X86 and aarch64 are true for now). >>> >>> This unrolling policy generates less optimized code when SLP >>> auto-vectorization fails (as following example shows). >>> >>> In this patch, I modify the current unrolling policy to do more >>> unrolling when SLP auto-vectorization fails. So the loop will be >>> unrolled until reaching the unroll times limitation. >>> >>> Here is one example: >>> public static void accessArrayConstants(int[] array) { >>> for (int j = 0; j < 1024; j++) { >>> array[0]++; >>> array[1]++; >>> } >>> } >>> >>> Before this patch, the loop will be unrolled by 4 times. 4 is >>> determined by: AArch64's vector size 128 bits / array element size 32 >>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. >>> >>> Below is the generated code by C2 on AArch64: >>> >>> ==== generated code start ==== >>> 0x0000ffff6caf3180: ldr w10, [x1,#16] ; >>> 0x0000ffff6caf3184: add w13, w10, #0x1 >>> 0x0000ffff6caf3188: str w13, [x1,#16] ; >>> 0x0000ffff6caf318c: ldr w12, [x1,#20] ; >>> 0x0000ffff6caf3190: add w13, w10, #0x4 >>> 0x0000ffff6caf3194: add w10, w12, #0x4 >>> 0x0000ffff6caf3198: str w13, [x1,#16] ; >>> 0x0000ffff6caf319c: add w11, w11, #0x4 ; >>> 0x0000ffff6caf31a0: str w10, [x1,#20] ; >>> 0x0000ffff6caf31a4: cmp w11, #0x3fd >>> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; >>> ==== generated code end ==== >>> >>> After applied this patch, it is unrolled 16 times: >>> >>> ==== generated code start ==== >>> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; >>> 0x0000ffffb0aa6104: add w13, w10, #0x1 >>> 0x0000ffffb0aa6108: str w13, [x1,#16] ; >>> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; >>> 0x0000ffffb0aa6110: add w13, w10, #0x10 >>> 0x0000ffffb0aa6114: add w10, w12, #0x10 >>> 0x0000ffffb0aa6118: str w13, [x1,#16] ; >>> 0x0000ffffb0aa611c: add w11, w11, #0x10 ; >>> 0x0000ffffb0aa6120: str w10, [x1,#20] ; >>> 0x0000ffffb0aa6124: cmp w11, #0x3f1 >>> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; >>> ==== generated code end ==== >>> >>> This patch passes jtreg tests both on AArch64 and X86. >>> >> > > > From vladimir.kozlov at oracle.com Tue Sep 19 18:32:52 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 19 Sep 2017 11:32:52 -0700 Subject: [10] RFR(M) 8182701: Modify JVMCI to allow Graal Compiler to expose platform MBean In-Reply-To: <2954772.sIDvmWlhCn@pracovni> References: <73a34e68-19cf-e949-0057-a2e16cfca6da@oracle.com> <48a815cd-2943-d328-d955-f301d39d7b86@oracle.com> <2954772.sIDvmWlhCn@pracovni> Message-ID: Finally ;) On 9/19/17 5:56 AM, Jaroslav Tulach wrote: > On p?tek 15. z??? 2017 10:53:45 CEST mandy chung wrote: >> > http://cr.openjdk.java.net/~jtulach/8182701/webrev.05/index.html >> >> Looks good. > > Great. Unless there are further comments, then we can plan integration of my > changes. Vladimir, are you "sponsor" of my change? Will you integrate it > somewhere? Yes, I will integrate them when jdk10 repos are open again after consolidation. > > This is my first change that has a chance to get into JDK, so I don't think I > have push rights. But I am thrilled. Yes, you have to propose about 8 "significant" changes before becoming Committer and allowed to push: http://openjdk.java.net/projects/#project-committer Thanks, Vladimir > > -jt > From vladimir.kozlov at oracle.com Tue Sep 19 18:45:00 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 19 Sep 2017 11:45:00 -0700 Subject: RFR (XXS): 8160303: parse_method_pattern only scans 254 chars In-Reply-To: <7555f1ba-383f-3652-a701-765eda0417ac@oracle.com> References: <7555f1ba-383f-3652-a701-765eda0417ac@oracle.com> Message-ID: <2923877c-af26-398c-658a-2bace3b34fd3@oracle.com> It should be 1022: one for '(' + one for \0 at the end. Vladimir On 9/19/17 3:54 AM, Nils Eliasson wrote: > Hi, > > This patch fixes the wrong (too short) scan length in the signature > parsing in methodMatcher.cpp. > > Bug: https://bugs.openjdk.java.net/browse/JDK-8160303 > > Webrev: http://cr.openjdk.java.net/~neliasso/8160303/webrev.01/ > > > Please review, > > Nils Eliasson > > From ekaterina.pavlova at oracle.com Wed Sep 20 05:04:44 2017 From: ekaterina.pavlova at oracle.com (Ekaterina Pavlova) Date: Tue, 19 Sep 2017 22:04:44 -0700 Subject: RFR(S) 8185134: Introduce vm.graal predicate Message-ID: Hi all, could you please review this small change which introduces vm.graal.enabled predicate and marks compiler tests which fail with Graal due to c2 specific checks by '@requires !vm.graal.enabled'. bug: https://bugs.openjdk.java.net/browse/JDK-8185134 webrev: http://cr.openjdk.java.net/~epavlova//8185134_test/ http://cr.openjdk.java.net/~epavlova//8185134_hs/ Tested by running failing tests in Graal as JIT compiler mode (-XX:+TieredCompilation -XX:+UseJVMCICompiler -Djvmci.Compiler=graal). thanks, -katya p.s. Igor Ignatyev volunteered to sponsor this change. From doug.simon at oracle.com Wed Sep 20 07:13:34 2017 From: doug.simon at oracle.com (Doug Simon) Date: Wed, 20 Sep 2017 09:13:34 +0200 Subject: RFR(S) 8185134: Introduce vm.graal predicate In-Reply-To: References: Message-ID: <89CD9484-2357-46BC-B18A-75AD6820B922@oracle.com> Looks good to me. > On 20 Sep 2017, at 07:04, Ekaterina Pavlova wrote: > > Hi all, > > could you please review this small change which introduces vm.graal.enabled predicate > and marks compiler tests which fail with Graal due to c2 specific checks by '@requires !vm.graal.enabled'. > > bug: https://bugs.openjdk.java.net/browse/JDK-8185134 > webrev: http://cr.openjdk.java.net/~epavlova//8185134_test/ > http://cr.openjdk.java.net/~epavlova//8185134_hs/ > > > Tested by running failing tests in Graal as JIT compiler mode > (-XX:+TieredCompilation -XX:+UseJVMCICompiler -Djvmci.Compiler=graal). > > thanks, > -katya > > p.s. > Igor Ignatyev volunteered to sponsor this change. From zhongwei.yao at linaro.org Wed Sep 20 11:07:20 2017 From: zhongwei.yao at linaro.org (Zhongwei Yao) Date: Wed, 20 Sep 2017 19:07:20 +0800 Subject: RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com> References: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com> Message-ID: Thanks for your suggestions! I've updated the patch that uses pass_slp and do_unroll_only flags without adding a new flag. Please take a look: http://cr.openjdk.java.net/~zyao/8187601/webrev.01/ On 20 September 2017 at 01:54, Vladimir Kozlov wrote: > > > On 9/18/17 10:59 PM, Zhongwei Yao wrote: >> >> Hi, Vladimir, >> >> On 19 September 2017 at 00:17, Vladimir Kozlov >> wrote: >>> >>> Why not use existing set_notpassed_slp() instead of >>> mark_slp_vec_failed()? >> >> >> Due to 2 reasons, I have not chosen existing passed_slp flag: > > > My point is that if we don't find vectors in a loop (as in your case) we > should ignore whole SLP analysis. > > In best case scenario SuperWord::unrolling_analysis() should determine if > there are vectors candidates. For example, check if array's index is depend > on loop's index variable. > > An other way is to call SuperWord::unrolling_analysis() only after we did > vector analysis. > > It is more complicated changes and out of scope of this. There is also side > effect I missed before which may prevent using set_notpassed_slp(): > LoopMaxUnroll is changed based on SLP analysis before has_passed_slp() > check. > > Note, set_notpassed_slp() is also used to additional unroll already > vectorized loops: > > http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421 > > May be you should also call mark_do_unroll_only() when you set > set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt > problem you pointed. Can you look on this? > > I am not against adding new is_slp_vec_failed() but I want first to > investigate if we can re-use existing functions. > > Thanks, > Vladimir > > >> 1. If we set_notpassed_slp() when _packset.length() == 0 in >> SuperWord::output(), then in the IdealLoopTree::policy_unroll() >> checking: >> >> if (cl->has_passed_slp()) { >> if (slp_max_unroll_factor >= future_unroll_ct) return true; >> // Normal case: loop too big >> return false; >> } >> >> we will ignore the case: "cl->has_passed_slp() && >> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()" >> as alos exposed in my patch: >> >> if (cl->has_passed_slp()) { >> if (slp_max_unroll_factor >= future_unroll_ct) return true; >> - // Normal case: loop too big >> - return false; >> + // When SLP vectorization failed, we could do more unrolling >> + // optimizations if body size is less than limit size. Otherwise, >> + // return false due to loop is too big. >> + if (!cl->is_slp_vec_failed()) return false; >> } >> >> However, I have not found a case to support this condition yet. >> >> 2. As replied below, in: >>> >>> - } else if (cl->is_main_loop()) { >>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>> sw.transform_loop(lpt, true); >> >> I need to check whether cl->is_slp_vec_failed() is true.Such >> checking becomes explicit when using SLPAutoVecFailed flag. >> >>> >>> Why you need next additional check?: >>> >>> - } else if (cl->is_main_loop()) { >>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>> sw.transform_loop(lpt, true); >>> >> >> The additional check prevents the case that when >> cl->is_slp_vec_failed() is true, then SuperWord::output() will >> set_major_progress() at the beginning (because _packset.length() == 0 >> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal >> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is >> not we want. > > > >> >>> >>> Thanks, >>> Vladimir >>> >>> >>> On 9/18/17 2:58 AM, Zhongwei Yao wrote: >>>> >>>> >>>> [Forward from aarch64-port-dev to hotspot-compiler-dev] >>>> >>>> Hi, all, >>>> >>>> Bug: >>>> https://bugs.openjdk.java.net/browse/JDK-8187601 >>>> >>>> Webrev: >>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00 >>>> >>>> In the current implementation, the loop unrolling times are determined >>>> by vector size and element size when SuperWordLoopUnrollAnalysis is >>>> true (both X86 and aarch64 are true for now). >>>> >>>> This unrolling policy generates less optimized code when SLP >>>> auto-vectorization fails (as following example shows). >>>> >>>> In this patch, I modify the current unrolling policy to do more >>>> unrolling when SLP auto-vectorization fails. So the loop will be >>>> unrolled until reaching the unroll times limitation. >>>> >>>> Here is one example: >>>> public static void accessArrayConstants(int[] array) { >>>> for (int j = 0; j < 1024; j++) { >>>> array[0]++; >>>> array[1]++; >>>> } >>>> } >>>> >>>> Before this patch, the loop will be unrolled by 4 times. 4 is >>>> determined by: AArch64's vector size 128 bits / array element size 32 >>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. >>>> >>>> Below is the generated code by C2 on AArch64: >>>> >>>> ==== generated code start ==== >>>> 0x0000ffff6caf3180: ldr w10, [x1,#16] ; >>>> 0x0000ffff6caf3184: add w13, w10, #0x1 >>>> 0x0000ffff6caf3188: str w13, [x1,#16] ; >>>> 0x0000ffff6caf318c: ldr w12, [x1,#20] ; >>>> 0x0000ffff6caf3190: add w13, w10, #0x4 >>>> 0x0000ffff6caf3194: add w10, w12, #0x4 >>>> 0x0000ffff6caf3198: str w13, [x1,#16] ; >>>> 0x0000ffff6caf319c: add w11, w11, #0x4 ; >>>> 0x0000ffff6caf31a0: str w10, [x1,#20] ; >>>> 0x0000ffff6caf31a4: cmp w11, #0x3fd >>>> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; >>>> ==== generated code end ==== >>>> >>>> After applied this patch, it is unrolled 16 times: >>>> >>>> ==== generated code start ==== >>>> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; >>>> 0x0000ffffb0aa6104: add w13, w10, #0x1 >>>> 0x0000ffffb0aa6108: str w13, [x1,#16] ; >>>> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; >>>> 0x0000ffffb0aa6110: add w13, w10, #0x10 >>>> 0x0000ffffb0aa6114: add w10, w12, #0x10 >>>> 0x0000ffffb0aa6118: str w13, [x1,#16] ; >>>> 0x0000ffffb0aa611c: add w11, w11, #0x10 ; >>>> 0x0000ffffb0aa6120: str w10, [x1,#20] ; >>>> 0x0000ffffb0aa6124: cmp w11, #0x3f1 >>>> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; >>>> ==== generated code end ==== >>>> >>>> This patch passes jtreg tests both on AArch64 and X86. >>>> >>> >> >> >> > -- Best regards, Zhongwei From dmitrij.pochepko at bell-sw.com Wed Sep 20 13:08:27 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko) Date: Wed, 20 Sep 2017 16:08:27 +0300 Subject: [10] RFR(S): 8187684 - Intrinsify Math.multiplyHigh(long, long) Message-ID: Hi, please review small patch for enhancement: 8187684 - Intrinsify Math.multiplyHigh(long, long) Method Math.multiplyHigh was introduced in jdk9 and is not intrinsified. This patch adds such intrinsic by using existing MulHiLNode. For aarch64 and x86_64 it uses respective cpu instruction (smulh/mulq), which is faster. I've created a small JMH benchmark: http://cr.openjdk.java.net/~dpochepk/8187684/MultiplyHighBench.java to test the improved performance and measured it on aarch64(t88, R-Pi) and x86_64(i7-4770K). Benchmark shows about x2.5 improvement on aarch64 and about x2 on x86_64 Detailed benchmark results are here: http://cr.openjdk.java.net/~dpochepk/8187684/results.txt webrev hotspot: http://cr.openjdk.java.net/~dpochepk/8187684/webrev.hotspot.01/ webrev jdk: http://cr.openjdk.java.net/~dpochepk/8187684/webrev.jdk.01/ I've found existing jtreg test for multiplyHigh in jdk/test/java/lang/Math which covers this method. I've run it in -Xmixed and -Xcomp modes and found no regressions. Thanks, Dmitrij -------------- next part -------------- An HTML attachment was scrubbed... URL: From aph at redhat.com Wed Sep 20 13:29:25 2017 From: aph at redhat.com (Andrew Haley) Date: Wed, 20 Sep 2017 14:29:25 +0100 Subject: [10] RFR(S): 8187684 - Intrinsify Math.multiplyHigh(long, long) In-Reply-To: References: Message-ID: <5cfb314b-9785-fe47-8797-a899f38643ef@redhat.com> On 20/09/17 14:08, Dmitrij Pochepko wrote: > please review small patch for enhancement: 8187684 - Intrinsify > Math.multiplyHigh(long, long) OK, thanks. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From dmitrij.pochepko at bell-sw.com Wed Sep 20 14:13:11 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko) Date: Wed, 20 Sep 2017 17:13:11 +0300 Subject: [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> Message-ID: <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> Hi, Andrew, do you believe this is ok to push? Thanks, Dmitrij .... On 06.09.2017 20:39, Dmitrij wrote: > > I've compared it by calling square and multiply methods and got > following results(ThunderX): > > > Benchmark??????????????????????????????????????? (size, ints) Mode > Cnt????? Score???? Error? Units > BigIntegerBench.implMutliplyToLenReflect?????? 1? avgt??? 5 186.930 ?? > 14.933? ns/op? (26% slower) > BigIntegerBench.implMutliplyToLenReflect?????? 2? avgt??? 5 194.095 ?? > 11.857? ns/op? (12% slower) > BigIntegerBench.implMutliplyToLenReflect?????? 3? avgt??? 5 233.912 > ??? 4.229? ns/op?? (24% slower) > BigIntegerBench.implMutliplyToLenReflect?????? 5? avgt??? 5 308.349 ?? > 20.383? ns/op? (22% slower) > BigIntegerBench.implMutliplyToLenReflect????? 10? avgt??? 5 475.839 > ??? 6.232? ns/op? (same) > BigIntegerBench.implMutliplyToLenReflect????? 50? avgt??? 5 6514.691 > ?? 76.934? ns/op (same) > BigIntegerBench.implMutliplyToLenReflect????? 90? avgt??? 5 20347.040 > ? 224.290? ns/op (3% slower) > BigIntegerBench.implMutliplyToLenReflect???? 127? avgt??? 5 41929.302 > ? 181.053? ns/op (9% slower) > > BigIntegerBench.implSquareToLenReflect???????? 1? avgt??? 5 147.751 ?? > 12.760? ns/op > BigIntegerBench.implSquareToLenReflect???????? 2? avgt??? 5 173.804 > ??? 4.850? ns/op > BigIntegerBench.implSquareToLenReflect???????? 3? avgt??? 5 187.822 ?? > 34.027? ns/op > BigIntegerBench.implSquareToLenReflect???????? 5? avgt??? 5 251.995 ?? > 19.711? ns/op > BigIntegerBench.implSquareToLenReflect??????? 10? avgt??? 5 474.489 > ??? 1.040? ns/op > BigIntegerBench.implSquareToLenReflect??????? 50? avgt??? 5 6493.768 > ?? 33.809? ns/op > BigIntegerBench.implSquareToLenReflect??????? 90? avgt??? 5 19766.524 > ?? 88.398? ns/op > BigIntegerBench.implSquareToLenReflect?????? 127? avgt??? 5 38448.202 > ? 180.095? ns/op > > > As we can see, squareToLen is faster than multiplyToLen. > > (I've updated benchmark code at > http://cr.openjdk.java.net/~dpochepk/8186915/BigIntegerBench.java) > > Thanks, > Dmitrij From aph at redhat.com Wed Sep 20 14:40:31 2017 From: aph at redhat.com (Andrew Haley) Date: Wed, 20 Sep 2017 15:40:31 +0100 Subject: [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> Message-ID: On 20/09/17 15:13, Dmitrij Pochepko wrote: > Andrew, do you believe this is ok to push? I'm testing it on some other hardware. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From vladimir.kozlov at oracle.com Wed Sep 20 15:15:29 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 20 Sep 2017 08:15:29 -0700 Subject: RFR(S) 8185134: Introduce vm.graal predicate In-Reply-To: References: Message-ID: Good. Thanks, Vladimir On 9/19/17 10:04 PM, Ekaterina Pavlova wrote: > Hi all, > > could you please review this small change which introduces > vm.graal.enabled predicate > and marks compiler tests which fail with Graal due to c2 specific checks > by '@requires !vm.graal.enabled'. > > ??? bug: https://bugs.openjdk.java.net/browse/JDK-8185134 > ?webrev: http://cr.openjdk.java.net/~epavlova//8185134_test/ > ???????? http://cr.openjdk.java.net/~epavlova//8185134_hs/ > > > Tested by running failing tests in Graal as JIT compiler mode > (-XX:+TieredCompilation -XX:+UseJVMCICompiler -Djvmci.Compiler=graal). > > thanks, > -katya > > p.s. > ?Igor Ignatyev volunteered to sponsor this change. From vladimir.kozlov at oracle.com Wed Sep 20 15:34:10 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 20 Sep 2017 08:34:10 -0700 Subject: [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> Message-ID: Dmitrij, You need Oracle's sponsor for push since you touched shared code register.hpp Thanks, Vladimir On 9/20/17 7:13 AM, Dmitrij Pochepko wrote: > Hi, > > Andrew, do you believe this is ok to push? > > Thanks, > > Dmitrij > > > .... > > On 06.09.2017 20:39, Dmitrij wrote: >> >> I've compared it by calling square and multiply methods and got >> following results(ThunderX): >> >> >> Benchmark??????????????????????????????????????? (size, ints) Mode >> Cnt????? Score???? Error? Units >> BigIntegerBench.implMutliplyToLenReflect?????? 1? avgt??? 5 186.930 ? >> 14.933? ns/op? (26% slower) >> BigIntegerBench.implMutliplyToLenReflect?????? 2? avgt??? 5 194.095 ? >> 11.857? ns/op? (12% slower) >> BigIntegerBench.implMutliplyToLenReflect?????? 3? avgt??? 5 233.912 >> ??? 4.229? ns/op?? (24% slower) >> BigIntegerBench.implMutliplyToLenReflect?????? 5? avgt??? 5 308.349 ? >> 20.383? ns/op? (22% slower) >> BigIntegerBench.implMutliplyToLenReflect????? 10? avgt??? 5 475.839 >> ??? 6.232? ns/op? (same) >> BigIntegerBench.implMutliplyToLenReflect????? 50? avgt??? 5 6514.691 >> ?? 76.934? ns/op (same) >> BigIntegerBench.implMutliplyToLenReflect????? 90? avgt??? 5 20347.040 >> ? 224.290? ns/op (3% slower) >> BigIntegerBench.implMutliplyToLenReflect???? 127? avgt??? 5 41929.302 >> ? 181.053? ns/op (9% slower) >> >> BigIntegerBench.implSquareToLenReflect???????? 1? avgt??? 5 147.751 ? >> 12.760? ns/op >> BigIntegerBench.implSquareToLenReflect???????? 2? avgt??? 5 173.804 >> ??? 4.850? ns/op >> BigIntegerBench.implSquareToLenReflect???????? 3? avgt??? 5 187.822 ? >> 34.027? ns/op >> BigIntegerBench.implSquareToLenReflect???????? 5? avgt??? 5 251.995 ? >> 19.711? ns/op >> BigIntegerBench.implSquareToLenReflect??????? 10? avgt??? 5 474.489 >> ??? 1.040? ns/op >> BigIntegerBench.implSquareToLenReflect??????? 50? avgt??? 5 6493.768 >> ?? 33.809? ns/op >> BigIntegerBench.implSquareToLenReflect??????? 90? avgt??? 5 19766.524 >> ?? 88.398? ns/op >> BigIntegerBench.implSquareToLenReflect?????? 127? avgt??? 5 38448.202 >> ? 180.095? ns/op >> >> >> As we can see, squareToLen is faster than multiplyToLen. >> >> (I've updated benchmark code at >> http://cr.openjdk.java.net/~dpochepk/8186915/BigIntegerBench.java) >> >> Thanks, >> Dmitrij > From vladimir.kozlov at oracle.com Wed Sep 20 16:18:00 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 20 Sep 2017 09:18:00 -0700 Subject: RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: References: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com> Message-ID: Nice. Did you verified that it fixed your case? Would be nice to run specjvm2008 to make sure performance did not regress. Thanks, Vladimir On 9/20/17 4:07 AM, Zhongwei Yao wrote: > Thanks for your suggestions! > > I've updated the patch that uses pass_slp and do_unroll_only flags > without adding a new flag. Please take a look: > > http://cr.openjdk.java.net/~zyao/8187601/webrev.01/ > > > > On 20 September 2017 at 01:54, Vladimir Kozlov > wrote: >> >> >> On 9/18/17 10:59 PM, Zhongwei Yao wrote: >>> >>> Hi, Vladimir, >>> >>> On 19 September 2017 at 00:17, Vladimir Kozlov >>> wrote: >>>> >>>> Why not use existing set_notpassed_slp() instead of >>>> mark_slp_vec_failed()? >>> >>> >>> Due to 2 reasons, I have not chosen existing passed_slp flag: >> >> >> My point is that if we don't find vectors in a loop (as in your case) we >> should ignore whole SLP analysis. >> >> In best case scenario SuperWord::unrolling_analysis() should determine if >> there are vectors candidates. For example, check if array's index is depend >> on loop's index variable. >> >> An other way is to call SuperWord::unrolling_analysis() only after we did >> vector analysis. >> >> It is more complicated changes and out of scope of this. There is also side >> effect I missed before which may prevent using set_notpassed_slp(): >> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp() >> check. >> >> Note, set_notpassed_slp() is also used to additional unroll already >> vectorized loops: >> >> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421 >> >> May be you should also call mark_do_unroll_only() when you set >> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt >> problem you pointed. Can you look on this? >> >> I am not against adding new is_slp_vec_failed() but I want first to >> investigate if we can re-use existing functions. >> >> Thanks, >> Vladimir >> >> >>> 1. If we set_notpassed_slp() when _packset.length() == 0 in >>> SuperWord::output(), then in the IdealLoopTree::policy_unroll() >>> checking: >>> >>> if (cl->has_passed_slp()) { >>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>> // Normal case: loop too big >>> return false; >>> } >>> >>> we will ignore the case: "cl->has_passed_slp() && >>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()" >>> as alos exposed in my patch: >>> >>> if (cl->has_passed_slp()) { >>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>> - // Normal case: loop too big >>> - return false; >>> + // When SLP vectorization failed, we could do more unrolling >>> + // optimizations if body size is less than limit size. Otherwise, >>> + // return false due to loop is too big. >>> + if (!cl->is_slp_vec_failed()) return false; >>> } >>> >>> However, I have not found a case to support this condition yet. >>> >>> 2. As replied below, in: >>>> >>>> - } else if (cl->is_main_loop()) { >>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>> sw.transform_loop(lpt, true); >>> >>> I need to check whether cl->is_slp_vec_failed() is true.Such >>> checking becomes explicit when using SLPAutoVecFailed flag. >>> >>>> >>>> Why you need next additional check?: >>>> >>>> - } else if (cl->is_main_loop()) { >>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>> sw.transform_loop(lpt, true); >>>> >>> >>> The additional check prevents the case that when >>> cl->is_slp_vec_failed() is true, then SuperWord::output() will >>> set_major_progress() at the beginning (because _packset.length() == 0 >>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal >>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is >>> not we want. >> >> >> >>> >>>> >>>> Thanks, >>>> Vladimir >>>> >>>> >>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote: >>>>> >>>>> >>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev] >>>>> >>>>> Hi, all, >>>>> >>>>> Bug: >>>>> https://bugs.openjdk.java.net/browse/JDK-8187601 >>>>> >>>>> Webrev: >>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00 >>>>> >>>>> In the current implementation, the loop unrolling times are determined >>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is >>>>> true (both X86 and aarch64 are true for now). >>>>> >>>>> This unrolling policy generates less optimized code when SLP >>>>> auto-vectorization fails (as following example shows). >>>>> >>>>> In this patch, I modify the current unrolling policy to do more >>>>> unrolling when SLP auto-vectorization fails. So the loop will be >>>>> unrolled until reaching the unroll times limitation. >>>>> >>>>> Here is one example: >>>>> public static void accessArrayConstants(int[] array) { >>>>> for (int j = 0; j < 1024; j++) { >>>>> array[0]++; >>>>> array[1]++; >>>>> } >>>>> } >>>>> >>>>> Before this patch, the loop will be unrolled by 4 times. 4 is >>>>> determined by: AArch64's vector size 128 bits / array element size 32 >>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. >>>>> >>>>> Below is the generated code by C2 on AArch64: >>>>> >>>>> ==== generated code start ==== >>>>> 0x0000ffff6caf3180: ldr w10, [x1,#16] ; >>>>> 0x0000ffff6caf3184: add w13, w10, #0x1 >>>>> 0x0000ffff6caf3188: str w13, [x1,#16] ; >>>>> 0x0000ffff6caf318c: ldr w12, [x1,#20] ; >>>>> 0x0000ffff6caf3190: add w13, w10, #0x4 >>>>> 0x0000ffff6caf3194: add w10, w12, #0x4 >>>>> 0x0000ffff6caf3198: str w13, [x1,#16] ; >>>>> 0x0000ffff6caf319c: add w11, w11, #0x4 ; >>>>> 0x0000ffff6caf31a0: str w10, [x1,#20] ; >>>>> 0x0000ffff6caf31a4: cmp w11, #0x3fd >>>>> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; >>>>> ==== generated code end ==== >>>>> >>>>> After applied this patch, it is unrolled 16 times: >>>>> >>>>> ==== generated code start ==== >>>>> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; >>>>> 0x0000ffffb0aa6104: add w13, w10, #0x1 >>>>> 0x0000ffffb0aa6108: str w13, [x1,#16] ; >>>>> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; >>>>> 0x0000ffffb0aa6110: add w13, w10, #0x10 >>>>> 0x0000ffffb0aa6114: add w10, w12, #0x10 >>>>> 0x0000ffffb0aa6118: str w13, [x1,#16] ; >>>>> 0x0000ffffb0aa611c: add w11, w11, #0x10 ; >>>>> 0x0000ffffb0aa6120: str w10, [x1,#20] ; >>>>> 0x0000ffffb0aa6124: cmp w11, #0x3f1 >>>>> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; >>>>> ==== generated code end ==== >>>>> >>>>> This patch passes jtreg tests both on AArch64 and X86. >>>>> >>>> >>> >>> >>> >> > > > From lutz.schmidt at sap.com Wed Sep 20 19:36:47 2017 From: lutz.schmidt at sap.com (Schmidt, Lutz) Date: Wed, 20 Sep 2017 19:36:47 +0000 Subject: multiplyHigh? In-Reply-To: <6a2137ab-bf1e-ba87-cdb1-e13c86f00521@redhat.com> References: <9e627b27-54cd-d031-748f-c8f7c4c032b7@bell-sw.com> <70ccb5fe-8c18-21fc-e6ad-ca64571eff88@oracle.com> <6a2137ab-bf1e-ba87-cdb1-e13c86f00521@redhat.com> Message-ID: Thanks Andrew for this hint. I assumed we would be talking about a ?full-blown? intrinsic like those for the crypto stuff. Defining just an instruct in the .ad file is easy. S390 and ppc both have suitable instructions. Expect my suggestion shortly. Regards, Lutz On 19.09.2017, 18:04, "Andrew Haley" wrote: On 19/09/17 15:26, Schmidt, Lutz wrote: > there is interest from these guys at SAP as well. I would volunteer to cover ppc and s390, once the initial implementation is available. Any ETA guess on that? I?m not pushing, just to get an idea. If you have a definition for MulHiL, you can use the same code for all targets. You don't have to wait for the intrinsic to be written. For reference, Aarch64 is: instruct mulHiL_rReg(iRegLNoSp dst, iRegL src1, iRegL src2, rFlagsReg cr) %{ match(Set dst (MulHiL src1 src2)); ins_cost(INSN_COST * 7); format %{ "smulh $dst, $src1, $src2, \t# mulhi" %} ins_encode %{ __ smulh(as_Register($dst$$reg), as_Register($src1$$reg), as_Register($src2$$reg)); %} ins_pipe(lmul_reg_reg); %} -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From yasuenag at gmail.com Wed Sep 20 22:44:46 2017 From: yasuenag at gmail.com (Yasumasa Suenaga) Date: Thu, 21 Sep 2017 07:44:46 +0900 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: <13b177d1-e04c-dd86-7a86-b55b165d5830@oracle.com> References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> <9bfc8e1d-8438-21f2-3c06-e53bb95afb6c@oracle.com> <0ee1ed46-5c27-8693-fb6d-396a3059335e@oracle.com> <13b177d1-e04c-dd86-7a86-b55b165d5830@oracle.com> Message-ID: <2c955c17-43b9-906f-408d-f5349d57ca13@gmail.com> Hi David, jdk10/hs has been opened [1]. Could you push this change? Thanks, Yasumasa [1] http://mail.openjdk.java.net/pipermail/jdk10-dev/2017-September/000499.html On 2017/09/19 12:31, David Holmes wrote: > On 19/09/2017 1:19 PM, Yasumasa Suenaga wrote: >> Thanks David, >> >> BTW, can I push this change after jdk10/master is opened? >> I cannot access JPRT. > > I think we'd probably prefer this to go into jdk10/hs - once it is open - and for that you need a sponsor. > > Thanks, > David > >> >> Yasumasa >> >> >> 2017/09/19 ??0:08 "David Holmes" >: >> >> ??? Hi Yasumasa, >> >> ??? On 19/09/2017 12:55 PM, Yasumasa Suenaga wrote: >> >> ??????? Thanks Chris, Robbin, >> >> ??????? I'm waiting reviewer(s) for this change. >> >> >> ??? Reviewed. >> >> ??? This simply reverts the change of 8185102. >> >> ??? Thanks, >> ??? David >> ??? ----- >> >> >> ??????? Yasumasa >> >> >> ??????? 2017/09/19 ??7:14 "Chris Plummer" > ??????? >> ??????? > ??????? >>: >> >> ???????? ? ? Hi Yasumasa, >> >> ???????? ? ? Ok, I see now that CIntegerField is just an interface, so >> ??????? it's up to >> ???????? ? ? a class to implement getValue() to fetch the field. I'm a bit >> ???????? ? ? unclear on how that part works, but from responses by >> ??????? others, it >> ???????? ? ? seems this is ok. >> >> ???????? ? ? I've run all the tests I can find that use jstack or jhsdb, >> ??????? and the >> ???????? ? ? assert was not triggered. Probably need to have a NMethod >> ??????? on the >> ???????? ? ? stack to trigger the code you are fixing. >> >> ???????? ? ? thanks, >> >> ???????? ? ? Chris >> >> >> ???????? ? ? On 9/17/17 1:13 AM, Yasumasa Suenaga wrote: >> >> ???????? ? ? ? ? Hi Chris, >> >> ???????? ? ? ? ? I've tested this issue on Fedora 26 x86_64. >> ???????? ? ? ? ? I think we can sue CIntegerField at this point because >> ???????? ? ? ? ? CIntegerField is not specialized for various int size [1]. >> ???????? ? ? ? ? In fact, CIntegerField had been used at this point [2], >> ??????? and HSDB >> ???????? ? ? ? ? worked fine. >> >> >> ???????? ? ? ? ? Thanks, >> >> ???????? ? ? ? ? Yasumasa >> >> >> ???????? ? ? ? ? [1] >> ??????? http://hg.openjdk.java.net/jdk10/master/file/fd36993f7bf5/src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/CIntegerField.java#l29 >> ??????? >> ??????? > ??????? > >> ???????? ? ? ? ? [2] >> ??????? http://hg.openjdk.java.net/jdk10/master/rev/cbfdbefc6ea3 >> ??????? >> ??????? > ??????? > >> >> >> ???????? ? ? ? ? On 2017/09/17 3:58, Chris Plummer wrote: >> >> ???????? ? ? ? ? ? ? Hi Yasumasa, >> >> ???????? ? ? ? ? ? ? Is this on a 32-bit system? I don't see how you could >> ???????? ? ? ? ? ? ? otherwise call getCIntegerField() on a long type. >> ??????? jlong is >> ???????? ? ? ? ? ? ? always 64-bit and long is (generally) 32-bit on 32-bit >> ???????? ? ? ? ? ? ? systems, and 64-bit on 64-bit systems, at least >> ??????? that seems >> ???????? ? ? ? ? ? ? to be the case with linux. >> >> ???????? ? ? ? ? ? ? ??From what I can see, _stack_traversal_mark is now >> ??????? the only >> ???????? ? ? ? ? ? ? long type in vmStructs.cpp. I don't know that we have a >> ???????? ? ? ? ? ? ? mechanism to safely fetch it on both 32-bit and >> ??????? 64-bit systems. >> >> ???????? ? ? ? ? ? ? _stack_traversal_mark seems to be a long because >> ??????? _traversals >> ???????? ? ? ? ? ? ? is also a long. >> >> ???????? ? ? ? ? ? ? ??? ? static long ??????? _traversals;?????????????????? // >> ???????? ? ? ? ? ? ? Stack scan count, also sweep ID. >> >> ???????? ? ? ? ? ? ? This too might be considered a bug. I'm not sure >> ??????? why you >> ???????? ? ? ? ? ? ? would want the size of this field to vary between >> ??????? 32-bit and >> ???????? ? ? ? ? ? ? 64-bit systems (adding compiler-dev to help answer >> ??????? that). >> >> ???????? ? ? ? ? ? ? So, while I would agree that your fix is generally >> ??????? in the >> ???????? ? ? ? ? ? ? right direction, I think we first need to revisit >> ??????? the use of >> ???????? ? ? ? ? ? ? long for these fields. If they can be changed to an >> ??????? int, >> ???????? ? ? ? ? ? ? then your fix is correct (pending the changes to >> ??????? int). If >> ???????? ? ? ? ? ? ? not, then maybe we need getCLongField() support. >> >> ???????? ? ? ? ? ? ? And lastly, we really should have a test to detect >> ??????? this bug. >> ???????? ? ? ? ? ? ? Maybe we already do, and it is failing but is going >> ???????? ? ? ? ? ? ? unnoticed for some reason. I'll try to look into >> ??????? that some >> ???????? ? ? ? ? ? ? more on Monday. >> >> ???????? ? ? ? ? ? ? thanks, >> >> ???????? ? ? ? ? ? ? Chris >> >> ???????? ? ? ? ? ? ? On 9/16/17 5:20 AM, Yasumasa Suenaga wrote: >> >> ???????? ? ? ? ? ? ? ? ? Hi all, >> >> ???????? ? ? ? ? ? ? ? ? I tried to get thread dump via jstack command >> ??????? on CLHSDB. >> ???????? ? ? ? ? ? ? ? ? But it was failed as below: >> >> ???????? ? ? ? ? ? ? ? ? ``` >> ???????? ? ? ? ? ? ? ? ? Caused by: >> ??????? sun.jvm.hotspot.types.WrongTypeException: >> ???????? ? ? ? ? ? ? ? ? field "_stack_traversal_mark" in type nmethod >> ??????? is not of >> ???????? ? ? ? ? ? ? ? ? type jlong, but instead of type long >> ???????? ? ? ? ? ? ? ? ? ???????? at >> ??????? jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:206) >> ???????? ? ? ? ? ? ? ? ? ???????? at >> ??????? jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:212) >> ???????? ? ? ? ? ? ? ? ? ???????? at >> ??????? jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getJLongField(BasicType.java:249) >> ???????? ? ? ? ? ? ? ? ? ???????? at >> ??????? jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.initialize(NMethod.java:108) >> ???????? ? ? ? ? ? ? ? ? ???????? at >> ??????? jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.access$000(NMethod.java:35) >> ???????? ? ? ? ? ? ? ? ? ???????? at >> ??????? jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod$1.update(NMethod.java:81) >> >> ???????? ? ? ? ? ? ? ? ? ???????? at >> ??????? jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMInitializedObserver(VM.java:451) >> ???????? ? ? ? ? ? ? ? ? ???????? at >> ??????? jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.(NMethod.java:79) >> ???????? ? ? ? ? ? ? ? ? ???????? ... 23 more >> ???????? ? ? ? ? ? ? ? ? ``` >> >> ???????? ? ? ? ? ? ? ? ? I think this exception is caused by JDK-8186837. >> ???????? ? ? ? ? ? ? ? ? This changeset has changed the type of >> ???????? ? ? ? ? ? ? ? ? `nmethod::_stack_traversal_mark` to `long` from >> ??????? `jlong`. >> >> ???????? ? ? ? ? ? ? ? ? SA should follow this change. >> >> ???????? ? ? ? ? ? ? ? ? I uploaded a webrev for this issue. This webrev is >> ???????? ? ? ? ? ? ? ? ? generated from consolidated repo (jdk10/master). >> ???????? ? ? ? ? ? ? ? ? Could you review it? >> >> ??????? http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.00/ >> ??????? >> ??????? > ??????? > >> >> >> ???????? ? ? ? ? ? ? ? ? I cannot access JPRT. So I need reviewer. >> >> >> ???????? ? ? ? ? ? ? ? ? Thanks, >> >> ???????? ? ? ? ? ? ? ? ? Yasumasa >> >> >> >> >> >> From david.holmes at oracle.com Wed Sep 20 23:35:05 2017 From: david.holmes at oracle.com (David Holmes) Date: Thu, 21 Sep 2017 09:35:05 +1000 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: <2c955c17-43b9-906f-408d-f5349d57ca13@gmail.com> References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> <9bfc8e1d-8438-21f2-3c06-e53bb95afb6c@oracle.com> <0ee1ed46-5c27-8693-fb6d-396a3059335e@oracle.com> <13b177d1-e04c-dd86-7a86-b55b165d5830@oracle.com> <2c955c17-43b9-906f-408d-f5349d57ca13@gmail.com> Message-ID: The opening announcement was somewhat premature. They created jdk10/hs but we're not quite ready to start accepting changes yet. David On 21/09/2017 8:44 AM, Yasumasa Suenaga wrote: > Hi David, > > jdk10/hs has been opened [1]. > Could you push this change? > > > Thanks, > > Yasumasa > > > [1] > http://mail.openjdk.java.net/pipermail/jdk10-dev/2017-September/000499.html > > > On 2017/09/19 12:31, David Holmes wrote: >> On 19/09/2017 1:19 PM, Yasumasa Suenaga wrote: >>> Thanks David, >>> >>> BTW, can I push this change after jdk10/master is opened? >>> I cannot access JPRT. >> >> I think we'd probably prefer this to go into jdk10/hs - once it is >> open - and for that you need a sponsor. >> >> Thanks, >> David >> >>> >>> Yasumasa >>> >>> >>> 2017/09/19 ??0:08 "David Holmes" >> >: >>> >>> ??? Hi Yasumasa, >>> >>> ??? On 19/09/2017 12:55 PM, Yasumasa Suenaga wrote: >>> >>> ??????? Thanks Chris, Robbin, >>> >>> ??????? I'm waiting reviewer(s) for this change. >>> >>> >>> ??? Reviewed. >>> >>> ??? This simply reverts the change of 8185102. >>> >>> ??? Thanks, >>> ??? David >>> ??? ----- >>> >>> >>> ??????? Yasumasa >>> >>> >>> ??????? 2017/09/19 ??7:14 "Chris Plummer" >> ??????? >>> ??????? >> ??????? >>: >>> >>> ???????? ? ? Hi Yasumasa, >>> >>> ???????? ? ? Ok, I see now that CIntegerField is just an interface, so >>> ??????? it's up to >>> ???????? ? ? a class to implement getValue() to fetch the field. I'm >>> a bit >>> ???????? ? ? unclear on how that part works, but from responses by >>> ??????? others, it >>> ???????? ? ? seems this is ok. >>> >>> ???????? ? ? I've run all the tests I can find that use jstack or jhsdb, >>> ??????? and the >>> ???????? ? ? assert was not triggered. Probably need to have a NMethod >>> ??????? on the >>> ???????? ? ? stack to trigger the code you are fixing. >>> >>> ???????? ? ? thanks, >>> >>> ???????? ? ? Chris >>> >>> >>> ???????? ? ? On 9/17/17 1:13 AM, Yasumasa Suenaga wrote: >>> >>> ???????? ? ? ? ? Hi Chris, >>> >>> ???????? ? ? ? ? I've tested this issue on Fedora 26 x86_64. >>> ???????? ? ? ? ? I think we can sue CIntegerField at this point because >>> ???????? ? ? ? ? CIntegerField is not specialized for various int >>> size [1]. >>> ???????? ? ? ? ? In fact, CIntegerField had been used at this point [2], >>> ??????? and HSDB >>> ???????? ? ? ? ? worked fine. >>> >>> >>> ???????? ? ? ? ? Thanks, >>> >>> ???????? ? ? ? ? Yasumasa >>> >>> >>> ???????? ? ? ? ? [1] >>> >>> http://hg.openjdk.java.net/jdk10/master/file/fd36993f7bf5/src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/CIntegerField.java#l29 >>> >>> >>> >>> >>> >>> >> >>> >>> > >>> >>> ???????? ? ? ? ? [2] >>> ??????? http://hg.openjdk.java.net/jdk10/master/rev/cbfdbefc6ea3 >>> ??????? >>> ??????? >> ??????? > >>> >>> >>> ???????? ? ? ? ? On 2017/09/17 3:58, Chris Plummer wrote: >>> >>> ???????? ? ? ? ? ? ? Hi Yasumasa, >>> >>> ???????? ? ? ? ? ? ? Is this on a 32-bit system? I don't see how you >>> could >>> ???????? ? ? ? ? ? ? otherwise call getCIntegerField() on a long type. >>> ??????? jlong is >>> ???????? ? ? ? ? ? ? always 64-bit and long is (generally) 32-bit on >>> 32-bit >>> ???????? ? ? ? ? ? ? systems, and 64-bit on 64-bit systems, at least >>> ??????? that seems >>> ???????? ? ? ? ? ? ? to be the case with linux. >>> >>> ???????? ? ? ? ? ? ? ??From what I can see, _stack_traversal_mark is now >>> ??????? the only >>> ???????? ? ? ? ? ? ? long type in vmStructs.cpp. I don't know that we >>> have a >>> ???????? ? ? ? ? ? ? mechanism to safely fetch it on both 32-bit and >>> ??????? 64-bit systems. >>> >>> ???????? ? ? ? ? ? ? _stack_traversal_mark seems to be a long because >>> ??????? _traversals >>> ???????? ? ? ? ? ? ? is also a long. >>> >>> ???????? ? ? ? ? ? ? ??? ? static long >>> _traversals;?????????????????? // >>> ???????? ? ? ? ? ? ? Stack scan count, also sweep ID. >>> >>> ???????? ? ? ? ? ? ? This too might be considered a bug. I'm not sure >>> ??????? why you >>> ???????? ? ? ? ? ? ? would want the size of this field to vary between >>> ??????? 32-bit and >>> ???????? ? ? ? ? ? ? 64-bit systems (adding compiler-dev to help answer >>> ??????? that). >>> >>> ???????? ? ? ? ? ? ? So, while I would agree that your fix is generally >>> ??????? in the >>> ???????? ? ? ? ? ? ? right direction, I think we first need to revisit >>> ??????? the use of >>> ???????? ? ? ? ? ? ? long for these fields. If they can be changed to an >>> ??????? int, >>> ???????? ? ? ? ? ? ? then your fix is correct (pending the changes to >>> ??????? int). If >>> ???????? ? ? ? ? ? ? not, then maybe we need getCLongField() support. >>> >>> ???????? ? ? ? ? ? ? And lastly, we really should have a test to detect >>> ??????? this bug. >>> ???????? ? ? ? ? ? ? Maybe we already do, and it is failing but is going >>> ???????? ? ? ? ? ? ? unnoticed for some reason. I'll try to look into >>> ??????? that some >>> ???????? ? ? ? ? ? ? more on Monday. >>> >>> ???????? ? ? ? ? ? ? thanks, >>> >>> ???????? ? ? ? ? ? ? Chris >>> >>> ???????? ? ? ? ? ? ? On 9/16/17 5:20 AM, Yasumasa Suenaga wrote: >>> >>> ???????? ? ? ? ? ? ? ? ? Hi all, >>> >>> ???????? ? ? ? ? ? ? ? ? I tried to get thread dump via jstack command >>> ??????? on CLHSDB. >>> ???????? ? ? ? ? ? ? ? ? But it was failed as below: >>> >>> ???????? ? ? ? ? ? ? ? ? ``` >>> ???????? ? ? ? ? ? ? ? ? Caused by: >>> ??????? sun.jvm.hotspot.types.WrongTypeException: >>> ???????? ? ? ? ? ? ? ? ? field "_stack_traversal_mark" in type nmethod >>> ??????? is not of >>> ???????? ? ? ? ? ? ? ? ? type jlong, but instead of type long >>> ???????? ? ? ? ? ? ? ? ? ???????? at >>> >>> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:206) >>> >>> ???????? ? ? ? ? ? ? ? ? ???????? at >>> >>> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:212) >>> >>> ???????? ? ? ? ? ? ? ? ? ???????? at >>> >>> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getJLongField(BasicType.java:249) >>> >>> ???????? ? ? ? ? ? ? ? ? ???????? at >>> >>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.initialize(NMethod.java:108) >>> >>> ???????? ? ? ? ? ? ? ? ? ???????? at >>> >>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.access$000(NMethod.java:35) >>> >>> ???????? ? ? ? ? ? ? ? ? ???????? at >>> >>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod$1.update(NMethod.java:81) >>> >>> ???????? ? ? ? ? ? ? ? ? ???????? at >>> >>> jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMInitializedObserver(VM.java:451) >>> >>> ???????? ? ? ? ? ? ? ? ? ???????? at >>> >>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.(NMethod.java:79) >>> ???????? ? ? ? ? ? ? ? ? ???????? ... 23 more >>> ???????? ? ? ? ? ? ? ? ? ``` >>> >>> ???????? ? ? ? ? ? ? ? ? I think this exception is caused by >>> JDK-8186837. >>> ???????? ? ? ? ? ? ? ? ? This changeset has changed the type of >>> ???????? ? ? ? ? ? ? ? ? `nmethod::_stack_traversal_mark` to `long` from >>> ??????? `jlong`. >>> >>> ???????? ? ? ? ? ? ? ? ? SA should follow this change. >>> >>> ???????? ? ? ? ? ? ? ? ? I uploaded a webrev for this issue. This >>> webrev is >>> ???????? ? ? ? ? ? ? ? ? generated from consolidated repo >>> (jdk10/master). >>> ???????? ? ? ? ? ? ? ? ? Could you review it? >>> >>> ??????? http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.00/ >>> ??????? >>> ??????? >> ??????? > >>> >>> >>> ???????? ? ? ? ? ? ? ? ? I cannot access JPRT. So I need reviewer. >>> >>> >>> ???????? ? ? ? ? ? ? ? ? Thanks, >>> >>> ???????? ? ? ? ? ? ? ? ? Yasumasa >>> >>> >>> >>> >>> >>> From yasuenag at gmail.com Wed Sep 20 23:57:57 2017 From: yasuenag at gmail.com (Yasumasa Suenaga) Date: Thu, 21 Sep 2017 08:57:57 +0900 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> <9bfc8e1d-8438-21f2-3c06-e53bb95afb6c@oracle.com> <0ee1ed46-5c27-8693-fb6d-396a3059335e@oracle.com> <13b177d1-e04c-dd86-7a86-b55b165d5830@oracle.com> <2c955c17-43b9-906f-408d-f5349d57ca13@gmail.com> Message-ID: 2017/09/21 ??8:35 "David Holmes" : The opening announcement was somewhat premature. They created jdk10/hs but we're not quite ready to start accepting changes yet. Where can I get the opening announcement for jdk10/hs? I will send review request after that. Thanks, Yasumasa David On 21/09/2017 8:44 AM, Yasumasa Suenaga wrote: > Hi David, > > jdk10/hs has been opened [1]. > Could you push this change? > > > Thanks, > > Yasumasa > > > [1] http://mail.openjdk.java.net/pipermail/jdk10-dev/2017-Septem > ber/000499.html > > > On 2017/09/19 12:31, David Holmes wrote: > >> On 19/09/2017 1:19 PM, Yasumasa Suenaga wrote: >> >>> Thanks David, >>> >>> BTW, can I push this change after jdk10/master is opened? >>> I cannot access JPRT. >>> >> >> I think we'd probably prefer this to go into jdk10/hs - once it is open - >> and for that you need a sponsor. >> >> Thanks, >> David >> >> >>> Yasumasa >>> >>> >>> 2017/09/19 ??0:08 "David Holmes" >> david.holmes at oracle.com>>: >>> >>> Hi Yasumasa, >>> >>> On 19/09/2017 12:55 PM, Yasumasa Suenaga wrote: >>> >>> Thanks Chris, Robbin, >>> >>> I'm waiting reviewer(s) for this change. >>> >>> >>> Reviewed. >>> >>> This simply reverts the change of 8185102. >>> >>> Thanks, >>> David >>> ----- >>> >>> >>> Yasumasa >>> >>> >>> 2017/09/19 ??7:14 "Chris Plummer" >> >>> >> >>: >>> >>> Hi Yasumasa, >>> >>> Ok, I see now that CIntegerField is just an interface, so >>> it's up to >>> a class to implement getValue() to fetch the field. I'm a >>> bit >>> unclear on how that part works, but from responses by >>> others, it >>> seems this is ok. >>> >>> I've run all the tests I can find that use jstack or jhsdb, >>> and the >>> assert was not triggered. Probably need to have a NMethod >>> on the >>> stack to trigger the code you are fixing. >>> >>> thanks, >>> >>> Chris >>> >>> >>> On 9/17/17 1:13 AM, Yasumasa Suenaga wrote: >>> >>> Hi Chris, >>> >>> I've tested this issue on Fedora 26 x86_64. >>> I think we can sue CIntegerField at this point because >>> CIntegerField is not specialized for various int size >>> [1]. >>> In fact, CIntegerField had been used at this point [2], >>> and HSDB >>> worked fine. >>> >>> >>> Thanks, >>> >>> Yasumasa >>> >>> >>> [1] >>> http://hg.openjdk.java.net/jdk10/master/file/fd36993f7bf5/ >>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/ >>> CIntegerField.java#l29 >>> >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/ >>> CIntegerField.java#l29> >>> >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/ >>> CIntegerField.java#l29 >>> >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/ >>> CIntegerField.java#l29>> >>> [2] >>> http://hg.openjdk.java.net/jdk10/master/rev/cbfdbefc6ea3 >>> >>> >> > >>> >>> >>> On 2017/09/17 3:58, Chris Plummer wrote: >>> >>> Hi Yasumasa, >>> >>> Is this on a 32-bit system? I don't see how you >>> could >>> otherwise call getCIntegerField() on a long type. >>> jlong is >>> always 64-bit and long is (generally) 32-bit on >>> 32-bit >>> systems, and 64-bit on 64-bit systems, at least >>> that seems >>> to be the case with linux. >>> >>> From what I can see, _stack_traversal_mark is now >>> the only >>> long type in vmStructs.cpp. I don't know that we >>> have a >>> mechanism to safely fetch it on both 32-bit and >>> 64-bit systems. >>> >>> _stack_traversal_mark seems to be a long because >>> _traversals >>> is also a long. >>> >>> static long >>> _traversals; // >>> Stack scan count, also sweep ID. >>> >>> This too might be considered a bug. I'm not sure >>> why you >>> would want the size of this field to vary between >>> 32-bit and >>> 64-bit systems (adding compiler-dev to help answer >>> that). >>> >>> So, while I would agree that your fix is generally >>> in the >>> right direction, I think we first need to revisit >>> the use of >>> long for these fields. If they can be changed to an >>> int, >>> then your fix is correct (pending the changes to >>> int). If >>> not, then maybe we need getCLongField() support. >>> >>> And lastly, we really should have a test to detect >>> this bug. >>> Maybe we already do, and it is failing but is going >>> unnoticed for some reason. I'll try to look into >>> that some >>> more on Monday. >>> >>> thanks, >>> >>> Chris >>> >>> On 9/16/17 5:20 AM, Yasumasa Suenaga wrote: >>> >>> Hi all, >>> >>> I tried to get thread dump via jstack command >>> on CLHSDB. >>> But it was failed as below: >>> >>> ``` >>> Caused by: >>> sun.jvm.hotspot.types.WrongTypeException: >>> field "_stack_traversal_mark" in type nmethod >>> is not of >>> type jlong, but instead of type long >>> at >>> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:206) >>> >>> at >>> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:212) >>> >>> at >>> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getJLongField(BasicType.java:249) >>> >>> at >>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.initialize(NMethod.java:108) >>> >>> at >>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.access$000(NMethod.java:35) >>> >>> at >>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod$1.update(NMet >>> hod.java:81) >>> >>> at >>> jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMInitializedObserver(VM.java:451) >>> >>> at >>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.(NMet >>> hod.java:79) >>> ... 23 more >>> ``` >>> >>> I think this exception is caused by JDK-8186837. >>> This changeset has changed the type of >>> `nmethod::_stack_traversal_mark` to `long` from >>> `jlong`. >>> >>> SA should follow this change. >>> >>> I uploaded a webrev for this issue. This webrev >>> is >>> generated from consolidated repo (jdk10/master). >>> Could you review it? >>> >>> http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.00/ >>> >>> >> > >>> >>> >>> I cannot access JPRT. So I need reviewer. >>> >>> >>> Thanks, >>> >>> Yasumasa >>> >>> >>> >>> >>> >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From david.holmes at oracle.com Thu Sep 21 00:31:47 2017 From: david.holmes at oracle.com (David Holmes) Date: Thu, 21 Sep 2017 10:31:47 +1000 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> <9bfc8e1d-8438-21f2-3c06-e53bb95afb6c@oracle.com> <0ee1ed46-5c27-8693-fb6d-396a3059335e@oracle.com> <13b177d1-e04c-dd86-7a86-b55b165d5830@oracle.com> <2c955c17-43b9-906f-408d-f5349d57ca13@gmail.com> Message-ID: On 21/09/2017 9:57 AM, Yasumasa Suenaga wrote: > 2017/09/21 ??8:35 "David Holmes" >: > > The opening announcement was somewhat premature. They created > jdk10/hs but we're not quite ready to start accepting changes yet. > > > Where can I get the opening announcement for jdk10/hs? hotspot-dev > I will send review request after that. You will need to rebase all your patches before they can be sponsored. Thanks, David > > Thanks, > > Yasumasa > > > > David > > > On 21/09/2017 8:44 AM, Yasumasa Suenaga wrote: > > Hi David, > > jdk10/hs has been opened [1]. > Could you push this change? > > > Thanks, > > Yasumasa > > > [1] > http://mail.openjdk.java.net/pipermail/jdk10-dev/2017-September/000499.html > > > > On 2017/09/19 12:31, David Holmes wrote: > > On 19/09/2017 1:19 PM, Yasumasa Suenaga wrote: > > Thanks David, > > BTW, can I push this change after jdk10/master is opened? > I cannot access JPRT. > > > I think we'd probably prefer this to go into jdk10/hs - once > it is open - and for that you need a sponsor. > > Thanks, > David > > > Yasumasa > > > 2017/09/19 ??0:08 "David Holmes" > > >>: > > ??? Hi Yasumasa, > > ??? On 19/09/2017 12:55 PM, Yasumasa Suenaga wrote: > > ??????? Thanks Chris, Robbin, > > ??????? I'm waiting reviewer(s) for this change. > > > ??? Reviewed. > > ??? This simply reverts the change of 8185102. > > ??? Thanks, > ??? David > ??? ----- > > > ??????? Yasumasa > > > ??????? 2017/09/19 ??7:14 "Chris Plummer" > > ??????? > > ??????? > ??????? >>>: > > ???????? ? ? Hi Yasumasa, > > ???????? ? ? Ok, I see now that CIntegerField is just > an interface, so > ??????? it's up to > ???????? ? ? a class to implement getValue() to fetch > the field. I'm a bit > ???????? ? ? unclear on how that part works, but from > responses by > ??????? others, it > ???????? ? ? seems this is ok. > > ???????? ? ? I've run all the tests I can find that use > jstack or jhsdb, > ??????? and the > ???????? ? ? assert was not triggered. Probably need to > have a NMethod > ??????? on the > ???????? ? ? stack to trigger the code you are fixing. > > ???????? ? ? thanks, > > ???????? ? ? Chris > > > ???????? ? ? On 9/17/17 1:13 AM, Yasumasa Suenaga wrote: > > ???????? ? ? ? ? Hi Chris, > > ???????? ? ? ? ? I've tested this issue on Fedora 26 > x86_64. > ???????? ? ? ? ? I think we can sue CIntegerField at > this point because > ???????? ? ? ? ? CIntegerField is not specialized for > various int size [1]. > ???????? ? ? ? ? In fact, CIntegerField had been used > at this point [2], > ??????? and HSDB > ???????? ? ? ? ? worked fine. > > > ???????? ? ? ? ? Thanks, > > ???????? ? ? ? ? Yasumasa > > > ???????? ? ? ? ? [1] > http://hg.openjdk.java.net/jdk10/master/file/fd36993f7bf5/src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/CIntegerField.java#l29 > > > > > > > > > > > >> > > ???????? ? ? ? ? [2] > http://hg.openjdk.java.net/jdk10/master/rev/cbfdbefc6ea3 > > > > > > > > >> > > > ???????? ? ? ? ? On 2017/09/17 3:58, Chris Plummer wrote: > > ???????? ? ? ? ? ? ? Hi Yasumasa, > > ???????? ? ? ? ? ? ? Is this on a 32-bit system? I > don't see how you could > ???????? ? ? ? ? ? ? otherwise call getCIntegerField() > on a long type. > ??????? jlong is > ???????? ? ? ? ? ? ? always 64-bit and long is > (generally) 32-bit on 32-bit > ???????? ? ? ? ? ? ? systems, and 64-bit on 64-bit > systems, at least > ??????? that seems > ???????? ? ? ? ? ? ? to be the case with linux. > > ???????? ? ? ? ? ? ? ??From what I can see, > _stack_traversal_mark is now > ??????? the only > ???????? ? ? ? ? ? ? long type in vmStructs.cpp. I > don't know that we have a > ???????? ? ? ? ? ? ? mechanism to safely fetch it on > both 32-bit and > ??????? 64-bit systems. > > ???????? ? ? ? ? ? ? _stack_traversal_mark seems to be > a long because > ??????? _traversals > ???????? ? ? ? ? ? ? is also a long. > > ???????? ? ? ? ? ? ? ??? ? static long > ?_traversals; ? // > ???????? ? ? ? ? ? ? Stack scan count, also sweep ID. > > ???????? ? ? ? ? ? ? This too might be considered a > bug. I'm not sure > ??????? why you > ???????? ? ? ? ? ? ? would want the size of this field > to vary between > ??????? 32-bit and > ???????? ? ? ? ? ? ? 64-bit systems (adding > compiler-dev to help answer > ??????? that). > > ???????? ? ? ? ? ? ? So, while I would agree that your > fix is generally > ??????? in the > ???????? ? ? ? ? ? ? right direction, I think we first > need to revisit > ??????? the use of > ???????? ? ? ? ? ? ? long for these fields. If they can > be changed to an > ??????? int, > ???????? ? ? ? ? ? ? then your fix is correct (pending > the changes to > ??????? int). If > ???????? ? ? ? ? ? ? not, then maybe we need > getCLongField() support. > > ???????? ? ? ? ? ? ? And lastly, we really should have > a test to detect > ??????? this bug. > ???????? ? ? ? ? ? ? Maybe we already do, and it is > failing but is going > ???????? ? ? ? ? ? ? unnoticed for some reason. I'll > try to look into > ??????? that some > ???????? ? ? ? ? ? ? more on Monday. > > ???????? ? ? ? ? ? ? thanks, > > ???????? ? ? ? ? ? ? Chris > > ???????? ? ? ? ? ? ? On 9/16/17 5:20 AM, Yasumasa > Suenaga wrote: > > ???????? ? ? ? ? ? ? ? ? Hi all, > > ???????? ? ? ? ? ? ? ? ? I tried to get thread dump via > jstack command > ??????? on CLHSDB. > ???????? ? ? ? ? ? ? ? ? But it was failed as below: > > ???????? ? ? ? ? ? ? ? ? ``` > ???????? ? ? ? ? ? ? ? ? Caused by: > ??????? sun.jvm.hotspot.types.WrongTypeException: > ???????? ? ? ? ? ? ? ? ? field "_stack_traversal_mark" > in type nmethod > ??????? is not of > ???????? ? ? ? ? ? ? ? ? type jlong, but instead of > type long > ???????? ? ? ? ? ? ? ? ? ???????? at > > jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:206) > > ???????? ? ? ? ? ? ? ? ? ???????? at > > jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:212) > > ???????? ? ? ? ? ? ? ? ? ???????? at > > jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getJLongField(BasicType.java:249) > > ???????? ? ? ? ? ? ? ? ? ???????? at > > jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.initialize(NMethod.java:108) > > ???????? ? ? ? ? ? ? ? ? ???????? at > > jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.access$000(NMethod.java:35) > > ???????? ? ? ? ? ? ? ? ? ???????? at > > jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod$1.update(NMethod.java:81) > > ???????? ? ? ? ? ? ? ? ? ???????? at > > jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMInitializedObserver(VM.java:451) > > ???????? ? ? ? ? ? ? ? ? ???????? at > > jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.(NMethod.java:79) > ???????? ? ? ? ? ? ? ? ? ???????? ... 23 more > ???????? ? ? ? ? ? ? ? ? ``` > > ???????? ? ? ? ? ? ? ? ? I think this exception is > caused by JDK-8186837. > ???????? ? ? ? ? ? ? ? ? This changeset has changed the > type of > > `nmethod::_stack_traversal_mark` to `long` from > ??????? `jlong`. > > ???????? ? ? ? ? ? ? ? ? SA should follow this change. > > ???????? ? ? ? ? ? ? ? ? I uploaded a webrev for this > issue. This webrev is > ???????? ? ? ? ? ? ? ? ? generated from consolidated > repo (jdk10/master). > ???????? ? ? ? ? ? ? ? ? Could you review it? > > http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.00/ > > > > > > > > >> > > > ???????? ? ? ? ? ? ? ? ? I cannot access JPRT. So I > need reviewer. > > > ???????? ? ? ? ? ? ? ? ? Thanks, > > ???????? ? ? ? ? ? ? ? ? Yasumasa > > > > > > > From erik.helin at oracle.com Thu Sep 21 09:02:06 2017 From: erik.helin at oracle.com (Erik Helin) Date: Thu, 21 Sep 2017 11:02:06 +0200 Subject: RFR: 8187676: Disable harmless uninitialized warnings for two files In-Reply-To: <4f5b0427-54bf-2b85-0a94-bb41049d2676@oracle.com> References: <2031be3e-2623-dde1-fff2-2d6cd6e41de9@oracle.com> <7512e87d-4e28-27a1-5e10-5cdfa794cdf4@oracle.com> <4f5b0427-54bf-2b85-0a94-bb41049d2676@oracle.com> Message-ID: <1a8dd6cc-8cf2-bb0e-af09-ea53324c85e3@oracle.com> Ok, lets wait for Rahul's patches. Rahul, when you post your patches, CC me and I can check if gcc 7.1.1 still complains :) Thanks, Erik On 09/19/2017 06:25 PM, Vladimir Kozlov wrote: > I would prefer to have general solution Rahul is working on because code > is general - not only x86 is affected. > > Thanks, > Vladimir > > On 9/19/17 7:59 AM, Rahul Raghavan wrote: >> Hi Erik, >> >> Please note that this 8187676 seems to be related to 8160404. >> ??? https://bugs.openjdk.java.net/browse/JDK-8160404 >> ??? (RelocationHolder constructors have bugs) >> >> As per the latest notes comments added for 8160404-jbs, I will submit >> webrev/RFR soon and will request help confirm similar issues with >> latest gcc7 gets solved. >> >> Thanks, >> Rahul >> >> On Tuesday 19 September 2017 07:07 PM, Erik Helin wrote: >>> Hi all, >>> >>> with gcc 7.1.1 from Fedora 26 on x86-64 there are warnings about the >>> potential usage of maybe uninitialized memory in >>> src/hotspot/cpu/x86/assembler_x86.cpp and in >>> src/hotspot/cpu/x86/interp_masm_x86.cpp. >>> >>> The problems arises from the class RelocationHolder in >>> src/hotspot/share/code/relocInfo.hpp which has the private fields: >>> ?? enum { _relocbuf_size = 5 }; >>> ?? void* _relocbuf[ _relocbuf_size ]; >>> >>> and the default constructor for RelocationHolder does not initialize >>> the elements of _relocbuf. I _think_ this is an optimization, >>> RelocationHolder is used *a lot* and setting the elements of >>> RelocationHolder::_relocbuf to NULL (or some other value) in the >>> default constructor might result in a performance penalty. Have a >>> look in >>> build/linux-x86_64-normal-server-fastdebug/hotspot/variant-server/gensrc/adfiles >>> and you will see that RelocationHolder is used all over the place :) >>> >>> AFAICS all users of RelocationHolder::_relocbuf take care to not use >>> uninitialized memory, which means that this warning is wrong, so I >>> suggest we disable the warning -Wmaybe-uninitialized for >>> src/hotspot/cpu/x86/assembler_x86.cpp. >>> >>> The problem continues because the class Address in >>> src/hotspot/cpu/x86/assembler_x86.hpp has a private field, >>> `RelocationHolder _rspec;` and the default constructor for Address >>> does not initialize _rspec._relocbuf (most likely for performance >>> reasons). The class Address also has a default copy constructor, >>> which will copy all the elements of _rspec._relocbuf, which will >>> result in a read of uninitialized memory. However, this is a benign >>> usage of uninitialized memory, since we take no action based on the >>> content of the uninitialized memory (it is just copied byte for byte). >>> >>> So, in this case too, I suggest we disable the warning >>> -Wuninitialized for src/hotspot/cpu/x86/assembler_x86.hpp. >>> >>> What do you think? >>> >>> Patch: >>> http://cr.openjdk.java.net/~ehelin/8187676/00/ >>> >>> --- old/make/hotspot/lib/JvmOverrideFiles.gmk??? 2017-09-19 >>> 15:11:45.036108983 +0200 >>> +++ new/make/hotspot/lib/JvmOverrideFiles.gmk??? 2017-09-19 >>> 15:11:44.692107277 +0200 >>> @@ -32,6 +32,8 @@ >>> ? ifeq ($(TOOLCHAIN_TYPE), gcc) >>> ??? BUILD_LIBJVM_vmStructs.cpp_CXXFLAGS := >>> -fno-var-tracking-assignments -O0 >>> ??? BUILD_LIBJVM_jvmciCompilerToVM.cpp_CXXFLAGS := >>> -fno-var-tracking-assignments >>> +? BUILD_LIBJVM_assembler_x86.cpp_CXXFLAGS := -Wno-maybe-uninitialized >>> +? BUILD_LIBJVM_interp_masm_x86.cpp_CXXFLAGS := -Wno-uninitialized >>> ? endif >>> >>> ? ifeq ($(OPENJDK_TARGET_OS), linux) >>> >>> Issue: >>> https://bugs.openjdk.java.net/browse/JDK-8187676 >>> >>> Testing: >>> - Compiles with: >>> ?? - gcc 7.1.1 and glibc 2.25 on Fedora 26 >>> ?? - gcc 4.9.2 and glibc 2.12 on OEL 6.4 >>> - JPRT >>> >>> Thanks, >>> Erik From tobias.hartmann at oracle.com Thu Sep 21 12:37:09 2017 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Thu, 21 Sep 2017 14:37:09 +0200 Subject: [10] RFR(S): 8187780: VM crashes while generating replay compilation file Message-ID: Hi, please review the following patch: https://bugs.openjdk.java.net/browse/JDK-8187780 http://cr.openjdk.java.net/~thartmann/8187780/webrev.00/ Creation of a replay compilation file crashes the VM due to the following problems: (1) ciInstanceKlass.cpp: java_lang_String::as_quoted_ascii(value) may return NULL if the String is empty (see 'emptyString' in TestDumpReplay.java). I added a NULL check. (2) bytecodeInfo.cpp: The liveness of the InlineTree object is limited by the scope of a ResourceMark. With incremental inlining, the InlineTree is created within the scope of the ResourceMark is in Compile::Optimize() whereas the replay compilation file is created in the scope of the caller method Compile::Compile() -> ciEnv::dump_replay_data(). We crash in InlineTree::dump_replay_data() because the object was released. I changed the implementation to allocate the object in the comp_arena. TestDumpReplay.java triggers both crashes. Tested with the hotspot testset on JPRT. Thanks, Tobias From aph at redhat.com Thu Sep 21 13:04:07 2017 From: aph at redhat.com (Andrew Haley) Date: Thu, 21 Sep 2017 14:04:07 +0100 Subject: [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> Message-ID: <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com> I reworked your benchmark to run faster and have less overhead, at http://cr.openjdk.java.net/~aph/8186915/ Run it as java --add-exports java.base/jdk.internal.misc=ALL-UNNAMED -jar target/benchmarks.jar org.sample.BigIntegerBench.implMutliplyToLen The test here was run on (rather old) Applied Micro hardware. The real issue is, I think, that almost all of the time of squareToLen without an intrinsic is dominated by mulAdd, and that already has an intrinsic. Asymptotically, an intrinsic squareToLen should take half the time of multiplyToLen, but we don't see that. Indeed, we barely see any advantage for UseSquareToLenIntrinsic. For a larger size, we see this with intrinsics enabled: BigIntegerBench.implMutliplyToLen 200 avgt 5 50833.555 ? 10.674 ns/op BigIntegerBench.implSquareToLen 200 avgt 5 57607.460 ? 87.155 ns/op BigIntegerBench.implMutliplyToLen 1000 avgt 5 1254728.119 ? 527.126 ns/op BigIntegerBench.implSquareToLen 1000 avgt 5 1369841.961 ? 169.843 ns/op which makes the problem clear, I believe. No intrinsics: Benchmark (size) Mode Cnt Score Error Units BigIntegerBench.implMutliplyToLen 1 avgt 5 24.176 ? 0.006 ns/op BigIntegerBench.implMutliplyToLen 2 avgt 5 41.266 ? 0.008 ns/op BigIntegerBench.implMutliplyToLen 3 avgt 5 65.027 ? 0.019 ns/op BigIntegerBench.implMutliplyToLen 10 avgt 5 466.440 ? 0.080 ns/op BigIntegerBench.implMutliplyToLen 50 avgt 5 10613.512 ? 5.153 ns/op BigIntegerBench.implMutliplyToLen 90 avgt 5 34070.328 ? 10.991 ns/op BigIntegerBench.implMutliplyToLen 127 avgt 5 67546.985 ? 16.581 ns/op -XX:+UseMultiplyToLenIntrinsic: Benchmark (size) Mode Cnt Score Error Units BigIntegerBench.implMutliplyToLen 1 avgt 5 25.661 ? 0.062 ns/op BigIntegerBench.implMutliplyToLen 2 avgt 5 29.183 ? 0.037 ns/op BigIntegerBench.implMutliplyToLen 3 avgt 5 51.690 ? 0.024 ns/op BigIntegerBench.implMutliplyToLen 10 avgt 5 193.401 ? 0.032 ns/op BigIntegerBench.implMutliplyToLen 50 avgt 5 3419.226 ? 0.312 ns/op BigIntegerBench.implMutliplyToLen 90 avgt 5 10638.801 ? 0.970 ns/op BigIntegerBench.implMutliplyToLen 127 avgt 5 21274.149 ? 7.188 ns/op No Intrinsics: Benchmark (size) Mode Cnt Score Error Units BigIntegerBench.implSquareToLen 1 avgt 5 38.933 ? 1.437 ns/op BigIntegerBench.implSquareToLen 2 avgt 5 62.523 ? 0.007 ns/op BigIntegerBench.implSquareToLen 3 avgt 5 82.114 ? 0.012 ns/op BigIntegerBench.implSquareToLen 10 avgt 5 366.986 ? 10.148 ns/op BigIntegerBench.implSquareToLen 50 avgt 5 5534.064 ? 88.895 ns/op BigIntegerBench.implSquareToLen 90 avgt 5 16308.025 ? 29.203 ns/op BigIntegerBench.implSquareToLen 127 avgt 5 31521.335 ? 49.421 ns/op -XX:+UseMulAddIntrinsic: Benchmark (size) Mode Cnt Score Error Units BigIntegerBench.implSquareToLen 1 avgt 5 46.268 ? 0.005 ns/op BigIntegerBench.implSquareToLen 2 avgt 5 67.527 ? 0.017 ns/op BigIntegerBench.implSquareToLen 3 avgt 5 97.975 ? 0.179 ns/op BigIntegerBench.implSquareToLen 10 avgt 5 345.126 ? 0.037 ns/op BigIntegerBench.implSquareToLen 50 avgt 5 4327.120 ? 9.942 ns/op BigIntegerBench.implSquareToLen 90 avgt 5 13143.308 ? 1.217 ns/op BigIntegerBench.implSquareToLen 127 avgt 5 25014.420 ? 16.221 ns/op -XX:+UseSquareToLenIntrinsic Benchmark (size) Mode Cnt Score Error Units BigIntegerBench.implSquareToLen 1 avgt 5 27.095 ? 0.012 ns/op BigIntegerBench.implSquareToLen 2 avgt 5 49.185 ? 0.007 ns/op BigIntegerBench.implSquareToLen 3 avgt 5 53.771 ? 0.013 ns/op BigIntegerBench.implSquareToLen 10 avgt 5 238.843 ? 0.080 ns/op BigIntegerBench.implSquareToLen 50 avgt 5 3828.313 ? 1.684 ns/op BigIntegerBench.implSquareToLen 90 avgt 5 11949.819 ? 9.925 ns/op BigIntegerBench.implSquareToLen 127 avgt 5 23613.427 ? 28.164 ns/op -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From vladimir.kozlov at oracle.com Thu Sep 21 15:42:29 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 21 Sep 2017 08:42:29 -0700 Subject: [10] RFR(S): 8187780: VM crashes while generating replay compilation file In-Reply-To: References: Message-ID: <822b3c1b-d8a6-ddb0-2fa7-98de2f954161@oracle.com> Looks good. Thanks, Vladimir On 9/21/17 5:37 AM, Tobias Hartmann wrote: > Hi, > > please review the following patch: > https://bugs.openjdk.java.net/browse/JDK-8187780 > http://cr.openjdk.java.net/~thartmann/8187780/webrev.00/ > > Creation of a replay compilation file crashes the VM due to the > following problems: > (1) ciInstanceKlass.cpp: > java_lang_String::as_quoted_ascii(value) may return NULL if the String > is empty (see 'emptyString' in TestDumpReplay.java). I added a NULL check. > > (2) bytecodeInfo.cpp: > The liveness of the InlineTree object is limited by the scope of a > ResourceMark. With incremental inlining, the InlineTree is created > within the scope of the ResourceMark is in Compile::Optimize() whereas > the replay compilation file is created in the scope of the caller method > Compile::Compile() -> ciEnv::dump_replay_data(). We crash in > InlineTree::dump_replay_data() because the object was released. I > changed the implementation to allocate the object in the comp_arena. > > TestDumpReplay.java triggers both crashes. > > Tested with the hotspot testset on JPRT. > > Thanks, > Tobias From dmitrij.pochepko at bell-sw.com Thu Sep 21 18:19:33 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko) Date: Thu, 21 Sep 2017 21:19:33 +0300 Subject: [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com> Message-ID: <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com> Hi, thank you for looking into this and trying on APM(I have no access to this h/w). I've used modified benchmark you've sent and run it on ThunderX and implSquareToLen still shows better results than implMultiplyToLen in most cases on ThunderX (up to 10% on size=127. results: http://cr.openjdk.java.net/~dpochepk/8186915/ThunderX_new.txt). However, since performance difference for APM is more than on ThunderX, I think it'll be more logical to return back to your idea and call multiplyToLen intrinsic inside squareToLen. Alternative solution is to generate different code for APM and ThunderX, but I prefer to have single version in case of such relatively small difference in performance and it's still much faster than without intrinsic at all. What do you think? fyi: regarding size 200 and 1000 - it's incorrect to measure these sizes for squareToLen, because squareToLen is never called for size more than 127(I've mentioned it before). An upper level squaring algorithm divides larger arrays into few parts(smaller than 128 integers) and then squaring it separately. In order to compare squaring vs multiplication on longer sizes, we should compare BigInteger::multiply vs BigInteger::square methods with full logic behind it, because this is what's called in real situation instead of direct intrinsified method call. I've uploaded benchmark with multiply method measurement here: http://cr.openjdk.java.net/~dpochepk/8186915/BigIntegerBench2.java just in case. Thanks, Dmitrij On 21.09.2017 16:04, Andrew Haley wrote: > I reworked your benchmark to run faster and have less overhead, at > http://cr.openjdk.java.net/~aph/8186915/ > > Run it as > > java --add-exports java.base/jdk.internal.misc=ALL-UNNAMED -jar target/benchmarks.jar org.sample.BigIntegerBench.implMutliplyToLen > > The test here was run on (rather old) Applied Micro hardware. The > real issue is, I think, that almost all of the time of squareToLen > without an intrinsic is dominated by mulAdd, and that already has an > intrinsic. Asymptotically, an intrinsic squareToLen should take half > the time of multiplyToLen, but we don't see that. Indeed, we barely > see any advantage for UseSquareToLenIntrinsic. > > For a larger size, we see this with intrinsics enabled: > > BigIntegerBench.implMutliplyToLen 200 avgt 5 50833.555 ? 10.674 ns/op > BigIntegerBench.implSquareToLen 200 avgt 5 57607.460 ? 87.155 ns/op > > BigIntegerBench.implMutliplyToLen 1000 avgt 5 1254728.119 ? 527.126 ns/op > BigIntegerBench.implSquareToLen 1000 avgt 5 1369841.961 ? 169.843 ns/op > > which makes the problem clear, I believe. > > > No intrinsics: > > Benchmark (size) Mode Cnt Score Error Units > BigIntegerBench.implMutliplyToLen 1 avgt 5 24.176 ? 0.006 ns/op > BigIntegerBench.implMutliplyToLen 2 avgt 5 41.266 ? 0.008 ns/op > BigIntegerBench.implMutliplyToLen 3 avgt 5 65.027 ? 0.019 ns/op > BigIntegerBench.implMutliplyToLen 10 avgt 5 466.440 ? 0.080 ns/op > BigIntegerBench.implMutliplyToLen 50 avgt 5 10613.512 ? 5.153 ns/op > BigIntegerBench.implMutliplyToLen 90 avgt 5 34070.328 ? 10.991 ns/op > BigIntegerBench.implMutliplyToLen 127 avgt 5 67546.985 ? 16.581 ns/op > > -XX:+UseMultiplyToLenIntrinsic: > > Benchmark (size) Mode Cnt Score Error Units > BigIntegerBench.implMutliplyToLen 1 avgt 5 25.661 ? 0.062 ns/op > BigIntegerBench.implMutliplyToLen 2 avgt 5 29.183 ? 0.037 ns/op > BigIntegerBench.implMutliplyToLen 3 avgt 5 51.690 ? 0.024 ns/op > BigIntegerBench.implMutliplyToLen 10 avgt 5 193.401 ? 0.032 ns/op > BigIntegerBench.implMutliplyToLen 50 avgt 5 3419.226 ? 0.312 ns/op > BigIntegerBench.implMutliplyToLen 90 avgt 5 10638.801 ? 0.970 ns/op > BigIntegerBench.implMutliplyToLen 127 avgt 5 21274.149 ? 7.188 ns/op > > > No Intrinsics: > > Benchmark (size) Mode Cnt Score Error Units > BigIntegerBench.implSquareToLen 1 avgt 5 38.933 ? 1.437 ns/op > BigIntegerBench.implSquareToLen 2 avgt 5 62.523 ? 0.007 ns/op > BigIntegerBench.implSquareToLen 3 avgt 5 82.114 ? 0.012 ns/op > BigIntegerBench.implSquareToLen 10 avgt 5 366.986 ? 10.148 ns/op > BigIntegerBench.implSquareToLen 50 avgt 5 5534.064 ? 88.895 ns/op > BigIntegerBench.implSquareToLen 90 avgt 5 16308.025 ? 29.203 ns/op > BigIntegerBench.implSquareToLen 127 avgt 5 31521.335 ? 49.421 ns/op > > -XX:+UseMulAddIntrinsic: > > Benchmark (size) Mode Cnt Score Error Units > BigIntegerBench.implSquareToLen 1 avgt 5 46.268 ? 0.005 ns/op > BigIntegerBench.implSquareToLen 2 avgt 5 67.527 ? 0.017 ns/op > BigIntegerBench.implSquareToLen 3 avgt 5 97.975 ? 0.179 ns/op > BigIntegerBench.implSquareToLen 10 avgt 5 345.126 ? 0.037 ns/op > BigIntegerBench.implSquareToLen 50 avgt 5 4327.120 ? 9.942 ns/op > BigIntegerBench.implSquareToLen 90 avgt 5 13143.308 ? 1.217 ns/op > BigIntegerBench.implSquareToLen 127 avgt 5 25014.420 ? 16.221 ns/op > > -XX:+UseSquareToLenIntrinsic > > Benchmark (size) Mode Cnt Score Error Units > BigIntegerBench.implSquareToLen 1 avgt 5 27.095 ? 0.012 ns/op > BigIntegerBench.implSquareToLen 2 avgt 5 49.185 ? 0.007 ns/op > BigIntegerBench.implSquareToLen 3 avgt 5 53.771 ? 0.013 ns/op > BigIntegerBench.implSquareToLen 10 avgt 5 238.843 ? 0.080 ns/op > BigIntegerBench.implSquareToLen 50 avgt 5 3828.313 ? 1.684 ns/op > BigIntegerBench.implSquareToLen 90 avgt 5 11949.819 ? 9.925 ns/op > BigIntegerBench.implSquareToLen 127 avgt 5 23613.427 ? 28.164 ns/op > > From tobias.hartmann at oracle.com Fri Sep 22 04:53:52 2017 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Fri, 22 Sep 2017 06:53:52 +0200 Subject: [10] RFR(S): 8187780: VM crashes while generating replay compilation file In-Reply-To: <822b3c1b-d8a6-ddb0-2fa7-98de2f954161@oracle.com> References: <822b3c1b-d8a6-ddb0-2fa7-98de2f954161@oracle.com> Message-ID: <79c24379-0d8b-4a85-3096-3bbb3e41bc58@oracle.com> Hi Vladimir, thanks for the review! Best regards, Tobias On 21.09.2017 17:42, Vladimir Kozlov wrote: > Looks good. > > Thanks, > Vladimir > > On 9/21/17 5:37 AM, Tobias Hartmann wrote: >> Hi, >> >> please review the following patch: >> https://bugs.openjdk.java.net/browse/JDK-8187780 >> http://cr.openjdk.java.net/~thartmann/8187780/webrev.00/ >> >> Creation of a replay compilation file crashes the VM due to the following problems: >> (1) ciInstanceKlass.cpp: >> java_lang_String::as_quoted_ascii(value) may return NULL if the String is empty (see 'emptyString' in >> TestDumpReplay.java). I added a NULL check. >> >> (2) bytecodeInfo.cpp: >> The liveness of the InlineTree object is limited by the scope of a ResourceMark. With incremental inlining, the >> InlineTree is created within the scope of the ResourceMark is in Compile::Optimize() whereas the replay compilation >> file is created in the scope of the caller method Compile::Compile() -> ciEnv::dump_replay_data(). We crash in >> InlineTree::dump_replay_data() because the object was released. I changed the implementation to allocate the object in >> the comp_arena. >> >> TestDumpReplay.java triggers both crashes. >> >> Tested with the hotspot testset on JPRT. >> >> Thanks, >> Tobias From aph at redhat.com Fri Sep 22 08:12:23 2017 From: aph at redhat.com (Andrew Haley) Date: Fri, 22 Sep 2017 09:12:23 +0100 Subject: [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com> <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com> Message-ID: <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com> On 21/09/17 19:19, Dmitrij Pochepko wrote: > thank you for looking into this and trying on APM(I have no access to > this h/w). > > > I've used modified benchmark you've sent and run it on ThunderX and > implSquareToLen still shows better results than implMultiplyToLen in > most cases on ThunderX (up to 10% on size=127. results: > http://cr.openjdk.java.net/~dpochepk/8186915/ThunderX_new.txt). For 10%, it's not worth doing, given the risks and that it's not used by crypto operations when C2-compiled. > However, since performance difference for APM is more than on > ThunderX, I think it'll be more logical to return back to your idea > and call multiplyToLen intrinsic inside squareToLen. Alternative > solution is to generate different code for APM and ThunderX, but I > prefer to have single version in case of such relatively small > difference in performance and it's still much faster than without > intrinsic at all. What do you think? Yes. Calling multiplyToLen would be fine. > fyi: regarding size 200 and 1000 - it's incorrect to measure these > sizes for squareToLen, because squareToLen is never called for size > more than 127(I've mentioned it before). It's not incorrect: it's a test for asymptotic behaviour. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From igor.veresov at oracle.com Fri Sep 22 21:12:57 2017 From: igor.veresov at oracle.com (Igor Veresov) Date: Fri, 22 Sep 2017 14:12:57 -0700 Subject: [10] RFR(L): 8132547: [AOT] support invokedynamic instructions In-Reply-To: <5bd196a3-93f8-b59d-86d8-a87e0cfc8179@oracle.com> References: <5bd196a3-93f8-b59d-86d8-a87e0cfc8179@oracle.com> Message-ID: <2B0CE295-504C-4D33-8938-82FB10CB88EF@oracle.com> Looks good to me. igor > On Sep 11, 2017, at 7:21 PM, Dean Long wrote: > > https://bugs.openjdk.java.net/browse/JDK-8132547 > > http://cr.openjdk.java.net/~dlong/8132547/ > > This enhancement is a first step in supporting invokedynamic instructions in AOT. Previously, when we saw an invokedynamic instruction, or any anonymous class, we would generate code to bail out and deoptimize. With this changeset we go a little further and call into the runtime to resolve the dynamic constant pool entry, running the bootstrap method, and returning the adapter method and appendix object. Like class initialization in AOT, we only do this the first time through. Because AOT double-checks classes using fingerprints and symbolic names, special care was required to handle anonymous class names. The solution I chose was to name anonymous types with aliases based on their constant pool location ("adapter" and appendix"). > > Future work is needed to AOT-compile the anonymous classes and/or inline through them, so this change is not expected to affect AOT performance. In my tests I was not able to measure any difference. > > Upstream Graal changes have already been pushed. I broke the JVMCI and hotspot changes into separate webrevs. > > dl > From calvin.cheung at oracle.com Fri Sep 22 23:24:55 2017 From: calvin.cheung at oracle.com (Calvin Cheung) Date: Fri, 22 Sep 2017 16:24:55 -0700 Subject: RFR(XS): 8187884: [TESTBUG] compiler/classUnloading/anonymousClass/TestAnonymousClassUnloading failed with ClassNotFoundException Message-ID: <59C59BC7.5040401@oracle.com> This test failure started showing up after repo consolidation when running via JPRT -testset hotspot. The problem went away if I change the file header similar to classUnloading/methodUnloading/TestMethodUnloading.java. bug: https://bugs.openjdk.java.net/browse/JDK-8187884 webrev: http://cr.openjdk.java.net/~ccheung/8187884/webrev.00/ It passed JPRT -testset hotspot. thanks, Calvin From dean.long at oracle.com Mon Sep 25 08:17:24 2017 From: dean.long at oracle.com (dean.long at oracle.com) Date: Mon, 25 Sep 2017 01:17:24 -0700 Subject: [10] RFR(L): 8132547: [AOT] support invokedynamic instructions In-Reply-To: <2B0CE295-504C-4D33-8938-82FB10CB88EF@oracle.com> References: <5bd196a3-93f8-b59d-86d8-a87e0cfc8179@oracle.com> <2B0CE295-504C-4D33-8938-82FB10CB88EF@oracle.com> Message-ID: <124bff79-e167-c21c-2bc4-7f558bd63bb7@oracle.com> Thanks Igor. dl On 9/22/17 2:12 PM, Igor Veresov wrote: > Looks good to me. > > igor > >> On Sep 11, 2017, at 7:21 PM, Dean Long wrote: >> >> https://bugs.openjdk.java.net/browse/JDK-8132547 >> >> http://cr.openjdk.java.net/~dlong/8132547/ >> >> This enhancement is a first step in supporting invokedynamic instructions in AOT. Previously, when we saw an invokedynamic instruction, or any anonymous class, we would generate code to bail out and deoptimize. With this changeset we go a little further and call into the runtime to resolve the dynamic constant pool entry, running the bootstrap method, and returning the adapter method and appendix object. Like class initialization in AOT, we only do this the first time through. Because AOT double-checks classes using fingerprints and symbolic names, special care was required to handle anonymous class names. The solution I chose was to name anonymous types with aliases based on their constant pool location ("adapter" and appendix"). >> >> Future work is needed to AOT-compile the anonymous classes and/or inline through them, so this change is not expected to affect AOT performance. In my tests I was not able to measure any difference. >> >> Upstream Graal changes have already been pushed. I broke the JVMCI and hotspot changes into separate webrevs. >> >> dl >> From aph at redhat.com Mon Sep 25 11:04:31 2017 From: aph at redhat.com (Andrew Haley) Date: Mon, 25 Sep 2017 12:04:31 +0100 Subject: [10] RFR(S): 8187684 - Intrinsify Math.multiplyHigh(long, long) In-Reply-To: <5cfb314b-9785-fe47-8797-a899f38643ef@redhat.com> References: <5cfb314b-9785-fe47-8797-a899f38643ef@redhat.com> Message-ID: <32a35ee7-9593-bbc4-a540-37c590ec3ba6@redhat.com> On 20/09/17 14:29, Andrew Haley wrote: > On 20/09/17 14:08, Dmitrij Pochepko wrote: >> please review small patch for enhancement: 8187684 - Intrinsify >> Math.multiplyHigh(long, long) > > OK, thanks. Dmitrij, do you have a sponsor for this? I'm sure Vladimir would be happy to help. :-) -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From dmitrij.pochepko at bell-sw.com Mon Sep 25 12:04:47 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko) Date: Mon, 25 Sep 2017 15:04:47 +0300 Subject: [10] RFR(S): 8187684 - Intrinsify Math.multiplyHigh(long, long) In-Reply-To: <32a35ee7-9593-bbc4-a540-37c590ec3ba6@redhat.com> References: <5cfb314b-9785-fe47-8797-a899f38643ef@redhat.com> <32a35ee7-9593-bbc4-a540-37c590ec3ba6@redhat.com> Message-ID: <5fa7e391-bde0-9310-7bcd-98cf96b3f158@bell-sw.com> On 25.09.2017 14:04, Andrew Haley wrote: > On 20/09/17 14:29, Andrew Haley wrote: >> On 20/09/17 14:08, Dmitrij Pochepko wrote: >>> please review small patch for enhancement: 8187684 - Intrinsify >>> Math.multiplyHigh(long, long) >> OK, thanks. > Dmitrij, do you have a sponsor for this? I'm sure Vladimir would > be happy to help. :-) > Hi, Vladimir, can you sponsor it? Thanks, Dmitrij From rwestrel at redhat.com Mon Sep 25 14:19:08 2017 From: rwestrel at redhat.com (Roland Westrelin) Date: Mon, 25 Sep 2017 16:19:08 +0200 Subject: RFR(S): 8187822: C2 conditonal move optimization might create broken graph Message-ID: http://cr.openjdk.java.net/~roland/8187822/webrev.00/ PhaseIdealLoop::conditional_move() replaces a if diamond with a CMoveX node, sets the control of the CMoveX node right above the if node and adjust the control of the inputs of the CMoveX node so it dominates the CMoveX node. The bug here is that one input of a CMoveX node can depend on another data node whose control is one of the branch of the if. PhaseIdealLoop::conditional_move() doesn't adjust the control of that node and so loop opts proceed with a node whose control doesn't dominate the control of its uses. The test case is an example of this. The if (flag2) test is converted to a CMoveI. One of its inputs is f + v1: a AddI that has a LoadI input. Control of the AddI is set to be right above the if but the control of the LoadI is not adjusted and in this case is the if branch. In this round of loop opts, the AddI is considered for a split thru phi. C2 shouldn't proceed with the split thru phi because its LoadI input is not a phi and doesn't strictly dominate the region of the phi. The strict domination test is implemented in PhaseIdealLoop::has_local_phi_input() as: get_ctrl(m) == n_ctrl which fails for the LoadI node because its control is below the AddI node. The fix I suggest is to set the control of the new CMoveX node below the if diamond. Then there's no need to fix the control of the inputs of the CMoveX. Roland. From aph at redhat.com Mon Sep 25 15:09:45 2017 From: aph at redhat.com (Andrew Haley) Date: Mon, 25 Sep 2017 16:09:45 +0100 Subject: [10] RFR(S): 8187684 - Intrinsify Math.multiplyHigh(long, long) In-Reply-To: References: Message-ID: On 20/09/17 14:08, Dmitrij Pochepko wrote: > I've created a small JMH benchmark: > http://cr.openjdk.java.net/~dpochepk/8187684/MultiplyHighBench.java to > test the improved performance and measured it on aarch64(t88, R-Pi) and > x86_64(i7-4770K). Benchmark shows about x2.5 improvement on aarch64 and > about x2 on x86_64 By the way, this benchmark: for (int i = 0; i < 100; i++) { op1 = Math.multiplyHigh(op1, op2++); } return Math.multiplyHigh(op1, op2); measures the latency of the multiplyHigh, not the throughput, because each iteration depends on the previous one. I don't know if that was your intent, but I would imagine we're more interested in throughput. Fast processors can issue a mulh every few clock cycles, but their latency may considerably longer. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From dmitrij.pochepko at bell-sw.com Mon Sep 25 15:44:06 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko) Date: Mon, 25 Sep 2017 18:44:06 +0300 Subject: [10] RFR(S): 8187684 - Intrinsify Math.multiplyHigh(long, long) In-Reply-To: References: Message-ID: <6f8f8c9d-ef6c-eaa1-e169-f2d87c0f4419@bell-sw.com> On 25.09.2017 18:09, Andrew Haley wrote: > On 20/09/17 14:08, Dmitrij Pochepko wrote: >> I've created a small JMH benchmark: >> http://cr.openjdk.java.net/~dpochepk/8187684/MultiplyHighBench.java to >> test the improved performance and measured it on aarch64(t88, R-Pi) and >> x86_64(i7-4770K). Benchmark shows about x2.5 improvement on aarch64 and >> about x2 on x86_64 > By the way, this benchmark: > > for (int i = 0; i < 100; i++) { > op1 = Math.multiplyHigh(op1, op2++); > } > return Math.multiplyHigh(op1, op2); > > measures the latency of the multiplyHigh, not the throughput, because > each iteration depends on the previous one. I don't know if that was > your intent, but I would imagine we're more interested in throughput. > Fast processors can issue a mulh every few clock cycles, but their > latency may considerably longer. > You're right. I've changed benchmark to: ??????? long op = System.currentTimeMillis(); ??????? long accum = 0; ??????? for (int i = 0; i < 10000; i++) { ??????????? accum += Math.multiplyHigh(op + i, op + i); ??????? } ??????? return accum; and it shows even more improvement. about x3.5 on aarch64. Thank you for noticing. From dmitrij.pochepko at bell-sw.com Mon Sep 25 15:46:43 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko) Date: Mon, 25 Sep 2017 18:46:43 +0300 Subject: [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com> <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com> <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com> Message-ID: Hi, please take a look at v2. I've modified code to use multiplyToLen in squareToLen. Additional benefit: no more code in common part. I've left mulAdd unchanged. http://cr.openjdk.java.net/~dpochepk/8186915/webrev.02/ I've also rerun benchmark on ThunderX and got these results: http://cr.openjdk.java.net/~dpochepk/8186915/ThunderX_new.txt Thanks, Dmitrij On 22.09.2017 11:12, Andrew Haley wrote: > On 21/09/17 19:19, Dmitrij Pochepko wrote: > >> thank you for looking into this and trying on APM(I have no access to >> this h/w). >> >> >> I've used modified benchmark you've sent and run it on ThunderX and >> implSquareToLen still shows better results than implMultiplyToLen in >> most cases on ThunderX (up to 10% on size=127. results: >> http://cr.openjdk.java.net/~dpochepk/8186915/ThunderX_new.txt). > For 10%, it's not worth doing, given the risks and that it's not used > by crypto operations when C2-compiled. > >> However, since performance difference for APM is more than on >> ThunderX, I think it'll be more logical to return back to your idea >> and call multiplyToLen intrinsic inside squareToLen. Alternative >> solution is to generate different code for APM and ThunderX, but I >> prefer to have single version in case of such relatively small >> difference in performance and it's still much faster than without >> intrinsic at all. What do you think? > Yes. Calling multiplyToLen would be fine. > >> fyi: regarding size 200 and 1000 - it's incorrect to measure these >> sizes for squareToLen, because squareToLen is never called for size >> more than 127(I've mentioned it before). > It's not incorrect: it's a test for asymptotic behaviour. From aph at redhat.com Mon Sep 25 15:57:43 2017 From: aph at redhat.com (Andrew Haley) Date: Mon, 25 Sep 2017 16:57:43 +0100 Subject: [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com> <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com> <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com> Message-ID: <6ba73c2b-33fa-bdb7-af84-e0b6a2e3b730@redhat.com> On 25/09/17 16:46, Dmitrij Pochepko wrote: > please take a look at v2. I've modified code to use multiplyToLen in > squareToLen. Additional benefit: no more code in common part. I've left > mulAdd unchanged. That looks fine. Please commit if you've run the jtreg test suite. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From martin.doerr at sap.com Mon Sep 25 16:14:49 2017 From: martin.doerr at sap.com (Doerr, Martin) Date: Mon, 25 Sep 2017 16:14:49 +0000 Subject: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic In-Reply-To: <2bc62b4e121b4422af3ae142508e8977@serv031.corp.eldorado.org.br> References: <4999bc2a3f0640dfb6dd75d23b4f30ea@sap.com> <0089f9f653a6442aa672af2e15b2b864@serv030.corp.eldorado.org.br> <59397a3749024e91b56be6e990a3250d@sap.com> <363c2378f23e4be2bf60b622594c60fe@sap.com> <59A089F4.6010504@linux.vnet.ibm.com> <34e6550d426440bab3b8a54a82e25190@sap.com> <59A9850B.7030302@linux.vnet.ibm.com> <3e64d11f046f44379f9658dffc766a45@sap.com> <670dd284fe77479986abe75aca42b20a@sap.com> <875858bb7bda421b97352d0fb2972a73@serv031.corp.eldorado.org.br> <2726388e5b224cdcb75a9c0931ec1e44@sap.com> <9bf856aaba4049b0940f5e2f36bf91f4@sap.com> <2bc62b4e121b4422af3ae142508e8977@serv031.corp.eldorado.org.br> Message-ID: <33cc203cca0340cebff8c0b07106922e@sap.com> Hi, here's a version which applies to the new jdk10/hs: http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.06/ The change contains small changes in the jtreg test code, so we need a sponsor from Oracle. The PPC64 code is already reviewed. May I ask for a volunteer? Best regards, Martin -----Original Message----- From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] Sent: Montag, 18. September 2017 15:02 To: Doerr, Martin ; Lindenmaier, Goetz ; Gustavo Romero Cc: 'hotspot-compiler-dev at openjdk.java.net' ; ppc-aix-port-dev at openjdk.java.net Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic Hi Martin, I agree with your changes. Performance and correctness is Ok on little endian. Thanks > -----Original Message----- > From: Doerr, Martin [mailto:martin.doerr at sap.com] > Sent: sexta-feira, 15 de setembro de 2017 13:11 > To: Gustavo Serra Scalet ; Lindenmaier, > Goetz ; Gustavo Romero > > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi Gustavo, > > I had to make some more changes to get it working on linux BE and AIX. > > New webrev: > http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.05/ > > Changes: > - Fixed remaining lvsr/lvsl issues for BE > - 16 byte alignment doesn't work with xlC on AIX. Workaround: use malloc > on AIX > - Changed address computations to provide more freedom for out-of-order > execution > - Removed some more unused stuff > > Please take a look. Maybe you find some further improvements. You may > want to rerun tests and benchmarks. > Tests have passed on linux BE and LE as well as on AIX. We'll run them > again over the weekend. > > Best regards, > Martin > > > -----Original Message----- > From: Doerr, Martin > Sent: Dienstag, 12. September 2017 18:32 > To: 'Gustavo Serra Scalet' ; > Lindenmaier, Goetz ; Gustavo Romero > > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi Gustavo, > > thanks for debugging. It should fix the problem for LE, but the previous > version was correct for BE. > > Version which works for both: > http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.04/ > > Changes: > - factored out lvsr/lvsl > - fixed ofs <= limit comparison (treat as positive ints to ignore > garbage in high half and be protected against integer overflow, use <= > see DigestBase.java) > - removed unused labels > - added the contributors you mentioned to Contributed-by list (not > comments, I think it's better there) > > Please let us know if this is ok for you. We'll do some more testing on > all platforms. > > Best regards, > Martin > > > -----Original Message----- > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > Sent: Dienstag, 12. September 2017 17:48 > To: Doerr, Martin ; Lindenmaier, Goetz > ; Gustavo Romero > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > Hi Martin and G?tz, > > I was taking a closer look at the hotspot's tests/compiler and I see > indeed one test failing for sha: > Passed: > compiler/intrinsics/sha/cli/TestUseSHA256IntrinsicsOptionOnSupportedCPU. > java > Passed: > compiler/intrinsics/sha/cli/TestUseSHA512IntrinsicsOptionOnSupportedCPU. > java > Passed: > compiler/intrinsics/sha/cli/TestUseSHA1IntrinsicsOptionOnUnsupportedCPU. > java > Passed: > compiler/intrinsics/sha/cli/TestUseSHAOptionOnUnsupportedCPU.java > Passed: > compiler/intrinsics/sha/cli/TestUseSHA512IntrinsicsOptionOnUnsupportedCP > U.java > Passed: > compiler/intrinsics/sha/cli/TestUseSHA1IntrinsicsOptionOnSupportedCPU.ja > va > Passed: > compiler/intrinsics/sha/cli/TestUseSHA256IntrinsicsOptionOnUnsupportedCP > U.java > Passed: compiler/intrinsics/sha/cli/TestUseSHAOptionOnSupportedCPU.java > FAILED: compiler/intrinsics/sha/TestSHA.java > Passed: compiler/intrinsics/sha/sanity/TestSHA1Intrinsics.java > Passed: compiler/intrinsics/sha/sanity/TestSHA256Intrinsics.java > Passed: compiler/intrinsics/sha/sanity/TestSHA1MultiBlockIntrinsics.java > Passed: compiler/intrinsics/sha/sanity/TestSHA512Intrinsics.java > Passed: > compiler/intrinsics/sha/sanity/TestSHA512MultiBlockIntrinsics.java > Passed: > compiler/intrinsics/sha/sanity/TestSHA256MultiBlockIntrinsics.java > > That one was failing due to a bug on unaligned memory load. I took a > closer look and fixed it. It should also work on Big Endian: > https://gut.github.io/openjdk/webrev/JDK-8185979/webrev.02/index.html > > This new webrev was updated on top of Martin's webrev.03. > > I also took this chance to add all the contributors to this patch, as > you suggested before. > > Thanks > > > -----Original Message----- > > From: Doerr, Martin [mailto:martin.doerr at sap.com] > > Sent: segunda-feira, 11 de setembro de 2017 14:06 > > To: Lindenmaier, Goetz ; Gustavo Romero > > ; Gustavo Serra Scalet > > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > Hi G?tz and Gustavo, > > > > I had just posted the version I had before leaving. Thanks for your > > feedback. > > > > New webrev is here: > > http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.03/ > > > > Changes to webrev.02: > > - Referenced paper > > - Factored out endianness specific vector permute instructions > > (vec_perm with only 3 parms to reduce risk of mixing them up) > > - Removed code for PPC64 platforms which didn't support it > > - code_size2 = 22000 > > - Added missing ')' in IntrinsicPredicates.java > > > > My changes shouldn't change the behavior of the little endian > > implementation. > > We have to check if and if yes which tests still fail. Are there any > > updates on this? > > > > Best regards, > > Martin > > > > > > -----Original Message----- > > From: Lindenmaier, Goetz > > Sent: Donnerstag, 7. September 2017 09:55 > > To: Gustavo Romero ; Doerr, Martin > > ; Gustavo Serra Scalet > > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > Hi, > > > > I had a look at this change. > > > > Martin, you missed a ')' in IntrinsicPredicates.java. > > > > Combined with the multiplyToLen change, stub codebuffer space runs > out. > > Please increase > > code_size2 = 20000 > > to 22000 in stubRoutines_ppc.hpp. > > > > I see TestSHA.java failing on linuxppc64le. > > Also, other tests are failing with SHA-256 digest error ... > > > > Also, on aix, some of our internal tests are failing. These didn't run > > on linuxppc64 on a Power8 machine, so it might fail there, too. But > > on the big endian platforms, the jtreg tests don't fail. > > > > @Gustavo, maybe you can have a look at the issues on linuxppc64le and > > post a new webrev. Then Martin can fix the remaining issue on big > > endian. > > > > Best regards, > > Goetz. > > > > > > > > > > > -----Original Message----- > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > bounces at openjdk.java.net] On Behalf Of Gustavo Romero > > > Sent: Freitag, 1. September 2017 18:04 > > > To: Doerr, Martin ; Gustavo Serra Scalet > > > > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > > > Hi Martin! > > > > > > On 01-09-2017 12:39, Doerr, Martin wrote: > > > > Hi Gustavos, > > > > > > > > I have managed to upload a version which seems to work on both > > > endianness implementations. > > > > At least some quick tests have passed on AIX and Big Endian linux > > > > in > > > addition to Little Endian linux. > > > > > > Great! :-) > > > > > > > > > > I'll be out next week, but the change looks ok for me. Please let > > > > me know if > > > the changed version still looks ok for you, too. Feel free to > > > overwork or improve it. > > > > It'd also be good to know, if relying on vrsave=-1 is safe. > > > > > > Sure, Martin. I'm chasing what's exactly setting vrsave=-1 and the > > > full history log (looks like it's not in the kernel, but I'm > > > checking > > yet). > > > > > > > > > > Is the copyright information ok? Did you get source code which > > > > requires to > > > be mentioned in the comments? > > > > The code looks similar to a reference implementation, so the > > > > authors of it > > > may want to be mentioned? > > > > Or did you just use the paper for implementing it? In this case, > > > > I'd mention > > > the paper. > > > > > > Gustavo S: the information on the paper must be updated accordingly > > > as Martin noted in the new webrev. There is none currently. > > > > > > > > > > After we got a second review and ran more tests, we can ask > > > > somebody > > > from Oracle to push it. > > > > > > > > Thanks for contributing and your support, Martin > > > > > > Thanks a lot for reviewing and for all the help. > > > > > > Regards, > > > Gustavo R > > > > > > > > > > > -----Original Message----- > > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > bounces at openjdk.java.net] On Behalf Of Doerr, Martin > > > > Sent: Donnerstag, 31. August 2017 18:21 > > > > To: Gustavo Romero > > > > Cc: 'hotspot-compiler-dev at openjdk.java.net' > > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > > > Subject: RE: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > > > > > Hi Gustavo R, > > > > > > > > I guess you're right. vrsave is already set to -1, so all Vector > > > > Registers get > > > saved. > > > > It'd be good to know where it is set (OS, Flag in ELF header, ???) > > > > and if this > > > is guaranteed. > > > > I don't want to risk getting sporadic errors on some OS versions. > > > > > > > > I'd like to enable SHA intrinsics on linux BE as well. I already > > > > managed to get > > > the 256 bit version working (was quite some work!). > > > > > > > > Thanks and best regards, > > > > Martin > > > > > > > > > > > > -----Original Message----- > > > > From: Gustavo Romero [mailto:gromero at linux.vnet.ibm.com] > > > > Sent: Freitag, 25. August 2017 22:35 > > > > To: Doerr, Martin > > > > Cc: Gustavo Serra Scalet ; > > > > 'hotspot- > > > compiler-dev at openjdk.java.net' > > dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net > > > > Subject: Re: [10] RFR(L): 8185979: PPC64: Implement SHA2 intrinsic > > > > > > > > Hi Martin, > > > > > > > > On 25-08-2017 13:18, Doerr, Martin wrote: > > > >> I think you didn't get my point about AIX. > > > >> Your current version doesn't break AIX, but it lacks SHA2 > > > >> acceleration for > > > AIX on Power 8 and newer, which is still relevant. > > > >> So I'd like to ask you kindly to take a look if Big Endian > > > >> support for the stub > > > could be added without high effort. AIX doesn't need VRSAVE handling > > > (like Little Endian linux, unlike Big Endian linux), so a few lines > > > in the stub could possibly be enough. I can assist with testing. > > > > > > > > I don't think that VRSAVE is handled on Linux, even on BE. > > > > Although BE ABI > > > [1] > > > > says: > > > > > > > > "Functions must ensure that the appropriate bits in the vrsave > > > > register are > > > set for any vector registers they use" > > > > > > > > and LE ABI does not say that, even on Linux BE VRSAVE is not in > > > > effect used to determine which vector registers (VMX/Altivec) > > > > should be > > > saved/restored. > > > > No application uses it on Linux, so I would say that VRSAVE is > > > > ignored on > > > Linux > > > > completely both on BE and LE. save/restore library interfaces > > > > don't pay attention to it in glibc: VRSAVE is just saved/restored > > > > completely in > > > mechanisms > > > > of swap/get/setcontext(), set/longjump(), and dl-trampoline() and > > > > that's > > > all. I > > > > checked that with toolchain folks and they agree. We've already > > > > discussed > > > that a > > > > long time ago but at that time I was just using the vector-scalar > > > > registers [2] and at that time I agreed that if VMX/Altivec was in > > > > use instead of the VSX > > > so > > > > VRSAVE should be handled accordingly. But I have a different > > > > opinion > > > now... > > > > > > > > I'm wondering if something would really break on Linux BE if we > > > > forget > > > about > > > > VRSAVE at all in the JVM. If not, we could forget about VRSAVE > > > > forever on > > > Linux. > > > > Looks like VRSAVE was sort of born to the oblivion... ? > > > > > > > > > > > > Kind regards, > > > > Gustavo > > > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=http- > > > 3A__refspecs.linuxfoundation.org_ELF_ppc64_PPC-2Delf64abi- > > > 2D1.9.html&d=DwIFAg&c=jf_iaSHvJObTbx- > > > siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z- > > > esrs- > > > Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=arAjOio5NNoRIZLdczhgF5BDoAF3HU > > > vq-xCtSufn_kA&e= > > > > [2] https://urldefense.proofpoint.com/v2/url?u=http- > > > 3A__mail.openjdk.java.net_pipermail_ppc-2Daix-2Dport-2Ddev_2016- > > > 2DMay_002508.html&d=DwIFAg&c=jf_iaSHvJObTbx- > > > siA1ZOg&r=kYUIMs9GlX4qSpgorcaHtHrNQxh38_XLwoS4XhaXum8&m=ih0Z- > > > esrs-Hl9wipN392okVsz6z70Rsr9rgJinnzArY&s=p0xb08lxayJHBXZREL-7c5ipKc- > > > waZMMZpTiQWfU-S4&e= > > > > From dmitrij.pochepko at bell-sw.com Mon Sep 25 16:33:00 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko) Date: Mon, 25 Sep 2017 19:33:00 +0300 Subject: [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <6ba73c2b-33fa-bdb7-af84-e0b6a2e3b730@redhat.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com> <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com> <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com> <6ba73c2b-33fa-bdb7-af84-e0b6a2e3b730@redhat.com> Message-ID: <40a89126-e855-c681-d6ff-21ed611fbd89@bell-sw.com> Thank you for such attentive review. I'll commit it now. I've run jtreg tests in jdk/test/java/math/BigInteger/* in both Xmixed and Xcomp modes. Thanks, Dmitrij On 25.09.2017 18:57, Andrew Haley wrote: > On 25/09/17 16:46, Dmitrij Pochepko wrote: >> please take a look at v2. I've modified code to use multiplyToLen in >> squareToLen. Additional benefit: no more code in common part. I've left >> mulAdd unchanged. > That looks fine. Please commit if you've run the jtreg test suite. > From dmitrij.pochepko at bell-sw.com Mon Sep 25 16:36:06 2017 From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko) Date: Mon, 25 Sep 2017 19:36:06 +0300 Subject: [10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd In-Reply-To: <40a89126-e855-c681-d6ff-21ed611fbd89@bell-sw.com> References: <8ba3ab4d-6d71-b0f5-352f-463ca71ba2a5@redhat.com> <81d23371-f77d-7e93-d4ac-bfddb909b22c@redhat.com> <304db30c-550e-5f3b-8cc5-295dad2d4b21@bell-sw.com> <0f272cd6-066b-af29-e01f-00f77af95e4b@bell-sw.com> <16e7e940-a9ae-c4e5-d37b-6ffa4c447a61@redhat.com> <8dc28b52-fa54-9984-8b4f-58933b069300@bell-sw.com> <18e7ddfa-1c9e-da40-77a1-80d6f434899b@bell-sw.com> <848eae58-37af-922b-fc28-19aaef2ab2ab@redhat.com> <85a13dcf-385c-f02e-72b8-9cb835b12fff@bell-sw.com> <3670d9aa-33a6-dc39-8df7-26a5393863f6@redhat.com> <6ba73c2b-33fa-bdb7-af84-e0b6a2e3b730@redhat.com> <40a89126-e855-c681-d6ff-21ed611fbd89@bell-sw.com> Message-ID: Seems like repo is still closed. Have to wait a bit On 25.09.2017 19:33, Dmitrij Pochepko wrote: > Thank you for such attentive review. > > I'll commit it now. I've run jtreg tests in > jdk/test/java/math/BigInteger/* in both Xmixed and Xcomp modes. > > > Thanks, > Dmitrij > On 25.09.2017 18:57, Andrew Haley wrote: >> On 25/09/17 16:46, Dmitrij Pochepko wrote: >>> please take a look at v2. I've modified code to use multiplyToLen in >>> squareToLen. Additional benefit: no more code in common part. I've left >>> mulAdd unchanged. >> That looks fine.? Please commit if you've run the jtreg test suite. >> > From vladimir.kozlov at oracle.com Mon Sep 25 16:42:06 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 25 Sep 2017 09:42:06 -0700 Subject: [10] RFR(S): 8187684 - Intrinsify Math.multiplyHigh(long, long) In-Reply-To: <5fa7e391-bde0-9310-7bcd-98cf96b3f158@bell-sw.com> References: <5cfb314b-9785-fe47-8797-a899f38643ef@redhat.com> <32a35ee7-9593-bbc4-a540-37c590ec3ba6@redhat.com> <5fa7e391-bde0-9310-7bcd-98cf96b3f158@bell-sw.com> Message-ID: <3d76ff3c-17d7-a37c-2959-8f5e6dacd854@oracle.com> Yes, when repo will be opened. Please, send patch and add latest webrev link to the RFE. Thanks, Vladimir On 9/25/17 5:04 AM, Dmitrij Pochepko wrote: > > On 25.09.2017 14:04, Andrew Haley wrote: >> On 20/09/17 14:29, Andrew Haley wrote: >>> On 20/09/17 14:08, Dmitrij Pochepko wrote: >>>> please review small patch for enhancement: 8187684 - Intrinsify >>>> Math.multiplyHigh(long, long) >>> OK, thanks. >> Dmitrij, do you have a sponsor for this?? I'm sure Vladimir would >> be happy to help.? :-) >> > Hi, > > Vladimir, can you sponsor it? > > Thanks, > Dmitrij From vladimir.kozlov at oracle.com Mon Sep 25 17:32:29 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 25 Sep 2017 10:32:29 -0700 Subject: RFR(XS): 8187884: [TESTBUG] compiler/classUnloading/anonymousClass/TestAnonymousClassUnloading failed with ClassNotFoundException In-Reply-To: <59C59BC7.5040401@oracle.com> References: <59C59BC7.5040401@oracle.com> Message-ID: <594b4abe-5581-f54a-5ba3-01e65b03cf02@oracle.com> Looks good. Why it worked before? Thanks, Vladimir On 9/22/17 4:24 PM, Calvin Cheung wrote: > This test failure started showing up after repo consolidation when > running via JPRT -testset hotspot. > The problem went away if I change the file header similar to > classUnloading/methodUnloading/TestMethodUnloading.java. > > bug: https://bugs.openjdk.java.net/browse/JDK-8187884 > > webrev: http://cr.openjdk.java.net/~ccheung/8187884/webrev.00/ > > It passed JPRT -testset hotspot. > > thanks, > Calvin From calvin.cheung at oracle.com Mon Sep 25 18:27:28 2017 From: calvin.cheung at oracle.com (Calvin Cheung) Date: Mon, 25 Sep 2017 11:27:28 -0700 Subject: RFR(XS): 8187884: [TESTBUG] compiler/classUnloading/anonymousClass/TestAnonymousClassUnloading failed with ClassNotFoundException In-Reply-To: <594b4abe-5581-f54a-5ba3-01e65b03cf02@oracle.com> References: <59C59BC7.5040401@oracle.com> <594b4abe-5581-f54a-5ba3-01e65b03cf02@oracle.com> Message-ID: <59C94A90.8050903@oracle.com> On 9/25/17, 10:32 AM, Vladimir Kozlov wrote: > Looks good. Thanks for your review. > > Why it worked before? It isn't clear why it worked before. From the JPRT log file, one difference is the working dir. Before the repo cosolidation: Working dir: C:\jprt\T\P1\163225.\s\test After: Working dir: C:\jprt\T\P1\161245.\s\closed\test Note the "closed" dir before the "test" dir. But this may not be the cause of the failure since the classpath seems correct and the class exists. Given that not everyone is encountering this problem, I'll hold on pushing this fix until after more investigations. thanks, Calvin > > Thanks, > Vladimir > > On 9/22/17 4:24 PM, Calvin Cheung wrote: >> This test failure started showing up after repo consolidation when >> running via JPRT -testset hotspot. >> The problem went away if I change the file header similar to >> classUnloading/methodUnloading/TestMethodUnloading.java. >> >> bug: https://bugs.openjdk.java.net/browse/JDK-8187884 >> >> webrev: http://cr.openjdk.java.net/~ccheung/8187884/webrev.00/ >> >> It passed JPRT -testset hotspot. >> >> thanks, >> Calvin From vladimir.kozlov at oracle.com Mon Sep 25 21:21:14 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 25 Sep 2017 14:21:14 -0700 Subject: RFR(S): 8187822: C2 conditonal move optimization might create broken graph In-Reply-To: References: Message-ID: <0d0b226d-418f-2344-2ff9-a7682747a0e2@oracle.com> Thanks, Roland I agree with your analysis. But as I understand we can't replace such diamond code with cmove because If node will not be eliminated if you not adjust control of LoadI node. What graph you have as result of your changes? Thanks, Vladimir On 9/25/17 7:19 AM, Roland Westrelin wrote: > > http://cr.openjdk.java.net/~roland/8187822/webrev.00/ > > PhaseIdealLoop::conditional_move() replaces a if diamond with a CMoveX > node, sets the control of the CMoveX node right above the if node and > adjust the control of the inputs of the CMoveX node so it dominates the > CMoveX node. The bug here is that one input of a CMoveX node can depend > on another data node whose control is one of the branch of the > if. PhaseIdealLoop::conditional_move() doesn't adjust the control of > that node and so loop opts proceed with a node whose control doesn't > dominate the control of its uses. > > The test case is an example of this. The if (flag2) test is converted to > a CMoveI. One of its inputs is f + v1: a AddI that has a LoadI > input. Control of the AddI is set to be right above the if but the > control of the LoadI is not adjusted and in this case is the if > branch. In this round of loop opts, the AddI is considered for a split > thru phi. C2 shouldn't proceed with the split thru phi because its LoadI > input is not a phi and doesn't strictly dominate the region of the > phi. The strict domination test is implemented in > PhaseIdealLoop::has_local_phi_input() as: > > get_ctrl(m) == n_ctrl > > which fails for the LoadI node because its control is below the AddI > node. > > The fix I suggest is to set the control of the new CMoveX node below the > if diamond. Then there's no need to fix the control of the inputs of the > CMoveX. > > Roland. > From jcbeyler at google.com Mon Sep 25 22:01:45 2017 From: jcbeyler at google.com (JC Beyler) Date: Mon, 25 Sep 2017 15:01:45 -0700 Subject: Low-Overhead Heap Profiling In-Reply-To: References: <2af975e6-3827-bd57-0c3d-fadd54867a67@oracle.com> <365499b6-3f4d-a4df-9e7e-e72a739fb26b@oracle.com> <102c59b8-25b6-8c21-8eef-1de7d0bbf629@oracle.com> <1497366226.2829.109.camel@oracle.com> <1498215147.2741.34.camel@oracle.com> <044f8c75-72f3-79fd-af47-7ee875c071fd@oracle.com> <23f4e6f5-c94e-01f7-ef1d-5e328d4823c8@oracle.com> Message-ID: Hi all, After a bit of a break, I am back working on this :). As before, here are two webrevs: - Full change set: http://cr.openjdk.java.net/~rasbold/8171119/webrev.09/ - Compared to version 8: http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/ (This version is compared to version 8 I last showed but ported to the new folder hierarchy) In this version I have: - Handled Thomas' comments from his email of 07/03: - Merged the logging to be standard - Fixed up the code a bit where asked - Added some notes about the code not being thread-safe yet - Removed additional dead code from the version that modifies interpreter/c1/c2 - Fixed compiler issues so that it compiles with --disable-precompiled-header - Tested with ./configure --with-boot-jdk= --with-debug-level=slowdebug --disable-precompiled-headers Additionally, I added a test to check the sanity of the sampler: HeapMonitorStatCorrectnessTest ( http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch ) - This allocates a number of arrays and checks that we obtain the number of samples we want with an accepted error of 5%. I tested it 100 times and it passed everytime, I can test more if wanted - Not in the test are the actual numbers I got for the various array sizes, I ran the program 30 times and parsed the output; here are the averages and standard deviation: 1000: 1.28% average; 1.13% standard deviation 10000: 1.59% average; 1.25% standard deviation 100000: 1.26% average; 1.26% standard deviation What this means is that we were always at about 1~2% of the number of samples the test expected. Let me know what you think, Jc On Wed, Jul 5, 2017 at 9:31 PM, JC Beyler wrote: > Hi all, > > I apologize, I have not yet handled your remarks but thought this new > webrev would also be useful to see and comment on perhaps. > > Here is the latest webrev, it is generated slightly different than the > others since now I'm using webrev.ksh without the -N option: > http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/ > > And the webrev.07 to webrev.08 diff is here: > http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/ > > (Let me know if it works well) > > It's a small change between versions but it: > - provides a fix that makes the average sample rate correct (more on > that below). > - fixes the code to actually have it play nicely with the fast tlab > refill > - cleaned up a bit the JVMTI text and now use jvmtiFrameInfo > - moved the capability to be onload solo > > With this webrev, I've done a small study of the random number generator > we use here for the sampling rate. I took a small program and it can be > simplified to: > > for (outer loop) > for (inner loop) > int[] tmp = new int[arraySize]; > > - I've fixed the outer and inner loops to being 800 for this experiment, > meaning we allocate 640000 times an array of a given array size. > > - Each program provides the average sample size used for the whole > execution > > - Then, I ran each variation 30 times and then calculated the average of > the average sample size used for various array sizes. I selected the array > size to be one of the following: 1, 10, 100, 1000. > > - When compared to 512kb, the average sample size of 30 runs: > 1: 4.62% of error > 10: 3.09% of error > 100: 0.36% of error > 1000: 0.1% of error > 10000: 0.03% of error > > What it shows is that, depending on the number of samples, the average > does become better. This is because with an allocation of 1 element per > array, it will take longer to hit one of the thresholds. This is seen by > looking at the sample count statistic I put in. For the same number of > iterations (800 * 800), the different array sizes provoke: > 1: 62 samples > 10: 125 samples > 100: 788 samples > 1000: 6166 samples > 10000: 57721 samples > > And of course, the more samples you have, the more sample rates you pick, > which means that your average gets closer using that math. > > Thanks, > Jc > > On Thu, Jun 29, 2017 at 10:01 PM, JC Beyler wrote: > >> Thanks Robbin, >> >> This seems to have worked. When I have the next webrev ready, we will >> find out but I'm fairly confident it will work! >> >> Thanks agian! >> Jc >> >> On Wed, Jun 28, 2017 at 11:46 PM, Robbin Ehn >> wrote: >> >>> Hi JC, >>> >>> On 06/29/2017 12:15 AM, JC Beyler wrote: >>> >>>> B) Incremental changes >>>> >>> >>> I guess the most common work flow here is using mq : >>> hg qnew fix_v1 >>> edit files >>> hg qrefresh >>> hg qnew fix_v2 >>> edit files >>> hg qrefresh >>> >>> if you do hg log you will see 2 commits >>> >>> webrev.ksh -r -2 -o my_inc_v1_v2 >>> webrev.ksh -o my_full_v2 >>> >>> >>> In your .hgrc you might need: >>> [extensions] >>> mq = >>> >>> /Robbin >>> >>> >>>> Again another newbiew question here... >>>> >>>> For showing the incremental changes, is there a link that explains how >>>> to do that? I apologize for my newbie questions all the time :) >>>> >>>> Right now, I do: >>>> >>>> ksh ../webrev.ksh -m -N >>>> >>>> That generates a webrev.zip and send it to Chuck Rasbold. He then >>>> uploads it to a new webrev. >>>> >>>> I tried commiting my change and adding a small change. Then if I just >>>> do ksh ../webrev.ksh without any options, it seems to produce a similar >>>> page but now with only the changes I had (so the 06-07 comparison you were >>>> talking about) and a changeset that has it all. I imagine that is what you >>>> meant. >>>> >>>> Which means that my workflow would become: >>>> >>>> 1) Make changes >>>> 2) Make a webrev without any options to show just the differences with >>>> the tip >>>> 3) Amend my changes to my local commit so that I have it done with >>>> 4) Go to 1 >>>> >>>> Does that seem correct to you? >>>> >>>> Note that when I do this, I only see the full change of a file in the >>>> full change set (Side note here: now the page says change set and not >>>> patch, which is maybe why Serguei was having issues?). >>>> >>>> Thanks! >>>> Jc >>>> >>>> >>>> >>>> On Wed, Jun 28, 2017 at 1:12 AM, Robbin Ehn >>> > wrote: >>>> >>>> Hi, >>>> >>>> On 06/28/2017 12:04 AM, JC Beyler wrote: >>>> >>>> Dear Thomas et al, >>>> >>>> Here is the newest webrev: >>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ < >>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/> >>>> >>>> >>>> >>>> You have some more bits to in there but generally this looks good >>>> and really nice with more tests. >>>> I'll do and deep dive and re-test this when I get back from my long >>>> vacation with whatever patch version you have then. >>>> >>>> Also I think it's time you provide incremental (v06->07 changes) as >>>> well as complete change-sets. >>>> >>>> Thanks, Robbin >>>> >>>> >>>> >>>> >>>> Thomas, I "think" I have answered all your remarks. The summary >>>> is: >>>> >>>> - The statistic system is up and provides insight on what the >>>> heap sampler is doing >>>> - I've noticed that, though the sampling rate is at the >>>> right mean, we are missing some samples, I have not yet tracked out why >>>> (details below) >>>> >>>> - I've run a tiny benchmark that is the worse case: it is a >>>> very tight loop and allocated a small array >>>> - In this case, I see no overhead when the system is off >>>> so that is a good start :) >>>> - I see right now a high overhead in this case when >>>> sampling is on. This is not a really too surprising but I'm going to see if >>>> this is consistent with our >>>> internal implementation. The benchmark is really allocation >>>> stressful so I'm not too surprised but I want to do the due diligence. >>>> >>>> - The statistic system up is up and I have a new test >>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/s >>>> erviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTes >>>> t.java.patch >>>> >>> serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTe >>>> st.java.patch> >>>> - I did a bit of a study about the random generator here, >>>> more details are below but basically it seems to work well >>>> >>>> - I added a capability but since this is the first time >>>> doing this, I was not sure I did it right >>>> - I did add a test though for it and the test seems to do >>>> what I expect (all methods are failing with the >>>> JVMTI_ERROR_MUST_POSSESS_CAPABILITY error). >>>> - http://cr.openjdk.java.net/~ra >>>> sbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonito >>>> r/MyPackage/HeapMonitorNoCapabilityTest.java.patch >>>> >>> serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapa >>>> bilityTest.java.patch> >>>> >>>> - I still need to figure out what to do about the >>>> multi-agent vs single-agent issue >>>> >>>> - As far as measurements, it seems I still need to look at: >>>> - Why we do the 20 random calls first, are they necessary? >>>> - Look at the mean of the sampling rate that the random >>>> generator does and also what is actually sampled >>>> - What is the overhead in terms of memory/performance when >>>> on? >>>> >>>> I have inlined my answers, I think I got them all in the new >>>> webrev, let me know your thoughts. >>>> >>>> Thanks again! >>>> Jc >>>> >>>> >>>> On Fri, Jun 23, 2017 at 3:52 AM, Thomas Schatzl < >>>> thomas.schatzl at oracle.com >>> thomas.schatzl at oracle.com >>>> >>>> >> wrote: >>>> >>>> Hi, >>>> >>>> On Wed, 2017-06-21 at 13:45 -0700, JC Beyler wrote: >>>> > Hi all, >>>> > >>>> > First off: Thanks again to Robbin and Thomas for their >>>> reviews :) >>>> > >>>> > Next, I've uploaded a new webrev: >>>> > http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ < >>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/> >>>> >>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>> >>>> >>>> > >>>> > Here is an update: >>>> > >>>> > - @Robbin, I forgot to say that yes I need to look at >>>> implementing >>>> > this for the other architectures and testing it before >>>> it is all >>>> > ready to go. Is it common to have it working on all >>>> possible >>>> > combinations or is there a subset that I should be doing >>>> first and we >>>> > can do the others later? >>>> > - I've tested slowdebug, built and ran the JTreg tests I >>>> wrote with >>>> > slowdebug and fixed a few more issues >>>> > - I've refactored a bit of the code following Thomas' >>>> comments >>>> > - I think I've handled all the comments from Thomas >>>> (I put >>>> > comments inline below for the specifics) >>>> >>>> Thanks for handling all those. >>>> >>>> > - Following Thomas' comments on statistics, I want to >>>> add some >>>> > quality assurance tests and find that the easiest way >>>> would be to >>>> > have a few counters of what is happening in the sampler >>>> and expose >>>> > that to the user. >>>> > - I'll be adding that in the next version if no one >>>> sees any >>>> > objections to that. >>>> > - This will allow me to add a sanity test in JTreg >>>> about number of >>>> > samples and average of sampling rate >>>> > >>>> > @Thomas: I had a few questions that I inlined below but >>>> I will >>>> > summarize the "bigger ones" here: >>>> > - You mentioned constants are not using the right >>>> conventions, I >>>> > looked around and didn't see any convention except >>>> normal naming then >>>> > for static constants. Is that right? >>>> >>>> I looked through https://wiki.openjdk.java.net/ >>>> display/HotSpot/StyleGui >>> /display/HotSpot/StyleGui> >>>> >>> https://wiki.openjdk.java.net/display/HotSpot/StyleGui>> >>>> de and the rule is to "follow an existing pattern and must >>>> have a >>>> distinct appearance from other names". Which does not help >>>> a lot I >>>> guess :/ The GC team started using upper camel case, e.g. >>>> SomeOtherConstant, but very likely this is probably not >>>> applied >>>> consistently throughout. So I am fine with not adding >>>> another style >>>> (like kMaxStackDepth with the "k" in front with some >>>> unknown meaning) >>>> is fine. >>>> >>>> (Chances are you will find that style somewhere used >>>> anyway too, >>>> apologies if so :/) >>>> >>>> >>>> Thanks for that link, now I know where to look. I used the >>>> upper camel case in my code as well then :) I should have gotten them all. >>>> >>>> >>>> > PS: I've also inlined my answers to Thomas below: >>>> > >>>> > On Tue, Jun 13, 2017 at 8:03 AM, Thomas Schatzl >>>> >>> > e.com > wrote: >>>> > > Hi all, >>>> > > >>>> > > On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote: >>>> > > > Dear all, >>>> > > > >>>> > > > I've continued working on this and have done the >>>> following >>>> > > webrev: >>>> > > > http://cr.openjdk.java.net/~ra >>>> sbold/8171119/webrev.05/ >>> asbold/8171119/webrev.05/> >>>> >>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>> >>>> >>>> > > >>>> > > [...] >>>> > > > Things I still need to do: >>>> > > > - Have to fix that TLAB case for the >>>> FastTLABRefill >>>> > > > - Have to start looking at the data to see that >>>> it is >>>> > > consistent and does gather the right samples, right >>>> frequency, etc. >>>> > > > - Have to check the GC elements and what that >>>> produces >>>> > > > - Run a slowdebug run and ensure I fixed all >>>> those issues you >>>> > > saw > Robbin >>>> > > > >>>> > > > Thanks for looking at the webrev and have a great >>>> week! >>>> > > >>>> > > scratching a bit on the surface of this change, so >>>> apologies for >>>> > > rather shallow comments: >>>> > > >>>> > > - macroAssembler_x86.cpp:5604: while this is compiler >>>> code, and I >>>> > > am not sure this is final, please avoid littering the >>>> code with >>>> > > TODO remarks :) They tend to be candidates for later >>>> wtf moments >>>> > > only. >>>> > > >>>> > > Just file a CR for that. >>>> > > >>>> > Newcomer question: what is a CR and not sure I have the >>>> rights to do >>>> > that yet ? :) >>>> >>>> Apologies. CR is a change request, this suggests to file a >>>> bug in the >>>> bug tracker. And you are right, you can't just create a >>>> new account in >>>> the OpenJDK JIRA yourselves. :( >>>> >>>> >>>> Ok good to know, I'll continue with my own todo list but I'll >>>> work hard on not letting it slip in the webrevs anymore :) >>>> >>>> >>>> I was mostly referring to the "... but it is a TODO" part >>>> of that >>>> comment in macroassembler_x86.cpp. Comments about the why >>>> of the code >>>> are appreciated. >>>> >>>> [Note that I now understand that this is to some degree >>>> still work in >>>> progress. As long as the final changeset does no contain >>>> TODO's I am >>>> fine (and it's not a hard objection, rather their use in >>>> "final" code >>>> is typically limited in my experience)] >>>> >>>> 5603 // Currently, if this happens, just set back the >>>> actual end to >>>> where it was. >>>> 5604 // We miss a chance to sample here. >>>> >>>> Would be okay, if explaining "this" and the "why" of >>>> missing a chance >>>> to sample here would be best. >>>> >>>> Like maybe: >>>> >>>> // If we needed to refill TLABs, just set the actual end >>>> point to >>>> // the end of the TLAB again. We do not sample here >>>> although we could. >>>> >>>> Done with your comment, it works well in my mind. >>>> >>>> I am not sure whether "miss a chance to sample" meant "we >>>> could, but >>>> consciously don't because it's not that useful" or "it >>>> would be >>>> necessary but don't because it's too complicated to do.". >>>> >>>> Looking at the original comment once more, I am also not >>>> sure if that >>>> comment shouldn't referring to the "end" variable (not >>>> actual_end) >>>> because that's the variable that is responsible for taking >>>> the sampling >>>> path? (Going from the member description of >>>> ThreadLocalAllocBuffer). >>>> >>>> >>>> I've moved this code and it no longer shows up here but the >>>> rationale and answer was: >>>> >>>> So.. Yes, end is the variable provoking the sampling. Actual >>>> end is the actual end of the TLAB. >>>> >>>> What was happening here is that the code is resetting _end to >>>> point towards the end of the new TLAB. Because, we now have the end for >>>> sampling and _actual_end for >>>> the actual end, we need to update the actual_end as well. >>>> >>>> Normally, were we to do the real work here, we would calculate >>>> the (end - start) offset, then do: >>>> >>>> - Set the new end to : start + (old_end - old_start) >>>> - Set the actual end like we do here now where it because it is >>>> the actual end. >>>> >>>> Why is this not done here now anymore? >>>> - I was still debating which path to take: >>>> - Do it in the fast refill code, it has its perks: >>>> - In a world where fast refills are happening all >>>> the time or a lot, we can augment there the code to do the sampling >>>> - Remember what we had as an end before leaving the >>>> slowpath and check on return >>>> - This is what I'm doing now, it removes the need to >>>> go fix up all fast refill paths but if you remain in fast refill paths, you >>>> won't get sampling. I >>>> have to think of the consequences of that, maybe a future >>>> change later on? >>>> - I have the statistics now so I'm going to study >>>> that >>>> -> By the way, though my statistics are >>>> showing I'm missing some samples, if I turn off FastTlabRefill, it is the >>>> same loss so for now, it seems >>>> this does not occur in my simple test. >>>> >>>> >>>> >>>> But maybe I am only confused and it's best to just leave >>>> the comment >>>> away. :) >>>> >>>> Thinking about it some more, doesn't this not-sampling in >>>> this case >>>> mean that sampling does not work in any collector that >>>> does inline TLAB >>>> allocation at the moment? (Or is inline TLAB alloc >>>> automatically >>>> disabled with sampling somehow?) >>>> >>>> That would indeed be a bigger TODO then :) >>>> >>>> >>>> Agreed, this remark made me think that perhaps as a first step >>>> the new way of doing it is better but I did have to: >>>> - Remove the const of the ThreadLocalBuffer remaining and >>>> hard_end methods >>>> - Move hard_end out of the header file to have a bit more >>>> logic there >>>> >>>> Please let me know what you think of that and if you prefer it >>>> this way or changing the fast refills. (I prefer this way now because it is >>>> more incremental). >>>> >>>> >>>> > > - calling HeapMonitoring::do_weak_oops() (which should >>>> probably be >>>> > > called weak_oops_do() like other similar methods) only >>>> if string >>>> > > deduplication is enabled (in g1CollectedHeap.cpp:4511) >>>> seems wrong. >>>> > >>>> > The call should be at least around 6 lines up outside >>>> the if. >>>> > >>>> > Preferentially in a method like >>>> process_weak_jni_handles(), including >>>> > additional logging. (No new (G1) gc phase without >>>> minimal logging >>>> > :)). >>>> > Done but really not sure because: >>>> > >>>> > I put for logging: >>>> > log_develop_trace(gc, freelist)("G1ConcRegionFreeing >>>> [other] : heap >>>> > monitoring"); >>>> >>>> I would think that "gc, ref" would be more appropriate log >>>> tags for >>>> this similar to jni handles. >>>> (I am als not sure what weak reference handling has to do >>>> with >>>> G1ConcRegionFreeing, so I am a bit puzzled) >>>> >>>> >>>> I was not sure what to put for the tags or really as the >>>> message. I cleaned it up a bit now to: >>>> log_develop_trace(gc, ref)("HeapSampling [other] : heap >>>> monitoring processing"); >>>> >>>> >>>> >>>> > Since weak_jni_handles didn't have logging for me to be >>>> inspired >>>> > from, I did that but unconvinced this is what should be >>>> done. >>>> >>>> The JNI handle processing does have logging, but only in >>>> ReferenceProcessor::process_discovered_references(). In >>>> process_weak_jni_handles() only overall time is measured >>>> (in a G1 >>>> specific way, since only G1 supports disabling reference >>>> procesing) :/ >>>> >>>> The code in ReferenceProcessor prints both time taken >>>> referenceProcessor.cpp:254, as well as the count, but >>>> strangely only in >>>> debug VMs. >>>> >>>> I have no idea why this logging is that unimportant to >>>> only print that >>>> in a debug VM. However there are reviews out for changing >>>> this area a >>>> bit, so it might be useful to wait for that (JDK-8173335). >>>> >>>> >>>> I cleaned it up a bit anyway and now it returns the count of >>>> objects that are in the system. >>>> >>>> >>>> > > - the change doubles the size of >>>> > > CollectedHeap::allocate_from_tlab_slow() above the >>>> "small and nice" >>>> > > threshold. Maybe it could be refactored a bit. >>>> > Done I think, it looks better to me :). >>>> >>>> In ThreadLocalAllocBuffer::handle_sample() I think the >>>> set_back_actual_end()/pick_next_sample() calls could be >>>> hoisted out of >>>> the "if" :) >>>> >>>> >>>> Done! >>>> >>>> >>>> > > - referenceProcessor.cpp:261: the change should add >>>> logging about >>>> > > the number of references encountered, maybe after the >>>> corresponding >>>> > > "JNI weak reference count" log message. >>>> > Just to double check, are you saying that you'd like to >>>> have the heap >>>> > sampler to keep in store how many sampled objects were >>>> encountered in >>>> > the HeapMonitoring::weak_oops_do? >>>> > - Would a return of the method with the number of >>>> handled >>>> > references and logging that work? >>>> >>>> Yes, it's fine if HeapMonitoring::weak_oops_do() only >>>> returned the >>>> number of processed weak oops. >>>> >>>> >>>> Done also (but I admit I have not tested the output yet) :) >>>> >>>> >>>> > - Additionally, would you prefer it in a separate >>>> block with its >>>> > GCTraceTime? >>>> >>>> Yes. Both kinds of information is interesting: while the >>>> time taken is >>>> typically more important, the next question would be why, >>>> and the >>>> number of references typically goes a long way there. >>>> >>>> See above though, it is probably best to wait a bit. >>>> >>>> >>>> Agreed that I "could" wait but, if it's ok, I'll just >>>> refactor/remove this when we get closer to something final. Either, >>>> JDK-8173335 >>>> has gone in and I will notice it now or it will soon and I can >>>> change it then. >>>> >>>> >>>> > > - threadLocalAllocBuffer.cpp:331: one more "TODO" >>>> > Removed it and added it to my personal todos to look at. >>>> > > > >>>> > > - threadLocalAllocBuffer.hpp: ThreadLocalAllocBuffer >>>> class >>>> > > documentation should be updated about the sampling >>>> additions. I >>>> > > would have no clue what the difference between >>>> "actual_end" and >>>> > > "end" would be from the given information. >>>> > If you are talking about the comments in this file, I >>>> made them more >>>> > clear I hope in the new webrev. If it was somewhere >>>> else, let me know >>>> > where to change. >>>> >>>> Thanks, that's much better. Maybe a note in the comment of >>>> the class >>>> that ThreadLocalBuffer provides some sampling facility by >>>> modifying the >>>> end() of the TLAB to cause "frequent" calls into the >>>> runtime call where >>>> actual sampling takes place. >>>> >>>> >>>> Done, I think it's better now. Added something about the >>>> slow_path_end as well. >>>> >>>> >>>> > > - in heapMonitoring.hpp: there are some random >>>> comments about some >>>> > > code that has been grabbed from >>>> "util/math/fastmath.[h|cc]". I >>>> > > can't tell whether this is code that can be used but I >>>> assume that >>>> > > Noam Shazeer is okay with that (i.e. that's all Google >>>> code). >>>> > Jeremy and I double checked and we can release that as I >>>> thought. I >>>> > removed the comment from that piece of code entirely. >>>> >>>> Thanks. >>>> >>>> > > - heapMonitoring.hpp/cpp static constant naming does >>>> not correspond >>>> > > to Hotspot's. Additionally, in Hotspot static methods >>>> are cased >>>> > > like other methods. >>>> > I think I fixed the methods to be cased the same way as >>>> all other >>>> > methods. For static constants, I was not sure. I fixed a >>>> few other >>>> > variables but I could not seem to really see a >>>> consistent trend for >>>> > constants. I made them as variables but I'm not sure now. >>>> >>>> Sorry again, style is a kind of mess. The goal of my >>>> suggestions here >>>> is only to prevent yet another style creeping in. >>>> >>>> > > - in heapMonitoring.cpp there are a few cryptic >>>> comments at the top >>>> > > that seem to refer to internal stuff that should >>>> probably be >>>> > > removed. >>>> > Sorry about that! My personal todos not cleared out. >>>> >>>> I am happy about comments, but I simply did not understand >>>> any of that >>>> and I do not know about other readers as well. >>>> >>>> If you think you will remember removing/updating them >>>> until the review >>>> proper (I misunderstood the review situation a little it >>>> seems). >>>> >>>> > > I did not think through the impact of the TLAB changes >>>> on collector >>>> > > behavior yet (if there are). Also I did not check for >>>> problems with >>>> > > concurrent mark and SATB/G1 (if there are). >>>> > I would love to know your thoughts on this, I think this >>>> is fine. I >>>> >>>> I think so too now. No objects are made live out of thin >>>> air :) >>>> >>>> > see issues with multiple threads right now hitting the >>>> stack storage >>>> > instance. Previous webrevs had a mutex lock here but we >>>> took it out >>>> > for simplificity (and only for now). >>>> >>>> :) When looking at this after some thinking I now assume >>>> for this >>>> review that this code is not MT safe at all. There seems >>>> to be more >>>> synchronization missing than just the one for the >>>> StackTraceStorage. So >>>> no comments about this here. >>>> >>>> >>>> I doubled checked a bit (quickly I admit) but it seems that >>>> synchronization in StackTraceStorage is really all you need (all methods >>>> lead to a StackTraceStorage one >>>> and can be multithreaded outside of that). >>>> There is a question about the initialization where the method >>>> HeapMonitoring::initialize_profiling is not thread safe. >>>> It would work (famous last words) and not crash if there was a >>>> race but we could add a synchronization point there as well (and therefore >>>> on the stop as well). >>>> >>>> But anyway I will really check and do this once we add back >>>> synchronization. >>>> >>>> >>>> Also, this would require some kind of specification of >>>> what is allowed >>>> to be called when and where. >>>> >>>> >>>> Would we specify this with the methods in the jvmti.xml file? >>>> We could start by specifying in each that they are not thread safe but I >>>> saw no mention of that for >>>> other methods. >>>> >>>> >>>> One potentially relevant observation about locking here: >>>> depending on >>>> sampling frequency, StackTraceStore::add_trace() may be >>>> rather >>>> frequently called. I assume that you are going to do >>>> measurements :) >>>> >>>> >>>> Though we don't have the TLAB implementation in our code, the >>>> compiler generated sampler uses 2% of overhead with a 512k sampling rate. I >>>> can do real measurements >>>> when the code settles and we can see how costly this is as a >>>> TLAB implementation. >>>> However, my theory is that if the rate is 512k, the >>>> memory/performance overhead should be minimal since it is what we saw with >>>> our code/workloads (though not called >>>> the same way, we call it essentially at the same rate). >>>> If you have a benchmark you'd like me to test, let me know! >>>> >>>> Right now, with my really small test, this does use a bit of >>>> overhead even for a 512k sample size. I don't know yet why, I'm going to >>>> see what is going on. >>>> >>>> Finally, I think it is not reasonable to suppose the overhead >>>> to be negligible if the sampling rate used is too low. The user should know >>>> that the lower the rate, >>>> the higher the overhead (documentation TODO?). >>>> >>>> >>>> I am not sure what the expected usage of the API is, but >>>> StackTraceStore::add_trace() seems to be able to grow >>>> without bounds. >>>> Only a GC truncates them to the live ones. That in itself >>>> seems to be >>>> problematic (GCs can be *wide* apart), and of course some >>>> of the API >>>> methods add to that because they duplicate that unbounded >>>> array. Do you >>>> have any concerns/measurements about this? >>>> >>>> >>>> So, the theory is that yes add_trace can be able to grow >>>> without bounds but it grows at a sample per 512k of allocated space. The >>>> stacks it gathers are currently >>>> maxed at 64 (I'd like to expand that to an option to the user >>>> though at some point). So I have no concerns because: >>>> >>>> - If really this is taking a lot of space, that means the job >>>> is keeping a lot of objects in memory as well, therefore the entire heap is >>>> getting huge >>>> - If this is the case, you will be triggering a GC at some >>>> point anyway. >>>> >>>> (I'm putting under the rug the issue of "What if we set the >>>> rate to 1 for example" because as you lower the sampling rate, we cannot >>>> guarantee low overhead; the >>>> idea behind this feature is to have a means of having >>>> meaningful allocated samples at a low overhead) >>>> >>>> I have no measurements really right now but since I now have >>>> some statistics I can poll, I will look a bit more at this question. >>>> >>>> I have the same last sentence than above: the user should >>>> expect this to happen if the sampling rate is too small. That probably can >>>> be reflected in the >>>> StartHeapSampling as a note : careful this might impact your >>>> performance. >>>> >>>> >>>> Also, these stack traces might hold on to huge arrays. Any >>>> consideration of that? Particularly it might be the cause >>>> for OOMEs in >>>> tight memory situations. >>>> >>>> >>>> There is a stack size maximum that is set to 64 so it should >>>> not hold huge arrays. I don't think this is an issue but I can double check >>>> with a test or two. >>>> >>>> >>>> - please consider adding a safepoint check in >>>> HeapMonitoring::weak_oops_do to prevent accidental misuse. >>>> >>>> - in struct StackTraceStorage, the public fields may also >>>> need >>>> underscores. At least some files in the runtime directory >>>> have structs >>>> with underscored public members (and some don't). The >>>> runtime team >>>> should probably comment on that. >>>> >>>> >>>> Agreed I did not know. I looked around and a lot of structs did >>>> not have them it seemed so I left it as is. I will happily change it if >>>> someone prefers (I was not >>>> sure if you really preferred or not, your sentence seemed to be >>>> more a note of "this might need to change but I don't know if the runtime >>>> team enforces that", let >>>> me know if I read that wrongly). >>>> >>>> >>>> - In StackTraceStorage::weak_oops_do(), when examining the >>>> StackTraceData, maybe it is useful to consider having a >>>> non-NULL >>>> reference outside of the heap's reserved space an error. >>>> There should >>>> be no oop outside of the heap's reserved space ever. >>>> >>>> Unless you allow storing random values in >>>> StackTraceData::obj, which I >>>> would not encourage. >>>> >>>> >>>> I suppose you are talking about this part: >>>> if ((value != NULL && Universe::heap()->is_in_reserved(value)) >>>> && >>>> (is_alive == NULL || is_alive->do_object_b(value))) >>>> { >>>> >>>> What you are saying is that I could have something like: >>>> if (value != my_non_null_reference && >>>> (is_alive == NULL || is_alive->do_object_b(value))) >>>> { >>>> >>>> Is that what you meant? Is there really a reason to do so? When >>>> I look at the code, is_in_reserved seems like a O(1) method call. I'm not >>>> even sure we can have a >>>> NULL value to be honest. I might have to study that to see if >>>> this was not a paranoid test to begin with. >>>> >>>> The is_alive code has now morphed due to the comment below. >>>> >>>> >>>> >>>> - HeapMonitoring::weak_oops_do() does not seem to use the >>>> passed AbstractRefProcTaskExecutor. >>>> >>>> >>>> It did use it: >>>> size_t HeapMonitoring::weak_oops_do( >>>> AbstractRefProcTaskExecutor *task_executor, >>>> BoolObjectClosure* is_alive, >>>> OopClosure *f, >>>> VoidClosure *complete_gc) { >>>> assert(SafepointSynchronize::is_at_safepoint(), "must be >>>> at safepoint"); >>>> >>>> if (task_executor != NULL) { >>>> task_executor->set_single_threaded_mode(); >>>> } >>>> return StackTraceStorage::storage()->weak_oops_do(is_alive, >>>> f, complete_gc); >>>> } >>>> >>>> But due to the comment below, I refactored this, so this is no >>>> longer here. Now I have an always true closure that is passed. >>>> >>>> >>>> - I do not understand allowing to call this method with a >>>> NULL >>>> complete_gc closure. This would mean that objects >>>> referenced from the >>>> object that is referenced by the StackTraceData are not >>>> pulled, meaning >>>> they would get stale. >>>> >>>> - same with is_alive parameter value of NULL >>>> >>>> >>>> So these questions made me look a bit closer at this code. This >>>> code I think was written this way to have a very small impact on the file >>>> but you are right, there >>>> is no reason for this here. I've simplified the code by making >>>> in referenceProcessor.cpp a process_HeapSampling method that handles >>>> everything there. >>>> >>>> The code allowed NULLs because it depended on where you were >>>> coming from and how the code was being called. >>>> >>>> - I added a static always_true variable and pass that now to be >>>> more consistent with the rest of the code. >>>> - I moved the complete_gc into process_phaseHeapSampling now >>>> (new method) and handle the task_executor and the complete_gc there >>>> - Newbie question: in our code we did a >>>> set_single_threaded_mode but I see that process_phaseJNI does it right >>>> before its call, do I need to do it for the >>>> process_phaseHeapSample? >>>> That API is much cleaner (in my mind) and is consistent with >>>> what is done around it (again in my mind). >>>> >>>> >>>> - heapMonitoring.cpp:590: I do not completely understand >>>> the purpose of >>>> this code: in the end this results in a fixed value >>>> directly dependent >>>> on the Thread address anyway? In the end this results in a >>>> fixed value >>>> directly dependent on the Thread address anyway? >>>> IOW, what is special about exactly 20 rounds? >>>> >>>> >>>> So we really want a fast random number generator that has a >>>> specific mean (512k is the default we use). The code uses the thread >>>> address as the start number of the >>>> sequence (why not, it is random enough is rationale). Then >>>> instead of just starting there, we prime the sequence and really only start >>>> at the 21st number, it is >>>> arbitrary and I have not done a study to see if we could do >>>> more or less of that. >>>> >>>> As I have the statistics of the system up and running, I'll run >>>> some experiments to see if this is needed, is 20 good, or not. >>>> >>>> >>>> - also I would consider stripping a few bits of the >>>> threads' address as >>>> initialization value for your rng. The last three bits >>>> (and probably >>>> more, check whether the Thread object is allocated on >>>> special >>>> boundaries) are always zero for them. >>>> Not sure if the given "random" value is random enough >>>> before/after, >>>> this method, so just skip that comment if you think this >>>> is not >>>> required. >>>> >>>> >>>> I don't know is the honest answer. I think what is important is >>>> that we tend towards a mean and it is random "enough" to not fall in >>>> pitfalls of only sampling a >>>> subset of objects due to their allocation order. I added that >>>> as test to do to see if it changes the mean in any way for the 512k default >>>> value and/or if the first >>>> 1000 elements look better. >>>> >>>> >>>> Some more random nits I did not find a place to put >>>> anywhere: >>>> >>>> - ThreadLocalAllocBuffer::_extra_space does not seem to >>>> be used >>>> anywhere? >>>> >>>> >>>> Good catch :). >>>> >>>> >>>> - Maybe indent the declaration of >>>> ThreadLocalAllocBuffer::_bytes_until_sample to align below the other >>>> members of that group. >>>> >>>> >>>> Done moved it up a bit to have non static members together and >>>> static separate. >>>> >>>> Thanks, >>>> Thomas >>>> >>>> >>>> Thanks for your review! >>>> Jc >>>> >>>> >>>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcbeyler at google.com Mon Sep 25 22:02:56 2017 From: jcbeyler at google.com (JC Beyler) Date: Mon, 25 Sep 2017 15:02:56 -0700 Subject: Low-Overhead Heap Profiling In-Reply-To: References: <2af975e6-3827-bd57-0c3d-fadd54867a67@oracle.com> <365499b6-3f4d-a4df-9e7e-e72a739fb26b@oracle.com> <102c59b8-25b6-8c21-8eef-1de7d0bbf629@oracle.com> <1497366226.2829.109.camel@oracle.com> <1498215147.2741.34.camel@oracle.com> <044f8c75-72f3-79fd-af47-7ee875c071fd@oracle.com> <23f4e6f5-c94e-01f7-ef1d-5e328d4823c8@oracle.com> Message-ID: Forgot to say that for my numbers: - Not in the test are the actual numbers I got for the various array sizes, I ran the program 30 times and parsed the output; here are the averages and standard deviation: 1000: 1.28% average; 1.13% standard deviation 10000: 1.59% average; 1.25% standard deviation 100000: 1.26% average; 1.26% standard deviation The 1000/10000/100000 are the sizes of the arrays being allocated. These are allocated 100k times and the sampling rate is 111 times the size of the array. Thanks! Jc On Mon, Sep 25, 2017 at 3:01 PM, JC Beyler wrote: > Hi all, > > After a bit of a break, I am back working on this :). As before, here are > two webrevs: > > - Full change set: http://cr.openjdk.java.net/~rasbold/8171119/webrev.09/ > - Compared to version 8: http://cr.openjdk.java.net/ > ~rasbold/8171119/webrev.08_09/ > (This version is compared to version 8 I last showed but ported to the > new folder hierarchy) > > In this version I have: > - Handled Thomas' comments from his email of 07/03: > - Merged the logging to be standard > - Fixed up the code a bit where asked > - Added some notes about the code not being thread-safe yet > - Removed additional dead code from the version that modifies > interpreter/c1/c2 > - Fixed compiler issues so that it compiles with > --disable-precompiled-header > - Tested with ./configure --with-boot-jdk= > --with-debug-level=slowdebug --disable-precompiled-headers > > Additionally, I added a test to check the sanity of the sampler: > HeapMonitorStatCorrectnessTest (http://cr.openjdk.java.net/~ > rasbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceability/jvmti/ > HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch) > - This allocates a number of arrays and checks that we obtain the > number of samples we want with an accepted error of 5%. I tested it 100 > times and it passed everytime, I can test more if wanted > - Not in the test are the actual numbers I got for the various array > sizes, I ran the program 30 times and parsed the output; here are the > averages and standard deviation: > 1000: 1.28% average; 1.13% standard deviation > 10000: 1.59% average; 1.25% standard deviation > 100000: 1.26% average; 1.26% standard deviation > > What this means is that we were always at about 1~2% of the number of > samples the test expected. > > Let me know what you think, > Jc > > > > On Wed, Jul 5, 2017 at 9:31 PM, JC Beyler wrote: > >> Hi all, >> >> I apologize, I have not yet handled your remarks but thought this new >> webrev would also be useful to see and comment on perhaps. >> >> Here is the latest webrev, it is generated slightly different than the >> others since now I'm using webrev.ksh without the -N option: >> http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/ >> >> And the webrev.07 to webrev.08 diff is here: >> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/ >> >> (Let me know if it works well) >> >> It's a small change between versions but it: >> - provides a fix that makes the average sample rate correct (more on >> that below). >> - fixes the code to actually have it play nicely with the fast tlab >> refill >> - cleaned up a bit the JVMTI text and now use jvmtiFrameInfo >> - moved the capability to be onload solo >> >> With this webrev, I've done a small study of the random number generator >> we use here for the sampling rate. I took a small program and it can be >> simplified to: >> >> for (outer loop) >> for (inner loop) >> int[] tmp = new int[arraySize]; >> >> - I've fixed the outer and inner loops to being 800 for this experiment, >> meaning we allocate 640000 times an array of a given array size. >> >> - Each program provides the average sample size used for the whole >> execution >> >> - Then, I ran each variation 30 times and then calculated the average of >> the average sample size used for various array sizes. I selected the array >> size to be one of the following: 1, 10, 100, 1000. >> >> - When compared to 512kb, the average sample size of 30 runs: >> 1: 4.62% of error >> 10: 3.09% of error >> 100: 0.36% of error >> 1000: 0.1% of error >> 10000: 0.03% of error >> >> What it shows is that, depending on the number of samples, the average >> does become better. This is because with an allocation of 1 element per >> array, it will take longer to hit one of the thresholds. This is seen by >> looking at the sample count statistic I put in. For the same number of >> iterations (800 * 800), the different array sizes provoke: >> 1: 62 samples >> 10: 125 samples >> 100: 788 samples >> 1000: 6166 samples >> 10000: 57721 samples >> >> And of course, the more samples you have, the more sample rates you pick, >> which means that your average gets closer using that math. >> >> Thanks, >> Jc >> >> On Thu, Jun 29, 2017 at 10:01 PM, JC Beyler wrote: >> >>> Thanks Robbin, >>> >>> This seems to have worked. When I have the next webrev ready, we will >>> find out but I'm fairly confident it will work! >>> >>> Thanks agian! >>> Jc >>> >>> On Wed, Jun 28, 2017 at 11:46 PM, Robbin Ehn >>> wrote: >>> >>>> Hi JC, >>>> >>>> On 06/29/2017 12:15 AM, JC Beyler wrote: >>>> >>>>> B) Incremental changes >>>>> >>>> >>>> I guess the most common work flow here is using mq : >>>> hg qnew fix_v1 >>>> edit files >>>> hg qrefresh >>>> hg qnew fix_v2 >>>> edit files >>>> hg qrefresh >>>> >>>> if you do hg log you will see 2 commits >>>> >>>> webrev.ksh -r -2 -o my_inc_v1_v2 >>>> webrev.ksh -o my_full_v2 >>>> >>>> >>>> In your .hgrc you might need: >>>> [extensions] >>>> mq = >>>> >>>> /Robbin >>>> >>>> >>>>> Again another newbiew question here... >>>>> >>>>> For showing the incremental changes, is there a link that explains how >>>>> to do that? I apologize for my newbie questions all the time :) >>>>> >>>>> Right now, I do: >>>>> >>>>> ksh ../webrev.ksh -m -N >>>>> >>>>> That generates a webrev.zip and send it to Chuck Rasbold. He then >>>>> uploads it to a new webrev. >>>>> >>>>> I tried commiting my change and adding a small change. Then if I just >>>>> do ksh ../webrev.ksh without any options, it seems to produce a similar >>>>> page but now with only the changes I had (so the 06-07 comparison you were >>>>> talking about) and a changeset that has it all. I imagine that is what you >>>>> meant. >>>>> >>>>> Which means that my workflow would become: >>>>> >>>>> 1) Make changes >>>>> 2) Make a webrev without any options to show just the differences with >>>>> the tip >>>>> 3) Amend my changes to my local commit so that I have it done with >>>>> 4) Go to 1 >>>>> >>>>> Does that seem correct to you? >>>>> >>>>> Note that when I do this, I only see the full change of a file in the >>>>> full change set (Side note here: now the page says change set and not >>>>> patch, which is maybe why Serguei was having issues?). >>>>> >>>>> Thanks! >>>>> Jc >>>>> >>>>> >>>>> >>>>> On Wed, Jun 28, 2017 at 1:12 AM, Robbin Ehn >>>> > wrote: >>>>> >>>>> Hi, >>>>> >>>>> On 06/28/2017 12:04 AM, JC Beyler wrote: >>>>> >>>>> Dear Thomas et al, >>>>> >>>>> Here is the newest webrev: >>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ < >>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/> >>>>> >>>>> >>>>> >>>>> You have some more bits to in there but generally this looks good >>>>> and really nice with more tests. >>>>> I'll do and deep dive and re-test this when I get back from my >>>>> long vacation with whatever patch version you have then. >>>>> >>>>> Also I think it's time you provide incremental (v06->07 changes) >>>>> as well as complete change-sets. >>>>> >>>>> Thanks, Robbin >>>>> >>>>> >>>>> >>>>> >>>>> Thomas, I "think" I have answered all your remarks. The >>>>> summary is: >>>>> >>>>> - The statistic system is up and provides insight on what the >>>>> heap sampler is doing >>>>> - I've noticed that, though the sampling rate is at the >>>>> right mean, we are missing some samples, I have not yet tracked out why >>>>> (details below) >>>>> >>>>> - I've run a tiny benchmark that is the worse case: it is a >>>>> very tight loop and allocated a small array >>>>> - In this case, I see no overhead when the system is off >>>>> so that is a good start :) >>>>> - I see right now a high overhead in this case when >>>>> sampling is on. This is not a really too surprising but I'm going to see if >>>>> this is consistent with our >>>>> internal implementation. The benchmark is really allocation >>>>> stressful so I'm not too surprised but I want to do the due diligence. >>>>> >>>>> - The statistic system up is up and I have a new test >>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/s >>>>> erviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTes >>>>> t.java.patch >>>>> >>>> serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTe >>>>> st.java.patch> >>>>> - I did a bit of a study about the random generator >>>>> here, more details are below but basically it seems to work well >>>>> >>>>> - I added a capability but since this is the first time >>>>> doing this, I was not sure I did it right >>>>> - I did add a test though for it and the test seems to do >>>>> what I expect (all methods are failing with the >>>>> JVMTI_ERROR_MUST_POSSESS_CAPABILITY error). >>>>> - http://cr.openjdk.java.net/~ra >>>>> sbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonito >>>>> r/MyPackage/HeapMonitorNoCapabilityTest.java.patch >>>>> >>>> serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapa >>>>> bilityTest.java.patch> >>>>> >>>>> - I still need to figure out what to do about the >>>>> multi-agent vs single-agent issue >>>>> >>>>> - As far as measurements, it seems I still need to look at: >>>>> - Why we do the 20 random calls first, are they necessary? >>>>> - Look at the mean of the sampling rate that the random >>>>> generator does and also what is actually sampled >>>>> - What is the overhead in terms of memory/performance >>>>> when on? >>>>> >>>>> I have inlined my answers, I think I got them all in the new >>>>> webrev, let me know your thoughts. >>>>> >>>>> Thanks again! >>>>> Jc >>>>> >>>>> >>>>> On Fri, Jun 23, 2017 at 3:52 AM, Thomas Schatzl < >>>>> thomas.schatzl at oracle.com >>>> thomas.schatzl at oracle.com >>>>> >>>>> >> wrote: >>>>> >>>>> Hi, >>>>> >>>>> On Wed, 2017-06-21 at 13:45 -0700, JC Beyler wrote: >>>>> > Hi all, >>>>> > >>>>> > First off: Thanks again to Robbin and Thomas for their >>>>> reviews :) >>>>> > >>>>> > Next, I've uploaded a new webrev: >>>>> > http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ >>>>> >>>>> >>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>> >>>>> >>>>> > >>>>> > Here is an update: >>>>> > >>>>> > - @Robbin, I forgot to say that yes I need to look at >>>>> implementing >>>>> > this for the other architectures and testing it before >>>>> it is all >>>>> > ready to go. Is it common to have it working on all >>>>> possible >>>>> > combinations or is there a subset that I should be >>>>> doing first and we >>>>> > can do the others later? >>>>> > - I've tested slowdebug, built and ran the JTreg tests >>>>> I wrote with >>>>> > slowdebug and fixed a few more issues >>>>> > - I've refactored a bit of the code following Thomas' >>>>> comments >>>>> > - I think I've handled all the comments from Thomas >>>>> (I put >>>>> > comments inline below for the specifics) >>>>> >>>>> Thanks for handling all those. >>>>> >>>>> > - Following Thomas' comments on statistics, I want to >>>>> add some >>>>> > quality assurance tests and find that the easiest way >>>>> would be to >>>>> > have a few counters of what is happening in the sampler >>>>> and expose >>>>> > that to the user. >>>>> > - I'll be adding that in the next version if no one >>>>> sees any >>>>> > objections to that. >>>>> > - This will allow me to add a sanity test in JTreg >>>>> about number of >>>>> > samples and average of sampling rate >>>>> > >>>>> > @Thomas: I had a few questions that I inlined below but >>>>> I will >>>>> > summarize the "bigger ones" here: >>>>> > - You mentioned constants are not using the right >>>>> conventions, I >>>>> > looked around and didn't see any convention except >>>>> normal naming then >>>>> > for static constants. Is that right? >>>>> >>>>> I looked through https://wiki.openjdk.java.net/ >>>>> display/HotSpot/StyleGui >>>> /display/HotSpot/StyleGui> >>>>> >>>> https://wiki.openjdk.java.net/display/HotSpot/StyleGui>> >>>>> de and the rule is to "follow an existing pattern and >>>>> must have a >>>>> distinct appearance from other names". Which does not >>>>> help a lot I >>>>> guess :/ The GC team started using upper camel case, e.g. >>>>> SomeOtherConstant, but very likely this is probably not >>>>> applied >>>>> consistently throughout. So I am fine with not adding >>>>> another style >>>>> (like kMaxStackDepth with the "k" in front with some >>>>> unknown meaning) >>>>> is fine. >>>>> >>>>> (Chances are you will find that style somewhere used >>>>> anyway too, >>>>> apologies if so :/) >>>>> >>>>> >>>>> Thanks for that link, now I know where to look. I used the >>>>> upper camel case in my code as well then :) I should have gotten them all. >>>>> >>>>> >>>>> > PS: I've also inlined my answers to Thomas below: >>>>> > >>>>> > On Tue, Jun 13, 2017 at 8:03 AM, Thomas Schatzl >>>>> >>>> > e.com > wrote: >>>>> > > Hi all, >>>>> > > >>>>> > > On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote: >>>>> > > > Dear all, >>>>> > > > >>>>> > > > I've continued working on this and have done the >>>>> following >>>>> > > webrev: >>>>> > > > http://cr.openjdk.java.net/~ra >>>>> sbold/8171119/webrev.05/ >>>> asbold/8171119/webrev.05/> >>>>> >>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>> >>>>> >>>>> > > >>>>> > > [...] >>>>> > > > Things I still need to do: >>>>> > > > - Have to fix that TLAB case for the >>>>> FastTLABRefill >>>>> > > > - Have to start looking at the data to see that >>>>> it is >>>>> > > consistent and does gather the right samples, right >>>>> frequency, etc. >>>>> > > > - Have to check the GC elements and what that >>>>> produces >>>>> > > > - Run a slowdebug run and ensure I fixed all >>>>> those issues you >>>>> > > saw > Robbin >>>>> > > > >>>>> > > > Thanks for looking at the webrev and have a great >>>>> week! >>>>> > > >>>>> > > scratching a bit on the surface of this change, so >>>>> apologies for >>>>> > > rather shallow comments: >>>>> > > >>>>> > > - macroAssembler_x86.cpp:5604: while this is >>>>> compiler code, and I >>>>> > > am not sure this is final, please avoid littering >>>>> the code with >>>>> > > TODO remarks :) They tend to be candidates for later >>>>> wtf moments >>>>> > > only. >>>>> > > >>>>> > > Just file a CR for that. >>>>> > > >>>>> > Newcomer question: what is a CR and not sure I have >>>>> the rights to do >>>>> > that yet ? :) >>>>> >>>>> Apologies. CR is a change request, this suggests to file >>>>> a bug in the >>>>> bug tracker. And you are right, you can't just create a >>>>> new account in >>>>> the OpenJDK JIRA yourselves. :( >>>>> >>>>> >>>>> Ok good to know, I'll continue with my own todo list but I'll >>>>> work hard on not letting it slip in the webrevs anymore :) >>>>> >>>>> >>>>> I was mostly referring to the "... but it is a TODO" part >>>>> of that >>>>> comment in macroassembler_x86.cpp. Comments about the why >>>>> of the code >>>>> are appreciated. >>>>> >>>>> [Note that I now understand that this is to some degree >>>>> still work in >>>>> progress. As long as the final changeset does no contain >>>>> TODO's I am >>>>> fine (and it's not a hard objection, rather their use in >>>>> "final" code >>>>> is typically limited in my experience)] >>>>> >>>>> 5603 // Currently, if this happens, just set back the >>>>> actual end to >>>>> where it was. >>>>> 5604 // We miss a chance to sample here. >>>>> >>>>> Would be okay, if explaining "this" and the "why" of >>>>> missing a chance >>>>> to sample here would be best. >>>>> >>>>> Like maybe: >>>>> >>>>> // If we needed to refill TLABs, just set the actual end >>>>> point to >>>>> // the end of the TLAB again. We do not sample here >>>>> although we could. >>>>> >>>>> Done with your comment, it works well in my mind. >>>>> >>>>> I am not sure whether "miss a chance to sample" meant "we >>>>> could, but >>>>> consciously don't because it's not that useful" or "it >>>>> would be >>>>> necessary but don't because it's too complicated to do.". >>>>> >>>>> Looking at the original comment once more, I am also not >>>>> sure if that >>>>> comment shouldn't referring to the "end" variable (not >>>>> actual_end) >>>>> because that's the variable that is responsible for >>>>> taking the sampling >>>>> path? (Going from the member description of >>>>> ThreadLocalAllocBuffer). >>>>> >>>>> >>>>> I've moved this code and it no longer shows up here but the >>>>> rationale and answer was: >>>>> >>>>> So.. Yes, end is the variable provoking the sampling. Actual >>>>> end is the actual end of the TLAB. >>>>> >>>>> What was happening here is that the code is resetting _end to >>>>> point towards the end of the new TLAB. Because, we now have the end for >>>>> sampling and _actual_end for >>>>> the actual end, we need to update the actual_end as well. >>>>> >>>>> Normally, were we to do the real work here, we would calculate >>>>> the (end - start) offset, then do: >>>>> >>>>> - Set the new end to : start + (old_end - old_start) >>>>> - Set the actual end like we do here now where it because it >>>>> is the actual end. >>>>> >>>>> Why is this not done here now anymore? >>>>> - I was still debating which path to take: >>>>> - Do it in the fast refill code, it has its perks: >>>>> - In a world where fast refills are happening all >>>>> the time or a lot, we can augment there the code to do the sampling >>>>> - Remember what we had as an end before leaving the >>>>> slowpath and check on return >>>>> - This is what I'm doing now, it removes the need >>>>> to go fix up all fast refill paths but if you remain in fast refill paths, >>>>> you won't get sampling. I >>>>> have to think of the consequences of that, maybe a future >>>>> change later on? >>>>> - I have the statistics now so I'm going to >>>>> study that >>>>> -> By the way, though my statistics are >>>>> showing I'm missing some samples, if I turn off FastTlabRefill, it is the >>>>> same loss so for now, it seems >>>>> this does not occur in my simple test. >>>>> >>>>> >>>>> >>>>> But maybe I am only confused and it's best to just leave >>>>> the comment >>>>> away. :) >>>>> >>>>> Thinking about it some more, doesn't this not-sampling in >>>>> this case >>>>> mean that sampling does not work in any collector that >>>>> does inline TLAB >>>>> allocation at the moment? (Or is inline TLAB alloc >>>>> automatically >>>>> disabled with sampling somehow?) >>>>> >>>>> That would indeed be a bigger TODO then :) >>>>> >>>>> >>>>> Agreed, this remark made me think that perhaps as a first step >>>>> the new way of doing it is better but I did have to: >>>>> - Remove the const of the ThreadLocalBuffer remaining and >>>>> hard_end methods >>>>> - Move hard_end out of the header file to have a bit more >>>>> logic there >>>>> >>>>> Please let me know what you think of that and if you prefer it >>>>> this way or changing the fast refills. (I prefer this way now because it is >>>>> more incremental). >>>>> >>>>> >>>>> > > - calling HeapMonitoring::do_weak_oops() (which >>>>> should probably be >>>>> > > called weak_oops_do() like other similar methods) >>>>> only if string >>>>> > > deduplication is enabled (in >>>>> g1CollectedHeap.cpp:4511) seems wrong. >>>>> > >>>>> > The call should be at least around 6 lines up outside >>>>> the if. >>>>> > >>>>> > Preferentially in a method like >>>>> process_weak_jni_handles(), including >>>>> > additional logging. (No new (G1) gc phase without >>>>> minimal logging >>>>> > :)). >>>>> > Done but really not sure because: >>>>> > >>>>> > I put for logging: >>>>> > log_develop_trace(gc, freelist)("G1ConcRegionFreeing >>>>> [other] : heap >>>>> > monitoring"); >>>>> >>>>> I would think that "gc, ref" would be more appropriate >>>>> log tags for >>>>> this similar to jni handles. >>>>> (I am als not sure what weak reference handling has to do >>>>> with >>>>> G1ConcRegionFreeing, so I am a bit puzzled) >>>>> >>>>> >>>>> I was not sure what to put for the tags or really as the >>>>> message. I cleaned it up a bit now to: >>>>> log_develop_trace(gc, ref)("HeapSampling [other] : heap >>>>> monitoring processing"); >>>>> >>>>> >>>>> >>>>> > Since weak_jni_handles didn't have logging for me to be >>>>> inspired >>>>> > from, I did that but unconvinced this is what should be >>>>> done. >>>>> >>>>> The JNI handle processing does have logging, but only in >>>>> ReferenceProcessor::process_discovered_references(). In >>>>> process_weak_jni_handles() only overall time is measured >>>>> (in a G1 >>>>> specific way, since only G1 supports disabling reference >>>>> procesing) :/ >>>>> >>>>> The code in ReferenceProcessor prints both time taken >>>>> referenceProcessor.cpp:254, as well as the count, but >>>>> strangely only in >>>>> debug VMs. >>>>> >>>>> I have no idea why this logging is that unimportant to >>>>> only print that >>>>> in a debug VM. However there are reviews out for changing >>>>> this area a >>>>> bit, so it might be useful to wait for that (JDK-8173335). >>>>> >>>>> >>>>> I cleaned it up a bit anyway and now it returns the count of >>>>> objects that are in the system. >>>>> >>>>> >>>>> > > - the change doubles the size of >>>>> > > CollectedHeap::allocate_from_tlab_slow() above the >>>>> "small and nice" >>>>> > > threshold. Maybe it could be refactored a bit. >>>>> > Done I think, it looks better to me :). >>>>> >>>>> In ThreadLocalAllocBuffer::handle_sample() I think the >>>>> set_back_actual_end()/pick_next_sample() calls could be >>>>> hoisted out of >>>>> the "if" :) >>>>> >>>>> >>>>> Done! >>>>> >>>>> >>>>> > > - referenceProcessor.cpp:261: the change should add >>>>> logging about >>>>> > > the number of references encountered, maybe after the >>>>> corresponding >>>>> > > "JNI weak reference count" log message. >>>>> > Just to double check, are you saying that you'd like to >>>>> have the heap >>>>> > sampler to keep in store how many sampled objects were >>>>> encountered in >>>>> > the HeapMonitoring::weak_oops_do? >>>>> > - Would a return of the method with the number of >>>>> handled >>>>> > references and logging that work? >>>>> >>>>> Yes, it's fine if HeapMonitoring::weak_oops_do() only >>>>> returned the >>>>> number of processed weak oops. >>>>> >>>>> >>>>> Done also (but I admit I have not tested the output yet) :) >>>>> >>>>> >>>>> > - Additionally, would you prefer it in a separate >>>>> block with its >>>>> > GCTraceTime? >>>>> >>>>> Yes. Both kinds of information is interesting: while the >>>>> time taken is >>>>> typically more important, the next question would be why, >>>>> and the >>>>> number of references typically goes a long way there. >>>>> >>>>> See above though, it is probably best to wait a bit. >>>>> >>>>> >>>>> Agreed that I "could" wait but, if it's ok, I'll just >>>>> refactor/remove this when we get closer to something final. Either, >>>>> JDK-8173335 >>>>> has gone in and I will notice it now or it will soon and I can >>>>> change it then. >>>>> >>>>> >>>>> > > - threadLocalAllocBuffer.cpp:331: one more "TODO" >>>>> > Removed it and added it to my personal todos to look at. >>>>> > > > >>>>> > > - threadLocalAllocBuffer.hpp: ThreadLocalAllocBuffer >>>>> class >>>>> > > documentation should be updated about the sampling >>>>> additions. I >>>>> > > would have no clue what the difference between >>>>> "actual_end" and >>>>> > > "end" would be from the given information. >>>>> > If you are talking about the comments in this file, I >>>>> made them more >>>>> > clear I hope in the new webrev. If it was somewhere >>>>> else, let me know >>>>> > where to change. >>>>> >>>>> Thanks, that's much better. Maybe a note in the comment >>>>> of the class >>>>> that ThreadLocalBuffer provides some sampling facility by >>>>> modifying the >>>>> end() of the TLAB to cause "frequent" calls into the >>>>> runtime call where >>>>> actual sampling takes place. >>>>> >>>>> >>>>> Done, I think it's better now. Added something about the >>>>> slow_path_end as well. >>>>> >>>>> >>>>> > > - in heapMonitoring.hpp: there are some random >>>>> comments about some >>>>> > > code that has been grabbed from >>>>> "util/math/fastmath.[h|cc]". I >>>>> > > can't tell whether this is code that can be used but >>>>> I assume that >>>>> > > Noam Shazeer is okay with that (i.e. that's all >>>>> Google code). >>>>> > Jeremy and I double checked and we can release that as >>>>> I thought. I >>>>> > removed the comment from that piece of code entirely. >>>>> >>>>> Thanks. >>>>> >>>>> > > - heapMonitoring.hpp/cpp static constant naming does >>>>> not correspond >>>>> > > to Hotspot's. Additionally, in Hotspot static methods >>>>> are cased >>>>> > > like other methods. >>>>> > I think I fixed the methods to be cased the same way as >>>>> all other >>>>> > methods. For static constants, I was not sure. I fixed >>>>> a few other >>>>> > variables but I could not seem to really see a >>>>> consistent trend for >>>>> > constants. I made them as variables but I'm not sure >>>>> now. >>>>> >>>>> Sorry again, style is a kind of mess. The goal of my >>>>> suggestions here >>>>> is only to prevent yet another style creeping in. >>>>> >>>>> > > - in heapMonitoring.cpp there are a few cryptic >>>>> comments at the top >>>>> > > that seem to refer to internal stuff that should >>>>> probably be >>>>> > > removed. >>>>> > Sorry about that! My personal todos not cleared out. >>>>> >>>>> I am happy about comments, but I simply did not >>>>> understand any of that >>>>> and I do not know about other readers as well. >>>>> >>>>> If you think you will remember removing/updating them >>>>> until the review >>>>> proper (I misunderstood the review situation a little it >>>>> seems). >>>>> >>>>> > > I did not think through the impact of the TLAB >>>>> changes on collector >>>>> > > behavior yet (if there are). Also I did not check for >>>>> problems with >>>>> > > concurrent mark and SATB/G1 (if there are). >>>>> > I would love to know your thoughts on this, I think >>>>> this is fine. I >>>>> >>>>> I think so too now. No objects are made live out of thin >>>>> air :) >>>>> >>>>> > see issues with multiple threads right now hitting the >>>>> stack storage >>>>> > instance. Previous webrevs had a mutex lock here but we >>>>> took it out >>>>> > for simplificity (and only for now). >>>>> >>>>> :) When looking at this after some thinking I now assume >>>>> for this >>>>> review that this code is not MT safe at all. There seems >>>>> to be more >>>>> synchronization missing than just the one for the >>>>> StackTraceStorage. So >>>>> no comments about this here. >>>>> >>>>> >>>>> I doubled checked a bit (quickly I admit) but it seems that >>>>> synchronization in StackTraceStorage is really all you need (all methods >>>>> lead to a StackTraceStorage one >>>>> and can be multithreaded outside of that). >>>>> There is a question about the initialization where the method >>>>> HeapMonitoring::initialize_profiling is not thread safe. >>>>> It would work (famous last words) and not crash if there was a >>>>> race but we could add a synchronization point there as well (and therefore >>>>> on the stop as well). >>>>> >>>>> But anyway I will really check and do this once we add back >>>>> synchronization. >>>>> >>>>> >>>>> Also, this would require some kind of specification of >>>>> what is allowed >>>>> to be called when and where. >>>>> >>>>> >>>>> Would we specify this with the methods in the jvmti.xml file? >>>>> We could start by specifying in each that they are not thread safe but I >>>>> saw no mention of that for >>>>> other methods. >>>>> >>>>> >>>>> One potentially relevant observation about locking here: >>>>> depending on >>>>> sampling frequency, StackTraceStore::add_trace() may be >>>>> rather >>>>> frequently called. I assume that you are going to do >>>>> measurements :) >>>>> >>>>> >>>>> Though we don't have the TLAB implementation in our code, the >>>>> compiler generated sampler uses 2% of overhead with a 512k sampling rate. I >>>>> can do real measurements >>>>> when the code settles and we can see how costly this is as a >>>>> TLAB implementation. >>>>> However, my theory is that if the rate is 512k, the >>>>> memory/performance overhead should be minimal since it is what we saw with >>>>> our code/workloads (though not called >>>>> the same way, we call it essentially at the same rate). >>>>> If you have a benchmark you'd like me to test, let me know! >>>>> >>>>> Right now, with my really small test, this does use a bit of >>>>> overhead even for a 512k sample size. I don't know yet why, I'm going to >>>>> see what is going on. >>>>> >>>>> Finally, I think it is not reasonable to suppose the overhead >>>>> to be negligible if the sampling rate used is too low. The user should know >>>>> that the lower the rate, >>>>> the higher the overhead (documentation TODO?). >>>>> >>>>> >>>>> I am not sure what the expected usage of the API is, but >>>>> StackTraceStore::add_trace() seems to be able to grow >>>>> without bounds. >>>>> Only a GC truncates them to the live ones. That in itself >>>>> seems to be >>>>> problematic (GCs can be *wide* apart), and of course some >>>>> of the API >>>>> methods add to that because they duplicate that unbounded >>>>> array. Do you >>>>> have any concerns/measurements about this? >>>>> >>>>> >>>>> So, the theory is that yes add_trace can be able to grow >>>>> without bounds but it grows at a sample per 512k of allocated space. The >>>>> stacks it gathers are currently >>>>> maxed at 64 (I'd like to expand that to an option to the user >>>>> though at some point). So I have no concerns because: >>>>> >>>>> - If really this is taking a lot of space, that means the job >>>>> is keeping a lot of objects in memory as well, therefore the entire heap is >>>>> getting huge >>>>> - If this is the case, you will be triggering a GC at some >>>>> point anyway. >>>>> >>>>> (I'm putting under the rug the issue of "What if we set the >>>>> rate to 1 for example" because as you lower the sampling rate, we cannot >>>>> guarantee low overhead; the >>>>> idea behind this feature is to have a means of having >>>>> meaningful allocated samples at a low overhead) >>>>> >>>>> I have no measurements really right now but since I now have >>>>> some statistics I can poll, I will look a bit more at this question. >>>>> >>>>> I have the same last sentence than above: the user should >>>>> expect this to happen if the sampling rate is too small. That probably can >>>>> be reflected in the >>>>> StartHeapSampling as a note : careful this might impact your >>>>> performance. >>>>> >>>>> >>>>> Also, these stack traces might hold on to huge arrays. Any >>>>> consideration of that? Particularly it might be the cause >>>>> for OOMEs in >>>>> tight memory situations. >>>>> >>>>> >>>>> There is a stack size maximum that is set to 64 so it should >>>>> not hold huge arrays. I don't think this is an issue but I can double check >>>>> with a test or two. >>>>> >>>>> >>>>> - please consider adding a safepoint check in >>>>> HeapMonitoring::weak_oops_do to prevent accidental misuse. >>>>> >>>>> - in struct StackTraceStorage, the public fields may also >>>>> need >>>>> underscores. At least some files in the runtime directory >>>>> have structs >>>>> with underscored public members (and some don't). The >>>>> runtime team >>>>> should probably comment on that. >>>>> >>>>> >>>>> Agreed I did not know. I looked around and a lot of structs >>>>> did not have them it seemed so I left it as is. I will happily change it if >>>>> someone prefers (I was not >>>>> sure if you really preferred or not, your sentence seemed to >>>>> be more a note of "this might need to change but I don't know if the >>>>> runtime team enforces that", let >>>>> me know if I read that wrongly). >>>>> >>>>> >>>>> - In StackTraceStorage::weak_oops_do(), when examining >>>>> the >>>>> StackTraceData, maybe it is useful to consider having a >>>>> non-NULL >>>>> reference outside of the heap's reserved space an error. >>>>> There should >>>>> be no oop outside of the heap's reserved space ever. >>>>> >>>>> Unless you allow storing random values in >>>>> StackTraceData::obj, which I >>>>> would not encourage. >>>>> >>>>> >>>>> I suppose you are talking about this part: >>>>> if ((value != NULL && Universe::heap()->is_in_reserved(value)) >>>>> && >>>>> (is_alive == NULL || >>>>> is_alive->do_object_b(value))) { >>>>> >>>>> What you are saying is that I could have something like: >>>>> if (value != my_non_null_reference && >>>>> (is_alive == NULL || >>>>> is_alive->do_object_b(value))) { >>>>> >>>>> Is that what you meant? Is there really a reason to do so? >>>>> When I look at the code, is_in_reserved seems like a O(1) method call. I'm >>>>> not even sure we can have a >>>>> NULL value to be honest. I might have to study that to see if >>>>> this was not a paranoid test to begin with. >>>>> >>>>> The is_alive code has now morphed due to the comment below. >>>>> >>>>> >>>>> >>>>> - HeapMonitoring::weak_oops_do() does not seem to use the >>>>> passed AbstractRefProcTaskExecutor. >>>>> >>>>> >>>>> It did use it: >>>>> size_t HeapMonitoring::weak_oops_do( >>>>> AbstractRefProcTaskExecutor *task_executor, >>>>> BoolObjectClosure* is_alive, >>>>> OopClosure *f, >>>>> VoidClosure *complete_gc) { >>>>> assert(SafepointSynchronize::is_at_safepoint(), "must be >>>>> at safepoint"); >>>>> >>>>> if (task_executor != NULL) { >>>>> task_executor->set_single_threaded_mode(); >>>>> } >>>>> return StackTraceStorage::storage()->weak_oops_do(is_alive, >>>>> f, complete_gc); >>>>> } >>>>> >>>>> But due to the comment below, I refactored this, so this is no >>>>> longer here. Now I have an always true closure that is passed. >>>>> >>>>> >>>>> - I do not understand allowing to call this method with a >>>>> NULL >>>>> complete_gc closure. This would mean that objects >>>>> referenced from the >>>>> object that is referenced by the StackTraceData are not >>>>> pulled, meaning >>>>> they would get stale. >>>>> >>>>> - same with is_alive parameter value of NULL >>>>> >>>>> >>>>> So these questions made me look a bit closer at this code. >>>>> This code I think was written this way to have a very small impact on the >>>>> file but you are right, there >>>>> is no reason for this here. I've simplified the code by making >>>>> in referenceProcessor.cpp a process_HeapSampling method that handles >>>>> everything there. >>>>> >>>>> The code allowed NULLs because it depended on where you were >>>>> coming from and how the code was being called. >>>>> >>>>> - I added a static always_true variable and pass that now to >>>>> be more consistent with the rest of the code. >>>>> - I moved the complete_gc into process_phaseHeapSampling now >>>>> (new method) and handle the task_executor and the complete_gc there >>>>> - Newbie question: in our code we did a >>>>> set_single_threaded_mode but I see that process_phaseJNI does it right >>>>> before its call, do I need to do it for the >>>>> process_phaseHeapSample? >>>>> That API is much cleaner (in my mind) and is consistent with >>>>> what is done around it (again in my mind). >>>>> >>>>> >>>>> - heapMonitoring.cpp:590: I do not completely understand >>>>> the purpose of >>>>> this code: in the end this results in a fixed value >>>>> directly dependent >>>>> on the Thread address anyway? In the end this results in >>>>> a fixed value >>>>> directly dependent on the Thread address anyway? >>>>> IOW, what is special about exactly 20 rounds? >>>>> >>>>> >>>>> So we really want a fast random number generator that has a >>>>> specific mean (512k is the default we use). The code uses the thread >>>>> address as the start number of the >>>>> sequence (why not, it is random enough is rationale). Then >>>>> instead of just starting there, we prime the sequence and really only start >>>>> at the 21st number, it is >>>>> arbitrary and I have not done a study to see if we could do >>>>> more or less of that. >>>>> >>>>> As I have the statistics of the system up and running, I'll >>>>> run some experiments to see if this is needed, is 20 good, or not. >>>>> >>>>> >>>>> - also I would consider stripping a few bits of the >>>>> threads' address as >>>>> initialization value for your rng. The last three bits >>>>> (and probably >>>>> more, check whether the Thread object is allocated on >>>>> special >>>>> boundaries) are always zero for them. >>>>> Not sure if the given "random" value is random enough >>>>> before/after, >>>>> this method, so just skip that comment if you think this >>>>> is not >>>>> required. >>>>> >>>>> >>>>> I don't know is the honest answer. I think what is important >>>>> is that we tend towards a mean and it is random "enough" to not fall in >>>>> pitfalls of only sampling a >>>>> subset of objects due to their allocation order. I added that >>>>> as test to do to see if it changes the mean in any way for the 512k default >>>>> value and/or if the first >>>>> 1000 elements look better. >>>>> >>>>> >>>>> Some more random nits I did not find a place to put >>>>> anywhere: >>>>> >>>>> - ThreadLocalAllocBuffer::_extra_space does not seem to >>>>> be used >>>>> anywhere? >>>>> >>>>> >>>>> Good catch :). >>>>> >>>>> >>>>> - Maybe indent the declaration of >>>>> ThreadLocalAllocBuffer::_bytes_until_sample to align below the other >>>>> members of that group. >>>>> >>>>> >>>>> Done moved it up a bit to have non static members together and >>>>> static separate. >>>>> >>>>> Thanks, >>>>> Thomas >>>>> >>>>> >>>>> Thanks for your review! >>>>> Jc >>>>> >>>>> >>>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yasuenag at gmail.com Tue Sep 26 09:31:00 2017 From: yasuenag at gmail.com (Yasumasa Suenaga) Date: Tue, 26 Sep 2017 18:31:00 +0900 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> <9bfc8e1d-8438-21f2-3c06-e53bb95afb6c@oracle.com> <0ee1ed46-5c27-8693-fb6d-396a3059335e@oracle.com> <13b177d1-e04c-dd86-7a86-b55b165d5830@oracle.com> <2c955c17-43b9-906f-408d-f5349d57ca13@gmail.com> Message-ID: Hi David, > You will need to rebase all your patches before they can be sponsored. I uploaded webrev for jdk10/hs: http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.01/ Could you push it? Thanks, Yasumasa 2017-09-21 9:31 GMT+09:00 David Holmes : > On 21/09/2017 9:57 AM, Yasumasa Suenaga wrote: >> >> 2017/09/21 ??8:35 "David Holmes" > >: >> >> The opening announcement was somewhat premature. They created >> jdk10/hs but we're not quite ready to start accepting changes yet. >> >> >> Where can I get the opening announcement for jdk10/hs? > > > hotspot-dev > >> I will send review request after that. > > > You will need to rebase all your patches before they can be sponsored. > > Thanks, > David > >> >> Thanks, >> >> Yasumasa >> >> >> >> David >> >> >> On 21/09/2017 8:44 AM, Yasumasa Suenaga wrote: >> >> Hi David, >> >> jdk10/hs has been opened [1]. >> Could you push this change? >> >> >> Thanks, >> >> Yasumasa >> >> >> [1] >> >> http://mail.openjdk.java.net/pipermail/jdk10-dev/2017-September/000499.html >> >> >> >> >> On 2017/09/19 12:31, David Holmes wrote: >> >> On 19/09/2017 1:19 PM, Yasumasa Suenaga wrote: >> >> Thanks David, >> >> BTW, can I push this change after jdk10/master is opened? >> I cannot access JPRT. >> >> >> I think we'd probably prefer this to go into jdk10/hs - once >> it is open - and for that you need a sponsor. >> >> Thanks, >> David >> >> >> Yasumasa >> >> >> 2017/09/19 ??0:08 "David Holmes" >> > >> > >> >>: >> >> Hi Yasumasa, >> >> On 19/09/2017 12:55 PM, Yasumasa Suenaga wrote: >> >> Thanks Chris, Robbin, >> >> I'm waiting reviewer(s) for this change. >> >> >> Reviewed. >> >> This simply reverts the change of 8185102. >> >> Thanks, >> David >> ----- >> >> >> Yasumasa >> >> >> 2017/09/19 ??7:14 "Chris Plummer" >> > >> > > >> > >> > >>>: >> >> Hi Yasumasa, >> >> Ok, I see now that CIntegerField is just >> an interface, so >> it's up to >> a class to implement getValue() to fetch >> the field. I'm a bit >> unclear on how that part works, but from >> responses by >> others, it >> seems this is ok. >> >> I've run all the tests I can find that use >> jstack or jhsdb, >> and the >> assert was not triggered. Probably need to >> have a NMethod >> on the >> stack to trigger the code you are fixing. >> >> thanks, >> >> Chris >> >> >> On 9/17/17 1:13 AM, Yasumasa Suenaga wrote: >> >> Hi Chris, >> >> I've tested this issue on Fedora 26 >> x86_64. >> I think we can sue CIntegerField at >> this point because >> CIntegerField is not specialized for >> various int size [1]. >> In fact, CIntegerField had been used >> at this point [2], >> and HSDB >> worked fine. >> >> >> Thanks, >> >> Yasumasa >> >> >> [1] >> >> http://hg.openjdk.java.net/jdk10/master/file/fd36993f7bf5/src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/CIntegerField.java#l29 >> >> >> >> >> > >> > >> >> >> > >> >> >> >> > >> >> >> >> [2] >> http://hg.openjdk.java.net/jdk10/master/rev/cbfdbefc6ea3 >> >> >> > > >> >> > >> >> > >> >> >> >> On 2017/09/17 3:58, Chris Plummer wrote: >> >> Hi Yasumasa, >> >> Is this on a 32-bit system? I >> don't see how you could >> otherwise call getCIntegerField() >> on a long type. >> jlong is >> always 64-bit and long is >> (generally) 32-bit on 32-bit >> systems, and 64-bit on 64-bit >> systems, at least >> that seems >> to be the case with linux. >> >> From what I can see, >> _stack_traversal_mark is now >> the only >> long type in vmStructs.cpp. I >> don't know that we have a >> mechanism to safely fetch it on >> both 32-bit and >> 64-bit systems. >> >> _stack_traversal_mark seems to be >> a long because >> _traversals >> is also a long. >> >> static long >> _traversals; // >> Stack scan count, also sweep ID. >> >> This too might be considered a >> bug. I'm not sure >> why you >> would want the size of this field >> to vary between >> 32-bit and >> 64-bit systems (adding >> compiler-dev to help answer >> that). >> >> So, while I would agree that your >> fix is generally >> in the >> right direction, I think we first >> need to revisit >> the use of >> long for these fields. If they can >> be changed to an >> int, >> then your fix is correct (pending >> the changes to >> int). If >> not, then maybe we need >> getCLongField() support. >> >> And lastly, we really should have >> a test to detect >> this bug. >> Maybe we already do, and it is >> failing but is going >> unnoticed for some reason. I'll >> try to look into >> that some >> more on Monday. >> >> thanks, >> >> Chris >> >> On 9/16/17 5:20 AM, Yasumasa >> Suenaga wrote: >> >> Hi all, >> >> I tried to get thread dump via >> jstack command >> on CLHSDB. >> But it was failed as below: >> >> ``` >> Caused by: >> sun.jvm.hotspot.types.WrongTypeException: >> field "_stack_traversal_mark" >> in type nmethod >> is not of >> type jlong, but instead of >> type long >> at >> >> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:206) >> >> at >> >> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:212) >> >> at >> >> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getJLongField(BasicType.java:249) >> >> at >> >> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.initialize(NMethod.java:108) >> >> at >> >> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.access$000(NMethod.java:35) >> >> at >> >> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod$1.update(NMethod.java:81) >> >> at >> >> jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMInitializedObserver(VM.java:451) >> >> at >> >> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.(NMethod.java:79) >> ... 23 more >> ``` >> >> I think this exception is >> caused by JDK-8186837. >> This changeset has changed the >> type of >> >> `nmethod::_stack_traversal_mark` to `long` from >> `jlong`. >> >> SA should follow this change. >> >> I uploaded a webrev for this >> issue. This webrev is >> generated from consolidated >> repo (jdk10/master). >> Could you review it? >> >> >> http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.00/ >> >> >> >> > >> > >> >> > >> >> >> > >> >> >> >> >> I cannot access JPRT. So I >> need reviewer. >> >> >> Thanks, >> >> Yasumasa >> >> >> >> >> >> >> > From david.holmes at oracle.com Tue Sep 26 10:55:05 2017 From: david.holmes at oracle.com (David Holmes) Date: Tue, 26 Sep 2017 20:55:05 +1000 Subject: RFR: JDK-8187597: WrongTypeException is occurred at CLHSDB jstack after JDK-8186837 In-Reply-To: References: <39e4a81a-0a12-1ee7-dedb-0b9853d89fce@gmail.com> <6f369ad1-ee23-32b1-37ae-3bb60629960c@gmail.com> <9bfc8e1d-8438-21f2-3c06-e53bb95afb6c@oracle.com> <0ee1ed46-5c27-8693-fb6d-396a3059335e@oracle.com> <13b177d1-e04c-dd86-7a86-b55b165d5830@oracle.com> <2c955c17-43b9-906f-408d-f5349d57ca13@gmail.com> Message-ID: <21df4ebd-823b-cf13-b9f7-24b31be35f0a@oracle.com> Hi Yasumasa, On 26/09/2017 7:31 PM, Yasumasa Suenaga wrote: > Hi David, > >> You will need to rebase all your patches before they can be sponsored. > > I uploaded webrev for jdk10/hs: > > http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.01/ > > > Could you push it? On its way, but please create the final changeset ready for your sponsor(s) to directly import in the future. Thanks, David > > Thanks, > > Yasumasa > > > 2017-09-21 9:31 GMT+09:00 David Holmes : >> On 21/09/2017 9:57 AM, Yasumasa Suenaga wrote: >>> >>> 2017/09/21 ??8:35 "David Holmes" >> >: >>> >>> The opening announcement was somewhat premature. They created >>> jdk10/hs but we're not quite ready to start accepting changes yet. >>> >>> >>> Where can I get the opening announcement for jdk10/hs? >> >> >> hotspot-dev >> >>> I will send review request after that. >> >> >> You will need to rebase all your patches before they can be sponsored. >> >> Thanks, >> David >> >>> >>> Thanks, >>> >>> Yasumasa >>> >>> >>> >>> David >>> >>> >>> On 21/09/2017 8:44 AM, Yasumasa Suenaga wrote: >>> >>> Hi David, >>> >>> jdk10/hs has been opened [1]. >>> Could you push this change? >>> >>> >>> Thanks, >>> >>> Yasumasa >>> >>> >>> [1] >>> >>> http://mail.openjdk.java.net/pipermail/jdk10-dev/2017-September/000499.html >>> >>> >>> >>> >>> On 2017/09/19 12:31, David Holmes wrote: >>> >>> On 19/09/2017 1:19 PM, Yasumasa Suenaga wrote: >>> >>> Thanks David, >>> >>> BTW, can I push this change after jdk10/master is opened? >>> I cannot access JPRT. >>> >>> >>> I think we'd probably prefer this to go into jdk10/hs - once >>> it is open - and for that you need a sponsor. >>> >>> Thanks, >>> David >>> >>> >>> Yasumasa >>> >>> >>> 2017/09/19 ??0:08 "David Holmes" >>> >> >>> >> >>> >>: >>> >>> Hi Yasumasa, >>> >>> On 19/09/2017 12:55 PM, Yasumasa Suenaga wrote: >>> >>> Thanks Chris, Robbin, >>> >>> I'm waiting reviewer(s) for this change. >>> >>> >>> Reviewed. >>> >>> This simply reverts the change of 8185102. >>> >>> Thanks, >>> David >>> ----- >>> >>> >>> Yasumasa >>> >>> >>> 2017/09/19 ??7:14 "Chris Plummer" >>> >> >>> >> > >>> >> >>> >> >>>: >>> >>> Hi Yasumasa, >>> >>> Ok, I see now that CIntegerField is just >>> an interface, so >>> it's up to >>> a class to implement getValue() to fetch >>> the field. I'm a bit >>> unclear on how that part works, but from >>> responses by >>> others, it >>> seems this is ok. >>> >>> I've run all the tests I can find that use >>> jstack or jhsdb, >>> and the >>> assert was not triggered. Probably need to >>> have a NMethod >>> on the >>> stack to trigger the code you are fixing. >>> >>> thanks, >>> >>> Chris >>> >>> >>> On 9/17/17 1:13 AM, Yasumasa Suenaga wrote: >>> >>> Hi Chris, >>> >>> I've tested this issue on Fedora 26 >>> x86_64. >>> I think we can sue CIntegerField at >>> this point because >>> CIntegerField is not specialized for >>> various int size [1]. >>> In fact, CIntegerField had been used >>> at this point [2], >>> and HSDB >>> worked fine. >>> >>> >>> Thanks, >>> >>> Yasumasa >>> >>> >>> [1] >>> >>> http://hg.openjdk.java.net/jdk10/master/file/fd36993f7bf5/src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/types/CIntegerField.java#l29 >>> >>> >>> >>> >>> >> >>> > >>> >>> >>> >> >>> >>> >>> >>> >> >>> >> >>> >>> [2] >>> http://hg.openjdk.java.net/jdk10/master/rev/cbfdbefc6ea3 >>> >>> >>> >> > >>> >>> >> >>> >>> >> >> >>> >>> >>> On 2017/09/17 3:58, Chris Plummer wrote: >>> >>> Hi Yasumasa, >>> >>> Is this on a 32-bit system? I >>> don't see how you could >>> otherwise call getCIntegerField() >>> on a long type. >>> jlong is >>> always 64-bit and long is >>> (generally) 32-bit on 32-bit >>> systems, and 64-bit on 64-bit >>> systems, at least >>> that seems >>> to be the case with linux. >>> >>> From what I can see, >>> _stack_traversal_mark is now >>> the only >>> long type in vmStructs.cpp. I >>> don't know that we have a >>> mechanism to safely fetch it on >>> both 32-bit and >>> 64-bit systems. >>> >>> _stack_traversal_mark seems to be >>> a long because >>> _traversals >>> is also a long. >>> >>> static long >>> _traversals; // >>> Stack scan count, also sweep ID. >>> >>> This too might be considered a >>> bug. I'm not sure >>> why you >>> would want the size of this field >>> to vary between >>> 32-bit and >>> 64-bit systems (adding >>> compiler-dev to help answer >>> that). >>> >>> So, while I would agree that your >>> fix is generally >>> in the >>> right direction, I think we first >>> need to revisit >>> the use of >>> long for these fields. If they can >>> be changed to an >>> int, >>> then your fix is correct (pending >>> the changes to >>> int). If >>> not, then maybe we need >>> getCLongField() support. >>> >>> And lastly, we really should have >>> a test to detect >>> this bug. >>> Maybe we already do, and it is >>> failing but is going >>> unnoticed for some reason. I'll >>> try to look into >>> that some >>> more on Monday. >>> >>> thanks, >>> >>> Chris >>> >>> On 9/16/17 5:20 AM, Yasumasa >>> Suenaga wrote: >>> >>> Hi all, >>> >>> I tried to get thread dump via >>> jstack command >>> on CLHSDB. >>> But it was failed as below: >>> >>> ``` >>> Caused by: >>> sun.jvm.hotspot.types.WrongTypeException: >>> field "_stack_traversal_mark" >>> in type nmethod >>> is not of >>> type jlong, but instead of >>> type long >>> at >>> >>> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:206) >>> >>> at >>> >>> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getField(BasicType.java:212) >>> >>> at >>> >>> jdk.hotspot.agent/sun.jvm.hotspot.types.basic.BasicType.getJLongField(BasicType.java:249) >>> >>> at >>> >>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.initialize(NMethod.java:108) >>> >>> at >>> >>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.access$000(NMethod.java:35) >>> >>> at >>> >>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod$1.update(NMethod.java:81) >>> >>> at >>> >>> jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMInitializedObserver(VM.java:451) >>> >>> at >>> >>> jdk.hotspot.agent/sun.jvm.hotspot.code.NMethod.(NMethod.java:79) >>> ... 23 more >>> ``` >>> >>> I think this exception is >>> caused by JDK-8186837. >>> This changeset has changed the >>> type of >>> >>> `nmethod::_stack_traversal_mark` to `long` from >>> `jlong`. >>> >>> SA should follow this change. >>> >>> I uploaded a webrev for this >>> issue. This webrev is >>> generated from consolidated >>> repo (jdk10/master). >>> Could you review it? >>> >>> >>> http://cr.openjdk.java.net/~ysuenaga/JDK-8187597/webrev.00/ >>> >>> >>> >>> >> >>> > >>> >>> >> >>> >>> >>> >> >>> >> >>> >>> >>> I cannot access JPRT. So I >>> need reviewer. >>> >>> >>> Thanks, >>> >>> Yasumasa >>> >>> >>> >>> >>> >>> >>> >> From martin.doerr at sap.com Tue Sep 26 15:10:47 2017 From: martin.doerr at sap.com (Doerr, Martin) Date: Tue, 26 Sep 2017 15:10:47 +0000 Subject: RFR(M): 8187573: [s390] z/Architecture Vector Facility Support In-Reply-To: <1C19B275-B5BC-4F2D-9319-F9080B365BD6@sap.com> References: <1C19B275-B5BC-4F2D-9319-F9080B365BD6@sap.com> Message-ID: Hi Lutz, thanks for the contribution. Reviewed and pushed. Best regards, Martin From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Schmidt, Lutz Sent: Montag, 18. September 2017 18:29 To: hotspot-compiler-dev at openjdk.java.net Subject: RFR(M): 8187573: [s390] z/Architecture Vector Facility Support Dear all, I would like to request reviews for this s390-only enhancement: Bug: https://bugs.openjdk.java.net/browse/JDK-8187573 Webrev: http://cr.openjdk.java.net/~lucy/webrevs/8187573.00/ This change is all about providing the instruction definitions and related low-level code emitters for the vector instructions, introduced with z13. The change covers support instructions and integer vector instructions only. It only facilitates code generation. No code is generated by the change itself. Thank you! Lutz -------------- next part -------------- An HTML attachment was scrubbed... URL: From lutz.schmidt at sap.com Wed Sep 27 06:48:59 2017 From: lutz.schmidt at sap.com (Schmidt, Lutz) Date: Wed, 27 Sep 2017 06:48:59 +0000 Subject: RFR(M): 8187573: [s390] z/Architecture Vector Facility Support In-Reply-To: References: <1C19B275-B5BC-4F2D-9319-F9080B365BD6@sap.com> Message-ID: Thank you, Martin! Regards, Lutz On 26.09.2017, 17:10, "Doerr, Martin" > wrote: Hi Lutz, thanks for the contribution. Reviewed and pushed. Best regards, Martin From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Schmidt, Lutz Sent: Montag, 18. September 2017 18:29 To: hotspot-compiler-dev at openjdk.java.net Subject: RFR(M): 8187573: [s390] z/Architecture Vector Facility Support Dear all, I would like to request reviews for this s390-only enhancement: Bug: https://bugs.openjdk.java.net/browse/JDK-8187573 Webrev: http://cr.openjdk.java.net/~lucy/webrevs/8187573.00/ This change is all about providing the instruction definitions and related low-level code emitters for the vector instructions, introduced with z13. The change covers support instructions and integer vector instructions only. It only facilitates code generation. No code is generated by the change itself. Thank you! Lutz -------------- next part -------------- An HTML attachment was scrubbed... URL: From rwestrel at redhat.com Wed Sep 27 14:48:02 2017 From: rwestrel at redhat.com (Roland Westrelin) Date: Wed, 27 Sep 2017 16:48:02 +0200 Subject: RFR(S): 8187822: C2 conditonal move optimization might create broken graph In-Reply-To: <0d0b226d-418f-2344-2ff9-a7682747a0e2@oracle.com> References: <0d0b226d-418f-2344-2ff9-a7682747a0e2@oracle.com> Message-ID: Hi Vladimir, Thanks for looking at this. > But as I understand we can't replace such diamond code with cmove > because If node will not be eliminated if you not adjust control of > LoadI node. The LoadI node has no control input but it's only used by the AddI which is only used by the Phi of the diamond. PhaseIdealLoop schedules it as late as possible, that is in the branch of the diamond. The fix I sent for review was actually broken (the controls of the inputs of the CMoveX wouldn't dominate the control of the CMoveX). What about this instead: http://cr.openjdk.java.net/~roland/8187822/webrev.01/ which simply follows dependent data nodes if needed and adjust their control. Roland. From patric.hedlin at oracle.com Thu Sep 28 08:17:27 2017 From: patric.hedlin at oracle.com (Patric Hedlin) Date: Thu, 28 Sep 2017 10:17:27 +0200 Subject: JDK10/RFR(M): 8188031: Complement fused mac operations on SPARC. Message-ID: <2e9e62da-e3b8-9519-2dbf-abbe52de5bee@oracle.com> Dear all, I would like to ask for help to review the following change/update: Issue: https://bugs.openjdk.java.net/browse/JDK-8188031 Webrev: http://cr.openjdk.java.net/~phedlin/tr8188031/ 8188031: Complement fused mac operations on SPARC. Adding a few (FMAf) matcher patterns to the SPARC back-end. Testing: Testing on JDK10 (jtreg/mach5/hotspot/precheckin-comp) Best regards, Patric From rickard.backman at oracle.com Thu Sep 28 08:44:36 2017 From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=) Date: Thu, 28 Sep 2017 10:44:36 +0200 Subject: JDK10/RFR(M): 8188031: Complement fused mac operations on SPARC. In-Reply-To: <2e9e62da-e3b8-9519-2dbf-abbe52de5bee@oracle.com> References: <2e9e62da-e3b8-9519-2dbf-abbe52de5bee@oracle.com> Message-ID: <20170928084436.GD14469@rbackman> Looks good. /R On 09/28, Patric Hedlin wrote: > Dear all, > > I would like to ask for help to review the following change/update: > > Issue: https://bugs.openjdk.java.net/browse/JDK-8188031 > > Webrev: http://cr.openjdk.java.net/~phedlin/tr8188031/ > > > 8188031: Complement fused mac operations on SPARC. > > Adding a few (FMAf) matcher patterns to the SPARC back-end. > > > Testing: > > Testing on JDK10 (jtreg/mach5/hotspot/precheckin-comp) > > > Best regards, > Patric From lutz.schmidt at sap.com Thu Sep 28 13:30:58 2017 From: lutz.schmidt at sap.com (Schmidt, Lutz) Date: Thu, 28 Sep 2017 13:30:58 +0000 Subject: RFR(S): 8187969: [s390] z/Architecture Vector Facility Support. Part II Message-ID: <6073FB33-D580-447F-A201-43FB0EB9867C@sap.com> Dear all, I would like to request reviews for this s390-only enhancement: Bug: https://bugs.openjdk.java.net/browse/JDK-8187969 Webrev: http://cr.openjdk.java.net/~lucy/webrevs/8187969.00/index.html This change is all about providing the instruction definitions and related low-level code emitters for the vector string instructions, introduced with z13. It only facilitates code generation. No code is generated by the change itself. Thank you! Lutz -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Thu Sep 28 15:40:31 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 28 Sep 2017 08:40:31 -0700 Subject: JDK10/RFR(M): 8188031: Complement fused mac operations on SPARC. In-Reply-To: <20170928084436.GD14469@rbackman> References: <2e9e62da-e3b8-9519-2dbf-abbe52de5bee@oracle.com> <20170928084436.GD14469@rbackman> Message-ID: +1 Thanks, Vladimir On 9/28/17 1:44 AM, Rickard B?ckman wrote: > Looks good. > > /R > > On 09/28, Patric Hedlin wrote: >> Dear all, >> >> I would like to ask for help to review the following change/update: >> >> Issue: https://bugs.openjdk.java.net/browse/JDK-8188031 >> >> Webrev: http://cr.openjdk.java.net/~phedlin/tr8188031/ >> >> >> 8188031: Complement fused mac operations on SPARC. >> >> Adding a few (FMAf) matcher patterns to the SPARC back-end. >> >> >> Testing: >> >> Testing on JDK10 (jtreg/mach5/hotspot/precheckin-comp) >> >> >> Best regards, >> Patric From vladimir.kozlov at oracle.com Thu Sep 28 18:04:42 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 28 Sep 2017 11:04:42 -0700 Subject: RFR(S): 8187822: C2 conditonal move optimization might create broken graph In-Reply-To: References: <0d0b226d-418f-2344-2ff9-a7682747a0e2@oracle.com> Message-ID: <66e0634c-84a6-f8ef-6451-bf1e6a9be252@oracle.com> Yes, this version is better. My only concern now if there is a case when data node has control edge pointing to branch of diamond and depends on If's condition (NULL check). Can this happen? Thanks, Vladimir On 9/27/17 7:48 AM, Roland Westrelin wrote: > > Hi Vladimir, > > Thanks for looking at this. > >> But as I understand we can't replace such diamond code with cmove >> because If node will not be eliminated if you not adjust control of >> LoadI node. > > The LoadI node has no control input but it's only used by the AddI which > is only used by the Phi of the diamond. PhaseIdealLoop schedules it as > late as possible, that is in the branch of the diamond. > > The fix I sent for review was actually broken (the controls of the inputs > of the CMoveX wouldn't dominate the control of the CMoveX). What about > this instead: > > http://cr.openjdk.java.net/~roland/8187822/webrev.01/ > > which simply follows dependent data nodes if needed and adjust their > control. > > Roland. > From vladimir.kozlov at oracle.com Thu Sep 28 23:40:48 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 28 Sep 2017 16:40:48 -0700 Subject: [10] RFR(S): 8187684 - Intrinsify Math.multiplyHigh(long, long) In-Reply-To: <3d76ff3c-17d7-a37c-2959-8f5e6dacd854@oracle.com> References: <5cfb314b-9785-fe47-8797-a899f38643ef@redhat.com> <32a35ee7-9593-bbc4-a540-37c590ec3ba6@redhat.com> <5fa7e391-bde0-9310-7bcd-98cf96b3f158@bell-sw.com> <3d76ff3c-17d7-a37c-2959-8f5e6dacd854@oracle.com> Message-ID: <1e07d1a7-8145-4a0e-7693-d3816551e65d@oracle.com> Dmitry, Please, update changes for new consolidated sources and send new patch/webrev. Thanks, Vladimir On 9/25/17 9:42 AM, Vladimir Kozlov wrote: > Yes, when repo will be opened. > > Please, send patch and add latest webrev link to the RFE. > > Thanks, > Vladimir > > On 9/25/17 5:04 AM, Dmitrij Pochepko wrote: >> >> On 25.09.2017 14:04, Andrew Haley wrote: >>> On 20/09/17 14:29, Andrew Haley wrote: >>>> On 20/09/17 14:08, Dmitrij Pochepko wrote: >>>>> please review small patch for enhancement: 8187684 - Intrinsify >>>>> Math.multiplyHigh(long, long) >>>> OK, thanks. >>> Dmitrij, do you have a sponsor for this?? I'm sure Vladimir would >>> be happy to help.? :-) >>> >> Hi, >> >> Vladimir, can you sponsor it? >> >> Thanks, >> Dmitrij From zhongwei.yao at linaro.org Fri Sep 29 08:25:39 2017 From: zhongwei.yao at linaro.org (Zhongwei Yao) Date: Fri, 29 Sep 2017 16:25:39 +0800 Subject: RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: References: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com> Message-ID: Hi, Vladimir, Sorry for my late response! And yes, it solves my case. But I found specjvm2008 doesn't have a stable result, especially for benchmark case like startup.xxx, scimark.xxx.large etc. And I have found obvious performance regress in the rest of benchmark cases. What do you think? On 21 September 2017 at 00:18, Vladimir Kozlov wrote: > Nice. > > Did you verified that it fixed your case? > > Would be nice to run specjvm2008 to make sure performance did not regress. > > Thanks, > Vladimir > > > On 9/20/17 4:07 AM, Zhongwei Yao wrote: >> >> Thanks for your suggestions! >> >> I've updated the patch that uses pass_slp and do_unroll_only flags >> without adding a new flag. Please take a look: >> >> http://cr.openjdk.java.net/~zyao/8187601/webrev.01/ >> >> >> >> On 20 September 2017 at 01:54, Vladimir Kozlov >> wrote: >>> >>> >>> >>> On 9/18/17 10:59 PM, Zhongwei Yao wrote: >>>> >>>> >>>> Hi, Vladimir, >>>> >>>> On 19 September 2017 at 00:17, Vladimir Kozlov >>>> wrote: >>>>> >>>>> >>>>> Why not use existing set_notpassed_slp() instead of >>>>> mark_slp_vec_failed()? >>>> >>>> >>>> >>>> Due to 2 reasons, I have not chosen existing passed_slp flag: >>> >>> >>> >>> My point is that if we don't find vectors in a loop (as in your case) we >>> should ignore whole SLP analysis. >>> >>> In best case scenario SuperWord::unrolling_analysis() should determine if >>> there are vectors candidates. For example, check if array's index is >>> depend >>> on loop's index variable. >>> >>> An other way is to call SuperWord::unrolling_analysis() only after we did >>> vector analysis. >>> >>> It is more complicated changes and out of scope of this. There is also >>> side >>> effect I missed before which may prevent using set_notpassed_slp(): >>> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp() >>> check. >>> >>> Note, set_notpassed_slp() is also used to additional unroll already >>> vectorized loops: >>> >>> >>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421 >>> >>> May be you should also call mark_do_unroll_only() when you set >>> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt >>> problem you pointed. Can you look on this? >>> >>> I am not against adding new is_slp_vec_failed() but I want first to >>> investigate if we can re-use existing functions. >>> >>> Thanks, >>> Vladimir >>> >>> >>>> 1. If we set_notpassed_slp() when _packset.length() == 0 in >>>> SuperWord::output(), then in the IdealLoopTree::policy_unroll() >>>> checking: >>>> >>>> if (cl->has_passed_slp()) { >>>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>>> // Normal case: loop too big >>>> return false; >>>> } >>>> >>>> we will ignore the case: "cl->has_passed_slp() && >>>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()" >>>> as alos exposed in my patch: >>>> >>>> if (cl->has_passed_slp()) { >>>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>>> - // Normal case: loop too big >>>> - return false; >>>> + // When SLP vectorization failed, we could do more unrolling >>>> + // optimizations if body size is less than limit size. Otherwise, >>>> + // return false due to loop is too big. >>>> + if (!cl->is_slp_vec_failed()) return false; >>>> } >>>> >>>> However, I have not found a case to support this condition yet. >>>> >>>> 2. As replied below, in: >>>>> >>>>> >>>>> - } else if (cl->is_main_loop()) { >>>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>>> sw.transform_loop(lpt, true); >>>> >>>> >>>> I need to check whether cl->is_slp_vec_failed() is true.Such >>>> checking becomes explicit when using SLPAutoVecFailed flag. >>>> >>>>> >>>>> Why you need next additional check?: >>>>> >>>>> - } else if (cl->is_main_loop()) { >>>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>>> sw.transform_loop(lpt, true); >>>>> >>>> >>>> The additional check prevents the case that when >>>> cl->is_slp_vec_failed() is true, then SuperWord::output() will >>>> set_major_progress() at the beginning (because _packset.length() == 0 >>>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal >>>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is >>>> not we want. >>> >>> >>> >>> >>>> >>>>> >>>>> Thanks, >>>>> Vladimir >>>>> >>>>> >>>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote: >>>>>> >>>>>> >>>>>> >>>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev] >>>>>> >>>>>> Hi, all, >>>>>> >>>>>> Bug: >>>>>> https://bugs.openjdk.java.net/browse/JDK-8187601 >>>>>> >>>>>> Webrev: >>>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00 >>>>>> >>>>>> In the current implementation, the loop unrolling times are determined >>>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is >>>>>> true (both X86 and aarch64 are true for now). >>>>>> >>>>>> This unrolling policy generates less optimized code when SLP >>>>>> auto-vectorization fails (as following example shows). >>>>>> >>>>>> In this patch, I modify the current unrolling policy to do more >>>>>> unrolling when SLP auto-vectorization fails. So the loop will be >>>>>> unrolled until reaching the unroll times limitation. >>>>>> >>>>>> Here is one example: >>>>>> public static void accessArrayConstants(int[] array) { >>>>>> for (int j = 0; j < 1024; j++) { >>>>>> array[0]++; >>>>>> array[1]++; >>>>>> } >>>>>> } >>>>>> >>>>>> Before this patch, the loop will be unrolled by 4 times. 4 is >>>>>> determined by: AArch64's vector size 128 bits / array element size 32 >>>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. >>>>>> >>>>>> Below is the generated code by C2 on AArch64: >>>>>> >>>>>> ==== generated code start ==== >>>>>> 0x0000ffff6caf3180: ldr w10, [x1,#16] ; >>>>>> 0x0000ffff6caf3184: add w13, w10, #0x1 >>>>>> 0x0000ffff6caf3188: str w13, [x1,#16] ; >>>>>> 0x0000ffff6caf318c: ldr w12, [x1,#20] ; >>>>>> 0x0000ffff6caf3190: add w13, w10, #0x4 >>>>>> 0x0000ffff6caf3194: add w10, w12, #0x4 >>>>>> 0x0000ffff6caf3198: str w13, [x1,#16] ; >>>>>> 0x0000ffff6caf319c: add w11, w11, #0x4 ; >>>>>> 0x0000ffff6caf31a0: str w10, [x1,#20] ; >>>>>> 0x0000ffff6caf31a4: cmp w11, #0x3fd >>>>>> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; >>>>>> ==== generated code end ==== >>>>>> >>>>>> After applied this patch, it is unrolled 16 times: >>>>>> >>>>>> ==== generated code start ==== >>>>>> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; >>>>>> 0x0000ffffb0aa6104: add w13, w10, #0x1 >>>>>> 0x0000ffffb0aa6108: str w13, [x1,#16] ; >>>>>> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; >>>>>> 0x0000ffffb0aa6110: add w13, w10, #0x10 >>>>>> 0x0000ffffb0aa6114: add w10, w12, #0x10 >>>>>> 0x0000ffffb0aa6118: str w13, [x1,#16] ; >>>>>> 0x0000ffffb0aa611c: add w11, w11, #0x10 ; >>>>>> 0x0000ffffb0aa6120: str w10, [x1,#20] ; >>>>>> 0x0000ffffb0aa6124: cmp w11, #0x3f1 >>>>>> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; >>>>>> ==== generated code end ==== >>>>>> >>>>>> This patch passes jtreg tests both on AArch64 and X86. >>>>>> >>>>> >>>> >>>> >>>> >>> >> >> >> > -- Best regards, Zhongwei From rwestrel at redhat.com Fri Sep 29 08:26:10 2017 From: rwestrel at redhat.com (Roland Westrelin) Date: Fri, 29 Sep 2017 10:26:10 +0200 Subject: RFR(S): 8187822: C2 conditonal move optimization might create broken graph In-Reply-To: <66e0634c-84a6-f8ef-6451-bf1e6a9be252@oracle.com> References: <0d0b226d-418f-2344-2ff9-a7682747a0e2@oracle.com> <66e0634c-84a6-f8ef-6451-bf1e6a9be252@oracle.com> Message-ID: > My only concern now if there is a case when data node has control edge > pointing to branch of diamond and depends on If's condition (NULL > check). Can this happen? These: // Check for ops pinned in an arm of the diamond. // Can't remove the control flow in this case if (lp->outcnt() > 1) return NULL; if (rp->outcnt() > 1) return NULL; prevent it, right? Roland. From zhongwei.yao at linaro.org Fri Sep 29 09:22:24 2017 From: zhongwei.yao at linaro.org (Zhongwei Yao) Date: Fri, 29 Sep 2017 17:22:24 +0800 Subject: RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: References: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com> Message-ID: I made a typo in the previous reply. On 29 September 2017 at 16:25, Zhongwei Yao wrote: > Hi, Vladimir, > > Sorry for my late response! > > And yes, it solves my case. > > But I found specjvm2008 doesn't have a stable result, especially for > benchmark case like startup.xxx, scimark.xxx.large etc. And I have > found obvious performance regress in the rest of benchmark cases. What And I have NOT found obvious performance regress in the rest of benchmark cases. > do you think? > > On 21 September 2017 at 00:18, Vladimir Kozlov > wrote: >> Nice. >> >> Did you verified that it fixed your case? >> >> Would be nice to run specjvm2008 to make sure performance did not regress. >> >> Thanks, >> Vladimir >> >> >> On 9/20/17 4:07 AM, Zhongwei Yao wrote: >>> >>> Thanks for your suggestions! >>> >>> I've updated the patch that uses pass_slp and do_unroll_only flags >>> without adding a new flag. Please take a look: >>> >>> http://cr.openjdk.java.net/~zyao/8187601/webrev.01/ >>> >>> >>> >>> On 20 September 2017 at 01:54, Vladimir Kozlov >>> wrote: >>>> >>>> >>>> >>>> On 9/18/17 10:59 PM, Zhongwei Yao wrote: >>>>> >>>>> >>>>> Hi, Vladimir, >>>>> >>>>> On 19 September 2017 at 00:17, Vladimir Kozlov >>>>> wrote: >>>>>> >>>>>> >>>>>> Why not use existing set_notpassed_slp() instead of >>>>>> mark_slp_vec_failed()? >>>>> >>>>> >>>>> >>>>> Due to 2 reasons, I have not chosen existing passed_slp flag: >>>> >>>> >>>> >>>> My point is that if we don't find vectors in a loop (as in your case) we >>>> should ignore whole SLP analysis. >>>> >>>> In best case scenario SuperWord::unrolling_analysis() should determine if >>>> there are vectors candidates. For example, check if array's index is >>>> depend >>>> on loop's index variable. >>>> >>>> An other way is to call SuperWord::unrolling_analysis() only after we did >>>> vector analysis. >>>> >>>> It is more complicated changes and out of scope of this. There is also >>>> side >>>> effect I missed before which may prevent using set_notpassed_slp(): >>>> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp() >>>> check. >>>> >>>> Note, set_notpassed_slp() is also used to additional unroll already >>>> vectorized loops: >>>> >>>> >>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421 >>>> >>>> May be you should also call mark_do_unroll_only() when you set >>>> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt >>>> problem you pointed. Can you look on this? >>>> >>>> I am not against adding new is_slp_vec_failed() but I want first to >>>> investigate if we can re-use existing functions. >>>> >>>> Thanks, >>>> Vladimir >>>> >>>> >>>>> 1. If we set_notpassed_slp() when _packset.length() == 0 in >>>>> SuperWord::output(), then in the IdealLoopTree::policy_unroll() >>>>> checking: >>>>> >>>>> if (cl->has_passed_slp()) { >>>>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>>>> // Normal case: loop too big >>>>> return false; >>>>> } >>>>> >>>>> we will ignore the case: "cl->has_passed_slp() && >>>>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()" >>>>> as alos exposed in my patch: >>>>> >>>>> if (cl->has_passed_slp()) { >>>>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>>>> - // Normal case: loop too big >>>>> - return false; >>>>> + // When SLP vectorization failed, we could do more unrolling >>>>> + // optimizations if body size is less than limit size. Otherwise, >>>>> + // return false due to loop is too big. >>>>> + if (!cl->is_slp_vec_failed()) return false; >>>>> } >>>>> >>>>> However, I have not found a case to support this condition yet. >>>>> >>>>> 2. As replied below, in: >>>>>> >>>>>> >>>>>> - } else if (cl->is_main_loop()) { >>>>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>>>> sw.transform_loop(lpt, true); >>>>> >>>>> >>>>> I need to check whether cl->is_slp_vec_failed() is true.Such >>>>> checking becomes explicit when using SLPAutoVecFailed flag. >>>>> >>>>>> >>>>>> Why you need next additional check?: >>>>>> >>>>>> - } else if (cl->is_main_loop()) { >>>>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>>>> sw.transform_loop(lpt, true); >>>>>> >>>>> >>>>> The additional check prevents the case that when >>>>> cl->is_slp_vec_failed() is true, then SuperWord::output() will >>>>> set_major_progress() at the beginning (because _packset.length() == 0 >>>>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal >>>>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is >>>>> not we want. >>>> >>>> >>>> >>>> >>>>> >>>>>> >>>>>> Thanks, >>>>>> Vladimir >>>>>> >>>>>> >>>>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev] >>>>>>> >>>>>>> Hi, all, >>>>>>> >>>>>>> Bug: >>>>>>> https://bugs.openjdk.java.net/browse/JDK-8187601 >>>>>>> >>>>>>> Webrev: >>>>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00 >>>>>>> >>>>>>> In the current implementation, the loop unrolling times are determined >>>>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is >>>>>>> true (both X86 and aarch64 are true for now). >>>>>>> >>>>>>> This unrolling policy generates less optimized code when SLP >>>>>>> auto-vectorization fails (as following example shows). >>>>>>> >>>>>>> In this patch, I modify the current unrolling policy to do more >>>>>>> unrolling when SLP auto-vectorization fails. So the loop will be >>>>>>> unrolled until reaching the unroll times limitation. >>>>>>> >>>>>>> Here is one example: >>>>>>> public static void accessArrayConstants(int[] array) { >>>>>>> for (int j = 0; j < 1024; j++) { >>>>>>> array[0]++; >>>>>>> array[1]++; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> Before this patch, the loop will be unrolled by 4 times. 4 is >>>>>>> determined by: AArch64's vector size 128 bits / array element size 32 >>>>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. >>>>>>> >>>>>>> Below is the generated code by C2 on AArch64: >>>>>>> >>>>>>> ==== generated code start ==== >>>>>>> 0x0000ffff6caf3180: ldr w10, [x1,#16] ; >>>>>>> 0x0000ffff6caf3184: add w13, w10, #0x1 >>>>>>> 0x0000ffff6caf3188: str w13, [x1,#16] ; >>>>>>> 0x0000ffff6caf318c: ldr w12, [x1,#20] ; >>>>>>> 0x0000ffff6caf3190: add w13, w10, #0x4 >>>>>>> 0x0000ffff6caf3194: add w10, w12, #0x4 >>>>>>> 0x0000ffff6caf3198: str w13, [x1,#16] ; >>>>>>> 0x0000ffff6caf319c: add w11, w11, #0x4 ; >>>>>>> 0x0000ffff6caf31a0: str w10, [x1,#20] ; >>>>>>> 0x0000ffff6caf31a4: cmp w11, #0x3fd >>>>>>> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; >>>>>>> ==== generated code end ==== >>>>>>> >>>>>>> After applied this patch, it is unrolled 16 times: >>>>>>> >>>>>>> ==== generated code start ==== >>>>>>> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; >>>>>>> 0x0000ffffb0aa6104: add w13, w10, #0x1 >>>>>>> 0x0000ffffb0aa6108: str w13, [x1,#16] ; >>>>>>> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; >>>>>>> 0x0000ffffb0aa6110: add w13, w10, #0x10 >>>>>>> 0x0000ffffb0aa6114: add w10, w12, #0x10 >>>>>>> 0x0000ffffb0aa6118: str w13, [x1,#16] ; >>>>>>> 0x0000ffffb0aa611c: add w11, w11, #0x10 ; >>>>>>> 0x0000ffffb0aa6120: str w10, [x1,#20] ; >>>>>>> 0x0000ffffb0aa6124: cmp w11, #0x3f1 >>>>>>> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; >>>>>>> ==== generated code end ==== >>>>>>> >>>>>>> This patch passes jtreg tests both on AArch64 and X86. >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> >> > > > > -- > Best regards, > Zhongwei -- Best regards, Zhongwei From martin.doerr at sap.com Fri Sep 29 15:05:08 2017 From: martin.doerr at sap.com (Doerr, Martin) Date: Fri, 29 Sep 2017 15:05:08 +0000 Subject: sponsor needed for 8185979: PPC64: Implement SHA2 intrinsic Message-ID: Hi, we need a sponsor for the following PPC64 change: 8185979: PPC64: Implement SHA2 intrinsic because it touches hotspot tests. Latest webrev for jdk10/hs is here: http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.06/ It already has 2 reviews. Can somebody push it through JPRT, please? Best regards, Martin -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Fri Sep 29 18:10:10 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 29 Sep 2017 11:10:10 -0700 Subject: RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: References: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com> Message-ID: <06d44e32-0d33-ae78-1516-6c4497adf983@oracle.com> On 9/29/17 1:25 AM, Zhongwei Yao wrote: > Hi, Vladimir, > > Sorry for my late response! > > And yes, it solves my case. > > But I found specjvm2008 doesn't have a stable result, especially for > benchmark case like startup.xxx, scimark.xxx.large etc. And I have > found obvious performance regress in the rest of benchmark cases. What > do you think? You know that you can change run parameters for specjvm2008 to avoid waiting for long to finish. And you need to run on one node preferable. Variations in startup is not important in this case. But scimark is important since they show quality of loop optimizations. Does regression significant? We need more time to investigate it then. Thanks, Vladimir > > On 21 September 2017 at 00:18, Vladimir Kozlov > wrote: >> Nice. >> >> Did you verified that it fixed your case? >> >> Would be nice to run specjvm2008 to make sure performance did not regress. >> >> Thanks, >> Vladimir >> >> >> On 9/20/17 4:07 AM, Zhongwei Yao wrote: >>> >>> Thanks for your suggestions! >>> >>> I've updated the patch that uses pass_slp and do_unroll_only flags >>> without adding a new flag. Please take a look: >>> >>> http://cr.openjdk.java.net/~zyao/8187601/webrev.01/ >>> >>> >>> >>> On 20 September 2017 at 01:54, Vladimir Kozlov >>> wrote: >>>> >>>> >>>> >>>> On 9/18/17 10:59 PM, Zhongwei Yao wrote: >>>>> >>>>> >>>>> Hi, Vladimir, >>>>> >>>>> On 19 September 2017 at 00:17, Vladimir Kozlov >>>>> wrote: >>>>>> >>>>>> >>>>>> Why not use existing set_notpassed_slp() instead of >>>>>> mark_slp_vec_failed()? >>>>> >>>>> >>>>> >>>>> Due to 2 reasons, I have not chosen existing passed_slp flag: >>>> >>>> >>>> >>>> My point is that if we don't find vectors in a loop (as in your case) we >>>> should ignore whole SLP analysis. >>>> >>>> In best case scenario SuperWord::unrolling_analysis() should determine if >>>> there are vectors candidates. For example, check if array's index is >>>> depend >>>> on loop's index variable. >>>> >>>> An other way is to call SuperWord::unrolling_analysis() only after we did >>>> vector analysis. >>>> >>>> It is more complicated changes and out of scope of this. There is also >>>> side >>>> effect I missed before which may prevent using set_notpassed_slp(): >>>> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp() >>>> check. >>>> >>>> Note, set_notpassed_slp() is also used to additional unroll already >>>> vectorized loops: >>>> >>>> >>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421 >>>> >>>> May be you should also call mark_do_unroll_only() when you set >>>> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt >>>> problem you pointed. Can you look on this? >>>> >>>> I am not against adding new is_slp_vec_failed() but I want first to >>>> investigate if we can re-use existing functions. >>>> >>>> Thanks, >>>> Vladimir >>>> >>>> >>>>> 1. If we set_notpassed_slp() when _packset.length() == 0 in >>>>> SuperWord::output(), then in the IdealLoopTree::policy_unroll() >>>>> checking: >>>>> >>>>> if (cl->has_passed_slp()) { >>>>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>>>> // Normal case: loop too big >>>>> return false; >>>>> } >>>>> >>>>> we will ignore the case: "cl->has_passed_slp() && >>>>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()" >>>>> as alos exposed in my patch: >>>>> >>>>> if (cl->has_passed_slp()) { >>>>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>>>> - // Normal case: loop too big >>>>> - return false; >>>>> + // When SLP vectorization failed, we could do more unrolling >>>>> + // optimizations if body size is less than limit size. Otherwise, >>>>> + // return false due to loop is too big. >>>>> + if (!cl->is_slp_vec_failed()) return false; >>>>> } >>>>> >>>>> However, I have not found a case to support this condition yet. >>>>> >>>>> 2. As replied below, in: >>>>>> >>>>>> >>>>>> - } else if (cl->is_main_loop()) { >>>>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>>>> sw.transform_loop(lpt, true); >>>>> >>>>> >>>>> I need to check whether cl->is_slp_vec_failed() is true.Such >>>>> checking becomes explicit when using SLPAutoVecFailed flag. >>>>> >>>>>> >>>>>> Why you need next additional check?: >>>>>> >>>>>> - } else if (cl->is_main_loop()) { >>>>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>>>> sw.transform_loop(lpt, true); >>>>>> >>>>> >>>>> The additional check prevents the case that when >>>>> cl->is_slp_vec_failed() is true, then SuperWord::output() will >>>>> set_major_progress() at the beginning (because _packset.length() == 0 >>>>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal >>>>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is >>>>> not we want. >>>> >>>> >>>> >>>> >>>>> >>>>>> >>>>>> Thanks, >>>>>> Vladimir >>>>>> >>>>>> >>>>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev] >>>>>>> >>>>>>> Hi, all, >>>>>>> >>>>>>> Bug: >>>>>>> https://bugs.openjdk.java.net/browse/JDK-8187601 >>>>>>> >>>>>>> Webrev: >>>>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00 >>>>>>> >>>>>>> In the current implementation, the loop unrolling times are determined >>>>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is >>>>>>> true (both X86 and aarch64 are true for now). >>>>>>> >>>>>>> This unrolling policy generates less optimized code when SLP >>>>>>> auto-vectorization fails (as following example shows). >>>>>>> >>>>>>> In this patch, I modify the current unrolling policy to do more >>>>>>> unrolling when SLP auto-vectorization fails. So the loop will be >>>>>>> unrolled until reaching the unroll times limitation. >>>>>>> >>>>>>> Here is one example: >>>>>>> public static void accessArrayConstants(int[] array) { >>>>>>> for (int j = 0; j < 1024; j++) { >>>>>>> array[0]++; >>>>>>> array[1]++; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> Before this patch, the loop will be unrolled by 4 times. 4 is >>>>>>> determined by: AArch64's vector size 128 bits / array element size 32 >>>>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8. >>>>>>> >>>>>>> Below is the generated code by C2 on AArch64: >>>>>>> >>>>>>> ==== generated code start ==== >>>>>>> 0x0000ffff6caf3180: ldr w10, [x1,#16] ; >>>>>>> 0x0000ffff6caf3184: add w13, w10, #0x1 >>>>>>> 0x0000ffff6caf3188: str w13, [x1,#16] ; >>>>>>> 0x0000ffff6caf318c: ldr w12, [x1,#20] ; >>>>>>> 0x0000ffff6caf3190: add w13, w10, #0x4 >>>>>>> 0x0000ffff6caf3194: add w10, w12, #0x4 >>>>>>> 0x0000ffff6caf3198: str w13, [x1,#16] ; >>>>>>> 0x0000ffff6caf319c: add w11, w11, #0x4 ; >>>>>>> 0x0000ffff6caf31a0: str w10, [x1,#20] ; >>>>>>> 0x0000ffff6caf31a4: cmp w11, #0x3fd >>>>>>> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; >>>>>>> ==== generated code end ==== >>>>>>> >>>>>>> After applied this patch, it is unrolled 16 times: >>>>>>> >>>>>>> ==== generated code start ==== >>>>>>> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; >>>>>>> 0x0000ffffb0aa6104: add w13, w10, #0x1 >>>>>>> 0x0000ffffb0aa6108: str w13, [x1,#16] ; >>>>>>> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; >>>>>>> 0x0000ffffb0aa6110: add w13, w10, #0x10 >>>>>>> 0x0000ffffb0aa6114: add w10, w12, #0x10 >>>>>>> 0x0000ffffb0aa6118: str w13, [x1,#16] ; >>>>>>> 0x0000ffffb0aa611c: add w11, w11, #0x10 ; >>>>>>> 0x0000ffffb0aa6120: str w10, [x1,#20] ; >>>>>>> 0x0000ffffb0aa6124: cmp w11, #0x3f1 >>>>>>> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; >>>>>>> ==== generated code end ==== >>>>>>> >>>>>>> This patch passes jtreg tests both on AArch64 and X86. >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> >> > > > From vladimir.kozlov at oracle.com Fri Sep 29 18:11:13 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 29 Sep 2017 11:11:13 -0700 Subject: RFR(S): 8187822: C2 conditonal move optimization might create broken graph In-Reply-To: References: <0d0b226d-418f-2344-2ff9-a7682747a0e2@oracle.com> <66e0634c-84a6-f8ef-6451-bf1e6a9be252@oracle.com> Message-ID: On 9/29/17 1:26 AM, Roland Westrelin wrote: > >> My only concern now if there is a case when data node has control edge >> pointing to branch of diamond and depends on If's condition (NULL >> check). Can this happen? > > These: > > // Check for ops pinned in an arm of the diamond. > // Can't remove the control flow in this case > if (lp->outcnt() > 1) return NULL; > if (rp->outcnt() > 1) return NULL; > > prevent it, right? Yes. Thanks to look on it. Changes are good. Vladimir > > Roland. > From vladimir.kozlov at oracle.com Fri Sep 29 18:40:44 2017 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 29 Sep 2017 11:40:44 -0700 Subject: sponsor needed for 8185979: PPC64: Implement SHA2 intrinsic In-Reply-To: References: Message-ID: <4bd56460-59c6-f95a-7a9a-9a6687d84115@oracle.com> I will sponsor it. Vladimir On 9/29/17 8:05 AM, Doerr, Martin wrote: > Hi, > > we need a sponsor for the following PPC64 change: > > 8185979: PPC64: Implement SHA2 intrinsic > > because it touches hotspot tests. > > Latest webrev for jdk10/hs is here: > > http://cr.openjdk.java.net/~mdoerr/8185979_ppc_sha2/webrev.06/ > > It already has 2 reviews. Can somebody push it through JPRT, please? > > Best regards, > > Martin > From gustavo.scalet at eldorado.org.br Fri Sep 29 21:25:41 2017 From: gustavo.scalet at eldorado.org.br (Gustavo Serra Scalet) Date: Fri, 29 Sep 2017 21:25:41 +0000 Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics In-Reply-To: <0aaf319e25934903a468542d02f6a734@serv030.corp.eldorado.org.br> References: <1f159ee480284095b8e5c3f444dceb96@serv031.corp.eldorado.org.br> <16e8b68451e94eb79cdd7d9cb5d7984c@sap.com> <2425566a8ff74051af485c919a0bf5ee@serv030.corp.eldorado.org.br> <4ec93a6bcbe14cf99c2fa02d50a18965@sap.com> <0ef23b5fcbc54996aea876d4c60e4097@sap.com> <10a918efbd344b1fbf95c56b7beedbc0@serv031.corp.eldorado.org.br> <2badcffdc4fb44c9ba77b5a1c6cc26fb@sap.com> <6362e4c1e3ab4871b12232580f2971aa@serv031.corp.eldorado.org.br> <1a54e77a4b5e45e2b848da3fccf423dd@serv030.corp.eldorado.org.br> <170eb84ecbdb4c9b9fe8d4481e4c319f@sap.com> <0aaf319e25934903a468542d02f6a734@serv030.corp.eldorado.org.br> Message-ID: <2432cbfebfa342dfb560ecf4d6023581@serv030.corp.eldorado.org.br> Hi Martin and Goetz, A new webrev updated to the new repo structure was requested and can be viewed below: https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.05/ PS: changes applied cleanly from old hotspot to new one. Can it be sponsored now? Thanks. > -----Original Message----- > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > Sent: quarta-feira, 6 de setembro de 2017 09:45 > To: Lindenmaier, Goetz ; Doerr, Martin > ; 'hotspot-compiler-dev at openjdk.java.net' > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > SquareToLen intrinsics > > Alright, thanks for the instructions. I'll keep that in mind. > > > -----Original Message----- > > From: Lindenmaier, Goetz [mailto:goetz.lindenmaier at sap.com] > > Sent: quarta-feira, 6 de setembro de 2017 09:44 > > To: Gustavo Serra Scalet ; Doerr, > > Martin ; 'hotspot-compiler-dev at openjdk.java.net' > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > SquareToLen intrinsics > > > > Hi Gustavo, > > > > the repos are all closed. Once they are opened again, you will have to > > merge your change into the new repo structure, post a new webrev and > > only then it can be sponsored. Me or Martin will sponsor it then. > > > > Best regards, > > Goetz. > > > > > -----Original Message----- > > > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > > > Sent: Mittwoch, 6. September 2017 14:32 > > > To: Lindenmaier, Goetz ; Doerr, Martin > > > ; 'hotspot-compiler-dev at openjdk.java.net' > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > SquareToLen intrinsics > > > > > > Thanks Goetz. > > > > > > Could somebody sponsor this change? > > > > > > THanks > > > > > > > -----Original Message----- > > > > From: Lindenmaier, Goetz [mailto:goetz.lindenmaier at sap.com] > > > > Sent: quarta-feira, 6 de setembro de 2017 03:30 > > > > To: Gustavo Serra Scalet ; Doerr, > > > > Martin ; 'hotspot-compiler- > > dev at openjdk.java.net' > > > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > SquareToLen intrinsics > > > > > > > > Hi, > > > > > > > > I had a look at this change and tested it. Reviewed. > > > > > > > > Best regards, > > > > Goetz. > > > > > > > > > -----Original Message----- > > > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > > > > Sent: Freitag, 1. September 2017 19:12 > > > > > To: Doerr, Martin ; 'hotspot-compiler- > > > > > dev at openjdk.java.net' > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > > SquareToLen intrinsics > > > > > > > > > > Hi Martin, > > > > > > > > > > > -----Original Message----- > > > > > > From: Doerr, Martin > > > > > > your first webrev already works on Big Endian. So the only > > > > > > required change is to fix your new code by this trivial patch: > > > > > > --- a/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 > > 17:47:45 > > > > 2017 > > > > > > +0200 > > > > > > +++ b/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 > > 17:55:08 > > > > 2017 > > > > > > +0200 > > > > > > @@ -3426,7 +3426,9 @@ > > > > > > __ srdi (product, product, 1); > > > > > > // join them to the same register and store it as Little > > Endian > > > > > > __ orr (product, lplw_s, product); > > > > > > +#ifdef VM_LITTLE_ENDIAN > > > > > > __ rldicl (product, product, 32, 0); > > > > > > +#endif > > > > > > __ stdu (product, 8, out_aux); > > > > > > __ bdnz (LOOP_SQUARE); > > > > > > > > > > > > So please enable it again for Big Endian in vm_version_ppc. > > > > > > Besides that, it looks good to me. We also need a 2nd review. > > > > > > > > > > Great! Thanks for checking it and suggesting the diff. > > > > > > > > > > I changed these things. You can find it below: > > > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.04/ > > > > > > > > > > I wonder who could be a 2nd reviewer... Anybody in mind that we > > > > > may > > > > ping? > > > > > Maybe Goetz Lindenmaier? > > > > > > > > > > Best Regards, > > > > > Gustavo Serra Scalet > > > > > > > > > > > > > > > > > Best regards, > > > > > > Martin > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: Gustavo Serra Scalet > > > > > > [mailto:gustavo.scalet at eldorado.org.br] > > > > > > Sent: Mittwoch, 30. August 2017 19:03 > > > > > > To: Doerr, Martin ; 'hotspot-compiler- > > > > > > dev at openjdk.java.net' > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > > > SquareToLen intrinsics > > > > > > > > > > > > Hi Martin, > > > > > > > > > > > > (webrev at the end) > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Doerr, Martin > > > > > > > > > > > > > > > The s/rldicl/rldic/ was fixed for "offset", but "len" > > > > > > > > doesn't seem to need further changes as it's being cleared > > > > > > > > with clrldi, which is the same as rldic with no shift. > > > > > > > > Therefore it's treated appropriately as requested for > > "offset" parameter. Do you agree? > > > > > > > > > > > > > > No, I didn't find clrldi for len in generate_mulAdd(). Only > > for k. > > > > > > > > > > > > I'm sorry. I was thinking about "offset" and "k", which are > > > > > > both cleaned on generate_mulAdd(). "len" was not cleaned and > > > > > > it was being used on > > > > > > muladd() directly with cmpdi, which could lead to problems. > > > > > > > > > > > > That is being changed. > > > > > > > > > > > > > Where are in_len and out_len fixed up in > > generate_squareToLen()? > > > > > > > > > > > > They are not. According to your suggestions, I agree it also > > > > > > needs to be done for the same reason. > > > > > > > > > > > > > > You are right. The way I'm building the 64 bits of the > > > > > > > > register depends on which kind of endianness it is run. > > > > > > > > For now it works only on little endian so I'm adding a > > > > > > > > switch (just like I did for SHA) to make it available only > > > > > > > > on > > little endian systems. > > > > > > > > > > > > > > It shouldn't be that hard to get it working on big endian > > > > > > > ;-) Btw., my point was not to replace the 2 4-byte store > > > > > > > instructions by an 8-byte one (though I'm also ok with > that). > > > > > > > It was that 2 stwu which update the same pointer doesn't > > > > > > > make sense from > > > > performance point of view. > > > > > > > Please keep something which works on big endian, too. > > > > > > > > > > > > I see. The 2x stwu was being used like that because it was the > > > > > > trivial approach when considering the original java update: > > > > > > z[i++] = (lastProductLowWord << 31) | (int)(product >>> 33); > > > > > > z[i++] = (int)(product >>> 1); > > > > > > > > > > > > As you pointed out, that might cause some stall on the > > > > > > pipeline so I made it with 1s stdu (and could improve code by > > > > > > reducing 1 > > > > > > instruction) > > > > > > > > > > > > Now about having a big endian version: I'm not confident in > > > > > > doing so as I don't have access to such a machine at the > moment. > > > > > > You were kind on offering test support but I don't know if > > > > > > it'd work like that. I may support you in checking out which > > > > > > places are endianness-related but I'm not comfortable in > > > > > > sending you untested > > > > code. > > > > > > > > > > > > Would you be interested in doing such a changes for making it > > > > > > work on Big Endian? For this patch, I provided an interesting > > > > > > test that might help you to verify if it worked. > > > > > > > > > > > > > > No, I used the jdk8u152-b01 (State of repository at Thu > > > > > > > > Apr > > > > > > > > 6 > > > > > > > > 14:15:31 2017). The reported performance speedup was > > > > > > > > calculated by running the following test > > (TestSquareToLen.java): > > > > > > > > > > > > > > Seems like JDK-8145913 has not been backported, yet. Sorry > > > > > > > for not checking this earlier. So if you want to make RSA > > > > > > > really fast, it should be so much better to backport that > > > > > > > one. But I can still sponsor this change as it may be used > elsewhere. > > > > > > > > > > > > No problem. It's nice to know that I may not need to request a > > > > > > backport of this patch for performance reasons. > > > > > > > > > > > > And at last, but not least, the new webrev with these clrldi > > > > changes: > > > > > > https://gut.github.io/openjdk/webrev/JDK- > > > > > 8185976/webrev.03/index.html > > > > > > > > > > > > Thank you once again, > > > > > > Gustavo Serra Scalet > > > > > > > > > > > > > Best regards, > > > > > > > Martin > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Gustavo Serra Scalet > > > > > > > [mailto:gustavo.scalet at eldorado.org.br] > > > > > > > Sent: Dienstag, 29. August 2017 22:37 > > > > > > > To: Doerr, Martin ; 'hotspot-compiler- > > > > > > > dev at openjdk.java.net' > > > > > > > > > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd > > > > > > > and SquareToLen intrinsics > > > > > > > > > > > > > > Hi Martin, > > > > > > > > > > > > > > New changes: > > > > > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.02/ > > > > > > > > > > > > > > Check comments below, please. > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: Doerr, Martin > > > > > > > > > > > > > > > > 1. Sign extending offset and len Right, sign and zero > > > > > > > > extending is equivalent for offset and len because they > > > > > > > > are guaranteed to be >=0 (by checks in Java). But you can > > > > > > > > only rely on bit 32 (IBM > > > > > > > > notation) to be 0. Bit 0-31 may contain > > > > > > > garbage. > > > > > > > > rldicl was incorrect. My mistake, sorry for that. Correct > > > > > > > > would be rldic which also clears the least significant > bits. > > > > > > > > len should also get fixed e.g. by replacing cmpdi by > > > > > > > > extsw_ in > > > > > > muladd. > > > > > > > > > > > > > > The s/rldicl/rldic/ was fixed for "offset", but "len" > > > > > > > doesn't seem to need further changes as it's being cleared > > > > > > > with clrldi, which is the same as rldic with no shift. > > > > > > > Therefore it's treated appropriately as requested for > "offset" > > parameter. Do you agree? > > > > > > > > > > > > > > > 2. Using 8 byte instructions for int The code which feeds > > > > > > > > stdu is endianess specific. Doesn't work on all > > > > > > > > PPC64 platforms. > > > > > > > > > > > > > > You are right. The way I'm building the 64 bits of the > > > > > > > register depends on which kind of endianness it is run. For > > > > > > > now it works only on little endian so I'm adding a switch > > > > > > > (just like I did for > > > > > > > SHA) to make it available only on little endian systems. > > > > > > > > > > > > > > > 3.Regarding Andrew's point: Superseded by Montgomery? > > > > > > > > The Montgomery change got backported to jdk8u (JDK-8150152 > > > > > > > > in > > > > > > 8u102). > > > > > > > > I'd expect the performance improvement of these intrinsics > > > > > > > > to be irrelevant for crypto.rsa. Did you measure with an > > > > > > > > older jdk8 > > > > > > release? > > > > > > > > > > > > > > No, I used the jdk8u152-b01 (State of repository at Thu Apr > > > > > > > 6 > > > > > > > 14:15:31 2017). The reported performance speedup was > > > > > > > calculated by running the following test > > (TestSquareToLen.java): > > > > > > > import java.math.BigInteger; > > > > > > > > > > > > > > public class TestSquareToLen { > > > > > > > > > > > > > > public static void main(String args[]) throws Exception > > > > > > > { > > > > > > > > > > > > > > int n = 10000000; > > > > > > > if (args.length >=1) { > > > > > > > n = Integer.parseInt(args[0]); > > > > > > > } > > > > > > > > > > > > > > BigInteger b1 = new > > > > > > > > > > > > > > > BigInteger("34893980923557359086350514982082503920002298311877320859 > > > > > 99 > > > > > > > 36 > > > > > > > > > > > > > > > 7395594183801021468843071391756049207873137016631559837931214754926 > > > > > 092 > > > > > > > 22 > > > > > > > > > > > > > > > 3780292110207609223272184808289336630057735969423726808520641030118 > > > > > 116 > > > > > > > 51 > > > > > > > > > > > > > > > 6440180488338234823908199478965242076358579845520899779963131131540 > > > > > 166 > > > > > > > 68 718795349783157384006672542605760392289645528307"); > > > > > > > BigInteger b2 = BigInteger.valueOf(0); > > > > > > > BigInteger check = BigInteger.valueOf(1); > > > > > > > for (int i = 0; i < n; i++) { > > > > > > > b2 = b1.multiply(b1); > > > > > > > if (i == 0) > > > > > > > // Didn't JIT yet. Comparing against interpreted > > mode > > > > > > > check = b2; > > > > > > > } > > > > > > > if (b2.compareTo(check) == 0) > > > > > > > System.out.println("Check ok!"); > > > > > > > else > > > > > > > System.out.println("Check failed!"); > > > > > > > } > > > > > > > } > > > > > > > > > > > > > > > > > > > > > I got these results on JDK8 on my POWER8 machine: > > > > > > > $ ./javac TestSquareToLen.java $ sudo perf stat -r 5 ./java > > > > > > > -XX:-UseMulAddIntrinsic -XX:- UseSquareToLenIntrinsic > > > > > > > TestSquareToLen Check ok! > > > > > > > Check ok! > > > > > > > Check ok! > > > > > > > Check ok! > > > > > > > Check ok! > > > > > > > > > > > > > > Performance counter stats for './java > > > > > > > -XX:-UseMulAddIntrinsic > > > > > > > -XX:- UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > > > > > > > > > > > > > 15148.009557 task-clock (msec) # 1.053 > > CPUs > > > > > > > utilized ( +- 0.48% ) > > > > > > > 2,425 context-switches # 0.160 > > K/sec > > > > > > > ( +- 5.84% ) > > > > > > > 356 cpu-migrations # 0.023 > > K/sec > > > > > > > ( +- 3.01% ) > > > > > > > 5,153 page-faults # 0.340 > > K/sec > > > > > > > ( +- 5.22% ) > > > > > > > 54,536,889,909 cycles # 3.600 > > GHz > > > > > > > ( +- 0.56% ) (66.68%) > > > > > > > 239,554,105 stalled-cycles-frontend # 0.44% > > > > frontend > > > > > > > cycles idle ( +- 4.87% ) (49.90%) > > > > > > > 27,683,316,001 stalled-cycles-backend # 50.76% > > > > backend > > > > > > > cycles idle ( +- 0.56% ) (50.17%) > > > > > > > 102,020,229,733 instructions # 1.87 > > insn > > > > per > > > > > > > cycle > > > > > > > # 0.27 > > > > stalled > > > > > > > cycles per insn ( +- 0.14% ) (66.94%) > > > > > > > 7,706,072,218 branches # 508.718 > > M/sec > > > > > > > ( +- 0.23% ) (50.20%) > > > > > > > 456,051,162 branch-misses # 5.92% > > of > > > > all > > > > > > > branches ( +- 0.09% ) (50.07%) > > > > > > > > > > > > > > 14.390840733 seconds time elapsed ( +- 0.09% ) > > > > > > > > > > > > > > $ sudo perf stat -r 5 ./java -XX:+UseMulAddIntrinsic - > > > > > > > XX:+UseSquareToLenIntrinsic TestSquareToLen Check ok! > > > > > > > Check ok! > > > > > > > Check ok! > > > > > > > Check ok! > > > > > > > Check ok! > > > > > > > > > > > > > > Performance counter stats for './java > > > > > > > -XX:+UseMulAddIntrinsic > > > > > > > - XX:+UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > > > > > > > > > > > > > 11368.141410 task-clock (msec) # 1.045 > > CPUs > > > > > > > utilized ( +- 0.64% ) > > > > > > > 1,964 context-switches # 0.173 > > K/sec > > > > > > > ( +- 8.93% ) > > > > > > > 338 cpu-migrations # 0.030 > > K/sec > > > > > > > ( +- 7.65% ) > > > > > > > 5,627 page-faults # 0.495 > > K/sec > > > > > > > ( +- 6.15% ) > > > > > > > 41,100,168,967 cycles # 3.615 > > GHz > > > > > > > ( +- 0.50% ) (66.36%) > > > > > > > 309,052,316 stalled-cycles-frontend # 0.75% > > > > frontend > > > > > > > cycles idle ( +- 2.84% ) (49.89%) > > > > > > > 14,188,581,685 stalled-cycles-backend # 34.52% > > > > backend > > > > > > > cycles idle ( +- 0.99% ) (50.34%) > > > > > > > 77,846,029,829 instructions # 1.89 > > insn > > > > per > > > > > > > cycle > > > > > > > # 0.18 > > > > stalled > > > > > > > cycles per insn ( +- 0.29% ) (66.96%) > > > > > > > 8,435,216,989 branches # 742.005 > > M/sec > > > > > > > ( +- 0.28% ) (50.17%) > > > > > > > 339,903,936 branch-misses # 4.03% > > of > > > > all > > > > > > > branches ( +- 0.27% ) (49.90%) > > > > > > > > > > > > > > 10.882357546 seconds time elapsed ( +- 0.24% ) > > > > > > > > > > > > > > > > > > > > > (out of curiosity, these numbers are 15.19s (+- 0.32%) and > > > > > > > 13.42s > > > > > > > (+- > > > > > > > 0.53%) on JDK10) > > > > > > > > > > > > > > I may run for SpecJVM2008's crypto.rsa if you are > interested. > > > > > > > > > > > > > > Thank you once again for reviewing this. > > > > > > > > > > > > > > Best regards, > > > > > > > Gustavo > > > > > > > > > > > > > > > (I think the change is still acceptable as the intrinsics > > > > > > > > could be used elsewhere and the implementation also exists > > > > > > > > on other > > > > > > > > platforms.) > > > > > > > > > > > > > > > > Best regards, > > > > > > > > Martin > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: Gustavo Serra Scalet > > > > > > > > [mailto:gustavo.scalet at eldorado.org.br] > > > > > > > > Sent: Mittwoch, 16. August 2017 18:50 > > > > > > > > To: Doerr, Martin ; > > > > > > > > 'hotspot-compiler- dev at openjdk.java.net' > > > > > > > > > > > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd > > > > > > > > and SquareToLen intrinsics > > > > > > > > > > > > > > > > Hi Martin, > > > > > > > > > > > > > > > > Thanks for dedicated review. It took me a while to be able > > > > > > > > to work on this but I hope to have your points solved. > > > > > > > > Please check below the review as well as my comments > > > > > > > > quoting > > your email: > > > > > > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.01 > > > > > > > > / > > > > > > > > > > > > > > > > > -----Original Message----- First of all, C2 does not > > > > > > > > > perform sign extend when calling > > > > stubs. > > > > > > > > > The int parms need to get zero/sign extended. (Could > > > > > > > > > even be done without extra instructions by replacing > > > > > > > > > sldi -> rldicl, cmpdi -> extsw_ in some > > > > > > > > > cases.) > > > > > > > > > > > > > > > > Does it make a difference on my case? > > > > > > > > > > > > > > > > I guess you are talking about mulAdd preparation code. The > > > > > > > > only aspect I found about him is to force the cast from 32 > > > > > > > > bits -> 64 bits by cleaning higher bits. Offset is a > > > > > > > > signed integer but it can't be > > > > > > > negative anyway. > > > > > > > > > > > > > > > > So I changed from: > > > > > > > > sldi (R5_ARG3, R5_ARG3, 2); > > > > > > > > > > > > > > > > to: > > > > > > > > rldicl (R5_ARG3, R5_ARG3, 2, 32); // always positive > > > > > > > > > > > > > > > > > > > > > > > > > macroAssembler_ppc.cpp: > > > > > > > > > - Indentation should be 2 spaces. > > > > > > > > > > > > > > > > Done > > > > > > > > > > > > > > > > > > > > > > > > > stubGenerator_ppc:cpp: > > > > > > > > > - or_, addi_ should get replaced by orr, addi when CR0 > > > > > > > > > result is not needed. > > > > > > > > > > > > > > > > Done > > > > > > > > > > > > > > > > > - Where is lplw initialized? > > > > > > > > > > > > > > > > It should be initialized with 0, I missed that... > > > > > > > > > > > > > > > > > - I believe that the updating load/store instructions > e.g. > > > > > > > > > lwzu don't perform well on some processors. At least > > > > > > > > > using stwu 2 times in the loop doesn't make sense. > > > > > > > > > > > > > > > > You are right. I could manipulate the bits differently and > > > > > > > > ended up with a single stdu in the loop. Neat! Although I > > > > > > > > could not reduce the total number of instructions. > > > > > > > > > > > > > > > > > - Note: It should be possible to use 8 byte instead of 4 > > > > > > > > > byte > > > > > > > > > instructions: MacroAssembler::multiply64, addc, adde. > > > > > > > > > But I'm not requesting to change that because I guess it > > > > > > > > > would make the code very complicated, especially when > > > > > > > > > supporting both endianess > > > > > > > versions. > > > > > > > > > > > > > > > > Yes, that would require a new analysis on this code. May > > > > > > > > we consider it next? As you said, I prefer having an > > > > > > > > initial version that looks as simple as the original java > code. > > > > > > > > > > > > > > > > > - The squareToLen stub implementation is very close the > > > > > > > > > Java implementation. So it'd be interesting to > > > > > > > > > understand what C2 doesn't do as well as the hand > > > > > > > > > written assembly code. Do you know that? (Not absolutely > > > > > > > > > necessary for accepting this change as long as the stub > > > > > > > > > is measurably > > > > > > > > > faster.) > > > > > > > > > > > > > > > > I don't know either. Basically I chose doing it because I > > > > > > > > noticed some performance gain on SpecJVM2008 when > > > > > > > > analyzing > > > X64. > > > > > > > > Then, taking a closer look, I didn't notice any AVX or > > > > > > > > some special instructions on > > > > > > > > X64 so I decided to try it on ppc64 by using some basic > > > > assembly. > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > Martin > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > > > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra > > > > > > > > > Scalet > > > > > > > > > Sent: Donnerstag, 10. August 2017 19:22 > > > > > > > > > To: 'hotspot-compiler-dev at openjdk.java.net' > > > > > > > > > > > > > > > > > > Subject: FW: [10] RFR(M): 8185976: PPC64: Implement > > > > > > > > > MulAdd > > > and > > > > > > > > > SquareToLen intrinsics > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: ppc-aix-port-dev [mailto:ppc-aix-port-dev- > > > > > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra > > > > > > > > > Scalet > > > > > > > > > Sent: ter?a-feira, 8 de agosto de 2017 17:19 > > > > > > > > > To: ppc-aix-port-dev at openjdk.java.net > > > > > > > > > Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd > > > > > > > > > and SquareToLen intrinsics > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > Could you please review this specific PPC64 change to > > hotspot? > > > > > > > > > By implementing these intrinsics I noticed a small > > > > > > > > > improvement with microbenchmarks analysis. On > > > > > > > > > SpecJVM2008's crypto.rsa benchmark, only when > > > > > > > > > backporting to JDK8 an improvement was > > > > noticed. > > > > > > > > > > > > > > > > > > JBS: https://bugs.openjdk.java.net/browse/JDK-8185976 > > > > > > > > > Webrev: https://gut.github.io/openjdk/webrev/JDK- > > > > > 8185976/webrev/ > > > > > > > > > > > > > > > > > > Motivation for this implementation: > > > > > > > > > https://twitter.com/ijuma/status/698309312498835457 > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > Gustavo Serra Scalet From goetz.lindenmaier at sap.com Fri Sep 29 22:48:17 2017 From: goetz.lindenmaier at sap.com (Lindenmaier, Goetz) Date: Fri, 29 Sep 2017 22:48:17 +0000 Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics In-Reply-To: <2432cbfebfa342dfb560ecf4d6023581@serv030.corp.eldorado.org.br> References: <1f159ee480284095b8e5c3f444dceb96@serv031.corp.eldorado.org.br> <16e8b68451e94eb79cdd7d9cb5d7984c@sap.com> <2425566a8ff74051af485c919a0bf5ee@serv030.corp.eldorado.org.br> <4ec93a6bcbe14cf99c2fa02d50a18965@sap.com> <0ef23b5fcbc54996aea876d4c60e4097@sap.com> <10a918efbd344b1fbf95c56b7beedbc0@serv031.corp.eldorado.org.br> <2badcffdc4fb44c9ba77b5a1c6cc26fb@sap.com> <6362e4c1e3ab4871b12232580f2971aa@serv031.corp.eldorado.org.br> <1a54e77a4b5e45e2b848da3fccf423dd@serv030.corp.eldorado.org.br> <170eb84ecbdb4c9b9fe8d4481e4c319f@sap.com> <0aaf319e25934903a468542d02f6a734@serv030.corp.eldorado.org.br> <2432cbfebfa342dfb560ecf4d6023581@serv030.corp.eldorado.org.br> Message-ID: Hi, I pushed it a few days ago: http://hg.openjdk.java.net/jdk10/hs/rev/122833427b36 Cheers, Goetz. > -----Original Message----- > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > Sent: Friday, September 29, 2017 11:26 PM > To: Doerr, Martin ; Lindenmaier, Goetz > > Cc: 'hotspot-compiler-dev at openjdk.java.net' dev at openjdk.java.net> > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > SquareToLen intrinsics > > Hi Martin and Goetz, > > A new webrev updated to the new repo structure was requested and can be > viewed below: > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.05/ > > PS: changes applied cleanly from old hotspot to new one. > > Can it be sponsored now? > > Thanks. > > > -----Original Message----- > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > Sent: quarta-feira, 6 de setembro de 2017 09:45 > > To: Lindenmaier, Goetz ; Doerr, Martin > > ; 'hotspot-compiler-dev at openjdk.java.net' > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > SquareToLen intrinsics > > > > Alright, thanks for the instructions. I'll keep that in mind. > > > > > -----Original Message----- > > > From: Lindenmaier, Goetz [mailto:goetz.lindenmaier at sap.com] > > > Sent: quarta-feira, 6 de setembro de 2017 09:44 > > > To: Gustavo Serra Scalet ; Doerr, > > > Martin ; 'hotspot-compiler- > dev at openjdk.java.net' > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > SquareToLen intrinsics > > > > > > Hi Gustavo, > > > > > > the repos are all closed. Once they are opened again, you will have to > > > merge your change into the new repo structure, post a new webrev and > > > only then it can be sponsored. Me or Martin will sponsor it then. > > > > > > Best regards, > > > Goetz. > > > > > > > -----Original Message----- > > > > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] > > > > Sent: Mittwoch, 6. September 2017 14:32 > > > > To: Lindenmaier, Goetz ; Doerr, Martin > > > > ; 'hotspot-compiler-dev at openjdk.java.net' > > > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > SquareToLen intrinsics > > > > > > > > Thanks Goetz. > > > > > > > > Could somebody sponsor this change? > > > > > > > > THanks > > > > > > > > > -----Original Message----- > > > > > From: Lindenmaier, Goetz [mailto:goetz.lindenmaier at sap.com] > > > > > Sent: quarta-feira, 6 de setembro de 2017 03:30 > > > > > To: Gustavo Serra Scalet ; Doerr, > > > > > Martin ; 'hotspot-compiler- > > > dev at openjdk.java.net' > > > > > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > > SquareToLen intrinsics > > > > > > > > > > Hi, > > > > > > > > > > I had a look at this change and tested it. Reviewed. > > > > > > > > > > Best regards, > > > > > Goetz. > > > > > > > > > > > -----Original Message----- > > > > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet > > > > > > Sent: Freitag, 1. September 2017 19:12 > > > > > > To: Doerr, Martin ; 'hotspot-compiler- > > > > > > dev at openjdk.java.net' > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > > > SquareToLen intrinsics > > > > > > > > > > > > Hi Martin, > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Doerr, Martin > > > > > > > your first webrev already works on Big Endian. So the only > > > > > > > required change is to fix your new code by this trivial patch: > > > > > > > --- a/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 > > > 17:47:45 > > > > > 2017 > > > > > > > +0200 > > > > > > > +++ b/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 > > > 17:55:08 > > > > > 2017 > > > > > > > +0200 > > > > > > > @@ -3426,7 +3426,9 @@ > > > > > > > __ srdi (product, product, 1); > > > > > > > // join them to the same register and store it as Little > > > Endian > > > > > > > __ orr (product, lplw_s, product); > > > > > > > +#ifdef VM_LITTLE_ENDIAN > > > > > > > __ rldicl (product, product, 32, 0); > > > > > > > +#endif > > > > > > > __ stdu (product, 8, out_aux); > > > > > > > __ bdnz (LOOP_SQUARE); > > > > > > > > > > > > > > So please enable it again for Big Endian in vm_version_ppc. > > > > > > > Besides that, it looks good to me. We also need a 2nd review. > > > > > > > > > > > > Great! Thanks for checking it and suggesting the diff. > > > > > > > > > > > > I changed these things. You can find it below: > > > > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.04/ > > > > > > > > > > > > I wonder who could be a 2nd reviewer... Anybody in mind that we > > > > > > may > > > > > ping? > > > > > > Maybe Goetz Lindenmaier? > > > > > > > > > > > > Best Regards, > > > > > > Gustavo Serra Scalet > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > Martin > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Gustavo Serra Scalet > > > > > > > [mailto:gustavo.scalet at eldorado.org.br] > > > > > > > Sent: Mittwoch, 30. August 2017 19:03 > > > > > > > To: Doerr, Martin ; 'hotspot-compiler- > > > > > > > dev at openjdk.java.net' > > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and > > > > > > > SquareToLen intrinsics > > > > > > > > > > > > > > Hi Martin, > > > > > > > > > > > > > > (webrev at the end) > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: Doerr, Martin > > > > > > > > > > > > > > > > > The s/rldicl/rldic/ was fixed for "offset", but "len" > > > > > > > > > doesn't seem to need further changes as it's being cleared > > > > > > > > > with clrldi, which is the same as rldic with no shift. > > > > > > > > > Therefore it's treated appropriately as requested for > > > "offset" parameter. Do you agree? > > > > > > > > > > > > > > > > No, I didn't find clrldi for len in generate_mulAdd(). Only > > > for k. > > > > > > > > > > > > > > I'm sorry. I was thinking about "offset" and "k", which are > > > > > > > both cleaned on generate_mulAdd(). "len" was not cleaned and > > > > > > > it was being used on > > > > > > > muladd() directly with cmpdi, which could lead to problems. > > > > > > > > > > > > > > That is being changed. > > > > > > > > > > > > > > > Where are in_len and out_len fixed up in > > > generate_squareToLen()? > > > > > > > > > > > > > > They are not. According to your suggestions, I agree it also > > > > > > > needs to be done for the same reason. > > > > > > > > > > > > > > > > You are right. The way I'm building the 64 bits of the > > > > > > > > > register depends on which kind of endianness it is run. > > > > > > > > > For now it works only on little endian so I'm adding a > > > > > > > > > switch (just like I did for SHA) to make it available only > > > > > > > > > on > > > little endian systems. > > > > > > > > > > > > > > > > It shouldn't be that hard to get it working on big endian > > > > > > > > ;-) Btw., my point was not to replace the 2 4-byte store > > > > > > > > instructions by an 8-byte one (though I'm also ok with > > that). > > > > > > > > It was that 2 stwu which update the same pointer doesn't > > > > > > > > make sense from > > > > > performance point of view. > > > > > > > > Please keep something which works on big endian, too. > > > > > > > > > > > > > > I see. The 2x stwu was being used like that because it was the > > > > > > > trivial approach when considering the original java update: > > > > > > > z[i++] = (lastProductLowWord << 31) | (int)(product >>> 33); > > > > > > > z[i++] = (int)(product >>> 1); > > > > > > > > > > > > > > As you pointed out, that might cause some stall on the > > > > > > > pipeline so I made it with 1s stdu (and could improve code by > > > > > > > reducing 1 > > > > > > > instruction) > > > > > > > > > > > > > > Now about having a big endian version: I'm not confident in > > > > > > > doing so as I don't have access to such a machine at the > > moment. > > > > > > > You were kind on offering test support but I don't know if > > > > > > > it'd work like that. I may support you in checking out which > > > > > > > places are endianness-related but I'm not comfortable in > > > > > > > sending you untested > > > > > code. > > > > > > > > > > > > > > Would you be interested in doing such a changes for making it > > > > > > > work on Big Endian? For this patch, I provided an interesting > > > > > > > test that might help you to verify if it worked. > > > > > > > > > > > > > > > > No, I used the jdk8u152-b01 (State of repository at Thu > > > > > > > > > Apr > > > > > > > > > 6 > > > > > > > > > 14:15:31 2017). The reported performance speedup was > > > > > > > > > calculated by running the following test > > > (TestSquareToLen.java): > > > > > > > > > > > > > > > > Seems like JDK-8145913 has not been backported, yet. Sorry > > > > > > > > for not checking this earlier. So if you want to make RSA > > > > > > > > really fast, it should be so much better to backport that > > > > > > > > one. But I can still sponsor this change as it may be used > > elsewhere. > > > > > > > > > > > > > > No problem. It's nice to know that I may not need to request a > > > > > > > backport of this patch for performance reasons. > > > > > > > > > > > > > > And at last, but not least, the new webrev with these clrldi > > > > > changes: > > > > > > > https://gut.github.io/openjdk/webrev/JDK- > > > > > > 8185976/webrev.03/index.html > > > > > > > > > > > > > > Thank you once again, > > > > > > > Gustavo Serra Scalet > > > > > > > > > > > > > > > Best regards, > > > > > > > > Martin > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: Gustavo Serra Scalet > > > > > > > > [mailto:gustavo.scalet at eldorado.org.br] > > > > > > > > Sent: Dienstag, 29. August 2017 22:37 > > > > > > > > To: Doerr, Martin ; 'hotspot-compiler- > > > > > > > > dev at openjdk.java.net' > > > > > > > > > > > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd > > > > > > > > and SquareToLen intrinsics > > > > > > > > > > > > > > > > Hi Martin, > > > > > > > > > > > > > > > > New changes: > > > > > > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.02/ > > > > > > > > > > > > > > > > Check comments below, please. > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Doerr, Martin > > > > > > > > > > > > > > > > > > 1. Sign extending offset and len Right, sign and zero > > > > > > > > > extending is equivalent for offset and len because they > > > > > > > > > are guaranteed to be >=0 (by checks in Java). But you can > > > > > > > > > only rely on bit 32 (IBM > > > > > > > > > notation) to be 0. Bit 0-31 may contain > > > > > > > > garbage. > > > > > > > > > rldicl was incorrect. My mistake, sorry for that. Correct > > > > > > > > > would be rldic which also clears the least significant > > bits. > > > > > > > > > len should also get fixed e.g. by replacing cmpdi by > > > > > > > > > extsw_ in > > > > > > > muladd. > > > > > > > > > > > > > > > > The s/rldicl/rldic/ was fixed for "offset", but "len" > > > > > > > > doesn't seem to need further changes as it's being cleared > > > > > > > > with clrldi, which is the same as rldic with no shift. > > > > > > > > Therefore it's treated appropriately as requested for > > "offset" > > > parameter. Do you agree? > > > > > > > > > > > > > > > > > 2. Using 8 byte instructions for int The code which feeds > > > > > > > > > stdu is endianess specific. Doesn't work on all > > > > > > > > > PPC64 platforms. > > > > > > > > > > > > > > > > You are right. The way I'm building the 64 bits of the > > > > > > > > register depends on which kind of endianness it is run. For > > > > > > > > now it works only on little endian so I'm adding a switch > > > > > > > > (just like I did for > > > > > > > > SHA) to make it available only on little endian systems. > > > > > > > > > > > > > > > > > 3.Regarding Andrew's point: Superseded by Montgomery? > > > > > > > > > The Montgomery change got backported to jdk8u (JDK- > 8150152 > > > > > > > > > in > > > > > > > 8u102). > > > > > > > > > I'd expect the performance improvement of these intrinsics > > > > > > > > > to be irrelevant for crypto.rsa. Did you measure with an > > > > > > > > > older jdk8 > > > > > > > release? > > > > > > > > > > > > > > > > No, I used the jdk8u152-b01 (State of repository at Thu Apr > > > > > > > > 6 > > > > > > > > 14:15:31 2017). The reported performance speedup was > > > > > > > > calculated by running the following test > > > (TestSquareToLen.java): > > > > > > > > import java.math.BigInteger; > > > > > > > > > > > > > > > > public class TestSquareToLen { > > > > > > > > > > > > > > > > public static void main(String args[]) throws Exception > > > > > > > > { > > > > > > > > > > > > > > > > int n = 10000000; > > > > > > > > if (args.length >=1) { > > > > > > > > n = Integer.parseInt(args[0]); > > > > > > > > } > > > > > > > > > > > > > > > > BigInteger b1 = new > > > > > > > > > > > > > > > > > > > BigInteger("3489398092355735908635051498208250392000229831187732 > 0859 > > > > > > 99 > > > > > > > > 36 > > > > > > > > > > > > > > > > > > > 73955941838010214688430713917560492078731370166315598379312147 > 54926 > > > > > > 092 > > > > > > > > 22 > > > > > > > > > > > > > > > > > > > 37802921102076092232721848082893366300577359694237268085206410 > 30118 > > > > > > 116 > > > > > > > > 51 > > > > > > > > > > > > > > > > > > > 64401804883382348239081994789652420763585798455208997799631311 > 31540 > > > > > > 166 > > > > > > > > 68 718795349783157384006672542605760392289645528307"); > > > > > > > > BigInteger b2 = BigInteger.valueOf(0); > > > > > > > > BigInteger check = BigInteger.valueOf(1); > > > > > > > > for (int i = 0; i < n; i++) { > > > > > > > > b2 = b1.multiply(b1); > > > > > > > > if (i == 0) > > > > > > > > // Didn't JIT yet. Comparing against interpreted > > > mode > > > > > > > > check = b2; > > > > > > > > } > > > > > > > > if (b2.compareTo(check) == 0) > > > > > > > > System.out.println("Check ok!"); > > > > > > > > else > > > > > > > > System.out.println("Check failed!"); > > > > > > > > } > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > I got these results on JDK8 on my POWER8 machine: > > > > > > > > $ ./javac TestSquareToLen.java $ sudo perf stat -r 5 ./java > > > > > > > > -XX:-UseMulAddIntrinsic -XX:- UseSquareToLenIntrinsic > > > > > > > > TestSquareToLen Check ok! > > > > > > > > Check ok! > > > > > > > > Check ok! > > > > > > > > Check ok! > > > > > > > > Check ok! > > > > > > > > > > > > > > > > Performance counter stats for './java > > > > > > > > -XX:-UseMulAddIntrinsic > > > > > > > > -XX:- UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > > > > > > > > > > > > > > > 15148.009557 task-clock (msec) # 1.053 > > > CPUs > > > > > > > > utilized ( +- 0.48% ) > > > > > > > > 2,425 context-switches # 0.160 > > > K/sec > > > > > > > > ( +- 5.84% ) > > > > > > > > 356 cpu-migrations # 0.023 > > > K/sec > > > > > > > > ( +- 3.01% ) > > > > > > > > 5,153 page-faults # 0.340 > > > K/sec > > > > > > > > ( +- 5.22% ) > > > > > > > > 54,536,889,909 cycles # 3.600 > > > GHz > > > > > > > > ( +- 0.56% ) (66.68%) > > > > > > > > 239,554,105 stalled-cycles-frontend # 0.44% > > > > > frontend > > > > > > > > cycles idle ( +- 4.87% ) (49.90%) > > > > > > > > 27,683,316,001 stalled-cycles-backend # 50.76% > > > > > backend > > > > > > > > cycles idle ( +- 0.56% ) (50.17%) > > > > > > > > 102,020,229,733 instructions # 1.87 > > > insn > > > > > per > > > > > > > > cycle > > > > > > > > # 0.27 > > > > > stalled > > > > > > > > cycles per insn ( +- 0.14% ) (66.94%) > > > > > > > > 7,706,072,218 branches # 508.718 > > > M/sec > > > > > > > > ( +- 0.23% ) (50.20%) > > > > > > > > 456,051,162 branch-misses # 5.92% > > > of > > > > > all > > > > > > > > branches ( +- 0.09% ) (50.07%) > > > > > > > > > > > > > > > > 14.390840733 seconds time elapsed ( +- 0.09% ) > > > > > > > > > > > > > > > > $ sudo perf stat -r 5 ./java -XX:+UseMulAddIntrinsic - > > > > > > > > XX:+UseSquareToLenIntrinsic TestSquareToLen Check ok! > > > > > > > > Check ok! > > > > > > > > Check ok! > > > > > > > > Check ok! > > > > > > > > Check ok! > > > > > > > > > > > > > > > > Performance counter stats for './java > > > > > > > > -XX:+UseMulAddIntrinsic > > > > > > > > - XX:+UseSquareToLenIntrinsic TestSquareToLen' (5 runs): > > > > > > > > > > > > > > > > 11368.141410 task-clock (msec) # 1.045 > > > CPUs > > > > > > > > utilized ( +- 0.64% ) > > > > > > > > 1,964 context-switches # 0.173 > > > K/sec > > > > > > > > ( +- 8.93% ) > > > > > > > > 338 cpu-migrations # 0.030 > > > K/sec > > > > > > > > ( +- 7.65% ) > > > > > > > > 5,627 page-faults # 0.495 > > > K/sec > > > > > > > > ( +- 6.15% ) > > > > > > > > 41,100,168,967 cycles # 3.615 > > > GHz > > > > > > > > ( +- 0.50% ) (66.36%) > > > > > > > > 309,052,316 stalled-cycles-frontend # 0.75% > > > > > frontend > > > > > > > > cycles idle ( +- 2.84% ) (49.89%) > > > > > > > > 14,188,581,685 stalled-cycles-backend # 34.52% > > > > > backend > > > > > > > > cycles idle ( +- 0.99% ) (50.34%) > > > > > > > > 77,846,029,829 instructions # 1.89 > > > insn > > > > > per > > > > > > > > cycle > > > > > > > > # 0.18 > > > > > stalled > > > > > > > > cycles per insn ( +- 0.29% ) (66.96%) > > > > > > > > 8,435,216,989 branches # 742.005 > > > M/sec > > > > > > > > ( +- 0.28% ) (50.17%) > > > > > > > > 339,903,936 branch-misses # 4.03% > > > of > > > > > all > > > > > > > > branches ( +- 0.27% ) (49.90%) > > > > > > > > > > > > > > > > 10.882357546 seconds time elapsed ( +- 0.24% ) > > > > > > > > > > > > > > > > > > > > > > > > (out of curiosity, these numbers are 15.19s (+- 0.32%) and > > > > > > > > 13.42s > > > > > > > > (+- > > > > > > > > 0.53%) on JDK10) > > > > > > > > > > > > > > > > I may run for SpecJVM2008's crypto.rsa if you are > > interested. > > > > > > > > > > > > > > > > Thank you once again for reviewing this. > > > > > > > > > > > > > > > > Best regards, > > > > > > > > Gustavo > > > > > > > > > > > > > > > > > (I think the change is still acceptable as the intrinsics > > > > > > > > > could be used elsewhere and the implementation also exists > > > > > > > > > on other > > > > > > > > > platforms.) > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > Martin > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Gustavo Serra Scalet > > > > > > > > > [mailto:gustavo.scalet at eldorado.org.br] > > > > > > > > > Sent: Mittwoch, 16. August 2017 18:50 > > > > > > > > > To: Doerr, Martin ; > > > > > > > > > 'hotspot-compiler- dev at openjdk.java.net' > > > > > > > > > > > > > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd > > > > > > > > > and SquareToLen intrinsics > > > > > > > > > > > > > > > > > > Hi Martin, > > > > > > > > > > > > > > > > > > Thanks for dedicated review. It took me a while to be able > > > > > > > > > to work on this but I hope to have your points solved. > > > > > > > > > Please check below the review as well as my comments > > > > > > > > > quoting > > > your email: > > > > > > > > > https://gut.github.io/openjdk/webrev/JDK- > 8185976/webrev.01 > > > > > > > > > / > > > > > > > > > > > > > > > > > > > -----Original Message----- First of all, C2 does not > > > > > > > > > > perform sign extend when calling > > > > > stubs. > > > > > > > > > > The int parms need to get zero/sign extended. (Could > > > > > > > > > > even be done without extra instructions by replacing > > > > > > > > > > sldi -> rldicl, cmpdi -> extsw_ in some > > > > > > > > > > cases.) > > > > > > > > > > > > > > > > > > Does it make a difference on my case? > > > > > > > > > > > > > > > > > > I guess you are talking about mulAdd preparation code. The > > > > > > > > > only aspect I found about him is to force the cast from 32 > > > > > > > > > bits -> 64 bits by cleaning higher bits. Offset is a > > > > > > > > > signed integer but it can't be > > > > > > > > negative anyway. > > > > > > > > > > > > > > > > > > So I changed from: > > > > > > > > > sldi (R5_ARG3, R5_ARG3, 2); > > > > > > > > > > > > > > > > > > to: > > > > > > > > > rldicl (R5_ARG3, R5_ARG3, 2, 32); // always positive > > > > > > > > > > > > > > > > > > > > > > > > > > > > macroAssembler_ppc.cpp: > > > > > > > > > > - Indentation should be 2 spaces. > > > > > > > > > > > > > > > > > > Done > > > > > > > > > > > > > > > > > > > > > > > > > > > > stubGenerator_ppc:cpp: > > > > > > > > > > - or_, addi_ should get replaced by orr, addi when CR0 > > > > > > > > > > result is not needed. > > > > > > > > > > > > > > > > > > Done > > > > > > > > > > > > > > > > > > > - Where is lplw initialized? > > > > > > > > > > > > > > > > > > It should be initialized with 0, I missed that... > > > > > > > > > > > > > > > > > > > - I believe that the updating load/store instructions > > e.g. > > > > > > > > > > lwzu don't perform well on some processors. At least > > > > > > > > > > using stwu 2 times in the loop doesn't make sense. > > > > > > > > > > > > > > > > > > You are right. I could manipulate the bits differently and > > > > > > > > > ended up with a single stdu in the loop. Neat! Although I > > > > > > > > > could not reduce the total number of instructions. > > > > > > > > > > > > > > > > > > > - Note: It should be possible to use 8 byte instead of 4 > > > > > > > > > > byte > > > > > > > > > > instructions: MacroAssembler::multiply64, addc, adde. > > > > > > > > > > But I'm not requesting to change that because I guess it > > > > > > > > > > would make the code very complicated, especially when > > > > > > > > > > supporting both endianess > > > > > > > > versions. > > > > > > > > > > > > > > > > > > Yes, that would require a new analysis on this code. May > > > > > > > > > we consider it next? As you said, I prefer having an > > > > > > > > > initial version that looks as simple as the original java > > code. > > > > > > > > > > > > > > > > > > > - The squareToLen stub implementation is very close the > > > > > > > > > > Java implementation. So it'd be interesting to > > > > > > > > > > understand what C2 doesn't do as well as the hand > > > > > > > > > > written assembly code. Do you know that? (Not absolutely > > > > > > > > > > necessary for accepting this change as long as the stub > > > > > > > > > > is measurably > > > > > > > > > > faster.) > > > > > > > > > > > > > > > > > > I don't know either. Basically I chose doing it because I > > > > > > > > > noticed some performance gain on SpecJVM2008 when > > > > > > > > > analyzing > > > > X64. > > > > > > > > > Then, taking a closer look, I didn't notice any AVX or > > > > > > > > > some special instructions on > > > > > > > > > X64 so I decided to try it on ppc64 by using some basic > > > > > assembly. > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > Martin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev- > > > > > > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra > > > > > > > > > > Scalet > > > > > > > > > > Sent: Donnerstag, 10. August 2017 19:22 > > > > > > > > > > To: 'hotspot-compiler-dev at openjdk.java.net' > > > > > > > > > > > > > > > > > > > > Subject: FW: [10] RFR(M): 8185976: PPC64: Implement > > > > > > > > > > MulAdd > > > > and > > > > > > > > > > SquareToLen intrinsics > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > > From: ppc-aix-port-dev [mailto:ppc-aix-port-dev- > > > > > > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra > > > > > > > > > > Scalet > > > > > > > > > > Sent: ter?a-feira, 8 de agosto de 2017 17:19 > > > > > > > > > > To: ppc-aix-port-dev at openjdk.java.net > > > > > > > > > > Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd > > > > > > > > > > and SquareToLen intrinsics > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > Could you please review this specific PPC64 change to > > > hotspot? > > > > > > > > > > By implementing these intrinsics I noticed a small > > > > > > > > > > improvement with microbenchmarks analysis. On > > > > > > > > > > SpecJVM2008's crypto.rsa benchmark, only when > > > > > > > > > > backporting to JDK8 an improvement was > > > > > noticed. > > > > > > > > > > > > > > > > > > > > JBS: https://bugs.openjdk.java.net/browse/JDK-8185976 > > > > > > > > > > Webrev: https://gut.github.io/openjdk/webrev/JDK- > > > > > > 8185976/webrev/ > > > > > > > > > > > > > > > > > > > > Motivation for this implementation: > > > > > > > > > > https://twitter.com/ijuma/status/698309312498835457 > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > Gustavo Serra Scalet From zhongwei.yao at linaro.org Sat Sep 30 06:37:32 2017 From: zhongwei.yao at linaro.org (Zhongwei Yao) Date: Sat, 30 Sep 2017 14:37:32 +0800 Subject: RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed In-Reply-To: <06d44e32-0d33-ae78-1516-6c4497adf983@oracle.com> References: <21f2540e-9d2f-dd29-8100-92b969b6bc22@oracle.com> <06d44e32-0d33-ae78-1516-6c4497adf983@oracle.com> Message-ID: On 30 September 2017 at 02:10, Vladimir Kozlov wrote: > On 9/29/17 1:25 AM, Zhongwei Yao wrote: >> >> Hi, Vladimir, >> >> Sorry for my late response! >> >> And yes, it solves my case. >> >> But I found specjvm2008 doesn't have a stable result, especially for >> benchmark case like startup.xxx, scimark.xxx.large etc. And I have not >> found obvious performance regress in the rest of benchmark cases. What >> do you think? > > > You know that you can change run parameters for specjvm2008 to avoid waiting > for long to finish. > And you need to run on one node preferable. > > Variations in startup is not important in this case. But scimark is > important since they show quality of loop optimizations. > > Does regression significant? We need more time to investigate it then. I see performance data fluctuates in specjvm2008. However, I check the scimark 2.0 (http://math.nist.gov/scimark2/) and see no performance regression in it both on x86 and arm64. > > Thanks, > Vladimir > > >> >> On 21 September 2017 at 00:18, Vladimir Kozlov >> wrote: >>> >>> Nice. >>> >>> Did you verified that it fixed your case? >>> >>> Would be nice to run specjvm2008 to make sure performance did not >>> regress. >>> >>> Thanks, >>> Vladimir >>> >>> >>> On 9/20/17 4:07 AM, Zhongwei Yao wrote: >>>> >>>> >>>> Thanks for your suggestions! >>>> >>>> I've updated the patch that uses pass_slp and do_unroll_only flags >>>> without adding a new flag. Please take a look: >>>> >>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.01/ >>>> >>>> >>>> >>>> On 20 September 2017 at 01:54, Vladimir Kozlov >>>> wrote: >>>>> >>>>> >>>>> >>>>> >>>>> On 9/18/17 10:59 PM, Zhongwei Yao wrote: >>>>>> >>>>>> >>>>>> >>>>>> Hi, Vladimir, >>>>>> >>>>>> On 19 September 2017 at 00:17, Vladimir Kozlov >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Why not use existing set_notpassed_slp() instead of >>>>>>> mark_slp_vec_failed()? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Due to 2 reasons, I have not chosen existing passed_slp flag: >>>>> >>>>> >>>>> >>>>> >>>>> My point is that if we don't find vectors in a loop (as in your case) >>>>> we >>>>> should ignore whole SLP analysis. >>>>> >>>>> In best case scenario SuperWord::unrolling_analysis() should determine >>>>> if >>>>> there are vectors candidates. For example, check if array's index is >>>>> depend >>>>> on loop's index variable. >>>>> >>>>> An other way is to call SuperWord::unrolling_analysis() only after we >>>>> did >>>>> vector analysis. >>>>> >>>>> It is more complicated changes and out of scope of this. There is also >>>>> side >>>>> effect I missed before which may prevent using set_notpassed_slp(): >>>>> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp() >>>>> check. >>>>> >>>>> Note, set_notpassed_slp() is also used to additional unroll already >>>>> vectorized loops: >>>>> >>>>> >>>>> >>>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421 >>>>> >>>>> May be you should also call mark_do_unroll_only() when you set >>>>> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt >>>>> problem you pointed. Can you look on this? >>>>> >>>>> I am not against adding new is_slp_vec_failed() but I want first to >>>>> investigate if we can re-use existing functions. >>>>> >>>>> Thanks, >>>>> Vladimir >>>>> >>>>> >>>>>> 1. If we set_notpassed_slp() when _packset.length() == 0 in >>>>>> SuperWord::output(), then in the IdealLoopTree::policy_unroll() >>>>>> checking: >>>>>> >>>>>> if (cl->has_passed_slp()) { >>>>>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>>>>> // Normal case: loop too big >>>>>> return false; >>>>>> } >>>>>> >>>>>> we will ignore the case: "cl->has_passed_slp() && >>>>>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()" >>>>>> as alos exposed in my patch: >>>>>> >>>>>> if (cl->has_passed_slp()) { >>>>>> if (slp_max_unroll_factor >= future_unroll_ct) return true; >>>>>> - // Normal case: loop too big >>>>>> - return false; >>>>>> + // When SLP vectorization failed, we could do more unrolling >>>>>> + // optimizations if body size is less than limit size. Otherwise, >>>>>> + // return false due to loop is too big. >>>>>> + if (!cl->is_slp_vec_failed()) return false; >>>>>> } >>>>>> >>>>>> However, I have not found a case to support this condition yet. >>>>>> >>>>>> 2. As replied below, in: >>>>>>> >>>>>>> >>>>>>> >>>>>>> - } else if (cl->is_main_loop()) { >>>>>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>>>>> sw.transform_loop(lpt, true); >>>>>> >>>>>> >>>>>> >>>>>> I need to check whether cl->is_slp_vec_failed() is true.Such >>>>>> checking becomes explicit when using SLPAutoVecFailed flag. >>>>>> >>>>>>> >>>>>>> Why you need next additional check?: >>>>>>> >>>>>>> - } else if (cl->is_main_loop()) { >>>>>>> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) { >>>>>>> sw.transform_loop(lpt, true); >>>>>>> >>>>>> >>>>>> The additional check prevents the case that when >>>>>> cl->is_slp_vec_failed() is true, then SuperWord::output() will >>>>>> set_major_progress() at the beginning (because _packset.length() == 0 >>>>>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal >>>>>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is >>>>>> not we want. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Vladimir >>>>>>> >>>>>>> >>>>>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev] >>>>>>>> >>>>>>>> Hi, all, >>>>>>>> >>>>>>>> Bug: >>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8187601 >>>>>>>> >>>>>>>> Webrev: >>>>>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00 >>>>>>>> >>>>>>>> In the current implementation, the loop unrolling times are >>>>>>>> determined >>>>>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is >>>>>>>> true (both X86 and aarch64 are true for now). >>>>>>>> >>>>>>>> This unrolling policy generates less optimized code when SLP >>>>>>>> auto-vectorization fails (as following example shows). >>>>>>>> >>>>>>>> In this patch, I modify the current unrolling policy to do more >>>>>>>> unrolling when SLP auto-vectorization fails. So the loop will be >>>>>>>> unrolled until reaching the unroll times limitation. >>>>>>>> >>>>>>>> Here is one example: >>>>>>>> public static void accessArrayConstants(int[] array) { >>>>>>>> for (int j = 0; j < 1024; j++) { >>>>>>>> array[0]++; >>>>>>>> array[1]++; >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> Before this patch, the loop will be unrolled by 4 times. 4 is >>>>>>>> determined by: AArch64's vector size 128 bits / array element size >>>>>>>> 32 >>>>>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are >>>>>>>> 8. >>>>>>>> >>>>>>>> Below is the generated code by C2 on AArch64: >>>>>>>> >>>>>>>> ==== generated code start ==== >>>>>>>> 0x0000ffff6caf3180: ldr w10, [x1,#16] ; >>>>>>>> 0x0000ffff6caf3184: add w13, w10, #0x1 >>>>>>>> 0x0000ffff6caf3188: str w13, [x1,#16] ; >>>>>>>> 0x0000ffff6caf318c: ldr w12, [x1,#20] ; >>>>>>>> 0x0000ffff6caf3190: add w13, w10, #0x4 >>>>>>>> 0x0000ffff6caf3194: add w10, w12, #0x4 >>>>>>>> 0x0000ffff6caf3198: str w13, [x1,#16] ; >>>>>>>> 0x0000ffff6caf319c: add w11, w11, #0x4 ; >>>>>>>> 0x0000ffff6caf31a0: str w10, [x1,#20] ; >>>>>>>> 0x0000ffff6caf31a4: cmp w11, #0x3fd >>>>>>>> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ; >>>>>>>> ==== generated code end ==== >>>>>>>> >>>>>>>> After applied this patch, it is unrolled 16 times: >>>>>>>> >>>>>>>> ==== generated code start ==== >>>>>>>> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ; >>>>>>>> 0x0000ffffb0aa6104: add w13, w10, #0x1 >>>>>>>> 0x0000ffffb0aa6108: str w13, [x1,#16] ; >>>>>>>> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ; >>>>>>>> 0x0000ffffb0aa6110: add w13, w10, #0x10 >>>>>>>> 0x0000ffffb0aa6114: add w10, w12, #0x10 >>>>>>>> 0x0000ffffb0aa6118: str w13, [x1,#16] ; >>>>>>>> 0x0000ffffb0aa611c: add w11, w11, #0x10 ; >>>>>>>> 0x0000ffffb0aa6120: str w10, [x1,#20] ; >>>>>>>> 0x0000ffffb0aa6124: cmp w11, #0x3f1 >>>>>>>> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ; >>>>>>>> ==== generated code end ==== >>>>>>>> >>>>>>>> This patch passes jtreg tests both on AArch64 and X86. >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>> >> >> >> > -- Best regards, Zhongwei