From vladimir.kozlov at oracle.com Fri May 1 20:54:19 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 01 May 2015 13:54:19 -0700 Subject: RFR(XXS) 8079231: quarantine compiler/jsr292/CallSiteDepContextTest.java Message-ID: <5543E7FB.9010605@oracle.com> http://cr.openjdk.java.net/~kvn/8079231/webrev/ https://bugs.openjdk.java.net/browse/JDK-8079231 Quarantine the test until next bug is fixed: https://bugs.openjdk.java.net/browse/JDK-8079205 Thanks, Vladimir From dean.long at oracle.com Fri May 1 21:14:48 2015 From: dean.long at oracle.com (Dean Long) Date: Fri, 01 May 2015 14:14:48 -0700 Subject: RFR(XXS) 8079231: quarantine compiler/jsr292/CallSiteDepContextTest.java In-Reply-To: <5543E7FB.9010605@oracle.com> References: <5543E7FB.9010605@oracle.com> Message-ID: <5543ECC8.6090505@oracle.com> Looks good. dl On 5/1/2015 1:54 PM, Vladimir Kozlov wrote: > http://cr.openjdk.java.net/~kvn/8079231/webrev/ > https://bugs.openjdk.java.net/browse/JDK-8079231 > > Quarantine the test until next bug is fixed: > > https://bugs.openjdk.java.net/browse/JDK-8079205 > > Thanks, > Vladimir From vladimir.kozlov at oracle.com Fri May 1 21:19:12 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 01 May 2015 14:19:12 -0700 Subject: RFR(XXS) 8079231: quarantine compiler/jsr292/CallSiteDepContextTest.java In-Reply-To: <5543ECC8.6090505@oracle.com> References: <5543E7FB.9010605@oracle.com> <5543ECC8.6090505@oracle.com> Message-ID: <5543EDD0.90407@oracle.com> Thank you, Dean Vladimir On 5/1/15 2:14 PM, Dean Long wrote: > Looks good. > > dl > > On 5/1/2015 1:54 PM, Vladimir Kozlov wrote: >> http://cr.openjdk.java.net/~kvn/8079231/webrev/ >> https://bugs.openjdk.java.net/browse/JDK-8079231 >> >> Quarantine the test until next bug is fixed: >> >> https://bugs.openjdk.java.net/browse/JDK-8079205 >> >> Thanks, >> Vladimir > From tobias.hartmann at oracle.com Mon May 4 05:38:53 2015 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Mon, 04 May 2015 07:38:53 +0200 Subject: [9] RFR(S): 8078497: C2's superword optimization causes unaligned memory accesses In-Reply-To: <554254A9.6060608@oracle.com> References: <5540D237.8000107@oracle.com> <5541172D.5010200@oracle.com> <5541F0B1.50804@oracle.com> <554254A9.6060608@oracle.com> Message-ID: <554705ED.4050305@oracle.com> Thanks, Vladimir. Best, Tobias On 30.04.2015 18:13, Vladimir Kozlov wrote: > Looks good. > > Thanks, > Vladimir > > On 4/30/15 2:06 AM, Tobias Hartmann wrote: >> Hi Vladimir, >> >> thanks for the review! >> >> On 29.04.2015 19:38, Vladimir Kozlov wrote: >>> Consider reducing number of iterations in test from 1M so that the test does not timeout on slow platforms. >> >> I reduced the number of iterations to 20k and verified that it is enough to trigger compilation and crash on Sparc. >> >>> lines 511-515 are duplicates of 486-491. Consider moving them before vw%span check. >> >> Thanks, I refactored that part. >> >>> Typo 'iff': >>> + // final offset is a multiple of vw iff init_offset is a multiple. >> >> I used 'iff' to refer to 'if and only if' [1]. I changed it. >> >> Here is the new webrev: >> http://cr.openjdk.java.net/~thartmann/8078497/webrev.01/ >> >> Thanks, >> Tobias >> >> [1] http://en.wikipedia.org/wiki/If_and_only_if >> >>> >>> Thanks, >>> Vladimir >>> >>> On 4/29/15 5:44 AM, Tobias Hartmann wrote: >>>> Hi, >>>> >>>> please review the following patch. >>>> >>>> https://bugs.openjdk.java.net/browse/JDK-8078497 >>>> http://cr.openjdk.java.net/~thartmann/8078497/webrev.00/ >>>> >>>> Background information (simplified): >>>> After loop optimizations, C2 tries to vectorize memory operations by finding and merging adjacent memory accesses in the loop body. The superword optimization matches MemNodes with the following pattern as address input: >>>> >>>> ptr + k*iv + constant [+ invar] >>>> >>>> where iv is the loop induction variable, k is a scaling factor that may be 0 and invar is an optional loop invariant value. C2 then picks the memory operation with most similar references and tries to align it in the main loop by setting the number of pre-loop iterations accordingly. Other adjacent memory operations that conform to this alignment are merged into packs. After extending and filtering these packs, final vector operations are emitted. >>>> >>>> The problem is that some architectures (for example, Sparc) require memory accesses to be aligned and in special cases the superword optimization generates code that contains unaligned vector instructions. >>>> >>>> Problems: >>>> (1) If two memory operations have different loop invariant offset values and C2 decides to align the main loop to one of them, we can always set the invariant of the other operation such that we break the alignment constraint. For example, if we have a store to a byte array a[i+inv1] and a load from a byte array b[i+inv2] (where i is the induction variable), C2 may decide to align the main loop such that (i+inv1 % 8) == 0 and replace the adjacent byte stores by a 8-byte double word store. Also the byte loads will be replaced by a double word load. If we then set inv2 = inv1 + 1 at runtime, the load will fail with a SIGBUS because the access is not 8-byte aligned, i.e., (i+inv2 % 8) != 0. Vectorization should not take place in this case. >>>> >>>> (2) If C2 decides to align the main loop to a memory operation, the necessary adjustment of the induction variable by the pre-loop is computed such that the resulting offset in the main loop is aligned to the vector width (see 'SuperWord::get_iv_adjustment'). If the offset of a memory operation is independent of the loop induction variable (i.e., scale k is 0), the iv adjustment should be 0. >>>> >>>> (3) If the loop span is greater than the vector width 'vw' of a memory operation, the superword optimization assumes that the pre-loop is never able to align this operation in the main loop (see 'SuperWord::ref_is_alignable'). This is wrong because if the loop span is a multiple of vw and depending on the initial offset, it may very well be possible to align the operation to vw. >>>> >>>> These problems originally only showed up in the string density code base (JDK-8054307) where we have a putChar intrinsic that writes a char value to two entries of a byte array. To reproduce the issue with JDK9 we have to make sure that the memory operations: >>>> (i) are independent, >>>> (ii) have different invariants, >>>> (iii) are not too complex (because then vectorization will not take place). >>>> >>>> To guarantee (i) we either need two different arrays (for example, byte[] and char[]) for the load and store operations or the same array but different offsets. Since the offsets should include a loop invariant part (ii), the superword optimization will not be able to determine that the runtime offset values do not overlap. We therefore use Unsafe.getChar/putChar to read/write a char value from/to a byte/char array and thereby guarantee independence. >>>> >>>> I came up with the following test (see 'TestVectorizationWithInvariant.java'): >>>> >>>> byte[] src = new byte[1000]; >>>> byte[] dst = new char[1000]; >>>> >>>> for (int i = (int) CHAR_ARRAY_OFFSET; i < 100; i = i + 8) { >>>> // Copy 8 chars from src to dst >>>> unsafe.putChar(dst, i + 0, unsafe.getChar(src, off + 0)); >>>> [...] >>>> unsafe.putChar(dst, i + 14, unsafe.getChar(src, off + 14)); >>>> } >>>> >>>> Problem (1) shows up since the main loop will be aligned to the StoreC[i + 0] which has no invariant. However, the LoadUS[off + 0] has the loop invariant 'off'. Setting off to BYTE_ARRAY_OFFSET + 2 will break the alignment of the emitted double word load and result in a crash. >>>> >>>> The LoadUS[off + 0] in above example is independent of the loop induction variable and therefore has no scale value. Because of problem (2), the iv adjustment is computed as -4 (see 'SuperWord::get_iv_adjustment'). >>>> >>>> offset = 16 iv_adjust = -4 elt_size = 2 scale = 0 iv_stride = 16 vect_size 8 >>>> >>>> The regression test contains additional test cases that also trigger problem (3). >>>> >>>> Solution: >>>> (1) I added a check to 'SuperWord::find_adjacent_refs' to make sure that memory accesses with different invariants are only vectorized if the architecture supports unaligned memory accesses. >>>> (2) 'SuperWord::get_iv_adjustment' is modified such that it returns 0 if the memory operation is independent of iv. >>>> (3) In 'SuperWord::ref_is_alignable' we need to additionally check if span is divisible by vw. If so, the pre-loop will add multiples of vw to the initial offset and if that initial offset is divisible by vm, the final offset (after the pre-loop) will be divisible as well and we should return true. I added comments to the code describing this in detail. >>>> >>>> Testing: >>>> - original test with string-density codebase >>>> - regression test >>>> - JPRT >>>> >>>> Thanks, >>>> Tobias >>>> From rickard.backman at oracle.com Mon May 4 13:13:37 2015 From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=) Date: Mon, 4 May 2015 15:13:37 +0200 Subject: RFR (M): 8064458 OopMap class could be more compact In-Reply-To: <554257C6.1010600@oracle.com> References: <20150428100337.GB31204@rbackman> <553FF92E.3010908@oracle.com> <20150429145259.GD31204@rbackman> <5541132C.9020208@oracle.com> <20150430141818.GG31204@rbackman> <554257C6.1010600@oracle.com> Message-ID: <20150504131337.GJ31204@rbackman> Updated again. http://cr.openjdk.java.net/~rbackman/8064458.5 Thanks On 04/30, Vladimir Kozlov wrote: > It is better now. I am fine with inner Mapping class. > > One thing left. I think only Mapping and fields could be private since only ImmutableOopMapBuilder accesses them: > > > + class ImmutableOopMapBuilder { > + public: > + class Mapping; > + > + public: > + const OopMapSet* _set; > > thanks, > Vladimir > > On 4/30/15 7:18 AM, Rickard B?ckman wrote: > >On 04/29, Vladimir Kozlov wrote: > >>On 4/29/15 7:52 AM, Rickard B?ckman wrote: > >>>On 04/28, Vladimir Kozlov wrote: > >>>>You need closed SA changes. > >>> > >>>I have them, just forgot to send the mail. Small changes too. > >>> > >>>> > >>>>Style: > >>>> > >>>>Move fields to the beginning of ImmutableOopMapBuilder and classes. > >>>>Add comments describing each field in ALL new classes. Add comments > >>>>to fields in old class. It will help next persone who will look on > >>>>oop maps later. > >>> > >>>All fields are at the beginning? Just following style and keeping > >>>friends above fields as other classes in the same file has. > >> > >> friends > >> fields > >> methods > >> > >>It is better to see fields before methods which access them. I am > >>not sure what codding style you are talking about. All classes in > >>oopMap.hpp has above layout. > >> > >>I am asking to change layout of ImmutableOopMapBuilder and Mapping classes. Also why Mapping is inner class? > > > >Ah, I was looking at oopMap.hpp and didn't see them. Fixed. > >Mapping is an inner class because it is only used by the Builder and I > >didn't want to clutter the namespace with something as general (and bad) > >as Mapping. I can move it out if that is preferred. > > > >> > >> > >>> > >>>> > >>>>Add ResourceMark into ImmutableOopMapSet::build_from() to free memory allocated in ImmutableOopMapBuilder(). > >> > >>no ResourceMark in 8064458.2 > > > >Added. > > > >> > >>>>Why you need ImmutableOopMapBuilder to be friend of class ImmutableOopMap? > >>> > >>>It makes a call to data_addr() which is private. Only for debug builds > >>>so I wrapped it in DEBUG_ONLY. > >> > >>Put ImmutableOopMapBuilder::verify() under #ifdef ASSERT and its declaration in DEBUG_ONLY. > >> > > > >Done. > > > >Updated webrev: http://cr.openjdk.java.net/~rbackman/8064458.4 > > > >Thanks > >/R > > > >>Thanks, > >>Vladimir > >> > >>> > >>>> > >>>>I think a simple loop in ImmutableOopMapBuilder::verify() would be faster than calling memcmp. > >>> > >>>Fixed. > >>> > >>>> > >>>>Field _end is not used. > >>> > >>>Removed. > >>> > >>>> > >>>>ImmutableOopMapBuilder() calls reset() and next called heap_size() > >>>>calls reset() again. May be move reset() to the end of heap_size() > >>>>so that you don't need to call it in fill(). > >>>> > >>> > >>>Better yet, the reset() method isn't required anymore. > >>> > >>>Updated webrev: http://cr.openjdk.java.net/~rbackman/8064458.2 > >>> > >>>/R > >>> > >>>>Thanks, > >>>>Vladimir > >>>> > >>>>On 4/28/15 3:03 AM, Rickard B?ckman wrote: > >>>>>Hi all, > >>>>> > >>>>>can I please have reviews for this change: > >>>>> > >>>>>RFR: http://cr.openjdk.java.net/~rbackman/8064458.1/ > >>>>>RFE: http://bugs.openjdk.java.net/browse/JDK-8064458 > >>>>> > >>>>>While looking at OopMaps a while ago I noticed that there were a couple > >>>>>of different fields that were unused after the OopMaps were finalised. > >>>>> > >>>>>I took some time to investigate and rearrange the OopMaps. Since I > >>>>>didn't want to change how the OopMaps are built I introduced new data > >>>>>structures. ImmutableOopMapSet and ImmutableOopMap. The original OopMap > >>>>>structures are used to build up the OopMaps and when finalised they are > >>>>>copied into the Immutable variants. > >>>>> > >>>>>The ImmutableOopMapSet contains a few fields [size, count] and then a > >>>>>list of [pc, offset]. The offset points to the offset after the list > >>>>>where the ImmutableOopMap is placed. By moving pc out from OopMap to be > >>>>>part of the list we can now have multiple pcs with identical OopMaps > >>>>>point to the same data. > >>>>> > >>>>>We only keep 1 empty OopMap, and the other compaction that is done in > >>>>>this change is to check if the OopMap is identical to the previous one > >>>>>and then reuse that one. So no complete uniqueness check. > >>>>> > >>>>>I ran a couple of small benchmarks and printed the size of the old > >>>>>OopMaps vs the new. The new layout uses about 20 - 25% of the space on > >>>>>the benchmarks I've run. > >>>>> > >>>>>Tested by running through JPRT, running BigApps and NSK.quick.testlist > >>>>> > >>>>>Thanks > >>>>>/R > >>>>> /R -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: Digital signature URL: From rickard.backman at oracle.com Mon May 4 13:16:04 2015 From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=) Date: Mon, 4 May 2015 15:16:04 +0200 Subject: RFR (M): 8064458 OopMap class could be more compact In-Reply-To: <55425AAF.2020801@oracle.com> References: <20150428100337.GB31204@rbackman> <553FF92E.3010908@oracle.com> <20150429145259.GD31204@rbackman> <5541132C.9020208@oracle.com> <20150430141818.GG31204@rbackman> <554257C6.1010600@oracle.com> <94B29370-0478-43B8-8547-F108D5AD29FF@oracle.com> <55425AAF.2020801@oracle.com> Message-ID: <20150504131604.GK31204@rbackman> Vladimir, I think you are right. Personally I don't see much improvement by having TempOopMap/OopMap vs OopMap/ImmutableOopMap (if we had had namespaces we could have had them as compiler::OopMap/code::OopMap). If I see lots of requests for the former before I push... Thanks /R On 04/30, Vladimir Kozlov wrote: > We still use OopMap and OopMapSet when we create oopmap data. > ImmutableOopMap* is used for writing and reading final metadata. > Otherwise changes would be much larger (see for example, > sharedRuntime_.cpp etc). Rickard may correct me if I am wrong. > > Vladimir > > On 4/30/15 9:33 AM, Christian Thalinger wrote: > >One thing I mentioned to Rickard privately is that I would prefer to have the current OopMap be renamed to something else (TempOopMap?) and keep OopMap as the one used. > > > >>On Apr 30, 2015, at 9:26 AM, Vladimir Kozlov wrote: > >> > >>It is better now. I am fine with inner Mapping class. > >> > >>One thing left. I think only Mapping and fields could be private since only ImmutableOopMapBuilder accesses them: > >> > >> > >>+ class ImmutableOopMapBuilder { > >>+ public: > >>+ class Mapping; > >>+ > >>+ public: > >>+ const OopMapSet* _set; > >> > >>thanks, > >>Vladimir > >> > >>On 4/30/15 7:18 AM, Rickard B?ckman wrote: > >>>On 04/29, Vladimir Kozlov wrote: > >>>>On 4/29/15 7:52 AM, Rickard B?ckman wrote: > >>>>>On 04/28, Vladimir Kozlov wrote: > >>>>>>You need closed SA changes. > >>>>> > >>>>>I have them, just forgot to send the mail. Small changes too. > >>>>> > >>>>>> > >>>>>>Style: > >>>>>> > >>>>>>Move fields to the beginning of ImmutableOopMapBuilder and classes. > >>>>>>Add comments describing each field in ALL new classes. Add comments > >>>>>>to fields in old class. It will help next persone who will look on > >>>>>>oop maps later. > >>>>> > >>>>>All fields are at the beginning? Just following style and keeping > >>>>>friends above fields as other classes in the same file has. > >>>> > >>>> friends > >>>> fields > >>>> methods > >>>> > >>>>It is better to see fields before methods which access them. I am > >>>>not sure what codding style you are talking about. All classes in > >>>>oopMap.hpp has above layout. > >>>> > >>>>I am asking to change layout of ImmutableOopMapBuilder and Mapping classes. Also why Mapping is inner class? > >>> > >>>Ah, I was looking at oopMap.hpp and didn't see them. Fixed. > >>>Mapping is an inner class because it is only used by the Builder and I > >>>didn't want to clutter the namespace with something as general (and bad) > >>>as Mapping. I can move it out if that is preferred. > >>> > >>>> > >>>> > >>>>> > >>>>>> > >>>>>>Add ResourceMark into ImmutableOopMapSet::build_from() to free memory allocated in ImmutableOopMapBuilder(). > >>>> > >>>>no ResourceMark in 8064458.2 > >>> > >>>Added. > >>> > >>>> > >>>>>>Why you need ImmutableOopMapBuilder to be friend of class ImmutableOopMap? > >>>>> > >>>>>It makes a call to data_addr() which is private. Only for debug builds > >>>>>so I wrapped it in DEBUG_ONLY. > >>>> > >>>>Put ImmutableOopMapBuilder::verify() under #ifdef ASSERT and its declaration in DEBUG_ONLY. > >>>> > >>> > >>>Done. > >>> > >>>Updated webrev: http://cr.openjdk.java.net/~rbackman/8064458.4 > >>> > >>>Thanks > >>>/R > >>> > >>>>Thanks, > >>>>Vladimir > >>>> > >>>>> > >>>>>> > >>>>>>I think a simple loop in ImmutableOopMapBuilder::verify() would be faster than calling memcmp. > >>>>> > >>>>>Fixed. > >>>>> > >>>>>> > >>>>>>Field _end is not used. > >>>>> > >>>>>Removed. > >>>>> > >>>>>> > >>>>>>ImmutableOopMapBuilder() calls reset() and next called heap_size() > >>>>>>calls reset() again. May be move reset() to the end of heap_size() > >>>>>>so that you don't need to call it in fill(). > >>>>>> > >>>>> > >>>>>Better yet, the reset() method isn't required anymore. > >>>>> > >>>>>Updated webrev: http://cr.openjdk.java.net/~rbackman/8064458.2 > >>>>> > >>>>>/R > >>>>> > >>>>>>Thanks, > >>>>>>Vladimir > >>>>>> > >>>>>>On 4/28/15 3:03 AM, Rickard B?ckman wrote: > >>>>>>>Hi all, > >>>>>>> > >>>>>>>can I please have reviews for this change: > >>>>>>> > >>>>>>>RFR: http://cr.openjdk.java.net/~rbackman/8064458.1/ > >>>>>>>RFE: http://bugs.openjdk.java.net/browse/JDK-8064458 > >>>>>>> > >>>>>>>While looking at OopMaps a while ago I noticed that there were a couple > >>>>>>>of different fields that were unused after the OopMaps were finalised. > >>>>>>> > >>>>>>>I took some time to investigate and rearrange the OopMaps. Since I > >>>>>>>didn't want to change how the OopMaps are built I introduced new data > >>>>>>>structures. ImmutableOopMapSet and ImmutableOopMap. The original OopMap > >>>>>>>structures are used to build up the OopMaps and when finalised they are > >>>>>>>copied into the Immutable variants. > >>>>>>> > >>>>>>>The ImmutableOopMapSet contains a few fields [size, count] and then a > >>>>>>>list of [pc, offset]. The offset points to the offset after the list > >>>>>>>where the ImmutableOopMap is placed. By moving pc out from OopMap to be > >>>>>>>part of the list we can now have multiple pcs with identical OopMaps > >>>>>>>point to the same data. > >>>>>>> > >>>>>>>We only keep 1 empty OopMap, and the other compaction that is done in > >>>>>>>this change is to check if the OopMap is identical to the previous one > >>>>>>>and then reuse that one. So no complete uniqueness check. > >>>>>>> > >>>>>>>I ran a couple of small benchmarks and printed the size of the old > >>>>>>>OopMaps vs the new. The new layout uses about 20 - 25% of the space on > >>>>>>>the benchmarks I've run. > >>>>>>> > >>>>>>>Tested by running through JPRT, running BigApps and NSK.quick.testlist > >>>>>>> > >>>>>>>Thanks > >>>>>>>/R > >>>>>>> > > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: Digital signature URL: From tobias.hartmann at oracle.com Mon May 4 14:09:53 2015 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Mon, 04 May 2015 16:09:53 +0200 Subject: [9] RFR(S): 8078497: C2's superword optimization causes unaligned memory accesses In-Reply-To: <554705ED.4050305@oracle.com> References: <5540D237.8000107@oracle.com> <5541172D.5010200@oracle.com> <5541F0B1.50804@oracle.com> <554254A9.6060608@oracle.com> <554705ED.4050305@oracle.com> Message-ID: <55477DB1.5050200@oracle.com> Hi, looks like there is still a problem I didn't catch. With my regression test I sometimes hit [1] # Internal Error (/opt/jprt/T/P1/130607.tohartma/s/hotspot/src/share/vm/opto/loopnode.cpp:3231), pid=18188, tid=0x00002b00dacde700 # assert(!had_error) failed: bad dominance I'm not sure if this is related to my fix but I assume it is an existing problem that was just triggered by my regression test. I'll further investigate. Thanks, Tobias [1] http://sthjprt.se.oracle.com/archives/2015/05/2015-05-04-130607.tohartma.8078497/logs/linux_x64_2.6-fastdebug-c2-hotspot_compiler_3.log.FAILED.log On 04.05.2015 07:38, Tobias Hartmann wrote: > Thanks, Vladimir. > > Best, > Tobias > > On 30.04.2015 18:13, Vladimir Kozlov wrote: >> Looks good. >> >> Thanks, >> Vladimir >> >> On 4/30/15 2:06 AM, Tobias Hartmann wrote: >>> Hi Vladimir, >>> >>> thanks for the review! >>> >>> On 29.04.2015 19:38, Vladimir Kozlov wrote: >>>> Consider reducing number of iterations in test from 1M so that the test does not timeout on slow platforms. >>> >>> I reduced the number of iterations to 20k and verified that it is enough to trigger compilation and crash on Sparc. >>> >>>> lines 511-515 are duplicates of 486-491. Consider moving them before vw%span check. >>> >>> Thanks, I refactored that part. >>> >>>> Typo 'iff': >>>> + // final offset is a multiple of vw iff init_offset is a multiple. >>> >>> I used 'iff' to refer to 'if and only if' [1]. I changed it. >>> >>> Here is the new webrev: >>> http://cr.openjdk.java.net/~thartmann/8078497/webrev.01/ >>> >>> Thanks, >>> Tobias >>> >>> [1] http://en.wikipedia.org/wiki/If_and_only_if >>> >>>> >>>> Thanks, >>>> Vladimir >>>> >>>> On 4/29/15 5:44 AM, Tobias Hartmann wrote: >>>>> Hi, >>>>> >>>>> please review the following patch. >>>>> >>>>> https://bugs.openjdk.java.net/browse/JDK-8078497 >>>>> http://cr.openjdk.java.net/~thartmann/8078497/webrev.00/ >>>>> >>>>> Background information (simplified): >>>>> After loop optimizations, C2 tries to vectorize memory operations by finding and merging adjacent memory accesses in the loop body. The superword optimization matches MemNodes with the following pattern as address input: >>>>> >>>>> ptr + k*iv + constant [+ invar] >>>>> >>>>> where iv is the loop induction variable, k is a scaling factor that may be 0 and invar is an optional loop invariant value. C2 then picks the memory operation with most similar references and tries to align it in the main loop by setting the number of pre-loop iterations accordingly. Other adjacent memory operations that conform to this alignment are merged into packs. After extending and filtering these packs, final vector operations are emitted. >>>>> >>>>> The problem is that some architectures (for example, Sparc) require memory accesses to be aligned and in special cases the superword optimization generates code that contains unaligned vector instructions. >>>>> >>>>> Problems: >>>>> (1) If two memory operations have different loop invariant offset values and C2 decides to align the main loop to one of them, we can always set the invariant of the other operation such that we break the alignment constraint. For example, if we have a store to a byte array a[i+inv1] and a load from a byte array b[i+inv2] (where i is the induction variable), C2 may decide to align the main loop such that (i+inv1 % 8) == 0 and replace the adjacent byte stores by a 8-byte double word store. Also the byte loads will be replaced by a double word load. If we then set inv2 = inv1 + 1 at runtime, the load will fail with a SIGBUS because the access is not 8-byte aligned, i.e., (i+inv2 % 8) != 0. Vectorization should not take place in this case. >>>>> >>>>> (2) If C2 decides to align the main loop to a memory operation, the necessary adjustment of the induction variable by the pre-loop is computed such that the resulting offset in the main loop is aligned to the vector width (see 'SuperWord::get_iv_adjustment'). If the offset of a memory operation is independent of the loop induction variable (i.e., scale k is 0), the iv adjustment should be 0. >>>>> >>>>> (3) If the loop span is greater than the vector width 'vw' of a memory operation, the superword optimization assumes that the pre-loop is never able to align this operation in the main loop (see 'SuperWord::ref_is_alignable'). This is wrong because if the loop span is a multiple of vw and depending on the initial offset, it may very well be possible to align the operation to vw. >>>>> >>>>> These problems originally only showed up in the string density code base (JDK-8054307) where we have a putChar intrinsic that writes a char value to two entries of a byte array. To reproduce the issue with JDK9 we have to make sure that the memory operations: >>>>> (i) are independent, >>>>> (ii) have different invariants, >>>>> (iii) are not too complex (because then vectorization will not take place). >>>>> >>>>> To guarantee (i) we either need two different arrays (for example, byte[] and char[]) for the load and store operations or the same array but different offsets. Since the offsets should include a loop invariant part (ii), the superword optimization will not be able to determine that the runtime offset values do not overlap. We therefore use Unsafe.getChar/putChar to read/write a char value from/to a byte/char array and thereby guarantee independence. >>>>> >>>>> I came up with the following test (see 'TestVectorizationWithInvariant.java'): >>>>> >>>>> byte[] src = new byte[1000]; >>>>> byte[] dst = new char[1000]; >>>>> >>>>> for (int i = (int) CHAR_ARRAY_OFFSET; i < 100; i = i + 8) { >>>>> // Copy 8 chars from src to dst >>>>> unsafe.putChar(dst, i + 0, unsafe.getChar(src, off + 0)); >>>>> [...] >>>>> unsafe.putChar(dst, i + 14, unsafe.getChar(src, off + 14)); >>>>> } >>>>> >>>>> Problem (1) shows up since the main loop will be aligned to the StoreC[i + 0] which has no invariant. However, the LoadUS[off + 0] has the loop invariant 'off'. Setting off to BYTE_ARRAY_OFFSET + 2 will break the alignment of the emitted double word load and result in a crash. >>>>> >>>>> The LoadUS[off + 0] in above example is independent of the loop induction variable and therefore has no scale value. Because of problem (2), the iv adjustment is computed as -4 (see 'SuperWord::get_iv_adjustment'). >>>>> >>>>> offset = 16 iv_adjust = -4 elt_size = 2 scale = 0 iv_stride = 16 vect_size 8 >>>>> >>>>> The regression test contains additional test cases that also trigger problem (3). >>>>> >>>>> Solution: >>>>> (1) I added a check to 'SuperWord::find_adjacent_refs' to make sure that memory accesses with different invariants are only vectorized if the architecture supports unaligned memory accesses. >>>>> (2) 'SuperWord::get_iv_adjustment' is modified such that it returns 0 if the memory operation is independent of iv. >>>>> (3) In 'SuperWord::ref_is_alignable' we need to additionally check if span is divisible by vw. If so, the pre-loop will add multiples of vw to the initial offset and if that initial offset is divisible by vm, the final offset (after the pre-loop) will be divisible as well and we should return true. I added comments to the code describing this in detail. >>>>> >>>>> Testing: >>>>> - original test with string-density codebase >>>>> - regression test >>>>> - JPRT >>>>> >>>>> Thanks, >>>>> Tobias >>>>> From aph at redhat.com Mon May 4 16:51:37 2015 From: aph at redhat.com (Andrew Haley) Date: Mon, 04 May 2015 17:51:37 +0100 Subject: sync_kit() Message-ID: <5547A399.3040107@redhat.com> I'm trying to get my head around sync_kit(). I don't really know what it does. Well, I know that it synchronizes IdealKit and graphKit, and I know from experience that if I don't call sync_kit() my code doesn't work. But what are the rules about when sync_kit() should be called? Thanks, Andrew. From vladimir.kozlov at oracle.com Mon May 4 18:54:58 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 04 May 2015 11:54:58 -0700 Subject: RFR (M): 8064458 OopMap class could be more compact In-Reply-To: <20150504131604.GK31204@rbackman> References: <20150428100337.GB31204@rbackman> <553FF92E.3010908@oracle.com> <20150429145259.GD31204@rbackman> <5541132C.9020208@oracle.com> <20150430141818.GG31204@rbackman> <554257C6.1010600@oracle.com> <94B29370-0478-43B8-8547-F108D5AD29FF@oracle.com> <55425AAF.2020801@oracle.com> <20150504131604.GK31204@rbackman> Message-ID: <5547C082.80501@oracle.com> Keep it as you have it now in your changes. thanks, Vladimir On 5/4/15 6:16 AM, Rickard B?ckman wrote: > Vladimir, > > I think you are right. Personally I don't see much improvement by having > TempOopMap/OopMap vs OopMap/ImmutableOopMap (if we had had namespaces we > could have had them as compiler::OopMap/code::OopMap). If I see lots of requests > for the former before I push... > > Thanks > /R > > On 04/30, Vladimir Kozlov wrote: >> We still use OopMap and OopMapSet when we create oopmap data. >> ImmutableOopMap* is used for writing and reading final metadata. >> Otherwise changes would be much larger (see for example, >> sharedRuntime_.cpp etc). Rickard may correct me if I am wrong. >> >> Vladimir >> >> On 4/30/15 9:33 AM, Christian Thalinger wrote: >>> One thing I mentioned to Rickard privately is that I would prefer to have the current OopMap be renamed to something else (TempOopMap?) and keep OopMap as the one used. >>> >>>> On Apr 30, 2015, at 9:26 AM, Vladimir Kozlov wrote: >>>> >>>> It is better now. I am fine with inner Mapping class. >>>> >>>> One thing left. I think only Mapping and fields could be private since only ImmutableOopMapBuilder accesses them: >>>> >>>> >>>> + class ImmutableOopMapBuilder { >>>> + public: >>>> + class Mapping; >>>> + >>>> + public: >>>> + const OopMapSet* _set; >>>> >>>> thanks, >>>> Vladimir >>>> >>>> On 4/30/15 7:18 AM, Rickard B?ckman wrote: >>>>> On 04/29, Vladimir Kozlov wrote: >>>>>> On 4/29/15 7:52 AM, Rickard B?ckman wrote: >>>>>>> On 04/28, Vladimir Kozlov wrote: >>>>>>>> You need closed SA changes. >>>>>>> >>>>>>> I have them, just forgot to send the mail. Small changes too. >>>>>>> >>>>>>>> >>>>>>>> Style: >>>>>>>> >>>>>>>> Move fields to the beginning of ImmutableOopMapBuilder and classes. >>>>>>>> Add comments describing each field in ALL new classes. Add comments >>>>>>>> to fields in old class. It will help next persone who will look on >>>>>>>> oop maps later. >>>>>>> >>>>>>> All fields are at the beginning? Just following style and keeping >>>>>>> friends above fields as other classes in the same file has. >>>>>> >>>>>> friends >>>>>> fields >>>>>> methods >>>>>> >>>>>> It is better to see fields before methods which access them. I am >>>>>> not sure what codding style you are talking about. All classes in >>>>>> oopMap.hpp has above layout. >>>>>> >>>>>> I am asking to change layout of ImmutableOopMapBuilder and Mapping classes. Also why Mapping is inner class? >>>>> >>>>> Ah, I was looking at oopMap.hpp and didn't see them. Fixed. >>>>> Mapping is an inner class because it is only used by the Builder and I >>>>> didn't want to clutter the namespace with something as general (and bad) >>>>> as Mapping. I can move it out if that is preferred. >>>>> >>>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>> Add ResourceMark into ImmutableOopMapSet::build_from() to free memory allocated in ImmutableOopMapBuilder(). >>>>>> >>>>>> no ResourceMark in 8064458.2 >>>>> >>>>> Added. >>>>> >>>>>> >>>>>>>> Why you need ImmutableOopMapBuilder to be friend of class ImmutableOopMap? >>>>>>> >>>>>>> It makes a call to data_addr() which is private. Only for debug builds >>>>>>> so I wrapped it in DEBUG_ONLY. >>>>>> >>>>>> Put ImmutableOopMapBuilder::verify() under #ifdef ASSERT and its declaration in DEBUG_ONLY. >>>>>> >>>>> >>>>> Done. >>>>> >>>>> Updated webrev: http://cr.openjdk.java.net/~rbackman/8064458.4 >>>>> >>>>> Thanks >>>>> /R >>>>> >>>>>> Thanks, >>>>>> Vladimir >>>>>> >>>>>>> >>>>>>>> >>>>>>>> I think a simple loop in ImmutableOopMapBuilder::verify() would be faster than calling memcmp. >>>>>>> >>>>>>> Fixed. >>>>>>> >>>>>>>> >>>>>>>> Field _end is not used. >>>>>>> >>>>>>> Removed. >>>>>>> >>>>>>>> >>>>>>>> ImmutableOopMapBuilder() calls reset() and next called heap_size() >>>>>>>> calls reset() again. May be move reset() to the end of heap_size() >>>>>>>> so that you don't need to call it in fill(). >>>>>>>> >>>>>>> >>>>>>> Better yet, the reset() method isn't required anymore. >>>>>>> >>>>>>> Updated webrev: http://cr.openjdk.java.net/~rbackman/8064458.2 >>>>>>> >>>>>>> /R >>>>>>> >>>>>>>> Thanks, >>>>>>>> Vladimir >>>>>>>> >>>>>>>> On 4/28/15 3:03 AM, Rickard B?ckman wrote: >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> can I please have reviews for this change: >>>>>>>>> >>>>>>>>> RFR: http://cr.openjdk.java.net/~rbackman/8064458.1/ >>>>>>>>> RFE: http://bugs.openjdk.java.net/browse/JDK-8064458 >>>>>>>>> >>>>>>>>> While looking at OopMaps a while ago I noticed that there were a couple >>>>>>>>> of different fields that were unused after the OopMaps were finalised. >>>>>>>>> >>>>>>>>> I took some time to investigate and rearrange the OopMaps. Since I >>>>>>>>> didn't want to change how the OopMaps are built I introduced new data >>>>>>>>> structures. ImmutableOopMapSet and ImmutableOopMap. The original OopMap >>>>>>>>> structures are used to build up the OopMaps and when finalised they are >>>>>>>>> copied into the Immutable variants. >>>>>>>>> >>>>>>>>> The ImmutableOopMapSet contains a few fields [size, count] and then a >>>>>>>>> list of [pc, offset]. The offset points to the offset after the list >>>>>>>>> where the ImmutableOopMap is placed. By moving pc out from OopMap to be >>>>>>>>> part of the list we can now have multiple pcs with identical OopMaps >>>>>>>>> point to the same data. >>>>>>>>> >>>>>>>>> We only keep 1 empty OopMap, and the other compaction that is done in >>>>>>>>> this change is to check if the OopMap is identical to the previous one >>>>>>>>> and then reuse that one. So no complete uniqueness check. >>>>>>>>> >>>>>>>>> I ran a couple of small benchmarks and printed the size of the old >>>>>>>>> OopMaps vs the new. The new layout uses about 20 - 25% of the space on >>>>>>>>> the benchmarks I've run. >>>>>>>>> >>>>>>>>> Tested by running through JPRT, running BigApps and NSK.quick.testlist >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> /R >>>>>>>>> >>> From vladimir.kozlov at oracle.com Mon May 4 18:56:00 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 04 May 2015 11:56:00 -0700 Subject: RFR (M): 8064458 OopMap class could be more compact In-Reply-To: <20150504131337.GJ31204@rbackman> References: <20150428100337.GB31204@rbackman> <553FF92E.3010908@oracle.com> <20150429145259.GD31204@rbackman> <5541132C.9020208@oracle.com> <20150430141818.GG31204@rbackman> <554257C6.1010600@oracle.com> <20150504131337.GJ31204@rbackman> Message-ID: <5547C0C0.1080404@oracle.com> Looks good. Thanks, Vladimir On 5/4/15 6:13 AM, Rickard B?ckman wrote: > Updated again. > > http://cr.openjdk.java.net/~rbackman/8064458.5 > > Thanks > > On 04/30, Vladimir Kozlov wrote: >> It is better now. I am fine with inner Mapping class. >> >> One thing left. I think only Mapping and fields could be private since only ImmutableOopMapBuilder accesses them: >> >> >> + class ImmutableOopMapBuilder { >> + public: >> + class Mapping; >> + >> + public: >> + const OopMapSet* _set; >> >> thanks, >> Vladimir >> >> On 4/30/15 7:18 AM, Rickard B?ckman wrote: >>> On 04/29, Vladimir Kozlov wrote: >>>> On 4/29/15 7:52 AM, Rickard B?ckman wrote: >>>>> On 04/28, Vladimir Kozlov wrote: >>>>>> You need closed SA changes. >>>>> >>>>> I have them, just forgot to send the mail. Small changes too. >>>>> >>>>>> >>>>>> Style: >>>>>> >>>>>> Move fields to the beginning of ImmutableOopMapBuilder and classes. >>>>>> Add comments describing each field in ALL new classes. Add comments >>>>>> to fields in old class. It will help next persone who will look on >>>>>> oop maps later. >>>>> >>>>> All fields are at the beginning? Just following style and keeping >>>>> friends above fields as other classes in the same file has. >>>> >>>> friends >>>> fields >>>> methods >>>> >>>> It is better to see fields before methods which access them. I am >>>> not sure what codding style you are talking about. All classes in >>>> oopMap.hpp has above layout. >>>> >>>> I am asking to change layout of ImmutableOopMapBuilder and Mapping classes. Also why Mapping is inner class? >>> >>> Ah, I was looking at oopMap.hpp and didn't see them. Fixed. >>> Mapping is an inner class because it is only used by the Builder and I >>> didn't want to clutter the namespace with something as general (and bad) >>> as Mapping. I can move it out if that is preferred. >>> >>>> >>>> >>>>> >>>>>> >>>>>> Add ResourceMark into ImmutableOopMapSet::build_from() to free memory allocated in ImmutableOopMapBuilder(). >>>> >>>> no ResourceMark in 8064458.2 >>> >>> Added. >>> >>>> >>>>>> Why you need ImmutableOopMapBuilder to be friend of class ImmutableOopMap? >>>>> >>>>> It makes a call to data_addr() which is private. Only for debug builds >>>>> so I wrapped it in DEBUG_ONLY. >>>> >>>> Put ImmutableOopMapBuilder::verify() under #ifdef ASSERT and its declaration in DEBUG_ONLY. >>>> >>> >>> Done. >>> >>> Updated webrev: http://cr.openjdk.java.net/~rbackman/8064458.4 >>> >>> Thanks >>> /R >>> >>>> Thanks, >>>> Vladimir >>>> >>>>> >>>>>> >>>>>> I think a simple loop in ImmutableOopMapBuilder::verify() would be faster than calling memcmp. >>>>> >>>>> Fixed. >>>>> >>>>>> >>>>>> Field _end is not used. >>>>> >>>>> Removed. >>>>> >>>>>> >>>>>> ImmutableOopMapBuilder() calls reset() and next called heap_size() >>>>>> calls reset() again. May be move reset() to the end of heap_size() >>>>>> so that you don't need to call it in fill(). >>>>>> >>>>> >>>>> Better yet, the reset() method isn't required anymore. >>>>> >>>>> Updated webrev: http://cr.openjdk.java.net/~rbackman/8064458.2 >>>>> >>>>> /R >>>>> >>>>>> Thanks, >>>>>> Vladimir >>>>>> >>>>>> On 4/28/15 3:03 AM, Rickard B?ckman wrote: >>>>>>> Hi all, >>>>>>> >>>>>>> can I please have reviews for this change: >>>>>>> >>>>>>> RFR: http://cr.openjdk.java.net/~rbackman/8064458.1/ >>>>>>> RFE: http://bugs.openjdk.java.net/browse/JDK-8064458 >>>>>>> >>>>>>> While looking at OopMaps a while ago I noticed that there were a couple >>>>>>> of different fields that were unused after the OopMaps were finalised. >>>>>>> >>>>>>> I took some time to investigate and rearrange the OopMaps. Since I >>>>>>> didn't want to change how the OopMaps are built I introduced new data >>>>>>> structures. ImmutableOopMapSet and ImmutableOopMap. The original OopMap >>>>>>> structures are used to build up the OopMaps and when finalised they are >>>>>>> copied into the Immutable variants. >>>>>>> >>>>>>> The ImmutableOopMapSet contains a few fields [size, count] and then a >>>>>>> list of [pc, offset]. The offset points to the offset after the list >>>>>>> where the ImmutableOopMap is placed. By moving pc out from OopMap to be >>>>>>> part of the list we can now have multiple pcs with identical OopMaps >>>>>>> point to the same data. >>>>>>> >>>>>>> We only keep 1 empty OopMap, and the other compaction that is done in >>>>>>> this change is to check if the OopMap is identical to the previous one >>>>>>> and then reuse that one. So no complete uniqueness check. >>>>>>> >>>>>>> I ran a couple of small benchmarks and printed the size of the old >>>>>>> OopMaps vs the new. The new layout uses about 20 - 25% of the space on >>>>>>> the benchmarks I've run. >>>>>>> >>>>>>> Tested by running through JPRT, running BigApps and NSK.quick.testlist >>>>>>> >>>>>>> Thanks >>>>>>> /R >>>>>>> > /R > From vladimir.kozlov at oracle.com Mon May 4 21:30:27 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 04 May 2015 14:30:27 -0700 Subject: RFR(L) 8073583: C2 support for CRC32C on SPARC Message-ID: <5547E4F3.4070603@oracle.com> http://cr.openjdk.java.net/~kvn/8073583/webrev.00/ https://bugs.openjdk.java.net/browse/JDK-8073583 Contributed by: James Cheng Two methods in the java.util.zip.CRC32C class are intrinsified on SPARC with a single stub routine: java.util.zip.CRC32C.updateBytes() java.util.zip.CRC32C.updateDirectByteBuffer() Two JVM flags, UseCRC32C and UseCRC32CIntrinsics, are added for controlling the usage of the CRC32C intrinsics. They are set to true on SPARC T4 and later processors, or false on other platforms. They can be turned off with -XX:-UseCRC32C or -XX:-UseCRC32CIntrinsics. Performance gains from the CRC32C intrinsics were measured up to 16x with hotspot/test/compiler/intrinsics/crc32c/TestCRC32C.java or the JMH-based ChecksumBenchmarks/CRC32C. Performance gains from the CRC32C intrinsics were also measured with Hadoop TestDFSIO at about 40% for write test and 80% for read test, Tested with hotspot/test/ and jdk/test/java/util/ jtreg tests. From john.r.rose at oracle.com Tue May 5 00:39:23 2015 From: john.r.rose at oracle.com (John Rose) Date: Mon, 4 May 2015 17:39:23 -0700 Subject: RFR(L) 8073583: C2 support for CRC32C on SPARC In-Reply-To: <5547E4F3.4070603@oracle.com> References: <5547E4F3.4070603@oracle.com> Message-ID: <779F1936-1EE2-419B-9F0A-149FB034BE68@oracle.com> On May 4, 2015, at 2:30 PM, Vladimir Kozlov wrote: > > http://cr.openjdk.java.net/~kvn/8073583/webrev.00/ > https://bugs.openjdk.java.net/browse/JDK-8073583 > > Contributed by: James Cheng I took a quick look at the assembly code. I have a few comments. The magic constant 128 (size of parallel decomposition chunk) needs naming; suggest "chunk length" or something like that. When the constant is defined, it needs to be explained also. I'm glad to see newer sparc instructions getting used, especially xmulx to reassociate crc. The magic constants used to shift the CRC codes by 1024 should be named and explained. The code is unnecessarily monolithic. It will be more readable and maintainable if some easy factorizations are applied. The factorizations themselves can add value beyond readability of this new code. Specifically, the sethi/or3/or3 should be replaced by set64 macros. Forget about the clever hoisting of 1<<32 into G1; just improve set64 to cover 34-bit constants in three instructions: sethi/sllx/or3. The sllx shifts up by 3 to make room for a simm13. (If we don't want to disturb set64, then add a new macro set34. But I say go for it. I wish the set64 macro got a little more love.) A couple of places use 'sethi' instead of the 'set' macro; I'd use 'set' to avoid puzzling the reader with details of micro-optimization, although it's a matter of style. The long sequences which perform byte reversal should use a macro; you can name it 'reverse_bytes_32'. If they are truly optimal then the macro will hide the noise. But if they are not, it will be much easier to swap in better instructions. The C2 intrinsics called ReverseBytes use ASI_PRIMARY_LITTLE to a stack temp, which is probably better. In any case, the decision needs to be hidden in a macro, not spread around in unrelated code. There are some minor things that would tidy the code. The first two instructions sllx/srlx should be replaced by clruwu (srl 0). The uses of cc's in the "subcc" instructions are all useless and slightly misleading; should be "sub" or (better) "dec". I personally find the use of cmp_and_brx_short peculiar with the constant 32, just because I know 32 doesn't actually work with the intended instruction. But I suppose it is OK, since the macro expands to something that works. ? John -------------- next part -------------- An HTML attachment was scrubbed... URL: From james.cheng at oracle.com Tue May 5 00:57:46 2015 From: james.cheng at oracle.com (james cheng) Date: Mon, 04 May 2015 17:57:46 -0700 Subject: RFR(L) 8073583: C2 support for CRC32C on SPARC In-Reply-To: <779F1936-1EE2-419B-9F0A-149FB034BE68@oracle.com> References: <5547E4F3.4070603@oracle.com> <779F1936-1EE2-419B-9F0A-149FB034BE68@oracle.com> Message-ID: <5548158A.2010103@oracle.com> Hi John, On 5/4/2015 5:39 PM, John Rose wrote: > On May 4, 2015, at 2:30 PM, Vladimir Kozlov > wrote: >> >> http://cr.openjdk.java.net/~kvn/8073583/webrev.00/ >> https://bugs.openjdk.java.net/browse/JDK-8073583 >> >> Contributed by: James Cheng > > I took a quick look at the assembly code. I have a few comments. Many thanks for your quick review and comments. I will make changes as you suggested as soon as I can. Thanks, -James From vladimir.kozlov at oracle.com Tue May 5 01:13:34 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 04 May 2015 18:13:34 -0700 Subject: RFR(L): 8069539: RSA acceleration In-Reply-To: <55101C52.1090506@redhat.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> Message-ID: <5548193E.7060803@oracle.com> I think it was last mail in open list. There were discussions in private. Here is upadted webrev: http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ It include changes in java/math/BigInteger.java which added range checks and moved intrinsified code into private methods. Andrew Haley has additional suggestions we can discuss now. Andrew, can you show proposed changes to java code? I would like to have only simple java changes (methods splitting) and do experiments/changes with word-reversal later. On 4/14/15 10:41 AM, Andrew Haley wrote: > On 04/14/2015 06:22 PM, Vladimir Kozlov wrote: > >> We are discussing how and which checks to add into java code which >> calls intrinsified methods to keep intrinsic simple. > > Yes, good idea. While you're in there, there's a couple of thoughts I'd > like to draw your attention to. > > Montgomery multiplication and squaring are implemented as separate > steps, like so: > > a = multiplyToLen(t, modLen, mult, modLen, a); > a = montReduce(a, mod, modLen, inv); > > a = squareToLen(t, modLen, a); > a = montReduce(a, mod, modLen, inv); > > It is possible to interleave the multiplication and Montgomery > reduction, and this can lead to a useful speedup on some > architectures. It would be nice if Montgomery multiplication and > squaring were factored into separate methods, and then they could be > replaced by intrinsics. > > Also, all these word-reversal and misaligned long stores / loads in > the multiplyToLen intrinsic code are a real PITA. If we word-reversed > the arrays so that they were in little-endian form we'd have neither > misaligned long stores / loads nor repeated word-reversals. We could > do the word reversal on the stack: AFAICS it's unusual for > multiplyToLen to be called for huge bignums, and I suppose if it did > happen for a bignum larger than some threshold we could do the word > reversal on the heap. > > Andrew. Thanks, Vladimir On 3/23/15 6:59 AM, Florian Weimer wrote: > On 03/20/2015 11:45 PM, Viswanathan, Sandhya wrote: >> Hi Florian, >> >> My thoughts on this are as follows: >> >> BigInteger.squareToLen is a private method and not a public method. >> The length calculation code in Java version of this method does not have the overflow check and the intrinsic follows the Java code. >> >> private static final int[] squareToLen(int[] x, int len, int[] z) { >> ... >> int zlen = len << 1; >> if (z == null || z.length < zlen) >> z = new int[zlen]; >> ... >> } > > The difference is that the Java code will still perform the bounds > checks on each array access, I think, so even if zlen turns out negative > (and thus no reallocation happens), damage from out-of-bounds accesses > will be non-existent. > >> Also the underlying array in BigInteger cannot be greater than MAX_MAG_LENGTH which is defined as: >> >> private static final int MAX_MAG_LENGTH = Integer.MAX_VALUE / Integer.SIZE + 1; // (1 << 26) >> >> So zlen calculation cannot overflow as int array x and its length len is coming from a BigInteger array. > > Maybe can you add this as a comment to the intrinsic? I think this > would be a useful addition, especially if at some point in the future, > someone else uses your code as a template to implement their own intrinsic. > >> Similarly mulAdd is a package private method and its inputs are allocated or verified at call sites. > > Same here. > From john.r.rose at oracle.com Tue May 5 01:21:27 2015 From: john.r.rose at oracle.com (John Rose) Date: Mon, 4 May 2015 18:21:27 -0700 Subject: RFR(L) 8073583: C2 support for CRC32C on SPARC In-Reply-To: <5548158A.2010103@oracle.com> References: <5547E4F3.4070603@oracle.com> <779F1936-1EE2-419B-9F0A-149FB034BE68@oracle.com> <5548158A.2010103@oracle.com> Message-ID: <0FBC800A-693A-4ECB-A983-D5240FAC59BA@oracle.com> Thanks, James. One more comment, which is at a higher level: Could we recode the loop control in Java and use unsafe to handle word and byte loads? Then we would only need single instruction intrinsics. ? John > On May 4, 2015, at 5:57 PM, james cheng wrote: > > Hi John, > >> On 5/4/2015 5:39 PM, John Rose wrote: >>> On May 4, 2015, at 2:30 PM, Vladimir Kozlov > wrote: >>> >>> http://cr.openjdk.java.net/~kvn/8073583/webrev.00/ >>> https://bugs.openjdk.java.net/browse/JDK-8073583 >>> >>> Contributed by: James Cheng >> >> I took a quick look at the assembly code. I have a few comments. > > Many thanks for your quick review and comments. I will make changes > as you suggested as soon as I can. > > Thanks, > -James From james.cheng at oracle.com Tue May 5 03:04:14 2015 From: james.cheng at oracle.com (James Cheng) Date: Mon, 4 May 2015 20:04:14 -0700 Subject: RFR(L) 8073583: C2 support for CRC32C on SPARC In-Reply-To: <0FBC800A-693A-4ECB-A983-D5240FAC59BA@oracle.com> References: <5547E4F3.4070603@oracle.com> <779F1936-1EE2-419B-9F0A-149FB034BE68@oracle.com> <5548158A.2010103@oracle.com> <0FBC800A-693A-4ECB-A983-D5240FAC59BA@oracle.com> Message-ID: <245E1236-F423-45B8-93C0-9A9BB737371C@oracle.com> Hi John, > On May 4, 2015, at 6:21 PM, John Rose wrote: > > One more comment, which is at a higher level: Could we recode the loop control in Java and use unsafe to handle word and byte loads? Then we would only need single instruction intrinsics. We could, I guess, but that means we?d need to rewrite the pure Java CRC32C in JDK. More difficult is how we implement the CRC32C methods so that they are not favoring one platform while hindering others. I am afraid that the CRC32C instructions on different platforms are too different to compromise. Thanks, -James From vladimir.kozlov at oracle.com Tue May 5 03:38:55 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 04 May 2015 20:38:55 -0700 Subject: sync_kit() In-Reply-To: <5547A399.3040107@redhat.com> References: <5547A399.3040107@redhat.com> Message-ID: <55483B4F.5040803@oracle.com> There is pair of sync_kit(): one in IdealKit and an other in GraphKit. They are needed because IdealKit and GraphKit use different mechanism to track current code state (control, memory, io nodes). They are used if you need to intermix calls to IdealKit and GraphKit. If you start from IdealKit graph construction and you need to call GraphKit methods to add new nodes into the graph you need to call GraphKit::sync_kit() first and after all sequential calls to GraphKit methods, before any IdealKit method is called, you need to call IdealKit::sync_kit(). The good example is in LibraryCallKit::insert_pre_barrier(). Regards, Vladimir On 5/4/15 9:51 AM, Andrew Haley wrote: > I'm trying to get my head around sync_kit(). I don't really know what > it does. Well, I know that it synchronizes IdealKit and graphKit, and > I know from experience that if I don't call sync_kit() my code doesn't > work. > > But what are the rules about when sync_kit() should be called? > > Thanks, > Andrew. > From john.r.rose at oracle.com Tue May 5 03:49:11 2015 From: john.r.rose at oracle.com (John Rose) Date: Mon, 4 May 2015 20:49:11 -0700 Subject: RFR(L) 8073583: C2 support for CRC32C on SPARC In-Reply-To: <245E1236-F423-45B8-93C0-9A9BB737371C@oracle.com> References: <5547E4F3.4070603@oracle.com> <779F1936-1EE2-419B-9F0A-149FB034BE68@oracle.com> <5548158A.2010103@oracle.com> <0FBC800A-693A-4ECB-A983-D5240FAC59BA@oracle.com> <245E1236-F423-45B8-93C0-9A9BB737371C@oracle.com> Message-ID: <5D45CF93-9A2E-4E00-A25A-9A101A3CDB26@oracle.com> On May 4, 2015, at 8:04 PM, James Cheng wrote: > > Hi John, > >> On May 4, 2015, at 6:21 PM, John Rose wrote: >> >> One more comment, which is at a higher level: Could we recode the loop control in Java and use unsafe to handle word and byte loads? Then we would only need single instruction intrinsics. > > We could, I guess, but that means we?d need to rewrite the pure Java CRC32C in JDK. > More difficult is how we implement the CRC32C methods so that they are not favoring > one platform while hindering others. I am afraid that the CRC32C instructions on different > platforms are too different to compromise. That may be; the vector size is CPU-dependent, for example. But a 64-bit vector is (currently) the sweet spot for writing vectorized code in Java, since 'long' is the biggest bit container in Java. (Note also that HotSpot JVM objects are aligned up to 8 byte boundaries, even after GC.) Another platform with larger vectors would have to use assembly language anyway (which Intel does), but Java code can express 64-bit vectorized loops. For CRC, the desirable number of distinct streams, and the prefetch mode and distance, are also CPU-dependent. For those variations injecting machine-specific parts into the Java-coded algorithm would get messy. The benefit of coding low-level vectorized loops in Java would be not having to code the loop logic in assembly code. If we could use byte buffers to manage the indexing, and/or had better array notations, it would probably be worth while moving from assembly to Java. At present it seems OK to code in assembly, *if* the assembly can be made more readable. We have a chicken and egg problem here: Nobody is going to experiment with Java-coded vector loops until we get single-vector CRC32[C] and XMULX instructions surfaced as C2-supported intrinsics. (We already have bit and byte reverse intrinsics, so that part is OK.) ? John From vladimir.kozlov at oracle.com Tue May 5 04:53:06 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 04 May 2015 21:53:06 -0700 Subject: RFR(L): 8076284: Improve vectorization of parallel streams In-Reply-To: <553A936D.2020909@oracle.com> References: <39F83597C33E5F408096702907E6C450E3E586@ORSMSX104.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63334516@FMSMSX112.amr.corp.intel.com> <39F83597C33E5F408096702907E6C450E3E5A4@ORSMSX104.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63334531@FMSMSX112.amr.corp.intel.com> <39F83597C33E5F408096702907E6C450E3E734@ORSMSX104.amr.corp.intel.com> <55303823.1020205@oracle.com> <39F83597C33E5F408096702907E6C450E3F0AC@ORSMSX104.amr.corp.intel.com> <39F83597C33E5F408096702907E6C450E3F413@ORSMSX104.amr.corp.intel.com> <553A936D.2020909@oracle.com> Message-ID: <55484CB2.6060205@oracle.com> I reviewed it and tested in JPRT. If nobody objects I will push it. Thanks, Vladimir On 4/24/15 12:03 PM, Vladimir Kozlov wrote: > Updated webrev: > > http://cr.openjdk.java.net/~kvn/8076284/webrev.01/ > > Vladimir > > On 4/20/15 11:45 PM, Civlin, Jan wrote: >> Vladimir, >> >> Here is the description and new patch with the changes you recommended (except the last one - see below my explanation). >> >> >> The patch description. >> >> This patch provides on-demand vectorization/SIMD'ing of a specified in JVM command as >> -XX:CompileCommand=option,,Vectorize. >> This optimization may be globally disabled by setting the flag -XX: -AllowVectorizeOnDemand (by default it is true). >> >> For each method that was specified with Vectorize option we do the following: >> >> 1. On each iteration of loop unroll for a given method (loopopts.cpp) we generate the next _clone_idx (which will be >> common for all the nodes cloned in this iteration); and on each node cloning we hash _idx of the origin of the node >> that is cloned (_idx_clone_orig ) and the _clone_idx (cm.verify_insert_and_clone). >> CloneMap belongs to Compile and is created in CompilerWrapper. >> >> 2. In SuperWord optimization, after max_depth has been built, we are hoisting the loads. >> For this we for each Load_X (subject of the hoisting) find some Load_0 that has the same origin as Load_X but belongs >> to the first iteration, i.e. if the MemNode::Memory input of Load_0 is memory Phi (collected previously in >> memory_slice) we set this Phi also as the MemNode::Memory input of Load_X. After this rebuild of the graph we restart >> the Superword optimization. >> >> The major routines here. >> - SuperWord::mark_generations: computes _ii_first (the index _clone_idx) of the nodes that have MemNode::Memory input >> coming from a phi node in some slice; computes list of the nodes in the first and last iterations of the loop. >> - SuperWord::hoist_loads_in_graph: for each memory slice (a phi node) visits each load that has this phi as a memory >> input and then for each other load that has the same origin makes the memory input coming from the phi. This routine >> does not use marking generations mechanism. >> - SuperWord::pack_parallel - this routine is called only if SuperWord fails to produce any pack after >> extend_packlist(); it is another algorithm for packing instructions into SIMD. >> It goes thru the list of all instructions in the _iteration_first, and if it is a Load, Store, Add or Mul it starts a >> new pack and adds this instruction to this pack. Then the algorithm circulates thru the iterations of the loop (gen < >> _ii_order.length()) and over the instructions list and finds the node with the origin coinciding with the origin of >> the nodes already in the pack - then it adds this node to the pack. Once packs are built, SuperWord returns to the >> normal processing (combine_packs()). >> >> Note, that neither 2 or 3 goes thru the data dependency analysis, since the correctness of parallelization was >> guaranteed by the user. >> Note, that some checks in added code could be omitted. But we are not assume that there is no optimization (now or in >> the future) that can change the graph structure between loop unrolling and the SuperWord, so we prefer to run many >> (probably unneeded) checks in the SuperWord. >> >> >> >>>> Can you also utilize changes done by Michael Berg for reduction >>>> optimization (the code in jdk9/hs-comp already)? I mean marking some >>>> nodes before unrolling and searching Phis. >> Michael Berg and I looked at what you suggested (phi marking before unroll?) and we both think this marking is very >> different than what I do: Michael's phi marking is just one bit per node (in the flags) whereas I collect the >> _idx_clone_orig and the unroll generation (clone_idx). >> >> >> -----Original Message----- >> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] >> Sent: Thursday, April 16, 2015 3:31 PM >> To: Civlin, Jan; hotspot-compiler-dev at openjdk.java.net >> Subject: Re: RFR(S): 8076284: Improve vectorization of parallel streams >> >> Hi Jan, >> >> You did not describe your changes in details (what they do). >> >> IgnoreVectorizeMethod flag should positive and enabled by default. >> Rename it to AllowVectorizeOnDemand (or something similar): >> >> + product(bool, AllowVectorizeOnDemand, true, >> \ >> >> Instead of next you should add intrinsic definition to >> and classfile/vmSymbols.hpp and then check method()->intrinsic_id(): >> >> + if (strcmp("forEachRemaining", method()->name()->as_quoted_ascii()) >> == 0 && method()->signature() != 0 >> + && method()->signature()->as_symbol() != 0 && >> method()->signature()->as_symbol()->as_quoted_ascii() != 0 ) { >> + if >> (strstr(method()->signature()->as_symbol()->as_quoted_ascii(),"Ljava/util/function/IntConsumer")) >> { >> + set_do_vector_loop(true); >> + } >> + } >> >> And that should be under flag too because in general forEachRemaining >> should be vectorized only if it is safe. >> >> Can you also utilize changes done by Michael Berg for reduction >> optimization (the code in jdk9/hs-comp already)? I mean marking some >> nodes before unrolling and searching Phis. >> >> Regards, >> Vladimir >> >> On 4/13/15 3:33 AM, Civlin, Jan wrote: >>> Hi All, >>> >>> >>> We would like to contribute the improvement of vectorization of >>> parallel streams from Intel. >>> >>> The contribution Bug ID: 8076284. >>> >>> Please review this patch: >>> >>> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8076284 >>> >>> webrev: http://cr.openjdk.java.net/~kvn/8076284/webrev/ >>> >>> >>> *Description* >>> >>> Improve vectorization of the unordered parallel streams (by vectorizing >>> forEachRemaining method). >>> >>> For example, this forEach will be vectorized: >>> >>> java.util.stream.IntStream iStream = java.util.stream.IntStream.range(0, >>> RANGE - 1).parallel(); >>> >>> iStream.forEach( id -> c[id] = c[id] + c[id+1] ); >>> >>> It also enables on-demand loop vectorization in a given method (by >>> providing more hints to SuperWord optimization). >>> >>> For example, use -XX:CompileCommand=option,computeCall,Vectorizeto >>> vectorize this loop >>> >>> void computeCall(double [] Call, double puByDf, double pdByDf) >>> >>> { >>> >>> for(int i = timeStep; i > 0; i--) >>> >>> for(int j = 0; j <= i - 1; j++) >>> >>> Call[j] = puByDf * Call[j + 1] + pdByDf * Call[j]; >>> >>> } >>> >>> >>> This enhancement is contributed by Intel and sponsored by the hotspot >>> compiler team. >>> >> From aph at redhat.com Tue May 5 09:06:11 2015 From: aph at redhat.com (Andrew Haley) Date: Tue, 05 May 2015 10:06:11 +0100 Subject: RFR(L): 8069539: RSA acceleration In-Reply-To: <5548193E.7060803@oracle.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> Message-ID: <55488803.4020802@redhat.com> On 05/05/15 02:13, Vladimir Kozlov wrote: > I think it was last mail in open list. There were discussions in private. > Here is upadted webrev: > > http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ > > It include changes in java/math/BigInteger.java which added range > checks and moved intrinsified code into private methods. > > Andrew Haley has additional suggestions we can discuss now. Andrew, > can you show proposed changes to java code? OK. The changes I'd like to make to the Java code are fairly minor, but they make possible a better design of the intrinsics. It's not really possible to write a 64-bit version of Montgomery reduction in pure Java (because it uses a 128-bit product internally) but I will produce a JNI version as a demonstration of what I'd like to do in an intrinsic. > I would like to have only simple java changes (methods splitting) > and do experiments/changes with word-reversal later. Sure. The word-reversal is internal to the native intrinsics; it won't affect the Java code at all. Andrew. From tobias.hartmann at oracle.com Tue May 5 13:38:05 2015 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Tue, 05 May 2015 15:38:05 +0200 Subject: [9] RFR(S): 8078497: C2's superword optimization causes unaligned memory accesses In-Reply-To: <55477DB1.5050200@oracle.com> References: <5540D237.8000107@oracle.com> <5541172D.5010200@oracle.com> <5541F0B1.50804@oracle.com> <554254A9.6060608@oracle.com> <554705ED.4050305@oracle.com> <55477DB1.5050200@oracle.com> Message-ID: <5548C7BD.7050601@oracle.com> Hi, I verified that this is a problem in the baseline version and not related to my fix. I filed [1] and will take care of it. I wait with pushing JDK-8078497 until I fixed this. Best, Tobias [1] https://bugs.openjdk.java.net/browse/JDK-8079343 On 04.05.2015 16:09, Tobias Hartmann wrote: > Hi, > > looks like there is still a problem I didn't catch. With my regression test I sometimes hit [1] > > # Internal Error (/opt/jprt/T/P1/130607.tohartma/s/hotspot/src/share/vm/opto/loopnode.cpp:3231), pid=18188, tid=0x00002b00dacde700 > # assert(!had_error) failed: bad dominance > > I'm not sure if this is related to my fix but I assume it is an existing problem that was just triggered by my regression test. > > I'll further investigate. > > Thanks, > Tobias > > [1] http://sthjprt.se.oracle.com/archives/2015/05/2015-05-04-130607.tohartma.8078497/logs/linux_x64_2.6-fastdebug-c2-hotspot_compiler_3.log.FAILED.log > > On 04.05.2015 07:38, Tobias Hartmann wrote: >> Thanks, Vladimir. >> >> Best, >> Tobias >> >> On 30.04.2015 18:13, Vladimir Kozlov wrote: >>> Looks good. >>> >>> Thanks, >>> Vladimir >>> >>> On 4/30/15 2:06 AM, Tobias Hartmann wrote: >>>> Hi Vladimir, >>>> >>>> thanks for the review! >>>> >>>> On 29.04.2015 19:38, Vladimir Kozlov wrote: >>>>> Consider reducing number of iterations in test from 1M so that the test does not timeout on slow platforms. >>>> >>>> I reduced the number of iterations to 20k and verified that it is enough to trigger compilation and crash on Sparc. >>>> >>>>> lines 511-515 are duplicates of 486-491. Consider moving them before vw%span check. >>>> >>>> Thanks, I refactored that part. >>>> >>>>> Typo 'iff': >>>>> + // final offset is a multiple of vw iff init_offset is a multiple. >>>> >>>> I used 'iff' to refer to 'if and only if' [1]. I changed it. >>>> >>>> Here is the new webrev: >>>> http://cr.openjdk.java.net/~thartmann/8078497/webrev.01/ >>>> >>>> Thanks, >>>> Tobias >>>> >>>> [1] http://en.wikipedia.org/wiki/If_and_only_if >>>> >>>>> >>>>> Thanks, >>>>> Vladimir >>>>> >>>>> On 4/29/15 5:44 AM, Tobias Hartmann wrote: >>>>>> Hi, >>>>>> >>>>>> please review the following patch. >>>>>> >>>>>> https://bugs.openjdk.java.net/browse/JDK-8078497 >>>>>> http://cr.openjdk.java.net/~thartmann/8078497/webrev.00/ >>>>>> >>>>>> Background information (simplified): >>>>>> After loop optimizations, C2 tries to vectorize memory operations by finding and merging adjacent memory accesses in the loop body. The superword optimization matches MemNodes with the following pattern as address input: >>>>>> >>>>>> ptr + k*iv + constant [+ invar] >>>>>> >>>>>> where iv is the loop induction variable, k is a scaling factor that may be 0 and invar is an optional loop invariant value. C2 then picks the memory operation with most similar references and tries to align it in the main loop by setting the number of pre-loop iterations accordingly. Other adjacent memory operations that conform to this alignment are merged into packs. After extending and filtering these packs, final vector operations are emitted. >>>>>> >>>>>> The problem is that some architectures (for example, Sparc) require memory accesses to be aligned and in special cases the superword optimization generates code that contains unaligned vector instructions. >>>>>> >>>>>> Problems: >>>>>> (1) If two memory operations have different loop invariant offset values and C2 decides to align the main loop to one of them, we can always set the invariant of the other operation such that we break the alignment constraint. For example, if we have a store to a byte array a[i+inv1] and a load from a byte array b[i+inv2] (where i is the induction variable), C2 may decide to align the main loop such that (i+inv1 % 8) == 0 and replace the adjacent byte stores by a 8-byte double word store. Also the byte loads will be replaced by a double word load. If we then set inv2 = inv1 + 1 at runtime, the load will fail with a SIGBUS because the access is not 8-byte aligned, i.e., (i+inv2 % 8) != 0. Vectorization should not take place in this case. >>>>>> >>>>>> (2) If C2 decides to align the main loop to a memory operation, the necessary adjustment of the induction variable by the pre-loop is computed such that the resulting offset in the main loop is aligned to the vector width (see 'SuperWord::get_iv_adjustment'). If the offset of a memory operation is independent of the loop induction variable (i.e., scale k is 0), the iv adjustment should be 0. >>>>>> >>>>>> (3) If the loop span is greater than the vector width 'vw' of a memory operation, the superword optimization assumes that the pre-loop is never able to align this operation in the main loop (see 'SuperWord::ref_is_alignable'). This is wrong because if the loop span is a multiple of vw and depending on the initial offset, it may very well be possible to align the operation to vw. >>>>>> >>>>>> These problems originally only showed up in the string density code base (JDK-8054307) where we have a putChar intrinsic that writes a char value to two entries of a byte array. To reproduce the issue with JDK9 we have to make sure that the memory operations: >>>>>> (i) are independent, >>>>>> (ii) have different invariants, >>>>>> (iii) are not too complex (because then vectorization will not take place). >>>>>> >>>>>> To guarantee (i) we either need two different arrays (for example, byte[] and char[]) for the load and store operations or the same array but different offsets. Since the offsets should include a loop invariant part (ii), the superword optimization will not be able to determine that the runtime offset values do not overlap. We therefore use Unsafe.getChar/putChar to read/write a char value from/to a byte/char array and thereby guarantee independence. >>>>>> >>>>>> I came up with the following test (see 'TestVectorizationWithInvariant.java'): >>>>>> >>>>>> byte[] src = new byte[1000]; >>>>>> byte[] dst = new char[1000]; >>>>>> >>>>>> for (int i = (int) CHAR_ARRAY_OFFSET; i < 100; i = i + 8) { >>>>>> // Copy 8 chars from src to dst >>>>>> unsafe.putChar(dst, i + 0, unsafe.getChar(src, off + 0)); >>>>>> [...] >>>>>> unsafe.putChar(dst, i + 14, unsafe.getChar(src, off + 14)); >>>>>> } >>>>>> >>>>>> Problem (1) shows up since the main loop will be aligned to the StoreC[i + 0] which has no invariant. However, the LoadUS[off + 0] has the loop invariant 'off'. Setting off to BYTE_ARRAY_OFFSET + 2 will break the alignment of the emitted double word load and result in a crash. >>>>>> >>>>>> The LoadUS[off + 0] in above example is independent of the loop induction variable and therefore has no scale value. Because of problem (2), the iv adjustment is computed as -4 (see 'SuperWord::get_iv_adjustment'). >>>>>> >>>>>> offset = 16 iv_adjust = -4 elt_size = 2 scale = 0 iv_stride = 16 vect_size 8 >>>>>> >>>>>> The regression test contains additional test cases that also trigger problem (3). >>>>>> >>>>>> Solution: >>>>>> (1) I added a check to 'SuperWord::find_adjacent_refs' to make sure that memory accesses with different invariants are only vectorized if the architecture supports unaligned memory accesses. >>>>>> (2) 'SuperWord::get_iv_adjustment' is modified such that it returns 0 if the memory operation is independent of iv. >>>>>> (3) In 'SuperWord::ref_is_alignable' we need to additionally check if span is divisible by vw. If so, the pre-loop will add multiples of vw to the initial offset and if that initial offset is divisible by vm, the final offset (after the pre-loop) will be divisible as well and we should return true. I added comments to the code describing this in detail. >>>>>> >>>>>> Testing: >>>>>> - original test with string-density codebase >>>>>> - regression test >>>>>> - JPRT >>>>>> >>>>>> Thanks, >>>>>> Tobias >>>>>> From manasthakur17 at gmail.com Thu May 7 04:27:36 2015 From: manasthakur17 at gmail.com (Manas Thakur) Date: Thu, 7 May 2015 09:57:36 +0530 Subject: Help regarding C2's Ideal IR Message-ID: Hello all Is there a help manual or web source describing the nodes and representation of C2's Ideal IR in detail? The details given on the wiki seem to be very less. I want to understand the standard analyses and optimizations done by C2, and a detailed description would be very helpful. Regards, Manas -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.westrelin at oracle.com Thu May 7 10:25:05 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Thu, 7 May 2015 12:25:05 +0200 Subject: RFR 8076276 support for AVX512 In-Reply-To: <553946F5.2090009@oracle.com> References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> Message-ID: <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> > http://cr.openjdk.java.net/~kvn/8076276/webrev.02 This looks good to me. A few minor remarks: Shouldn?t the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I?m not sure what the guideline is. In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. Not sure why you dropped: 3463 // Float register operands 3473 // Double register operands in x86_64.ad In chaitin.hpp: 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else comment is not aligned with the one below anymore. Roland. From vladimir.kozlov at oracle.com Thu May 7 17:40:07 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 07 May 2015 10:40:07 -0700 Subject: RFR 8076276 support for AVX512 In-Reply-To: <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> Message-ID: <554BA377.8050700@oracle.com> Thanks, Roland On 5/7/15 3:25 AM, Roland Westrelin wrote: >> http://cr.openjdk.java.net/~kvn/8076276/webrev.02 > > This looks good to me. A few minor remarks: > > Shouldn?t the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I?m not sure what the guideline is. Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about? > > In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. > > Not sure why you dropped: > > 3463 // Float register operands > 3473 // Double register operands > > in x86_64.ad > > In chaitin.hpp: > > 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else > > comment is not aligned with the one below anymore. > > Roland. > I also have questions to Michael. Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"? Why you moved "operand vecS() etc." from x86.ad ? Do you mean evex is not supported in 32-bit? Thanks, Vladimir From roland.westrelin at oracle.com Thu May 7 18:40:01 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Thu, 7 May 2015 20:40:01 +0200 Subject: RFR 8076276 support for AVX512 In-Reply-To: <554BA377.8050700@oracle.com> References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> <554BA377.8050700@oracle.com> Message-ID: >>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02 >> >> This looks good to me. A few minor remarks: >> >> Shouldn?t the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I?m not sure what the guideline is. > > Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about? I?m talking about: 4101 #ifdef _LP64 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, regF tmp, regF tmp2) %{ for instance in x86.ad Why isn?t it in x86_64.ad? Roland. > >> >> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. >> >> Not sure why you dropped: >> >> 3463 // Float register operands >> 3473 // Double register operands >> >> in x86_64.ad >> >> In chaitin.hpp: >> >> 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else >> >> comment is not aligned with the one below anymore. >> >> Roland. >> > > I also have questions to Michael. > > Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"? > > Why you moved "operand vecS() etc." from x86.ad ? Do you mean evex is not supported in 32-bit? > > Thanks, > Vladimir From vladimir.kozlov at oracle.com Thu May 7 19:26:05 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 07 May 2015 12:26:05 -0700 Subject: RFR 8076276 support for AVX512 In-Reply-To: References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> <554BA377.8050700@oracle.com> Message-ID: <554BBC4D.6020501@oracle.com> On 5/7/15 11:40 AM, Roland Westrelin wrote: >>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02 >>> >>> This looks good to me. A few minor remarks: >>> >>> Shouldn?t the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I?m not sure what the guideline is. >> >> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about? > > I?m talking about: > 4101 #ifdef _LP64 > 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, regF tmp, regF tmp2) %{ > > for instance in x86.ad > Why isn?t it in x86_64.ad? I wanted to keep all vector instructions in one file. That is all. Vladimir > > Roland. >> >>> >>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. >>> >>> Not sure why you dropped: >>> >>> 3463 // Float register operands >>> 3473 // Double register operands >>> >>> in x86_64.ad >>> >>> In chaitin.hpp: >>> >>> 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else >>> >>> comment is not aligned with the one below anymore. >>> >>> Roland. >>> >> >> I also have questions to Michael. >> >> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"? >> >> Why you moved "operand vecS() etc." from x86.ad ? Do you mean evex is not supported in 32-bit? >> >> Thanks, >> Vladimir > From roland.westrelin at oracle.com Thu May 7 20:07:54 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Thu, 7 May 2015 22:07:54 +0200 Subject: RFR 8076276 support for AVX512 In-Reply-To: <554BBC4D.6020501@oracle.com> References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> <554BA377.8050700@oracle.com> <554BBC4D.6020501@oracle.com> Message-ID: <906035A6-CA25-4EA5-8D91-CC683A0F289B@oracle.com> > I wanted to keep all vector instructions in one file. That is all. Ok. Roland. From michael.c.berg at intel.com Thu May 7 20:38:20 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Thu, 7 May 2015 20:38:20 +0000 Subject: RFR 8076276 support for AVX512 In-Reply-To: References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> <554BA377.8050700@oracle.com> Message-ID: Roland, For the usage of LP64 in the x86.ad file, I was following the existing examples, like replicate, which have specific forms for each type LONG usage, if applicable. I wanted to wait to add the 32-bit LONG versions of some of these macros (until later) as they expand to be larger than their 64-bit counterparts and are thought to be of lower performance. Currently we no make mention of vector instruction definitions outside of the x86.ad file definitions, so this is in keeping with both methods/attributes of instruction definition. New Instructions: The 64x2, 64x4, 32x4, 32x8 versions of extract/insert, kmov are all new. There are new vector multiply instructions as well as extended forms that do not exist on earlier microarchitectures. See CPUID guards in the instruction definitions, there many that are AVX3 (UseAVX > 2) only for which are also evident in the assembler layer as supports_evex() instructions. This follows the current AVX512 ISA document in definition (available at: https://software.intel.com/en-us/isa-extensions . By and large the assembler is now savy enough to encode all legacy x86 microarchitecture instructions, while supporting the new ones and the new encoding model. Thanks, Michael -----Original Message----- From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Roland Westrelin Sent: Thursday, May 07, 2015 11:40 AM To: Vladimir Kozlov Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR 8076276 support for AVX512 >>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02 >> >> This looks good to me. A few minor remarks: >> >> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is. > > Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about? I'm talking about: 4101 #ifdef _LP64 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, regF tmp, regF tmp2) %{ for instance in x86.ad Why isn't it in x86_64.ad? Roland. > >> >> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. >> >> Not sure why you dropped: >> >> 3463 // Float register operands >> 3473 // Double register operands >> >> in x86_64.ad >> >> In chaitin.hpp: >> >> 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else >> >> comment is not aligned with the one below anymore. >> >> Roland. >> > > I also have questions to Michael. > > Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"? > > Why you moved "operand vecS() etc." from x86.ad ? Do you mean evex is not supported in 32-bit? > > Thanks, > Vladimir From vladimir.kozlov at oracle.com Thu May 7 20:48:41 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 07 May 2015 13:48:41 -0700 Subject: RFR 8076276 support for AVX512 In-Reply-To: References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> <554BA377.8050700@oracle.com> Message-ID: <554BCFA9.1070103@oracle.com> You did not answered my questions I think. Or I don't understand your answer: >> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"? >> >> Why you moved "operand vecS() etc." from x86.ad ? Do you mean evex is not supported in 32-bit? >> And Roland's: >>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. Thanks, Vladimir On 5/7/15 1:38 PM, Berg, Michael C wrote: > Roland, > > For the usage of LP64 in the x86.ad file, I was following the existing examples, like replicate, which have specific forms for each type LONG usage, if applicable. > I wanted to wait to add the 32-bit LONG versions of some of these macros (until later) as they expand to be larger than their 64-bit counterparts and are thought to be of lower performance. > Currently we no make mention of vector instruction definitions outside of the x86.ad file definitions, so this is in keeping with both methods/attributes of instruction definition. > New Instructions: The 64x2, 64x4, 32x4, 32x8 versions of extract/insert, kmov are all new. There are new vector multiply instructions as well as extended forms that do not exist on earlier microarchitectures. > See CPUID guards in the instruction definitions, there many that are AVX3 (UseAVX > 2) only for which are also evident in the assembler layer as supports_evex() instructions. This follows the current AVX512 ISA document in definition (available at: https://software.intel.com/en-us/isa-extensions . By and large the assembler is now savy enough to encode all legacy x86 microarchitecture instructions, while supporting the new ones and the new encoding model. > > Thanks, > Michael > > -----Original Message----- > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Roland Westrelin > Sent: Thursday, May 07, 2015 11:40 AM > To: Vladimir Kozlov > Cc: hotspot-compiler-dev at openjdk.java.net > Subject: Re: RFR 8076276 support for AVX512 > >>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02 >>> >>> This looks good to me. A few minor remarks: >>> >>> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is. >> >> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about? > > I'm talking about: > 4101 #ifdef _LP64 > 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, regF tmp, regF tmp2) %{ > > for instance in x86.ad > Why isn't it in x86_64.ad? > > Roland. >> >>> >>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. >>> >>> Not sure why you dropped: >>> >>> 3463 // Float register operands >>> 3473 // Double register operands >>> >>> in x86_64.ad >>> >>> In chaitin.hpp: >>> >>> 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else >>> >>> comment is not aligned with the one below anymore. >>> >>> Roland. >>> >> >> I also have questions to Michael. >> >> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"? >> >> Why you moved "operand vecS() etc." from x86.ad ? Do you mean evex is not supported in 32-bit? >> >> Thanks, >> Vladimir > From michael.c.berg at intel.com Thu May 7 21:07:29 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Thu, 7 May 2015 21:07:29 +0000 Subject: RFR 8076276 support for AVX512 In-Reply-To: References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> <554BA377.8050700@oracle.com> Message-ID: Roland, VecS like the other forms of Vec{D|X|Y|Z} are defined as needed in each of the 32 and 64 ad files, as they are divergent now. The 32-bit version only has the legacy bank. 64-bit version uses the newly defined reg_class_dynamic definition to provide the appropriate bank of registers based on CPUID. The chunk3 rename is an artifact of removing the kreg bank. I will put it back to the old name (and test it out). Eventually, we will change it back once we formalize the usage of kreg into all facets of code generation ( in a later patch ). If we made a single version out of os_supports_avx_vectors()'s loop, we would have a guard in the loop as the size of the regions are now different, it seems cleaner the way it is. In x86.64.ad: >> 3463 // Float register operands >> 3473 // Double register operands I will re-add the comments. In chaitain.hpp: 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else I will fix the comment location. Thanks, Michael -----Original Message----- From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Roland Westrelin Sent: Thursday, May 07, 2015 11:40 AM To: Vladimir Kozlov Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR 8076276 support for AVX512 >>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02 >> >> This looks good to me. A few minor remarks: >> >> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is. > > Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about? I'm talking about: 4101 #ifdef _LP64 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, regF tmp, regF tmp2) %{ for instance in x86.ad Why isn't it in x86_64.ad? Roland. > >> >> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. >> >> Not sure why you dropped: >> >> 3463 // Float register operands >> 3473 // Double register operands >> >> in x86_64.ad >> >> In chaitin.hpp: >> >> 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else >> >> comment is not aligned with the one below anymore. >> >> Roland. >> > > I also have questions to Michael. > > Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"? > > Why you moved "operand vecS() etc." from x86.ad ? Do you mean evex is not supported in 32-bit? > > Thanks, > Vladimir From vladimir.kozlov at oracle.com Thu May 7 21:27:38 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 07 May 2015 14:27:38 -0700 Subject: RFR 8076276 support for AVX512 In-Reply-To: References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> <554BA377.8050700@oracle.com> Message-ID: <554BD8CA.1090409@oracle.com> On 5/7/15 2:07 PM, Berg, Michael C wrote: > Roland, > > VecS like the other forms of Vec{D|X|Y|Z} are defined as needed in each of the 32 and 64 ad files, as they are divergent now. The 32-bit version only has the legacy bank. > 64-bit version uses the newly defined reg_class_dynamic definition to provide the appropriate bank of registers based on CPUID. The chunk3 rename is an artifact of removing the kreg bank. For 32-bit reg_class_dynamic will give the same result as legacy because registers after XMM7 will be cut of by #ifdef _LP64. So I don't understand why you need to split Vec{D|X|Y} operands. You do kept vecZ operand the same for 32- and 64-bit. Regards, Vladimir > I will put it back to the old name (and test it out). Eventually, we will change it back once we formalize the usage of kreg into all facets of code generation ( in a later patch ). > If we made a single version out of os_supports_avx_vectors()'s loop, we would have a guard in the loop as the size of the regions are now different, it seems cleaner the way it is. > > In x86.64.ad: > >>> 3463 // Float register operands >>> 3473 // Double register operands > > I will re-add the comments. > > In chaitain.hpp: > > 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else > > I will fix the comment location. > > Thanks, > Michael > > -----Original Message----- > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Roland Westrelin > Sent: Thursday, May 07, 2015 11:40 AM > To: Vladimir Kozlov > Cc: hotspot-compiler-dev at openjdk.java.net > Subject: Re: RFR 8076276 support for AVX512 > >>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02 >>> >>> This looks good to me. A few minor remarks: >>> >>> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is. >> >> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about? > > I'm talking about: > 4101 #ifdef _LP64 > 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, regF tmp, regF tmp2) %{ > > for instance in x86.ad > Why isn't it in x86_64.ad? > > Roland. >> >>> >>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. >>> >>> Not sure why you dropped: >>> >>> 3463 // Float register operands >>> 3473 // Double register operands >>> >>> in x86_64.ad >>> >>> In chaitin.hpp: >>> >>> 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else >>> >>> comment is not aligned with the one below anymore. >>> >>> Roland. >>> >> >> I also have questions to Michael. >> >> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"? >> >> Why you moved "operand vecS() etc." from x86.ad ? Do you mean evex is not supported in 32-bit? >> >> Thanks, >> Vladimir > From michael.c.berg at intel.com Thu May 7 22:43:43 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Thu, 7 May 2015 22:43:43 +0000 Subject: RFR 8076276 support for AVX512 In-Reply-To: <554BD8CA.1090409@oracle.com> References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> <554BA377.8050700@oracle.com> <554BD8CA.1090409@oracle.com> Message-ID: Tried it out as a single definition referencing reg_class_dynamic for both 64-bit and 32-bit. On the new machine model it does work correctly for both. However on 32-bit we will now emit more auto generated code through the adlc layer as we will have the same definitions on both evex and non evex, it is why I used separate definitions. -Michael -----Original Message----- From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] Sent: Thursday, May 07, 2015 2:28 PM To: Berg, Michael C; Roland Westrelin Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR 8076276 support for AVX512 On 5/7/15 2:07 PM, Berg, Michael C wrote: > Roland, > > VecS like the other forms of Vec{D|X|Y|Z} are defined as needed in each of the 32 and 64 ad files, as they are divergent now. The 32-bit version only has the legacy bank. > 64-bit version uses the newly defined reg_class_dynamic definition to provide the appropriate bank of registers based on CPUID. The chunk3 rename is an artifact of removing the kreg bank. For 32-bit reg_class_dynamic will give the same result as legacy because registers after XMM7 will be cut of by #ifdef _LP64. So I don't understand why you need to split Vec{D|X|Y} operands. You do kept vecZ operand the same for 32- and 64-bit. Regards, Vladimir > I will put it back to the old name (and test it out). Eventually, we will change it back once we formalize the usage of kreg into all facets of code generation ( in a later patch ). > If we made a single version out of os_supports_avx_vectors()'s loop, we would have a guard in the loop as the size of the regions are now different, it seems cleaner the way it is. > > In x86.64.ad: > >>> 3463 // Float register operands >>> 3473 // Double register operands > > I will re-add the comments. > > In chaitain.hpp: > > 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else > > I will fix the comment location. > > Thanks, > Michael > > -----Original Message----- > From: hotspot-compiler-dev > [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of > Roland Westrelin > Sent: Thursday, May 07, 2015 11:40 AM > To: Vladimir Kozlov > Cc: hotspot-compiler-dev at openjdk.java.net > Subject: Re: RFR 8076276 support for AVX512 > >>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02 >>> >>> This looks good to me. A few minor remarks: >>> >>> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is. >> >> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about? > > I'm talking about: > 4101 #ifdef _LP64 > 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, > regF tmp, regF tmp2) %{ > > for instance in x86.ad > Why isn't it in x86_64.ad? > > Roland. >> >>> >>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. >>> >>> Not sure why you dropped: >>> >>> 3463 // Float register operands >>> 3473 // Double register operands >>> >>> in x86_64.ad >>> >>> In chaitin.hpp: >>> >>> 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else >>> >>> comment is not aligned with the one below anymore. >>> >>> Roland. >>> >> >> I also have questions to Michael. >> >> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"? >> >> Why you moved "operand vecS() etc." from x86.ad ? Do you mean evex is not supported in 32-bit? >> >> Thanks, >> Vladimir > From vladimir.kozlov at oracle.com Thu May 7 23:32:31 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 07 May 2015 16:32:31 -0700 Subject: RFR 8076276 support for AVX512 In-Reply-To: References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> <554BA377.8050700@oracle.com> <554BD8CA.1090409@oracle.com> Message-ID: <554BF60F.9000401@oracle.com> You should have said this from the beginning! Add comment to x86_32.ad before operand vecS explaining why it uses vectors_reg_legacy. To avoid unneeded runtime code generation by adlc. It produces the same result regardless evex support for these vectors in 32-bit VM. Thanks, Vladimir On 5/7/15 3:43 PM, Berg, Michael C wrote: > Tried it out as a single definition referencing reg_class_dynamic for both 64-bit and 32-bit. On the new machine model it does work correctly for both. However on 32-bit we will now emit more auto generated code through the adlc layer as we will have the same definitions on both evex and non evex, it is why I used separate definitions. > > -Michael > > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > Sent: Thursday, May 07, 2015 2:28 PM > To: Berg, Michael C; Roland Westrelin > Cc: hotspot-compiler-dev at openjdk.java.net > Subject: Re: RFR 8076276 support for AVX512 > > On 5/7/15 2:07 PM, Berg, Michael C wrote: >> Roland, >> >> VecS like the other forms of Vec{D|X|Y|Z} are defined as needed in each of the 32 and 64 ad files, as they are divergent now. The 32-bit version only has the legacy bank. >> 64-bit version uses the newly defined reg_class_dynamic definition to provide the appropriate bank of registers based on CPUID. The chunk3 rename is an artifact of removing the kreg bank. > > > For 32-bit reg_class_dynamic will give the same result as legacy because registers after XMM7 will be cut of by #ifdef _LP64. So I don't understand why you need to split Vec{D|X|Y} operands. You do kept vecZ operand the same for 32- and 64-bit. > > Regards, > Vladimir > >> I will put it back to the old name (and test it out). Eventually, we will change it back once we formalize the usage of kreg into all facets of code generation ( in a later patch ). >> If we made a single version out of os_supports_avx_vectors()'s loop, we would have a guard in the loop as the size of the regions are now different, it seems cleaner the way it is. >> >> In x86.64.ad: >> >>>> 3463 // Float register operands >>>> 3473 // Double register operands >> >> I will re-add the comments. >> >> In chaitain.hpp: >> >> 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else >> >> I will fix the comment location. >> >> Thanks, >> Michael >> >> -----Original Message----- >> From: hotspot-compiler-dev >> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of >> Roland Westrelin >> Sent: Thursday, May 07, 2015 11:40 AM >> To: Vladimir Kozlov >> Cc: hotspot-compiler-dev at openjdk.java.net >> Subject: Re: RFR 8076276 support for AVX512 >> >>>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02 >>>> >>>> This looks good to me. A few minor remarks: >>>> >>>> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is. >>> >>> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about? >> >> I'm talking about: >> 4101 #ifdef _LP64 >> 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, >> regF tmp, regF tmp2) %{ >> >> for instance in x86.ad >> Why isn't it in x86_64.ad? >> >> Roland. >>> >>>> >>>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. >>>> >>>> Not sure why you dropped: >>>> >>>> 3463 // Float register operands >>>> 3473 // Double register operands >>>> >>>> in x86_64.ad >>>> >>>> In chaitin.hpp: >>>> >>>> 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else >>>> >>>> comment is not aligned with the one below anymore. >>>> >>>> Roland. >>>> >>> >>> I also have questions to Michael. >>> >>> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"? >>> >>> Why you moved "operand vecS() etc." from x86.ad ? Do you mean evex is not supported in 32-bit? >>> >>> Thanks, >>> Vladimir >> From michael.c.berg at intel.com Thu May 7 23:41:07 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Thu, 7 May 2015 23:41:07 +0000 Subject: RFR 8076276 support for AVX512 In-Reply-To: <554BF60F.9000401@oracle.com> References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> <554BA377.8050700@oracle.com> <554BD8CA.1090409@oracle.com> <554BF60F.9000401@oracle.com> Message-ID: Ok, I will leave that way it is plus the comment. Thanks, -Michael -----Original Message----- From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] Sent: Thursday, May 07, 2015 4:33 PM To: Berg, Michael C; Roland Westrelin Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR 8076276 support for AVX512 You should have said this from the beginning! Add comment to x86_32.ad before operand vecS explaining why it uses vectors_reg_legacy. To avoid unneeded runtime code generation by adlc. It produces the same result regardless evex support for these vectors in 32-bit VM. Thanks, Vladimir On 5/7/15 3:43 PM, Berg, Michael C wrote: > Tried it out as a single definition referencing reg_class_dynamic for both 64-bit and 32-bit. On the new machine model it does work correctly for both. However on 32-bit we will now emit more auto generated code through the adlc layer as we will have the same definitions on both evex and non evex, it is why I used separate definitions. > > -Michael > > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > Sent: Thursday, May 07, 2015 2:28 PM > To: Berg, Michael C; Roland Westrelin > Cc: hotspot-compiler-dev at openjdk.java.net > Subject: Re: RFR 8076276 support for AVX512 > > On 5/7/15 2:07 PM, Berg, Michael C wrote: >> Roland, >> >> VecS like the other forms of Vec{D|X|Y|Z} are defined as needed in each of the 32 and 64 ad files, as they are divergent now. The 32-bit version only has the legacy bank. >> 64-bit version uses the newly defined reg_class_dynamic definition to provide the appropriate bank of registers based on CPUID. The chunk3 rename is an artifact of removing the kreg bank. > > > For 32-bit reg_class_dynamic will give the same result as legacy because registers after XMM7 will be cut of by #ifdef _LP64. So I don't understand why you need to split Vec{D|X|Y} operands. You do kept vecZ operand the same for 32- and 64-bit. > > Regards, > Vladimir > >> I will put it back to the old name (and test it out). Eventually, we will change it back once we formalize the usage of kreg into all facets of code generation ( in a later patch ). >> If we made a single version out of os_supports_avx_vectors()'s loop, we would have a guard in the loop as the size of the regions are now different, it seems cleaner the way it is. >> >> In x86.64.ad: >> >>>> 3463 // Float register operands >>>> 3473 // Double register operands >> >> I will re-add the comments. >> >> In chaitain.hpp: >> >> 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else >> >> I will fix the comment location. >> >> Thanks, >> Michael >> >> -----Original Message----- >> From: hotspot-compiler-dev >> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of >> Roland Westrelin >> Sent: Thursday, May 07, 2015 11:40 AM >> To: Vladimir Kozlov >> Cc: hotspot-compiler-dev at openjdk.java.net >> Subject: Re: RFR 8076276 support for AVX512 >> >>>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02 >>>> >>>> This looks good to me. A few minor remarks: >>>> >>>> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is. >>> >>> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about? >> >> I'm talking about: >> 4101 #ifdef _LP64 >> 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, >> regF tmp, regF tmp2) %{ >> >> for instance in x86.ad >> Why isn't it in x86_64.ad? >> >> Roland. >>> >>>> >>>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. >>>> >>>> Not sure why you dropped: >>>> >>>> 3463 // Float register operands >>>> 3473 // Double register operands >>>> >>>> in x86_64.ad >>>> >>>> In chaitin.hpp: >>>> >>>> 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else >>>> >>>> comment is not aligned with the one below anymore. >>>> >>>> Roland. >>>> >>> >>> I also have questions to Michael. >>> >>> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"? >>> >>> Why you moved "operand vecS() etc." from x86.ad ? Do you mean evex is not supported in 32-bit? >>> >>> Thanks, >>> Vladimir >> From tobias.hartmann at oracle.com Fri May 8 12:18:06 2015 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Fri, 08 May 2015 14:18:06 +0200 Subject: [9] RFR(S): 8079343: Crash in PhaseIdealLoop with "assert(!had_error) failed: bad dominance" Message-ID: <554CA97E.8020708@oracle.com> Hi, please review the following patch. https://bugs.openjdk.java.net/browse/JDK-8079343 http://cr.openjdk.java.net/~thartmann/8079343/webrev.00/ Problem: The regression test for JDK-8078497 [1] fails with "assert(!had_error) failed: bad dominance" because during loop optimizations a node in the pre-loop has a control node in the main-loop. The test looks like this: byte[] src = src1; for (int i = ...) { // Copy 8 chars from src to dst ... // Prevent loop invariant code motion of char read. src = (src == src1) ? src2 : src1; } The problem is that the superword optimization tries to vectorize the 8 load char / store char operations in the loop. In 'SuperWord::align_initial_loop_index' the base address of the memory operation we align the loop to ('align_to_ref_p.base()') is not loop invariant because 'src' changes in each iteration. However, we use it's value to correctly align the address if 'vw > ObjectAlignmentInBytes' (see line 2344). This is wrong because the address changes in each iteration. It also causes follow up problems in 'PhaseIdealLoop::verify_dominance' because nodes in the pre-loop have a control in the main-loop. Solution: I agreed with Vladimir K. that we should not try to vectorize such edge cases. I added a check to the constructor of SWPointer to bail out in case the base address is loop variant. Testing: - Regression test - JPRT (together with fix for JDK-8078497) Thanks, Tobias [1] https://bugs.openjdk.java.net/browse/JDK-8078497 From rickard.backman at oracle.com Fri May 8 13:30:42 2015 From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=) Date: Fri, 8 May 2015 15:30:42 +0200 Subject: RFR(XS): 8079797 assert(index >= 0 && index < _count) failed: check Message-ID: <20150508133042.GN31204@rbackman> Hi, can I please have this change reviewed? The recent OopMap change caused this bug when a few methods were renamed: size() used to return the number of OopMaps. In the new implementation size() was used to return the number of bytes, and count the number of Pairs. I renamed the size() method to number_of_bytes() to avoid confusion in the future. Webrev: http://cr.openjdk.java.net/~rbackman/8079797/ Bug: https://bugs.openjdk.java.net/browse/JDK-8079797 The problem was triggered by running java -XX:+PrintAssembly. Verified that java -XX:+PrintAssembly now works. Thanks /R -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: Digital signature URL: From aph at redhat.com Fri May 8 15:59:04 2015 From: aph at redhat.com (Andrew Haley) Date: Fri, 08 May 2015 16:59:04 +0100 Subject: RFR(L): 8069539: RSA acceleration In-Reply-To: <55488803.4020802@redhat.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> Message-ID: <554CDD48.8080500@redhat.com> Here is a prototype of what I propose: http://cr.openjdk.java.net/~aph/rsa-1/ It is a JNI version of the fast algorithm I think we should use for RSA. It doesn't use any intrinsics. But with it, RSA is already twice as fast as the code we have now for 1024-bit keys, and almost three times as fast for 2048-bit keys! This code (montgomery_multiply) is designed to be as easy as I can possibly make it to translate into a hand-coded intrinsic. It will then offer better performance still, with the JNI overhead gone and with some carefully hand-tweaked memory accesses. It has a very regular structure which should make it fairly easy to turn into a software pipeline with overlapped fetching and multiplication, although this will perhaps be difficult on register-starved machines like the x86. The JNI version can still be used for those machines where people don't yet want to write a hand-coded intrinsic. I haven't yet done anything about Montgomery squaring: it's asymptotically 25% faster than Montgomery multiplication. I have only written it for x86_64. Systems without a 64-bit multiply won't benefit very much from this idea, but they are pretty much legacy anyway: I want to concentrate on 64-bit systems. So, shall we go with this? I don't think you will find any faster way to do it. Andrew. From fweimer at redhat.com Fri May 8 16:38:39 2015 From: fweimer at redhat.com (Florian Weimer) Date: Fri, 08 May 2015 18:38:39 +0200 Subject: RFR(L): 8069539: RSA acceleration In-Reply-To: <554CDD48.8080500@redhat.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> Message-ID: <554CE68F.20509@redhat.com> On 05/08/2015 05:59 PM, Andrew Haley wrote: > Here is a prototype of what I propose: > > http://cr.openjdk.java.net/~aph/rsa-1/ > > It is a JNI version of the fast algorithm I think we should use for > RSA. It doesn't use any intrinsics. But with it, RSA is already twice > as fast as the code we have now for 1024-bit keys, and almost three > times as fast for 2048-bit keys! > > This code (montgomery_multiply) is designed to be as easy as I can > possibly make it to translate into a hand-coded intrinsic. It will > then offer better performance still, with the JNI overhead gone and > with some carefully hand-tweaked memory accesses. It has a very > regular structure which should make it fairly easy to turn into a > software pipeline with overlapped fetching and multiplication, > although this will perhaps be difficult on register-starved machines > like the x86. Do we want to add side-channel protection as part of this effort (against timing attacks and cache-flushing attacks)? -- Florian Weimer / Red Hat Product Security From vladimir.x.ivanov at oracle.com Fri May 8 17:16:38 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Fri, 08 May 2015 20:16:38 +0300 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared Message-ID: <554CEF76.8090903@oracle.com> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.01 https://bugs.openjdk.java.net/browse/JDK-8079205 Recent change in sun.misc.Cleaner behavior broke CallSite context cleanup. CallSite references context class through a Cleaner to avoid its unnecessary retention. The problem is the following: to do a cleanup (invalidate all affected nmethods) VM needs a pointer to a context class. Until Cleaner is cleared (and it was a manual action, since Cleaner extends PhantomReference isn't automatically cleared according to the docs), VM can extract it from CallSite.context.referent field. I experimented with moving cleanup logic into VM [1], but Peter Levart came up with a clever idea and implemented FinalReference-based cleaner-like Finalizator. Classes don't have finalizers, but Finalizator allows to attach a finalization action to them. And it is guaranteed that the referent is alive when finalization happens. Also, Peter spotted another problem with Cleaner-based implementation. Cleaner cleanup action is strongly referenced, since it is registered in Cleaner class. CallSite context cleanup action keeps a reference to CallSite class (it is passed to MHN.invalidateDependentNMethods). Users are free to extend CallSite and many do so. If a context class and a call site class are loaded by a custom class loader, such loader will never be unloaded, causing a memory leak. Finalizator doesn't suffer from that, since the action is referenced only from Finalizator instance. The downside is that cleanup action can be missed if Finalizator becomes unreachable. It's not a problem for CallSite context, because a context is always referenced from some CallSite and if a CallSite becomes unreachable, there's no need to perform a cleanup. Testing: jdk/test/java/lang/invoke, hotspot/test/compiler/jsr292 Contributed-by: plevart, vlivanov Best regards, Vladimir Ivanov PS: frankly speaking, I would move Finalizator from java.lang.ref to java.lang.invoke and call it Context, if there were a way to extend package-private FinalReference from another package :-) [1] http://cr.openjdk.java.net/~vlivanov/8079205/webrev.00 From aph at redhat.com Fri May 8 17:19:07 2015 From: aph at redhat.com (Andrew Haley) Date: Fri, 08 May 2015 18:19:07 +0100 Subject: RFR(L): 8069539: RSA acceleration In-Reply-To: <554CE68F.20509@redhat.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <554CE68F.20509@redhat.com> Message-ID: <554CF00B.6000508@redhat.com> On 05/08/2015 05:38 PM, Florian Weimer wrote: > On 05/08/2015 05:59 PM, Andrew Haley wrote: >> Here is a prototype of what I propose: >> >> http://cr.openjdk.java.net/~aph/rsa-1/ >> >> It is a JNI version of the fast algorithm I think we should use for >> RSA. It doesn't use any intrinsics. But with it, RSA is already twice >> as fast as the code we have now for 1024-bit keys, and almost three >> times as fast for 2048-bit keys! >> >> This code (montgomery_multiply) is designed to be as easy as I can >> possibly make it to translate into a hand-coded intrinsic. It will >> then offer better performance still, with the JNI overhead gone and >> with some carefully hand-tweaked memory accesses. It has a very >> regular structure which should make it fairly easy to turn into a >> software pipeline with overlapped fetching and multiplication, >> although this will perhaps be difficult on register-starved machines >> like the x86. > > Do we want to add side-channel protection as part of this effort > (against timing attacks and cache-flushing attacks)? I wouldn't have thought so. It might make sense to add an optional path without key-dependent branches, but not as a part of this effort: the goals are completely orthogonal. Andrew. From vladimir.kozlov at oracle.com Fri May 8 18:05:17 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 08 May 2015 11:05:17 -0700 Subject: RFR(XS): 8079797 assert(index >= 0 && index < _count) failed: check In-Reply-To: <20150508133042.GN31204@rbackman> References: <20150508133042.GN31204@rbackman> Message-ID: <554CFADD.20803@oracle.com> Looks good. Thanks, Vladimir On 5/8/15 6:30 AM, Rickard B?ckman wrote: > Hi, > > can I please have this change reviewed? > The recent OopMap change caused this bug when a few methods were > renamed: > > size() used to return the number of OopMaps. > In the new implementation size() was used to return the number of bytes, > and count the number of Pairs. > > I renamed the size() method to number_of_bytes() to avoid confusion in > the future. > > Webrev: http://cr.openjdk.java.net/~rbackman/8079797/ > Bug: https://bugs.openjdk.java.net/browse/JDK-8079797 > > The problem was triggered by running java -XX:+PrintAssembly. > Verified that java -XX:+PrintAssembly now works. > > Thanks > /R > From vladimir.kozlov at oracle.com Fri May 8 18:16:43 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 08 May 2015 11:16:43 -0700 Subject: [9] RFR(S): 8079343: Crash in PhaseIdealLoop with "assert(!had_error) failed: bad dominance" In-Reply-To: <554CA97E.8020708@oracle.com> References: <554CA97E.8020708@oracle.com> Message-ID: <554CFD8B.9090904@oracle.com> Looks good. Thanks, Vladimir On 5/8/15 5:18 AM, Tobias Hartmann wrote: > Hi, > > please review the following patch. > > https://bugs.openjdk.java.net/browse/JDK-8079343 > http://cr.openjdk.java.net/~thartmann/8079343/webrev.00/ > > Problem: > The regression test for JDK-8078497 [1] fails with "assert(!had_error) failed: bad dominance" because during loop optimizations a node in the pre-loop has a control node in the main-loop. The test looks like this: > > > byte[] src = src1; > > for (int i = ...) { > > // Copy 8 chars from src to dst > > ... > > // Prevent loop invariant code motion of char read. > > src = (src == src1) ? src2 : src1; > > } > > > > The problem is that the superword optimization tries to vectorize the 8 load char / store char operations in the loop. In 'SuperWord::align_initial_loop_index' the base address of the memory operation we align the loop to ('align_to_ref_p.base()') is not loop invariant because 'src' changes in each iteration. However, we use it's value to correctly align the address if 'vw > ObjectAlignmentInBytes' (see line 2344). This is wrong because the address changes in each iteration. It also causes follow up problems in 'PhaseIdealLoop::verify_dominance' because nodes in the pre-loop have a control in the main-loop. > > > Solution: > I agreed with Vladimir K. that we should not try to vectorize such edge cases. I added a check to the constructor of SWPointer to bail out in case the base address is loop variant. > > Testing: > - Regression test > - JPRT (together with fix for JDK-8078497) > > Thanks, > Tobias > > [1] https://bugs.openjdk.java.net/browse/JDK-8078497 > From vladimir.kozlov at oracle.com Fri May 8 19:09:20 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 08 May 2015 12:09:20 -0700 Subject: RFR 8076276 support for AVX512 In-Reply-To: References: <55258337.2050605@oracle.com> <55259078.1080309@oracle.com> <55271100.8080203@oracle.com> <553946F5.2090009@oracle.com> <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com> <554BA377.8050700@oracle.com> <554BD8CA.1090409@oracle.com> <554BF60F.9000401@oracle.com> Message-ID: <554D09E0.4040904@oracle.com> I applied small "cosmetic" changes sent by Michael and I am pushing this. Here is updated webrev for the record (I also removed trailing spaces in .ad files): http://cr.openjdk.java.net/~kvn/8076276/webrev.03 Thanks, Vladimir On 5/7/15 4:41 PM, Berg, Michael C wrote: > Ok, I will leave that way it is plus the comment. > > Thanks, > -Michael > > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > Sent: Thursday, May 07, 2015 4:33 PM > To: Berg, Michael C; Roland Westrelin > Cc: hotspot-compiler-dev at openjdk.java.net > Subject: Re: RFR 8076276 support for AVX512 > > You should have said this from the beginning! Add comment to x86_32.ad before operand vecS explaining why it uses vectors_reg_legacy. > > To avoid unneeded runtime code generation by adlc. It produces the same result regardless evex support for these vectors in 32-bit VM. > > Thanks, > Vladimir > > On 5/7/15 3:43 PM, Berg, Michael C wrote: >> Tried it out as a single definition referencing reg_class_dynamic for both 64-bit and 32-bit. On the new machine model it does work correctly for both. However on 32-bit we will now emit more auto generated code through the adlc layer as we will have the same definitions on both evex and non evex, it is why I used separate definitions. >> >> -Michael >> >> -----Original Message----- >> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] >> Sent: Thursday, May 07, 2015 2:28 PM >> To: Berg, Michael C; Roland Westrelin >> Cc: hotspot-compiler-dev at openjdk.java.net >> Subject: Re: RFR 8076276 support for AVX512 >> >> On 5/7/15 2:07 PM, Berg, Michael C wrote: >>> Roland, >>> >>> VecS like the other forms of Vec{D|X|Y|Z} are defined as needed in each of the 32 and 64 ad files, as they are divergent now. The 32-bit version only has the legacy bank. >>> 64-bit version uses the newly defined reg_class_dynamic definition to provide the appropriate bank of registers based on CPUID. The chunk3 rename is an artifact of removing the kreg bank. >> >> >> For 32-bit reg_class_dynamic will give the same result as legacy because registers after XMM7 will be cut of by #ifdef _LP64. So I don't understand why you need to split Vec{D|X|Y} operands. You do kept vecZ operand the same for 32- and 64-bit. >> >> Regards, >> Vladimir >> >>> I will put it back to the old name (and test it out). Eventually, we will change it back once we formalize the usage of kreg into all facets of code generation ( in a later patch ). >>> If we made a single version out of os_supports_avx_vectors()'s loop, we would have a guard in the loop as the size of the regions are now different, it seems cleaner the way it is. >>> >>> In x86.64.ad: >>> >>>>> 3463 // Float register operands >>>>> 3473 // Double register operands >>> >>> I will re-add the comments. >>> >>> In chaitain.hpp: >>> >>> 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else >>> >>> I will fix the comment location. >>> >>> Thanks, >>> Michael >>> >>> -----Original Message----- >>> From: hotspot-compiler-dev >>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of >>> Roland Westrelin >>> Sent: Thursday, May 07, 2015 11:40 AM >>> To: Vladimir Kozlov >>> Cc: hotspot-compiler-dev at openjdk.java.net >>> Subject: Re: RFR 8076276 support for AVX512 >>> >>>>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02 >>>>> >>>>> This looks good to me. A few minor remarks: >>>>> >>>>> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is. >>>> >>>> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about? >>> >>> I'm talking about: >>> 4101 #ifdef _LP64 >>> 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, >>> regF tmp, regF tmp2) %{ >>> >>> for instance in x86.ad >>> Why isn't it in x86_64.ad? >>> >>> Roland. >>>> >>>>> >>>>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters. >>>>> >>>>> Not sure why you dropped: >>>>> >>>>> 3463 // Float register operands >>>>> 3473 // Double register operands >>>>> >>>>> in x86_64.ad >>>>> >>>>> In chaitin.hpp: >>>>> >>>>> 144 uint16_t _num_regs; // 2 for Longs and Doubles, 1 for all else >>>>> >>>>> comment is not aligned with the one below anymore. >>>>> >>>>> Roland. >>>>> >>>> >>>> I also have questions to Michael. >>>> >>>> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"? >>>> >>>> Why you moved "operand vecS() etc." from x86.ad ? Do you mean evex is not supported in 32-bit? >>>> >>>> Thanks, >>>> Vladimir >>> From aph at redhat.com Sat May 9 07:03:32 2015 From: aph at redhat.com (Andrew Haley) Date: Sat, 09 May 2015 08:03:32 +0100 Subject: aarch64 AD-file / matching rule In-Reply-To: References: Message-ID: <554DB144.1050705@redhat.com> On 04/23/2015 02:41 PM, Benedikt Wedenik wrote: > One of the patterns I want to optimise is the following: > > 0x0000007f8c2961b4: and w2, w2, #0x7ffff8 0x0000007f8c2961b8: cmp w2, #0x0 0x0000007f8c2961bc: b.eq 0x0000007f8c2968f4 > > > Here I see an opportunity for ands, b.eq. It would make more sense to use cbz. In fact, I have no idea why cbz is not being used here. I can't see any point to this pattern. > [-client -Xcomp -XX:-TieredCompilation] : Here the cmp for #0x0 only occurs about 3 times in the whole disassembly, instead of about 200 times without these flags. This tells you that the cmp for #0x0 is C1, not C2. > My question is now how to find out why the rule does not match / if the rule is correct and how to find the actual rule which emits the code of my desired pattern. 1. Use a debug build. 2. Turn off tiered compilation. Look at a disassembly and see which method and which line emits the code above. You haven't provdied enough context. 3. Forget about optimizing C1. It will have no effect on benchmarks. Andrew. From peter.levart at gmail.com Sat May 9 09:33:43 2015 From: peter.levart at gmail.com (Peter Levart) Date: Sat, 09 May 2015 11:33:43 +0200 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared In-Reply-To: <554CEF76.8090903@oracle.com> References: <554CEF76.8090903@oracle.com> Message-ID: <554DD477.7000703@gmail.com> Hi Vladimir, On 05/08/2015 07:16 PM, Vladimir Ivanov wrote: > http://cr.openjdk.java.net/~vlivanov/8079205/webrev.01 > https://bugs.openjdk.java.net/browse/JDK-8079205 Your Finalizator touches are good. Supplier interface is not needed as there is a public Reference superclass that can be used for return type of JavaLangRefAccess.createFinalizator(). You can remove the import for Supplier in Reference now. > > Recent change in sun.misc.Cleaner behavior broke CallSite context > cleanup. > > CallSite references context class through a Cleaner to avoid its > unnecessary retention. > > The problem is the following: to do a cleanup (invalidate all affected > nmethods) VM needs a pointer to a context class. Until Cleaner is > cleared (and it was a manual action, since Cleaner extends > PhantomReference isn't automatically cleared according to the docs), > VM can extract it from CallSite.context.referent field. If PhantomReference.referent wasn't cleared by VM when PhantomReference was equeued, could it happen that the referent pointer was still != null and the referent object's heap memory was already reclaimed by GC? Is that what JDK-8071931 is about? (I cant't see the bug - it's internal). In that respect the Cleaner based solution was broken from the start, as you did dereference the referent after Cleaner was enqueued. You could get a != null pointer which was pointing to reclaimed heap. In Java this dereference is prevented by PhantomReference.get() always returning null (well, nothing prevents one to read the referent field with reflection though). FinalReference(s) are different in that their referent is not reclaimed while it is still reachable through FinalReference, which means that finalizable objects must undergo at least two GC cycles, with processing in the reference handler and finalizer threads inbetween the cycles, to be reclaimed. PhantomReference(s), on the other hand, can be enqueued and their referents reclaimed in the same GC cycle, can't they? > > I experimented with moving cleanup logic into VM [1], What's the downside of that approach? I mean, why is GC-assisted approach better? Simpler? > but Peter Levart came up with a clever idea and implemented > FinalReference-based cleaner-like Finalizator. Classes don't have > finalizers, but Finalizator allows to attach a finalization action to > them. And it is guaranteed that the referent is alive when > finalization happens. > > Also, Peter spotted another problem with Cleaner-based implementation. > Cleaner cleanup action is strongly referenced, since it is registered > in Cleaner class. CallSite context cleanup action keeps a reference to > CallSite class (it is passed to MHN.invalidateDependentNMethods). > Users are free to extend CallSite and many do so. If a context class > and a call site class are loaded by a custom class loader, such loader > will never be unloaded, causing a memory leak. > > Finalizator doesn't suffer from that, since the action is referenced > only from Finalizator instance. The downside is that cleanup action > can be missed if Finalizator becomes unreachable. It's not a problem > for CallSite context, because a context is always referenced from some > CallSite and if a CallSite becomes unreachable, there's no need to > perform a cleanup. > > Testing: jdk/test/java/lang/invoke, hotspot/test/compiler/jsr292 > > Contributed-by: plevart, vlivanov > > Best regards, > Vladimir Ivanov > > PS: frankly speaking, I would move Finalizator from java.lang.ref to > java.lang.invoke and call it Context, if there were a way to extend > package-private FinalReference from another package :-) Modules may have an answer for that. FinalReference could be moved to a package (java.lang.ref.internal or jdk.lang.ref) that is not exported to the world and made public. On the other hand, I don't see a reason why FinalReference couldn't be part of JDK public API. I know it's currently just a private mechanism to implement finalization which has issues and many would like to see it gone, but Java has learned to live with it, and, as you see, it can be a solution to some problems too ;-) Regards, Peter > > [1] http://cr.openjdk.java.net/~vlivanov/8079205/webrev.00 -------------- next part -------------- An HTML attachment was scrubbed... URL: From tobias.hartmann at oracle.com Mon May 11 05:33:16 2015 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Mon, 11 May 2015 07:33:16 +0200 Subject: [9] RFR(S): 8079343: Crash in PhaseIdealLoop with "assert(!had_error) failed: bad dominance" In-Reply-To: <554CFD8B.9090904@oracle.com> References: <554CA97E.8020708@oracle.com> <554CFD8B.9090904@oracle.com> Message-ID: <55503F1C.2070808@oracle.com> Thanks, Vladimir. Best, Tobias On 08.05.2015 20:16, Vladimir Kozlov wrote: > Looks good. > > Thanks, > Vladimir > > On 5/8/15 5:18 AM, Tobias Hartmann wrote: >> Hi, >> >> please review the following patch. >> >> https://bugs.openjdk.java.net/browse/JDK-8079343 >> http://cr.openjdk.java.net/~thartmann/8079343/webrev.00/ >> >> Problem: >> The regression test for JDK-8078497 [1] fails with "assert(!had_error) failed: bad dominance" because during loop optimizations a node in the pre-loop has a control node in the main-loop. The test looks like this: >> >> >> byte[] src = src1; >> >> for (int i = ...) { >> >> // Copy 8 chars from src to dst >> >> ... >> >> // Prevent loop invariant code motion of char read. >> >> src = (src == src1) ? src2 : src1; >> >> } >> >> >> >> The problem is that the superword optimization tries to vectorize the 8 load char / store char operations in the loop. In 'SuperWord::align_initial_loop_index' the base address of the memory operation we align the loop to ('align_to_ref_p.base()') is not loop invariant because 'src' changes in each iteration. However, we use it's value to correctly align the address if 'vw > ObjectAlignmentInBytes' (see line 2344). This is wrong because the address changes in each iteration. It also causes follow up problems in 'PhaseIdealLoop::verify_dominance' because nodes in the pre-loop have a control in the main-loop. >> >> >> Solution: >> I agreed with Vladimir K. that we should not try to vectorize such edge cases. I added a check to the constructor of SWPointer to bail out in case the base address is loop variant. >> >> Testing: >> - Regression test >> - JPRT (together with fix for JDK-8078497) >> >> Thanks, >> Tobias >> >> [1] https://bugs.openjdk.java.net/browse/JDK-8078497 >> From rickard.backman at oracle.com Mon May 11 07:36:25 2015 From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=) Date: Mon, 11 May 2015 09:36:25 +0200 Subject: RFR(XS): 8079797 assert(index >= 0 && index < _count) failed: check In-Reply-To: <554CFADD.20803@oracle.com> References: <20150508133042.GN31204@rbackman> <554CFADD.20803@oracle.com> Message-ID: <20150511073625.GP31204@rbackman> Thanks Vladimir. /R On 05/08, Vladimir Kozlov wrote: > Looks good. > > Thanks, > Vladimir > > On 5/8/15 6:30 AM, Rickard B?ckman wrote: > >Hi, > > > >can I please have this change reviewed? > >The recent OopMap change caused this bug when a few methods were > >renamed: > > > >size() used to return the number of OopMaps. > >In the new implementation size() was used to return the number of bytes, > >and count the number of Pairs. > > > >I renamed the size() method to number_of_bytes() to avoid confusion in > >the future. > > > >Webrev: http://cr.openjdk.java.net/~rbackman/8079797/ > >Bug: https://bugs.openjdk.java.net/browse/JDK-8079797 > > > >The problem was triggered by running java -XX:+PrintAssembly. > >Verified that java -XX:+PrintAssembly now works. > > > >Thanks > >/R > > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: Digital signature URL: From fweimer at redhat.com Mon May 11 15:37:36 2015 From: fweimer at redhat.com (Florian Weimer) Date: Mon, 11 May 2015 17:37:36 +0200 Subject: RFR(L): 8069539: RSA acceleration In-Reply-To: <554CF00B.6000508@redhat.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <554CE68F.20509@redhat.com> <554CF00B.6000508@redhat.com> Message-ID: <5550CCC0.6000502@redhat.com> On 05/08/2015 07:19 PM, Andrew Haley wrote: >> Do we want to add side-channel protection as part of this effort >> (against timing attacks and cache-flushing attacks)? > > I wouldn't have thought so. It might make sense to add an optional > path without key-dependent branches, but not as a part of this effort: > the goals are completely orthogonal. I'm not well-versed in this kind of side-channel protection for RSA implementations, but my impression that algorithm changes are needed to mitigate the impact of data-dependent memory fetches (see fixed-width modular exponentiation). But maybe the necessary changes materialize at a higher level, beyond the operation which you proposed to intrinsify. -- Florian Weimer / Red Hat Product Security From aph at redhat.com Mon May 11 16:15:42 2015 From: aph at redhat.com (Andrew Haley) Date: Mon, 11 May 2015 17:15:42 +0100 Subject: RFR(L): 8069539: RSA acceleration In-Reply-To: <5550CCC0.6000502@redhat.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <554CE68F.20509@redhat.com> <554CF00B.6000508@redhat.com> <5550CCC0.6000502@redhat.com> Message-ID: <5550D5AE.1070602@redhat.com> On 05/11/2015 04:37 PM, Florian Weimer wrote: > On 05/08/2015 07:19 PM, Andrew Haley wrote: > >>> Do we want to add side-channel protection as part of this effort >>> (against timing attacks and cache-flushing attacks)? >> >> I wouldn't have thought so. It might make sense to add an optional >> path without key-dependent branches, but not as a part of this effort: >> the goals are completely orthogonal. > > I'm not well-versed in this kind of side-channel protection for RSA > implementations, but my impression that algorithm changes are needed > to mitigate the impact of data-dependent memory fetches (see > fixed-width modular exponentiation). RSA can be done with no key- (or data-) dependent timing. No Chinese Remainder Theorem optimizations, no addition chains, no optimized division. Just do everything. Even when the result of a multiplication is not needed, do it anyway. This really easy to do: just delete all of the optimizations (and therefore most of the code) in oddModPow, leaving only a simple kernel and a much slower algorithm. > But maybe the necessary changes materialize at a higher level, > beyond the operation which you proposed to intrinsify. They absolutely do. The algorithm I propose to intrinsify has almost no key-dependent timing whatsoever unless the multiplication instruction itself has an early-out optimization. And there's not much anyone can do about that. (Well, almost: it may be necessary to do an extra subtraction to normalize the result of the Montgomery multiplication, and this is data-dependent. I suppose it would be possible to do the subtraction anyway, and throw it away if needs be.) It would make more sense to use one of the techniques to prevent timing attacks by using blinding functions. See e.g. Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems, Kocher 1995. But really it seems inappropriate to me to piggyback this conversation on top of a completely different subject, RSA acceleration. Let's continue it elsewhere. Andrew. From fweimer at redhat.com Mon May 11 17:04:13 2015 From: fweimer at redhat.com (Florian Weimer) Date: Mon, 11 May 2015 19:04:13 +0200 Subject: RFR(L): 8069539: RSA acceleration In-Reply-To: <5550D5AE.1070602@redhat.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <554CE68F.20509@redhat.com> <554CF00B.6000508@redhat.com> <5550CCC0.6000502@redhat.com> <5550D5AE.1070602@redhat.com> Message-ID: <5550E10D.7020304@redhat.com> On 05/11/2015 06:15 PM, Andrew Haley wrote: > But really it seems inappropriate to me to piggyback this conversation > on top of a completely different subject, RSA acceleration. Let's > continue it elsewhere. Agreed. I apologize for the side discussion. -- Florian Weimer / Red Hat Product Security From vitalyd at gmail.com Tue May 12 00:25:30 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Mon, 11 May 2015 20:25:30 -0400 Subject: Intermediate writes to array slot not eliminated by optimizer Message-ID: Hi guys, Suppose we have the following code: private static final long[] ARR = new long[10]; private static final int ITERATIONS = 500 * 1000 * 1000; static void loop() { int i = ITERATIONS; while(--i >= 0) { ARR[4] = i; } } My expectation would be that loop() effectively compiles to: ARR[4] = 0; ret; However, an actual loop is compiled. This example is a reduction of a slightly more involved case that I had in mind, but I'm curious if I'm missing something here. I tested this with 8u40 and tiered compilation disabled, for what it's worth. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From tangwei6 at huawei.com Tue May 12 06:24:39 2015 From: tangwei6 at huawei.com (Tangwei (Euler)) Date: Tue, 12 May 2015 06:24:39 +0000 Subject: Anyway to hack JITTed code manually? Message-ID: Hi All, Is there any simple way to hack JITTed code and relocate at runtime? Following steps are expected. 1. Dump JITTed code (c1/c2) to file or some cached directory. 2. Hack assembly code 3. Run JVM to load the hacked assembly code and relocating in memory Regards! wei -------------- next part -------------- An HTML attachment was scrubbed... URL: From rednaxelafx at gmail.com Tue May 12 06:33:49 2015 From: rednaxelafx at gmail.com (Krystal Mok) Date: Mon, 11 May 2015 23:33:49 -0700 Subject: Anyway to hack JITTed code manually? In-Reply-To: References: Message-ID: Hi Wei, For HotSpot VM as it is now, the answer is no, there isn't a simple way to do that. One of the various problem that you'd have with trying to do something like this is relocation: there are a lot of addresses embedded into JIT compiled code as constants -- some could be Klass*, some could be compiled entry points to other compiled methods or stubs. The kind of feature you're asking for is more likely to be present in a VM that supports some certain form of AOT compilation, where the system has to deal with relocation of code anyway. The HotSpot VM we can see in OpenJDK doesn't do that right now. - Kris On Mon, May 11, 2015 at 11:24 PM, Tangwei (Euler) wrote: > Hi All, > > Is there any simple way to hack JITTed code and relocate at runtime? > Following steps are expected. > > > > 1. Dump JITTed code (c1/c2) to file or some cached directory. > > 2. Hack assembly code > > 3. Run JVM to load the hacked assembly code and relocating in memory > > > > > > Regards! > > wei > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tangwei6 at huawei.com Tue May 12 06:42:53 2015 From: tangwei6 at huawei.com (Tangwei (Euler)) Date: Tue, 12 May 2015 06:42:53 +0000 Subject: Anyway to hack JITTed code manually? In-Reply-To: References: Message-ID: Hi Kris, Thanks for quick response. Is there any other hacking way in OpenJDK for fast performance measurement? Or is that possible to invoke other profiling tool such as ?perf? instead of hprof in JVM to sampling more PMU events. Regards! wei From: Krystal Mok [mailto:rednaxelafx at gmail.com] Sent: Tuesday, May 12, 2015 2:34 PM To: Tangwei (Euler) Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: Anyway to hack JITTed code manually? Hi Wei, For HotSpot VM as it is now, the answer is no, there isn't a simple way to do that. One of the various problem that you'd have with trying to do something like this is relocation: there are a lot of addresses embedded into JIT compiled code as constants -- some could be Klass*, some could be compiled entry points to other compiled methods or stubs. The kind of feature you're asking for is more likely to be present in a VM that supports some certain form of AOT compilation, where the system has to deal with relocation of code anyway. The HotSpot VM we can see in OpenJDK doesn't do that right now. - Kris On Mon, May 11, 2015 at 11:24 PM, Tangwei (Euler) > wrote: Hi All, Is there any simple way to hack JITTed code and relocate at runtime? Following steps are expected. 1. Dump JITTed code (c1/c2) to file or some cached directory. 2. Hack assembly code 3. Run JVM to load the hacked assembly code and relocating in memory Regards! wei -------------- next part -------------- An HTML attachment was scrubbed... URL: From rednaxelafx at gmail.com Tue May 12 06:49:36 2015 From: rednaxelafx at gmail.com (Krystal Mok) Date: Mon, 11 May 2015 23:49:36 -0700 Subject: Anyway to hack JITTed code manually? In-Reply-To: References: Message-ID: Hi Wei, HotSpot VM is known to be able to work with other more native profiling tools. HotSpot can work with Linux perf. An easy way to run benchmarks and profiling them is with the JMH framework and with the "perfasm" feature [1][2]. Other profilers like Oracle Solaris Studio Performance Analyzer [3], Intel VTune and AMD CodeAnalyst are also known to work with HotSpot. - Kris [1]: http://cr.openjdk.java.net/~shade/jmh/perfasm-sample.log [2]: http://shipilev.net/blog/2014/java-scala-divided-we-fail/ [3]: http://www.oracle.com/technetwork/server-storage/solarisstudio/features/performance-analyzer-2292312.html On Mon, May 11, 2015 at 11:42 PM, Tangwei (Euler) wrote: > Hi Kris, > > Thanks for quick response. Is there any other hacking way in OpenJDK for > fast performance measurement? > > Or is that possible to invoke other profiling tool such as ?perf? instead > of hprof in JVM to sampling more PMU events. > > > > Regards! > > wei > > > > *From:* Krystal Mok [mailto:rednaxelafx at gmail.com] > *Sent:* Tuesday, May 12, 2015 2:34 PM > *To:* Tangwei (Euler) > *Cc:* hotspot-compiler-dev at openjdk.java.net > *Subject:* Re: Anyway to hack JITTed code manually? > > > > Hi Wei, > > > > For HotSpot VM as it is now, the answer is no, there isn't a simple way to > do that. > > > > One of the various problem that you'd have with trying to do something > like this is relocation: there are a lot of addresses embedded into JIT > compiled code as constants -- some could be Klass*, some could be compiled > entry points to other compiled methods or stubs. > > > > The kind of feature you're asking for is more likely to be present in a VM > that supports some certain form of AOT compilation, where the system has to > deal with relocation of code anyway. The HotSpot VM we can see in OpenJDK > doesn't do that right now. > > > > - Kris > > > > On Mon, May 11, 2015 at 11:24 PM, Tangwei (Euler) > wrote: > > Hi All, > > Is there any simple way to hack JITTed code and relocate at runtime? > Following steps are expected. > > > > 1. Dump JITTed code (c1/c2) to file or some cached directory. > > 2. Hack assembly code > > 3. Run JVM to load the hacked assembly code and relocating in memory > > > > > > Regards! > > wei > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tangwei6 at huawei.com Tue May 12 07:05:41 2015 From: tangwei6 at huawei.com (Tangwei (Euler)) Date: Tue, 12 May 2015 07:05:41 +0000 Subject: Anyway to hack JITTed code manually? In-Reply-To: References: Message-ID: Hi Kris, Sorry for inaccurate question description, the tool I am looking for is to sampling specific function. For example I know some function called ?Foo? is hot, but has poor performance. I just want to sample it with some PMU event set. I am working on ARM, so some X86 only profiler is not suitable to me. Regards! wei From: Krystal Mok [mailto:rednaxelafx at gmail.com] Sent: Tuesday, May 12, 2015 2:50 PM To: Tangwei (Euler) Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: Anyway to hack JITTed code manually? Hi Wei, HotSpot VM is known to be able to work with other more native profiling tools. HotSpot can work with Linux perf. An easy way to run benchmarks and profiling them is with the JMH framework and with the "perfasm" feature [1][2]. Other profilers like Oracle Solaris Studio Performance Analyzer [3], Intel VTune and AMD CodeAnalyst are also known to work with HotSpot. - Kris [1]: http://cr.openjdk.java.net/~shade/jmh/perfasm-sample.log [2]: http://shipilev.net/blog/2014/java-scala-divided-we-fail/ [3]: http://www.oracle.com/technetwork/server-storage/solarisstudio/features/performance-analyzer-2292312.html On Mon, May 11, 2015 at 11:42 PM, Tangwei (Euler) > wrote: Hi Kris, Thanks for quick response. Is there any other hacking way in OpenJDK for fast performance measurement? Or is that possible to invoke other profiling tool such as ?perf? instead of hprof in JVM to sampling more PMU events. Regards! wei From: Krystal Mok [mailto:rednaxelafx at gmail.com] Sent: Tuesday, May 12, 2015 2:34 PM To: Tangwei (Euler) Cc: hotspot-compiler-dev at openjdk.java.net Subject: Re: Anyway to hack JITTed code manually? Hi Wei, For HotSpot VM as it is now, the answer is no, there isn't a simple way to do that. One of the various problem that you'd have with trying to do something like this is relocation: there are a lot of addresses embedded into JIT compiled code as constants -- some could be Klass*, some could be compiled entry points to other compiled methods or stubs. The kind of feature you're asking for is more likely to be present in a VM that supports some certain form of AOT compilation, where the system has to deal with relocation of code anyway. The HotSpot VM we can see in OpenJDK doesn't do that right now. - Kris On Mon, May 11, 2015 at 11:24 PM, Tangwei (Euler) > wrote: Hi All, Is there any simple way to hack JITTed code and relocate at runtime? Following steps are expected. 1. Dump JITTed code (c1/c2) to file or some cached directory. 2. Hack assembly code 3. Run JVM to load the hacked assembly code and relocating in memory Regards! wei -------------- next part -------------- An HTML attachment was scrubbed... URL: From filipp.zhinkin at gmail.com Tue May 12 09:32:21 2015 From: filipp.zhinkin at gmail.com (Filipp Zhinkin) Date: Tue, 12 May 2015 12:32:21 +0300 Subject: Anyway to hack JITTed code manually? In-Reply-To: References: Message-ID: Hi Wei, in order to profile JIT-generated code using perf you can use perf-map-agent [1] which is aimed to generate map files for compiled methods. Regards, Filipp. [1] https://github.com/jrudolph/perf-map-agent On Tue, May 12, 2015 at 10:05 AM, Tangwei (Euler) wrote: > Hi Kris, > > Sorry for inaccurate question description, the tool I am looking for is to > sampling specific function. > > For example I know some function called ?Foo? is hot, but has poor > performance. I just want to sample > > it with some PMU event set. I am working on ARM, so some X86 only profiler > is not suitable to me. > > > > > > Regards! > > wei > > > > From: Krystal Mok [mailto:rednaxelafx at gmail.com] > Sent: Tuesday, May 12, 2015 2:50 PM > > > To: Tangwei (Euler) > Cc: hotspot-compiler-dev at openjdk.java.net > Subject: Re: Anyway to hack JITTed code manually? > > > > Hi Wei, > > > > HotSpot VM is known to be able to work with other more native profiling > tools. > > > > HotSpot can work with Linux perf. An easy way to run benchmarks and > profiling them is with the JMH framework and with the "perfasm" feature > [1][2]. > > > > Other profilers like Oracle Solaris Studio Performance Analyzer [3], Intel > VTune and AMD CodeAnalyst are also known to work with HotSpot. > > > > - Kris > > > > [1]: http://cr.openjdk.java.net/~shade/jmh/perfasm-sample.log > > [2]: http://shipilev.net/blog/2014/java-scala-divided-we-fail/ > > [3]: > http://www.oracle.com/technetwork/server-storage/solarisstudio/features/performance-analyzer-2292312.html > > > > On Mon, May 11, 2015 at 11:42 PM, Tangwei (Euler) > wrote: > > Hi Kris, > > Thanks for quick response. Is there any other hacking way in OpenJDK for > fast performance measurement? > > Or is that possible to invoke other profiling tool such as ?perf? instead of > hprof in JVM to sampling more PMU events. > > > > Regards! > > wei > > > > From: Krystal Mok [mailto:rednaxelafx at gmail.com] > Sent: Tuesday, May 12, 2015 2:34 PM > To: Tangwei (Euler) > Cc: hotspot-compiler-dev at openjdk.java.net > Subject: Re: Anyway to hack JITTed code manually? > > > > Hi Wei, > > > > For HotSpot VM as it is now, the answer is no, there isn't a simple way to > do that. > > > > One of the various problem that you'd have with trying to do something like > this is relocation: there are a lot of addresses embedded into JIT compiled > code as constants -- some could be Klass*, some could be compiled entry > points to other compiled methods or stubs. > > > > The kind of feature you're asking for is more likely to be present in a VM > that supports some certain form of AOT compilation, where the system has to > deal with relocation of code anyway. The HotSpot VM we can see in OpenJDK > doesn't do that right now. > > > > - Kris > > > > On Mon, May 11, 2015 at 11:24 PM, Tangwei (Euler) > wrote: > > Hi All, > > Is there any simple way to hack JITTed code and relocate at runtime? > Following steps are expected. > > > > 1. Dump JITTed code (c1/c2) to file or some cached directory. > > 2. Hack assembly code > > 3. Run JVM to load the hacked assembly code and relocating in memory > > > > > > Regards! > > wei > > > > From roland.westrelin at oracle.com Tue May 12 10:40:05 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Tue, 12 May 2015 12:40:05 +0200 Subject: RFR(M): 8076188 Optimize arraycopy out for non escaping destination In-Reply-To: References: <2B8622DD-1DA5-4715-B4BF-3801202B588B@oracle.com> <553F9C4A.7040407@oracle.com> Message-ID: <3F7929F8-11FA-4536-A340-A98266FB2C62@oracle.com> jprt found some problems with this change. So I?d like to push: http://cr.openjdk.java.net/~roland/8076188/webrev.00-01/ on top of the reviewed change. Overall change: http://cr.openjdk.java.net/~roland/8076188/webrev.01/ List of fixes: - tests with G1 failed in verification code. I made the change to LoadNode::Ideal() so the new code that looks for a dominating identical load is disabled for raw loads - OptimizePtrCompare is broken in cases like the new test case added to TestEliminateArrayCopy.java: field from non escaping object should have an unknown value when the the object is target of an ArrayCopy. - The ifnode change is unrelated (I found the problem when debugging one of the issues above) but it?s simple enough that I don?t think it needs its own CR. Roland. > On Apr 29, 2015, at 11:30 AM, Roland Westrelin wrote: > > Thanks for the review, Vladimir. > > Roland. > >> On Apr 28, 2015, at 4:42 PM, Vladimir Ivanov wrote: >> >> Looks good. >> >> Best regards, >> Vladimir Ivanov >> >> On 4/21/15 4:02 PM, Roland Westrelin wrote: >>> http://cr.openjdk.java.net/~roland/8076188/webrev.00/ >>> >>> This patch tries to eliminate ArrayCopyNodes (for instance clones, array clones, arraycopy and copyOf) when the destination of the copy doesn?t escape: >>> >>> - during escape analysis, ArrayCopyNodes don?t cause the destination of the copy to be marked as escaping anymore >>> - a load to the destination of a copy may be replaced by a load from the source during IGVN >>> - during macro expansion, ArrayCopyNodes don?t stop allocation from being eliminated and can themselves be eliminated >>> >>> Roland. >>> > From vladimir.x.ivanov at oracle.com Tue May 12 10:56:04 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Tue, 12 May 2015 13:56:04 +0300 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared In-Reply-To: <554DD477.7000703@gmail.com> References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com> Message-ID: <5551DC44.9060902@oracle.com> Peter, >> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.01 >> https://bugs.openjdk.java.net/browse/JDK-8079205 > > Your Finalizator touches are good. Supplier interface is not needed as > there is a public Reference superclass that can be used for return type > of JavaLangRefAccess.createFinalizator(). You can remove the import for > Supplier in Reference now. Thanks for spotting that. Will do. >> Recent change in sun.misc.Cleaner behavior broke CallSite context >> cleanup. >> >> CallSite references context class through a Cleaner to avoid its >> unnecessary retention. >> >> The problem is the following: to do a cleanup (invalidate all affected >> nmethods) VM needs a pointer to a context class. Until Cleaner is >> cleared (and it was a manual action, since Cleaner extends >> PhantomReference isn't automatically cleared according to the docs), >> VM can extract it from CallSite.context.referent field. > > If PhantomReference.referent wasn't cleared by VM when PhantomReference > was equeued, could it happen that the referent pointer was still != null > and the referent object's heap memory was already reclaimed by GC? Is > that what JDK-8071931 is about? (I cant't see the bug - it's internal). > In that respect the Cleaner based solution was broken from the start, as > you did dereference the referent after Cleaner was enqueued. You could > get a != null pointer which was pointing to reclaimed heap. In Java this > dereference is prevented by PhantomReference.get() always returning null > (well, nothing prevents one to read the referent field with reflection > though). No, the object isn't reclaimed until corresponding reference is != null. When Cleaner wasn't automatically cleared, it was guaranteed that the referent is alive when cleanup action happens. That is the assumption CallSite dependency tracking is based on. It's not the case anymore when Cleaner is automatically cleared. Cleanup action can happen when context Class is already GCed. So, I can access neither the context mirror class nor VM-internal InstanceKlass. > FinalReference(s) are different in that their referent is not reclaimed > while it is still reachable through FinalReference, which means that > finalizable objects must undergo at least two GC cycles, with processing > in the reference handler and finalizer threads inbetween the cycles, to > be reclaimed. PhantomReference(s), on the other hand, can be enqueued > and their referents reclaimed in the same GC cycle, can't they? Since PhantomReference isn't automatically cleared, GC can't reclaim its referent until the reference is manually cleared or PhantomReference becomes unreachable as a result of cleanup action. So, it's impossible to reclaim it in the same GC cycle (unless it is a concurrent GC cycle?). >> I experimented with moving cleanup logic into VM [1], > > What's the downside of that approach? I mean, why is GC-assisted > approach better? Simpler? IMO the more code on JDK side the better. More safety guarantees are provided in Java code comparing to native/VM code. Also, JVM-based approach suffers from the fact it doesn't get prompt notification when context Class can be GCed. Since it should work with autocleared references, there's no need in a cleanup action anymore. By the time cleanup happens, referent can be already GCed. That's why I switched from Cleaner to WeakReference. When CallSite.context is cleared ("stale" context), VM has to scan all nmethods in the code cache to find all affected nmethods. BTW I had a private discussion with Kim Barrett who suggested an alternative approach which doesn't require full code cache scan. I plan to do some prototyping to understand its feasibility, since it requires non-InstanceKlass-based nmethod dependency tracking machinery. Best regards, Vladimir Ivanov >> but Peter Levart came up with a clever idea and implemented >> FinalReference-based cleaner-like Finalizator. Classes don't have >> finalizers, but Finalizator allows to attach a finalization action to >> them. And it is guaranteed that the referent is alive when >> finalization happens. >> >> Also, Peter spotted another problem with Cleaner-based implementation. >> Cleaner cleanup action is strongly referenced, since it is registered >> in Cleaner class. CallSite context cleanup action keeps a reference to >> CallSite class (it is passed to MHN.invalidateDependentNMethods). >> Users are free to extend CallSite and many do so. If a context class >> and a call site class are loaded by a custom class loader, such loader >> will never be unloaded, causing a memory leak. >> >> Finalizator doesn't suffer from that, since the action is referenced >> only from Finalizator instance. The downside is that cleanup action >> can be missed if Finalizator becomes unreachable. It's not a problem >> for CallSite context, because a context is always referenced from some >> CallSite and if a CallSite becomes unreachable, there's no need to >> perform a cleanup. >> >> Testing: jdk/test/java/lang/invoke, hotspot/test/compiler/jsr292 >> >> Contributed-by: plevart, vlivanov >> >> Best regards, >> Vladimir Ivanov >> >> PS: frankly speaking, I would move Finalizator from java.lang.ref to >> java.lang.invoke and call it Context, if there were a way to extend >> package-private FinalReference from another package :-) > > Modules may have an answer for that. FinalReference could be moved to a > package (java.lang.ref.internal or jdk.lang.ref) that is not exported to > the world and made public. > > On the other hand, I don't see a reason why FinalReference couldn't be > part of JDK public API. I know it's currently just a private mechanism > to implement finalization which has issues and many would like to see it > gone, but Java has learned to live with it, and, as you see, it can be a > solution to some problems too ;-) > > Regards, Peter > >> >> [1] http://cr.openjdk.java.net/~vlivanov/8079205/webrev.00 > From vladimir.kozlov at oracle.com Tue May 12 15:38:11 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 12 May 2015 08:38:11 -0700 Subject: RFR(M): 8076188 Optimize arraycopy out for non escaping destination In-Reply-To: <3F7929F8-11FA-4536-A340-A98266FB2C62@oracle.com> References: <2B8622DD-1DA5-4715-B4BF-3801202B588B@oracle.com> <553F9C4A.7040407@oracle.com> <3F7929F8-11FA-4536-A340-A98266FB2C62@oracle.com> Message-ID: <55521E63.9030106@oracle.com> Okay. Thanks, Vladimir On 5/12/15 3:40 AM, Roland Westrelin wrote: > > jprt found some problems with this change. So I?d like to push: > > http://cr.openjdk.java.net/~roland/8076188/webrev.00-01/ > > on top of the reviewed change. Overall change: > > http://cr.openjdk.java.net/~roland/8076188/webrev.01/ > > List of fixes: > > - tests with G1 failed in verification code. I made the change to LoadNode::Ideal() so the new code that looks for a dominating identical load is disabled for raw loads > > - OptimizePtrCompare is broken in cases like the new test case added to TestEliminateArrayCopy.java: field from non escaping object should have an unknown value when the the object is target of an ArrayCopy. > > - The ifnode change is unrelated (I found the problem when debugging one of the issues above) but it?s simple enough that I don?t think it needs its own CR. > > Roland. > >> On Apr 29, 2015, at 11:30 AM, Roland Westrelin wrote: >> >> Thanks for the review, Vladimir. >> >> Roland. >> >>> On Apr 28, 2015, at 4:42 PM, Vladimir Ivanov wrote: >>> >>> Looks good. >>> >>> Best regards, >>> Vladimir Ivanov >>> >>> On 4/21/15 4:02 PM, Roland Westrelin wrote: >>>> http://cr.openjdk.java.net/~roland/8076188/webrev.00/ >>>> >>>> This patch tries to eliminate ArrayCopyNodes (for instance clones, array clones, arraycopy and copyOf) when the destination of the copy doesn?t escape: >>>> >>>> - during escape analysis, ArrayCopyNodes don?t cause the destination of the copy to be marked as escaping anymore >>>> - a load to the destination of a copy may be replaced by a load from the source during IGVN >>>> - during macro expansion, ArrayCopyNodes don?t stop allocation from being eliminated and can themselves be eliminated >>>> >>>> Roland. >>>> >> > From aph at redhat.com Tue May 12 16:59:14 2015 From: aph at redhat.com (Andrew Haley) Date: Tue, 12 May 2015 17:59:14 +0100 Subject: aarch64 AD-file / matching rule In-Reply-To: References: Message-ID: <55523162.3000401@redhat.com> Hi, On 04/23/2015 02:41 PM, Benedikt Wedenik wrote: > I?m writing compiler-optimisations for the aarch64 port at the moment and I am using specjbb2005 for benchmarking. > One of the patterns I want to optimise is the following: > > 0x0000007f8c2961b4: andw2, w2, #0x7ffff8 > 0x0000007f8c2961b8: cmpw2, #0x0 > 0x0000007f8c2961bc: b.eq0x0000007f8c2968f4 I do not see this pattern generated anywhere by C2. If you can tell me which routine is being compiled I will be very interested. It is generated a lot by C1, but that's C1: not interesting. Incidentally, if you want to see a profile of specjbb2005, do this: operf java -Xms256m -Xmx256m -agentpath:/usr/lib64/oprofile/libjvmti_oprofile.so spec.jbb.JBBmain -propfile SPECjbb.props then, afterwards: opreport -l Andrew. From vitalyd at gmail.com Tue May 12 17:05:08 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Tue, 12 May 2015 13:05:08 -0400 Subject: Intermediate writes to array slot not eliminated by optimizer In-Reply-To: References: Message-ID: Anyone? :) Thanks On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich wrote: > Hi guys, > > Suppose we have the following code: > > private static final long[] ARR = new long[10]; > private static final int ITERATIONS = 500 * 1000 * 1000; > > > static void loop() { > int i = ITERATIONS; > while(--i >= 0) { > ARR[4] = i; > } > } > > My expectation would be that loop() effectively compiles to: > ARR[4] = 0; > ret; > > However, an actual loop is compiled. This example is a reduction of a > slightly more involved case that I had in mind, but I'm curious if I'm > missing something here. > > I tested this with 8u40 and tiered compilation disabled, for what it's > worth. > > Thanks > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Tue May 12 17:12:22 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 12 May 2015 10:12:22 -0700 Subject: Intermediate writes to array slot not eliminated by optimizer In-Reply-To: References: Message-ID: <55523476.90601@oracle.com> I would need to look on ideal graph to find what is going on. My guess is that the problem is Store node has control edge pointing to loop head so we can't move it from loop. Or stupid staff like we only do it for index increment and not for decrement (which I doubt). Most possible scenario is we convert this loop to canonical Counted and split it (creating 3 loops: pre, main, post) before we had a chance to collapse it. Regards, Vladimir On 5/12/15 10:05 AM, Vitaly Davidovich wrote: > Anyone? :) > > Thanks > > On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich > wrote: > > Hi guys, > > Suppose we have the following code: > > private static final long[] ARR = new long[10]; > private static final int ITERATIONS = 500 * 1000 * 1000; > > > static void loop() { > int i = ITERATIONS; > while(--i >= 0) { > ARR[4] = i; > } > } > > My expectation would be that loop() effectively compiles to: > ARR[4] = 0; > ret; > > However, an actual loop is compiled. This example is a reduction of > a slightly more involved case that I had in mind, but I'm curious if > I'm missing something here. > > I tested this with 8u40 and tiered compilation disabled, for what > it's worth. > > Thanks > > From vitalyd at gmail.com Tue May 12 17:16:05 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Tue, 12 May 2015 13:16:05 -0400 Subject: Intermediate writes to array slot not eliminated by optimizer In-Reply-To: <55523476.90601@oracle.com> References: <55523476.90601@oracle.com> Message-ID: Thanks Vladimir. I believe I did see the split pre/main/post loop form in the compiled code; I tried it with both increment and decrement, and also a plain counted for loop, but same asm in the end. Clearly the example I gave is just for illustrating the point, but perhaps this type of code can be left behind after some other optimization passes complete. Do you think it's worth handling these types of things? On Tue, May 12, 2015 at 1:12 PM, Vladimir Kozlov wrote: > I would need to look on ideal graph to find what is going on. > My guess is that the problem is Store node has control edge pointing to > loop head so we can't move it from loop. Or stupid staff like we only do it > for index increment and not for decrement (which I doubt). > > Most possible scenario is we convert this loop to canonical Counted and > split it (creating 3 loops: pre, main, post) before we had a chance to > collapse it. > > Regards, > Vladimir > > On 5/12/15 10:05 AM, Vitaly Davidovich wrote: > >> Anyone? :) >> >> Thanks >> >> On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich > > wrote: >> >> Hi guys, >> >> Suppose we have the following code: >> >> private static final long[] ARR = new long[10]; >> private static final int ITERATIONS = 500 * 1000 * 1000; >> >> >> static void loop() { >> int i = ITERATIONS; >> while(--i >= 0) { >> ARR[4] = i; >> } >> } >> >> My expectation would be that loop() effectively compiles to: >> ARR[4] = 0; >> ret; >> >> However, an actual loop is compiled. This example is a reduction of >> a slightly more involved case that I had in mind, but I'm curious if >> I'm missing something here. >> >> I tested this with 8u40 and tiered compilation disabled, for what >> it's worth. >> >> Thanks >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Tue May 12 17:47:51 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 12 May 2015 10:47:51 -0700 Subject: Intermediate writes to array slot not eliminated by optimizer In-Reply-To: References: <55523476.90601@oracle.com> Message-ID: <55523CC7.5010200@oracle.com> The most important thing we need do with this code is move Store from the loop. We definitely collapse empty loops. We do move some loop invariant nodes from loops but this case could be different. I think the problem is stored value is loop variant. Yes, it would benefit in general if we could move such stores from loops. Vladimir On 5/12/15 10:16 AM, Vitaly Davidovich wrote: > Thanks Vladimir. I believe I did see the split pre/main/post loop form > in the compiled code; I tried it with both increment and decrement, and > also a plain counted for loop, but same asm in the end. Clearly the > example I gave is just for illustrating the point, but perhaps this type > of code can be left behind after some other optimization passes > complete. Do you think it's worth handling these types of things? > > On Tue, May 12, 2015 at 1:12 PM, Vladimir Kozlov > > wrote: > > I would need to look on ideal graph to find what is going on. > My guess is that the problem is Store node has control edge pointing > to loop head so we can't move it from loop. Or stupid staff like we > only do it for index increment and not for decrement (which I doubt). > > Most possible scenario is we convert this loop to canonical Counted > and split it (creating 3 loops: pre, main, post) before we had a > chance to collapse it. > > Regards, > Vladimir > > On 5/12/15 10:05 AM, Vitaly Davidovich wrote: > > Anyone? :) > > Thanks > > On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich > > >> wrote: > > Hi guys, > > Suppose we have the following code: > > private static final long[] ARR = new long[10]; > private static final int ITERATIONS = 500 * 1000 * 1000; > > > static void loop() { > int i = ITERATIONS; > while(--i >= 0) { > ARR[4] = i; > } > } > > My expectation would be that loop() effectively compiles to: > ARR[4] = 0; > ret; > > However, an actual loop is compiled. This example is a > reduction of > a slightly more involved case that I had in mind, but I'm > curious if > I'm missing something here. > > I tested this with 8u40 and tiered compilation disabled, > for what > it's worth. > > Thanks > > > From vitalyd at gmail.com Tue May 12 17:56:07 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Tue, 12 May 2015 13:56:07 -0400 Subject: Intermediate writes to array slot not eliminated by optimizer In-Reply-To: <55523CC7.5010200@oracle.com> References: <55523476.90601@oracle.com> <55523CC7.5010200@oracle.com> Message-ID: Well, this seems more about detecting that each iteration of the loop is storing to the same address, which means if the loop is counted (or otherwise determined to end), then only final value can be stored. Even if loop is not removed entirely and the final value computed statically (as one could conceive in this case), I'd think working with a temp inside the loop and then storing the temp value into memory would be good as well. FWIW, gcc and llvm do the "right" thing here in similarly shaped C++ code :). On Tue, May 12, 2015 at 1:47 PM, Vladimir Kozlov wrote: > The most important thing we need do with this code is move Store from the > loop. We definitely collapse empty loops. > We do move some loop invariant nodes from loops but this case could be > different. I think the problem is stored value is loop variant. > Yes, it would benefit in general if we could move such stores from loops. > > Vladimir > > On 5/12/15 10:16 AM, Vitaly Davidovich wrote: > >> Thanks Vladimir. I believe I did see the split pre/main/post loop form >> in the compiled code; I tried it with both increment and decrement, and >> also a plain counted for loop, but same asm in the end. Clearly the >> example I gave is just for illustrating the point, but perhaps this type >> of code can be left behind after some other optimization passes >> complete. Do you think it's worth handling these types of things? >> >> On Tue, May 12, 2015 at 1:12 PM, Vladimir Kozlov >> > wrote: >> >> I would need to look on ideal graph to find what is going on. >> My guess is that the problem is Store node has control edge pointing >> to loop head so we can't move it from loop. Or stupid staff like we >> only do it for index increment and not for decrement (which I doubt). >> >> Most possible scenario is we convert this loop to canonical Counted >> and split it (creating 3 loops: pre, main, post) before we had a >> chance to collapse it. >> >> Regards, >> Vladimir >> >> On 5/12/15 10:05 AM, Vitaly Davidovich wrote: >> >> Anyone? :) >> >> Thanks >> >> On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich >> >> >> wrote: >> >> Hi guys, >> >> Suppose we have the following code: >> >> private static final long[] ARR = new long[10]; >> private static final int ITERATIONS = 500 * 1000 * 1000; >> >> >> static void loop() { >> int i = ITERATIONS; >> while(--i >= 0) { >> ARR[4] = i; >> } >> } >> >> My expectation would be that loop() effectively compiles to: >> ARR[4] = 0; >> ret; >> >> However, an actual loop is compiled. This example is a >> reduction of >> a slightly more involved case that I had in mind, but I'm >> curious if >> I'm missing something here. >> >> I tested this with 8u40 and tiered compilation disabled, >> for what >> it's worth. >> >> Thanks >> >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From sandhya.viswanathan at intel.com Tue May 12 23:25:00 2015 From: sandhya.viswanathan at intel.com (Viswanathan, Sandhya) Date: Tue, 12 May 2015 23:25:00 +0000 Subject: RFR(L): 8069539: RSA acceleration In-Reply-To: <554CDD48.8080500@redhat.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> Message-ID: <02FCFB8477C4EF43A2AD8E0C60F3DA2B6336247C@FMSMSX112.amr.corp.intel.com> I was wondering if inline assembly can be used in the native code of JRE. Does the JRE build framework support it? Thanks, Sandhya -----Original Message----- From: Andrew Haley [mailto:aph at redhat.com] Sent: Friday, May 08, 2015 8:59 AM To: Vladimir Kozlov; Florian Weimer; Viswanathan, Sandhya; hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR(L): 8069539: RSA acceleration Here is a prototype of what I propose: http://cr.openjdk.java.net/~aph/rsa-1/ It is a JNI version of the fast algorithm I think we should use for RSA. It doesn't use any intrinsics. But with it, RSA is already twice as fast as the code we have now for 1024-bit keys, and almost three times as fast for 2048-bit keys! This code (montgomery_multiply) is designed to be as easy as I can possibly make it to translate into a hand-coded intrinsic. It will then offer better performance still, with the JNI overhead gone and with some carefully hand-tweaked memory accesses. It has a very regular structure which should make it fairly easy to turn into a software pipeline with overlapped fetching and multiplication, although this will perhaps be difficult on register-starved machines like the x86. The JNI version can still be used for those machines where people don't yet want to write a hand-coded intrinsic. I haven't yet done anything about Montgomery squaring: it's asymptotically 25% faster than Montgomery multiplication. I have only written it for x86_64. Systems without a 64-bit multiply won't benefit very much from this idea, but they are pretty much legacy anyway: I want to concentrate on 64-bit systems. So, shall we go with this? I don't think you will find any faster way to do it. Andrew. From aph at redhat.com Wed May 13 08:34:12 2015 From: aph at redhat.com (Andrew Haley) Date: Wed, 13 May 2015 09:34:12 +0100 Subject: RFR(L): 8069539: RSA acceleration In-Reply-To: <02FCFB8477C4EF43A2AD8E0C60F3DA2B6336247C@FMSMSX112.amr.corp.intel.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B6336247C@FMSMSX112.amr.corp.intel.com> Message-ID: <55530C84.2070801@redhat.com> On 13/05/15 00:25, Viswanathan, Sandhya wrote: > I was wondering if inline assembly can be used in the native code of JRE. Does the JRE build framework support it? For every target which supports it you could define a macro in one of the os/cpu-dependent header files. But it's slightly awkward because it's compiler- as well as cpu-dependent. As long as we assume GCC it's pretty easy for all targets but I'm not sure what other compilers we use. However, I have to stress that this is just a demonstration: even though it is already much better than what we have, I'm not imagining that we really want to use JNI for this. It would make more sense to move it into HotSpot and call it as a runtime helper. One interesting thing to try would be to find out if this technique makes things slower on 32-bit targets. If it doesn't then we could use it everywhere. Andrew. From roland.westrelin at oracle.com Wed May 13 09:05:19 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 13 May 2015 11:05:19 +0200 Subject: RFR(XS): 8078436: java/util/stream/boottest/java/util/stream/UnorderedTest.java crashed with an assert in ifnode.cpp Message-ID: <683673A0-DB76-497C-8772-77CA133AFFAA@oracle.com> http://cr.openjdk.java.net/~roland/8078436/webrev.00/ The assert is wrong. I mixed the BoolNode of one IfNode with the projection of the other IfNode. Roland. From vladimir.x.ivanov at oracle.com Wed May 13 09:06:47 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Wed, 13 May 2015 12:06:47 +0300 Subject: RFR(XS): 8078436: java/util/stream/boottest/java/util/stream/UnorderedTest.java crashed with an assert in ifnode.cpp In-Reply-To: <683673A0-DB76-497C-8772-77CA133AFFAA@oracle.com> References: <683673A0-DB76-497C-8772-77CA133AFFAA@oracle.com> Message-ID: <55531427.5030201@oracle.com> Looks good. Best regards, Vladimir Ivanov On 5/13/15 12:05 PM, Roland Westrelin wrote: > http://cr.openjdk.java.net/~roland/8078436/webrev.00/ > > The assert is wrong. I mixed the BoolNode of one IfNode with the projection of the other IfNode. > > Roland. > From vladimir.x.ivanov at oracle.com Wed May 13 09:08:26 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Wed, 13 May 2015 12:08:26 +0300 Subject: RFR(M): 8076188 Optimize arraycopy out for non escaping destination In-Reply-To: <3F7929F8-11FA-4536-A340-A98266FB2C62@oracle.com> References: <2B8622DD-1DA5-4715-B4BF-3801202B588B@oracle.com> <553F9C4A.7040407@oracle.com> <3F7929F8-11FA-4536-A340-A98266FB2C62@oracle.com> Message-ID: <5553148A.2020206@oracle.com> Still looks fine. Best regards, Vladimir Ivanov On 5/12/15 1:40 PM, Roland Westrelin wrote: > > jprt found some problems with this change. So I?d like to push: > > http://cr.openjdk.java.net/~roland/8076188/webrev.00-01/ > > on top of the reviewed change. Overall change: > > http://cr.openjdk.java.net/~roland/8076188/webrev.01/ > > List of fixes: > > - tests with G1 failed in verification code. I made the change to LoadNode::Ideal() so the new code that looks for a dominating identical load is disabled for raw loads > > - OptimizePtrCompare is broken in cases like the new test case added to TestEliminateArrayCopy.java: field from non escaping object should have an unknown value when the the object is target of an ArrayCopy. > > - The ifnode change is unrelated (I found the problem when debugging one of the issues above) but it?s simple enough that I don?t think it needs its own CR. > > Roland. > >> On Apr 29, 2015, at 11:30 AM, Roland Westrelin wrote: >> >> Thanks for the review, Vladimir. >> >> Roland. >> >>> On Apr 28, 2015, at 4:42 PM, Vladimir Ivanov wrote: >>> >>> Looks good. >>> >>> Best regards, >>> Vladimir Ivanov >>> >>> On 4/21/15 4:02 PM, Roland Westrelin wrote: >>>> http://cr.openjdk.java.net/~roland/8076188/webrev.00/ >>>> >>>> This patch tries to eliminate ArrayCopyNodes (for instance clones, array clones, arraycopy and copyOf) when the destination of the copy doesn?t escape: >>>> >>>> - during escape analysis, ArrayCopyNodes don?t cause the destination of the copy to be marked as escaping anymore >>>> - a load to the destination of a copy may be replaced by a load from the source during IGVN >>>> - during macro expansion, ArrayCopyNodes don?t stop allocation from being eliminated and can themselves be eliminated >>>> >>>> Roland. >>>> >> > From vladimir.x.ivanov at oracle.com Wed May 13 09:27:59 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Wed, 13 May 2015 12:27:59 +0300 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared In-Reply-To: <5551DC44.9060902@oracle.com> References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com> <5551DC44.9060902@oracle.com> Message-ID: <5553191F.6080800@oracle.com> I finished experimenting with the idea inspired by private discussion with Kim Barrett: http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02/ The idea is to use CallSite instance as a context for dependency tracking, instead of the Class CallSite is bound to. It requires extension of nmethod dependencies to be used separately from InstanceKlass. Though it slightly complicates nmethod dependency management (special cases for call_site_target_value are added), it completely eliminates the need for context state transitions. Every CallSite instance has its dedicated context, represented as CallSite.Context which stores an address of the dependency list head (nmethodBucket*). Cleanup action (Cleaner) is attached to CallSite instance and frees all allocated nmethodBucket entries when CallSite becomes unreachable. Though CallSite instance can freely go away, the Cleaner keeps corresponding CallSite.Context from reclamation (all Cleaners are registered in static list in Cleaner class). I like how it finally shaped out. It became much easier to reason about CallSite context. Dependency tracking extension to non-InstanceKlass contextes looks quite natural and can be reused. Testing: extended regression test, jprt, nashorn Thanks! Best regards, Vladimir Ivanov On 5/12/15 1:56 PM, Vladimir Ivanov wrote: > Peter, > >>> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.01 >>> https://bugs.openjdk.java.net/browse/JDK-8079205 >> >> Your Finalizator touches are good. Supplier interface is not needed as >> there is a public Reference superclass that can be used for return type >> of JavaLangRefAccess.createFinalizator(). You can remove the import for >> Supplier in Reference now. > Thanks for spotting that. Will do. > >>> Recent change in sun.misc.Cleaner behavior broke CallSite context >>> cleanup. >>> >>> CallSite references context class through a Cleaner to avoid its >>> unnecessary retention. >>> >>> The problem is the following: to do a cleanup (invalidate all affected >>> nmethods) VM needs a pointer to a context class. Until Cleaner is >>> cleared (and it was a manual action, since Cleaner extends >>> PhantomReference isn't automatically cleared according to the docs), >>> VM can extract it from CallSite.context.referent field. >> >> If PhantomReference.referent wasn't cleared by VM when PhantomReference >> was equeued, could it happen that the referent pointer was still != null >> and the referent object's heap memory was already reclaimed by GC? Is >> that what JDK-8071931 is about? (I cant't see the bug - it's internal). >> In that respect the Cleaner based solution was broken from the start, as >> you did dereference the referent after Cleaner was enqueued. You could >> get a != null pointer which was pointing to reclaimed heap. In Java this >> dereference is prevented by PhantomReference.get() always returning null >> (well, nothing prevents one to read the referent field with reflection >> though). > No, the object isn't reclaimed until corresponding reference is != null. > When Cleaner wasn't automatically cleared, it was guaranteed that the > referent is alive when cleanup action happens. That is the assumption > CallSite dependency tracking is based on. > > It's not the case anymore when Cleaner is automatically cleared. Cleanup > action can happen when context Class is already GCed. So, I can access > neither the context mirror class nor VM-internal InstanceKlass. > >> FinalReference(s) are different in that their referent is not reclaimed >> while it is still reachable through FinalReference, which means that >> finalizable objects must undergo at least two GC cycles, with processing >> in the reference handler and finalizer threads inbetween the cycles, to >> be reclaimed. PhantomReference(s), on the other hand, can be enqueued >> and their referents reclaimed in the same GC cycle, can't they? > Since PhantomReference isn't automatically cleared, GC can't reclaim its > referent until the reference is manually cleared or PhantomReference > becomes unreachable as a result of cleanup action. > So, it's impossible to reclaim it in the same GC cycle (unless it is a > concurrent GC cycle?). > >>> I experimented with moving cleanup logic into VM [1], >> >> What's the downside of that approach? I mean, why is GC-assisted >> approach better? Simpler? > IMO the more code on JDK side the better. More safety guarantees are > provided in Java code comparing to native/VM code. > > Also, JVM-based approach suffers from the fact it doesn't get prompt > notification when context Class can be GCed. Since it should work with > autocleared references, there's no need in a cleanup action anymore. > By the time cleanup happens, referent can be already GCed. > > That's why I switched from Cleaner to WeakReference. When > CallSite.context is cleared ("stale" context), VM has to scan all > nmethods in the code cache to find all affected nmethods. > > BTW I had a private discussion with Kim Barrett who suggested an > alternative approach which doesn't require full code cache scan. I plan > to do some prototyping to understand its feasibility, since it requires > non-InstanceKlass-based nmethod dependency tracking machinery. > > Best regards, > Vladimir Ivanov > >>> but Peter Levart came up with a clever idea and implemented >>> FinalReference-based cleaner-like Finalizator. Classes don't have >>> finalizers, but Finalizator allows to attach a finalization action to >>> them. And it is guaranteed that the referent is alive when >>> finalization happens. >>> >>> Also, Peter spotted another problem with Cleaner-based implementation. >>> Cleaner cleanup action is strongly referenced, since it is registered >>> in Cleaner class. CallSite context cleanup action keeps a reference to >>> CallSite class (it is passed to MHN.invalidateDependentNMethods). >>> Users are free to extend CallSite and many do so. If a context class >>> and a call site class are loaded by a custom class loader, such loader >>> will never be unloaded, causing a memory leak. >>> >>> Finalizator doesn't suffer from that, since the action is referenced >>> only from Finalizator instance. The downside is that cleanup action >>> can be missed if Finalizator becomes unreachable. It's not a problem >>> for CallSite context, because a context is always referenced from some >>> CallSite and if a CallSite becomes unreachable, there's no need to >>> perform a cleanup. >>> >>> Testing: jdk/test/java/lang/invoke, hotspot/test/compiler/jsr292 >>> >>> Contributed-by: plevart, vlivanov >>> >>> Best regards, >>> Vladimir Ivanov >>> >>> PS: frankly speaking, I would move Finalizator from java.lang.ref to >>> java.lang.invoke and call it Context, if there were a way to extend >>> package-private FinalReference from another package :-) >> >> Modules may have an answer for that. FinalReference could be moved to a >> package (java.lang.ref.internal or jdk.lang.ref) that is not exported to >> the world and made public. >> >> On the other hand, I don't see a reason why FinalReference couldn't be >> part of JDK public API. I know it's currently just a private mechanism >> to implement finalization which has issues and many would like to see it >> gone, but Java has learned to live with it, and, as you see, it can be a >> solution to some problems too ;-) >> >> Regards, Peter >> >>> >>> [1] http://cr.openjdk.java.net/~vlivanov/8079205/webrev.00 >> From vladimir.x.ivanov at oracle.com Wed May 13 09:50:48 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Wed, 13 May 2015 12:50:48 +0300 Subject: [9] RFR (XS): 8079135: C2 disables some optimizations when a large number of unique nodes exist Message-ID: <55531E78.7070301@oracle.com> http://cr.openjdk.java.net/~vlivanov/8079135/webrev.00/ Incremental inlining produces lots of dead nodes, so # of unique and # of live nodes can differ significantly. But actual IR size is what matters. Performance experiments show that octane doesn't hit the problem (no difference). Testing: octane/nashorn Best regards, Vladimir Ivanov From bonifaido at gmail.com Wed May 13 10:07:02 2015 From: bonifaido at gmail.com (=?UTF-8?B?TsOhbmRvciBJc3R2w6FuIEtyw6Fjc2Vy?=) Date: Wed, 13 May 2015 12:07:02 +0200 Subject: [9] RFR(S): 8068945: Use RBP register as proper frame pointer in JIT compiled code on x86 Message-ID: Hi, this change broke my build for the zero jvm variant, I think the PreserveFramePointer global should be defined in src/cpu/zero/vm/globals_zero.hpp as well, correct me please if I'm wrong. Regards, Nandor -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.levart at gmail.com Wed May 13 10:45:17 2015 From: peter.levart at gmail.com (Peter Levart) Date: Wed, 13 May 2015 12:45:17 +0200 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared In-Reply-To: <5553191F.6080800@oracle.com> References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com> <5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com> Message-ID: <55532B3D.5090202@gmail.com> Hi Vladimir, This is nice. I don't know how space-savvy CallSite should be, but you can save one additional object - the Cleaner's thunk: static class Context implements Runnable { private final long dependencies = 0; // Used by JVM to store JVM_nmethodBucket* static Context make(CallSite cs) { Context newContext = new Context(); // Cleaner is attached to CallSite instance and it clears native structures allocated for CallSite context. // Though the CallSite can become unreachable, its Context is retained by the Cleaner instance (which is // referenced from Cleaner class) until cleanup is performed. Cleaner.create(cs, newContext); return newContext; } @Override public void run() { MethodHandleNatives.clearCallSiteContext(this); } } Since Context instance does not escape to client code, I think this is safe. Regards, Peter On 05/13/2015 11:27 AM, Vladimir Ivanov wrote: > I finished experimenting with the idea inspired by private discussion > with Kim Barrett: > > http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02/ > > The idea is to use CallSite instance as a context for dependency > tracking, instead of the Class CallSite is bound to. It requires > extension of nmethod dependencies to be used separately from > InstanceKlass. Though it slightly complicates nmethod dependency > management (special cases for call_site_target_value are added), it > completely eliminates the need for context state transitions. > > Every CallSite instance has its dedicated context, represented as > CallSite.Context which stores an address of the dependency list head > (nmethodBucket*). Cleanup action (Cleaner) is attached to CallSite > instance and frees all allocated nmethodBucket entries when CallSite > becomes unreachable. Though CallSite instance can freely go away, the > Cleaner keeps corresponding CallSite.Context from reclamation (all > Cleaners are registered in static list in Cleaner class). > > I like how it finally shaped out. It became much easier to reason > about CallSite context. Dependency tracking extension to > non-InstanceKlass contextes looks quite natural and can be reused. > > Testing: extended regression test, jprt, nashorn > > Thanks! > > Best regards, > Vladimir Ivanov > > On 5/12/15 1:56 PM, Vladimir Ivanov wrote: >> Peter, >> >>>> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.01 >>>> https://bugs.openjdk.java.net/browse/JDK-8079205 >>> >>> Your Finalizator touches are good. Supplier interface is not needed as >>> there is a public Reference superclass that can be used for return type >>> of JavaLangRefAccess.createFinalizator(). You can remove the import for >>> Supplier in Reference now. >> Thanks for spotting that. Will do. >> >>>> Recent change in sun.misc.Cleaner behavior broke CallSite context >>>> cleanup. >>>> >>>> CallSite references context class through a Cleaner to avoid its >>>> unnecessary retention. >>>> >>>> The problem is the following: to do a cleanup (invalidate all affected >>>> nmethods) VM needs a pointer to a context class. Until Cleaner is >>>> cleared (and it was a manual action, since Cleaner extends >>>> PhantomReference isn't automatically cleared according to the docs), >>>> VM can extract it from CallSite.context.referent field. >>> >>> If PhantomReference.referent wasn't cleared by VM when PhantomReference >>> was equeued, could it happen that the referent pointer was still != >>> null >>> and the referent object's heap memory was already reclaimed by GC? Is >>> that what JDK-8071931 is about? (I cant't see the bug - it's internal). >>> In that respect the Cleaner based solution was broken from the >>> start, as >>> you did dereference the referent after Cleaner was enqueued. You could >>> get a != null pointer which was pointing to reclaimed heap. In Java >>> this >>> dereference is prevented by PhantomReference.get() always returning >>> null >>> (well, nothing prevents one to read the referent field with reflection >>> though). >> No, the object isn't reclaimed until corresponding reference is != null. >> When Cleaner wasn't automatically cleared, it was guaranteed that the >> referent is alive when cleanup action happens. That is the assumption >> CallSite dependency tracking is based on. >> >> It's not the case anymore when Cleaner is automatically cleared. Cleanup >> action can happen when context Class is already GCed. So, I can access >> neither the context mirror class nor VM-internal InstanceKlass. >> >>> FinalReference(s) are different in that their referent is not reclaimed >>> while it is still reachable through FinalReference, which means that >>> finalizable objects must undergo at least two GC cycles, with >>> processing >>> in the reference handler and finalizer threads inbetween the cycles, to >>> be reclaimed. PhantomReference(s), on the other hand, can be enqueued >>> and their referents reclaimed in the same GC cycle, can't they? >> Since PhantomReference isn't automatically cleared, GC can't reclaim its >> referent until the reference is manually cleared or PhantomReference >> becomes unreachable as a result of cleanup action. >> So, it's impossible to reclaim it in the same GC cycle (unless it is a >> concurrent GC cycle?). >> >>>> I experimented with moving cleanup logic into VM [1], >>> >>> What's the downside of that approach? I mean, why is GC-assisted >>> approach better? Simpler? >> IMO the more code on JDK side the better. More safety guarantees are >> provided in Java code comparing to native/VM code. >> >> Also, JVM-based approach suffers from the fact it doesn't get prompt >> notification when context Class can be GCed. Since it should work with >> autocleared references, there's no need in a cleanup action anymore. >> By the time cleanup happens, referent can be already GCed. >> >> That's why I switched from Cleaner to WeakReference. When >> CallSite.context is cleared ("stale" context), VM has to scan all >> nmethods in the code cache to find all affected nmethods. >> >> BTW I had a private discussion with Kim Barrett who suggested an >> alternative approach which doesn't require full code cache scan. I plan >> to do some prototyping to understand its feasibility, since it requires >> non-InstanceKlass-based nmethod dependency tracking machinery. >> >> Best regards, >> Vladimir Ivanov >> >>>> but Peter Levart came up with a clever idea and implemented >>>> FinalReference-based cleaner-like Finalizator. Classes don't have >>>> finalizers, but Finalizator allows to attach a finalization action to >>>> them. And it is guaranteed that the referent is alive when >>>> finalization happens. >>>> >>>> Also, Peter spotted another problem with Cleaner-based implementation. >>>> Cleaner cleanup action is strongly referenced, since it is registered >>>> in Cleaner class. CallSite context cleanup action keeps a reference to >>>> CallSite class (it is passed to MHN.invalidateDependentNMethods). >>>> Users are free to extend CallSite and many do so. If a context class >>>> and a call site class are loaded by a custom class loader, such loader >>>> will never be unloaded, causing a memory leak. >>>> >>>> Finalizator doesn't suffer from that, since the action is referenced >>>> only from Finalizator instance. The downside is that cleanup action >>>> can be missed if Finalizator becomes unreachable. It's not a problem >>>> for CallSite context, because a context is always referenced from some >>>> CallSite and if a CallSite becomes unreachable, there's no need to >>>> perform a cleanup. >>>> >>>> Testing: jdk/test/java/lang/invoke, hotspot/test/compiler/jsr292 >>>> >>>> Contributed-by: plevart, vlivanov >>>> >>>> Best regards, >>>> Vladimir Ivanov >>>> >>>> PS: frankly speaking, I would move Finalizator from java.lang.ref to >>>> java.lang.invoke and call it Context, if there were a way to extend >>>> package-private FinalReference from another package :-) >>> >>> Modules may have an answer for that. FinalReference could be moved to a >>> package (java.lang.ref.internal or jdk.lang.ref) that is not >>> exported to >>> the world and made public. >>> >>> On the other hand, I don't see a reason why FinalReference couldn't be >>> part of JDK public API. I know it's currently just a private mechanism >>> to implement finalization which has issues and many would like to >>> see it >>> gone, but Java has learned to live with it, and, as you see, it can >>> be a >>> solution to some problems too ;-) >>> >>> Regards, Peter >>> >>>> >>>> [1] http://cr.openjdk.java.net/~vlivanov/8079205/webrev.00 >>> From paul.sandoz at oracle.com Wed May 13 11:24:07 2015 From: paul.sandoz at oracle.com (Paul Sandoz) Date: Wed, 13 May 2015 13:24:07 +0200 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared In-Reply-To: <5553191F.6080800@oracle.com> References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com> <5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com> Message-ID: <72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com> Hi Vladimir, I am not an export in the HS area but the code mostly made sense to me. I also like Peter's suggestion of Context implementing Runnable. Some minor comments. CallSite.java: 145 private final long dependencies = 0; // Used by JVM to store JVM_nmethodBucket* It's a little odd to see this field be final, but i think i understand the motivation as it's essentially acting as a memory address pointing to the head of the nbucket list, so in Java code you don't want to assign it to anything other than 0. Is VM access, via java_lang_invoke_CallSite_Context, affected by treating final fields as really final in the j.l.i package? javaClasses.hpp: 1182 static oop context(oop site); 1183 static void set_context(oop site, oop context); Is set_context required? instanceKlass.cpp: 1876 // recording of dependecies. Returns true if the bucket is ready for reclamation. typo "dependecies" Paul. On May 13, 2015, at 11:27 AM, Vladimir Ivanov wrote: > I finished experimenting with the idea inspired by private discussion with Kim Barrett: > > http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02/ > > The idea is to use CallSite instance as a context for dependency tracking, instead of the Class CallSite is bound to. It requires extension of nmethod dependencies to be used separately from InstanceKlass. Though it slightly complicates nmethod dependency management (special cases for call_site_target_value are added), it completely eliminates the need for context state transitions. > > Every CallSite instance has its dedicated context, represented as CallSite.Context which stores an address of the dependency list head (nmethodBucket*). Cleanup action (Cleaner) is attached to CallSite instance and frees all allocated nmethodBucket entries when CallSite becomes unreachable. Though CallSite instance can freely go away, the Cleaner keeps corresponding CallSite.Context from reclamation (all Cleaners are registered in static list in Cleaner class). > > I like how it finally shaped out. It became much easier to reason about CallSite context. Dependency tracking extension to non-InstanceKlass contextes looks quite natural and can be reused. > > Testing: extended regression test, jprt, nashorn > > Thanks! > > Best regards, > Vladimir Ivanov > > On 5/12/15 1:56 PM, Vladimir Ivanov wrote: >> Peter, >> >>>> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.01 >>>> https://bugs.openjdk.java.net/browse/JDK-8079205 >>> >>> Your Finalizator touches are good. Supplier interface is not needed as >>> there is a public Reference superclass that can be used for return type >>> of JavaLangRefAccess.createFinalizator(). You can remove the import for >>> Supplier in Reference now. >> Thanks for spotting that. Will do. >> >>>> Recent change in sun.misc.Cleaner behavior broke CallSite context >>>> cleanup. >>>> >>>> CallSite references context class through a Cleaner to avoid its >>>> unnecessary retention. >>>> >>>> The problem is the following: to do a cleanup (invalidate all affected >>>> nmethods) VM needs a pointer to a context class. Until Cleaner is >>>> cleared (and it was a manual action, since Cleaner extends >>>> PhantomReference isn't automatically cleared according to the docs), >>>> VM can extract it from CallSite.context.referent field. >>> >>> If PhantomReference.referent wasn't cleared by VM when PhantomReference >>> was equeued, could it happen that the referent pointer was still != null >>> and the referent object's heap memory was already reclaimed by GC? Is >>> that what JDK-8071931 is about? (I cant't see the bug - it's internal). >>> In that respect the Cleaner based solution was broken from the start, as >>> you did dereference the referent after Cleaner was enqueued. You could >>> get a != null pointer which was pointing to reclaimed heap. In Java this >>> dereference is prevented by PhantomReference.get() always returning null >>> (well, nothing prevents one to read the referent field with reflection >>> though). >> No, the object isn't reclaimed until corresponding reference is != null. >> When Cleaner wasn't automatically cleared, it was guaranteed that the >> referent is alive when cleanup action happens. That is the assumption >> CallSite dependency tracking is based on. >> >> It's not the case anymore when Cleaner is automatically cleared. Cleanup >> action can happen when context Class is already GCed. So, I can access >> neither the context mirror class nor VM-internal InstanceKlass. >> >>> FinalReference(s) are different in that their referent is not reclaimed >>> while it is still reachable through FinalReference, which means that >>> finalizable objects must undergo at least two GC cycles, with processing >>> in the reference handler and finalizer threads inbetween the cycles, to >>> be reclaimed. PhantomReference(s), on the other hand, can be enqueued >>> and their referents reclaimed in the same GC cycle, can't they? >> Since PhantomReference isn't automatically cleared, GC can't reclaim its >> referent until the reference is manually cleared or PhantomReference >> becomes unreachable as a result of cleanup action. >> So, it's impossible to reclaim it in the same GC cycle (unless it is a >> concurrent GC cycle?). >> >>>> I experimented with moving cleanup logic into VM [1], >>> >>> What's the downside of that approach? I mean, why is GC-assisted >>> approach better? Simpler? >> IMO the more code on JDK side the better. More safety guarantees are >> provided in Java code comparing to native/VM code. >> >> Also, JVM-based approach suffers from the fact it doesn't get prompt >> notification when context Class can be GCed. Since it should work with >> autocleared references, there's no need in a cleanup action anymore. >> By the time cleanup happens, referent can be already GCed. >> >> That's why I switched from Cleaner to WeakReference. When >> CallSite.context is cleared ("stale" context), VM has to scan all >> nmethods in the code cache to find all affected nmethods. >> >> BTW I had a private discussion with Kim Barrett who suggested an >> alternative approach which doesn't require full code cache scan. I plan >> to do some prototyping to understand its feasibility, since it requires >> non-InstanceKlass-based nmethod dependency tracking machinery. >> >> Best regards, >> Vladimir Ivanov >> >>>> but Peter Levart came up with a clever idea and implemented >>>> FinalReference-based cleaner-like Finalizator. Classes don't have >>>> finalizers, but Finalizator allows to attach a finalization action to >>>> them. And it is guaranteed that the referent is alive when >>>> finalization happens. >>>> >>>> Also, Peter spotted another problem with Cleaner-based implementation. >>>> Cleaner cleanup action is strongly referenced, since it is registered >>>> in Cleaner class. CallSite context cleanup action keeps a reference to >>>> CallSite class (it is passed to MHN.invalidateDependentNMethods). >>>> Users are free to extend CallSite and many do so. If a context class >>>> and a call site class are loaded by a custom class loader, such loader >>>> will never be unloaded, causing a memory leak. >>>> >>>> Finalizator doesn't suffer from that, since the action is referenced >>>> only from Finalizator instance. The downside is that cleanup action >>>> can be missed if Finalizator becomes unreachable. It's not a problem >>>> for CallSite context, because a context is always referenced from some >>>> CallSite and if a CallSite becomes unreachable, there's no need to >>>> perform a cleanup. >>>> >>>> Testing: jdk/test/java/lang/invoke, hotspot/test/compiler/jsr292 >>>> >>>> Contributed-by: plevart, vlivanov >>>> >>>> Best regards, >>>> Vladimir Ivanov >>>> >>>> PS: frankly speaking, I would move Finalizator from java.lang.ref to >>>> java.lang.invoke and call it Context, if there were a way to extend >>>> package-private FinalReference from another package :-) >>> >>> Modules may have an answer for that. FinalReference could be moved to a >>> package (java.lang.ref.internal or jdk.lang.ref) that is not exported to >>> the world and made public. >>> >>> On the other hand, I don't see a reason why FinalReference couldn't be >>> part of JDK public API. I know it's currently just a private mechanism >>> to implement finalization which has issues and many would like to see it >>> gone, but Java has learned to live with it, and, as you see, it can be a >>> solution to some problems too ;-) >>> >>> Regards, Peter >>> >>>> >>>> [1] http://cr.openjdk.java.net/~vlivanov/8079205/webrev.00 >>> -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 841 bytes Desc: Message signed with OpenPGP using GPGMail URL: From roland.westrelin at oracle.com Wed May 13 11:43:12 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 13 May 2015 13:43:12 +0200 Subject: RFR(M): 8076188 Optimize arraycopy out for non escaping destination In-Reply-To: <5553148A.2020206@oracle.com> References: <2B8622DD-1DA5-4715-B4BF-3801202B588B@oracle.com> <553F9C4A.7040407@oracle.com> <3F7929F8-11FA-4536-A340-A98266FB2C62@oracle.com> <5553148A.2020206@oracle.com> Message-ID: <6CB8AAD1-2C3C-4766-9C75-CE786C4F94EF@oracle.com> Thanks Vladimir & Vladimir for the re-reviews. Roland. > On May 13, 2015, at 11:08 AM, Vladimir Ivanov wrote: > > Still looks fine. > > Best regards, > Vladimir Ivanov > > On 5/12/15 1:40 PM, Roland Westrelin wrote: >> >> jprt found some problems with this change. So I?d like to push: >> >> http://cr.openjdk.java.net/~roland/8076188/webrev.00-01/ >> >> on top of the reviewed change. Overall change: >> >> http://cr.openjdk.java.net/~roland/8076188/webrev.01/ >> >> List of fixes: >> >> - tests with G1 failed in verification code. I made the change to LoadNode::Ideal() so the new code that looks for a dominating identical load is disabled for raw loads >> >> - OptimizePtrCompare is broken in cases like the new test case added to TestEliminateArrayCopy.java: field from non escaping object should have an unknown value when the the object is target of an ArrayCopy. >> >> - The ifnode change is unrelated (I found the problem when debugging one of the issues above) but it?s simple enough that I don?t think it needs its own CR. >> >> Roland. >> >>> On Apr 29, 2015, at 11:30 AM, Roland Westrelin wrote: >>> >>> Thanks for the review, Vladimir. >>> >>> Roland. >>> >>>> On Apr 28, 2015, at 4:42 PM, Vladimir Ivanov wrote: >>>> >>>> Looks good. >>>> >>>> Best regards, >>>> Vladimir Ivanov >>>> >>>> On 4/21/15 4:02 PM, Roland Westrelin wrote: >>>>> http://cr.openjdk.java.net/~roland/8076188/webrev.00/ >>>>> >>>>> This patch tries to eliminate ArrayCopyNodes (for instance clones, array clones, arraycopy and copyOf) when the destination of the copy doesn?t escape: >>>>> >>>>> - during escape analysis, ArrayCopyNodes don?t cause the destination of the copy to be marked as escaping anymore >>>>> - a load to the destination of a copy may be replaced by a load from the source during IGVN >>>>> - during macro expansion, ArrayCopyNodes don?t stop allocation from being eliminated and can themselves be eliminated >>>>> >>>>> Roland. >>>>> >>> >> From vladimir.x.ivanov at oracle.com Wed May 13 11:59:42 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Wed, 13 May 2015 14:59:42 +0300 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared In-Reply-To: <72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com> References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com> <5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com> <72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com> Message-ID: <55533CAE.3050201@oracle.com> Peter, Paul, thanks for the feedback! Updated the webrev in place: http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02 > I am not an export in the HS area but the code mostly made sense to me. I also like Peter's suggestion of Context implementing Runnable. I agree. Integrated. > Some minor comments. > > CallSite.java: > > 145 private final long dependencies = 0; // Used by JVM to store JVM_nmethodBucket* > > It's a little odd to see this field be final, but i think i understand the motivation as it's essentially acting as a memory address pointing to the head of the nbucket list, so in Java code you don't want to assign it to anything other than 0. Yes, my motivation was to forbid reads & writes of the field from Java code. I was experimenting with injecting the field by VM, but it's less attractive. > Is VM access, via java_lang_invoke_CallSite_Context, affected by treating final fields as really final in the j.l.i package? No, field modifiers aren't respected. oopDesc::*_field_[get|put] does a raw read/write using field offset. > javaClasses.hpp: > > 1182 static oop context(oop site); > 1183 static void set_context(oop site, oop context); > > Is set_context required? Good point. Removed. > instanceKlass.cpp: > > 1876 // recording of dependecies. Returns true if the bucket is ready for reclamation. > > typo "dependecies" Fixed. Best regards, Vladimir Ivanov From zoltan.majo at oracle.com Wed May 13 12:41:46 2015 From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=) Date: Wed, 13 May 2015 14:41:46 +0200 Subject: [9] RFR(S): 8068945: Use RBP register as proper frame pointer in JIT compiled code on x86 In-Reply-To: References: Message-ID: <5553468A.6010206@oracle.com> Hi N?ndor, On 05/13/2015 12:07 PM, N?ndor Istv?n Kr?cser wrote: > Hi, > > this change broke my build for the zero jvm variant, I think > the PreserveFramePointer global should be defined > in src/cpu/zero/vm/globals_zero.hpp as well, correct me please if I'm > wrong. Thanks for noticing and reporting the problem. I've filed 8080281 and will send out an RFR soon. Best wishes, Zolt?n > > Regards, > Nandor From roland.westrelin at oracle.com Wed May 13 13:02:50 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 13 May 2015 15:02:50 +0200 Subject: RFR(S): 8078866: compiler/eliminateAutobox/6934604/TestIntBoxing.java assert(p_f->Opcode() == Op_IfFalse) failed Message-ID: http://cr.openjdk.java.net/~roland/8078866/webrev.00/ The assert fires when optimizing that code: static int remi_sumb() { Integer j = Integer.valueOf(1); for (int i = 0; i< 1000; i++) { j = j + 1; } return j; } The compiler tries to find the pre loop from the main loop but fails because the pre loop isn?t there. Here is the chain of events that leads to the pre loop being optimized out: 1- The CountedLoopNode is created 2- PhiNode::Value() for the induction variable?s Phi is called and capture that 0 <= i < 1000 3- SplitIf is applied and moves the LoadI that reads the value from the Integer through the Phi that merges successive value of j 4- pre-loop, main-loop, post-loop are created 5- SplitIf is applied again and moves the LoadI of the loop through the Phi that merges both branches of: public static Integer valueOf(int i) { if (i >= IntegerCache.low && i <= IntegerCache.high) return IntegerCache.cache[i + (-IntegerCache.low)]; return new Integer(i); } 6- The LoadI is optimized out, the Phi for j is now a Phi that merges integer values, PhaseIdealLoop::replace_parallel_iv() replaces j by an expression that depends on i 7- In the preloop, the range of values of i are known because of 2- so and the if of valueOf() can be optimized out 8- the preloop is empty and is removed by IdealLoopTree::policy_do_remove_empty_loop() SplitIf doesn?t find all opportunities for improvements before the pre/main/post loops are built. I considered trying to fix that but that seems complex and not robust enough: some optimizations after SplitIf could uncover more opportunities for SplitIf. The valueOf?s if cannot be optimized out in the main loop because the range of values of i in the main depends on the Phi for the iv of the pre loop and is not [0, 1000[ so I don?t see a way to make sure the compiler always optimizes the main loop out when the pre loop is empty. Instead, I?m proposing that when IdealLoopTree::policy_do_remove_empty_loop() removes a loop, it checks whether it is a pre-loop and if it is, removes the main and post loops. I don?t have a test case because it seems this only occurs under well defined conditions that are hard to reproduce with a simple test case: PhiNode::Value() must be called early, SplitIf must do a little work but not too much before pre/main/post loops are built. Actually, this occurs only when remi_sumb is inlined. If remi_sumb is not, we don?t optimize the loop out at all. Roland. From roland.westrelin at oracle.com Wed May 13 13:47:45 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 13 May 2015 15:47:45 +0200 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared In-Reply-To: <5553191F.6080800@oracle.com> References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com> <5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com> Message-ID: <882B8D85-62B4-4734-8CA7-0BFD45DB73B3@oracle.com> > http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02/ The hotspot code looks good to me. Roland. From rickard.backman at oracle.com Wed May 13 13:51:08 2015 From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=) Date: Wed, 13 May 2015 15:51:08 +0200 Subject: RFR(XS): 8080155: field "_pc_offset" not found in type ImmutableOopMapSet Message-ID: <20150513135108.GQ31204@rbackman> Hi, can I please have this small change reviewed? There was a typo in the name in the SA code which causes SA to fail when looking at OopMaps. Bug: https://bugs.openjdk.java.net/browse/JDK-8080155 Webrev: http://cr.openjdk.java.net/~rbackman/8080155.1/ Thanks /R -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: Digital signature URL: From roland.westrelin at oracle.com Wed May 13 14:01:35 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 13 May 2015 16:01:35 +0200 Subject: RFR(XS): 8080155: field "_pc_offset" not found in type ImmutableOopMapSet In-Reply-To: <20150513135108.GQ31204@rbackman> References: <20150513135108.GQ31204@rbackman> Message-ID: <43BF38E3-3152-41DA-B3E9-BE708E420E86@oracle.com> > Webrev: http://cr.openjdk.java.net/~rbackman/8080155.1/ Looks good to me. Roland. From rickard.backman at oracle.com Wed May 13 14:07:51 2015 From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=) Date: Wed, 13 May 2015 16:07:51 +0200 Subject: RFR(XS): 8080155: field "_pc_offset" not found in type ImmutableOopMapSet In-Reply-To: <43BF38E3-3152-41DA-B3E9-BE708E420E86@oracle.com> References: <20150513135108.GQ31204@rbackman> <43BF38E3-3152-41DA-B3E9-BE708E420E86@oracle.com> Message-ID: <20150513140751.GR31204@rbackman> Thank you! /R On 05/13, Roland Westrelin wrote: > > Webrev: http://cr.openjdk.java.net/~rbackman/8080155.1/ > > Looks good to me. > > Roland. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: Digital signature URL: From roland.westrelin at oracle.com Wed May 13 14:08:26 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 13 May 2015 16:08:26 +0200 Subject: Intermediate writes to array slot not eliminated by optimizer In-Reply-To: References: <55523476.90601@oracle.com> <55523CC7.5010200@oracle.com> Message-ID: Vitaly, I filed: https://bugs.openjdk.java.net/browse/JDK-8080289 Roland. > On May 12, 2015, at 7:56 PM, Vitaly Davidovich wrote: > > Well, this seems more about detecting that each iteration of the loop is storing to the same address, which means if the loop is counted (or otherwise determined to end), then only final value can be stored. Even if loop is not removed entirely and the final value computed statically (as one could conceive in this case), I'd think working with a temp inside the loop and then storing the temp value into memory would be good as well. > > FWIW, gcc and llvm do the "right" thing here in similarly shaped C++ code :). > > On Tue, May 12, 2015 at 1:47 PM, Vladimir Kozlov wrote: > The most important thing we need do with this code is move Store from the loop. We definitely collapse empty loops. > We do move some loop invariant nodes from loops but this case could be different. I think the problem is stored value is loop variant. > Yes, it would benefit in general if we could move such stores from loops. > > Vladimir > > On 5/12/15 10:16 AM, Vitaly Davidovich wrote: > Thanks Vladimir. I believe I did see the split pre/main/post loop form > in the compiled code; I tried it with both increment and decrement, and > also a plain counted for loop, but same asm in the end. Clearly the > example I gave is just for illustrating the point, but perhaps this type > of code can be left behind after some other optimization passes > complete. Do you think it's worth handling these types of things? > > On Tue, May 12, 2015 at 1:12 PM, Vladimir Kozlov > > wrote: > > I would need to look on ideal graph to find what is going on. > My guess is that the problem is Store node has control edge pointing > to loop head so we can't move it from loop. Or stupid staff like we > only do it for index increment and not for decrement (which I doubt). > > Most possible scenario is we convert this loop to canonical Counted > and split it (creating 3 loops: pre, main, post) before we had a > chance to collapse it. > > Regards, > Vladimir > > On 5/12/15 10:05 AM, Vitaly Davidovich wrote: > > Anyone? :) > > Thanks > > On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich > > >> wrote: > > Hi guys, > > Suppose we have the following code: > > private static final long[] ARR = new long[10]; > private static final int ITERATIONS = 500 * 1000 * 1000; > > > static void loop() { > int i = ITERATIONS; > while(--i >= 0) { > ARR[4] = i; > } > } > > My expectation would be that loop() effectively compiles to: > ARR[4] = 0; > ret; > > However, an actual loop is compiled. This example is a > reduction of > a slightly more involved case that I had in mind, but I'm > curious if > I'm missing something here. > > I tested this with 8u40 and tiered compilation disabled, > for what > it's worth. > > Thanks > > > > From vitalyd at gmail.com Wed May 13 14:12:46 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Wed, 13 May 2015 10:12:46 -0400 Subject: Intermediate writes to array slot not eliminated by optimizer In-Reply-To: References: <55523476.90601@oracle.com> <55523CC7.5010200@oracle.com> Message-ID: Thanks Roland. Did you by chance confirm this on your end? I wanted to make sure this wasn't some artifact on my part. Vitaly On Wed, May 13, 2015 at 10:08 AM, Roland Westrelin < roland.westrelin at oracle.com> wrote: > Vitaly, > > I filed: > > https://bugs.openjdk.java.net/browse/JDK-8080289 > > Roland. > > > > On May 12, 2015, at 7:56 PM, Vitaly Davidovich > wrote: > > > > Well, this seems more about detecting that each iteration of the loop is > storing to the same address, which means if the loop is counted (or > otherwise determined to end), then only final value can be stored. Even if > loop is not removed entirely and the final value computed statically (as > one could conceive in this case), I'd think working with a temp inside the > loop and then storing the temp value into memory would be good as well. > > > > FWIW, gcc and llvm do the "right" thing here in similarly shaped C++ > code :). > > > > On Tue, May 12, 2015 at 1:47 PM, Vladimir Kozlov < > vladimir.kozlov at oracle.com> wrote: > > The most important thing we need do with this code is move Store from > the loop. We definitely collapse empty loops. > > We do move some loop invariant nodes from loops but this case could be > different. I think the problem is stored value is loop variant. > > Yes, it would benefit in general if we could move such stores from loops. > > > > Vladimir > > > > On 5/12/15 10:16 AM, Vitaly Davidovich wrote: > > Thanks Vladimir. I believe I did see the split pre/main/post loop form > > in the compiled code; I tried it with both increment and decrement, and > > also a plain counted for loop, but same asm in the end. Clearly the > > example I gave is just for illustrating the point, but perhaps this type > > of code can be left behind after some other optimization passes > > complete. Do you think it's worth handling these types of things? > > > > On Tue, May 12, 2015 at 1:12 PM, Vladimir Kozlov > > > wrote: > > > > I would need to look on ideal graph to find what is going on. > > My guess is that the problem is Store node has control edge pointing > > to loop head so we can't move it from loop. Or stupid staff like we > > only do it for index increment and not for decrement (which I doubt). > > > > Most possible scenario is we convert this loop to canonical Counted > > and split it (creating 3 loops: pre, main, post) before we had a > > chance to collapse it. > > > > Regards, > > Vladimir > > > > On 5/12/15 10:05 AM, Vitaly Davidovich wrote: > > > > Anyone? :) > > > > Thanks > > > > On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich > > > > >> wrote: > > > > Hi guys, > > > > Suppose we have the following code: > > > > private static final long[] ARR = new long[10]; > > private static final int ITERATIONS = 500 * 1000 * 1000; > > > > > > static void loop() { > > int i = ITERATIONS; > > while(--i >= 0) { > > ARR[4] = i; > > } > > } > > > > My expectation would be that loop() effectively compiles to: > > ARR[4] = 0; > > ret; > > > > However, an actual loop is compiled. This example is a > > reduction of > > a slightly more involved case that I had in mind, but I'm > > curious if > > I'm missing something here. > > > > I tested this with 8u40 and tiered compilation disabled, > > for what > > it's worth. > > > > Thanks > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.westrelin at oracle.com Wed May 13 14:19:18 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 13 May 2015 16:19:18 +0200 Subject: Intermediate writes to array slot not eliminated by optimizer In-Reply-To: References: <55523476.90601@oracle.com> <55523CC7.5010200@oracle.com> Message-ID: > Thanks Roland. Did you by chance confirm this on your end? I wanted to make sure this wasn't some artifact on my part. Yes, it?s a problem on our side. Roland. > > Vitaly > > On Wed, May 13, 2015 at 10:08 AM, Roland Westrelin wrote: > Vitaly, > > I filed: > > https://bugs.openjdk.java.net/browse/JDK-8080289 > > Roland. > > > > On May 12, 2015, at 7:56 PM, Vitaly Davidovich wrote: > > > > Well, this seems more about detecting that each iteration of the loop is storing to the same address, which means if the loop is counted (or otherwise determined to end), then only final value can be stored. Even if loop is not removed entirely and the final value computed statically (as one could conceive in this case), I'd think working with a temp inside the loop and then storing the temp value into memory would be good as well. > > > > FWIW, gcc and llvm do the "right" thing here in similarly shaped C++ code :). > > > > On Tue, May 12, 2015 at 1:47 PM, Vladimir Kozlov wrote: > > The most important thing we need do with this code is move Store from the loop. We definitely collapse empty loops. > > We do move some loop invariant nodes from loops but this case could be different. I think the problem is stored value is loop variant. > > Yes, it would benefit in general if we could move such stores from loops. > > > > Vladimir > > > > On 5/12/15 10:16 AM, Vitaly Davidovich wrote: > > Thanks Vladimir. I believe I did see the split pre/main/post loop form > > in the compiled code; I tried it with both increment and decrement, and > > also a plain counted for loop, but same asm in the end. Clearly the > > example I gave is just for illustrating the point, but perhaps this type > > of code can be left behind after some other optimization passes > > complete. Do you think it's worth handling these types of things? > > > > On Tue, May 12, 2015 at 1:12 PM, Vladimir Kozlov > > > wrote: > > > > I would need to look on ideal graph to find what is going on. > > My guess is that the problem is Store node has control edge pointing > > to loop head so we can't move it from loop. Or stupid staff like we > > only do it for index increment and not for decrement (which I doubt). > > > > Most possible scenario is we convert this loop to canonical Counted > > and split it (creating 3 loops: pre, main, post) before we had a > > chance to collapse it. > > > > Regards, > > Vladimir > > > > On 5/12/15 10:05 AM, Vitaly Davidovich wrote: > > > > Anyone? :) > > > > Thanks > > > > On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich > > > > >> wrote: > > > > Hi guys, > > > > Suppose we have the following code: > > > > private static final long[] ARR = new long[10]; > > private static final int ITERATIONS = 500 * 1000 * 1000; > > > > > > static void loop() { > > int i = ITERATIONS; > > while(--i >= 0) { > > ARR[4] = i; > > } > > } > > > > My expectation would be that loop() effectively compiles to: > > ARR[4] = 0; > > ret; > > > > However, an actual loop is compiled. This example is a > > reduction of > > a slightly more involved case that I had in mind, but I'm > > curious if > > I'm missing something here. > > > > I tested this with 8u40 and tiered compilation disabled, > > for what > > it's worth. > > > > Thanks > > > > > > > > > > From vitalyd at gmail.com Wed May 13 14:19:44 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Wed, 13 May 2015 10:19:44 -0400 Subject: Intermediate writes to array slot not eliminated by optimizer In-Reply-To: References: <55523476.90601@oracle.com> <55523CC7.5010200@oracle.com> Message-ID: Great, thank you On Wed, May 13, 2015 at 10:19 AM, Roland Westrelin < roland.westrelin at oracle.com> wrote: > > Thanks Roland. Did you by chance confirm this on your end? I wanted to > make sure this wasn't some artifact on my part. > > Yes, it?s a problem on our side. > > Roland. > > > > > Vitaly > > > > On Wed, May 13, 2015 at 10:08 AM, Roland Westrelin < > roland.westrelin at oracle.com> wrote: > > Vitaly, > > > > I filed: > > > > https://bugs.openjdk.java.net/browse/JDK-8080289 > > > > Roland. > > > > > > > On May 12, 2015, at 7:56 PM, Vitaly Davidovich > wrote: > > > > > > Well, this seems more about detecting that each iteration of the loop > is storing to the same address, which means if the loop is counted (or > otherwise determined to end), then only final value can be stored. Even if > loop is not removed entirely and the final value computed statically (as > one could conceive in this case), I'd think working with a temp inside the > loop and then storing the temp value into memory would be good as well. > > > > > > FWIW, gcc and llvm do the "right" thing here in similarly shaped C++ > code :). > > > > > > On Tue, May 12, 2015 at 1:47 PM, Vladimir Kozlov < > vladimir.kozlov at oracle.com> wrote: > > > The most important thing we need do with this code is move Store from > the loop. We definitely collapse empty loops. > > > We do move some loop invariant nodes from loops but this case could be > different. I think the problem is stored value is loop variant. > > > Yes, it would benefit in general if we could move such stores from > loops. > > > > > > Vladimir > > > > > > On 5/12/15 10:16 AM, Vitaly Davidovich wrote: > > > Thanks Vladimir. I believe I did see the split pre/main/post loop form > > > in the compiled code; I tried it with both increment and decrement, and > > > also a plain counted for loop, but same asm in the end. Clearly the > > > example I gave is just for illustrating the point, but perhaps this > type > > > of code can be left behind after some other optimization passes > > > complete. Do you think it's worth handling these types of things? > > > > > > On Tue, May 12, 2015 at 1:12 PM, Vladimir Kozlov > > > > > wrote: > > > > > > I would need to look on ideal graph to find what is going on. > > > My guess is that the problem is Store node has control edge > pointing > > > to loop head so we can't move it from loop. Or stupid staff like we > > > only do it for index increment and not for decrement (which I > doubt). > > > > > > Most possible scenario is we convert this loop to canonical Counted > > > and split it (creating 3 loops: pre, main, post) before we had a > > > chance to collapse it. > > > > > > Regards, > > > Vladimir > > > > > > On 5/12/15 10:05 AM, Vitaly Davidovich wrote: > > > > > > Anyone? :) > > > > > > Thanks > > > > > > On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich > > > > > > >> wrote: > > > > > > Hi guys, > > > > > > Suppose we have the following code: > > > > > > private static final long[] ARR = new long[10]; > > > private static final int ITERATIONS = 500 * 1000 * 1000; > > > > > > > > > static void loop() { > > > int i = ITERATIONS; > > > while(--i >= 0) { > > > ARR[4] = i; > > > } > > > } > > > > > > My expectation would be that loop() effectively compiles > to: > > > ARR[4] = 0; > > > ret; > > > > > > However, an actual loop is compiled. This example is a > > > reduction of > > > a slightly more involved case that I had in mind, but I'm > > > curious if > > > I'm missing something here. > > > > > > I tested this with 8u40 and tiered compilation disabled, > > > for what > > > it's worth. > > > > > > Thanks > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Wed May 13 14:28:32 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 13 May 2015 07:28:32 -0700 Subject: RFR(XS): 8078436: java/util/stream/boottest/java/util/stream/UnorderedTest.java crashed with an assert in ifnode.cpp In-Reply-To: <683673A0-DB76-497C-8772-77CA133AFFAA@oracle.com> References: <683673A0-DB76-497C-8772-77CA133AFFAA@oracle.com> Message-ID: <55535F90.9000704@oracle.com> Good. Thanks, Vladimir On 5/13/15 2:05 AM, Roland Westrelin wrote: > http://cr.openjdk.java.net/~roland/8078436/webrev.00/ > > The assert is wrong. I mixed the BoolNode of one IfNode with the projection of the other IfNode. > > Roland. > From bertrand.delsart at oracle.com Wed May 13 14:31:01 2015 From: bertrand.delsart at oracle.com (Bertrand Delsart) Date: Wed, 13 May 2015 16:31:01 +0200 Subject: RFR(XS): 8080155: field "_pc_offset" not found in type ImmutableOopMapSet In-Reply-To: <20150513135108.GQ31204@rbackman> References: <20150513135108.GQ31204@rbackman> Message-ID: <55536025.9070700@oracle.com> On 13/05/2015 15:51, Rickard B?ckman wrote: > Hi, > > can I please have this small change reviewed? There was a typo in the > name in the SA code which causes SA to fail when looking at OopMaps. > > Bug: https://bugs.openjdk.java.net/browse/JDK-8080155 > Webrev: http://cr.openjdk.java.net/~rbackman/8080155.1/ > > Thanks > /R > Looks good, Bertrand (not a Reviewer) -- Bertrand Delsart, Grenoble Engineering Center Oracle, 180 av. de l'Europe, ZIRST de Montbonnot 38330 Montbonnot Saint Martin, FRANCE bertrand.delsart at oracle.com Phone : +33 4 76 18 81 23 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From roland.westrelin at oracle.com Wed May 13 14:36:50 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 13 May 2015 16:36:50 +0200 Subject: RFR(XS): 8078436: java/util/stream/boottest/java/util/stream/UnorderedTest.java crashed with an assert in ifnode.cpp In-Reply-To: <55535F90.9000704@oracle.com> References: <683673A0-DB76-497C-8772-77CA133AFFAA@oracle.com> <55535F90.9000704@oracle.com> Message-ID: <6FFA58D0-0D65-4CDA-A2A7-F23AE3D6BC1D@oracle.com> Thanks Vladimir & Vladimir for the reviews. Roland. From dmitry.samersoff at oracle.com Wed May 13 14:38:32 2015 From: dmitry.samersoff at oracle.com (Dmitry Samersoff) Date: Wed, 13 May 2015 17:38:32 +0300 Subject: RFR(XS): 8080155: field "_pc_offset" not found in type ImmutableOopMapSet In-Reply-To: <20150513135108.GQ31204@rbackman> References: <20150513135108.GQ31204@rbackman> Message-ID: <555361E8.2080506@oracle.com> Looks good for me (not a Reviewer) On 2015-05-13 16:51, Rickard B?ckman wrote: > Hi, > > can I please have this small change reviewed? There was a typo in the > name in the SA code which causes SA to fail when looking at OopMaps. > > Bug: https://bugs.openjdk.java.net/browse/JDK-8080155 > Webrev: http://cr.openjdk.java.net/~rbackman/8080155.1/ > > Thanks > /R > -- Dmitry Samersoff Oracle Java development team, Saint Petersburg, Russia * I would love to change the world, but they won't give me the sources. From paul.sandoz at oracle.com Wed May 13 14:56:37 2015 From: paul.sandoz at oracle.com (Paul Sandoz) Date: Wed, 13 May 2015 16:56:37 +0200 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared In-Reply-To: <55533CAE.3050201@oracle.com> References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com> <5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com> <72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com> <55533CAE.3050201@oracle.com> Message-ID: On May 13, 2015, at 1:59 PM, Vladimir Ivanov wrote: > Peter, Paul, thanks for the feedback! > > Updated the webrev in place: > http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02 > +1 >> I am not an export in the HS area but the code mostly made sense to me. I also like Peter's suggestion of Context implementing Runnable. > I agree. Integrated. > >> Some minor comments. >> >> CallSite.java: >> >> 145 private final long dependencies = 0; // Used by JVM to store JVM_nmethodBucket* >> >> It's a little odd to see this field be final, but i think i understand the motivation as it's essentially acting as a memory address pointing to the head of the nbucket list, so in Java code you don't want to assign it to anything other than 0. > Yes, my motivation was to forbid reads & writes of the field from Java code. I was experimenting with injecting the field by VM, but it's less attractive. > I was wondering if that were possible. >> Is VM access, via java_lang_invoke_CallSite_Context, affected by treating final fields as really final in the j.l.i package? > No, field modifiers aren't respected. oopDesc::*_field_[get|put] does a raw read/write using field offset. > Ok. Paul. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 841 bytes Desc: Message signed with OpenPGP using GPGMail URL: From volker.simonis at gmail.com Wed May 13 16:29:27 2015 From: volker.simonis at gmail.com (Volker Simonis) Date: Wed, 13 May 2015 18:29:27 +0200 Subject: RFR(S): 8080190: PPC64: Fix wrong rotate instructions in the .ad file Message-ID: Hi, could you please review the following, ppc64-only fix which solves a problem with the integer-rotate instructions in C2: https://bugs.openjdk.java.net/browse/JDK-8080190 http://cr.openjdk.java.net/~simonis/webrevs/2015/8080190.v2/ The problem was that we initially took the "rotate left/right by 8-bit immediate" instructions from x86_64. But on x86_64 the rotate instructions for 32-bit integers do in fact really take 8-bit immediates (seems like the Intel guys had to many spare bits when they designed their instruction set :) On the other hand, the rotate instructions on Power only use 5 bit to encode the rotation distance. We do check for this limit, but only in the debug build so you'll see an assertion there, but in the product build we will silently emit a wrong instruction which can lead to anything starting from a crash up to an incorrect computation. I think the simplest solution for this problem is to mask the rotation distance with 0x1f (which is equivalent to taking the distance modulo 32) in the corresponding instructions. This change should also be backported to 8 as fast as possible. Notice that this a day-zero bug in our ppc64 port because in our internal SAP JVM we already mask the shift amounts in the ideal graph but for some unknown reasons these shared changes haven?t been contributed along with our port. Thank you and best regards, Volker From vladimir.kozlov at oracle.com Wed May 13 18:39:05 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 13 May 2015 11:39:05 -0700 Subject: RFR(S): 8080190: PPC64: Fix wrong rotate instructions in the .ad file In-Reply-To: References: Message-ID: <55539A49.4010906@oracle.com> Is this because you can have negative values? I see that at the bottom of calls which generate bits u_field() method is used. Why not mask it there? There are other instructions which can have the same problem (slwi). You would have to guarantee all of those do the masking. That is why I am asking why not mask it at the final code methods, like u_field()? Thanks, Vladimir On 5/13/15 9:29 AM, Volker Simonis wrote: > Hi, > > could you please review the following, ppc64-only fix which solves a > problem with the integer-rotate instructions in C2: > > https://bugs.openjdk.java.net/browse/JDK-8080190 > http://cr.openjdk.java.net/~simonis/webrevs/2015/8080190.v2/ > > The problem was that we initially took the "rotate left/right by 8-bit > immediate" instructions from x86_64. But on x86_64 the rotate > instructions for 32-bit integers do in fact really take 8-bit > immediates (seems like the Intel guys had to many spare bits when they > designed their instruction set :) On the other hand, the rotate > instructions on Power only use 5 bit to encode the rotation distance. > > We do check for this limit, but only in the debug build so you'll see > an assertion there, but in the product build we will silently emit a > wrong instruction which can lead to anything starting from a crash up > to an incorrect computation. > > I think the simplest solution for this problem is to mask the > rotation distance with 0x1f (which is equivalent to taking the > distance modulo 32) in the corresponding instructions. > > This change should also be backported to 8 as fast as possible. Notice > that this a day-zero bug in our ppc64 port because in our internal SAP > JVM we already mask the shift amounts in the ideal graph but for some > unknown reasons these shared changes haven?t been contributed along > with our port. > > Thank you and best regards, > Volker > From sandhya.viswanathan at intel.com Wed May 13 19:29:20 2015 From: sandhya.viswanathan at intel.com (Viswanathan, Sandhya) Date: Wed, 13 May 2015 19:29:20 +0000 Subject: RFR(L): 8069539: RSA acceleration In-Reply-To: <55530C84.2070801@redhat.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B6336247C@FMSMSX112.amr.corp.intel.com> <55530C84.2070801@redhat.com> Message-ID: <02FCFB8477C4EF43A2AD8E0C60F3DA2B6336292E@FMSMSX112.amr.corp.intel.com> Thanks for your clarification. There are now two proposals/patches for this and we could benefit from both. The patch that I proposed has intrinsic implementation for squareToLen which would still be exercised from the BigInteger.multiply() path when two inputs are same. It shows ~2x benefit for micro benchmark doing square operations and we see ~1.5x for Crypto RSA. So in my thoughts we should go ahead with the patch from Intel as it is complementary to your proposal where you accelerate oddModPow. Best Regards, Sandhya -----Original Message----- From: Andrew Haley [mailto:aph at redhat.com] Sent: Wednesday, May 13, 2015 1:34 AM To: Viswanathan, Sandhya; Vladimir Kozlov; Florian Weimer; hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR(L): 8069539: RSA acceleration On 13/05/15 00:25, Viswanathan, Sandhya wrote: > I was wondering if inline assembly can be used in the native code of JRE. Does the JRE build framework support it? For every target which supports it you could define a macro in one of the os/cpu-dependent header files. But it's slightly awkward because it's compiler- as well as cpu-dependent. As long as we assume GCC it's pretty easy for all targets but I'm not sure what other compilers we use. However, I have to stress that this is just a demonstration: even though it is already much better than what we have, I'm not imagining that we really want to use JNI for this. It would make more sense to move it into HotSpot and call it as a runtime helper. One interesting thing to try would be to find out if this technique makes things slower on 32-bit targets. If it doesn't then we could use it everywhere. Andrew. From vitalyd at gmail.com Wed May 13 19:36:33 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Wed, 13 May 2015 15:36:33 -0400 Subject: C2: Advantage of parse time inlining Message-ID: Hi guys, Could someone please explain the advantage, if any, of parse time inlining in C2? Given that FreqInlineSize is quite large by default, most hot methods will get inlined anyway (well, ones that can be for other reasons). What is the advantage of parse time inlining? Is it quicker time to peak performance if C1 is reached first? Does it ensure that a method is inlined whereas it may not be if it's already compiled into a medium/large method otherwise? Is parse time inlining not susceptible to profile pollution? I suspect it is since the interpreter has already profiled the inlinee either way, but wanted to check. Anything else I'm not thinking about? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Wed May 13 21:28:02 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 13 May 2015 14:28:02 -0700 Subject: RFR(S): 8078866: compiler/eliminateAutobox/6934604/TestIntBoxing.java assert(p_f->Opcode() == Op_IfFalse) failed In-Reply-To: References: Message-ID: <5553C1E2.3040003@oracle.com> Why do you think that main- and post- loops will be empty too and it is safe to replace them with pre-loop? Why it is safe? Opaque1 node is used only when RCE and align is requsted: cmp_end->set_req(2, peel_only ? pre_limit : pre_opaq); In peel_only case pre-loop have only 1 iteration and can be eliminated too. The assert could be hit in such case too. So your change may not fix the problem. I think eliminating main- and post- is good investigation for separate RFE. And we need to make sure it is safe. To fix this bug, I think, we should bailout from do_range_check() when we can't find pre-loop. Typo 'cab': + // post loop cab be removed as well Thanks, Vladimir On 5/13/15 6:02 AM, Roland Westrelin wrote: > http://cr.openjdk.java.net/~roland/8078866/webrev.00/ > > The assert fires when optimizing that code: > > static int remi_sumb() { > Integer j = Integer.valueOf(1); > for (int i = 0; i< 1000; i++) { > j = j + 1; > } > return j; > } > > The compiler tries to find the pre loop from the main loop but fails because the pre loop isn?t there. > > Here is the chain of events that leads to the pre loop being optimized out: > > 1- The CountedLoopNode is created > 2- PhiNode::Value() for the induction variable?s Phi is called and capture that 0 <= i < 1000 > 3- SplitIf is applied and moves the LoadI that reads the value from the Integer through the Phi that merges successive value of j > 4- pre-loop, main-loop, post-loop are created > 5- SplitIf is applied again and moves the LoadI of the loop through the Phi that merges both branches of: > > public static Integer valueOf(int i) { > if (i >= IntegerCache.low && i <= IntegerCache.high) > return IntegerCache.cache[i + (-IntegerCache.low)]; > return new Integer(i); > } > > 6- The LoadI is optimized out, the Phi for j is now a Phi that merges integer values, PhaseIdealLoop::replace_parallel_iv() replaces j by an expression that depends on i > 7- In the preloop, the range of values of i are known because of 2- so and the if of valueOf() can be optimized out > 8- the preloop is empty and is removed by IdealLoopTree::policy_do_remove_empty_loop() > > SplitIf doesn?t find all opportunities for improvements before the pre/main/post loops are built. I considered trying to fix that but that seems complex and not robust enough: some optimizations after SplitIf could uncover more opportunities for SplitIf. The valueOf?s if cannot be optimized out in the main loop because the range of values of i in the main depends on the Phi for the iv of the pre loop and is not [0, 1000[ so I don?t see a way to make sure the compiler always optimizes the main loop out when the pre loop is empty. > > Instead, I?m proposing that when IdealLoopTree::policy_do_remove_empty_loop() removes a loop, it checks whether it is a pre-loop and if it is, removes the main and post loops. > > I don?t have a test case because it seems this only occurs under well defined conditions that are hard to reproduce with a simple test case: PhiNode::Value() must be called early, SplitIf must do a little work but not too much before pre/main/post loops are built. Actually, this occurs only when remi_sumb is inlined. If remi_sumb is not, we don?t optimize the loop out at all. > > Roland. > From vladimir.kozlov at oracle.com Thu May 14 00:22:21 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 13 May 2015 17:22:21 -0700 Subject: [9] RFR (XS): 8079135: C2 disables some optimizations when a large number of unique nodes exist In-Reply-To: <55531E78.7070301@oracle.com> References: <55531E78.7070301@oracle.com> Message-ID: <5553EABD.8040807@oracle.com> Looks good. Thanks, Vladimir On 5/13/15 2:50 AM, Vladimir Ivanov wrote: > http://cr.openjdk.java.net/~vlivanov/8079135/webrev.00/ > > Incremental inlining produces lots of dead nodes, so # of unique and # > of live nodes can differ significantly. But actual IR size is what matters. > > Performance experiments show that octane doesn't hit the problem (no > difference). > > Testing: octane/nashorn > > Best regards, > Vladimir Ivanov From michael.c.berg at intel.com Thu May 14 01:26:59 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Thu, 14 May 2015 01:26:59 +0000 Subject: RFR 8080325 SuperWord loop unrolling analysis Message-ID: Hi Folks, We (Intel) would like to contribute superword loop unrolling analysis to facilitate more default unrolling and larger SIMD vector mapping. Please review this patch and comment as needed: Bug-id: https://bugs.openjdk.java.net/browse/JDK-8080325 webrev: http://cr.openjdk.java.net/~kvn/8080325/webrev/ The design finds the highest common vector supported and implemented on a given machine and applies that to unrolling, iff it is greater than the default. If the user gates unrolling we will still obey the user directive. It's light weight, when we get to the analysis part, if we fail, we stop further tries. If we succeed we stop further tries. We generally always analyze only once. We then gate the unroll factor by extending the size of the unroll segment so that the iterative tries will keep unrolling, obtaining larger unrolls of a targeted loop. I see no negative behavior wrt to performance, and a modest positive swing in default behavior up to 1.5x for some micros. Vladimir Koslov has offered to sponsor this patch. -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.c.berg at intel.com Thu May 14 01:31:57 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Thu, 14 May 2015 01:31:57 +0000 Subject: RFR 8080325 SuperWord loop unrolling analysis In-Reply-To: References: Message-ID: Also, for now I have only enabled x86 and leave the analysis of its value to other target developer contributors after the fact. Thanks, Michael From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Berg, Michael C Sent: Wednesday, May 13, 2015 6:27 PM To: hotspot-compiler-dev at openjdk.java.net Subject: RFR 8080325 SuperWord loop unrolling analysis Hi Folks, We (Intel) would like to contribute superword loop unrolling analysis to facilitate more default unrolling and larger SIMD vector mapping. Please review this patch and comment as needed: Bug-id: https://bugs.openjdk.java.net/browse/JDK-8080325 webrev: http://cr.openjdk.java.net/~kvn/8080325/webrev/ The design finds the highest common vector supported and implemented on a given machine and applies that to unrolling, iff it is greater than the default. If the user gates unrolling we will still obey the user directive. It's light weight, when we get to the analysis part, if we fail, we stop further tries. If we succeed we stop further tries. We generally always analyze only once. We then gate the unroll factor by extending the size of the unroll segment so that the iterative tries will keep unrolling, obtaining larger unrolls of a targeted loop. I see no negative behavior wrt to performance, and a modest positive swing in default behavior up to 1.5x for some micros. Vladimir Kozlov has offered to sponsor this patch. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aph at redhat.com Thu May 14 10:59:40 2015 From: aph at redhat.com (Andrew Haley) Date: Thu, 14 May 2015 11:59:40 +0100 Subject: RFR(L): 8069539: RSA acceleration In-Reply-To: <02FCFB8477C4EF43A2AD8E0C60F3DA2B6336292E@FMSMSX112.amr.corp.intel.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B6336247C@FMSMSX112.amr.corp.intel.com> <55530C84.2070801@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B6336292E@FMSMSX112.amr.corp.intel.com> Message-ID: <5554801C.5070208@redhat.com> On 05/13/2015 08:29 PM, Viswanathan, Sandhya wrote: > > Thanks for your clarification. There are now two proposals/patches > for this and we could benefit from both. Yes, indeed. > The patch that I proposed has intrinsic implementation for > squareToLen which would still be exercised from the > BigInteger.multiply() path when two inputs are same. It shows ~2x > benefit for micro benchmark doing square operations and we see ~1.5x > for Crypto RSA. > > So in my thoughts we should go ahead with the patch from Intel as it > is complementary to your proposal where you accelerate oddModPow. I think that's probably right. We can track the proposals separately. Even if squareToLen and multiplyToLen don't end up being used in RSA at all they will still be useful for BigInteger operations. I'm not seriously suggesting the JNI route: it's just a demonstration of how powerful the accelerated Montgomery algorithm is. (Or, alternatively, just how suboptimal our current approach is.) Andrew. From vladimir.x.ivanov at oracle.com Thu May 14 11:25:02 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 14 May 2015 14:25:02 +0300 Subject: [9] RFR (XS): 8079135: C2 disables some optimizations when a large number of unique nodes exist In-Reply-To: <5553EABD.8040807@oracle.com> References: <55531E78.7070301@oracle.com> <5553EABD.8040807@oracle.com> Message-ID: <5554860E.2030106@oracle.com> Thanks, Vladimir. Best regards, Vladimir Ivanov On 5/14/15 3:22 AM, Vladimir Kozlov wrote: > Looks good. > > Thanks, > Vladimir > > On 5/13/15 2:50 AM, Vladimir Ivanov wrote: >> http://cr.openjdk.java.net/~vlivanov/8079135/webrev.00/ >> >> Incremental inlining produces lots of dead nodes, so # of unique and # >> of live nodes can differ significantly. But actual IR size is what >> matters. >> >> Performance experiments show that octane doesn't hit the problem (no >> difference). >> >> Testing: octane/nashorn >> >> Best regards, >> Vladimir Ivanov From vladimir.x.ivanov at oracle.com Thu May 14 11:43:57 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 14 May 2015 14:43:57 +0300 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared In-Reply-To: <882B8D85-62B4-4734-8CA7-0BFD45DB73B3@oracle.com> References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com> <5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com> <882B8D85-62B4-4734-8CA7-0BFD45DB73B3@oracle.com> Message-ID: <55548A7D.3000606@oracle.com> Thanks, Roland. Best regards, Vladimir Ivanov On 5/13/15 4:47 PM, Roland Westrelin wrote: >> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02/ > > The hotspot code looks good to me. > > Roland. > From vladimir.x.ivanov at oracle.com Thu May 14 12:18:31 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 14 May 2015 15:18:31 +0300 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared In-Reply-To: References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com> <5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com> <72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com> <55533CAE.3050201@oracle.com> Message-ID: <55549297.5020002@oracle.com> Small update in JDK code (webrev updated in place): http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02 Changes: - hid Context.dependencies field from Reflection - new test case: CallSiteDepContextTest.testHiddenDepField() Best regards, Vladimir Ivanov On 5/13/15 5:56 PM, Paul Sandoz wrote: > > On May 13, 2015, at 1:59 PM, Vladimir Ivanov wrote: > >> Peter, Paul, thanks for the feedback! >> >> Updated the webrev in place: >> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02 >> > > +1 > > >>> I am not an export in the HS area but the code mostly made sense to me. I also like Peter's suggestion of Context implementing Runnable. >> I agree. Integrated. >> >>> Some minor comments. >>> >>> CallSite.java: >>> >>> 145 private final long dependencies = 0; // Used by JVM to store JVM_nmethodBucket* >>> >>> It's a little odd to see this field be final, but i think i understand the motivation as it's essentially acting as a memory address pointing to the head of the nbucket list, so in Java code you don't want to assign it to anything other than 0. >> Yes, my motivation was to forbid reads & writes of the field from Java code. I was experimenting with injecting the field by VM, but it's less attractive. >> > > I was wondering if that were possible. > > >>> Is VM access, via java_lang_invoke_CallSite_Context, affected by treating final fields as really final in the j.l.i package? >> No, field modifiers aren't respected. oopDesc::*_field_[get|put] does a raw read/write using field offset. >> > > Ok. > > Paul. > From tobias.hartmann at oracle.com Thu May 14 15:59:11 2015 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Thu, 14 May 2015 17:59:11 +0200 Subject: [9] RFR(XS): 8080420: Compilation of TestVectorizationWithInvariant fails with "error: package com.oracle.java.testlibrary does not exist" Message-ID: <5554C64F.5010405@oracle.com> Hi, please review the following patch that fixes the testlibrary location in TestVectorizationWithInvariant after JDK-8067013 which is in hs but not in hs-comp. https://bugs.openjdk.java.net/browse/JDK-8080420 http://cr.openjdk.java.net/~thartmann/8080420/webrev.00/ This is blocking the hs-comp->hs sync. I will push the fix directly to hs. Thanks, Tobias [1] https://bugs.openjdk.java.net/browse/JDK-8067013 From vladimir.kozlov at oracle.com Thu May 14 16:00:48 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 14 May 2015 09:00:48 -0700 Subject: [9] RFR(XS): 8080420: Compilation of TestVectorizationWithInvariant fails with "error: package com.oracle.java.testlibrary does not exist" In-Reply-To: <5554C64F.5010405@oracle.com> References: <5554C64F.5010405@oracle.com> Message-ID: <5554C6B0.5080601@oracle.com> Looks good. Thanks, Vladimir On 5/14/15 8:59 AM, Tobias Hartmann wrote: > Hi, > > please review the following patch that fixes the testlibrary location in TestVectorizationWithInvariant after JDK-8067013 which is in hs but not in hs-comp. > > https://bugs.openjdk.java.net/browse/JDK-8080420 > http://cr.openjdk.java.net/~thartmann/8080420/webrev.00/ > > This is blocking the hs-comp->hs sync. I will push the fix directly to hs. > > Thanks, > Tobias > > [1] https://bugs.openjdk.java.net/browse/JDK-8067013 > From tobias.hartmann at oracle.com Thu May 14 16:06:01 2015 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Thu, 14 May 2015 18:06:01 +0200 Subject: [9] RFR(XS): 8080420: Compilation of TestVectorizationWithInvariant fails with "error: package com.oracle.java.testlibrary does not exist" In-Reply-To: <5554C6B0.5080601@oracle.com> References: <5554C64F.5010405@oracle.com> <5554C6B0.5080601@oracle.com> Message-ID: <5554C7E9.5070305@oracle.com> Thanks, Vladimir. Best, Tobias On 14.05.2015 18:00, Vladimir Kozlov wrote: > Looks good. > > Thanks, > Vladimir > > On 5/14/15 8:59 AM, Tobias Hartmann wrote: >> Hi, >> >> please review the following patch that fixes the testlibrary location in TestVectorizationWithInvariant after JDK-8067013 which is in hs but not in hs-comp. >> >> https://bugs.openjdk.java.net/browse/JDK-8080420 >> http://cr.openjdk.java.net/~thartmann/8080420/webrev.00/ >> >> This is blocking the hs-comp->hs sync. I will push the fix directly to hs. >> >> Thanks, >> Tobias >> >> [1] https://bugs.openjdk.java.net/browse/JDK-8067013 >> From vitalyd at gmail.com Thu May 14 16:09:41 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Thu, 14 May 2015 12:09:41 -0400 Subject: C2: Advantage of parse time inlining In-Reply-To: References: Message-ID: Any pointers? Sorry to bug you guys, but just want to make sure I understand this point as I see quite a bit of discussion on core-libs and elsewhere where people are worrying about the 35 bytecode size threshold for parse inlining. On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich wrote: > Hi guys, > > Could someone please explain the advantage, if any, of parse time inlining > in C2? Given that FreqInlineSize is quite large by default, most hot > methods will get inlined anyway (well, ones that can be for other > reasons). What is the advantage of parse time inlining? > > Is it quicker time to peak performance if C1 is reached first? > > Does it ensure that a method is inlined whereas it may not be if it's > already compiled into a medium/large method otherwise? > > Is parse time inlining not susceptible to profile pollution? I suspect it > is since the interpreter has already profiled the inlinee either way, but > wanted to check. > > Anything else I'm not thinking about? > > Thanks > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.x.ivanov at oracle.com Thu May 14 16:36:45 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 14 May 2015 19:36:45 +0300 Subject: C2: Advantage of parse time inlining In-Reply-To: References: Message-ID: <5554CF1D.3010601@oracle.com> Vitaly, Can you elaborate your question a bit? What do you compare parse-time inlining with? Mentioning of ?1 & profile pollution in this context confuses me. Usually, people care about 35 (= MaxInlineSize), because for methods up to MaxInlineSize their call frequency is ignored. So, fewer chances to end up with non-inlined call. Best regards, Vladimir Ivanov On 5/14/15 7:09 PM, Vitaly Davidovich wrote: > Any pointers? Sorry to bug you guys, but just want to make sure I > understand this point as I see quite a bit of discussion on core-libs > and elsewhere where people are worrying about the 35 bytecode size > threshold for parse inlining. > > On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich > wrote: > > Hi guys, > > Could someone please explain the advantage, if any, of parse time > inlining in C2? Given that FreqInlineSize is quite large by default, > most hot methods will get inlined anyway (well, ones that can be for > other reasons). What is the advantage of parse time inlining? > > Is it quicker time to peak performance if C1 is reached first? > > Does it ensure that a method is inlined whereas it may not be if > it's already compiled into a medium/large method otherwise? > > Is parse time inlining not susceptible to profile pollution? I > suspect it is since the interpreter has already profiled the inlinee > either way, but wanted to check. > > Anything else I'm not thinking about? > > Thanks > > From vitalyd at gmail.com Thu May 14 16:57:26 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Thu, 14 May 2015 12:57:26 -0400 Subject: C2: Advantage of parse time inlining In-Reply-To: <5554CF1D.3010601@oracle.com> References: <5554CF1D.3010601@oracle.com> Message-ID: Vladimir, I'm comparing MaxInlineSize (35) with FreqInlineSize (325). AFAIU, MaxInlineSize drives which methods are inlined at parse time by C2, whereas FreqInlineSize is the threshold for "late" (or what do you guys call inlining after parsing?) inlining. Most of the inlining discussions (or worries, rather) seem to focus around the MaxInlineSize value, and not FreqInlineSize, even if the target method will get hot. Usually, people care about 35 (= MaxInlineSize), because for methods up to > MaxInlineSize their call frequency is ignored. So, fewer chances to end up > with non-inlined call. Ok, so for hot methods then MaxInlineSize isn't really a concern, and FreqInlineSize would be the threshold to worry about (for C2 compiler) then? Why are people worried about inlining in cold paths then? Thanks Vladimir On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov < vladimir.x.ivanov at oracle.com> wrote: > Vitaly, > > Can you elaborate your question a bit? What do you compare parse-time > inlining with? Mentioning of ?1 & profile pollution in this context > confuses me. > > Usually, people care about 35 (= MaxInlineSize), because for methods up to > MaxInlineSize their call frequency is ignored. So, fewer chances to end up > with non-inlined call. > > Best regards, > Vladimir Ivanov > > On 5/14/15 7:09 PM, Vitaly Davidovich wrote: > >> Any pointers? Sorry to bug you guys, but just want to make sure I >> understand this point as I see quite a bit of discussion on core-libs >> and elsewhere where people are worrying about the 35 bytecode size >> threshold for parse inlining. >> >> On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich > > wrote: >> >> Hi guys, >> >> Could someone please explain the advantage, if any, of parse time >> inlining in C2? Given that FreqInlineSize is quite large by default, >> most hot methods will get inlined anyway (well, ones that can be for >> other reasons). What is the advantage of parse time inlining? >> >> Is it quicker time to peak performance if C1 is reached first? >> >> Does it ensure that a method is inlined whereas it may not be if >> it's already compiled into a medium/large method otherwise? >> >> Is parse time inlining not susceptible to profile pollution? I >> suspect it is since the interpreter has already profiled the inlinee >> either way, but wanted to check. >> >> Anything else I'm not thinking about? >> >> Thanks >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Thu May 14 17:03:50 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Thu, 14 May 2015 13:03:50 -0400 Subject: C2: Advantage of parse time inlining In-Reply-To: References: <5554CF1D.3010601@oracle.com> Message-ID: I should also add that I see how inlining without taking call freq into account could lead to faster time to peak performance for methods that eventually get hot anyway but aren't at parse time. Peak perf will be the same if the method is too big for parse inlining but eventually gets compiled due to reaching hotness. Is that about right? sent from my phone On May 14, 2015 12:57 PM, "Vitaly Davidovich" wrote: > Vladimir, > > I'm comparing MaxInlineSize (35) with FreqInlineSize (325). AFAIU, > MaxInlineSize drives which methods are inlined at parse time by C2, whereas > FreqInlineSize is the threshold for "late" (or what do you guys call > inlining after parsing?) inlining. Most of the inlining discussions (or > worries, rather) seem to focus around the MaxInlineSize value, and not > FreqInlineSize, even if the target method will get hot. > > Usually, people care about 35 (= MaxInlineSize), because for methods up to >> MaxInlineSize their call frequency is ignored. So, fewer chances to end up >> with non-inlined call. > > > Ok, so for hot methods then MaxInlineSize isn't really a concern, and > FreqInlineSize would be the threshold to worry about (for C2 compiler) > then? Why are people worried about inlining in cold paths then? > > Thanks Vladimir > > On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov < > vladimir.x.ivanov at oracle.com> wrote: > >> Vitaly, >> >> Can you elaborate your question a bit? What do you compare parse-time >> inlining with? Mentioning of ?1 & profile pollution in this context >> confuses me. >> >> Usually, people care about 35 (= MaxInlineSize), because for methods up >> to MaxInlineSize their call frequency is ignored. So, fewer chances to end >> up with non-inlined call. >> >> Best regards, >> Vladimir Ivanov >> >> On 5/14/15 7:09 PM, Vitaly Davidovich wrote: >> >>> Any pointers? Sorry to bug you guys, but just want to make sure I >>> understand this point as I see quite a bit of discussion on core-libs >>> and elsewhere where people are worrying about the 35 bytecode size >>> threshold for parse inlining. >>> >>> On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich >> > wrote: >>> >>> Hi guys, >>> >>> Could someone please explain the advantage, if any, of parse time >>> inlining in C2? Given that FreqInlineSize is quite large by default, >>> most hot methods will get inlined anyway (well, ones that can be for >>> other reasons). What is the advantage of parse time inlining? >>> >>> Is it quicker time to peak performance if C1 is reached first? >>> >>> Does it ensure that a method is inlined whereas it may not be if >>> it's already compiled into a medium/large method otherwise? >>> >>> Is parse time inlining not susceptible to profile pollution? I >>> suspect it is since the interpreter has already profiled the inlinee >>> either way, but wanted to check. >>> >>> Anything else I'm not thinking about? >>> >>> Thanks >>> >>> >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Thu May 14 18:30:35 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 14 May 2015 11:30:35 -0700 Subject: C2: Advantage of parse time inlining In-Reply-To: References: <5554CF1D.3010601@oracle.com> Message-ID: <5554E9CB.3020903@oracle.com> Vitaly, You have small misconception - almost all C2 inlining (normal java methods) is done during parsing in one pass. Recently we changed it to inline jsr292 methods after parsing (to execute IGVN and reduce graph - otherwise they blow up number of ideal nodes and we bailout compilation due to MaxNodeLimit). As parser goes and see a call site it check can it inline or not (see opto/bytecodeinfo.cpp, should_inline() and should_not_inline()). There are several conditions which drive inlining and following (most important) flags controls them: MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode, MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount, InlineFrequencyRatio. Most tedious flag is InlineSmallCode (size of compiled assembler code which is different from other sizes which are bytecode size) which control inlining of already compiled method. Usually if a call site is hot the callee is compiled before caller. But sometimes you can get caller compiled first (if it has hot loop, for example) so different condition will be used and as result you can get performance variation between runs. The difference between MaxInlineSize (35) and FreqInlineSize (325) is FreqInlineSize takes into account how frequent call site is executed relatively to caller invocations: int call_site_count = method()->scale_count(profile.count()); int invoke_count = method()->interpreter_invocation_count(); int freq = call_site_count / invoke_count; int max_inline_size = MaxInlineSize; // bump the max size if the call is frequent if ((freq >= InlineFrequencyRatio) || (call_site_count >= InlineFrequencyCount) || is_unboxing_method(callee_method, C) || is_init_with_ea(callee_method, caller_method, C)) { max_inline_size = FreqInlineSize; And there is additional inlining condition for all methods which size > MaxTrivialSize: if (!callee_method->was_executed_more_than(MinInliningThreshold)) { set_msg("executed < MinInliningThreshold times"); Regards, Vladimir On 5/14/15 10:03 AM, Vitaly Davidovich wrote: > I should also add that I see how inlining without taking call freq into > account could lead to faster time to peak performance for methods that > eventually get hot anyway but aren't at parse time. Peak perf will be > the same if the method is too big for parse inlining but eventually gets > compiled due to reaching hotness. Is that about right? > > sent from my phone > > On May 14, 2015 12:57 PM, "Vitaly Davidovich" > wrote: > > Vladimir, > > I'm comparing MaxInlineSize (35) with FreqInlineSize (325). AFAIU, > MaxInlineSize drives which methods are inlined at parse time by C2, > whereas FreqInlineSize is the threshold for "late" (or what do you > guys call inlining after parsing?) inlining. Most of the inlining > discussions (or worries, rather) seem to focus around the > MaxInlineSize value, and not FreqInlineSize, even if the target > method will get hot. > > Usually, people care about 35 (= MaxInlineSize), because for > methods up to MaxInlineSize their call frequency is ignored. So, > fewer chances to end up with non-inlined call. > > > Ok, so for hot methods then MaxInlineSize isn't really a concern, > and FreqInlineSize would be the threshold to worry about (for C2 > compiler) then? Why are people worried about inlining in cold paths > then? > > Thanks Vladimir > > On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov > > > wrote: > > Vitaly, > > Can you elaborate your question a bit? What do you compare > parse-time inlining with? Mentioning of ?1 & profile pollution > in this context confuses me. > > Usually, people care about 35 (= MaxInlineSize), because for > methods up to MaxInlineSize their call frequency is ignored. So, > fewer chances to end up with non-inlined call. > > Best regards, > Vladimir Ivanov > > On 5/14/15 7:09 PM, Vitaly Davidovich wrote: > > Any pointers? Sorry to bug you guys, but just want to make > sure I > understand this point as I see quite a bit of discussion on > core-libs > and elsewhere where people are worrying about the 35 > bytecode size > threshold for parse inlining. > > On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich > > >> wrote: > > Hi guys, > > Could someone please explain the advantage, if any, of > parse time > inlining in C2? Given that FreqInlineSize is quite > large by default, > most hot methods will get inlined anyway (well, ones > that can be for > other reasons). What is the advantage of parse time > inlining? > > Is it quicker time to peak performance if C1 is reached > first? > > Does it ensure that a method is inlined whereas it may > not be if > it's already compiled into a medium/large method otherwise? > > Is parse time inlining not susceptible to profile > pollution? I > suspect it is since the interpreter has already > profiled the inlinee > either way, but wanted to check. > > Anything else I'm not thinking about? > > Thanks > > > From john.r.rose at oracle.com Thu May 14 18:42:47 2015 From: john.r.rose at oracle.com (John Rose) Date: Thu, 14 May 2015 11:42:47 -0700 Subject: C2: Advantage of parse time inlining In-Reply-To: <5554E9CB.3020903@oracle.com> References: <5554CF1D.3010601@oracle.com> <5554E9CB.3020903@oracle.com> Message-ID: On May 14, 2015, at 11:30 AM, Vladimir Kozlov wrote: > > As parser goes and see a call site it check can it inline or not (see opto/bytecodeinfo.cpp, should_inline() and should_not_inline()). There are several conditions which drive inlining and following (most important) flags controls them: MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode, MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount, InlineFrequencyRatio. There's a bug report to catch complaints about these heuristics: https://bugs.openjdk.java.net/browse/JDK-6316156 -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Thu May 14 19:02:20 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Thu, 14 May 2015 15:02:20 -0400 Subject: C2: Advantage of parse time inlining In-Reply-To: <5554E9CB.3020903@oracle.com> References: <5554CF1D.3010601@oracle.com> <5554E9CB.3020903@oracle.com> Message-ID: Thanks Vladimir. I recall seeing changes around incremental inlining, and may have mistakenly thought it happens at some later point in time. Appreciate the clarification. Ok, so based on what you say, I can see a theoretical problem whereby a method is being parsed, is larger than MaxInlineSize, but doesn't happen to be frequent enough yet at this point, and so it won't be inlined; if it turns out to be hot later on, the lack of inlining will not be undone (assuming the caller isn't deopted and recompiled later, with updated frequency info, for other reasons). Thanks again. On Thu, May 14, 2015 at 2:30 PM, Vladimir Kozlov wrote: > Vitaly, > > You have small misconception - almost all C2 inlining (normal java > methods) is done during parsing in one pass. Recently we changed it to > inline jsr292 methods after parsing (to execute IGVN and reduce graph - > otherwise they blow up number of ideal nodes and we bailout compilation due > to MaxNodeLimit). > > As parser goes and see a call site it check can it inline or not (see > opto/bytecodeinfo.cpp, should_inline() and should_not_inline()). There are > several conditions which drive inlining and following (most important) > flags controls them: MaxTrivialSize, MaxInlineSize, FreqInlineSize, > InlineSmallCode, MinInliningThreshold, MaxInlineLevel, > InlineFrequencyCount, InlineFrequencyRatio. > > Most tedious flag is InlineSmallCode (size of compiled assembler code > which is different from other sizes which are bytecode size) which control > inlining of already compiled method. Usually if a call site is hot the > callee is compiled before caller. But sometimes you can get caller compiled > first (if it has hot loop, for example) so different condition will be used > and as result you can get performance variation between runs. > > The difference between MaxInlineSize (35) and FreqInlineSize (325) is > FreqInlineSize takes into account how frequent call site is executed > relatively to caller invocations: > > int call_site_count = method()->scale_count(profile.count()); > int invoke_count = method()->interpreter_invocation_count(); > int freq = call_site_count / invoke_count; > int max_inline_size = MaxInlineSize; > // bump the max size if the call is frequent > if ((freq >= InlineFrequencyRatio) || > (call_site_count >= InlineFrequencyCount) || > is_unboxing_method(callee_method, C) || > is_init_with_ea(callee_method, caller_method, C)) { > max_inline_size = FreqInlineSize; > > And there is additional inlining condition for all methods which size > > MaxTrivialSize: > > if (!callee_method->was_executed_more_than(MinInliningThreshold)) { > set_msg("executed < MinInliningThreshold times"); > > Regards, > Vladimir > > On 5/14/15 10:03 AM, Vitaly Davidovich wrote: > >> I should also add that I see how inlining without taking call freq into >> account could lead to faster time to peak performance for methods that >> eventually get hot anyway but aren't at parse time. Peak perf will be >> the same if the method is too big for parse inlining but eventually gets >> compiled due to reaching hotness. Is that about right? >> >> sent from my phone >> >> On May 14, 2015 12:57 PM, "Vitaly Davidovich" > > wrote: >> >> Vladimir, >> >> I'm comparing MaxInlineSize (35) with FreqInlineSize (325). AFAIU, >> MaxInlineSize drives which methods are inlined at parse time by C2, >> whereas FreqInlineSize is the threshold for "late" (or what do you >> guys call inlining after parsing?) inlining. Most of the inlining >> discussions (or worries, rather) seem to focus around the >> MaxInlineSize value, and not FreqInlineSize, even if the target >> method will get hot. >> >> Usually, people care about 35 (= MaxInlineSize), because for >> methods up to MaxInlineSize their call frequency is ignored. So, >> fewer chances to end up with non-inlined call. >> >> >> Ok, so for hot methods then MaxInlineSize isn't really a concern, >> and FreqInlineSize would be the threshold to worry about (for C2 >> compiler) then? Why are people worried about inlining in cold paths >> then? >> >> Thanks Vladimir >> >> On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov >> > >> wrote: >> >> Vitaly, >> >> Can you elaborate your question a bit? What do you compare >> parse-time inlining with? Mentioning of ?1 & profile pollution >> in this context confuses me. >> >> Usually, people care about 35 (= MaxInlineSize), because for >> methods up to MaxInlineSize their call frequency is ignored. So, >> fewer chances to end up with non-inlined call. >> >> Best regards, >> Vladimir Ivanov >> >> On 5/14/15 7:09 PM, Vitaly Davidovich wrote: >> >> Any pointers? Sorry to bug you guys, but just want to make >> sure I >> understand this point as I see quite a bit of discussion on >> core-libs >> and elsewhere where people are worrying about the 35 >> bytecode size >> threshold for parse inlining. >> >> On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich >> >> >> wrote: >> >> Hi guys, >> >> Could someone please explain the advantage, if any, of >> parse time >> inlining in C2? Given that FreqInlineSize is quite >> large by default, >> most hot methods will get inlined anyway (well, ones >> that can be for >> other reasons). What is the advantage of parse time >> inlining? >> >> Is it quicker time to peak performance if C1 is reached >> first? >> >> Does it ensure that a method is inlined whereas it may >> not be if >> it's already compiled into a medium/large method >> otherwise? >> >> Is parse time inlining not susceptible to profile >> pollution? I >> suspect it is since the interpreter has already >> profiled the inlinee >> either way, but wanted to check. >> >> Anything else I'm not thinking about? >> >> Thanks >> >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Thu May 14 19:20:35 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 14 May 2015 12:20:35 -0700 Subject: C2: Advantage of parse time inlining In-Reply-To: References: <5554CF1D.3010601@oracle.com> <5554E9CB.3020903@oracle.com> Message-ID: <5554F583.7090005@oracle.com> On 5/14/15 12:02 PM, Vitaly Davidovich wrote: > Thanks Vladimir. I recall seeing changes around incremental inlining, > and may have mistakenly thought it happens at some later point in time. > Appreciate the clarification. > > Ok, so based on what you say, I can see a theoretical problem whereby a > method is being parsed, is larger than MaxInlineSize, but doesn't happen > to be frequent enough yet at this point, and so it won't be inlined; if > it turns out to be hot later on, the lack of inlining will not be undone > (assuming the caller isn't deopted and recompiled later, with updated > frequency info, for other reasons). That is correct. Note, that the problem is not how hot is callee (invocation times) but how hot the call site in caller. Usually it does not change during execution. If it is called in a loop C2 will try to inline because freq should be high. There is MinInliningThreshold (250) but since we compile caller when it is executed 10000 times the call site should be on slow path which is executed only 2.5% times. So you will not see the performance difference if we inline it or call callee which is compiled if it is hot. Vladimir > > Thanks again. > > On Thu, May 14, 2015 at 2:30 PM, Vladimir Kozlov > > wrote: > > Vitaly, > > You have small misconception - almost all C2 inlining (normal java > methods) is done during parsing in one pass. Recently we changed it > to inline jsr292 methods after parsing (to execute IGVN and reduce > graph - otherwise they blow up number of ideal nodes and we bailout > compilation due to MaxNodeLimit). > > As parser goes and see a call site it check can it inline or not > (see opto/bytecodeinfo.cpp, should_inline() and > should_not_inline()). There are several conditions which drive > inlining and following (most important) flags controls them: > MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode, > MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount, > InlineFrequencyRatio. > > Most tedious flag is InlineSmallCode (size of compiled assembler > code which is different from other sizes which are bytecode size) > which control inlining of already compiled method. Usually if a call > site is hot the callee is compiled before caller. But sometimes you > can get caller compiled first (if it has hot loop, for example) so > different condition will be used and as result you can get > performance variation between runs. > > The difference between MaxInlineSize (35) and FreqInlineSize (325) > is FreqInlineSize takes into account how frequent call site is > executed relatively to caller invocations: > > int call_site_count = method()->scale_count(profile.count()); > int invoke_count = method()->interpreter_invocation_count(); > int freq = call_site_count / invoke_count; > int max_inline_size = MaxInlineSize; > // bump the max size if the call is frequent > if ((freq >= InlineFrequencyRatio) || > (call_site_count >= InlineFrequencyCount) || > is_unboxing_method(callee_method, C) || > is_init_with_ea(callee_method, caller_method, C)) { > max_inline_size = FreqInlineSize; > > And there is additional inlining condition for all methods which > size > MaxTrivialSize: > > if (!callee_method->was_executed_more_than(MinInliningThreshold)) { > set_msg("executed < MinInliningThreshold times"); > > Regards, > Vladimir > > On 5/14/15 10:03 AM, Vitaly Davidovich wrote: > > I should also add that I see how inlining without taking call > freq into > account could lead to faster time to peak performance for > methods that > eventually get hot anyway but aren't at parse time. Peak perf > will be > the same if the method is too big for parse inlining but > eventually gets > compiled due to reaching hotness. Is that about right? > > sent from my phone > > On May 14, 2015 12:57 PM, "Vitaly Davidovich" > >> wrote: > > Vladimir, > > I'm comparing MaxInlineSize (35) with FreqInlineSize > (325). AFAIU, > MaxInlineSize drives which methods are inlined at parse > time by C2, > whereas FreqInlineSize is the threshold for "late" (or what > do you > guys call inlining after parsing?) inlining. Most of the > inlining > discussions (or worries, rather) seem to focus around the > MaxInlineSize value, and not FreqInlineSize, even if the target > method will get hot. > > Usually, people care about 35 (= MaxInlineSize), > because for > methods up to MaxInlineSize their call frequency is > ignored. So, > fewer chances to end up with non-inlined call. > > > Ok, so for hot methods then MaxInlineSize isn't really a > concern, > and FreqInlineSize would be the threshold to worry about > (for C2 > compiler) then? Why are people worried about inlining in > cold paths > then? > > Thanks Vladimir > > On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov > > >> > wrote: > > Vitaly, > > Can you elaborate your question a bit? What do you compare > parse-time inlining with? Mentioning of ?1 & profile > pollution > in this context confuses me. > > Usually, people care about 35 (= MaxInlineSize), > because for > methods up to MaxInlineSize their call frequency is > ignored. So, > fewer chances to end up with non-inlined call. > > Best regards, > Vladimir Ivanov > > On 5/14/15 7:09 PM, Vitaly Davidovich wrote: > > Any pointers? Sorry to bug you guys, but just want > to make > sure I > understand this point as I see quite a bit of > discussion on > core-libs > and elsewhere where people are worrying about the 35 > bytecode size > threshold for parse inlining. > > On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich > > > > >>> wrote: > > Hi guys, > > Could someone please explain the advantage, if > any, of > parse time > inlining in C2? Given that FreqInlineSize is quite > large by default, > most hot methods will get inlined anyway > (well, ones > that can be for > other reasons). What is the advantage of > parse time > inlining? > > Is it quicker time to peak performance if C1 > is reached > first? > > Does it ensure that a method is inlined > whereas it may > not be if > it's already compiled into a medium/large > method otherwise? > > Is parse time inlining not susceptible to profile > pollution? I > suspect it is since the interpreter has already > profiled the inlinee > either way, but wanted to check. > > Anything else I'm not thinking about? > > Thanks > > > > From vitalyd at gmail.com Thu May 14 20:01:29 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Thu, 14 May 2015 16:01:29 -0400 Subject: C2: Advantage of parse time inlining In-Reply-To: <5554F583.7090005@oracle.com> References: <5554CF1D.3010601@oracle.com> <5554E9CB.3020903@oracle.com> <5554F583.7090005@oracle.com> Message-ID: Right, thank you. Is MinInliningThreshold scaled with XX:CompileThreshold=XXX values? Say I turn CompileThreshold down to 100 (as an example). On Thu, May 14, 2015 at 3:20 PM, Vladimir Kozlov wrote: > On 5/14/15 12:02 PM, Vitaly Davidovich wrote: > >> Thanks Vladimir. I recall seeing changes around incremental inlining, >> and may have mistakenly thought it happens at some later point in time. >> Appreciate the clarification. >> >> Ok, so based on what you say, I can see a theoretical problem whereby a >> method is being parsed, is larger than MaxInlineSize, but doesn't happen >> to be frequent enough yet at this point, and so it won't be inlined; if >> it turns out to be hot later on, the lack of inlining will not be undone >> (assuming the caller isn't deopted and recompiled later, with updated >> frequency info, for other reasons). >> > > That is correct. > > Note, that the problem is not how hot is callee (invocation times) but how > hot the call site in caller. Usually it does not change during execution. > If it is called in a loop C2 will try to inline because freq should be high. > > There is MinInliningThreshold (250) but since we compile caller when it is > executed 10000 times the call site should be on slow path which is executed > only 2.5% times. So you will not see the performance difference if we > inline it or call callee which is compiled if it is hot. > > Vladimir > > >> Thanks again. >> >> On Thu, May 14, 2015 at 2:30 PM, Vladimir Kozlov >> > wrote: >> >> Vitaly, >> >> You have small misconception - almost all C2 inlining (normal java >> methods) is done during parsing in one pass. Recently we changed it >> to inline jsr292 methods after parsing (to execute IGVN and reduce >> graph - otherwise they blow up number of ideal nodes and we bailout >> compilation due to MaxNodeLimit). >> >> As parser goes and see a call site it check can it inline or not >> (see opto/bytecodeinfo.cpp, should_inline() and >> should_not_inline()). There are several conditions which drive >> inlining and following (most important) flags controls them: >> MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode, >> MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount, >> InlineFrequencyRatio. >> >> Most tedious flag is InlineSmallCode (size of compiled assembler >> code which is different from other sizes which are bytecode size) >> which control inlining of already compiled method. Usually if a call >> site is hot the callee is compiled before caller. But sometimes you >> can get caller compiled first (if it has hot loop, for example) so >> different condition will be used and as result you can get >> performance variation between runs. >> >> The difference between MaxInlineSize (35) and FreqInlineSize (325) >> is FreqInlineSize takes into account how frequent call site is >> executed relatively to caller invocations: >> >> int call_site_count = method()->scale_count(profile.count()); >> int invoke_count = method()->interpreter_invocation_count(); >> int freq = call_site_count / invoke_count; >> int max_inline_size = MaxInlineSize; >> // bump the max size if the call is frequent >> if ((freq >= InlineFrequencyRatio) || >> (call_site_count >= InlineFrequencyCount) || >> is_unboxing_method(callee_method, C) || >> is_init_with_ea(callee_method, caller_method, C)) { >> max_inline_size = FreqInlineSize; >> >> And there is additional inlining condition for all methods which >> size > MaxTrivialSize: >> >> if (!callee_method->was_executed_more_than(MinInliningThreshold)) { >> set_msg("executed < MinInliningThreshold times"); >> >> Regards, >> Vladimir >> >> On 5/14/15 10:03 AM, Vitaly Davidovich wrote: >> >> I should also add that I see how inlining without taking call >> freq into >> account could lead to faster time to peak performance for >> methods that >> eventually get hot anyway but aren't at parse time. Peak perf >> will be >> the same if the method is too big for parse inlining but >> eventually gets >> compiled due to reaching hotness. Is that about right? >> >> sent from my phone >> >> On May 14, 2015 12:57 PM, "Vitaly Davidovich" > >> >> wrote: >> >> Vladimir, >> >> I'm comparing MaxInlineSize (35) with FreqInlineSize >> (325). AFAIU, >> MaxInlineSize drives which methods are inlined at parse >> time by C2, >> whereas FreqInlineSize is the threshold for "late" (or what >> do you >> guys call inlining after parsing?) inlining. Most of the >> inlining >> discussions (or worries, rather) seem to focus around the >> MaxInlineSize value, and not FreqInlineSize, even if the >> target >> method will get hot. >> >> Usually, people care about 35 (= MaxInlineSize), >> because for >> methods up to MaxInlineSize their call frequency is >> ignored. So, >> fewer chances to end up with non-inlined call. >> >> >> Ok, so for hot methods then MaxInlineSize isn't really a >> concern, >> and FreqInlineSize would be the threshold to worry about >> (for C2 >> compiler) then? Why are people worried about inlining in >> cold paths >> then? >> >> Thanks Vladimir >> >> On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov >> > >> > >> >> >> wrote: >> >> Vitaly, >> >> Can you elaborate your question a bit? What do you >> compare >> parse-time inlining with? Mentioning of ?1 & profile >> pollution >> in this context confuses me. >> >> Usually, people care about 35 (= MaxInlineSize), >> because for >> methods up to MaxInlineSize their call frequency is >> ignored. So, >> fewer chances to end up with non-inlined call. >> >> Best regards, >> Vladimir Ivanov >> >> On 5/14/15 7:09 PM, Vitaly Davidovich wrote: >> >> Any pointers? Sorry to bug you guys, but just want >> to make >> sure I >> understand this point as I see quite a bit of >> discussion on >> core-libs >> and elsewhere where people are worrying about the 35 >> bytecode size >> threshold for parse inlining. >> >> On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich >> >> > >> > > >>> wrote: >> >> Hi guys, >> >> Could someone please explain the advantage, if >> any, of >> parse time >> inlining in C2? Given that FreqInlineSize is >> quite >> large by default, >> most hot methods will get inlined anyway >> (well, ones >> that can be for >> other reasons). What is the advantage of >> parse time >> inlining? >> >> Is it quicker time to peak performance if C1 >> is reached >> first? >> >> Does it ensure that a method is inlined >> whereas it may >> not be if >> it's already compiled into a medium/large >> method otherwise? >> >> Is parse time inlining not susceptible to >> profile >> pollution? I >> suspect it is since the interpreter has already >> profiled the inlinee >> either way, but wanted to check. >> >> Anything else I'm not thinking about? >> >> Thanks >> >> >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From rednaxelafx at gmail.com Thu May 14 21:28:31 2015 From: rednaxelafx at gmail.com (Krystal Mok) Date: Thu, 14 May 2015 14:28:31 -0700 Subject: C2: Advantage of parse time inlining In-Reply-To: References: <5554CF1D.3010601@oracle.com> <5554E9CB.3020903@oracle.com> <5554F583.7090005@oracle.com> Message-ID: Yes and no. intx counter_high_value; // Tiered compilation uses a different "high value" than non-tiered compilation. // Determine the right value to use. if (TieredCompilation) { counter_high_value = InvocationCounter::count_limit / 2; } else { counter_high_value = CompileThreshold / 2; } if (!callee_method->was_executed_more_than(MIN2(MinInliningThreshold, counter_high_value))) { set_msg("executed < MinInliningThreshold times"); return true; } So it's not scaling MinInliningThreshold directly, but rather using a min of MinInliningThreshold and counter_high_value (where the latter is calculated from CompileThreshold when not using tiered compilation) to make the actual decision. Because tiered is on by default now, the short answer to your question would probably be a "no". - Kris On Thu, May 14, 2015 at 1:01 PM, Vitaly Davidovich wrote: > Right, thank you. > > Is MinInliningThreshold scaled with XX:CompileThreshold=XXX values? Say I > turn CompileThreshold down to 100 (as an example). > > On Thu, May 14, 2015 at 3:20 PM, Vladimir Kozlov < > vladimir.kozlov at oracle.com> wrote: > >> On 5/14/15 12:02 PM, Vitaly Davidovich wrote: >> >>> Thanks Vladimir. I recall seeing changes around incremental inlining, >>> and may have mistakenly thought it happens at some later point in time. >>> Appreciate the clarification. >>> >>> Ok, so based on what you say, I can see a theoretical problem whereby a >>> method is being parsed, is larger than MaxInlineSize, but doesn't happen >>> to be frequent enough yet at this point, and so it won't be inlined; if >>> it turns out to be hot later on, the lack of inlining will not be undone >>> (assuming the caller isn't deopted and recompiled later, with updated >>> frequency info, for other reasons). >>> >> >> That is correct. >> >> Note, that the problem is not how hot is callee (invocation times) but >> how hot the call site in caller. Usually it does not change during >> execution. If it is called in a loop C2 will try to inline because freq >> should be high. >> >> There is MinInliningThreshold (250) but since we compile caller when it >> is executed 10000 times the call site should be on slow path which is >> executed only 2.5% times. So you will not see the performance difference if >> we inline it or call callee which is compiled if it is hot. >> >> Vladimir >> >> >>> Thanks again. >>> >>> On Thu, May 14, 2015 at 2:30 PM, Vladimir Kozlov >>> > wrote: >>> >>> Vitaly, >>> >>> You have small misconception - almost all C2 inlining (normal java >>> methods) is done during parsing in one pass. Recently we changed it >>> to inline jsr292 methods after parsing (to execute IGVN and reduce >>> graph - otherwise they blow up number of ideal nodes and we bailout >>> compilation due to MaxNodeLimit). >>> >>> As parser goes and see a call site it check can it inline or not >>> (see opto/bytecodeinfo.cpp, should_inline() and >>> should_not_inline()). There are several conditions which drive >>> inlining and following (most important) flags controls them: >>> MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode, >>> MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount, >>> InlineFrequencyRatio. >>> >>> Most tedious flag is InlineSmallCode (size of compiled assembler >>> code which is different from other sizes which are bytecode size) >>> which control inlining of already compiled method. Usually if a call >>> site is hot the callee is compiled before caller. But sometimes you >>> can get caller compiled first (if it has hot loop, for example) so >>> different condition will be used and as result you can get >>> performance variation between runs. >>> >>> The difference between MaxInlineSize (35) and FreqInlineSize (325) >>> is FreqInlineSize takes into account how frequent call site is >>> executed relatively to caller invocations: >>> >>> int call_site_count = method()->scale_count(profile.count()); >>> int invoke_count = method()->interpreter_invocation_count(); >>> int freq = call_site_count / invoke_count; >>> int max_inline_size = MaxInlineSize; >>> // bump the max size if the call is frequent >>> if ((freq >= InlineFrequencyRatio) || >>> (call_site_count >= InlineFrequencyCount) || >>> is_unboxing_method(callee_method, C) || >>> is_init_with_ea(callee_method, caller_method, C)) { >>> max_inline_size = FreqInlineSize; >>> >>> And there is additional inlining condition for all methods which >>> size > MaxTrivialSize: >>> >>> if (!callee_method->was_executed_more_than(MinInliningThreshold)) >>> { >>> set_msg("executed < MinInliningThreshold times"); >>> >>> Regards, >>> Vladimir >>> >>> On 5/14/15 10:03 AM, Vitaly Davidovich wrote: >>> >>> I should also add that I see how inlining without taking call >>> freq into >>> account could lead to faster time to peak performance for >>> methods that >>> eventually get hot anyway but aren't at parse time. Peak perf >>> will be >>> the same if the method is too big for parse inlining but >>> eventually gets >>> compiled due to reaching hotness. Is that about right? >>> >>> sent from my phone >>> >>> On May 14, 2015 12:57 PM, "Vitaly Davidovich" >> >>> >> wrote: >>> >>> Vladimir, >>> >>> I'm comparing MaxInlineSize (35) with FreqInlineSize >>> (325). AFAIU, >>> MaxInlineSize drives which methods are inlined at parse >>> time by C2, >>> whereas FreqInlineSize is the threshold for "late" (or what >>> do you >>> guys call inlining after parsing?) inlining. Most of the >>> inlining >>> discussions (or worries, rather) seem to focus around the >>> MaxInlineSize value, and not FreqInlineSize, even if the >>> target >>> method will get hot. >>> >>> Usually, people care about 35 (= MaxInlineSize), >>> because for >>> methods up to MaxInlineSize their call frequency is >>> ignored. So, >>> fewer chances to end up with non-inlined call. >>> >>> >>> Ok, so for hot methods then MaxInlineSize isn't really a >>> concern, >>> and FreqInlineSize would be the threshold to worry about >>> (for C2 >>> compiler) then? Why are people worried about inlining in >>> cold paths >>> then? >>> >>> Thanks Vladimir >>> >>> On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov >>> >> >>> >> >>> >> >>> wrote: >>> >>> Vitaly, >>> >>> Can you elaborate your question a bit? What do you >>> compare >>> parse-time inlining with? Mentioning of ?1 & profile >>> pollution >>> in this context confuses me. >>> >>> Usually, people care about 35 (= MaxInlineSize), >>> because for >>> methods up to MaxInlineSize their call frequency is >>> ignored. So, >>> fewer chances to end up with non-inlined call. >>> >>> Best regards, >>> Vladimir Ivanov >>> >>> On 5/14/15 7:09 PM, Vitaly Davidovich wrote: >>> >>> Any pointers? Sorry to bug you guys, but just want >>> to make >>> sure I >>> understand this point as I see quite a bit of >>> discussion on >>> core-libs >>> and elsewhere where people are worrying about the 35 >>> bytecode size >>> threshold for parse inlining. >>> >>> On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich >>> >>> > >>> >> >> >>> wrote: >>> >>> Hi guys, >>> >>> Could someone please explain the advantage, if >>> any, of >>> parse time >>> inlining in C2? Given that FreqInlineSize is >>> quite >>> large by default, >>> most hot methods will get inlined anyway >>> (well, ones >>> that can be for >>> other reasons). What is the advantage of >>> parse time >>> inlining? >>> >>> Is it quicker time to peak performance if C1 >>> is reached >>> first? >>> >>> Does it ensure that a method is inlined >>> whereas it may >>> not be if >>> it's already compiled into a medium/large >>> method otherwise? >>> >>> Is parse time inlining not susceptible to >>> profile >>> pollution? I >>> suspect it is since the interpreter has already >>> profiled the inlinee >>> either way, but wanted to check. >>> >>> Anything else I'm not thinking about? >>> >>> Thanks >>> >>> >>> >>> >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Thu May 14 21:35:21 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Thu, 14 May 2015 17:35:21 -0400 Subject: C2: Advantage of parse time inlining In-Reply-To: References: <5554CF1D.3010601@oracle.com> <5554E9CB.3020903@oracle.com> <5554F583.7090005@oracle.com> Message-ID: Thanks Kris. Hmm, this sounds pretty bad for non-tiered compilations with a relatively low CompileThreshold. If I have a (larger than MaxInlineSize) method executed 49% of the time, it'll inline at CompileThreshold=10k but not CompileThreshold=100. Or am I missing something? On Thu, May 14, 2015 at 5:28 PM, Krystal Mok wrote: > Yes and no. > > intx counter_high_value; > // Tiered compilation uses a different "high value" than non-tiered > compilation. > // Determine the right value to use. > if (TieredCompilation) { > counter_high_value = InvocationCounter::count_limit / 2; > } else { > counter_high_value = CompileThreshold / 2; > } > if > (!callee_method->was_executed_more_than(MIN2(MinInliningThreshold, > counter_high_value))) { > set_msg("executed < MinInliningThreshold times"); > return true; > } > > So it's not scaling MinInliningThreshold directly, but rather using a min > of MinInliningThreshold and counter_high_value (where the latter is > calculated from CompileThreshold when not using tiered compilation) to make > the actual decision. > > Because tiered is on by default now, the short answer to your question > would probably be a "no". > > - Kris > > On Thu, May 14, 2015 at 1:01 PM, Vitaly Davidovich > wrote: > >> Right, thank you. >> >> Is MinInliningThreshold scaled with XX:CompileThreshold=XXX values? Say I >> turn CompileThreshold down to 100 (as an example). >> >> On Thu, May 14, 2015 at 3:20 PM, Vladimir Kozlov < >> vladimir.kozlov at oracle.com> wrote: >> >>> On 5/14/15 12:02 PM, Vitaly Davidovich wrote: >>> >>>> Thanks Vladimir. I recall seeing changes around incremental inlining, >>>> and may have mistakenly thought it happens at some later point in time. >>>> Appreciate the clarification. >>>> >>>> Ok, so based on what you say, I can see a theoretical problem whereby a >>>> method is being parsed, is larger than MaxInlineSize, but doesn't happen >>>> to be frequent enough yet at this point, and so it won't be inlined; if >>>> it turns out to be hot later on, the lack of inlining will not be undone >>>> (assuming the caller isn't deopted and recompiled later, with updated >>>> frequency info, for other reasons). >>>> >>> >>> That is correct. >>> >>> Note, that the problem is not how hot is callee (invocation times) but >>> how hot the call site in caller. Usually it does not change during >>> execution. If it is called in a loop C2 will try to inline because freq >>> should be high. >>> >>> There is MinInliningThreshold (250) but since we compile caller when it >>> is executed 10000 times the call site should be on slow path which is >>> executed only 2.5% times. So you will not see the performance difference if >>> we inline it or call callee which is compiled if it is hot. >>> >>> Vladimir >>> >>> >>>> Thanks again. >>>> >>>> On Thu, May 14, 2015 at 2:30 PM, Vladimir Kozlov >>>> > wrote: >>>> >>>> Vitaly, >>>> >>>> You have small misconception - almost all C2 inlining (normal java >>>> methods) is done during parsing in one pass. Recently we changed it >>>> to inline jsr292 methods after parsing (to execute IGVN and reduce >>>> graph - otherwise they blow up number of ideal nodes and we bailout >>>> compilation due to MaxNodeLimit). >>>> >>>> As parser goes and see a call site it check can it inline or not >>>> (see opto/bytecodeinfo.cpp, should_inline() and >>>> should_not_inline()). There are several conditions which drive >>>> inlining and following (most important) flags controls them: >>>> MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode, >>>> MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount, >>>> InlineFrequencyRatio. >>>> >>>> Most tedious flag is InlineSmallCode (size of compiled assembler >>>> code which is different from other sizes which are bytecode size) >>>> which control inlining of already compiled method. Usually if a call >>>> site is hot the callee is compiled before caller. But sometimes you >>>> can get caller compiled first (if it has hot loop, for example) so >>>> different condition will be used and as result you can get >>>> performance variation between runs. >>>> >>>> The difference between MaxInlineSize (35) and FreqInlineSize (325) >>>> is FreqInlineSize takes into account how frequent call site is >>>> executed relatively to caller invocations: >>>> >>>> int call_site_count = method()->scale_count(profile.count()); >>>> int invoke_count = method()->interpreter_invocation_count(); >>>> int freq = call_site_count / invoke_count; >>>> int max_inline_size = MaxInlineSize; >>>> // bump the max size if the call is frequent >>>> if ((freq >= InlineFrequencyRatio) || >>>> (call_site_count >= InlineFrequencyCount) || >>>> is_unboxing_method(callee_method, C) || >>>> is_init_with_ea(callee_method, caller_method, C)) { >>>> max_inline_size = FreqInlineSize; >>>> >>>> And there is additional inlining condition for all methods which >>>> size > MaxTrivialSize: >>>> >>>> if >>>> (!callee_method->was_executed_more_than(MinInliningThreshold)) { >>>> set_msg("executed < MinInliningThreshold times"); >>>> >>>> Regards, >>>> Vladimir >>>> >>>> On 5/14/15 10:03 AM, Vitaly Davidovich wrote: >>>> >>>> I should also add that I see how inlining without taking call >>>> freq into >>>> account could lead to faster time to peak performance for >>>> methods that >>>> eventually get hot anyway but aren't at parse time. Peak perf >>>> will be >>>> the same if the method is too big for parse inlining but >>>> eventually gets >>>> compiled due to reaching hotness. Is that about right? >>>> >>>> sent from my phone >>>> >>>> On May 14, 2015 12:57 PM, "Vitaly Davidovich" < >>>> vitalyd at gmail.com >>>> >>>> >> wrote: >>>> >>>> Vladimir, >>>> >>>> I'm comparing MaxInlineSize (35) with FreqInlineSize >>>> (325). AFAIU, >>>> MaxInlineSize drives which methods are inlined at parse >>>> time by C2, >>>> whereas FreqInlineSize is the threshold for "late" (or what >>>> do you >>>> guys call inlining after parsing?) inlining. Most of the >>>> inlining >>>> discussions (or worries, rather) seem to focus around the >>>> MaxInlineSize value, and not FreqInlineSize, even if the >>>> target >>>> method will get hot. >>>> >>>> Usually, people care about 35 (= MaxInlineSize), >>>> because for >>>> methods up to MaxInlineSize their call frequency is >>>> ignored. So, >>>> fewer chances to end up with non-inlined call. >>>> >>>> >>>> Ok, so for hot methods then MaxInlineSize isn't really a >>>> concern, >>>> and FreqInlineSize would be the threshold to worry about >>>> (for C2 >>>> compiler) then? Why are people worried about inlining in >>>> cold paths >>>> then? >>>> >>>> Thanks Vladimir >>>> >>>> On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov >>>> >>> >>>> >>> >>>> >> >>>> wrote: >>>> >>>> Vitaly, >>>> >>>> Can you elaborate your question a bit? What do you >>>> compare >>>> parse-time inlining with? Mentioning of ?1 & profile >>>> pollution >>>> in this context confuses me. >>>> >>>> Usually, people care about 35 (= MaxInlineSize), >>>> because for >>>> methods up to MaxInlineSize their call frequency is >>>> ignored. So, >>>> fewer chances to end up with non-inlined call. >>>> >>>> Best regards, >>>> Vladimir Ivanov >>>> >>>> On 5/14/15 7:09 PM, Vitaly Davidovich wrote: >>>> >>>> Any pointers? Sorry to bug you guys, but just want >>>> to make >>>> sure I >>>> understand this point as I see quite a bit of >>>> discussion on >>>> core-libs >>>> and elsewhere where people are worrying about the >>>> 35 >>>> bytecode size >>>> threshold for parse inlining. >>>> >>>> On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich >>>> >>>> > >>>> >>> >>> >>> wrote: >>>> >>>> Hi guys, >>>> >>>> Could someone please explain the advantage, if >>>> any, of >>>> parse time >>>> inlining in C2? Given that FreqInlineSize is >>>> quite >>>> large by default, >>>> most hot methods will get inlined anyway >>>> (well, ones >>>> that can be for >>>> other reasons). What is the advantage of >>>> parse time >>>> inlining? >>>> >>>> Is it quicker time to peak performance if C1 >>>> is reached >>>> first? >>>> >>>> Does it ensure that a method is inlined >>>> whereas it may >>>> not be if >>>> it's already compiled into a medium/large >>>> method otherwise? >>>> >>>> Is parse time inlining not susceptible to >>>> profile >>>> pollution? I >>>> suspect it is since the interpreter has >>>> already >>>> profiled the inlinee >>>> either way, but wanted to check. >>>> >>>> Anything else I'm not thinking about? >>>> >>>> Thanks >>>> >>>> >>>> >>>> >>>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Thu May 14 21:50:30 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 14 May 2015 14:50:30 -0700 Subject: C2: Advantage of parse time inlining In-Reply-To: References: <5554CF1D.3010601@oracle.com> <5554E9CB.3020903@oracle.com> <5554F583.7090005@oracle.com> Message-ID: <555518A6.4070505@oracle.com> On 5/14/15 1:01 PM, Vitaly Davidovich wrote: > Right, thank you. > > Is MinInliningThreshold scaled with XX:CompileThreshold=XXX values? Say > I turn CompileThreshold down to 100 (as an example). No, it does not scale. Note, with tiered compilation C1 compilation triggered after 100 invocations. So you will get compiled code (not optimal as C2) very early. Vladimir > > On Thu, May 14, 2015 at 3:20 PM, Vladimir Kozlov > > wrote: > > On 5/14/15 12:02 PM, Vitaly Davidovich wrote: > > Thanks Vladimir. I recall seeing changes around incremental > inlining, > and may have mistakenly thought it happens at some later point > in time. > Appreciate the clarification. > > Ok, so based on what you say, I can see a theoretical problem > whereby a > method is being parsed, is larger than MaxInlineSize, but > doesn't happen > to be frequent enough yet at this point, and so it won't be > inlined; if > it turns out to be hot later on, the lack of inlining will not > be undone > (assuming the caller isn't deopted and recompiled later, with > updated > frequency info, for other reasons). > > > That is correct. > > Note, that the problem is not how hot is callee (invocation times) > but how hot the call site in caller. Usually it does not change > during execution. If it is called in a loop C2 will try to inline > because freq should be high. > > There is MinInliningThreshold (250) but since we compile caller when > it is executed 10000 times the call site should be on slow path > which is executed only 2.5% times. So you will not see the > performance difference if we inline it or call callee which is > compiled if it is hot. > > Vladimir > > > Thanks again. > > On Thu, May 14, 2015 at 2:30 PM, Vladimir Kozlov > > >> wrote: > > Vitaly, > > You have small misconception - almost all C2 inlining > (normal java > methods) is done during parsing in one pass. Recently we > changed it > to inline jsr292 methods after parsing (to execute IGVN and > reduce > graph - otherwise they blow up number of ideal nodes and we > bailout > compilation due to MaxNodeLimit). > > As parser goes and see a call site it check can it inline > or not > (see opto/bytecodeinfo.cpp, should_inline() and > should_not_inline()). There are several conditions which drive > inlining and following (most important) flags controls them: > MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode, > MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount, > InlineFrequencyRatio. > > Most tedious flag is InlineSmallCode (size of compiled > assembler > code which is different from other sizes which are bytecode > size) > which control inlining of already compiled method. Usually > if a call > site is hot the callee is compiled before caller. But > sometimes you > can get caller compiled first (if it has hot loop, for > example) so > different condition will be used and as result you can get > performance variation between runs. > > The difference between MaxInlineSize (35) and > FreqInlineSize (325) > is FreqInlineSize takes into account how frequent call site is > executed relatively to caller invocations: > > int call_site_count = > method()->scale_count(profile.count()); > int invoke_count = > method()->interpreter_invocation_count(); > int freq = call_site_count / invoke_count; > int max_inline_size = MaxInlineSize; > // bump the max size if the call is frequent > if ((freq >= InlineFrequencyRatio) || > (call_site_count >= InlineFrequencyCount) || > is_unboxing_method(callee_method, C) || > is_init_with_ea(callee_method, caller_method, C)) { > max_inline_size = FreqInlineSize; > > And there is additional inlining condition for all methods > which > size > MaxTrivialSize: > > if > (!callee_method->was_executed_more_than(MinInliningThreshold)) { > set_msg("executed < MinInliningThreshold times"); > > Regards, > Vladimir > > On 5/14/15 10:03 AM, Vitaly Davidovich wrote: > > I should also add that I see how inlining without > taking call > freq into > account could lead to faster time to peak performance for > methods that > eventually get hot anyway but aren't at parse time. > Peak perf > will be > the same if the method is too big for parse inlining but > eventually gets > compiled due to reaching hotness. Is that about right? > > sent from my phone > > On May 14, 2015 12:57 PM, "Vitaly Davidovich" > > > > > >>> wrote: > > Vladimir, > > I'm comparing MaxInlineSize (35) with FreqInlineSize > (325). AFAIU, > MaxInlineSize drives which methods are inlined at > parse > time by C2, > whereas FreqInlineSize is the threshold for "late" > (or what > do you > guys call inlining after parsing?) inlining. Most > of the > inlining > discussions (or worries, rather) seem to focus > around the > MaxInlineSize value, and not FreqInlineSize, even > if the target > method will get hot. > > Usually, people care about 35 (= MaxInlineSize), > because for > methods up to MaxInlineSize their call > frequency is > ignored. So, > fewer chances to end up with non-inlined call. > > > Ok, so for hot methods then MaxInlineSize isn't > really a > concern, > and FreqInlineSize would be the threshold to worry > about > (for C2 > compiler) then? Why are people worried about > inlining in > cold paths > then? > > Thanks Vladimir > > On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov > > > > > > >>> > wrote: > > Vitaly, > > Can you elaborate your question a bit? What do > you compare > parse-time inlining with? Mentioning of ?1 & > profile > pollution > in this context confuses me. > > Usually, people care about 35 (= MaxInlineSize), > because for > methods up to MaxInlineSize their call > frequency is > ignored. So, > fewer chances to end up with non-inlined call. > > Best regards, > Vladimir Ivanov > > On 5/14/15 7:09 PM, Vitaly Davidovich wrote: > > Any pointers? Sorry to bug you guys, but > just want > to make > sure I > understand this point as I see quite a bit of > discussion on > core-libs > and elsewhere where people are worrying > about the 35 > bytecode size > threshold for parse inlining. > > On Wed, May 13, 2015 at 3:36 PM, Vitaly > Davidovich > > > > >> > > > > > >>>> wrote: > > Hi guys, > > Could someone please explain the > advantage, if > any, of > parse time > inlining in C2? Given that > FreqInlineSize is quite > large by default, > most hot methods will get inlined anyway > (well, ones > that can be for > other reasons). What is the advantage of > parse time > inlining? > > Is it quicker time to peak > performance if C1 > is reached > first? > > Does it ensure that a method is inlined > whereas it may > not be if > it's already compiled into a medium/large > method otherwise? > > Is parse time inlining not > susceptible to profile > pollution? I > suspect it is since the interpreter > has already > profiled the inlinee > either way, but wanted to check. > > Anything else I'm not thinking about? > > Thanks > > > > > From rednaxelafx at gmail.com Thu May 14 22:00:20 2015 From: rednaxelafx at gmail.com (Krystal Mok) Date: Thu, 14 May 2015 15:00:20 -0700 Subject: C2: Advantage of parse time inlining In-Reply-To: References: <5554CF1D.3010601@oracle.com> <5554E9CB.3020903@oracle.com> <5554F583.7090005@oracle.com> Message-ID: Hi Vitaly, The code I posted comes from should_not_inline(). It's a negative filter, so return true means don't inline. You can take a look at opto/bytecodeInfo.cpp. It's a lot to explain in words, but very obvious from the code. I'm not sure what your 49% means here. In the positive filter, If the frequency of a call site is more than InlineFrequencyRatio (=20, think of a call site in a loop run at least 20 times per invocation of this method), or the profile recorded the call site is called at least InlineFrequencyCount (=100 on x86), or some other heuristics, the max_inline_size for this call site is bumped from MaxInlineSize (=35) to FreqInlineSize (=325 on x86). In the negative filter, the callee has to be executed at least MIN2(MinInliningThreshold, counter_high_value) times in order to be considered candidate for inlining. This is the invocation counter on the callee side, not on the call site side. In tiered mode that would be MIN2(250, 134217728) = 250. It doesn't matter what CompileThreshold is set in this case. In non-tiered mode, it'd be MIN2(250, CompileThreshold/2), for CompileThreshold=100 that's 50. - Kris On Thu, May 14, 2015 at 2:35 PM, Vitaly Davidovich wrote: > Thanks Kris. Hmm, this sounds pretty bad for non-tiered compilations with > a relatively low CompileThreshold. If I have a (larger than MaxInlineSize) > method executed 49% of the time, it'll inline at CompileThreshold=10k but > not CompileThreshold=100. Or am I missing something? > > On Thu, May 14, 2015 at 5:28 PM, Krystal Mok > wrote: > >> Yes and no. >> >> intx counter_high_value; >> // Tiered compilation uses a different "high value" than non-tiered >> compilation. >> // Determine the right value to use. >> if (TieredCompilation) { >> counter_high_value = InvocationCounter::count_limit / 2; >> } else { >> counter_high_value = CompileThreshold / 2; >> } >> if >> (!callee_method->was_executed_more_than(MIN2(MinInliningThreshold, >> counter_high_value))) { >> set_msg("executed < MinInliningThreshold times"); >> return true; >> } >> >> So it's not scaling MinInliningThreshold directly, but rather using a min >> of MinInliningThreshold and counter_high_value (where the latter is >> calculated from CompileThreshold when not using tiered compilation) to make >> the actual decision. >> >> Because tiered is on by default now, the short answer to your question >> would probably be a "no". >> >> - Kris >> >> On Thu, May 14, 2015 at 1:01 PM, Vitaly Davidovich >> wrote: >> >>> Right, thank you. >>> >>> Is MinInliningThreshold scaled with XX:CompileThreshold=XXX values? Say >>> I turn CompileThreshold down to 100 (as an example). >>> >>> On Thu, May 14, 2015 at 3:20 PM, Vladimir Kozlov < >>> vladimir.kozlov at oracle.com> wrote: >>> >>>> On 5/14/15 12:02 PM, Vitaly Davidovich wrote: >>>> >>>>> Thanks Vladimir. I recall seeing changes around incremental inlining, >>>>> and may have mistakenly thought it happens at some later point in time. >>>>> Appreciate the clarification. >>>>> >>>>> Ok, so based on what you say, I can see a theoretical problem whereby a >>>>> method is being parsed, is larger than MaxInlineSize, but doesn't >>>>> happen >>>>> to be frequent enough yet at this point, and so it won't be inlined; if >>>>> it turns out to be hot later on, the lack of inlining will not be >>>>> undone >>>>> (assuming the caller isn't deopted and recompiled later, with updated >>>>> frequency info, for other reasons). >>>>> >>>> >>>> That is correct. >>>> >>>> Note, that the problem is not how hot is callee (invocation times) but >>>> how hot the call site in caller. Usually it does not change during >>>> execution. If it is called in a loop C2 will try to inline because freq >>>> should be high. >>>> >>>> There is MinInliningThreshold (250) but since we compile caller when it >>>> is executed 10000 times the call site should be on slow path which is >>>> executed only 2.5% times. So you will not see the performance difference if >>>> we inline it or call callee which is compiled if it is hot. >>>> >>>> Vladimir >>>> >>>> >>>>> Thanks again. >>>>> >>>>> On Thu, May 14, 2015 at 2:30 PM, Vladimir Kozlov >>>>> > >>>>> wrote: >>>>> >>>>> Vitaly, >>>>> >>>>> You have small misconception - almost all C2 inlining (normal java >>>>> methods) is done during parsing in one pass. Recently we changed it >>>>> to inline jsr292 methods after parsing (to execute IGVN and reduce >>>>> graph - otherwise they blow up number of ideal nodes and we bailout >>>>> compilation due to MaxNodeLimit). >>>>> >>>>> As parser goes and see a call site it check can it inline or not >>>>> (see opto/bytecodeinfo.cpp, should_inline() and >>>>> should_not_inline()). There are several conditions which drive >>>>> inlining and following (most important) flags controls them: >>>>> MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode, >>>>> MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount, >>>>> InlineFrequencyRatio. >>>>> >>>>> Most tedious flag is InlineSmallCode (size of compiled assembler >>>>> code which is different from other sizes which are bytecode size) >>>>> which control inlining of already compiled method. Usually if a >>>>> call >>>>> site is hot the callee is compiled before caller. But sometimes you >>>>> can get caller compiled first (if it has hot loop, for example) so >>>>> different condition will be used and as result you can get >>>>> performance variation between runs. >>>>> >>>>> The difference between MaxInlineSize (35) and FreqInlineSize (325) >>>>> is FreqInlineSize takes into account how frequent call site is >>>>> executed relatively to caller invocations: >>>>> >>>>> int call_site_count = method()->scale_count(profile.count()); >>>>> int invoke_count = method()->interpreter_invocation_count(); >>>>> int freq = call_site_count / invoke_count; >>>>> int max_inline_size = MaxInlineSize; >>>>> // bump the max size if the call is frequent >>>>> if ((freq >= InlineFrequencyRatio) || >>>>> (call_site_count >= InlineFrequencyCount) || >>>>> is_unboxing_method(callee_method, C) || >>>>> is_init_with_ea(callee_method, caller_method, C)) { >>>>> max_inline_size = FreqInlineSize; >>>>> >>>>> And there is additional inlining condition for all methods which >>>>> size > MaxTrivialSize: >>>>> >>>>> if >>>>> (!callee_method->was_executed_more_than(MinInliningThreshold)) { >>>>> set_msg("executed < MinInliningThreshold times"); >>>>> >>>>> Regards, >>>>> Vladimir >>>>> >>>>> On 5/14/15 10:03 AM, Vitaly Davidovich wrote: >>>>> >>>>> I should also add that I see how inlining without taking call >>>>> freq into >>>>> account could lead to faster time to peak performance for >>>>> methods that >>>>> eventually get hot anyway but aren't at parse time. Peak perf >>>>> will be >>>>> the same if the method is too big for parse inlining but >>>>> eventually gets >>>>> compiled due to reaching hotness. Is that about right? >>>>> >>>>> sent from my phone >>>>> >>>>> On May 14, 2015 12:57 PM, "Vitaly Davidovich" < >>>>> vitalyd at gmail.com >>>>> >>>>> >> wrote: >>>>> >>>>> Vladimir, >>>>> >>>>> I'm comparing MaxInlineSize (35) with FreqInlineSize >>>>> (325). AFAIU, >>>>> MaxInlineSize drives which methods are inlined at parse >>>>> time by C2, >>>>> whereas FreqInlineSize is the threshold for "late" (or >>>>> what >>>>> do you >>>>> guys call inlining after parsing?) inlining. Most of the >>>>> inlining >>>>> discussions (or worries, rather) seem to focus around the >>>>> MaxInlineSize value, and not FreqInlineSize, even if the >>>>> target >>>>> method will get hot. >>>>> >>>>> Usually, people care about 35 (= MaxInlineSize), >>>>> because for >>>>> methods up to MaxInlineSize their call frequency is >>>>> ignored. So, >>>>> fewer chances to end up with non-inlined call. >>>>> >>>>> >>>>> Ok, so for hot methods then MaxInlineSize isn't really a >>>>> concern, >>>>> and FreqInlineSize would be the threshold to worry about >>>>> (for C2 >>>>> compiler) then? Why are people worried about inlining in >>>>> cold paths >>>>> then? >>>>> >>>>> Thanks Vladimir >>>>> >>>>> On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov >>>>> >>>> >>>>> >>>> >>>>> >> >>>>> wrote: >>>>> >>>>> Vitaly, >>>>> >>>>> Can you elaborate your question a bit? What do you >>>>> compare >>>>> parse-time inlining with? Mentioning of ?1 & profile >>>>> pollution >>>>> in this context confuses me. >>>>> >>>>> Usually, people care about 35 (= MaxInlineSize), >>>>> because for >>>>> methods up to MaxInlineSize their call frequency is >>>>> ignored. So, >>>>> fewer chances to end up with non-inlined call. >>>>> >>>>> Best regards, >>>>> Vladimir Ivanov >>>>> >>>>> On 5/14/15 7:09 PM, Vitaly Davidovich wrote: >>>>> >>>>> Any pointers? Sorry to bug you guys, but just want >>>>> to make >>>>> sure I >>>>> understand this point as I see quite a bit of >>>>> discussion on >>>>> core-libs >>>>> and elsewhere where people are worrying about the >>>>> 35 >>>>> bytecode size >>>>> threshold for parse inlining. >>>>> >>>>> On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich >>>>> >>>>> > >>>>> >>>> >>>> >>> wrote: >>>>> >>>>> Hi guys, >>>>> >>>>> Could someone please explain the advantage, >>>>> if >>>>> any, of >>>>> parse time >>>>> inlining in C2? Given that FreqInlineSize is >>>>> quite >>>>> large by default, >>>>> most hot methods will get inlined anyway >>>>> (well, ones >>>>> that can be for >>>>> other reasons). What is the advantage of >>>>> parse time >>>>> inlining? >>>>> >>>>> Is it quicker time to peak performance if C1 >>>>> is reached >>>>> first? >>>>> >>>>> Does it ensure that a method is inlined >>>>> whereas it may >>>>> not be if >>>>> it's already compiled into a medium/large >>>>> method otherwise? >>>>> >>>>> Is parse time inlining not susceptible to >>>>> profile >>>>> pollution? I >>>>> suspect it is since the interpreter has >>>>> already >>>>> profiled the inlinee >>>>> either way, but wanted to check. >>>>> >>>>> Anything else I'm not thinking about? >>>>> >>>>> Thanks >>>>> >>>>> >>>>> >>>>> >>>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Thu May 14 22:16:08 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Thu, 14 May 2015 18:16:08 -0400 Subject: C2: Advantage of parse time inlining In-Reply-To: References: <5554CF1D.3010601@oracle.com> <5554E9CB.3020903@oracle.com> <5554F583.7090005@oracle.com> Message-ID: Yes, I'll look at that code. My 49% example (probably poorly explained, I'll try to correct this below) was based on what Vladimir said: There is MinInliningThreshold (250) but since we compile caller when it is > executed 10000 times the call site should be on slow path which is executed > only 2.5% times. So you will not see the performance difference if we > inline it or call callee which is compiled if it is hot. Assume we're talking about a callee (method) whose size is > MaxInlineSize (e.g. 64 bytes just to pull some number out). Using default compile threshold of 10k, the callee will not be inlined if it's executed < 250 times. That's fine since as Vladimir says it's a cold path anyway (< 2.5% of invocations of caller don't touch that callsite), and I agree given 10k compile threshold. If we change CompileThreshold to 100, then the code snippet you showed earlier (for the negative filter) will yield the MIN2(250,50) negative filter. So the "49%" case is the caller is invoked 100 times, triggers compilation, and compiler looks at the call site that's executed 49 times out of those 100 (that's the 49% I'm using). If we use 10k compile threshold, then it's executed 4900 times, clearly above 250; I'm using an example where the code profile is the same, we're only changing compile threshold. So the negative filter rejects this method for inlining now, whereas it would've accepted it had I let the code run to 10k invocations. I guess my overall question is the following: does changing CompileThreshold to a lower value, but keeping execution profile the same, alter inlining decisions. Intuitively I would think the answer should be "no, inlining is the same" as everything would be scaled down appropriately, but it sounds like there are "magic" absolute numbers involved. On Thu, May 14, 2015 at 6:00 PM, Krystal Mok wrote: > Hi Vitaly, > > The code I posted comes from should_not_inline(). It's a negative filter, > so return true means don't inline. > You can take a look at opto/bytecodeInfo.cpp. It's a lot to explain in > words, but very obvious from the code. > > I'm not sure what your 49% means here. > > In the positive filter, If the frequency of a call site is more > than InlineFrequencyRatio (=20, think of a call site in a loop run at least > 20 times per invocation of this method), or the profile recorded the call > site is called at least InlineFrequencyCount (=100 on x86), or some other > heuristics, the max_inline_size for this call site is bumped from > MaxInlineSize (=35) to FreqInlineSize (=325 on x86). > > In the negative filter, the callee has to be executed at > least MIN2(MinInliningThreshold, counter_high_value) times in order to be > considered candidate for inlining. This is the invocation counter on the > callee side, not on the call site side. > In tiered mode that would be MIN2(250, 134217728) = 250. It doesn't matter > what CompileThreshold is set in this case. > In non-tiered mode, it'd be MIN2(250, CompileThreshold/2), for > CompileThreshold=100 that's 50. > > - Kris > > On Thu, May 14, 2015 at 2:35 PM, Vitaly Davidovich > wrote: > >> Thanks Kris. Hmm, this sounds pretty bad for non-tiered compilations >> with a relatively low CompileThreshold. If I have a (larger than >> MaxInlineSize) method executed 49% of the time, it'll inline at >> CompileThreshold=10k but not CompileThreshold=100. Or am I missing >> something? >> >> On Thu, May 14, 2015 at 5:28 PM, Krystal Mok >> wrote: >> >>> Yes and no. >>> >>> intx counter_high_value; >>> // Tiered compilation uses a different "high value" than >>> non-tiered compilation. >>> // Determine the right value to use. >>> if (TieredCompilation) { >>> counter_high_value = InvocationCounter::count_limit / 2; >>> } else { >>> counter_high_value = CompileThreshold / 2; >>> } >>> if >>> (!callee_method->was_executed_more_than(MIN2(MinInliningThreshold, >>> counter_high_value))) { >>> set_msg("executed < MinInliningThreshold times"); >>> return true; >>> } >>> >>> So it's not scaling MinInliningThreshold directly, but rather using a >>> min of MinInliningThreshold and counter_high_value (where the latter is >>> calculated from CompileThreshold when not using tiered compilation) to make >>> the actual decision. >>> >>> Because tiered is on by default now, the short answer to your question >>> would probably be a "no". >>> >>> - Kris >>> >>> On Thu, May 14, 2015 at 1:01 PM, Vitaly Davidovich >>> wrote: >>> >>>> Right, thank you. >>>> >>>> Is MinInliningThreshold scaled with XX:CompileThreshold=XXX values? Say >>>> I turn CompileThreshold down to 100 (as an example). >>>> >>>> On Thu, May 14, 2015 at 3:20 PM, Vladimir Kozlov < >>>> vladimir.kozlov at oracle.com> wrote: >>>> >>>>> On 5/14/15 12:02 PM, Vitaly Davidovich wrote: >>>>> >>>>>> Thanks Vladimir. I recall seeing changes around incremental inlining, >>>>>> and may have mistakenly thought it happens at some later point in >>>>>> time. >>>>>> Appreciate the clarification. >>>>>> >>>>>> Ok, so based on what you say, I can see a theoretical problem whereby >>>>>> a >>>>>> method is being parsed, is larger than MaxInlineSize, but doesn't >>>>>> happen >>>>>> to be frequent enough yet at this point, and so it won't be inlined; >>>>>> if >>>>>> it turns out to be hot later on, the lack of inlining will not be >>>>>> undone >>>>>> (assuming the caller isn't deopted and recompiled later, with updated >>>>>> frequency info, for other reasons). >>>>>> >>>>> >>>>> That is correct. >>>>> >>>>> Note, that the problem is not how hot is callee (invocation times) but >>>>> how hot the call site in caller. Usually it does not change during >>>>> execution. If it is called in a loop C2 will try to inline because freq >>>>> should be high. >>>>> >>>>> There is MinInliningThreshold (250) but since we compile caller when >>>>> it is executed 10000 times the call site should be on slow path which is >>>>> executed only 2.5% times. So you will not see the performance difference if >>>>> we inline it or call callee which is compiled if it is hot. >>>>> >>>>> Vladimir >>>>> >>>>> >>>>>> Thanks again. >>>>>> >>>>>> On Thu, May 14, 2015 at 2:30 PM, Vladimir Kozlov >>>>>> > >>>>>> wrote: >>>>>> >>>>>> Vitaly, >>>>>> >>>>>> You have small misconception - almost all C2 inlining (normal java >>>>>> methods) is done during parsing in one pass. Recently we changed >>>>>> it >>>>>> to inline jsr292 methods after parsing (to execute IGVN and reduce >>>>>> graph - otherwise they blow up number of ideal nodes and we >>>>>> bailout >>>>>> compilation due to MaxNodeLimit). >>>>>> >>>>>> As parser goes and see a call site it check can it inline or not >>>>>> (see opto/bytecodeinfo.cpp, should_inline() and >>>>>> should_not_inline()). There are several conditions which drive >>>>>> inlining and following (most important) flags controls them: >>>>>> MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode, >>>>>> MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount, >>>>>> InlineFrequencyRatio. >>>>>> >>>>>> Most tedious flag is InlineSmallCode (size of compiled assembler >>>>>> code which is different from other sizes which are bytecode size) >>>>>> which control inlining of already compiled method. Usually if a >>>>>> call >>>>>> site is hot the callee is compiled before caller. But sometimes >>>>>> you >>>>>> can get caller compiled first (if it has hot loop, for example) so >>>>>> different condition will be used and as result you can get >>>>>> performance variation between runs. >>>>>> >>>>>> The difference between MaxInlineSize (35) and FreqInlineSize (325) >>>>>> is FreqInlineSize takes into account how frequent call site is >>>>>> executed relatively to caller invocations: >>>>>> >>>>>> int call_site_count = method()->scale_count(profile.count()); >>>>>> int invoke_count = >>>>>> method()->interpreter_invocation_count(); >>>>>> int freq = call_site_count / invoke_count; >>>>>> int max_inline_size = MaxInlineSize; >>>>>> // bump the max size if the call is frequent >>>>>> if ((freq >= InlineFrequencyRatio) || >>>>>> (call_site_count >= InlineFrequencyCount) || >>>>>> is_unboxing_method(callee_method, C) || >>>>>> is_init_with_ea(callee_method, caller_method, C)) { >>>>>> max_inline_size = FreqInlineSize; >>>>>> >>>>>> And there is additional inlining condition for all methods which >>>>>> size > MaxTrivialSize: >>>>>> >>>>>> if >>>>>> (!callee_method->was_executed_more_than(MinInliningThreshold)) { >>>>>> set_msg("executed < MinInliningThreshold times"); >>>>>> >>>>>> Regards, >>>>>> Vladimir >>>>>> >>>>>> On 5/14/15 10:03 AM, Vitaly Davidovich wrote: >>>>>> >>>>>> I should also add that I see how inlining without taking call >>>>>> freq into >>>>>> account could lead to faster time to peak performance for >>>>>> methods that >>>>>> eventually get hot anyway but aren't at parse time. Peak perf >>>>>> will be >>>>>> the same if the method is too big for parse inlining but >>>>>> eventually gets >>>>>> compiled due to reaching hotness. Is that about right? >>>>>> >>>>>> sent from my phone >>>>>> >>>>>> On May 14, 2015 12:57 PM, "Vitaly Davidovich" < >>>>>> vitalyd at gmail.com >>>>>> >>>>>> >> wrote: >>>>>> >>>>>> Vladimir, >>>>>> >>>>>> I'm comparing MaxInlineSize (35) with FreqInlineSize >>>>>> (325). AFAIU, >>>>>> MaxInlineSize drives which methods are inlined at parse >>>>>> time by C2, >>>>>> whereas FreqInlineSize is the threshold for "late" (or >>>>>> what >>>>>> do you >>>>>> guys call inlining after parsing?) inlining. Most of the >>>>>> inlining >>>>>> discussions (or worries, rather) seem to focus around the >>>>>> MaxInlineSize value, and not FreqInlineSize, even if the >>>>>> target >>>>>> method will get hot. >>>>>> >>>>>> Usually, people care about 35 (= MaxInlineSize), >>>>>> because for >>>>>> methods up to MaxInlineSize their call frequency is >>>>>> ignored. So, >>>>>> fewer chances to end up with non-inlined call. >>>>>> >>>>>> >>>>>> Ok, so for hot methods then MaxInlineSize isn't really a >>>>>> concern, >>>>>> and FreqInlineSize would be the threshold to worry about >>>>>> (for C2 >>>>>> compiler) then? Why are people worried about inlining in >>>>>> cold paths >>>>>> then? >>>>>> >>>>>> Thanks Vladimir >>>>>> >>>>>> On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov >>>>>> >>>>> >>>>>> >>>>> >>>>>> >> >>>>>> wrote: >>>>>> >>>>>> Vitaly, >>>>>> >>>>>> Can you elaborate your question a bit? What do you >>>>>> compare >>>>>> parse-time inlining with? Mentioning of ?1 & profile >>>>>> pollution >>>>>> in this context confuses me. >>>>>> >>>>>> Usually, people care about 35 (= MaxInlineSize), >>>>>> because for >>>>>> methods up to MaxInlineSize their call frequency is >>>>>> ignored. So, >>>>>> fewer chances to end up with non-inlined call. >>>>>> >>>>>> Best regards, >>>>>> Vladimir Ivanov >>>>>> >>>>>> On 5/14/15 7:09 PM, Vitaly Davidovich wrote: >>>>>> >>>>>> Any pointers? Sorry to bug you guys, but just >>>>>> want >>>>>> to make >>>>>> sure I >>>>>> understand this point as I see quite a bit of >>>>>> discussion on >>>>>> core-libs >>>>>> and elsewhere where people are worrying about >>>>>> the 35 >>>>>> bytecode size >>>>>> threshold for parse inlining. >>>>>> >>>>>> On Wed, May 13, 2015 at 3:36 PM, Vitaly >>>>>> Davidovich >>>>>> >>>>>> > >>>>>> >>>>> >>>>> >>> wrote: >>>>>> >>>>>> Hi guys, >>>>>> >>>>>> Could someone please explain the advantage, >>>>>> if >>>>>> any, of >>>>>> parse time >>>>>> inlining in C2? Given that FreqInlineSize >>>>>> is quite >>>>>> large by default, >>>>>> most hot methods will get inlined anyway >>>>>> (well, ones >>>>>> that can be for >>>>>> other reasons). What is the advantage of >>>>>> parse time >>>>>> inlining? >>>>>> >>>>>> Is it quicker time to peak performance if C1 >>>>>> is reached >>>>>> first? >>>>>> >>>>>> Does it ensure that a method is inlined >>>>>> whereas it may >>>>>> not be if >>>>>> it's already compiled into a medium/large >>>>>> method otherwise? >>>>>> >>>>>> Is parse time inlining not susceptible to >>>>>> profile >>>>>> pollution? I >>>>>> suspect it is since the interpreter has >>>>>> already >>>>>> profiled the inlinee >>>>>> either way, but wanted to check. >>>>>> >>>>>> Anything else I'm not thinking about? >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.x.ivanov at oracle.com Fri May 15 12:14:59 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Fri, 15 May 2015 15:14:59 +0300 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared In-Reply-To: <55549297.5020002@oracle.com> References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com> <5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com> <72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com> <55533CAE.3050201@oracle.com> <55549297.5020002@oracle.com> Message-ID: <5555E343.5050600@oracle.com> After private discussion with Paul, here's an update: http://cr.openjdk.java.net/~vlivanov/8079205/webrev.03 Renamed CallSite$Context to MethodHandleNatives$Context. Best regards, Vladimir Ivanov On 5/14/15 3:18 PM, Vladimir Ivanov wrote: > Small update in JDK code (webrev updated in place): > http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02 > > Changes: > - hid Context.dependencies field from Reflection > - new test case: CallSiteDepContextTest.testHiddenDepField() > > Best regards, > Vladimir Ivanov > > On 5/13/15 5:56 PM, Paul Sandoz wrote: >> >> On May 13, 2015, at 1:59 PM, Vladimir Ivanov >> wrote: >> >>> Peter, Paul, thanks for the feedback! >>> >>> Updated the webrev in place: >>> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02 >>> >> >> +1 >> >> >>>> I am not an export in the HS area but the code mostly made sense to >>>> me. I also like Peter's suggestion of Context implementing Runnable. >>> I agree. Integrated. >>> >>>> Some minor comments. >>>> >>>> CallSite.java: >>>> >>>> 145 private final long dependencies = 0; // Used by JVM to >>>> store JVM_nmethodBucket* >>>> >>>> It's a little odd to see this field be final, but i think i >>>> understand the motivation as it's essentially acting as a memory >>>> address pointing to the head of the nbucket list, so in Java code >>>> you don't want to assign it to anything other than 0. >>> Yes, my motivation was to forbid reads & writes of the field from >>> Java code. I was experimenting with injecting the field by VM, but >>> it's less attractive. >>> >> >> I was wondering if that were possible. >> >> >>>> Is VM access, via java_lang_invoke_CallSite_Context, affected by >>>> treating final fields as really final in the j.l.i package? >>> No, field modifiers aren't respected. oopDesc::*_field_[get|put] does >>> a raw read/write using field offset. >>> >> >> Ok. >> >> Paul. >> From john.r.rose at oracle.com Fri May 15 17:06:50 2015 From: john.r.rose at oracle.com (John Rose) Date: Fri, 15 May 2015 10:06:50 -0700 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared In-Reply-To: <5555E343.5050600@oracle.com> References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com> <5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com> <72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com> <55533CAE.3050201@oracle.com> <55549297.5020002@oracle.com> <5555E343.5050600@oracle.com> Message-ID: I know injected fields are somewhat hacky, but there are a couple of conditions here which would motivate their use: 1. The field is of a type not known to Java. Usually, and in this case, it is a C pointer to some metadata. We can make space for it with a Java long, but that is a size mismatch on 32-bit systems. Such mismatches have occasionally caused bugs on big-endian systems, although we try to defend against them by using long-wise reads followed by casts. 2. The field is useless to Java code, and would be a vulnerability if somebody found a way to touch it from Java. In both these ways the 'dependencies' field is like the MemberName.vmtarget field. (Suggestion: s/dependencies/vmdependencies/ to follow that pattern.) ? John On May 15, 2015, at 5:14 AM, Vladimir Ivanov wrote: > > After private discussion with Paul, here's an update: > http://cr.openjdk.java.net/~vlivanov/8079205/webrev.03 > > Renamed CallSite$Context to MethodHandleNatives$Context. > > Best regards, > Vladimir Ivanov > > On 5/14/15 3:18 PM, Vladimir Ivanov wrote: >> Small update in JDK code (webrev updated in place): >> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02 >> >> Changes: >> - hid Context.dependencies field from Reflection >> - new test case: CallSiteDepContextTest.testHiddenDepField() >> >> Best regards, >> Vladimir Ivanov >> >> On 5/13/15 5:56 PM, Paul Sandoz wrote: >>> >>> On May 13, 2015, at 1:59 PM, Vladimir Ivanov >>> wrote: >>> >>>> Peter, Paul, thanks for the feedback! >>>> >>>> Updated the webrev in place: >>>> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02 >>>> >>> >>> +1 >>> >>> >>>>> I am not an export in the HS area but the code mostly made sense to >>>>> me. I also like Peter's suggestion of Context implementing Runnable. >>>> I agree. Integrated. >>>> >>>>> Some minor comments. >>>>> >>>>> CallSite.java: >>>>> >>>>> 145 private final long dependencies = 0; // Used by JVM to >>>>> store JVM_nmethodBucket* >>>>> >>>>> It's a little odd to see this field be final, but i think i >>>>> understand the motivation as it's essentially acting as a memory >>>>> address pointing to the head of the nbucket list, so in Java code >>>>> you don't want to assign it to anything other than 0. >>>> Yes, my motivation was to forbid reads & writes of the field from >>>> Java code. I was experimenting with injecting the field by VM, but >>>> it's less attractive. >>>> >>> >>> I was wondering if that were possible. >>> >>> >>>>> Is VM access, via java_lang_invoke_CallSite_Context, affected by >>>>> treating final fields as really final in the j.l.i package? >>>> No, field modifiers aren't respected. oopDesc::*_field_[get|put] does >>>> a raw read/write using field offset. >>>> >>> >>> Ok. >>> >>> Paul. >>> From christian.thalinger at oracle.com Fri May 15 17:39:28 2015 From: christian.thalinger at oracle.com (Christian Thalinger) Date: Fri, 15 May 2015 10:39:28 -0700 Subject: RFR(L): 8069539: RSA acceleration In-Reply-To: <554CDD48.8080500@redhat.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> Message-ID: <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> > On May 8, 2015, at 8:59 AM, Andrew Haley wrote: > > Here is a prototype of what I propose: > > http://cr.openjdk.java.net/~aph/rsa-1/ > It is a JNI version of the fast algorithm I think we should use for > RSA. It doesn't use any intrinsics. But with it, RSA is already twice > as fast as the code we have now for 1024-bit keys, and almost three > times as fast for 2048-bit keys! > > This code (montgomery_multiply) is designed to be as easy as I can > possibly make it to translate into a hand-coded intrinsic. I?m curious: did you try to implement this in Java? > It will > then offer better performance still, with the JNI overhead gone and > with some carefully hand-tweaked memory accesses. It has a very > regular structure which should make it fairly easy to turn into a > software pipeline with overlapped fetching and multiplication, > although this will perhaps be difficult on register-starved machines > like the x86. > > The JNI version can still be used for those machines where people > don't yet want to write a hand-coded intrinsic. > > I haven't yet done anything about Montgomery squaring: it's > asymptotically 25% faster than Montgomery multiplication. > > I have only written it for x86_64. Systems without a 64-bit multiply > won't benefit very much from this idea, but they are pretty much > legacy anyway: I want to concentrate on 64-bit systems. > > So, shall we go with this? I don't think you will find any faster way > to do it. > > Andrew. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Sat May 16 00:18:00 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 15 May 2015 17:18:00 -0700 Subject: [9] RFR(XS) 8080483: Incorrect test execution string at SumRed_Long.java Message-ID: <55568CB8.3030102@oracle.com> Small typo fix. Test run commands incorrectly execute SumRed_Double class instead of SumRed_Long. http://cr.openjdk.java.net/~kvn/8080483/webrev/ https://bugs.openjdk.java.net/browse/JDK-8080483 Tested with jtreg. Thanks, Vladimir From igor.veresov at oracle.com Sat May 16 00:38:23 2015 From: igor.veresov at oracle.com (Igor Veresov) Date: Fri, 15 May 2015 17:38:23 -0700 Subject: [9] RFR(XS) 8080483: Incorrect test execution string at SumRed_Long.java In-Reply-To: <55568CB8.3030102@oracle.com> References: <55568CB8.3030102@oracle.com> Message-ID: <9C43EB57-487F-4D7E-929A-07F5154550A9@oracle.com> Ok. igor > On May 15, 2015, at 5:18 PM, Vladimir Kozlov wrote: > > Small typo fix. Test run commands incorrectly execute SumRed_Double class instead of SumRed_Long. > > http://cr.openjdk.java.net/~kvn/8080483/webrev/ > https://bugs.openjdk.java.net/browse/JDK-8080483 > > Tested with jtreg. > > Thanks, > Vladimir From vladimir.kozlov at oracle.com Sat May 16 00:42:46 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 15 May 2015 17:42:46 -0700 Subject: [9] RFR(XS) 8080483: Incorrect test execution string at SumRed_Long.java In-Reply-To: <9C43EB57-487F-4D7E-929A-07F5154550A9@oracle.com> References: <55568CB8.3030102@oracle.com> <9C43EB57-487F-4D7E-929A-07F5154550A9@oracle.com> Message-ID: <55569286.4010705@oracle.com> Thanks. Vladimir On 5/15/15 5:38 PM, Igor Veresov wrote: > Ok. > > igor > >> On May 15, 2015, at 5:18 PM, Vladimir Kozlov wrote: >> >> Small typo fix. Test run commands incorrectly execute SumRed_Double class instead of SumRed_Long. >> >> http://cr.openjdk.java.net/~kvn/8080483/webrev/ >> https://bugs.openjdk.java.net/browse/JDK-8080483 >> >> Tested with jtreg. >> >> Thanks, >> Vladimir > From aph at redhat.com Sat May 16 09:07:10 2015 From: aph at redhat.com (Andrew Haley) Date: Sat, 16 May 2015 10:07:10 +0100 Subject: RFR(L): 8069539: RSA acceleration In-Reply-To: <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> Message-ID: <555708BE.8090100@redhat.com> On 15/05/15 18:39, Christian Thalinger wrote: > >> On May 8, 2015, at 8:59 AM, Andrew Haley wrote: >> >> Here is a prototype of what I propose: >> >> http://cr.openjdk.java.net/~aph/rsa-1/ >> It is a JNI version of the fast algorithm I think we should use for >> RSA. It doesn't use any intrinsics. But with it, RSA is already twice >> as fast as the code we have now for 1024-bit keys, and almost three >> times as fast for 2048-bit keys! >> >> This code (montgomery_multiply) is designed to be as easy as I can >> possibly make it to translate into a hand-coded intrinsic. > > I?m curious: did you try to implement this in Java? The Montgomery multiplication algorithm I presented is fast because (unlike the Java code in BigInteger) it keeps most of its intermediate state in registers. In contrast, the code in BigInteger spends much time writing digits out to memory and reading them back again. I think it is impossible to write in Java. The inner multiply primitive accumulates into a three-register sum. Like this: t += a * b where a and b are unsigned longs, and t is a triple-precision unsigned long. It's easy to express that in GNU C with inline asm, but not in Java. It's not possible to express it in in portable C either because C has no add with carry. The closest you could get in Java would be an intrinsic into which you passed and returned an object with three longs as the accumulator. But as I said, this JNI code is just a demonstration. I think that Montgomery multiplication itself really needs to be turned into a HotSpot intrinsic. Then, our RSA and Diffie-Hellman performance will be competitive with the best library code. I can do it, but so far I have had no response from Oracle at all. :-( Andrew. From aph at redhat.com Sat May 16 09:35:02 2015 From: aph at redhat.com (Andrew Haley) Date: Sat, 16 May 2015 10:35:02 +0100 Subject: RFR(L): 8069539: RSA acceleration In-Reply-To: <555708BE.8090100@redhat.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> <555708BE.8090100@redhat.com> Message-ID: <55570F46.5000909@redhat.com> There is one other thing I didn't mention: it is possible to turn Montgomery multiplication into a software pipeline if you have enough registers. The idea is that the latency of one multiplication is overlapped by the latency of the load of the operands for the next one, and the accumulation is done on not on the latest multiplication but the previous one, so no operation ever stalls the pipeline. I'm not sure if x86 has enough registers to do this (it might, just) but AArch64 certainly does. An out-of-order CPU can to some extent do the instruction reordering automatically, but you still have to write the code in a way that makes it possible. Andrew. From roland.westrelin at oracle.com Mon May 18 08:04:27 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Mon, 18 May 2015 10:04:27 +0200 Subject: RFR(S) 8077504: Unsafe load can loose control dependency and cause crash In-Reply-To: <553FFAC4.7030107@oracle.com> References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com> <552FC216.4010503@redhat.com> <95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com> <6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com> <553A88F3.2010700@oracle.com> <553FFAC4.7030107@oracle.com> Message-ID: >>> BTW, should we modify LoadNode::hash() to include _depends_only_on_test and prevent igvning? >> >> If the graph already has a non-pinned LoadNode and we add a pinned LoadNode with the same inputs, it?s safe for GVN to replace the pinned LoadNode by the non-pinned LoadNode, otherwise it wouldn?t be safe to have a non pinned LoadNode with those inputs in the first place. If the graph already has a pinned LoadNode and we add a non pinned LoadNode with the same inputs, it?s safe for GVN to replace the non-pinned LoadNode by the pinned LoadNode. It?s also suboptimal but better than 2 LoadNodes? > > Okay. Add comment to _depends_only_on_test field why we don't use it in hash(). I added the comment. Thanks for the review. I had the change go through a performance run to be safe and I found no regression so I?m pushing this. Roland. > > Thanks, > Vladimir > >> >> I guess we could change LoadNode::hash() and then use LoadNode::Identity/Ideal to make sure the pinned LoadNode is always replaced by the non-pinned LoadNode in the scenarios above but that sounds like extra complexity for something that, as far as we know, never happens in practice. >> >> Roland. >> >> >>> >>> Thanks, >>> Vladimir >>> >>> On 4/24/15 1:03 AM, Roland Westrelin wrote: >>>> >>>>> Vladimir suggested privately to set _depends_only_on_test to true in the constructor and then use an explicit call to a new a method set_depends_only_on_test() to set it to false in the rare cases where it?s needed. That feels better indeed. What do you think? >>>> >>>> Actually, using a set_depends_only_on_test() method doesn?t work well. In LibraryCallKit::inline_unsafe_access() the node returned by make_load() may have been transformed already and we could call set_depends_only_on_test() on a node that doesn?t need to be pinned. The call to set_depends_only_on_test() would have to be in LoadNode::make(). I went with default parameters instead to keep the change small: >>>> >>>> http://cr.openjdk.java.net/~roland/8077504/webrev.01/ >>>> >>>> Roland. >>>> >> From zoltan.majo at oracle.com Mon May 18 09:17:42 2015 From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=) Date: Mon, 18 May 2015 11:17:42 +0200 Subject: [9] RFR(XS): 8080281: 8068945 changes break building the zero JVM variant Message-ID: <5559AE36.4040808@oracle.com> Hi, please review the following small patch. Bug: https://bugs.openjdk.java.net/browse/JDK-8080281 Problem: The changes by 8068945 brake building the zero JVM variant. Solution: Add the definition of the PreserveFramePointer to the globals_zero.hpp file. Webrev: http://cr.openjdk.java.net/~zmajo/8080281/webrev.00/ Testing: - full JPRT run (testset hotspot) - local build of zero JVM variant Thank you and best regards, Zoltan From roland.westrelin at oracle.com Mon May 18 11:53:55 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Mon, 18 May 2015 13:53:55 +0200 Subject: RFR(S) 8077504: Unsafe load can loose control dependency and cause crash In-Reply-To: References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com> <552FC216.4010503@redhat.com> <95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com> <6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com> <553A88F3.2010700@oracle.com> <553FFAC4.7030107@oracle.com> Message-ID: <8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com> My change is broken. GraphKit::make_load() takes 2 boolean arguments: bool depends_only_on_test = true, bool require_atomic_access = false) and Parse::do_get_xxx() passes the value needs_atomic_access as argument depends_only_on_test instead of require_atomic_access. That goes unnoticed because of the default values. It?s the exact reason I didn?t use default arguments in my first webrev. It?s too easy to make a mistake and not notice it. If we keep default arguments, one way to have help from the compiler would be to use a 2 value enum: enum DependsOnlyOnTest { DoesDependOnlyOnTest, DoesNotDependOnlyOnTest }; and use it to pass parameters around. The _depends_only_on_test LoadNode field would still be a boolean. Is this acceptable? Roland. > On May 18, 2015, at 10:04 AM, Roland Westrelin wrote: > >>>> BTW, should we modify LoadNode::hash() to include _depends_only_on_test and prevent igvning? >>> >>> If the graph already has a non-pinned LoadNode and we add a pinned LoadNode with the same inputs, it?s safe for GVN to replace the pinned LoadNode by the non-pinned LoadNode, otherwise it wouldn?t be safe to have a non pinned LoadNode with those inputs in the first place. If the graph already has a pinned LoadNode and we add a non pinned LoadNode with the same inputs, it?s safe for GVN to replace the non-pinned LoadNode by the pinned LoadNode. It?s also suboptimal but better than 2 LoadNodes? >> >> Okay. Add comment to _depends_only_on_test field why we don't use it in hash(). > > I added the comment. Thanks for the review. > I had the change go through a performance run to be safe and I found no regression so I?m pushing this. > > Roland. > >> >> Thanks, >> Vladimir >> >>> >>> I guess we could change LoadNode::hash() and then use LoadNode::Identity/Ideal to make sure the pinned LoadNode is always replaced by the non-pinned LoadNode in the scenarios above but that sounds like extra complexity for something that, as far as we know, never happens in practice. >>> >>> Roland. >>> >>> >>>> >>>> Thanks, >>>> Vladimir >>>> >>>> On 4/24/15 1:03 AM, Roland Westrelin wrote: >>>>> >>>>>> Vladimir suggested privately to set _depends_only_on_test to true in the constructor and then use an explicit call to a new a method set_depends_only_on_test() to set it to false in the rare cases where it?s needed. That feels better indeed. What do you think? >>>>> >>>>> Actually, using a set_depends_only_on_test() method doesn?t work well. In LibraryCallKit::inline_unsafe_access() the node returned by make_load() may have been transformed already and we could call set_depends_only_on_test() on a node that doesn?t need to be pinned. The call to set_depends_only_on_test() would have to be in LoadNode::make(). I went with default parameters instead to keep the change small: >>>>> >>>>> http://cr.openjdk.java.net/~roland/8077504/webrev.01/ >>>>> >>>>> Roland. >>>>> >>> > From roland.westrelin at oracle.com Mon May 18 12:19:59 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Mon, 18 May 2015 14:19:59 +0200 Subject: RFR(S): 8078866: compiler/eliminateAutobox/6934604/TestIntBoxing.java assert(p_f->Opcode() == Op_IfFalse) failed In-Reply-To: <5553C1E2.3040003@oracle.com> References: <5553C1E2.3040003@oracle.com> Message-ID: Thanks for looking at this, Vladimir. > Why do you think that main- and post- loops will be empty too and it is safe to replace them with pre-loop? Why it is safe? > > Opaque1 node is used only when RCE and align is requsted: > cmp_end->set_req(2, peel_only ? pre_limit : pre_opaq); > > In peel_only case pre-loop have only 1 iteration and can be eliminated too. The assert could be hit in such case too. So your change may not fix the problem. In case of peel_only, we set: if( peel_only ) main_head->set_main_no_pre_loop(); and in IdealLoopTree::policy_range_check(): // If we unrolled with no intention of doing RCE and we later // changed our minds, we got no pre-loop. Either we need to // make a new pre-loop, or we gotta disallow RCE. if (cl->is_main_no_pre_loop()) return false; // Disallowed for now. > I think eliminating main- and post- is good investigation for separate RFE. And we need to make sure it is safe. My reasoning is that if the pre loop?s limit is still an opaque node, then the compiler has not been able to optimize the pre loop based on its limited number of iterations yet. I don?t see what optimization would remove stuff from the pre loop that wouldn?t have removed the same stuff from the initial loop, had it not been cloned. I could be missing something, though. Do you see one? Roland. > > To fix this bug, I think, we should bailout from do_range_check() when we can't find pre-loop. > > Typo 'cab': > + // post loop cab be removed as well > > Thanks, > Vladimir > > > On 5/13/15 6:02 AM, Roland Westrelin wrote: >> http://cr.openjdk.java.net/~roland/8078866/webrev.00/ >> >> The assert fires when optimizing that code: >> >> static int remi_sumb() { >> Integer j = Integer.valueOf(1); >> for (int i = 0; i< 1000; i++) { >> j = j + 1; >> } >> return j; >> } >> >> The compiler tries to find the pre loop from the main loop but fails because the pre loop isn?t there. >> >> Here is the chain of events that leads to the pre loop being optimized out: >> >> 1- The CountedLoopNode is created >> 2- PhiNode::Value() for the induction variable?s Phi is called and capture that 0 <= i < 1000 >> 3- SplitIf is applied and moves the LoadI that reads the value from the Integer through the Phi that merges successive value of j >> 4- pre-loop, main-loop, post-loop are created >> 5- SplitIf is applied again and moves the LoadI of the loop through the Phi that merges both branches of: >> >> public static Integer valueOf(int i) { >> if (i >= IntegerCache.low && i <= IntegerCache.high) >> return IntegerCache.cache[i + (-IntegerCache.low)]; >> return new Integer(i); >> } >> >> 6- The LoadI is optimized out, the Phi for j is now a Phi that merges integer values, PhaseIdealLoop::replace_parallel_iv() replaces j by an expression that depends on i >> 7- In the preloop, the range of values of i are known because of 2- so and the if of valueOf() can be optimized out >> 8- the preloop is empty and is removed by IdealLoopTree::policy_do_remove_empty_loop() >> >> SplitIf doesn?t find all opportunities for improvements before the pre/main/post loops are built. I considered trying to fix that but that seems complex and not robust enough: some optimizations after SplitIf could uncover more opportunities for SplitIf. The valueOf?s if cannot be optimized out in the main loop because the range of values of i in the main depends on the Phi for the iv of the pre loop and is not [0, 1000[ so I don?t see a way to make sure the compiler always optimizes the main loop out when the pre loop is empty. >> >> Instead, I?m proposing that when IdealLoopTree::policy_do_remove_empty_loop() removes a loop, it checks whether it is a pre-loop and if it is, removes the main and post loops. >> >> I don?t have a test case because it seems this only occurs under well defined conditions that are hard to reproduce with a simple test case: PhiNode::Value() must be called early, SplitIf must do a little work but not too much before pre/main/post loops are built. Actually, this occurs only when remi_sumb is inlined. If remi_sumb is not, we don?t optimize the loop out at all. >> >> Roland. >> From volker.simonis at gmail.com Mon May 18 12:49:19 2015 From: volker.simonis at gmail.com (Volker Simonis) Date: Mon, 18 May 2015 14:49:19 +0200 Subject: [9] RFR(XS): 8080281: 8068945 changes break building the zero JVM variant In-Reply-To: <5559AE36.4040808@oracle.com> References: <5559AE36.4040808@oracle.com> Message-ID: Hi Zoltan, your change looks good. Thanks for fixing this and best regards, Volker On Mon, May 18, 2015 at 11:17 AM, Zolt?n Maj? wrote: > Hi, > > > please review the following small patch. > > Bug: https://bugs.openjdk.java.net/browse/JDK-8080281 > > Problem: The changes by 8068945 brake building the zero JVM variant. > > Solution: Add the definition of the PreserveFramePointer to the > globals_zero.hpp file. > > Webrev: http://cr.openjdk.java.net/~zmajo/8080281/webrev.00/ > > Testing: > - full JPRT run (testset hotspot) > - local build of zero JVM variant > > Thank you and best regards, > > > Zoltan > From vladimir.x.ivanov at oracle.com Mon May 18 13:05:41 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Mon, 18 May 2015 16:05:41 +0300 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared In-Reply-To: References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com> <5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com> <72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com> <55533CAE.3050201@oracle.com> <55549297.5020002@oracle.com> <5555E343.5050600@oracle.com> Message-ID: <5559E3A5.8030108@oracle.com> Thanks, John. What do you think about the following version: http://cr.openjdk.java.net/~vlivanov/8079205/webrev.04 Best regards, Vladimir Ivanov On 5/15/15 8:06 PM, John Rose wrote: > I know injected fields are somewhat hacky, but there are a couple of conditions here which would motivate their use: > > 1. The field is of a type not known to Java. Usually, and in this case, it is a C pointer to some metadata. > > We can make space for it with a Java long, but that is a size mismatch on 32-bit systems. Such mismatches have occasionally caused bugs on big-endian systems, although we try to defend against them by using long-wise reads followed by casts. > > 2. The field is useless to Java code, and would be a vulnerability if somebody found a way to touch it from Java. > > In both these ways the 'dependencies' field is like the MemberName.vmtarget field. (Suggestion: s/dependencies/vmdependencies/ to follow that pattern.) > > ? John > > On May 15, 2015, at 5:14 AM, Vladimir Ivanov wrote: >> >> After private discussion with Paul, here's an update: >> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.03 >> >> Renamed CallSite$Context to MethodHandleNatives$Context. >> >> Best regards, >> Vladimir Ivanov >> >> On 5/14/15 3:18 PM, Vladimir Ivanov wrote: >>> Small update in JDK code (webrev updated in place): >>> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02 >>> >>> Changes: >>> - hid Context.dependencies field from Reflection >>> - new test case: CallSiteDepContextTest.testHiddenDepField() >>> >>> Best regards, >>> Vladimir Ivanov >>> >>> On 5/13/15 5:56 PM, Paul Sandoz wrote: >>>> >>>> On May 13, 2015, at 1:59 PM, Vladimir Ivanov >>>> wrote: >>>> >>>>> Peter, Paul, thanks for the feedback! >>>>> >>>>> Updated the webrev in place: >>>>> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02 >>>>> >>>> >>>> +1 >>>> >>>> >>>>>> I am not an export in the HS area but the code mostly made sense to >>>>>> me. I also like Peter's suggestion of Context implementing Runnable. >>>>> I agree. Integrated. >>>>> >>>>>> Some minor comments. >>>>>> >>>>>> CallSite.java: >>>>>> >>>>>> 145 private final long dependencies = 0; // Used by JVM to >>>>>> store JVM_nmethodBucket* >>>>>> >>>>>> It's a little odd to see this field be final, but i think i >>>>>> understand the motivation as it's essentially acting as a memory >>>>>> address pointing to the head of the nbucket list, so in Java code >>>>>> you don't want to assign it to anything other than 0. >>>>> Yes, my motivation was to forbid reads & writes of the field from >>>>> Java code. I was experimenting with injecting the field by VM, but >>>>> it's less attractive. >>>>> >>>> >>>> I was wondering if that were possible. >>>> >>>> >>>>>> Is VM access, via java_lang_invoke_CallSite_Context, affected by >>>>>> treating final fields as really final in the j.l.i package? >>>>> No, field modifiers aren't respected. oopDesc::*_field_[get|put] does >>>>> a raw read/write using field offset. >>>>> >>>> >>>> Ok. >>>> >>>> Paul. >>>> > From aph at redhat.com Mon May 18 13:56:48 2015 From: aph at redhat.com (Andrew Haley) Date: Mon, 18 May 2015 14:56:48 +0100 Subject: 8054492: compiler/intrinsics/classcast/NullCheckDroppingsTest.java is an invalid test Message-ID: <5559EFA0.5060800@redhat.com> This test has been failing with a failed assertion: Exception in thread "main" java.lang.AssertionError: void NullCheckDroppingsTest.testVarClassCast(java.lang.String) was not deoptimized at NullCheckDroppingsTest.checkDeoptimization(NullCheckDroppingsTest.java:331) at NullCheckDroppingsTest.runTest(NullCheckDroppingsTest.java:309) at NullCheckDroppingsTest.main(NullCheckDroppingsTest.java:125) It seems that the reason for the failure is that the nmethod for a deoptimized method (testVarClassCast) is not removed. This is perfectly correct because when testVarClassCast() was trapped, it was with an action of Deoptimization::Action_maybe_recompile. When this occurs, a method is not marked as dead and continues to be invoked, so the test WhiteBox.isMethodCompiled() will still return true. Therefore, I think this code in checkDeoptimization is incorrect: // Check deoptimization event (intrinsic Class.cast() works). if (WHITE_BOX.isMethodCompiled(method) == deopt) { throw new AssertionError(method + " was" + (deopt ? " not" : "") + " deoptimized"); } We either need some code in WhiteBox to check for a deoptimization event properly or we should just remove this altogether. So, thoughts? Just delete the check? Andrew. From denis.kononenko at oracle.com Mon May 18 13:57:12 2015 From: denis.kononenko at oracle.com (denis kononenko) Date: Mon, 18 May 2015 16:57:12 +0300 Subject: RFR(XS): JDK-8077620: [TESTBUG] Some of the hotspot tests require at least compact profile 3 Message-ID: <5559EFB8.9000503@oracle.com> Hi All, Could you please review a small change to the TEST.groups file. Webrev link: http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.00/ Bug id: JDK-8077620 Testing: automated Description: The following tests cannot be run on compact profiles 1 and 2: compiler/jsr292/RedefineMethodUsedByMultipleMethodHandles.java: requires java.lang.instrument package (JEP-161, available in Compact Profile 3); compiler/rangechecks/TestRangeCheckSmearing.java: requires sun.management package (JEP-161, any management packages available since Compact Profile 3); gc/TestGCLogRotationViaJcmd.java: uses ProcessTools which depends on java.lang.management (JEP-161, available in Compact Profile 3); serviceability/sa/jmap-hashcode/Test8028623.java: the same as above. The fix simply moves these tests into :needs_compact3 group. Thank you, Denis. -------------- next part -------------- An HTML attachment was scrubbed... URL: From zoltan.majo at oracle.com Mon May 18 14:35:59 2015 From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=) Date: Mon, 18 May 2015 16:35:59 +0200 Subject: [9] RFR(XS): 8080281: 8068945 changes break building the zero JVM variant In-Reply-To: References: <5559AE36.4040808@oracle.com> Message-ID: <5559F8CF.9070304@oracle.com> Hi Volker, On 05/18/2015 02:49 PM, Volker Simonis wrote: > Hi Zoltan, > > your change looks good. > > Thanks for fixing this and best regards, thank you for the review! Best wishes, Zolt?n > Volker > > > On Mon, May 18, 2015 at 11:17 AM, Zolt?n Maj? wrote: >> Hi, >> >> >> please review the following small patch. >> >> Bug: https://bugs.openjdk.java.net/browse/JDK-8080281 >> >> Problem: The changes by 8068945 brake building the zero JVM variant. >> >> Solution: Add the definition of the PreserveFramePointer to the >> globals_zero.hpp file. >> >> Webrev: http://cr.openjdk.java.net/~zmajo/8080281/webrev.00/ >> >> Testing: >> - full JPRT run (testset hotspot) >> - local build of zero JVM variant >> >> Thank you and best regards, >> >> >> Zoltan >> From sgehwolf at redhat.com Mon May 18 15:07:09 2015 From: sgehwolf at redhat.com (Severin Gehwolf) Date: Mon, 18 May 2015 17:07:09 +0200 Subject: [9] RFR(XS): 8080281: 8068945 changes break building the zero JVM variant In-Reply-To: <5559AE36.4040808@oracle.com> References: <5559AE36.4040808@oracle.com> Message-ID: <1431961629.3512.14.camel@redhat.com> Hi, On Mon, 2015-05-18 at 11:17 +0200, Zolt?n Maj? wrote: > Hi, > > > please review the following small patch. > > Bug: https://bugs.openjdk.java.net/browse/JDK-8080281 > > Problem: The changes by 8068945 brake building the zero JVM variant. > > Solution: Add the definition of the PreserveFramePointer to the > globals_zero.hpp file. > > Webrev: http://cr.openjdk.java.net/~zmajo/8080281/webrev.00/ > > Testing: > - full JPRT run (testset hotspot) > - local build of zero JVM variant > > Thank you and best regards, Thanks, Zoltan, for being pro-active about this! Patch looks fine to me (Disclaimer: I'm not a hotspot Reviewer). Cheers, Severin > Zoltan > From volker.simonis at gmail.com Mon May 18 16:05:04 2015 From: volker.simonis at gmail.com (Volker Simonis) Date: Mon, 18 May 2015 18:05:04 +0200 Subject: RFR(S): 8080190: PPC64: Fix wrong rotate instructions in the .ad file In-Reply-To: <55539A49.4010906@oracle.com> References: <55539A49.4010906@oracle.com> Message-ID: Hi Vladimir, thanks for the review and sorry for the late reply (we had a public holiday in Germany:). On Wed, May 13, 2015 at 8:39 PM, Vladimir Kozlov wrote: > Is this because you can have negative values? > I see that at the bottom of calls which generate bits u_field() method is > used. Why not mask it there? > u_field() generates immediates in the range (hi_bit - lo_bit + 1) but it can be used to generate not only 5-bit immediates but also bigger or smaller ones (e.g. tbr() calls u_field() to generate a 10-bit immediate. The assertion in u_field() checks that the given value can be encoded in the requested range of bits but we do not want to explicitly mask the value to the corresponding bit range as this could hide other problems. The other problem is that we can have negative values as you have correctly noticed. I think this is more a kind of a conceptual problem. The JLS states that "only the five lowest-order bits of the right-hand operand are used as the shift distance" in a shift operation [1]. The same holds true for the specifications of the shift bytecodes in the JVMS [2]. But for constant shift amounts we do not handle this in C2 (i.e. in the ideal graph) but instead we relay on the corresponding CPU instruction doing "the right thing". On x86 this works because the whole 8-bit immediates are encoded right into the instruction and interpreted as required by the JLS. Another way to solve this would be to do the masking right in the ideal graph as we've previously done in the SAP JVM: Node *LShiftINode::Ideal(PhaseGVN *phase, bool can_reshape) { const Type *t = phase->type( in(2) ); if( t == Type::TOP ) return NULL; // Right input is dead const TypeInt *t2 = t->isa_int(); if( !t2 || !t2->is_con() ) return NULL; // Right input is a constant // SAPJVM: mask constant shift count already in ideal graph processing const int in2_con = t2->get_con(); const int con = in2_con & ( BitsPerJavaInteger - 1 ); // masked shift count if ( con == 0 ) return NULL; // let Identity() handle 0 shift count if( in2_con != con ) { set_req( 2, phase->intcon(con) ); // replace shift count with masked value } Damned! I just wanted to ask you what's your opinion about this solution but while writing all this and doing some software archaeology I just discovered that this problem has already been discussed [3] and solved some time ago! Change "7029017: Additional architecture support for c2 compiler" introduced the "Support (for) masking of shift counts when the processor architecture mandates it". This is exactly what's needed on PowerPC! So I'd like to change my fix as follows: http://cr.openjdk.java.net/~simonis/webrevs/2015/8080190.v3/ I think that using "Matcher::need_masked_shift_count = true" will make a lot of the masking for shift instructions in the ppc.ad file obsolete, but I'd like to keep these cleanups for a separate change. Regards, Volker [1] https://docs.oracle.com/javase/specs/jls/se7/html/jls-15.html#jls-15.19 [2] https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.ishl [3] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2011-March/thread.html#4944 > There are other instructions which can have the same problem (slwi). You > would have to guarantee all of those do the masking. That is why I am asking > why not mask it at the final code methods, like u_field()? > > Thanks, > Vladimir > > > On 5/13/15 9:29 AM, Volker Simonis wrote: >> >> Hi, >> >> could you please review the following, ppc64-only fix which solves a >> problem with the integer-rotate instructions in C2: >> >> https://bugs.openjdk.java.net/browse/JDK-8080190 >> http://cr.openjdk.java.net/~simonis/webrevs/2015/8080190.v2/ >> >> The problem was that we initially took the "rotate left/right by 8-bit >> immediate" instructions from x86_64. But on x86_64 the rotate >> instructions for 32-bit integers do in fact really take 8-bit >> immediates (seems like the Intel guys had to many spare bits when they >> designed their instruction set :) On the other hand, the rotate >> instructions on Power only use 5 bit to encode the rotation distance. >> >> We do check for this limit, but only in the debug build so you'll see >> an assertion there, but in the product build we will silently emit a >> wrong instruction which can lead to anything starting from a crash up >> to an incorrect computation. >> >> I think the simplest solution for this problem is to mask the >> rotation distance with 0x1f (which is equivalent to taking the >> distance modulo 32) in the corresponding instructions. >> >> This change should also be backported to 8 as fast as possible. Notice >> that this a day-zero bug in our ppc64 port because in our internal SAP >> JVM we already mask the shift amounts in the ideal graph but for some >> unknown reasons these shared changes haven?t been contributed along >> with our port. >> >> Thank you and best regards, >> Volker >> > From vladimir.kozlov at oracle.com Mon May 18 16:42:56 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 18 May 2015 09:42:56 -0700 Subject: RFR(S): 8080190: PPC64: Fix wrong rotate instructions in the .ad file In-Reply-To: References: <55539A49.4010906@oracle.com> Message-ID: <555A1690.40704@oracle.com> You need to change the comment in ppc.ad. Otherwise looks good. Thank you for doing archeological research - I forgot about that change. Thanks, Vladimir On 5/18/15 9:05 AM, Volker Simonis wrote: > Hi Vladimir, > > thanks for the review and sorry for the late reply (we had a public > holiday in Germany:). > > > On Wed, May 13, 2015 at 8:39 PM, Vladimir Kozlov > wrote: >> Is this because you can have negative values? >> I see that at the bottom of calls which generate bits u_field() method is >> used. Why not mask it there? >> > > u_field() generates immediates in the range (hi_bit - lo_bit + 1) but > it can be used to generate not only 5-bit immediates but also bigger > or smaller ones (e.g. tbr() calls u_field() to generate a 10-bit > immediate. The assertion in u_field() checks that the given value can > be encoded in the requested range of bits but we do not want to > explicitly mask the value to the corresponding bit range as this could > hide other problems. > > The other problem is that we can have negative values as you have > correctly noticed. > > I think this is more a kind of a conceptual problem. The JLS states > that "only the five lowest-order bits of the right-hand operand are > used as the shift distance" in a shift operation [1]. The same holds > true for the specifications of the shift bytecodes in the JVMS [2]. > But for constant shift amounts we do not handle this in C2 (i.e. in > the ideal graph) but instead we relay on the corresponding CPU > instruction doing "the right thing". On x86 this works because the > whole 8-bit immediates are encoded right into the instruction and > interpreted as required by the JLS. > > Another way to solve this would be to do the masking right in the > ideal graph as we've previously done in the SAP JVM: > > Node *LShiftINode::Ideal(PhaseGVN *phase, bool can_reshape) { > const Type *t = phase->type( in(2) ); > if( t == Type::TOP ) return NULL; // Right input is dead > const TypeInt *t2 = t->isa_int(); > if( !t2 || !t2->is_con() ) return NULL; // Right input is a constant > > // SAPJVM: mask constant shift count already in ideal graph processing > const int in2_con = t2->get_con(); > const int con = in2_con & ( BitsPerJavaInteger - 1 ); // > masked shift count > > if ( con == 0 ) return NULL; // let Identity() handle 0 shift count > > if( in2_con != con ) { > set_req( 2, phase->intcon(con) ); // replace shift count with masked value > } > > Damned! I just wanted to ask you what's your opinion about this > solution but while writing all this and doing some software > archaeology I just discovered that this problem has already been > discussed [3] and solved some time ago! Change "7029017: Additional > architecture support for c2 compiler" introduced the "Support (for) > masking of shift counts when the processor architecture mandates it". > This is exactly what's needed on PowerPC! > > So I'd like to change my fix as follows: > > http://cr.openjdk.java.net/~simonis/webrevs/2015/8080190.v3/ > > I think that using "Matcher::need_masked_shift_count = true" will make > a lot of the masking for shift instructions in the ppc.ad file > obsolete, but I'd like to keep these cleanups for a separate change. > > Regards, > Volker > > [1] https://docs.oracle.com/javase/specs/jls/se7/html/jls-15.html#jls-15.19 > [2] https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.ishl > [3] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2011-March/thread.html#4944 > >> There are other instructions which can have the same problem (slwi). You >> would have to guarantee all of those do the masking. That is why I am asking >> why not mask it at the final code methods, like u_field()? >> >> Thanks, >> Vladimir >> >> >> On 5/13/15 9:29 AM, Volker Simonis wrote: >>> >>> Hi, >>> >>> could you please review the following, ppc64-only fix which solves a >>> problem with the integer-rotate instructions in C2: >>> >>> https://bugs.openjdk.java.net/browse/JDK-8080190 >>> http://cr.openjdk.java.net/~simonis/webrevs/2015/8080190.v2/ >>> >>> The problem was that we initially took the "rotate left/right by 8-bit >>> immediate" instructions from x86_64. But on x86_64 the rotate >>> instructions for 32-bit integers do in fact really take 8-bit >>> immediates (seems like the Intel guys had to many spare bits when they >>> designed their instruction set :) On the other hand, the rotate >>> instructions on Power only use 5 bit to encode the rotation distance. >>> >>> We do check for this limit, but only in the debug build so you'll see >>> an assertion there, but in the product build we will silently emit a >>> wrong instruction which can lead to anything starting from a crash up >>> to an incorrect computation. >>> >>> I think the simplest solution for this problem is to mask the >>> rotation distance with 0x1f (which is equivalent to taking the >>> distance modulo 32) in the corresponding instructions. >>> >>> This change should also be backported to 8 as fast as possible. Notice >>> that this a day-zero bug in our ppc64 port because in our internal SAP >>> JVM we already mask the shift amounts in the ideal graph but for some >>> unknown reasons these shared changes haven?t been contributed along >>> with our port. >>> >>> Thank you and best regards, >>> Volker >>> >> From vladimir.kozlov at oracle.com Mon May 18 16:46:54 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 18 May 2015 09:46:54 -0700 Subject: [9] RFR(XS): 8080281: 8068945 changes break building the zero JVM variant In-Reply-To: <5559AE36.4040808@oracle.com> References: <5559AE36.4040808@oracle.com> Message-ID: <555A177E.3040900@oracle.com> Looks good. Thanks, Vladimir On 5/18/15 2:17 AM, Zolt?n Maj? wrote: > Hi, > > > please review the following small patch. > > Bug: https://bugs.openjdk.java.net/browse/JDK-8080281 > > Problem: The changes by 8068945 brake building the zero JVM variant. > > Solution: Add the definition of the PreserveFramePointer to the > globals_zero.hpp file. > > Webrev: http://cr.openjdk.java.net/~zmajo/8080281/webrev.00/ > > Testing: > - full JPRT run (testset hotspot) > - local build of zero JVM variant > > Thank you and best regards, > > > Zoltan > From volker.simonis at gmail.com Mon May 18 16:53:14 2015 From: volker.simonis at gmail.com (Volker Simonis) Date: Mon, 18 May 2015 18:53:14 +0200 Subject: RFR(S): 8080190: PPC64: Fix wrong rotate instructions in the .ad file In-Reply-To: <555A1690.40704@oracle.com> References: <55539A49.4010906@oracle.com> <555A1690.40704@oracle.com> Message-ID: Thanks for the review! I'll update the comment accordingly. Regards, Volker On Mon, May 18, 2015 at 6:42 PM, Vladimir Kozlov wrote: > You need to change the comment in ppc.ad. > Otherwise looks good. Thank you for doing archeological research - I forgot > about that change. > > Thanks, > Vladimir > > > On 5/18/15 9:05 AM, Volker Simonis wrote: >> >> Hi Vladimir, >> >> thanks for the review and sorry for the late reply (we had a public >> holiday in Germany:). >> >> >> On Wed, May 13, 2015 at 8:39 PM, Vladimir Kozlov >> wrote: >>> >>> Is this because you can have negative values? >>> I see that at the bottom of calls which generate bits u_field() method is >>> used. Why not mask it there? >>> >> >> u_field() generates immediates in the range (hi_bit - lo_bit + 1) but >> it can be used to generate not only 5-bit immediates but also bigger >> or smaller ones (e.g. tbr() calls u_field() to generate a 10-bit >> immediate. The assertion in u_field() checks that the given value can >> be encoded in the requested range of bits but we do not want to >> explicitly mask the value to the corresponding bit range as this could >> hide other problems. >> >> The other problem is that we can have negative values as you have >> correctly noticed. >> >> I think this is more a kind of a conceptual problem. The JLS states >> that "only the five lowest-order bits of the right-hand operand are >> used as the shift distance" in a shift operation [1]. The same holds >> true for the specifications of the shift bytecodes in the JVMS [2]. >> But for constant shift amounts we do not handle this in C2 (i.e. in >> the ideal graph) but instead we relay on the corresponding CPU >> instruction doing "the right thing". On x86 this works because the >> whole 8-bit immediates are encoded right into the instruction and >> interpreted as required by the JLS. >> >> Another way to solve this would be to do the masking right in the >> ideal graph as we've previously done in the SAP JVM: >> >> Node *LShiftINode::Ideal(PhaseGVN *phase, bool can_reshape) { >> const Type *t = phase->type( in(2) ); >> if( t == Type::TOP ) return NULL; // Right input is dead >> const TypeInt *t2 = t->isa_int(); >> if( !t2 || !t2->is_con() ) return NULL; // Right input is a constant >> >> // SAPJVM: mask constant shift count already in ideal graph processing >> const int in2_con = t2->get_con(); >> const int con = in2_con & ( BitsPerJavaInteger - 1 ); // >> masked shift count >> >> if ( con == 0 ) return NULL; // let Identity() handle 0 shift count >> >> if( in2_con != con ) { >> set_req( 2, phase->intcon(con) ); // replace shift count with masked >> value >> } >> >> Damned! I just wanted to ask you what's your opinion about this >> solution but while writing all this and doing some software >> archaeology I just discovered that this problem has already been >> discussed [3] and solved some time ago! Change "7029017: Additional >> architecture support for c2 compiler" introduced the "Support (for) >> masking of shift counts when the processor architecture mandates it". >> This is exactly what's needed on PowerPC! >> >> So I'd like to change my fix as follows: >> >> http://cr.openjdk.java.net/~simonis/webrevs/2015/8080190.v3/ >> >> I think that using "Matcher::need_masked_shift_count = true" will make >> a lot of the masking for shift instructions in the ppc.ad file >> obsolete, but I'd like to keep these cleanups for a separate change. >> >> Regards, >> Volker >> >> [1] >> https://docs.oracle.com/javase/specs/jls/se7/html/jls-15.html#jls-15.19 >> [2] >> https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.ishl >> [3] >> http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2011-March/thread.html#4944 >> >>> There are other instructions which can have the same problem (slwi). You >>> would have to guarantee all of those do the masking. That is why I am >>> asking >>> why not mask it at the final code methods, like u_field()? >>> >>> Thanks, >>> Vladimir >>> >>> >>> On 5/13/15 9:29 AM, Volker Simonis wrote: >>>> >>>> >>>> Hi, >>>> >>>> could you please review the following, ppc64-only fix which solves a >>>> problem with the integer-rotate instructions in C2: >>>> >>>> https://bugs.openjdk.java.net/browse/JDK-8080190 >>>> http://cr.openjdk.java.net/~simonis/webrevs/2015/8080190.v2/ >>>> >>>> The problem was that we initially took the "rotate left/right by 8-bit >>>> immediate" instructions from x86_64. But on x86_64 the rotate >>>> instructions for 32-bit integers do in fact really take 8-bit >>>> immediates (seems like the Intel guys had to many spare bits when they >>>> designed their instruction set :) On the other hand, the rotate >>>> instructions on Power only use 5 bit to encode the rotation distance. >>>> >>>> We do check for this limit, but only in the debug build so you'll see >>>> an assertion there, but in the product build we will silently emit a >>>> wrong instruction which can lead to anything starting from a crash up >>>> to an incorrect computation. >>>> >>>> I think the simplest solution for this problem is to mask the >>>> rotation distance with 0x1f (which is equivalent to taking the >>>> distance modulo 32) in the corresponding instructions. >>>> >>>> This change should also be backported to 8 as fast as possible. Notice >>>> that this a day-zero bug in our ppc64 port because in our internal SAP >>>> JVM we already mask the shift amounts in the ideal graph but for some >>>> unknown reasons these shared changes haven?t been contributed along >>>> with our port. >>>> >>>> Thank you and best regards, >>>> Volker >>>> >>> > From zoltan.majo at oracle.com Mon May 18 17:18:30 2015 From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=) Date: Mon, 18 May 2015 19:18:30 +0200 Subject: [9] RFR(XS): 8080281: 8068945 changes break building the zero JVM variant In-Reply-To: <555A177E.3040900@oracle.com> References: <5559AE36.4040808@oracle.com> <555A177E.3040900@oracle.com> Message-ID: <555A1EE6.3090702@oracle.com> Thank you, Vladimir, for the review! Best regards, Zoltan On 05/18/2015 06:46 PM, Vladimir Kozlov wrote: > Looks good. > > Thanks, > Vladimir > > On 5/18/15 2:17 AM, Zolt?n Maj? wrote: >> Hi, >> >> >> please review the following small patch. >> >> Bug: https://bugs.openjdk.java.net/browse/JDK-8080281 >> >> Problem: The changes by 8068945 brake building the zero JVM variant. >> >> Solution: Add the definition of the PreserveFramePointer to the >> globals_zero.hpp file. >> >> Webrev: http://cr.openjdk.java.net/~zmajo/8080281/webrev.00/ >> >> Testing: >> - full JPRT run (testset hotspot) >> - local build of zero JVM variant >> >> Thank you and best regards, >> >> >> Zoltan >> From zoltan.majo at oracle.com Mon May 18 17:55:34 2015 From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=) Date: Mon, 18 May 2015 19:55:34 +0200 Subject: [9] RFR(XS): 8080281: 8068945 changes break building the zero JVM variant In-Reply-To: <1431961629.3512.14.camel@redhat.com> References: <5559AE36.4040808@oracle.com> <1431961629.3512.14.camel@redhat.com> Message-ID: <555A2796.1020501@oracle.com> Hi Severin, On 05/18/2015 05:07 PM, Severin Gehwolf wrote: > Hi, > > On Mon, 2015-05-18 at 11:17 +0200, Zolt?n Maj? wrote: >> Hi, >> >> >> please review the following small patch. >> >> Bug: https://bugs.openjdk.java.net/browse/JDK-8080281 >> >> Problem: The changes by 8068945 brake building the zero JVM variant. >> >> Solution: Add the definition of the PreserveFramePointer to the >> globals_zero.hpp file. >> >> Webrev: http://cr.openjdk.java.net/~zmajo/8080281/webrev.00/ >> >> Testing: >> - full JPRT run (testset hotspot) >> - local build of zero JVM variant >> >> Thank you and best regards, > Thanks, Zoltan, for being pro-active about this! Patch looks fine to me > (Disclaimer: I'm not a hotspot Reviewer). thank you for the review! Best wishes, Zoltan > > Cheers, > Severin > >> Zoltan >> > > From vladimir.kozlov at oracle.com Mon May 18 18:06:18 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 18 May 2015 11:06:18 -0700 Subject: RFR(XS): JDK-8077620: [TESTBUG] Some of the hotspot tests require at least compact profile 3 In-Reply-To: <5559EFB8.9000503@oracle.com> References: <5559EFB8.9000503@oracle.com> Message-ID: <555A2A1A.2080600@oracle.com> Looks fine. Thanks, Vladimir On 5/18/15 6:57 AM, denis kononenko wrote: > Hi All, > > Could you please review a small change to the TEST.groups file. > > Webrev link: > http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.00/ > Bug id: JDK-8077620 > Testing: automated > Description: > > The following tests cannot be run on compact profiles 1 and 2: > > compiler/jsr292/RedefineMethodUsedByMultipleMethodHandles.java: requires > java.lang.instrument package (JEP-161, available in Compact Profile 3); > compiler/rangechecks/TestRangeCheckSmearing.java: requires > sun.management package (JEP-161, any management packages available since > Compact Profile 3); > gc/TestGCLogRotationViaJcmd.java: uses ProcessTools which depends on > java.lang.management (JEP-161, available in Compact Profile 3); > serviceability/sa/jmap-hashcode/Test8028623.java: the same as above. > > The fix simply moves these tests into :needs_compact3 group. > > Thank you, > Denis. > From michael.c.berg at intel.com Mon May 18 19:11:27 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Mon, 18 May 2015 19:11:27 +0000 Subject: [9] RFR(XS) 8080483: Incorrect test execution string at SumRed_Long.java In-Reply-To: <9C43EB57-487F-4D7E-929A-07F5154550A9@oracle.com> References: <55568CB8.3030102@oracle.com> <9C43EB57-487F-4D7E-929A-07F5154550A9@oracle.com> Message-ID: Yes, that will make the test correct. Thanks for catching this. -Michael -----Original Message----- From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Igor Veresov Sent: Friday, May 15, 2015 5:38 PM To: Vladimir Kozlov Cc: hotspot compiler Subject: Re: [9] RFR(XS) 8080483: Incorrect test execution string at SumRed_Long.java Ok. igor > On May 15, 2015, at 5:18 PM, Vladimir Kozlov wrote: > > Small typo fix. Test run commands incorrectly execute SumRed_Double class instead of SumRed_Long. > > http://cr.openjdk.java.net/~kvn/8080483/webrev/ > https://bugs.openjdk.java.net/browse/JDK-8080483 > > Tested with jtreg. > > Thanks, > Vladimir From adinn at redhat.com Mon May 18 19:40:22 2015 From: adinn at redhat.com (Andrew Dinn) Date: Mon, 18 May 2015 20:40:22 +0100 Subject: RFR(S) 8077504: Unsafe load can loose control dependency and cause crash In-Reply-To: <8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com> References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com> <552FC216.4010503@redhat.com> <95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com> <6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com> <553A88F3.2010700@oracle.com> <553FFAC4.7030107@oracle.com> <8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com> Message-ID: <555A4026.3010303@redhat.com> On 18/05/15 12:53, Roland Westrelin wrote: > My change is broken. > GraphKit::make_load() takes 2 boolean arguments: bool depends_only_on_test = true, bool require_atomic_access = false) > and Parse::do_get_xxx() passes the value needs_atomic_access as argument depends_only_on_test instead of require_atomic_access. That goes unnoticed because of the default values. It?s the exact reason I didn?t use default arguments in my first webrev. It?s too easy to make a mistake and not notice it. > If we keep default arguments, one way to have help from the compiler would be to use a 2 value enum: > > enum DependsOnlyOnTest { > DoesDependOnlyOnTest, > DoesNotDependOnlyOnTest > }; > > and use it to pass parameters around. The _depends_only_on_test LoadNode field would still be a boolean. Is this acceptable? That looks perfectly acceptable to me (not a reviewer but I was the one who set you off down this path). regards, Andrew Dinn ----------- From michael.c.berg at intel.com Mon May 18 20:13:29 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Mon, 18 May 2015 20:13:29 +0000 Subject: RFR(S): 8078866: compiler/eliminateAutobox/6934604/TestIntBoxing.java assert(p_f->Opcode() == Op_IfFalse) failed In-Reply-To: References: <5553C1E2.3040003@oracle.com> Message-ID: Hi Roland, If you make pre_end->cmp_node() a variable it would remove some overhead too like you did with main_cmp. Regards, Michael -----Original Message----- From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Roland Westrelin Sent: Monday, May 18, 2015 5:20 AM To: hotspot compiler Subject: Re: RFR(S): 8078866: compiler/eliminateAutobox/6934604/TestIntBoxing.java assert(p_f->Opcode() == Op_IfFalse) failed Thanks for looking at this, Vladimir. > Why do you think that main- and post- loops will be empty too and it is safe to replace them with pre-loop? Why it is safe? > > Opaque1 node is used only when RCE and align is requsted: > cmp_end->set_req(2, peel_only ? pre_limit : pre_opaq); > > In peel_only case pre-loop have only 1 iteration and can be eliminated too. The assert could be hit in such case too. So your change may not fix the problem. In case of peel_only, we set: if( peel_only ) main_head->set_main_no_pre_loop(); and in IdealLoopTree::policy_range_check(): // If we unrolled with no intention of doing RCE and we later // changed our minds, we got no pre-loop. Either we need to // make a new pre-loop, or we gotta disallow RCE. if (cl->is_main_no_pre_loop()) return false; // Disallowed for now. > I think eliminating main- and post- is good investigation for separate RFE. And we need to make sure it is safe. My reasoning is that if the pre loop?s limit is still an opaque node, then the compiler has not been able to optimize the pre loop based on its limited number of iterations yet. I don?t see what optimization would remove stuff from the pre loop that wouldn?t have removed the same stuff from the initial loop, had it not been cloned. I could be missing something, though. Do you see one? Roland. > > To fix this bug, I think, we should bailout from do_range_check() when we can't find pre-loop. > > Typo 'cab': > + // post loop cab be removed as well > > Thanks, > Vladimir > > > On 5/13/15 6:02 AM, Roland Westrelin wrote: >> http://cr.openjdk.java.net/~roland/8078866/webrev.00/ >> >> The assert fires when optimizing that code: >> >> static int remi_sumb() { >> Integer j = Integer.valueOf(1); >> for (int i = 0; i< 1000; i++) { >> j = j + 1; >> } >> return j; >> } >> >> The compiler tries to find the pre loop from the main loop but fails because the pre loop isn?t there. >> >> Here is the chain of events that leads to the pre loop being optimized out: >> >> 1- The CountedLoopNode is created >> 2- PhiNode::Value() for the induction variable?s Phi is called and capture that 0 <= i < 1000 >> 3- SplitIf is applied and moves the LoadI that reads the value from the Integer through the Phi that merges successive value of j >> 4- pre-loop, main-loop, post-loop are created >> 5- SplitIf is applied again and moves the LoadI of the loop through the Phi that merges both branches of: >> >> public static Integer valueOf(int i) { >> if (i >= IntegerCache.low && i <= IntegerCache.high) >> return IntegerCache.cache[i + (-IntegerCache.low)]; >> return new Integer(i); >> } >> >> 6- The LoadI is optimized out, the Phi for j is now a Phi that merges integer values, PhaseIdealLoop::replace_parallel_iv() replaces j by an expression that depends on i >> 7- In the preloop, the range of values of i are known because of 2- so and the if of valueOf() can be optimized out >> 8- the preloop is empty and is removed by IdealLoopTree::policy_do_remove_empty_loop() >> >> SplitIf doesn?t find all opportunities for improvements before the pre/main/post loops are built. I considered trying to fix that but that seems complex and not robust enough: some optimizations after SplitIf could uncover more opportunities for SplitIf. The valueOf?s if cannot be optimized out in the main loop because the range of values of i in the main depends on the Phi for the iv of the pre loop and is not [0, 1000[ so I don?t see a way to make sure the compiler always optimizes the main loop out when the pre loop is empty. >> >> Instead, I?m proposing that when IdealLoopTree::policy_do_remove_empty_loop() removes a loop, it checks whether it is a pre-loop and if it is, removes the main and post loops. >> >> I don?t have a test case because it seems this only occurs under well defined conditions that are hard to reproduce with a simple test case: PhiNode::Value() must be called early, SplitIf must do a little work but not too much before pre/main/post loops are built. Actually, this occurs only when remi_sumb is inlined. If remi_sumb is not, we don?t optimize the loop out at all. >> >> Roland. >> From vladimir.kozlov at oracle.com Mon May 18 22:27:13 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 18 May 2015 15:27:13 -0700 Subject: RFR(S) 8077504: Unsafe load can loose control dependency and cause crash In-Reply-To: <555A4026.3010303@redhat.com> References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com> <552FC216.4010503@redhat.com> <95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com> <6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com> <553A88F3.2010700@oracle.com> <553FFAC4.7030107@oracle.com> <8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com> <555A4026.3010303@redhat.com> Message-ID: <555A6741.3030002@oracle.com> Acceptable. Thanks, Vladimir On 5/18/15 12:40 PM, Andrew Dinn wrote: > On 18/05/15 12:53, Roland Westrelin wrote: >> My change is broken. >> GraphKit::make_load() takes 2 boolean arguments: bool depends_only_on_test = true, bool require_atomic_access = false) >> and Parse::do_get_xxx() passes the value needs_atomic_access as argument depends_only_on_test instead of require_atomic_access. That goes unnoticed because of the default values. It?s the exact reason I didn?t use default arguments in my first webrev. It?s too easy to make a mistake and not notice it. >> If we keep default arguments, one way to have help from the compiler would be to use a 2 value enum: >> >> enum DependsOnlyOnTest { >> DoesDependOnlyOnTest, >> DoesNotDependOnlyOnTest >> }; >> >> and use it to pass parameters around. The _depends_only_on_test LoadNode field would still be a boolean. Is this acceptable? > > That looks perfectly acceptable to me (not a reviewer but I was the one > who set you off down this path). > > regards, > > > Andrew Dinn > ----------- > From shrinivas.joshi at oracle.com Mon May 18 23:13:41 2015 From: shrinivas.joshi at oracle.com (Shrinivas Joshi) Date: Mon, 18 May 2015 16:13:41 -0700 Subject: RFR (XS) 8080308: Enable JSR292-only type profiling level on SPARC Message-ID: <555A7224.1080004@oracle.com> Hi, Please review this small change which sets default value of TypeProfileLevel to 111 on the SPARC platform. Bug: https://bugs.openjdk.java.net/browse/JDK-8080308 Webrev: http://cr.openjdk.java.net/~roland/8080308/webrev.00/ Tested with JTREG, CTW, NSK and JPRT. Thanks, -Shrinivas From vladimir.kozlov at oracle.com Mon May 18 23:25:19 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 18 May 2015 16:25:19 -0700 Subject: RFR (XS) 8080308: Enable JSR292-only type profiling level on SPARC In-Reply-To: <555A7224.1080004@oracle.com> References: <555A7224.1080004@oracle.com> Message-ID: <555A74DF.5040306@oracle.com> Looks fine to me but Roland should do final judgement. Thanks, Vladimir On 5/18/15 4:13 PM, Shrinivas Joshi wrote: > Hi, > > Please review this small change which sets default value of > TypeProfileLevel to 111 on the SPARC platform. > > Bug: https://bugs.openjdk.java.net/browse/JDK-8080308 > Webrev: http://cr.openjdk.java.net/~roland/8080308/webrev.00/ > > Tested with JTREG, CTW, NSK and JPRT. > > Thanks, > -Shrinivas From shrinivas.joshi at oracle.com Mon May 18 23:56:05 2015 From: shrinivas.joshi at oracle.com (Shrinivas Joshi) Date: Mon, 18 May 2015 16:56:05 -0700 Subject: RFR (XS) 8080308: Enable JSR292-only type profiling level on SPARC In-Reply-To: <555A74DF.5040306@oracle.com> References: <555A7224.1080004@oracle.com> <555A74DF.5040306@oracle.com> Message-ID: <555A7C15.4030208@oracle.com> Hi Vladimir, Thanks for the review. -Shrinivas On 5/18/2015 4:25 PM, Vladimir Kozlov wrote: > Looks fine to me but Roland should do final judgement. > > Thanks, > Vladimir > > On 5/18/15 4:13 PM, Shrinivas Joshi wrote: >> Hi, >> >> Please review this small change which sets default value of >> TypeProfileLevel to 111 on the SPARC platform. >> >> Bug: https://bugs.openjdk.java.net/browse/JDK-8080308 >> Webrev: http://cr.openjdk.java.net/~roland/8080308/webrev.00/ >> >> Tested with JTREG, CTW, NSK and JPRT. >> >> Thanks, >> -Shrinivas > > From roland.westrelin at oracle.com Tue May 19 06:39:43 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Tue, 19 May 2015 08:39:43 +0200 Subject: RFR (XS) 8080308: Enable JSR292-only type profiling level on SPARC In-Reply-To: <555A7224.1080004@oracle.com> References: <555A7224.1080004@oracle.com> Message-ID: > Webrev: http://cr.openjdk.java.net/~roland/8080308/webrev.00/ Looks good to me. Thanks for finding and fixing that mistake. Roland. From david.holmes at oracle.com Tue May 19 07:49:30 2015 From: david.holmes at oracle.com (David Holmes) Date: Tue, 19 May 2015 17:49:30 +1000 Subject: RFR(XS): JDK-8077620: [TESTBUG] Some of the hotspot tests require at least compact profile 3 In-Reply-To: <5559EFB8.9000503@oracle.com> References: <5559EFB8.9000503@oracle.com> Message-ID: <555AEB0A.6090301@oracle.com> Looks fine to me. Thanks, David On 18/05/2015 11:57 PM, denis kononenko wrote: > Hi All, > > Could you please review a small change to the TEST.groups file. > > Webrev link: > http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.00/ > Bug id: JDK-8077620 > Testing: automated > Description: > > The following tests cannot be run on compact profiles 1 and 2: > > compiler/jsr292/RedefineMethodUsedByMultipleMethodHandles.java: requires > java.lang.instrument package (JEP-161, available in Compact Profile 3); > compiler/rangechecks/TestRangeCheckSmearing.java: requires > sun.management package (JEP-161, any management packages available since > Compact Profile 3); > gc/TestGCLogRotationViaJcmd.java: uses ProcessTools which depends on > java.lang.management (JEP-161, available in Compact Profile 3); > serviceability/sa/jmap-hashcode/Test8028623.java: the same as above. > > The fix simply moves these tests into :needs_compact3 group. > > Thank you, > Denis. > From andreas.eriksson at oracle.com Tue May 19 10:56:20 2015 From: andreas.eriksson at oracle.com (Andreas Eriksson) Date: Tue, 19 May 2015 12:56:20 +0200 Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information Message-ID: <555B16D4.80008@oracle.com> Hi, Could I please get two reviews for the following bug fix. https://bugs.openjdk.java.net/browse/JDK-8060036 http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/ The problem is with type propagation in C2. A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node. If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead. These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU. This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself. The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there. At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information. When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken. This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated. I've not been able to create a regression test that is reliable enough to check in. Thanks, Andreas From tangwei6 at huawei.com Tue May 19 12:58:08 2015 From: tangwei6 at huawei.com (Tangwei (Euler)) Date: Tue, 19 May 2015 12:58:08 +0000 Subject: How to change compilation policy to trigger C2 compilation ASAP? Message-ID: Hi All, I want to run a JAVA application on a performance simulator, and do a profiling on a hot function JITTed with C2 compiler. In order to make C2 compiler compile hot function as early as possible, I hope to reduce the threshold of function invocation count in interpreter and C1 to drive the JIT compiler transitioned to Level 4 (C2) ASAP. Following is the option list I try, but failed to find a right combination to meet my requirement. Anyone can help to figure out what options I can use? Thanks in advance. -XX:Tier0ProfilingStartPercentage=0 -XX:Tier3InvocationThreshold -XX:Tier3MinInvocationThreshold -XX:Tier3CompileThreshold -XX:Tier4InvocationThreshold -XX:CompileThreshold -XX:Tier4MinInvocationThreshold -XX:Tier4CompileThreshold Regards! wei -------------- next part -------------- An HTML attachment was scrubbed... URL: From aleksey.shipilev at oracle.com Tue May 19 13:10:14 2015 From: aleksey.shipilev at oracle.com (Aleksey Shipilev) Date: Tue, 19 May 2015 16:10:14 +0300 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: Message-ID: <555B3636.9010100@oracle.com> On 19.05.2015 15:58, Tangwei (Euler) wrote: > In order to make C2 compiler compile hot function as early as possible, > I hope to reduce the threshold of function invocation > > count in interpreter and C1 to drive the JIT compiler transitioned to > Level 4 (C2) ASAP. Following is the option list I try, but > > failed to find a right combination to meet my requirement. Anyone can > help to figure out what options I can use? Cut the C1 from the compilation pipeline by disabling tiered compilation altogether? It will still profile the code with interpreter, and so compilation thresholds are still in effect: -XX:-TieredCompilation The same, but force runtime to compile the methods as soon as it calls into them, plus disable background compilation. This will stall the method execution until the compiled version is available. No profiling is done: -XX:-TieredCompilation -Xbatch -Xcomp But, there is intrinsic tradeoff between the time you spend profiling on lower compilation/interpretation levels and the profile accuracy. This translates to the tradeoff between the compiled code efficiency and time-to-performance. Skipping profiling altogether may and will have a detrimental effect on performance tests. Thanks, -Aleksey -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: OpenPGP digital signature URL: From vitalyd at gmail.com Tue May 19 13:32:51 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Tue, 19 May 2015 09:32:51 -0400 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: Message-ID: Is your goal specifically to have C2 compile or just to reach peak performance quickly? It sounds like the latter. What values did you specify for the tier thresholds? Also, it may help you to -XX:+PrintCompilation to tune the flags as this will show you which methods are being compiled, when, and at what tier. sent from my phone On May 19, 2015 9:01 AM, "Tangwei (Euler)" wrote: > Hi All, > > I want to run a JAVA application on a performance simulator, and do a > profiling on a hot function JITTed with C2 compiler. > > In order to make C2 compiler compile hot function as early as possible, I > hope to reduce the threshold of function invocation > > count in interpreter and C1 to drive the JIT compiler transitioned to > Level 4 (C2) ASAP. Following is the option list I try, but > > failed to find a right combination to meet my requirement. Anyone can help > to figure out what options I can use? > > Thanks in advance. > > > > -XX:Tier0ProfilingStartPercentage=0 > > -XX:Tier3InvocationThreshold > > -XX:Tier3MinInvocationThreshold > > -XX:Tier3CompileThreshold > > -XX:Tier4InvocationThreshold > > -XX:CompileThreshold > > -XX:Tier4MinInvocationThreshold > > -XX:Tier4CompileThreshold > > > > Regards! > > wei > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shrinivas.joshi at oracle.com Tue May 19 16:02:29 2015 From: shrinivas.joshi at oracle.com (Shrinivas Joshi) Date: Tue, 19 May 2015 09:02:29 -0700 Subject: RFR (XS) 8080308: Enable JSR292-only type profiling level on SPARC In-Reply-To: References: <555A7224.1080004@oracle.com> Message-ID: <555B5E95.3080403@oracle.com> Hi Roland, Thanks for the review. -Shrinivas On 5/18/2015 11:39 PM, Roland Westrelin wrote: >> Webrev: http://cr.openjdk.java.net/~roland/8080308/webrev.00/ > Looks good to me. Thanks for finding and fixing that mistake. > > Roland. > > From michael.c.berg at intel.com Tue May 19 18:24:21 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Tue, 19 May 2015 18:24:21 +0000 Subject: 8060036: C2: CmpU nodes can end up with wrong type information In-Reply-To: <555B16D4.80008@oracle.com> References: <555B16D4.80008@oracle.com> Message-ID: Hi Andreas, could you add a comment to the push statement: // Propagate change to user And to Node* p declaration and init: // Propagate changes to uses So that the new code carries all similar information. Else the changes seem ok. -Michael -----Original Message----- From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andreas Eriksson Sent: Tuesday, May 19, 2015 3:56 AM To: hotspot-compiler-dev at openjdk.java.net Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information Hi, Could I please get two reviews for the following bug fix. https://bugs.openjdk.java.net/browse/JDK-8060036 http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/ The problem is with type propagation in C2. A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node. If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead. These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU. This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself. The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there. At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information. When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken. This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated. I've not been able to create a regression test that is reliable enough to check in. Thanks, Andreas From michael.c.berg at intel.com Tue May 19 18:25:26 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Tue, 19 May 2015 18:25:26 +0000 Subject: RFR 8080325 SuperWord loop unrolling analysis In-Reply-To: References: Message-ID: Could I please get two reviews for superword unrolling analysis. Thanks, Michael From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Berg, Michael C Sent: Wednesday, May 13, 2015 6:27 PM To: hotspot-compiler-dev at openjdk.java.net Subject: RFR 8080325 SuperWord loop unrolling analysis Hi Folks, We (Intel) would like to contribute superword loop unrolling analysis to facilitate more default unrolling and larger SIMD vector mapping. Please review this patch and comment as needed: Bug-id: https://bugs.openjdk.java.net/browse/JDK-8080325 webrev: http://cr.openjdk.java.net/~kvn/8080325/webrev/ The design finds the highest common vector supported and implemented on a given machine and applies that to unrolling, iff it is greater than the default. If the user gates unrolling we will still obey the user directive. It's light weight, when we get to the analysis part, if we fail, we stop further tries. If we succeed we stop further tries. We generally always analyze only once. We then gate the unroll factor by extending the size of the unroll segment so that the iterative tries will keep unrolling, obtaining larger unrolls of a targeted loop. I see no negative behavior wrt to performance, and a modest positive swing in default behavior up to 1.5x for some micros. Vladimir Kozlov has offered to sponsor this patch. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tangwei6 at huawei.com Wed May 20 00:34:52 2015 From: tangwei6 at huawei.com (Tangwei (Euler)) Date: Wed, 20 May 2015 00:34:52 +0000 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: <555B3636.9010100@oracle.com> References: <555B3636.9010100@oracle.com> Message-ID: > -----Original Message----- > From: Aleksey Shipilev [mailto:aleksey.shipilev at oracle.com] > Sent: Tuesday, May 19, 2015 9:10 PM > To: Tangwei (Euler); hotspot-compiler-dev at openjdk.java.net > Subject: Re: How to change compilation policy to trigger C2 compilation ASAP? > > On 19.05.2015 15:58, Tangwei (Euler) wrote: > > In order to make C2 compiler compile hot function as early as > > possible, I hope to reduce the threshold of function invocation > > > > count in interpreter and C1 to drive the JIT compiler transitioned to > > Level 4 (C2) ASAP. Following is the option list I try, but > > > > failed to find a right combination to meet my requirement. Anyone can > > help to figure out what options I can use? > > Cut the C1 from the compilation pipeline by disabling tiered compilation > altogether? It will still profile the code with interpreter, and so compilation > thresholds are still in effect: > -XX:-TieredCompilation > Is that possible to keep tiered compilation policy, and just change threshold value to reduce transition time between different execution level? > The same, but force runtime to compile the methods as soon as it calls into > them, plus disable background compilation. This will stall the method execution > until the compiled version is available. No profiling is done: > -XX:-TieredCompilation -Xbatch -Xcomp > I think this will compile all methods instead of some hot ones. > But, there is intrinsic tradeoff between the time you spend profiling on lower > compilation/interpretation levels and the profile accuracy. This translates to > the tradeoff between the compiled code efficiency and time-to-performance. > Skipping profiling altogether may and will have a detrimental effect on > performance tests. > > Thanks, > -Aleksey From tangwei6 at huawei.com Wed May 20 00:43:38 2015 From: tangwei6 at huawei.com (Tangwei (Euler)) Date: Wed, 20 May 2015 00:43:38 +0000 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: Message-ID: My goal is just to reach peak performance quickly. Following is one tier threshold combination I tried: -XX:Tier0ProfilingStartPercentage=0 -XX:Tier3InvocationThreshold=3 -XX:Tier3MinInvocationThreshold=2 -XX:Tier3CompileThreshold=2 -XX:Tier4InvocationThreshold=4 -XX:Tier4MinInvocationThreshold=3 -XX:Tier4CompileThreshold=2 Regards! wei From: Vitaly Davidovich [mailto:vitalyd at gmail.com] Sent: Tuesday, May 19, 2015 9:33 PM To: Tangwei (Euler) Cc: hotspot compiler Subject: Re: How to change compilation policy to trigger C2 compilation ASAP? Is your goal specifically to have C2 compile or just to reach peak performance quickly? It sounds like the latter. What values did you specify for the tier thresholds? Also, it may help you to -XX:+PrintCompilation to tune the flags as this will show you which methods are being compiled, when, and at what tier. sent from my phone On May 19, 2015 9:01 AM, "Tangwei (Euler)" > wrote: Hi All, I want to run a JAVA application on a performance simulator, and do a profiling on a hot function JITTed with C2 compiler. In order to make C2 compiler compile hot function as early as possible, I hope to reduce the threshold of function invocation count in interpreter and C1 to drive the JIT compiler transitioned to Level 4 (C2) ASAP. Following is the option list I try, but failed to find a right combination to meet my requirement. Anyone can help to figure out what options I can use? Thanks in advance. -XX:Tier0ProfilingStartPercentage=0 -XX:Tier3InvocationThreshold -XX:Tier3MinInvocationThreshold -XX:Tier3CompileThreshold -XX:Tier4InvocationThreshold -XX:CompileThreshold -XX:Tier4MinInvocationThreshold -XX:Tier4CompileThreshold Regards! wei -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Wed May 20 01:12:29 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 19 May 2015 18:12:29 -0700 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: Message-ID: <555BDF7D.4010002@oracle.com> If you want only C2 compilation you can disable tiered compilation (since you don't want to spend to much on profiling anyway) and set low CompileThreshold (you need only this one for non-tiered compile): -XX:-TieredCompilation -XX:CompileThreshold=100 If your method is simple and don't need profiling or inlining the generated code could be same quality as with long profiling. An other approach is to set threshold per method (scale all threasholds (C1, C2, interpreter) by this value). For example to reduce thresholds by half: -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdScaling,0.5 Vladimir On 5/19/15 5:43 PM, Tangwei (Euler) wrote: > My goal is just to reach peak performance quickly. Following is one tier > threshold combination I tried: > > -XX:Tier0ProfilingStartPercentage=0 > > -XX:Tier3InvocationThreshold=3 > > -XX:Tier3MinInvocationThreshold=2 > > -XX:Tier3CompileThreshold=2 > > -XX:Tier4InvocationThreshold=4 > > -XX:Tier4MinInvocationThreshold=3 > > -XX:Tier4CompileThreshold=2 > > Regards! > > wei > > *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] > *Sent:* Tuesday, May 19, 2015 9:33 PM > *To:* Tangwei (Euler) > *Cc:* hotspot compiler > *Subject:* Re: How to change compilation policy to trigger C2 > compilation ASAP? > > Is your goal specifically to have C2 compile or just to reach peak > performance quickly? It sounds like the latter. What values did you > specify for the tier thresholds? Also, it may help you to > -XX:+PrintCompilation to tune the flags as this will show you which > methods are being compiled, when, and at what tier. > > sent from my phone > > On May 19, 2015 9:01 AM, "Tangwei (Euler)" > wrote: > > Hi All, > > I want to run a JAVA application on a performance simulator, and do a > profiling on a hot function JITTed with C2 compiler. > > In order to make C2 compiler compile hot function as early as possible, > I hope to reduce the threshold of function invocation > > count in interpreter and C1 to drive the JIT compiler transitioned to > Level 4 (C2) ASAP. Following is the option list I try, but > > failed to find a right combination to meet my requirement. Anyone can > help to figure out what options I can use? > > Thanks in advance. > > -XX:Tier0ProfilingStartPercentage=0 > > -XX:Tier3InvocationThreshold > > -XX:Tier3MinInvocationThreshold > > -XX:Tier3CompileThreshold > > -XX:Tier4InvocationThreshold > > -XX:CompileThreshold > > -XX:Tier4MinInvocationThreshold > > -XX:Tier4CompileThreshold > > Regards! > > wei > From vitalyd at gmail.com Wed May 20 01:15:23 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Tue, 19 May 2015 21:15:23 -0400 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: <555BDF7D.4010002@oracle.com> References: <555BDF7D.4010002@oracle.com> Message-ID: Vladimir, Is 100 actually generally good/decent based on our conversation from the other day regarding lack of scaling down certain inlining thresholds? Thanks sent from my phone On May 19, 2015 9:12 PM, "Vladimir Kozlov" wrote: > If you want only C2 compilation you can disable tiered compilation (since > you don't want to spend to much on profiling anyway) and set low > CompileThreshold (you need only this one for non-tiered compile): > > -XX:-TieredCompilation -XX:CompileThreshold=100 > > If your method is simple and don't need profiling or inlining the > generated code could be same quality as with long profiling. > > An other approach is to set threshold per method (scale all threasholds > (C1, C2, interpreter) by this value). For example to reduce thresholds by > half: > > > -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdScaling,0.5 > > Vladimir > > On 5/19/15 5:43 PM, Tangwei (Euler) wrote: > >> My goal is just to reach peak performance quickly. Following is one tier >> threshold combination I tried: >> >> -XX:Tier0ProfilingStartPercentage=0 >> >> -XX:Tier3InvocationThreshold=3 >> >> -XX:Tier3MinInvocationThreshold=2 >> >> -XX:Tier3CompileThreshold=2 >> >> -XX:Tier4InvocationThreshold=4 >> >> -XX:Tier4MinInvocationThreshold=3 >> >> -XX:Tier4CompileThreshold=2 >> >> Regards! >> >> wei >> >> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] >> *Sent:* Tuesday, May 19, 2015 9:33 PM >> *To:* Tangwei (Euler) >> *Cc:* hotspot compiler >> *Subject:* Re: How to change compilation policy to trigger C2 >> compilation ASAP? >> >> Is your goal specifically to have C2 compile or just to reach peak >> performance quickly? It sounds like the latter. What values did you >> specify for the tier thresholds? Also, it may help you to >> -XX:+PrintCompilation to tune the flags as this will show you which >> methods are being compiled, when, and at what tier. >> >> sent from my phone >> >> On May 19, 2015 9:01 AM, "Tangwei (Euler)" > > wrote: >> >> Hi All, >> >> I want to run a JAVA application on a performance simulator, and do a >> profiling on a hot function JITTed with C2 compiler. >> >> In order to make C2 compiler compile hot function as early as possible, >> I hope to reduce the threshold of function invocation >> >> count in interpreter and C1 to drive the JIT compiler transitioned to >> Level 4 (C2) ASAP. Following is the option list I try, but >> >> failed to find a right combination to meet my requirement. Anyone can >> help to figure out what options I can use? >> >> Thanks in advance. >> >> -XX:Tier0ProfilingStartPercentage=0 >> >> -XX:Tier3InvocationThreshold >> >> -XX:Tier3MinInvocationThreshold >> >> -XX:Tier3CompileThreshold >> >> -XX:Tier4InvocationThreshold >> >> -XX:CompileThreshold >> >> -XX:Tier4MinInvocationThreshold >> >> -XX:Tier4CompileThreshold >> >> Regards! >> >> wei >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Wed May 20 01:11:58 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Tue, 19 May 2015 21:11:58 -0400 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: Message-ID: These thresholds are going to effectively compile the world, gated only by backlog that will surely ensue on the compiler thread queues? I would start with using default values, and tuning them down until perf is satisfactory. As you have it configured, I don't think tiered will help; in fact, it's probably worse than simply using C2 and lowering CompileThreshold. I think the threshold for C2 should probably depend on how well the code runs with C1, which is likely to be very application (and hardware, to some extent) specific. Are you using java 8? How many compiler threads for C1 and C2 are you using? By the way, I'm likely to do some tiered experiments myself soon, so this is new territory to me as well. At an initial glance, if default values aren't good, it seems to devolve into the same "black magic" as GC tuning. sent from my phone On May 19, 2015 8:43 PM, "Tangwei (Euler)" wrote: > My goal is just to reach peak performance quickly. Following is one tier > threshold combination I tried: > > > > -XX:Tier0ProfilingStartPercentage=0 > > -XX:Tier3InvocationThreshold=3 > > -XX:Tier3MinInvocationThreshold=2 > > -XX:Tier3CompileThreshold=2 > > -XX:Tier4InvocationThreshold=4 > > -XX:Tier4MinInvocationThreshold=3 > > -XX:Tier4CompileThreshold=2 > > > > Regards! > > wei > > > > > > *From:* Vitaly Davidovich [mailto:vitalyd at gmail.com] > *Sent:* Tuesday, May 19, 2015 9:33 PM > *To:* Tangwei (Euler) > *Cc:* hotspot compiler > *Subject:* Re: How to change compilation policy to trigger C2 compilation > ASAP? > > > > Is your goal specifically to have C2 compile or just to reach peak > performance quickly? It sounds like the latter. What values did you > specify for the tier thresholds? Also, it may help you to > -XX:+PrintCompilation to tune the flags as this will show you which methods > are being compiled, when, and at what tier. > > sent from my phone > > On May 19, 2015 9:01 AM, "Tangwei (Euler)" wrote: > > Hi All, > > I want to run a JAVA application on a performance simulator, and do a > profiling on a hot function JITTed with C2 compiler. > > In order to make C2 compiler compile hot function as early as possible, I > hope to reduce the threshold of function invocation > > count in interpreter and C1 to drive the JIT compiler transitioned to > Level 4 (C2) ASAP. Following is the option list I try, but > > failed to find a right combination to meet my requirement. Anyone can help > to figure out what options I can use? > > Thanks in advance. > > > > -XX:Tier0ProfilingStartPercentage=0 > > -XX:Tier3InvocationThreshold > > -XX:Tier3MinInvocationThreshold > > -XX:Tier3CompileThreshold > > -XX:Tier4InvocationThreshold > > -XX:CompileThreshold > > -XX:Tier4MinInvocationThreshold > > -XX:Tier4CompileThreshold > > > > Regards! > > wei > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Wed May 20 01:30:32 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 19 May 2015 18:30:32 -0700 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: <555BDF7D.4010002@oracle.com> Message-ID: <555BE3B8.1050801@oracle.com> I had said "if don't need profiling or inlining" :) Of cause low threashold affects quality of generated code. Especially if it has branches and calls. And you should not expect that you will get the same inlining. So profiling result of such code should be considered only as approximation. Vladimir On 5/19/15 6:15 PM, Vitaly Davidovich wrote: > Vladimir, > > Is 100 actually generally good/decent based on our conversation from the > other day regarding lack of scaling down certain inlining thresholds? > > Thanks > > sent from my phone > > On May 19, 2015 9:12 PM, "Vladimir Kozlov" > wrote: > > If you want only C2 compilation you can disable tiered compilation > (since you don't want to spend to much on profiling anyway) and set > low CompileThreshold (you need only this one for non-tiered compile): > > -XX:-TieredCompilation -XX:CompileThreshold=100 > > If your method is simple and don't need profiling or inlining the > generated code could be same quality as with long profiling. > > An other approach is to set threshold per method (scale all > threasholds (C1, C2, interpreter) by this value). For example to > reduce thresholds by half: > > -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdScaling,0.5 > > Vladimir > > On 5/19/15 5:43 PM, Tangwei (Euler) wrote: > > My goal is just to reach peak performance quickly. Following is > one tier > threshold combination I tried: > > -XX:Tier0ProfilingStartPercentage=0 > > -XX:Tier3InvocationThreshold=3 > > -XX:Tier3MinInvocationThreshold=2 > > -XX:Tier3CompileThreshold=2 > > -XX:Tier4InvocationThreshold=4 > > -XX:Tier4MinInvocationThreshold=3 > > -XX:Tier4CompileThreshold=2 > > Regards! > > wei > > *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com > ] > *Sent:* Tuesday, May 19, 2015 9:33 PM > *To:* Tangwei (Euler) > *Cc:* hotspot compiler > *Subject:* Re: How to change compilation policy to trigger C2 > compilation ASAP? > > Is your goal specifically to have C2 compile or just to reach peak > performance quickly? It sounds like the latter. What values did you > specify for the tier thresholds? Also, it may help you to > -XX:+PrintCompilation to tune the flags as this will show you which > methods are being compiled, when, and at what tier. > > sent from my phone > > On May 19, 2015 9:01 AM, "Tangwei (Euler)" > >> wrote: > > Hi All, > > I want to run a JAVA application on a performance simulator, > and do a > profiling on a hot function JITTed with C2 compiler. > > In order to make C2 compiler compile hot function as early as > possible, > I hope to reduce the threshold of function invocation > > count in interpreter and C1 to drive the JIT compiler > transitioned to > Level 4 (C2) ASAP. Following is the option list I try, but > > failed to find a right combination to meet my requirement. > Anyone can > help to figure out what options I can use? > > Thanks in advance. > > -XX:Tier0ProfilingStartPercentage=0 > > -XX:Tier3InvocationThreshold > > -XX:Tier3MinInvocationThreshold > > -XX:Tier3CompileThreshold > > -XX:Tier4InvocationThreshold > > -XX:CompileThreshold > > -XX:Tier4MinInvocationThreshold > > -XX:Tier4CompileThreshold > > Regards! > > wei > From vitalyd at gmail.com Wed May 20 01:33:51 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Tue, 19 May 2015 21:33:51 -0400 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: <555BE3B8.1050801@oracle.com> References: <555BDF7D.4010002@oracle.com> <555BE3B8.1050801@oracle.com> Message-ID: Unfortunately, most non-trivial java code needs profiling and inlining to get decent performance :). sent from my phone On May 19, 2015 9:30 PM, "Vladimir Kozlov" wrote: > I had said "if don't need profiling or inlining" :) > > Of cause low threashold affects quality of generated code. Especially if > it has branches and calls. And you should not expect that you will get the > same inlining. So profiling result of such code should be considered only > as approximation. > > Vladimir > > On 5/19/15 6:15 PM, Vitaly Davidovich wrote: > >> Vladimir, >> >> Is 100 actually generally good/decent based on our conversation from the >> other day regarding lack of scaling down certain inlining thresholds? >> >> Thanks >> >> sent from my phone >> >> On May 19, 2015 9:12 PM, "Vladimir Kozlov" > > wrote: >> >> If you want only C2 compilation you can disable tiered compilation >> (since you don't want to spend to much on profiling anyway) and set >> low CompileThreshold (you need only this one for non-tiered compile): >> >> -XX:-TieredCompilation -XX:CompileThreshold=100 >> >> If your method is simple and don't need profiling or inlining the >> generated code could be same quality as with long profiling. >> >> An other approach is to set threshold per method (scale all >> threasholds (C1, C2, interpreter) by this value). For example to >> reduce thresholds by half: >> >> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdScaling,0.5 >> >> Vladimir >> >> On 5/19/15 5:43 PM, Tangwei (Euler) wrote: >> >> My goal is just to reach peak performance quickly. Following is >> one tier >> threshold combination I tried: >> >> -XX:Tier0ProfilingStartPercentage=0 >> >> -XX:Tier3InvocationThreshold=3 >> >> -XX:Tier3MinInvocationThreshold=2 >> >> -XX:Tier3CompileThreshold=2 >> >> -XX:Tier4InvocationThreshold=4 >> >> -XX:Tier4MinInvocationThreshold=3 >> >> -XX:Tier4CompileThreshold=2 >> >> Regards! >> >> wei >> >> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com >> ] >> *Sent:* Tuesday, May 19, 2015 9:33 PM >> *To:* Tangwei (Euler) >> *Cc:* hotspot compiler >> *Subject:* Re: How to change compilation policy to trigger C2 >> compilation ASAP? >> >> Is your goal specifically to have C2 compile or just to reach peak >> performance quickly? It sounds like the latter. What values did >> you >> specify for the tier thresholds? Also, it may help you to >> -XX:+PrintCompilation to tune the flags as this will show you >> which >> methods are being compiled, when, and at what tier. >> >> sent from my phone >> >> On May 19, 2015 9:01 AM, "Tangwei (Euler)" > >> >> wrote: >> >> Hi All, >> >> I want to run a JAVA application on a performance simulator, >> and do a >> profiling on a hot function JITTed with C2 compiler. >> >> In order to make C2 compiler compile hot function as early as >> possible, >> I hope to reduce the threshold of function invocation >> >> count in interpreter and C1 to drive the JIT compiler >> transitioned to >> Level 4 (C2) ASAP. Following is the option list I try, but >> >> failed to find a right combination to meet my requirement. >> Anyone can >> help to figure out what options I can use? >> >> Thanks in advance. >> >> -XX:Tier0ProfilingStartPercentage=0 >> >> -XX:Tier3InvocationThreshold >> >> -XX:Tier3MinInvocationThreshold >> >> -XX:Tier3CompileThreshold >> >> -XX:Tier4InvocationThreshold >> >> -XX:CompileThreshold >> >> -XX:Tier4MinInvocationThreshold >> >> -XX:Tier4CompileThreshold >> >> Regards! >> >> wei >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From tangwei6 at huawei.com Wed May 20 01:41:44 2015 From: tangwei6 at huawei.com (Tangwei (Euler)) Date: Wed, 20 May 2015 01:41:44 +0000 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: Message-ID: Vitaly, Yes, I agree with you that tuning is necessary to get peak performance. That?s why I want to run the application on simulator by changing threshold to collect some performance data to guide optimization in next step. I am using JAVA 8. The number of compiler thread to run is default value 2. Regards! wei From: Vitaly Davidovich [mailto:vitalyd at gmail.com] Sent: Wednesday, May 20, 2015 9:12 AM To: Tangwei (Euler) Cc: hotspot compiler Subject: RE: How to change compilation policy to trigger C2 compilation ASAP? These thresholds are going to effectively compile the world, gated only by backlog that will surely ensue on the compiler thread queues? I would start with using default values, and tuning them down until perf is satisfactory. As you have it configured, I don't think tiered will help; in fact, it's probably worse than simply using C2 and lowering CompileThreshold. I think the threshold for C2 should probably depend on how well the code runs with C1, which is likely to be very application (and hardware, to some extent) specific. Are you using java 8? How many compiler threads for C1 and C2 are you using? By the way, I'm likely to do some tiered experiments myself soon, so this is new territory to me as well. At an initial glance, if default values aren't good, it seems to devolve into the same "black magic" as GC tuning. sent from my phone On May 19, 2015 8:43 PM, "Tangwei (Euler)" > wrote: My goal is just to reach peak performance quickly. Following is one tier threshold combination I tried: -XX:Tier0ProfilingStartPercentage=0 -XX:Tier3InvocationThreshold=3 -XX:Tier3MinInvocationThreshold=2 -XX:Tier3CompileThreshold=2 -XX:Tier4InvocationThreshold=4 -XX:Tier4MinInvocationThreshold=3 -XX:Tier4CompileThreshold=2 Regards! wei From: Vitaly Davidovich [mailto:vitalyd at gmail.com] Sent: Tuesday, May 19, 2015 9:33 PM To: Tangwei (Euler) Cc: hotspot compiler Subject: Re: How to change compilation policy to trigger C2 compilation ASAP? Is your goal specifically to have C2 compile or just to reach peak performance quickly? It sounds like the latter. What values did you specify for the tier thresholds? Also, it may help you to -XX:+PrintCompilation to tune the flags as this will show you which methods are being compiled, when, and at what tier. sent from my phone On May 19, 2015 9:01 AM, "Tangwei (Euler)" > wrote: Hi All, I want to run a JAVA application on a performance simulator, and do a profiling on a hot function JITTed with C2 compiler. In order to make C2 compiler compile hot function as early as possible, I hope to reduce the threshold of function invocation count in interpreter and C1 to drive the JIT compiler transitioned to Level 4 (C2) ASAP. Following is the option list I try, but failed to find a right combination to meet my requirement. Anyone can help to figure out what options I can use? Thanks in advance. -XX:Tier0ProfilingStartPercentage=0 -XX:Tier3InvocationThreshold -XX:Tier3MinInvocationThreshold -XX:Tier3CompileThreshold -XX:Tier4InvocationThreshold -XX:CompileThreshold -XX:Tier4MinInvocationThreshold -XX:Tier4CompileThreshold Regards! wei -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Wed May 20 01:55:46 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Tue, 19 May 2015 21:55:46 -0400 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: Message-ID: How many CPUs are available to your jvm? Perhaps try to also bump # of compiler threads and also bump ReservedCodeCacheSize for good measure. If you reduce thresholds to something reasonable, more compiler threads (assuming you have cores to run them on in parallel) should theoretically keep the backlog down and not stall compilations. Also, probably best to have more C2 compiler threads than C1 since those compilations take longer. If you do make any interesting discoveries, please do share :). sent from my phone On May 19, 2015 9:41 PM, "Tangwei (Euler)" wrote: > Vitaly, > > Yes, I agree with you that tuning is necessary to get peak performance. > That?s why I want to run the application on simulator by changing threshold > to > > collect some performance data to guide optimization in next step. I am > using JAVA 8. The number of compiler thread to run is default value 2. > > > > Regards! > > wei > > > > > > > > *From:* Vitaly Davidovich [mailto:vitalyd at gmail.com] > *Sent:* Wednesday, May 20, 2015 9:12 AM > *To:* Tangwei (Euler) > *Cc:* hotspot compiler > *Subject:* RE: How to change compilation policy to trigger C2 compilation > ASAP? > > > > These thresholds are going to effectively compile the world, gated only by > backlog that will surely ensue on the compiler thread queues? I would start > with using default values, and tuning them down until perf is > satisfactory. As you have it configured, I don't think tiered will help; > in fact, it's probably worse than simply using C2 and lowering > CompileThreshold. I think the threshold for C2 should probably depend on > how well the code runs with C1, which is likely to be very application (and > hardware, to some extent) specific. > > Are you using java 8? How many compiler threads for C1 and C2 are you > using? > > By the way, I'm likely to do some tiered experiments myself soon, so this > is new territory to me as well. At an initial glance, if default values > aren't good, it seems to devolve into the same "black magic" as GC tuning. > > sent from my phone > > On May 19, 2015 8:43 PM, "Tangwei (Euler)" wrote: > > My goal is just to reach peak performance quickly. Following is one tier > threshold combination I tried: > > > > -XX:Tier0ProfilingStartPercentage=0 > > -XX:Tier3InvocationThreshold=3 > > -XX:Tier3MinInvocationThreshold=2 > > -XX:Tier3CompileThreshold=2 > > -XX:Tier4InvocationThreshold=4 > > -XX:Tier4MinInvocationThreshold=3 > > -XX:Tier4CompileThreshold=2 > > > > Regards! > > wei > > > > > > *From:* Vitaly Davidovich [mailto:vitalyd at gmail.com] > *Sent:* Tuesday, May 19, 2015 9:33 PM > *To:* Tangwei (Euler) > *Cc:* hotspot compiler > *Subject:* Re: How to change compilation policy to trigger C2 compilation > ASAP? > > > > Is your goal specifically to have C2 compile or just to reach peak > performance quickly? It sounds like the latter. What values did you > specify for the tier thresholds? Also, it may help you to > -XX:+PrintCompilation to tune the flags as this will show you which methods > are being compiled, when, and at what tier. > > sent from my phone > > On May 19, 2015 9:01 AM, "Tangwei (Euler)" wrote: > > Hi All, > > I want to run a JAVA application on a performance simulator, and do a > profiling on a hot function JITTed with C2 compiler. > > In order to make C2 compiler compile hot function as early as possible, I > hope to reduce the threshold of function invocation > > count in interpreter and C1 to drive the JIT compiler transitioned to > Level 4 (C2) ASAP. Following is the option list I try, but > > failed to find a right combination to meet my requirement. Anyone can help > to figure out what options I can use? > > Thanks in advance. > > > > -XX:Tier0ProfilingStartPercentage=0 > > -XX:Tier3InvocationThreshold > > -XX:Tier3MinInvocationThreshold > > -XX:Tier3CompileThreshold > > -XX:Tier4InvocationThreshold > > -XX:CompileThreshold > > -XX:Tier4MinInvocationThreshold > > -XX:Tier4CompileThreshold > > > > Regards! > > wei > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tangwei6 at huawei.com Wed May 20 02:00:39 2015 From: tangwei6 at huawei.com (Tangwei (Euler)) Date: Wed, 20 May 2015 02:00:39 +0000 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: <555BDF7D.4010002@oracle.com> References: <555BDF7D.4010002@oracle.com> Message-ID: Vladimir, What the 'double' means in following command line? The option CompileThresholdScaling you mentioned is same as ProfileMaturityPercentage? I cannot find the option in code. How much effect to function inlining by turning off tiered compilation? > -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdScaling,0.5 Regards! wei > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > Sent: Wednesday, May 20, 2015 9:12 AM > To: Tangwei (Euler); Vitaly Davidovich > Cc: hotspot compiler > Subject: Re: How to change compilation policy to trigger C2 compilation ASAP? > > If you want only C2 compilation you can disable tiered compilation (since you > don't want to spend to much on profiling anyway) and set low > CompileThreshold (you need only this one for non-tiered compile): > > -XX:-TieredCompilation -XX:CompileThreshold=100 > > If your method is simple and don't need profiling or inlining the generated code > could be same quality as with long profiling. > > An other approach is to set threshold per method (scale all threasholds (C1, C2, > interpreter) by this value). For example to reduce thresholds by half: > > -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh > oldScaling,0.5 > > Vladimir > > On 5/19/15 5:43 PM, Tangwei (Euler) wrote: > > My goal is just to reach peak performance quickly. Following is one > > tier threshold combination I tried: > > > > -XX:Tier0ProfilingStartPercentage=0 > > > > -XX:Tier3InvocationThreshold=3 > > > > -XX:Tier3MinInvocationThreshold=2 > > > > -XX:Tier3CompileThreshold=2 > > > > -XX:Tier4InvocationThreshold=4 > > > > -XX:Tier4MinInvocationThreshold=3 > > > > -XX:Tier4CompileThreshold=2 > > > > Regards! > > > > wei > > > > *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] > > *Sent:* Tuesday, May 19, 2015 9:33 PM > > *To:* Tangwei (Euler) > > *Cc:* hotspot compiler > > *Subject:* Re: How to change compilation policy to trigger C2 > > compilation ASAP? > > > > Is your goal specifically to have C2 compile or just to reach peak > > performance quickly? It sounds like the latter. What values did you > > specify for the tier thresholds? Also, it may help you to > > -XX:+PrintCompilation to tune the flags as this will show you which > > methods are being compiled, when, and at what tier. > > > > sent from my phone > > > > On May 19, 2015 9:01 AM, "Tangwei (Euler)" > > wrote: > > > > Hi All, > > > > I want to run a JAVA application on a performance simulator, and do > > a profiling on a hot function JITTed with C2 compiler. > > > > In order to make C2 compiler compile hot function as early as > > possible, I hope to reduce the threshold of function invocation > > > > count in interpreter and C1 to drive the JIT compiler transitioned to > > Level 4 (C2) ASAP. Following is the option list I try, but > > > > failed to find a right combination to meet my requirement. Anyone can > > help to figure out what options I can use? > > > > Thanks in advance. > > > > -XX:Tier0ProfilingStartPercentage=0 > > > > -XX:Tier3InvocationThreshold > > > > -XX:Tier3MinInvocationThreshold > > > > -XX:Tier3CompileThreshold > > > > -XX:Tier4InvocationThreshold > > > > -XX:CompileThreshold > > > > -XX:Tier4MinInvocationThreshold > > > > -XX:Tier4CompileThreshold > > > > Regards! > > > > wei > > From tangwei6 at huawei.com Wed May 20 02:06:38 2015 From: tangwei6 at huawei.com (Tangwei (Euler)) Date: Wed, 20 May 2015 02:06:38 +0000 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: Message-ID: Our real machine has 16 CPU cores, but unfortunately, our simulator can only support one thread now. So it doesn?t work for me to increase compiler threads by now. I am glad to share if find some interesting thing. ? Thanks for your help! Regards! wei From: Vitaly Davidovich [mailto:vitalyd at gmail.com] Sent: Wednesday, May 20, 2015 9:56 AM To: Tangwei (Euler) Cc: hotspot compiler Subject: RE: How to change compilation policy to trigger C2 compilation ASAP? How many CPUs are available to your jvm? Perhaps try to also bump # of compiler threads and also bump ReservedCodeCacheSize for good measure. If you reduce thresholds to something reasonable, more compiler threads (assuming you have cores to run them on in parallel) should theoretically keep the backlog down and not stall compilations. Also, probably best to have more C2 compiler threads than C1 since those compilations take longer. If you do make any interesting discoveries, please do share :). sent from my phone On May 19, 2015 9:41 PM, "Tangwei (Euler)" > wrote: Vitaly, Yes, I agree with you that tuning is necessary to get peak performance. That?s why I want to run the application on simulator by changing threshold to collect some performance data to guide optimization in next step. I am using JAVA 8. The number of compiler thread to run is default value 2. Regards! wei From: Vitaly Davidovich [mailto:vitalyd at gmail.com] Sent: Wednesday, May 20, 2015 9:12 AM To: Tangwei (Euler) Cc: hotspot compiler Subject: RE: How to change compilation policy to trigger C2 compilation ASAP? These thresholds are going to effectively compile the world, gated only by backlog that will surely ensue on the compiler thread queues? I would start with using default values, and tuning them down until perf is satisfactory. As you have it configured, I don't think tiered will help; in fact, it's probably worse than simply using C2 and lowering CompileThreshold. I think the threshold for C2 should probably depend on how well the code runs with C1, which is likely to be very application (and hardware, to some extent) specific. Are you using java 8? How many compiler threads for C1 and C2 are you using? By the way, I'm likely to do some tiered experiments myself soon, so this is new territory to me as well. At an initial glance, if default values aren't good, it seems to devolve into the same "black magic" as GC tuning. sent from my phone On May 19, 2015 8:43 PM, "Tangwei (Euler)" > wrote: My goal is just to reach peak performance quickly. Following is one tier threshold combination I tried: -XX:Tier0ProfilingStartPercentage=0 -XX:Tier3InvocationThreshold=3 -XX:Tier3MinInvocationThreshold=2 -XX:Tier3CompileThreshold=2 -XX:Tier4InvocationThreshold=4 -XX:Tier4MinInvocationThreshold=3 -XX:Tier4CompileThreshold=2 Regards! wei From: Vitaly Davidovich [mailto:vitalyd at gmail.com] Sent: Tuesday, May 19, 2015 9:33 PM To: Tangwei (Euler) Cc: hotspot compiler Subject: Re: How to change compilation policy to trigger C2 compilation ASAP? Is your goal specifically to have C2 compile or just to reach peak performance quickly? It sounds like the latter. What values did you specify for the tier thresholds? Also, it may help you to -XX:+PrintCompilation to tune the flags as this will show you which methods are being compiled, when, and at what tier. sent from my phone On May 19, 2015 9:01 AM, "Tangwei (Euler)" > wrote: Hi All, I want to run a JAVA application on a performance simulator, and do a profiling on a hot function JITTed with C2 compiler. In order to make C2 compiler compile hot function as early as possible, I hope to reduce the threshold of function invocation count in interpreter and C1 to drive the JIT compiler transitioned to Level 4 (C2) ASAP. Following is the option list I try, but failed to find a right combination to meet my requirement. Anyone can help to figure out what options I can use? Thanks in advance. -XX:Tier0ProfilingStartPercentage=0 -XX:Tier3InvocationThreshold -XX:Tier3MinInvocationThreshold -XX:Tier3CompileThreshold -XX:Tier4InvocationThreshold -XX:CompileThreshold -XX:Tier4MinInvocationThreshold -XX:Tier4CompileThreshold Regards! wei -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Wed May 20 02:11:17 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Tue, 19 May 2015 22:11:17 -0400 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: <555BDF7D.4010002@oracle.com> Message-ID: I'll let Vladimir comment on the compile command, but turning off tiered doesn't prevent inlining. What prevents inlining (an otherwise inlineable method), in a nutshell, is lack of profiling information that would indicate the method or callsite is hot enough for inlining. So you can have tiered off and run C2 with standard compilation thresholds, and you'll likely get good inlining because you'll have a long profile. If you turn C2 compile threshold down too far (and too far is going to be app-specific), C2 compilation may not have sufficient info in the shortened profile to decide to inline. Or even worse, the profile collected is actually not reflective of the real profile you want to capture (e.g. app has a phase change after initialization). The theoretical advantage of tiered is you can get decent perf quickly, but also since you're getting better than interpreter speed quickly, you can run in C1 tier longer and thus collect an even larger profile, possibly leading to better code than C2 alone. But unfortunately this may require serious tuning. That's my understanding at least. sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)" wrote: > Vladimir, > What the 'double' means in following command line? The option > CompileThresholdScaling you mentioned is same as ProfileMaturityPercentage? > I cannot find the option in code. How much effect to function inlining by > turning off tiered compilation? > > > > -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdScaling,0.5 > > Regards! > wei > > > -----Original Message----- > > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > > Sent: Wednesday, May 20, 2015 9:12 AM > > To: Tangwei (Euler); Vitaly Davidovich > > Cc: hotspot compiler > > Subject: Re: How to change compilation policy to trigger C2 compilation > ASAP? > > > > If you want only C2 compilation you can disable tiered compilation > (since you > > don't want to spend to much on profiling anyway) and set low > > CompileThreshold (you need only this one for non-tiered compile): > > > > -XX:-TieredCompilation -XX:CompileThreshold=100 > > > > If your method is simple and don't need profiling or inlining the > generated code > > could be same quality as with long profiling. > > > > An other approach is to set threshold per method (scale all threasholds > (C1, C2, > > interpreter) by this value). For example to reduce thresholds by half: > > > > -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh > > oldScaling,0.5 > > > > Vladimir > > > > On 5/19/15 5:43 PM, Tangwei (Euler) wrote: > > > My goal is just to reach peak performance quickly. Following is one > > > tier threshold combination I tried: > > > > > > -XX:Tier0ProfilingStartPercentage=0 > > > > > > -XX:Tier3InvocationThreshold=3 > > > > > > -XX:Tier3MinInvocationThreshold=2 > > > > > > -XX:Tier3CompileThreshold=2 > > > > > > -XX:Tier4InvocationThreshold=4 > > > > > > -XX:Tier4MinInvocationThreshold=3 > > > > > > -XX:Tier4CompileThreshold=2 > > > > > > Regards! > > > > > > wei > > > > > > *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] > > > *Sent:* Tuesday, May 19, 2015 9:33 PM > > > *To:* Tangwei (Euler) > > > *Cc:* hotspot compiler > > > *Subject:* Re: How to change compilation policy to trigger C2 > > > compilation ASAP? > > > > > > Is your goal specifically to have C2 compile or just to reach peak > > > performance quickly? It sounds like the latter. What values did you > > > specify for the tier thresholds? Also, it may help you to > > > -XX:+PrintCompilation to tune the flags as this will show you which > > > methods are being compiled, when, and at what tier. > > > > > > sent from my phone > > > > > > On May 19, 2015 9:01 AM, "Tangwei (Euler)" > > > wrote: > > > > > > Hi All, > > > > > > I want to run a JAVA application on a performance simulator, and do > > > a profiling on a hot function JITTed with C2 compiler. > > > > > > In order to make C2 compiler compile hot function as early as > > > possible, I hope to reduce the threshold of function invocation > > > > > > count in interpreter and C1 to drive the JIT compiler transitioned to > > > Level 4 (C2) ASAP. Following is the option list I try, but > > > > > > failed to find a right combination to meet my requirement. Anyone can > > > help to figure out what options I can use? > > > > > > Thanks in advance. > > > > > > -XX:Tier0ProfilingStartPercentage=0 > > > > > > -XX:Tier3InvocationThreshold > > > > > > -XX:Tier3MinInvocationThreshold > > > > > > -XX:Tier3CompileThreshold > > > > > > -XX:Tier4InvocationThreshold > > > > > > -XX:CompileThreshold > > > > > > -XX:Tier4MinInvocationThreshold > > > > > > -XX:Tier4CompileThreshold > > > > > > Regards! > > > > > > wei > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tangwei6 at huawei.com Wed May 20 02:29:41 2015 From: tangwei6 at huawei.com (Tangwei (Euler)) Date: Wed, 20 May 2015 02:29:41 +0000 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: <555BDF7D.4010002@oracle.com> Message-ID: Got it! Thanks a lot! Regards! wei From: Vitaly Davidovich [mailto:vitalyd at gmail.com] Sent: Wednesday, May 20, 2015 10:11 AM To: Tangwei (Euler) Cc: hotspot compiler; Vladimir Kozlov Subject: RE: How to change compilation policy to trigger C2 compilation ASAP? I'll let Vladimir comment on the compile command, but turning off tiered doesn't prevent inlining. What prevents inlining (an otherwise inlineable method), in a nutshell, is lack of profiling information that would indicate the method or callsite is hot enough for inlining. So you can have tiered off and run C2 with standard compilation thresholds, and you'll likely get good inlining because you'll have a long profile. If you turn C2 compile threshold down too far (and too far is going to be app-specific), C2 compilation may not have sufficient info in the shortened profile to decide to inline. Or even worse, the profile collected is actually not reflective of the real profile you want to capture (e.g. app has a phase change after initialization). The theoretical advantage of tiered is you can get decent perf quickly, but also since you're getting better than interpreter speed quickly, you can run in C1 tier longer and thus collect an even larger profile, possibly leading to better code than C2 alone. But unfortunately this may require serious tuning. That's my understanding at least. sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)" > wrote: Vladimir, What the 'double' means in following command line? The option CompileThresholdScaling you mentioned is same as ProfileMaturityPercentage? I cannot find the option in code. How much effect to function inlining by turning off tiered compilation? > -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdScaling,0.5 Regards! wei > -----Original Message----- > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > Sent: Wednesday, May 20, 2015 9:12 AM > To: Tangwei (Euler); Vitaly Davidovich > Cc: hotspot compiler > Subject: Re: How to change compilation policy to trigger C2 compilation ASAP? > > If you want only C2 compilation you can disable tiered compilation (since you > don't want to spend to much on profiling anyway) and set low > CompileThreshold (you need only this one for non-tiered compile): > > -XX:-TieredCompilation -XX:CompileThreshold=100 > > If your method is simple and don't need profiling or inlining the generated code > could be same quality as with long profiling. > > An other approach is to set threshold per method (scale all threasholds (C1, C2, > interpreter) by this value). For example to reduce thresholds by half: > > -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh > oldScaling,0.5 > > Vladimir > > On 5/19/15 5:43 PM, Tangwei (Euler) wrote: > > My goal is just to reach peak performance quickly. Following is one > > tier threshold combination I tried: > > > > -XX:Tier0ProfilingStartPercentage=0 > > > > -XX:Tier3InvocationThreshold=3 > > > > -XX:Tier3MinInvocationThreshold=2 > > > > -XX:Tier3CompileThreshold=2 > > > > -XX:Tier4InvocationThreshold=4 > > > > -XX:Tier4MinInvocationThreshold=3 > > > > -XX:Tier4CompileThreshold=2 > > > > Regards! > > > > wei > > > > *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] > > *Sent:* Tuesday, May 19, 2015 9:33 PM > > *To:* Tangwei (Euler) > > *Cc:* hotspot compiler > > *Subject:* Re: How to change compilation policy to trigger C2 > > compilation ASAP? > > > > Is your goal specifically to have C2 compile or just to reach peak > > performance quickly? It sounds like the latter. What values did you > > specify for the tier thresholds? Also, it may help you to > > -XX:+PrintCompilation to tune the flags as this will show you which > > methods are being compiled, when, and at what tier. > > > > sent from my phone > > > > On May 19, 2015 9:01 AM, "Tangwei (Euler)" > > >> wrote: > > > > Hi All, > > > > I want to run a JAVA application on a performance simulator, and do > > a profiling on a hot function JITTed with C2 compiler. > > > > In order to make C2 compiler compile hot function as early as > > possible, I hope to reduce the threshold of function invocation > > > > count in interpreter and C1 to drive the JIT compiler transitioned to > > Level 4 (C2) ASAP. Following is the option list I try, but > > > > failed to find a right combination to meet my requirement. Anyone can > > help to figure out what options I can use? > > > > Thanks in advance. > > > > -XX:Tier0ProfilingStartPercentage=0 > > > > -XX:Tier3InvocationThreshold > > > > -XX:Tier3MinInvocationThreshold > > > > -XX:Tier3CompileThreshold > > > > -XX:Tier4InvocationThreshold > > > > -XX:CompileThreshold > > > > -XX:Tier4MinInvocationThreshold > > > > -XX:Tier4CompileThreshold > > > > Regards! > > > > wei > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zoltan.majo at oracle.com Wed May 20 06:47:17 2015 From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=) Date: Wed, 20 May 2015 08:47:17 +0200 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: <555BDF7D.4010002@oracle.com> References: <555BDF7D.4010002@oracle.com> Message-ID: <555C2DF5.1050600@oracle.com> Hi Wei, On 05/20/2015 03:12 AM, Vladimir Kozlov wrote: [...] > An other approach is to set threshold per method (scale all > threasholds (C1, C2, interpreter) by this value). For example to > reduce thresholds by half: > > -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdScaling,0.5 > just to complement what Vladimir suggested: If you want to scale thresholds for all methods at once, the CompileThresholdScaling can be also used globally. For example, -XX:CompileThresholdScaling=0.5 would reduce all thresholds by half (both with tiered and non-tiered modes of operation). Best regards, Zoltan > > Vladimir > > On 5/19/15 5:43 PM, Tangwei (Euler) wrote: >> My goal is just to reach peak performance quickly. Following is one tier >> threshold combination I tried: >> >> -XX:Tier0ProfilingStartPercentage=0 >> >> -XX:Tier3InvocationThreshold=3 >> >> -XX:Tier3MinInvocationThreshold=2 >> >> -XX:Tier3CompileThreshold=2 >> >> -XX:Tier4InvocationThreshold=4 >> >> -XX:Tier4MinInvocationThreshold=3 >> >> -XX:Tier4CompileThreshold=2 >> >> Regards! >> >> wei >> >> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] >> *Sent:* Tuesday, May 19, 2015 9:33 PM >> *To:* Tangwei (Euler) >> *Cc:* hotspot compiler >> *Subject:* Re: How to change compilation policy to trigger C2 >> compilation ASAP? >> >> Is your goal specifically to have C2 compile or just to reach peak >> performance quickly? It sounds like the latter. What values did you >> specify for the tier thresholds? Also, it may help you to >> -XX:+PrintCompilation to tune the flags as this will show you which >> methods are being compiled, when, and at what tier. >> >> sent from my phone >> >> On May 19, 2015 9:01 AM, "Tangwei (Euler)" > > wrote: >> >> Hi All, >> >> I want to run a JAVA application on a performance simulator, and do a >> profiling on a hot function JITTed with C2 compiler. >> >> In order to make C2 compiler compile hot function as early as possible, >> I hope to reduce the threshold of function invocation >> >> count in interpreter and C1 to drive the JIT compiler transitioned to >> Level 4 (C2) ASAP. Following is the option list I try, but >> >> failed to find a right combination to meet my requirement. Anyone can >> help to figure out what options I can use? >> >> Thanks in advance. >> >> -XX:Tier0ProfilingStartPercentage=0 >> >> -XX:Tier3InvocationThreshold >> >> -XX:Tier3MinInvocationThreshold >> >> -XX:Tier3CompileThreshold >> >> -XX:Tier4InvocationThreshold >> >> -XX:CompileThreshold >> >> -XX:Tier4MinInvocationThreshold >> >> -XX:Tier4CompileThreshold >> >> Regards! >> >> wei >> From roland.westrelin at oracle.com Wed May 20 07:14:25 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 20 May 2015 09:14:25 +0200 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared In-Reply-To: <5559E3A5.8030108@oracle.com> References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com> <5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com> <72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com> <55533CAE.3050201@oracle.com> <55549297.5020002@oracle.com> <5555E343.5050600@oracle.com> <5559E3A5.8030108@oracle.com> Message-ID: > What do you think about the following version: > http://cr.openjdk.java.net/~vlivanov/8079205/webrev.04 Still looks good to me. Roland. From cnewland at chrisnewland.com Wed May 20 07:46:19 2015 From: cnewland at chrisnewland.com (Chris Newland) Date: Wed, 20 May 2015 08:46:19 +0100 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: <555BDF7D.4010002@oracle.com> Message-ID: <2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net> Hi Wei, Is there any reason why you need to cold-start your application each time and begin with no profiling information? Depending on your input data, would it be possible to "warm up" the VM by running typical inputs through your code to build a rich profile and have the JIT compilations performed before your real data arrives? One reason *against* doing this would be wide variations in your input data that could result in the wrong optimisations being made and possibly suffering a decompilation event. There are commercial VMs that have technology to mitigate cold-start costs (for example Azul Zing's ReadyNow). Have you examined the HotSpot LogCompilation output to make sure there is nothing you could change at source code level which would result in a better JIT decisions? I'm thinking of things like inlining failures due to call-site megamorphism. There's a free tool called JITWatch that aims to make it easier to understand the LogCompilation output (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales pitch - I'm the author). Regards, Chris @chriswhocodes On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote: > I'll let Vladimir comment on the compile command, but turning off tiered > doesn't prevent inlining. What prevents inlining (an otherwise inlineable > method), in a nutshell, is lack of profiling information that would > indicate the method or callsite is hot enough for inlining. So you can > have tiered off and run C2 with standard compilation thresholds, and > you'll likely get good inlining because you'll have a long profile. If > you turn C2 compile threshold down too far (and too far is going to be > app-specific), C2 compilation may not have sufficient info in the > shortened profile to decide to inline. Or even worse, the profile > collected is actually not reflective of the real profile you want to > capture (e.g. app has a phase change after initialization). > > The theoretical advantage of tiered is you can get decent perf quickly, > but also since you're getting better than interpreter speed quickly, you > can run in C1 tier longer and thus collect an even larger profile, > possibly leading to better code than C2 alone. But unfortunately this may > require serious tuning. That's my understanding at least. > > sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)" > wrote: > > >> Vladimir, >> What the 'double' means in following command line? The option >> CompileThresholdScaling you mentioned is same as >> ProfileMaturityPercentage? >> I cannot find the option in code. How much effect to function inlining >> by turning off tiered compilation? >> >>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdS >> caling,0.5 >> >> Regards! >> wei >> >>> -----Original Message----- >>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] >>> Sent: Wednesday, May 20, 2015 9:12 AM >>> To: Tangwei (Euler); Vitaly Davidovich >>> Cc: hotspot compiler >>> Subject: Re: How to change compilation policy to trigger C2 >>> compilation >> ASAP? >> >>> >>> If you want only C2 compilation you can disable tiered compilation >>> >> (since you >> >>> don't want to spend to much on profiling anyway) and set low >>> CompileThreshold (you need only this one for non-tiered compile): >>> >>> >>> -XX:-TieredCompilation -XX:CompileThreshold=100 >>> >>> >>> If your method is simple and don't need profiling or inlining the >>> >> generated code >>> could be same quality as with long profiling. >>> >>> An other approach is to set threshold per method (scale all >>> threasholds >> (C1, C2, >> >>> interpreter) by this value). For example to reduce thresholds by >>> half: >>> >>> >>> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh >>> oldScaling,0.5 >>> >>> Vladimir >>> >>> >>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote: >>> >>>> My goal is just to reach peak performance quickly. Following is one >>>> tier threshold combination I tried: >>>> >>>> -XX:Tier0ProfilingStartPercentage=0 >>>> >>>> >>>> -XX:Tier3InvocationThreshold=3 >>>> >>>> >>>> -XX:Tier3MinInvocationThreshold=2 >>>> >>>> >>>> -XX:Tier3CompileThreshold=2 >>>> >>>> >>>> -XX:Tier4InvocationThreshold=4 >>>> >>>> >>>> -XX:Tier4MinInvocationThreshold=3 >>>> >>>> >>>> -XX:Tier4CompileThreshold=2 >>>> >>>> >>>> Regards! >>>> >>>> >>>> wei >>>> >>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] >>>> *Sent:* Tuesday, May 19, 2015 9:33 PM >>>> *To:* Tangwei (Euler) >>>> *Cc:* hotspot compiler >>>> *Subject:* Re: How to change compilation policy to trigger C2 >>>> compilation ASAP? >>>> >>>> Is your goal specifically to have C2 compile or just to reach peak >>>> performance quickly? It sounds like the latter. What values did you >>>> specify for the tier thresholds? Also, it may help you to >>>> -XX:+PrintCompilation to tune the flags as this will show you which >>>> methods are being compiled, when, and at what tier. >>>> >>>> sent from my phone >>>> >>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" >>> > wrote: >>>> >>>> >>>> Hi All, >>>> >>>> >>>> I want to run a JAVA application on a performance simulator, and do >>>> a profiling on a hot function JITTed with C2 compiler. >>>> >>>> In order to make C2 compiler compile hot function as early as >>>> possible, I hope to reduce the threshold of function invocation >>>> >>>> count in interpreter and C1 to drive the JIT compiler transitioned >>>> to Level 4 (C2) ASAP. Following is the option list I try, but >>>> >>>> >>>> failed to find a right combination to meet my requirement. Anyone >>>> can help to figure out what options I can use? >>>> >>>> Thanks in advance. >>>> >>>> >>>> -XX:Tier0ProfilingStartPercentage=0 >>>> >>>> >>>> -XX:Tier3InvocationThreshold >>>> >>>> >>>> -XX:Tier3MinInvocationThreshold >>>> >>>> >>>> -XX:Tier3CompileThreshold >>>> >>>> >>>> -XX:Tier4InvocationThreshold >>>> >>>> >>>> -XX:CompileThreshold >>>> >>>> >>>> -XX:Tier4MinInvocationThreshold >>>> >>>> >>>> -XX:Tier4CompileThreshold >>>> >>>> >>>> Regards! >>>> >>>> >>>> wei >>>> >> > From tangwei6 at huawei.com Wed May 20 08:07:34 2015 From: tangwei6 at huawei.com (Tangwei (Euler)) Date: Wed, 20 May 2015 08:07:34 +0000 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: <555C2DF5.1050600@oracle.com> References: <555BDF7D.4010002@oracle.com> <555C2DF5.1050600@oracle.com> Message-ID: Hi Zoltan, Is this option supported in OpenJDK8? I cannot find it by greping output of command line 'java -XX:+PrintFlagsFinal'. Regards! wei > -----Original Message----- > From: hotspot-compiler-dev > [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Zolt?n > Maj? > Sent: Wednesday, May 20, 2015 2:47 PM > To: hotspot-compiler-dev at openjdk.java.net > Subject: Re: How to change compilation policy to trigger C2 compilation ASAP? > > Hi Wei, > > > On 05/20/2015 03:12 AM, Vladimir Kozlov wrote: > [...] > > An other approach is to set threshold per method (scale all > > threasholds (C1, C2, interpreter) by this value). For example to > > reduce thresholds by half: > > > > > -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh > old > > Scaling,0.5 > > > > just to complement what Vladimir suggested: If you want to scale thresholds > for all methods at once, the CompileThresholdScaling can be also used globally. > For example, -XX:CompileThresholdScaling=0.5 would reduce all thresholds by > half (both with tiered and non-tiered modes of operation). > > Best regards, > > > Zoltan > > > > > Vladimir > > > > On 5/19/15 5:43 PM, Tangwei (Euler) wrote: > >> My goal is just to reach peak performance quickly. Following is one > >> tier threshold combination I tried: > >> > >> -XX:Tier0ProfilingStartPercentage=0 > >> > >> -XX:Tier3InvocationThreshold=3 > >> > >> -XX:Tier3MinInvocationThreshold=2 > >> > >> -XX:Tier3CompileThreshold=2 > >> > >> -XX:Tier4InvocationThreshold=4 > >> > >> -XX:Tier4MinInvocationThreshold=3 > >> > >> -XX:Tier4CompileThreshold=2 > >> > >> Regards! > >> > >> wei > >> > >> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] > >> *Sent:* Tuesday, May 19, 2015 9:33 PM > >> *To:* Tangwei (Euler) > >> *Cc:* hotspot compiler > >> *Subject:* Re: How to change compilation policy to trigger C2 > >> compilation ASAP? > >> > >> Is your goal specifically to have C2 compile or just to reach peak > >> performance quickly? It sounds like the latter. What values did you > >> specify for the tier thresholds? Also, it may help you to > >> -XX:+PrintCompilation to tune the flags as this will show you which > >> methods are being compiled, when, and at what tier. > >> > >> sent from my phone > >> > >> On May 19, 2015 9:01 AM, "Tangwei (Euler)" >> > wrote: > >> > >> Hi All, > >> > >> I want to run a JAVA application on a performance simulator, and > >> do a profiling on a hot function JITTed with C2 compiler. > >> > >> In order to make C2 compiler compile hot function as early as > >> possible, I hope to reduce the threshold of function invocation > >> > >> count in interpreter and C1 to drive the JIT compiler transitioned to > >> Level 4 (C2) ASAP. Following is the option list I try, but > >> > >> failed to find a right combination to meet my requirement. Anyone can > >> help to figure out what options I can use? > >> > >> Thanks in advance. > >> > >> -XX:Tier0ProfilingStartPercentage=0 > >> > >> -XX:Tier3InvocationThreshold > >> > >> -XX:Tier3MinInvocationThreshold > >> > >> -XX:Tier3CompileThreshold > >> > >> -XX:Tier4InvocationThreshold > >> > >> -XX:CompileThreshold > >> > >> -XX:Tier4MinInvocationThreshold > >> > >> -XX:Tier4CompileThreshold > >> > >> Regards! > >> > >> wei > >> From zoltan.majo at oracle.com Wed May 20 08:11:17 2015 From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=) Date: Wed, 20 May 2015 10:11:17 +0200 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: <555BDF7D.4010002@oracle.com> <555C2DF5.1050600@oracle.com> Message-ID: <555C41A5.6010308@oracle.com> Hi Wei, On 05/20/2015 10:07 AM, Tangwei (Euler) wrote: > Hi Zoltan, > Is this option supported in OpenJDK8? I cannot find it by greping output of command line 'java -XX:+PrintFlagsFinal'. I checked and it is available only in 9. Best regards, Zoltan > > > Regards! > wei > >> -----Original Message----- >> From: hotspot-compiler-dev >> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Zolt?n >> Maj? >> Sent: Wednesday, May 20, 2015 2:47 PM >> To: hotspot-compiler-dev at openjdk.java.net >> Subject: Re: How to change compilation policy to trigger C2 compilation ASAP? >> >> Hi Wei, >> >> >> On 05/20/2015 03:12 AM, Vladimir Kozlov wrote: >> [...] >>> An other approach is to set threshold per method (scale all >>> threasholds (C1, C2, interpreter) by this value). For example to >>> reduce thresholds by half: >>> >>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh >> old >>> Scaling,0.5 >>> >> just to complement what Vladimir suggested: If you want to scale thresholds >> for all methods at once, the CompileThresholdScaling can be also used globally. >> For example, -XX:CompileThresholdScaling=0.5 would reduce all thresholds by >> half (both with tiered and non-tiered modes of operation). >> >> Best regards, >> >> >> Zoltan >> >>> Vladimir >>> >>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote: >>>> My goal is just to reach peak performance quickly. Following is one >>>> tier threshold combination I tried: >>>> >>>> -XX:Tier0ProfilingStartPercentage=0 >>>> >>>> -XX:Tier3InvocationThreshold=3 >>>> >>>> -XX:Tier3MinInvocationThreshold=2 >>>> >>>> -XX:Tier3CompileThreshold=2 >>>> >>>> -XX:Tier4InvocationThreshold=4 >>>> >>>> -XX:Tier4MinInvocationThreshold=3 >>>> >>>> -XX:Tier4CompileThreshold=2 >>>> >>>> Regards! >>>> >>>> wei >>>> >>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] >>>> *Sent:* Tuesday, May 19, 2015 9:33 PM >>>> *To:* Tangwei (Euler) >>>> *Cc:* hotspot compiler >>>> *Subject:* Re: How to change compilation policy to trigger C2 >>>> compilation ASAP? >>>> >>>> Is your goal specifically to have C2 compile or just to reach peak >>>> performance quickly? It sounds like the latter. What values did you >>>> specify for the tier thresholds? Also, it may help you to >>>> -XX:+PrintCompilation to tune the flags as this will show you which >>>> methods are being compiled, when, and at what tier. >>>> >>>> sent from my phone >>>> >>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" >>> > wrote: >>>> >>>> Hi All, >>>> >>>> I want to run a JAVA application on a performance simulator, and >>>> do a profiling on a hot function JITTed with C2 compiler. >>>> >>>> In order to make C2 compiler compile hot function as early as >>>> possible, I hope to reduce the threshold of function invocation >>>> >>>> count in interpreter and C1 to drive the JIT compiler transitioned to >>>> Level 4 (C2) ASAP. Following is the option list I try, but >>>> >>>> failed to find a right combination to meet my requirement. Anyone can >>>> help to figure out what options I can use? >>>> >>>> Thanks in advance. >>>> >>>> -XX:Tier0ProfilingStartPercentage=0 >>>> >>>> -XX:Tier3InvocationThreshold >>>> >>>> -XX:Tier3MinInvocationThreshold >>>> >>>> -XX:Tier3CompileThreshold >>>> >>>> -XX:Tier4InvocationThreshold >>>> >>>> -XX:CompileThreshold >>>> >>>> -XX:Tier4MinInvocationThreshold >>>> >>>> -XX:Tier4CompileThreshold >>>> >>>> Regards! >>>> >>>> wei >>>> From tangwei6 at huawei.com Wed May 20 08:17:54 2015 From: tangwei6 at huawei.com (Tangwei (Euler)) Date: Wed, 20 May 2015 08:17:54 +0000 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: <2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net> References: <555BDF7D.4010002@oracle.com> <2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net> Message-ID: Hi Chris, Your tool looks cool. I think OpenJDK has no technology to mitigate cold-start costs, please correct if I am wrong. I can control some compilation passes with option -XX:CompileCommand, but the profiling data, such as invocation count and backedge count, has to reach some threshold before trigger execution level transition. This causes simulator doesn't trigger C2 compilation within reasonable time span. Regards! wei > -----Original Message----- > From: Chris Newland [mailto:cnewland at chrisnewland.com] > Sent: Wednesday, May 20, 2015 3:46 PM > To: Vitaly Davidovich > Cc: Tangwei (Euler); Vladimir Kozlov; hotspot compiler > Subject: RE: How to change compilation policy to trigger C2 compilation ASAP? > > Hi Wei, > > Is there any reason why you need to cold-start your application each time and > begin with no profiling information? > > Depending on your input data, would it be possible to "warm up" the VM by > running typical inputs through your code to build a rich profile and have the JIT > compilations performed before your real data arrives? > > One reason *against* doing this would be wide variations in your input data > that could result in the wrong optimisations being made and possibly suffering > a decompilation event. > > There are commercial VMs that have technology to mitigate cold-start costs > (for example Azul Zing's ReadyNow). > > Have you examined the HotSpot LogCompilation output to make sure there is > nothing you could change at source code level which would result in a better JIT > decisions? I'm thinking of things like inlining failures due to call-site > megamorphism. > > There's a free tool called JITWatch that aims to make it easier to understand > the LogCompilation output > (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales pitch - I'm the > author). > > Regards, > > Chris > @chriswhocodes > > On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote: > > I'll let Vladimir comment on the compile command, but turning off > > tiered doesn't prevent inlining. What prevents inlining (an otherwise > > inlineable method), in a nutshell, is lack of profiling information > > that would indicate the method or callsite is hot enough for inlining. > > So you can have tiered off and run C2 with standard compilation > > thresholds, and you'll likely get good inlining because you'll have a > > long profile. If you turn C2 compile threshold down too far (and too > > far is going to be app-specific), C2 compilation may not have > > sufficient info in the shortened profile to decide to inline. Or even > > worse, the profile collected is actually not reflective of the real > > profile you want to capture (e.g. app has a phase change after initialization). > > > > The theoretical advantage of tiered is you can get decent perf > > quickly, but also since you're getting better than interpreter speed > > quickly, you can run in C1 tier longer and thus collect an even larger > > profile, possibly leading to better code than C2 alone. But > > unfortunately this may require serious tuning. That's my understanding at > least. > > > > sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)" > > wrote: > > > > > >> Vladimir, > >> What the 'double' means in following command line? The option > >> CompileThresholdScaling you mentioned is same as > >> ProfileMaturityPercentage? > >> I cannot find the option in code. How much effect to function > >> inlining by turning off tiered compilation? > >> > >>> > >> > -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh > ol > >> dS > >> caling,0.5 > >> > >> Regards! > >> wei > >> > >>> -----Original Message----- > >>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > >>> Sent: Wednesday, May 20, 2015 9:12 AM > >>> To: Tangwei (Euler); Vitaly Davidovich > >>> Cc: hotspot compiler > >>> Subject: Re: How to change compilation policy to trigger C2 > >>> compilation > >> ASAP? > >> > >>> > >>> If you want only C2 compilation you can disable tiered compilation > >>> > >> (since you > >> > >>> don't want to spend to much on profiling anyway) and set low > >>> CompileThreshold (you need only this one for non-tiered compile): > >>> > >>> > >>> -XX:-TieredCompilation -XX:CompileThreshold=100 > >>> > >>> > >>> If your method is simple and don't need profiling or inlining the > >>> > >> generated code > >>> could be same quality as with long profiling. > >>> > >>> An other approach is to set threshold per method (scale all > >>> threasholds > >> (C1, C2, > >> > >>> interpreter) by this value). For example to reduce thresholds by > >>> half: > >>> > >>> > >>> > -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh > >>> oldScaling,0.5 > >>> > >>> Vladimir > >>> > >>> > >>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote: > >>> > >>>> My goal is just to reach peak performance quickly. Following is one > >>>> tier threshold combination I tried: > >>>> > >>>> -XX:Tier0ProfilingStartPercentage=0 > >>>> > >>>> > >>>> -XX:Tier3InvocationThreshold=3 > >>>> > >>>> > >>>> -XX:Tier3MinInvocationThreshold=2 > >>>> > >>>> > >>>> -XX:Tier3CompileThreshold=2 > >>>> > >>>> > >>>> -XX:Tier4InvocationThreshold=4 > >>>> > >>>> > >>>> -XX:Tier4MinInvocationThreshold=3 > >>>> > >>>> > >>>> -XX:Tier4CompileThreshold=2 > >>>> > >>>> > >>>> Regards! > >>>> > >>>> > >>>> wei > >>>> > >>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] > >>>> *Sent:* Tuesday, May 19, 2015 9:33 PM > >>>> *To:* Tangwei (Euler) > >>>> *Cc:* hotspot compiler > >>>> *Subject:* Re: How to change compilation policy to trigger C2 > >>>> compilation ASAP? > >>>> > >>>> Is your goal specifically to have C2 compile or just to reach peak > >>>> performance quickly? It sounds like the latter. What values did > >>>> you specify for the tier thresholds? Also, it may help you to > >>>> -XX:+PrintCompilation to tune the flags as this will show you which > >>>> methods are being compiled, when, and at what tier. > >>>> > >>>> sent from my phone > >>>> > >>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" >>>> > wrote: > >>>> > >>>> > >>>> Hi All, > >>>> > >>>> > >>>> I want to run a JAVA application on a performance simulator, and do > >>>> a profiling on a hot function JITTed with C2 compiler. > >>>> > >>>> In order to make C2 compiler compile hot function as early as > >>>> possible, I hope to reduce the threshold of function invocation > >>>> > >>>> count in interpreter and C1 to drive the JIT compiler transitioned > >>>> to Level 4 (C2) ASAP. Following is the option list I try, but > >>>> > >>>> > >>>> failed to find a right combination to meet my requirement. Anyone > >>>> can help to figure out what options I can use? > >>>> > >>>> Thanks in advance. > >>>> > >>>> > >>>> -XX:Tier0ProfilingStartPercentage=0 > >>>> > >>>> > >>>> -XX:Tier3InvocationThreshold > >>>> > >>>> > >>>> -XX:Tier3MinInvocationThreshold > >>>> > >>>> > >>>> -XX:Tier3CompileThreshold > >>>> > >>>> > >>>> -XX:Tier4InvocationThreshold > >>>> > >>>> > >>>> -XX:CompileThreshold > >>>> > >>>> > >>>> -XX:Tier4MinInvocationThreshold > >>>> > >>>> > >>>> -XX:Tier4CompileThreshold > >>>> > >>>> > >>>> Regards! > >>>> > >>>> > >>>> wei > >>>> > >> > > > From tangwei6 at huawei.com Wed May 20 08:19:49 2015 From: tangwei6 at huawei.com (Tangwei (Euler)) Date: Wed, 20 May 2015 08:19:49 +0000 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: <555C41A5.6010308@oracle.com> References: <555BDF7D.4010002@oracle.com> <555C2DF5.1050600@oracle.com> <555C41A5.6010308@oracle.com> Message-ID: Thanks, we haven't switched to JDK9, and will give a try. Regards! wei > -----Original Message----- > From: Zolt?n Maj? [mailto:zoltan.majo at oracle.com] > Sent: Wednesday, May 20, 2015 4:11 PM > To: Tangwei (Euler); hotspot-compiler-dev at openjdk.java.net > Subject: Re: How to change compilation policy to trigger C2 compilation ASAP? > > Hi Wei, > > > On 05/20/2015 10:07 AM, Tangwei (Euler) wrote: > > Hi Zoltan, > > Is this option supported in OpenJDK8? I cannot find it by greping output > of command line 'java -XX:+PrintFlagsFinal'. > > I checked and it is available only in 9. > > Best regards, > > > Zoltan > > > > > > > Regards! > > wei > > > >> -----Original Message----- > >> From: hotspot-compiler-dev > >> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of > >> Zolt?n Maj? > >> Sent: Wednesday, May 20, 2015 2:47 PM > >> To: hotspot-compiler-dev at openjdk.java.net > >> Subject: Re: How to change compilation policy to trigger C2 compilation > ASAP? > >> > >> Hi Wei, > >> > >> > >> On 05/20/2015 03:12 AM, Vladimir Kozlov wrote: > >> [...] > >>> An other approach is to set threshold per method (scale all > >>> threasholds (C1, C2, interpreter) by this value). For example to > >>> reduce thresholds by half: > >>> > >>> > >> > -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh > >> old > >>> Scaling,0.5 > >>> > >> just to complement what Vladimir suggested: If you want to scale > >> thresholds for all methods at once, the CompileThresholdScaling can be also > used globally. > >> For example, -XX:CompileThresholdScaling=0.5 would reduce all > >> thresholds by half (both with tiered and non-tiered modes of operation). > >> > >> Best regards, > >> > >> > >> Zoltan > >> > >>> Vladimir > >>> > >>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote: > >>>> My goal is just to reach peak performance quickly. Following is one > >>>> tier threshold combination I tried: > >>>> > >>>> -XX:Tier0ProfilingStartPercentage=0 > >>>> > >>>> -XX:Tier3InvocationThreshold=3 > >>>> > >>>> -XX:Tier3MinInvocationThreshold=2 > >>>> > >>>> -XX:Tier3CompileThreshold=2 > >>>> > >>>> -XX:Tier4InvocationThreshold=4 > >>>> > >>>> -XX:Tier4MinInvocationThreshold=3 > >>>> > >>>> -XX:Tier4CompileThreshold=2 > >>>> > >>>> Regards! > >>>> > >>>> wei > >>>> > >>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] > >>>> *Sent:* Tuesday, May 19, 2015 9:33 PM > >>>> *To:* Tangwei (Euler) > >>>> *Cc:* hotspot compiler > >>>> *Subject:* Re: How to change compilation policy to trigger C2 > >>>> compilation ASAP? > >>>> > >>>> Is your goal specifically to have C2 compile or just to reach peak > >>>> performance quickly? It sounds like the latter. What values did > >>>> you specify for the tier thresholds? Also, it may help you to > >>>> -XX:+PrintCompilation to tune the flags as this will show you which > >>>> methods are being compiled, when, and at what tier. > >>>> > >>>> sent from my phone > >>>> > >>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" >>>> > wrote: > >>>> > >>>> Hi All, > >>>> > >>>> I want to run a JAVA application on a performance simulator, > >>>> and do a profiling on a hot function JITTed with C2 compiler. > >>>> > >>>> In order to make C2 compiler compile hot function as early as > >>>> possible, I hope to reduce the threshold of function invocation > >>>> > >>>> count in interpreter and C1 to drive the JIT compiler transitioned > >>>> to Level 4 (C2) ASAP. Following is the option list I try, but > >>>> > >>>> failed to find a right combination to meet my requirement. Anyone > >>>> can help to figure out what options I can use? > >>>> > >>>> Thanks in advance. > >>>> > >>>> -XX:Tier0ProfilingStartPercentage=0 > >>>> > >>>> -XX:Tier3InvocationThreshold > >>>> > >>>> -XX:Tier3MinInvocationThreshold > >>>> > >>>> -XX:Tier3CompileThreshold > >>>> > >>>> -XX:Tier4InvocationThreshold > >>>> > >>>> -XX:CompileThreshold > >>>> > >>>> -XX:Tier4MinInvocationThreshold > >>>> > >>>> -XX:Tier4CompileThreshold > >>>> > >>>> Regards! > >>>> > >>>> wei > >>>> From aph at redhat.com Wed May 20 08:48:18 2015 From: aph at redhat.com (Andrew Haley) Date: Wed, 20 May 2015 09:48:18 +0100 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: Message-ID: <555C4A52.60807@redhat.com> On 20/05/15 03:06, Tangwei (Euler) wrote: > Our real machine has 16 CPU cores, but unfortunately, our simulator > can only support one thread now. So it doesn?t work for me to > increase compiler threads by now. > I am glad to share if find some interesting thing. The problem with forcing early C2 compilation is that it may not have enough profile data to do its job well. So, you'll have something you can measure but it won't be representative of what the compiler will do on real hardware. Andrew. From cnewland at chrisnewland.com Wed May 20 08:58:40 2015 From: cnewland at chrisnewland.com (Chris Newland) Date: Wed, 20 May 2015 09:58:40 +0100 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: <555BDF7D.4010002@oracle.com> <2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net> Message-ID: <318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net> Hi Wei, Thanks! OpenJDK has no support for persisting compilation profile data between VM runs. For some industries it can be economical to use a commercial VM to get the lowest latencies. One "free" alternative is to add a training mode to your application and feed it the most realistic input data so that you get a good set of JIT optimisations before the inputs you care about arrive. Is there a reason why you are cold-starting your VM? (For example your GC strategy is to use a huge Eden space and reboot daily to avoid any garbage collections). Regards, Chris @chriswhocodes On Wed, May 20, 2015 09:17, Tangwei (Euler) wrote: > Hi Chris, > Your tool looks cool. I think OpenJDK has no technology to mitigate > cold-start costs, please correct if I am wrong. I can control some > compilation passes with option -XX:CompileCommand, but the profiling > data, such as invocation count and backedge count, has to reach some > threshold before trigger execution level transition. This causes > simulator doesn't trigger C2 compilation within reasonable time span. > > Regards! > wei > >> -----Original Message----- >> From: Chris Newland [mailto:cnewland at chrisnewland.com] >> Sent: Wednesday, May 20, 2015 3:46 PM >> To: Vitaly Davidovich >> Cc: Tangwei (Euler); Vladimir Kozlov; hotspot compiler >> Subject: RE: How to change compilation policy to trigger C2 compilation >> ASAP? >> >> >> Hi Wei, >> >> >> Is there any reason why you need to cold-start your application each >> time and begin with no profiling information? >> >> Depending on your input data, would it be possible to "warm up" the VM >> by running typical inputs through your code to build a rich profile and >> have the JIT compilations performed before your real data arrives? >> >> One reason *against* doing this would be wide variations in your input >> data that could result in the wrong optimisations being made and >> possibly suffering a decompilation event. >> >> There are commercial VMs that have technology to mitigate cold-start >> costs (for example Azul Zing's ReadyNow). >> >> >> Have you examined the HotSpot LogCompilation output to make sure there >> is nothing you could change at source code level which would result in a >> better JIT decisions? I'm thinking of things like inlining failures due >> to call-site megamorphism. >> >> There's a free tool called JITWatch that aims to make it easier to >> understand the LogCompilation output >> (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales pitch - >> I'm the >> author). >> >> Regards, >> >> >> Chris >> @chriswhocodes >> >> >> On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote: >> >>> I'll let Vladimir comment on the compile command, but turning off >>> tiered doesn't prevent inlining. What prevents inlining (an otherwise >>> inlineable method), in a nutshell, is lack of profiling information >>> that would indicate the method or callsite is hot enough for >>> inlining. So you can have tiered off and run C2 with standard >>> compilation thresholds, and you'll likely get good inlining because >>> you'll have a long profile. If you turn C2 compile threshold down too >>> far (and too far is going to be app-specific), C2 compilation may not >>> have sufficient info in the shortened profile to decide to inline. Or >>> even worse, the profile collected is actually not reflective of the >>> real profile you want to capture (e.g. app has a phase change after >>> initialization). >>> >>> The theoretical advantage of tiered is you can get decent perf >>> quickly, but also since you're getting better than interpreter speed >>> quickly, you can run in C1 tier longer and thus collect an even >>> larger profile, possibly leading to better code than C2 alone. But >>> unfortunately this may require serious tuning. That's my >>> understanding at >> least. >>> >>> sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)" >>> wrote: >>> >>> >>> >>>> Vladimir, >>>> What the 'double' means in following command line? The option >>>> CompileThresholdScaling you mentioned is same as >>>> ProfileMaturityPercentage? >>>> I cannot find the option in code. How much effect to function >>>> inlining by turning off tiered compilation? >>>> >>>>> >>>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh >> ol >>>> dS caling,0.5 >>>> >>>> Regards! >>>> wei >>>> >>>>> -----Original Message----- >>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] >>>>> Sent: Wednesday, May 20, 2015 9:12 AM >>>>> To: Tangwei (Euler); Vitaly Davidovich >>>>> Cc: hotspot compiler >>>>> Subject: Re: How to change compilation policy to trigger C2 >>>>> compilation >>>> ASAP? >>>> >>>> >>>>> >>>>> If you want only C2 compilation you can disable tiered >>>>> compilation >>>>> >>>> (since you >>>> >>>> >>>>> don't want to spend to much on profiling anyway) and set low >>>>> CompileThreshold (you need only this one for non-tiered compile): >>>>> >>>>> >>>>> >>>>> -XX:-TieredCompilation -XX:CompileThreshold=100 >>>>> >>>>> >>>>> >>>>> If your method is simple and don't need profiling or inlining the >>>>> >>>>> >>>> generated code >>>>> could be same quality as with long profiling. >>>>> >>>>> An other approach is to set threshold per method (scale all >>>>> threasholds >>>> (C1, C2, >>>> >>>> >>>>> interpreter) by this value). For example to reduce thresholds by >>>>> half: >>>>> >>>>> >>>>> >>>>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh >> >>>>> oldScaling,0.5 >>>>> >>>>> Vladimir >>>>> >>>>> >>>>> >>>>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote: >>>>> >>>>> >>>>>> My goal is just to reach peak performance quickly. Following is >>>>>> one tier threshold combination I tried: >>>>>> >>>>>> -XX:Tier0ProfilingStartPercentage=0 >>>>>> >>>>>> >>>>>> >>>>>> -XX:Tier3InvocationThreshold=3 >>>>>> >>>>>> >>>>>> >>>>>> -XX:Tier3MinInvocationThreshold=2 >>>>>> >>>>>> >>>>>> >>>>>> -XX:Tier3CompileThreshold=2 >>>>>> >>>>>> >>>>>> >>>>>> -XX:Tier4InvocationThreshold=4 >>>>>> >>>>>> >>>>>> >>>>>> -XX:Tier4MinInvocationThreshold=3 >>>>>> >>>>>> >>>>>> >>>>>> -XX:Tier4CompileThreshold=2 >>>>>> >>>>>> >>>>>> >>>>>> Regards! >>>>>> >>>>>> >>>>>> >>>>>> wei >>>>>> >>>>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] >>>>>> *Sent:* Tuesday, May 19, 2015 9:33 PM >>>>>> *To:* Tangwei (Euler) >>>>>> *Cc:* hotspot compiler >>>>>> *Subject:* Re: How to change compilation policy to trigger C2 >>>>>> compilation ASAP? >>>>>> >>>>>> Is your goal specifically to have C2 compile or just to reach >>>>>> peak performance quickly? It sounds like the latter. What >>>>>> values did you specify for the tier thresholds? Also, it may >>>>>> help you to -XX:+PrintCompilation to tune the flags as this will >>>>>> show you which methods are being compiled, when, and at what >>>>>> tier. >>>>>> >>>>>> sent from my phone >>>>>> >>>>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" >>>>> > wrote: >>>>>> >>>>>> >>>>>> >>>>>> Hi All, >>>>>> >>>>>> >>>>>> >>>>>> I want to run a JAVA application on a performance simulator, >>>>>> and do a profiling on a hot function JITTed with C2 compiler. >>>>>> >>>>>> In order to make C2 compiler compile hot function as early as >>>>>> possible, I hope to reduce the threshold of function invocation >>>>>> >>>>>> count in interpreter and C1 to drive the JIT compiler >>>>>> transitioned to Level 4 (C2) ASAP. Following is the option list >>>>>> I try, but >>>>>> >>>>>> >>>>>> >>>>>> failed to find a right combination to meet my requirement. >>>>>> Anyone >>>>>> can help to figure out what options I can use? >>>>>> >>>>>> Thanks in advance. >>>>>> >>>>>> >>>>>> >>>>>> -XX:Tier0ProfilingStartPercentage=0 >>>>>> >>>>>> >>>>>> >>>>>> -XX:Tier3InvocationThreshold >>>>>> >>>>>> >>>>>> >>>>>> -XX:Tier3MinInvocationThreshold >>>>>> >>>>>> >>>>>> >>>>>> -XX:Tier3CompileThreshold >>>>>> >>>>>> >>>>>> >>>>>> -XX:Tier4InvocationThreshold >>>>>> >>>>>> >>>>>> >>>>>> -XX:CompileThreshold >>>>>> >>>>>> >>>>>> >>>>>> -XX:Tier4MinInvocationThreshold >>>>>> >>>>>> >>>>>> >>>>>> -XX:Tier4CompileThreshold >>>>>> >>>>>> >>>>>> >>>>>> Regards! >>>>>> >>>>>> >>>>>> >>>>>> wei >>>>>> >>>> >>> >> > > From tangwei6 at huawei.com Wed May 20 09:29:52 2015 From: tangwei6 at huawei.com (Tangwei (Euler)) Date: Wed, 20 May 2015 09:29:52 +0000 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: <318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net> References: <555BDF7D.4010002@oracle.com> <2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net> <318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net> Message-ID: Hi Chris, Actually, what I want is to run my application on simulator to locate performance bottleneck in JITTed code. I actually don't care about if cold-starting my VM or not, just want to compile hot function earlier to shorten simulator execution time. Regards! wei > -----Original Message----- > From: Chris Newland [mailto:cnewland at chrisnewland.com] > Sent: Wednesday, May 20, 2015 4:59 PM > To: Tangwei (Euler) > Cc: hotspot compiler > Subject: RE: How to change compilation policy to trigger C2 compilation ASAP? > > Hi Wei, > > Thanks! OpenJDK has no support for persisting compilation profile data > between VM runs. For some industries it can be economical to use a > commercial VM to get the lowest latencies. > > One "free" alternative is to add a training mode to your application and feed it > the most realistic input data so that you get a good set of JIT optimisations > before the inputs you care about arrive. > > Is there a reason why you are cold-starting your VM? (For example your GC > strategy is to use a huge Eden space and reboot daily to avoid any garbage > collections). > > Regards, > > Chris > @chriswhocodes > > On Wed, May 20, 2015 09:17, Tangwei (Euler) wrote: > > Hi Chris, > > Your tool looks cool. I think OpenJDK has no technology to mitigate > > cold-start costs, please correct if I am wrong. I can control some > > compilation passes with option -XX:CompileCommand, but the profiling > > data, such as invocation count and backedge count, has to reach some > > threshold before trigger execution level transition. This causes > > simulator doesn't trigger C2 compilation within reasonable time span. > > > > Regards! > > wei > > > >> -----Original Message----- > >> From: Chris Newland [mailto:cnewland at chrisnewland.com] > >> Sent: Wednesday, May 20, 2015 3:46 PM > >> To: Vitaly Davidovich > >> Cc: Tangwei (Euler); Vladimir Kozlov; hotspot compiler > >> Subject: RE: How to change compilation policy to trigger C2 > >> compilation ASAP? > >> > >> > >> Hi Wei, > >> > >> > >> Is there any reason why you need to cold-start your application each > >> time and begin with no profiling information? > >> > >> Depending on your input data, would it be possible to "warm up" the > >> VM by running typical inputs through your code to build a rich > >> profile and have the JIT compilations performed before your real data > arrives? > >> > >> One reason *against* doing this would be wide variations in your > >> input data that could result in the wrong optimisations being made > >> and possibly suffering a decompilation event. > >> > >> There are commercial VMs that have technology to mitigate cold-start > >> costs (for example Azul Zing's ReadyNow). > >> > >> > >> Have you examined the HotSpot LogCompilation output to make sure > >> there is nothing you could change at source code level which would > >> result in a better JIT decisions? I'm thinking of things like > >> inlining failures due to call-site megamorphism. > >> > >> There's a free tool called JITWatch that aims to make it easier to > >> understand the LogCompilation output > >> (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales pitch > >> - I'm the author). > >> > >> Regards, > >> > >> > >> Chris > >> @chriswhocodes > >> > >> > >> On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote: > >> > >>> I'll let Vladimir comment on the compile command, but turning off > >>> tiered doesn't prevent inlining. What prevents inlining (an > >>> otherwise inlineable method), in a nutshell, is lack of profiling > >>> information that would indicate the method or callsite is hot > >>> enough for inlining. So you can have tiered off and run C2 with > >>> standard compilation thresholds, and you'll likely get good inlining > >>> because you'll have a long profile. If you turn C2 compile > >>> threshold down too far (and too far is going to be app-specific), C2 > >>> compilation may not have sufficient info in the shortened profile to > >>> decide to inline. Or even worse, the profile collected is actually > >>> not reflective of the real profile you want to capture (e.g. app has > >>> a phase change after initialization). > >>> > >>> The theoretical advantage of tiered is you can get decent perf > >>> quickly, but also since you're getting better than interpreter speed > >>> quickly, you can run in C1 tier longer and thus collect an even > >>> larger profile, possibly leading to better code than C2 alone. But > >>> unfortunately this may require serious tuning. That's my > >>> understanding at > >> least. > >>> > >>> sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)" > >>> wrote: > >>> > >>> > >>> > >>>> Vladimir, > >>>> What the 'double' means in following command line? The option > >>>> CompileThresholdScaling you mentioned is same as > >>>> ProfileMaturityPercentage? > >>>> I cannot find the option in code. How much effect to function > >>>> inlining by turning off tiered compilation? > >>>> > >>>>> > >>>> > >> > -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh > >> ol > >>>> dS caling,0.5 > >>>> > >>>> Regards! > >>>> wei > >>>> > >>>>> -----Original Message----- > >>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > >>>>> Sent: Wednesday, May 20, 2015 9:12 AM > >>>>> To: Tangwei (Euler); Vitaly Davidovich > >>>>> Cc: hotspot compiler > >>>>> Subject: Re: How to change compilation policy to trigger C2 > >>>>> compilation > >>>> ASAP? > >>>> > >>>> > >>>>> > >>>>> If you want only C2 compilation you can disable tiered compilation > >>>>> > >>>> (since you > >>>> > >>>> > >>>>> don't want to spend to much on profiling anyway) and set low > >>>>> CompileThreshold (you need only this one for non-tiered compile): > >>>>> > >>>>> > >>>>> > >>>>> -XX:-TieredCompilation -XX:CompileThreshold=100 > >>>>> > >>>>> > >>>>> > >>>>> If your method is simple and don't need profiling or inlining the > >>>>> > >>>>> > >>>> generated code > >>>>> could be same quality as with long profiling. > >>>>> > >>>>> An other approach is to set threshold per method (scale all > >>>>> threasholds > >>>> (C1, C2, > >>>> > >>>> > >>>>> interpreter) by this value). For example to reduce thresholds by > >>>>> half: > >>>>> > >>>>> > >>>>> > >>>>> > >> > -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh > >> > >>>>> oldScaling,0.5 > >>>>> > >>>>> Vladimir > >>>>> > >>>>> > >>>>> > >>>>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote: > >>>>> > >>>>> > >>>>>> My goal is just to reach peak performance quickly. Following is > >>>>>> one tier threshold combination I tried: > >>>>>> > >>>>>> -XX:Tier0ProfilingStartPercentage=0 > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier3InvocationThreshold=3 > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier3MinInvocationThreshold=2 > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier3CompileThreshold=2 > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier4InvocationThreshold=4 > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier4MinInvocationThreshold=3 > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier4CompileThreshold=2 > >>>>>> > >>>>>> > >>>>>> > >>>>>> Regards! > >>>>>> > >>>>>> > >>>>>> > >>>>>> wei > >>>>>> > >>>>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] > >>>>>> *Sent:* Tuesday, May 19, 2015 9:33 PM > >>>>>> *To:* Tangwei (Euler) > >>>>>> *Cc:* hotspot compiler > >>>>>> *Subject:* Re: How to change compilation policy to trigger C2 > >>>>>> compilation ASAP? > >>>>>> > >>>>>> Is your goal specifically to have C2 compile or just to reach > >>>>>> peak performance quickly? It sounds like the latter. What values > >>>>>> did you specify for the tier thresholds? Also, it may help you > >>>>>> to -XX:+PrintCompilation to tune the flags as this will show you > >>>>>> which methods are being compiled, when, and at what tier. > >>>>>> > >>>>>> sent from my phone > >>>>>> > >>>>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" >>>>>> > wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> Hi All, > >>>>>> > >>>>>> > >>>>>> > >>>>>> I want to run a JAVA application on a performance simulator, and > >>>>>> do a profiling on a hot function JITTed with C2 compiler. > >>>>>> > >>>>>> In order to make C2 compiler compile hot function as early as > >>>>>> possible, I hope to reduce the threshold of function invocation > >>>>>> > >>>>>> count in interpreter and C1 to drive the JIT compiler > >>>>>> transitioned to Level 4 (C2) ASAP. Following is the option list I > >>>>>> try, but > >>>>>> > >>>>>> > >>>>>> > >>>>>> failed to find a right combination to meet my requirement. > >>>>>> Anyone > >>>>>> can help to figure out what options I can use? > >>>>>> > >>>>>> Thanks in advance. > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier0ProfilingStartPercentage=0 > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier3InvocationThreshold > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier3MinInvocationThreshold > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier3CompileThreshold > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier4InvocationThreshold > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:CompileThreshold > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier4MinInvocationThreshold > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier4CompileThreshold > >>>>>> > >>>>>> > >>>>>> > >>>>>> Regards! > >>>>>> > >>>>>> > >>>>>> > >>>>>> wei > >>>>>> > >>>> > >>> > >> > > > > > From andreas.eriksson at oracle.com Wed May 20 11:04:11 2015 From: andreas.eriksson at oracle.com (Andreas Eriksson) Date: Wed, 20 May 2015 13:04:11 +0200 Subject: 8060036: C2: CmpU nodes can end up with wrong type information In-Reply-To: References: <555B16D4.80008@oracle.com> Message-ID: <555C6A2B.7090804@oracle.com> Comments added, thanks. Can I have a Reviewer look at this as well? New webrev: http://cr.openjdk.java.net/~aeriksso/8060036/webrev.01/ - Andreas On 2015-05-19 20:24, Berg, Michael C wrote: > Hi Andreas, could you add a comment to the push statement: > > // Propagate change to user > > And to Node* p declaration and init: > > // Propagate changes to uses > > So that the new code carries all similar information. > Else the changes seem ok. > > -Michael > > -----Original Message----- > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andreas Eriksson > Sent: Tuesday, May 19, 2015 3:56 AM > To: hotspot-compiler-dev at openjdk.java.net > Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information > > Hi, > > Could I please get two reviews for the following bug fix. > > https://bugs.openjdk.java.net/browse/JDK-8060036 > http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/ > > The problem is with type propagation in C2. > > A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node. > If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead. > These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU. > This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself. > > The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there. > At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information. > > When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken. > > This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated. > I've not been able to create a regression test that is reliable enough to check in. > > Thanks, > Andreas From paul.sandoz at oracle.com Wed May 20 12:32:26 2015 From: paul.sandoz at oracle.com (Paul Sandoz) Date: Wed, 20 May 2015 14:32:26 +0200 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared In-Reply-To: References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com> <5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com> <72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com> <55533CAE.3050201@oracle.com> <55549297.5020002@oracle.com> <5555E343.5050600@oracle.com> <5559E3A5.8030108@oracle.com> Message-ID: <59A05F7D-058A-4A9B-BA65-17A71A57254D@oracle.com> On May 20, 2015, at 9:14 AM, Roland Westrelin wrote: >> What do you think about the following version: >> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.04 > > Still looks good to me. > And to me. Injected fields seem a little magical but safer. Paul. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 841 bytes Desc: Message signed with OpenPGP using GPGMail URL: From vladimir.x.ivanov at oracle.com Wed May 20 13:03:10 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Wed, 20 May 2015 16:03:10 +0300 Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after sun.misc.Cleaner became automatically cleared In-Reply-To: <59A05F7D-058A-4A9B-BA65-17A71A57254D@oracle.com> References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com> <5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com> <72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com> <55533CAE.3050201@oracle.com> <55549297.5020002@oracle.com> <5555E343.5050600@oracle.com> <5559E3A5.8030108@oracle.com> <59A05F7D-058A-4A9B-BA65-17A71A57254D@oracle.com> Message-ID: <555C860E.1020508@oracle.com> Paul, Roland, thanks for review. Best regards, Vladimir Ivanov On 5/20/15 3:32 PM, Paul Sandoz wrote: > > On May 20, 2015, at 9:14 AM, Roland Westrelin wrote: > >>> What do you think about the following version: >>> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.04 >> >> Still looks good to me. >> > > And to me. > > Injected fields seem a little magical but safer. > > Paul. > From vitalyd at gmail.com Wed May 20 13:40:15 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Wed, 20 May 2015 09:40:15 -0400 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: <318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net> References: <555BDF7D.4010002@oracle.com> <2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net> <318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net> Message-ID: Training mode is also dangerous unless built in from beginning as there's risk some unwanted application effect will take place. This is beside the possibility of feeding it "wrong" profile. What does Azul ReadyNow do? Saves profiles to disk and reads them upon startup? sent from my phone On May 20, 2015 4:58 AM, "Chris Newland" wrote: > Hi Wei, > > Thanks! OpenJDK has no support for persisting compilation profile data > between VM runs. For some industries it can be economical to use a > commercial VM to get the lowest latencies. > > One "free" alternative is to add a training mode to your application and > feed it the most realistic input data so that you get a good set of JIT > optimisations before the inputs you care about arrive. > > Is there a reason why you are cold-starting your VM? (For example your GC > strategy is to use a huge Eden space and reboot daily to avoid any garbage > collections). > > Regards, > > Chris > @chriswhocodes > > On Wed, May 20, 2015 09:17, Tangwei (Euler) wrote: > > Hi Chris, > > Your tool looks cool. I think OpenJDK has no technology to mitigate > > cold-start costs, please correct if I am wrong. I can control some > > compilation passes with option -XX:CompileCommand, but the profiling > > data, such as invocation count and backedge count, has to reach some > > threshold before trigger execution level transition. This causes > > simulator doesn't trigger C2 compilation within reasonable time span. > > > > Regards! > > wei > > > >> -----Original Message----- > >> From: Chris Newland [mailto:cnewland at chrisnewland.com] > >> Sent: Wednesday, May 20, 2015 3:46 PM > >> To: Vitaly Davidovich > >> Cc: Tangwei (Euler); Vladimir Kozlov; hotspot compiler > >> Subject: RE: How to change compilation policy to trigger C2 compilation > >> ASAP? > >> > >> > >> Hi Wei, > >> > >> > >> Is there any reason why you need to cold-start your application each > >> time and begin with no profiling information? > >> > >> Depending on your input data, would it be possible to "warm up" the VM > >> by running typical inputs through your code to build a rich profile and > >> have the JIT compilations performed before your real data arrives? > >> > >> One reason *against* doing this would be wide variations in your input > >> data that could result in the wrong optimisations being made and > >> possibly suffering a decompilation event. > >> > >> There are commercial VMs that have technology to mitigate cold-start > >> costs (for example Azul Zing's ReadyNow). > >> > >> > >> Have you examined the HotSpot LogCompilation output to make sure there > >> is nothing you could change at source code level which would result in a > >> better JIT decisions? I'm thinking of things like inlining failures due > >> to call-site megamorphism. > >> > >> There's a free tool called JITWatch that aims to make it easier to > >> understand the LogCompilation output > >> (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales pitch - > >> I'm the > >> author). > >> > >> Regards, > >> > >> > >> Chris > >> @chriswhocodes > >> > >> > >> On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote: > >> > >>> I'll let Vladimir comment on the compile command, but turning off > >>> tiered doesn't prevent inlining. What prevents inlining (an otherwise > >>> inlineable method), in a nutshell, is lack of profiling information > >>> that would indicate the method or callsite is hot enough for > >>> inlining. So you can have tiered off and run C2 with standard > >>> compilation thresholds, and you'll likely get good inlining because > >>> you'll have a long profile. If you turn C2 compile threshold down too > >>> far (and too far is going to be app-specific), C2 compilation may not > >>> have sufficient info in the shortened profile to decide to inline. Or > >>> even worse, the profile collected is actually not reflective of the > >>> real profile you want to capture (e.g. app has a phase change after > >>> initialization). > >>> > >>> The theoretical advantage of tiered is you can get decent perf > >>> quickly, but also since you're getting better than interpreter speed > >>> quickly, you can run in C1 tier longer and thus collect an even > >>> larger profile, possibly leading to better code than C2 alone. But > >>> unfortunately this may require serious tuning. That's my > >>> understanding at > >> least. > >>> > >>> sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)" > >>> wrote: > >>> > >>> > >>> > >>>> Vladimir, > >>>> What the 'double' means in following command line? The option > >>>> CompileThresholdScaling you mentioned is same as > >>>> ProfileMaturityPercentage? > >>>> I cannot find the option in code. How much effect to function > >>>> inlining by turning off tiered compilation? > >>>> > >>>>> > >>>> > >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh > >> ol > >>>> dS caling,0.5 > >>>> > >>>> Regards! > >>>> wei > >>>> > >>>>> -----Original Message----- > >>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] > >>>>> Sent: Wednesday, May 20, 2015 9:12 AM > >>>>> To: Tangwei (Euler); Vitaly Davidovich > >>>>> Cc: hotspot compiler > >>>>> Subject: Re: How to change compilation policy to trigger C2 > >>>>> compilation > >>>> ASAP? > >>>> > >>>> > >>>>> > >>>>> If you want only C2 compilation you can disable tiered > >>>>> compilation > >>>>> > >>>> (since you > >>>> > >>>> > >>>>> don't want to spend to much on profiling anyway) and set low > >>>>> CompileThreshold (you need only this one for non-tiered compile): > >>>>> > >>>>> > >>>>> > >>>>> -XX:-TieredCompilation -XX:CompileThreshold=100 > >>>>> > >>>>> > >>>>> > >>>>> If your method is simple and don't need profiling or inlining the > >>>>> > >>>>> > >>>> generated code > >>>>> could be same quality as with long profiling. > >>>>> > >>>>> An other approach is to set threshold per method (scale all > >>>>> threasholds > >>>> (C1, C2, > >>>> > >>>> > >>>>> interpreter) by this value). For example to reduce thresholds by > >>>>> half: > >>>>> > >>>>> > >>>>> > >>>>> > >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh > >> > >>>>> oldScaling,0.5 > >>>>> > >>>>> Vladimir > >>>>> > >>>>> > >>>>> > >>>>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote: > >>>>> > >>>>> > >>>>>> My goal is just to reach peak performance quickly. Following is > >>>>>> one tier threshold combination I tried: > >>>>>> > >>>>>> -XX:Tier0ProfilingStartPercentage=0 > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier3InvocationThreshold=3 > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier3MinInvocationThreshold=2 > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier3CompileThreshold=2 > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier4InvocationThreshold=4 > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier4MinInvocationThreshold=3 > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier4CompileThreshold=2 > >>>>>> > >>>>>> > >>>>>> > >>>>>> Regards! > >>>>>> > >>>>>> > >>>>>> > >>>>>> wei > >>>>>> > >>>>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] > >>>>>> *Sent:* Tuesday, May 19, 2015 9:33 PM > >>>>>> *To:* Tangwei (Euler) > >>>>>> *Cc:* hotspot compiler > >>>>>> *Subject:* Re: How to change compilation policy to trigger C2 > >>>>>> compilation ASAP? > >>>>>> > >>>>>> Is your goal specifically to have C2 compile or just to reach > >>>>>> peak performance quickly? It sounds like the latter. What > >>>>>> values did you specify for the tier thresholds? Also, it may > >>>>>> help you to -XX:+PrintCompilation to tune the flags as this will > >>>>>> show you which methods are being compiled, when, and at what > >>>>>> tier. > >>>>>> > >>>>>> sent from my phone > >>>>>> > >>>>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" >>>>>> > wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> Hi All, > >>>>>> > >>>>>> > >>>>>> > >>>>>> I want to run a JAVA application on a performance simulator, > >>>>>> and do a profiling on a hot function JITTed with C2 compiler. > >>>>>> > >>>>>> In order to make C2 compiler compile hot function as early as > >>>>>> possible, I hope to reduce the threshold of function invocation > >>>>>> > >>>>>> count in interpreter and C1 to drive the JIT compiler > >>>>>> transitioned to Level 4 (C2) ASAP. Following is the option list > >>>>>> I try, but > >>>>>> > >>>>>> > >>>>>> > >>>>>> failed to find a right combination to meet my requirement. > >>>>>> Anyone > >>>>>> can help to figure out what options I can use? > >>>>>> > >>>>>> Thanks in advance. > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier0ProfilingStartPercentage=0 > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier3InvocationThreshold > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier3MinInvocationThreshold > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier3CompileThreshold > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier4InvocationThreshold > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:CompileThreshold > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier4MinInvocationThreshold > >>>>>> > >>>>>> > >>>>>> > >>>>>> -XX:Tier4CompileThreshold > >>>>>> > >>>>>> > >>>>>> > >>>>>> Regards! > >>>>>> > >>>>>> > >>>>>> > >>>>>> wei > >>>>>> > >>>> > >>> > >> > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.westrelin at oracle.com Wed May 20 14:03:53 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 20 May 2015 16:03:53 +0200 Subject: RFR(XS): 8080699: Assert failed: Not a Java pointer in JCK test Message-ID: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com> http://cr.openjdk.java.net/~roland/8080699/webrev.00/ When an arraycopy node is eliminated (for a non escaping allocation), its control, src and dest arguments are set to top and it is no longer connected to the rest of the graph through its fall through outputs. It?s not entirely removed from the graph until the next igvn though, and it can be reached through its exception processing outputs. That?s what happens here and ArrayCopyNode::may_modify() assumes dest, that is top, is an oop. Roland. From roland.westrelin at oracle.com Wed May 20 14:19:24 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 20 May 2015 16:19:24 +0200 Subject: RFR(S) 8077504: Unsafe load can loose control dependency and cause crash In-Reply-To: <555A6741.3030002@oracle.com> References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com> <552FC216.4010503@redhat.com> <95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com> <6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com> <553A88F3.2010700@oracle.com> <553FFAC4.7030107@oracle.com> <8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com> <555A4026.3010303@redhat.com> <555A6741.3030002@oracle.com> Message-ID: Thanks Andrew and Vladimir. Here is a new webrev with an enum: http://cr.openjdk.java.net/~roland/8077504/webrev.02/ Roland. From vladimir.kozlov at oracle.com Wed May 20 16:12:36 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 20 May 2015 09:12:36 -0700 Subject: RFR(S) 8077504: Unsafe load can loose control dependency and cause crash In-Reply-To: References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com> <552FC216.4010503@redhat.com> <95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com> <6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com> <553A88F3.2010700@oracle.com> <553FFAC4.7030107@oracle.com> <8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com> <555A4026.3010303@redhat.com> <555A6741.3030002@oracle.com> Message-ID: <555CB274.4010409@oracle.com> Looks good. But fix second comment in memnode.hpp - it has '???' instead of '. Thanks, Vladimir On 5/20/15 7:19 AM, Roland Westrelin wrote: > Thanks Andrew and Vladimir. Here is a new webrev with an enum: > > http://cr.openjdk.java.net/~roland/8077504/webrev.02/ > > Roland. > From adinn at redhat.com Wed May 20 16:15:59 2015 From: adinn at redhat.com (Andrew Dinn) Date: Wed, 20 May 2015 17:15:59 +0100 Subject: RFR(S) 8077504: Unsafe load can loose control dependency and cause crash In-Reply-To: <555CB274.4010409@oracle.com> References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com> <552FC216.4010503@redhat.com> <95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com> <6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com> <553A88F3.2010700@oracle.com> <553FFAC4.7030107@oracle.com> <8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com> <555A4026.3010303@redhat.com> <555A6741.3030002@oracle.com> <555CB274.4010409@oracle.com> Message-ID: <555CB33F.6030506@redhat.com> On 20/05/15 17:12, Vladimir Kozlov wrote: > Looks good. But fix second comment in memnode.hpp - it has '???' instead > of '. Yes, looks good (but, of course, I'm not a reviewer). regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in UK and Wales under Company Registration No. 3798903 Directors: Michael Cunningham (USA), Matt Parson (USA), Charlie Peters (USA), Michael O'Neill (Ireland) From vladimir.kozlov at oracle.com Wed May 20 16:17:37 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 20 May 2015 09:17:37 -0700 Subject: 8060036: C2: CmpU nodes can end up with wrong type information In-Reply-To: <555C6A2B.7090804@oracle.com> References: <555B16D4.80008@oracle.com> <555C6A2B.7090804@oracle.com> Message-ID: <555CB3A1.2000308@oracle.com> Looks good. I agree with explanation and fix. Thanks, Vladimir On 5/20/15 4:04 AM, Andreas Eriksson wrote: > Comments added, thanks. > > Can I have a Reviewer look at this as well? > > New webrev: > http://cr.openjdk.java.net/~aeriksso/8060036/webrev.01/ > > - Andreas > > On 2015-05-19 20:24, Berg, Michael C wrote: >> Hi Andreas, could you add a comment to the push statement: >> >> // Propagate change to user >> >> And to Node* p declaration and init: >> >> // Propagate changes to uses >> >> So that the new code carries all similar information. >> Else the changes seem ok. >> >> -Michael >> >> -----Original Message----- >> From: hotspot-compiler-dev >> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of >> Andreas Eriksson >> Sent: Tuesday, May 19, 2015 3:56 AM >> To: hotspot-compiler-dev at openjdk.java.net >> Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type >> information >> >> Hi, >> >> Could I please get two reviews for the following bug fix. >> >> https://bugs.openjdk.java.net/browse/JDK-8060036 >> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/ >> >> The problem is with type propagation in C2. >> >> A CmpU can use type information from nodes two steps up in the graph, >> when input 1 is an AddI or SubI node. >> If the AddI/SubI overflows the CmpU node type computation will set up >> two type ranges based on the inputs to the AddI/SubI instead. >> These type ranges will then be compared to input 2 of the CmpU node to >> see if we can make any meaningful observations on the real type of the >> CmpU. >> This means that the type of the AddI/SubI can be bottom, but the CmpU >> can still use the type of the AddI/SubI inputs to get a more specific >> type itself. >> >> The problem with this is that the types of the AddI/SubI inputs can be >> updated at a later pass, but since the AddI/SubI is of type bottom the >> propagation pass will stop there. >> At that point the CmpU node is never made aware of the new types of >> the AddI/SubI inputs, and it is left with a type that is based on old >> type information. >> >> When this happens other optimizations using the type information can >> go very wrong; in this particular case a branch to a range_check trap >> that normally never happened was optimized to always be taken. >> >> This fix adds a check in the type propagation logic for CmpU nodes, >> which makes sure that the CmpU node will be added to the worklist so >> the type can be updated. >> I've not been able to create a regression test that is reliable enough >> to check in. >> >> Thanks, >> Andreas > From vladimir.kozlov at oracle.com Wed May 20 17:24:39 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 20 May 2015 10:24:39 -0700 Subject: RFR(XS): 8080699: Assert failed: Not a Java pointer in JCK test In-Reply-To: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com> References: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com> Message-ID: <555CC357.9030006@oracle.com> Should it also return false if in(ArrayCopyNode::Src)->is_top()? thanks, Vladimir On 5/20/15 7:03 AM, Roland Westrelin wrote: > http://cr.openjdk.java.net/~roland/8080699/webrev.00/ > > When an arraycopy node is eliminated (for a non escaping allocation), its control, src and dest arguments are set to top and it is no longer connected to the rest of the graph through its fall through outputs. It?s not entirely removed from the graph until the next igvn though, and it can be reached through its exception processing outputs. That?s what happens here and ArrayCopyNode::may_modify() assumes dest, that is top, is an oop. > > Roland. > From roland.westrelin at oracle.com Wed May 20 18:57:06 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 20 May 2015 20:57:06 +0200 Subject: RFR(XS): 8080699: Assert failed: Not a Java pointer in JCK test In-Reply-To: <555CC357.9030006@oracle.com> References: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com> <555CC357.9030006@oracle.com> Message-ID: <29CE480B-2136-40DE-8167-E0E093A5F145@oracle.com> Thanks for looking at this, Vladimir. > Should it also return false if in(ArrayCopyNode::Src)->is_top()? That doesn?t feel right: ArrayCopyNode::may_modify() would decide what to return based on src that it doesn?t modify. Also, I don?t think src can become top and dest not be top during allocation elimination. If that happens at another time (igvn for instance), this part of the graph is being removed and presumably the compiler will get an other chance to optimize it better. Roland. > > thanks, > Vladimir > > On 5/20/15 7:03 AM, Roland Westrelin wrote: >> http://cr.openjdk.java.net/~roland/8080699/webrev.00/ >> >> When an arraycopy node is eliminated (for a non escaping allocation), its control, src and dest arguments are set to top and it is no longer connected to the rest of the graph through its fall through outputs. It?s not entirely removed from the graph until the next igvn though, and it can be reached through its exception processing outputs. That?s what happens here and ArrayCopyNode::may_modify() assumes dest, that is top, is an oop. >> >> Roland. >> From vladimir.kozlov at oracle.com Wed May 20 19:15:24 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 20 May 2015 12:15:24 -0700 Subject: RFR(XS): 8080699: Assert failed: Not a Java pointer in JCK test In-Reply-To: <29CE480B-2136-40DE-8167-E0E093A5F145@oracle.com> References: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com> <555CC357.9030006@oracle.com> <29CE480B-2136-40DE-8167-E0E093A5F145@oracle.com> Message-ID: <555CDD4C.5030707@oracle.com> On 5/20/15 11:57 AM, Roland Westrelin wrote: > Thanks for looking at this, Vladimir. > >> Should it also return false if in(ArrayCopyNode::Src)->is_top()? > > That doesn?t feel right: ArrayCopyNode::may_modify() would decide what to return based on src that it doesn?t modify. Okay, returning true in such case would be conservative. Should we check dest input for top in CallLeafNode::may_modify()? Do we have other similar places? Thanks, Vladimir > Also, I don?t think src can become top and dest not be top during allocation elimination. If that happens at another time (igvn for instance), this part of the graph is being removed and presumably the compiler will get an other chance to optimize it better. > > Roland. > > > >> >> thanks, >> Vladimir >> >> On 5/20/15 7:03 AM, Roland Westrelin wrote: >>> http://cr.openjdk.java.net/~roland/8080699/webrev.00/ >>> >>> When an arraycopy node is eliminated (for a non escaping allocation), its control, src and dest arguments are set to top and it is no longer connected to the rest of the graph through its fall through outputs. It?s not entirely removed from the graph until the next igvn though, and it can be reached through its exception processing outputs. That?s what happens here and ArrayCopyNode::may_modify() assumes dest, that is top, is an oop. >>> >>> Roland. >>> > From vladimir.kozlov at oracle.com Wed May 20 19:38:26 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 20 May 2015 12:38:26 -0700 Subject: RFR(S): 8078866: compiler/eliminateAutobox/6934604/TestIntBoxing.java assert(p_f->Opcode() == Op_IfFalse) failed In-Reply-To: References: <5553C1E2.3040003@oracle.com> Message-ID: <555CE2B2.8020305@oracle.com> On 5/18/15 5:19 AM, Roland Westrelin wrote: > Thanks for looking at this, Vladimir. > >> Why do you think that main- and post- loops will be empty too and it is safe to replace them with pre-loop? Why it is safe? >> >> Opaque1 node is used only when RCE and align is requsted: >> cmp_end->set_req(2, peel_only ? pre_limit : pre_opaq); >> >> In peel_only case pre-loop have only 1 iteration and can be eliminated too. The assert could be hit in such case too. So your change may not fix the problem. > > In case of peel_only, we set: > > if( peel_only ) main_head->set_main_no_pre_loop(); > > and in IdealLoopTree::policy_range_check(): > > // If we unrolled with no intention of doing RCE and we later > // changed our minds, we got no pre-loop. Either we need to > // make a new pre-loop, or we gotta disallow RCE. > if (cl->is_main_no_pre_loop()) return false; // Disallowed for now. I missed that. > >> I think eliminating main- and post- is good investigation for separate RFE. And we need to make sure it is safe. > > My reasoning is that if the pre loop?s limit is still an opaque node, then the compiler has not been able to optimize the pre loop based on its limited number of iterations yet. I don?t see what optimization would remove stuff from the pre loop that wouldn?t have removed the same stuff from the initial loop, had it not been cloned. I could be missing something, though. Do you see one? Predicates? Can we have checks moved from pre- but not from main- because of execution order? Anyway, I am not strongly against this change but you still need to add check to do_range_check(), I think, in case something went wrong. Thanks, Vladimir > > Roland. > >> >> To fix this bug, I think, we should bailout from do_range_check() when we can't find pre-loop. >> >> Typo 'cab': >> + // post loop cab be removed as well >> >> Thanks, >> Vladimir >> >> >> On 5/13/15 6:02 AM, Roland Westrelin wrote: >>> http://cr.openjdk.java.net/~roland/8078866/webrev.00/ >>> >>> The assert fires when optimizing that code: >>> >>> static int remi_sumb() { >>> Integer j = Integer.valueOf(1); >>> for (int i = 0; i< 1000; i++) { >>> j = j + 1; >>> } >>> return j; >>> } >>> >>> The compiler tries to find the pre loop from the main loop but fails because the pre loop isn?t there. >>> >>> Here is the chain of events that leads to the pre loop being optimized out: >>> >>> 1- The CountedLoopNode is created >>> 2- PhiNode::Value() for the induction variable?s Phi is called and capture that 0 <= i < 1000 >>> 3- SplitIf is applied and moves the LoadI that reads the value from the Integer through the Phi that merges successive value of j >>> 4- pre-loop, main-loop, post-loop are created >>> 5- SplitIf is applied again and moves the LoadI of the loop through the Phi that merges both branches of: >>> >>> public static Integer valueOf(int i) { >>> if (i >= IntegerCache.low && i <= IntegerCache.high) >>> return IntegerCache.cache[i + (-IntegerCache.low)]; >>> return new Integer(i); >>> } >>> >>> 6- The LoadI is optimized out, the Phi for j is now a Phi that merges integer values, PhaseIdealLoop::replace_parallel_iv() replaces j by an expression that depends on i >>> 7- In the preloop, the range of values of i are known because of 2- so and the if of valueOf() can be optimized out >>> 8- the preloop is empty and is removed by IdealLoopTree::policy_do_remove_empty_loop() >>> >>> SplitIf doesn?t find all opportunities for improvements before the pre/main/post loops are built. I considered trying to fix that but that seems complex and not robust enough: some optimizations after SplitIf could uncover more opportunities for SplitIf. The valueOf?s if cannot be optimized out in the main loop because the range of values of i in the main depends on the Phi for the iv of the pre loop and is not [0, 1000[ so I don?t see a way to make sure the compiler always optimizes the main loop out when the pre loop is empty. >>> >>> Instead, I?m proposing that when IdealLoopTree::policy_do_remove_empty_loop() removes a loop, it checks whether it is a pre-loop and if it is, removes the main and post loops. >>> >>> I don?t have a test case because it seems this only occurs under well defined conditions that are hard to reproduce with a simple test case: PhiNode::Value() must be called early, SplitIf must do a little work but not too much before pre/main/post loops are built. Actually, this occurs only when remi_sumb is inlined. If remi_sumb is not, we don?t optimize the loop out at all. >>> >>> Roland. >>> > From vladimir.kozlov at oracle.com Wed May 20 22:02:08 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 20 May 2015 15:02:08 -0700 Subject: 8054492: compiler/intrinsics/classcast/NullCheckDroppingsTest.java is an invalid test In-Reply-To: <5559EFA0.5060800@redhat.com> References: <5559EFA0.5060800@redhat.com> Message-ID: <555D0460.4080805@oracle.com> testVarClassCast tests deoptimization for javaMirror == null: void testVarClassCast(String s) { Class cl = (s == null) ? null : String.class; try { ssink = (String)cl.cast(svalue); Which is done in LibraryCallKit::inline_Class_cast() by: mirror = null_check(mirror); which has Deoptimization::Action_make_not_entrant. Unfortunately currently the test also pass because unstable_if is generated for the first line: (s == null) ? null : String.class; If you run the test with TraceDoptimization (or LogCompilation) you will see: Uncommon trap occurred in NullCheckDroppingsTest::testVarClassCast (@0x000000010b0670d8) thread=26883 reason=unstable_if action=reinterpret unloaded_class_index=-1 In what situation you see it fails? Please, run with TraceDoptimization. thanks, Vladimir On 5/18/15 6:56 AM, Andrew Haley wrote: > This test has been failing with a failed assertion: > > Exception in thread "main" java.lang.AssertionError: void NullCheckDroppingsTest.testVarClassCast(java.lang.String) was not deoptimized > at NullCheckDroppingsTest.checkDeoptimization(NullCheckDroppingsTest.java:331) > at NullCheckDroppingsTest.runTest(NullCheckDroppingsTest.java:309) > at NullCheckDroppingsTest.main(NullCheckDroppingsTest.java:125) > > It seems that the reason for the failure is that the nmethod for a > deoptimized method (testVarClassCast) is not removed. This is > perfectly correct because when testVarClassCast() was trapped, it was > with an action of Deoptimization::Action_maybe_recompile. When this > occurs, a method is not marked as dead and continues to be invoked, so > the test WhiteBox.isMethodCompiled() will still return true. > > Therefore, I think this code in checkDeoptimization is incorrect: > > // Check deoptimization event (intrinsic Class.cast() works). > if (WHITE_BOX.isMethodCompiled(method) == deopt) { > throw new AssertionError(method + " was" + (deopt ? " not" : "") + " deoptimized"); > } > > We either need some code in WhiteBox to check for a deoptimization > event properly or we should just remove this altogether. > > So, thoughts? Just delete the check? > > Andrew. > From rednaxelafx at gmail.com Thu May 21 00:22:40 2015 From: rednaxelafx at gmail.com (Krystal Mok) Date: Wed, 20 May 2015 17:22:40 -0700 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: <555BDF7D.4010002@oracle.com> <2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net> <318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net> Message-ID: ReadyNow does save profiles to disk and read them back in subsequent runs. There are also other changes on the side in the runtime and compilers to make it work smoothly. Cc'ing Doug Hawkins who's the one working on ReadyNow if anybody's interested in more information. - Kris On Wed, May 20, 2015 at 6:40 AM, Vitaly Davidovich wrote: > Training mode is also dangerous unless built in from beginning as there's > risk some unwanted application effect will take place. This is beside the > possibility of feeding it "wrong" profile. > > What does Azul ReadyNow do? Saves profiles to disk and reads them upon > startup? > > sent from my phone > On May 20, 2015 4:58 AM, "Chris Newland" > wrote: > >> Hi Wei, >> >> Thanks! OpenJDK has no support for persisting compilation profile data >> between VM runs. For some industries it can be economical to use a >> commercial VM to get the lowest latencies. >> >> One "free" alternative is to add a training mode to your application and >> feed it the most realistic input data so that you get a good set of JIT >> optimisations before the inputs you care about arrive. >> >> Is there a reason why you are cold-starting your VM? (For example your GC >> strategy is to use a huge Eden space and reboot daily to avoid any garbage >> collections). >> >> Regards, >> >> Chris >> @chriswhocodes >> >> On Wed, May 20, 2015 09:17, Tangwei (Euler) wrote: >> > Hi Chris, >> > Your tool looks cool. I think OpenJDK has no technology to mitigate >> > cold-start costs, please correct if I am wrong. I can control some >> > compilation passes with option -XX:CompileCommand, but the profiling >> > data, such as invocation count and backedge count, has to reach some >> > threshold before trigger execution level transition. This causes >> > simulator doesn't trigger C2 compilation within reasonable time span. >> > >> > Regards! >> > wei >> > >> >> -----Original Message----- >> >> From: Chris Newland [mailto:cnewland at chrisnewland.com] >> >> Sent: Wednesday, May 20, 2015 3:46 PM >> >> To: Vitaly Davidovich >> >> Cc: Tangwei (Euler); Vladimir Kozlov; hotspot compiler >> >> Subject: RE: How to change compilation policy to trigger C2 compilation >> >> ASAP? >> >> >> >> >> >> Hi Wei, >> >> >> >> >> >> Is there any reason why you need to cold-start your application each >> >> time and begin with no profiling information? >> >> >> >> Depending on your input data, would it be possible to "warm up" the VM >> >> by running typical inputs through your code to build a rich profile and >> >> have the JIT compilations performed before your real data arrives? >> >> >> >> One reason *against* doing this would be wide variations in your input >> >> data that could result in the wrong optimisations being made and >> >> possibly suffering a decompilation event. >> >> >> >> There are commercial VMs that have technology to mitigate cold-start >> >> costs (for example Azul Zing's ReadyNow). >> >> >> >> >> >> Have you examined the HotSpot LogCompilation output to make sure there >> >> is nothing you could change at source code level which would result in >> a >> >> better JIT decisions? I'm thinking of things like inlining failures due >> >> to call-site megamorphism. >> >> >> >> There's a free tool called JITWatch that aims to make it easier to >> >> understand the LogCompilation output >> >> (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales pitch >> - >> >> I'm the >> >> author). >> >> >> >> Regards, >> >> >> >> >> >> Chris >> >> @chriswhocodes >> >> >> >> >> >> On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote: >> >> >> >>> I'll let Vladimir comment on the compile command, but turning off >> >>> tiered doesn't prevent inlining. What prevents inlining (an otherwise >> >>> inlineable method), in a nutshell, is lack of profiling information >> >>> that would indicate the method or callsite is hot enough for >> >>> inlining. So you can have tiered off and run C2 with standard >> >>> compilation thresholds, and you'll likely get good inlining because >> >>> you'll have a long profile. If you turn C2 compile threshold down too >> >>> far (and too far is going to be app-specific), C2 compilation may not >> >>> have sufficient info in the shortened profile to decide to inline. Or >> >>> even worse, the profile collected is actually not reflective of the >> >>> real profile you want to capture (e.g. app has a phase change after >> >>> initialization). >> >>> >> >>> The theoretical advantage of tiered is you can get decent perf >> >>> quickly, but also since you're getting better than interpreter speed >> >>> quickly, you can run in C1 tier longer and thus collect an even >> >>> larger profile, possibly leading to better code than C2 alone. But >> >>> unfortunately this may require serious tuning. That's my >> >>> understanding at >> >> least. >> >>> >> >>> sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)" >> >>> wrote: >> >>> >> >>> >> >>> >> >>>> Vladimir, >> >>>> What the 'double' means in following command line? The option >> >>>> CompileThresholdScaling you mentioned is same as >> >>>> ProfileMaturityPercentage? >> >>>> I cannot find the option in code. How much effect to function >> >>>> inlining by turning off tiered compilation? >> >>>> >> >>>>> >> >>>> >> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh >> >> ol >> >>>> dS caling,0.5 >> >>>> >> >>>> Regards! >> >>>> wei >> >>>> >> >>>>> -----Original Message----- >> >>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] >> >>>>> Sent: Wednesday, May 20, 2015 9:12 AM >> >>>>> To: Tangwei (Euler); Vitaly Davidovich >> >>>>> Cc: hotspot compiler >> >>>>> Subject: Re: How to change compilation policy to trigger C2 >> >>>>> compilation >> >>>> ASAP? >> >>>> >> >>>> >> >>>>> >> >>>>> If you want only C2 compilation you can disable tiered >> >>>>> compilation >> >>>>> >> >>>> (since you >> >>>> >> >>>> >> >>>>> don't want to spend to much on profiling anyway) and set low >> >>>>> CompileThreshold (you need only this one for non-tiered compile): >> >>>>> >> >>>>> >> >>>>> >> >>>>> -XX:-TieredCompilation -XX:CompileThreshold=100 >> >>>>> >> >>>>> >> >>>>> >> >>>>> If your method is simple and don't need profiling or inlining the >> >>>>> >> >>>>> >> >>>> generated code >> >>>>> could be same quality as with long profiling. >> >>>>> >> >>>>> An other approach is to set threshold per method (scale all >> >>>>> threasholds >> >>>> (C1, C2, >> >>>> >> >>>> >> >>>>> interpreter) by this value). For example to reduce thresholds by >> >>>>> half: >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh >> >> >> >>>>> oldScaling,0.5 >> >>>>> >> >>>>> Vladimir >> >>>>> >> >>>>> >> >>>>> >> >>>>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote: >> >>>>> >> >>>>> >> >>>>>> My goal is just to reach peak performance quickly. Following is >> >>>>>> one tier threshold combination I tried: >> >>>>>> >> >>>>>> -XX:Tier0ProfilingStartPercentage=0 >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -XX:Tier3InvocationThreshold=3 >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -XX:Tier3MinInvocationThreshold=2 >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -XX:Tier3CompileThreshold=2 >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -XX:Tier4InvocationThreshold=4 >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -XX:Tier4MinInvocationThreshold=3 >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -XX:Tier4CompileThreshold=2 >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> Regards! >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> wei >> >>>>>> >> >>>>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] >> >>>>>> *Sent:* Tuesday, May 19, 2015 9:33 PM >> >>>>>> *To:* Tangwei (Euler) >> >>>>>> *Cc:* hotspot compiler >> >>>>>> *Subject:* Re: How to change compilation policy to trigger C2 >> >>>>>> compilation ASAP? >> >>>>>> >> >>>>>> Is your goal specifically to have C2 compile or just to reach >> >>>>>> peak performance quickly? It sounds like the latter. What >> >>>>>> values did you specify for the tier thresholds? Also, it may >> >>>>>> help you to -XX:+PrintCompilation to tune the flags as this will >> >>>>>> show you which methods are being compiled, when, and at what >> >>>>>> tier. >> >>>>>> >> >>>>>> sent from my phone >> >>>>>> >> >>>>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" > >>>>>> > wrote: >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> Hi All, >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> I want to run a JAVA application on a performance simulator, >> >>>>>> and do a profiling on a hot function JITTed with C2 compiler. >> >>>>>> >> >>>>>> In order to make C2 compiler compile hot function as early as >> >>>>>> possible, I hope to reduce the threshold of function invocation >> >>>>>> >> >>>>>> count in interpreter and C1 to drive the JIT compiler >> >>>>>> transitioned to Level 4 (C2) ASAP. Following is the option list >> >>>>>> I try, but >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> failed to find a right combination to meet my requirement. >> >>>>>> Anyone >> >>>>>> can help to figure out what options I can use? >> >>>>>> >> >>>>>> Thanks in advance. >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -XX:Tier0ProfilingStartPercentage=0 >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -XX:Tier3InvocationThreshold >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -XX:Tier3MinInvocationThreshold >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -XX:Tier3CompileThreshold >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -XX:Tier4InvocationThreshold >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -XX:CompileThreshold >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -XX:Tier4MinInvocationThreshold >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -XX:Tier4CompileThreshold >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> Regards! >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> wei >> >>>>>> >> >>>> >> >>> >> >> >> > >> > >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Thu May 21 00:53:32 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Wed, 20 May 2015 20:53:32 -0400 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: <555BDF7D.4010002@oracle.com> <2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net> <318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net> Message-ID: Thanks Kris. Microsoft CLR also has something like that which they call Multicore JIT (i.e. background compilations are started on separate cores against saved profiles on startup). I wonder if anything of the sort is on the roadmap for hotspot? sent from my phone On May 20, 2015 8:22 PM, "Krystal Mok" wrote: > ReadyNow does save profiles to disk and read them back in subsequent runs. > There are also other changes on the side in the runtime and compilers to > make it work smoothly. > > Cc'ing Doug Hawkins who's the one working on ReadyNow if anybody's > interested in more information. > > - Kris > > On Wed, May 20, 2015 at 6:40 AM, Vitaly Davidovich > wrote: > >> Training mode is also dangerous unless built in from beginning as there's >> risk some unwanted application effect will take place. This is beside the >> possibility of feeding it "wrong" profile. >> >> What does Azul ReadyNow do? Saves profiles to disk and reads them upon >> startup? >> >> sent from my phone >> On May 20, 2015 4:58 AM, "Chris Newland" >> wrote: >> >>> Hi Wei, >>> >>> Thanks! OpenJDK has no support for persisting compilation profile data >>> between VM runs. For some industries it can be economical to use a >>> commercial VM to get the lowest latencies. >>> >>> One "free" alternative is to add a training mode to your application and >>> feed it the most realistic input data so that you get a good set of JIT >>> optimisations before the inputs you care about arrive. >>> >>> Is there a reason why you are cold-starting your VM? (For example your GC >>> strategy is to use a huge Eden space and reboot daily to avoid any >>> garbage >>> collections). >>> >>> Regards, >>> >>> Chris >>> @chriswhocodes >>> >>> On Wed, May 20, 2015 09:17, Tangwei (Euler) wrote: >>> > Hi Chris, >>> > Your tool looks cool. I think OpenJDK has no technology to mitigate >>> > cold-start costs, please correct if I am wrong. I can control some >>> > compilation passes with option -XX:CompileCommand, but the profiling >>> > data, such as invocation count and backedge count, has to reach some >>> > threshold before trigger execution level transition. This causes >>> > simulator doesn't trigger C2 compilation within reasonable time span. >>> > >>> > Regards! >>> > wei >>> > >>> >> -----Original Message----- >>> >> From: Chris Newland [mailto:cnewland at chrisnewland.com] >>> >> Sent: Wednesday, May 20, 2015 3:46 PM >>> >> To: Vitaly Davidovich >>> >> Cc: Tangwei (Euler); Vladimir Kozlov; hotspot compiler >>> >> Subject: RE: How to change compilation policy to trigger C2 >>> compilation >>> >> ASAP? >>> >> >>> >> >>> >> Hi Wei, >>> >> >>> >> >>> >> Is there any reason why you need to cold-start your application each >>> >> time and begin with no profiling information? >>> >> >>> >> Depending on your input data, would it be possible to "warm up" the VM >>> >> by running typical inputs through your code to build a rich profile >>> and >>> >> have the JIT compilations performed before your real data arrives? >>> >> >>> >> One reason *against* doing this would be wide variations in your input >>> >> data that could result in the wrong optimisations being made and >>> >> possibly suffering a decompilation event. >>> >> >>> >> There are commercial VMs that have technology to mitigate cold-start >>> >> costs (for example Azul Zing's ReadyNow). >>> >> >>> >> >>> >> Have you examined the HotSpot LogCompilation output to make sure there >>> >> is nothing you could change at source code level which would result >>> in a >>> >> better JIT decisions? I'm thinking of things like inlining failures >>> due >>> >> to call-site megamorphism. >>> >> >>> >> There's a free tool called JITWatch that aims to make it easier to >>> >> understand the LogCompilation output >>> >> (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales >>> pitch - >>> >> I'm the >>> >> author). >>> >> >>> >> Regards, >>> >> >>> >> >>> >> Chris >>> >> @chriswhocodes >>> >> >>> >> >>> >> On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote: >>> >> >>> >>> I'll let Vladimir comment on the compile command, but turning off >>> >>> tiered doesn't prevent inlining. What prevents inlining (an >>> otherwise >>> >>> inlineable method), in a nutshell, is lack of profiling information >>> >>> that would indicate the method or callsite is hot enough for >>> >>> inlining. So you can have tiered off and run C2 with standard >>> >>> compilation thresholds, and you'll likely get good inlining because >>> >>> you'll have a long profile. If you turn C2 compile threshold down >>> too >>> >>> far (and too far is going to be app-specific), C2 compilation may not >>> >>> have sufficient info in the shortened profile to decide to inline. >>> Or >>> >>> even worse, the profile collected is actually not reflective of the >>> >>> real profile you want to capture (e.g. app has a phase change after >>> >>> initialization). >>> >>> >>> >>> The theoretical advantage of tiered is you can get decent perf >>> >>> quickly, but also since you're getting better than interpreter speed >>> >>> quickly, you can run in C1 tier longer and thus collect an even >>> >>> larger profile, possibly leading to better code than C2 alone. But >>> >>> unfortunately this may require serious tuning. That's my >>> >>> understanding at >>> >> least. >>> >>> >>> >>> sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)" >>> >>> wrote: >>> >>> >>> >>> >>> >>> >>> >>>> Vladimir, >>> >>>> What the 'double' means in following command line? The option >>> >>>> CompileThresholdScaling you mentioned is same as >>> >>>> ProfileMaturityPercentage? >>> >>>> I cannot find the option in code. How much effect to function >>> >>>> inlining by turning off tiered compilation? >>> >>>> >>> >>>>> >>> >>>> >>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh >>> >> ol >>> >>>> dS caling,0.5 >>> >>>> >>> >>>> Regards! >>> >>>> wei >>> >>>> >>> >>>>> -----Original Message----- >>> >>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] >>> >>>>> Sent: Wednesday, May 20, 2015 9:12 AM >>> >>>>> To: Tangwei (Euler); Vitaly Davidovich >>> >>>>> Cc: hotspot compiler >>> >>>>> Subject: Re: How to change compilation policy to trigger C2 >>> >>>>> compilation >>> >>>> ASAP? >>> >>>> >>> >>>> >>> >>>>> >>> >>>>> If you want only C2 compilation you can disable tiered >>> >>>>> compilation >>> >>>>> >>> >>>> (since you >>> >>>> >>> >>>> >>> >>>>> don't want to spend to much on profiling anyway) and set low >>> >>>>> CompileThreshold (you need only this one for non-tiered compile): >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> -XX:-TieredCompilation -XX:CompileThreshold=100 >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> If your method is simple and don't need profiling or inlining the >>> >>>>> >>> >>>>> >>> >>>> generated code >>> >>>>> could be same quality as with long profiling. >>> >>>>> >>> >>>>> An other approach is to set threshold per method (scale all >>> >>>>> threasholds >>> >>>> (C1, C2, >>> >>>> >>> >>>> >>> >>>>> interpreter) by this value). For example to reduce thresholds by >>> >>>>> half: >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh >>> >> >>> >>>>> oldScaling,0.5 >>> >>>>> >>> >>>>> Vladimir >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote: >>> >>>>> >>> >>>>> >>> >>>>>> My goal is just to reach peak performance quickly. Following is >>> >>>>>> one tier threshold combination I tried: >>> >>>>>> >>> >>>>>> -XX:Tier0ProfilingStartPercentage=0 >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -XX:Tier3InvocationThreshold=3 >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -XX:Tier3MinInvocationThreshold=2 >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -XX:Tier3CompileThreshold=2 >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -XX:Tier4InvocationThreshold=4 >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -XX:Tier4MinInvocationThreshold=3 >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -XX:Tier4CompileThreshold=2 >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> Regards! >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> wei >>> >>>>>> >>> >>>>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] >>> >>>>>> *Sent:* Tuesday, May 19, 2015 9:33 PM >>> >>>>>> *To:* Tangwei (Euler) >>> >>>>>> *Cc:* hotspot compiler >>> >>>>>> *Subject:* Re: How to change compilation policy to trigger C2 >>> >>>>>> compilation ASAP? >>> >>>>>> >>> >>>>>> Is your goal specifically to have C2 compile or just to reach >>> >>>>>> peak performance quickly? It sounds like the latter. What >>> >>>>>> values did you specify for the tier thresholds? Also, it may >>> >>>>>> help you to -XX:+PrintCompilation to tune the flags as this will >>> >>>>>> show you which methods are being compiled, when, and at what >>> >>>>>> tier. >>> >>>>>> >>> >>>>>> sent from my phone >>> >>>>>> >>> >>>>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" >> >>>>>> > wrote: >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> Hi All, >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> I want to run a JAVA application on a performance simulator, >>> >>>>>> and do a profiling on a hot function JITTed with C2 compiler. >>> >>>>>> >>> >>>>>> In order to make C2 compiler compile hot function as early as >>> >>>>>> possible, I hope to reduce the threshold of function invocation >>> >>>>>> >>> >>>>>> count in interpreter and C1 to drive the JIT compiler >>> >>>>>> transitioned to Level 4 (C2) ASAP. Following is the option list >>> >>>>>> I try, but >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> failed to find a right combination to meet my requirement. >>> >>>>>> Anyone >>> >>>>>> can help to figure out what options I can use? >>> >>>>>> >>> >>>>>> Thanks in advance. >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -XX:Tier0ProfilingStartPercentage=0 >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -XX:Tier3InvocationThreshold >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -XX:Tier3MinInvocationThreshold >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -XX:Tier3CompileThreshold >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -XX:Tier4InvocationThreshold >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -XX:CompileThreshold >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -XX:Tier4MinInvocationThreshold >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -XX:Tier4CompileThreshold >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> Regards! >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> wei >>> >>>>>> >>> >>>> >>> >>> >>> >> >>> > >>> > >>> >>> >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rednaxelafx at gmail.com Thu May 21 06:53:18 2015 From: rednaxelafx at gmail.com (Krystal Mok) Date: Wed, 20 May 2015 23:53:18 -0700 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: <555BDF7D.4010002@oracle.com> <2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net> <318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net> Message-ID: Hi Vitaly, Multicore JIT in the CLR is totally different from what we do in Azul Zing's ReadyNow. The profile that ReadyNow saves and restores is the kind of detailed profile that C2 uses directly, similar to HotSpot's MDOs. The "profile" that Multicore JIT uses is just a list of methods (which can be group together under user-specified group names), without any detail profiles. The JIT compilers in CLR can only use detailed profiles when run in MPGO (managed profile-guided optimization) mode, which is a mode used with NGen (native image generator), an AOT compilation. I'll have to leave the HotSpot part of your question to Oracle people to answer. - Kris On Wed, May 20, 2015 at 5:53 PM, Vitaly Davidovich wrote: > Thanks Kris. Microsoft CLR also has something like that which they call > Multicore JIT (i.e. background compilations are started on separate cores > against saved profiles on startup). > > I wonder if anything of the sort is on the roadmap for hotspot? > > sent from my phone > On May 20, 2015 8:22 PM, "Krystal Mok" wrote: > >> ReadyNow does save profiles to disk and read them back in subsequent runs. >> There are also other changes on the side in the runtime and compilers to >> make it work smoothly. >> >> Cc'ing Doug Hawkins who's the one working on ReadyNow if anybody's >> interested in more information. >> >> - Kris >> >> On Wed, May 20, 2015 at 6:40 AM, Vitaly Davidovich >> wrote: >> >>> Training mode is also dangerous unless built in from beginning as >>> there's risk some unwanted application effect will take place. This is >>> beside the possibility of feeding it "wrong" profile. >>> >>> What does Azul ReadyNow do? Saves profiles to disk and reads them upon >>> startup? >>> >>> sent from my phone >>> On May 20, 2015 4:58 AM, "Chris Newland" >>> wrote: >>> >>>> Hi Wei, >>>> >>>> Thanks! OpenJDK has no support for persisting compilation profile data >>>> between VM runs. For some industries it can be economical to use a >>>> commercial VM to get the lowest latencies. >>>> >>>> One "free" alternative is to add a training mode to your application and >>>> feed it the most realistic input data so that you get a good set of JIT >>>> optimisations before the inputs you care about arrive. >>>> >>>> Is there a reason why you are cold-starting your VM? (For example your >>>> GC >>>> strategy is to use a huge Eden space and reboot daily to avoid any >>>> garbage >>>> collections). >>>> >>>> Regards, >>>> >>>> Chris >>>> @chriswhocodes >>>> >>>> On Wed, May 20, 2015 09:17, Tangwei (Euler) wrote: >>>> > Hi Chris, >>>> > Your tool looks cool. I think OpenJDK has no technology to mitigate >>>> > cold-start costs, please correct if I am wrong. I can control some >>>> > compilation passes with option -XX:CompileCommand, but the profiling >>>> > data, such as invocation count and backedge count, has to reach some >>>> > threshold before trigger execution level transition. This causes >>>> > simulator doesn't trigger C2 compilation within reasonable time span. >>>> > >>>> > Regards! >>>> > wei >>>> > >>>> >> -----Original Message----- >>>> >> From: Chris Newland [mailto:cnewland at chrisnewland.com] >>>> >> Sent: Wednesday, May 20, 2015 3:46 PM >>>> >> To: Vitaly Davidovich >>>> >> Cc: Tangwei (Euler); Vladimir Kozlov; hotspot compiler >>>> >> Subject: RE: How to change compilation policy to trigger C2 >>>> compilation >>>> >> ASAP? >>>> >> >>>> >> >>>> >> Hi Wei, >>>> >> >>>> >> >>>> >> Is there any reason why you need to cold-start your application each >>>> >> time and begin with no profiling information? >>>> >> >>>> >> Depending on your input data, would it be possible to "warm up" the >>>> VM >>>> >> by running typical inputs through your code to build a rich profile >>>> and >>>> >> have the JIT compilations performed before your real data arrives? >>>> >> >>>> >> One reason *against* doing this would be wide variations in your >>>> input >>>> >> data that could result in the wrong optimisations being made and >>>> >> possibly suffering a decompilation event. >>>> >> >>>> >> There are commercial VMs that have technology to mitigate cold-start >>>> >> costs (for example Azul Zing's ReadyNow). >>>> >> >>>> >> >>>> >> Have you examined the HotSpot LogCompilation output to make sure >>>> there >>>> >> is nothing you could change at source code level which would result >>>> in a >>>> >> better JIT decisions? I'm thinking of things like inlining failures >>>> due >>>> >> to call-site megamorphism. >>>> >> >>>> >> There's a free tool called JITWatch that aims to make it easier to >>>> >> understand the LogCompilation output >>>> >> (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales >>>> pitch - >>>> >> I'm the >>>> >> author). >>>> >> >>>> >> Regards, >>>> >> >>>> >> >>>> >> Chris >>>> >> @chriswhocodes >>>> >> >>>> >> >>>> >> On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote: >>>> >> >>>> >>> I'll let Vladimir comment on the compile command, but turning off >>>> >>> tiered doesn't prevent inlining. What prevents inlining (an >>>> otherwise >>>> >>> inlineable method), in a nutshell, is lack of profiling >>>> information >>>> >>> that would indicate the method or callsite is hot enough for >>>> >>> inlining. So you can have tiered off and run C2 with standard >>>> >>> compilation thresholds, and you'll likely get good inlining because >>>> >>> you'll have a long profile. If you turn C2 compile threshold down >>>> too >>>> >>> far (and too far is going to be app-specific), C2 compilation may >>>> not >>>> >>> have sufficient info in the shortened profile to decide to inline. >>>> Or >>>> >>> even worse, the profile collected is actually not reflective of the >>>> >>> real profile you want to capture (e.g. app has a phase change after >>>> >>> initialization). >>>> >>> >>>> >>> The theoretical advantage of tiered is you can get decent perf >>>> >>> quickly, but also since you're getting better than interpreter speed >>>> >>> quickly, you can run in C1 tier longer and thus collect an even >>>> >>> larger profile, possibly leading to better code than C2 alone. But >>>> >>> unfortunately this may require serious tuning. That's my >>>> >>> understanding at >>>> >> least. >>>> >>> >>>> >>> sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)" >>>> >>> wrote: >>>> >>> >>>> >>> >>>> >>> >>>> >>>> Vladimir, >>>> >>>> What the 'double' means in following command line? The option >>>> >>>> CompileThresholdScaling you mentioned is same as >>>> >>>> ProfileMaturityPercentage? >>>> >>>> I cannot find the option in code. How much effect to function >>>> >>>> inlining by turning off tiered compilation? >>>> >>>> >>>> >>>>> >>>> >>>> >>>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh >>>> >> ol >>>> >>>> dS caling,0.5 >>>> >>>> >>>> >>>> Regards! >>>> >>>> wei >>>> >>>> >>>> >>>>> -----Original Message----- >>>> >>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] >>>> >>>>> Sent: Wednesday, May 20, 2015 9:12 AM >>>> >>>>> To: Tangwei (Euler); Vitaly Davidovich >>>> >>>>> Cc: hotspot compiler >>>> >>>>> Subject: Re: How to change compilation policy to trigger C2 >>>> >>>>> compilation >>>> >>>> ASAP? >>>> >>>> >>>> >>>> >>>> >>>>> >>>> >>>>> If you want only C2 compilation you can disable tiered >>>> >>>>> compilation >>>> >>>>> >>>> >>>> (since you >>>> >>>> >>>> >>>> >>>> >>>>> don't want to spend to much on profiling anyway) and set low >>>> >>>>> CompileThreshold (you need only this one for non-tiered compile): >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> -XX:-TieredCompilation -XX:CompileThreshold=100 >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> If your method is simple and don't need profiling or inlining the >>>> >>>>> >>>> >>>>> >>>> >>>> generated code >>>> >>>>> could be same quality as with long profiling. >>>> >>>>> >>>> >>>>> An other approach is to set threshold per method (scale all >>>> >>>>> threasholds >>>> >>>> (C1, C2, >>>> >>>> >>>> >>>> >>>> >>>>> interpreter) by this value). For example to reduce thresholds by >>>> >>>>> half: >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh >>>> >> >>>> >>>>> oldScaling,0.5 >>>> >>>>> >>>> >>>>> Vladimir >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote: >>>> >>>>> >>>> >>>>> >>>> >>>>>> My goal is just to reach peak performance quickly. Following is >>>> >>>>>> one tier threshold combination I tried: >>>> >>>>>> >>>> >>>>>> -XX:Tier0ProfilingStartPercentage=0 >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> -XX:Tier3InvocationThreshold=3 >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> -XX:Tier3MinInvocationThreshold=2 >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> -XX:Tier3CompileThreshold=2 >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> -XX:Tier4InvocationThreshold=4 >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> -XX:Tier4MinInvocationThreshold=3 >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> -XX:Tier4CompileThreshold=2 >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> Regards! >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> wei >>>> >>>>>> >>>> >>>>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] >>>> >>>>>> *Sent:* Tuesday, May 19, 2015 9:33 PM >>>> >>>>>> *To:* Tangwei (Euler) >>>> >>>>>> *Cc:* hotspot compiler >>>> >>>>>> *Subject:* Re: How to change compilation policy to trigger C2 >>>> >>>>>> compilation ASAP? >>>> >>>>>> >>>> >>>>>> Is your goal specifically to have C2 compile or just to reach >>>> >>>>>> peak performance quickly? It sounds like the latter. What >>>> >>>>>> values did you specify for the tier thresholds? Also, it may >>>> >>>>>> help you to -XX:+PrintCompilation to tune the flags as this will >>>> >>>>>> show you which methods are being compiled, when, and at what >>>> >>>>>> tier. >>>> >>>>>> >>>> >>>>>> sent from my phone >>>> >>>>>> >>>> >>>>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" >>> >>>>>> > wrote: >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> Hi All, >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> I want to run a JAVA application on a performance simulator, >>>> >>>>>> and do a profiling on a hot function JITTed with C2 compiler. >>>> >>>>>> >>>> >>>>>> In order to make C2 compiler compile hot function as early as >>>> >>>>>> possible, I hope to reduce the threshold of function invocation >>>> >>>>>> >>>> >>>>>> count in interpreter and C1 to drive the JIT compiler >>>> >>>>>> transitioned to Level 4 (C2) ASAP. Following is the option list >>>> >>>>>> I try, but >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> failed to find a right combination to meet my requirement. >>>> >>>>>> Anyone >>>> >>>>>> can help to figure out what options I can use? >>>> >>>>>> >>>> >>>>>> Thanks in advance. >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> -XX:Tier0ProfilingStartPercentage=0 >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> -XX:Tier3InvocationThreshold >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> -XX:Tier3MinInvocationThreshold >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> -XX:Tier3CompileThreshold >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> -XX:Tier4InvocationThreshold >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> -XX:CompileThreshold >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> -XX:Tier4MinInvocationThreshold >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> -XX:Tier4CompileThreshold >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> Regards! >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> wei >>>> >>>>>> >>>> >>>> >>>> >>> >>>> >> >>>> > >>>> > >>>> >>>> >>>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From yekaterina.kantserova at oracle.com Thu May 21 07:33:42 2015 From: yekaterina.kantserova at oracle.com (Yekaterina Kantserova) Date: Thu, 21 May 2015 09:33:42 +0200 Subject: RFR(XS): 8080828: Create sanity test for JDK-8080155 Message-ID: <555D8A56.3050001@oracle.com> Hi, Could I please have a review of this new test. bug: https://bugs.openjdk.java.net/browse/JDK-8080828 webrev: http://cr.openjdk.java.net/~ykantser/8080828/webrev.00 This test will trigger the exception reported in https://bugs.openjdk.java.net/browse/JDK-8080155 if running with -Xcomp. Thanks, Katja From staffan.larsen at oracle.com Thu May 21 08:13:04 2015 From: staffan.larsen at oracle.com (Staffan Larsen) Date: Thu, 21 May 2015 10:13:04 +0200 Subject: RFR(XS): 8080828: Create sanity test for JDK-8080155 In-Reply-To: <555D8A56.3050001@oracle.com> References: <555D8A56.3050001@oracle.com> Message-ID: Can we add some kind of validation of the output? Something simple, just to verify that we actually ran what we think we ran. > On 21 maj 2015, at 09:33, Yekaterina Kantserova wrote: > > Hi, > > Could I please have a review of this new test. > > bug: https://bugs.openjdk.java.net/browse/JDK-8080828 > webrev: http://cr.openjdk.java.net/~ykantser/8080828/webrev.00 > > This test will trigger the exception reported in https://bugs.openjdk.java.net/browse/JDK-8080155 if running with -Xcomp. > > Thanks, > Katja From staffan.larsen at oracle.com Thu May 21 08:15:29 2015 From: staffan.larsen at oracle.com (Staffan Larsen) Date: Thu, 21 May 2015 10:15:29 +0200 Subject: RFR(XS): 8080828: Create sanity test for JDK-8080155 In-Reply-To: References: <555D8A56.3050001@oracle.com> Message-ID: I just realize that you also need to add a check if SA can attach on the current platform. See the test in sa/jmap-hashcode: if (!Platform.shouldSAAttach()) { System.out.println("SA attach not expected to work - test skipped."); return; } Thanks, /Staffan > On 21 maj 2015, at 10:13, Staffan Larsen wrote: > > Can we add some kind of validation of the output? Something simple, just to verify that we actually ran what we think we ran. > >> On 21 maj 2015, at 09:33, Yekaterina Kantserova wrote: >> >> Hi, >> >> Could I please have a review of this new test. >> >> bug: https://bugs.openjdk.java.net/browse/JDK-8080828 >> webrev: http://cr.openjdk.java.net/~ykantser/8080828/webrev.00 >> >> This test will trigger the exception reported in https://bugs.openjdk.java.net/browse/JDK-8080155 if running with -Xcomp. >> >> Thanks, >> Katja > From yekaterina.kantserova at oracle.com Thu May 21 08:58:10 2015 From: yekaterina.kantserova at oracle.com (Yekaterina Kantserova) Date: Thu, 21 May 2015 10:58:10 +0200 Subject: RFR(XS): 8080828: Create sanity test for JDK-8080155 In-Reply-To: References: <555D8A56.3050001@oracle.com> Message-ID: <555D9E22.6090601@oracle.com> Staffan, thanks fro looking at it! I'll fix according the test your comments. // Katja On 05/21/2015 10:15 AM, Staffan Larsen wrote: > I just realize that you also need to add a check if SA can attach on the current platform. See the test in sa/jmap-hashcode: > > if (!Platform.shouldSAAttach()) { > System.out.println("SA attach not expected to work - test skipped."); > return; > } > > Thanks, > /Staffan > >> On 21 maj 2015, at 10:13, Staffan Larsen wrote: >> >> Can we add some kind of validation of the output? Something simple, just to verify that we actually ran what we think we ran. >> >>> On 21 maj 2015, at 09:33, Yekaterina Kantserova wrote: >>> >>> Hi, >>> >>> Could I please have a review of this new test. >>> >>> bug: https://bugs.openjdk.java.net/browse/JDK-8080828 >>> webrev: http://cr.openjdk.java.net/~ykantser/8080828/webrev.00 >>> >>> This test will trigger the exception reported in https://bugs.openjdk.java.net/browse/JDK-8080155 if running with -Xcomp. >>> >>> Thanks, >>> Katja From andreas.eriksson at oracle.com Thu May 21 09:14:33 2015 From: andreas.eriksson at oracle.com (Andreas Eriksson) Date: Thu, 21 May 2015 11:14:33 +0200 Subject: 8060036: C2: CmpU nodes can end up with wrong type information In-Reply-To: <555CB3A1.2000308@oracle.com> References: <555B16D4.80008@oracle.com> <555C6A2B.7090804@oracle.com> <555CB3A1.2000308@oracle.com> Message-ID: <555DA1F9.4020207@oracle.com> Thanks Vladimir! On 2015-05-20 18:17, Vladimir Kozlov wrote: > Looks good. I agree with explanation and fix. > > Thanks, > Vladimir > > On 5/20/15 4:04 AM, Andreas Eriksson wrote: >> Comments added, thanks. >> >> Can I have a Reviewer look at this as well? >> >> New webrev: >> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.01/ >> >> - Andreas >> >> On 2015-05-19 20:24, Berg, Michael C wrote: >>> Hi Andreas, could you add a comment to the push statement: >>> >>> // Propagate change to user >>> >>> And to Node* p declaration and init: >>> >>> // Propagate changes to uses >>> >>> So that the new code carries all similar information. >>> Else the changes seem ok. >>> >>> -Michael >>> >>> -----Original Message----- >>> From: hotspot-compiler-dev >>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of >>> Andreas Eriksson >>> Sent: Tuesday, May 19, 2015 3:56 AM >>> To: hotspot-compiler-dev at openjdk.java.net >>> Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type >>> information >>> >>> Hi, >>> >>> Could I please get two reviews for the following bug fix. >>> >>> https://bugs.openjdk.java.net/browse/JDK-8060036 >>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/ >>> >>> The problem is with type propagation in C2. >>> >>> A CmpU can use type information from nodes two steps up in the graph, >>> when input 1 is an AddI or SubI node. >>> If the AddI/SubI overflows the CmpU node type computation will set up >>> two type ranges based on the inputs to the AddI/SubI instead. >>> These type ranges will then be compared to input 2 of the CmpU node to >>> see if we can make any meaningful observations on the real type of the >>> CmpU. >>> This means that the type of the AddI/SubI can be bottom, but the CmpU >>> can still use the type of the AddI/SubI inputs to get a more specific >>> type itself. >>> >>> The problem with this is that the types of the AddI/SubI inputs can be >>> updated at a later pass, but since the AddI/SubI is of type bottom the >>> propagation pass will stop there. >>> At that point the CmpU node is never made aware of the new types of >>> the AddI/SubI inputs, and it is left with a type that is based on old >>> type information. >>> >>> When this happens other optimizations using the type information can >>> go very wrong; in this particular case a branch to a range_check trap >>> that normally never happened was optimized to always be taken. >>> >>> This fix adds a check in the type propagation logic for CmpU nodes, >>> which makes sure that the CmpU node will be added to the worklist so >>> the type can be updated. >>> I've not been able to create a regression test that is reliable enough >>> to check in. >>> >>> Thanks, >>> Andreas >> From aph at redhat.com Thu May 21 10:03:40 2015 From: aph at redhat.com (Andrew Haley) Date: Thu, 21 May 2015 11:03:40 +0100 Subject: 8054492: compiler/intrinsics/classcast/NullCheckDroppingsTest.java is an invalid test In-Reply-To: <555D0460.4080805@oracle.com> References: <5559EFA0.5060800@redhat.com> <555D0460.4080805@oracle.com> Message-ID: <555DAD7C.2090805@redhat.com> On 05/20/2015 11:02 PM, Vladimir Kozlov wrote: > testVarClassCast tests deoptimization for javaMirror == null: > > void testVarClassCast(String s) { > Class cl = (s == null) ? null : String.class; > try { > ssink = (String)cl.cast(svalue); > > Which is done in LibraryCallKit::inline_Class_cast() by: > > mirror = null_check(mirror); > > which has Deoptimization::Action_make_not_entrant. > > Unfortunately currently the test also pass because unstable_if is > generated for the first line: > > (s == null) ? null : String.class; > > If you run the test with TraceDoptimization (or LogCompilation) you will > see: > > Uncommon trap occurred in NullCheckDroppingsTest::testVarClassCast > (@0x000000010b0670d8) thread=26883 reason=unstable_if action=reinterpret > unloaded_class_index=-1 Not quite the same. I get a reason=null_check: Uncommon trap occurred in NullCheckDroppingsTest::testVarClassCast (@0x000003ff8d253e54) thread=4396243677696 reason=null_check action=maybe_recompile unloaded_class_index=-1 Which comes from a SEGV here: 0x000003ff89253ca4: ldr x0, [x10,#72] ; implicit exception: dispatches to 0x000003ff89253e40 which is the line ssink = (String)cl.cast(svalue); I don't get a trap for unstable_if because there isn't one. I just get 0x000003ff89253c90: cbz x2, 0x000003ff89253e00 (this is java/lang/String s) --> 0x000003ff89253e00: mov x10, xzr 0x000003ff89253e04: b 0x000003ff89253ca0 --> 0x000003ff89253ca0: lsl x11, x11, #3 ;*getstatic svalue ; - NullCheckDroppingsTest::testVarClassCast at 13 (line 181) 0x000003ff89253ca4: ldr x0, [x10,#72] ; implicit exception: dispatches to 0x000003ff89253e40 ... and then the trap for the null pointer exception. Andrew. -------------- next part -------------- A non-text attachment was scrubbed... Name: deopt.log Type: text/x-log Size: 65408 bytes Desc: not available URL: From yekaterina.kantserova at oracle.com Thu May 21 10:55:06 2015 From: yekaterina.kantserova at oracle.com (Yekaterina Kantserova) Date: Thu, 21 May 2015 12:55:06 +0200 Subject: RFR(XS): 8080828: Create sanity test for JDK-8080155 In-Reply-To: References: <555D8A56.3050001@oracle.com> Message-ID: <555DB98A.7050401@oracle.com> The new webrev can be found here: http://cr.openjdk.java.net/~ykantser/8080828/webrev.01 Thanks, Katja On 05/21/2015 10:15 AM, Staffan Larsen wrote: > I just realize that you also need to add a check if SA can attach on the current platform. See the test in sa/jmap-hashcode: > > if (!Platform.shouldSAAttach()) { > System.out.println("SA attach not expected to work - test skipped."); > return; > } > > Thanks, > /Staffan > >> On 21 maj 2015, at 10:13, Staffan Larsen wrote: >> >> Can we add some kind of validation of the output? Something simple, just to verify that we actually ran what we think we ran. >> >>> On 21 maj 2015, at 09:33, Yekaterina Kantserova wrote: >>> >>> Hi, >>> >>> Could I please have a review of this new test. >>> >>> bug: https://bugs.openjdk.java.net/browse/JDK-8080828 >>> webrev: http://cr.openjdk.java.net/~ykantser/8080828/webrev.00 >>> >>> This test will trigger the exception reported in https://bugs.openjdk.java.net/browse/JDK-8080155 if running with -Xcomp. >>> >>> Thanks, >>> Katja From staffan.larsen at oracle.com Thu May 21 11:04:10 2015 From: staffan.larsen at oracle.com (Staffan Larsen) Date: Thu, 21 May 2015 13:04:10 +0200 Subject: RFR(XS): 8080828: Create sanity test for JDK-8080155 In-Reply-To: <555DB98A.7050401@oracle.com> References: <555D8A56.3050001@oracle.com> <555DB98A.7050401@oracle.com> Message-ID: <5C6FDB60-85C3-4130-ABF0-52368963E0DB@oracle.com> Looks good! Thanks, /Staffan > On 21 maj 2015, at 12:55, Yekaterina Kantserova wrote: > > The new webrev can be found here: http://cr.openjdk.java.net/~ykantser/8080828/webrev.01 > > Thanks, > Katja > > > > On 05/21/2015 10:15 AM, Staffan Larsen wrote: >> I just realize that you also need to add a check if SA can attach on the current platform. See the test in sa/jmap-hashcode: >> >> if (!Platform.shouldSAAttach()) { >> System.out.println("SA attach not expected to work - test skipped."); >> return; >> } >> >> Thanks, >> /Staffan >> >>> On 21 maj 2015, at 10:13, Staffan Larsen wrote: >>> >>> Can we add some kind of validation of the output? Something simple, just to verify that we actually ran what we think we ran. >>> >>>> On 21 maj 2015, at 09:33, Yekaterina Kantserova wrote: >>>> >>>> Hi, >>>> >>>> Could I please have a review of this new test. >>>> >>>> bug: https://bugs.openjdk.java.net/browse/JDK-8080828 >>>> webrev: http://cr.openjdk.java.net/~ykantser/8080828/webrev.00 >>>> >>>> This test will trigger the exception reported in https://bugs.openjdk.java.net/browse/JDK-8080155 if running with -Xcomp. >>>> >>>> Thanks, >>>> Katja > From vitalyd at gmail.com Thu May 21 11:11:29 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Thu, 21 May 2015 07:11:29 -0400 Subject: How to change compilation policy to trigger C2 compilation ASAP? In-Reply-To: References: <555BDF7D.4010002@oracle.com> <2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net> <318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net> Message-ID: Hi Kris, MS CLR doesn't maintain the rich profile that Hotspot and Zing do, so yes, for them it's pretty much a list of methods. So I'm only mentioning them in terms of doing eager compilation at startup based on profile info, irrespective of what's actually in the profile. Thanks sent from my phone On May 21, 2015 2:53 AM, "Krystal Mok" wrote: > Hi Vitaly, > > Multicore JIT in the CLR is totally different from what we do in Azul > Zing's ReadyNow. > > The profile that ReadyNow saves and restores is the kind of detailed > profile that C2 uses directly, similar to HotSpot's MDOs. > The "profile" that Multicore JIT uses is just a list of methods (which can > be group together under user-specified group names), without any detail > profiles. > > The JIT compilers in CLR can only use detailed profiles when run in MPGO > (managed profile-guided optimization) mode, which is a mode used with NGen > (native image generator), an AOT compilation. > > I'll have to leave the HotSpot part of your question to Oracle people to > answer. > > - Kris > > On Wed, May 20, 2015 at 5:53 PM, Vitaly Davidovich > wrote: > >> Thanks Kris. Microsoft CLR also has something like that which they call >> Multicore JIT (i.e. background compilations are started on separate cores >> against saved profiles on startup). >> >> I wonder if anything of the sort is on the roadmap for hotspot? >> >> sent from my phone >> On May 20, 2015 8:22 PM, "Krystal Mok" wrote: >> >>> ReadyNow does save profiles to disk and read them back in subsequent >>> runs. >>> There are also other changes on the side in the runtime and compilers to >>> make it work smoothly. >>> >>> Cc'ing Doug Hawkins who's the one working on ReadyNow if anybody's >>> interested in more information. >>> >>> - Kris >>> >>> On Wed, May 20, 2015 at 6:40 AM, Vitaly Davidovich >>> wrote: >>> >>>> Training mode is also dangerous unless built in from beginning as >>>> there's risk some unwanted application effect will take place. This is >>>> beside the possibility of feeding it "wrong" profile. >>>> >>>> What does Azul ReadyNow do? Saves profiles to disk and reads them upon >>>> startup? >>>> >>>> sent from my phone >>>> On May 20, 2015 4:58 AM, "Chris Newland" >>>> wrote: >>>> >>>>> Hi Wei, >>>>> >>>>> Thanks! OpenJDK has no support for persisting compilation profile data >>>>> between VM runs. For some industries it can be economical to use a >>>>> commercial VM to get the lowest latencies. >>>>> >>>>> One "free" alternative is to add a training mode to your application >>>>> and >>>>> feed it the most realistic input data so that you get a good set of JIT >>>>> optimisations before the inputs you care about arrive. >>>>> >>>>> Is there a reason why you are cold-starting your VM? (For example your >>>>> GC >>>>> strategy is to use a huge Eden space and reboot daily to avoid any >>>>> garbage >>>>> collections). >>>>> >>>>> Regards, >>>>> >>>>> Chris >>>>> @chriswhocodes >>>>> >>>>> On Wed, May 20, 2015 09:17, Tangwei (Euler) wrote: >>>>> > Hi Chris, >>>>> > Your tool looks cool. I think OpenJDK has no technology to mitigate >>>>> > cold-start costs, please correct if I am wrong. I can control some >>>>> > compilation passes with option -XX:CompileCommand, but the profiling >>>>> > data, such as invocation count and backedge count, has to reach some >>>>> > threshold before trigger execution level transition. This causes >>>>> > simulator doesn't trigger C2 compilation within reasonable time span. >>>>> > >>>>> > Regards! >>>>> > wei >>>>> > >>>>> >> -----Original Message----- >>>>> >> From: Chris Newland [mailto:cnewland at chrisnewland.com] >>>>> >> Sent: Wednesday, May 20, 2015 3:46 PM >>>>> >> To: Vitaly Davidovich >>>>> >> Cc: Tangwei (Euler); Vladimir Kozlov; hotspot compiler >>>>> >> Subject: RE: How to change compilation policy to trigger C2 >>>>> compilation >>>>> >> ASAP? >>>>> >> >>>>> >> >>>>> >> Hi Wei, >>>>> >> >>>>> >> >>>>> >> Is there any reason why you need to cold-start your application each >>>>> >> time and begin with no profiling information? >>>>> >> >>>>> >> Depending on your input data, would it be possible to "warm up" the >>>>> VM >>>>> >> by running typical inputs through your code to build a rich profile >>>>> and >>>>> >> have the JIT compilations performed before your real data arrives? >>>>> >> >>>>> >> One reason *against* doing this would be wide variations in your >>>>> input >>>>> >> data that could result in the wrong optimisations being made and >>>>> >> possibly suffering a decompilation event. >>>>> >> >>>>> >> There are commercial VMs that have technology to mitigate cold-start >>>>> >> costs (for example Azul Zing's ReadyNow). >>>>> >> >>>>> >> >>>>> >> Have you examined the HotSpot LogCompilation output to make sure >>>>> there >>>>> >> is nothing you could change at source code level which would result >>>>> in a >>>>> >> better JIT decisions? I'm thinking of things like inlining failures >>>>> due >>>>> >> to call-site megamorphism. >>>>> >> >>>>> >> There's a free tool called JITWatch that aims to make it easier to >>>>> >> understand the LogCompilation output >>>>> >> (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales >>>>> pitch - >>>>> >> I'm the >>>>> >> author). >>>>> >> >>>>> >> Regards, >>>>> >> >>>>> >> >>>>> >> Chris >>>>> >> @chriswhocodes >>>>> >> >>>>> >> >>>>> >> On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote: >>>>> >> >>>>> >>> I'll let Vladimir comment on the compile command, but turning off >>>>> >>> tiered doesn't prevent inlining. What prevents inlining (an >>>>> otherwise >>>>> >>> inlineable method), in a nutshell, is lack of profiling >>>>> information >>>>> >>> that would indicate the method or callsite is hot enough for >>>>> >>> inlining. So you can have tiered off and run C2 with standard >>>>> >>> compilation thresholds, and you'll likely get good inlining because >>>>> >>> you'll have a long profile. If you turn C2 compile threshold down >>>>> too >>>>> >>> far (and too far is going to be app-specific), C2 compilation may >>>>> not >>>>> >>> have sufficient info in the shortened profile to decide to >>>>> inline. Or >>>>> >>> even worse, the profile collected is actually not reflective of the >>>>> >>> real profile you want to capture (e.g. app has a phase change after >>>>> >>> initialization). >>>>> >>> >>>>> >>> The theoretical advantage of tiered is you can get decent perf >>>>> >>> quickly, but also since you're getting better than interpreter >>>>> speed >>>>> >>> quickly, you can run in C1 tier longer and thus collect an even >>>>> >>> larger profile, possibly leading to better code than C2 alone. But >>>>> >>> unfortunately this may require serious tuning. That's my >>>>> >>> understanding at >>>>> >> least. >>>>> >>> >>>>> >>> sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)" >>>>> >>> wrote: >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>>> Vladimir, >>>>> >>>> What the 'double' means in following command line? The option >>>>> >>>> CompileThresholdScaling you mentioned is same as >>>>> >>>> ProfileMaturityPercentage? >>>>> >>>> I cannot find the option in code. How much effect to function >>>>> >>>> inlining by turning off tiered compilation? >>>>> >>>> >>>>> >>>>> >>>>> >>>> >>>>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh >>>>> >> ol >>>>> >>>> dS caling,0.5 >>>>> >>>> >>>>> >>>> Regards! >>>>> >>>> wei >>>>> >>>> >>>>> >>>>> -----Original Message----- >>>>> >>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] >>>>> >>>>> Sent: Wednesday, May 20, 2015 9:12 AM >>>>> >>>>> To: Tangwei (Euler); Vitaly Davidovich >>>>> >>>>> Cc: hotspot compiler >>>>> >>>>> Subject: Re: How to change compilation policy to trigger C2 >>>>> >>>>> compilation >>>>> >>>> ASAP? >>>>> >>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>>> If you want only C2 compilation you can disable tiered >>>>> >>>>> compilation >>>>> >>>>> >>>>> >>>> (since you >>>>> >>>> >>>>> >>>> >>>>> >>>>> don't want to spend to much on profiling anyway) and set low >>>>> >>>>> CompileThreshold (you need only this one for non-tiered compile): >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -XX:-TieredCompilation -XX:CompileThreshold=100 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> If your method is simple and don't need profiling or inlining the >>>>> >>>>> >>>>> >>>>> >>>>> >>>> generated code >>>>> >>>>> could be same quality as with long profiling. >>>>> >>>>> >>>>> >>>>> An other approach is to set threshold per method (scale all >>>>> >>>>> threasholds >>>>> >>>> (C1, C2, >>>>> >>>> >>>>> >>>> >>>>> >>>>> interpreter) by this value). For example to reduce thresholds by >>>>> >>>>> half: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh >>>>> >> >>>>> >>>>> oldScaling,0.5 >>>>> >>>>> >>>>> >>>>> Vladimir >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> My goal is just to reach peak performance quickly. Following is >>>>> >>>>>> one tier threshold combination I tried: >>>>> >>>>>> >>>>> >>>>>> -XX:Tier0ProfilingStartPercentage=0 >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> -XX:Tier3InvocationThreshold=3 >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> -XX:Tier3MinInvocationThreshold=2 >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> -XX:Tier3CompileThreshold=2 >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> -XX:Tier4InvocationThreshold=4 >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> -XX:Tier4MinInvocationThreshold=3 >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> -XX:Tier4CompileThreshold=2 >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> Regards! >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> wei >>>>> >>>>>> >>>>> >>>>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com] >>>>> >>>>>> *Sent:* Tuesday, May 19, 2015 9:33 PM >>>>> >>>>>> *To:* Tangwei (Euler) >>>>> >>>>>> *Cc:* hotspot compiler >>>>> >>>>>> *Subject:* Re: How to change compilation policy to trigger C2 >>>>> >>>>>> compilation ASAP? >>>>> >>>>>> >>>>> >>>>>> Is your goal specifically to have C2 compile or just to reach >>>>> >>>>>> peak performance quickly? It sounds like the latter. What >>>>> >>>>>> values did you specify for the tier thresholds? Also, it may >>>>> >>>>>> help you to -XX:+PrintCompilation to tune the flags as this will >>>>> >>>>>> show you which methods are being compiled, when, and at what >>>>> >>>>>> tier. >>>>> >>>>>> >>>>> >>>>>> sent from my phone >>>>> >>>>>> >>>>> >>>>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" >>>> >>>>>> > wrote: >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> Hi All, >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> I want to run a JAVA application on a performance simulator, >>>>> >>>>>> and do a profiling on a hot function JITTed with C2 compiler. >>>>> >>>>>> >>>>> >>>>>> In order to make C2 compiler compile hot function as early as >>>>> >>>>>> possible, I hope to reduce the threshold of function invocation >>>>> >>>>>> >>>>> >>>>>> count in interpreter and C1 to drive the JIT compiler >>>>> >>>>>> transitioned to Level 4 (C2) ASAP. Following is the option list >>>>> >>>>>> I try, but >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> failed to find a right combination to meet my requirement. >>>>> >>>>>> Anyone >>>>> >>>>>> can help to figure out what options I can use? >>>>> >>>>>> >>>>> >>>>>> Thanks in advance. >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> -XX:Tier0ProfilingStartPercentage=0 >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> -XX:Tier3InvocationThreshold >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> -XX:Tier3MinInvocationThreshold >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> -XX:Tier3CompileThreshold >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> -XX:Tier4InvocationThreshold >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> -XX:CompileThreshold >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> -XX:Tier4MinInvocationThreshold >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> -XX:Tier4CompileThreshold >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> Regards! >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> wei >>>>> >>>>>> >>>>> >>>> >>>>> >>> >>>>> >> >>>>> > >>>>> > >>>>> >>>>> >>>>> >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.westrelin at oracle.com Thu May 21 11:55:43 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Thu, 21 May 2015 13:55:43 +0200 Subject: RFR(S) 8077504: Unsafe load can loose control dependency and cause crash In-Reply-To: <555CB274.4010409@oracle.com> References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com> <552FC216.4010503@redhat.com> <95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com> <6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com> <553A88F3.2010700@oracle.com> <553FFAC4.7030107@oracle.com> <8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com> <555A4026.3010303@redhat.com> <555A6741.3030002@oracle.com> <555CB274.4010409@oracle.com> Message-ID: <27CDBC05-3F22-46F2-871A-7786F19AE8B3@oracle.com> Vladimir, Thanks for spotting the bad comment and thanks for re-reviewing. Roland. > On May 20, 2015, at 6:12 PM, Vladimir Kozlov wrote: > > Looks good. But fix second comment in memnode.hpp - it has '???' instead of '. > > Thanks, > Vladimir > > On 5/20/15 7:19 AM, Roland Westrelin wrote: >> Thanks Andrew and Vladimir. Here is a new webrev with an enum: >> >> http://cr.openjdk.java.net/~roland/8077504/webrev.02/ >> >> Roland. >> From roland.westrelin at oracle.com Thu May 21 12:09:25 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Thu, 21 May 2015 14:09:25 +0200 Subject: RFR(S) 8077504: Unsafe load can loose control dependency and cause crash In-Reply-To: <555CB33F.6030506@redhat.com> References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com> <552FC216.4010503@redhat.com> <95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com> <6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com> <553A88F3.2010700@oracle.com> <553FFAC4.7030107@oracle.com> <8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com> <555A4026.3010303@redhat.com> <555A6741.3030002@oracle.com> <555CB274.4010409@oracle.com> <555CB33F.6030506@redhat.com> Message-ID: <8D2CAEE0-573C-4AC8-8A60-B522610AC4B0@oracle.com> Andrew, >> Looks good. But fix second comment in memnode.hpp - it has '???' instead >> of '. > > Yes, looks good (but, of course, I'm not a reviewer). Thanks for looking at it. Roland. From roland.westrelin at oracle.com Thu May 21 12:55:40 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Thu, 21 May 2015 14:55:40 +0200 Subject: RFR(XS): 8080699: Assert failed: Not a Java pointer in JCK test In-Reply-To: <555CDD4C.5030707@oracle.com> References: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com> <555CC357.9030006@oracle.com> <29CE480B-2136-40DE-8167-E0E093A5F145@oracle.com> <555CDD4C.5030707@oracle.com> Message-ID: <73E7F064-372B-4303-8FA4-9CB6CF1239D4@oracle.com> >>> Should it also return false if in(ArrayCopyNode::Src)->is_top()? >> >> That doesn?t feel right: ArrayCopyNode::may_modify() would decide what to return based on src that it doesn?t modify. > > Okay, returning true in such case would be conservative. > > Should we check dest input for top in CallLeafNode::may_modify()? Do we have other similar places? You?re right checking for top dest in CallLeafNode::may_modify() is safer. I don?t see other similar issues elsewhere in the array copy code. Roland. > > Thanks, > Vladimir > >> Also, I don?t think src can become top and dest not be top during allocation elimination. If that happens at another time (igvn for instance), this part of the graph is being removed and presumably the compiler will get an other chance to optimize it better. >> >> Roland. >> >> >> >>> >>> thanks, >>> Vladimir >>> >>> On 5/20/15 7:03 AM, Roland Westrelin wrote: >>>> http://cr.openjdk.java.net/~roland/8080699/webrev.00/ >>>> >>>> When an arraycopy node is eliminated (for a non escaping allocation), its control, src and dest arguments are set to top and it is no longer connected to the rest of the graph through its fall through outputs. It?s not entirely removed from the graph until the next igvn though, and it can be reached through its exception processing outputs. That?s what happens here and ArrayCopyNode::may_modify() assumes dest, that is top, is an oop. >>>> >>>> Roland. >>>> >> From roland.westrelin at oracle.com Thu May 21 13:25:29 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Thu, 21 May 2015 15:25:29 +0200 Subject: RFR(S): 8078866: compiler/eliminateAutobox/6934604/TestIntBoxing.java assert(p_f->Opcode() == Op_IfFalse) failed In-Reply-To: <555CE2B2.8020305@oracle.com> References: <5553C1E2.3040003@oracle.com> <555CE2B2.8020305@oracle.com> Message-ID: <5B87D4E4-FA26-411F-BD03-9F9CAA1E0CBF@oracle.com> >>> I think eliminating main- and post- is good investigation for separate RFE. And we need to make sure it is safe. >> >> My reasoning is that if the pre loop?s limit is still an opaque node, then the compiler has not been able to optimize the pre loop based on its limited number of iterations yet. I don?t see what optimization would remove stuff from the pre loop that wouldn?t have removed the same stuff from the initial loop, had it not been cloned. I could be missing something, though. Do you see one? > > Predicates? Can we have checks moved from pre- but not from main- because of execution order? It seems we don?t do predication with pre/main loops. In PhaseIdealLoop::loop_predication_impl(): if (head->is_valid_counted_loop()) { cl = head->as_CountedLoop(); // do nothing for iteration-splitted loops if (!cl->is_normal_loop()) return false; > Anyway, I am not strongly against this change but you still need to add check to do_range_check(), I think, in case something went wrong. You not being against it is not enough for me to push that change? What other investigation do you think is needed? Roland. > > Thanks, > Vladimir > >> >> Roland. >> >>> >>> To fix this bug, I think, we should bailout from do_range_check() when we can't find pre-loop. >>> >>> Typo 'cab': >>> + // post loop cab be removed as well >>> >>> Thanks, >>> Vladimir >>> >>> >>> On 5/13/15 6:02 AM, Roland Westrelin wrote: >>>> http://cr.openjdk.java.net/~roland/8078866/webrev.00/ >>>> >>>> The assert fires when optimizing that code: >>>> >>>> static int remi_sumb() { >>>> Integer j = Integer.valueOf(1); >>>> for (int i = 0; i< 1000; i++) { >>>> j = j + 1; >>>> } >>>> return j; >>>> } >>>> >>>> The compiler tries to find the pre loop from the main loop but fails because the pre loop isn?t there. >>>> >>>> Here is the chain of events that leads to the pre loop being optimized out: >>>> >>>> 1- The CountedLoopNode is created >>>> 2- PhiNode::Value() for the induction variable?s Phi is called and capture that 0 <= i < 1000 >>>> 3- SplitIf is applied and moves the LoadI that reads the value from the Integer through the Phi that merges successive value of j >>>> 4- pre-loop, main-loop, post-loop are created >>>> 5- SplitIf is applied again and moves the LoadI of the loop through the Phi that merges both branches of: >>>> >>>> public static Integer valueOf(int i) { >>>> if (i >= IntegerCache.low && i <= IntegerCache.high) >>>> return IntegerCache.cache[i + (-IntegerCache.low)]; >>>> return new Integer(i); >>>> } >>>> >>>> 6- The LoadI is optimized out, the Phi for j is now a Phi that merges integer values, PhaseIdealLoop::replace_parallel_iv() replaces j by an expression that depends on i >>>> 7- In the preloop, the range of values of i are known because of 2- so and the if of valueOf() can be optimized out >>>> 8- the preloop is empty and is removed by IdealLoopTree::policy_do_remove_empty_loop() >>>> >>>> SplitIf doesn?t find all opportunities for improvements before the pre/main/post loops are built. I considered trying to fix that but that seems complex and not robust enough: some optimizations after SplitIf could uncover more opportunities for SplitIf. The valueOf?s if cannot be optimized out in the main loop because the range of values of i in the main depends on the Phi for the iv of the pre loop and is not [0, 1000[ so I don?t see a way to make sure the compiler always optimizes the main loop out when the pre loop is empty. >>>> >>>> Instead, I?m proposing that when IdealLoopTree::policy_do_remove_empty_loop() removes a loop, it checks whether it is a pre-loop and if it is, removes the main and post loops. >>>> >>>> I don?t have a test case because it seems this only occurs under well defined conditions that are hard to reproduce with a simple test case: PhiNode::Value() must be called early, SplitIf must do a little work but not too much before pre/main/post loops are built. Actually, this occurs only when remi_sumb is inlined. If remi_sumb is not, we don?t optimize the loop out at all. >>>> >>>> Roland. From roland.westrelin at oracle.com Thu May 21 13:39:24 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Thu, 21 May 2015 15:39:24 +0200 Subject: 8060036: C2: CmpU nodes can end up with wrong type information In-Reply-To: <555C6A2B.7090804@oracle.com> References: <555B16D4.80008@oracle.com> <555C6A2B.7090804@oracle.com> Message-ID: <7E368413-6A30-47DC-9369-58CEDB2E85FD@oracle.com> > http://cr.openjdk.java.net/~aeriksso/8060036/webrev.01/ It looks good to me. 1607 if( p->bottom_type() != type(p) ) // If not already bottomed out 1608 worklist.push(p); // Propagate change to user should be: if (p->bottom_type() != type(p)) { // If not already bottomed out worklist.push(p); // Propagate change to user } to follow the hotspot style. https://wikis.oracle.com/display/HotSpotInternals/StyleGuide I can sponsor it if you need. Roland. > > - Andreas > > On 2015-05-19 20:24, Berg, Michael C wrote: >> Hi Andreas, could you add a comment to the push statement: >> >> // Propagate change to user >> >> And to Node* p declaration and init: >> >> // Propagate changes to uses >> >> So that the new code carries all similar information. >> Else the changes seem ok. >> >> -Michael >> >> -----Original Message----- >> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andreas Eriksson >> Sent: Tuesday, May 19, 2015 3:56 AM >> To: hotspot-compiler-dev at openjdk.java.net >> Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information >> >> Hi, >> >> Could I please get two reviews for the following bug fix. >> >> https://bugs.openjdk.java.net/browse/JDK-8060036 >> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/ >> >> The problem is with type propagation in C2. >> >> A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node. >> If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead. >> These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU. >> This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself. >> >> The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there. >> At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information. >> >> When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken. >> >> This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated. >> I've not been able to create a regression test that is reliable enough to check in. >> >> Thanks, >> Andreas > From andreas.eriksson at oracle.com Thu May 21 14:04:58 2015 From: andreas.eriksson at oracle.com (Andreas Eriksson) Date: Thu, 21 May 2015 16:04:58 +0200 Subject: 8060036: C2: CmpU nodes can end up with wrong type information In-Reply-To: <7E368413-6A30-47DC-9369-58CEDB2E85FD@oracle.com> References: <555B16D4.80008@oracle.com> <555C6A2B.7090804@oracle.com> <7E368413-6A30-47DC-9369-58CEDB2E85FD@oracle.com> Message-ID: <555DE60A.9080003@oracle.com> On 2015-05-21 15:39, Roland Westrelin wrote: >> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.01/ > It looks good to me. > > 1607 if( p->bottom_type() != type(p) ) // If not already bottomed out > 1608 worklist.push(p); // Propagate change to user > > > should be: > > if (p->bottom_type() != type(p)) { // If not already bottomed out > worklist.push(p); // Propagate change to user > } > > to follow the hotspot style. > https://wikis.oracle.com/display/HotSpotInternals/StyleGuide I did it that way because all the other if (...) worklist.push(p) in that function omit the curly braces. Should I have them in this change anyway? > > I can sponsor it if you need. That would be very helpful, thanks. - Andreas > > Roland. > >> - Andreas >> >> On 2015-05-19 20:24, Berg, Michael C wrote: >>> Hi Andreas, could you add a comment to the push statement: >>> >>> // Propagate change to user >>> >>> And to Node* p declaration and init: >>> >>> // Propagate changes to uses >>> >>> So that the new code carries all similar information. >>> Else the changes seem ok. >>> >>> -Michael >>> >>> -----Original Message----- >>> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andreas Eriksson >>> Sent: Tuesday, May 19, 2015 3:56 AM >>> To: hotspot-compiler-dev at openjdk.java.net >>> Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information >>> >>> Hi, >>> >>> Could I please get two reviews for the following bug fix. >>> >>> https://bugs.openjdk.java.net/browse/JDK-8060036 >>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/ >>> >>> The problem is with type propagation in C2. >>> >>> A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node. >>> If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead. >>> These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU. >>> This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself. >>> >>> The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there. >>> At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information. >>> >>> When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken. >>> >>> This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated. >>> I've not been able to create a regression test that is reliable enough to check in. >>> >>> Thanks, >>> Andreas From roland.westrelin at oracle.com Thu May 21 14:11:12 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Thu, 21 May 2015 16:11:12 +0200 Subject: 8060036: C2: CmpU nodes can end up with wrong type information In-Reply-To: <555DE60A.9080003@oracle.com> References: <555B16D4.80008@oracle.com> <555C6A2B.7090804@oracle.com> <7E368413-6A30-47DC-9369-58CEDB2E85FD@oracle.com> <555DE60A.9080003@oracle.com> Message-ID: > I did it that way because all the other > > if (...) > worklist.push(p) > > in that function omit the curly braces. > Should I have them in this change anyway? Some c2 code doesn?t (yet) follow the current style. The usual practice is to: 1) use the current style for new code 2) fix the style around what you?ve modified >> I can sponsor it if you need. > > That would be very helpful, thanks. Can you send me a changeset (hg export) with the properly formatted mercurial comment? http://openjdk.java.net/guide/producingChangeset.html section "Formatting a Changeset Comment? Can you also make sure you?ve run jcheck. Thanks, Roland. > > - Andreas > >> >> Roland. >> >>> - Andreas >>> >>> On 2015-05-19 20:24, Berg, Michael C wrote: >>>> Hi Andreas, could you add a comment to the push statement: >>>> >>>> // Propagate change to user >>>> >>>> And to Node* p declaration and init: >>>> >>>> // Propagate changes to uses >>>> >>>> So that the new code carries all similar information. >>>> Else the changes seem ok. >>>> >>>> -Michael >>>> >>>> -----Original Message----- >>>> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andreas Eriksson >>>> Sent: Tuesday, May 19, 2015 3:56 AM >>>> To: hotspot-compiler-dev at openjdk.java.net >>>> Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information >>>> >>>> Hi, >>>> >>>> Could I please get two reviews for the following bug fix. >>>> >>>> https://bugs.openjdk.java.net/browse/JDK-8060036 >>>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/ >>>> >>>> The problem is with type propagation in C2. >>>> >>>> A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node. >>>> If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead. >>>> These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU. >>>> This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself. >>>> >>>> The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there. >>>> At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information. >>>> >>>> When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken. >>>> >>>> This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated. >>>> I've not been able to create a regression test that is reliable enough to check in. >>>> >>>> Thanks, >>>> Andreas > From benedikt.wedenik at theobroma-systems.com Thu May 21 14:24:23 2015 From: benedikt.wedenik at theobroma-systems.com (Benedikt Wedenik) Date: Thu, 21 May 2015 16:24:23 +0200 Subject: Disassembly from c1 or c2 Message-ID: <57E1D081-D9F5-429E-A327-2002A7B777D0@theobroma-systems.com> Hi! I?m trying to figure out how to determine whether the generated disassembly (-XX:CompileCommand='print,*Foo.bar') is emitted by c1 or by c2. Is there any easy way like a flag? If not, does anyone know which ?print? in which file to enhance to get the needed information? If there is the possibility to disable the entire output of c1, this would help a lot too :) Thanks in advance, Benedikt W., Theobroma-systems.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From aph at redhat.com Thu May 21 14:26:24 2015 From: aph at redhat.com (Andrew Haley) Date: Thu, 21 May 2015 15:26:24 +0100 Subject: Disassembly from c1 or c2 In-Reply-To: <57E1D081-D9F5-429E-A327-2002A7B777D0@theobroma-systems.com> References: <57E1D081-D9F5-429E-A327-2002A7B777D0@theobroma-systems.com> Message-ID: <555DEB10.2000805@redhat.com> On 05/21/2015 03:24 PM, Benedikt Wedenik wrote: > Hi! > > I?m trying to figure out how to determine whether the generated disassembly > (-XX:CompileCommand='print,*Foo.bar') is emitted by c1 or by c2. Yes. Here it is: Loaded disassembler from hsdis-aarch64.so Compiled method (c2) You'll want to build a debug VM for full information. Andrew. From benedikt.wedenik at theobroma-systems.com Thu May 21 14:33:39 2015 From: benedikt.wedenik at theobroma-systems.com (Benedikt Wedenik) Date: Thu, 21 May 2015 16:33:39 +0200 Subject: Disassembly from c1 or c2 In-Reply-To: <555DEB10.2000805@redhat.com> References: <57E1D081-D9F5-429E-A327-2002A7B777D0@theobroma-systems.com> <555DEB10.2000805@redhat.com> Message-ID: <25DCD7E3-F429-483D-A6F4-8178FD91DF35@theobroma-systems.com> Thank you for the quick response! Benedikt On 21 May 2015, at 16:26, Andrew Haley wrote: > On 05/21/2015 03:24 PM, Benedikt Wedenik wrote: >> Hi! >> >> I?m trying to figure out how to determine whether the generated disassembly >> (-XX:CompileCommand='print,*Foo.bar') is emitted by c1 or by c2. > > Yes. Here it is: > > Loaded disassembler from hsdis-aarch64.so > Compiled method (c2) > > You'll want to build a debug VM for full information. > > Andrew. > From andreas.eriksson at oracle.com Thu May 21 15:00:57 2015 From: andreas.eriksson at oracle.com (Andreas Eriksson) Date: Thu, 21 May 2015 17:00:57 +0200 Subject: 8060036: C2: CmpU nodes can end up with wrong type information In-Reply-To: References: <555B16D4.80008@oracle.com> <555C6A2B.7090804@oracle.com> <7E368413-6A30-47DC-9369-58CEDB2E85FD@oracle.com> <555DE60A.9080003@oracle.com> Message-ID: <555DF329.4090707@oracle.com> On 2015-05-21 16:11, Roland Westrelin wrote: >> I did it that way because all the other >> >> if (...) >> worklist.push(p) >> >> in that function omit the curly braces. >> Should I have them in this change anyway? > Some c2 code doesn?t (yet) follow the current style. The usual practice is to: > > 1) use the current style for new code > 2) fix the style around what you?ve modified > >>> I can sponsor it if you need. >> That would be very helpful, thanks. > Can you send me a changeset (hg export) with the properly formatted mercurial comment? > http://openjdk.java.net/guide/producingChangeset.html > section "Formatting a Changeset Comment? > Can you also make sure you?ve run jcheck. Attached an export, passes jcheck. I changed the surrounding code to add curly braces and remove extra whitespace. (I don't need an extra round of reviews for that, right?) - Andreas > > Thanks, > Roland. > >> - Andreas >> >>> Roland. >>> >>>> - Andreas >>>> >>>> On 2015-05-19 20:24, Berg, Michael C wrote: >>>>> Hi Andreas, could you add a comment to the push statement: >>>>> >>>>> // Propagate change to user >>>>> >>>>> And to Node* p declaration and init: >>>>> >>>>> // Propagate changes to uses >>>>> >>>>> So that the new code carries all similar information. >>>>> Else the changes seem ok. >>>>> >>>>> -Michael >>>>> >>>>> -----Original Message----- >>>>> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andreas Eriksson >>>>> Sent: Tuesday, May 19, 2015 3:56 AM >>>>> To: hotspot-compiler-dev at openjdk.java.net >>>>> Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information >>>>> >>>>> Hi, >>>>> >>>>> Could I please get two reviews for the following bug fix. >>>>> >>>>> https://bugs.openjdk.java.net/browse/JDK-8060036 >>>>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/ >>>>> >>>>> The problem is with type propagation in C2. >>>>> >>>>> A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node. >>>>> If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead. >>>>> These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU. >>>>> This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself. >>>>> >>>>> The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there. >>>>> At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information. >>>>> >>>>> When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken. >>>>> >>>>> This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated. >>>>> I've not been able to create a regression test that is reliable enough to check in. >>>>> >>>>> Thanks, >>>>> Andreas -------------- next part -------------- # HG changeset patch # User aeriksso # Date 1432219751 -7200 # Thu May 21 16:49:11 2015 +0200 # Node ID 10cc51f8aaa98fa79da3941cf4fb3a6fd3056d75 # Parent 67729f5f33c4a798af42d0e2bcbce838834695f9 8060036: C2: CmpU nodes can end up with wrong type information Reviewed-by: mcberg, kvn, roland diff -r 67729f5f33c4 -r 10cc51f8aaa9 src/share/vm/opto/phaseX.cpp --- a/src/share/vm/opto/phaseX.cpp +++ b/src/share/vm/opto/phaseX.cpp @@ -1576,8 +1576,9 @@ if( m->is_Region() ) { // New path to Region? Must recheck Phis too for (DUIterator_Fast i2max, i2 = m->fast_outs(i2max); i2 < i2max; i2++) { Node* p = m->fast_out(i2); // Propagate changes to uses - if( p->bottom_type() != type(p) ) // If not already bottomed out + if(p->bottom_type() != type(p)) { // If not already bottomed out worklist.push(p); // Propagate change to user + } } } // If we changed the receiver type to a call, we need to revisit @@ -1587,12 +1588,30 @@ if (m->is_Call()) { for (DUIterator_Fast i2max, i2 = m->fast_outs(i2max); i2 < i2max; i2++) { Node* p = m->fast_out(i2); // Propagate changes to uses - if (p->is_Proj() && p->as_Proj()->_con == TypeFunc::Control && p->outcnt() == 1) + if (p->is_Proj() && p->as_Proj()->_con == TypeFunc::Control && p->outcnt() == 1) { worklist.push(p->unique_out()); + } } } if( m->bottom_type() != type(m) ) // If not already bottomed out worklist.push(m); // Propagate change to user + + // CmpU nodes can get their type information from two nodes up in the + // graph (instead of from the nodes immediately above). Make sure they + // are added to the worklist if nodes they depend on are updated, since + // they could be missed and get wrong types otherwise. + uint m_op = m->Opcode(); + if (m_op == Op_AddI || m_op == Op_SubI) { + for (DUIterator_Fast i2max, i2 = m->fast_outs(i2max); i2 < i2max; i2++) { + Node* p = m->fast_out(i2); // Propagate changes to uses + if (p->Opcode() == Op_CmpU) { + // Got a CmpU which might need the new type information from node n. + if(p->bottom_type() != type(p)) { // If not already bottomed out + worklist.push(p); // Propagate change to user + } + } + } + } } } } From rickard.backman at oracle.com Thu May 21 15:11:53 2015 From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=) Date: Thu, 21 May 2015 17:11:53 +0200 Subject: RFR(S) 8080692: lots of jstack tests failing in pit Message-ID: <20150521151153.GA2285@rbackman> Hi, can I please have reviews for this change? The base used to calculate the address for getting Pairs and Data was wrong in SA. Also fixes issues with getting the count field. Added more verification in oopMap.cpp to check that offsets and sizes are correct. Webrev: http://cr.openjdk.java.net/~rbackman/8080692/ Bug: https://bugs.openjdk.java.net/browse/JDK-8080692 tmtools jstack tests & jprt. Thanks /R -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: Digital signature URL: From vladimir.x.ivanov at oracle.com Thu May 21 15:20:49 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 21 May 2015 18:20:49 +0300 Subject: RFR(S) 8080692: lots of jstack tests failing in pit In-Reply-To: <20150521151153.GA2285@rbackman> References: <20150521151153.GA2285@rbackman> Message-ID: <555DF7D1.5050206@oracle.com> Looks good. Best regards, Vladimir Ivanov On 5/21/15 6:11 PM, Rickard B?ckman wrote: > Hi, > > can I please have reviews for this change? > The base used to calculate the address for getting Pairs and Data was wrong in SA. > Also fixes issues with getting the count field. Added more verification > in oopMap.cpp to check that offsets and sizes are correct. > > Webrev: http://cr.openjdk.java.net/~rbackman/8080692/ > Bug: https://bugs.openjdk.java.net/browse/JDK-8080692 > > tmtools jstack tests & jprt. > > Thanks > /R > From vladimir.kozlov at oracle.com Thu May 21 15:54:10 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 21 May 2015 08:54:10 -0700 Subject: RFR(S) 8080692: lots of jstack tests failing in pit In-Reply-To: <20150521151153.GA2285@rbackman> References: <20150521151153.GA2285@rbackman> Message-ID: <555DFFA2.8090303@oracle.com> Looks good. Did you run tmtools tests with -Xcomp? Thanks, Vladimir On 5/21/15 8:11 AM, Rickard B?ckman wrote: > Hi, > > can I please have reviews for this change? > The base used to calculate the address for getting Pairs and Data was wrong in SA. > Also fixes issues with getting the count field. Added more verification > in oopMap.cpp to check that offsets and sizes are correct. > > Webrev: http://cr.openjdk.java.net/~rbackman/8080692/ > Bug: https://bugs.openjdk.java.net/browse/JDK-8080692 > > tmtools jstack tests & jprt. > > Thanks > /R > From rickard.backman at oracle.com Thu May 21 16:06:00 2015 From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=) Date: Thu, 21 May 2015 18:06:00 +0200 Subject: RFR(S) 8080692: lots of jstack tests failing in pit In-Reply-To: <555DFFA2.8090303@oracle.com> References: <20150521151153.GA2285@rbackman> <555DFFA2.8090303@oracle.com> Message-ID: <20150521160600.GB2285@rbackman> On 05/21, Vladimir Kozlov wrote: > Looks good. Did you run tmtools tests with -Xcomp? > I hope. They failed with the errors before my fixes. Command line: $COMMON/ute/bin/ute -jdk /home/rbackman/code/oopmap4/build/fastdebug/images/jdk -component vm -testbase $COMMON/vm -stablejavahome /home/rbackman/opt/8u20 -vmflavor Xcomp -dkfl empty -test tmtools/jstack/ Thanks for the review. /R > Thanks, > Vladimir > > On 5/21/15 8:11 AM, Rickard B?ckman wrote: > >Hi, > > > >can I please have reviews for this change? > >The base used to calculate the address for getting Pairs and Data was wrong in SA. > >Also fixes issues with getting the count field. Added more verification > >in oopMap.cpp to check that offsets and sizes are correct. > > > >Webrev: http://cr.openjdk.java.net/~rbackman/8080692/ > >Bug: https://bugs.openjdk.java.net/browse/JDK-8080692 > > > >tmtools jstack tests & jprt. > > > >Thanks > >/R > > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: Digital signature URL: From rickard.backman at oracle.com Thu May 21 16:06:22 2015 From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=) Date: Thu, 21 May 2015 18:06:22 +0200 Subject: RFR(S) 8080692: lots of jstack tests failing in pit In-Reply-To: <555DF7D1.5050206@oracle.com> References: <20150521151153.GA2285@rbackman> <555DF7D1.5050206@oracle.com> Message-ID: <20150521160622.GC2285@rbackman> Thanks for the review! /R On 05/21, Vladimir Ivanov wrote: > Looks good. > > Best regards, > Vladimir Ivanov > > On 5/21/15 6:11 PM, Rickard B?ckman wrote: > >Hi, > > > >can I please have reviews for this change? > >The base used to calculate the address for getting Pairs and Data was wrong in SA. > >Also fixes issues with getting the count field. Added more verification > >in oopMap.cpp to check that offsets and sizes are correct. > > > >Webrev: http://cr.openjdk.java.net/~rbackman/8080692/ > >Bug: https://bugs.openjdk.java.net/browse/JDK-8080692 > > > >tmtools jstack tests & jprt. > > > >Thanks > >/R > > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: Digital signature URL: From roland.westrelin at oracle.com Thu May 21 16:07:31 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Thu, 21 May 2015 18:07:31 +0200 Subject: 8060036: C2: CmpU nodes can end up with wrong type information In-Reply-To: <555DF329.4090707@oracle.com> References: <555B16D4.80008@oracle.com> <555C6A2B.7090804@oracle.com> <7E368413-6A30-47DC-9369-58CEDB2E85FD@oracle.com> <555DE60A.9080003@oracle.com> <555DF329.4090707@oracle.com> Message-ID: <45C554FC-C0CE-4353-A210-4431C1FD29CB@oracle.com> > Attached an export, passes jcheck. > I changed the surrounding code to add curly braces and remove extra whitespace. > (I don't need an extra round of reviews for that, right?) You don?t. Thanks for fixing the formatting. I?m pushing it. Roland. > > - Andreas > >> >> Thanks, >> Roland. >> >>> - Andreas >>> >>>> Roland. >>>> >>>>> - Andreas >>>>> >>>>> On 2015-05-19 20:24, Berg, Michael C wrote: >>>>>> Hi Andreas, could you add a comment to the push statement: >>>>>> >>>>>> // Propagate change to user >>>>>> >>>>>> And to Node* p declaration and init: >>>>>> >>>>>> // Propagate changes to uses >>>>>> >>>>>> So that the new code carries all similar information. >>>>>> Else the changes seem ok. >>>>>> >>>>>> -Michael >>>>>> >>>>>> -----Original Message----- >>>>>> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andreas Eriksson >>>>>> Sent: Tuesday, May 19, 2015 3:56 AM >>>>>> To: hotspot-compiler-dev at openjdk.java.net >>>>>> Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information >>>>>> >>>>>> Hi, >>>>>> >>>>>> Could I please get two reviews for the following bug fix. >>>>>> >>>>>> https://bugs.openjdk.java.net/browse/JDK-8060036 >>>>>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/ >>>>>> >>>>>> The problem is with type propagation in C2. >>>>>> >>>>>> A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node. >>>>>> If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead. >>>>>> These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU. >>>>>> This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself. >>>>>> >>>>>> The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there. >>>>>> At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information. >>>>>> >>>>>> When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken. >>>>>> >>>>>> This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated. >>>>>> I've not been able to create a regression test that is reliable enough to check in. >>>>>> >>>>>> Thanks, >>>>>> Andreas > > <8060036.export> From tobias.hartmann at oracle.com Thu May 21 17:28:50 2015 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Thu, 21 May 2015 19:28:50 +0200 Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information In-Reply-To: <555B16D4.80008@oracle.com> References: <555B16D4.80008@oracle.com> Message-ID: <555E15D2.4070700@oracle.com> Hi Andreas, I was working on JDK-8080156 [1] and just noticed that this is the same issue as JDK-8060036. The bug contains a small reproducer (see Ivan Gerasimov's comment), in case you want to add this as a regression test to your fix. Best, Tobias [1] https://bugs.openjdk.java.net/browse/JDK-8080156 On 19.05.2015 12:56, Andreas Eriksson wrote: > Hi, > > Could I please get two reviews for the following bug fix. > > https://bugs.openjdk.java.net/browse/JDK-8060036 > http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/ > > The problem is with type propagation in C2. > > A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node. > If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead. > These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU. > This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself. > > The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there. > At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information. > > When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken. > > This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated. > I've not been able to create a regression test that is reliable enough to check in. > > Thanks, > Andreas From vladimir.kozlov at oracle.com Thu May 21 17:51:11 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 21 May 2015 10:51:11 -0700 Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information In-Reply-To: <555E15D2.4070700@oracle.com> References: <555B16D4.80008@oracle.com> <555E15D2.4070700@oracle.com> Message-ID: <555E1B0F.9050802@oracle.com> Since this fix 8060036 is already in JPRT queue lets use 8080156 to add the regression test. Thanks, Vladimir On 5/21/15 10:28 AM, Tobias Hartmann wrote: > Hi Andreas, > > I was working on JDK-8080156 [1] and just noticed that this is the same issue as JDK-8060036. The bug contains a small reproducer (see Ivan Gerasimov's comment), in case you want to add this as a regression test to your fix. > > Best, > Tobias > > [1] https://bugs.openjdk.java.net/browse/JDK-8080156 > > On 19.05.2015 12:56, Andreas Eriksson wrote: >> Hi, >> >> Could I please get two reviews for the following bug fix. >> >> https://bugs.openjdk.java.net/browse/JDK-8060036 >> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/ >> >> The problem is with type propagation in C2. >> >> A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node. >> If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead. >> These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU. >> This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself. >> >> The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there. >> At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information. >> >> When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken. >> >> This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated. >> I've not been able to create a regression test that is reliable enough to check in. >> >> Thanks, >> Andreas From tobias.hartmann at oracle.com Thu May 21 18:51:46 2015 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Thu, 21 May 2015 20:51:46 +0200 Subject: [9] RFR(S): 8080156: Integer.toString(int value) sometimes throws NPE Message-ID: <555E2942.7060404@oracle.com> Hi, this is a follow up RFR on JDK-8060036 [1] that adds a regression test I used while investigating JDK-8080156 which turned out to be a duplicate. https://bugs.openjdk.java.net/browse/JDK-8080156 http://cr.openjdk.java.net/~thartmann/8080156/webrev.00/ The test triggers a range check that is implemented with a CmpUNode. If type information is not propagated correctly to the CmpUNode, the range check trap is always taken. Because C2 also removes the unnecessary buffer allocation, a NullPointerException is thrown and the test fails. Thanks, Tobias [1] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-May/017983.html From tobias.hartmann at oracle.com Thu May 21 18:52:15 2015 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Thu, 21 May 2015 20:52:15 +0200 Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information In-Reply-To: <555E1B0F.9050802@oracle.com> References: <555B16D4.80008@oracle.com> <555E15D2.4070700@oracle.com> <555E1B0F.9050802@oracle.com> Message-ID: <555E295F.3040204@oracle.com> On 21.05.2015 19:51, Vladimir Kozlov wrote: > Since this fix 8060036 is already in JPRT queue lets use 8080156 to add the regression test. Okay, I sent a new RFR with the regression test. Thanks, Tobias > > Thanks, > Vladimir > > On 5/21/15 10:28 AM, Tobias Hartmann wrote: >> Hi Andreas, >> >> I was working on JDK-8080156 [1] and just noticed that this is the same issue as JDK-8060036. The bug contains a small reproducer (see Ivan Gerasimov's comment), in case you want to add this as a regression test to your fix. >> >> Best, >> Tobias >> >> [1] https://bugs.openjdk.java.net/browse/JDK-8080156 >> >> On 19.05.2015 12:56, Andreas Eriksson wrote: >>> Hi, >>> >>> Could I please get two reviews for the following bug fix. >>> >>> https://bugs.openjdk.java.net/browse/JDK-8060036 >>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/ >>> >>> The problem is with type propagation in C2. >>> >>> A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node. >>> If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead. >>> These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU. >>> This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself. >>> >>> The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there. >>> At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information. >>> >>> When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken. >>> >>> This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated. >>> I've not been able to create a regression test that is reliable enough to check in. >>> >>> Thanks, >>> Andreas From vladimir.kozlov at oracle.com Thu May 21 19:06:15 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 21 May 2015 12:06:15 -0700 Subject: [9] RFR(S): 8080156: Integer.toString(int value) sometimes throws NPE In-Reply-To: <555E2942.7060404@oracle.com> References: <555E2942.7060404@oracle.com> Message-ID: <555E2CA7.2040904@oracle.com> Looks good. Thanks, Vladimir On 5/21/15 11:51 AM, Tobias Hartmann wrote: > Hi, > > this is a follow up RFR on JDK-8060036 [1] that adds a regression test I used while investigating JDK-8080156 which turned out to be a duplicate. > > https://bugs.openjdk.java.net/browse/JDK-8080156 > http://cr.openjdk.java.net/~thartmann/8080156/webrev.00/ > > The test triggers a range check that is implemented with a CmpUNode. If type information is not propagated correctly to the CmpUNode, the range check trap is always taken. Because C2 also removes the unnecessary buffer allocation, a NullPointerException is thrown and the test fails. > > Thanks, > Tobias > > [1] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-May/017983.html > From tobias.hartmann at oracle.com Thu May 21 19:12:57 2015 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Thu, 21 May 2015 21:12:57 +0200 Subject: [9] RFR(S): 8080156: Integer.toString(int value) sometimes throws NPE In-Reply-To: <555E2CA7.2040904@oracle.com> References: <555E2942.7060404@oracle.com> <555E2CA7.2040904@oracle.com> Message-ID: <555E2E39.8040903@oracle.com> Thanks, Vladimir. Best, Tobias On 21.05.2015 21:06, Vladimir Kozlov wrote: > Looks good. > > Thanks, > Vladimir > > On 5/21/15 11:51 AM, Tobias Hartmann wrote: >> Hi, >> >> this is a follow up RFR on JDK-8060036 [1] that adds a regression test I used while investigating JDK-8080156 which turned out to be a duplicate. >> >> https://bugs.openjdk.java.net/browse/JDK-8080156 >> http://cr.openjdk.java.net/~thartmann/8080156/webrev.00/ >> >> The test triggers a range check that is implemented with a CmpUNode. If type information is not propagated correctly to the CmpUNode, the range check trap is always taken. Because C2 also removes the unnecessary buffer allocation, a NullPointerException is thrown and the test fails. >> >> Thanks, >> Tobias >> >> [1] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-May/017983.html >> From yekaterina.kantserova at oracle.com Fri May 22 06:40:45 2015 From: yekaterina.kantserova at oracle.com (Yekaterina Kantserova) Date: Fri, 22 May 2015 08:40:45 +0200 Subject: RFR(XS): 8080828: Create sanity test for JDK-8080155 In-Reply-To: <5C6FDB60-85C3-4130-ABF0-52368963E0DB@oracle.com> References: <555D8A56.3050001@oracle.com> <555DB98A.7050401@oracle.com> <5C6FDB60-85C3-4130-ABF0-52368963E0DB@oracle.com> Message-ID: <555ECF6D.2030003@oracle.com> Staffan, thanks for the review! On 05/21/2015 01:04 PM, Staffan Larsen wrote: > Looks good! > > Thanks, > /Staffan > >> On 21 maj 2015, at 12:55, Yekaterina Kantserova wrote: >> >> The new webrev can be found here: http://cr.openjdk.java.net/~ykantser/8080828/webrev.01 >> >> Thanks, >> Katja >> >> >> >> On 05/21/2015 10:15 AM, Staffan Larsen wrote: >>> I just realize that you also need to add a check if SA can attach on the current platform. See the test in sa/jmap-hashcode: >>> >>> if (!Platform.shouldSAAttach()) { >>> System.out.println("SA attach not expected to work - test skipped."); >>> return; >>> } >>> >>> Thanks, >>> /Staffan >>> >>>> On 21 maj 2015, at 10:13, Staffan Larsen wrote: >>>> >>>> Can we add some kind of validation of the output? Something simple, just to verify that we actually ran what we think we ran. >>>> >>>>> On 21 maj 2015, at 09:33, Yekaterina Kantserova wrote: >>>>> >>>>> Hi, >>>>> >>>>> Could I please have a review of this new test. >>>>> >>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8080828 >>>>> webrev: http://cr.openjdk.java.net/~ykantser/8080828/webrev.00 >>>>> >>>>> This test will trigger the exception reported in https://bugs.openjdk.java.net/browse/JDK-8080155 if running with -Xcomp. >>>>> >>>>> Thanks, >>>>> Katja From yekaterina.kantserova at oracle.com Fri May 22 07:17:33 2015 From: yekaterina.kantserova at oracle.com (Yekaterina Kantserova) Date: Fri, 22 May 2015 09:17:33 +0200 Subject: 8080855: Create sanity test for JDK-8080692 Message-ID: <555ED80D.5000806@oracle.com> Hi, Could I please have a review of this new test. bug: https://bugs.openjdk.java.net/browse/JDK-8080855 webrev: http://cr.openjdk.java.net/~ykantser/8080855/webrev.00/ This test will trigger the exception reported in https://bugs.openjdk.java.net/browse/JDK-8080692 if running with -Xcomp. Thanks, Katja From staffan.larsen at oracle.com Fri May 22 07:20:45 2015 From: staffan.larsen at oracle.com (Staffan Larsen) Date: Fri, 22 May 2015 09:20:45 +0200 Subject: 8080855: Create sanity test for JDK-8080692 In-Reply-To: <555ED80D.5000806@oracle.com> References: <555ED80D.5000806@oracle.com> Message-ID: <2EB33CDA-5402-428E-8E70-0C64581D697A@oracle.com> Looks good! Thanks, /Staffan > On 22 maj 2015, at 09:17, Yekaterina Kantserova wrote: > > Hi, > > Could I please have a review of this new test. > > bug: https://bugs.openjdk.java.net/browse/JDK-8080855 > webrev: http://cr.openjdk.java.net/~ykantser/8080855/webrev.00/ > > This test will trigger the exception reported in https://bugs.openjdk.java.net/browse/JDK-8080692 if running with -Xcomp. > > Thanks, > Katja From roland.westrelin at oracle.com Fri May 22 07:24:44 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Fri, 22 May 2015 09:24:44 +0200 Subject: RFR(XS): 8080699: Assert failed: Not a Java pointer in JCK test In-Reply-To: <73E7F064-372B-4303-8FA4-9CB6CF1239D4@oracle.com> References: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com> <555CC357.9030006@oracle.com> <29CE480B-2136-40DE-8167-E0E093A5F145@oracle.com> <555CDD4C.5030707@oracle.com> <73E7F064-372B-4303-8FA4-9CB6CF1239D4@oracle.com> Message-ID: Vladimir, >>>> Should it also return false if in(ArrayCopyNode::Src)->is_top()? >>> >>> That doesn?t feel right: ArrayCopyNode::may_modify() would decide what to return based on src that it doesn?t modify. >> >> Okay, returning true in such case would be conservative. >> >> Should we check dest input for top in CallLeafNode::may_modify()? Do we have other similar places? > > You?re right checking for top dest in CallLeafNode::may_modify() is safer. > I don?t see other similar issues elsewhere in the array copy code. Here is a webrev with the top check in CallLeafNode::may_modify()? http://cr.openjdk.java.net/~roland/8080699/webrev.01/ Can I push it? Roland. > > Roland. > > >> >> Thanks, >> Vladimir >> >>> Also, I don?t think src can become top and dest not be top during allocation elimination. If that happens at another time (igvn for instance), this part of the graph is being removed and presumably the compiler will get an other chance to optimize it better. >>> >>> Roland. >>> >>> >>> >>>> >>>> thanks, >>>> Vladimir >>>> >>>> On 5/20/15 7:03 AM, Roland Westrelin wrote: >>>>> http://cr.openjdk.java.net/~roland/8080699/webrev.00/ >>>>> >>>>> When an arraycopy node is eliminated (for a non escaping allocation), its control, src and dest arguments are set to top and it is no longer connected to the rest of the graph through its fall through outputs. It?s not entirely removed from the graph until the next igvn though, and it can be reached through its exception processing outputs. That?s what happens here and ArrayCopyNode::may_modify() assumes dest, that is top, is an oop. >>>>> >>>>> Roland. >>>>> >>> > From denis.kononenko at oracle.com Fri May 22 12:38:43 2015 From: denis.kononenko at oracle.com (denis kononenko) Date: Fri, 22 May 2015 15:38:43 +0300 Subject: [8u60] RFR(XS): JDK-8077620: [TESTBUG] Some of the hotspot tests require at least compact profile 3 Message-ID: <555F2353.1020406@oracle.com> Hi All, Could you please review a backport for JDK-8080693. JDK9 webrev link: http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.00/ JDK9 bug id: 8080693, http://bugs.openjdk.java.net/browse/JDK-8080693 JDK9 review thread: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2015-May/014780.html The changes cannot be applied cleanly, so I've created a new webrev: 8u60 webrev link: http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.8u60.00/ JDK8 bug id: 8077620, https://bugs.openjdk.java.net/browse/JDK-8077620 Thank you, Denis. From andreas.eriksson at oracle.com Fri May 22 13:07:01 2015 From: andreas.eriksson at oracle.com (Andreas Eriksson) Date: Fri, 22 May 2015 15:07:01 +0200 Subject: [9] RFR(S): 8080156: Integer.toString(int value) sometimes throws NPE In-Reply-To: <555E2942.7060404@oracle.com> References: <555E2942.7060404@oracle.com> Message-ID: <555F29F5.7090402@oracle.com> Hi Tobias, You have a comma that shouldn't be there in the bugid line: -------------------------------------------------- TEST: compiler/types/TestTypePropagationToCmpU.java TEST RESULT: Error. Parse Exception: Invalid or unrecognized bugid: 8080156, -------------------------------------------------- Otherwise it looks good to me. Thanks for pushing this test. - Andreas On 2015-05-21 20:51, Tobias Hartmann wrote: > Hi, > > this is a follow up RFR on JDK-8060036 [1] that adds a regression test I used while investigating JDK-8080156 which turned out to be a duplicate. > > https://bugs.openjdk.java.net/browse/JDK-8080156 > http://cr.openjdk.java.net/~thartmann/8080156/webrev.00/ > > The test triggers a range check that is implemented with a CmpUNode. If type information is not propagated correctly to the CmpUNode, the range check trap is always taken. Because C2 also removes the unnecessary buffer allocation, a NullPointerException is thrown and the test fails. > > Thanks, > Tobias > > [1] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-May/017983.html From tobias.hartmann at oracle.com Fri May 22 13:10:40 2015 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Fri, 22 May 2015 15:10:40 +0200 Subject: [9] RFR(S): 8080156: Integer.toString(int value) sometimes throws NPE In-Reply-To: <555F29F5.7090402@oracle.com> References: <555E2942.7060404@oracle.com> <555F29F5.7090402@oracle.com> Message-ID: <555F2AD0.2010907@oracle.com> Hi Andreas, thanks, I already noticed while running through JPRT and fixed it. Best, Tobias On 22.05.2015 15:07, Andreas Eriksson wrote: > Hi Tobias, > > You have a comma that shouldn't be there in the bugid line: > -------------------------------------------------- > TEST: compiler/types/TestTypePropagationToCmpU.java > TEST RESULT: Error. Parse Exception: Invalid or unrecognized bugid: 8080156, > -------------------------------------------------- > > Otherwise it looks good to me. > Thanks for pushing this test. > > - Andreas > > On 2015-05-21 20:51, Tobias Hartmann wrote: >> Hi, >> >> this is a follow up RFR on JDK-8060036 [1] that adds a regression test I used while investigating JDK-8080156 which turned out to be a duplicate. >> >> https://bugs.openjdk.java.net/browse/JDK-8080156 >> http://cr.openjdk.java.net/~thartmann/8080156/webrev.00/ >> >> The test triggers a range check that is implemented with a CmpUNode. If type information is not propagated correctly to the CmpUNode, the range check trap is always taken. Because C2 also removes the unnecessary buffer allocation, a NullPointerException is thrown and the test fails. >> >> Thanks, >> Tobias >> >> [1] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-May/017983.html > From vladimir.x.ivanov at oracle.com Fri May 22 14:36:43 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Fri, 22 May 2015 17:36:43 +0300 Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match some 8-bit/16-bit masks Message-ID: <555F3EFB.60606@oracle.com> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00 https://bugs.openjdk.java.net/browse/JDK-8001622 Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from immI8/immI16 to immI, since zero-extending move is used. Thanks! Best regards, Vladimir Ivanov From vladimir.x.ivanov at oracle.com Fri May 22 14:38:55 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Fri, 22 May 2015 17:38:55 +0300 Subject: [8u60] RFR(XS): JDK-8077620: [TESTBUG] Some of the hotspot tests require at least compact profile 3 In-Reply-To: <555F2353.1020406@oracle.com> References: <555F2353.1020406@oracle.com> Message-ID: <555F3F7F.5040000@oracle.com> Both changes look good. Best regards, Vladimir Ivanov On 5/22/15 3:38 PM, denis kononenko wrote: > Hi All, > > Could you please review a backport for JDK-8080693. > > JDK9 webrev link: > http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.00/ > JDK9 bug id: 8080693, http://bugs.openjdk.java.net/browse/JDK-8080693 > JDK9 review thread: > http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2015-May/014780.html > > > The changes cannot be applied cleanly, so I've created a new webrev: > > 8u60 webrev link: > http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.8u60.00/ > JDK8 bug id: 8077620, https://bugs.openjdk.java.net/browse/JDK-8077620 > > Thank you, > Denis. > From roland.westrelin at oracle.com Fri May 22 14:57:18 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Fri, 22 May 2015 16:57:18 +0200 Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match some 8-bit/16-bit masks In-Reply-To: <555F3EFB.60606@oracle.com> References: <555F3EFB.60606@oracle.com> Message-ID: <49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com> > http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00 > https://bugs.openjdk.java.net/browse/JDK-8001622 > > Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from immI8/immI16 to immI, since zero-extending move is used. The change on sparc doesn?t look good: __ and3($dst$$Register, $mask$$constant, $dst$$Register); the and instruction cannot encode arbitrary large integer constants. Isn?t the root of the problem that we want an unsigned 8 bit integer and not a signed one? Roland. From vladimir.x.ivanov at oracle.com Fri May 22 15:14:44 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Fri, 22 May 2015 18:14:44 +0300 Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match some 8-bit/16-bit masks In-Reply-To: <49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com> References: <555F3EFB.60606@oracle.com> <49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com> Message-ID: <555F47E4.8020507@oracle.com> Thanks for looking into the fix, Roland. On 5/22/15 5:57 PM, Roland Westrelin wrote: >> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00 >> https://bugs.openjdk.java.net/browse/JDK-8001622 >> >> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from immI8/immI16 to immI, since zero-extending move is used. > > The change on sparc doesn?t look good: > __ and3($dst$$Register, $mask$$constant, $dst$$Register); > > the and instruction cannot encode arbitrary large integer constants. Good catch. I missed that it is and3. > Isn?t the root of the problem that we want an unsigned 8 bit integer and not a signed one? Yes, unsigned 8-bit integer work fine, but it can be generalized to arbitrary masks as well. I experimented with new operand types (immU8, immU16), but it was more code for no particular benefit. I can use them on sparc. Best regards, Vladimir Ivanov From vladimir.x.ivanov at oracle.com Fri May 22 15:38:37 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Fri, 22 May 2015 18:38:37 +0300 Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match some 8-bit/16-bit masks In-Reply-To: <555F47E4.8020507@oracle.com> References: <555F3EFB.60606@oracle.com> <49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com> <555F47E4.8020507@oracle.com> Message-ID: <555F4D7D.7080504@oracle.com> Updated webrev: http://cr.openjdk.java.net/~vlivanov/8001622/webrev.01 Introduced immU8/immU16 operands on sparc. As a cleanup, sorted immI/immU operand declarations. Best regards, Vladimir Ivanov On 5/22/15 6:14 PM, Vladimir Ivanov wrote: > Thanks for looking into the fix, Roland. > > On 5/22/15 5:57 PM, Roland Westrelin wrote: >>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00 >>> https://bugs.openjdk.java.net/browse/JDK-8001622 >>> >>> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from >>> immI8/immI16 to immI, since zero-extending move is used. >> >> The change on sparc doesn?t look good: >> __ and3($dst$$Register, $mask$$constant, $dst$$Register); >> >> the and instruction cannot encode arbitrary large integer constants. > Good catch. I missed that it is and3. > >> Isn?t the root of the problem that we want an unsigned 8 bit integer >> and not a signed one? > Yes, unsigned 8-bit integer work fine, but it can be generalized to > arbitrary masks as well. > > I experimented with new operand types (immU8, immU16), but it was more > code for no particular benefit. I can use them on sparc. > > Best regards, > Vladimir Ivanov From vladimir.kozlov at oracle.com Fri May 22 15:44:39 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 22 May 2015 08:44:39 -0700 Subject: RFR(XS): 8080699: Assert failed: Not a Java pointer in JCK test In-Reply-To: References: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com> <555CC357.9030006@oracle.com> <29CE480B-2136-40DE-8167-E0E093A5F145@oracle.com> <555CDD4C.5030707@oracle.com> <73E7F064-372B-4303-8FA4-9CB6CF1239D4@oracle.com> Message-ID: <555F4EE7.8060305@oracle.com> Good. Thanks, Vladimir On 5/22/15 12:24 AM, Roland Westrelin wrote: > Vladimir, > >>>>> Should it also return false if in(ArrayCopyNode::Src)->is_top()? >>>> >>>> That doesn?t feel right: ArrayCopyNode::may_modify() would decide what to return based on src that it doesn?t modify. >>> >>> Okay, returning true in such case would be conservative. >>> >>> Should we check dest input for top in CallLeafNode::may_modify()? Do we have other similar places? >> >> You?re right checking for top dest in CallLeafNode::may_modify() is safer. >> I don?t see other similar issues elsewhere in the array copy code. > > Here is a webrev with the top check in CallLeafNode::may_modify()? > > http://cr.openjdk.java.net/~roland/8080699/webrev.01/ > > Can I push it? > > Roland. > >> >> Roland. >> >> >>> >>> Thanks, >>> Vladimir >>> >>>> Also, I don?t think src can become top and dest not be top during allocation elimination. If that happens at another time (igvn for instance), this part of the graph is being removed and presumably the compiler will get an other chance to optimize it better. >>>> >>>> Roland. >>>> >>>> >>>> >>>>> >>>>> thanks, >>>>> Vladimir >>>>> >>>>> On 5/20/15 7:03 AM, Roland Westrelin wrote: >>>>>> http://cr.openjdk.java.net/~roland/8080699/webrev.00/ >>>>>> >>>>>> When an arraycopy node is eliminated (for a non escaping allocation), its control, src and dest arguments are set to top and it is no longer connected to the rest of the graph through its fall through outputs. It?s not entirely removed from the graph until the next igvn though, and it can be reached through its exception processing outputs. That?s what happens here and ArrayCopyNode::may_modify() assumes dest, that is top, is an oop. >>>>>> >>>>>> Roland. >>>>>> >>>> >> > From roland.westrelin at oracle.com Fri May 22 15:51:28 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Fri, 22 May 2015 17:51:28 +0200 Subject: RFR(XS): 8080699: Assert failed: Not a Java pointer in JCK test In-Reply-To: <555F4EE7.8060305@oracle.com> References: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com> <555CC357.9030006@oracle.com> <29CE480B-2136-40DE-8167-E0E093A5F145@oracle.com> <555CDD4C.5030707@oracle.com> <73E7F064-372B-4303-8FA4-9CB6CF1239D4@oracle.com> <555F4EE7.8060305@oracle.com> Message-ID: <61FE1136-DD34-44D9-9E4E-A6B5FEEC228E@oracle.com> Thanks, Vladimir. Roland. > On May 22, 2015, at 5:44 PM, Vladimir Kozlov wrote: > > Good. > > Thanks, > Vladimir > > On 5/22/15 12:24 AM, Roland Westrelin wrote: >> Vladimir, >> >>>>>> Should it also return false if in(ArrayCopyNode::Src)->is_top()? >>>>> >>>>> That doesn?t feel right: ArrayCopyNode::may_modify() would decide what to return based on src that it doesn?t modify. >>>> >>>> Okay, returning true in such case would be conservative. >>>> >>>> Should we check dest input for top in CallLeafNode::may_modify()? Do we have other similar places? >>> >>> You?re right checking for top dest in CallLeafNode::may_modify() is safer. >>> I don?t see other similar issues elsewhere in the array copy code. >> >> Here is a webrev with the top check in CallLeafNode::may_modify()? >> >> http://cr.openjdk.java.net/~roland/8080699/webrev.01/ >> >> Can I push it? >> >> Roland. >> >>> >>> Roland. >>> >>> >>>> >>>> Thanks, >>>> Vladimir >>>> >>>>> Also, I don?t think src can become top and dest not be top during allocation elimination. If that happens at another time (igvn for instance), this part of the graph is being removed and presumably the compiler will get an other chance to optimize it better. >>>>> >>>>> Roland. >>>>> >>>>> >>>>> >>>>>> >>>>>> thanks, >>>>>> Vladimir >>>>>> >>>>>> On 5/20/15 7:03 AM, Roland Westrelin wrote: >>>>>>> http://cr.openjdk.java.net/~roland/8080699/webrev.00/ >>>>>>> >>>>>>> When an arraycopy node is eliminated (for a non escaping allocation), its control, src and dest arguments are set to top and it is no longer connected to the rest of the graph through its fall through outputs. It?s not entirely removed from the graph until the next igvn though, and it can be reached through its exception processing outputs. That?s what happens here and ArrayCopyNode::may_modify() assumes dest, that is top, is an oop. >>>>>>> >>>>>>> Roland. >>>>>>> >>>>> >>> >> From vladimir.kozlov at oracle.com Fri May 22 17:21:52 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 22 May 2015 10:21:52 -0700 Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match some 8-bit/16-bit masks In-Reply-To: <555F4D7D.7080504@oracle.com> References: <555F3EFB.60606@oracle.com> <49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com> <555F47E4.8020507@oracle.com> <555F4D7D.7080504@oracle.com> Message-ID: <555F65B0.30307@oracle.com> Good. Thanks, Vladimir K On 5/22/15 8:38 AM, Vladimir Ivanov wrote: > Updated webrev: > http://cr.openjdk.java.net/~vlivanov/8001622/webrev.01 > > Introduced immU8/immU16 operands on sparc. > > As a cleanup, sorted immI/immU operand declarations. > > Best regards, > Vladimir Ivanov > > On 5/22/15 6:14 PM, Vladimir Ivanov wrote: >> Thanks for looking into the fix, Roland. >> >> On 5/22/15 5:57 PM, Roland Westrelin wrote: >>>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00 >>>> https://bugs.openjdk.java.net/browse/JDK-8001622 >>>> >>>> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from >>>> immI8/immI16 to immI, since zero-extending move is used. >>> >>> The change on sparc doesn?t look good: >>> __ and3($dst$$Register, $mask$$constant, $dst$$Register); >>> >>> the and instruction cannot encode arbitrary large integer constants. >> Good catch. I missed that it is and3. >> >>> Isn?t the root of the problem that we want an unsigned 8 bit integer >>> and not a signed one? >> Yes, unsigned 8-bit integer work fine, but it can be generalized to >> arbitrary masks as well. >> >> I experimented with new operand types (immU8, immU16), but it was more >> code for no particular benefit. I can use them on sparc. >> >> Best regards, >> Vladimir Ivanov From rednaxelafx at gmail.com Fri May 22 22:22:13 2015 From: rednaxelafx at gmail.com (Krystal Mok) Date: Fri, 22 May 2015 15:22:13 -0700 Subject: Question on ADLC's RegClass destructor Message-ID: Hi compiler team, I came across this piece of code in ADLC, and I'm not sure if it's right or not: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/ac291bc3ece2/src/share/vm/adlc/formsopt.cpp#l236 RegClass::~RegClass() { delete _classid; } Why do we need to delete the _classid field here? If this is needed, then a bunch of other classes in formsopt.hpp would need the same treatment. (I know this virtual destructor was added mainly for the two new subclasses, but I'm just wondering about this one...) Thanks, Kris -------------- next part -------------- An HTML attachment was scrubbed... URL: From dean.long at oracle.com Sat May 23 04:49:30 2015 From: dean.long at oracle.com (Dean Long) Date: Fri, 22 May 2015 21:49:30 -0700 Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match some 8-bit/16-bit masks In-Reply-To: <555F4D7D.7080504@oracle.com> References: <555F3EFB.60606@oracle.com> <49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com> <555F47E4.8020507@oracle.com> <555F4D7D.7080504@oracle.com> Message-ID: <556006DA.8030202@oracle.com> Isn't this what you want: 5656 // Load Unsigned Byte (8 bit UNsigned) with 32-bit mask into Long Register 5657 instruct loadUB2L_immI(iRegL dst, memory mem, immI mask) %{ 5658 match(Set dst (ConvI2L (AndI (LoadUB mem) mask))); 5659 ins_cost(MEMORY_REF_COST + DEFAULT_COST); 5660 5661 size(2*4); 5662 format %{ "LDUB $mem,$dst\t# ubyte & 32-bit mask -> long\n\t" 5663 "AND $dst,right_n_bits($mask, 8),$dst" %} 5664 ins_encode %{ 5665 __ ldub($mem$$Address, $dst$$Register); 5666 __ and3($dst$$Register, right_n_bits($mask$$constant, 8), $dst$$Register); 5667 %} 5668 ins_pipe(iload_mem); 5669 %} dl On 5/22/2015 8:38 AM, Vladimir Ivanov wrote: > Updated webrev: > http://cr.openjdk.java.net/~vlivanov/8001622/webrev.01 > > Introduced immU8/immU16 operands on sparc. > > As a cleanup, sorted immI/immU operand declarations. > > Best regards, > Vladimir Ivanov > > On 5/22/15 6:14 PM, Vladimir Ivanov wrote: >> Thanks for looking into the fix, Roland. >> >> On 5/22/15 5:57 PM, Roland Westrelin wrote: >>>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00 >>>> https://bugs.openjdk.java.net/browse/JDK-8001622 >>>> >>>> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from >>>> immI8/immI16 to immI, since zero-extending move is used. >>> >>> The change on sparc doesn?t look good: >>> __ and3($dst$$Register, $mask$$constant, $dst$$Register); >>> >>> the and instruction cannot encode arbitrary large integer constants. >> Good catch. I missed that it is and3. >> >>> Isn?t the root of the problem that we want an unsigned 8 bit integer >>> and not a signed one? >> Yes, unsigned 8-bit integer work fine, but it can be generalized to >> arbitrary masks as well. >> >> I experimented with new operand types (immU8, immU16), but it was more >> code for no particular benefit. I can use them on sparc. >> >> Best regards, >> Vladimir Ivanov From dawid.weiss at gmail.com Sat May 23 12:51:03 2015 From: dawid.weiss at gmail.com (Dawid Weiss) Date: Sat, 23 May 2015 14:51:03 +0200 Subject: 1.9.0-ea-b64 regression (AIOOB thrown where it shouldn't be thrown) Message-ID: Hi everyone, I've ran into an issue with a suspicious ArrayIndexOutOfBounds on ea builds of JDK 1.9.0. Here's some context: - we run separate builds for 1.7, 1.8 and 1.9ea VMs and only the 1.9 build currently fails (Windows, Linux environments, 64-bit), - the bug/ issue is a suspicious AIOOB on: org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820) which happens to be the line of code inside this for loop: for (int i = 0, lastShadow = this.last; i <= lastShadow; i++) { tt[cftab[ll8[i] & 0xff]++] = i; } Which array access this is exactly is hard to tell, but the *same* bzip input file does not produce the error on any other JVM (or an earlier releases of 1.9ea). This code is deterministic in the test that uses the above routine. - the problem *only* appears from 1.9ea_b64; on earlier releases the same code passes just fine (bisected it back from b45), - I also checked 1.9ea_b65 (which happens to be on the download server but wasn't properly announced yet?). The problem persists. - the problem does reproduce on the build server (Windows and Linux). Interestingly, I couldn't reproduce it locally. The code is proprietary, I couldn't narrow it down yet to something that would reproduce (sigh). I realize this is insufficient information to get started, but perhaps this issue is already known or somebody may have a clue at what is going on (CCing hotspot-compiler-dev)? Dawid From dawid.weiss at gmail.com Sat May 23 19:58:53 2015 From: dawid.weiss at gmail.com (Dawid Weiss) Date: Sat, 23 May 2015 21:58:53 +0200 Subject: 1.9.0-ea-b64 regression (AIOOB thrown where it shouldn't be thrown) Message-ID: Good news. I have a repro that crashes for me every time and it only contains open-source code (and some data). Bad news: it's probably a compiler bug because everything works just fine with -Xint. I'll put it together into a repro tomorrow, hopefully, and will ask somebody with the right permission to file an issue in Jira. Should be relatively easy to narrow it down by bisecting hs repo commits. Dawid On Sat, May 23, 2015 at 2:19 PM, Dawid Weiss wrote: > Hi Rory, everyone, > > I've ran into an issue with a suspicious ArrayIndexOutOfBounds on ea > builds of JDK 1.9.0. Here's some context: > > - we run separate builds for 1.7, 1.8 and 1.9ea VMs and only the 1.9 > build currently fails (Windows, Linux environments, 64-bit), > > - the bug/ issue is a suspicious AIOOB on: > > org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820) > > which happens to be the line of code inside this for loop: > > for (int i = 0, lastShadow = this.last; i <= lastShadow; i++) { > tt[cftab[ll8[i] & 0xff]++] = i; > } > > Which array access this is exactly is hard to tell, but the *same* > bzip input file does not produce the error on any other JVM (or an > earlier releases of 1.9ea). This code is deterministic in the test > that uses the above routine. > > - the problem *only* appears from 1.9ea_b64; on earlier releases the > same code passes just fine (bisected it back from b45), > > - I also checked 1.9ea_b65 (which happens to be on the download server > but wasn't properly announced yet?). The problem persists. > > - the problem does reproduce on the build server (Windows and Linux). > Interestingly, I couldn't reproduce it locally. The code is > proprietary, I couldn't narrow it down yet to something that would > reproduce (sigh). > > I realize this is insufficient information to get started, but perhaps > this issue is already known or somebody may have a clue at what is > going on (CCing hotspot-compiler-dev)? > > Dawid From dawid.weiss at gmail.com Sun May 24 07:23:11 2015 From: dawid.weiss at gmail.com (Dawid Weiss) Date: Sun, 24 May 2015 09:23:11 +0200 Subject: 1.9.0-ea-b64 regression (AIOOB thrown where it shouldn't be thrown) In-Reply-To: References: Message-ID: Hello again, The bug repro code is at the link below: http://download.carrotsearch.com/jvm/repro.zip Definitely something with the compilation because disabling loop unrolling (or running in interpreted mode) doesn't trigger the bug. More information (also included in README.txt) quoted below. Dawid Expected behavior: The code should re-read the gz2 resource, looping and printing (infinitely): Round... Round... Round... Actual behavior (64-Bit Server VM, build 1.9.0-ea-b64, mixed mode): Round... Round... Round... Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 314297 at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.(BZip2CompressorInputStream.java:136) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.(BZip2CompressorInputStream.java:111) at bug.Repro.main(Repro.java:15) Notes ----- - Self contained maven project (copied commons compress sources so that one can tweak them if needed). An additional bz2 resource is needed (included). - Build with: mvn package - Run with: java -jar target/Repro-0.0.0.jar - Running in interpreted mode does *not* cause any error: java -Xint -jar target/Repro-0.0.0.jar - Running without loop unrolls does *not* cause any error: java -Xbatch -XX:LoopUnrollLimit=0 -jar target/Repro-0.0.0.jar On Sat, May 23, 2015 at 9:58 PM, Dawid Weiss wrote: > Good news. I have a repro that crashes for me every time and it only > contains open-source code (and some data). Bad news: it's probably a > compiler bug because everything works just fine with -Xint. > > I'll put it together into a repro tomorrow, hopefully, and will ask > somebody with the right permission to file an issue in Jira. Should be > relatively easy to narrow it down by bisecting hs repo commits. > > Dawid > > On Sat, May 23, 2015 at 2:19 PM, Dawid Weiss > wrote: >> Hi Rory, everyone, >> >> I've ran into an issue with a suspicious ArrayIndexOutOfBounds on ea >> builds of JDK 1.9.0. Here's some context: >> >> - we run separate builds for 1.7, 1.8 and 1.9ea VMs and only the 1.9 >> build currently fails (Windows, Linux environments, 64-bit), >> >> - the bug/ issue is a suspicious AIOOB on: >> >> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820) >> >> which happens to be the line of code inside this for loop: >> >> for (int i = 0, lastShadow = this.last; i <= lastShadow; i++) { >> tt[cftab[ll8[i] & 0xff]++] = i; >> } >> >> Which array access this is exactly is hard to tell, but the *same* >> bzip input file does not produce the error on any other JVM (or an >> earlier releases of 1.9ea). This code is deterministic in the test >> that uses the above routine. >> >> - the problem *only* appears from 1.9ea_b64; on earlier releases the >> same code passes just fine (bisected it back from b45), >> >> - I also checked 1.9ea_b65 (which happens to be on the download server >> but wasn't properly announced yet?). The problem persists. >> >> - the problem does reproduce on the build server (Windows and Linux). >> Interestingly, I couldn't reproduce it locally. The code is >> proprietary, I couldn't narrow it down yet to something that would >> reproduce (sigh). >> >> I realize this is insufficient information to get started, but perhaps >> this issue is already known or somebody may have a clue at what is >> going on (CCing hotspot-compiler-dev)? >> >> Dawid From david.holmes at oracle.com Sun May 24 23:41:31 2015 From: david.holmes at oracle.com (David Holmes) Date: Mon, 25 May 2015 09:41:31 +1000 Subject: [8u60] RFR(XS): JDK-8077620: [TESTBUG] Some of the hotspot tests require at least compact profile 3 In-Reply-To: <555F2353.1020406@oracle.com> References: <555F2353.1020406@oracle.com> Message-ID: <556261AB.6010000@oracle.com> Looks good. Thanks, David On 22/05/2015 10:38 PM, denis kononenko wrote: > Hi All, > > Could you please review a backport for JDK-8080693. > > JDK9 webrev link: > http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.00/ > JDK9 bug id: 8080693, http://bugs.openjdk.java.net/browse/JDK-8080693 > JDK9 review thread: > http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2015-May/014780.html > > > The changes cannot be applied cleanly, so I've created a new webrev: > > 8u60 webrev link: > http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.8u60.00/ > JDK8 bug id: 8077620, https://bugs.openjdk.java.net/browse/JDK-8077620 > > Thank you, > Denis. > From dawid.weiss at gmail.com Mon May 25 07:46:54 2015 From: dawid.weiss at gmail.com (Dawid Weiss) Date: Mon, 25 May 2015 09:46:54 +0200 Subject: 1.9.0-ea-b64 regression (AIOOB thrown where it shouldn't be thrown) In-Reply-To: <55623549.2060502@oracle.com> References: <55623549.2060502@oracle.com> Message-ID: Filed a bug report with Review ID: JI-9021458. Thanks! Dawid On Sun, May 24, 2015 at 10:32 PM, Rory O'Donnell wrote: > Hi Dawid, > > Could you log an incident at bugs.java.com and let us know the incident id. > > Thanks, Rory > > > On 24/05/2015 08:23, Dawid Weiss wrote: >> >> Hello again, >> >> The bug repro code is at the link below: >> http://download.carrotsearch.com/jvm/repro.zip >> >> Definitely something with the compilation because disabling loop >> unrolling (or running in interpreted mode) doesn't trigger the bug. >> More information (also included in README.txt) quoted below. >> >> Dawid >> >> Expected behavior: >> The code should re-read the gz2 resource, looping and printing >> (infinitely): >> Round... >> Round... >> Round... >> >> Actual behavior (64-Bit Server VM, build 1.9.0-ea-b64, mixed mode): >> Round... >> Round... >> Round... >> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: >> 314297 >> at >> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820) >> at >> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.(BZip2CompressorInputStream.java:136) >> at >> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.(BZip2CompressorInputStream.java:111) >> at bug.Repro.main(Repro.java:15) >> >> Notes >> ----- >> >> - Self contained maven project (copied commons compress sources so that >> one can >> tweak them if needed). An additional bz2 resource is needed (included). >> - Build with: >> mvn package >> - Run with: >> java -jar target/Repro-0.0.0.jar >> - Running in interpreted mode does *not* cause any error: >> java -Xint -jar target/Repro-0.0.0.jar >> - Running without loop unrolls does *not* cause any error: >> java -Xbatch -XX:LoopUnrollLimit=0 -jar target/Repro-0.0.0.jar >> >> On Sat, May 23, 2015 at 9:58 PM, Dawid Weiss >> wrote: >>> >>> Good news. I have a repro that crashes for me every time and it only >>> contains open-source code (and some data). Bad news: it's probably a >>> compiler bug because everything works just fine with -Xint. >>> >>> I'll put it together into a repro tomorrow, hopefully, and will ask >>> somebody with the right permission to file an issue in Jira. Should be >>> relatively easy to narrow it down by bisecting hs repo commits. >>> >>> Dawid >>> >>> On Sat, May 23, 2015 at 2:19 PM, Dawid Weiss >>> wrote: >>>> >>>> Hi Rory, everyone, >>>> >>>> I've ran into an issue with a suspicious ArrayIndexOutOfBounds on ea >>>> builds of JDK 1.9.0. Here's some context: >>>> >>>> - we run separate builds for 1.7, 1.8 and 1.9ea VMs and only the 1.9 >>>> build currently fails (Windows, Linux environments, 64-bit), >>>> >>>> - the bug/ issue is a suspicious AIOOB on: >>>> >>>> >>>> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820) >>>> >>>> which happens to be the line of code inside this for loop: >>>> >>>> for (int i = 0, lastShadow = this.last; i <= lastShadow; i++) { >>>> tt[cftab[ll8[i] & 0xff]++] = i; >>>> } >>>> >>>> Which array access this is exactly is hard to tell, but the *same* >>>> bzip input file does not produce the error on any other JVM (or an >>>> earlier releases of 1.9ea). This code is deterministic in the test >>>> that uses the above routine. >>>> >>>> - the problem *only* appears from 1.9ea_b64; on earlier releases the >>>> same code passes just fine (bisected it back from b45), >>>> >>>> - I also checked 1.9ea_b65 (which happens to be on the download server >>>> but wasn't properly announced yet?). The problem persists. >>>> >>>> - the problem does reproduce on the build server (Windows and Linux). >>>> Interestingly, I couldn't reproduce it locally. The code is >>>> proprietary, I couldn't narrow it down yet to something that would >>>> reproduce (sigh). >>>> >>>> I realize this is insufficient information to get started, but perhaps >>>> this issue is already known or somebody may have a clue at what is >>>> going on (CCing hotspot-compiler-dev)? >>>> >>>> Dawid > > > -- > Rgds,Rory O'Donnell > Quality Engineering Manager > Oracle EMEA, Dublin,Ireland > From rory.odonnell at oracle.com Mon May 25 08:00:55 2015 From: rory.odonnell at oracle.com (Rory O'Donnell) Date: Mon, 25 May 2015 09:00:55 +0100 Subject: 1.9.0-ea-b64 regression (AIOOB thrown where it shouldn't be thrown) In-Reply-To: References: <55623549.2060502@oracle.com> Message-ID: <5562D6B7.2050800@oracle.com> Hi Dawid, Here is the JBS id: https://bugs.openjdk.java.net/browse/JDK-8080976 Thank you for submitting the bug. Rgds, Rory On 25/05/2015 08:46, Dawid Weiss wrote: > Filed a bug report with Review ID: JI-9021458. Thanks! > > Dawid > > On Sun, May 24, 2015 at 10:32 PM, Rory O'Donnell > wrote: >> Hi Dawid, >> >> Could you log an incident at bugs.java.com and let us know the incident id. >> >> Thanks, Rory >> >> >> On 24/05/2015 08:23, Dawid Weiss wrote: >>> Hello again, >>> >>> The bug repro code is at the link below: >>> http://download.carrotsearch.com/jvm/repro.zip >>> >>> Definitely something with the compilation because disabling loop >>> unrolling (or running in interpreted mode) doesn't trigger the bug. >>> More information (also included in README.txt) quoted below. >>> >>> Dawid >>> >>> Expected behavior: >>> The code should re-read the gz2 resource, looping and printing >>> (infinitely): >>> Round... >>> Round... >>> Round... >>> >>> Actual behavior (64-Bit Server VM, build 1.9.0-ea-b64, mixed mode): >>> Round... >>> Round... >>> Round... >>> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: >>> 314297 >>> at >>> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820) >>> at >>> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.(BZip2CompressorInputStream.java:136) >>> at >>> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.(BZip2CompressorInputStream.java:111) >>> at bug.Repro.main(Repro.java:15) >>> >>> Notes >>> ----- >>> >>> - Self contained maven project (copied commons compress sources so that >>> one can >>> tweak them if needed). An additional bz2 resource is needed (included). >>> - Build with: >>> mvn package >>> - Run with: >>> java -jar target/Repro-0.0.0.jar >>> - Running in interpreted mode does *not* cause any error: >>> java -Xint -jar target/Repro-0.0.0.jar >>> - Running without loop unrolls does *not* cause any error: >>> java -Xbatch -XX:LoopUnrollLimit=0 -jar target/Repro-0.0.0.jar >>> >>> On Sat, May 23, 2015 at 9:58 PM, Dawid Weiss >>> wrote: >>>> Good news. I have a repro that crashes for me every time and it only >>>> contains open-source code (and some data). Bad news: it's probably a >>>> compiler bug because everything works just fine with -Xint. >>>> >>>> I'll put it together into a repro tomorrow, hopefully, and will ask >>>> somebody with the right permission to file an issue in Jira. Should be >>>> relatively easy to narrow it down by bisecting hs repo commits. >>>> >>>> Dawid >>>> >>>> On Sat, May 23, 2015 at 2:19 PM, Dawid Weiss >>>> wrote: >>>>> Hi Rory, everyone, >>>>> >>>>> I've ran into an issue with a suspicious ArrayIndexOutOfBounds on ea >>>>> builds of JDK 1.9.0. Here's some context: >>>>> >>>>> - we run separate builds for 1.7, 1.8 and 1.9ea VMs and only the 1.9 >>>>> build currently fails (Windows, Linux environments, 64-bit), >>>>> >>>>> - the bug/ issue is a suspicious AIOOB on: >>>>> >>>>> >>>>> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820) >>>>> >>>>> which happens to be the line of code inside this for loop: >>>>> >>>>> for (int i = 0, lastShadow = this.last; i <= lastShadow; i++) { >>>>> tt[cftab[ll8[i] & 0xff]++] = i; >>>>> } >>>>> >>>>> Which array access this is exactly is hard to tell, but the *same* >>>>> bzip input file does not produce the error on any other JVM (or an >>>>> earlier releases of 1.9ea). This code is deterministic in the test >>>>> that uses the above routine. >>>>> >>>>> - the problem *only* appears from 1.9ea_b64; on earlier releases the >>>>> same code passes just fine (bisected it back from b45), >>>>> >>>>> - I also checked 1.9ea_b65 (which happens to be on the download server >>>>> but wasn't properly announced yet?). The problem persists. >>>>> >>>>> - the problem does reproduce on the build server (Windows and Linux). >>>>> Interestingly, I couldn't reproduce it locally. The code is >>>>> proprietary, I couldn't narrow it down yet to something that would >>>>> reproduce (sigh). >>>>> >>>>> I realize this is insufficient information to get started, but perhaps >>>>> this issue is already known or somebody may have a clue at what is >>>>> going on (CCing hotspot-compiler-dev)? >>>>> >>>>> Dawid >> >> -- >> Rgds,Rory O'Donnell >> Quality Engineering Manager >> Oracle EMEA, Dublin,Ireland >> -- Rgds,Rory O'Donnell Quality Engineering Manager Oracle EMEA , Dublin, Ireland From zoltan.majo at oracle.com Tue May 26 07:06:52 2015 From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=) Date: Tue, 26 May 2015 09:06:52 +0200 Subject: Question on ADLC's RegClass destructor In-Reply-To: References: Message-ID: <55641B8C.7080007@oracle.com> Hi Kris, On 05/23/2015 12:22 AM, Krystal Mok wrote: > Hi compiler team, > > I came across this piece of code in ADLC, and I'm not sure if it's > right or not: > > http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/ac291bc3ece2/src/share/vm/adlc/formsopt.cpp#l236 > > RegClass::~RegClass() { > delete _classid; > } > > Why do we need to delete the _classid field here? If this is needed, > then a bunch of other classes in formsopt.hpp would need the same > treatment. If I remember correctly, I added the destructor because the caller of the RegClass constructor allocates memory for _classid, but does not free it. But I'd have to check to be sure. Best regards, Zoltan > > (I know this virtual destructor was added mainly for the two new > subclasses, but I'm just wondering about this one...) > > Thanks, > Kris From aph at redhat.com Tue May 26 09:29:20 2015 From: aph at redhat.com (Andrew Haley) Date: Tue, 26 May 2015 10:29:20 +0100 Subject: Protection of RSA from timing and cache-flushing attacks [Was: RFR(L): 8069539: RSA acceleration] In-Reply-To: <5550CCC0.6000502@redhat.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <554CE68F.20509@redhat.com> <554CF00B.6000508@redhat.com> <5550CCC0.6000502@redhat.com> Message-ID: <55643CF0.8090100@redhat.com> On 05/11/2015 04:37 PM, Florian Weimer wrote: > On 05/08/2015 07:19 PM, Andrew Haley wrote: > >>> Do we want to add side-channel protection as part of this effort >>> (against timing attacks and cache-flushing attacks)? >> >> I wouldn't have thought so. It might make sense to add an optional >> path without key-dependent branches, but not as a part of this effort: >> the goals are completely orthogonal. > > I'm not well-versed in this kind of side-channel protection for RSA > implementations, but my impression that algorithm changes are needed to > mitigate the impact of data-dependent memory fetches (see fixed-width > modular exponentiation). But maybe the necessary changes materialize at > a higher level, beyond the operation which you proposed to intrinsify. By the way: there is quite a bit of code in sun/security/rsa/RSACore.java to protect against timing attacks. In particular, the patch for "8031346: Enhance RSA key handling" looks quite thorough and there is also extra care taken to make padding operations execute in constant time. Andrew. From rory.odonnell at oracle.com Sun May 24 20:32:09 2015 From: rory.odonnell at oracle.com (Rory O'Donnell) Date: Sun, 24 May 2015 21:32:09 +0100 Subject: 1.9.0-ea-b64 regression (AIOOB thrown where it shouldn't be thrown) In-Reply-To: References: Message-ID: <55623549.2060502@oracle.com> Hi Dawid, Could you log an incident at bugs.java.com and let us know the incident id. Thanks, Rory On 24/05/2015 08:23, Dawid Weiss wrote: > Hello again, > > The bug repro code is at the link below: > http://download.carrotsearch.com/jvm/repro.zip > > Definitely something with the compilation because disabling loop > unrolling (or running in interpreted mode) doesn't trigger the bug. > More information (also included in README.txt) quoted below. > > Dawid > > Expected behavior: > The code should re-read the gz2 resource, looping and printing (infinitely): > Round... > Round... > Round... > > Actual behavior (64-Bit Server VM, build 1.9.0-ea-b64, mixed mode): > Round... > Round... > Round... > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 314297 > at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820) > at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.(BZip2CompressorInputStream.java:136) > at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.(BZip2CompressorInputStream.java:111) > at bug.Repro.main(Repro.java:15) > > Notes > ----- > > - Self contained maven project (copied commons compress sources so that one can > tweak them if needed). An additional bz2 resource is needed (included). > - Build with: > mvn package > - Run with: > java -jar target/Repro-0.0.0.jar > - Running in interpreted mode does *not* cause any error: > java -Xint -jar target/Repro-0.0.0.jar > - Running without loop unrolls does *not* cause any error: > java -Xbatch -XX:LoopUnrollLimit=0 -jar target/Repro-0.0.0.jar > > On Sat, May 23, 2015 at 9:58 PM, Dawid Weiss wrote: >> Good news. I have a repro that crashes for me every time and it only >> contains open-source code (and some data). Bad news: it's probably a >> compiler bug because everything works just fine with -Xint. >> >> I'll put it together into a repro tomorrow, hopefully, and will ask >> somebody with the right permission to file an issue in Jira. Should be >> relatively easy to narrow it down by bisecting hs repo commits. >> >> Dawid >> >> On Sat, May 23, 2015 at 2:19 PM, Dawid Weiss >> wrote: >>> Hi Rory, everyone, >>> >>> I've ran into an issue with a suspicious ArrayIndexOutOfBounds on ea >>> builds of JDK 1.9.0. Here's some context: >>> >>> - we run separate builds for 1.7, 1.8 and 1.9ea VMs and only the 1.9 >>> build currently fails (Windows, Linux environments, 64-bit), >>> >>> - the bug/ issue is a suspicious AIOOB on: >>> >>> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820) >>> >>> which happens to be the line of code inside this for loop: >>> >>> for (int i = 0, lastShadow = this.last; i <= lastShadow; i++) { >>> tt[cftab[ll8[i] & 0xff]++] = i; >>> } >>> >>> Which array access this is exactly is hard to tell, but the *same* >>> bzip input file does not produce the error on any other JVM (or an >>> earlier releases of 1.9ea). This code is deterministic in the test >>> that uses the above routine. >>> >>> - the problem *only* appears from 1.9ea_b64; on earlier releases the >>> same code passes just fine (bisected it back from b45), >>> >>> - I also checked 1.9ea_b65 (which happens to be on the download server >>> but wasn't properly announced yet?). The problem persists. >>> >>> - the problem does reproduce on the build server (Windows and Linux). >>> Interestingly, I couldn't reproduce it locally. The code is >>> proprietary, I couldn't narrow it down yet to something that would >>> reproduce (sigh). >>> >>> I realize this is insufficient information to get started, but perhaps >>> this issue is already known or somebody may have a clue at what is >>> going on (CCing hotspot-compiler-dev)? >>> >>> Dawid -- Rgds,Rory O'Donnell Quality Engineering Manager Oracle EMEA, Dublin,Ireland From dawid.weiss at carrotsearch.com Sat May 23 12:19:46 2015 From: dawid.weiss at carrotsearch.com (Dawid Weiss) Date: Sat, 23 May 2015 14:19:46 +0200 Subject: 1.9.0-ea-b64 regression? Message-ID: Hi Rory, everyone, I've ran into an issue with a suspicious ArrayIndexOutOfBounds on ea builds of JDK 1.9.0. Here's some context: - we run separate builds for 1.7, 1.8 and 1.9ea VMs and only the 1.9 build currently fails (Windows, Linux environments, 64-bit), - the bug/ issue is a suspicious AIOOB on: org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820) which happens to be the line of code inside this for loop: for (int i = 0, lastShadow = this.last; i <= lastShadow; i++) { tt[cftab[ll8[i] & 0xff]++] = i; } Which array access this is exactly is hard to tell, but the *same* bzip input file does not produce the error on any other JVM (or an earlier releases of 1.9ea). This code is deterministic in the test that uses the above routine. - the problem *only* appears from 1.9ea_b64; on earlier releases the same code passes just fine (bisected it back from b45), - I also checked 1.9ea_b65 (which happens to be on the download server but wasn't properly announced yet?). The problem persists. - the problem does reproduce on the build server (Windows and Linux). Interestingly, I couldn't reproduce it locally. The code is proprietary, I couldn't narrow it down yet to something that would reproduce (sigh). I realize this is insufficient information to get started, but perhaps this issue is already known or somebody may have a clue at what is going on (CCing hotspot-compiler-dev)? Dawid From andreas.eriksson at oracle.com Tue May 26 15:55:29 2015 From: andreas.eriksson at oracle.com (Andreas Eriksson) Date: Tue, 26 May 2015 17:55:29 +0200 Subject: [8u60] backport RFR: 8060036 and 8080156 Message-ID: <55649771.2050105@oracle.com> Hi, I'd like to backport the fix for 8060036 : C2: CmpU nodes can end up with wrong type information http://hg.openjdk.java.net/jdk9/hs-comp/hotspot/rev/53707cf9a443 And at the same time the test that was pushed by Tobias with 8080156 : Integer.toString(int value) sometimes throws NPE http://hg.openjdk.java.net/jdk9/hs-comp/hotspot/rev/b2e3cbd555fc The changes applied cleanly from the jdk9 changesets. A jprt run with the new test passes. Thanks, Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From rednaxelafx at gmail.com Tue May 26 19:34:16 2015 From: rednaxelafx at gmail.com (Krystal Mok) Date: Tue, 26 May 2015 12:34:16 -0700 Subject: Question on ADLC's RegClass destructor In-Reply-To: <55641B8C.7080007@oracle.com> References: <55641B8C.7080007@oracle.com> Message-ID: Hi Zoltan, Thanks for your reply. When I was looking at the call chain that leads to RegClass' constructor, get_ident() is what's passed into RegClass as classid. The comment on get_ident_common() says: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/ac291bc3ece2/src/share/vm/adlc/adlparse.cpp#l4560 //------------------------------get_ident_common------------------------------- // Looks for an identifier in the buffer, and turns it into a null terminated // string(still inside the file buffer). Returns a pointer to the string or // NULL if some other token is found instead. char *ADLParser::get_ident_common(bool do_preproc) { So normally, the string returned should be still inside the file buffer (if no preprocessing is needed...), and shouldn't need to be free'd afterwards; but if preprocessing is needed, then yes, there's dynamically allocated memory for the string returned. get_ident() is get_ident_common(true), so it's possible that preprocessing is needed; but from the RegClass constructor, it'd be hard to tell whether the string passed in is from the file buffer or from a dynamically allocated piece of memory. It (the original code before your changes) was an odd piece of code to begin with... - Kris On Tue, May 26, 2015 at 12:06 AM, Zolt?n Maj? wrote: > Hi Kris, > > > On 05/23/2015 12:22 AM, Krystal Mok wrote: > >> Hi compiler team, >> >> I came across this piece of code in ADLC, and I'm not sure if it's right >> or not: >> >> >> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/ac291bc3ece2/src/share/vm/adlc/formsopt.cpp#l236 >> >> RegClass::~RegClass() { >> delete _classid; >> } >> >> Why do we need to delete the _classid field here? If this is needed, then >> a bunch of other classes in formsopt.hpp would need the same treatment. >> > > If I remember correctly, I added the destructor because the caller of the > RegClass constructor allocates memory for _classid, but does not free it. > But I'd have to check to be sure. > > Best regards, > > > Zoltan > > > >> (I know this virtual destructor was added mainly for the two new >> subclasses, but I'm just wondering about this one...) >> >> Thanks, >> Kris >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zoltan.majo at oracle.com Wed May 27 08:09:50 2015 From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=) Date: Wed, 27 May 2015 10:09:50 +0200 Subject: Question on ADLC's RegClass destructor In-Reply-To: References: <55641B8C.7080007@oracle.com> Message-ID: <55657BCE.1020306@oracle.com> Hi Kris, On 05/26/2015 09:34 PM, Krystal Mok wrote: > Hi Zoltan, > > Thanks for your reply. When I was looking at the call chain that leads > to RegClass' constructor, get_ident() is what's passed into RegClass > as classid. > > The comment on get_ident_common() says: > http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/ac291bc3ece2/src/share/vm/adlc/adlparse.cpp#l4560 > //------------------------------get_ident_common------------------------------- > // Looks for an identifier in the buffer, and turns it into a null > terminated > // string(still inside the file buffer). Returns a pointer to the > string or > // NULL if some other token is found instead. > char *ADLParser::get_ident_common(bool do_preproc) { > > So normally, the string returned should be still inside the file > buffer (if no preprocessing is needed...), and shouldn't need to be > free'd afterwards; but if preprocessing is needed, then yes, there's > dynamically allocated memory for the string returned. thank you for the detailed explanation and for bringing attention to this issue! I filed JDK-8081288 [1] to track the problem and will take care of it hopefully soon. If you have some time to spare, you are of course welcome to contribute a patch. Best regards, Zoltan [1] https://bugs.openjdk.java.net/browse/JDK-8081288 > get_ident() is get_ident_common(true), so it's possible that > preprocessing is needed; but from the RegClass constructor, it'd be > hard to tell whether the string passed in is from the file buffer or > from a dynamically allocated piece of memory. > > It (the original code before your changes) was an odd piece of code to > begin with... > > - Kris > > > On Tue, May 26, 2015 at 12:06 AM, Zolt?n Maj? > wrote: > > Hi Kris, > > > On 05/23/2015 12:22 AM, Krystal Mok wrote: > > Hi compiler team, > > I came across this piece of code in ADLC, and I'm not sure if > it's right or not: > > http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/ac291bc3ece2/src/share/vm/adlc/formsopt.cpp#l236 > > RegClass::~RegClass() { > delete _classid; > } > > Why do we need to delete the _classid field here? If this is > needed, then a bunch of other classes in formsopt.hpp would > need the same treatment. > > > If I remember correctly, I added the destructor because the caller > of the RegClass constructor allocates memory for _classid, but > does not free it. But I'd have to check to be sure. > > Best regards, > > > Zoltan > > > > (I know this virtual destructor was added mainly for the two > new subclasses, but I'm just wondering about this one...) > > Thanks, > Kris > > > From benedikt.wedenik at theobroma-systems.com Wed May 27 10:27:46 2015 From: benedikt.wedenik at theobroma-systems.com (Benedikt Wedenik) Date: Wed, 27 May 2015 12:27:46 +0200 Subject: Implicit null-check in hot-loop? Message-ID: Hey there! I wrote a small Fibonacci-Micro-Benchmark to examine the generated assembly (eg. loop unrolling, pipeline, and so on). This is the relevant Java Code: 2 private static long fib(long n) { 3 if(n == 0) { 4 return 0; 5 } 6 if(n == 1) { 7 return 1; 8 } 9 10 long res = 3; 11 long imm0 = 0; 12 long imm1 = 1; 13 14 for(long i = 1; i < n; ++i) { 15 res = imm0 + imm1; 16 imm0 = imm1; 17 imm1 = res; 18 } 19 return res; 20 } In the hot-loop there is this ?ldr? which looked a little ?strange? at the first glance. I think that this load is a null-check? Is that the case? If so, why does HotSpot needs to check for null in every iteration step? I also investigated the generated code on x86 which is quite similar, but instead of a load, they are using the ?test?-instruction which performs an ?and? but only sets the flags discarding the result. Is there any similar instruction available on aarch64 or is this already the closest solution? Here the relevant assembly: 541 0x0000007fa4188a4c: adrp x23, 0x0000007fb5580000 542 ; {poll} 543 0x0000007fa4188a50: b 0x0000007fa4188a74 544 0x0000007fa4188a54: nop 545 0x0000007fa4188a58: nop 546 0x0000007fa4188a5c: nop 547 0x0000007fa4188a60: add x19, x19, #0x1 ;*ladd 548 ; - Fib::fib at 52 (line 14) 549 550 0x0000007fa4188a64: add xlocals, xbcp, xdispatch 551 ; OopMap{off=104} 552 ;*goto 553 ; - Fib::fib at 55 (line 14) 554 555 0x0000007fa4188a68: ldr wzr, [x23] ;*goto 556 ; - Fib::fib at 55 (line 14) 557 ; {poll} 558 0x0000007fa4188a6c: mov xbcp, xdispatch 559 0x0000007fa4188a70: mov xdispatch, xlocals ;*lload 560 ; - Fib::fib at 29 (line 14) 561 562 0x0000007fa4188a74: cmp x19, xesp 563 0x0000007fa4188a78: b.lt 0x0000007fa4188a60 ;*ifge Btw - why does HotSpot not unroll this loop? Thanks and best regards, Benedikt -------------- next part -------------- An HTML attachment was scrubbed... URL: From aph at redhat.com Wed May 27 10:36:34 2015 From: aph at redhat.com (Andrew Haley) Date: Wed, 27 May 2015 11:36:34 +0100 Subject: Implicit null-check in hot-loop? In-Reply-To: References: Message-ID: <55659E32.9070007@redhat.com> On 05/27/2015 11:27 AM, Benedikt Wedenik wrote: > > In the hot-loop there is this ?ldr? which looked a little ?strange? at the first glance. > I think that this load is a null-check? Is that the case? No. It's a safepoint check. HotSpot has to insert one of these because it can't prove that the loop is reasonably short-lived. > I also investigated the generated code on x86 which is quite similar, but instead of a load, they are > using the ?test?-instruction which performs an ?and? but only sets the flags discarding the result. > Is there any similar instruction available on aarch64 or is this already the closest solution? Closest to what? A load to XZR is the best solution: it does not hit the flags. x86 cannot do this. It looks like great code. Andrew. From benedikt.wedenik at theobroma-systems.com Wed May 27 10:46:32 2015 From: benedikt.wedenik at theobroma-systems.com (Benedikt Wedenik) Date: Wed, 27 May 2015 12:46:32 +0200 Subject: Implicit null-check in hot-loop? In-Reply-To: <55659E32.9070007@redhat.com> References: <55659E32.9070007@redhat.com> Message-ID: <9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com> Oh! Thanks :) So then it's not so bad, cause it is (I guess) obligatory. Do you have any idea why this loop does not get unrolled? Benedikt. On 27 May 2015, at 12:36, Andrew Haley wrote: > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote: >> >> In the hot-loop there is this ?ldr? which looked a little ?strange? at the first glance. >> I think that this load is a null-check? Is that the case? > > No. It's a safepoint check. HotSpot has to insert one of these because > it can't prove that the loop is reasonably short-lived. > >> I also investigated the generated code on x86 which is quite similar, but instead of a load, they are >> using the ?test?-instruction which performs an ?and? but only sets the flags discarding the result. >> Is there any similar instruction available on aarch64 or is this already the closest solution? > > Closest to what? A load to XZR is the best solution: it does not hit > the flags. x86 cannot do this. > > It looks like great code. > > Andrew. > From vitalyd at gmail.com Wed May 27 11:39:48 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Wed, 27 May 2015 07:39:48 -0400 Subject: Implicit null-check in hot-loop? In-Reply-To: <9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com> References: <55659E32.9070007@redhat.com> <9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com> Message-ID: Cause you're using a long as induction variable; change to int and it should unroll. sent from my phone On May 27, 2015 6:46 AM, "Benedikt Wedenik" < benedikt.wedenik at theobroma-systems.com> wrote: > Oh! > > Thanks :) So then it's not so bad, cause it is (I guess) obligatory. > > Do you have any idea why this loop does not get unrolled? > > Benedikt. > > On 27 May 2015, at 12:36, Andrew Haley wrote: > > > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote: > >> > >> In the hot-loop there is this ?ldr? which looked a little ?strange? at > the first glance. > >> I think that this load is a null-check? Is that the case? > > > > No. It's a safepoint check. HotSpot has to insert one of these because > > it can't prove that the loop is reasonably short-lived. > > > >> I also investigated the generated code on x86 which is quite similar, > but instead of a load, they are > >> using the ?test?-instruction which performs an ?and? but only sets the > flags discarding the result. > >> Is there any similar instruction available on aarch64 or is this > already the closest solution? > > > > Closest to what? A load to XZR is the best solution: it does not hit > > the flags. x86 cannot do this. > > > > It looks like great code. > > > > Andrew. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Wed May 27 11:44:16 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Wed, 27 May 2015 07:44:16 -0400 Subject: Implicit null-check in hot-loop? In-Reply-To: References: <55659E32.9070007@redhat.com> <9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com> Message-ID: Also, the safepoint check on each iteration should go away when using an int loop; it'll be coalesced into once per unroll factor or may go away entirely if the loop turns into a counted loop. sent from my phone On May 27, 2015 7:39 AM, "Vitaly Davidovich" wrote: > Cause you're using a long as induction variable; change to int and it > should unroll. > > sent from my phone > On May 27, 2015 6:46 AM, "Benedikt Wedenik" < > benedikt.wedenik at theobroma-systems.com> wrote: > >> Oh! >> >> Thanks :) So then it's not so bad, cause it is (I guess) obligatory. >> >> Do you have any idea why this loop does not get unrolled? >> >> Benedikt. >> >> On 27 May 2015, at 12:36, Andrew Haley wrote: >> >> > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote: >> >> >> >> In the hot-loop there is this ?ldr? which looked a little ?strange? at >> the first glance. >> >> I think that this load is a null-check? Is that the case? >> > >> > No. It's a safepoint check. HotSpot has to insert one of these because >> > it can't prove that the loop is reasonably short-lived. >> > >> >> I also investigated the generated code on x86 which is quite similar, >> but instead of a load, they are >> >> using the ?test?-instruction which performs an ?and? but only sets the >> flags discarding the result. >> >> Is there any similar instruction available on aarch64 or is this >> already the closest solution? >> > >> > Closest to what? A load to XZR is the best solution: it does not hit >> > the flags. x86 cannot do this. >> > >> > It looks like great code. >> > >> > Andrew. >> > >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From benedikt.wedenik at theobroma-systems.com Wed May 27 12:09:22 2015 From: benedikt.wedenik at theobroma-systems.com (Benedikt Wedenik) Date: Wed, 27 May 2015 14:09:22 +0200 Subject: Implicit null-check in hot-loop? In-Reply-To: References: <55659E32.9070007@redhat.com> <9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com> Message-ID: First of all, thanks for your quick response :) You were absolutely right - but I do not understand why the long counter was the problem. Is there any reason why the loop-unrolling is only available for int loops? I mean I guess so - because else it would probably be implemented already. But still it would be a nice opportunity for an optimisation :) Benedikt. On 27 May 2015, at 13:44, Vitaly Davidovich wrote: > Also, the safepoint check on each iteration should go away when using an int loop; it'll be coalesced into once per unroll factor or may go away entirely if the loop turns into a counted loop. > > sent from my phone > > On May 27, 2015 7:39 AM, "Vitaly Davidovich" wrote: > Cause you're using a long as induction variable; change to int and it should unroll. > > sent from my phone > > On May 27, 2015 6:46 AM, "Benedikt Wedenik" wrote: > Oh! > > Thanks :) So then it's not so bad, cause it is (I guess) obligatory. > > Do you have any idea why this loop does not get unrolled? > > Benedikt. > > On 27 May 2015, at 12:36, Andrew Haley wrote: > > > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote: > >> > >> In the hot-loop there is this ?ldr? which looked a little ?strange? at the first glance. > >> I think that this load is a null-check? Is that the case? > > > > No. It's a safepoint check. HotSpot has to insert one of these because > > it can't prove that the loop is reasonably short-lived. > > > >> I also investigated the generated code on x86 which is quite similar, but instead of a load, they are > >> using the ?test?-instruction which performs an ?and? but only sets the flags discarding the result. > >> Is there any similar instruction available on aarch64 or is this already the closest solution? > > > > Closest to what? A load to XZR is the best solution: it does not hit > > the flags. x86 cannot do this. > > > > It looks like great code. > > > > Andrew. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Wed May 27 12:25:24 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Wed, 27 May 2015 08:25:24 -0400 Subject: Implicit null-check in hot-loop? In-Reply-To: References: <55659E32.9070007@redhat.com> <9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com> Message-ID: I don't think there's any fundamental reason, it's just that int loops are more common and so were optimized. sent from my phone On May 27, 2015 8:09 AM, "Benedikt Wedenik" < benedikt.wedenik at theobroma-systems.com> wrote: > First of all, thanks for your quick response :) > You were absolutely right - but I do not understand why the long counter > was the problem. > > Is there any reason why the loop-unrolling is only available for int loops? > I mean I guess so - because else it would probably be implemented already. > But still it would be a nice opportunity for an optimisation :) > > Benedikt. > On 27 May 2015, at 13:44, Vitaly Davidovich wrote: > > Also, the safepoint check on each iteration should go away when using an > int loop; it'll be coalesced into once per unroll factor or may go away > entirely if the loop turns into a counted loop. > > sent from my phone > On May 27, 2015 7:39 AM, "Vitaly Davidovich" wrote: > >> Cause you're using a long as induction variable; change to int and it >> should unroll. >> >> sent from my phone >> On May 27, 2015 6:46 AM, "Benedikt Wedenik" < >> benedikt.wedenik at theobroma-systems.com> wrote: >> >>> Oh! >>> >>> Thanks :) So then it's not so bad, cause it is (I guess) obligatory. >>> >>> Do you have any idea why this loop does not get unrolled? >>> >>> Benedikt. >>> >>> On 27 May 2015, at 12:36, Andrew Haley wrote: >>> >>> > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote: >>> >> >>> >> In the hot-loop there is this ?ldr? which looked a little ?strange? >>> at the first glance. >>> >> I think that this load is a null-check? Is that the case? >>> > >>> > No. It's a safepoint check. HotSpot has to insert one of these >>> because >>> > it can't prove that the loop is reasonably short-lived. >>> > >>> >> I also investigated the generated code on x86 which is quite similar, >>> but instead of a load, they are >>> >> using the ?test?-instruction which performs an ?and? but only sets >>> the flags discarding the result. >>> >> Is there any similar instruction available on aarch64 or is this >>> already the closest solution? >>> > >>> > Closest to what? A load to XZR is the best solution: it does not hit >>> > the flags. x86 cannot do this. >>> > >>> > It looks like great code. >>> > >>> > Andrew. >>> > >>> >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benedikt.wedenik at theobroma-systems.com Wed May 27 12:27:33 2015 From: benedikt.wedenik at theobroma-systems.com (Benedikt Wedenik) Date: Wed, 27 May 2015 14:27:33 +0200 Subject: Implicit null-check in hot-loop? In-Reply-To: References: <55659E32.9070007@redhat.com> <9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com> Message-ID: <25BD5D47-934E-4601-BB32-0F6507254C13@theobroma-systems.com> Okay :) Thank you again! Benedikt. On 27 May 2015, at 14:25, Vitaly Davidovich wrote: > I don't think there's any fundamental reason, it's just that int loops are more common and so were optimized. > > sent from my phone > > On May 27, 2015 8:09 AM, "Benedikt Wedenik" wrote: > First of all, thanks for your quick response :) > You were absolutely right - but I do not understand why the long counter was the problem. > > Is there any reason why the loop-unrolling is only available for int loops? > I mean I guess so - because else it would probably be implemented already. > But still it would be a nice opportunity for an optimisation :) > > Benedikt. > On 27 May 2015, at 13:44, Vitaly Davidovich wrote: > >> Also, the safepoint check on each iteration should go away when using an int loop; it'll be coalesced into once per unroll factor or may go away entirely if the loop turns into a counted loop. >> >> sent from my phone >> >> On May 27, 2015 7:39 AM, "Vitaly Davidovich" wrote: >> Cause you're using a long as induction variable; change to int and it should unroll. >> >> sent from my phone >> >> On May 27, 2015 6:46 AM, "Benedikt Wedenik" wrote: >> Oh! >> >> Thanks :) So then it's not so bad, cause it is (I guess) obligatory. >> >> Do you have any idea why this loop does not get unrolled? >> >> Benedikt. >> >> On 27 May 2015, at 12:36, Andrew Haley wrote: >> >> > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote: >> >> >> >> In the hot-loop there is this ?ldr? which looked a little ?strange? at the first glance. >> >> I think that this load is a null-check? Is that the case? >> > >> > No. It's a safepoint check. HotSpot has to insert one of these because >> > it can't prove that the loop is reasonably short-lived. >> > >> >> I also investigated the generated code on x86 which is quite similar, but instead of a load, they are >> >> using the ?test?-instruction which performs an ?and? but only sets the flags discarding the result. >> >> Is there any similar instruction available on aarch64 or is this already the closest solution? >> > >> > Closest to what? A load to XZR is the best solution: it does not hit >> > the flags. x86 cannot do this. >> > >> > It looks like great code. >> > >> > Andrew. >> > >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vitalyd at gmail.com Wed May 27 12:34:37 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Wed, 27 May 2015 08:34:37 -0400 Subject: Implicit null-check in hot-loop? In-Reply-To: <25BD5D47-934E-4601-BB32-0F6507254C13@theobroma-systems.com> References: <55659E32.9070007@redhat.com> <9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com> <25BD5D47-934E-4601-BB32-0F6507254C13@theobroma-systems.com> Message-ID: Also, my hunch is that loop unrolling in current production hotspot isn't necessarily a win for long trip counts since there's virtually no vectorization and I'm not sure making the loop code bigger is a net win since OoO CPUs can take care of them well on their own. Eliminating/reducing the safepoint check is probably the biggest win. sent from my phone On May 27, 2015 8:27 AM, "Benedikt Wedenik" < benedikt.wedenik at theobroma-systems.com> wrote: > Okay :) > > Thank you again! > > Benedikt. > On 27 May 2015, at 14:25, Vitaly Davidovich wrote: > > I don't think there's any fundamental reason, it's just that int loops are > more common and so were optimized. > > sent from my phone > On May 27, 2015 8:09 AM, "Benedikt Wedenik" < > benedikt.wedenik at theobroma-systems.com> wrote: > >> First of all, thanks for your quick response :) >> You were absolutely right - but I do not understand why the long counter >> was the problem. >> >> Is there any reason why the loop-unrolling is only available for int >> loops? >> I mean I guess so - because else it would probably be implemented already. >> But still it would be a nice opportunity for an optimisation :) >> >> Benedikt. >> On 27 May 2015, at 13:44, Vitaly Davidovich wrote: >> >> Also, the safepoint check on each iteration should go away when using an >> int loop; it'll be coalesced into once per unroll factor or may go away >> entirely if the loop turns into a counted loop. >> >> sent from my phone >> On May 27, 2015 7:39 AM, "Vitaly Davidovich" wrote: >> >>> Cause you're using a long as induction variable; change to int and it >>> should unroll. >>> >>> sent from my phone >>> On May 27, 2015 6:46 AM, "Benedikt Wedenik" < >>> benedikt.wedenik at theobroma-systems.com> wrote: >>> >>>> Oh! >>>> >>>> Thanks :) So then it's not so bad, cause it is (I guess) obligatory. >>>> >>>> Do you have any idea why this loop does not get unrolled? >>>> >>>> Benedikt. >>>> >>>> On 27 May 2015, at 12:36, Andrew Haley wrote: >>>> >>>> > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote: >>>> >> >>>> >> In the hot-loop there is this ?ldr? which looked a little ?strange? >>>> at the first glance. >>>> >> I think that this load is a null-check? Is that the case? >>>> > >>>> > No. It's a safepoint check. HotSpot has to insert one of these >>>> because >>>> > it can't prove that the loop is reasonably short-lived. >>>> > >>>> >> I also investigated the generated code on x86 which is quite >>>> similar, but instead of a load, they are >>>> >> using the ?test?-instruction which performs an ?and? but only sets >>>> the flags discarding the result. >>>> >> Is there any similar instruction available on aarch64 or is this >>>> already the closest solution? >>>> > >>>> > Closest to what? A load to XZR is the best solution: it does not hit >>>> > the flags. x86 cannot do this. >>>> > >>>> > It looks like great code. >>>> > >>>> > Andrew. >>>> > >>>> >>>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.westrelin at oracle.com Wed May 27 14:08:52 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Wed, 27 May 2015 16:08:52 +0200 Subject: [8u60] backport RFR: 8060036 and 8080156 In-Reply-To: <55649771.2050105@oracle.com> References: <55649771.2050105@oracle.com> Message-ID: <255230A8-017F-4C7F-9E10-37D74F973091@oracle.com> That looks good to me. I?ll push it. Roland. > On May 26, 2015, at 5:55 PM, Andreas Eriksson wrote: > > Hi, > > I'd like to backport the fix for > 8060036: C2: CmpU nodes can end up with wrong type information > http://hg.openjdk.java.net/jdk9/hs-comp/hotspot/rev/53707cf9a443 > > And at the same time the test that was pushed by Tobias with > 8080156: Integer.toString(int value) sometimes throws NPE > http://hg.openjdk.java.net/jdk9/hs-comp/hotspot/rev/b2e3cbd555fc > > The changes applied cleanly from the jdk9 changesets. > A jprt run with the new test passes. > > Thanks, > Andreas From andreas.eriksson at oracle.com Wed May 27 14:21:41 2015 From: andreas.eriksson at oracle.com (Andreas Eriksson) Date: Wed, 27 May 2015 16:21:41 +0200 Subject: [8u60] backport RFR: 8060036 and 8080156 In-Reply-To: <255230A8-017F-4C7F-9E10-37D74F973091@oracle.com> References: <55649771.2050105@oracle.com> <255230A8-017F-4C7F-9E10-37D74F973091@oracle.com> Message-ID: <5565D2F5.70406@oracle.com> Thanks! On 2015-05-27 16:08, Roland Westrelin wrote: > That looks good to me. I?ll push it. > > Roland. > >> On May 26, 2015, at 5:55 PM, Andreas Eriksson wrote: >> >> Hi, >> >> I'd like to backport the fix for >> 8060036: C2: CmpU nodes can end up with wrong type information >> http://hg.openjdk.java.net/jdk9/hs-comp/hotspot/rev/53707cf9a443 >> >> And at the same time the test that was pushed by Tobias with >> 8080156: Integer.toString(int value) sometimes throws NPE >> http://hg.openjdk.java.net/jdk9/hs-comp/hotspot/rev/b2e3cbd555fc >> >> The changes applied cleanly from the jdk9 changesets. >> A jprt run with the new test passes. >> >> Thanks, >> Andreas From aph at redhat.com Wed May 27 17:11:50 2015 From: aph at redhat.com (Andrew Haley) Date: Wed, 27 May 2015 18:11:50 +0100 Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] In-Reply-To: <555708BE.8090100@redhat.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> <555708BE.8090100@redhat.com> Message-ID: <5565FAD6.5010409@redhat.com> An update: I'm still working on this. Following last week's revelations [1] it seems to me that a faster implementation of (integer) D-H is even more important. I've spent a couple of days tracking down an extremely odd feature (bug?) in MutableBigInteger which was breaking everything, but I'm past that now. I'm trying to produce an intrinsic implementation of the core modular exponentiation which is as fast as any state-of-the- art implementation while disrupting the common code as little as possible; this is not easy. I hope to have something which is faster on all processors, not just those for which we have hand-coded assembly-language implementations. I don't think that my work should be any impediment to Sadya's patch for squareToLen at http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ being committed. It'll still be useful. Andrew. [1] Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice https://weakdh.org/imperfect-forward-secrecy.pdf From vladimir.x.ivanov at oracle.com Wed May 27 18:19:09 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Wed, 27 May 2015 21:19:09 +0300 Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match some 8-bit/16-bit masks In-Reply-To: <556006DA.8030202@oracle.com> References: <555F3EFB.60606@oracle.com> <49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com> <555F47E4.8020507@oracle.com> <555F4D7D.7080504@oracle.com> <556006DA.8030202@oracle.com> Message-ID: <55660A9D.6090809@oracle.com> Vladimir, Dean, thanks for the feedback. I like right_n_bits trick. Here's an updated webrev: http://cr.openjdk.java.net/~vlivanov/8001622/webrev.02 Got rid of immU16 since set can encode 32-bit constants. Best regards, Vladimir Ivanov On 5/23/15 7:49 AM, Dean Long wrote: > Isn't this what you want: > > 5656 // Load Unsigned Byte (8 bit UNsigned) with 32-bit mask into Long > Register > 5657 instruct loadUB2L_immI(iRegL dst, memory mem, immI mask) %{ > 5658 match(Set dst (ConvI2L (AndI (LoadUB mem) mask))); > 5659 ins_cost(MEMORY_REF_COST + DEFAULT_COST); > 5660 > 5661 size(2*4); > 5662 format %{ "LDUB $mem,$dst\t# ubyte & 32-bit mask -> long\n\t" > 5663 "AND $dst,right_n_bits($mask, 8),$dst" %} > 5664 ins_encode %{ > 5665 __ ldub($mem$$Address, $dst$$Register); > 5666 __ and3($dst$$Register, right_n_bits($mask$$constant, 8), > $dst$$Register); > 5667 %} > 5668 ins_pipe(iload_mem); > 5669 %} > > dl > > On 5/22/2015 8:38 AM, Vladimir Ivanov wrote: >> Updated webrev: >> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.01 >> >> Introduced immU8/immU16 operands on sparc. >> >> As a cleanup, sorted immI/immU operand declarations. >> >> Best regards, >> Vladimir Ivanov >> >> On 5/22/15 6:14 PM, Vladimir Ivanov wrote: >>> Thanks for looking into the fix, Roland. >>> >>> On 5/22/15 5:57 PM, Roland Westrelin wrote: >>>>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00 >>>>> https://bugs.openjdk.java.net/browse/JDK-8001622 >>>>> >>>>> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from >>>>> immI8/immI16 to immI, since zero-extending move is used. >>>> >>>> The change on sparc doesn?t look good: >>>> __ and3($dst$$Register, $mask$$constant, $dst$$Register); >>>> >>>> the and instruction cannot encode arbitrary large integer constants. >>> Good catch. I missed that it is and3. >>> >>>> Isn?t the root of the problem that we want an unsigned 8 bit integer >>>> and not a signed one? >>> Yes, unsigned 8-bit integer work fine, but it can be generalized to >>> arbitrary masks as well. >>> >>> I experimented with new operand types (immU8, immU16), but it was more >>> code for no particular benefit. I can use them on sparc. >>> >>> Best regards, >>> Vladimir Ivanov > From dean.long at oracle.com Wed May 27 19:04:02 2015 From: dean.long at oracle.com (Dean Long) Date: Wed, 27 May 2015 12:04:02 -0700 Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match some 8-bit/16-bit masks In-Reply-To: <55660A9D.6090809@oracle.com> References: <555F3EFB.60606@oracle.com> <49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com> <555F47E4.8020507@oracle.com> <555F4D7D.7080504@oracle.com> <556006DA.8030202@oracle.com> <55660A9D.6090809@oracle.com> Message-ID: <55661522.4010703@oracle.com> You should be able to use right_n_bits on x86 too, to allow the smaller encoding with an 8 bit immediate instead of 32. Also, for this on sparc 5791 __ set($mask$$constant, Rtmp); I believe __ set(right_n_bits($mask$$constant, 16), Rtmp) will give the same result in fewer instructions. dl On 5/27/2015 11:19 AM, Vladimir Ivanov wrote: > Vladimir, Dean, thanks for the feedback. > > I like right_n_bits trick. > Here's an updated webrev: > http://cr.openjdk.java.net/~vlivanov/8001622/webrev.02 > > Got rid of immU16 since set can encode 32-bit constants. > > Best regards, > Vladimir Ivanov > > On 5/23/15 7:49 AM, Dean Long wrote: >> Isn't this what you want: >> >> 5656 // Load Unsigned Byte (8 bit UNsigned) with 32-bit mask into Long >> Register >> 5657 instruct loadUB2L_immI(iRegL dst, memory mem, immI mask) %{ >> 5658 match(Set dst (ConvI2L (AndI (LoadUB mem) mask))); >> 5659 ins_cost(MEMORY_REF_COST + DEFAULT_COST); >> 5660 >> 5661 size(2*4); >> 5662 format %{ "LDUB $mem,$dst\t# ubyte & 32-bit mask -> long\n\t" >> 5663 "AND $dst,right_n_bits($mask, 8),$dst" %} >> 5664 ins_encode %{ >> 5665 __ ldub($mem$$Address, $dst$$Register); >> 5666 __ and3($dst$$Register, right_n_bits($mask$$constant, 8), >> $dst$$Register); >> 5667 %} >> 5668 ins_pipe(iload_mem); >> 5669 %} >> >> dl >> >> On 5/22/2015 8:38 AM, Vladimir Ivanov wrote: >>> Updated webrev: >>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.01 >>> >>> Introduced immU8/immU16 operands on sparc. >>> >>> As a cleanup, sorted immI/immU operand declarations. >>> >>> Best regards, >>> Vladimir Ivanov >>> >>> On 5/22/15 6:14 PM, Vladimir Ivanov wrote: >>>> Thanks for looking into the fix, Roland. >>>> >>>> On 5/22/15 5:57 PM, Roland Westrelin wrote: >>>>>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00 >>>>>> https://bugs.openjdk.java.net/browse/JDK-8001622 >>>>>> >>>>>> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from >>>>>> immI8/immI16 to immI, since zero-extending move is used. >>>>> >>>>> The change on sparc doesn?t look good: >>>>> __ and3($dst$$Register, $mask$$constant, $dst$$Register); >>>>> >>>>> the and instruction cannot encode arbitrary large integer constants. >>>> Good catch. I missed that it is and3. >>>> >>>>> Isn?t the root of the problem that we want an unsigned 8 bit integer >>>>> and not a signed one? >>>> Yes, unsigned 8-bit integer work fine, but it can be generalized to >>>> arbitrary masks as well. >>>> >>>> I experimented with new operand types (immU8, immU16), but it was more >>>> code for no particular benefit. I can use them on sparc. >>>> >>>> Best regards, >>>> Vladimir Ivanov >> From sandhya.viswanathan at intel.com Thu May 28 01:27:53 2015 From: sandhya.viswanathan at intel.com (Viswanathan, Sandhya) Date: Thu, 28 May 2015 01:27:53 +0000 Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] In-Reply-To: <5565FAD6.5010409@redhat.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> <555708BE.8090100@redhat.com> <5565FAD6.5010409@redhat.com> Message-ID: <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337BA3A@FMSMSX112.amr.corp.intel.com> Hi Tony, Please let us know if you are ok with the changes in BigInteger.java (range checks) in patch from Intel: http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ Per Andrew's email below we could go ahead with this patch and it shouldn't affect his work. Best Regards, Sandhya -----Original Message----- From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andrew Haley Sent: Wednesday, May 27, 2015 10:12 AM To: Christian Thalinger Cc: Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] An update: I'm still working on this. Following last week's revelations [1] it seems to me that a faster implementation of (integer) D-H is even more important. I've spent a couple of days tracking down an extremely odd feature (bug?) in MutableBigInteger which was breaking everything, but I'm past that now. I'm trying to produce an intrinsic implementation of the core modular exponentiation which is as fast as any state-of-the- art implementation while disrupting the common code as little as possible; this is not easy. I hope to have something which is faster on all processors, not just those for which we have hand-coded assembly-language implementations. I don't think that my work should be any impediment to Sadya's patch for squareToLen at http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ being committed. It'll still be useful. Andrew. [1] Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice https://weakdh.org/imperfect-forward-secrecy.pdf From roland.westrelin at oracle.com Thu May 28 13:55:10 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Thu, 28 May 2015 15:55:10 +0200 Subject: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on (regression) Message-ID: http://cr.openjdk.java.net/~roland/8080976/webrev.00/ A loop where the result of the reduction is used in the loop shouldn?t be vectorized. Roland. From vitalyd at gmail.com Thu May 28 14:15:18 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Thu, 28 May 2015 10:15:18 -0400 Subject: Inlining methods with large switch statements Message-ID: Hi guys, C2 will fail to inline hot methods over FreqInlineSize that contain a large dense switch statement. The actual machine code here uses a jump table so native size is small and I'd expect inlining. Is this something that can be handled better? Thanks sent from my phone -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Thu May 28 16:32:45 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 28 May 2015 09:32:45 -0700 Subject: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on (regression) In-Reply-To: References: Message-ID: <5567432D.3040108@oracle.com> Looks good. Thank you for fixing this. Vladimir On 5/28/15 6:55 AM, Roland Westrelin wrote: > http://cr.openjdk.java.net/~roland/8080976/webrev.00/ > > A loop where the result of the reduction is used in the loop shouldn?t be vectorized. > > Roland. > From michael.c.berg at intel.com Thu May 28 16:43:30 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Thu, 28 May 2015 16:43:30 +0000 Subject: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on (regression) In-Reply-To: <5567432D.3040108@oracle.com> References: <5567432D.3040108@oracle.com> Message-ID: I will also review the change, thx, it will be just a bit. -----Original Message----- From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Vladimir Kozlov Sent: Thursday, May 28, 2015 9:33 AM To: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on (regression) Looks good. Thank you for fixing this. Vladimir On 5/28/15 6:55 AM, Roland Westrelin wrote: > http://cr.openjdk.java.net/~roland/8080976/webrev.00/ > > A loop where the result of the reduction is used in the loop shouldn?t be vectorized. > > Roland. > From michael.c.berg at intel.com Thu May 28 17:18:34 2015 From: michael.c.berg at intel.com (Berg, Michael C) Date: Thu, 28 May 2015 17:18:34 +0000 Subject: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on (regression) References: <5567432D.3040108@oracle.com> Message-ID: Roland I would add a condition so that we save a step, and a comment or two: bool ok = false; for (unsigned j = 1; j < def_node->req(); j++) { Node* in = def_node->in(j); if (in == phi) { ok = true; break; } } // do nothing if we did not match the initial criteria if (ok == false) { continue; } // The result of the reduction must not be used in the loop for (DUIterator_Fast imax, i = def_node->fast_outs(imax); i < imax && ok; i++) { Node* u = def_node->fast_out(i); if (has_ctrl(u) && !loop->is_member(get_loop(get_ctrl(u)))) { continue; } if (u == phi) { continue; } ok = false; } // iff the uses conform if (ok) { def_node->add_flag(Node::Flag_is_reduction); } I checked the resultant code for the bug, as is the case for the change in the webrev, the reduction is no longer emitted. I also checked the relevant micros and verified the desired behavior is still present. Regards, -Michael -----Original Message----- From: Berg, Michael C Sent: Thursday, May 28, 2015 9:43 AM To: 'Vladimir Kozlov'; hotspot-compiler-dev at openjdk.java.net Subject: RE: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on (regression) I will also review the change, thx, it will be just a bit. -----Original Message----- From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Vladimir Kozlov Sent: Thursday, May 28, 2015 9:33 AM To: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on (regression) Looks good. Thank you for fixing this. Vladimir On 5/28/15 6:55 AM, Roland Westrelin wrote: > http://cr.openjdk.java.net/~roland/8080976/webrev.00/ > > A loop where the result of the reduction is used in the loop shouldn?t be vectorized. > > Roland. > From vladimir.x.ivanov at oracle.com Thu May 28 17:40:13 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 28 May 2015 20:40:13 +0300 Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match some 8-bit/16-bit masks In-Reply-To: <55661522.4010703@oracle.com> References: <555F3EFB.60606@oracle.com> <49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com> <555F47E4.8020507@oracle.com> <555F4D7D.7080504@oracle.com> <556006DA.8030202@oracle.com> <55660A9D.6090809@oracle.com> <55661522.4010703@oracle.com> Message-ID: <556752FD.6060605@oracle.com> Dean, what do you think about the following: http://cr.openjdk.java.net/~vlivanov/8001622/webrev.03 Best regards, Vladimir Ivanov On 5/27/15 10:04 PM, Dean Long wrote: > You should be able to use right_n_bits on x86 too, to allow the smaller > encoding with an 8 bit immediate instead of 32. > > Also, for this on sparc > > 5791 __ set($mask$$constant, Rtmp); > > I believe > > __ set(right_n_bits($mask$$constant, 16), Rtmp) > > will give the same result in fewer instructions. > > dl > > On 5/27/2015 11:19 AM, Vladimir Ivanov wrote: >> Vladimir, Dean, thanks for the feedback. >> >> I like right_n_bits trick. >> Here's an updated webrev: >> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.02 >> >> Got rid of immU16 since set can encode 32-bit constants. >> >> Best regards, >> Vladimir Ivanov >> >> On 5/23/15 7:49 AM, Dean Long wrote: >>> Isn't this what you want: >>> >>> 5656 // Load Unsigned Byte (8 bit UNsigned) with 32-bit mask into Long >>> Register >>> 5657 instruct loadUB2L_immI(iRegL dst, memory mem, immI mask) %{ >>> 5658 match(Set dst (ConvI2L (AndI (LoadUB mem) mask))); >>> 5659 ins_cost(MEMORY_REF_COST + DEFAULT_COST); >>> 5660 >>> 5661 size(2*4); >>> 5662 format %{ "LDUB $mem,$dst\t# ubyte & 32-bit mask -> long\n\t" >>> 5663 "AND $dst,right_n_bits($mask, 8),$dst" %} >>> 5664 ins_encode %{ >>> 5665 __ ldub($mem$$Address, $dst$$Register); >>> 5666 __ and3($dst$$Register, right_n_bits($mask$$constant, 8), >>> $dst$$Register); >>> 5667 %} >>> 5668 ins_pipe(iload_mem); >>> 5669 %} >>> >>> dl >>> >>> On 5/22/2015 8:38 AM, Vladimir Ivanov wrote: >>>> Updated webrev: >>>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.01 >>>> >>>> Introduced immU8/immU16 operands on sparc. >>>> >>>> As a cleanup, sorted immI/immU operand declarations. >>>> >>>> Best regards, >>>> Vladimir Ivanov >>>> >>>> On 5/22/15 6:14 PM, Vladimir Ivanov wrote: >>>>> Thanks for looking into the fix, Roland. >>>>> >>>>> On 5/22/15 5:57 PM, Roland Westrelin wrote: >>>>>>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00 >>>>>>> https://bugs.openjdk.java.net/browse/JDK-8001622 >>>>>>> >>>>>>> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from >>>>>>> immI8/immI16 to immI, since zero-extending move is used. >>>>>> >>>>>> The change on sparc doesn?t look good: >>>>>> __ and3($dst$$Register, $mask$$constant, $dst$$Register); >>>>>> >>>>>> the and instruction cannot encode arbitrary large integer constants. >>>>> Good catch. I missed that it is and3. >>>>> >>>>>> Isn?t the root of the problem that we want an unsigned 8 bit integer >>>>>> and not a signed one? >>>>> Yes, unsigned 8-bit integer work fine, but it can be generalized to >>>>> arbitrary masks as well. >>>>> >>>>> I experimented with new operand types (immU8, immU16), but it was more >>>>> code for no particular benefit. I can use them on sparc. >>>>> >>>>> Best regards, >>>>> Vladimir Ivanov >>> > From dean.long at oracle.com Thu May 28 17:43:04 2015 From: dean.long at oracle.com (Dean Long) Date: Thu, 28 May 2015 10:43:04 -0700 Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match some 8-bit/16-bit masks In-Reply-To: <556752FD.6060605@oracle.com> References: <555F3EFB.60606@oracle.com> <49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com> <555F47E4.8020507@oracle.com> <555F4D7D.7080504@oracle.com> <556006DA.8030202@oracle.com> <55660A9D.6090809@oracle.com> <55661522.4010703@oracle.com> <556752FD.6060605@oracle.com> Message-ID: <556753A8.6030909@oracle.com> Looks good. dl On 5/28/2015 10:40 AM, Vladimir Ivanov wrote: > Dean, what do you think about the following: > http://cr.openjdk.java.net/~vlivanov/8001622/webrev.03 > > Best regards, > Vladimir Ivanov > > On 5/27/15 10:04 PM, Dean Long wrote: >> You should be able to use right_n_bits on x86 too, to allow the smaller >> encoding with an 8 bit immediate instead of 32. >> >> Also, for this on sparc >> >> 5791 __ set($mask$$constant, Rtmp); >> >> I believe >> >> __ set(right_n_bits($mask$$constant, 16), Rtmp) >> >> will give the same result in fewer instructions. >> >> dl >> >> On 5/27/2015 11:19 AM, Vladimir Ivanov wrote: >>> Vladimir, Dean, thanks for the feedback. >>> >>> I like right_n_bits trick. >>> Here's an updated webrev: >>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.02 >>> >>> Got rid of immU16 since set can encode 32-bit constants. >>> >>> Best regards, >>> Vladimir Ivanov >>> >>> On 5/23/15 7:49 AM, Dean Long wrote: >>>> Isn't this what you want: >>>> >>>> 5656 // Load Unsigned Byte (8 bit UNsigned) with 32-bit mask into Long >>>> Register >>>> 5657 instruct loadUB2L_immI(iRegL dst, memory mem, immI mask) %{ >>>> 5658 match(Set dst (ConvI2L (AndI (LoadUB mem) mask))); >>>> 5659 ins_cost(MEMORY_REF_COST + DEFAULT_COST); >>>> 5660 >>>> 5661 size(2*4); >>>> 5662 format %{ "LDUB $mem,$dst\t# ubyte & 32-bit mask -> long\n\t" >>>> 5663 "AND $dst,right_n_bits($mask, 8),$dst" %} >>>> 5664 ins_encode %{ >>>> 5665 __ ldub($mem$$Address, $dst$$Register); >>>> 5666 __ and3($dst$$Register, right_n_bits($mask$$constant, 8), >>>> $dst$$Register); >>>> 5667 %} >>>> 5668 ins_pipe(iload_mem); >>>> 5669 %} >>>> >>>> dl >>>> >>>> On 5/22/2015 8:38 AM, Vladimir Ivanov wrote: >>>>> Updated webrev: >>>>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.01 >>>>> >>>>> Introduced immU8/immU16 operands on sparc. >>>>> >>>>> As a cleanup, sorted immI/immU operand declarations. >>>>> >>>>> Best regards, >>>>> Vladimir Ivanov >>>>> >>>>> On 5/22/15 6:14 PM, Vladimir Ivanov wrote: >>>>>> Thanks for looking into the fix, Roland. >>>>>> >>>>>> On 5/22/15 5:57 PM, Roland Westrelin wrote: >>>>>>>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00 >>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8001622 >>>>>>>> >>>>>>>> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from >>>>>>>> immI8/immI16 to immI, since zero-extending move is used. >>>>>>> >>>>>>> The change on sparc doesn?t look good: >>>>>>> __ and3($dst$$Register, $mask$$constant, $dst$$Register); >>>>>>> >>>>>>> the and instruction cannot encode arbitrary large integer >>>>>>> constants. >>>>>> Good catch. I missed that it is and3. >>>>>> >>>>>>> Isn?t the root of the problem that we want an unsigned 8 bit >>>>>>> integer >>>>>>> and not a signed one? >>>>>> Yes, unsigned 8-bit integer work fine, but it can be generalized to >>>>>> arbitrary masks as well. >>>>>> >>>>>> I experimented with new operand types (immU8, immU16), but it was >>>>>> more >>>>>> code for no particular benefit. I can use them on sparc. >>>>>> >>>>>> Best regards, >>>>>> Vladimir Ivanov >>>> >> From dawid.weiss at gmail.com Thu May 28 20:41:26 2015 From: dawid.weiss at gmail.com (Dawid Weiss) Date: Thu, 28 May 2015 22:41:26 +0200 Subject: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on (regression) In-Reply-To: References: Message-ID: Your quick follow-up to this issue is much appreciated! (I filed the bug via Rory). Dawid On Thu, May 28, 2015 at 3:55 PM, Roland Westrelin wrote: > http://cr.openjdk.java.net/~roland/8080976/webrev.00/ > > A loop where the result of the reduction is used in the loop shouldn?t be vectorized. > > Roland. From roland.westrelin at oracle.com Fri May 29 07:53:31 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Fri, 29 May 2015 09:53:31 +0200 Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match some 8-bit/16-bit masks In-Reply-To: <556752FD.6060605@oracle.com> References: <555F3EFB.60606@oracle.com> <49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com> <555F47E4.8020507@oracle.com> <555F4D7D.7080504@oracle.com> <556006DA.8030202@oracle.com> <55660A9D.6090809@oracle.com> <55661522.4010703@oracle.com> <556752FD.6060605@oracle.com> Message-ID: > http://cr.openjdk.java.net/~vlivanov/8001622/webrev.03 That looks good to me. Roland. From vladimir.x.ivanov at oracle.com Fri May 29 10:50:29 2015 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Fri, 29 May 2015 13:50:29 +0300 Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match some 8-bit/16-bit masks In-Reply-To: References: <555F3EFB.60606@oracle.com> <49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com> <555F47E4.8020507@oracle.com> <555F4D7D.7080504@oracle.com> <556006DA.8030202@oracle.com> <55660A9D.6090809@oracle.com> <55661522.4010703@oracle.com> <556752FD.6060605@oracle.com> Message-ID: <55684475.90905@oracle.com> Thanks for review, Dean, Roland & Vladimir. Best regards, Vladimir Ivanov On 5/29/15 10:53 AM, Roland Westrelin wrote: >> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.03 > > That looks good to me. > > Roland. > From anthony.scarpino at oracle.com Thu May 28 23:39:59 2015 From: anthony.scarpino at oracle.com (Anthony Scarpino) Date: Thu, 28 May 2015 16:39:59 -0700 Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] In-Reply-To: <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337BA3A@FMSMSX112.amr.corp.intel.com> References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com> <5502E67C.8080208@oracle.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com> <550C004D.1060501@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com> <55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com> <55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com> <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com> <555708BE.8090100@redhat.com> <5565FAD6.5010409@redhat.com> <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337BA3A@FMSMSX112.amr.corp.intel.com> Message-ID: <5567A74F.4040105@oracle.com> Personally I think it better to not have implSquareToLenChecks() and implMulAddCheck() as separate methods and to have the range check squareToLen and mulAdd. Given these change are about performance, it seems unnecessary to add an extra call to a method. While we are changing BigInteger, should a range check for multiplyToLen be added? Or is there a different bug for that? Tony On 05/27/2015 06:27 PM, Viswanathan, Sandhya wrote: > Hi Tony, > > Please let us know if you are ok with the changes in BigInteger.java (range checks) in patch from Intel: > > http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ > > Per Andrew's email below we could go ahead with this patch and it shouldn't affect his work. > > Best Regards, > Sandhya > > > -----Original Message----- > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andrew Haley > Sent: Wednesday, May 27, 2015 10:12 AM > To: Christian Thalinger > Cc: Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net > Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration] > > An update: > > I'm still working on this. Following last week's revelations [1] it > seems to me that a faster implementation of (integer) D-H is even more > important. > > I've spent a couple of days tracking down an extremely odd feature > (bug?) in MutableBigInteger which was breaking everything, but I'm > past that now. I'm trying to produce an intrinsic implementation of > the core modular exponentiation which is as fast as any state-of-the- > art implementation while disrupting the common code as little as > possible; this is not easy. > > I hope to have something which is faster on all processors, not just > those for which we have hand-coded assembly-language implementations. > > I don't think that my work should be any impediment to Sadya's patch > for squareToLen at http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ > being committed. It'll still be useful. > > Andrew. > > > [1] Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice > https://weakdh.org/imperfect-forward-secrecy.pdf > From roland.westrelin at oracle.com Fri May 29 14:06:21 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Fri, 29 May 2015 16:06:21 +0200 Subject: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on (regression) In-Reply-To: References: Message-ID: <9D656012-9AD5-4723-9DE6-5810CAA916D0@oracle.com> > Your quick follow-up to this issue is much appreciated! (I filed the > bug via Rory). Having a reliable way to reproduce it helped a lot: thanks for providing a reproducer. Roland. From roland.westrelin at oracle.com Fri May 29 14:09:58 2015 From: roland.westrelin at oracle.com (Roland Westrelin) Date: Fri, 29 May 2015 16:09:58 +0200 Subject: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on (regression) In-Reply-To: References: <5567432D.3040108@oracle.com> Message-ID: <03323516-B787-4D8B-AF1F-C3D54C3BE1F1@oracle.com> Michael, > Roland I would add a condition so that we save a step, and a comment or two: Thanks for the suggestions. I?ll go with your version. > I checked the resultant code for the bug, as is the case for the change in the webrev, the reduction is no longer emitted. I also checked the relevant micros and verified the desired behavior is still present. Thanks for taking the time to verify the fix is correct. Roland. From Ulf.Zibis at CoSoCo.de Fri May 29 16:48:28 2015 From: Ulf.Zibis at CoSoCo.de (Ulf Zibis) Date: Fri, 29 May 2015 18:48:28 +0200 Subject: Why isn't Object.notify() a synchronized method? In-Reply-To: <5567420A.9020406@redhat.com> References: <55673D97.4010505@CoSoCo.de> <5567420A.9020406@redhat.com> Message-ID: <5568985C.6050507@CoSoCo.de> Thanks for your hint David. That's the only reason I could imagine too. Can somebody tell something about the cost for recursive lock acquisition in comparison to the whole call, couldn't it be eliminated by Hotspot? As I recently fell into the trap of forgetting the synchronized block around a single notifyAll(), I believe, the current situation is just errorprone. Any comments about the Javadoc issue? -Ulf Am 28.05.2015 um 18:27 schrieb David M. Lloyd: > Since most of the time you have to hold the lock anyway for other reasons, I think this would > generally be an unwelcome change since I expect the cost of recursive lock acquisition is nonzero. > > On 05/28/2015 11:08 AM, Ulf Zibis wrote: >> Hi all, >> >> in the Javadoc of notify(), notifyAll() and wait(...) I read, that this >> methods should only be used with synchronisation on it's instance. >> So I'm wondering, why they don't have the synchronized modifier out of >> the box in Object class. >> >> Also I think, the following note should be moved from wait(long,int) to >> wait(long): >> /The current thread must own this object's monitor. The thread releases >> ownership of this monitor and waits until either of the following two >> conditions has occurred:// >> / >> >> * /Another thread notifies threads waiting on this object's monitor to >> wake up either through a >> call to the notify method or the notifyAll method./ >> * /The timeout period, specified by timeout milliseconds plus nanos >> nanoseconds arguments, has >> elapsed. / >> >> >> >> Cheers, >> >> Ulf >> > From vladimir.kozlov at oracle.com Fri May 29 17:38:20 2015 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 29 May 2015 10:38:20 -0700 Subject: Implicit null-check in hot-loop? In-Reply-To: References: <55659E32.9070007@redhat.com> <9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com> Message-ID: <5568A40C.9000602@oracle.com> There is reason. We do long arithmetic to prove that int index will not overflow in canonical (counted) loops but we can't do that cheaply for long index. And 99% of loops use int index - so we don't want to spend code on edge case. Regards, Vladimir On 5/27/15 5:25 AM, Vitaly Davidovich wrote: > I don't think there's any fundamental reason, it's just that int loops are more common and so were optimized. > > sent from my phone > > On May 27, 2015 8:09 AM, "Benedikt Wedenik" > wrote: > > First of all, thanks for your quick response :) > You were absolutely right - but I do not understand why the long counter was the problem. > > Is there any reason why the loop-unrolling is only available for int loops? > I mean I guess so - because else it would probably be implemented already. > But still it would be a nice opportunity for an optimisation :) > > Benedikt. > On 27 May 2015, at 13:44, Vitaly Davidovich > wrote: > >> Also, the safepoint check on each iteration should go away when using an int loop; it'll be coalesced into once >> per unroll factor or may go away entirely if the loop turns into a counted loop. >> >> sent from my phone >> >> On May 27, 2015 7:39 AM, "Vitaly Davidovich" > wrote: >> >> Cause you're using a long as induction variable; change to int and it should unroll. >> >> sent from my phone >> >> On May 27, 2015 6:46 AM, "Benedikt Wedenik" > > wrote: >> >> Oh! >> >> Thanks :) So then it's not so bad, cause it is (I guess) obligatory. >> >> Do you have any idea why this loop does not get unrolled? >> >> Benedikt. >> >> On 27 May 2015, at 12:36, Andrew Haley > wrote: >> >> > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote: >> >> >> >> In the hot-loop there is this ?ldr? which looked a little ?strange? at the first glance. >> >> I think that this load is a null-check? Is that the case? >> > >> > No. It's a safepoint check. HotSpot has to insert one of these because >> > it can't prove that the loop is reasonably short-lived. >> > >> >> I also investigated the generated code on x86 which is quite similar, but instead of a load, they are >> >> using the ?test?-instruction which performs an ?and? but only sets the flags discarding the result. >> >> Is there any similar instruction available on aarch64 or is this already the closest solution? >> > >> > Closest to what? A load to XZR is the best solution: it does not hit >> > the flags. x86 cannot do this. >> > >> > It looks like great code. >> > >> > Andrew. >> > >> > From vitalyd at gmail.com Fri May 29 17:48:47 2015 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Fri, 29 May 2015 13:48:47 -0400 Subject: Implicit null-check in hot-loop? In-Reply-To: <5568A40C.9000602@oracle.com> References: <55659E32.9070007@redhat.com> <9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com> <5568A40C.9000602@oracle.com> Message-ID: Hi Vladimir, Given java has no unsigned long, wouldn't this be somewhat straightforward to test? I agree that it's unlikely to be worth it given most loops are int. Thanks On Fri, May 29, 2015 at 1:38 PM, Vladimir Kozlov wrote: > There is reason. We do long arithmetic to prove that int index will not > overflow in canonical (counted) loops but we can't do that cheaply for long > index. And 99% of loops use int index - so we don't want to spend code on > edge case. > > Regards, > Vladimir > > > On 5/27/15 5:25 AM, Vitaly Davidovich wrote: > >> I don't think there's any fundamental reason, it's just that int loops >> are more common and so were optimized. >> >> sent from my phone >> >> On May 27, 2015 8:09 AM, "Benedikt Wedenik" < >> benedikt.wedenik at theobroma-systems.com >> > wrote: >> >> First of all, thanks for your quick response :) >> You were absolutely right - but I do not understand why the long >> counter was the problem. >> >> Is there any reason why the loop-unrolling is only available for int >> loops? >> I mean I guess so - because else it would probably be implemented >> already. >> But still it would be a nice opportunity for an optimisation :) >> >> Benedikt. >> On 27 May 2015, at 13:44, Vitaly Davidovich > > wrote: >> >> Also, the safepoint check on each iteration should go away when >>> using an int loop; it'll be coalesced into once >>> per unroll factor or may go away entirely if the loop turns into a >>> counted loop. >>> >>> sent from my phone >>> >>> On May 27, 2015 7:39 AM, "Vitaly Davidovich" >> > wrote: >>> >>> Cause you're using a long as induction variable; change to int >>> and it should unroll. >>> >>> sent from my phone >>> >>> On May 27, 2015 6:46 AM, "Benedikt Wedenik" < >>> benedikt.wedenik at theobroma-systems.com >>> > wrote: >>> >>> Oh! >>> >>> Thanks :) So then it's not so bad, cause it is (I guess) >>> obligatory. >>> >>> Do you have any idea why this loop does not get unrolled? >>> >>> Benedikt. >>> >>> On 27 May 2015, at 12:36, Andrew Haley >> > wrote: >>> >>> > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote: >>> >> >>> >> In the hot-loop there is this ?ldr? which looked a little >>> ?strange? at the first glance. >>> >> I think that this load is a null-check? Is that the case? >>> > >>> > No. It's a safepoint check. HotSpot has to insert one of >>> these because >>> > it can't prove that the loop is reasonably short-lived. >>> > >>> >> I also investigated the generated code on x86 which is >>> quite similar, but instead of a load, they are >>> >> using the ?test?-instruction which performs an ?and? but >>> only sets the flags discarding the result. >>> >> Is there any similar instruction available on aarch64 or >>> is this already the closest solution? >>> > >>> > Closest to what? A load to XZR is the best solution: it >>> does not hit >>> > the flags. x86 cannot do this. >>> > >>> > It looks like great code. >>> > >>> > Andrew. >>> > >>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From david.holmes at oracle.com Sun May 31 04:31:25 2015 From: david.holmes at oracle.com (David Holmes) Date: Sun, 31 May 2015 14:31:25 +1000 Subject: Why isn't Object.notify() a synchronized method? In-Reply-To: <5568985C.6050507@CoSoCo.de> References: <55673D97.4010505@CoSoCo.de> <5567420A.9020406@redhat.com> <5568985C.6050507@CoSoCo.de> Message-ID: <556A8E9D.4040407@oracle.com> On 30/05/2015 2:48 AM, Ulf Zibis wrote: > Thanks for your hint David. That's the only reason I could imagine too. > Can somebody tell something about the cost for recursive lock > acquisition in comparison to the whole call, couldn't it be eliminated > by Hotspot? There are a number of fast paths related to recursive locking. Not sure if inlining allows lock elision that something for the compiler folk. > As I recently fell into the trap of forgetting the synchronized block > around a single notifyAll(), I believe, the current situation is just > errorprone. How is it errorprone? You forgot to acquire the lock and you got an IllegalMonitorStateException when you did notifyAll. That's pointing out your error. > Any comments about the Javadoc issue? I did comment - I don't see any issue. David H. > -Ulf > > > Am 28.05.2015 um 18:27 schrieb David M. Lloyd: >> Since most of the time you have to hold the lock anyway for other >> reasons, I think this would generally be an unwelcome change since I >> expect the cost of recursive lock acquisition is nonzero. >> >> On 05/28/2015 11:08 AM, Ulf Zibis wrote: >>> Hi all, >>> >>> in the Javadoc of notify(), notifyAll() and wait(...) I read, that this >>> methods should only be used with synchronisation on it's instance. >>> So I'm wondering, why they don't have the synchronized modifier out of >>> the box in Object class. >>> >>> Also I think, the following note should be moved from wait(long,int) to >>> wait(long): >>> /The current thread must own this object's monitor. The thread releases >>> ownership of this monitor and waits until either of the following two >>> conditions has occurred:// >>> / >>> >>> * /Another thread notifies threads waiting on this object's monitor to >>> wake up either through a >>> call to the notify method or the notifyAll method./ >>> * /The timeout period, specified by timeout milliseconds plus nanos >>> nanoseconds arguments, has >>> elapsed. / >>> >>> >>> >>> Cheers, >>> >>> Ulf >>> >> >