From vladimir.kozlov at oracle.com  Fri May  1 20:54:19 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 01 May 2015 13:54:19 -0700
Subject: RFR(XXS) 8079231: quarantine
	compiler/jsr292/CallSiteDepContextTest.java
Message-ID: <5543E7FB.9010605@oracle.com>

http://cr.openjdk.java.net/~kvn/8079231/webrev/
https://bugs.openjdk.java.net/browse/JDK-8079231

Quarantine the test until next bug is fixed:

https://bugs.openjdk.java.net/browse/JDK-8079205

Thanks,
Vladimir

From dean.long at oracle.com  Fri May  1 21:14:48 2015
From: dean.long at oracle.com (Dean Long)
Date: Fri, 01 May 2015 14:14:48 -0700
Subject: RFR(XXS) 8079231: quarantine
	compiler/jsr292/CallSiteDepContextTest.java
In-Reply-To: <5543E7FB.9010605@oracle.com>
References: <5543E7FB.9010605@oracle.com>
Message-ID: <5543ECC8.6090505@oracle.com>

Looks good.

dl

On 5/1/2015 1:54 PM, Vladimir Kozlov wrote:
> http://cr.openjdk.java.net/~kvn/8079231/webrev/
> https://bugs.openjdk.java.net/browse/JDK-8079231
>
> Quarantine the test until next bug is fixed:
>
> https://bugs.openjdk.java.net/browse/JDK-8079205
>
> Thanks,
> Vladimir


From vladimir.kozlov at oracle.com  Fri May  1 21:19:12 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 01 May 2015 14:19:12 -0700
Subject: RFR(XXS) 8079231: quarantine
	compiler/jsr292/CallSiteDepContextTest.java
In-Reply-To: <5543ECC8.6090505@oracle.com>
References: <5543E7FB.9010605@oracle.com> <5543ECC8.6090505@oracle.com>
Message-ID: <5543EDD0.90407@oracle.com>

Thank you, Dean

Vladimir

On 5/1/15 2:14 PM, Dean Long wrote:
> Looks good.
>
> dl
>
> On 5/1/2015 1:54 PM, Vladimir Kozlov wrote:
>> http://cr.openjdk.java.net/~kvn/8079231/webrev/
>> https://bugs.openjdk.java.net/browse/JDK-8079231
>>
>> Quarantine the test until next bug is fixed:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8079205
>>
>> Thanks,
>> Vladimir
>

From tobias.hartmann at oracle.com  Mon May  4 05:38:53 2015
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 04 May 2015 07:38:53 +0200
Subject: [9] RFR(S): 8078497: C2's superword optimization causes unaligned
	memory accesses
In-Reply-To: <554254A9.6060608@oracle.com>
References: <5540D237.8000107@oracle.com>
	<5541172D.5010200@oracle.com>	<5541F0B1.50804@oracle.com>
	<554254A9.6060608@oracle.com>
Message-ID: <554705ED.4050305@oracle.com>

Thanks, Vladimir.

Best,
Tobias

On 30.04.2015 18:13, Vladimir Kozlov wrote:
> Looks good.
> 
> Thanks,
> Vladimir
> 
> On 4/30/15 2:06 AM, Tobias Hartmann wrote:
>> Hi Vladimir,
>>
>> thanks for the review!
>>
>> On 29.04.2015 19:38, Vladimir Kozlov wrote:
>>> Consider reducing number of iterations in test from 1M so that the test does not timeout on slow platforms.
>>
>> I reduced the number of iterations to 20k and verified that it is enough to trigger compilation and crash on Sparc.
>>
>>> lines 511-515 are duplicates of 486-491.  Consider moving them before vw%span check.
>>
>> Thanks, I refactored that part.
>>
>>> Typo 'iff':
>>> +    // final offset is a multiple of vw iff init_offset is a multiple.
>>
>> I used 'iff' to refer to 'if and only if' [1]. I changed it.
>>
>> Here is the new webrev:
>> http://cr.openjdk.java.net/~thartmann/8078497/webrev.01/
>>
>> Thanks,
>> Tobias
>>
>> [1] http://en.wikipedia.org/wiki/If_and_only_if
>>
>>>
>>> Thanks,
>>> Vladimir
>>>
>>> On 4/29/15 5:44 AM, Tobias Hartmann wrote:
>>>> Hi,
>>>>
>>>> please review the following patch.
>>>>
>>>> https://bugs.openjdk.java.net/browse/JDK-8078497
>>>> http://cr.openjdk.java.net/~thartmann/8078497/webrev.00/
>>>>
>>>> Background information (simplified):
>>>> After loop optimizations, C2 tries to vectorize memory operations by finding and merging adjacent memory accesses in the loop body. The superword optimization matches MemNodes with the following pattern as address input:
>>>>
>>>>     ptr + k*iv + constant [+ invar]
>>>>
>>>> where iv is the loop induction variable, k is a scaling factor that may be 0 and invar is an optional loop invariant value. C2 then picks the memory operation with most similar references and tries to align it in the main loop by setting the number of pre-loop iterations accordingly. Other adjacent memory operations that conform to this alignment are merged into packs. After extending and filtering these packs, final vector operations are emitted.
>>>>
>>>> The problem is that some architectures (for example, Sparc) require memory accesses to be aligned and in special cases the superword optimization generates code that contains unaligned vector instructions.
>>>>
>>>> Problems:
>>>> (1) If two memory operations have different loop invariant offset values and C2 decides to align the main loop to one of them, we can always set the invariant of the other operation such that we break the alignment constraint. For example, if we have a store to a byte array a[i+inv1] and a load from a byte array b[i+inv2] (where i is the induction variable), C2 may decide to align the main loop such that (i+inv1 % 8) == 0 and replace the adjacent byte stores by a 8-byte double word store. Also the byte loads will be replaced by a double word load. If we then set inv2 = inv1 + 1 at runtime, the load will fail with a SIGBUS because the access is not 8-byte aligned, i.e., (i+inv2 % 8) != 0. Vectorization should not take place in this case.
>>>>
>>>> (2) If C2 decides to align the main loop to a memory operation, the necessary adjustment of the induction variable by the pre-loop is computed such that the resulting offset in the main loop is aligned to the vector width (see 'SuperWord::get_iv_adjustment'). If the offset of a memory operation is independent of the loop induction variable (i.e., scale k is 0), the iv adjustment should be 0.
>>>>
>>>> (3) If the loop span is greater than the vector width 'vw' of a memory operation, the superword optimization assumes that the pre-loop is never able to align this operation in the main loop (see 'SuperWord::ref_is_alignable'). This is wrong because if the loop span is a multiple of vw and depending on the initial offset, it may very well be possible to align the operation to vw.
>>>>
>>>> These problems originally only showed up in the string density code base (JDK-8054307) where we have a putChar intrinsic that writes a char value to two entries of a byte array. To reproduce the issue with JDK9 we have to make sure that the memory operations:
>>>> (i) are independent,
>>>> (ii) have different invariants,
>>>> (iii) are not too complex (because then vectorization will not take place).
>>>>
>>>> To guarantee (i) we either need two different arrays (for example, byte[] and char[]) for the load and store operations or the same array but different offsets. Since the offsets should include a loop invariant part (ii), the superword optimization will not be able to determine that the runtime offset values do not overlap. We therefore use Unsafe.getChar/putChar to read/write a char value from/to a byte/char array and thereby guarantee independence.
>>>>
>>>> I came up with the following test (see 'TestVectorizationWithInvariant.java'):
>>>>
>>>>     byte[] src = new byte[1000];
>>>>     byte[] dst = new char[1000];
>>>>
>>>>     for (int i = (int) CHAR_ARRAY_OFFSET; i < 100; i = i + 8) {
>>>>       // Copy 8 chars from src to dst
>>>>       unsafe.putChar(dst, i + 0, unsafe.getChar(src, off + 0));
>>>>       [...]
>>>>       unsafe.putChar(dst, i + 14, unsafe.getChar(src, off + 14));
>>>>     }
>>>>
>>>> Problem (1) shows up since the main loop will be aligned to the StoreC[i + 0] which has no invariant. However, the LoadUS[off + 0] has the loop invariant 'off'. Setting off to BYTE_ARRAY_OFFSET + 2 will break the alignment of the emitted double word load and result in a crash.
>>>>
>>>> The LoadUS[off + 0] in above example is independent of the loop induction variable and therefore has no scale value. Because of problem (2), the iv adjustment is computed as -4 (see 'SuperWord::get_iv_adjustment').
>>>>
>>>>     offset = 16 iv_adjust = -4 elt_size = 2 scale = 0 iv_stride = 16 vect_size 8
>>>>
>>>> The regression test contains additional test cases that also trigger problem (3).
>>>>
>>>> Solution:
>>>> (1) I added a check to 'SuperWord::find_adjacent_refs' to make sure that memory accesses with different invariants are only vectorized if the architecture supports unaligned memory accesses.
>>>> (2) 'SuperWord::get_iv_adjustment' is modified such that it returns 0 if the memory operation is independent of iv.
>>>> (3) In 'SuperWord::ref_is_alignable' we need to additionally check if span is divisible by vw. If so, the pre-loop will add multiples of vw to the initial offset and if that initial offset is divisible by vm, the final offset (after the pre-loop) will be divisible as well and we should return true. I added comments to the code describing this in detail.
>>>>
>>>> Testing:
>>>> - original test with string-density codebase
>>>> - regression test
>>>> - JPRT
>>>>
>>>> Thanks,
>>>> Tobias
>>>>

From rickard.backman at oracle.com  Mon May  4 13:13:37 2015
From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=)
Date: Mon, 4 May 2015 15:13:37 +0200
Subject: RFR (M): 8064458 OopMap class could be more compact
In-Reply-To: <554257C6.1010600@oracle.com>
References: <20150428100337.GB31204@rbackman> <553FF92E.3010908@oracle.com>
	<20150429145259.GD31204@rbackman> <5541132C.9020208@oracle.com>
	<20150430141818.GG31204@rbackman> <554257C6.1010600@oracle.com>
Message-ID: <20150504131337.GJ31204@rbackman>

Updated again.

http://cr.openjdk.java.net/~rbackman/8064458.5

Thanks

On 04/30, Vladimir Kozlov wrote:
> It is better now. I am fine with inner Mapping class.
> 
> One thing left. I think only Mapping and fields could be private since only ImmutableOopMapBuilder accesses them:
> 
> 
> + class ImmutableOopMapBuilder {
> + public:
> +   class Mapping;
> +
> + public:
> +   const OopMapSet* _set;
> 
> thanks,
> Vladimir
> 
> On 4/30/15 7:18 AM, Rickard B?ckman wrote:
> >On 04/29, Vladimir Kozlov wrote:
> >>On 4/29/15 7:52 AM, Rickard B?ckman wrote:
> >>>On 04/28, Vladimir Kozlov wrote:
> >>>>You need closed SA changes.
> >>>
> >>>I have them, just forgot to send the mail. Small changes too.
> >>>
> >>>>
> >>>>Style:
> >>>>
> >>>>Move fields to the beginning of ImmutableOopMapBuilder and classes.
> >>>>Add comments describing each field in ALL new classes. Add comments
> >>>>to fields in old class. It will help next persone who will look on
> >>>>oop maps later.
> >>>
> >>>All fields are at the beginning? Just following style and keeping
> >>>friends above fields as other classes in the same file has.
> >>
> >>   friends
> >>   fields
> >>   methods
> >>
> >>It is better to see fields before methods which access them. I am
> >>not sure what codding style you are talking about. All classes in
> >>oopMap.hpp has above layout.
> >>
> >>I am asking to change layout of ImmutableOopMapBuilder and Mapping classes. Also why Mapping is inner class?
> >
> >Ah, I was looking at oopMap.hpp and didn't see them. Fixed.
> >Mapping is an inner class because it is only used by the Builder and I
> >didn't want to clutter the namespace with something as general (and bad)
> >as Mapping. I can move it out if that is preferred.
> >
> >>
> >>
> >>>
> >>>>
> >>>>Add ResourceMark into ImmutableOopMapSet::build_from() to free memory allocated in ImmutableOopMapBuilder().
> >>
> >>no ResourceMark in 8064458.2
> >
> >Added.
> >
> >>
> >>>>Why you need ImmutableOopMapBuilder to be friend of class ImmutableOopMap?
> >>>
> >>>It makes a call to data_addr() which is private. Only for debug builds
> >>>so I wrapped it in DEBUG_ONLY.
> >>
> >>Put ImmutableOopMapBuilder::verify() under #ifdef ASSERT and its declaration in DEBUG_ONLY.
> >>
> >
> >Done.
> >
> >Updated webrev: http://cr.openjdk.java.net/~rbackman/8064458.4
> >
> >Thanks
> >/R
> >
> >>Thanks,
> >>Vladimir
> >>
> >>>
> >>>>
> >>>>I think a simple loop in ImmutableOopMapBuilder::verify() would be faster than calling memcmp.
> >>>
> >>>Fixed.
> >>>
> >>>>
> >>>>Field _end is not used.
> >>>
> >>>Removed.
> >>>
> >>>>
> >>>>ImmutableOopMapBuilder() calls reset() and next called heap_size()
> >>>>calls reset() again. May be move reset() to the end of heap_size()
> >>>>so that you don't need to call it in fill().
> >>>>
> >>>
> >>>Better yet, the reset() method isn't required anymore.
> >>>
> >>>Updated webrev: http://cr.openjdk.java.net/~rbackman/8064458.2
> >>>
> >>>/R
> >>>
> >>>>Thanks,
> >>>>Vladimir
> >>>>
> >>>>On 4/28/15 3:03 AM, Rickard B?ckman wrote:
> >>>>>Hi all,
> >>>>>
> >>>>>can I please have reviews for this change:
> >>>>>
> >>>>>RFR: http://cr.openjdk.java.net/~rbackman/8064458.1/
> >>>>>RFE: http://bugs.openjdk.java.net/browse/JDK-8064458
> >>>>>
> >>>>>While looking at OopMaps a while ago I noticed that there were a couple
> >>>>>of different fields that were unused after the OopMaps were finalised.
> >>>>>
> >>>>>I took some time to investigate and rearrange the OopMaps. Since I
> >>>>>didn't want to change how the OopMaps are built I introduced new data
> >>>>>structures. ImmutableOopMapSet and ImmutableOopMap. The original OopMap
> >>>>>structures are used to build up the OopMaps and when finalised they are
> >>>>>copied into the Immutable variants.
> >>>>>
> >>>>>The ImmutableOopMapSet contains a few fields [size, count] and then a
> >>>>>list of [pc, offset]. The offset points to the offset after the list
> >>>>>where the ImmutableOopMap is placed. By moving pc out from OopMap to be
> >>>>>part of the list we can now have multiple pcs with identical OopMaps
> >>>>>point to the same data.
> >>>>>
> >>>>>We only keep 1 empty OopMap, and the other compaction that is done in
> >>>>>this change is to check if the OopMap is identical to the previous one
> >>>>>and then reuse that one. So no complete uniqueness check.
> >>>>>
> >>>>>I ran a couple of small benchmarks and printed the size of the old
> >>>>>OopMaps vs the new. The new layout uses about 20 - 25% of the space on
> >>>>>the benchmarks I've run.
> >>>>>
> >>>>>Tested by running through JPRT, running BigApps and NSK.quick.testlist
> >>>>>
> >>>>>Thanks
> >>>>>/R
> >>>>>
/R
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150504/b0101dbb/signature.asc>

From rickard.backman at oracle.com  Mon May  4 13:16:04 2015
From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=)
Date: Mon, 4 May 2015 15:16:04 +0200
Subject: RFR (M): 8064458 OopMap class could be more compact
In-Reply-To: <55425AAF.2020801@oracle.com>
References: <20150428100337.GB31204@rbackman> <553FF92E.3010908@oracle.com>
	<20150429145259.GD31204@rbackman> <5541132C.9020208@oracle.com>
	<20150430141818.GG31204@rbackman> <554257C6.1010600@oracle.com>
	<94B29370-0478-43B8-8547-F108D5AD29FF@oracle.com>
	<55425AAF.2020801@oracle.com>
Message-ID: <20150504131604.GK31204@rbackman>

Vladimir,

I think you are right. Personally I don't see much improvement by having
TempOopMap/OopMap vs OopMap/ImmutableOopMap (if we had had namespaces we
could have had them as compiler::OopMap/code::OopMap). If I see lots of requests
for the former before I push...

Thanks
/R

On 04/30, Vladimir Kozlov wrote:
> We still use OopMap and OopMapSet when we create oopmap data.
> ImmutableOopMap* is used for writing and reading final metadata.
> Otherwise changes would be much larger (see for example,
> sharedRuntime_<arch>.cpp etc). Rickard may correct me if I am wrong.
> 
> Vladimir
> 
> On 4/30/15 9:33 AM, Christian Thalinger wrote:
> >One thing I mentioned to Rickard privately is that I would prefer to have the current OopMap be renamed to something else (TempOopMap?) and keep OopMap as the one used.
> >
> >>On Apr 30, 2015, at 9:26 AM, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
> >>
> >>It is better now. I am fine with inner Mapping class.
> >>
> >>One thing left. I think only Mapping and fields could be private since only ImmutableOopMapBuilder accesses them:
> >>
> >>
> >>+ class ImmutableOopMapBuilder {
> >>+ public:
> >>+   class Mapping;
> >>+
> >>+ public:
> >>+   const OopMapSet* _set;
> >>
> >>thanks,
> >>Vladimir
> >>
> >>On 4/30/15 7:18 AM, Rickard B?ckman wrote:
> >>>On 04/29, Vladimir Kozlov wrote:
> >>>>On 4/29/15 7:52 AM, Rickard B?ckman wrote:
> >>>>>On 04/28, Vladimir Kozlov wrote:
> >>>>>>You need closed SA changes.
> >>>>>
> >>>>>I have them, just forgot to send the mail. Small changes too.
> >>>>>
> >>>>>>
> >>>>>>Style:
> >>>>>>
> >>>>>>Move fields to the beginning of ImmutableOopMapBuilder and classes.
> >>>>>>Add comments describing each field in ALL new classes. Add comments
> >>>>>>to fields in old class. It will help next persone who will look on
> >>>>>>oop maps later.
> >>>>>
> >>>>>All fields are at the beginning? Just following style and keeping
> >>>>>friends above fields as other classes in the same file has.
> >>>>
> >>>>   friends
> >>>>   fields
> >>>>   methods
> >>>>
> >>>>It is better to see fields before methods which access them. I am
> >>>>not sure what codding style you are talking about. All classes in
> >>>>oopMap.hpp has above layout.
> >>>>
> >>>>I am asking to change layout of ImmutableOopMapBuilder and Mapping classes. Also why Mapping is inner class?
> >>>
> >>>Ah, I was looking at oopMap.hpp and didn't see them. Fixed.
> >>>Mapping is an inner class because it is only used by the Builder and I
> >>>didn't want to clutter the namespace with something as general (and bad)
> >>>as Mapping. I can move it out if that is preferred.
> >>>
> >>>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>Add ResourceMark into ImmutableOopMapSet::build_from() to free memory allocated in ImmutableOopMapBuilder().
> >>>>
> >>>>no ResourceMark in 8064458.2
> >>>
> >>>Added.
> >>>
> >>>>
> >>>>>>Why you need ImmutableOopMapBuilder to be friend of class ImmutableOopMap?
> >>>>>
> >>>>>It makes a call to data_addr() which is private. Only for debug builds
> >>>>>so I wrapped it in DEBUG_ONLY.
> >>>>
> >>>>Put ImmutableOopMapBuilder::verify() under #ifdef ASSERT and its declaration in DEBUG_ONLY.
> >>>>
> >>>
> >>>Done.
> >>>
> >>>Updated webrev: http://cr.openjdk.java.net/~rbackman/8064458.4
> >>>
> >>>Thanks
> >>>/R
> >>>
> >>>>Thanks,
> >>>>Vladimir
> >>>>
> >>>>>
> >>>>>>
> >>>>>>I think a simple loop in ImmutableOopMapBuilder::verify() would be faster than calling memcmp.
> >>>>>
> >>>>>Fixed.
> >>>>>
> >>>>>>
> >>>>>>Field _end is not used.
> >>>>>
> >>>>>Removed.
> >>>>>
> >>>>>>
> >>>>>>ImmutableOopMapBuilder() calls reset() and next called heap_size()
> >>>>>>calls reset() again. May be move reset() to the end of heap_size()
> >>>>>>so that you don't need to call it in fill().
> >>>>>>
> >>>>>
> >>>>>Better yet, the reset() method isn't required anymore.
> >>>>>
> >>>>>Updated webrev: http://cr.openjdk.java.net/~rbackman/8064458.2
> >>>>>
> >>>>>/R
> >>>>>
> >>>>>>Thanks,
> >>>>>>Vladimir
> >>>>>>
> >>>>>>On 4/28/15 3:03 AM, Rickard B?ckman wrote:
> >>>>>>>Hi all,
> >>>>>>>
> >>>>>>>can I please have reviews for this change:
> >>>>>>>
> >>>>>>>RFR: http://cr.openjdk.java.net/~rbackman/8064458.1/
> >>>>>>>RFE: http://bugs.openjdk.java.net/browse/JDK-8064458
> >>>>>>>
> >>>>>>>While looking at OopMaps a while ago I noticed that there were a couple
> >>>>>>>of different fields that were unused after the OopMaps were finalised.
> >>>>>>>
> >>>>>>>I took some time to investigate and rearrange the OopMaps. Since I
> >>>>>>>didn't want to change how the OopMaps are built I introduced new data
> >>>>>>>structures. ImmutableOopMapSet and ImmutableOopMap. The original OopMap
> >>>>>>>structures are used to build up the OopMaps and when finalised they are
> >>>>>>>copied into the Immutable variants.
> >>>>>>>
> >>>>>>>The ImmutableOopMapSet contains a few fields [size, count] and then a
> >>>>>>>list of [pc, offset]. The offset points to the offset after the list
> >>>>>>>where the ImmutableOopMap is placed. By moving pc out from OopMap to be
> >>>>>>>part of the list we can now have multiple pcs with identical OopMaps
> >>>>>>>point to the same data.
> >>>>>>>
> >>>>>>>We only keep 1 empty OopMap, and the other compaction that is done in
> >>>>>>>this change is to check if the OopMap is identical to the previous one
> >>>>>>>and then reuse that one. So no complete uniqueness check.
> >>>>>>>
> >>>>>>>I ran a couple of small benchmarks and printed the size of the old
> >>>>>>>OopMaps vs the new. The new layout uses about 20 - 25% of the space on
> >>>>>>>the benchmarks I've run.
> >>>>>>>
> >>>>>>>Tested by running through JPRT, running BigApps and NSK.quick.testlist
> >>>>>>>
> >>>>>>>Thanks
> >>>>>>>/R
> >>>>>>>
> >
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150504/da53fd23/signature-0001.asc>

From tobias.hartmann at oracle.com  Mon May  4 14:09:53 2015
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 04 May 2015 16:09:53 +0200
Subject: [9] RFR(S): 8078497: C2's superword optimization causes unaligned
	memory accesses
In-Reply-To: <554705ED.4050305@oracle.com>
References: <5540D237.8000107@oracle.com>	<5541172D.5010200@oracle.com>	<5541F0B1.50804@oracle.com>	<554254A9.6060608@oracle.com>
	<554705ED.4050305@oracle.com>
Message-ID: <55477DB1.5050200@oracle.com>

Hi,

looks like there is still a problem I didn't catch. With my regression test I sometimes hit [1]

  #  Internal Error (/opt/jprt/T/P1/130607.tohartma/s/hotspot/src/share/vm/opto/loopnode.cpp:3231), pid=18188, tid=0x00002b00dacde700
  #  assert(!had_error) failed: bad dominance

I'm not sure if this is related to my fix but I assume it is an existing problem that was just triggered by my regression test.

I'll further investigate.

Thanks,
Tobias

[1] http://sthjprt.se.oracle.com/archives/2015/05/2015-05-04-130607.tohartma.8078497/logs/linux_x64_2.6-fastdebug-c2-hotspot_compiler_3.log.FAILED.log

On 04.05.2015 07:38, Tobias Hartmann wrote:
> Thanks, Vladimir.
> 
> Best,
> Tobias
> 
> On 30.04.2015 18:13, Vladimir Kozlov wrote:
>> Looks good.
>>
>> Thanks,
>> Vladimir
>>
>> On 4/30/15 2:06 AM, Tobias Hartmann wrote:
>>> Hi Vladimir,
>>>
>>> thanks for the review!
>>>
>>> On 29.04.2015 19:38, Vladimir Kozlov wrote:
>>>> Consider reducing number of iterations in test from 1M so that the test does not timeout on slow platforms.
>>>
>>> I reduced the number of iterations to 20k and verified that it is enough to trigger compilation and crash on Sparc.
>>>
>>>> lines 511-515 are duplicates of 486-491.  Consider moving them before vw%span check.
>>>
>>> Thanks, I refactored that part.
>>>
>>>> Typo 'iff':
>>>> +    // final offset is a multiple of vw iff init_offset is a multiple.
>>>
>>> I used 'iff' to refer to 'if and only if' [1]. I changed it.
>>>
>>> Here is the new webrev:
>>> http://cr.openjdk.java.net/~thartmann/8078497/webrev.01/
>>>
>>> Thanks,
>>> Tobias
>>>
>>> [1] http://en.wikipedia.org/wiki/If_and_only_if
>>>
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>> On 4/29/15 5:44 AM, Tobias Hartmann wrote:
>>>>> Hi,
>>>>>
>>>>> please review the following patch.
>>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8078497
>>>>> http://cr.openjdk.java.net/~thartmann/8078497/webrev.00/
>>>>>
>>>>> Background information (simplified):
>>>>> After loop optimizations, C2 tries to vectorize memory operations by finding and merging adjacent memory accesses in the loop body. The superword optimization matches MemNodes with the following pattern as address input:
>>>>>
>>>>>     ptr + k*iv + constant [+ invar]
>>>>>
>>>>> where iv is the loop induction variable, k is a scaling factor that may be 0 and invar is an optional loop invariant value. C2 then picks the memory operation with most similar references and tries to align it in the main loop by setting the number of pre-loop iterations accordingly. Other adjacent memory operations that conform to this alignment are merged into packs. After extending and filtering these packs, final vector operations are emitted.
>>>>>
>>>>> The problem is that some architectures (for example, Sparc) require memory accesses to be aligned and in special cases the superword optimization generates code that contains unaligned vector instructions.
>>>>>
>>>>> Problems:
>>>>> (1) If two memory operations have different loop invariant offset values and C2 decides to align the main loop to one of them, we can always set the invariant of the other operation such that we break the alignment constraint. For example, if we have a store to a byte array a[i+inv1] and a load from a byte array b[i+inv2] (where i is the induction variable), C2 may decide to align the main loop such that (i+inv1 % 8) == 0 and replace the adjacent byte stores by a 8-byte double word store. Also the byte loads will be replaced by a double word load. If we then set inv2 = inv1 + 1 at runtime, the load will fail with a SIGBUS because the access is not 8-byte aligned, i.e., (i+inv2 % 8) != 0. Vectorization should not take place in this case.
>>>>>
>>>>> (2) If C2 decides to align the main loop to a memory operation, the necessary adjustment of the induction variable by the pre-loop is computed such that the resulting offset in the main loop is aligned to the vector width (see 'SuperWord::get_iv_adjustment'). If the offset of a memory operation is independent of the loop induction variable (i.e., scale k is 0), the iv adjustment should be 0.
>>>>>
>>>>> (3) If the loop span is greater than the vector width 'vw' of a memory operation, the superword optimization assumes that the pre-loop is never able to align this operation in the main loop (see 'SuperWord::ref_is_alignable'). This is wrong because if the loop span is a multiple of vw and depending on the initial offset, it may very well be possible to align the operation to vw.
>>>>>
>>>>> These problems originally only showed up in the string density code base (JDK-8054307) where we have a putChar intrinsic that writes a char value to two entries of a byte array. To reproduce the issue with JDK9 we have to make sure that the memory operations:
>>>>> (i) are independent,
>>>>> (ii) have different invariants,
>>>>> (iii) are not too complex (because then vectorization will not take place).
>>>>>
>>>>> To guarantee (i) we either need two different arrays (for example, byte[] and char[]) for the load and store operations or the same array but different offsets. Since the offsets should include a loop invariant part (ii), the superword optimization will not be able to determine that the runtime offset values do not overlap. We therefore use Unsafe.getChar/putChar to read/write a char value from/to a byte/char array and thereby guarantee independence.
>>>>>
>>>>> I came up with the following test (see 'TestVectorizationWithInvariant.java'):
>>>>>
>>>>>     byte[] src = new byte[1000];
>>>>>     byte[] dst = new char[1000];
>>>>>
>>>>>     for (int i = (int) CHAR_ARRAY_OFFSET; i < 100; i = i + 8) {
>>>>>       // Copy 8 chars from src to dst
>>>>>       unsafe.putChar(dst, i + 0, unsafe.getChar(src, off + 0));
>>>>>       [...]
>>>>>       unsafe.putChar(dst, i + 14, unsafe.getChar(src, off + 14));
>>>>>     }
>>>>>
>>>>> Problem (1) shows up since the main loop will be aligned to the StoreC[i + 0] which has no invariant. However, the LoadUS[off + 0] has the loop invariant 'off'. Setting off to BYTE_ARRAY_OFFSET + 2 will break the alignment of the emitted double word load and result in a crash.
>>>>>
>>>>> The LoadUS[off + 0] in above example is independent of the loop induction variable and therefore has no scale value. Because of problem (2), the iv adjustment is computed as -4 (see 'SuperWord::get_iv_adjustment').
>>>>>
>>>>>     offset = 16 iv_adjust = -4 elt_size = 2 scale = 0 iv_stride = 16 vect_size 8
>>>>>
>>>>> The regression test contains additional test cases that also trigger problem (3).
>>>>>
>>>>> Solution:
>>>>> (1) I added a check to 'SuperWord::find_adjacent_refs' to make sure that memory accesses with different invariants are only vectorized if the architecture supports unaligned memory accesses.
>>>>> (2) 'SuperWord::get_iv_adjustment' is modified such that it returns 0 if the memory operation is independent of iv.
>>>>> (3) In 'SuperWord::ref_is_alignable' we need to additionally check if span is divisible by vw. If so, the pre-loop will add multiples of vw to the initial offset and if that initial offset is divisible by vm, the final offset (after the pre-loop) will be divisible as well and we should return true. I added comments to the code describing this in detail.
>>>>>
>>>>> Testing:
>>>>> - original test with string-density codebase
>>>>> - regression test
>>>>> - JPRT
>>>>>
>>>>> Thanks,
>>>>> Tobias
>>>>>

From aph at redhat.com  Mon May  4 16:51:37 2015
From: aph at redhat.com (Andrew Haley)
Date: Mon, 04 May 2015 17:51:37 +0100
Subject: sync_kit()
Message-ID: <5547A399.3040107@redhat.com>

I'm trying to get my head around sync_kit().  I don't really know what
it does.  Well, I know that it synchronizes IdealKit and graphKit, and
I know from experience that if I don't call sync_kit() my code doesn't
work.

But what are the rules about when sync_kit() should be called?

Thanks,
Andrew.

From vladimir.kozlov at oracle.com  Mon May  4 18:54:58 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 04 May 2015 11:54:58 -0700
Subject: RFR (M): 8064458 OopMap class could be more compact
In-Reply-To: <20150504131604.GK31204@rbackman>
References: <20150428100337.GB31204@rbackman>
	<553FF92E.3010908@oracle.com>	<20150429145259.GD31204@rbackman>
	<5541132C.9020208@oracle.com>	<20150430141818.GG31204@rbackman>
	<554257C6.1010600@oracle.com>	<94B29370-0478-43B8-8547-F108D5AD29FF@oracle.com>	<55425AAF.2020801@oracle.com>
	<20150504131604.GK31204@rbackman>
Message-ID: <5547C082.80501@oracle.com>

Keep it as you have it now in your changes.

thanks,
Vladimir

On 5/4/15 6:16 AM, Rickard B?ckman wrote:
> Vladimir,
>
> I think you are right. Personally I don't see much improvement by having
> TempOopMap/OopMap vs OopMap/ImmutableOopMap (if we had had namespaces we
> could have had them as compiler::OopMap/code::OopMap). If I see lots of requests
> for the former before I push...
>
> Thanks
> /R
>
> On 04/30, Vladimir Kozlov wrote:
>> We still use OopMap and OopMapSet when we create oopmap data.
>> ImmutableOopMap* is used for writing and reading final metadata.
>> Otherwise changes would be much larger (see for example,
>> sharedRuntime_<arch>.cpp etc). Rickard may correct me if I am wrong.
>>
>> Vladimir
>>
>> On 4/30/15 9:33 AM, Christian Thalinger wrote:
>>> One thing I mentioned to Rickard privately is that I would prefer to have the current OopMap be renamed to something else (TempOopMap?) and keep OopMap as the one used.
>>>
>>>> On Apr 30, 2015, at 9:26 AM, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
>>>>
>>>> It is better now. I am fine with inner Mapping class.
>>>>
>>>> One thing left. I think only Mapping and fields could be private since only ImmutableOopMapBuilder accesses them:
>>>>
>>>>
>>>> + class ImmutableOopMapBuilder {
>>>> + public:
>>>> +   class Mapping;
>>>> +
>>>> + public:
>>>> +   const OopMapSet* _set;
>>>>
>>>> thanks,
>>>> Vladimir
>>>>
>>>> On 4/30/15 7:18 AM, Rickard B?ckman wrote:
>>>>> On 04/29, Vladimir Kozlov wrote:
>>>>>> On 4/29/15 7:52 AM, Rickard B?ckman wrote:
>>>>>>> On 04/28, Vladimir Kozlov wrote:
>>>>>>>> You need closed SA changes.
>>>>>>>
>>>>>>> I have them, just forgot to send the mail. Small changes too.
>>>>>>>
>>>>>>>>
>>>>>>>> Style:
>>>>>>>>
>>>>>>>> Move fields to the beginning of ImmutableOopMapBuilder and classes.
>>>>>>>> Add comments describing each field in ALL new classes. Add comments
>>>>>>>> to fields in old class. It will help next persone who will look on
>>>>>>>> oop maps later.
>>>>>>>
>>>>>>> All fields are at the beginning? Just following style and keeping
>>>>>>> friends above fields as other classes in the same file has.
>>>>>>
>>>>>>    friends
>>>>>>    fields
>>>>>>    methods
>>>>>>
>>>>>> It is better to see fields before methods which access them. I am
>>>>>> not sure what codding style you are talking about. All classes in
>>>>>> oopMap.hpp has above layout.
>>>>>>
>>>>>> I am asking to change layout of ImmutableOopMapBuilder and Mapping classes. Also why Mapping is inner class?
>>>>>
>>>>> Ah, I was looking at oopMap.hpp and didn't see them. Fixed.
>>>>> Mapping is an inner class because it is only used by the Builder and I
>>>>> didn't want to clutter the namespace with something as general (and bad)
>>>>> as Mapping. I can move it out if that is preferred.
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Add ResourceMark into ImmutableOopMapSet::build_from() to free memory allocated in ImmutableOopMapBuilder().
>>>>>>
>>>>>> no ResourceMark in 8064458.2
>>>>>
>>>>> Added.
>>>>>
>>>>>>
>>>>>>>> Why you need ImmutableOopMapBuilder to be friend of class ImmutableOopMap?
>>>>>>>
>>>>>>> It makes a call to data_addr() which is private. Only for debug builds
>>>>>>> so I wrapped it in DEBUG_ONLY.
>>>>>>
>>>>>> Put ImmutableOopMapBuilder::verify() under #ifdef ASSERT and its declaration in DEBUG_ONLY.
>>>>>>
>>>>>
>>>>> Done.
>>>>>
>>>>> Updated webrev: http://cr.openjdk.java.net/~rbackman/8064458.4
>>>>>
>>>>> Thanks
>>>>> /R
>>>>>
>>>>>> Thanks,
>>>>>> Vladimir
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I think a simple loop in ImmutableOopMapBuilder::verify() would be faster than calling memcmp.
>>>>>>>
>>>>>>> Fixed.
>>>>>>>
>>>>>>>>
>>>>>>>> Field _end is not used.
>>>>>>>
>>>>>>> Removed.
>>>>>>>
>>>>>>>>
>>>>>>>> ImmutableOopMapBuilder() calls reset() and next called heap_size()
>>>>>>>> calls reset() again. May be move reset() to the end of heap_size()
>>>>>>>> so that you don't need to call it in fill().
>>>>>>>>
>>>>>>>
>>>>>>> Better yet, the reset() method isn't required anymore.
>>>>>>>
>>>>>>> Updated webrev: http://cr.openjdk.java.net/~rbackman/8064458.2
>>>>>>>
>>>>>>> /R
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Vladimir
>>>>>>>>
>>>>>>>> On 4/28/15 3:03 AM, Rickard B?ckman wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> can I please have reviews for this change:
>>>>>>>>>
>>>>>>>>> RFR: http://cr.openjdk.java.net/~rbackman/8064458.1/
>>>>>>>>> RFE: http://bugs.openjdk.java.net/browse/JDK-8064458
>>>>>>>>>
>>>>>>>>> While looking at OopMaps a while ago I noticed that there were a couple
>>>>>>>>> of different fields that were unused after the OopMaps were finalised.
>>>>>>>>>
>>>>>>>>> I took some time to investigate and rearrange the OopMaps. Since I
>>>>>>>>> didn't want to change how the OopMaps are built I introduced new data
>>>>>>>>> structures. ImmutableOopMapSet and ImmutableOopMap. The original OopMap
>>>>>>>>> structures are used to build up the OopMaps and when finalised they are
>>>>>>>>> copied into the Immutable variants.
>>>>>>>>>
>>>>>>>>> The ImmutableOopMapSet contains a few fields [size, count] and then a
>>>>>>>>> list of [pc, offset]. The offset points to the offset after the list
>>>>>>>>> where the ImmutableOopMap is placed. By moving pc out from OopMap to be
>>>>>>>>> part of the list we can now have multiple pcs with identical OopMaps
>>>>>>>>> point to the same data.
>>>>>>>>>
>>>>>>>>> We only keep 1 empty OopMap, and the other compaction that is done in
>>>>>>>>> this change is to check if the OopMap is identical to the previous one
>>>>>>>>> and then reuse that one. So no complete uniqueness check.
>>>>>>>>>
>>>>>>>>> I ran a couple of small benchmarks and printed the size of the old
>>>>>>>>> OopMaps vs the new. The new layout uses about 20 - 25% of the space on
>>>>>>>>> the benchmarks I've run.
>>>>>>>>>
>>>>>>>>> Tested by running through JPRT, running BigApps and NSK.quick.testlist
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> /R
>>>>>>>>>
>>>

From vladimir.kozlov at oracle.com  Mon May  4 18:56:00 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 04 May 2015 11:56:00 -0700
Subject: RFR (M): 8064458 OopMap class could be more compact
In-Reply-To: <20150504131337.GJ31204@rbackman>
References: <20150428100337.GB31204@rbackman>
	<553FF92E.3010908@oracle.com>	<20150429145259.GD31204@rbackman>
	<5541132C.9020208@oracle.com>	<20150430141818.GG31204@rbackman>
	<554257C6.1010600@oracle.com> <20150504131337.GJ31204@rbackman>
Message-ID: <5547C0C0.1080404@oracle.com>

Looks good.

Thanks,
Vladimir

On 5/4/15 6:13 AM, Rickard B?ckman wrote:
> Updated again.
>
> http://cr.openjdk.java.net/~rbackman/8064458.5
>
> Thanks
>
> On 04/30, Vladimir Kozlov wrote:
>> It is better now. I am fine with inner Mapping class.
>>
>> One thing left. I think only Mapping and fields could be private since only ImmutableOopMapBuilder accesses them:
>>
>>
>> + class ImmutableOopMapBuilder {
>> + public:
>> +   class Mapping;
>> +
>> + public:
>> +   const OopMapSet* _set;
>>
>> thanks,
>> Vladimir
>>
>> On 4/30/15 7:18 AM, Rickard B?ckman wrote:
>>> On 04/29, Vladimir Kozlov wrote:
>>>> On 4/29/15 7:52 AM, Rickard B?ckman wrote:
>>>>> On 04/28, Vladimir Kozlov wrote:
>>>>>> You need closed SA changes.
>>>>>
>>>>> I have them, just forgot to send the mail. Small changes too.
>>>>>
>>>>>>
>>>>>> Style:
>>>>>>
>>>>>> Move fields to the beginning of ImmutableOopMapBuilder and classes.
>>>>>> Add comments describing each field in ALL new classes. Add comments
>>>>>> to fields in old class. It will help next persone who will look on
>>>>>> oop maps later.
>>>>>
>>>>> All fields are at the beginning? Just following style and keeping
>>>>> friends above fields as other classes in the same file has.
>>>>
>>>>    friends
>>>>    fields
>>>>    methods
>>>>
>>>> It is better to see fields before methods which access them. I am
>>>> not sure what codding style you are talking about. All classes in
>>>> oopMap.hpp has above layout.
>>>>
>>>> I am asking to change layout of ImmutableOopMapBuilder and Mapping classes. Also why Mapping is inner class?
>>>
>>> Ah, I was looking at oopMap.hpp and didn't see them. Fixed.
>>> Mapping is an inner class because it is only used by the Builder and I
>>> didn't want to clutter the namespace with something as general (and bad)
>>> as Mapping. I can move it out if that is preferred.
>>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Add ResourceMark into ImmutableOopMapSet::build_from() to free memory allocated in ImmutableOopMapBuilder().
>>>>
>>>> no ResourceMark in 8064458.2
>>>
>>> Added.
>>>
>>>>
>>>>>> Why you need ImmutableOopMapBuilder to be friend of class ImmutableOopMap?
>>>>>
>>>>> It makes a call to data_addr() which is private. Only for debug builds
>>>>> so I wrapped it in DEBUG_ONLY.
>>>>
>>>> Put ImmutableOopMapBuilder::verify() under #ifdef ASSERT and its declaration in DEBUG_ONLY.
>>>>
>>>
>>> Done.
>>>
>>> Updated webrev: http://cr.openjdk.java.net/~rbackman/8064458.4
>>>
>>> Thanks
>>> /R
>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>>>
>>>>>>
>>>>>> I think a simple loop in ImmutableOopMapBuilder::verify() would be faster than calling memcmp.
>>>>>
>>>>> Fixed.
>>>>>
>>>>>>
>>>>>> Field _end is not used.
>>>>>
>>>>> Removed.
>>>>>
>>>>>>
>>>>>> ImmutableOopMapBuilder() calls reset() and next called heap_size()
>>>>>> calls reset() again. May be move reset() to the end of heap_size()
>>>>>> so that you don't need to call it in fill().
>>>>>>
>>>>>
>>>>> Better yet, the reset() method isn't required anymore.
>>>>>
>>>>> Updated webrev: http://cr.openjdk.java.net/~rbackman/8064458.2
>>>>>
>>>>> /R
>>>>>
>>>>>> Thanks,
>>>>>> Vladimir
>>>>>>
>>>>>> On 4/28/15 3:03 AM, Rickard B?ckman wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> can I please have reviews for this change:
>>>>>>>
>>>>>>> RFR: http://cr.openjdk.java.net/~rbackman/8064458.1/
>>>>>>> RFE: http://bugs.openjdk.java.net/browse/JDK-8064458
>>>>>>>
>>>>>>> While looking at OopMaps a while ago I noticed that there were a couple
>>>>>>> of different fields that were unused after the OopMaps were finalised.
>>>>>>>
>>>>>>> I took some time to investigate and rearrange the OopMaps. Since I
>>>>>>> didn't want to change how the OopMaps are built I introduced new data
>>>>>>> structures. ImmutableOopMapSet and ImmutableOopMap. The original OopMap
>>>>>>> structures are used to build up the OopMaps and when finalised they are
>>>>>>> copied into the Immutable variants.
>>>>>>>
>>>>>>> The ImmutableOopMapSet contains a few fields [size, count] and then a
>>>>>>> list of [pc, offset]. The offset points to the offset after the list
>>>>>>> where the ImmutableOopMap is placed. By moving pc out from OopMap to be
>>>>>>> part of the list we can now have multiple pcs with identical OopMaps
>>>>>>> point to the same data.
>>>>>>>
>>>>>>> We only keep 1 empty OopMap, and the other compaction that is done in
>>>>>>> this change is to check if the OopMap is identical to the previous one
>>>>>>> and then reuse that one. So no complete uniqueness check.
>>>>>>>
>>>>>>> I ran a couple of small benchmarks and printed the size of the old
>>>>>>> OopMaps vs the new. The new layout uses about 20 - 25% of the space on
>>>>>>> the benchmarks I've run.
>>>>>>>
>>>>>>> Tested by running through JPRT, running BigApps and NSK.quick.testlist
>>>>>>>
>>>>>>> Thanks
>>>>>>> /R
>>>>>>>
> /R
>

From vladimir.kozlov at oracle.com  Mon May  4 21:30:27 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 04 May 2015 14:30:27 -0700
Subject: RFR(L) 8073583: C2 support for CRC32C on SPARC
Message-ID: <5547E4F3.4070603@oracle.com>

http://cr.openjdk.java.net/~kvn/8073583/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8073583

Contributed by: James Cheng

Two methods in the java.util.zip.CRC32C class are intrinsified on
SPARC with a single stub routine:

     java.util.zip.CRC32C.updateBytes()
     java.util.zip.CRC32C.updateDirectByteBuffer()

Two JVM flags, UseCRC32C and UseCRC32CIntrinsics, are added for
controlling the usage of the CRC32C intrinsics.  They are set to
true on SPARC T4 and later processors, or false on other platforms.
They can be turned off with -XX:-UseCRC32C or -XX:-UseCRC32CIntrinsics.

Performance gains from the CRC32C intrinsics were measured up to 16x
with hotspot/test/compiler/intrinsics/crc32c/TestCRC32C.java or the
JMH-based ChecksumBenchmarks/CRC32C.

Performance gains from the CRC32C intrinsics were also measured with
Hadoop TestDFSIO at about 40% for write test and 80% for read test,

Tested with  hotspot/test/ and jdk/test/java/util/ jtreg tests.

From john.r.rose at oracle.com  Tue May  5 00:39:23 2015
From: john.r.rose at oracle.com (John Rose)
Date: Mon, 4 May 2015 17:39:23 -0700
Subject: RFR(L) 8073583: C2 support for CRC32C on SPARC
In-Reply-To: <5547E4F3.4070603@oracle.com>
References: <5547E4F3.4070603@oracle.com>
Message-ID: <779F1936-1EE2-419B-9F0A-149FB034BE68@oracle.com>

On May 4, 2015, at 2:30 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
> 
> http://cr.openjdk.java.net/~kvn/8073583/webrev.00/ <http://cr.openjdk.java.net/~kvn/8073583/webrev.00/>
> https://bugs.openjdk.java.net/browse/JDK-8073583 <https://bugs.openjdk.java.net/browse/JDK-8073583>
> 
> Contributed by: James Cheng

I took a quick look at the assembly code.  I have a few comments.

The magic constant 128 (size of parallel decomposition chunk) needs naming; suggest "chunk length" or something like that.  When the constant is defined, it needs to be explained also.

I'm glad to see newer sparc instructions getting used, especially xmulx to reassociate crc.  The magic constants used to shift the CRC codes by 1024 should be named and explained.

The code is unnecessarily monolithic.  It will be more readable and maintainable if some easy factorizations are applied.  The factorizations themselves can add value beyond readability of this new code.

Specifically, the sethi/or3/or3 should be replaced by set64 macros.  Forget about the clever hoisting of 1<<32 into G1; just improve set64 to cover 34-bit constants in three instructions:  sethi/sllx/or3.  The sllx shifts up by 3 to make room for a simm13.

(If we don't want to disturb set64, then add a new macro set34.  But I say go for it.  I wish the set64 macro got a little more love.)

A couple of places use 'sethi' instead of the 'set' macro; I'd use 'set' to avoid puzzling the reader with details of micro-optimization, although it's a matter of style.

The long sequences which perform byte reversal should use a macro; you can name it 'reverse_bytes_32'.  If they are truly optimal then the macro will hide the noise.  But if they are not, it will be much easier to swap in better instructions.  The C2 intrinsics called ReverseBytes use ASI_PRIMARY_LITTLE to a stack temp, which is probably better.  In any case, the decision needs to be hidden in a macro, not spread around in unrelated code.

There are some minor things that would tidy the code.  The first two instructions sllx/srlx should be replaced by clruwu (srl 0).  The uses of cc's in the "subcc" instructions are all useless and slightly misleading; should be "sub" or (better) "dec".

I personally find the use of cmp_and_brx_short peculiar with the constant 32, just because I know 32 doesn't actually work with the intended instruction.  But I suppose it is OK, since the macro expands to something that works.

? John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150504/55deb4d8/attachment.html>

From james.cheng at oracle.com  Tue May  5 00:57:46 2015
From: james.cheng at oracle.com (james cheng)
Date: Mon, 04 May 2015 17:57:46 -0700
Subject: RFR(L) 8073583: C2 support for CRC32C on SPARC
In-Reply-To: <779F1936-1EE2-419B-9F0A-149FB034BE68@oracle.com>
References: <5547E4F3.4070603@oracle.com>
	<779F1936-1EE2-419B-9F0A-149FB034BE68@oracle.com>
Message-ID: <5548158A.2010103@oracle.com>

Hi John,

On 5/4/2015 5:39 PM, John Rose wrote:
> On May 4, 2015, at 2:30 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>>
>> http://cr.openjdk.java.net/~kvn/8073583/webrev.00/
>> https://bugs.openjdk.java.net/browse/JDK-8073583
>>
>> Contributed by: James Cheng
>
> I took a quick look at the assembly code.  I have a few comments.

Many thanks for your quick review and comments.  I will make changes
as you suggested as soon as I can.

Thanks,
-James

From vladimir.kozlov at oracle.com  Tue May  5 01:13:34 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 04 May 2015 18:13:34 -0700
Subject: RFR(L): 8069539: RSA acceleration
In-Reply-To: <55101C52.1090506@redhat.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>	<5502E67C.8080208@oracle.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>	<550C004D.1060501@redhat.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com>
Message-ID: <5548193E.7060803@oracle.com>

I think it was last mail in open list. There were discussions in private.
Here is upadted webrev:

http://cr.openjdk.java.net/~kvn/8069539/webrev.01/

It include changes in java/math/BigInteger.java which added range checks and moved intrinsified code into private methods.

Andrew Haley has additional suggestions we can discuss now. Andrew, can you show proposed changes to java code?
I would like to have only simple java changes (methods splitting) and do experiments/changes with word-reversal later.

On 4/14/15 10:41 AM, Andrew Haley wrote:
 > On 04/14/2015 06:22 PM, Vladimir Kozlov wrote:
 >
 >> We are discussing how and which checks to add into java code which
 >> calls intrinsified methods to keep intrinsic simple.
 >
 > Yes, good idea.  While you're in there, there's a couple of thoughts I'd
 > like to draw your attention to.
 >
 > Montgomery multiplication and squaring are implemented as separate
 > steps, like so:
 >
 >          a = multiplyToLen(t, modLen, mult, modLen, a);
 >          a = montReduce(a, mod, modLen, inv);
 >
 >          a = squareToLen(t, modLen, a);
 >          a = montReduce(a, mod, modLen, inv);
 >
 > It is possible to interleave the multiplication and Montgomery
 > reduction, and this can lead to a useful speedup on some
 > architectures.  It would be nice if Montgomery multiplication and
 > squaring were factored into separate methods, and then they could be
 > replaced by intrinsics.
 >
 > Also, all these word-reversal and misaligned long stores / loads in
 > the multiplyToLen intrinsic code are a real PITA.  If we word-reversed
 > the arrays so that they were in little-endian form we'd have neither
 > misaligned long stores / loads nor repeated word-reversals.  We could
 > do the word reversal on the stack: AFAICS it's unusual for
 > multiplyToLen to be called for huge bignums, and I suppose if it did
 > happen for a bignum larger than some threshold we could do the word
 > reversal on the heap.
 >
 > Andrew.

Thanks,
Vladimir

On 3/23/15 6:59 AM, Florian Weimer wrote:
> On 03/20/2015 11:45 PM, Viswanathan, Sandhya wrote:
>> Hi Florian,
>>
>> My thoughts on this are as follows:
>>
>> BigInteger.squareToLen is a private method and not a public method.
>> The length calculation code in Java version of this method does not have the overflow check and the intrinsic follows the Java code.
>>
>> private static final int[] squareToLen(int[] x, int len, int[] z) {
>>          ...
>>          int zlen = len << 1;
>>          if (z == null || z.length < zlen)
>>              z = new int[zlen];
>>          ...
>>    }
>
> The difference is that the Java code will still perform the bounds
> checks on each array access, I think, so even if zlen turns out negative
> (and thus no reallocation happens), damage from out-of-bounds accesses
> will be non-existent.
>
>> Also the underlying array in BigInteger cannot be greater than MAX_MAG_LENGTH which is defined as:
>>
>> private static final int MAX_MAG_LENGTH = Integer.MAX_VALUE / Integer.SIZE + 1; // (1 << 26)
>>
>> So zlen calculation cannot overflow as int array x and its length len is coming from a BigInteger array.
>
> Maybe can you add this as a comment to the intrinsic?  I think this
> would be a useful addition, especially if at some point in the future,
> someone else uses your code as a template to implement their own intrinsic.
>
>> Similarly mulAdd is a package private method and its inputs are allocated or verified at call sites.
>
> Same here.
>

From john.r.rose at oracle.com  Tue May  5 01:21:27 2015
From: john.r.rose at oracle.com (John Rose)
Date: Mon, 4 May 2015 18:21:27 -0700
Subject: RFR(L) 8073583: C2 support for CRC32C on SPARC
In-Reply-To: <5548158A.2010103@oracle.com>
References: <5547E4F3.4070603@oracle.com>
	<779F1936-1EE2-419B-9F0A-149FB034BE68@oracle.com>
	<5548158A.2010103@oracle.com>
Message-ID: <0FBC800A-693A-4ECB-A983-D5240FAC59BA@oracle.com>

Thanks, James. 

One more comment, which is at a higher level:  Could we recode the loop control in Java and use unsafe to handle word and byte loads? Then we would only need single instruction intrinsics.

? John

> On May 4, 2015, at 5:57 PM, james cheng <james.cheng at oracle.com> wrote:
> 
> Hi John,
> 
>> On 5/4/2015 5:39 PM, John Rose wrote:
>>> On May 4, 2015, at 2:30 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>>> 
>>> http://cr.openjdk.java.net/~kvn/8073583/webrev.00/
>>> https://bugs.openjdk.java.net/browse/JDK-8073583
>>> 
>>> Contributed by: James Cheng
>> 
>> I took a quick look at the assembly code.  I have a few comments.
> 
> Many thanks for your quick review and comments.  I will make changes
> as you suggested as soon as I can.
> 
> Thanks,
> -James

From james.cheng at oracle.com  Tue May  5 03:04:14 2015
From: james.cheng at oracle.com (James Cheng)
Date: Mon, 4 May 2015 20:04:14 -0700
Subject: RFR(L) 8073583: C2 support for CRC32C on SPARC
In-Reply-To: <0FBC800A-693A-4ECB-A983-D5240FAC59BA@oracle.com>
References: <5547E4F3.4070603@oracle.com>
	<779F1936-1EE2-419B-9F0A-149FB034BE68@oracle.com>
	<5548158A.2010103@oracle.com>
	<0FBC800A-693A-4ECB-A983-D5240FAC59BA@oracle.com>
Message-ID: <245E1236-F423-45B8-93C0-9A9BB737371C@oracle.com>

Hi John,

> On May 4, 2015, at 6:21 PM, John Rose <john.r.rose at oracle.com> wrote:
> 
> One more comment, which is at a higher level:  Could we recode the loop control in Java and use unsafe to handle word and byte loads? Then we would only need single instruction intrinsics.

We could, I guess, but that means we?d need to rewrite the pure Java CRC32C in JDK.
More difficult is how we implement the CRC32C methods so that they are not favoring
one platform while hindering others. I am afraid that the CRC32C instructions on different
platforms are too different to compromise.

Thanks,
-James


From vladimir.kozlov at oracle.com  Tue May  5 03:38:55 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 04 May 2015 20:38:55 -0700
Subject: sync_kit()
In-Reply-To: <5547A399.3040107@redhat.com>
References: <5547A399.3040107@redhat.com>
Message-ID: <55483B4F.5040803@oracle.com>

There is pair of sync_kit(): one in IdealKit and an other in GraphKit. They are needed because  IdealKit and GraphKit 
use different mechanism to track current code state (control, memory, io nodes).

They are used if you need to intermix calls to IdealKit and GraphKit. If you start from IdealKit graph construction and 
you need to call GraphKit methods to add new nodes into the graph you need to call GraphKit::sync_kit() first and after 
all sequential calls to GraphKit methods, before any IdealKit method is called, you need to call IdealKit::sync_kit().

The good example is in LibraryCallKit::insert_pre_barrier().

Regards,
Vladimir

On 5/4/15 9:51 AM, Andrew Haley wrote:
> I'm trying to get my head around sync_kit().  I don't really know what
> it does.  Well, I know that it synchronizes IdealKit and graphKit, and
> I know from experience that if I don't call sync_kit() my code doesn't
> work.
>
> But what are the rules about when sync_kit() should be called?
>
> Thanks,
> Andrew.
>

From john.r.rose at oracle.com  Tue May  5 03:49:11 2015
From: john.r.rose at oracle.com (John Rose)
Date: Mon, 4 May 2015 20:49:11 -0700
Subject: RFR(L) 8073583: C2 support for CRC32C on SPARC
In-Reply-To: <245E1236-F423-45B8-93C0-9A9BB737371C@oracle.com>
References: <5547E4F3.4070603@oracle.com>
	<779F1936-1EE2-419B-9F0A-149FB034BE68@oracle.com>
	<5548158A.2010103@oracle.com>
	<0FBC800A-693A-4ECB-A983-D5240FAC59BA@oracle.com>
	<245E1236-F423-45B8-93C0-9A9BB737371C@oracle.com>
Message-ID: <5D45CF93-9A2E-4E00-A25A-9A101A3CDB26@oracle.com>

On May 4, 2015, at 8:04 PM, James Cheng <james.cheng at oracle.com> wrote:
> 
> Hi John,
> 
>> On May 4, 2015, at 6:21 PM, John Rose <john.r.rose at oracle.com> wrote:
>> 
>> One more comment, which is at a higher level:  Could we recode the loop control in Java and use unsafe to handle word and byte loads? Then we would only need single instruction intrinsics.
> 
> We could, I guess, but that means we?d need to rewrite the pure Java CRC32C in JDK.
> More difficult is how we implement the CRC32C methods so that they are not favoring
> one platform while hindering others. I am afraid that the CRC32C instructions on different
> platforms are too different to compromise.

That may be; the vector size is CPU-dependent, for example.  But a 64-bit vector is (currently) the sweet spot for writing vectorized code in Java, since 'long' is the biggest bit container in Java.  (Note also that HotSpot JVM objects are aligned up to 8 byte boundaries, even after GC.)  Another platform with larger vectors would have to use assembly language anyway (which Intel does), but Java code can express 64-bit vectorized loops.

For CRC, the desirable number of distinct streams, and the prefetch mode and distance, are also CPU-dependent.  For those variations injecting machine-specific parts into the Java-coded algorithm would get messy.

The benefit of coding low-level vectorized loops in Java would be not having to code the loop logic in assembly code.  If we could use byte buffers to manage the indexing, and/or had better array notations, it would probably be worth while moving from assembly to Java.  At present it seems OK to code in assembly, *if* the assembly can be made more readable.

We have a chicken and egg problem here:  Nobody is going to experiment with Java-coded vector loops until we get single-vector CRC32[C] and XMULX instructions surfaced as C2-supported intrinsics.  (We already have bit and byte reverse intrinsics, so that part is OK.)

? John

From vladimir.kozlov at oracle.com  Tue May  5 04:53:06 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 04 May 2015 21:53:06 -0700
Subject: RFR(L): 8076284: Improve vectorization of parallel streams
In-Reply-To: <553A936D.2020909@oracle.com>
References: <39F83597C33E5F408096702907E6C450E3E586@ORSMSX104.amr.corp.intel.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63334516@FMSMSX112.amr.corp.intel.com>	<39F83597C33E5F408096702907E6C450E3E5A4@ORSMSX104.amr.corp.intel.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63334531@FMSMSX112.amr.corp.intel.com>	<39F83597C33E5F408096702907E6C450E3E734@ORSMSX104.amr.corp.intel.com>	<55303823.1020205@oracle.com>	<39F83597C33E5F408096702907E6C450E3F0AC@ORSMSX104.amr.corp.intel.com>	<39F83597C33E5F408096702907E6C450E3F413@ORSMSX104.amr.corp.intel.com>
	<553A936D.2020909@oracle.com>
Message-ID: <55484CB2.6060205@oracle.com>

I reviewed it and tested in JPRT. If nobody objects I will push it.

Thanks,
Vladimir

On 4/24/15 12:03 PM, Vladimir Kozlov wrote:
> Updated webrev:
>
> http://cr.openjdk.java.net/~kvn/8076284/webrev.01/
>
> Vladimir
>
> On 4/20/15 11:45 PM, Civlin, Jan wrote:
>> Vladimir,
>>
>> Here is the description and new patch with the changes you recommended (except the last one - see below my explanation).
>>
>>
>> The patch description.
>>
>> This patch provides on-demand vectorization/SIMD'ing of a <method> specified in JVM command as
>> -XX:CompileCommand=option,<method>,Vectorize.
>> This optimization may be globally disabled by setting the flag -XX: -AllowVectorizeOnDemand (by default it is true).
>>
>> For each method that was specified with Vectorize option we do the following:
>>
>> 1. On each iteration of loop unroll for a given method (loopopts.cpp) we generate the next _clone_idx (which will be
>> common for all the nodes cloned in this iteration); and on each node cloning we hash _idx of the origin of the node
>> that is cloned (_idx_clone_orig ) and the _clone_idx (cm.verify_insert_and_clone).
>> CloneMap belongs to Compile and is created in CompilerWrapper.
>>
>> 2. In SuperWord optimization, after max_depth has been built, we are hoisting the loads.
>> For this we for each Load_X (subject of the hoisting) find some Load_0 that has the same origin as Load_X but belongs
>> to the first iteration, i.e. if the MemNode::Memory input of Load_0 is memory Phi (collected previously in
>> memory_slice) we set this Phi also as the MemNode::Memory input of Load_X. After this rebuild of the graph we restart
>> the Superword optimization.
>>
>> The major routines here.
>> - SuperWord::mark_generations: computes _ii_first (the index _clone_idx) of the nodes that have MemNode::Memory input
>> coming from a phi node in some slice; computes list of the nodes in the first and last iterations of the loop.
>> - SuperWord::hoist_loads_in_graph: for each memory slice (a phi node) visits each load that has this phi as a memory
>> input and then for each other load that has the same origin makes the memory input coming from the phi.  This routine
>> does not use marking generations mechanism.
>> - SuperWord::pack_parallel - this routine is called only if SuperWord fails to produce any pack after
>> extend_packlist(); it is another algorithm for packing instructions into SIMD.
>> It goes thru the list of all instructions in the _iteration_first, and if it is a Load, Store, Add or Mul it starts a
>> new pack and adds this instruction to this pack. Then the algorithm circulates thru the iterations of the loop (gen <
>> _ii_order.length()) and over the instructions list and finds the node with the origin coinciding with the origin of
>> the nodes already in the pack - then it adds this node to the pack. Once packs are built, SuperWord returns to the
>> normal processing (combine_packs()).
>>
>> Note, that neither 2 or 3 goes thru the data dependency analysis, since the correctness of parallelization was
>> guaranteed by the user.
>> Note, that some checks in added code could be omitted. But we are not assume that there is no optimization (now or in
>> the future) that can change the graph structure between loop unrolling and the SuperWord, so we prefer to run many
>> (probably unneeded) checks in the SuperWord.
>>
>>
>>
>>>> Can you also utilize changes done by Michael Berg for reduction
>>>> optimization (the code in jdk9/hs-comp already)? I mean marking some
>>>> nodes before unrolling and searching Phis.
>> Michael Berg and I looked at what you suggested (phi marking before unroll?) and we both think this marking is very
>> different than what I do: Michael's phi marking is just one bit per node (in the flags) whereas I collect the
>> _idx_clone_orig and the unroll generation (clone_idx).
>>
>>
>>   -----Original Message-----
>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>> Sent: Thursday, April 16, 2015 3:31 PM
>> To: Civlin, Jan; hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: RFR(S): 8076284: Improve vectorization of parallel streams
>>
>> Hi Jan,
>>
>> You did not describe your changes in details (what they do).
>>
>> IgnoreVectorizeMethod flag should positive and enabled by default.
>> Rename it to AllowVectorizeOnDemand (or something similar):
>>
>> +  product(bool, AllowVectorizeOnDemand, true,
>>        \
>>
>> Instead of next you should add intrinsic definition to
>> and classfile/vmSymbols.hpp and then check method()->intrinsic_id():
>>
>> +    if (strcmp("forEachRemaining", method()->name()->as_quoted_ascii())
>> == 0 && method()->signature() != 0
>> +      && method()->signature()->as_symbol() != 0 &&
>> method()->signature()->as_symbol()->as_quoted_ascii() != 0 ) {
>> +      if
>> (strstr(method()->signature()->as_symbol()->as_quoted_ascii(),"Ljava/util/function/IntConsumer"))
>> {
>> +        set_do_vector_loop(true);
>> +      }
>> +    }
>>
>> And that should be under flag too because in general forEachRemaining
>> should be vectorized only if it is safe.
>>
>> Can you also utilize changes done by Michael Berg for reduction
>> optimization (the code in jdk9/hs-comp already)? I mean marking some
>> nodes before unrolling and searching Phis.
>>
>> Regards,
>> Vladimir
>>
>> On 4/13/15 3:33 AM, Civlin, Jan wrote:
>>> Hi All,
>>>
>>>
>>>    We would like to contribute the improvement of vectorization of
>>>    parallel streams  from Intel.
>>>
>>> The contribution Bug ID: 8076284.
>>>
>>> Please review this patch:
>>>
>>> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8076284
>>>
>>> webrev: http://cr.openjdk.java.net/~kvn/8076284/webrev/
>>>
>>>
>>>        *Description*
>>>
>>> Improve vectorization of the unordered parallel streams (by vectorizing
>>> forEachRemaining method).
>>>
>>> For example, this forEach will be vectorized:
>>>
>>> java.util.stream.IntStream iStream = java.util.stream.IntStream.range(0,
>>> RANGE - 1).parallel();
>>>
>>> iStream.forEach( id -> c[id] = c[id] + c[id+1] );
>>>
>>> It also enables on-demand loop vectorization in a given method (by
>>> providing more hints to SuperWord optimization).
>>>
>>> For example, use -XX:CompileCommand=option,computeCall,Vectorizeto
>>> vectorize this loop
>>>
>>> void computeCall(double [] Call, double  puByDf, double  pdByDf)
>>>
>>> {
>>>
>>> for(int i = timeStep; i > 0; i--)
>>>
>>> for(int j = 0; j <= i - 1; j++)
>>>
>>> Call[j] = puByDf * Call[j + 1] + pdByDf * Call[j];
>>>
>>> }
>>>
>>>
>>> This enhancement is contributed by Intel and sponsored by the hotspot
>>> compiler team.
>>>
>>

From aph at redhat.com  Tue May  5 09:06:11 2015
From: aph at redhat.com (Andrew Haley)
Date: Tue, 05 May 2015 10:06:11 +0100
Subject: RFR(L): 8069539: RSA acceleration
In-Reply-To: <5548193E.7060803@oracle.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>	<5502E67C.8080208@oracle.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>	<550C004D.1060501@redhat.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com>
Message-ID: <55488803.4020802@redhat.com>

On 05/05/15 02:13, Vladimir Kozlov wrote:
> I think it was last mail in open list. There were discussions in private.
> Here is upadted webrev:
> 
> http://cr.openjdk.java.net/~kvn/8069539/webrev.01/
> 
> It include changes in java/math/BigInteger.java which added range
> checks and moved intrinsified code into private methods.
> 
> Andrew Haley has additional suggestions we can discuss now. Andrew,
> can you show proposed changes to java code?

OK.  The changes I'd like to make to the Java code are fairly minor,
but they make possible a better design of the intrinsics.  It's not
really possible to write a 64-bit version of Montgomery reduction in
pure Java (because it uses a 128-bit product internally) but I will
produce a JNI version as a demonstration of what I'd like to do in an
intrinsic.

> I would like to have only simple java changes (methods splitting)
> and do experiments/changes with word-reversal later.

Sure.  The word-reversal is internal to the native intrinsics; it
won't affect the Java code at all.

Andrew.

From tobias.hartmann at oracle.com  Tue May  5 13:38:05 2015
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Tue, 05 May 2015 15:38:05 +0200
Subject: [9] RFR(S): 8078497: C2's superword optimization causes unaligned
	memory accesses
In-Reply-To: <55477DB1.5050200@oracle.com>
References: <5540D237.8000107@oracle.com>	<5541172D.5010200@oracle.com>	<5541F0B1.50804@oracle.com>	<554254A9.6060608@oracle.com>	<554705ED.4050305@oracle.com>
	<55477DB1.5050200@oracle.com>
Message-ID: <5548C7BD.7050601@oracle.com>

Hi,

I verified that this is a problem in the baseline version and not related to my fix. I filed [1] and will take care of it.

I wait with pushing JDK-8078497 until I fixed this.

Best,
Tobias

[1] https://bugs.openjdk.java.net/browse/JDK-8079343

On 04.05.2015 16:09, Tobias Hartmann wrote:
> Hi,
> 
> looks like there is still a problem I didn't catch. With my regression test I sometimes hit [1]
> 
>   #  Internal Error (/opt/jprt/T/P1/130607.tohartma/s/hotspot/src/share/vm/opto/loopnode.cpp:3231), pid=18188, tid=0x00002b00dacde700
>   #  assert(!had_error) failed: bad dominance
> 
> I'm not sure if this is related to my fix but I assume it is an existing problem that was just triggered by my regression test.
> 
> I'll further investigate.
> 
> Thanks,
> Tobias
> 
> [1] http://sthjprt.se.oracle.com/archives/2015/05/2015-05-04-130607.tohartma.8078497/logs/linux_x64_2.6-fastdebug-c2-hotspot_compiler_3.log.FAILED.log
> 
> On 04.05.2015 07:38, Tobias Hartmann wrote:
>> Thanks, Vladimir.
>>
>> Best,
>> Tobias
>>
>> On 30.04.2015 18:13, Vladimir Kozlov wrote:
>>> Looks good.
>>>
>>> Thanks,
>>> Vladimir
>>>
>>> On 4/30/15 2:06 AM, Tobias Hartmann wrote:
>>>> Hi Vladimir,
>>>>
>>>> thanks for the review!
>>>>
>>>> On 29.04.2015 19:38, Vladimir Kozlov wrote:
>>>>> Consider reducing number of iterations in test from 1M so that the test does not timeout on slow platforms.
>>>>
>>>> I reduced the number of iterations to 20k and verified that it is enough to trigger compilation and crash on Sparc.
>>>>
>>>>> lines 511-515 are duplicates of 486-491.  Consider moving them before vw%span check.
>>>>
>>>> Thanks, I refactored that part.
>>>>
>>>>> Typo 'iff':
>>>>> +    // final offset is a multiple of vw iff init_offset is a multiple.
>>>>
>>>> I used 'iff' to refer to 'if and only if' [1]. I changed it.
>>>>
>>>> Here is the new webrev:
>>>> http://cr.openjdk.java.net/~thartmann/8078497/webrev.01/
>>>>
>>>> Thanks,
>>>> Tobias
>>>>
>>>> [1] http://en.wikipedia.org/wiki/If_and_only_if
>>>>
>>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>> On 4/29/15 5:44 AM, Tobias Hartmann wrote:
>>>>>> Hi,
>>>>>>
>>>>>> please review the following patch.
>>>>>>
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8078497
>>>>>> http://cr.openjdk.java.net/~thartmann/8078497/webrev.00/
>>>>>>
>>>>>> Background information (simplified):
>>>>>> After loop optimizations, C2 tries to vectorize memory operations by finding and merging adjacent memory accesses in the loop body. The superword optimization matches MemNodes with the following pattern as address input:
>>>>>>
>>>>>>     ptr + k*iv + constant [+ invar]
>>>>>>
>>>>>> where iv is the loop induction variable, k is a scaling factor that may be 0 and invar is an optional loop invariant value. C2 then picks the memory operation with most similar references and tries to align it in the main loop by setting the number of pre-loop iterations accordingly. Other adjacent memory operations that conform to this alignment are merged into packs. After extending and filtering these packs, final vector operations are emitted.
>>>>>>
>>>>>> The problem is that some architectures (for example, Sparc) require memory accesses to be aligned and in special cases the superword optimization generates code that contains unaligned vector instructions.
>>>>>>
>>>>>> Problems:
>>>>>> (1) If two memory operations have different loop invariant offset values and C2 decides to align the main loop to one of them, we can always set the invariant of the other operation such that we break the alignment constraint. For example, if we have a store to a byte array a[i+inv1] and a load from a byte array b[i+inv2] (where i is the induction variable), C2 may decide to align the main loop such that (i+inv1 % 8) == 0 and replace the adjacent byte stores by a 8-byte double word store. Also the byte loads will be replaced by a double word load. If we then set inv2 = inv1 + 1 at runtime, the load will fail with a SIGBUS because the access is not 8-byte aligned, i.e., (i+inv2 % 8) != 0. Vectorization should not take place in this case.
>>>>>>
>>>>>> (2) If C2 decides to align the main loop to a memory operation, the necessary adjustment of the induction variable by the pre-loop is computed such that the resulting offset in the main loop is aligned to the vector width (see 'SuperWord::get_iv_adjustment'). If the offset of a memory operation is independent of the loop induction variable (i.e., scale k is 0), the iv adjustment should be 0.
>>>>>>
>>>>>> (3) If the loop span is greater than the vector width 'vw' of a memory operation, the superword optimization assumes that the pre-loop is never able to align this operation in the main loop (see 'SuperWord::ref_is_alignable'). This is wrong because if the loop span is a multiple of vw and depending on the initial offset, it may very well be possible to align the operation to vw.
>>>>>>
>>>>>> These problems originally only showed up in the string density code base (JDK-8054307) where we have a putChar intrinsic that writes a char value to two entries of a byte array. To reproduce the issue with JDK9 we have to make sure that the memory operations:
>>>>>> (i) are independent,
>>>>>> (ii) have different invariants,
>>>>>> (iii) are not too complex (because then vectorization will not take place).
>>>>>>
>>>>>> To guarantee (i) we either need two different arrays (for example, byte[] and char[]) for the load and store operations or the same array but different offsets. Since the offsets should include a loop invariant part (ii), the superword optimization will not be able to determine that the runtime offset values do not overlap. We therefore use Unsafe.getChar/putChar to read/write a char value from/to a byte/char array and thereby guarantee independence.
>>>>>>
>>>>>> I came up with the following test (see 'TestVectorizationWithInvariant.java'):
>>>>>>
>>>>>>     byte[] src = new byte[1000];
>>>>>>     byte[] dst = new char[1000];
>>>>>>
>>>>>>     for (int i = (int) CHAR_ARRAY_OFFSET; i < 100; i = i + 8) {
>>>>>>       // Copy 8 chars from src to dst
>>>>>>       unsafe.putChar(dst, i + 0, unsafe.getChar(src, off + 0));
>>>>>>       [...]
>>>>>>       unsafe.putChar(dst, i + 14, unsafe.getChar(src, off + 14));
>>>>>>     }
>>>>>>
>>>>>> Problem (1) shows up since the main loop will be aligned to the StoreC[i + 0] which has no invariant. However, the LoadUS[off + 0] has the loop invariant 'off'. Setting off to BYTE_ARRAY_OFFSET + 2 will break the alignment of the emitted double word load and result in a crash.
>>>>>>
>>>>>> The LoadUS[off + 0] in above example is independent of the loop induction variable and therefore has no scale value. Because of problem (2), the iv adjustment is computed as -4 (see 'SuperWord::get_iv_adjustment').
>>>>>>
>>>>>>     offset = 16 iv_adjust = -4 elt_size = 2 scale = 0 iv_stride = 16 vect_size 8
>>>>>>
>>>>>> The regression test contains additional test cases that also trigger problem (3).
>>>>>>
>>>>>> Solution:
>>>>>> (1) I added a check to 'SuperWord::find_adjacent_refs' to make sure that memory accesses with different invariants are only vectorized if the architecture supports unaligned memory accesses.
>>>>>> (2) 'SuperWord::get_iv_adjustment' is modified such that it returns 0 if the memory operation is independent of iv.
>>>>>> (3) In 'SuperWord::ref_is_alignable' we need to additionally check if span is divisible by vw. If so, the pre-loop will add multiples of vw to the initial offset and if that initial offset is divisible by vm, the final offset (after the pre-loop) will be divisible as well and we should return true. I added comments to the code describing this in detail.
>>>>>>
>>>>>> Testing:
>>>>>> - original test with string-density codebase
>>>>>> - regression test
>>>>>> - JPRT
>>>>>>
>>>>>> Thanks,
>>>>>> Tobias
>>>>>>

From manasthakur17 at gmail.com  Thu May  7 04:27:36 2015
From: manasthakur17 at gmail.com (Manas Thakur)
Date: Thu, 7 May 2015 09:57:36 +0530
Subject: Help regarding C2's Ideal IR
Message-ID: <CAOChv7HxQ2kzeAbPDhv6EMjeUcgUHy-gje+81rK6TKhvtVOy9w@mail.gmail.com>

Hello all

Is there a help manual or web source describing the nodes and
representation of C2's Ideal IR in detail? The details given on the wiki
seem to be very less. I want to understand the standard analyses and
optimizations done by C2, and a detailed description would be very helpful.

Regards,
Manas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150507/3adf92e3/attachment-0001.html>

From roland.westrelin at oracle.com  Thu May  7 10:25:05 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Thu, 7 May 2015 12:25:05 +0200
Subject: RFR 8076276 support for AVX512
In-Reply-To: <553946F5.2090009@oracle.com>
References: <C568518E7B433348B114B6A7122D474755DCE552@FMSMSX102.amr.corp.intel.com>
	<55258337.2050605@oracle.com>
	<C568518E7B433348B114B6A7122D474755DCEBDA@FMSMSX102.amr.corp.intel.com>
	<55259078.1080309@oracle.com>
	<C568518E7B433348B114B6A7122D474755DCED7C@FMSMSX102.amr.corp.intel.com>
	<55271100.8080203@oracle.com> <553946F5.2090009@oracle.com>
Message-ID: <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com>

> http://cr.openjdk.java.net/~kvn/8076276/webrev.02

This looks good to me. A few minor remarks:

Shouldn?t the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I?m not sure what the guideline is.

In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters.

Not sure why you dropped:

3463 // Float register operands
3473 // Double register operands

in x86_64.ad

In chaitin.hpp:

144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else

comment is not aligned with the one below anymore.

Roland.


From vladimir.kozlov at oracle.com  Thu May  7 17:40:07 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 07 May 2015 10:40:07 -0700
Subject: RFR 8076276 support for AVX512
In-Reply-To: <7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com>
References: <C568518E7B433348B114B6A7122D474755DCE552@FMSMSX102.amr.corp.intel.com>	<55258337.2050605@oracle.com>	<C568518E7B433348B114B6A7122D474755DCEBDA@FMSMSX102.amr.corp.intel.com>	<55259078.1080309@oracle.com>	<C568518E7B433348B114B6A7122D474755DCED7C@FMSMSX102.amr.corp.intel.com>	<55271100.8080203@oracle.com>
	<553946F5.2090009@oracle.com>
	<7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com>
Message-ID: <554BA377.8050700@oracle.com>

Thanks, Roland

On 5/7/15 3:25 AM, Roland Westrelin wrote:
>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02
>
> This looks good to me. A few minor remarks:
>
> Shouldn?t the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I?m not sure what the guideline is.

Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default 
(used only in 64-bit VM)? What new instructions you are talking about?

>
> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters.
>
> Not sure why you dropped:
>
> 3463 // Float register operands
> 3473 // Double register operands
>
> in x86_64.ad
>
> In chaitin.hpp:
>
> 144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else
>
> comment is not aligned with the one below anymore.
>
> Roland.
>

I also have questions to Michael.

Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"?

Why you moved "operand vecS() etc." from x86.ad ? Do you mean  evex is 
not supported in 32-bit?

Thanks,
Vladimir


From roland.westrelin at oracle.com  Thu May  7 18:40:01 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Thu, 7 May 2015 20:40:01 +0200
Subject: RFR 8076276 support for AVX512
In-Reply-To: <554BA377.8050700@oracle.com>
References: <C568518E7B433348B114B6A7122D474755DCE552@FMSMSX102.amr.corp.intel.com>
	<55258337.2050605@oracle.com>
	<C568518E7B433348B114B6A7122D474755DCEBDA@FMSMSX102.amr.corp.intel.com>
	<55259078.1080309@oracle.com>
	<C568518E7B433348B114B6A7122D474755DCED7C@FMSMSX102.amr.corp.intel.com>
	<55271100.8080203@oracle.com> <553946F5.2090009@oracle.com>
	<7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com>
	<554BA377.8050700@oracle.com>
Message-ID: <D5758E8D-067E-466E-8DE2-AEDE0A5FBB73@oracle.com>

>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02
>> 
>> This looks good to me. A few minor remarks:
>> 
>> Shouldn?t the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I?m not sure what the guideline is.
> 
> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about?

I?m talking about:
4101 #ifdef _LP64
4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, regF tmp, regF tmp2) %{

for instance in x86.ad
Why isn?t it in x86_64.ad?

Roland.
> 
>> 
>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters.
>> 
>> Not sure why you dropped:
>> 
>> 3463 // Float register operands
>> 3473 // Double register operands
>> 
>> in x86_64.ad
>> 
>> In chaitin.hpp:
>> 
>> 144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else
>> 
>> comment is not aligned with the one below anymore.
>> 
>> Roland.
>> 
> 
> I also have questions to Michael.
> 
> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"?
> 
> Why you moved "operand vecS() etc." from x86.ad ? Do you mean  evex is not supported in 32-bit?
> 
> Thanks,
> Vladimir


From vladimir.kozlov at oracle.com  Thu May  7 19:26:05 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 07 May 2015 12:26:05 -0700
Subject: RFR 8076276 support for AVX512
In-Reply-To: <D5758E8D-067E-466E-8DE2-AEDE0A5FBB73@oracle.com>
References: <C568518E7B433348B114B6A7122D474755DCE552@FMSMSX102.amr.corp.intel.com>
	<55258337.2050605@oracle.com>
	<C568518E7B433348B114B6A7122D474755DCEBDA@FMSMSX102.amr.corp.intel.com>
	<55259078.1080309@oracle.com>
	<C568518E7B433348B114B6A7122D474755DCED7C@FMSMSX102.amr.corp.intel.com>
	<55271100.8080203@oracle.com> <553946F5.2090009@oracle.com>
	<7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com>
	<554BA377.8050700@oracle.com>
	<D5758E8D-067E-466E-8DE2-AEDE0A5FBB73@oracle.com>
Message-ID: <554BBC4D.6020501@oracle.com>

On 5/7/15 11:40 AM, Roland Westrelin wrote:
>>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02
>>>
>>> This looks good to me. A few minor remarks:
>>>
>>> Shouldn?t the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I?m not sure what the guideline is.
>>
>> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about?
>
> I?m talking about:
> 4101 #ifdef _LP64
> 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, regF tmp, regF tmp2) %{
>
> for instance in x86.ad
> Why isn?t it in x86_64.ad?

I wanted to keep all vector instructions in one file. That is all.

Vladimir

>
> Roland.
>>
>>>
>>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters.
>>>
>>> Not sure why you dropped:
>>>
>>> 3463 // Float register operands
>>> 3473 // Double register operands
>>>
>>> in x86_64.ad
>>>
>>> In chaitin.hpp:
>>>
>>> 144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else
>>>
>>> comment is not aligned with the one below anymore.
>>>
>>> Roland.
>>>
>>
>> I also have questions to Michael.
>>
>> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"?
>>
>> Why you moved "operand vecS() etc." from x86.ad ? Do you mean  evex is not supported in 32-bit?
>>
>> Thanks,
>> Vladimir
>

From roland.westrelin at oracle.com  Thu May  7 20:07:54 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Thu, 7 May 2015 22:07:54 +0200
Subject: RFR 8076276 support for AVX512
In-Reply-To: <554BBC4D.6020501@oracle.com>
References: <C568518E7B433348B114B6A7122D474755DCE552@FMSMSX102.amr.corp.intel.com>
	<55258337.2050605@oracle.com>
	<C568518E7B433348B114B6A7122D474755DCEBDA@FMSMSX102.amr.corp.intel.com>
	<55259078.1080309@oracle.com>
	<C568518E7B433348B114B6A7122D474755DCED7C@FMSMSX102.amr.corp.intel.com>
	<55271100.8080203@oracle.com> <553946F5.2090009@oracle.com>
	<7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com>
	<554BA377.8050700@oracle.com>
	<D5758E8D-067E-466E-8DE2-AEDE0A5FBB73@oracle.com>
	<554BBC4D.6020501@oracle.com>
Message-ID: <906035A6-CA25-4EA5-8D91-CC683A0F289B@oracle.com>


> I wanted to keep all vector instructions in one file. That is all.

Ok.

Roland.


From michael.c.berg at intel.com  Thu May  7 20:38:20 2015
From: michael.c.berg at intel.com (Berg, Michael C)
Date: Thu, 7 May 2015 20:38:20 +0000
Subject: RFR 8076276 support for AVX512
In-Reply-To: <D5758E8D-067E-466E-8DE2-AEDE0A5FBB73@oracle.com>
References: <C568518E7B433348B114B6A7122D474755DCE552@FMSMSX102.amr.corp.intel.com>
	<55258337.2050605@oracle.com>
	<C568518E7B433348B114B6A7122D474755DCEBDA@FMSMSX102.amr.corp.intel.com>
	<55259078.1080309@oracle.com>
	<C568518E7B433348B114B6A7122D474755DCED7C@FMSMSX102.amr.corp.intel.com>
	<55271100.8080203@oracle.com> <553946F5.2090009@oracle.com>
	<7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com>
	<554BA377.8050700@oracle.com>
	<D5758E8D-067E-466E-8DE2-AEDE0A5FBB73@oracle.com>
Message-ID: <C568518E7B433348B114B6A7122D474755DE034D@FMSMSX102.amr.corp.intel.com>

Roland,

For the usage of LP64 in the x86.ad file, I was following the existing examples, like replicate, which have specific forms for each type LONG usage, if applicable.
I wanted to wait to add the 32-bit LONG versions of some of these macros (until later) as they expand to be larger than their 64-bit counterparts and are thought to be of lower performance.
Currently we no make mention of vector instruction definitions outside of the x86.ad file definitions, so this is in keeping with both methods/attributes of instruction definition.
New Instructions: The 64x2, 64x4, 32x4, 32x8 versions of extract/insert, kmov are all new.  There are new vector multiply instructions as well as extended forms that do not exist on earlier microarchitectures.
See CPUID guards in the instruction definitions, there many that are AVX3 (UseAVX > 2) only for which are also evident in the assembler layer as supports_evex() instructions. This follows the current AVX512 ISA document in definition (available at: https://software.intel.com/en-us/isa-extensions .  By and large the assembler is now savy enough to encode all legacy x86 microarchitecture instructions, while supporting the new ones and the new encoding model.

Thanks,
Michael

-----Original Message-----
From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Roland Westrelin
Sent: Thursday, May 07, 2015 11:40 AM
To: Vladimir Kozlov
Cc: hotspot-compiler-dev at openjdk.java.net
Subject: Re: RFR 8076276 support for AVX512

>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02
>> 
>> This looks good to me. A few minor remarks:
>> 
>> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is.
> 
> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about?

I'm talking about:
4101 #ifdef _LP64
4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, regF tmp, regF tmp2) %{

for instance in x86.ad
Why isn't it in x86_64.ad?

Roland.
> 
>> 
>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters.
>> 
>> Not sure why you dropped:
>> 
>> 3463 // Float register operands
>> 3473 // Double register operands
>> 
>> in x86_64.ad
>> 
>> In chaitin.hpp:
>> 
>> 144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else
>> 
>> comment is not aligned with the one below anymore.
>> 
>> Roland.
>> 
> 
> I also have questions to Michael.
> 
> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"?
> 
> Why you moved "operand vecS() etc." from x86.ad ? Do you mean  evex is not supported in 32-bit?
> 
> Thanks,
> Vladimir


From vladimir.kozlov at oracle.com  Thu May  7 20:48:41 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 07 May 2015 13:48:41 -0700
Subject: RFR 8076276 support for AVX512
In-Reply-To: <C568518E7B433348B114B6A7122D474755DE034D@FMSMSX102.amr.corp.intel.com>
References: <C568518E7B433348B114B6A7122D474755DCE552@FMSMSX102.amr.corp.intel.com>	<55258337.2050605@oracle.com>	<C568518E7B433348B114B6A7122D474755DCEBDA@FMSMSX102.amr.corp.intel.com>	<55259078.1080309@oracle.com>	<C568518E7B433348B114B6A7122D474755DCED7C@FMSMSX102.amr.corp.intel.com>	<55271100.8080203@oracle.com>
	<553946F5.2090009@oracle.com>	<7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com>	<554BA377.8050700@oracle.com>
	<D5758E8D-067E-466E-8DE2-AEDE0A5FBB73@oracle.com>
	<C568518E7B433348B114B6A7122D474755DE034D@FMSMSX102.amr.corp.intel.com>
Message-ID: <554BCFA9.1070103@oracle.com>

You did not answered my questions I think. Or I don't understand your 
answer:

 >> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"?
 >>
 >> Why you moved "operand vecS() etc." from x86.ad ? Do you mean  evex 
is not supported in 32-bit?
 >>

And Roland's:

 >>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a 
single copy of the loop with different parameters.

Thanks,
Vladimir

On 5/7/15 1:38 PM, Berg, Michael C wrote:
> Roland,
>
> For the usage of LP64 in the x86.ad file, I was following the existing examples, like replicate, which have specific forms for each type LONG usage, if applicable.
> I wanted to wait to add the 32-bit LONG versions of some of these macros (until later) as they expand to be larger than their 64-bit counterparts and are thought to be of lower performance.
> Currently we no make mention of vector instruction definitions outside of the x86.ad file definitions, so this is in keeping with both methods/attributes of instruction definition.
> New Instructions: The 64x2, 64x4, 32x4, 32x8 versions of extract/insert, kmov are all new.  There are new vector multiply instructions as well as extended forms that do not exist on earlier microarchitectures.
> See CPUID guards in the instruction definitions, there many that are AVX3 (UseAVX > 2) only for which are also evident in the assembler layer as supports_evex() instructions. This follows the current AVX512 ISA document in definition (available at: https://software.intel.com/en-us/isa-extensions .  By and large the assembler is now savy enough to encode all legacy x86 microarchitecture instructions, while supporting the new ones and the new encoding model.
>
> Thanks,
> Michael
>
> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Roland Westrelin
> Sent: Thursday, May 07, 2015 11:40 AM
> To: Vladimir Kozlov
> Cc: hotspot-compiler-dev at openjdk.java.net
> Subject: Re: RFR 8076276 support for AVX512
>
>>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02
>>>
>>> This looks good to me. A few minor remarks:
>>>
>>> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is.
>>
>> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about?
>
> I'm talking about:
> 4101 #ifdef _LP64
> 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, regF tmp, regF tmp2) %{
>
> for instance in x86.ad
> Why isn't it in x86_64.ad?
>
> Roland.
>>
>>>
>>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters.
>>>
>>> Not sure why you dropped:
>>>
>>> 3463 // Float register operands
>>> 3473 // Double register operands
>>>
>>> in x86_64.ad
>>>
>>> In chaitin.hpp:
>>>
>>> 144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else
>>>
>>> comment is not aligned with the one below anymore.
>>>
>>> Roland.
>>>
>>
>> I also have questions to Michael.
>>
>> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"?
>>
>> Why you moved "operand vecS() etc." from x86.ad ? Do you mean  evex is not supported in 32-bit?
>>
>> Thanks,
>> Vladimir
>

From michael.c.berg at intel.com  Thu May  7 21:07:29 2015
From: michael.c.berg at intel.com (Berg, Michael C)
Date: Thu, 7 May 2015 21:07:29 +0000
Subject: RFR 8076276 support for AVX512
In-Reply-To: <D5758E8D-067E-466E-8DE2-AEDE0A5FBB73@oracle.com>
References: <C568518E7B433348B114B6A7122D474755DCE552@FMSMSX102.amr.corp.intel.com>
	<55258337.2050605@oracle.com>
	<C568518E7B433348B114B6A7122D474755DCEBDA@FMSMSX102.amr.corp.intel.com>
	<55259078.1080309@oracle.com>
	<C568518E7B433348B114B6A7122D474755DCED7C@FMSMSX102.amr.corp.intel.com>
	<55271100.8080203@oracle.com> <553946F5.2090009@oracle.com>
	<7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com>
	<554BA377.8050700@oracle.com>
	<D5758E8D-067E-466E-8DE2-AEDE0A5FBB73@oracle.com>
Message-ID: <C568518E7B433348B114B6A7122D474755DE0385@FMSMSX102.amr.corp.intel.com>

Roland,

VecS like the other forms of Vec{D|X|Y|Z} are defined as needed in each of the 32 and 64 ad files, as they are divergent now.  The 32-bit version only has the legacy bank.
64-bit version uses the newly defined reg_class_dynamic definition to provide the appropriate bank of registers based on CPUID.  The chunk3 rename is an artifact of removing the kreg bank.
I will put it back to the old name (and test it out).  Eventually, we will change it back once we formalize the usage of kreg into all facets of code generation ( in a later patch ).
If we made a single version out of os_supports_avx_vectors()'s loop, we would have a guard in the loop as the size of the regions are now different, it seems cleaner the way it is.

In x86.64.ad:

>> 3463 // Float register operands
>> 3473 // Double register operands

I will re-add the comments.

In chaitain.hpp:

144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else

I will fix the comment location.

Thanks,
Michael

-----Original Message-----
From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Roland Westrelin
Sent: Thursday, May 07, 2015 11:40 AM
To: Vladimir Kozlov
Cc: hotspot-compiler-dev at openjdk.java.net
Subject: Re: RFR 8076276 support for AVX512

>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02
>> 
>> This looks good to me. A few minor remarks:
>> 
>> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is.
> 
> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about?

I'm talking about:
4101 #ifdef _LP64
4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, regF tmp, regF tmp2) %{

for instance in x86.ad
Why isn't it in x86_64.ad?

Roland.
> 
>> 
>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters.
>> 
>> Not sure why you dropped:
>> 
>> 3463 // Float register operands
>> 3473 // Double register operands
>> 
>> in x86_64.ad
>> 
>> In chaitin.hpp:
>> 
>> 144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else
>> 
>> comment is not aligned with the one below anymore.
>> 
>> Roland.
>> 
> 
> I also have questions to Michael.
> 
> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"?
> 
> Why you moved "operand vecS() etc." from x86.ad ? Do you mean  evex is not supported in 32-bit?
> 
> Thanks,
> Vladimir


From vladimir.kozlov at oracle.com  Thu May  7 21:27:38 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 07 May 2015 14:27:38 -0700
Subject: RFR 8076276 support for AVX512
In-Reply-To: <C568518E7B433348B114B6A7122D474755DE0385@FMSMSX102.amr.corp.intel.com>
References: <C568518E7B433348B114B6A7122D474755DCE552@FMSMSX102.amr.corp.intel.com>	<55258337.2050605@oracle.com>	<C568518E7B433348B114B6A7122D474755DCEBDA@FMSMSX102.amr.corp.intel.com>	<55259078.1080309@oracle.com>	<C568518E7B433348B114B6A7122D474755DCED7C@FMSMSX102.amr.corp.intel.com>	<55271100.8080203@oracle.com>
	<553946F5.2090009@oracle.com>	<7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com>	<554BA377.8050700@oracle.com>
	<D5758E8D-067E-466E-8DE2-AEDE0A5FBB73@oracle.com>
	<C568518E7B433348B114B6A7122D474755DE0385@FMSMSX102.amr.corp.intel.com>
Message-ID: <554BD8CA.1090409@oracle.com>

On 5/7/15 2:07 PM, Berg, Michael C wrote:
> Roland,
>
> VecS like the other forms of Vec{D|X|Y|Z} are defined as needed in each of the 32 and 64 ad files, as they are divergent now.  The 32-bit version only has the legacy bank.
> 64-bit version uses the newly defined reg_class_dynamic definition to provide the appropriate bank of registers based on CPUID.  The chunk3 rename is an artifact of removing the kreg bank.


For 32-bit reg_class_dynamic will give the same result as legacy because 
registers after XMM7 will be cut of by #ifdef _LP64. So I don't 
understand why you need to split Vec{D|X|Y} operands. You do kept vecZ 
operand the same for 32- and 64-bit.

Regards,
Vladimir

> I will put it back to the old name (and test it out).  Eventually, we will change it back once we formalize the usage of kreg into all facets of code generation ( in a later patch ).
> If we made a single version out of os_supports_avx_vectors()'s loop, we would have a guard in the loop as the size of the regions are now different, it seems cleaner the way it is.
>
> In x86.64.ad:
>
>>> 3463 // Float register operands
>>> 3473 // Double register operands
>
> I will re-add the comments.
>
> In chaitain.hpp:
>
> 144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else
>
> I will fix the comment location.
>
> Thanks,
> Michael
>
> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Roland Westrelin
> Sent: Thursday, May 07, 2015 11:40 AM
> To: Vladimir Kozlov
> Cc: hotspot-compiler-dev at openjdk.java.net
> Subject: Re: RFR 8076276 support for AVX512
>
>>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02
>>>
>>> This looks good to me. A few minor remarks:
>>>
>>> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is.
>>
>> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about?
>
> I'm talking about:
> 4101 #ifdef _LP64
> 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, regF tmp, regF tmp2) %{
>
> for instance in x86.ad
> Why isn't it in x86_64.ad?
>
> Roland.
>>
>>>
>>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters.
>>>
>>> Not sure why you dropped:
>>>
>>> 3463 // Float register operands
>>> 3473 // Double register operands
>>>
>>> in x86_64.ad
>>>
>>> In chaitin.hpp:
>>>
>>> 144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else
>>>
>>> comment is not aligned with the one below anymore.
>>>
>>> Roland.
>>>
>>
>> I also have questions to Michael.
>>
>> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"?
>>
>> Why you moved "operand vecS() etc." from x86.ad ? Do you mean  evex is not supported in 32-bit?
>>
>> Thanks,
>> Vladimir
>

From michael.c.berg at intel.com  Thu May  7 22:43:43 2015
From: michael.c.berg at intel.com (Berg, Michael C)
Date: Thu, 7 May 2015 22:43:43 +0000
Subject: RFR 8076276 support for AVX512
In-Reply-To: <554BD8CA.1090409@oracle.com>
References: <C568518E7B433348B114B6A7122D474755DCE552@FMSMSX102.amr.corp.intel.com>
	<55258337.2050605@oracle.com>
	<C568518E7B433348B114B6A7122D474755DCEBDA@FMSMSX102.amr.corp.intel.com>
	<55259078.1080309@oracle.com>
	<C568518E7B433348B114B6A7122D474755DCED7C@FMSMSX102.amr.corp.intel.com>
	<55271100.8080203@oracle.com> <553946F5.2090009@oracle.com>
	<7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com>
	<554BA377.8050700@oracle.com>
	<D5758E8D-067E-466E-8DE2-AEDE0A5FBB73@oracle.com>
	<C568518E7B433348B114B6A7122D474755DE0385@FMSMSX102.amr.corp.intel.com>
	<554BD8CA.1090409@oracle.com>
Message-ID: <C568518E7B433348B114B6A7122D474755DE03E0@FMSMSX102.amr.corp.intel.com>

Tried it out as a single definition referencing reg_class_dynamic for both 64-bit and 32-bit. On the new machine model it does work correctly for both.  However on 32-bit we will now emit more auto generated code through the adlc layer as we will have the same definitions on both evex and non evex, it is why I used separate definitions.

-Michael

-----Original Message-----
From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] 
Sent: Thursday, May 07, 2015 2:28 PM
To: Berg, Michael C; Roland Westrelin
Cc: hotspot-compiler-dev at openjdk.java.net
Subject: Re: RFR 8076276 support for AVX512

On 5/7/15 2:07 PM, Berg, Michael C wrote:
> Roland,
>
> VecS like the other forms of Vec{D|X|Y|Z} are defined as needed in each of the 32 and 64 ad files, as they are divergent now.  The 32-bit version only has the legacy bank.
> 64-bit version uses the newly defined reg_class_dynamic definition to provide the appropriate bank of registers based on CPUID.  The chunk3 rename is an artifact of removing the kreg bank.


For 32-bit reg_class_dynamic will give the same result as legacy because registers after XMM7 will be cut of by #ifdef _LP64. So I don't understand why you need to split Vec{D|X|Y} operands. You do kept vecZ operand the same for 32- and 64-bit.

Regards,
Vladimir

> I will put it back to the old name (and test it out).  Eventually, we will change it back once we formalize the usage of kreg into all facets of code generation ( in a later patch ).
> If we made a single version out of os_supports_avx_vectors()'s loop, we would have a guard in the loop as the size of the regions are now different, it seems cleaner the way it is.
>
> In x86.64.ad:
>
>>> 3463 // Float register operands
>>> 3473 // Double register operands
>
> I will re-add the comments.
>
> In chaitain.hpp:
>
> 144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else
>
> I will fix the comment location.
>
> Thanks,
> Michael
>
> -----Original Message-----
> From: hotspot-compiler-dev 
> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of 
> Roland Westrelin
> Sent: Thursday, May 07, 2015 11:40 AM
> To: Vladimir Kozlov
> Cc: hotspot-compiler-dev at openjdk.java.net
> Subject: Re: RFR 8076276 support for AVX512
>
>>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02
>>>
>>> This looks good to me. A few minor remarks:
>>>
>>> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is.
>>
>> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about?
>
> I'm talking about:
> 4101 #ifdef _LP64
> 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, 
> regF tmp, regF tmp2) %{
>
> for instance in x86.ad
> Why isn't it in x86_64.ad?
>
> Roland.
>>
>>>
>>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters.
>>>
>>> Not sure why you dropped:
>>>
>>> 3463 // Float register operands
>>> 3473 // Double register operands
>>>
>>> in x86_64.ad
>>>
>>> In chaitin.hpp:
>>>
>>> 144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else
>>>
>>> comment is not aligned with the one below anymore.
>>>
>>> Roland.
>>>
>>
>> I also have questions to Michael.
>>
>> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"?
>>
>> Why you moved "operand vecS() etc." from x86.ad ? Do you mean  evex is not supported in 32-bit?
>>
>> Thanks,
>> Vladimir
>

From vladimir.kozlov at oracle.com  Thu May  7 23:32:31 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 07 May 2015 16:32:31 -0700
Subject: RFR 8076276 support for AVX512
In-Reply-To: <C568518E7B433348B114B6A7122D474755DE03E0@FMSMSX102.amr.corp.intel.com>
References: <C568518E7B433348B114B6A7122D474755DCE552@FMSMSX102.amr.corp.intel.com>	<55258337.2050605@oracle.com>	<C568518E7B433348B114B6A7122D474755DCEBDA@FMSMSX102.amr.corp.intel.com>	<55259078.1080309@oracle.com>	<C568518E7B433348B114B6A7122D474755DCED7C@FMSMSX102.amr.corp.intel.com>	<55271100.8080203@oracle.com>
	<553946F5.2090009@oracle.com>	<7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com>	<554BA377.8050700@oracle.com>
	<D5758E8D-067E-466E-8DE2-AEDE0A5FBB73@oracle.com>
	<C568518E7B433348B114B6A7122D474755DE0385@FMSMSX102.amr.corp.intel.com>
	<554BD8CA.1090409@oracle.com>
	<C568518E7B433348B114B6A7122D474755DE03E0@FMSMSX102.amr.corp.intel.com>
Message-ID: <554BF60F.9000401@oracle.com>

You should have said this from the beginning! Add comment to x86_32.ad 
before operand vecS explaining why it uses vectors_reg_legacy.

To avoid unneeded runtime code generation by adlc. It produces the same 
result regardless evex support for these vectors in 32-bit VM.

Thanks,
Vladimir

On 5/7/15 3:43 PM, Berg, Michael C wrote:
> Tried it out as a single definition referencing reg_class_dynamic for both 64-bit and 32-bit. On the new machine model it does work correctly for both.  However on 32-bit we will now emit more auto generated code through the adlc layer as we will have the same definitions on both evex and non evex, it is why I used separate definitions.
>
> -Michael
>
> -----Original Message-----
> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> Sent: Thursday, May 07, 2015 2:28 PM
> To: Berg, Michael C; Roland Westrelin
> Cc: hotspot-compiler-dev at openjdk.java.net
> Subject: Re: RFR 8076276 support for AVX512
>
> On 5/7/15 2:07 PM, Berg, Michael C wrote:
>> Roland,
>>
>> VecS like the other forms of Vec{D|X|Y|Z} are defined as needed in each of the 32 and 64 ad files, as they are divergent now.  The 32-bit version only has the legacy bank.
>> 64-bit version uses the newly defined reg_class_dynamic definition to provide the appropriate bank of registers based on CPUID.  The chunk3 rename is an artifact of removing the kreg bank.
>
>
> For 32-bit reg_class_dynamic will give the same result as legacy because registers after XMM7 will be cut of by #ifdef _LP64. So I don't understand why you need to split Vec{D|X|Y} operands. You do kept vecZ operand the same for 32- and 64-bit.
>
> Regards,
> Vladimir
>
>> I will put it back to the old name (and test it out).  Eventually, we will change it back once we formalize the usage of kreg into all facets of code generation ( in a later patch ).
>> If we made a single version out of os_supports_avx_vectors()'s loop, we would have a guard in the loop as the size of the regions are now different, it seems cleaner the way it is.
>>
>> In x86.64.ad:
>>
>>>> 3463 // Float register operands
>>>> 3473 // Double register operands
>>
>> I will re-add the comments.
>>
>> In chaitain.hpp:
>>
>> 144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else
>>
>> I will fix the comment location.
>>
>> Thanks,
>> Michael
>>
>> -----Original Message-----
>> From: hotspot-compiler-dev
>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of
>> Roland Westrelin
>> Sent: Thursday, May 07, 2015 11:40 AM
>> To: Vladimir Kozlov
>> Cc: hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: RFR 8076276 support for AVX512
>>
>>>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02
>>>>
>>>> This looks good to me. A few minor remarks:
>>>>
>>>> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is.
>>>
>>> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about?
>>
>> I'm talking about:
>> 4101 #ifdef _LP64
>> 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2,
>> regF tmp, regF tmp2) %{
>>
>> for instance in x86.ad
>> Why isn't it in x86_64.ad?
>>
>> Roland.
>>>
>>>>
>>>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters.
>>>>
>>>> Not sure why you dropped:
>>>>
>>>> 3463 // Float register operands
>>>> 3473 // Double register operands
>>>>
>>>> in x86_64.ad
>>>>
>>>> In chaitin.hpp:
>>>>
>>>> 144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else
>>>>
>>>> comment is not aligned with the one below anymore.
>>>>
>>>> Roland.
>>>>
>>>
>>> I also have questions to Michael.
>>>
>>> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"?
>>>
>>> Why you moved "operand vecS() etc." from x86.ad ? Do you mean  evex is not supported in 32-bit?
>>>
>>> Thanks,
>>> Vladimir
>>

From michael.c.berg at intel.com  Thu May  7 23:41:07 2015
From: michael.c.berg at intel.com (Berg, Michael C)
Date: Thu, 7 May 2015 23:41:07 +0000
Subject: RFR 8076276 support for AVX512
In-Reply-To: <554BF60F.9000401@oracle.com>
References: <C568518E7B433348B114B6A7122D474755DCE552@FMSMSX102.amr.corp.intel.com>
	<55258337.2050605@oracle.com>
	<C568518E7B433348B114B6A7122D474755DCEBDA@FMSMSX102.amr.corp.intel.com>
	<55259078.1080309@oracle.com>
	<C568518E7B433348B114B6A7122D474755DCED7C@FMSMSX102.amr.corp.intel.com>
	<55271100.8080203@oracle.com> <553946F5.2090009@oracle.com>
	<7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com>
	<554BA377.8050700@oracle.com>
	<D5758E8D-067E-466E-8DE2-AEDE0A5FBB73@oracle.com>
	<C568518E7B433348B114B6A7122D474755DE0385@FMSMSX102.amr.corp.intel.com>
	<554BD8CA.1090409@oracle.com>
	<C568518E7B433348B114B6A7122D474755DE03E0@FMSMSX102.amr.corp.intel.com>
	<554BF60F.9000401@oracle.com>
Message-ID: <C568518E7B433348B114B6A7122D474755DE049B@FMSMSX102.amr.corp.intel.com>

Ok, I will leave that way it is plus the comment.

Thanks,
-Michael

-----Original Message-----
From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] 
Sent: Thursday, May 07, 2015 4:33 PM
To: Berg, Michael C; Roland Westrelin
Cc: hotspot-compiler-dev at openjdk.java.net
Subject: Re: RFR 8076276 support for AVX512

You should have said this from the beginning! Add comment to x86_32.ad before operand vecS explaining why it uses vectors_reg_legacy.

To avoid unneeded runtime code generation by adlc. It produces the same result regardless evex support for these vectors in 32-bit VM.

Thanks,
Vladimir

On 5/7/15 3:43 PM, Berg, Michael C wrote:
> Tried it out as a single definition referencing reg_class_dynamic for both 64-bit and 32-bit. On the new machine model it does work correctly for both.  However on 32-bit we will now emit more auto generated code through the adlc layer as we will have the same definitions on both evex and non evex, it is why I used separate definitions.
>
> -Michael
>
> -----Original Message-----
> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> Sent: Thursday, May 07, 2015 2:28 PM
> To: Berg, Michael C; Roland Westrelin
> Cc: hotspot-compiler-dev at openjdk.java.net
> Subject: Re: RFR 8076276 support for AVX512
>
> On 5/7/15 2:07 PM, Berg, Michael C wrote:
>> Roland,
>>
>> VecS like the other forms of Vec{D|X|Y|Z} are defined as needed in each of the 32 and 64 ad files, as they are divergent now.  The 32-bit version only has the legacy bank.
>> 64-bit version uses the newly defined reg_class_dynamic definition to provide the appropriate bank of registers based on CPUID.  The chunk3 rename is an artifact of removing the kreg bank.
>
>
> For 32-bit reg_class_dynamic will give the same result as legacy because registers after XMM7 will be cut of by #ifdef _LP64. So I don't understand why you need to split Vec{D|X|Y} operands. You do kept vecZ operand the same for 32- and 64-bit.
>
> Regards,
> Vladimir
>
>> I will put it back to the old name (and test it out).  Eventually, we will change it back once we formalize the usage of kreg into all facets of code generation ( in a later patch ).
>> If we made a single version out of os_supports_avx_vectors()'s loop, we would have a guard in the loop as the size of the regions are now different, it seems cleaner the way it is.
>>
>> In x86.64.ad:
>>
>>>> 3463 // Float register operands
>>>> 3473 // Double register operands
>>
>> I will re-add the comments.
>>
>> In chaitain.hpp:
>>
>> 144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else
>>
>> I will fix the comment location.
>>
>> Thanks,
>> Michael
>>
>> -----Original Message-----
>> From: hotspot-compiler-dev
>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of 
>> Roland Westrelin
>> Sent: Thursday, May 07, 2015 11:40 AM
>> To: Vladimir Kozlov
>> Cc: hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: RFR 8076276 support for AVX512
>>
>>>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02
>>>>
>>>> This looks good to me. A few minor remarks:
>>>>
>>>> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is.
>>>
>>> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about?
>>
>> I'm talking about:
>> 4101 #ifdef _LP64
>> 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2, 
>> regF tmp, regF tmp2) %{
>>
>> for instance in x86.ad
>> Why isn't it in x86_64.ad?
>>
>> Roland.
>>>
>>>>
>>>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters.
>>>>
>>>> Not sure why you dropped:
>>>>
>>>> 3463 // Float register operands
>>>> 3473 // Double register operands
>>>>
>>>> in x86_64.ad
>>>>
>>>> In chaitin.hpp:
>>>>
>>>> 144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else
>>>>
>>>> comment is not aligned with the one below anymore.
>>>>
>>>> Roland.
>>>>
>>>
>>> I also have questions to Michael.
>>>
>>> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"?
>>>
>>> Why you moved "operand vecS() etc." from x86.ad ? Do you mean  evex is not supported in 32-bit?
>>>
>>> Thanks,
>>> Vladimir
>>

From tobias.hartmann at oracle.com  Fri May  8 12:18:06 2015
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Fri, 08 May 2015 14:18:06 +0200
Subject: [9] RFR(S): 8079343: Crash in PhaseIdealLoop with "assert(!had_error)
	failed: bad dominance"
Message-ID: <554CA97E.8020708@oracle.com>

Hi,

please review the following patch.

https://bugs.openjdk.java.net/browse/JDK-8079343
http://cr.openjdk.java.net/~thartmann/8079343/webrev.00/

Problem:
The regression test for JDK-8078497 [1] fails with "assert(!had_error) failed: bad dominance" because during loop optimizations a node in the pre-loop has a control node in the main-loop. The test looks like this:


  byte[] src = src1;

  for (int i = ...) {

    // Copy 8 chars from src to dst

    ...

    // Prevent loop invariant code motion of char read.

    src = (src == src1) ? src2 : src1;

  }


The problem is that the superword optimization tries to vectorize the 8 load char / store char operations in the loop. In 'SuperWord::align_initial_loop_index' the base address of the memory operation we align the loop to ('align_to_ref_p.base()') is not loop invariant because 'src' changes in each iteration. However, we use it's value to correctly align the address if 'vw > ObjectAlignmentInBytes' (see line 2344). This is wrong because the address changes in each iteration. It also causes follow up problems in 'PhaseIdealLoop::verify_dominance' because nodes in the pre-loop have a control in the main-loop.


Solution:
I agreed with Vladimir K. that we should not try to vectorize such edge cases. I added a check to the constructor of SWPointer to bail out in case the base address is loop variant.

Testing:
- Regression test
- JPRT (together with fix for JDK-8078497)

Thanks,
Tobias

[1] https://bugs.openjdk.java.net/browse/JDK-8078497

From rickard.backman at oracle.com  Fri May  8 13:30:42 2015
From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=)
Date: Fri, 8 May 2015 15:30:42 +0200
Subject: RFR(XS): 8079797 assert(index >= 0 && index < _count) failed: check
Message-ID: <20150508133042.GN31204@rbackman>

Hi,

can I please have this change reviewed?
The recent OopMap change caused this bug when a few methods were
renamed:

size() used to return the number of OopMaps.
In the new implementation size() was used to return the number of bytes,
and count the number of Pairs.

I renamed the size() method to number_of_bytes() to avoid confusion in
the future.

Webrev: http://cr.openjdk.java.net/~rbackman/8079797/
Bug: https://bugs.openjdk.java.net/browse/JDK-8079797

The problem was triggered by running java -XX:+PrintAssembly.
Verified that java -XX:+PrintAssembly now works.

Thanks
/R
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150508/82546492/signature.asc>

From aph at redhat.com  Fri May  8 15:59:04 2015
From: aph at redhat.com (Andrew Haley)
Date: Fri, 08 May 2015 16:59:04 +0100
Subject: RFR(L): 8069539: RSA acceleration
In-Reply-To: <55488803.4020802@redhat.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>	<5502E67C.8080208@oracle.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>	<550C004D.1060501@redhat.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com>
	<55488803.4020802@redhat.com>
Message-ID: <554CDD48.8080500@redhat.com>

Here is a prototype of what I propose:

http://cr.openjdk.java.net/~aph/rsa-1/

It is a JNI version of the fast algorithm I think we should use for
RSA.  It doesn't use any intrinsics.  But with it, RSA is already twice
as fast as the code we have now for 1024-bit keys, and almost three
times as fast for 2048-bit keys!

This code (montgomery_multiply) is designed to be as easy as I can
possibly make it to translate into a hand-coded intrinsic.  It will
then offer better performance still, with the JNI overhead gone and
with some carefully hand-tweaked memory accesses.  It has a very
regular structure which should make it fairly easy to turn into a
software pipeline with overlapped fetching and multiplication,
although this will perhaps be difficult on register-starved machines
like the x86.

The JNI version can still be used for those machines where people
don't yet want to write a hand-coded intrinsic.

I haven't yet done anything about Montgomery squaring: it's
asymptotically 25% faster than Montgomery multiplication.

I have only written it for x86_64.  Systems without a 64-bit multiply
won't benefit very much from this idea, but they are pretty much
legacy anyway: I want to concentrate on 64-bit systems.

So, shall we go with this?  I don't think you will find any faster way
to do it.

Andrew.

From fweimer at redhat.com  Fri May  8 16:38:39 2015
From: fweimer at redhat.com (Florian Weimer)
Date: Fri, 08 May 2015 18:38:39 +0200
Subject: RFR(L): 8069539: RSA acceleration
In-Reply-To: <554CDD48.8080500@redhat.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>	<5502E67C.8080208@oracle.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>	<550C004D.1060501@redhat.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com>
	<55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com>
Message-ID: <554CE68F.20509@redhat.com>

On 05/08/2015 05:59 PM, Andrew Haley wrote:
> Here is a prototype of what I propose:
> 
> http://cr.openjdk.java.net/~aph/rsa-1/
> 
> It is a JNI version of the fast algorithm I think we should use for
> RSA.  It doesn't use any intrinsics.  But with it, RSA is already twice
> as fast as the code we have now for 1024-bit keys, and almost three
> times as fast for 2048-bit keys!
> 
> This code (montgomery_multiply) is designed to be as easy as I can
> possibly make it to translate into a hand-coded intrinsic.  It will
> then offer better performance still, with the JNI overhead gone and
> with some carefully hand-tweaked memory accesses.  It has a very
> regular structure which should make it fairly easy to turn into a
> software pipeline with overlapped fetching and multiplication,
> although this will perhaps be difficult on register-starved machines
> like the x86.

Do we want to add side-channel protection as part of this effort
(against timing attacks and cache-flushing attacks)?

-- 
Florian Weimer / Red Hat Product Security

From vladimir.x.ivanov at oracle.com  Fri May  8 17:16:38 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Fri, 08 May 2015 20:16:38 +0300
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken after
	sun.misc.Cleaner became automatically cleared
Message-ID: <554CEF76.8090903@oracle.com>

http://cr.openjdk.java.net/~vlivanov/8079205/webrev.01
https://bugs.openjdk.java.net/browse/JDK-8079205

Recent change in sun.misc.Cleaner behavior broke CallSite context cleanup.

CallSite references context class through a Cleaner to avoid its 
unnecessary retention.

The problem is the following: to do a cleanup (invalidate all affected 
nmethods) VM needs a pointer to a context class. Until Cleaner is 
cleared (and it was a manual action, since Cleaner extends 
PhantomReference isn't automatically cleared according to the docs), VM 
can extract it from CallSite.context.referent field.

I experimented with moving cleanup logic into VM [1], but Peter Levart 
came up with a clever idea and implemented FinalReference-based 
cleaner-like Finalizator. Classes don't have finalizers, but Finalizator 
allows to attach a finalization action to them. And it is guaranteed 
that the referent is alive when finalization happens.

Also, Peter spotted another problem with Cleaner-based implementation.
Cleaner cleanup action is strongly referenced, since it is registered in 
Cleaner class. CallSite context cleanup action keeps a reference to 
CallSite class (it is passed to MHN.invalidateDependentNMethods). Users 
are free to extend CallSite and many do so. If a context class and a 
call site class are loaded by a custom class loader, such loader will 
never be unloaded, causing a memory leak.

Finalizator doesn't suffer from that, since the action is referenced 
only from Finalizator instance. The downside is that cleanup action can 
be missed if Finalizator becomes unreachable. It's not a problem for 
CallSite context, because a context is always referenced from some 
CallSite and if a CallSite becomes unreachable, there's no need to 
perform a cleanup.

Testing: jdk/test/java/lang/invoke, hotspot/test/compiler/jsr292

Contributed-by: plevart, vlivanov

Best regards,
Vladimir Ivanov

PS: frankly speaking, I would move Finalizator from java.lang.ref to 
java.lang.invoke and call it Context, if there were a way to extend 
package-private FinalReference from another package :-)

[1] http://cr.openjdk.java.net/~vlivanov/8079205/webrev.00

From aph at redhat.com  Fri May  8 17:19:07 2015
From: aph at redhat.com (Andrew Haley)
Date: Fri, 08 May 2015 18:19:07 +0100
Subject: RFR(L): 8069539: RSA acceleration
In-Reply-To: <554CE68F.20509@redhat.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>	<5502E67C.8080208@oracle.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>	<550C004D.1060501@redhat.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com>
	<55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com>
	<554CE68F.20509@redhat.com>
Message-ID: <554CF00B.6000508@redhat.com>

On 05/08/2015 05:38 PM, Florian Weimer wrote:
> On 05/08/2015 05:59 PM, Andrew Haley wrote:
>> Here is a prototype of what I propose:
>>
>> http://cr.openjdk.java.net/~aph/rsa-1/
>>
>> It is a JNI version of the fast algorithm I think we should use for
>> RSA.  It doesn't use any intrinsics.  But with it, RSA is already twice
>> as fast as the code we have now for 1024-bit keys, and almost three
>> times as fast for 2048-bit keys!
>>
>> This code (montgomery_multiply) is designed to be as easy as I can
>> possibly make it to translate into a hand-coded intrinsic.  It will
>> then offer better performance still, with the JNI overhead gone and
>> with some carefully hand-tweaked memory accesses.  It has a very
>> regular structure which should make it fairly easy to turn into a
>> software pipeline with overlapped fetching and multiplication,
>> although this will perhaps be difficult on register-starved machines
>> like the x86.
> 
> Do we want to add side-channel protection as part of this effort
> (against timing attacks and cache-flushing attacks)?

I wouldn't have thought so.  It might make sense to add an optional
path without key-dependent branches, but not as a part of this effort:
the goals are completely orthogonal.

Andrew.


From vladimir.kozlov at oracle.com  Fri May  8 18:05:17 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 08 May 2015 11:05:17 -0700
Subject: RFR(XS): 8079797 assert(index >= 0 && index < _count) failed:
	check
In-Reply-To: <20150508133042.GN31204@rbackman>
References: <20150508133042.GN31204@rbackman>
Message-ID: <554CFADD.20803@oracle.com>

Looks good.

Thanks,
Vladimir

On 5/8/15 6:30 AM, Rickard B?ckman wrote:
> Hi,
>
> can I please have this change reviewed?
> The recent OopMap change caused this bug when a few methods were
> renamed:
>
> size() used to return the number of OopMaps.
> In the new implementation size() was used to return the number of bytes,
> and count the number of Pairs.
>
> I renamed the size() method to number_of_bytes() to avoid confusion in
> the future.
>
> Webrev: http://cr.openjdk.java.net/~rbackman/8079797/
> Bug: https://bugs.openjdk.java.net/browse/JDK-8079797
>
> The problem was triggered by running java -XX:+PrintAssembly.
> Verified that java -XX:+PrintAssembly now works.
>
> Thanks
> /R
>

From vladimir.kozlov at oracle.com  Fri May  8 18:16:43 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 08 May 2015 11:16:43 -0700
Subject: [9] RFR(S): 8079343: Crash in PhaseIdealLoop with
	"assert(!had_error) failed: bad dominance"
In-Reply-To: <554CA97E.8020708@oracle.com>
References: <554CA97E.8020708@oracle.com>
Message-ID: <554CFD8B.9090904@oracle.com>

Looks good.

Thanks,
Vladimir

On 5/8/15 5:18 AM, Tobias Hartmann wrote:
> Hi,
>
> please review the following patch.
>
> https://bugs.openjdk.java.net/browse/JDK-8079343
> http://cr.openjdk.java.net/~thartmann/8079343/webrev.00/
>
> Problem:
> The regression test for JDK-8078497 [1] fails with "assert(!had_error) failed: bad dominance" because during loop optimizations a node in the pre-loop has a control node in the main-loop. The test looks like this:
>
>
>    byte[] src = src1;
>
>    for (int i = ...) {
>
>      // Copy 8 chars from src to dst
>
>      ...
>
>      // Prevent loop invariant code motion of char read.
>
>      src = (src == src1) ? src2 : src1;
>
>    }
>
>
>
> The problem is that the superword optimization tries to vectorize the 8 load char / store char operations in the loop. In 'SuperWord::align_initial_loop_index' the base address of the memory operation we align the loop to ('align_to_ref_p.base()') is not loop invariant because 'src' changes in each iteration. However, we use it's value to correctly align the address if 'vw > ObjectAlignmentInBytes' (see line 2344). This is wrong because the address changes in each iteration. It also causes follow up problems in 'PhaseIdealLoop::verify_dominance' because nodes in the pre-loop have a control in the main-loop.
>
>
> Solution:
> I agreed with Vladimir K. that we should not try to vectorize such edge cases. I added a check to the constructor of SWPointer to bail out in case the base address is loop variant.
>
> Testing:
> - Regression test
> - JPRT (together with fix for JDK-8078497)
>
> Thanks,
> Tobias
>
> [1] https://bugs.openjdk.java.net/browse/JDK-8078497
>

From vladimir.kozlov at oracle.com  Fri May  8 19:09:20 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 08 May 2015 12:09:20 -0700
Subject: RFR 8076276 support for AVX512
In-Reply-To: <C568518E7B433348B114B6A7122D474755DE049B@FMSMSX102.amr.corp.intel.com>
References: <C568518E7B433348B114B6A7122D474755DCE552@FMSMSX102.amr.corp.intel.com>	<55258337.2050605@oracle.com>	<C568518E7B433348B114B6A7122D474755DCEBDA@FMSMSX102.amr.corp.intel.com>	<55259078.1080309@oracle.com>	<C568518E7B433348B114B6A7122D474755DCED7C@FMSMSX102.amr.corp.intel.com>	<55271100.8080203@oracle.com>
	<553946F5.2090009@oracle.com>	<7FB7F866-50F5-4A5B-98D7-91E8EE7E460E@oracle.com>	<554BA377.8050700@oracle.com>
	<D5758E8D-067E-466E-8DE2-AEDE0A5FBB73@oracle.com>
	<C568518E7B433348B114B6A7122D474755DE0385@FMSMSX102.amr.corp.intel.com>
	<554BD8CA.1090409@oracle.com>
	<C568518E7B433348B114B6A7122D474755DE03E0@FMSMSX102.amr.corp.intel.com>
	<554BF60F.9000401@oracle.com>
	<C568518E7B433348B114B6A7122D474755DE049B@FMSMSX102.amr.corp.intel.com>
Message-ID: <554D09E0.4040904@oracle.com>

I applied small "cosmetic" changes sent by Michael and I am pushing this.

Here is updated webrev for the record (I also removed trailing spaces in 
.ad files):

http://cr.openjdk.java.net/~kvn/8076276/webrev.03

Thanks,
Vladimir

On 5/7/15 4:41 PM, Berg, Michael C wrote:
> Ok, I will leave that way it is plus the comment.
>
> Thanks,
> -Michael
>
> -----Original Message-----
> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> Sent: Thursday, May 07, 2015 4:33 PM
> To: Berg, Michael C; Roland Westrelin
> Cc: hotspot-compiler-dev at openjdk.java.net
> Subject: Re: RFR 8076276 support for AVX512
>
> You should have said this from the beginning! Add comment to x86_32.ad before operand vecS explaining why it uses vectors_reg_legacy.
>
> To avoid unneeded runtime code generation by adlc. It produces the same result regardless evex support for these vectors in 32-bit VM.
>
> Thanks,
> Vladimir
>
> On 5/7/15 3:43 PM, Berg, Michael C wrote:
>> Tried it out as a single definition referencing reg_class_dynamic for both 64-bit and 32-bit. On the new machine model it does work correctly for both.  However on 32-bit we will now emit more auto generated code through the adlc layer as we will have the same definitions on both evex and non evex, it is why I used separate definitions.
>>
>> -Michael
>>
>> -----Original Message-----
>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>> Sent: Thursday, May 07, 2015 2:28 PM
>> To: Berg, Michael C; Roland Westrelin
>> Cc: hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: RFR 8076276 support for AVX512
>>
>> On 5/7/15 2:07 PM, Berg, Michael C wrote:
>>> Roland,
>>>
>>> VecS like the other forms of Vec{D|X|Y|Z} are defined as needed in each of the 32 and 64 ad files, as they are divergent now.  The 32-bit version only has the legacy bank.
>>> 64-bit version uses the newly defined reg_class_dynamic definition to provide the appropriate bank of registers based on CPUID.  The chunk3 rename is an artifact of removing the kreg bank.
>>
>>
>> For 32-bit reg_class_dynamic will give the same result as legacy because registers after XMM7 will be cut of by #ifdef _LP64. So I don't understand why you need to split Vec{D|X|Y} operands. You do kept vecZ operand the same for 32- and 64-bit.
>>
>> Regards,
>> Vladimir
>>
>>> I will put it back to the old name (and test it out).  Eventually, we will change it back once we formalize the usage of kreg into all facets of code generation ( in a later patch ).
>>> If we made a single version out of os_supports_avx_vectors()'s loop, we would have a guard in the loop as the size of the regions are now different, it seems cleaner the way it is.
>>>
>>> In x86.64.ad:
>>>
>>>>> 3463 // Float register operands
>>>>> 3473 // Double register operands
>>>
>>> I will re-add the comments.
>>>
>>> In chaitain.hpp:
>>>
>>> 144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else
>>>
>>> I will fix the comment location.
>>>
>>> Thanks,
>>> Michael
>>>
>>> -----Original Message-----
>>> From: hotspot-compiler-dev
>>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of
>>> Roland Westrelin
>>> Sent: Thursday, May 07, 2015 11:40 AM
>>> To: Vladimir Kozlov
>>> Cc: hotspot-compiler-dev at openjdk.java.net
>>> Subject: Re: RFR 8076276 support for AVX512
>>>
>>>>>> http://cr.openjdk.java.net/~kvn/8076276/webrev.02
>>>>>
>>>>> This looks good to me. A few minor remarks:
>>>>>
>>>>> Shouldn't the #ifdef _LP64 new instruct be in x86_64.ad? I see there are already other #ifdef _LP64 in x86.ad so I'm not sure what the guideline is.
>>>>
>>>> Why you need #ifdef _LP64 in x86_64.ad were _LP64 is set by default (used only in 64-bit VM)? What new instructions you are talking about?
>>>
>>> I'm talking about:
>>> 4101 #ifdef _LP64
>>> 4102 instruct rvadd2L_reduction_reg(rRegL dst, rRegL src1, vecX src2,
>>> regF tmp, regF tmp2) %{
>>>
>>> for instance in x86.ad
>>> Why isn't it in x86_64.ad?
>>>
>>> Roland.
>>>>
>>>>>
>>>>> In vm_version_x86.hpp, os_supports_avx_vectors(), you could have a single copy of the loop with different parameters.
>>>>>
>>>>> Not sure why you dropped:
>>>>>
>>>>> 3463 // Float register operands
>>>>> 3473 // Double register operands
>>>>>
>>>>> in x86_64.ad
>>>>>
>>>>> In chaitin.hpp:
>>>>>
>>>>> 144   uint16_t _num_regs;            // 2 for Longs and Doubles, 1 for all else
>>>>>
>>>>> comment is not aligned with the one below anymore.
>>>>>
>>>>> Roland.
>>>>>
>>>>
>>>> I also have questions to Michael.
>>>>
>>>> Why you renamed chunk2 to "alloc_class chunk3(RFLAGS);"?
>>>>
>>>> Why you moved "operand vecS() etc." from x86.ad ? Do you mean  evex is not supported in 32-bit?
>>>>
>>>> Thanks,
>>>> Vladimir
>>>

From aph at redhat.com  Sat May  9 07:03:32 2015
From: aph at redhat.com (Andrew Haley)
Date: Sat, 09 May 2015 08:03:32 +0100
Subject: aarch64 AD-file / matching rule
In-Reply-To: <E914289F-BD6A-47EB-81D0-3EDCF988A8E4@theobroma-systems.com>
References: <E914289F-BD6A-47EB-81D0-3EDCF988A8E4@theobroma-systems.com>
Message-ID: <554DB144.1050705@redhat.com>

On 04/23/2015 02:41 PM, Benedikt Wedenik wrote:
> One of the patterns I want to optimise is the following:
> 
> 0x0000007f8c2961b4: and	w2, w2, #0x7ffff8 0x0000007f8c2961b8: cmp	w2, #0x0 0x0000007f8c2961bc: b.eq	0x0000007f8c2968f4
> 
> 
> Here I see an opportunity for ands, b.eq.

It would make more sense to use cbz.  In fact, I have no idea why cbz
is not being used here.  I can't see any point to this pattern.


> [-client -Xcomp -XX:-TieredCompilation] : Here the cmp for #0x0 only occurs about 3 times in the whole disassembly, instead of about 200 times without these flags.

This tells you that the cmp for #0x0 is C1, not C2.

> My question is now how to find out why the rule does not match / if the rule is correct and how to find the actual rule which emits the code of my desired pattern.

1.  Use a debug build.

2.  Turn off tiered compilation.  Look at a disassembly and see which
    method and which line emits the code above.  You haven't provdied
    enough context.

3.  Forget about optimizing C1.  It will have no effect on benchmarks.

Andrew.

From peter.levart at gmail.com  Sat May  9 09:33:43 2015
From: peter.levart at gmail.com (Peter Levart)
Date: Sat, 09 May 2015 11:33:43 +0200
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken
	after sun.misc.Cleaner became automatically cleared
In-Reply-To: <554CEF76.8090903@oracle.com>
References: <554CEF76.8090903@oracle.com>
Message-ID: <554DD477.7000703@gmail.com>

Hi Vladimir,

On 05/08/2015 07:16 PM, Vladimir Ivanov wrote:
> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.01
> https://bugs.openjdk.java.net/browse/JDK-8079205

Your Finalizator touches are good. Supplier interface is not needed as 
there is a public Reference superclass that can be used for return type 
of JavaLangRefAccess.createFinalizator(). You can remove the import for 
Supplier in Reference now.

>
> Recent change in sun.misc.Cleaner behavior broke CallSite context 
> cleanup.
>
> CallSite references context class through a Cleaner to avoid its 
> unnecessary retention.
>
> The problem is the following: to do a cleanup (invalidate all affected 
> nmethods) VM needs a pointer to a context class. Until Cleaner is 
> cleared (and it was a manual action, since Cleaner extends 
> PhantomReference isn't automatically cleared according to the docs), 
> VM can extract it from CallSite.context.referent field.

If PhantomReference.referent wasn't cleared by VM when PhantomReference 
was equeued, could it happen that the referent pointer was still != null 
and the referent object's heap memory was already reclaimed by GC? Is 
that what JDK-8071931 is about? (I cant't see the bug - it's internal). 
In that respect the Cleaner based solution was broken from the start, as 
you did dereference the referent after Cleaner was enqueued. You could 
get a != null pointer which was pointing to reclaimed heap. In Java this 
dereference is prevented by PhantomReference.get() always returning null 
(well, nothing prevents one to read the referent field with reflection 
though).

FinalReference(s) are different in that their referent is not reclaimed 
while it is still reachable through FinalReference, which means that 
finalizable objects must undergo at least two GC cycles, with processing 
in the reference handler and finalizer threads inbetween the cycles, to 
be reclaimed. PhantomReference(s), on the other hand, can be enqueued 
and their referents reclaimed in the same GC cycle, can't they?

>
> I experimented with moving cleanup logic into VM [1], 

What's the downside of that approach? I mean, why is GC-assisted 
approach better? Simpler?

> but Peter Levart came up with a clever idea and implemented 
> FinalReference-based cleaner-like Finalizator. Classes don't have 
> finalizers, but Finalizator allows to attach a finalization action to 
> them. And it is guaranteed that the referent is alive when 
> finalization happens.
>
> Also, Peter spotted another problem with Cleaner-based implementation.
> Cleaner cleanup action is strongly referenced, since it is registered 
> in Cleaner class. CallSite context cleanup action keeps a reference to 
> CallSite class (it is passed to MHN.invalidateDependentNMethods). 
> Users are free to extend CallSite and many do so. If a context class 
> and a call site class are loaded by a custom class loader, such loader 
> will never be unloaded, causing a memory leak.
>
> Finalizator doesn't suffer from that, since the action is referenced 
> only from Finalizator instance. The downside is that cleanup action 
> can be missed if Finalizator becomes unreachable. It's not a problem 
> for CallSite context, because a context is always referenced from some 
> CallSite and if a CallSite becomes unreachable, there's no need to 
> perform a cleanup.
>
> Testing: jdk/test/java/lang/invoke, hotspot/test/compiler/jsr292
>
> Contributed-by: plevart, vlivanov
>
> Best regards,
> Vladimir Ivanov
>
> PS: frankly speaking, I would move Finalizator from java.lang.ref to 
> java.lang.invoke and call it Context, if there were a way to extend 
> package-private FinalReference from another package :-)

Modules may have an answer for that. FinalReference could be moved to a  
package (java.lang.ref.internal or jdk.lang.ref) that is not exported to 
the world and made public.

On the other hand, I don't see a reason why FinalReference couldn't be 
part of JDK public API. I know it's currently just a private mechanism 
to implement finalization which has issues and many would like to see it 
gone, but Java has learned to live with it, and, as you see, it can be a 
solution to some problems too ;-)

Regards, Peter

>
> [1] http://cr.openjdk.java.net/~vlivanov/8079205/webrev.00

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150509/c1d1e95f/attachment-0001.html>

From tobias.hartmann at oracle.com  Mon May 11 05:33:16 2015
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 11 May 2015 07:33:16 +0200
Subject: [9] RFR(S): 8079343: Crash in PhaseIdealLoop with
	"assert(!had_error) failed: bad dominance"
In-Reply-To: <554CFD8B.9090904@oracle.com>
References: <554CA97E.8020708@oracle.com> <554CFD8B.9090904@oracle.com>
Message-ID: <55503F1C.2070808@oracle.com>

Thanks, Vladimir.

Best,
Tobias

On 08.05.2015 20:16, Vladimir Kozlov wrote:
> Looks good.
> 
> Thanks,
> Vladimir
> 
> On 5/8/15 5:18 AM, Tobias Hartmann wrote:
>> Hi,
>>
>> please review the following patch.
>>
>> https://bugs.openjdk.java.net/browse/JDK-8079343
>> http://cr.openjdk.java.net/~thartmann/8079343/webrev.00/
>>
>> Problem:
>> The regression test for JDK-8078497 [1] fails with "assert(!had_error) failed: bad dominance" because during loop optimizations a node in the pre-loop has a control node in the main-loop. The test looks like this:
>>
>>
>>    byte[] src = src1;
>>
>>    for (int i = ...) {
>>
>>      // Copy 8 chars from src to dst
>>
>>      ...
>>
>>      // Prevent loop invariant code motion of char read.
>>
>>      src = (src == src1) ? src2 : src1;
>>
>>    }
>>
>>
>>
>> The problem is that the superword optimization tries to vectorize the 8 load char / store char operations in the loop. In 'SuperWord::align_initial_loop_index' the base address of the memory operation we align the loop to ('align_to_ref_p.base()') is not loop invariant because 'src' changes in each iteration. However, we use it's value to correctly align the address if 'vw > ObjectAlignmentInBytes' (see line 2344). This is wrong because the address changes in each iteration. It also causes follow up problems in 'PhaseIdealLoop::verify_dominance' because nodes in the pre-loop have a control in the main-loop.
>>
>>
>> Solution:
>> I agreed with Vladimir K. that we should not try to vectorize such edge cases. I added a check to the constructor of SWPointer to bail out in case the base address is loop variant.
>>
>> Testing:
>> - Regression test
>> - JPRT (together with fix for JDK-8078497)
>>
>> Thanks,
>> Tobias
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8078497
>>

From rickard.backman at oracle.com  Mon May 11 07:36:25 2015
From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=)
Date: Mon, 11 May 2015 09:36:25 +0200
Subject: RFR(XS): 8079797 assert(index >= 0 && index < _count) failed:
	check
In-Reply-To: <554CFADD.20803@oracle.com>
References: <20150508133042.GN31204@rbackman>
 <554CFADD.20803@oracle.com>
Message-ID: <20150511073625.GP31204@rbackman>

Thanks Vladimir.

/R

On 05/08, Vladimir Kozlov wrote:
> Looks good.
> 
> Thanks,
> Vladimir
> 
> On 5/8/15 6:30 AM, Rickard B?ckman wrote:
> >Hi,
> >
> >can I please have this change reviewed?
> >The recent OopMap change caused this bug when a few methods were
> >renamed:
> >
> >size() used to return the number of OopMaps.
> >In the new implementation size() was used to return the number of bytes,
> >and count the number of Pairs.
> >
> >I renamed the size() method to number_of_bytes() to avoid confusion in
> >the future.
> >
> >Webrev: http://cr.openjdk.java.net/~rbackman/8079797/
> >Bug: https://bugs.openjdk.java.net/browse/JDK-8079797
> >
> >The problem was triggered by running java -XX:+PrintAssembly.
> >Verified that java -XX:+PrintAssembly now works.
> >
> >Thanks
> >/R
> >
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150511/5373b63e/signature.asc>

From fweimer at redhat.com  Mon May 11 15:37:36 2015
From: fweimer at redhat.com (Florian Weimer)
Date: Mon, 11 May 2015 17:37:36 +0200
Subject: RFR(L): 8069539: RSA acceleration
In-Reply-To: <554CF00B.6000508@redhat.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>	<5502E67C.8080208@oracle.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>	<550C004D.1060501@redhat.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com>
	<55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com>
	<554CE68F.20509@redhat.com> <554CF00B.6000508@redhat.com>
Message-ID: <5550CCC0.6000502@redhat.com>

On 05/08/2015 07:19 PM, Andrew Haley wrote:

>> Do we want to add side-channel protection as part of this effort
>> (against timing attacks and cache-flushing attacks)?
> 
> I wouldn't have thought so.  It might make sense to add an optional
> path without key-dependent branches, but not as a part of this effort:
> the goals are completely orthogonal.

I'm not well-versed in this kind of side-channel protection for RSA
implementations, but my impression that algorithm changes are needed to
mitigate the impact of data-dependent memory fetches (see fixed-width
modular exponentiation).  But maybe the necessary changes materialize at
a higher level, beyond the operation which you proposed to intrinsify.

-- 
Florian Weimer / Red Hat Product Security

From aph at redhat.com  Mon May 11 16:15:42 2015
From: aph at redhat.com (Andrew Haley)
Date: Mon, 11 May 2015 17:15:42 +0100
Subject: RFR(L): 8069539: RSA acceleration
In-Reply-To: <5550CCC0.6000502@redhat.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>	<5502E67C.8080208@oracle.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>	<550C004D.1060501@redhat.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com>
	<55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com>
	<554CE68F.20509@redhat.com> <554CF00B.6000508@redhat.com>
	<5550CCC0.6000502@redhat.com>
Message-ID: <5550D5AE.1070602@redhat.com>

On 05/11/2015 04:37 PM, Florian Weimer wrote:
> On 05/08/2015 07:19 PM, Andrew Haley wrote:
> 
>>> Do we want to add side-channel protection as part of this effort
>>> (against timing attacks and cache-flushing attacks)?
>>
>> I wouldn't have thought so.  It might make sense to add an optional
>> path without key-dependent branches, but not as a part of this effort:
>> the goals are completely orthogonal.
> 
> I'm not well-versed in this kind of side-channel protection for RSA
> implementations, but my impression that algorithm changes are needed
> to mitigate the impact of data-dependent memory fetches (see
> fixed-width modular exponentiation). 

RSA can be done with no key- (or data-) dependent timing.  No Chinese
Remainder Theorem optimizations, no addition chains, no optimized
division.  Just do everything.  Even when the result of a
multiplication is not needed, do it anyway.  This really easy to do:
just delete all of the optimizations (and therefore most of the code)
in oddModPow, leaving only a simple kernel and a much slower
algorithm.

> But maybe the necessary changes materialize at a higher level,
> beyond the operation which you proposed to intrinsify.

They absolutely do.  The algorithm I propose to intrinsify has almost
no key-dependent timing whatsoever unless the multiplication
instruction itself has an early-out optimization.  And there's not
much anyone can do about that.  (Well, almost: it may be necessary to
do an extra subtraction to normalize the result of the Montgomery
multiplication, and this is data-dependent.  I suppose it would be
possible to do the subtraction anyway, and throw it away if needs be.)

It would make more sense to use one of the techniques to prevent
timing attacks by using blinding functions.  See e.g. Timing Attacks
on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems,
Kocher 1995.

But really it seems inappropriate to me to piggyback this conversation
on top of a completely different subject, RSA acceleration.  Let's
continue it elsewhere.

Andrew.

From fweimer at redhat.com  Mon May 11 17:04:13 2015
From: fweimer at redhat.com (Florian Weimer)
Date: Mon, 11 May 2015 19:04:13 +0200
Subject: RFR(L): 8069539: RSA acceleration
In-Reply-To: <5550D5AE.1070602@redhat.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>	<5502E67C.8080208@oracle.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>	<550C004D.1060501@redhat.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com>
	<55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com>
	<554CE68F.20509@redhat.com> <554CF00B.6000508@redhat.com>
	<5550CCC0.6000502@redhat.com> <5550D5AE.1070602@redhat.com>
Message-ID: <5550E10D.7020304@redhat.com>

On 05/11/2015 06:15 PM, Andrew Haley wrote:

> But really it seems inappropriate to me to piggyback this conversation
> on top of a completely different subject, RSA acceleration.  Let's
> continue it elsewhere.

Agreed.  I apologize for the side discussion.

-- 
Florian Weimer / Red Hat Product Security

From vitalyd at gmail.com  Tue May 12 00:25:30 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Mon, 11 May 2015 20:25:30 -0400
Subject: Intermediate writes to array slot not eliminated by optimizer
Message-ID: <CAHjP37H_6R+8f6M+q7VOBhSmK-SSdHmqg_eX-uL5J8wayHaaFw@mail.gmail.com>

Hi guys,

Suppose we have the following code:

private static final long[] ARR = new long[10];
private static final int ITERATIONS = 500 * 1000 * 1000;


static void loop() {
    int i = ITERATIONS;
    while(--i >= 0) {
         ARR[4] = i;
     }
}

My expectation would be that loop() effectively compiles to:
ARR[4] = 0;
ret;

However, an actual loop is compiled.  This example is a reduction of a
slightly more involved case that I had in mind, but I'm curious if I'm
missing something here.

I tested this with 8u40 and tiered compilation disabled, for what it's
worth.

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150511/7febc6c1/attachment.html>

From tangwei6 at huawei.com  Tue May 12 06:24:39 2015
From: tangwei6 at huawei.com (Tangwei (Euler))
Date: Tue, 12 May 2015 06:24:39 +0000
Subject: Anyway to hack JITTed code manually?
Message-ID: <C8D1E566CC4CA845813731FB6FD5C5300108C0BA@SZXEMI503-MBX.china.huawei.com>

Hi All,
  Is there any simple way to hack JITTed code and relocate at runtime? Following steps are expected.


1.       Dump JITTed code (c1/c2) to file or some cached directory.

2.       Hack assembly code

3.       Run JVM to load the hacked assembly code and relocating in memory


Regards!
wei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150512/20e2d547/attachment-0001.html>

From rednaxelafx at gmail.com  Tue May 12 06:33:49 2015
From: rednaxelafx at gmail.com (Krystal Mok)
Date: Mon, 11 May 2015 23:33:49 -0700
Subject: Anyway to hack JITTed code manually?
In-Reply-To: <C8D1E566CC4CA845813731FB6FD5C5300108C0BA@SZXEMI503-MBX.china.huawei.com>
References: <C8D1E566CC4CA845813731FB6FD5C5300108C0BA@SZXEMI503-MBX.china.huawei.com>
Message-ID: <CA+cQ+tSK54gWEW=NArQZ2uTJkPbQCWnCGa34B9E2LEFq-dJN7g@mail.gmail.com>

Hi Wei,

For HotSpot VM as it is now, the answer is no, there isn't a simple way to
do that.

One of the various problem that you'd have with trying to do something like
this is relocation: there are a lot of addresses embedded into JIT compiled
code as constants -- some could be Klass*, some could be compiled entry
points to other compiled methods or stubs.

The kind of feature you're asking for is more likely to be present in a VM
that supports some certain form of AOT compilation, where the system has to
deal with relocation of code anyway. The HotSpot VM we can see in OpenJDK
doesn't do that right now.

- Kris

On Mon, May 11, 2015 at 11:24 PM, Tangwei (Euler) <tangwei6 at huawei.com>
wrote:

>  Hi All,
>
>   Is there any simple way to hack JITTed code and relocate at runtime?
> Following steps are expected.
>
>
>
> 1.       Dump JITTed code (c1/c2) to file or some cached directory.
>
> 2.       Hack assembly code
>
> 3.       Run JVM to load the hacked assembly code and relocating in memory
>
>
>
>
>
> Regards!
>
> wei
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150511/bd838dcf/attachment.html>

From tangwei6 at huawei.com  Tue May 12 06:42:53 2015
From: tangwei6 at huawei.com (Tangwei (Euler))
Date: Tue, 12 May 2015 06:42:53 +0000
Subject: Anyway to hack JITTed code manually?
In-Reply-To: <CA+cQ+tSK54gWEW=NArQZ2uTJkPbQCWnCGa34B9E2LEFq-dJN7g@mail.gmail.com>
References: <C8D1E566CC4CA845813731FB6FD5C5300108C0BA@SZXEMI503-MBX.china.huawei.com>
	<CA+cQ+tSK54gWEW=NArQZ2uTJkPbQCWnCGa34B9E2LEFq-dJN7g@mail.gmail.com>
Message-ID: <C8D1E566CC4CA845813731FB6FD5C5300108C0DF@SZXEMI503-MBX.china.huawei.com>

Hi Kris,
  Thanks for quick response. Is there any other hacking way in OpenJDK for fast performance measurement?
Or is that possible to invoke other profiling tool such as ?perf? instead of hprof in JVM to sampling more PMU events.

Regards!
wei

From: Krystal Mok [mailto:rednaxelafx at gmail.com]
Sent: Tuesday, May 12, 2015 2:34 PM
To: Tangwei (Euler)
Cc: hotspot-compiler-dev at openjdk.java.net
Subject: Re: Anyway to hack JITTed code manually?

Hi Wei,

For HotSpot VM as it is now, the answer is no, there isn't a simple way to do that.

One of the various problem that you'd have with trying to do something like this is relocation: there are a lot of addresses embedded into JIT compiled code as constants -- some could be Klass*, some could be compiled entry points to other compiled methods or stubs.

The kind of feature you're asking for is more likely to be present in a VM that supports some certain form of AOT compilation, where the system has to deal with relocation of code anyway. The HotSpot VM we can see in OpenJDK doesn't do that right now.

- Kris

On Mon, May 11, 2015 at 11:24 PM, Tangwei (Euler) <tangwei6 at huawei.com<mailto:tangwei6 at huawei.com>> wrote:
Hi All,
  Is there any simple way to hack JITTed code and relocate at runtime? Following steps are expected.


1.       Dump JITTed code (c1/c2) to file or some cached directory.

2.       Hack assembly code

3.       Run JVM to load the hacked assembly code and relocating in memory


Regards!
wei

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150512/69955e39/attachment.html>

From rednaxelafx at gmail.com  Tue May 12 06:49:36 2015
From: rednaxelafx at gmail.com (Krystal Mok)
Date: Mon, 11 May 2015 23:49:36 -0700
Subject: Anyway to hack JITTed code manually?
In-Reply-To: <C8D1E566CC4CA845813731FB6FD5C5300108C0DF@SZXEMI503-MBX.china.huawei.com>
References: <C8D1E566CC4CA845813731FB6FD5C5300108C0BA@SZXEMI503-MBX.china.huawei.com>
	<CA+cQ+tSK54gWEW=NArQZ2uTJkPbQCWnCGa34B9E2LEFq-dJN7g@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C5300108C0DF@SZXEMI503-MBX.china.huawei.com>
Message-ID: <CA+cQ+tTw6WvPco=f1Q7BhEET9Y5bTS0k=8C0B8nG1yv6eT5q1w@mail.gmail.com>

Hi Wei,

HotSpot VM is known to be able to work with other more native profiling
tools.

HotSpot can work with Linux perf. An easy way to run benchmarks and
profiling them is with the JMH framework and with the "perfasm" feature
[1][2].

Other profilers like Oracle Solaris Studio Performance Analyzer [3], Intel
VTune and AMD CodeAnalyst are also known to work with HotSpot.

- Kris

[1]: http://cr.openjdk.java.net/~shade/jmh/perfasm-sample.log
[2]: http://shipilev.net/blog/2014/java-scala-divided-we-fail/
[3]:
http://www.oracle.com/technetwork/server-storage/solarisstudio/features/performance-analyzer-2292312.html

On Mon, May 11, 2015 at 11:42 PM, Tangwei (Euler) <tangwei6 at huawei.com>
wrote:

>  Hi Kris,
>
>   Thanks for quick response. Is there any other hacking way in OpenJDK for
> fast performance measurement?
>
> Or is that possible to invoke other profiling tool such as ?perf? instead
> of hprof in JVM to sampling more PMU events.
>
>
>
> Regards!
>
> wei
>
>
>
> *From:* Krystal Mok [mailto:rednaxelafx at gmail.com]
> *Sent:* Tuesday, May 12, 2015 2:34 PM
> *To:* Tangwei (Euler)
> *Cc:* hotspot-compiler-dev at openjdk.java.net
> *Subject:* Re: Anyway to hack JITTed code manually?
>
>
>
> Hi Wei,
>
>
>
> For HotSpot VM as it is now, the answer is no, there isn't a simple way to
> do that.
>
>
>
> One of the various problem that you'd have with trying to do something
> like this is relocation: there are a lot of addresses embedded into JIT
> compiled code as constants -- some could be Klass*, some could be compiled
> entry points to other compiled methods or stubs.
>
>
>
> The kind of feature you're asking for is more likely to be present in a VM
> that supports some certain form of AOT compilation, where the system has to
> deal with relocation of code anyway. The HotSpot VM we can see in OpenJDK
> doesn't do that right now.
>
>
>
> - Kris
>
>
>
> On Mon, May 11, 2015 at 11:24 PM, Tangwei (Euler) <tangwei6 at huawei.com>
> wrote:
>
> Hi All,
>
>   Is there any simple way to hack JITTed code and relocate at runtime?
> Following steps are expected.
>
>
>
> 1.       Dump JITTed code (c1/c2) to file or some cached directory.
>
> 2.       Hack assembly code
>
> 3.       Run JVM to load the hacked assembly code and relocating in memory
>
>
>
>
>
> Regards!
>
> wei
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150511/dbd8a203/attachment-0001.html>

From tangwei6 at huawei.com  Tue May 12 07:05:41 2015
From: tangwei6 at huawei.com (Tangwei (Euler))
Date: Tue, 12 May 2015 07:05:41 +0000
Subject: Anyway to hack JITTed code manually?
In-Reply-To: <CA+cQ+tTw6WvPco=f1Q7BhEET9Y5bTS0k=8C0B8nG1yv6eT5q1w@mail.gmail.com>
References: <C8D1E566CC4CA845813731FB6FD5C5300108C0BA@SZXEMI503-MBX.china.huawei.com>
	<CA+cQ+tSK54gWEW=NArQZ2uTJkPbQCWnCGa34B9E2LEFq-dJN7g@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C5300108C0DF@SZXEMI503-MBX.china.huawei.com>
	<CA+cQ+tTw6WvPco=f1Q7BhEET9Y5bTS0k=8C0B8nG1yv6eT5q1w@mail.gmail.com>
Message-ID: <C8D1E566CC4CA845813731FB6FD5C5300108C0FE@SZXEMI503-MBX.china.huawei.com>

Hi Kris,
Sorry for inaccurate question description,  the tool I am looking for is to sampling specific function.
For example I know some function called ?Foo? is hot, but has poor performance. I just want to sample
it with some PMU event set.  I am working on ARM, so some X86 only profiler is not suitable to me.


Regards!
wei

From: Krystal Mok [mailto:rednaxelafx at gmail.com]
Sent: Tuesday, May 12, 2015 2:50 PM
To: Tangwei (Euler)
Cc: hotspot-compiler-dev at openjdk.java.net
Subject: Re: Anyway to hack JITTed code manually?

Hi Wei,

HotSpot VM is known to be able to work with other more native profiling tools.

HotSpot can work with Linux perf. An easy way to run benchmarks and profiling them is with the JMH framework and with the "perfasm" feature [1][2].

Other profilers like Oracle Solaris Studio Performance Analyzer [3], Intel VTune and AMD CodeAnalyst are also known to work with HotSpot.

- Kris

[1]: http://cr.openjdk.java.net/~shade/jmh/perfasm-sample.log
[2]: http://shipilev.net/blog/2014/java-scala-divided-we-fail/
[3]: http://www.oracle.com/technetwork/server-storage/solarisstudio/features/performance-analyzer-2292312.html

On Mon, May 11, 2015 at 11:42 PM, Tangwei (Euler) <tangwei6 at huawei.com<mailto:tangwei6 at huawei.com>> wrote:
Hi Kris,
  Thanks for quick response. Is there any other hacking way in OpenJDK for fast performance measurement?
Or is that possible to invoke other profiling tool such as ?perf? instead of hprof in JVM to sampling more PMU events.

Regards!
wei

From: Krystal Mok [mailto:rednaxelafx at gmail.com<mailto:rednaxelafx at gmail.com>]
Sent: Tuesday, May 12, 2015 2:34 PM
To: Tangwei (Euler)
Cc: hotspot-compiler-dev at openjdk.java.net<mailto:hotspot-compiler-dev at openjdk.java.net>
Subject: Re: Anyway to hack JITTed code manually?

Hi Wei,

For HotSpot VM as it is now, the answer is no, there isn't a simple way to do that.

One of the various problem that you'd have with trying to do something like this is relocation: there are a lot of addresses embedded into JIT compiled code as constants -- some could be Klass*, some could be compiled entry points to other compiled methods or stubs.

The kind of feature you're asking for is more likely to be present in a VM that supports some certain form of AOT compilation, where the system has to deal with relocation of code anyway. The HotSpot VM we can see in OpenJDK doesn't do that right now.

- Kris

On Mon, May 11, 2015 at 11:24 PM, Tangwei (Euler) <tangwei6 at huawei.com<mailto:tangwei6 at huawei.com>> wrote:
Hi All,
  Is there any simple way to hack JITTed code and relocate at runtime? Following steps are expected.


1.       Dump JITTed code (c1/c2) to file or some cached directory.

2.       Hack assembly code

3.       Run JVM to load the hacked assembly code and relocating in memory


Regards!
wei


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150512/4500335e/attachment.html>

From filipp.zhinkin at gmail.com  Tue May 12 09:32:21 2015
From: filipp.zhinkin at gmail.com (Filipp Zhinkin)
Date: Tue, 12 May 2015 12:32:21 +0300
Subject: Anyway to hack JITTed code manually?
In-Reply-To: <C8D1E566CC4CA845813731FB6FD5C5300108C0FE@SZXEMI503-MBX.china.huawei.com>
References: <C8D1E566CC4CA845813731FB6FD5C5300108C0BA@SZXEMI503-MBX.china.huawei.com>
	<CA+cQ+tSK54gWEW=NArQZ2uTJkPbQCWnCGa34B9E2LEFq-dJN7g@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C5300108C0DF@SZXEMI503-MBX.china.huawei.com>
	<CA+cQ+tTw6WvPco=f1Q7BhEET9Y5bTS0k=8C0B8nG1yv6eT5q1w@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C5300108C0FE@SZXEMI503-MBX.china.huawei.com>
Message-ID: <CANQc0nfWvD_uqbo500iL5bCDqxECjqyzUujzegbMvrCPy39YVw@mail.gmail.com>

Hi Wei,

in order to profile JIT-generated code using perf you can use
perf-map-agent [1] which is aimed to generate map files for compiled
methods.

Regards,
Filipp.

[1] https://github.com/jrudolph/perf-map-agent

On Tue, May 12, 2015 at 10:05 AM, Tangwei (Euler) <tangwei6 at huawei.com> wrote:
> Hi Kris,
>
> Sorry for inaccurate question description,  the tool I am looking for is to
> sampling specific function.
>
> For example I know some function called ?Foo? is hot, but has poor
> performance. I just want to sample
>
> it with some PMU event set.  I am working on ARM, so some X86 only profiler
> is not suitable to me.
>
>
>
>
>
> Regards!
>
> wei
>
>
>
> From: Krystal Mok [mailto:rednaxelafx at gmail.com]
> Sent: Tuesday, May 12, 2015 2:50 PM
>
>
> To: Tangwei (Euler)
> Cc: hotspot-compiler-dev at openjdk.java.net
> Subject: Re: Anyway to hack JITTed code manually?
>
>
>
> Hi Wei,
>
>
>
> HotSpot VM is known to be able to work with other more native profiling
> tools.
>
>
>
> HotSpot can work with Linux perf. An easy way to run benchmarks and
> profiling them is with the JMH framework and with the "perfasm" feature
> [1][2].
>
>
>
> Other profilers like Oracle Solaris Studio Performance Analyzer [3], Intel
> VTune and AMD CodeAnalyst are also known to work with HotSpot.
>
>
>
> - Kris
>
>
>
> [1]: http://cr.openjdk.java.net/~shade/jmh/perfasm-sample.log
>
> [2]: http://shipilev.net/blog/2014/java-scala-divided-we-fail/
>
> [3]:
> http://www.oracle.com/technetwork/server-storage/solarisstudio/features/performance-analyzer-2292312.html
>
>
>
> On Mon, May 11, 2015 at 11:42 PM, Tangwei (Euler) <tangwei6 at huawei.com>
> wrote:
>
> Hi Kris,
>
>   Thanks for quick response. Is there any other hacking way in OpenJDK for
> fast performance measurement?
>
> Or is that possible to invoke other profiling tool such as ?perf? instead of
> hprof in JVM to sampling more PMU events.
>
>
>
> Regards!
>
> wei
>
>
>
> From: Krystal Mok [mailto:rednaxelafx at gmail.com]
> Sent: Tuesday, May 12, 2015 2:34 PM
> To: Tangwei (Euler)
> Cc: hotspot-compiler-dev at openjdk.java.net
> Subject: Re: Anyway to hack JITTed code manually?
>
>
>
> Hi Wei,
>
>
>
> For HotSpot VM as it is now, the answer is no, there isn't a simple way to
> do that.
>
>
>
> One of the various problem that you'd have with trying to do something like
> this is relocation: there are a lot of addresses embedded into JIT compiled
> code as constants -- some could be Klass*, some could be compiled entry
> points to other compiled methods or stubs.
>
>
>
> The kind of feature you're asking for is more likely to be present in a VM
> that supports some certain form of AOT compilation, where the system has to
> deal with relocation of code anyway. The HotSpot VM we can see in OpenJDK
> doesn't do that right now.
>
>
>
> - Kris
>
>
>
> On Mon, May 11, 2015 at 11:24 PM, Tangwei (Euler) <tangwei6 at huawei.com>
> wrote:
>
> Hi All,
>
>   Is there any simple way to hack JITTed code and relocate at runtime?
> Following steps are expected.
>
>
>
> 1.       Dump JITTed code (c1/c2) to file or some cached directory.
>
> 2.       Hack assembly code
>
> 3.       Run JVM to load the hacked assembly code and relocating in memory
>
>
>
>
>
> Regards!
>
> wei
>
>
>
>

From roland.westrelin at oracle.com  Tue May 12 10:40:05 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Tue, 12 May 2015 12:40:05 +0200
Subject: RFR(M): 8076188 Optimize arraycopy out for non escaping
	destination
In-Reply-To: <D5590E7C-C82D-4F9B-A3DC-B69AE805D409@oracle.com>
References: <2B8622DD-1DA5-4715-B4BF-3801202B588B@oracle.com>
	<553F9C4A.7040407@oracle.com>
	<D5590E7C-C82D-4F9B-A3DC-B69AE805D409@oracle.com>
Message-ID: <3F7929F8-11FA-4536-A340-A98266FB2C62@oracle.com>


jprt found some problems with this change. So I?d like to push:

http://cr.openjdk.java.net/~roland/8076188/webrev.00-01/

on top of the reviewed change. Overall change:

http://cr.openjdk.java.net/~roland/8076188/webrev.01/

List of fixes:

- tests with G1 failed in verification code. I made the change to LoadNode::Ideal() so the new code that looks for a dominating identical load is disabled for raw loads

- OptimizePtrCompare is broken in cases like the new test case added to TestEliminateArrayCopy.java: field from non escaping object should have an unknown value when the the object is target of an ArrayCopy.

- The ifnode change is unrelated (I found the problem when debugging one of the issues above) but it?s simple enough that I don?t think it needs its own CR.

Roland.

> On Apr 29, 2015, at 11:30 AM, Roland Westrelin <roland.westrelin at oracle.com> wrote:
> 
> Thanks for the review, Vladimir.
> 
> Roland.
> 
>> On Apr 28, 2015, at 4:42 PM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
>> 
>> Looks good.
>> 
>> Best regards,
>> Vladimir Ivanov
>> 
>> On 4/21/15 4:02 PM, Roland Westrelin wrote:
>>> http://cr.openjdk.java.net/~roland/8076188/webrev.00/
>>> 
>>> This patch tries to eliminate ArrayCopyNodes (for instance clones, array clones, arraycopy and copyOf) when the destination of the copy doesn?t escape:
>>> 
>>> - during escape analysis, ArrayCopyNodes don?t cause the destination of the copy to be marked as escaping anymore
>>> - a load to the destination of a copy may be replaced by a load from the source during IGVN
>>> - during macro expansion, ArrayCopyNodes don?t stop allocation from being eliminated and can themselves be eliminated
>>> 
>>> Roland.
>>> 
> 


From vladimir.x.ivanov at oracle.com  Tue May 12 10:56:04 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Tue, 12 May 2015 13:56:04 +0300
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken
	after sun.misc.Cleaner became automatically cleared
In-Reply-To: <554DD477.7000703@gmail.com>
References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com>
Message-ID: <5551DC44.9060902@oracle.com>

Peter,

>> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.01
>> https://bugs.openjdk.java.net/browse/JDK-8079205
>
> Your Finalizator touches are good. Supplier interface is not needed as
> there is a public Reference superclass that can be used for return type
> of JavaLangRefAccess.createFinalizator(). You can remove the import for
> Supplier in Reference now.
Thanks for spotting that. Will do.

>> Recent change in sun.misc.Cleaner behavior broke CallSite context
>> cleanup.
>>
>> CallSite references context class through a Cleaner to avoid its
>> unnecessary retention.
>>
>> The problem is the following: to do a cleanup (invalidate all affected
>> nmethods) VM needs a pointer to a context class. Until Cleaner is
>> cleared (and it was a manual action, since Cleaner extends
>> PhantomReference isn't automatically cleared according to the docs),
>> VM can extract it from CallSite.context.referent field.
>
> If PhantomReference.referent wasn't cleared by VM when PhantomReference
> was equeued, could it happen that the referent pointer was still != null
> and the referent object's heap memory was already reclaimed by GC? Is
> that what JDK-8071931 is about? (I cant't see the bug - it's internal).
> In that respect the Cleaner based solution was broken from the start, as
> you did dereference the referent after Cleaner was enqueued. You could
> get a != null pointer which was pointing to reclaimed heap. In Java this
> dereference is prevented by PhantomReference.get() always returning null
> (well, nothing prevents one to read the referent field with reflection
> though).
No, the object isn't reclaimed until corresponding reference is != null.
When Cleaner wasn't automatically cleared, it was guaranteed that the 
referent is alive when cleanup action happens. That is the assumption 
CallSite dependency tracking is based on.

It's not the case anymore when Cleaner is automatically cleared. Cleanup 
action can happen when context Class is already GCed. So, I can access 
neither the context mirror class nor VM-internal InstanceKlass.

> FinalReference(s) are different in that their referent is not reclaimed
> while it is still reachable through FinalReference, which means that
> finalizable objects must undergo at least two GC cycles, with processing
> in the reference handler and finalizer threads inbetween the cycles, to
> be reclaimed. PhantomReference(s), on the other hand, can be enqueued
> and their referents reclaimed in the same GC cycle, can't they?
Since PhantomReference isn't automatically cleared, GC can't reclaim its 
referent until the reference is manually cleared or PhantomReference 
becomes unreachable as a result of cleanup action.
So, it's impossible to reclaim it in the same GC cycle (unless it is a 
concurrent GC cycle?).

>> I experimented with moving cleanup logic into VM [1],
>
> What's the downside of that approach? I mean, why is GC-assisted
> approach better? Simpler?
IMO the more code on JDK side the better. More safety guarantees are 
provided in Java code comparing to native/VM code.

Also, JVM-based approach suffers from the fact it doesn't get prompt 
notification when context Class can be GCed. Since it should work with 
autocleared references, there's no need in a cleanup action anymore.
By the time cleanup happens, referent can be already GCed.

That's why I switched from Cleaner to WeakReference. When 
CallSite.context is cleared ("stale" context), VM has to scan all 
nmethods in the code cache to find all affected nmethods.

BTW I had a private discussion with Kim Barrett who suggested an 
alternative approach which doesn't require full code cache scan. I plan 
to do some prototyping to understand its feasibility, since it requires 
non-InstanceKlass-based nmethod dependency tracking machinery.

Best regards,
Vladimir Ivanov

>> but Peter Levart came up with a clever idea and implemented
>> FinalReference-based cleaner-like Finalizator. Classes don't have
>> finalizers, but Finalizator allows to attach a finalization action to
>> them. And it is guaranteed that the referent is alive when
>> finalization happens.
>>
>> Also, Peter spotted another problem with Cleaner-based implementation.
>> Cleaner cleanup action is strongly referenced, since it is registered
>> in Cleaner class. CallSite context cleanup action keeps a reference to
>> CallSite class (it is passed to MHN.invalidateDependentNMethods).
>> Users are free to extend CallSite and many do so. If a context class
>> and a call site class are loaded by a custom class loader, such loader
>> will never be unloaded, causing a memory leak.
>>
>> Finalizator doesn't suffer from that, since the action is referenced
>> only from Finalizator instance. The downside is that cleanup action
>> can be missed if Finalizator becomes unreachable. It's not a problem
>> for CallSite context, because a context is always referenced from some
>> CallSite and if a CallSite becomes unreachable, there's no need to
>> perform a cleanup.
>>
>> Testing: jdk/test/java/lang/invoke, hotspot/test/compiler/jsr292
>>
>> Contributed-by: plevart, vlivanov
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> PS: frankly speaking, I would move Finalizator from java.lang.ref to
>> java.lang.invoke and call it Context, if there were a way to extend
>> package-private FinalReference from another package :-)
>
> Modules may have an answer for that. FinalReference could be moved to a
> package (java.lang.ref.internal or jdk.lang.ref) that is not exported to
> the world and made public.
>
> On the other hand, I don't see a reason why FinalReference couldn't be
> part of JDK public API. I know it's currently just a private mechanism
> to implement finalization which has issues and many would like to see it
> gone, but Java has learned to live with it, and, as you see, it can be a
> solution to some problems too ;-)
>
> Regards, Peter
>
>>
>> [1] http://cr.openjdk.java.net/~vlivanov/8079205/webrev.00
>

From vladimir.kozlov at oracle.com  Tue May 12 15:38:11 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Tue, 12 May 2015 08:38:11 -0700
Subject: RFR(M): 8076188 Optimize arraycopy out for non escaping
	destination
In-Reply-To: <3F7929F8-11FA-4536-A340-A98266FB2C62@oracle.com>
References: <2B8622DD-1DA5-4715-B4BF-3801202B588B@oracle.com>	<553F9C4A.7040407@oracle.com>	<D5590E7C-C82D-4F9B-A3DC-B69AE805D409@oracle.com>
	<3F7929F8-11FA-4536-A340-A98266FB2C62@oracle.com>
Message-ID: <55521E63.9030106@oracle.com>

Okay.

Thanks,
Vladimir

On 5/12/15 3:40 AM, Roland Westrelin wrote:
>
> jprt found some problems with this change. So I?d like to push:
>
> http://cr.openjdk.java.net/~roland/8076188/webrev.00-01/
>
> on top of the reviewed change. Overall change:
>
> http://cr.openjdk.java.net/~roland/8076188/webrev.01/
>
> List of fixes:
>
> - tests with G1 failed in verification code. I made the change to LoadNode::Ideal() so the new code that looks for a dominating identical load is disabled for raw loads
>
> - OptimizePtrCompare is broken in cases like the new test case added to TestEliminateArrayCopy.java: field from non escaping object should have an unknown value when the the object is target of an ArrayCopy.
>
> - The ifnode change is unrelated (I found the problem when debugging one of the issues above) but it?s simple enough that I don?t think it needs its own CR.
>
> Roland.
>
>> On Apr 29, 2015, at 11:30 AM, Roland Westrelin <roland.westrelin at oracle.com> wrote:
>>
>> Thanks for the review, Vladimir.
>>
>> Roland.
>>
>>> On Apr 28, 2015, at 4:42 PM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
>>>
>>> Looks good.
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> On 4/21/15 4:02 PM, Roland Westrelin wrote:
>>>> http://cr.openjdk.java.net/~roland/8076188/webrev.00/
>>>>
>>>> This patch tries to eliminate ArrayCopyNodes (for instance clones, array clones, arraycopy and copyOf) when the destination of the copy doesn?t escape:
>>>>
>>>> - during escape analysis, ArrayCopyNodes don?t cause the destination of the copy to be marked as escaping anymore
>>>> - a load to the destination of a copy may be replaced by a load from the source during IGVN
>>>> - during macro expansion, ArrayCopyNodes don?t stop allocation from being eliminated and can themselves be eliminated
>>>>
>>>> Roland.
>>>>
>>
>

From aph at redhat.com  Tue May 12 16:59:14 2015
From: aph at redhat.com (Andrew Haley)
Date: Tue, 12 May 2015 17:59:14 +0100
Subject: aarch64 AD-file / matching rule
In-Reply-To: <E914289F-BD6A-47EB-81D0-3EDCF988A8E4@theobroma-systems.com>
References: <E914289F-BD6A-47EB-81D0-3EDCF988A8E4@theobroma-systems.com>
Message-ID: <55523162.3000401@redhat.com>

Hi,

On 04/23/2015 02:41 PM, Benedikt Wedenik wrote:

> I?m writing compiler-optimisations for the aarch64 port at the moment and I am using specjbb2005 for benchmarking.
> One of the patterns I want to optimise is the following:
> 
>   0x0000007f8c2961b4: andw2, w2, #0x7ffff8
>   0x0000007f8c2961b8: cmpw2, #0x0
>   0x0000007f8c2961bc: b.eq0x0000007f8c2968f4

I do not see this pattern generated anywhere by C2.  If you can tell
me which routine is being compiled I will be very interested.

It is generated a lot by C1, but that's C1: not interesting.

Incidentally, if you want to see a profile of specjbb2005, do this:

operf java -Xms256m -Xmx256m -agentpath:/usr/lib64/oprofile/libjvmti_oprofile.so spec.jbb.JBBmain -propfile SPECjbb.props

then, afterwards:

opreport -l

Andrew.


From vitalyd at gmail.com  Tue May 12 17:05:08 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Tue, 12 May 2015 13:05:08 -0400
Subject: Intermediate writes to array slot not eliminated by optimizer
In-Reply-To: <CAHjP37H_6R+8f6M+q7VOBhSmK-SSdHmqg_eX-uL5J8wayHaaFw@mail.gmail.com>
References: <CAHjP37H_6R+8f6M+q7VOBhSmK-SSdHmqg_eX-uL5J8wayHaaFw@mail.gmail.com>
Message-ID: <CAHjP37EiG1FDe0hfA5MBf5rOBiV3uuAM8PNAVtUkzRYo9Bp+zA@mail.gmail.com>

Anyone? :)

Thanks

On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich <vitalyd at gmail.com>
wrote:

> Hi guys,
>
> Suppose we have the following code:
>
> private static final long[] ARR = new long[10];
> private static final int ITERATIONS = 500 * 1000 * 1000;
>
>
> static void loop() {
>     int i = ITERATIONS;
>     while(--i >= 0) {
>          ARR[4] = i;
>      }
> }
>
> My expectation would be that loop() effectively compiles to:
> ARR[4] = 0;
> ret;
>
> However, an actual loop is compiled.  This example is a reduction of a
> slightly more involved case that I had in mind, but I'm curious if I'm
> missing something here.
>
> I tested this with 8u40 and tiered compilation disabled, for what it's
> worth.
>
> Thanks
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150512/08765c3e/attachment.html>

From vladimir.kozlov at oracle.com  Tue May 12 17:12:22 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Tue, 12 May 2015 10:12:22 -0700
Subject: Intermediate writes to array slot not eliminated by optimizer
In-Reply-To: <CAHjP37EiG1FDe0hfA5MBf5rOBiV3uuAM8PNAVtUkzRYo9Bp+zA@mail.gmail.com>
References: <CAHjP37H_6R+8f6M+q7VOBhSmK-SSdHmqg_eX-uL5J8wayHaaFw@mail.gmail.com>
	<CAHjP37EiG1FDe0hfA5MBf5rOBiV3uuAM8PNAVtUkzRYo9Bp+zA@mail.gmail.com>
Message-ID: <55523476.90601@oracle.com>

I would need to look on ideal graph to find what is going on.
My guess is that the problem is Store node has control edge pointing to 
loop head so we can't move it from loop. Or stupid staff like we only do 
it for index increment and not for decrement (which I doubt).

Most possible scenario is we convert this loop to canonical Counted and 
split it (creating 3 loops: pre, main, post) before we had a chance to 
collapse it.

Regards,
Vladimir

On 5/12/15 10:05 AM, Vitaly Davidovich wrote:
> Anyone? :)
>
> Thanks
>
> On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich <vitalyd at gmail.com
> <mailto:vitalyd at gmail.com>> wrote:
>
>     Hi guys,
>
>     Suppose we have the following code:
>
>     private static final long[] ARR = new long[10];
>     private static final int ITERATIONS = 500 * 1000 * 1000;
>
>
>     static void loop() {
>          int i = ITERATIONS;
>          while(--i >= 0) {
>               ARR[4] = i;
>           }
>     }
>
>     My expectation would be that loop() effectively compiles to:
>     ARR[4] = 0;
>     ret;
>
>     However, an actual loop is compiled.  This example is a reduction of
>     a slightly more involved case that I had in mind, but I'm curious if
>     I'm missing something here.
>
>     I tested this with 8u40 and tiered compilation disabled, for what
>     it's worth.
>
>     Thanks
>
>

From vitalyd at gmail.com  Tue May 12 17:16:05 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Tue, 12 May 2015 13:16:05 -0400
Subject: Intermediate writes to array slot not eliminated by optimizer
In-Reply-To: <55523476.90601@oracle.com>
References: <CAHjP37H_6R+8f6M+q7VOBhSmK-SSdHmqg_eX-uL5J8wayHaaFw@mail.gmail.com>
	<CAHjP37EiG1FDe0hfA5MBf5rOBiV3uuAM8PNAVtUkzRYo9Bp+zA@mail.gmail.com>
	<55523476.90601@oracle.com>
Message-ID: <CAHjP37F_=Y8U-wqK7E69cXQ_XkW16dq2HA+O54keb0kQ_3emNw@mail.gmail.com>

Thanks Vladimir.  I believe I did see the split pre/main/post loop form in
the compiled code; I tried it with both increment and decrement, and also a
plain counted for loop, but same asm in the end.  Clearly the example I
gave is just for illustrating the point, but perhaps this type of code can
be left behind after some other optimization passes complete.  Do you think
it's worth handling these types of things?

On Tue, May 12, 2015 at 1:12 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com
> wrote:

> I would need to look on ideal graph to find what is going on.
> My guess is that the problem is Store node has control edge pointing to
> loop head so we can't move it from loop. Or stupid staff like we only do it
> for index increment and not for decrement (which I doubt).
>
> Most possible scenario is we convert this loop to canonical Counted and
> split it (creating 3 loops: pre, main, post) before we had a chance to
> collapse it.
>
> Regards,
> Vladimir
>
> On 5/12/15 10:05 AM, Vitaly Davidovich wrote:
>
>> Anyone? :)
>>
>> Thanks
>>
>> On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich <vitalyd at gmail.com
>> <mailto:vitalyd at gmail.com>> wrote:
>>
>>     Hi guys,
>>
>>     Suppose we have the following code:
>>
>>     private static final long[] ARR = new long[10];
>>     private static final int ITERATIONS = 500 * 1000 * 1000;
>>
>>
>>     static void loop() {
>>          int i = ITERATIONS;
>>          while(--i >= 0) {
>>               ARR[4] = i;
>>           }
>>     }
>>
>>     My expectation would be that loop() effectively compiles to:
>>     ARR[4] = 0;
>>     ret;
>>
>>     However, an actual loop is compiled.  This example is a reduction of
>>     a slightly more involved case that I had in mind, but I'm curious if
>>     I'm missing something here.
>>
>>     I tested this with 8u40 and tiered compilation disabled, for what
>>     it's worth.
>>
>>     Thanks
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150512/f909ca8e/attachment.html>

From vladimir.kozlov at oracle.com  Tue May 12 17:47:51 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Tue, 12 May 2015 10:47:51 -0700
Subject: Intermediate writes to array slot not eliminated by optimizer
In-Reply-To: <CAHjP37F_=Y8U-wqK7E69cXQ_XkW16dq2HA+O54keb0kQ_3emNw@mail.gmail.com>
References: <CAHjP37H_6R+8f6M+q7VOBhSmK-SSdHmqg_eX-uL5J8wayHaaFw@mail.gmail.com>	<CAHjP37EiG1FDe0hfA5MBf5rOBiV3uuAM8PNAVtUkzRYo9Bp+zA@mail.gmail.com>	<55523476.90601@oracle.com>
	<CAHjP37F_=Y8U-wqK7E69cXQ_XkW16dq2HA+O54keb0kQ_3emNw@mail.gmail.com>
Message-ID: <55523CC7.5010200@oracle.com>

The most important thing we need do with this code is move Store from 
the loop. We definitely collapse empty loops.
We do move some loop invariant nodes from loops but this case could be 
different. I think the problem is stored value is loop variant.
Yes, it would benefit in general if we could move such stores from loops.

Vladimir

On 5/12/15 10:16 AM, Vitaly Davidovich wrote:
> Thanks Vladimir.  I believe I did see the split pre/main/post loop form
> in the compiled code; I tried it with both increment and decrement, and
> also a plain counted for loop, but same asm in the end.  Clearly the
> example I gave is just for illustrating the point, but perhaps this type
> of code can be left behind after some other optimization passes
> complete.  Do you think it's worth handling these types of things?
>
> On Tue, May 12, 2015 at 1:12 PM, Vladimir Kozlov
> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>
>     I would need to look on ideal graph to find what is going on.
>     My guess is that the problem is Store node has control edge pointing
>     to loop head so we can't move it from loop. Or stupid staff like we
>     only do it for index increment and not for decrement (which I doubt).
>
>     Most possible scenario is we convert this loop to canonical Counted
>     and split it (creating 3 loops: pre, main, post) before we had a
>     chance to collapse it.
>
>     Regards,
>     Vladimir
>
>     On 5/12/15 10:05 AM, Vitaly Davidovich wrote:
>
>         Anyone? :)
>
>         Thanks
>
>         On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich
>         <vitalyd at gmail.com <mailto:vitalyd at gmail.com>
>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>> wrote:
>
>              Hi guys,
>
>              Suppose we have the following code:
>
>              private static final long[] ARR = new long[10];
>              private static final int ITERATIONS = 500 * 1000 * 1000;
>
>
>              static void loop() {
>                   int i = ITERATIONS;
>                   while(--i >= 0) {
>                        ARR[4] = i;
>                    }
>              }
>
>              My expectation would be that loop() effectively compiles to:
>              ARR[4] = 0;
>              ret;
>
>              However, an actual loop is compiled.  This example is a
>         reduction of
>              a slightly more involved case that I had in mind, but I'm
>         curious if
>              I'm missing something here.
>
>              I tested this with 8u40 and tiered compilation disabled,
>         for what
>              it's worth.
>
>              Thanks
>
>
>

From vitalyd at gmail.com  Tue May 12 17:56:07 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Tue, 12 May 2015 13:56:07 -0400
Subject: Intermediate writes to array slot not eliminated by optimizer
In-Reply-To: <55523CC7.5010200@oracle.com>
References: <CAHjP37H_6R+8f6M+q7VOBhSmK-SSdHmqg_eX-uL5J8wayHaaFw@mail.gmail.com>
	<CAHjP37EiG1FDe0hfA5MBf5rOBiV3uuAM8PNAVtUkzRYo9Bp+zA@mail.gmail.com>
	<55523476.90601@oracle.com>
	<CAHjP37F_=Y8U-wqK7E69cXQ_XkW16dq2HA+O54keb0kQ_3emNw@mail.gmail.com>
	<55523CC7.5010200@oracle.com>
Message-ID: <CAHjP37HOnGZ7c13AA1akxE10WwnFA5zq70ht8574bqeFmFAPMg@mail.gmail.com>

Well, this seems more about detecting that each iteration of the loop is
storing to the same address, which means if the loop is counted (or
otherwise determined to end), then only final value can be stored.  Even if
loop is not removed entirely and the final value computed statically (as
one could conceive in this case), I'd think working with a temp inside the
loop and then storing the temp value into memory would be good as well.

FWIW, gcc and llvm do the "right" thing here in similarly shaped C++ code
:).

On Tue, May 12, 2015 at 1:47 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com
> wrote:

> The most important thing we need do with this code is move Store from the
> loop. We definitely collapse empty loops.
> We do move some loop invariant nodes from loops but this case could be
> different. I think the problem is stored value is loop variant.
> Yes, it would benefit in general if we could move such stores from loops.
>
> Vladimir
>
> On 5/12/15 10:16 AM, Vitaly Davidovich wrote:
>
>> Thanks Vladimir.  I believe I did see the split pre/main/post loop form
>> in the compiled code; I tried it with both increment and decrement, and
>> also a plain counted for loop, but same asm in the end.  Clearly the
>> example I gave is just for illustrating the point, but perhaps this type
>> of code can be left behind after some other optimization passes
>> complete.  Do you think it's worth handling these types of things?
>>
>> On Tue, May 12, 2015 at 1:12 PM, Vladimir Kozlov
>> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>>
>>     I would need to look on ideal graph to find what is going on.
>>     My guess is that the problem is Store node has control edge pointing
>>     to loop head so we can't move it from loop. Or stupid staff like we
>>     only do it for index increment and not for decrement (which I doubt).
>>
>>     Most possible scenario is we convert this loop to canonical Counted
>>     and split it (creating 3 loops: pre, main, post) before we had a
>>     chance to collapse it.
>>
>>     Regards,
>>     Vladimir
>>
>>     On 5/12/15 10:05 AM, Vitaly Davidovich wrote:
>>
>>         Anyone? :)
>>
>>         Thanks
>>
>>         On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich
>>         <vitalyd at gmail.com <mailto:vitalyd at gmail.com>
>>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>> wrote:
>>
>>              Hi guys,
>>
>>              Suppose we have the following code:
>>
>>              private static final long[] ARR = new long[10];
>>              private static final int ITERATIONS = 500 * 1000 * 1000;
>>
>>
>>              static void loop() {
>>                   int i = ITERATIONS;
>>                   while(--i >= 0) {
>>                        ARR[4] = i;
>>                    }
>>              }
>>
>>              My expectation would be that loop() effectively compiles to:
>>              ARR[4] = 0;
>>              ret;
>>
>>              However, an actual loop is compiled.  This example is a
>>         reduction of
>>              a slightly more involved case that I had in mind, but I'm
>>         curious if
>>              I'm missing something here.
>>
>>              I tested this with 8u40 and tiered compilation disabled,
>>         for what
>>              it's worth.
>>
>>              Thanks
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150512/7585ed9a/attachment.html>

From sandhya.viswanathan at intel.com  Tue May 12 23:25:00 2015
From: sandhya.viswanathan at intel.com (Viswanathan, Sandhya)
Date: Tue, 12 May 2015 23:25:00 +0000
Subject: RFR(L): 8069539: RSA acceleration
In-Reply-To: <554CDD48.8080500@redhat.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>
	<5502E67C.8080208@oracle.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>
	<550C004D.1060501@redhat.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com>
	<55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com>
Message-ID: <02FCFB8477C4EF43A2AD8E0C60F3DA2B6336247C@FMSMSX112.amr.corp.intel.com>


I was wondering if inline assembly can be used in the native code of JRE.  Does the JRE build framework support it?

Thanks,
Sandhya

-----Original Message-----
From: Andrew Haley [mailto:aph at redhat.com] 
Sent: Friday, May 08, 2015 8:59 AM
To: Vladimir Kozlov; Florian Weimer; Viswanathan, Sandhya; hotspot-compiler-dev at openjdk.java.net
Subject: Re: RFR(L): 8069539: RSA acceleration

Here is a prototype of what I propose:

http://cr.openjdk.java.net/~aph/rsa-1/

It is a JNI version of the fast algorithm I think we should use for RSA.  It doesn't use any intrinsics.  But with it, RSA is already twice as fast as the code we have now for 1024-bit keys, and almost three times as fast for 2048-bit keys!

This code (montgomery_multiply) is designed to be as easy as I can possibly make it to translate into a hand-coded intrinsic.  It will then offer better performance still, with the JNI overhead gone and with some carefully hand-tweaked memory accesses.  It has a very regular structure which should make it fairly easy to turn into a software pipeline with overlapped fetching and multiplication, although this will perhaps be difficult on register-starved machines like the x86.

The JNI version can still be used for those machines where people don't yet want to write a hand-coded intrinsic.

I haven't yet done anything about Montgomery squaring: it's asymptotically 25% faster than Montgomery multiplication.

I have only written it for x86_64.  Systems without a 64-bit multiply won't benefit very much from this idea, but they are pretty much legacy anyway: I want to concentrate on 64-bit systems.

So, shall we go with this?  I don't think you will find any faster way to do it.

Andrew.

From aph at redhat.com  Wed May 13 08:34:12 2015
From: aph at redhat.com (Andrew Haley)
Date: Wed, 13 May 2015 09:34:12 +0100
Subject: RFR(L): 8069539: RSA acceleration
In-Reply-To: <02FCFB8477C4EF43A2AD8E0C60F3DA2B6336247C@FMSMSX112.amr.corp.intel.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>	<5502E67C.8080208@oracle.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>	<550C004D.1060501@redhat.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com>
	<55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B6336247C@FMSMSX112.amr.corp.intel.com>
Message-ID: <55530C84.2070801@redhat.com>

On 13/05/15 00:25, Viswanathan, Sandhya wrote:
> I was wondering if inline assembly can be used in the native code of JRE.  Does the JRE build framework support it?

For every target which supports it you could define a macro in one of
the os/cpu-dependent header files.  But it's slightly awkward because
it's compiler- as well as cpu-dependent.  As long as we assume GCC
it's pretty easy for all targets but I'm not sure what other compilers
we use.

However, I have to stress that this is just a demonstration: even
though it is already much better than what we have, I'm not imagining
that we really want to use JNI for this.  It would make more sense to
move it into HotSpot and call it as a runtime helper.

One interesting thing to try would be to find out if this technique
makes things slower on 32-bit targets.  If it doesn't then we could
use it everywhere.

Andrew.

From roland.westrelin at oracle.com  Wed May 13 09:05:19 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Wed, 13 May 2015 11:05:19 +0200
Subject: RFR(XS): 8078436:
	java/util/stream/boottest/java/util/stream/UnorderedTest.java
	crashed with an assert in ifnode.cpp
Message-ID: <683673A0-DB76-497C-8772-77CA133AFFAA@oracle.com>

http://cr.openjdk.java.net/~roland/8078436/webrev.00/

The assert is wrong. I mixed the BoolNode of one IfNode with the projection of the other IfNode.

Roland.

From vladimir.x.ivanov at oracle.com  Wed May 13 09:06:47 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Wed, 13 May 2015 12:06:47 +0300
Subject: RFR(XS): 8078436:
	java/util/stream/boottest/java/util/stream/UnorderedTest.java
	crashed with an assert in ifnode.cpp
In-Reply-To: <683673A0-DB76-497C-8772-77CA133AFFAA@oracle.com>
References: <683673A0-DB76-497C-8772-77CA133AFFAA@oracle.com>
Message-ID: <55531427.5030201@oracle.com>

Looks good.

Best regards,
Vladimir Ivanov

On 5/13/15 12:05 PM, Roland Westrelin wrote:
> http://cr.openjdk.java.net/~roland/8078436/webrev.00/
>
> The assert is wrong. I mixed the BoolNode of one IfNode with the projection of the other IfNode.
>
> Roland.
>

From vladimir.x.ivanov at oracle.com  Wed May 13 09:08:26 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Wed, 13 May 2015 12:08:26 +0300
Subject: RFR(M): 8076188 Optimize arraycopy out for non escaping
	destination
In-Reply-To: <3F7929F8-11FA-4536-A340-A98266FB2C62@oracle.com>
References: <2B8622DD-1DA5-4715-B4BF-3801202B588B@oracle.com>	<553F9C4A.7040407@oracle.com>	<D5590E7C-C82D-4F9B-A3DC-B69AE805D409@oracle.com>
	<3F7929F8-11FA-4536-A340-A98266FB2C62@oracle.com>
Message-ID: <5553148A.2020206@oracle.com>

Still looks fine.

Best regards,
Vladimir Ivanov

On 5/12/15 1:40 PM, Roland Westrelin wrote:
>
> jprt found some problems with this change. So I?d like to push:
>
> http://cr.openjdk.java.net/~roland/8076188/webrev.00-01/
>
> on top of the reviewed change. Overall change:
>
> http://cr.openjdk.java.net/~roland/8076188/webrev.01/
>
> List of fixes:
>
> - tests with G1 failed in verification code. I made the change to LoadNode::Ideal() so the new code that looks for a dominating identical load is disabled for raw loads
>
> - OptimizePtrCompare is broken in cases like the new test case added to TestEliminateArrayCopy.java: field from non escaping object should have an unknown value when the the object is target of an ArrayCopy.
>
> - The ifnode change is unrelated (I found the problem when debugging one of the issues above) but it?s simple enough that I don?t think it needs its own CR.
>
> Roland.
>
>> On Apr 29, 2015, at 11:30 AM, Roland Westrelin <roland.westrelin at oracle.com> wrote:
>>
>> Thanks for the review, Vladimir.
>>
>> Roland.
>>
>>> On Apr 28, 2015, at 4:42 PM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
>>>
>>> Looks good.
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> On 4/21/15 4:02 PM, Roland Westrelin wrote:
>>>> http://cr.openjdk.java.net/~roland/8076188/webrev.00/
>>>>
>>>> This patch tries to eliminate ArrayCopyNodes (for instance clones, array clones, arraycopy and copyOf) when the destination of the copy doesn?t escape:
>>>>
>>>> - during escape analysis, ArrayCopyNodes don?t cause the destination of the copy to be marked as escaping anymore
>>>> - a load to the destination of a copy may be replaced by a load from the source during IGVN
>>>> - during macro expansion, ArrayCopyNodes don?t stop allocation from being eliminated and can themselves be eliminated
>>>>
>>>> Roland.
>>>>
>>
>

From vladimir.x.ivanov at oracle.com  Wed May 13 09:27:59 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Wed, 13 May 2015 12:27:59 +0300
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken
	after sun.misc.Cleaner became automatically cleared
In-Reply-To: <5551DC44.9060902@oracle.com>
References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com>
	<5551DC44.9060902@oracle.com>
Message-ID: <5553191F.6080800@oracle.com>

I finished experimenting with the idea inspired by private discussion 
with Kim Barrett:

   http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02/

The idea is to use CallSite instance as a context for dependency 
tracking, instead of the Class CallSite is bound to. It requires 
extension of nmethod dependencies to be used separately from 
InstanceKlass. Though it slightly complicates nmethod dependency 
management (special cases for call_site_target_value are added), it 
completely eliminates the need for context state transitions.

Every CallSite instance has its dedicated context, represented as 
CallSite.Context which stores an address of the dependency list head 
(nmethodBucket*). Cleanup action (Cleaner) is attached to CallSite 
instance and frees all allocated nmethodBucket entries when CallSite 
becomes unreachable. Though CallSite instance can freely go away, the 
Cleaner keeps corresponding CallSite.Context from reclamation (all 
Cleaners are registered in static list in Cleaner class).

I like how it finally shaped out. It became much easier to reason about 
CallSite context. Dependency tracking extension to non-InstanceKlass 
contextes looks quite natural and can be reused.

Testing: extended regression test, jprt, nashorn

Thanks!

Best regards,
Vladimir Ivanov

On 5/12/15 1:56 PM, Vladimir Ivanov wrote:
> Peter,
>
>>> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.01
>>> https://bugs.openjdk.java.net/browse/JDK-8079205
>>
>> Your Finalizator touches are good. Supplier interface is not needed as
>> there is a public Reference superclass that can be used for return type
>> of JavaLangRefAccess.createFinalizator(). You can remove the import for
>> Supplier in Reference now.
> Thanks for spotting that. Will do.
>
>>> Recent change in sun.misc.Cleaner behavior broke CallSite context
>>> cleanup.
>>>
>>> CallSite references context class through a Cleaner to avoid its
>>> unnecessary retention.
>>>
>>> The problem is the following: to do a cleanup (invalidate all affected
>>> nmethods) VM needs a pointer to a context class. Until Cleaner is
>>> cleared (and it was a manual action, since Cleaner extends
>>> PhantomReference isn't automatically cleared according to the docs),
>>> VM can extract it from CallSite.context.referent field.
>>
>> If PhantomReference.referent wasn't cleared by VM when PhantomReference
>> was equeued, could it happen that the referent pointer was still != null
>> and the referent object's heap memory was already reclaimed by GC? Is
>> that what JDK-8071931 is about? (I cant't see the bug - it's internal).
>> In that respect the Cleaner based solution was broken from the start, as
>> you did dereference the referent after Cleaner was enqueued. You could
>> get a != null pointer which was pointing to reclaimed heap. In Java this
>> dereference is prevented by PhantomReference.get() always returning null
>> (well, nothing prevents one to read the referent field with reflection
>> though).
> No, the object isn't reclaimed until corresponding reference is != null.
> When Cleaner wasn't automatically cleared, it was guaranteed that the
> referent is alive when cleanup action happens. That is the assumption
> CallSite dependency tracking is based on.
>
> It's not the case anymore when Cleaner is automatically cleared. Cleanup
> action can happen when context Class is already GCed. So, I can access
> neither the context mirror class nor VM-internal InstanceKlass.
>
>> FinalReference(s) are different in that their referent is not reclaimed
>> while it is still reachable through FinalReference, which means that
>> finalizable objects must undergo at least two GC cycles, with processing
>> in the reference handler and finalizer threads inbetween the cycles, to
>> be reclaimed. PhantomReference(s), on the other hand, can be enqueued
>> and their referents reclaimed in the same GC cycle, can't they?
> Since PhantomReference isn't automatically cleared, GC can't reclaim its
> referent until the reference is manually cleared or PhantomReference
> becomes unreachable as a result of cleanup action.
> So, it's impossible to reclaim it in the same GC cycle (unless it is a
> concurrent GC cycle?).
>
>>> I experimented with moving cleanup logic into VM [1],
>>
>> What's the downside of that approach? I mean, why is GC-assisted
>> approach better? Simpler?
> IMO the more code on JDK side the better. More safety guarantees are
> provided in Java code comparing to native/VM code.
>
> Also, JVM-based approach suffers from the fact it doesn't get prompt
> notification when context Class can be GCed. Since it should work with
> autocleared references, there's no need in a cleanup action anymore.
> By the time cleanup happens, referent can be already GCed.
>
> That's why I switched from Cleaner to WeakReference. When
> CallSite.context is cleared ("stale" context), VM has to scan all
> nmethods in the code cache to find all affected nmethods.
>
> BTW I had a private discussion with Kim Barrett who suggested an
> alternative approach which doesn't require full code cache scan. I plan
> to do some prototyping to understand its feasibility, since it requires
> non-InstanceKlass-based nmethod dependency tracking machinery.
>
> Best regards,
> Vladimir Ivanov
>
>>> but Peter Levart came up with a clever idea and implemented
>>> FinalReference-based cleaner-like Finalizator. Classes don't have
>>> finalizers, but Finalizator allows to attach a finalization action to
>>> them. And it is guaranteed that the referent is alive when
>>> finalization happens.
>>>
>>> Also, Peter spotted another problem with Cleaner-based implementation.
>>> Cleaner cleanup action is strongly referenced, since it is registered
>>> in Cleaner class. CallSite context cleanup action keeps a reference to
>>> CallSite class (it is passed to MHN.invalidateDependentNMethods).
>>> Users are free to extend CallSite and many do so. If a context class
>>> and a call site class are loaded by a custom class loader, such loader
>>> will never be unloaded, causing a memory leak.
>>>
>>> Finalizator doesn't suffer from that, since the action is referenced
>>> only from Finalizator instance. The downside is that cleanup action
>>> can be missed if Finalizator becomes unreachable. It's not a problem
>>> for CallSite context, because a context is always referenced from some
>>> CallSite and if a CallSite becomes unreachable, there's no need to
>>> perform a cleanup.
>>>
>>> Testing: jdk/test/java/lang/invoke, hotspot/test/compiler/jsr292
>>>
>>> Contributed-by: plevart, vlivanov
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> PS: frankly speaking, I would move Finalizator from java.lang.ref to
>>> java.lang.invoke and call it Context, if there were a way to extend
>>> package-private FinalReference from another package :-)
>>
>> Modules may have an answer for that. FinalReference could be moved to a
>> package (java.lang.ref.internal or jdk.lang.ref) that is not exported to
>> the world and made public.
>>
>> On the other hand, I don't see a reason why FinalReference couldn't be
>> part of JDK public API. I know it's currently just a private mechanism
>> to implement finalization which has issues and many would like to see it
>> gone, but Java has learned to live with it, and, as you see, it can be a
>> solution to some problems too ;-)
>>
>> Regards, Peter
>>
>>>
>>> [1] http://cr.openjdk.java.net/~vlivanov/8079205/webrev.00
>>

From vladimir.x.ivanov at oracle.com  Wed May 13 09:50:48 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Wed, 13 May 2015 12:50:48 +0300
Subject: [9] RFR (XS): 8079135: C2 disables some optimizations when a large
	number of unique nodes exist
Message-ID: <55531E78.7070301@oracle.com>

http://cr.openjdk.java.net/~vlivanov/8079135/webrev.00/

Incremental inlining produces lots of dead nodes, so # of unique and # 
of live nodes can differ significantly. But actual IR size is what 
matters.

Performance experiments show that octane doesn't hit the problem (no 
difference).

Testing: octane/nashorn

Best regards,
Vladimir Ivanov

From bonifaido at gmail.com  Wed May 13 10:07:02 2015
From: bonifaido at gmail.com (=?UTF-8?B?TsOhbmRvciBJc3R2w6FuIEtyw6Fjc2Vy?=)
Date: Wed, 13 May 2015 12:07:02 +0200
Subject: [9] RFR(S): 8068945: Use RBP register as proper frame pointer in JIT
	compiled code on x86
Message-ID: <CAHkG_eykQuh2urYEp3_CBK9xUO52h=0m=cLg+=_8S3ohVTa_Ww@mail.gmail.com>

Hi,

this change broke my build for the zero jvm variant, I think
the PreserveFramePointer global should be defined
in src/cpu/zero/vm/globals_zero.hpp as well, correct me please if I'm wrong.

Regards,
Nandor
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150513/0e7736b7/attachment-0001.html>

From peter.levart at gmail.com  Wed May 13 10:45:17 2015
From: peter.levart at gmail.com (Peter Levart)
Date: Wed, 13 May 2015 12:45:17 +0200
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken
	after sun.misc.Cleaner became automatically cleared
In-Reply-To: <5553191F.6080800@oracle.com>
References: <554CEF76.8090903@oracle.com>
	<554DD477.7000703@gmail.com>	<5551DC44.9060902@oracle.com>
	<5553191F.6080800@oracle.com>
Message-ID: <55532B3D.5090202@gmail.com>

Hi Vladimir,

This is nice. I don't know how space-savvy CallSite should be, but you 
can save one additional object - the Cleaner's thunk:

     static class Context implements Runnable {
         private final long dependencies = 0; // Used by JVM to store 
JVM_nmethodBucket*

         static Context make(CallSite cs) {
             Context newContext = new Context();
             // Cleaner is attached to CallSite instance and it clears 
native structures allocated for CallSite context.
             // Though the CallSite can become unreachable, its Context 
is retained by the Cleaner instance (which is
             // referenced from Cleaner class) until cleanup is performed.
             Cleaner.create(cs, newContext);
             return newContext;
         }

         @Override
         public void run() {
             MethodHandleNatives.clearCallSiteContext(this);
         }
     }

Since Context instance does not escape to client code, I think this is safe.

Regards, Peter


On 05/13/2015 11:27 AM, Vladimir Ivanov wrote:
> I finished experimenting with the idea inspired by private discussion 
> with Kim Barrett:
>
>   http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02/
>
> The idea is to use CallSite instance as a context for dependency 
> tracking, instead of the Class CallSite is bound to. It requires 
> extension of nmethod dependencies to be used separately from 
> InstanceKlass. Though it slightly complicates nmethod dependency 
> management (special cases for call_site_target_value are added), it 
> completely eliminates the need for context state transitions.
>
> Every CallSite instance has its dedicated context, represented as 
> CallSite.Context which stores an address of the dependency list head 
> (nmethodBucket*). Cleanup action (Cleaner) is attached to CallSite 
> instance and frees all allocated nmethodBucket entries when CallSite 
> becomes unreachable. Though CallSite instance can freely go away, the 
> Cleaner keeps corresponding CallSite.Context from reclamation (all 
> Cleaners are registered in static list in Cleaner class).
>
> I like how it finally shaped out. It became much easier to reason 
> about CallSite context. Dependency tracking extension to 
> non-InstanceKlass contextes looks quite natural and can be reused.
>
> Testing: extended regression test, jprt, nashorn
>
> Thanks!
>
> Best regards,
> Vladimir Ivanov
>
> On 5/12/15 1:56 PM, Vladimir Ivanov wrote:
>> Peter,
>>
>>>> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.01
>>>> https://bugs.openjdk.java.net/browse/JDK-8079205
>>>
>>> Your Finalizator touches are good. Supplier interface is not needed as
>>> there is a public Reference superclass that can be used for return type
>>> of JavaLangRefAccess.createFinalizator(). You can remove the import for
>>> Supplier in Reference now.
>> Thanks for spotting that. Will do.
>>
>>>> Recent change in sun.misc.Cleaner behavior broke CallSite context
>>>> cleanup.
>>>>
>>>> CallSite references context class through a Cleaner to avoid its
>>>> unnecessary retention.
>>>>
>>>> The problem is the following: to do a cleanup (invalidate all affected
>>>> nmethods) VM needs a pointer to a context class. Until Cleaner is
>>>> cleared (and it was a manual action, since Cleaner extends
>>>> PhantomReference isn't automatically cleared according to the docs),
>>>> VM can extract it from CallSite.context.referent field.
>>>
>>> If PhantomReference.referent wasn't cleared by VM when PhantomReference
>>> was equeued, could it happen that the referent pointer was still != 
>>> null
>>> and the referent object's heap memory was already reclaimed by GC? Is
>>> that what JDK-8071931 is about? (I cant't see the bug - it's internal).
>>> In that respect the Cleaner based solution was broken from the 
>>> start, as
>>> you did dereference the referent after Cleaner was enqueued. You could
>>> get a != null pointer which was pointing to reclaimed heap. In Java 
>>> this
>>> dereference is prevented by PhantomReference.get() always returning 
>>> null
>>> (well, nothing prevents one to read the referent field with reflection
>>> though).
>> No, the object isn't reclaimed until corresponding reference is != null.
>> When Cleaner wasn't automatically cleared, it was guaranteed that the
>> referent is alive when cleanup action happens. That is the assumption
>> CallSite dependency tracking is based on.
>>
>> It's not the case anymore when Cleaner is automatically cleared. Cleanup
>> action can happen when context Class is already GCed. So, I can access
>> neither the context mirror class nor VM-internal InstanceKlass.
>>
>>> FinalReference(s) are different in that their referent is not reclaimed
>>> while it is still reachable through FinalReference, which means that
>>> finalizable objects must undergo at least two GC cycles, with 
>>> processing
>>> in the reference handler and finalizer threads inbetween the cycles, to
>>> be reclaimed. PhantomReference(s), on the other hand, can be enqueued
>>> and their referents reclaimed in the same GC cycle, can't they?
>> Since PhantomReference isn't automatically cleared, GC can't reclaim its
>> referent until the reference is manually cleared or PhantomReference
>> becomes unreachable as a result of cleanup action.
>> So, it's impossible to reclaim it in the same GC cycle (unless it is a
>> concurrent GC cycle?).
>>
>>>> I experimented with moving cleanup logic into VM [1],
>>>
>>> What's the downside of that approach? I mean, why is GC-assisted
>>> approach better? Simpler?
>> IMO the more code on JDK side the better. More safety guarantees are
>> provided in Java code comparing to native/VM code.
>>
>> Also, JVM-based approach suffers from the fact it doesn't get prompt
>> notification when context Class can be GCed. Since it should work with
>> autocleared references, there's no need in a cleanup action anymore.
>> By the time cleanup happens, referent can be already GCed.
>>
>> That's why I switched from Cleaner to WeakReference. When
>> CallSite.context is cleared ("stale" context), VM has to scan all
>> nmethods in the code cache to find all affected nmethods.
>>
>> BTW I had a private discussion with Kim Barrett who suggested an
>> alternative approach which doesn't require full code cache scan. I plan
>> to do some prototyping to understand its feasibility, since it requires
>> non-InstanceKlass-based nmethod dependency tracking machinery.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>>>> but Peter Levart came up with a clever idea and implemented
>>>> FinalReference-based cleaner-like Finalizator. Classes don't have
>>>> finalizers, but Finalizator allows to attach a finalization action to
>>>> them. And it is guaranteed that the referent is alive when
>>>> finalization happens.
>>>>
>>>> Also, Peter spotted another problem with Cleaner-based implementation.
>>>> Cleaner cleanup action is strongly referenced, since it is registered
>>>> in Cleaner class. CallSite context cleanup action keeps a reference to
>>>> CallSite class (it is passed to MHN.invalidateDependentNMethods).
>>>> Users are free to extend CallSite and many do so. If a context class
>>>> and a call site class are loaded by a custom class loader, such loader
>>>> will never be unloaded, causing a memory leak.
>>>>
>>>> Finalizator doesn't suffer from that, since the action is referenced
>>>> only from Finalizator instance. The downside is that cleanup action
>>>> can be missed if Finalizator becomes unreachable. It's not a problem
>>>> for CallSite context, because a context is always referenced from some
>>>> CallSite and if a CallSite becomes unreachable, there's no need to
>>>> perform a cleanup.
>>>>
>>>> Testing: jdk/test/java/lang/invoke, hotspot/test/compiler/jsr292
>>>>
>>>> Contributed-by: plevart, vlivanov
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>> PS: frankly speaking, I would move Finalizator from java.lang.ref to
>>>> java.lang.invoke and call it Context, if there were a way to extend
>>>> package-private FinalReference from another package :-)
>>>
>>> Modules may have an answer for that. FinalReference could be moved to a
>>> package (java.lang.ref.internal or jdk.lang.ref) that is not 
>>> exported to
>>> the world and made public.
>>>
>>> On the other hand, I don't see a reason why FinalReference couldn't be
>>> part of JDK public API. I know it's currently just a private mechanism
>>> to implement finalization which has issues and many would like to 
>>> see it
>>> gone, but Java has learned to live with it, and, as you see, it can 
>>> be a
>>> solution to some problems too ;-)
>>>
>>> Regards, Peter
>>>
>>>>
>>>> [1] http://cr.openjdk.java.net/~vlivanov/8079205/webrev.00
>>>


From paul.sandoz at oracle.com  Wed May 13 11:24:07 2015
From: paul.sandoz at oracle.com (Paul Sandoz)
Date: Wed, 13 May 2015 13:24:07 +0200
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken
	after sun.misc.Cleaner became automatically cleared
In-Reply-To: <5553191F.6080800@oracle.com>
References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com>
	<5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com>
Message-ID: <72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com>

Hi Vladimir,

I am not an export in the HS area but the code mostly made sense to me. I also like Peter's suggestion of Context implementing Runnable. 

Some minor comments.

CallSite.java:

145         private final long dependencies = 0; // Used by JVM to store JVM_nmethodBucket*

It's a little odd to see this field be final, but i think i understand the motivation as it's essentially acting as a memory address pointing to the head of the nbucket list, so in Java code you don't want to assign it to anything other than 0. 

Is VM access, via java_lang_invoke_CallSite_Context, affected by treating final fields as really final in the j.l.i package?


javaClasses.hpp:

1182   static oop              context(oop site);
1183   static void         set_context(oop site, oop context);

Is set_context required?


instanceKlass.cpp:

1876 // recording of dependecies. Returns true if the bucket is ready for reclamation.

typo "dependecies"

Paul.

On May 13, 2015, at 11:27 AM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:

> I finished experimenting with the idea inspired by private discussion with Kim Barrett:
> 
>  http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02/
> 
> The idea is to use CallSite instance as a context for dependency tracking, instead of the Class CallSite is bound to. It requires extension of nmethod dependencies to be used separately from InstanceKlass. Though it slightly complicates nmethod dependency management (special cases for call_site_target_value are added), it completely eliminates the need for context state transitions.
> 
> Every CallSite instance has its dedicated context, represented as CallSite.Context which stores an address of the dependency list head (nmethodBucket*). Cleanup action (Cleaner) is attached to CallSite instance and frees all allocated nmethodBucket entries when CallSite becomes unreachable. Though CallSite instance can freely go away, the Cleaner keeps corresponding CallSite.Context from reclamation (all Cleaners are registered in static list in Cleaner class).
> 
> I like how it finally shaped out. It became much easier to reason about CallSite context. Dependency tracking extension to non-InstanceKlass contextes looks quite natural and can be reused.
> 
> Testing: extended regression test, jprt, nashorn
> 
> Thanks!
> 
> Best regards,
> Vladimir Ivanov
> 
> On 5/12/15 1:56 PM, Vladimir Ivanov wrote:
>> Peter,
>> 
>>>> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.01
>>>> https://bugs.openjdk.java.net/browse/JDK-8079205
>>> 
>>> Your Finalizator touches are good. Supplier interface is not needed as
>>> there is a public Reference superclass that can be used for return type
>>> of JavaLangRefAccess.createFinalizator(). You can remove the import for
>>> Supplier in Reference now.
>> Thanks for spotting that. Will do.
>> 
>>>> Recent change in sun.misc.Cleaner behavior broke CallSite context
>>>> cleanup.
>>>> 
>>>> CallSite references context class through a Cleaner to avoid its
>>>> unnecessary retention.
>>>> 
>>>> The problem is the following: to do a cleanup (invalidate all affected
>>>> nmethods) VM needs a pointer to a context class. Until Cleaner is
>>>> cleared (and it was a manual action, since Cleaner extends
>>>> PhantomReference isn't automatically cleared according to the docs),
>>>> VM can extract it from CallSite.context.referent field.
>>> 
>>> If PhantomReference.referent wasn't cleared by VM when PhantomReference
>>> was equeued, could it happen that the referent pointer was still != null
>>> and the referent object's heap memory was already reclaimed by GC? Is
>>> that what JDK-8071931 is about? (I cant't see the bug - it's internal).
>>> In that respect the Cleaner based solution was broken from the start, as
>>> you did dereference the referent after Cleaner was enqueued. You could
>>> get a != null pointer which was pointing to reclaimed heap. In Java this
>>> dereference is prevented by PhantomReference.get() always returning null
>>> (well, nothing prevents one to read the referent field with reflection
>>> though).
>> No, the object isn't reclaimed until corresponding reference is != null.
>> When Cleaner wasn't automatically cleared, it was guaranteed that the
>> referent is alive when cleanup action happens. That is the assumption
>> CallSite dependency tracking is based on.
>> 
>> It's not the case anymore when Cleaner is automatically cleared. Cleanup
>> action can happen when context Class is already GCed. So, I can access
>> neither the context mirror class nor VM-internal InstanceKlass.
>> 
>>> FinalReference(s) are different in that their referent is not reclaimed
>>> while it is still reachable through FinalReference, which means that
>>> finalizable objects must undergo at least two GC cycles, with processing
>>> in the reference handler and finalizer threads inbetween the cycles, to
>>> be reclaimed. PhantomReference(s), on the other hand, can be enqueued
>>> and their referents reclaimed in the same GC cycle, can't they?
>> Since PhantomReference isn't automatically cleared, GC can't reclaim its
>> referent until the reference is manually cleared or PhantomReference
>> becomes unreachable as a result of cleanup action.
>> So, it's impossible to reclaim it in the same GC cycle (unless it is a
>> concurrent GC cycle?).
>> 
>>>> I experimented with moving cleanup logic into VM [1],
>>> 
>>> What's the downside of that approach? I mean, why is GC-assisted
>>> approach better? Simpler?
>> IMO the more code on JDK side the better. More safety guarantees are
>> provided in Java code comparing to native/VM code.
>> 
>> Also, JVM-based approach suffers from the fact it doesn't get prompt
>> notification when context Class can be GCed. Since it should work with
>> autocleared references, there's no need in a cleanup action anymore.
>> By the time cleanup happens, referent can be already GCed.
>> 
>> That's why I switched from Cleaner to WeakReference. When
>> CallSite.context is cleared ("stale" context), VM has to scan all
>> nmethods in the code cache to find all affected nmethods.
>> 
>> BTW I had a private discussion with Kim Barrett who suggested an
>> alternative approach which doesn't require full code cache scan. I plan
>> to do some prototyping to understand its feasibility, since it requires
>> non-InstanceKlass-based nmethod dependency tracking machinery.
>> 
>> Best regards,
>> Vladimir Ivanov
>> 
>>>> but Peter Levart came up with a clever idea and implemented
>>>> FinalReference-based cleaner-like Finalizator. Classes don't have
>>>> finalizers, but Finalizator allows to attach a finalization action to
>>>> them. And it is guaranteed that the referent is alive when
>>>> finalization happens.
>>>> 
>>>> Also, Peter spotted another problem with Cleaner-based implementation.
>>>> Cleaner cleanup action is strongly referenced, since it is registered
>>>> in Cleaner class. CallSite context cleanup action keeps a reference to
>>>> CallSite class (it is passed to MHN.invalidateDependentNMethods).
>>>> Users are free to extend CallSite and many do so. If a context class
>>>> and a call site class are loaded by a custom class loader, such loader
>>>> will never be unloaded, causing a memory leak.
>>>> 
>>>> Finalizator doesn't suffer from that, since the action is referenced
>>>> only from Finalizator instance. The downside is that cleanup action
>>>> can be missed if Finalizator becomes unreachable. It's not a problem
>>>> for CallSite context, because a context is always referenced from some
>>>> CallSite and if a CallSite becomes unreachable, there's no need to
>>>> perform a cleanup.
>>>> 
>>>> Testing: jdk/test/java/lang/invoke, hotspot/test/compiler/jsr292
>>>> 
>>>> Contributed-by: plevart, vlivanov
>>>> 
>>>> Best regards,
>>>> Vladimir Ivanov
>>>> 
>>>> PS: frankly speaking, I would move Finalizator from java.lang.ref to
>>>> java.lang.invoke and call it Context, if there were a way to extend
>>>> package-private FinalReference from another package :-)
>>> 
>>> Modules may have an answer for that. FinalReference could be moved to a
>>> package (java.lang.ref.internal or jdk.lang.ref) that is not exported to
>>> the world and made public.
>>> 
>>> On the other hand, I don't see a reason why FinalReference couldn't be
>>> part of JDK public API. I know it's currently just a private mechanism
>>> to implement finalization which has issues and many would like to see it
>>> gone, but Java has learned to live with it, and, as you see, it can be a
>>> solution to some problems too ;-)
>>> 
>>> Regards, Peter
>>> 
>>>> 
>>>> [1] http://cr.openjdk.java.net/~vlivanov/8079205/webrev.00
>>> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150513/96d3c700/signature.asc>

From roland.westrelin at oracle.com  Wed May 13 11:43:12 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Wed, 13 May 2015 13:43:12 +0200
Subject: RFR(M): 8076188 Optimize arraycopy out for non escaping
	destination
In-Reply-To: <5553148A.2020206@oracle.com>
References: <2B8622DD-1DA5-4715-B4BF-3801202B588B@oracle.com>
	<553F9C4A.7040407@oracle.com>
	<D5590E7C-C82D-4F9B-A3DC-B69AE805D409@oracle.com>
	<3F7929F8-11FA-4536-A340-A98266FB2C62@oracle.com>
	<5553148A.2020206@oracle.com>
Message-ID: <6CB8AAD1-2C3C-4766-9C75-CE786C4F94EF@oracle.com>

Thanks Vladimir & Vladimir for the re-reviews.

Roland.

> On May 13, 2015, at 11:08 AM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
> 
> Still looks fine.
> 
> Best regards,
> Vladimir Ivanov
> 
> On 5/12/15 1:40 PM, Roland Westrelin wrote:
>> 
>> jprt found some problems with this change. So I?d like to push:
>> 
>> http://cr.openjdk.java.net/~roland/8076188/webrev.00-01/
>> 
>> on top of the reviewed change. Overall change:
>> 
>> http://cr.openjdk.java.net/~roland/8076188/webrev.01/
>> 
>> List of fixes:
>> 
>> - tests with G1 failed in verification code. I made the change to LoadNode::Ideal() so the new code that looks for a dominating identical load is disabled for raw loads
>> 
>> - OptimizePtrCompare is broken in cases like the new test case added to TestEliminateArrayCopy.java: field from non escaping object should have an unknown value when the the object is target of an ArrayCopy.
>> 
>> - The ifnode change is unrelated (I found the problem when debugging one of the issues above) but it?s simple enough that I don?t think it needs its own CR.
>> 
>> Roland.
>> 
>>> On Apr 29, 2015, at 11:30 AM, Roland Westrelin <roland.westrelin at oracle.com> wrote:
>>> 
>>> Thanks for the review, Vladimir.
>>> 
>>> Roland.
>>> 
>>>> On Apr 28, 2015, at 4:42 PM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
>>>> 
>>>> Looks good.
>>>> 
>>>> Best regards,
>>>> Vladimir Ivanov
>>>> 
>>>> On 4/21/15 4:02 PM, Roland Westrelin wrote:
>>>>> http://cr.openjdk.java.net/~roland/8076188/webrev.00/
>>>>> 
>>>>> This patch tries to eliminate ArrayCopyNodes (for instance clones, array clones, arraycopy and copyOf) when the destination of the copy doesn?t escape:
>>>>> 
>>>>> - during escape analysis, ArrayCopyNodes don?t cause the destination of the copy to be marked as escaping anymore
>>>>> - a load to the destination of a copy may be replaced by a load from the source during IGVN
>>>>> - during macro expansion, ArrayCopyNodes don?t stop allocation from being eliminated and can themselves be eliminated
>>>>> 
>>>>> Roland.
>>>>> 
>>> 
>> 


From vladimir.x.ivanov at oracle.com  Wed May 13 11:59:42 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Wed, 13 May 2015 14:59:42 +0300
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken
	after sun.misc.Cleaner became automatically cleared
In-Reply-To: <72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com>
References: <554CEF76.8090903@oracle.com>
	<554DD477.7000703@gmail.com>	<5551DC44.9060902@oracle.com>
	<5553191F.6080800@oracle.com>
	<72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com>
Message-ID: <55533CAE.3050201@oracle.com>

Peter, Paul, thanks for the feedback!

Updated the webrev in place:
   http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02

> I am not an export in the HS area but the code mostly made sense to me. I also like Peter's suggestion of Context implementing Runnable.
I agree. Integrated.

> Some minor comments.
>
> CallSite.java:
>
> 145         private final long dependencies = 0; // Used by JVM to store JVM_nmethodBucket*
>
> It's a little odd to see this field be final, but i think i understand the motivation as it's essentially acting as a memory address pointing to the head of the nbucket list, so in Java code you don't want to assign it to anything other than 0.
Yes, my motivation was to forbid reads & writes of the field from Java 
code. I was experimenting with injecting the field by VM, but it's less 
attractive.

> Is VM access, via java_lang_invoke_CallSite_Context, affected by treating final fields as really final in the j.l.i package?
No, field modifiers aren't respected. oopDesc::*_field_[get|put] does a 
raw read/write using field offset.

> javaClasses.hpp:
>
> 1182   static oop              context(oop site);
> 1183   static void         set_context(oop site, oop context);
>
> Is set_context required?
Good point. Removed.

> instanceKlass.cpp:
>
> 1876 // recording of dependecies. Returns true if the bucket is ready for reclamation.
>
> typo "dependecies"
Fixed.

Best regards,
Vladimir Ivanov

From zoltan.majo at oracle.com  Wed May 13 12:41:46 2015
From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=)
Date: Wed, 13 May 2015 14:41:46 +0200
Subject: [9] RFR(S): 8068945: Use RBP register as proper frame pointer
	in JIT compiled code on x86
In-Reply-To: <CAHkG_eykQuh2urYEp3_CBK9xUO52h=0m=cLg+=_8S3ohVTa_Ww@mail.gmail.com>
References: <CAHkG_eykQuh2urYEp3_CBK9xUO52h=0m=cLg+=_8S3ohVTa_Ww@mail.gmail.com>
Message-ID: <5553468A.6010206@oracle.com>

Hi N?ndor,


On 05/13/2015 12:07 PM, N?ndor Istv?n Kr?cser wrote:
> Hi,
>
> this change broke my build for the zero jvm variant, I think 
> the PreserveFramePointer global should be defined 
> in src/cpu/zero/vm/globals_zero.hpp as well, correct me please if I'm 
> wrong.

Thanks for noticing and reporting the problem. I've filed 8080281 and 
will send out an RFR soon.

Best wishes,


Zolt?n

>
> Regards,
> Nandor


From roland.westrelin at oracle.com  Wed May 13 13:02:50 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Wed, 13 May 2015 15:02:50 +0200
Subject: RFR(S): 8078866: compiler/eliminateAutobox/6934604/TestIntBoxing.java
	assert(p_f->Opcode() == Op_IfFalse) failed
Message-ID: <E65BE33D-2CBE-448E-A935-07217E43E034@oracle.com>

http://cr.openjdk.java.net/~roland/8078866/webrev.00/

The assert fires when optimizing that code:

  static int remi_sumb() {
    Integer j = Integer.valueOf(1);
    for (int i = 0; i< 1000; i++) {
      j = j + 1;
    }
    return j;
  }

The compiler tries to find the pre loop from the main loop but fails because the pre loop isn?t there.

Here is the chain of events that leads to the pre loop being optimized out:

1- The CountedLoopNode is created
2- PhiNode::Value() for the induction variable?s Phi is called and capture that 0 <= i < 1000
3- SplitIf is applied and moves the LoadI that reads the value from the Integer through the Phi that merges successive value of j
4- pre-loop, main-loop, post-loop are created
5- SplitIf is applied again and moves the LoadI of the loop through the Phi that merges both branches of:

    public static Integer valueOf(int i) {
        if (i >= IntegerCache.low && i <= IntegerCache.high)
            return IntegerCache.cache[i + (-IntegerCache.low)];
        return new Integer(i);
    }

6- The LoadI is optimized out, the Phi for j is now a Phi that merges integer values, PhaseIdealLoop::replace_parallel_iv() replaces j by an expression that depends on i
7- In the preloop, the range of values of i are known because of 2- so and the if of valueOf() can be optimized out
8- the preloop is empty and is removed by IdealLoopTree::policy_do_remove_empty_loop()

SplitIf doesn?t find all opportunities for improvements before the pre/main/post loops are built. I considered trying to fix that but that seems complex and not robust enough: some optimizations after SplitIf could uncover more opportunities for SplitIf. The valueOf?s if cannot be optimized out in the main loop because the range of values of i in the main depends on the Phi for the iv of the pre loop and is not [0, 1000[ so I don?t see a way to make sure the compiler always optimizes the main loop out when the pre loop is empty.

Instead, I?m proposing that when IdealLoopTree::policy_do_remove_empty_loop() removes a loop, it checks whether it is a pre-loop and if it is, removes the main and post loops.

I don?t have a test case because it seems this only occurs under well defined conditions that are hard to reproduce with a simple test case: PhiNode::Value() must be called early, SplitIf must do a little work but not too much before pre/main/post loops are built. Actually, this occurs only when remi_sumb is inlined. If remi_sumb is not, we don?t optimize the loop out at all.

Roland.

From roland.westrelin at oracle.com  Wed May 13 13:47:45 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Wed, 13 May 2015 15:47:45 +0200
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken
	after sun.misc.Cleaner became automatically cleared
In-Reply-To: <5553191F.6080800@oracle.com>
References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com>
	<5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com>
Message-ID: <882B8D85-62B4-4734-8CA7-0BFD45DB73B3@oracle.com>

>  http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02/

The hotspot code looks good to me.

Roland.

From rickard.backman at oracle.com  Wed May 13 13:51:08 2015
From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=)
Date: Wed, 13 May 2015 15:51:08 +0200
Subject: RFR(XS): 8080155: field "_pc_offset" not found in type
	ImmutableOopMapSet
Message-ID: <20150513135108.GQ31204@rbackman>

Hi,

can I please have this small change reviewed? There was a typo in the
name in the SA code which causes SA to fail when looking at OopMaps.

Bug: https://bugs.openjdk.java.net/browse/JDK-8080155
Webrev: http://cr.openjdk.java.net/~rbackman/8080155.1/

Thanks
/R
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150513/0374eb3c/signature.asc>

From roland.westrelin at oracle.com  Wed May 13 14:01:35 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Wed, 13 May 2015 16:01:35 +0200
Subject: RFR(XS): 8080155: field "_pc_offset" not found in type
	ImmutableOopMapSet
In-Reply-To: <20150513135108.GQ31204@rbackman>
References: <20150513135108.GQ31204@rbackman>
Message-ID: <43BF38E3-3152-41DA-B3E9-BE708E420E86@oracle.com>

> Webrev: http://cr.openjdk.java.net/~rbackman/8080155.1/

Looks good to me.

Roland.

From rickard.backman at oracle.com  Wed May 13 14:07:51 2015
From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=)
Date: Wed, 13 May 2015 16:07:51 +0200
Subject: RFR(XS): 8080155: field "_pc_offset" not found in type
	ImmutableOopMapSet
In-Reply-To: <43BF38E3-3152-41DA-B3E9-BE708E420E86@oracle.com>
References: <20150513135108.GQ31204@rbackman>
	<43BF38E3-3152-41DA-B3E9-BE708E420E86@oracle.com>
Message-ID: <20150513140751.GR31204@rbackman>

Thank you!

/R

On 05/13, Roland Westrelin wrote:
> > Webrev: http://cr.openjdk.java.net/~rbackman/8080155.1/
> 
> Looks good to me.
> 
> Roland.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150513/0c0e21d5/signature.asc>

From roland.westrelin at oracle.com  Wed May 13 14:08:26 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Wed, 13 May 2015 16:08:26 +0200
Subject: Intermediate writes to array slot not eliminated by optimizer
In-Reply-To: <CAHjP37HOnGZ7c13AA1akxE10WwnFA5zq70ht8574bqeFmFAPMg@mail.gmail.com>
References: <CAHjP37H_6R+8f6M+q7VOBhSmK-SSdHmqg_eX-uL5J8wayHaaFw@mail.gmail.com>
	<CAHjP37EiG1FDe0hfA5MBf5rOBiV3uuAM8PNAVtUkzRYo9Bp+zA@mail.gmail.com>
	<55523476.90601@oracle.com>
	<CAHjP37F_=Y8U-wqK7E69cXQ_XkW16dq2HA+O54keb0kQ_3emNw@mail.gmail.com>
	<55523CC7.5010200@oracle.com>
	<CAHjP37HOnGZ7c13AA1akxE10WwnFA5zq70ht8574bqeFmFAPMg@mail.gmail.com>
Message-ID: <B3E59427-ADE5-4244-9217-ECFA09F44C94@oracle.com>

Vitaly,

I filed:

https://bugs.openjdk.java.net/browse/JDK-8080289

Roland.


> On May 12, 2015, at 7:56 PM, Vitaly Davidovich <vitalyd at gmail.com> wrote:
> 
> Well, this seems more about detecting that each iteration of the loop is storing to the same address, which means if the loop is counted (or otherwise determined to end), then only final value can be stored.  Even if loop is not removed entirely and the final value computed statically (as one could conceive in this case), I'd think working with a temp inside the loop and then storing the temp value into memory would be good as well.
> 
> FWIW, gcc and llvm do the "right" thing here in similarly shaped C++ code :).
> 
> On Tue, May 12, 2015 at 1:47 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
> The most important thing we need do with this code is move Store from the loop. We definitely collapse empty loops.
> We do move some loop invariant nodes from loops but this case could be different. I think the problem is stored value is loop variant.
> Yes, it would benefit in general if we could move such stores from loops.
> 
> Vladimir
> 
> On 5/12/15 10:16 AM, Vitaly Davidovich wrote:
> Thanks Vladimir.  I believe I did see the split pre/main/post loop form
> in the compiled code; I tried it with both increment and decrement, and
> also a plain counted for loop, but same asm in the end.  Clearly the
> example I gave is just for illustrating the point, but perhaps this type
> of code can be left behind after some other optimization passes
> complete.  Do you think it's worth handling these types of things?
> 
> On Tue, May 12, 2015 at 1:12 PM, Vladimir Kozlov
> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
> 
>     I would need to look on ideal graph to find what is going on.
>     My guess is that the problem is Store node has control edge pointing
>     to loop head so we can't move it from loop. Or stupid staff like we
>     only do it for index increment and not for decrement (which I doubt).
> 
>     Most possible scenario is we convert this loop to canonical Counted
>     and split it (creating 3 loops: pre, main, post) before we had a
>     chance to collapse it.
> 
>     Regards,
>     Vladimir
> 
>     On 5/12/15 10:05 AM, Vitaly Davidovich wrote:
> 
>         Anyone? :)
> 
>         Thanks
> 
>         On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich
>         <vitalyd at gmail.com <mailto:vitalyd at gmail.com>
>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>> wrote:
> 
>              Hi guys,
> 
>              Suppose we have the following code:
> 
>              private static final long[] ARR = new long[10];
>              private static final int ITERATIONS = 500 * 1000 * 1000;
> 
> 
>              static void loop() {
>                   int i = ITERATIONS;
>                   while(--i >= 0) {
>                        ARR[4] = i;
>                    }
>              }
> 
>              My expectation would be that loop() effectively compiles to:
>              ARR[4] = 0;
>              ret;
> 
>              However, an actual loop is compiled.  This example is a
>         reduction of
>              a slightly more involved case that I had in mind, but I'm
>         curious if
>              I'm missing something here.
> 
>              I tested this with 8u40 and tiered compilation disabled,
>         for what
>              it's worth.
> 
>              Thanks
> 
> 
> 
> 


From vitalyd at gmail.com  Wed May 13 14:12:46 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Wed, 13 May 2015 10:12:46 -0400
Subject: Intermediate writes to array slot not eliminated by optimizer
In-Reply-To: <B3E59427-ADE5-4244-9217-ECFA09F44C94@oracle.com>
References: <CAHjP37H_6R+8f6M+q7VOBhSmK-SSdHmqg_eX-uL5J8wayHaaFw@mail.gmail.com>
	<CAHjP37EiG1FDe0hfA5MBf5rOBiV3uuAM8PNAVtUkzRYo9Bp+zA@mail.gmail.com>
	<55523476.90601@oracle.com>
	<CAHjP37F_=Y8U-wqK7E69cXQ_XkW16dq2HA+O54keb0kQ_3emNw@mail.gmail.com>
	<55523CC7.5010200@oracle.com>
	<CAHjP37HOnGZ7c13AA1akxE10WwnFA5zq70ht8574bqeFmFAPMg@mail.gmail.com>
	<B3E59427-ADE5-4244-9217-ECFA09F44C94@oracle.com>
Message-ID: <CAHjP37H83V+VW98Vw3CV9tSQeYmZEeBF0wiLTZqgO_jW=OrWiw@mail.gmail.com>

Thanks Roland.  Did you by chance confirm this on your end? I wanted to
make sure this wasn't some artifact on my part.

Vitaly

On Wed, May 13, 2015 at 10:08 AM, Roland Westrelin <
roland.westrelin at oracle.com> wrote:

> Vitaly,
>
> I filed:
>
> https://bugs.openjdk.java.net/browse/JDK-8080289
>
> Roland.
>
>
> > On May 12, 2015, at 7:56 PM, Vitaly Davidovich <vitalyd at gmail.com>
> wrote:
> >
> > Well, this seems more about detecting that each iteration of the loop is
> storing to the same address, which means if the loop is counted (or
> otherwise determined to end), then only final value can be stored.  Even if
> loop is not removed entirely and the final value computed statically (as
> one could conceive in this case), I'd think working with a temp inside the
> loop and then storing the temp value into memory would be good as well.
> >
> > FWIW, gcc and llvm do the "right" thing here in similarly shaped C++
> code :).
> >
> > On Tue, May 12, 2015 at 1:47 PM, Vladimir Kozlov <
> vladimir.kozlov at oracle.com> wrote:
> > The most important thing we need do with this code is move Store from
> the loop. We definitely collapse empty loops.
> > We do move some loop invariant nodes from loops but this case could be
> different. I think the problem is stored value is loop variant.
> > Yes, it would benefit in general if we could move such stores from loops.
> >
> > Vladimir
> >
> > On 5/12/15 10:16 AM, Vitaly Davidovich wrote:
> > Thanks Vladimir.  I believe I did see the split pre/main/post loop form
> > in the compiled code; I tried it with both increment and decrement, and
> > also a plain counted for loop, but same asm in the end.  Clearly the
> > example I gave is just for illustrating the point, but perhaps this type
> > of code can be left behind after some other optimization passes
> > complete.  Do you think it's worth handling these types of things?
> >
> > On Tue, May 12, 2015 at 1:12 PM, Vladimir Kozlov
> > <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
> >
> >     I would need to look on ideal graph to find what is going on.
> >     My guess is that the problem is Store node has control edge pointing
> >     to loop head so we can't move it from loop. Or stupid staff like we
> >     only do it for index increment and not for decrement (which I doubt).
> >
> >     Most possible scenario is we convert this loop to canonical Counted
> >     and split it (creating 3 loops: pre, main, post) before we had a
> >     chance to collapse it.
> >
> >     Regards,
> >     Vladimir
> >
> >     On 5/12/15 10:05 AM, Vitaly Davidovich wrote:
> >
> >         Anyone? :)
> >
> >         Thanks
> >
> >         On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich
> >         <vitalyd at gmail.com <mailto:vitalyd at gmail.com>
> >         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>> wrote:
> >
> >              Hi guys,
> >
> >              Suppose we have the following code:
> >
> >              private static final long[] ARR = new long[10];
> >              private static final int ITERATIONS = 500 * 1000 * 1000;
> >
> >
> >              static void loop() {
> >                   int i = ITERATIONS;
> >                   while(--i >= 0) {
> >                        ARR[4] = i;
> >                    }
> >              }
> >
> >              My expectation would be that loop() effectively compiles to:
> >              ARR[4] = 0;
> >              ret;
> >
> >              However, an actual loop is compiled.  This example is a
> >         reduction of
> >              a slightly more involved case that I had in mind, but I'm
> >         curious if
> >              I'm missing something here.
> >
> >              I tested this with 8u40 and tiered compilation disabled,
> >         for what
> >              it's worth.
> >
> >              Thanks
> >
> >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150513/46edf70f/attachment.html>

From roland.westrelin at oracle.com  Wed May 13 14:19:18 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Wed, 13 May 2015 16:19:18 +0200
Subject: Intermediate writes to array slot not eliminated by optimizer
In-Reply-To: <CAHjP37H83V+VW98Vw3CV9tSQeYmZEeBF0wiLTZqgO_jW=OrWiw@mail.gmail.com>
References: <CAHjP37H_6R+8f6M+q7VOBhSmK-SSdHmqg_eX-uL5J8wayHaaFw@mail.gmail.com>
	<CAHjP37EiG1FDe0hfA5MBf5rOBiV3uuAM8PNAVtUkzRYo9Bp+zA@mail.gmail.com>
	<55523476.90601@oracle.com>
	<CAHjP37F_=Y8U-wqK7E69cXQ_XkW16dq2HA+O54keb0kQ_3emNw@mail.gmail.com>
	<55523CC7.5010200@oracle.com>
	<CAHjP37HOnGZ7c13AA1akxE10WwnFA5zq70ht8574bqeFmFAPMg@mail.gmail.com>
	<B3E59427-ADE5-4244-9217-ECFA09F44C94@oracle.com>
	<CAHjP37H83V+VW98Vw3CV9tSQeYmZEeBF0wiLTZqgO_jW=OrWiw@mail.gmail.com>
Message-ID: <F3B7C4E6-9462-4313-AE41-11534F9609ED@oracle.com>

> Thanks Roland.  Did you by chance confirm this on your end? I wanted to make sure this wasn't some artifact on my part.

Yes, it?s a problem on our side.

Roland.

> 
> Vitaly
> 
> On Wed, May 13, 2015 at 10:08 AM, Roland Westrelin <roland.westrelin at oracle.com> wrote:
> Vitaly,
> 
> I filed:
> 
> https://bugs.openjdk.java.net/browse/JDK-8080289
> 
> Roland.
> 
> 
> > On May 12, 2015, at 7:56 PM, Vitaly Davidovich <vitalyd at gmail.com> wrote:
> >
> > Well, this seems more about detecting that each iteration of the loop is storing to the same address, which means if the loop is counted (or otherwise determined to end), then only final value can be stored.  Even if loop is not removed entirely and the final value computed statically (as one could conceive in this case), I'd think working with a temp inside the loop and then storing the temp value into memory would be good as well.
> >
> > FWIW, gcc and llvm do the "right" thing here in similarly shaped C++ code :).
> >
> > On Tue, May 12, 2015 at 1:47 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
> > The most important thing we need do with this code is move Store from the loop. We definitely collapse empty loops.
> > We do move some loop invariant nodes from loops but this case could be different. I think the problem is stored value is loop variant.
> > Yes, it would benefit in general if we could move such stores from loops.
> >
> > Vladimir
> >
> > On 5/12/15 10:16 AM, Vitaly Davidovich wrote:
> > Thanks Vladimir.  I believe I did see the split pre/main/post loop form
> > in the compiled code; I tried it with both increment and decrement, and
> > also a plain counted for loop, but same asm in the end.  Clearly the
> > example I gave is just for illustrating the point, but perhaps this type
> > of code can be left behind after some other optimization passes
> > complete.  Do you think it's worth handling these types of things?
> >
> > On Tue, May 12, 2015 at 1:12 PM, Vladimir Kozlov
> > <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
> >
> >     I would need to look on ideal graph to find what is going on.
> >     My guess is that the problem is Store node has control edge pointing
> >     to loop head so we can't move it from loop. Or stupid staff like we
> >     only do it for index increment and not for decrement (which I doubt).
> >
> >     Most possible scenario is we convert this loop to canonical Counted
> >     and split it (creating 3 loops: pre, main, post) before we had a
> >     chance to collapse it.
> >
> >     Regards,
> >     Vladimir
> >
> >     On 5/12/15 10:05 AM, Vitaly Davidovich wrote:
> >
> >         Anyone? :)
> >
> >         Thanks
> >
> >         On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich
> >         <vitalyd at gmail.com <mailto:vitalyd at gmail.com>
> >         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>> wrote:
> >
> >              Hi guys,
> >
> >              Suppose we have the following code:
> >
> >              private static final long[] ARR = new long[10];
> >              private static final int ITERATIONS = 500 * 1000 * 1000;
> >
> >
> >              static void loop() {
> >                   int i = ITERATIONS;
> >                   while(--i >= 0) {
> >                        ARR[4] = i;
> >                    }
> >              }
> >
> >              My expectation would be that loop() effectively compiles to:
> >              ARR[4] = 0;
> >              ret;
> >
> >              However, an actual loop is compiled.  This example is a
> >         reduction of
> >              a slightly more involved case that I had in mind, but I'm
> >         curious if
> >              I'm missing something here.
> >
> >              I tested this with 8u40 and tiered compilation disabled,
> >         for what
> >              it's worth.
> >
> >              Thanks
> >
> >
> >
> >
> 
> 


From vitalyd at gmail.com  Wed May 13 14:19:44 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Wed, 13 May 2015 10:19:44 -0400
Subject: Intermediate writes to array slot not eliminated by optimizer
In-Reply-To: <F3B7C4E6-9462-4313-AE41-11534F9609ED@oracle.com>
References: <CAHjP37H_6R+8f6M+q7VOBhSmK-SSdHmqg_eX-uL5J8wayHaaFw@mail.gmail.com>
	<CAHjP37EiG1FDe0hfA5MBf5rOBiV3uuAM8PNAVtUkzRYo9Bp+zA@mail.gmail.com>
	<55523476.90601@oracle.com>
	<CAHjP37F_=Y8U-wqK7E69cXQ_XkW16dq2HA+O54keb0kQ_3emNw@mail.gmail.com>
	<55523CC7.5010200@oracle.com>
	<CAHjP37HOnGZ7c13AA1akxE10WwnFA5zq70ht8574bqeFmFAPMg@mail.gmail.com>
	<B3E59427-ADE5-4244-9217-ECFA09F44C94@oracle.com>
	<CAHjP37H83V+VW98Vw3CV9tSQeYmZEeBF0wiLTZqgO_jW=OrWiw@mail.gmail.com>
	<F3B7C4E6-9462-4313-AE41-11534F9609ED@oracle.com>
Message-ID: <CAHjP37G2kBP7gbDabEVHk9mhBgR1DdPX-h_=7i4U87q0QJYosA@mail.gmail.com>

Great, thank you

On Wed, May 13, 2015 at 10:19 AM, Roland Westrelin <
roland.westrelin at oracle.com> wrote:

> > Thanks Roland.  Did you by chance confirm this on your end? I wanted to
> make sure this wasn't some artifact on my part.
>
> Yes, it?s a problem on our side.
>
> Roland.
>
> >
> > Vitaly
> >
> > On Wed, May 13, 2015 at 10:08 AM, Roland Westrelin <
> roland.westrelin at oracle.com> wrote:
> > Vitaly,
> >
> > I filed:
> >
> > https://bugs.openjdk.java.net/browse/JDK-8080289
> >
> > Roland.
> >
> >
> > > On May 12, 2015, at 7:56 PM, Vitaly Davidovich <vitalyd at gmail.com>
> wrote:
> > >
> > > Well, this seems more about detecting that each iteration of the loop
> is storing to the same address, which means if the loop is counted (or
> otherwise determined to end), then only final value can be stored.  Even if
> loop is not removed entirely and the final value computed statically (as
> one could conceive in this case), I'd think working with a temp inside the
> loop and then storing the temp value into memory would be good as well.
> > >
> > > FWIW, gcc and llvm do the "right" thing here in similarly shaped C++
> code :).
> > >
> > > On Tue, May 12, 2015 at 1:47 PM, Vladimir Kozlov <
> vladimir.kozlov at oracle.com> wrote:
> > > The most important thing we need do with this code is move Store from
> the loop. We definitely collapse empty loops.
> > > We do move some loop invariant nodes from loops but this case could be
> different. I think the problem is stored value is loop variant.
> > > Yes, it would benefit in general if we could move such stores from
> loops.
> > >
> > > Vladimir
> > >
> > > On 5/12/15 10:16 AM, Vitaly Davidovich wrote:
> > > Thanks Vladimir.  I believe I did see the split pre/main/post loop form
> > > in the compiled code; I tried it with both increment and decrement, and
> > > also a plain counted for loop, but same asm in the end.  Clearly the
> > > example I gave is just for illustrating the point, but perhaps this
> type
> > > of code can be left behind after some other optimization passes
> > > complete.  Do you think it's worth handling these types of things?
> > >
> > > On Tue, May 12, 2015 at 1:12 PM, Vladimir Kozlov
> > > <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>>
> wrote:
> > >
> > >     I would need to look on ideal graph to find what is going on.
> > >     My guess is that the problem is Store node has control edge
> pointing
> > >     to loop head so we can't move it from loop. Or stupid staff like we
> > >     only do it for index increment and not for decrement (which I
> doubt).
> > >
> > >     Most possible scenario is we convert this loop to canonical Counted
> > >     and split it (creating 3 loops: pre, main, post) before we had a
> > >     chance to collapse it.
> > >
> > >     Regards,
> > >     Vladimir
> > >
> > >     On 5/12/15 10:05 AM, Vitaly Davidovich wrote:
> > >
> > >         Anyone? :)
> > >
> > >         Thanks
> > >
> > >         On Mon, May 11, 2015 at 8:25 PM, Vitaly Davidovich
> > >         <vitalyd at gmail.com <mailto:vitalyd at gmail.com>
> > >         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>> wrote:
> > >
> > >              Hi guys,
> > >
> > >              Suppose we have the following code:
> > >
> > >              private static final long[] ARR = new long[10];
> > >              private static final int ITERATIONS = 500 * 1000 * 1000;
> > >
> > >
> > >              static void loop() {
> > >                   int i = ITERATIONS;
> > >                   while(--i >= 0) {
> > >                        ARR[4] = i;
> > >                    }
> > >              }
> > >
> > >              My expectation would be that loop() effectively compiles
> to:
> > >              ARR[4] = 0;
> > >              ret;
> > >
> > >              However, an actual loop is compiled.  This example is a
> > >         reduction of
> > >              a slightly more involved case that I had in mind, but I'm
> > >         curious if
> > >              I'm missing something here.
> > >
> > >              I tested this with 8u40 and tiered compilation disabled,
> > >         for what
> > >              it's worth.
> > >
> > >              Thanks
> > >
> > >
> > >
> > >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150513/e0368f71/attachment-0001.html>

From vladimir.kozlov at oracle.com  Wed May 13 14:28:32 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 13 May 2015 07:28:32 -0700
Subject: RFR(XS): 8078436:
	java/util/stream/boottest/java/util/stream/UnorderedTest.java
	crashed with an assert in ifnode.cpp
In-Reply-To: <683673A0-DB76-497C-8772-77CA133AFFAA@oracle.com>
References: <683673A0-DB76-497C-8772-77CA133AFFAA@oracle.com>
Message-ID: <55535F90.9000704@oracle.com>

Good.

Thanks,
Vladimir

On 5/13/15 2:05 AM, Roland Westrelin wrote:
> http://cr.openjdk.java.net/~roland/8078436/webrev.00/
>
> The assert is wrong. I mixed the BoolNode of one IfNode with the projection of the other IfNode.
>
> Roland.
>

From bertrand.delsart at oracle.com  Wed May 13 14:31:01 2015
From: bertrand.delsart at oracle.com (Bertrand Delsart)
Date: Wed, 13 May 2015 16:31:01 +0200
Subject: RFR(XS): 8080155: field "_pc_offset" not found in type
	ImmutableOopMapSet
In-Reply-To: <20150513135108.GQ31204@rbackman>
References: <20150513135108.GQ31204@rbackman>
Message-ID: <55536025.9070700@oracle.com>

On 13/05/2015 15:51, Rickard B?ckman wrote:
> Hi,
>
> can I please have this small change reviewed? There was a typo in the
> name in the SA code which causes SA to fail when looking at OopMaps.
>
> Bug: https://bugs.openjdk.java.net/browse/JDK-8080155
> Webrev: http://cr.openjdk.java.net/~rbackman/8080155.1/
>
> Thanks
> /R
>

Looks good,

Bertrand (not a Reviewer)

-- 
Bertrand Delsart,                     Grenoble Engineering Center
Oracle,         180 av. de l'Europe,          ZIRST de Montbonnot
38330 Montbonnot Saint Martin,                             FRANCE
bertrand.delsart at oracle.com             Phone : +33 4 76 18 81 23

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE: This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged
information. Any unauthorized review, use, disclosure or
distribution is prohibited. If you are not the intended recipient,
please contact the sender by reply email and destroy all copies of
the original message.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

From roland.westrelin at oracle.com  Wed May 13 14:36:50 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Wed, 13 May 2015 16:36:50 +0200
Subject: RFR(XS): 8078436:
	java/util/stream/boottest/java/util/stream/UnorderedTest.java
	crashed with an assert in ifnode.cpp
In-Reply-To: <55535F90.9000704@oracle.com>
References: <683673A0-DB76-497C-8772-77CA133AFFAA@oracle.com>
	<55535F90.9000704@oracle.com>
Message-ID: <6FFA58D0-0D65-4CDA-A2A7-F23AE3D6BC1D@oracle.com>

Thanks Vladimir & Vladimir for the reviews.

Roland.

From dmitry.samersoff at oracle.com  Wed May 13 14:38:32 2015
From: dmitry.samersoff at oracle.com (Dmitry Samersoff)
Date: Wed, 13 May 2015 17:38:32 +0300
Subject: RFR(XS): 8080155: field "_pc_offset" not found in type
	ImmutableOopMapSet
In-Reply-To: <20150513135108.GQ31204@rbackman>
References: <20150513135108.GQ31204@rbackman>
Message-ID: <555361E8.2080506@oracle.com>

Looks good for me (not a Reviewer)

On 2015-05-13 16:51, Rickard B?ckman wrote:
> Hi,
> 
> can I please have this small change reviewed? There was a typo in the
> name in the SA code which causes SA to fail when looking at OopMaps.
> 
> Bug: https://bugs.openjdk.java.net/browse/JDK-8080155
> Webrev: http://cr.openjdk.java.net/~rbackman/8080155.1/
> 
> Thanks
> /R
> 


-- 
Dmitry Samersoff
Oracle Java development team, Saint Petersburg, Russia
* I would love to change the world, but they won't give me the sources.

From paul.sandoz at oracle.com  Wed May 13 14:56:37 2015
From: paul.sandoz at oracle.com (Paul Sandoz)
Date: Wed, 13 May 2015 16:56:37 +0200
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken
	after sun.misc.Cleaner became automatically cleared
In-Reply-To: <55533CAE.3050201@oracle.com>
References: <554CEF76.8090903@oracle.com>
	<554DD477.7000703@gmail.com>	<5551DC44.9060902@oracle.com>
	<5553191F.6080800@oracle.com>
	<72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com>
	<55533CAE.3050201@oracle.com>
Message-ID: <A98C60B2-C10C-4D62-B7E8-5B7CFA5E2B39@oracle.com>


On May 13, 2015, at 1:59 PM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:

> Peter, Paul, thanks for the feedback!
> 
> Updated the webrev in place:
>  http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02
> 

+1


>> I am not an export in the HS area but the code mostly made sense to me. I also like Peter's suggestion of Context implementing Runnable.
> I agree. Integrated.
> 
>> Some minor comments.
>> 
>> CallSite.java:
>> 
>> 145         private final long dependencies = 0; // Used by JVM to store JVM_nmethodBucket*
>> 
>> It's a little odd to see this field be final, but i think i understand the motivation as it's essentially acting as a memory address pointing to the head of the nbucket list, so in Java code you don't want to assign it to anything other than 0.
> Yes, my motivation was to forbid reads & writes of the field from Java code. I was experimenting with injecting the field by VM, but it's less attractive.
> 

I was wondering if that were possible.


>> Is VM access, via java_lang_invoke_CallSite_Context, affected by treating final fields as really final in the j.l.i package?
> No, field modifiers aren't respected. oopDesc::*_field_[get|put] does a raw read/write using field offset.
> 

Ok.

Paul.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150513/ca8a3f9c/signature.asc>

From volker.simonis at gmail.com  Wed May 13 16:29:27 2015
From: volker.simonis at gmail.com (Volker Simonis)
Date: Wed, 13 May 2015 18:29:27 +0200
Subject: RFR(S): 8080190: PPC64: Fix wrong rotate instructions in the .ad file
Message-ID: <CA+3eh13agOYGrDRY8-ENrKzh1Xe+tFRsz-0_uKbOpbJwVzM_SA@mail.gmail.com>

Hi,

could you please review the following, ppc64-only fix which solves a
problem with the integer-rotate instructions in C2:

https://bugs.openjdk.java.net/browse/JDK-8080190
http://cr.openjdk.java.net/~simonis/webrevs/2015/8080190.v2/

The problem was that we initially took the "rotate left/right by 8-bit
immediate" instructions from x86_64. But on x86_64 the rotate
instructions for 32-bit integers do in fact really take 8-bit
immediates (seems like the Intel guys had to many spare bits when they
designed their instruction set :) On the other hand, the rotate
instructions on Power only use 5 bit to encode the rotation distance.

We do check for this limit, but only in the debug build so you'll see
an assertion there, but in the product build we will silently emit a
wrong instruction which can lead to anything starting from a crash up
to an incorrect computation.

I think the simplest solution for this problem is to mask the
rotation distance with 0x1f  (which is equivalent to taking the
distance modulo 32) in the corresponding instructions.

This change should also be backported to 8 as fast as possible. Notice
that this a day-zero bug in our ppc64 port because in our internal SAP
JVM we already mask the shift amounts in the ideal graph but for some
unknown reasons these shared changes haven?t been contributed along
with our port.

Thank you and best regards,
Volker

From vladimir.kozlov at oracle.com  Wed May 13 18:39:05 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 13 May 2015 11:39:05 -0700
Subject: RFR(S): 8080190: PPC64: Fix wrong rotate instructions in the
	.ad file
In-Reply-To: <CA+3eh13agOYGrDRY8-ENrKzh1Xe+tFRsz-0_uKbOpbJwVzM_SA@mail.gmail.com>
References: <CA+3eh13agOYGrDRY8-ENrKzh1Xe+tFRsz-0_uKbOpbJwVzM_SA@mail.gmail.com>
Message-ID: <55539A49.4010906@oracle.com>

Is this because you can have negative values?
I see that at the bottom of calls which generate bits u_field() method 
is used. Why not mask it there?

There are other instructions which can have the same problem (slwi). You 
would have to guarantee all of those do the masking. That is why I am 
asking why not mask it at the final code methods, like u_field()?

Thanks,
Vladimir

On 5/13/15 9:29 AM, Volker Simonis wrote:
> Hi,
>
> could you please review the following, ppc64-only fix which solves a
> problem with the integer-rotate instructions in C2:
>
> https://bugs.openjdk.java.net/browse/JDK-8080190
> http://cr.openjdk.java.net/~simonis/webrevs/2015/8080190.v2/
>
> The problem was that we initially took the "rotate left/right by 8-bit
> immediate" instructions from x86_64. But on x86_64 the rotate
> instructions for 32-bit integers do in fact really take 8-bit
> immediates (seems like the Intel guys had to many spare bits when they
> designed their instruction set :) On the other hand, the rotate
> instructions on Power only use 5 bit to encode the rotation distance.
>
> We do check for this limit, but only in the debug build so you'll see
> an assertion there, but in the product build we will silently emit a
> wrong instruction which can lead to anything starting from a crash up
> to an incorrect computation.
>
> I think the simplest solution for this problem is to mask the
> rotation distance with 0x1f  (which is equivalent to taking the
> distance modulo 32) in the corresponding instructions.
>
> This change should also be backported to 8 as fast as possible. Notice
> that this a day-zero bug in our ppc64 port because in our internal SAP
> JVM we already mask the shift amounts in the ideal graph but for some
> unknown reasons these shared changes haven?t been contributed along
> with our port.
>
> Thank you and best regards,
> Volker
>

From sandhya.viswanathan at intel.com  Wed May 13 19:29:20 2015
From: sandhya.viswanathan at intel.com (Viswanathan, Sandhya)
Date: Wed, 13 May 2015 19:29:20 +0000
Subject: RFR(L): 8069539: RSA acceleration
In-Reply-To: <55530C84.2070801@redhat.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>
	<5502E67C.8080208@oracle.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>
	<550C004D.1060501@redhat.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com>
	<55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B6336247C@FMSMSX112.amr.corp.intel.com>
	<55530C84.2070801@redhat.com>
Message-ID: <02FCFB8477C4EF43A2AD8E0C60F3DA2B6336292E@FMSMSX112.amr.corp.intel.com>


Thanks for your clarification. There are now two proposals/patches for this and we could benefit from both.

The patch that I proposed has intrinsic implementation for squareToLen which would still be exercised from the BigInteger.multiply() path when two inputs are same. It shows ~2x benefit for micro benchmark doing square operations and we see ~1.5x for Crypto RSA.

So in my thoughts we should go ahead with the patch from Intel as it is complementary to your proposal where you accelerate oddModPow.

Best Regards,
Sandhya


-----Original Message-----
From: Andrew Haley [mailto:aph at redhat.com] 
Sent: Wednesday, May 13, 2015 1:34 AM
To: Viswanathan, Sandhya; Vladimir Kozlov; Florian Weimer; hotspot-compiler-dev at openjdk.java.net
Subject: Re: RFR(L): 8069539: RSA acceleration

On 13/05/15 00:25, Viswanathan, Sandhya wrote:
> I was wondering if inline assembly can be used in the native code of JRE.  Does the JRE build framework support it?

For every target which supports it you could define a macro in one of
the os/cpu-dependent header files.  But it's slightly awkward because
it's compiler- as well as cpu-dependent.  As long as we assume GCC
it's pretty easy for all targets but I'm not sure what other compilers
we use.

However, I have to stress that this is just a demonstration: even
though it is already much better than what we have, I'm not imagining
that we really want to use JNI for this.  It would make more sense to
move it into HotSpot and call it as a runtime helper.

One interesting thing to try would be to find out if this technique
makes things slower on 32-bit targets.  If it doesn't then we could
use it everywhere.

Andrew.

From vitalyd at gmail.com  Wed May 13 19:36:33 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Wed, 13 May 2015 15:36:33 -0400
Subject: C2: Advantage of parse time inlining
Message-ID: <CAHjP37FqOumfi=VCsEtRePVrjp_7xZyOn6bhqF-RrcL2gkfXwg@mail.gmail.com>

Hi guys,

Could someone please explain the advantage, if any, of parse time inlining
in C2? Given that FreqInlineSize is quite large by default, most hot
methods will get inlined anyway (well, ones that can be for other
reasons).  What is the advantage of parse time inlining?

Is it quicker time to peak performance if C1 is reached first?

Does it ensure that a method is inlined whereas it may not be if it's
already compiled into a medium/large method otherwise?

Is parse time inlining not susceptible to profile pollution? I suspect it
is since the interpreter has already profiled the inlinee either way, but
wanted to check.

Anything else I'm not thinking about?

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150513/8cf60f3b/attachment.html>

From vladimir.kozlov at oracle.com  Wed May 13 21:28:02 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 13 May 2015 14:28:02 -0700
Subject: RFR(S): 8078866:
	compiler/eliminateAutobox/6934604/TestIntBoxing.java
	assert(p_f->Opcode() == Op_IfFalse) failed
In-Reply-To: <E65BE33D-2CBE-448E-A935-07217E43E034@oracle.com>
References: <E65BE33D-2CBE-448E-A935-07217E43E034@oracle.com>
Message-ID: <5553C1E2.3040003@oracle.com>

Why do you think that main- and post- loops will be empty too and it is 
safe to replace them with pre-loop? Why it is safe?

Opaque1 node is used only when RCE and align is requsted:
   cmp_end->set_req(2, peel_only ? pre_limit : pre_opaq);

In peel_only case pre-loop have only 1 iteration and can be eliminated 
too. The assert could be hit in such case too. So your change may not 
fix the problem.

I think eliminating main- and post- is good investigation for separate 
RFE. And we need to make sure it is safe.

To fix this bug, I think, we should bailout from do_range_check() when 
we can't find pre-loop.

Typo 'cab':
+     // post loop cab be removed as well

Thanks,
Vladimir


On 5/13/15 6:02 AM, Roland Westrelin wrote:
> http://cr.openjdk.java.net/~roland/8078866/webrev.00/
>
> The assert fires when optimizing that code:
>
>    static int remi_sumb() {
>      Integer j = Integer.valueOf(1);
>      for (int i = 0; i< 1000; i++) {
>        j = j + 1;
>      }
>      return j;
>    }
>
> The compiler tries to find the pre loop from the main loop but fails because the pre loop isn?t there.
>
> Here is the chain of events that leads to the pre loop being optimized out:
>
> 1- The CountedLoopNode is created
> 2- PhiNode::Value() for the induction variable?s Phi is called and capture that 0 <= i < 1000
> 3- SplitIf is applied and moves the LoadI that reads the value from the Integer through the Phi that merges successive value of j
> 4- pre-loop, main-loop, post-loop are created
> 5- SplitIf is applied again and moves the LoadI of the loop through the Phi that merges both branches of:
>
>      public static Integer valueOf(int i) {
>          if (i >= IntegerCache.low && i <= IntegerCache.high)
>              return IntegerCache.cache[i + (-IntegerCache.low)];
>          return new Integer(i);
>      }
>
> 6- The LoadI is optimized out, the Phi for j is now a Phi that merges integer values, PhaseIdealLoop::replace_parallel_iv() replaces j by an expression that depends on i
> 7- In the preloop, the range of values of i are known because of 2- so and the if of valueOf() can be optimized out
> 8- the preloop is empty and is removed by IdealLoopTree::policy_do_remove_empty_loop()
>
> SplitIf doesn?t find all opportunities for improvements before the pre/main/post loops are built. I considered trying to fix that but that seems complex and not robust enough: some optimizations after SplitIf could uncover more opportunities for SplitIf. The valueOf?s if cannot be optimized out in the main loop because the range of values of i in the main depends on the Phi for the iv of the pre loop and is not [0, 1000[ so I don?t see a way to make sure the compiler always optimizes the main loop out when the pre loop is empty.
>
> Instead, I?m proposing that when IdealLoopTree::policy_do_remove_empty_loop() removes a loop, it checks whether it is a pre-loop and if it is, removes the main and post loops.
>
> I don?t have a test case because it seems this only occurs under well defined conditions that are hard to reproduce with a simple test case: PhiNode::Value() must be called early, SplitIf must do a little work but not too much before pre/main/post loops are built. Actually, this occurs only when remi_sumb is inlined. If remi_sumb is not, we don?t optimize the loop out at all.
>
> Roland.
>

From vladimir.kozlov at oracle.com  Thu May 14 00:22:21 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 13 May 2015 17:22:21 -0700
Subject: [9] RFR (XS): 8079135: C2 disables some optimizations when a
	large number of unique nodes exist
In-Reply-To: <55531E78.7070301@oracle.com>
References: <55531E78.7070301@oracle.com>
Message-ID: <5553EABD.8040807@oracle.com>

Looks good.

Thanks,
Vladimir

On 5/13/15 2:50 AM, Vladimir Ivanov wrote:
> http://cr.openjdk.java.net/~vlivanov/8079135/webrev.00/
>
> Incremental inlining produces lots of dead nodes, so # of unique and #
> of live nodes can differ significantly. But actual IR size is what matters.
>
> Performance experiments show that octane doesn't hit the problem (no
> difference).
>
> Testing: octane/nashorn
>
> Best regards,
> Vladimir Ivanov

From michael.c.berg at intel.com  Thu May 14 01:26:59 2015
From: michael.c.berg at intel.com (Berg, Michael C)
Date: Thu, 14 May 2015 01:26:59 +0000
Subject: RFR 8080325 SuperWord loop unrolling analysis
Message-ID: <C568518E7B433348B114B6A7122D474755DE9497@FMSMSX102.amr.corp.intel.com>

Hi Folks,

We (Intel) would like to contribute superword loop unrolling analysis to facilitate more default unrolling and larger SIMD vector mapping.
Please review this patch and comment as needed:

Bug-id: https://bugs.openjdk.java.net/browse/JDK-8080325

webrev:

http://cr.openjdk.java.net/~kvn/8080325/webrev/

The design finds the highest common vector supported and implemented on a given machine and applies that to unrolling, iff it is greater than the default. If the user gates unrolling we will still obey the user directive. It's light weight, when we get to the analysis part, if we fail, we stop further tries. If we succeed we stop further tries. We generally always analyze only once. We then gate the unroll factor by extending the size of the unroll segment so that the iterative tries will keep unrolling, obtaining larger unrolls of a targeted loop. I see no negative behavior wrt to performance, and a modest positive swing in default behavior up to 1.5x for some micros.

Vladimir Koslov has offered to sponsor this patch.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150514/6f8a7b98/attachment.html>

From michael.c.berg at intel.com  Thu May 14 01:31:57 2015
From: michael.c.berg at intel.com (Berg, Michael C)
Date: Thu, 14 May 2015 01:31:57 +0000
Subject: RFR 8080325 SuperWord loop unrolling analysis
In-Reply-To: <C568518E7B433348B114B6A7122D474755DE9497@FMSMSX102.amr.corp.intel.com>
References: <C568518E7B433348B114B6A7122D474755DE9497@FMSMSX102.amr.corp.intel.com>
Message-ID: <C568518E7B433348B114B6A7122D474755DE94AE@FMSMSX102.amr.corp.intel.com>

Also, for now I have only enabled x86 and leave the analysis of its value to other target developer contributors after the fact.

Thanks,
Michael

From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Berg, Michael C
Sent: Wednesday, May 13, 2015 6:27 PM
To: hotspot-compiler-dev at openjdk.java.net
Subject: RFR 8080325 SuperWord loop unrolling analysis


Hi Folks,

We (Intel) would like to contribute superword loop unrolling analysis to facilitate more default unrolling and larger SIMD vector mapping.
Please review this patch and comment as needed:

Bug-id: https://bugs.openjdk.java.net/browse/JDK-8080325

webrev:

http://cr.openjdk.java.net/~kvn/8080325/webrev/

The design finds the highest common vector supported and implemented on a given machine and applies that to unrolling, iff it is greater than the default. If the user gates unrolling we will still obey the user directive. It's light weight, when we get to the analysis part, if we fail, we stop further tries. If we succeed we stop further tries. We generally always analyze only once. We then gate the unroll factor by extending the size of the unroll segment so that the iterative tries will keep unrolling, obtaining larger unrolls of a targeted loop. I see no negative behavior wrt to performance, and a modest positive swing in default behavior up to 1.5x for some micros.

Vladimir Kozlov has offered to sponsor this patch.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150514/448595cf/attachment-0001.html>

From aph at redhat.com  Thu May 14 10:59:40 2015
From: aph at redhat.com (Andrew Haley)
Date: Thu, 14 May 2015 11:59:40 +0100
Subject: RFR(L): 8069539: RSA acceleration
In-Reply-To: <02FCFB8477C4EF43A2AD8E0C60F3DA2B6336292E@FMSMSX112.amr.corp.intel.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>	<5502E67C.8080208@oracle.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>	<550C004D.1060501@redhat.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com>
	<55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B6336247C@FMSMSX112.amr.corp.intel.com>
	<55530C84.2070801@redhat.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B6336292E@FMSMSX112.amr.corp.intel.com>
Message-ID: <5554801C.5070208@redhat.com>

On 05/13/2015 08:29 PM, Viswanathan, Sandhya wrote:
> 
> Thanks for your clarification. There are now two proposals/patches
> for this and we could benefit from both.

Yes, indeed.

> The patch that I proposed has intrinsic implementation for
> squareToLen which would still be exercised from the
> BigInteger.multiply() path when two inputs are same. It shows ~2x
> benefit for micro benchmark doing square operations and we see ~1.5x
> for Crypto RSA.
> 
> So in my thoughts we should go ahead with the patch from Intel as it
> is complementary to your proposal where you accelerate oddModPow.

I think that's probably right.  We can track the proposals separately.
Even if squareToLen and multiplyToLen don't end up being used in RSA
at all they will still be useful for BigInteger operations.

I'm not seriously suggesting the JNI route: it's just a demonstration
of how powerful the accelerated Montgomery algorithm is.  (Or,
alternatively, just how suboptimal our current approach is.)

Andrew.

From vladimir.x.ivanov at oracle.com  Thu May 14 11:25:02 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Thu, 14 May 2015 14:25:02 +0300
Subject: [9] RFR (XS): 8079135: C2 disables some optimizations when a
	large number of unique nodes exist
In-Reply-To: <5553EABD.8040807@oracle.com>
References: <55531E78.7070301@oracle.com> <5553EABD.8040807@oracle.com>
Message-ID: <5554860E.2030106@oracle.com>

Thanks, Vladimir.

Best regards,
Vladimir Ivanov

On 5/14/15 3:22 AM, Vladimir Kozlov wrote:
> Looks good.
>
> Thanks,
> Vladimir
>
> On 5/13/15 2:50 AM, Vladimir Ivanov wrote:
>> http://cr.openjdk.java.net/~vlivanov/8079135/webrev.00/
>>
>> Incremental inlining produces lots of dead nodes, so # of unique and #
>> of live nodes can differ significantly. But actual IR size is what
>> matters.
>>
>> Performance experiments show that octane doesn't hit the problem (no
>> difference).
>>
>> Testing: octane/nashorn
>>
>> Best regards,
>> Vladimir Ivanov

From vladimir.x.ivanov at oracle.com  Thu May 14 11:43:57 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Thu, 14 May 2015 14:43:57 +0300
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken
	after sun.misc.Cleaner became automatically cleared
In-Reply-To: <882B8D85-62B4-4734-8CA7-0BFD45DB73B3@oracle.com>
References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com>
	<5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com>
	<882B8D85-62B4-4734-8CA7-0BFD45DB73B3@oracle.com>
Message-ID: <55548A7D.3000606@oracle.com>

Thanks, Roland.

Best regards,
Vladimir Ivanov

On 5/13/15 4:47 PM, Roland Westrelin wrote:
>>   http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02/
>
> The hotspot code looks good to me.
>
> Roland.
>

From vladimir.x.ivanov at oracle.com  Thu May 14 12:18:31 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Thu, 14 May 2015 15:18:31 +0300
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken
	after sun.misc.Cleaner became automatically cleared
In-Reply-To: <A98C60B2-C10C-4D62-B7E8-5B7CFA5E2B39@oracle.com>
References: <554CEF76.8090903@oracle.com>	<554DD477.7000703@gmail.com>	<5551DC44.9060902@oracle.com>	<5553191F.6080800@oracle.com>	<72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com>	<55533CAE.3050201@oracle.com>
	<A98C60B2-C10C-4D62-B7E8-5B7CFA5E2B39@oracle.com>
Message-ID: <55549297.5020002@oracle.com>

Small update in JDK code (webrev updated in place):
   http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02

Changes:
   - hid Context.dependencies field from Reflection
   - new test case: CallSiteDepContextTest.testHiddenDepField()

Best regards,
Vladimir Ivanov

On 5/13/15 5:56 PM, Paul Sandoz wrote:
>
> On May 13, 2015, at 1:59 PM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
>
>> Peter, Paul, thanks for the feedback!
>>
>> Updated the webrev in place:
>>   http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02
>>
>
> +1
>
>
>>> I am not an export in the HS area but the code mostly made sense to me. I also like Peter's suggestion of Context implementing Runnable.
>> I agree. Integrated.
>>
>>> Some minor comments.
>>>
>>> CallSite.java:
>>>
>>> 145         private final long dependencies = 0; // Used by JVM to store JVM_nmethodBucket*
>>>
>>> It's a little odd to see this field be final, but i think i understand the motivation as it's essentially acting as a memory address pointing to the head of the nbucket list, so in Java code you don't want to assign it to anything other than 0.
>> Yes, my motivation was to forbid reads & writes of the field from Java code. I was experimenting with injecting the field by VM, but it's less attractive.
>>
>
> I was wondering if that were possible.
>
>
>>> Is VM access, via java_lang_invoke_CallSite_Context, affected by treating final fields as really final in the j.l.i package?
>> No, field modifiers aren't respected. oopDesc::*_field_[get|put] does a raw read/write using field offset.
>>
>
> Ok.
>
> Paul.
>

From tobias.hartmann at oracle.com  Thu May 14 15:59:11 2015
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Thu, 14 May 2015 17:59:11 +0200
Subject: [9] RFR(XS): 8080420: Compilation of TestVectorizationWithInvariant
	fails with "error: package com.oracle.java.testlibrary does not exist"
Message-ID: <5554C64F.5010405@oracle.com>

Hi,

please review the following patch that fixes the testlibrary location in TestVectorizationWithInvariant after JDK-8067013 which is in hs but not in hs-comp.

https://bugs.openjdk.java.net/browse/JDK-8080420
http://cr.openjdk.java.net/~thartmann/8080420/webrev.00/

This is blocking the hs-comp->hs sync. I will push the fix directly to hs.

Thanks,
Tobias

[1] https://bugs.openjdk.java.net/browse/JDK-8067013

From vladimir.kozlov at oracle.com  Thu May 14 16:00:48 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 14 May 2015 09:00:48 -0700
Subject: [9] RFR(XS): 8080420: Compilation of
	TestVectorizationWithInvariant
	fails with "error: package com.oracle.java.testlibrary does not exist"
In-Reply-To: <5554C64F.5010405@oracle.com>
References: <5554C64F.5010405@oracle.com>
Message-ID: <5554C6B0.5080601@oracle.com>

Looks good.

Thanks,
Vladimir

On 5/14/15 8:59 AM, Tobias Hartmann wrote:
> Hi,
>
> please review the following patch that fixes the testlibrary location in TestVectorizationWithInvariant after JDK-8067013 which is in hs but not in hs-comp.
>
> https://bugs.openjdk.java.net/browse/JDK-8080420
> http://cr.openjdk.java.net/~thartmann/8080420/webrev.00/
>
> This is blocking the hs-comp->hs sync. I will push the fix directly to hs.
>
> Thanks,
> Tobias
>
> [1] https://bugs.openjdk.java.net/browse/JDK-8067013
>

From tobias.hartmann at oracle.com  Thu May 14 16:06:01 2015
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Thu, 14 May 2015 18:06:01 +0200
Subject: [9] RFR(XS): 8080420: Compilation of
	TestVectorizationWithInvariant
	fails with "error: package com.oracle.java.testlibrary does not exist"
In-Reply-To: <5554C6B0.5080601@oracle.com>
References: <5554C64F.5010405@oracle.com> <5554C6B0.5080601@oracle.com>
Message-ID: <5554C7E9.5070305@oracle.com>

Thanks, Vladimir.

Best,
Tobias

On 14.05.2015 18:00, Vladimir Kozlov wrote:
> Looks good.
> 
> Thanks,
> Vladimir
> 
> On 5/14/15 8:59 AM, Tobias Hartmann wrote:
>> Hi,
>>
>> please review the following patch that fixes the testlibrary location in TestVectorizationWithInvariant after JDK-8067013 which is in hs but not in hs-comp.
>>
>> https://bugs.openjdk.java.net/browse/JDK-8080420
>> http://cr.openjdk.java.net/~thartmann/8080420/webrev.00/
>>
>> This is blocking the hs-comp->hs sync. I will push the fix directly to hs.
>>
>> Thanks,
>> Tobias
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8067013
>>

From vitalyd at gmail.com  Thu May 14 16:09:41 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Thu, 14 May 2015 12:09:41 -0400
Subject: C2: Advantage of parse time inlining
In-Reply-To: <CAHjP37FqOumfi=VCsEtRePVrjp_7xZyOn6bhqF-RrcL2gkfXwg@mail.gmail.com>
References: <CAHjP37FqOumfi=VCsEtRePVrjp_7xZyOn6bhqF-RrcL2gkfXwg@mail.gmail.com>
Message-ID: <CAHjP37F4Kfrkyq59ZOAP09b1CSMx_vhSxw-3QsRJk9-xKBpGWQ@mail.gmail.com>

Any pointers? Sorry to bug you guys, but just want to make sure I
understand this point as I see quite a bit of discussion on core-libs and
elsewhere where people are worrying about the 35 bytecode size threshold
for parse inlining.

On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich <vitalyd at gmail.com>
wrote:

> Hi guys,
>
> Could someone please explain the advantage, if any, of parse time inlining
> in C2? Given that FreqInlineSize is quite large by default, most hot
> methods will get inlined anyway (well, ones that can be for other
> reasons).  What is the advantage of parse time inlining?
>
> Is it quicker time to peak performance if C1 is reached first?
>
> Does it ensure that a method is inlined whereas it may not be if it's
> already compiled into a medium/large method otherwise?
>
> Is parse time inlining not susceptible to profile pollution? I suspect it
> is since the interpreter has already profiled the inlinee either way, but
> wanted to check.
>
> Anything else I'm not thinking about?
>
> Thanks
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150514/92d72d62/attachment-0001.html>

From vladimir.x.ivanov at oracle.com  Thu May 14 16:36:45 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Thu, 14 May 2015 19:36:45 +0300
Subject: C2: Advantage of parse time inlining
In-Reply-To: <CAHjP37F4Kfrkyq59ZOAP09b1CSMx_vhSxw-3QsRJk9-xKBpGWQ@mail.gmail.com>
References: <CAHjP37FqOumfi=VCsEtRePVrjp_7xZyOn6bhqF-RrcL2gkfXwg@mail.gmail.com>
	<CAHjP37F4Kfrkyq59ZOAP09b1CSMx_vhSxw-3QsRJk9-xKBpGWQ@mail.gmail.com>
Message-ID: <5554CF1D.3010601@oracle.com>

Vitaly,

Can you elaborate your question a bit? What do you compare parse-time 
inlining with? Mentioning of ?1 & profile pollution in this context 
confuses me.

Usually, people care about 35 (= MaxInlineSize), because for methods up 
to MaxInlineSize their call frequency is ignored. So, fewer chances to 
end up with non-inlined call.

Best regards,
Vladimir Ivanov

On 5/14/15 7:09 PM, Vitaly Davidovich wrote:
> Any pointers? Sorry to bug you guys, but just want to make sure I
> understand this point as I see quite a bit of discussion on core-libs
> and elsewhere where people are worrying about the 35 bytecode size
> threshold for parse inlining.
>
> On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich <vitalyd at gmail.com
> <mailto:vitalyd at gmail.com>> wrote:
>
>     Hi guys,
>
>     Could someone please explain the advantage, if any, of parse time
>     inlining in C2? Given that FreqInlineSize is quite large by default,
>     most hot methods will get inlined anyway (well, ones that can be for
>     other reasons).  What is the advantage of parse time inlining?
>
>     Is it quicker time to peak performance if C1 is reached first?
>
>     Does it ensure that a method is inlined whereas it may not be if
>     it's already compiled into a medium/large method otherwise?
>
>     Is parse time inlining not susceptible to profile pollution? I
>     suspect it is since the interpreter has already profiled the inlinee
>     either way, but wanted to check.
>
>     Anything else I'm not thinking about?
>
>     Thanks
>
>

From vitalyd at gmail.com  Thu May 14 16:57:26 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Thu, 14 May 2015 12:57:26 -0400
Subject: C2: Advantage of parse time inlining
In-Reply-To: <5554CF1D.3010601@oracle.com>
References: <CAHjP37FqOumfi=VCsEtRePVrjp_7xZyOn6bhqF-RrcL2gkfXwg@mail.gmail.com>
	<CAHjP37F4Kfrkyq59ZOAP09b1CSMx_vhSxw-3QsRJk9-xKBpGWQ@mail.gmail.com>
	<5554CF1D.3010601@oracle.com>
Message-ID: <CAHjP37HJAoc2v89uVbsH-EWLFi50x_8V1CGVk4gKrmj3pv=bSw@mail.gmail.com>

Vladimir,

I'm comparing MaxInlineSize (35) with FreqInlineSize (325).  AFAIU,
MaxInlineSize drives which methods are inlined at parse time by C2, whereas
FreqInlineSize is the threshold for "late" (or what do you guys call
inlining after parsing?) inlining.  Most of the inlining discussions (or
worries, rather) seem to focus around the MaxInlineSize value, and not
FreqInlineSize, even if the target method will get hot.

Usually, people care about 35 (= MaxInlineSize), because for methods up to
> MaxInlineSize their call frequency is ignored. So, fewer chances to end up
> with non-inlined call.


Ok, so for hot methods then MaxInlineSize isn't really a concern, and
FreqInlineSize would be the threshold to worry about (for C2 compiler)
then? Why are people worried about inlining in cold paths then?

Thanks Vladimir

On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov <
vladimir.x.ivanov at oracle.com> wrote:

> Vitaly,
>
> Can you elaborate your question a bit? What do you compare parse-time
> inlining with? Mentioning of ?1 & profile pollution in this context
> confuses me.
>
> Usually, people care about 35 (= MaxInlineSize), because for methods up to
> MaxInlineSize their call frequency is ignored. So, fewer chances to end up
> with non-inlined call.
>
> Best regards,
> Vladimir Ivanov
>
> On 5/14/15 7:09 PM, Vitaly Davidovich wrote:
>
>> Any pointers? Sorry to bug you guys, but just want to make sure I
>> understand this point as I see quite a bit of discussion on core-libs
>> and elsewhere where people are worrying about the 35 bytecode size
>> threshold for parse inlining.
>>
>> On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich <vitalyd at gmail.com
>> <mailto:vitalyd at gmail.com>> wrote:
>>
>>     Hi guys,
>>
>>     Could someone please explain the advantage, if any, of parse time
>>     inlining in C2? Given that FreqInlineSize is quite large by default,
>>     most hot methods will get inlined anyway (well, ones that can be for
>>     other reasons).  What is the advantage of parse time inlining?
>>
>>     Is it quicker time to peak performance if C1 is reached first?
>>
>>     Does it ensure that a method is inlined whereas it may not be if
>>     it's already compiled into a medium/large method otherwise?
>>
>>     Is parse time inlining not susceptible to profile pollution? I
>>     suspect it is since the interpreter has already profiled the inlinee
>>     either way, but wanted to check.
>>
>>     Anything else I'm not thinking about?
>>
>>     Thanks
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150514/62f8d0e1/attachment.html>

From vitalyd at gmail.com  Thu May 14 17:03:50 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Thu, 14 May 2015 13:03:50 -0400
Subject: C2: Advantage of parse time inlining
In-Reply-To: <CAHjP37HJAoc2v89uVbsH-EWLFi50x_8V1CGVk4gKrmj3pv=bSw@mail.gmail.com>
References: <CAHjP37FqOumfi=VCsEtRePVrjp_7xZyOn6bhqF-RrcL2gkfXwg@mail.gmail.com>
	<CAHjP37F4Kfrkyq59ZOAP09b1CSMx_vhSxw-3QsRJk9-xKBpGWQ@mail.gmail.com>
	<5554CF1D.3010601@oracle.com>
	<CAHjP37HJAoc2v89uVbsH-EWLFi50x_8V1CGVk4gKrmj3pv=bSw@mail.gmail.com>
Message-ID: <CAHjP37GnCbB5XJBKZj-oTzt0zxEG3-NLYHu=h+YqU5d08vVRVw@mail.gmail.com>

I should also add that I see how inlining without taking call freq into
account could lead to faster time to peak performance for methods that
eventually get hot anyway but aren't at parse time.  Peak perf will be the
same if the method is too big for parse inlining but eventually gets
compiled due to reaching hotness.  Is that about right?

sent from my phone
On May 14, 2015 12:57 PM, "Vitaly Davidovich" <vitalyd at gmail.com> wrote:

> Vladimir,
>
> I'm comparing MaxInlineSize (35) with FreqInlineSize (325).  AFAIU,
> MaxInlineSize drives which methods are inlined at parse time by C2, whereas
> FreqInlineSize is the threshold for "late" (or what do you guys call
> inlining after parsing?) inlining.  Most of the inlining discussions (or
> worries, rather) seem to focus around the MaxInlineSize value, and not
> FreqInlineSize, even if the target method will get hot.
>
> Usually, people care about 35 (= MaxInlineSize), because for methods up to
>> MaxInlineSize their call frequency is ignored. So, fewer chances to end up
>> with non-inlined call.
>
>
> Ok, so for hot methods then MaxInlineSize isn't really a concern, and
> FreqInlineSize would be the threshold to worry about (for C2 compiler)
> then? Why are people worried about inlining in cold paths then?
>
> Thanks Vladimir
>
> On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov <
> vladimir.x.ivanov at oracle.com> wrote:
>
>> Vitaly,
>>
>> Can you elaborate your question a bit? What do you compare parse-time
>> inlining with? Mentioning of ?1 & profile pollution in this context
>> confuses me.
>>
>> Usually, people care about 35 (= MaxInlineSize), because for methods up
>> to MaxInlineSize their call frequency is ignored. So, fewer chances to end
>> up with non-inlined call.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 5/14/15 7:09 PM, Vitaly Davidovich wrote:
>>
>>> Any pointers? Sorry to bug you guys, but just want to make sure I
>>> understand this point as I see quite a bit of discussion on core-libs
>>> and elsewhere where people are worrying about the 35 bytecode size
>>> threshold for parse inlining.
>>>
>>> On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich <vitalyd at gmail.com
>>> <mailto:vitalyd at gmail.com>> wrote:
>>>
>>>     Hi guys,
>>>
>>>     Could someone please explain the advantage, if any, of parse time
>>>     inlining in C2? Given that FreqInlineSize is quite large by default,
>>>     most hot methods will get inlined anyway (well, ones that can be for
>>>     other reasons).  What is the advantage of parse time inlining?
>>>
>>>     Is it quicker time to peak performance if C1 is reached first?
>>>
>>>     Does it ensure that a method is inlined whereas it may not be if
>>>     it's already compiled into a medium/large method otherwise?
>>>
>>>     Is parse time inlining not susceptible to profile pollution? I
>>>     suspect it is since the interpreter has already profiled the inlinee
>>>     either way, but wanted to check.
>>>
>>>     Anything else I'm not thinking about?
>>>
>>>     Thanks
>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150514/32f3bbc1/attachment.html>

From vladimir.kozlov at oracle.com  Thu May 14 18:30:35 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 14 May 2015 11:30:35 -0700
Subject: C2: Advantage of parse time inlining
In-Reply-To: <CAHjP37GnCbB5XJBKZj-oTzt0zxEG3-NLYHu=h+YqU5d08vVRVw@mail.gmail.com>
References: <CAHjP37FqOumfi=VCsEtRePVrjp_7xZyOn6bhqF-RrcL2gkfXwg@mail.gmail.com>	<CAHjP37F4Kfrkyq59ZOAP09b1CSMx_vhSxw-3QsRJk9-xKBpGWQ@mail.gmail.com>	<5554CF1D.3010601@oracle.com>	<CAHjP37HJAoc2v89uVbsH-EWLFi50x_8V1CGVk4gKrmj3pv=bSw@mail.gmail.com>
	<CAHjP37GnCbB5XJBKZj-oTzt0zxEG3-NLYHu=h+YqU5d08vVRVw@mail.gmail.com>
Message-ID: <5554E9CB.3020903@oracle.com>

Vitaly,

You have small misconception - almost all C2 inlining (normal java 
methods) is done during parsing in one pass. Recently we changed it to 
inline jsr292 methods after parsing (to execute IGVN and reduce graph - 
otherwise they blow up number of ideal nodes and we bailout compilation 
due to MaxNodeLimit).

As parser goes and see a call site it check can it inline or not (see 
opto/bytecodeinfo.cpp, should_inline() and should_not_inline()). There 
are several conditions which drive inlining and following (most 
important) flags controls them: MaxTrivialSize, MaxInlineSize, 
FreqInlineSize, InlineSmallCode, MinInliningThreshold, MaxInlineLevel, 
InlineFrequencyCount, InlineFrequencyRatio.

Most tedious flag is InlineSmallCode (size of compiled assembler code 
which is different from other sizes which are bytecode size) which 
control inlining of already compiled method. Usually if a call site is 
hot the callee is compiled before caller. But sometimes you can get 
caller compiled first (if it has hot loop, for example) so different 
condition will be used and as result you can get performance variation 
between runs.

The difference between MaxInlineSize (35) and FreqInlineSize (325) is 
FreqInlineSize takes into account how frequent call site is executed 
relatively to caller invocations:

   int call_site_count  = method()->scale_count(profile.count());
   int invoke_count     = method()->interpreter_invocation_count();
   int freq = call_site_count / invoke_count;
   int max_inline_size  = MaxInlineSize;
   // bump the max size if the call is frequent
   if ((freq >= InlineFrequencyRatio) ||
       (call_site_count >= InlineFrequencyCount) ||
       is_unboxing_method(callee_method, C) ||
       is_init_with_ea(callee_method, caller_method, C)) {
     max_inline_size = FreqInlineSize;

And there is additional inlining condition for all methods which size > 
MaxTrivialSize:

   if (!callee_method->was_executed_more_than(MinInliningThreshold)) {
     set_msg("executed < MinInliningThreshold times");

Regards,
Vladimir

On 5/14/15 10:03 AM, Vitaly Davidovich wrote:
> I should also add that I see how inlining without taking call freq into
> account could lead to faster time to peak performance for methods that
> eventually get hot anyway but aren't at parse time.  Peak perf will be
> the same if the method is too big for parse inlining but eventually gets
> compiled due to reaching hotness.  Is that about right?
>
> sent from my phone
>
> On May 14, 2015 12:57 PM, "Vitaly Davidovich" <vitalyd at gmail.com
> <mailto:vitalyd at gmail.com>> wrote:
>
>     Vladimir,
>
>     I'm comparing MaxInlineSize (35) with FreqInlineSize (325).  AFAIU,
>     MaxInlineSize drives which methods are inlined at parse time by C2,
>     whereas FreqInlineSize is the threshold for "late" (or what do you
>     guys call inlining after parsing?) inlining.  Most of the inlining
>     discussions (or worries, rather) seem to focus around the
>     MaxInlineSize value, and not FreqInlineSize, even if the target
>     method will get hot.
>
>         Usually, people care about 35 (= MaxInlineSize), because for
>         methods up to MaxInlineSize their call frequency is ignored. So,
>         fewer chances to end up with non-inlined call.
>
>
>     Ok, so for hot methods then MaxInlineSize isn't really a concern,
>     and FreqInlineSize would be the threshold to worry about (for C2
>     compiler) then? Why are people worried about inlining in cold paths
>     then?
>
>     Thanks Vladimir
>
>     On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov
>     <vladimir.x.ivanov at oracle.com <mailto:vladimir.x.ivanov at oracle.com>>
>     wrote:
>
>         Vitaly,
>
>         Can you elaborate your question a bit? What do you compare
>         parse-time inlining with? Mentioning of ?1 & profile pollution
>         in this context confuses me.
>
>         Usually, people care about 35 (= MaxInlineSize), because for
>         methods up to MaxInlineSize their call frequency is ignored. So,
>         fewer chances to end up with non-inlined call.
>
>         Best regards,
>         Vladimir Ivanov
>
>         On 5/14/15 7:09 PM, Vitaly Davidovich wrote:
>
>             Any pointers? Sorry to bug you guys, but just want to make
>             sure I
>             understand this point as I see quite a bit of discussion on
>             core-libs
>             and elsewhere where people are worrying about the 35
>             bytecode size
>             threshold for parse inlining.
>
>             On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich
>             <vitalyd at gmail.com <mailto:vitalyd at gmail.com>
>             <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>> wrote:
>
>                  Hi guys,
>
>                  Could someone please explain the advantage, if any, of
>             parse time
>                  inlining in C2? Given that FreqInlineSize is quite
>             large by default,
>                  most hot methods will get inlined anyway (well, ones
>             that can be for
>                  other reasons).  What is the advantage of parse time
>             inlining?
>
>                  Is it quicker time to peak performance if C1 is reached
>             first?
>
>                  Does it ensure that a method is inlined whereas it may
>             not be if
>                  it's already compiled into a medium/large method otherwise?
>
>                  Is parse time inlining not susceptible to profile
>             pollution? I
>                  suspect it is since the interpreter has already
>             profiled the inlinee
>                  either way, but wanted to check.
>
>                  Anything else I'm not thinking about?
>
>                  Thanks
>
>
>

From john.r.rose at oracle.com  Thu May 14 18:42:47 2015
From: john.r.rose at oracle.com (John Rose)
Date: Thu, 14 May 2015 11:42:47 -0700
Subject: C2: Advantage of parse time inlining
In-Reply-To: <5554E9CB.3020903@oracle.com>
References: <CAHjP37FqOumfi=VCsEtRePVrjp_7xZyOn6bhqF-RrcL2gkfXwg@mail.gmail.com>
	<CAHjP37F4Kfrkyq59ZOAP09b1CSMx_vhSxw-3QsRJk9-xKBpGWQ@mail.gmail.com>
	<5554CF1D.3010601@oracle.com>
	<CAHjP37HJAoc2v89uVbsH-EWLFi50x_8V1CGVk4gKrmj3pv=bSw@mail.gmail.com>
	<CAHjP37GnCbB5XJBKZj-oTzt0zxEG3-NLYHu=h+YqU5d08vVRVw@mail.gmail.com>
	<5554E9CB.3020903@oracle.com>
Message-ID: <B4810651-59E6-412B-9D90-BEA64263A212@oracle.com>

On May 14, 2015, at 11:30 AM, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
> 
> As parser goes and see a call site it check can it inline or not (see opto/bytecodeinfo.cpp, should_inline() and should_not_inline()). There are several conditions which drive inlining and following (most important) flags controls them: MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode, MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount, InlineFrequencyRatio.

There's a bug report to catch complaints about these heuristics:
  https://bugs.openjdk.java.net/browse/JDK-6316156

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150514/404d3f82/attachment.html>

From vitalyd at gmail.com  Thu May 14 19:02:20 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Thu, 14 May 2015 15:02:20 -0400
Subject: C2: Advantage of parse time inlining
In-Reply-To: <5554E9CB.3020903@oracle.com>
References: <CAHjP37FqOumfi=VCsEtRePVrjp_7xZyOn6bhqF-RrcL2gkfXwg@mail.gmail.com>
	<CAHjP37F4Kfrkyq59ZOAP09b1CSMx_vhSxw-3QsRJk9-xKBpGWQ@mail.gmail.com>
	<5554CF1D.3010601@oracle.com>
	<CAHjP37HJAoc2v89uVbsH-EWLFi50x_8V1CGVk4gKrmj3pv=bSw@mail.gmail.com>
	<CAHjP37GnCbB5XJBKZj-oTzt0zxEG3-NLYHu=h+YqU5d08vVRVw@mail.gmail.com>
	<5554E9CB.3020903@oracle.com>
Message-ID: <CAHjP37F8ufzukBnemSHnPRScPsPShCf1-RydmAsDL+6WXnoEiw@mail.gmail.com>

Thanks Vladimir.  I recall seeing changes around incremental inlining, and
may have mistakenly thought it happens at some later point in time.
Appreciate the clarification.

Ok, so based on what you say, I can see a theoretical problem whereby a
method is being parsed, is larger than MaxInlineSize, but doesn't happen to
be frequent enough yet at this point, and so it won't be inlined; if it
turns out to be hot later on, the lack of inlining will not be undone
(assuming the caller isn't deopted and recompiled later, with updated
frequency info, for other reasons).

Thanks again.

On Thu, May 14, 2015 at 2:30 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com
> wrote:

> Vitaly,
>
> You have small misconception - almost all C2 inlining (normal java
> methods) is done during parsing in one pass. Recently we changed it to
> inline jsr292 methods after parsing (to execute IGVN and reduce graph -
> otherwise they blow up number of ideal nodes and we bailout compilation due
> to MaxNodeLimit).
>
> As parser goes and see a call site it check can it inline or not (see
> opto/bytecodeinfo.cpp, should_inline() and should_not_inline()). There are
> several conditions which drive inlining and following (most important)
> flags controls them: MaxTrivialSize, MaxInlineSize, FreqInlineSize,
> InlineSmallCode, MinInliningThreshold, MaxInlineLevel,
> InlineFrequencyCount, InlineFrequencyRatio.
>
> Most tedious flag is InlineSmallCode (size of compiled assembler code
> which is different from other sizes which are bytecode size) which control
> inlining of already compiled method. Usually if a call site is hot the
> callee is compiled before caller. But sometimes you can get caller compiled
> first (if it has hot loop, for example) so different condition will be used
> and as result you can get performance variation between runs.
>
> The difference between MaxInlineSize (35) and FreqInlineSize (325) is
> FreqInlineSize takes into account how frequent call site is executed
> relatively to caller invocations:
>
>   int call_site_count  = method()->scale_count(profile.count());
>   int invoke_count     = method()->interpreter_invocation_count();
>   int freq = call_site_count / invoke_count;
>   int max_inline_size  = MaxInlineSize;
>   // bump the max size if the call is frequent
>   if ((freq >= InlineFrequencyRatio) ||
>       (call_site_count >= InlineFrequencyCount) ||
>       is_unboxing_method(callee_method, C) ||
>       is_init_with_ea(callee_method, caller_method, C)) {
>     max_inline_size = FreqInlineSize;
>
> And there is additional inlining condition for all methods which size >
> MaxTrivialSize:
>
>   if (!callee_method->was_executed_more_than(MinInliningThreshold)) {
>     set_msg("executed < MinInliningThreshold times");
>
> Regards,
> Vladimir
>
> On 5/14/15 10:03 AM, Vitaly Davidovich wrote:
>
>> I should also add that I see how inlining without taking call freq into
>> account could lead to faster time to peak performance for methods that
>> eventually get hot anyway but aren't at parse time.  Peak perf will be
>> the same if the method is too big for parse inlining but eventually gets
>> compiled due to reaching hotness.  Is that about right?
>>
>> sent from my phone
>>
>> On May 14, 2015 12:57 PM, "Vitaly Davidovich" <vitalyd at gmail.com
>> <mailto:vitalyd at gmail.com>> wrote:
>>
>>     Vladimir,
>>
>>     I'm comparing MaxInlineSize (35) with FreqInlineSize (325).  AFAIU,
>>     MaxInlineSize drives which methods are inlined at parse time by C2,
>>     whereas FreqInlineSize is the threshold for "late" (or what do you
>>     guys call inlining after parsing?) inlining.  Most of the inlining
>>     discussions (or worries, rather) seem to focus around the
>>     MaxInlineSize value, and not FreqInlineSize, even if the target
>>     method will get hot.
>>
>>         Usually, people care about 35 (= MaxInlineSize), because for
>>         methods up to MaxInlineSize their call frequency is ignored. So,
>>         fewer chances to end up with non-inlined call.
>>
>>
>>     Ok, so for hot methods then MaxInlineSize isn't really a concern,
>>     and FreqInlineSize would be the threshold to worry about (for C2
>>     compiler) then? Why are people worried about inlining in cold paths
>>     then?
>>
>>     Thanks Vladimir
>>
>>     On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov
>>     <vladimir.x.ivanov at oracle.com <mailto:vladimir.x.ivanov at oracle.com>>
>>     wrote:
>>
>>         Vitaly,
>>
>>         Can you elaborate your question a bit? What do you compare
>>         parse-time inlining with? Mentioning of ?1 & profile pollution
>>         in this context confuses me.
>>
>>         Usually, people care about 35 (= MaxInlineSize), because for
>>         methods up to MaxInlineSize their call frequency is ignored. So,
>>         fewer chances to end up with non-inlined call.
>>
>>         Best regards,
>>         Vladimir Ivanov
>>
>>         On 5/14/15 7:09 PM, Vitaly Davidovich wrote:
>>
>>             Any pointers? Sorry to bug you guys, but just want to make
>>             sure I
>>             understand this point as I see quite a bit of discussion on
>>             core-libs
>>             and elsewhere where people are worrying about the 35
>>             bytecode size
>>             threshold for parse inlining.
>>
>>             On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich
>>             <vitalyd at gmail.com <mailto:vitalyd at gmail.com>
>>             <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>> wrote:
>>
>>                  Hi guys,
>>
>>                  Could someone please explain the advantage, if any, of
>>             parse time
>>                  inlining in C2? Given that FreqInlineSize is quite
>>             large by default,
>>                  most hot methods will get inlined anyway (well, ones
>>             that can be for
>>                  other reasons).  What is the advantage of parse time
>>             inlining?
>>
>>                  Is it quicker time to peak performance if C1 is reached
>>             first?
>>
>>                  Does it ensure that a method is inlined whereas it may
>>             not be if
>>                  it's already compiled into a medium/large method
>> otherwise?
>>
>>                  Is parse time inlining not susceptible to profile
>>             pollution? I
>>                  suspect it is since the interpreter has already
>>             profiled the inlinee
>>                  either way, but wanted to check.
>>
>>                  Anything else I'm not thinking about?
>>
>>                  Thanks
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150514/70f404bb/attachment.html>

From vladimir.kozlov at oracle.com  Thu May 14 19:20:35 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 14 May 2015 12:20:35 -0700
Subject: C2: Advantage of parse time inlining
In-Reply-To: <CAHjP37F8ufzukBnemSHnPRScPsPShCf1-RydmAsDL+6WXnoEiw@mail.gmail.com>
References: <CAHjP37FqOumfi=VCsEtRePVrjp_7xZyOn6bhqF-RrcL2gkfXwg@mail.gmail.com>	<CAHjP37F4Kfrkyq59ZOAP09b1CSMx_vhSxw-3QsRJk9-xKBpGWQ@mail.gmail.com>	<5554CF1D.3010601@oracle.com>	<CAHjP37HJAoc2v89uVbsH-EWLFi50x_8V1CGVk4gKrmj3pv=bSw@mail.gmail.com>	<CAHjP37GnCbB5XJBKZj-oTzt0zxEG3-NLYHu=h+YqU5d08vVRVw@mail.gmail.com>	<5554E9CB.3020903@oracle.com>
	<CAHjP37F8ufzukBnemSHnPRScPsPShCf1-RydmAsDL+6WXnoEiw@mail.gmail.com>
Message-ID: <5554F583.7090005@oracle.com>

On 5/14/15 12:02 PM, Vitaly Davidovich wrote:
> Thanks Vladimir.  I recall seeing changes around incremental inlining,
> and may have mistakenly thought it happens at some later point in time.
> Appreciate the clarification.
>
> Ok, so based on what you say, I can see a theoretical problem whereby a
> method is being parsed, is larger than MaxInlineSize, but doesn't happen
> to be frequent enough yet at this point, and so it won't be inlined; if
> it turns out to be hot later on, the lack of inlining will not be undone
> (assuming the caller isn't deopted and recompiled later, with updated
> frequency info, for other reasons).

That is correct.

Note, that the problem is not how hot is callee (invocation times) but 
how hot the call site in caller. Usually it does not change during 
execution. If it is called in a loop C2 will try to inline because freq 
should be high.

There is MinInliningThreshold (250) but since we compile caller when it 
is executed 10000 times the call site should be on slow path which is 
executed only 2.5% times. So you will not see the performance difference 
if we inline it or call callee which is compiled if it is hot.

Vladimir

>
> Thanks again.
>
> On Thu, May 14, 2015 at 2:30 PM, Vladimir Kozlov
> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>
>     Vitaly,
>
>     You have small misconception - almost all C2 inlining (normal java
>     methods) is done during parsing in one pass. Recently we changed it
>     to inline jsr292 methods after parsing (to execute IGVN and reduce
>     graph - otherwise they blow up number of ideal nodes and we bailout
>     compilation due to MaxNodeLimit).
>
>     As parser goes and see a call site it check can it inline or not
>     (see opto/bytecodeinfo.cpp, should_inline() and
>     should_not_inline()). There are several conditions which drive
>     inlining and following (most important) flags controls them:
>     MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode,
>     MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount,
>     InlineFrequencyRatio.
>
>     Most tedious flag is InlineSmallCode (size of compiled assembler
>     code which is different from other sizes which are bytecode size)
>     which control inlining of already compiled method. Usually if a call
>     site is hot the callee is compiled before caller. But sometimes you
>     can get caller compiled first (if it has hot loop, for example) so
>     different condition will be used and as result you can get
>     performance variation between runs.
>
>     The difference between MaxInlineSize (35) and FreqInlineSize (325)
>     is FreqInlineSize takes into account how frequent call site is
>     executed relatively to caller invocations:
>
>        int call_site_count  = method()->scale_count(profile.count());
>        int invoke_count     = method()->interpreter_invocation_count();
>        int freq = call_site_count / invoke_count;
>        int max_inline_size  = MaxInlineSize;
>        // bump the max size if the call is frequent
>        if ((freq >= InlineFrequencyRatio) ||
>            (call_site_count >= InlineFrequencyCount) ||
>            is_unboxing_method(callee_method, C) ||
>            is_init_with_ea(callee_method, caller_method, C)) {
>          max_inline_size = FreqInlineSize;
>
>     And there is additional inlining condition for all methods which
>     size > MaxTrivialSize:
>
>        if (!callee_method->was_executed_more_than(MinInliningThreshold)) {
>          set_msg("executed < MinInliningThreshold times");
>
>     Regards,
>     Vladimir
>
>     On 5/14/15 10:03 AM, Vitaly Davidovich wrote:
>
>         I should also add that I see how inlining without taking call
>         freq into
>         account could lead to faster time to peak performance for
>         methods that
>         eventually get hot anyway but aren't at parse time.  Peak perf
>         will be
>         the same if the method is too big for parse inlining but
>         eventually gets
>         compiled due to reaching hotness.  Is that about right?
>
>         sent from my phone
>
>         On May 14, 2015 12:57 PM, "Vitaly Davidovich" <vitalyd at gmail.com
>         <mailto:vitalyd at gmail.com>
>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>> wrote:
>
>              Vladimir,
>
>              I'm comparing MaxInlineSize (35) with FreqInlineSize
>         (325).  AFAIU,
>              MaxInlineSize drives which methods are inlined at parse
>         time by C2,
>              whereas FreqInlineSize is the threshold for "late" (or what
>         do you
>              guys call inlining after parsing?) inlining.  Most of the
>         inlining
>              discussions (or worries, rather) seem to focus around the
>              MaxInlineSize value, and not FreqInlineSize, even if the target
>              method will get hot.
>
>                  Usually, people care about 35 (= MaxInlineSize),
>         because for
>                  methods up to MaxInlineSize their call frequency is
>         ignored. So,
>                  fewer chances to end up with non-inlined call.
>
>
>              Ok, so for hot methods then MaxInlineSize isn't really a
>         concern,
>              and FreqInlineSize would be the threshold to worry about
>         (for C2
>              compiler) then? Why are people worried about inlining in
>         cold paths
>              then?
>
>              Thanks Vladimir
>
>              On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov
>              <vladimir.x.ivanov at oracle.com
>         <mailto:vladimir.x.ivanov at oracle.com>
>         <mailto:vladimir.x.ivanov at oracle.com
>         <mailto:vladimir.x.ivanov at oracle.com>>>
>              wrote:
>
>                  Vitaly,
>
>                  Can you elaborate your question a bit? What do you compare
>                  parse-time inlining with? Mentioning of ?1 & profile
>         pollution
>                  in this context confuses me.
>
>                  Usually, people care about 35 (= MaxInlineSize),
>         because for
>                  methods up to MaxInlineSize their call frequency is
>         ignored. So,
>                  fewer chances to end up with non-inlined call.
>
>                  Best regards,
>                  Vladimir Ivanov
>
>                  On 5/14/15 7:09 PM, Vitaly Davidovich wrote:
>
>                      Any pointers? Sorry to bug you guys, but just want
>         to make
>                      sure I
>                      understand this point as I see quite a bit of
>         discussion on
>                      core-libs
>                      and elsewhere where people are worrying about the 35
>                      bytecode size
>                      threshold for parse inlining.
>
>                      On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich
>                      <vitalyd at gmail.com <mailto:vitalyd at gmail.com>
>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>
>                      <mailto:vitalyd at gmail.com
>         <mailto:vitalyd at gmail.com> <mailto:vitalyd at gmail.com
>         <mailto:vitalyd at gmail.com>>>> wrote:
>
>                           Hi guys,
>
>                           Could someone please explain the advantage, if
>         any, of
>                      parse time
>                           inlining in C2? Given that FreqInlineSize is quite
>                      large by default,
>                           most hot methods will get inlined anyway
>         (well, ones
>                      that can be for
>                           other reasons).  What is the advantage of
>         parse time
>                      inlining?
>
>                           Is it quicker time to peak performance if C1
>         is reached
>                      first?
>
>                           Does it ensure that a method is inlined
>         whereas it may
>                      not be if
>                           it's already compiled into a medium/large
>         method otherwise?
>
>                           Is parse time inlining not susceptible to profile
>                      pollution? I
>                           suspect it is since the interpreter has already
>                      profiled the inlinee
>                           either way, but wanted to check.
>
>                           Anything else I'm not thinking about?
>
>                           Thanks
>
>
>
>

From vitalyd at gmail.com  Thu May 14 20:01:29 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Thu, 14 May 2015 16:01:29 -0400
Subject: C2: Advantage of parse time inlining
In-Reply-To: <5554F583.7090005@oracle.com>
References: <CAHjP37FqOumfi=VCsEtRePVrjp_7xZyOn6bhqF-RrcL2gkfXwg@mail.gmail.com>
	<CAHjP37F4Kfrkyq59ZOAP09b1CSMx_vhSxw-3QsRJk9-xKBpGWQ@mail.gmail.com>
	<5554CF1D.3010601@oracle.com>
	<CAHjP37HJAoc2v89uVbsH-EWLFi50x_8V1CGVk4gKrmj3pv=bSw@mail.gmail.com>
	<CAHjP37GnCbB5XJBKZj-oTzt0zxEG3-NLYHu=h+YqU5d08vVRVw@mail.gmail.com>
	<5554E9CB.3020903@oracle.com>
	<CAHjP37F8ufzukBnemSHnPRScPsPShCf1-RydmAsDL+6WXnoEiw@mail.gmail.com>
	<5554F583.7090005@oracle.com>
Message-ID: <CAHjP37FATCcp4aJ4eVw_PadzWefmsS=eAC8U+grU_zVN+AHjcQ@mail.gmail.com>

Right, thank you.

Is MinInliningThreshold scaled with XX:CompileThreshold=XXX values? Say I
turn CompileThreshold down to 100 (as an example).

On Thu, May 14, 2015 at 3:20 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com
> wrote:

> On 5/14/15 12:02 PM, Vitaly Davidovich wrote:
>
>> Thanks Vladimir.  I recall seeing changes around incremental inlining,
>> and may have mistakenly thought it happens at some later point in time.
>> Appreciate the clarification.
>>
>> Ok, so based on what you say, I can see a theoretical problem whereby a
>> method is being parsed, is larger than MaxInlineSize, but doesn't happen
>> to be frequent enough yet at this point, and so it won't be inlined; if
>> it turns out to be hot later on, the lack of inlining will not be undone
>> (assuming the caller isn't deopted and recompiled later, with updated
>> frequency info, for other reasons).
>>
>
> That is correct.
>
> Note, that the problem is not how hot is callee (invocation times) but how
> hot the call site in caller. Usually it does not change during execution.
> If it is called in a loop C2 will try to inline because freq should be high.
>
> There is MinInliningThreshold (250) but since we compile caller when it is
> executed 10000 times the call site should be on slow path which is executed
> only 2.5% times. So you will not see the performance difference if we
> inline it or call callee which is compiled if it is hot.
>
> Vladimir
>
>
>> Thanks again.
>>
>> On Thu, May 14, 2015 at 2:30 PM, Vladimir Kozlov
>> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>>
>>     Vitaly,
>>
>>     You have small misconception - almost all C2 inlining (normal java
>>     methods) is done during parsing in one pass. Recently we changed it
>>     to inline jsr292 methods after parsing (to execute IGVN and reduce
>>     graph - otherwise they blow up number of ideal nodes and we bailout
>>     compilation due to MaxNodeLimit).
>>
>>     As parser goes and see a call site it check can it inline or not
>>     (see opto/bytecodeinfo.cpp, should_inline() and
>>     should_not_inline()). There are several conditions which drive
>>     inlining and following (most important) flags controls them:
>>     MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode,
>>     MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount,
>>     InlineFrequencyRatio.
>>
>>     Most tedious flag is InlineSmallCode (size of compiled assembler
>>     code which is different from other sizes which are bytecode size)
>>     which control inlining of already compiled method. Usually if a call
>>     site is hot the callee is compiled before caller. But sometimes you
>>     can get caller compiled first (if it has hot loop, for example) so
>>     different condition will be used and as result you can get
>>     performance variation between runs.
>>
>>     The difference between MaxInlineSize (35) and FreqInlineSize (325)
>>     is FreqInlineSize takes into account how frequent call site is
>>     executed relatively to caller invocations:
>>
>>        int call_site_count  = method()->scale_count(profile.count());
>>        int invoke_count     = method()->interpreter_invocation_count();
>>        int freq = call_site_count / invoke_count;
>>        int max_inline_size  = MaxInlineSize;
>>        // bump the max size if the call is frequent
>>        if ((freq >= InlineFrequencyRatio) ||
>>            (call_site_count >= InlineFrequencyCount) ||
>>            is_unboxing_method(callee_method, C) ||
>>            is_init_with_ea(callee_method, caller_method, C)) {
>>          max_inline_size = FreqInlineSize;
>>
>>     And there is additional inlining condition for all methods which
>>     size > MaxTrivialSize:
>>
>>        if (!callee_method->was_executed_more_than(MinInliningThreshold)) {
>>          set_msg("executed < MinInliningThreshold times");
>>
>>     Regards,
>>     Vladimir
>>
>>     On 5/14/15 10:03 AM, Vitaly Davidovich wrote:
>>
>>         I should also add that I see how inlining without taking call
>>         freq into
>>         account could lead to faster time to peak performance for
>>         methods that
>>         eventually get hot anyway but aren't at parse time.  Peak perf
>>         will be
>>         the same if the method is too big for parse inlining but
>>         eventually gets
>>         compiled due to reaching hotness.  Is that about right?
>>
>>         sent from my phone
>>
>>         On May 14, 2015 12:57 PM, "Vitaly Davidovich" <vitalyd at gmail.com
>>         <mailto:vitalyd at gmail.com>
>>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>> wrote:
>>
>>              Vladimir,
>>
>>              I'm comparing MaxInlineSize (35) with FreqInlineSize
>>         (325).  AFAIU,
>>              MaxInlineSize drives which methods are inlined at parse
>>         time by C2,
>>              whereas FreqInlineSize is the threshold for "late" (or what
>>         do you
>>              guys call inlining after parsing?) inlining.  Most of the
>>         inlining
>>              discussions (or worries, rather) seem to focus around the
>>              MaxInlineSize value, and not FreqInlineSize, even if the
>> target
>>              method will get hot.
>>
>>                  Usually, people care about 35 (= MaxInlineSize),
>>         because for
>>                  methods up to MaxInlineSize their call frequency is
>>         ignored. So,
>>                  fewer chances to end up with non-inlined call.
>>
>>
>>              Ok, so for hot methods then MaxInlineSize isn't really a
>>         concern,
>>              and FreqInlineSize would be the threshold to worry about
>>         (for C2
>>              compiler) then? Why are people worried about inlining in
>>         cold paths
>>              then?
>>
>>              Thanks Vladimir
>>
>>              On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov
>>              <vladimir.x.ivanov at oracle.com
>>         <mailto:vladimir.x.ivanov at oracle.com>
>>         <mailto:vladimir.x.ivanov at oracle.com
>>
>>         <mailto:vladimir.x.ivanov at oracle.com>>>
>>              wrote:
>>
>>                  Vitaly,
>>
>>                  Can you elaborate your question a bit? What do you
>> compare
>>                  parse-time inlining with? Mentioning of ?1 & profile
>>         pollution
>>                  in this context confuses me.
>>
>>                  Usually, people care about 35 (= MaxInlineSize),
>>         because for
>>                  methods up to MaxInlineSize their call frequency is
>>         ignored. So,
>>                  fewer chances to end up with non-inlined call.
>>
>>                  Best regards,
>>                  Vladimir Ivanov
>>
>>                  On 5/14/15 7:09 PM, Vitaly Davidovich wrote:
>>
>>                      Any pointers? Sorry to bug you guys, but just want
>>         to make
>>                      sure I
>>                      understand this point as I see quite a bit of
>>         discussion on
>>                      core-libs
>>                      and elsewhere where people are worrying about the 35
>>                      bytecode size
>>                      threshold for parse inlining.
>>
>>                      On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich
>>                      <vitalyd at gmail.com <mailto:vitalyd at gmail.com>
>>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>
>>                      <mailto:vitalyd at gmail.com
>>         <mailto:vitalyd at gmail.com> <mailto:vitalyd at gmail.com
>>         <mailto:vitalyd at gmail.com>>>> wrote:
>>
>>                           Hi guys,
>>
>>                           Could someone please explain the advantage, if
>>         any, of
>>                      parse time
>>                           inlining in C2? Given that FreqInlineSize is
>> quite
>>                      large by default,
>>                           most hot methods will get inlined anyway
>>         (well, ones
>>                      that can be for
>>                           other reasons).  What is the advantage of
>>         parse time
>>                      inlining?
>>
>>                           Is it quicker time to peak performance if C1
>>         is reached
>>                      first?
>>
>>                           Does it ensure that a method is inlined
>>         whereas it may
>>                      not be if
>>                           it's already compiled into a medium/large
>>         method otherwise?
>>
>>                           Is parse time inlining not susceptible to
>> profile
>>                      pollution? I
>>                           suspect it is since the interpreter has already
>>                      profiled the inlinee
>>                           either way, but wanted to check.
>>
>>                           Anything else I'm not thinking about?
>>
>>                           Thanks
>>
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150514/534493cc/attachment.html>

From rednaxelafx at gmail.com  Thu May 14 21:28:31 2015
From: rednaxelafx at gmail.com (Krystal Mok)
Date: Thu, 14 May 2015 14:28:31 -0700
Subject: C2: Advantage of parse time inlining
In-Reply-To: <CAHjP37FATCcp4aJ4eVw_PadzWefmsS=eAC8U+grU_zVN+AHjcQ@mail.gmail.com>
References: <CAHjP37FqOumfi=VCsEtRePVrjp_7xZyOn6bhqF-RrcL2gkfXwg@mail.gmail.com>
	<CAHjP37F4Kfrkyq59ZOAP09b1CSMx_vhSxw-3QsRJk9-xKBpGWQ@mail.gmail.com>
	<5554CF1D.3010601@oracle.com>
	<CAHjP37HJAoc2v89uVbsH-EWLFi50x_8V1CGVk4gKrmj3pv=bSw@mail.gmail.com>
	<CAHjP37GnCbB5XJBKZj-oTzt0zxEG3-NLYHu=h+YqU5d08vVRVw@mail.gmail.com>
	<5554E9CB.3020903@oracle.com>
	<CAHjP37F8ufzukBnemSHnPRScPsPShCf1-RydmAsDL+6WXnoEiw@mail.gmail.com>
	<5554F583.7090005@oracle.com>
	<CAHjP37FATCcp4aJ4eVw_PadzWefmsS=eAC8U+grU_zVN+AHjcQ@mail.gmail.com>
Message-ID: <CA+cQ+tQdtrnyC9p9CWrop5565+LSmf7N8RyRXpthe9UvqyyKhA@mail.gmail.com>

Yes and no.

      intx counter_high_value;
      // Tiered compilation uses a different "high value" than non-tiered
compilation.
      // Determine the right value to use.
      if (TieredCompilation) {
        counter_high_value = InvocationCounter::count_limit / 2;
      } else {
        counter_high_value = CompileThreshold / 2;
      }
      if (!callee_method->was_executed_more_than(MIN2(MinInliningThreshold,
counter_high_value))) {
        set_msg("executed < MinInliningThreshold times");
        return true;
      }

So it's not scaling MinInliningThreshold directly, but rather using a min
of MinInliningThreshold and counter_high_value (where the latter is
calculated from CompileThreshold when not using tiered compilation) to make
the actual decision.

Because tiered is on by default now, the short answer to your question
would probably be a "no".

- Kris

On Thu, May 14, 2015 at 1:01 PM, Vitaly Davidovich <vitalyd at gmail.com>
wrote:

> Right, thank you.
>
> Is MinInliningThreshold scaled with XX:CompileThreshold=XXX values? Say I
> turn CompileThreshold down to 100 (as an example).
>
> On Thu, May 14, 2015 at 3:20 PM, Vladimir Kozlov <
> vladimir.kozlov at oracle.com> wrote:
>
>> On 5/14/15 12:02 PM, Vitaly Davidovich wrote:
>>
>>> Thanks Vladimir.  I recall seeing changes around incremental inlining,
>>> and may have mistakenly thought it happens at some later point in time.
>>> Appreciate the clarification.
>>>
>>> Ok, so based on what you say, I can see a theoretical problem whereby a
>>> method is being parsed, is larger than MaxInlineSize, but doesn't happen
>>> to be frequent enough yet at this point, and so it won't be inlined; if
>>> it turns out to be hot later on, the lack of inlining will not be undone
>>> (assuming the caller isn't deopted and recompiled later, with updated
>>> frequency info, for other reasons).
>>>
>>
>> That is correct.
>>
>> Note, that the problem is not how hot is callee (invocation times) but
>> how hot the call site in caller. Usually it does not change during
>> execution. If it is called in a loop C2 will try to inline because freq
>> should be high.
>>
>> There is MinInliningThreshold (250) but since we compile caller when it
>> is executed 10000 times the call site should be on slow path which is
>> executed only 2.5% times. So you will not see the performance difference if
>> we inline it or call callee which is compiled if it is hot.
>>
>> Vladimir
>>
>>
>>> Thanks again.
>>>
>>> On Thu, May 14, 2015 at 2:30 PM, Vladimir Kozlov
>>> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>>>
>>>     Vitaly,
>>>
>>>     You have small misconception - almost all C2 inlining (normal java
>>>     methods) is done during parsing in one pass. Recently we changed it
>>>     to inline jsr292 methods after parsing (to execute IGVN and reduce
>>>     graph - otherwise they blow up number of ideal nodes and we bailout
>>>     compilation due to MaxNodeLimit).
>>>
>>>     As parser goes and see a call site it check can it inline or not
>>>     (see opto/bytecodeinfo.cpp, should_inline() and
>>>     should_not_inline()). There are several conditions which drive
>>>     inlining and following (most important) flags controls them:
>>>     MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode,
>>>     MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount,
>>>     InlineFrequencyRatio.
>>>
>>>     Most tedious flag is InlineSmallCode (size of compiled assembler
>>>     code which is different from other sizes which are bytecode size)
>>>     which control inlining of already compiled method. Usually if a call
>>>     site is hot the callee is compiled before caller. But sometimes you
>>>     can get caller compiled first (if it has hot loop, for example) so
>>>     different condition will be used and as result you can get
>>>     performance variation between runs.
>>>
>>>     The difference between MaxInlineSize (35) and FreqInlineSize (325)
>>>     is FreqInlineSize takes into account how frequent call site is
>>>     executed relatively to caller invocations:
>>>
>>>        int call_site_count  = method()->scale_count(profile.count());
>>>        int invoke_count     = method()->interpreter_invocation_count();
>>>        int freq = call_site_count / invoke_count;
>>>        int max_inline_size  = MaxInlineSize;
>>>        // bump the max size if the call is frequent
>>>        if ((freq >= InlineFrequencyRatio) ||
>>>            (call_site_count >= InlineFrequencyCount) ||
>>>            is_unboxing_method(callee_method, C) ||
>>>            is_init_with_ea(callee_method, caller_method, C)) {
>>>          max_inline_size = FreqInlineSize;
>>>
>>>     And there is additional inlining condition for all methods which
>>>     size > MaxTrivialSize:
>>>
>>>        if (!callee_method->was_executed_more_than(MinInliningThreshold))
>>> {
>>>          set_msg("executed < MinInliningThreshold times");
>>>
>>>     Regards,
>>>     Vladimir
>>>
>>>     On 5/14/15 10:03 AM, Vitaly Davidovich wrote:
>>>
>>>         I should also add that I see how inlining without taking call
>>>         freq into
>>>         account could lead to faster time to peak performance for
>>>         methods that
>>>         eventually get hot anyway but aren't at parse time.  Peak perf
>>>         will be
>>>         the same if the method is too big for parse inlining but
>>>         eventually gets
>>>         compiled due to reaching hotness.  Is that about right?
>>>
>>>         sent from my phone
>>>
>>>         On May 14, 2015 12:57 PM, "Vitaly Davidovich" <vitalyd at gmail.com
>>>         <mailto:vitalyd at gmail.com>
>>>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>> wrote:
>>>
>>>              Vladimir,
>>>
>>>              I'm comparing MaxInlineSize (35) with FreqInlineSize
>>>         (325).  AFAIU,
>>>              MaxInlineSize drives which methods are inlined at parse
>>>         time by C2,
>>>              whereas FreqInlineSize is the threshold for "late" (or what
>>>         do you
>>>              guys call inlining after parsing?) inlining.  Most of the
>>>         inlining
>>>              discussions (or worries, rather) seem to focus around the
>>>              MaxInlineSize value, and not FreqInlineSize, even if the
>>> target
>>>              method will get hot.
>>>
>>>                  Usually, people care about 35 (= MaxInlineSize),
>>>         because for
>>>                  methods up to MaxInlineSize their call frequency is
>>>         ignored. So,
>>>                  fewer chances to end up with non-inlined call.
>>>
>>>
>>>              Ok, so for hot methods then MaxInlineSize isn't really a
>>>         concern,
>>>              and FreqInlineSize would be the threshold to worry about
>>>         (for C2
>>>              compiler) then? Why are people worried about inlining in
>>>         cold paths
>>>              then?
>>>
>>>              Thanks Vladimir
>>>
>>>              On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov
>>>              <vladimir.x.ivanov at oracle.com
>>>         <mailto:vladimir.x.ivanov at oracle.com>
>>>         <mailto:vladimir.x.ivanov at oracle.com
>>>
>>>         <mailto:vladimir.x.ivanov at oracle.com>>>
>>>              wrote:
>>>
>>>                  Vitaly,
>>>
>>>                  Can you elaborate your question a bit? What do you
>>> compare
>>>                  parse-time inlining with? Mentioning of ?1 & profile
>>>         pollution
>>>                  in this context confuses me.
>>>
>>>                  Usually, people care about 35 (= MaxInlineSize),
>>>         because for
>>>                  methods up to MaxInlineSize their call frequency is
>>>         ignored. So,
>>>                  fewer chances to end up with non-inlined call.
>>>
>>>                  Best regards,
>>>                  Vladimir Ivanov
>>>
>>>                  On 5/14/15 7:09 PM, Vitaly Davidovich wrote:
>>>
>>>                      Any pointers? Sorry to bug you guys, but just want
>>>         to make
>>>                      sure I
>>>                      understand this point as I see quite a bit of
>>>         discussion on
>>>                      core-libs
>>>                      and elsewhere where people are worrying about the 35
>>>                      bytecode size
>>>                      threshold for parse inlining.
>>>
>>>                      On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich
>>>                      <vitalyd at gmail.com <mailto:vitalyd at gmail.com>
>>>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>
>>>                      <mailto:vitalyd at gmail.com
>>>         <mailto:vitalyd at gmail.com> <mailto:vitalyd at gmail.com
>>>         <mailto:vitalyd at gmail.com>>>> wrote:
>>>
>>>                           Hi guys,
>>>
>>>                           Could someone please explain the advantage, if
>>>         any, of
>>>                      parse time
>>>                           inlining in C2? Given that FreqInlineSize is
>>> quite
>>>                      large by default,
>>>                           most hot methods will get inlined anyway
>>>         (well, ones
>>>                      that can be for
>>>                           other reasons).  What is the advantage of
>>>         parse time
>>>                      inlining?
>>>
>>>                           Is it quicker time to peak performance if C1
>>>         is reached
>>>                      first?
>>>
>>>                           Does it ensure that a method is inlined
>>>         whereas it may
>>>                      not be if
>>>                           it's already compiled into a medium/large
>>>         method otherwise?
>>>
>>>                           Is parse time inlining not susceptible to
>>> profile
>>>                      pollution? I
>>>                           suspect it is since the interpreter has already
>>>                      profiled the inlinee
>>>                           either way, but wanted to check.
>>>
>>>                           Anything else I'm not thinking about?
>>>
>>>                           Thanks
>>>
>>>
>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150514/8cfef558/attachment-0001.html>

From vitalyd at gmail.com  Thu May 14 21:35:21 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Thu, 14 May 2015 17:35:21 -0400
Subject: C2: Advantage of parse time inlining
In-Reply-To: <CA+cQ+tQdtrnyC9p9CWrop5565+LSmf7N8RyRXpthe9UvqyyKhA@mail.gmail.com>
References: <CAHjP37FqOumfi=VCsEtRePVrjp_7xZyOn6bhqF-RrcL2gkfXwg@mail.gmail.com>
	<CAHjP37F4Kfrkyq59ZOAP09b1CSMx_vhSxw-3QsRJk9-xKBpGWQ@mail.gmail.com>
	<5554CF1D.3010601@oracle.com>
	<CAHjP37HJAoc2v89uVbsH-EWLFi50x_8V1CGVk4gKrmj3pv=bSw@mail.gmail.com>
	<CAHjP37GnCbB5XJBKZj-oTzt0zxEG3-NLYHu=h+YqU5d08vVRVw@mail.gmail.com>
	<5554E9CB.3020903@oracle.com>
	<CAHjP37F8ufzukBnemSHnPRScPsPShCf1-RydmAsDL+6WXnoEiw@mail.gmail.com>
	<5554F583.7090005@oracle.com>
	<CAHjP37FATCcp4aJ4eVw_PadzWefmsS=eAC8U+grU_zVN+AHjcQ@mail.gmail.com>
	<CA+cQ+tQdtrnyC9p9CWrop5565+LSmf7N8RyRXpthe9UvqyyKhA@mail.gmail.com>
Message-ID: <CAHjP37HbX4cRB3NURaG0gNtRjNCpPZwqXrAZvsTbLgK4rDtkTw@mail.gmail.com>

Thanks Kris.  Hmm, this sounds pretty bad for non-tiered compilations with
a relatively low CompileThreshold.  If I have a (larger than MaxInlineSize)
method executed 49% of the time, it'll inline at CompileThreshold=10k but
not CompileThreshold=100.  Or am I missing something?

On Thu, May 14, 2015 at 5:28 PM, Krystal Mok <rednaxelafx at gmail.com> wrote:

> Yes and no.
>
>       intx counter_high_value;
>       // Tiered compilation uses a different "high value" than non-tiered
> compilation.
>       // Determine the right value to use.
>       if (TieredCompilation) {
>         counter_high_value = InvocationCounter::count_limit / 2;
>       } else {
>         counter_high_value = CompileThreshold / 2;
>       }
>       if
> (!callee_method->was_executed_more_than(MIN2(MinInliningThreshold,
> counter_high_value))) {
>         set_msg("executed < MinInliningThreshold times");
>         return true;
>       }
>
> So it's not scaling MinInliningThreshold directly, but rather using a min
> of MinInliningThreshold and counter_high_value (where the latter is
> calculated from CompileThreshold when not using tiered compilation) to make
> the actual decision.
>
> Because tiered is on by default now, the short answer to your question
> would probably be a "no".
>
> - Kris
>
> On Thu, May 14, 2015 at 1:01 PM, Vitaly Davidovich <vitalyd at gmail.com>
> wrote:
>
>> Right, thank you.
>>
>> Is MinInliningThreshold scaled with XX:CompileThreshold=XXX values? Say I
>> turn CompileThreshold down to 100 (as an example).
>>
>> On Thu, May 14, 2015 at 3:20 PM, Vladimir Kozlov <
>> vladimir.kozlov at oracle.com> wrote:
>>
>>> On 5/14/15 12:02 PM, Vitaly Davidovich wrote:
>>>
>>>> Thanks Vladimir.  I recall seeing changes around incremental inlining,
>>>> and may have mistakenly thought it happens at some later point in time.
>>>> Appreciate the clarification.
>>>>
>>>> Ok, so based on what you say, I can see a theoretical problem whereby a
>>>> method is being parsed, is larger than MaxInlineSize, but doesn't happen
>>>> to be frequent enough yet at this point, and so it won't be inlined; if
>>>> it turns out to be hot later on, the lack of inlining will not be undone
>>>> (assuming the caller isn't deopted and recompiled later, with updated
>>>> frequency info, for other reasons).
>>>>
>>>
>>> That is correct.
>>>
>>> Note, that the problem is not how hot is callee (invocation times) but
>>> how hot the call site in caller. Usually it does not change during
>>> execution. If it is called in a loop C2 will try to inline because freq
>>> should be high.
>>>
>>> There is MinInliningThreshold (250) but since we compile caller when it
>>> is executed 10000 times the call site should be on slow path which is
>>> executed only 2.5% times. So you will not see the performance difference if
>>> we inline it or call callee which is compiled if it is hot.
>>>
>>> Vladimir
>>>
>>>
>>>> Thanks again.
>>>>
>>>> On Thu, May 14, 2015 at 2:30 PM, Vladimir Kozlov
>>>> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>>>>
>>>>     Vitaly,
>>>>
>>>>     You have small misconception - almost all C2 inlining (normal java
>>>>     methods) is done during parsing in one pass. Recently we changed it
>>>>     to inline jsr292 methods after parsing (to execute IGVN and reduce
>>>>     graph - otherwise they blow up number of ideal nodes and we bailout
>>>>     compilation due to MaxNodeLimit).
>>>>
>>>>     As parser goes and see a call site it check can it inline or not
>>>>     (see opto/bytecodeinfo.cpp, should_inline() and
>>>>     should_not_inline()). There are several conditions which drive
>>>>     inlining and following (most important) flags controls them:
>>>>     MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode,
>>>>     MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount,
>>>>     InlineFrequencyRatio.
>>>>
>>>>     Most tedious flag is InlineSmallCode (size of compiled assembler
>>>>     code which is different from other sizes which are bytecode size)
>>>>     which control inlining of already compiled method. Usually if a call
>>>>     site is hot the callee is compiled before caller. But sometimes you
>>>>     can get caller compiled first (if it has hot loop, for example) so
>>>>     different condition will be used and as result you can get
>>>>     performance variation between runs.
>>>>
>>>>     The difference between MaxInlineSize (35) and FreqInlineSize (325)
>>>>     is FreqInlineSize takes into account how frequent call site is
>>>>     executed relatively to caller invocations:
>>>>
>>>>        int call_site_count  = method()->scale_count(profile.count());
>>>>        int invoke_count     = method()->interpreter_invocation_count();
>>>>        int freq = call_site_count / invoke_count;
>>>>        int max_inline_size  = MaxInlineSize;
>>>>        // bump the max size if the call is frequent
>>>>        if ((freq >= InlineFrequencyRatio) ||
>>>>            (call_site_count >= InlineFrequencyCount) ||
>>>>            is_unboxing_method(callee_method, C) ||
>>>>            is_init_with_ea(callee_method, caller_method, C)) {
>>>>          max_inline_size = FreqInlineSize;
>>>>
>>>>     And there is additional inlining condition for all methods which
>>>>     size > MaxTrivialSize:
>>>>
>>>>        if
>>>> (!callee_method->was_executed_more_than(MinInliningThreshold)) {
>>>>          set_msg("executed < MinInliningThreshold times");
>>>>
>>>>     Regards,
>>>>     Vladimir
>>>>
>>>>     On 5/14/15 10:03 AM, Vitaly Davidovich wrote:
>>>>
>>>>         I should also add that I see how inlining without taking call
>>>>         freq into
>>>>         account could lead to faster time to peak performance for
>>>>         methods that
>>>>         eventually get hot anyway but aren't at parse time.  Peak perf
>>>>         will be
>>>>         the same if the method is too big for parse inlining but
>>>>         eventually gets
>>>>         compiled due to reaching hotness.  Is that about right?
>>>>
>>>>         sent from my phone
>>>>
>>>>         On May 14, 2015 12:57 PM, "Vitaly Davidovich" <
>>>> vitalyd at gmail.com
>>>>         <mailto:vitalyd at gmail.com>
>>>>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>> wrote:
>>>>
>>>>              Vladimir,
>>>>
>>>>              I'm comparing MaxInlineSize (35) with FreqInlineSize
>>>>         (325).  AFAIU,
>>>>              MaxInlineSize drives which methods are inlined at parse
>>>>         time by C2,
>>>>              whereas FreqInlineSize is the threshold for "late" (or what
>>>>         do you
>>>>              guys call inlining after parsing?) inlining.  Most of the
>>>>         inlining
>>>>              discussions (or worries, rather) seem to focus around the
>>>>              MaxInlineSize value, and not FreqInlineSize, even if the
>>>> target
>>>>              method will get hot.
>>>>
>>>>                  Usually, people care about 35 (= MaxInlineSize),
>>>>         because for
>>>>                  methods up to MaxInlineSize their call frequency is
>>>>         ignored. So,
>>>>                  fewer chances to end up with non-inlined call.
>>>>
>>>>
>>>>              Ok, so for hot methods then MaxInlineSize isn't really a
>>>>         concern,
>>>>              and FreqInlineSize would be the threshold to worry about
>>>>         (for C2
>>>>              compiler) then? Why are people worried about inlining in
>>>>         cold paths
>>>>              then?
>>>>
>>>>              Thanks Vladimir
>>>>
>>>>              On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov
>>>>              <vladimir.x.ivanov at oracle.com
>>>>         <mailto:vladimir.x.ivanov at oracle.com>
>>>>         <mailto:vladimir.x.ivanov at oracle.com
>>>>
>>>>         <mailto:vladimir.x.ivanov at oracle.com>>>
>>>>              wrote:
>>>>
>>>>                  Vitaly,
>>>>
>>>>                  Can you elaborate your question a bit? What do you
>>>> compare
>>>>                  parse-time inlining with? Mentioning of ?1 & profile
>>>>         pollution
>>>>                  in this context confuses me.
>>>>
>>>>                  Usually, people care about 35 (= MaxInlineSize),
>>>>         because for
>>>>                  methods up to MaxInlineSize their call frequency is
>>>>         ignored. So,
>>>>                  fewer chances to end up with non-inlined call.
>>>>
>>>>                  Best regards,
>>>>                  Vladimir Ivanov
>>>>
>>>>                  On 5/14/15 7:09 PM, Vitaly Davidovich wrote:
>>>>
>>>>                      Any pointers? Sorry to bug you guys, but just want
>>>>         to make
>>>>                      sure I
>>>>                      understand this point as I see quite a bit of
>>>>         discussion on
>>>>                      core-libs
>>>>                      and elsewhere where people are worrying about the
>>>> 35
>>>>                      bytecode size
>>>>                      threshold for parse inlining.
>>>>
>>>>                      On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich
>>>>                      <vitalyd at gmail.com <mailto:vitalyd at gmail.com>
>>>>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>
>>>>                      <mailto:vitalyd at gmail.com
>>>>         <mailto:vitalyd at gmail.com> <mailto:vitalyd at gmail.com
>>>>         <mailto:vitalyd at gmail.com>>>> wrote:
>>>>
>>>>                           Hi guys,
>>>>
>>>>                           Could someone please explain the advantage, if
>>>>         any, of
>>>>                      parse time
>>>>                           inlining in C2? Given that FreqInlineSize is
>>>> quite
>>>>                      large by default,
>>>>                           most hot methods will get inlined anyway
>>>>         (well, ones
>>>>                      that can be for
>>>>                           other reasons).  What is the advantage of
>>>>         parse time
>>>>                      inlining?
>>>>
>>>>                           Is it quicker time to peak performance if C1
>>>>         is reached
>>>>                      first?
>>>>
>>>>                           Does it ensure that a method is inlined
>>>>         whereas it may
>>>>                      not be if
>>>>                           it's already compiled into a medium/large
>>>>         method otherwise?
>>>>
>>>>                           Is parse time inlining not susceptible to
>>>> profile
>>>>                      pollution? I
>>>>                           suspect it is since the interpreter has
>>>> already
>>>>                      profiled the inlinee
>>>>                           either way, but wanted to check.
>>>>
>>>>                           Anything else I'm not thinking about?
>>>>
>>>>                           Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150514/2e466425/attachment-0001.html>

From vladimir.kozlov at oracle.com  Thu May 14 21:50:30 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 14 May 2015 14:50:30 -0700
Subject: C2: Advantage of parse time inlining
In-Reply-To: <CAHjP37FATCcp4aJ4eVw_PadzWefmsS=eAC8U+grU_zVN+AHjcQ@mail.gmail.com>
References: <CAHjP37FqOumfi=VCsEtRePVrjp_7xZyOn6bhqF-RrcL2gkfXwg@mail.gmail.com>	<CAHjP37F4Kfrkyq59ZOAP09b1CSMx_vhSxw-3QsRJk9-xKBpGWQ@mail.gmail.com>	<5554CF1D.3010601@oracle.com>	<CAHjP37HJAoc2v89uVbsH-EWLFi50x_8V1CGVk4gKrmj3pv=bSw@mail.gmail.com>	<CAHjP37GnCbB5XJBKZj-oTzt0zxEG3-NLYHu=h+YqU5d08vVRVw@mail.gmail.com>	<5554E9CB.3020903@oracle.com>	<CAHjP37F8ufzukBnemSHnPRScPsPShCf1-RydmAsDL+6WXnoEiw@mail.gmail.com>	<5554F583.7090005@oracle.com>
	<CAHjP37FATCcp4aJ4eVw_PadzWefmsS=eAC8U+grU_zVN+AHjcQ@mail.gmail.com>
Message-ID: <555518A6.4070505@oracle.com>

On 5/14/15 1:01 PM, Vitaly Davidovich wrote:
> Right, thank you.
>
> Is MinInliningThreshold scaled with XX:CompileThreshold=XXX values? Say
> I turn CompileThreshold down to 100 (as an example).

No, it does not scale.

Note, with tiered compilation C1 compilation triggered after 100 
invocations. So you will get compiled code (not optimal as C2) very early.

Vladimir

>
> On Thu, May 14, 2015 at 3:20 PM, Vladimir Kozlov
> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>
>     On 5/14/15 12:02 PM, Vitaly Davidovich wrote:
>
>         Thanks Vladimir.  I recall seeing changes around incremental
>         inlining,
>         and may have mistakenly thought it happens at some later point
>         in time.
>         Appreciate the clarification.
>
>         Ok, so based on what you say, I can see a theoretical problem
>         whereby a
>         method is being parsed, is larger than MaxInlineSize, but
>         doesn't happen
>         to be frequent enough yet at this point, and so it won't be
>         inlined; if
>         it turns out to be hot later on, the lack of inlining will not
>         be undone
>         (assuming the caller isn't deopted and recompiled later, with
>         updated
>         frequency info, for other reasons).
>
>
>     That is correct.
>
>     Note, that the problem is not how hot is callee (invocation times)
>     but how hot the call site in caller. Usually it does not change
>     during execution. If it is called in a loop C2 will try to inline
>     because freq should be high.
>
>     There is MinInliningThreshold (250) but since we compile caller when
>     it is executed 10000 times the call site should be on slow path
>     which is executed only 2.5% times. So you will not see the
>     performance difference if we inline it or call callee which is
>     compiled if it is hot.
>
>     Vladimir
>
>
>         Thanks again.
>
>         On Thu, May 14, 2015 at 2:30 PM, Vladimir Kozlov
>         <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>
>         <mailto:vladimir.kozlov at oracle.com
>         <mailto:vladimir.kozlov at oracle.com>>> wrote:
>
>              Vitaly,
>
>              You have small misconception - almost all C2 inlining
>         (normal java
>              methods) is done during parsing in one pass. Recently we
>         changed it
>              to inline jsr292 methods after parsing (to execute IGVN and
>         reduce
>              graph - otherwise they blow up number of ideal nodes and we
>         bailout
>              compilation due to MaxNodeLimit).
>
>              As parser goes and see a call site it check can it inline
>         or not
>              (see opto/bytecodeinfo.cpp, should_inline() and
>              should_not_inline()). There are several conditions which drive
>              inlining and following (most important) flags controls them:
>              MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode,
>              MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount,
>              InlineFrequencyRatio.
>
>              Most tedious flag is InlineSmallCode (size of compiled
>         assembler
>              code which is different from other sizes which are bytecode
>         size)
>              which control inlining of already compiled method. Usually
>         if a call
>              site is hot the callee is compiled before caller. But
>         sometimes you
>              can get caller compiled first (if it has hot loop, for
>         example) so
>              different condition will be used and as result you can get
>              performance variation between runs.
>
>              The difference between MaxInlineSize (35) and
>         FreqInlineSize (325)
>              is FreqInlineSize takes into account how frequent call site is
>              executed relatively to caller invocations:
>
>                 int call_site_count  =
>         method()->scale_count(profile.count());
>                 int invoke_count     =
>         method()->interpreter_invocation_count();
>                 int freq = call_site_count / invoke_count;
>                 int max_inline_size  = MaxInlineSize;
>                 // bump the max size if the call is frequent
>                 if ((freq >= InlineFrequencyRatio) ||
>                     (call_site_count >= InlineFrequencyCount) ||
>                     is_unboxing_method(callee_method, C) ||
>                     is_init_with_ea(callee_method, caller_method, C)) {
>                   max_inline_size = FreqInlineSize;
>
>              And there is additional inlining condition for all methods
>         which
>              size > MaxTrivialSize:
>
>                 if
>         (!callee_method->was_executed_more_than(MinInliningThreshold)) {
>                   set_msg("executed < MinInliningThreshold times");
>
>              Regards,
>              Vladimir
>
>              On 5/14/15 10:03 AM, Vitaly Davidovich wrote:
>
>                  I should also add that I see how inlining without
>         taking call
>                  freq into
>                  account could lead to faster time to peak performance for
>                  methods that
>                  eventually get hot anyway but aren't at parse time.
>         Peak perf
>                  will be
>                  the same if the method is too big for parse inlining but
>                  eventually gets
>                  compiled due to reaching hotness.  Is that about right?
>
>                  sent from my phone
>
>                  On May 14, 2015 12:57 PM, "Vitaly Davidovich"
>         <vitalyd at gmail.com <mailto:vitalyd at gmail.com>
>                  <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>
>                  <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>
>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>>> wrote:
>
>                       Vladimir,
>
>                       I'm comparing MaxInlineSize (35) with FreqInlineSize
>                  (325).  AFAIU,
>                       MaxInlineSize drives which methods are inlined at
>         parse
>                  time by C2,
>                       whereas FreqInlineSize is the threshold for "late"
>         (or what
>                  do you
>                       guys call inlining after parsing?) inlining.  Most
>         of the
>                  inlining
>                       discussions (or worries, rather) seem to focus
>         around the
>                       MaxInlineSize value, and not FreqInlineSize, even
>         if the target
>                       method will get hot.
>
>                           Usually, people care about 35 (= MaxInlineSize),
>                  because for
>                           methods up to MaxInlineSize their call
>         frequency is
>                  ignored. So,
>                           fewer chances to end up with non-inlined call.
>
>
>                       Ok, so for hot methods then MaxInlineSize isn't
>         really a
>                  concern,
>                       and FreqInlineSize would be the threshold to worry
>         about
>                  (for C2
>                       compiler) then? Why are people worried about
>         inlining in
>                  cold paths
>                       then?
>
>                       Thanks Vladimir
>
>                       On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov
>                       <vladimir.x.ivanov at oracle.com
>         <mailto:vladimir.x.ivanov at oracle.com>
>                  <mailto:vladimir.x.ivanov at oracle.com
>         <mailto:vladimir.x.ivanov at oracle.com>>
>                  <mailto:vladimir.x.ivanov at oracle.com
>         <mailto:vladimir.x.ivanov at oracle.com>
>
>                  <mailto:vladimir.x.ivanov at oracle.com
>         <mailto:vladimir.x.ivanov at oracle.com>>>>
>                       wrote:
>
>                           Vitaly,
>
>                           Can you elaborate your question a bit? What do
>         you compare
>                           parse-time inlining with? Mentioning of ?1 &
>         profile
>                  pollution
>                           in this context confuses me.
>
>                           Usually, people care about 35 (= MaxInlineSize),
>                  because for
>                           methods up to MaxInlineSize their call
>         frequency is
>                  ignored. So,
>                           fewer chances to end up with non-inlined call.
>
>                           Best regards,
>                           Vladimir Ivanov
>
>                           On 5/14/15 7:09 PM, Vitaly Davidovich wrote:
>
>                               Any pointers? Sorry to bug you guys, but
>         just want
>                  to make
>                               sure I
>                               understand this point as I see quite a bit of
>                  discussion on
>                               core-libs
>                               and elsewhere where people are worrying
>         about the 35
>                               bytecode size
>                               threshold for parse inlining.
>
>                               On Wed, May 13, 2015 at 3:36 PM, Vitaly
>         Davidovich
>                               <vitalyd at gmail.com
>         <mailto:vitalyd at gmail.com> <mailto:vitalyd at gmail.com
>         <mailto:vitalyd at gmail.com>>
>                  <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>
>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>>
>                               <mailto:vitalyd at gmail.com
>         <mailto:vitalyd at gmail.com>
>                  <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>
>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>
>                  <mailto:vitalyd at gmail.com
>         <mailto:vitalyd at gmail.com>>>>> wrote:
>
>                                    Hi guys,
>
>                                    Could someone please explain the
>         advantage, if
>                  any, of
>                               parse time
>                                    inlining in C2? Given that
>         FreqInlineSize is quite
>                               large by default,
>                                    most hot methods will get inlined anyway
>                  (well, ones
>                               that can be for
>                                    other reasons).  What is the advantage of
>                  parse time
>                               inlining?
>
>                                    Is it quicker time to peak
>         performance if C1
>                  is reached
>                               first?
>
>                                    Does it ensure that a method is inlined
>                  whereas it may
>                               not be if
>                                    it's already compiled into a medium/large
>                  method otherwise?
>
>                                    Is parse time inlining not
>         susceptible to profile
>                               pollution? I
>                                    suspect it is since the interpreter
>         has already
>                               profiled the inlinee
>                                    either way, but wanted to check.
>
>                                    Anything else I'm not thinking about?
>
>                                    Thanks
>
>
>
>
>

From rednaxelafx at gmail.com  Thu May 14 22:00:20 2015
From: rednaxelafx at gmail.com (Krystal Mok)
Date: Thu, 14 May 2015 15:00:20 -0700
Subject: C2: Advantage of parse time inlining
In-Reply-To: <CAHjP37HbX4cRB3NURaG0gNtRjNCpPZwqXrAZvsTbLgK4rDtkTw@mail.gmail.com>
References: <CAHjP37FqOumfi=VCsEtRePVrjp_7xZyOn6bhqF-RrcL2gkfXwg@mail.gmail.com>
	<CAHjP37F4Kfrkyq59ZOAP09b1CSMx_vhSxw-3QsRJk9-xKBpGWQ@mail.gmail.com>
	<5554CF1D.3010601@oracle.com>
	<CAHjP37HJAoc2v89uVbsH-EWLFi50x_8V1CGVk4gKrmj3pv=bSw@mail.gmail.com>
	<CAHjP37GnCbB5XJBKZj-oTzt0zxEG3-NLYHu=h+YqU5d08vVRVw@mail.gmail.com>
	<5554E9CB.3020903@oracle.com>
	<CAHjP37F8ufzukBnemSHnPRScPsPShCf1-RydmAsDL+6WXnoEiw@mail.gmail.com>
	<5554F583.7090005@oracle.com>
	<CAHjP37FATCcp4aJ4eVw_PadzWefmsS=eAC8U+grU_zVN+AHjcQ@mail.gmail.com>
	<CA+cQ+tQdtrnyC9p9CWrop5565+LSmf7N8RyRXpthe9UvqyyKhA@mail.gmail.com>
	<CAHjP37HbX4cRB3NURaG0gNtRjNCpPZwqXrAZvsTbLgK4rDtkTw@mail.gmail.com>
Message-ID: <CA+cQ+tTeSMFSTQgqqyk+e2pwGn2BKzjNdes+Lsmkfu2iyHAhzw@mail.gmail.com>

Hi Vitaly,

The code I posted comes from should_not_inline(). It's a negative filter,
so return true means don't inline.
You can take a look at opto/bytecodeInfo.cpp. It's a lot to explain in
words, but very obvious from the code.

I'm not sure what your 49% means here.

In the positive filter, If the frequency of a call site is more
than InlineFrequencyRatio (=20, think of a call site in a loop run at least
20 times per invocation of this method), or the profile recorded the call
site is called at least InlineFrequencyCount (=100 on x86), or some other
heuristics, the max_inline_size for this call site is bumped from
MaxInlineSize (=35) to FreqInlineSize (=325 on x86).

In the negative filter, the callee has to be executed at
least MIN2(MinInliningThreshold, counter_high_value) times in order to be
considered candidate for inlining. This is the invocation counter on the
callee side, not on the call site side.
In tiered mode that would be MIN2(250, 134217728) = 250. It doesn't matter
what CompileThreshold is set in this case.
In non-tiered mode, it'd be MIN2(250, CompileThreshold/2), for
CompileThreshold=100 that's 50.

- Kris

On Thu, May 14, 2015 at 2:35 PM, Vitaly Davidovich <vitalyd at gmail.com>
wrote:

> Thanks Kris.  Hmm, this sounds pretty bad for non-tiered compilations with
> a relatively low CompileThreshold.  If I have a (larger than MaxInlineSize)
> method executed 49% of the time, it'll inline at CompileThreshold=10k but
> not CompileThreshold=100.  Or am I missing something?
>
> On Thu, May 14, 2015 at 5:28 PM, Krystal Mok <rednaxelafx at gmail.com>
> wrote:
>
>> Yes and no.
>>
>>       intx counter_high_value;
>>       // Tiered compilation uses a different "high value" than non-tiered
>> compilation.
>>       // Determine the right value to use.
>>       if (TieredCompilation) {
>>         counter_high_value = InvocationCounter::count_limit / 2;
>>       } else {
>>         counter_high_value = CompileThreshold / 2;
>>       }
>>       if
>> (!callee_method->was_executed_more_than(MIN2(MinInliningThreshold,
>> counter_high_value))) {
>>         set_msg("executed < MinInliningThreshold times");
>>         return true;
>>       }
>>
>> So it's not scaling MinInliningThreshold directly, but rather using a min
>> of MinInliningThreshold and counter_high_value (where the latter is
>> calculated from CompileThreshold when not using tiered compilation) to make
>> the actual decision.
>>
>> Because tiered is on by default now, the short answer to your question
>> would probably be a "no".
>>
>> - Kris
>>
>> On Thu, May 14, 2015 at 1:01 PM, Vitaly Davidovich <vitalyd at gmail.com>
>> wrote:
>>
>>> Right, thank you.
>>>
>>> Is MinInliningThreshold scaled with XX:CompileThreshold=XXX values? Say
>>> I turn CompileThreshold down to 100 (as an example).
>>>
>>> On Thu, May 14, 2015 at 3:20 PM, Vladimir Kozlov <
>>> vladimir.kozlov at oracle.com> wrote:
>>>
>>>> On 5/14/15 12:02 PM, Vitaly Davidovich wrote:
>>>>
>>>>> Thanks Vladimir.  I recall seeing changes around incremental inlining,
>>>>> and may have mistakenly thought it happens at some later point in time.
>>>>> Appreciate the clarification.
>>>>>
>>>>> Ok, so based on what you say, I can see a theoretical problem whereby a
>>>>> method is being parsed, is larger than MaxInlineSize, but doesn't
>>>>> happen
>>>>> to be frequent enough yet at this point, and so it won't be inlined; if
>>>>> it turns out to be hot later on, the lack of inlining will not be
>>>>> undone
>>>>> (assuming the caller isn't deopted and recompiled later, with updated
>>>>> frequency info, for other reasons).
>>>>>
>>>>
>>>> That is correct.
>>>>
>>>> Note, that the problem is not how hot is callee (invocation times) but
>>>> how hot the call site in caller. Usually it does not change during
>>>> execution. If it is called in a loop C2 will try to inline because freq
>>>> should be high.
>>>>
>>>> There is MinInliningThreshold (250) but since we compile caller when it
>>>> is executed 10000 times the call site should be on slow path which is
>>>> executed only 2.5% times. So you will not see the performance difference if
>>>> we inline it or call callee which is compiled if it is hot.
>>>>
>>>> Vladimir
>>>>
>>>>
>>>>> Thanks again.
>>>>>
>>>>> On Thu, May 14, 2015 at 2:30 PM, Vladimir Kozlov
>>>>> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>>
>>>>> wrote:
>>>>>
>>>>>     Vitaly,
>>>>>
>>>>>     You have small misconception - almost all C2 inlining (normal java
>>>>>     methods) is done during parsing in one pass. Recently we changed it
>>>>>     to inline jsr292 methods after parsing (to execute IGVN and reduce
>>>>>     graph - otherwise they blow up number of ideal nodes and we bailout
>>>>>     compilation due to MaxNodeLimit).
>>>>>
>>>>>     As parser goes and see a call site it check can it inline or not
>>>>>     (see opto/bytecodeinfo.cpp, should_inline() and
>>>>>     should_not_inline()). There are several conditions which drive
>>>>>     inlining and following (most important) flags controls them:
>>>>>     MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode,
>>>>>     MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount,
>>>>>     InlineFrequencyRatio.
>>>>>
>>>>>     Most tedious flag is InlineSmallCode (size of compiled assembler
>>>>>     code which is different from other sizes which are bytecode size)
>>>>>     which control inlining of already compiled method. Usually if a
>>>>> call
>>>>>     site is hot the callee is compiled before caller. But sometimes you
>>>>>     can get caller compiled first (if it has hot loop, for example) so
>>>>>     different condition will be used and as result you can get
>>>>>     performance variation between runs.
>>>>>
>>>>>     The difference between MaxInlineSize (35) and FreqInlineSize (325)
>>>>>     is FreqInlineSize takes into account how frequent call site is
>>>>>     executed relatively to caller invocations:
>>>>>
>>>>>        int call_site_count  = method()->scale_count(profile.count());
>>>>>        int invoke_count     = method()->interpreter_invocation_count();
>>>>>        int freq = call_site_count / invoke_count;
>>>>>        int max_inline_size  = MaxInlineSize;
>>>>>        // bump the max size if the call is frequent
>>>>>        if ((freq >= InlineFrequencyRatio) ||
>>>>>            (call_site_count >= InlineFrequencyCount) ||
>>>>>            is_unboxing_method(callee_method, C) ||
>>>>>            is_init_with_ea(callee_method, caller_method, C)) {
>>>>>          max_inline_size = FreqInlineSize;
>>>>>
>>>>>     And there is additional inlining condition for all methods which
>>>>>     size > MaxTrivialSize:
>>>>>
>>>>>        if
>>>>> (!callee_method->was_executed_more_than(MinInliningThreshold)) {
>>>>>          set_msg("executed < MinInliningThreshold times");
>>>>>
>>>>>     Regards,
>>>>>     Vladimir
>>>>>
>>>>>     On 5/14/15 10:03 AM, Vitaly Davidovich wrote:
>>>>>
>>>>>         I should also add that I see how inlining without taking call
>>>>>         freq into
>>>>>         account could lead to faster time to peak performance for
>>>>>         methods that
>>>>>         eventually get hot anyway but aren't at parse time.  Peak perf
>>>>>         will be
>>>>>         the same if the method is too big for parse inlining but
>>>>>         eventually gets
>>>>>         compiled due to reaching hotness.  Is that about right?
>>>>>
>>>>>         sent from my phone
>>>>>
>>>>>         On May 14, 2015 12:57 PM, "Vitaly Davidovich" <
>>>>> vitalyd at gmail.com
>>>>>         <mailto:vitalyd at gmail.com>
>>>>>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>> wrote:
>>>>>
>>>>>              Vladimir,
>>>>>
>>>>>              I'm comparing MaxInlineSize (35) with FreqInlineSize
>>>>>         (325).  AFAIU,
>>>>>              MaxInlineSize drives which methods are inlined at parse
>>>>>         time by C2,
>>>>>              whereas FreqInlineSize is the threshold for "late" (or
>>>>> what
>>>>>         do you
>>>>>              guys call inlining after parsing?) inlining.  Most of the
>>>>>         inlining
>>>>>              discussions (or worries, rather) seem to focus around the
>>>>>              MaxInlineSize value, and not FreqInlineSize, even if the
>>>>> target
>>>>>              method will get hot.
>>>>>
>>>>>                  Usually, people care about 35 (= MaxInlineSize),
>>>>>         because for
>>>>>                  methods up to MaxInlineSize their call frequency is
>>>>>         ignored. So,
>>>>>                  fewer chances to end up with non-inlined call.
>>>>>
>>>>>
>>>>>              Ok, so for hot methods then MaxInlineSize isn't really a
>>>>>         concern,
>>>>>              and FreqInlineSize would be the threshold to worry about
>>>>>         (for C2
>>>>>              compiler) then? Why are people worried about inlining in
>>>>>         cold paths
>>>>>              then?
>>>>>
>>>>>              Thanks Vladimir
>>>>>
>>>>>              On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov
>>>>>              <vladimir.x.ivanov at oracle.com
>>>>>         <mailto:vladimir.x.ivanov at oracle.com>
>>>>>         <mailto:vladimir.x.ivanov at oracle.com
>>>>>
>>>>>         <mailto:vladimir.x.ivanov at oracle.com>>>
>>>>>              wrote:
>>>>>
>>>>>                  Vitaly,
>>>>>
>>>>>                  Can you elaborate your question a bit? What do you
>>>>> compare
>>>>>                  parse-time inlining with? Mentioning of ?1 & profile
>>>>>         pollution
>>>>>                  in this context confuses me.
>>>>>
>>>>>                  Usually, people care about 35 (= MaxInlineSize),
>>>>>         because for
>>>>>                  methods up to MaxInlineSize their call frequency is
>>>>>         ignored. So,
>>>>>                  fewer chances to end up with non-inlined call.
>>>>>
>>>>>                  Best regards,
>>>>>                  Vladimir Ivanov
>>>>>
>>>>>                  On 5/14/15 7:09 PM, Vitaly Davidovich wrote:
>>>>>
>>>>>                      Any pointers? Sorry to bug you guys, but just want
>>>>>         to make
>>>>>                      sure I
>>>>>                      understand this point as I see quite a bit of
>>>>>         discussion on
>>>>>                      core-libs
>>>>>                      and elsewhere where people are worrying about the
>>>>> 35
>>>>>                      bytecode size
>>>>>                      threshold for parse inlining.
>>>>>
>>>>>                      On Wed, May 13, 2015 at 3:36 PM, Vitaly Davidovich
>>>>>                      <vitalyd at gmail.com <mailto:vitalyd at gmail.com>
>>>>>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>
>>>>>                      <mailto:vitalyd at gmail.com
>>>>>         <mailto:vitalyd at gmail.com> <mailto:vitalyd at gmail.com
>>>>>         <mailto:vitalyd at gmail.com>>>> wrote:
>>>>>
>>>>>                           Hi guys,
>>>>>
>>>>>                           Could someone please explain the advantage,
>>>>> if
>>>>>         any, of
>>>>>                      parse time
>>>>>                           inlining in C2? Given that FreqInlineSize is
>>>>> quite
>>>>>                      large by default,
>>>>>                           most hot methods will get inlined anyway
>>>>>         (well, ones
>>>>>                      that can be for
>>>>>                           other reasons).  What is the advantage of
>>>>>         parse time
>>>>>                      inlining?
>>>>>
>>>>>                           Is it quicker time to peak performance if C1
>>>>>         is reached
>>>>>                      first?
>>>>>
>>>>>                           Does it ensure that a method is inlined
>>>>>         whereas it may
>>>>>                      not be if
>>>>>                           it's already compiled into a medium/large
>>>>>         method otherwise?
>>>>>
>>>>>                           Is parse time inlining not susceptible to
>>>>> profile
>>>>>                      pollution? I
>>>>>                           suspect it is since the interpreter has
>>>>> already
>>>>>                      profiled the inlinee
>>>>>                           either way, but wanted to check.
>>>>>
>>>>>                           Anything else I'm not thinking about?
>>>>>
>>>>>                           Thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150514/7f264d3c/attachment-0001.html>

From vitalyd at gmail.com  Thu May 14 22:16:08 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Thu, 14 May 2015 18:16:08 -0400
Subject: C2: Advantage of parse time inlining
In-Reply-To: <CA+cQ+tTeSMFSTQgqqyk+e2pwGn2BKzjNdes+Lsmkfu2iyHAhzw@mail.gmail.com>
References: <CAHjP37FqOumfi=VCsEtRePVrjp_7xZyOn6bhqF-RrcL2gkfXwg@mail.gmail.com>
	<CAHjP37F4Kfrkyq59ZOAP09b1CSMx_vhSxw-3QsRJk9-xKBpGWQ@mail.gmail.com>
	<5554CF1D.3010601@oracle.com>
	<CAHjP37HJAoc2v89uVbsH-EWLFi50x_8V1CGVk4gKrmj3pv=bSw@mail.gmail.com>
	<CAHjP37GnCbB5XJBKZj-oTzt0zxEG3-NLYHu=h+YqU5d08vVRVw@mail.gmail.com>
	<5554E9CB.3020903@oracle.com>
	<CAHjP37F8ufzukBnemSHnPRScPsPShCf1-RydmAsDL+6WXnoEiw@mail.gmail.com>
	<5554F583.7090005@oracle.com>
	<CAHjP37FATCcp4aJ4eVw_PadzWefmsS=eAC8U+grU_zVN+AHjcQ@mail.gmail.com>
	<CA+cQ+tQdtrnyC9p9CWrop5565+LSmf7N8RyRXpthe9UvqyyKhA@mail.gmail.com>
	<CAHjP37HbX4cRB3NURaG0gNtRjNCpPZwqXrAZvsTbLgK4rDtkTw@mail.gmail.com>
	<CA+cQ+tTeSMFSTQgqqyk+e2pwGn2BKzjNdes+Lsmkfu2iyHAhzw@mail.gmail.com>
Message-ID: <CAHjP37HmqNpvg79f8C=gnhLtgP75902wokf5N9JfpyvL_JggNQ@mail.gmail.com>

Yes, I'll look at that code.

My 49% example (probably poorly explained, I'll try to correct this below)
was based on what Vladimir said:

There is MinInliningThreshold (250) but since we compile caller when it is
> executed 10000 times the call site should be on slow path which is executed
> only 2.5% times. So you will not see the performance difference if we
> inline it or call callee which is compiled if it is hot.


Assume we're talking about a callee (method) whose size is > MaxInlineSize
(e.g. 64 bytes just to pull some number out).  Using default compile
threshold of 10k, the callee will not be inlined if it's executed < 250
times.  That's fine since as Vladimir says it's a cold path anyway (< 2.5%
of invocations of caller don't touch that callsite), and I agree given 10k
compile threshold.  If we change CompileThreshold to 100, then the code
snippet you showed earlier (for the negative filter) will yield the
MIN2(250,50) negative filter.  So the "49%" case is the caller is invoked
100 times, triggers compilation, and compiler looks at the call site that's
executed 49 times out of those 100 (that's the 49% I'm using).  If we use
10k compile threshold, then it's executed 4900 times, clearly above 250;
I'm using an example where the code profile is the same, we're only
changing compile threshold.  So the negative filter rejects this method for
inlining now, whereas it would've accepted it had I let the code run to 10k
invocations.

I guess my overall question is the following: does changing
CompileThreshold to a lower value, but keeping execution profile the same,
alter inlining decisions.  Intuitively I would think the answer should be
"no, inlining is the same" as everything would be scaled down
appropriately, but it sounds like there are "magic" absolute numbers
involved.


On Thu, May 14, 2015 at 6:00 PM, Krystal Mok <rednaxelafx at gmail.com> wrote:

> Hi Vitaly,
>
> The code I posted comes from should_not_inline(). It's a negative filter,
> so return true means don't inline.
> You can take a look at opto/bytecodeInfo.cpp. It's a lot to explain in
> words, but very obvious from the code.
>
> I'm not sure what your 49% means here.
>
> In the positive filter, If the frequency of a call site is more
> than InlineFrequencyRatio (=20, think of a call site in a loop run at least
> 20 times per invocation of this method), or the profile recorded the call
> site is called at least InlineFrequencyCount (=100 on x86), or some other
> heuristics, the max_inline_size for this call site is bumped from
> MaxInlineSize (=35) to FreqInlineSize (=325 on x86).
>
> In the negative filter, the callee has to be executed at
> least MIN2(MinInliningThreshold, counter_high_value) times in order to be
> considered candidate for inlining. This is the invocation counter on the
> callee side, not on the call site side.
> In tiered mode that would be MIN2(250, 134217728) = 250. It doesn't matter
> what CompileThreshold is set in this case.
> In non-tiered mode, it'd be MIN2(250, CompileThreshold/2), for
> CompileThreshold=100 that's 50.
>
> - Kris
>
> On Thu, May 14, 2015 at 2:35 PM, Vitaly Davidovich <vitalyd at gmail.com>
> wrote:
>
>> Thanks Kris.  Hmm, this sounds pretty bad for non-tiered compilations
>> with a relatively low CompileThreshold.  If I have a (larger than
>> MaxInlineSize) method executed 49% of the time, it'll inline at
>> CompileThreshold=10k but not CompileThreshold=100.  Or am I missing
>> something?
>>
>> On Thu, May 14, 2015 at 5:28 PM, Krystal Mok <rednaxelafx at gmail.com>
>> wrote:
>>
>>> Yes and no.
>>>
>>>       intx counter_high_value;
>>>       // Tiered compilation uses a different "high value" than
>>> non-tiered compilation.
>>>       // Determine the right value to use.
>>>       if (TieredCompilation) {
>>>         counter_high_value = InvocationCounter::count_limit / 2;
>>>       } else {
>>>         counter_high_value = CompileThreshold / 2;
>>>       }
>>>       if
>>> (!callee_method->was_executed_more_than(MIN2(MinInliningThreshold,
>>> counter_high_value))) {
>>>         set_msg("executed < MinInliningThreshold times");
>>>         return true;
>>>       }
>>>
>>> So it's not scaling MinInliningThreshold directly, but rather using a
>>> min of MinInliningThreshold and counter_high_value (where the latter is
>>> calculated from CompileThreshold when not using tiered compilation) to make
>>> the actual decision.
>>>
>>> Because tiered is on by default now, the short answer to your question
>>> would probably be a "no".
>>>
>>> - Kris
>>>
>>> On Thu, May 14, 2015 at 1:01 PM, Vitaly Davidovich <vitalyd at gmail.com>
>>> wrote:
>>>
>>>> Right, thank you.
>>>>
>>>> Is MinInliningThreshold scaled with XX:CompileThreshold=XXX values? Say
>>>> I turn CompileThreshold down to 100 (as an example).
>>>>
>>>> On Thu, May 14, 2015 at 3:20 PM, Vladimir Kozlov <
>>>> vladimir.kozlov at oracle.com> wrote:
>>>>
>>>>> On 5/14/15 12:02 PM, Vitaly Davidovich wrote:
>>>>>
>>>>>> Thanks Vladimir.  I recall seeing changes around incremental inlining,
>>>>>> and may have mistakenly thought it happens at some later point in
>>>>>> time.
>>>>>> Appreciate the clarification.
>>>>>>
>>>>>> Ok, so based on what you say, I can see a theoretical problem whereby
>>>>>> a
>>>>>> method is being parsed, is larger than MaxInlineSize, but doesn't
>>>>>> happen
>>>>>> to be frequent enough yet at this point, and so it won't be inlined;
>>>>>> if
>>>>>> it turns out to be hot later on, the lack of inlining will not be
>>>>>> undone
>>>>>> (assuming the caller isn't deopted and recompiled later, with updated
>>>>>> frequency info, for other reasons).
>>>>>>
>>>>>
>>>>> That is correct.
>>>>>
>>>>> Note, that the problem is not how hot is callee (invocation times) but
>>>>> how hot the call site in caller. Usually it does not change during
>>>>> execution. If it is called in a loop C2 will try to inline because freq
>>>>> should be high.
>>>>>
>>>>> There is MinInliningThreshold (250) but since we compile caller when
>>>>> it is executed 10000 times the call site should be on slow path which is
>>>>> executed only 2.5% times. So you will not see the performance difference if
>>>>> we inline it or call callee which is compiled if it is hot.
>>>>>
>>>>> Vladimir
>>>>>
>>>>>
>>>>>> Thanks again.
>>>>>>
>>>>>> On Thu, May 14, 2015 at 2:30 PM, Vladimir Kozlov
>>>>>> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>>
>>>>>> wrote:
>>>>>>
>>>>>>     Vitaly,
>>>>>>
>>>>>>     You have small misconception - almost all C2 inlining (normal java
>>>>>>     methods) is done during parsing in one pass. Recently we changed
>>>>>> it
>>>>>>     to inline jsr292 methods after parsing (to execute IGVN and reduce
>>>>>>     graph - otherwise they blow up number of ideal nodes and we
>>>>>> bailout
>>>>>>     compilation due to MaxNodeLimit).
>>>>>>
>>>>>>     As parser goes and see a call site it check can it inline or not
>>>>>>     (see opto/bytecodeinfo.cpp, should_inline() and
>>>>>>     should_not_inline()). There are several conditions which drive
>>>>>>     inlining and following (most important) flags controls them:
>>>>>>     MaxTrivialSize, MaxInlineSize, FreqInlineSize, InlineSmallCode,
>>>>>>     MinInliningThreshold, MaxInlineLevel, InlineFrequencyCount,
>>>>>>     InlineFrequencyRatio.
>>>>>>
>>>>>>     Most tedious flag is InlineSmallCode (size of compiled assembler
>>>>>>     code which is different from other sizes which are bytecode size)
>>>>>>     which control inlining of already compiled method. Usually if a
>>>>>> call
>>>>>>     site is hot the callee is compiled before caller. But sometimes
>>>>>> you
>>>>>>     can get caller compiled first (if it has hot loop, for example) so
>>>>>>     different condition will be used and as result you can get
>>>>>>     performance variation between runs.
>>>>>>
>>>>>>     The difference between MaxInlineSize (35) and FreqInlineSize (325)
>>>>>>     is FreqInlineSize takes into account how frequent call site is
>>>>>>     executed relatively to caller invocations:
>>>>>>
>>>>>>        int call_site_count  = method()->scale_count(profile.count());
>>>>>>        int invoke_count     =
>>>>>> method()->interpreter_invocation_count();
>>>>>>        int freq = call_site_count / invoke_count;
>>>>>>        int max_inline_size  = MaxInlineSize;
>>>>>>        // bump the max size if the call is frequent
>>>>>>        if ((freq >= InlineFrequencyRatio) ||
>>>>>>            (call_site_count >= InlineFrequencyCount) ||
>>>>>>            is_unboxing_method(callee_method, C) ||
>>>>>>            is_init_with_ea(callee_method, caller_method, C)) {
>>>>>>          max_inline_size = FreqInlineSize;
>>>>>>
>>>>>>     And there is additional inlining condition for all methods which
>>>>>>     size > MaxTrivialSize:
>>>>>>
>>>>>>        if
>>>>>> (!callee_method->was_executed_more_than(MinInliningThreshold)) {
>>>>>>          set_msg("executed < MinInliningThreshold times");
>>>>>>
>>>>>>     Regards,
>>>>>>     Vladimir
>>>>>>
>>>>>>     On 5/14/15 10:03 AM, Vitaly Davidovich wrote:
>>>>>>
>>>>>>         I should also add that I see how inlining without taking call
>>>>>>         freq into
>>>>>>         account could lead to faster time to peak performance for
>>>>>>         methods that
>>>>>>         eventually get hot anyway but aren't at parse time.  Peak perf
>>>>>>         will be
>>>>>>         the same if the method is too big for parse inlining but
>>>>>>         eventually gets
>>>>>>         compiled due to reaching hotness.  Is that about right?
>>>>>>
>>>>>>         sent from my phone
>>>>>>
>>>>>>         On May 14, 2015 12:57 PM, "Vitaly Davidovich" <
>>>>>> vitalyd at gmail.com
>>>>>>         <mailto:vitalyd at gmail.com>
>>>>>>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>> wrote:
>>>>>>
>>>>>>              Vladimir,
>>>>>>
>>>>>>              I'm comparing MaxInlineSize (35) with FreqInlineSize
>>>>>>         (325).  AFAIU,
>>>>>>              MaxInlineSize drives which methods are inlined at parse
>>>>>>         time by C2,
>>>>>>              whereas FreqInlineSize is the threshold for "late" (or
>>>>>> what
>>>>>>         do you
>>>>>>              guys call inlining after parsing?) inlining.  Most of the
>>>>>>         inlining
>>>>>>              discussions (or worries, rather) seem to focus around the
>>>>>>              MaxInlineSize value, and not FreqInlineSize, even if the
>>>>>> target
>>>>>>              method will get hot.
>>>>>>
>>>>>>                  Usually, people care about 35 (= MaxInlineSize),
>>>>>>         because for
>>>>>>                  methods up to MaxInlineSize their call frequency is
>>>>>>         ignored. So,
>>>>>>                  fewer chances to end up with non-inlined call.
>>>>>>
>>>>>>
>>>>>>              Ok, so for hot methods then MaxInlineSize isn't really a
>>>>>>         concern,
>>>>>>              and FreqInlineSize would be the threshold to worry about
>>>>>>         (for C2
>>>>>>              compiler) then? Why are people worried about inlining in
>>>>>>         cold paths
>>>>>>              then?
>>>>>>
>>>>>>              Thanks Vladimir
>>>>>>
>>>>>>              On Thu, May 14, 2015 at 12:36 PM, Vladimir Ivanov
>>>>>>              <vladimir.x.ivanov at oracle.com
>>>>>>         <mailto:vladimir.x.ivanov at oracle.com>
>>>>>>         <mailto:vladimir.x.ivanov at oracle.com
>>>>>>
>>>>>>         <mailto:vladimir.x.ivanov at oracle.com>>>
>>>>>>              wrote:
>>>>>>
>>>>>>                  Vitaly,
>>>>>>
>>>>>>                  Can you elaborate your question a bit? What do you
>>>>>> compare
>>>>>>                  parse-time inlining with? Mentioning of ?1 & profile
>>>>>>         pollution
>>>>>>                  in this context confuses me.
>>>>>>
>>>>>>                  Usually, people care about 35 (= MaxInlineSize),
>>>>>>         because for
>>>>>>                  methods up to MaxInlineSize their call frequency is
>>>>>>         ignored. So,
>>>>>>                  fewer chances to end up with non-inlined call.
>>>>>>
>>>>>>                  Best regards,
>>>>>>                  Vladimir Ivanov
>>>>>>
>>>>>>                  On 5/14/15 7:09 PM, Vitaly Davidovich wrote:
>>>>>>
>>>>>>                      Any pointers? Sorry to bug you guys, but just
>>>>>> want
>>>>>>         to make
>>>>>>                      sure I
>>>>>>                      understand this point as I see quite a bit of
>>>>>>         discussion on
>>>>>>                      core-libs
>>>>>>                      and elsewhere where people are worrying about
>>>>>> the 35
>>>>>>                      bytecode size
>>>>>>                      threshold for parse inlining.
>>>>>>
>>>>>>                      On Wed, May 13, 2015 at 3:36 PM, Vitaly
>>>>>> Davidovich
>>>>>>                      <vitalyd at gmail.com <mailto:vitalyd at gmail.com>
>>>>>>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>
>>>>>>                      <mailto:vitalyd at gmail.com
>>>>>>         <mailto:vitalyd at gmail.com> <mailto:vitalyd at gmail.com
>>>>>>         <mailto:vitalyd at gmail.com>>>> wrote:
>>>>>>
>>>>>>                           Hi guys,
>>>>>>
>>>>>>                           Could someone please explain the advantage,
>>>>>> if
>>>>>>         any, of
>>>>>>                      parse time
>>>>>>                           inlining in C2? Given that FreqInlineSize
>>>>>> is quite
>>>>>>                      large by default,
>>>>>>                           most hot methods will get inlined anyway
>>>>>>         (well, ones
>>>>>>                      that can be for
>>>>>>                           other reasons).  What is the advantage of
>>>>>>         parse time
>>>>>>                      inlining?
>>>>>>
>>>>>>                           Is it quicker time to peak performance if C1
>>>>>>         is reached
>>>>>>                      first?
>>>>>>
>>>>>>                           Does it ensure that a method is inlined
>>>>>>         whereas it may
>>>>>>                      not be if
>>>>>>                           it's already compiled into a medium/large
>>>>>>         method otherwise?
>>>>>>
>>>>>>                           Is parse time inlining not susceptible to
>>>>>> profile
>>>>>>                      pollution? I
>>>>>>                           suspect it is since the interpreter has
>>>>>> already
>>>>>>                      profiled the inlinee
>>>>>>                           either way, but wanted to check.
>>>>>>
>>>>>>                           Anything else I'm not thinking about?
>>>>>>
>>>>>>                           Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150514/25c5190d/attachment-0001.html>

From vladimir.x.ivanov at oracle.com  Fri May 15 12:14:59 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Fri, 15 May 2015 15:14:59 +0300
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken
	after sun.misc.Cleaner became automatically cleared
In-Reply-To: <55549297.5020002@oracle.com>
References: <554CEF76.8090903@oracle.com>	<554DD477.7000703@gmail.com>	<5551DC44.9060902@oracle.com>	<5553191F.6080800@oracle.com>	<72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com>	<55533CAE.3050201@oracle.com>	<A98C60B2-C10C-4D62-B7E8-5B7CFA5E2B39@oracle.com>
	<55549297.5020002@oracle.com>
Message-ID: <5555E343.5050600@oracle.com>

After private discussion with Paul, here's an update:
   http://cr.openjdk.java.net/~vlivanov/8079205/webrev.03

Renamed CallSite$Context to MethodHandleNatives$Context.

Best regards,
Vladimir Ivanov

On 5/14/15 3:18 PM, Vladimir Ivanov wrote:
> Small update in JDK code (webrev updated in place):
>    http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02
>
> Changes:
>    - hid Context.dependencies field from Reflection
>    - new test case: CallSiteDepContextTest.testHiddenDepField()
>
> Best regards,
> Vladimir Ivanov
>
> On 5/13/15 5:56 PM, Paul Sandoz wrote:
>>
>> On May 13, 2015, at 1:59 PM, Vladimir Ivanov
>> <vladimir.x.ivanov at oracle.com> wrote:
>>
>>> Peter, Paul, thanks for the feedback!
>>>
>>> Updated the webrev in place:
>>>   http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02
>>>
>>
>> +1
>>
>>
>>>> I am not an export in the HS area but the code mostly made sense to
>>>> me. I also like Peter's suggestion of Context implementing Runnable.
>>> I agree. Integrated.
>>>
>>>> Some minor comments.
>>>>
>>>> CallSite.java:
>>>>
>>>> 145         private final long dependencies = 0; // Used by JVM to
>>>> store JVM_nmethodBucket*
>>>>
>>>> It's a little odd to see this field be final, but i think i
>>>> understand the motivation as it's essentially acting as a memory
>>>> address pointing to the head of the nbucket list, so in Java code
>>>> you don't want to assign it to anything other than 0.
>>> Yes, my motivation was to forbid reads & writes of the field from
>>> Java code. I was experimenting with injecting the field by VM, but
>>> it's less attractive.
>>>
>>
>> I was wondering if that were possible.
>>
>>
>>>> Is VM access, via java_lang_invoke_CallSite_Context, affected by
>>>> treating final fields as really final in the j.l.i package?
>>> No, field modifiers aren't respected. oopDesc::*_field_[get|put] does
>>> a raw read/write using field offset.
>>>
>>
>> Ok.
>>
>> Paul.
>>

From john.r.rose at oracle.com  Fri May 15 17:06:50 2015
From: john.r.rose at oracle.com (John Rose)
Date: Fri, 15 May 2015 10:06:50 -0700
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken
	after sun.misc.Cleaner became automatically cleared
In-Reply-To: <5555E343.5050600@oracle.com>
References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com>
	<5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com>
	<72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com>
	<55533CAE.3050201@oracle.com>
	<A98C60B2-C10C-4D62-B7E8-5B7CFA5E2B39@oracle.com>
	<55549297.5020002@oracle.com> <5555E343.5050600@oracle.com>
Message-ID: <E2E24EFA-7B52-4852-923B-15171FECE6A5@oracle.com>

I know injected fields are somewhat hacky, but there are a couple of conditions here which would motivate their use:

1. The field is of a type not known to Java.  Usually, and in this case, it is a C pointer to some metadata.

We can make space for it with a Java long, but that is a size mismatch on 32-bit systems.  Such mismatches have occasionally caused bugs on big-endian systems, although we try to defend against them by using long-wise reads followed by casts.

2. The field is useless to Java code, and would be a vulnerability if somebody found a way to touch it from Java.

In both these ways the 'dependencies' field is like the MemberName.vmtarget field.  (Suggestion: s/dependencies/vmdependencies/ to follow that pattern.)

? John

On May 15, 2015, at 5:14 AM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
> 
> After private discussion with Paul, here's an update:
>  http://cr.openjdk.java.net/~vlivanov/8079205/webrev.03
> 
> Renamed CallSite$Context to MethodHandleNatives$Context.
> 
> Best regards,
> Vladimir Ivanov
> 
> On 5/14/15 3:18 PM, Vladimir Ivanov wrote:
>> Small update in JDK code (webrev updated in place):
>>   http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02
>> 
>> Changes:
>>   - hid Context.dependencies field from Reflection
>>   - new test case: CallSiteDepContextTest.testHiddenDepField()
>> 
>> Best regards,
>> Vladimir Ivanov
>> 
>> On 5/13/15 5:56 PM, Paul Sandoz wrote:
>>> 
>>> On May 13, 2015, at 1:59 PM, Vladimir Ivanov
>>> <vladimir.x.ivanov at oracle.com> wrote:
>>> 
>>>> Peter, Paul, thanks for the feedback!
>>>> 
>>>> Updated the webrev in place:
>>>>  http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02
>>>> 
>>> 
>>> +1
>>> 
>>> 
>>>>> I am not an export in the HS area but the code mostly made sense to
>>>>> me. I also like Peter's suggestion of Context implementing Runnable.
>>>> I agree. Integrated.
>>>> 
>>>>> Some minor comments.
>>>>> 
>>>>> CallSite.java:
>>>>> 
>>>>> 145         private final long dependencies = 0; // Used by JVM to
>>>>> store JVM_nmethodBucket*
>>>>> 
>>>>> It's a little odd to see this field be final, but i think i
>>>>> understand the motivation as it's essentially acting as a memory
>>>>> address pointing to the head of the nbucket list, so in Java code
>>>>> you don't want to assign it to anything other than 0.
>>>> Yes, my motivation was to forbid reads & writes of the field from
>>>> Java code. I was experimenting with injecting the field by VM, but
>>>> it's less attractive.
>>>> 
>>> 
>>> I was wondering if that were possible.
>>> 
>>> 
>>>>> Is VM access, via java_lang_invoke_CallSite_Context, affected by
>>>>> treating final fields as really final in the j.l.i package?
>>>> No, field modifiers aren't respected. oopDesc::*_field_[get|put] does
>>>> a raw read/write using field offset.
>>>> 
>>> 
>>> Ok.
>>> 
>>> Paul.
>>> 


From christian.thalinger at oracle.com  Fri May 15 17:39:28 2015
From: christian.thalinger at oracle.com (Christian Thalinger)
Date: Fri, 15 May 2015 10:39:28 -0700
Subject: RFR(L): 8069539: RSA acceleration
In-Reply-To: <554CDD48.8080500@redhat.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>
	<5502E67C.8080208@oracle.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>
	<550C004D.1060501@redhat.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com>
	<55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com>
Message-ID: <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com>


> On May 8, 2015, at 8:59 AM, Andrew Haley <aph at redhat.com> wrote:
> 
> Here is a prototype of what I propose:
> 
> http://cr.openjdk.java.net/~aph/rsa-1/ <http://cr.openjdk.java.net/~aph/rsa-1/>
> It is a JNI version of the fast algorithm I think we should use for
> RSA.  It doesn't use any intrinsics.  But with it, RSA is already twice
> as fast as the code we have now for 1024-bit keys, and almost three
> times as fast for 2048-bit keys!
> 
> This code (montgomery_multiply) is designed to be as easy as I can
> possibly make it to translate into a hand-coded intrinsic.

I?m curious:  did you try to implement this in Java?

>  It will
> then offer better performance still, with the JNI overhead gone and
> with some carefully hand-tweaked memory accesses.  It has a very
> regular structure which should make it fairly easy to turn into a
> software pipeline with overlapped fetching and multiplication,
> although this will perhaps be difficult on register-starved machines
> like the x86.
> 
> The JNI version can still be used for those machines where people
> don't yet want to write a hand-coded intrinsic.
> 
> I haven't yet done anything about Montgomery squaring: it's
> asymptotically 25% faster than Montgomery multiplication.
> 
> I have only written it for x86_64.  Systems without a 64-bit multiply
> won't benefit very much from this idea, but they are pretty much
> legacy anyway: I want to concentrate on 64-bit systems.
> 
> So, shall we go with this?  I don't think you will find any faster way
> to do it.
> 
> Andrew.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150515/9e8d84a7/attachment.html>

From vladimir.kozlov at oracle.com  Sat May 16 00:18:00 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 15 May 2015 17:18:00 -0700
Subject: [9] RFR(XS) 8080483: Incorrect test execution string at
	SumRed_Long.java
Message-ID: <55568CB8.3030102@oracle.com>

Small typo fix. Test run commands incorrectly execute SumRed_Double 
class instead of SumRed_Long.

http://cr.openjdk.java.net/~kvn/8080483/webrev/
https://bugs.openjdk.java.net/browse/JDK-8080483

Tested with jtreg.

Thanks,
Vladimir

From igor.veresov at oracle.com  Sat May 16 00:38:23 2015
From: igor.veresov at oracle.com (Igor Veresov)
Date: Fri, 15 May 2015 17:38:23 -0700
Subject: [9] RFR(XS) 8080483: Incorrect test execution string at
	SumRed_Long.java
In-Reply-To: <55568CB8.3030102@oracle.com>
References: <55568CB8.3030102@oracle.com>
Message-ID: <9C43EB57-487F-4D7E-929A-07F5154550A9@oracle.com>

Ok.

igor

> On May 15, 2015, at 5:18 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
> 
> Small typo fix. Test run commands incorrectly execute SumRed_Double class instead of SumRed_Long.
> 
> http://cr.openjdk.java.net/~kvn/8080483/webrev/
> https://bugs.openjdk.java.net/browse/JDK-8080483
> 
> Tested with jtreg.
> 
> Thanks,
> Vladimir


From vladimir.kozlov at oracle.com  Sat May 16 00:42:46 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 15 May 2015 17:42:46 -0700
Subject: [9] RFR(XS) 8080483: Incorrect test execution string at
	SumRed_Long.java
In-Reply-To: <9C43EB57-487F-4D7E-929A-07F5154550A9@oracle.com>
References: <55568CB8.3030102@oracle.com>
	<9C43EB57-487F-4D7E-929A-07F5154550A9@oracle.com>
Message-ID: <55569286.4010705@oracle.com>

Thanks.

Vladimir

On 5/15/15 5:38 PM, Igor Veresov wrote:
> Ok.
>
> igor
>
>> On May 15, 2015, at 5:18 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
>>
>> Small typo fix. Test run commands incorrectly execute SumRed_Double class instead of SumRed_Long.
>>
>> http://cr.openjdk.java.net/~kvn/8080483/webrev/
>> https://bugs.openjdk.java.net/browse/JDK-8080483
>>
>> Tested with jtreg.
>>
>> Thanks,
>> Vladimir
>

From aph at redhat.com  Sat May 16 09:07:10 2015
From: aph at redhat.com (Andrew Haley)
Date: Sat, 16 May 2015 10:07:10 +0100
Subject: RFR(L): 8069539: RSA acceleration
In-Reply-To: <1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>
	<5502E67C.8080208@oracle.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>
	<550C004D.1060501@redhat.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com>
	<55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com>
	<1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com>
Message-ID: <555708BE.8090100@redhat.com>

On 15/05/15 18:39, Christian Thalinger wrote:
> 
>> On May 8, 2015, at 8:59 AM, Andrew Haley <aph at redhat.com> wrote:
>>
>> Here is a prototype of what I propose:
>>
>> http://cr.openjdk.java.net/~aph/rsa-1/ <http://cr.openjdk.java.net/~aph/rsa-1/>
>> It is a JNI version of the fast algorithm I think we should use for
>> RSA.  It doesn't use any intrinsics.  But with it, RSA is already twice
>> as fast as the code we have now for 1024-bit keys, and almost three
>> times as fast for 2048-bit keys!
>>
>> This code (montgomery_multiply) is designed to be as easy as I can
>> possibly make it to translate into a hand-coded intrinsic.
> 
> I?m curious:  did you try to implement this in Java?

The Montgomery multiplication algorithm I presented is fast because
(unlike the Java code in BigInteger) it keeps most of its intermediate
state in registers.  In contrast, the code in BigInteger spends much
time writing digits out to memory and reading them back again.

I think it is impossible to write in Java.  The inner multiply
primitive accumulates into a three-register sum.  Like this:

   t += a * b

where a and b are unsigned longs, and t is a triple-precision unsigned
long.

It's easy to express that in GNU C with inline asm, but not in Java.
It's not possible to express it in in portable C either because C has
no add with carry.  The closest you could get in Java would be an
intrinsic into which you passed and returned an object with three
longs as the accumulator.

But as I said, this JNI code is just a demonstration.  I think that
Montgomery multiplication itself really needs to be turned into a
HotSpot intrinsic.  Then, our RSA and Diffie-Hellman performance will
be competitive with the best library code.  I can do it, but so far I
have had no response from Oracle at all.  :-(

Andrew.

From aph at redhat.com  Sat May 16 09:35:02 2015
From: aph at redhat.com (Andrew Haley)
Date: Sat, 16 May 2015 10:35:02 +0100
Subject: RFR(L): 8069539: RSA acceleration
In-Reply-To: <555708BE.8090100@redhat.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>
	<5502E67C.8080208@oracle.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>
	<550C004D.1060501@redhat.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com>
	<55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com>
	<1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com>
	<555708BE.8090100@redhat.com>
Message-ID: <55570F46.5000909@redhat.com>

There is one other thing I didn't mention: it is possible to turn
Montgomery multiplication into a software pipeline if you have enough
registers.

The idea is that the latency of one multiplication is overlapped by
the latency of the load of the operands for the next one, and the
accumulation is done on not on the latest multiplication but the
previous one, so no operation ever stalls the pipeline.  I'm not sure
if x86 has enough registers to do this (it might, just) but AArch64
certainly does.

An out-of-order CPU can to some extent do the instruction reordering
automatically, but you still have to write the code in a way that
makes it possible.

Andrew.

From roland.westrelin at oracle.com  Mon May 18 08:04:27 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Mon, 18 May 2015 10:04:27 +0200
Subject: RFR(S) 8077504: Unsafe load can loose control dependency and
	cause crash
In-Reply-To: <553FFAC4.7030107@oracle.com>
References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com>
	<552FC216.4010503@redhat.com>
	<95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com>
	<6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com>
	<553A88F3.2010700@oracle.com>
	<CDF55084-CD74-4195-B596-37E7935FCB62@oracle.com>
	<553FFAC4.7030107@oracle.com>
Message-ID: <B295A658-D4D6-4F5B-AE79-A4B99592585D@oracle.com>

>>> BTW, should we modify LoadNode::hash() to include _depends_only_on_test and prevent igvning?
>> 
>> If the graph already has a non-pinned LoadNode and we add a pinned LoadNode with the same inputs, it?s safe for GVN to replace the pinned LoadNode by the non-pinned LoadNode, otherwise it wouldn?t be safe to have a non pinned LoadNode with those inputs in the first place. If the graph already has a pinned LoadNode and we add a non pinned LoadNode with the same inputs, it?s safe for GVN to replace the non-pinned LoadNode by the pinned LoadNode. It?s also suboptimal but better than 2 LoadNodes?
> 
> Okay. Add comment to _depends_only_on_test field why we don't use it in hash().

I added the comment. Thanks for the review.
I had the change go through a performance run to be safe and I found no regression so I?m pushing this.

Roland.

> 
> Thanks,
> Vladimir
> 
>> 
>> I guess we could change LoadNode::hash() and then use LoadNode::Identity/Ideal to make sure the pinned LoadNode is always replaced by the non-pinned LoadNode in the scenarios above but that sounds like extra complexity for something that, as far as we know, never happens in practice.
>> 
>> Roland.
>> 
>> 
>>> 
>>> Thanks,
>>> Vladimir
>>> 
>>> On 4/24/15 1:03 AM, Roland Westrelin wrote:
>>>> 
>>>>> Vladimir suggested privately to set _depends_only_on_test to true in the constructor and then use an explicit call to a new a method set_depends_only_on_test() to set it to false in the rare cases where it?s needed. That feels better indeed. What do you think?
>>>> 
>>>> Actually, using a set_depends_only_on_test() method doesn?t work well. In LibraryCallKit::inline_unsafe_access() the node returned by make_load() may have been transformed already and we could call set_depends_only_on_test() on a node that doesn?t need to be pinned. The call to set_depends_only_on_test() would have to be in LoadNode::make(). I went with default parameters instead to keep the change small:
>>>> 
>>>> http://cr.openjdk.java.net/~roland/8077504/webrev.01/
>>>> 
>>>> Roland.
>>>> 
>> 


From zoltan.majo at oracle.com  Mon May 18 09:17:42 2015
From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=)
Date: Mon, 18 May 2015 11:17:42 +0200
Subject: [9] RFR(XS): 8080281: 8068945 changes break building the zero JVM
	variant
Message-ID: <5559AE36.4040808@oracle.com>

Hi,


please review the following small patch.

Bug: https://bugs.openjdk.java.net/browse/JDK-8080281

Problem: The changes by 8068945 brake building the zero JVM variant.

Solution: Add the definition of the PreserveFramePointer to the 
globals_zero.hpp file.

Webrev: http://cr.openjdk.java.net/~zmajo/8080281/webrev.00/

Testing:
- full JPRT run (testset hotspot)
- local build of zero JVM variant

Thank you and best regards,


Zoltan


From roland.westrelin at oracle.com  Mon May 18 11:53:55 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Mon, 18 May 2015 13:53:55 +0200
Subject: RFR(S) 8077504: Unsafe load can loose control dependency and
	cause crash
In-Reply-To: <B295A658-D4D6-4F5B-AE79-A4B99592585D@oracle.com>
References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com>
	<552FC216.4010503@redhat.com>
	<95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com>
	<6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com>
	<553A88F3.2010700@oracle.com>
	<CDF55084-CD74-4195-B596-37E7935FCB62@oracle.com>
	<553FFAC4.7030107@oracle.com>
	<B295A658-D4D6-4F5B-AE79-A4B99592585D@oracle.com>
Message-ID: <8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com>

My change is broken.
GraphKit::make_load() takes 2 boolean arguments: bool depends_only_on_test = true, bool require_atomic_access = false)
and Parse::do_get_xxx() passes the value needs_atomic_access as argument depends_only_on_test instead of require_atomic_access. That goes unnoticed because of the default values. It?s the exact reason I didn?t use default arguments in my first webrev. It?s too easy to make a mistake and not notice it.
If we keep default arguments, one way to have help from the compiler would be to use a 2 value enum:

  enum DependsOnlyOnTest {
    DoesDependOnlyOnTest,
    DoesNotDependOnlyOnTest
  };

and use it to pass parameters around. The _depends_only_on_test LoadNode field would still be a boolean. Is this acceptable?

Roland.


> On May 18, 2015, at 10:04 AM, Roland Westrelin <roland.westrelin at oracle.com> wrote:
> 
>>>> BTW, should we modify LoadNode::hash() to include _depends_only_on_test and prevent igvning?
>>> 
>>> If the graph already has a non-pinned LoadNode and we add a pinned LoadNode with the same inputs, it?s safe for GVN to replace the pinned LoadNode by the non-pinned LoadNode, otherwise it wouldn?t be safe to have a non pinned LoadNode with those inputs in the first place. If the graph already has a pinned LoadNode and we add a non pinned LoadNode with the same inputs, it?s safe for GVN to replace the non-pinned LoadNode by the pinned LoadNode. It?s also suboptimal but better than 2 LoadNodes?
>> 
>> Okay. Add comment to _depends_only_on_test field why we don't use it in hash().
> 
> I added the comment. Thanks for the review.
> I had the change go through a performance run to be safe and I found no regression so I?m pushing this.
> 
> Roland.
> 
>> 
>> Thanks,
>> Vladimir
>> 
>>> 
>>> I guess we could change LoadNode::hash() and then use LoadNode::Identity/Ideal to make sure the pinned LoadNode is always replaced by the non-pinned LoadNode in the scenarios above but that sounds like extra complexity for something that, as far as we know, never happens in practice.
>>> 
>>> Roland.
>>> 
>>> 
>>>> 
>>>> Thanks,
>>>> Vladimir
>>>> 
>>>> On 4/24/15 1:03 AM, Roland Westrelin wrote:
>>>>> 
>>>>>> Vladimir suggested privately to set _depends_only_on_test to true in the constructor and then use an explicit call to a new a method set_depends_only_on_test() to set it to false in the rare cases where it?s needed. That feels better indeed. What do you think?
>>>>> 
>>>>> Actually, using a set_depends_only_on_test() method doesn?t work well. In LibraryCallKit::inline_unsafe_access() the node returned by make_load() may have been transformed already and we could call set_depends_only_on_test() on a node that doesn?t need to be pinned. The call to set_depends_only_on_test() would have to be in LoadNode::make(). I went with default parameters instead to keep the change small:
>>>>> 
>>>>> http://cr.openjdk.java.net/~roland/8077504/webrev.01/
>>>>> 
>>>>> Roland.
>>>>> 
>>> 
> 


From roland.westrelin at oracle.com  Mon May 18 12:19:59 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Mon, 18 May 2015 14:19:59 +0200
Subject: RFR(S): 8078866:
	compiler/eliminateAutobox/6934604/TestIntBoxing.java
	assert(p_f->Opcode() == Op_IfFalse) failed
In-Reply-To: <5553C1E2.3040003@oracle.com>
References: <E65BE33D-2CBE-448E-A935-07217E43E034@oracle.com>
	<5553C1E2.3040003@oracle.com>
Message-ID: <C2871C32-958A-42BD-BCA9-A6F692BA6BCB@oracle.com>

Thanks for looking at this, Vladimir.

> Why do you think that main- and post- loops will be empty too and it is safe to replace them with pre-loop? Why it is safe?
> 
> Opaque1 node is used only when RCE and align is requsted:
>  cmp_end->set_req(2, peel_only ? pre_limit : pre_opaq);
> 
> In peel_only case pre-loop have only 1 iteration and can be eliminated too. The assert could be hit in such case too. So your change may not fix the problem.

In case of peel_only, we set:

if( peel_only ) main_head->set_main_no_pre_loop();

and in IdealLoopTree::policy_range_check():

  // If we unrolled with no intention of doing RCE and we later                                                                                                                                                                                                                 
  // changed our minds, we got no pre-loop.  Either we need to                                                                                                                                                                                                                   
  // make a new pre-loop, or we gotta disallow RCE.                                                                                                                                                                                                                             
  if (cl->is_main_no_pre_loop()) return false; // Disallowed for now.               

> I think eliminating main- and post- is good investigation for separate RFE. And we need to make sure it is safe.

My reasoning is that if the pre loop?s limit is still an opaque node, then the compiler has not been able to optimize the pre loop based on its limited number of iterations yet. I don?t see what optimization would remove stuff from the pre loop that wouldn?t have removed the same stuff from the initial loop, had it not been cloned. I could be missing something, though. Do you see one?

Roland.

> 
> To fix this bug, I think, we should bailout from do_range_check() when we can't find pre-loop.
> 
> Typo 'cab':
> +     // post loop cab be removed as well
> 
> Thanks,
> Vladimir
> 
> 
> On 5/13/15 6:02 AM, Roland Westrelin wrote:
>> http://cr.openjdk.java.net/~roland/8078866/webrev.00/
>> 
>> The assert fires when optimizing that code:
>> 
>>   static int remi_sumb() {
>>     Integer j = Integer.valueOf(1);
>>     for (int i = 0; i< 1000; i++) {
>>       j = j + 1;
>>     }
>>     return j;
>>   }
>> 
>> The compiler tries to find the pre loop from the main loop but fails because the pre loop isn?t there.
>> 
>> Here is the chain of events that leads to the pre loop being optimized out:
>> 
>> 1- The CountedLoopNode is created
>> 2- PhiNode::Value() for the induction variable?s Phi is called and capture that 0 <= i < 1000
>> 3- SplitIf is applied and moves the LoadI that reads the value from the Integer through the Phi that merges successive value of j
>> 4- pre-loop, main-loop, post-loop are created
>> 5- SplitIf is applied again and moves the LoadI of the loop through the Phi that merges both branches of:
>> 
>>     public static Integer valueOf(int i) {
>>         if (i >= IntegerCache.low && i <= IntegerCache.high)
>>             return IntegerCache.cache[i + (-IntegerCache.low)];
>>         return new Integer(i);
>>     }
>> 
>> 6- The LoadI is optimized out, the Phi for j is now a Phi that merges integer values, PhaseIdealLoop::replace_parallel_iv() replaces j by an expression that depends on i
>> 7- In the preloop, the range of values of i are known because of 2- so and the if of valueOf() can be optimized out
>> 8- the preloop is empty and is removed by IdealLoopTree::policy_do_remove_empty_loop()
>> 
>> SplitIf doesn?t find all opportunities for improvements before the pre/main/post loops are built. I considered trying to fix that but that seems complex and not robust enough: some optimizations after SplitIf could uncover more opportunities for SplitIf. The valueOf?s if cannot be optimized out in the main loop because the range of values of i in the main depends on the Phi for the iv of the pre loop and is not [0, 1000[ so I don?t see a way to make sure the compiler always optimizes the main loop out when the pre loop is empty.
>> 
>> Instead, I?m proposing that when IdealLoopTree::policy_do_remove_empty_loop() removes a loop, it checks whether it is a pre-loop and if it is, removes the main and post loops.
>> 
>> I don?t have a test case because it seems this only occurs under well defined conditions that are hard to reproduce with a simple test case: PhiNode::Value() must be called early, SplitIf must do a little work but not too much before pre/main/post loops are built. Actually, this occurs only when remi_sumb is inlined. If remi_sumb is not, we don?t optimize the loop out at all.
>> 
>> Roland.
>> 


From volker.simonis at gmail.com  Mon May 18 12:49:19 2015
From: volker.simonis at gmail.com (Volker Simonis)
Date: Mon, 18 May 2015 14:49:19 +0200
Subject: [9] RFR(XS): 8080281: 8068945 changes break building the zero JVM
	variant
In-Reply-To: <5559AE36.4040808@oracle.com>
References: <5559AE36.4040808@oracle.com>
Message-ID: <CA+3eh13k71ACNYq++xG+W3AeC0rTU09gtF8C9Gx0SL8LfCH_tQ@mail.gmail.com>

Hi Zoltan,

your change looks good.

Thanks for fixing this and best regards,
Volker


On Mon, May 18, 2015 at 11:17 AM, Zolt?n Maj? <zoltan.majo at oracle.com> wrote:
> Hi,
>
>
> please review the following small patch.
>
> Bug: https://bugs.openjdk.java.net/browse/JDK-8080281
>
> Problem: The changes by 8068945 brake building the zero JVM variant.
>
> Solution: Add the definition of the PreserveFramePointer to the
> globals_zero.hpp file.
>
> Webrev: http://cr.openjdk.java.net/~zmajo/8080281/webrev.00/
>
> Testing:
> - full JPRT run (testset hotspot)
> - local build of zero JVM variant
>
> Thank you and best regards,
>
>
> Zoltan
>

From vladimir.x.ivanov at oracle.com  Mon May 18 13:05:41 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Mon, 18 May 2015 16:05:41 +0300
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken
	after sun.misc.Cleaner became automatically cleared
In-Reply-To: <E2E24EFA-7B52-4852-923B-15171FECE6A5@oracle.com>
References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com>
	<5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com>
	<72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com>
	<55533CAE.3050201@oracle.com>
	<A98C60B2-C10C-4D62-B7E8-5B7CFA5E2B39@oracle.com>
	<55549297.5020002@oracle.com> <5555E343.5050600@oracle.com>
	<E2E24EFA-7B52-4852-923B-15171FECE6A5@oracle.com>
Message-ID: <5559E3A5.8030108@oracle.com>

Thanks, John.

What do you think about the following version:
   http://cr.openjdk.java.net/~vlivanov/8079205/webrev.04

Best regards,
Vladimir Ivanov

On 5/15/15 8:06 PM, John Rose wrote:
> I know injected fields are somewhat hacky, but there are a couple of conditions here which would motivate their use:
>
> 1. The field is of a type not known to Java.  Usually, and in this case, it is a C pointer to some metadata.
>
> We can make space for it with a Java long, but that is a size mismatch on 32-bit systems.  Such mismatches have occasionally caused bugs on big-endian systems, although we try to defend against them by using long-wise reads followed by casts.
>
> 2. The field is useless to Java code, and would be a vulnerability if somebody found a way to touch it from Java.
>
> In both these ways the 'dependencies' field is like the MemberName.vmtarget field.  (Suggestion: s/dependencies/vmdependencies/ to follow that pattern.)
>
> ? John
>
> On May 15, 2015, at 5:14 AM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
>>
>> After private discussion with Paul, here's an update:
>>   http://cr.openjdk.java.net/~vlivanov/8079205/webrev.03
>>
>> Renamed CallSite$Context to MethodHandleNatives$Context.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 5/14/15 3:18 PM, Vladimir Ivanov wrote:
>>> Small update in JDK code (webrev updated in place):
>>>    http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02
>>>
>>> Changes:
>>>    - hid Context.dependencies field from Reflection
>>>    - new test case: CallSiteDepContextTest.testHiddenDepField()
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> On 5/13/15 5:56 PM, Paul Sandoz wrote:
>>>>
>>>> On May 13, 2015, at 1:59 PM, Vladimir Ivanov
>>>> <vladimir.x.ivanov at oracle.com> wrote:
>>>>
>>>>> Peter, Paul, thanks for the feedback!
>>>>>
>>>>> Updated the webrev in place:
>>>>>   http://cr.openjdk.java.net/~vlivanov/8079205/webrev.02
>>>>>
>>>>
>>>> +1
>>>>
>>>>
>>>>>> I am not an export in the HS area but the code mostly made sense to
>>>>>> me. I also like Peter's suggestion of Context implementing Runnable.
>>>>> I agree. Integrated.
>>>>>
>>>>>> Some minor comments.
>>>>>>
>>>>>> CallSite.java:
>>>>>>
>>>>>> 145         private final long dependencies = 0; // Used by JVM to
>>>>>> store JVM_nmethodBucket*
>>>>>>
>>>>>> It's a little odd to see this field be final, but i think i
>>>>>> understand the motivation as it's essentially acting as a memory
>>>>>> address pointing to the head of the nbucket list, so in Java code
>>>>>> you don't want to assign it to anything other than 0.
>>>>> Yes, my motivation was to forbid reads & writes of the field from
>>>>> Java code. I was experimenting with injecting the field by VM, but
>>>>> it's less attractive.
>>>>>
>>>>
>>>> I was wondering if that were possible.
>>>>
>>>>
>>>>>> Is VM access, via java_lang_invoke_CallSite_Context, affected by
>>>>>> treating final fields as really final in the j.l.i package?
>>>>> No, field modifiers aren't respected. oopDesc::*_field_[get|put] does
>>>>> a raw read/write using field offset.
>>>>>
>>>>
>>>> Ok.
>>>>
>>>> Paul.
>>>>
>

From aph at redhat.com  Mon May 18 13:56:48 2015
From: aph at redhat.com (Andrew Haley)
Date: Mon, 18 May 2015 14:56:48 +0100
Subject: 8054492: compiler/intrinsics/classcast/NullCheckDroppingsTest.java
	is an invalid test
Message-ID: <5559EFA0.5060800@redhat.com>

This test has been failing with a failed assertion:

Exception in thread "main" java.lang.AssertionError: void NullCheckDroppingsTest.testVarClassCast(java.lang.String) was not deoptimized
	at NullCheckDroppingsTest.checkDeoptimization(NullCheckDroppingsTest.java:331)
	at NullCheckDroppingsTest.runTest(NullCheckDroppingsTest.java:309)
	at NullCheckDroppingsTest.main(NullCheckDroppingsTest.java:125)

It seems that the reason for the failure is that the nmethod for a
deoptimized method (testVarClassCast) is not removed.  This is
perfectly correct because when testVarClassCast() was trapped, it was
with an action of Deoptimization::Action_maybe_recompile.  When this
occurs, a method is not marked as dead and continues to be invoked, so
the test WhiteBox.isMethodCompiled() will still return true.

Therefore, I think this code in checkDeoptimization is incorrect:

        // Check deoptimization event (intrinsic Class.cast() works).
        if (WHITE_BOX.isMethodCompiled(method) == deopt) {
            throw new AssertionError(method + " was" + (deopt ? " not" : "") + " deoptimized");
        }

We either need some code in WhiteBox to check for a deoptimization
event properly or we should just remove this altogether.

So, thoughts?  Just delete the check?

Andrew.

From denis.kononenko at oracle.com  Mon May 18 13:57:12 2015
From: denis.kononenko at oracle.com (denis kononenko)
Date: Mon, 18 May 2015 16:57:12 +0300
Subject: RFR(XS): JDK-8077620: [TESTBUG] Some of the hotspot tests require
	at least compact profile 3
Message-ID: <5559EFB8.9000503@oracle.com>

Hi All,

Could you please review a small change to the TEST.groups file.

Webrev link: 
http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.00/
Bug id: JDK-8077620 <https://bugs.openjdk.java.net/browse/JDK-8077620>
Testing: automated
Description:

The following tests cannot be run on compact profiles 1 and 2:

compiler/jsr292/RedefineMethodUsedByMultipleMethodHandles.java: requires 
java.lang.instrument package (JEP-161, available in Compact Profile 3);
compiler/rangechecks/TestRangeCheckSmearing.java: requires 
sun.management package (JEP-161, any management packages available since 
Compact Profile 3);
gc/TestGCLogRotationViaJcmd.java: uses ProcessTools which depends on 
java.lang.management (JEP-161, available in Compact Profile 3);
serviceability/sa/jmap-hashcode/Test8028623.java: the same as above.

The fix simply moves these tests into :needs_compact3 group.

Thank you,
Denis.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150518/27fb33eb/attachment.html>

From zoltan.majo at oracle.com  Mon May 18 14:35:59 2015
From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=)
Date: Mon, 18 May 2015 16:35:59 +0200
Subject: [9] RFR(XS): 8080281: 8068945 changes break building the zero
	JVM variant
In-Reply-To: <CA+3eh13k71ACNYq++xG+W3AeC0rTU09gtF8C9Gx0SL8LfCH_tQ@mail.gmail.com>
References: <5559AE36.4040808@oracle.com>
	<CA+3eh13k71ACNYq++xG+W3AeC0rTU09gtF8C9Gx0SL8LfCH_tQ@mail.gmail.com>
Message-ID: <5559F8CF.9070304@oracle.com>

Hi Volker,


On 05/18/2015 02:49 PM, Volker Simonis wrote:
> Hi Zoltan,
>
> your change looks good.
>
> Thanks for fixing this and best regards,

thank you for the review!

Best wishes,


Zolt?n

> Volker
>
>
> On Mon, May 18, 2015 at 11:17 AM, Zolt?n Maj? <zoltan.majo at oracle.com> wrote:
>> Hi,
>>
>>
>> please review the following small patch.
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8080281
>>
>> Problem: The changes by 8068945 brake building the zero JVM variant.
>>
>> Solution: Add the definition of the PreserveFramePointer to the
>> globals_zero.hpp file.
>>
>> Webrev: http://cr.openjdk.java.net/~zmajo/8080281/webrev.00/
>>
>> Testing:
>> - full JPRT run (testset hotspot)
>> - local build of zero JVM variant
>>
>> Thank you and best regards,
>>
>>
>> Zoltan
>>


From sgehwolf at redhat.com  Mon May 18 15:07:09 2015
From: sgehwolf at redhat.com (Severin Gehwolf)
Date: Mon, 18 May 2015 17:07:09 +0200
Subject: [9] RFR(XS): 8080281: 8068945 changes break building the zero
	JVM variant
In-Reply-To: <5559AE36.4040808@oracle.com>
References: <5559AE36.4040808@oracle.com>
Message-ID: <1431961629.3512.14.camel@redhat.com>

Hi,

On Mon, 2015-05-18 at 11:17 +0200, Zolt?n Maj? wrote:
> Hi,
> 
> 
> please review the following small patch.
> 
> Bug: https://bugs.openjdk.java.net/browse/JDK-8080281
> 
> Problem: The changes by 8068945 brake building the zero JVM variant.
> 
> Solution: Add the definition of the PreserveFramePointer to the 
> globals_zero.hpp file.
> 
> Webrev: http://cr.openjdk.java.net/~zmajo/8080281/webrev.00/
> 
> Testing:
> - full JPRT run (testset hotspot)
> - local build of zero JVM variant
> 
> Thank you and best regards,

Thanks, Zoltan, for being pro-active about this! Patch looks fine to me
(Disclaimer: I'm not a hotspot Reviewer).

Cheers,
Severin

> Zoltan
> 


From volker.simonis at gmail.com  Mon May 18 16:05:04 2015
From: volker.simonis at gmail.com (Volker Simonis)
Date: Mon, 18 May 2015 18:05:04 +0200
Subject: RFR(S): 8080190: PPC64: Fix wrong rotate instructions in the .ad
	file
In-Reply-To: <55539A49.4010906@oracle.com>
References: <CA+3eh13agOYGrDRY8-ENrKzh1Xe+tFRsz-0_uKbOpbJwVzM_SA@mail.gmail.com>
	<55539A49.4010906@oracle.com>
Message-ID: <CA+3eh11MPV+tHXrtWN9teya7b=4O-iitU3pvuA3Yxa9s43SL4A@mail.gmail.com>

Hi Vladimir,

thanks for the review and sorry for the late reply (we had a public
holiday in Germany:).


On Wed, May 13, 2015 at 8:39 PM, Vladimir Kozlov
<vladimir.kozlov at oracle.com> wrote:
> Is this because you can have negative values?
> I see that at the bottom of calls which generate bits u_field() method is
> used. Why not mask it there?
>

u_field() generates immediates in the range (hi_bit - lo_bit + 1) but
it can be used to generate not only 5-bit immediates but also bigger
or smaller ones (e.g. tbr() calls u_field() to generate a 10-bit
immediate. The assertion in u_field() checks that the given value can
be encoded in the requested range of bits but we do not want to
explicitly mask the value to the corresponding bit range as this could
hide other problems.

The other problem is that we can have negative values as you have
correctly noticed.

I think this is more a kind of a conceptual problem. The JLS states
that "only the five lowest-order bits of the right-hand operand are
used as the shift distance" in a shift operation [1]. The same holds
true for the specifications of the shift bytecodes in the JVMS [2].
But for constant shift amounts we do not handle this in C2 (i.e. in
the ideal graph) but instead we relay on the corresponding CPU
instruction doing "the right thing". On x86 this works because the
whole 8-bit immediates are encoded right into the instruction and
interpreted as required by the JLS.

Another way to solve this would be to do the masking right in the
ideal graph as we've previously done in the SAP JVM:

Node *LShiftINode::Ideal(PhaseGVN *phase, bool can_reshape) {
  const Type *t  = phase->type( in(2) );
  if( t == Type::TOP ) return NULL;       // Right input is dead
  const TypeInt *t2 = t->isa_int();
  if( !t2 || !t2->is_con() ) return NULL; // Right input is a constant

  // SAPJVM: mask constant shift count already in ideal graph processing
  const int in2_con = t2->get_con();
  const int con     = in2_con & ( BitsPerJavaInteger - 1 );   //
masked shift count

  if ( con == 0 )  return NULL; // let Identity() handle 0 shift count

  if( in2_con != con ) {
    set_req( 2, phase->intcon(con) ); // replace shift count with masked value
  }

Damned! I just wanted to ask you what's your opinion about this
solution but while writing all this and doing some software
archaeology I just discovered that this problem has already been
discussed [3] and solved some time ago! Change "7029017: Additional
architecture support for c2 compiler" introduced  the "Support (for)
masking of shift counts when the processor architecture mandates it".
This is exactly what's needed on PowerPC!

So I'd like to change my fix as follows:

http://cr.openjdk.java.net/~simonis/webrevs/2015/8080190.v3/

I think that using "Matcher::need_masked_shift_count = true" will make
a lot of the masking for shift instructions in the ppc.ad file
obsolete, but I'd like to keep these cleanups for a separate change.

Regards,
Volker

[1] https://docs.oracle.com/javase/specs/jls/se7/html/jls-15.html#jls-15.19
[2] https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.ishl
[3] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2011-March/thread.html#4944

> There are other instructions which can have the same problem (slwi). You
> would have to guarantee all of those do the masking. That is why I am asking
> why not mask it at the final code methods, like u_field()?
>
> Thanks,
> Vladimir
>
>
> On 5/13/15 9:29 AM, Volker Simonis wrote:
>>
>> Hi,
>>
>> could you please review the following, ppc64-only fix which solves a
>> problem with the integer-rotate instructions in C2:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8080190
>> http://cr.openjdk.java.net/~simonis/webrevs/2015/8080190.v2/
>>
>> The problem was that we initially took the "rotate left/right by 8-bit
>> immediate" instructions from x86_64. But on x86_64 the rotate
>> instructions for 32-bit integers do in fact really take 8-bit
>> immediates (seems like the Intel guys had to many spare bits when they
>> designed their instruction set :) On the other hand, the rotate
>> instructions on Power only use 5 bit to encode the rotation distance.
>>
>> We do check for this limit, but only in the debug build so you'll see
>> an assertion there, but in the product build we will silently emit a
>> wrong instruction which can lead to anything starting from a crash up
>> to an incorrect computation.
>>
>> I think the simplest solution for this problem is to mask the
>> rotation distance with 0x1f  (which is equivalent to taking the
>> distance modulo 32) in the corresponding instructions.
>>
>> This change should also be backported to 8 as fast as possible. Notice
>> that this a day-zero bug in our ppc64 port because in our internal SAP
>> JVM we already mask the shift amounts in the ideal graph but for some
>> unknown reasons these shared changes haven?t been contributed along
>> with our port.
>>
>> Thank you and best regards,
>> Volker
>>
>

From vladimir.kozlov at oracle.com  Mon May 18 16:42:56 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 18 May 2015 09:42:56 -0700
Subject: RFR(S): 8080190: PPC64: Fix wrong rotate instructions in the
	.ad file
In-Reply-To: <CA+3eh11MPV+tHXrtWN9teya7b=4O-iitU3pvuA3Yxa9s43SL4A@mail.gmail.com>
References: <CA+3eh13agOYGrDRY8-ENrKzh1Xe+tFRsz-0_uKbOpbJwVzM_SA@mail.gmail.com>	<55539A49.4010906@oracle.com>
	<CA+3eh11MPV+tHXrtWN9teya7b=4O-iitU3pvuA3Yxa9s43SL4A@mail.gmail.com>
Message-ID: <555A1690.40704@oracle.com>

You need to change the comment in ppc.ad.
Otherwise looks good. Thank you for doing archeological research - I 
forgot about that change.

Thanks,
Vladimir

On 5/18/15 9:05 AM, Volker Simonis wrote:
> Hi Vladimir,
>
> thanks for the review and sorry for the late reply (we had a public
> holiday in Germany:).
>
>
> On Wed, May 13, 2015 at 8:39 PM, Vladimir Kozlov
> <vladimir.kozlov at oracle.com> wrote:
>> Is this because you can have negative values?
>> I see that at the bottom of calls which generate bits u_field() method is
>> used. Why not mask it there?
>>
>
> u_field() generates immediates in the range (hi_bit - lo_bit + 1) but
> it can be used to generate not only 5-bit immediates but also bigger
> or smaller ones (e.g. tbr() calls u_field() to generate a 10-bit
> immediate. The assertion in u_field() checks that the given value can
> be encoded in the requested range of bits but we do not want to
> explicitly mask the value to the corresponding bit range as this could
> hide other problems.
>
> The other problem is that we can have negative values as you have
> correctly noticed.
>
> I think this is more a kind of a conceptual problem. The JLS states
> that "only the five lowest-order bits of the right-hand operand are
> used as the shift distance" in a shift operation [1]. The same holds
> true for the specifications of the shift bytecodes in the JVMS [2].
> But for constant shift amounts we do not handle this in C2 (i.e. in
> the ideal graph) but instead we relay on the corresponding CPU
> instruction doing "the right thing". On x86 this works because the
> whole 8-bit immediates are encoded right into the instruction and
> interpreted as required by the JLS.
>
> Another way to solve this would be to do the masking right in the
> ideal graph as we've previously done in the SAP JVM:
>
> Node *LShiftINode::Ideal(PhaseGVN *phase, bool can_reshape) {
>    const Type *t  = phase->type( in(2) );
>    if( t == Type::TOP ) return NULL;       // Right input is dead
>    const TypeInt *t2 = t->isa_int();
>    if( !t2 || !t2->is_con() ) return NULL; // Right input is a constant
>
>    // SAPJVM: mask constant shift count already in ideal graph processing
>    const int in2_con = t2->get_con();
>    const int con     = in2_con & ( BitsPerJavaInteger - 1 );   //
> masked shift count
>
>    if ( con == 0 )  return NULL; // let Identity() handle 0 shift count
>
>    if( in2_con != con ) {
>      set_req( 2, phase->intcon(con) ); // replace shift count with masked value
>    }
>
> Damned! I just wanted to ask you what's your opinion about this
> solution but while writing all this and doing some software
> archaeology I just discovered that this problem has already been
> discussed [3] and solved some time ago! Change "7029017: Additional
> architecture support for c2 compiler" introduced  the "Support (for)
> masking of shift counts when the processor architecture mandates it".
> This is exactly what's needed on PowerPC!
>
> So I'd like to change my fix as follows:
>
> http://cr.openjdk.java.net/~simonis/webrevs/2015/8080190.v3/
>
> I think that using "Matcher::need_masked_shift_count = true" will make
> a lot of the masking for shift instructions in the ppc.ad file
> obsolete, but I'd like to keep these cleanups for a separate change.
>
> Regards,
> Volker
>
> [1] https://docs.oracle.com/javase/specs/jls/se7/html/jls-15.html#jls-15.19
> [2] https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.ishl
> [3] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2011-March/thread.html#4944
>
>> There are other instructions which can have the same problem (slwi). You
>> would have to guarantee all of those do the masking. That is why I am asking
>> why not mask it at the final code methods, like u_field()?
>>
>> Thanks,
>> Vladimir
>>
>>
>> On 5/13/15 9:29 AM, Volker Simonis wrote:
>>>
>>> Hi,
>>>
>>> could you please review the following, ppc64-only fix which solves a
>>> problem with the integer-rotate instructions in C2:
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8080190
>>> http://cr.openjdk.java.net/~simonis/webrevs/2015/8080190.v2/
>>>
>>> The problem was that we initially took the "rotate left/right by 8-bit
>>> immediate" instructions from x86_64. But on x86_64 the rotate
>>> instructions for 32-bit integers do in fact really take 8-bit
>>> immediates (seems like the Intel guys had to many spare bits when they
>>> designed their instruction set :) On the other hand, the rotate
>>> instructions on Power only use 5 bit to encode the rotation distance.
>>>
>>> We do check for this limit, but only in the debug build so you'll see
>>> an assertion there, but in the product build we will silently emit a
>>> wrong instruction which can lead to anything starting from a crash up
>>> to an incorrect computation.
>>>
>>> I think the simplest solution for this problem is to mask the
>>> rotation distance with 0x1f  (which is equivalent to taking the
>>> distance modulo 32) in the corresponding instructions.
>>>
>>> This change should also be backported to 8 as fast as possible. Notice
>>> that this a day-zero bug in our ppc64 port because in our internal SAP
>>> JVM we already mask the shift amounts in the ideal graph but for some
>>> unknown reasons these shared changes haven?t been contributed along
>>> with our port.
>>>
>>> Thank you and best regards,
>>> Volker
>>>
>>

From vladimir.kozlov at oracle.com  Mon May 18 16:46:54 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 18 May 2015 09:46:54 -0700
Subject: [9] RFR(XS): 8080281: 8068945 changes break building the zero
	JVM variant
In-Reply-To: <5559AE36.4040808@oracle.com>
References: <5559AE36.4040808@oracle.com>
Message-ID: <555A177E.3040900@oracle.com>

Looks good.

Thanks,
Vladimir

On 5/18/15 2:17 AM, Zolt?n Maj? wrote:
> Hi,
>
>
> please review the following small patch.
>
> Bug: https://bugs.openjdk.java.net/browse/JDK-8080281
>
> Problem: The changes by 8068945 brake building the zero JVM variant.
>
> Solution: Add the definition of the PreserveFramePointer to the
> globals_zero.hpp file.
>
> Webrev: http://cr.openjdk.java.net/~zmajo/8080281/webrev.00/
>
> Testing:
> - full JPRT run (testset hotspot)
> - local build of zero JVM variant
>
> Thank you and best regards,
>
>
> Zoltan
>

From volker.simonis at gmail.com  Mon May 18 16:53:14 2015
From: volker.simonis at gmail.com (Volker Simonis)
Date: Mon, 18 May 2015 18:53:14 +0200
Subject: RFR(S): 8080190: PPC64: Fix wrong rotate instructions in the .ad
	file
In-Reply-To: <555A1690.40704@oracle.com>
References: <CA+3eh13agOYGrDRY8-ENrKzh1Xe+tFRsz-0_uKbOpbJwVzM_SA@mail.gmail.com>
	<55539A49.4010906@oracle.com>
	<CA+3eh11MPV+tHXrtWN9teya7b=4O-iitU3pvuA3Yxa9s43SL4A@mail.gmail.com>
	<555A1690.40704@oracle.com>
Message-ID: <CA+3eh10pG0ObNgt=4hYKwL_-=hCRXF4Lxvr=s+RVXhzuyzJOYA@mail.gmail.com>

Thanks for the review!
I'll update the comment accordingly.

Regards,
Volker


On Mon, May 18, 2015 at 6:42 PM, Vladimir Kozlov
<vladimir.kozlov at oracle.com> wrote:
> You need to change the comment in ppc.ad.
> Otherwise looks good. Thank you for doing archeological research - I forgot
> about that change.
>
> Thanks,
> Vladimir
>
>
> On 5/18/15 9:05 AM, Volker Simonis wrote:
>>
>> Hi Vladimir,
>>
>> thanks for the review and sorry for the late reply (we had a public
>> holiday in Germany:).
>>
>>
>> On Wed, May 13, 2015 at 8:39 PM, Vladimir Kozlov
>> <vladimir.kozlov at oracle.com> wrote:
>>>
>>> Is this because you can have negative values?
>>> I see that at the bottom of calls which generate bits u_field() method is
>>> used. Why not mask it there?
>>>
>>
>> u_field() generates immediates in the range (hi_bit - lo_bit + 1) but
>> it can be used to generate not only 5-bit immediates but also bigger
>> or smaller ones (e.g. tbr() calls u_field() to generate a 10-bit
>> immediate. The assertion in u_field() checks that the given value can
>> be encoded in the requested range of bits but we do not want to
>> explicitly mask the value to the corresponding bit range as this could
>> hide other problems.
>>
>> The other problem is that we can have negative values as you have
>> correctly noticed.
>>
>> I think this is more a kind of a conceptual problem. The JLS states
>> that "only the five lowest-order bits of the right-hand operand are
>> used as the shift distance" in a shift operation [1]. The same holds
>> true for the specifications of the shift bytecodes in the JVMS [2].
>> But for constant shift amounts we do not handle this in C2 (i.e. in
>> the ideal graph) but instead we relay on the corresponding CPU
>> instruction doing "the right thing". On x86 this works because the
>> whole 8-bit immediates are encoded right into the instruction and
>> interpreted as required by the JLS.
>>
>> Another way to solve this would be to do the masking right in the
>> ideal graph as we've previously done in the SAP JVM:
>>
>> Node *LShiftINode::Ideal(PhaseGVN *phase, bool can_reshape) {
>>    const Type *t  = phase->type( in(2) );
>>    if( t == Type::TOP ) return NULL;       // Right input is dead
>>    const TypeInt *t2 = t->isa_int();
>>    if( !t2 || !t2->is_con() ) return NULL; // Right input is a constant
>>
>>    // SAPJVM: mask constant shift count already in ideal graph processing
>>    const int in2_con = t2->get_con();
>>    const int con     = in2_con & ( BitsPerJavaInteger - 1 );   //
>> masked shift count
>>
>>    if ( con == 0 )  return NULL; // let Identity() handle 0 shift count
>>
>>    if( in2_con != con ) {
>>      set_req( 2, phase->intcon(con) ); // replace shift count with masked
>> value
>>    }
>>
>> Damned! I just wanted to ask you what's your opinion about this
>> solution but while writing all this and doing some software
>> archaeology I just discovered that this problem has already been
>> discussed [3] and solved some time ago! Change "7029017: Additional
>> architecture support for c2 compiler" introduced  the "Support (for)
>> masking of shift counts when the processor architecture mandates it".
>> This is exactly what's needed on PowerPC!
>>
>> So I'd like to change my fix as follows:
>>
>> http://cr.openjdk.java.net/~simonis/webrevs/2015/8080190.v3/
>>
>> I think that using "Matcher::need_masked_shift_count = true" will make
>> a lot of the masking for shift instructions in the ppc.ad file
>> obsolete, but I'd like to keep these cleanups for a separate change.
>>
>> Regards,
>> Volker
>>
>> [1]
>> https://docs.oracle.com/javase/specs/jls/se7/html/jls-15.html#jls-15.19
>> [2]
>> https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.ishl
>> [3]
>> http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2011-March/thread.html#4944
>>
>>> There are other instructions which can have the same problem (slwi). You
>>> would have to guarantee all of those do the masking. That is why I am
>>> asking
>>> why not mask it at the final code methods, like u_field()?
>>>
>>> Thanks,
>>> Vladimir
>>>
>>>
>>> On 5/13/15 9:29 AM, Volker Simonis wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> could you please review the following, ppc64-only fix which solves a
>>>> problem with the integer-rotate instructions in C2:
>>>>
>>>> https://bugs.openjdk.java.net/browse/JDK-8080190
>>>> http://cr.openjdk.java.net/~simonis/webrevs/2015/8080190.v2/
>>>>
>>>> The problem was that we initially took the "rotate left/right by 8-bit
>>>> immediate" instructions from x86_64. But on x86_64 the rotate
>>>> instructions for 32-bit integers do in fact really take 8-bit
>>>> immediates (seems like the Intel guys had to many spare bits when they
>>>> designed their instruction set :) On the other hand, the rotate
>>>> instructions on Power only use 5 bit to encode the rotation distance.
>>>>
>>>> We do check for this limit, but only in the debug build so you'll see
>>>> an assertion there, but in the product build we will silently emit a
>>>> wrong instruction which can lead to anything starting from a crash up
>>>> to an incorrect computation.
>>>>
>>>> I think the simplest solution for this problem is to mask the
>>>> rotation distance with 0x1f  (which is equivalent to taking the
>>>> distance modulo 32) in the corresponding instructions.
>>>>
>>>> This change should also be backported to 8 as fast as possible. Notice
>>>> that this a day-zero bug in our ppc64 port because in our internal SAP
>>>> JVM we already mask the shift amounts in the ideal graph but for some
>>>> unknown reasons these shared changes haven?t been contributed along
>>>> with our port.
>>>>
>>>> Thank you and best regards,
>>>> Volker
>>>>
>>>
>

From zoltan.majo at oracle.com  Mon May 18 17:18:30 2015
From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=)
Date: Mon, 18 May 2015 19:18:30 +0200
Subject: [9] RFR(XS): 8080281: 8068945 changes break building the zero
	JVM variant
In-Reply-To: <555A177E.3040900@oracle.com>
References: <5559AE36.4040808@oracle.com> <555A177E.3040900@oracle.com>
Message-ID: <555A1EE6.3090702@oracle.com>

Thank you, Vladimir, for the review!

Best regards,


Zoltan


On 05/18/2015 06:46 PM, Vladimir Kozlov wrote:
> Looks good.
>
> Thanks,
> Vladimir
>
> On 5/18/15 2:17 AM, Zolt?n Maj? wrote:
>> Hi,
>>
>>
>> please review the following small patch.
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8080281
>>
>> Problem: The changes by 8068945 brake building the zero JVM variant.
>>
>> Solution: Add the definition of the PreserveFramePointer to the
>> globals_zero.hpp file.
>>
>> Webrev: http://cr.openjdk.java.net/~zmajo/8080281/webrev.00/
>>
>> Testing:
>> - full JPRT run (testset hotspot)
>> - local build of zero JVM variant
>>
>> Thank you and best regards,
>>
>>
>> Zoltan
>>


From zoltan.majo at oracle.com  Mon May 18 17:55:34 2015
From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=)
Date: Mon, 18 May 2015 19:55:34 +0200
Subject: [9] RFR(XS): 8080281: 8068945 changes break building the zero
	JVM variant
In-Reply-To: <1431961629.3512.14.camel@redhat.com>
References: <5559AE36.4040808@oracle.com> <1431961629.3512.14.camel@redhat.com>
Message-ID: <555A2796.1020501@oracle.com>

Hi Severin,


On 05/18/2015 05:07 PM, Severin Gehwolf wrote:
> Hi,
>
> On Mon, 2015-05-18 at 11:17 +0200, Zolt?n Maj? wrote:
>> Hi,
>>
>>
>> please review the following small patch.
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8080281
>>
>> Problem: The changes by 8068945 brake building the zero JVM variant.
>>
>> Solution: Add the definition of the PreserveFramePointer to the
>> globals_zero.hpp file.
>>
>> Webrev: http://cr.openjdk.java.net/~zmajo/8080281/webrev.00/
>>
>> Testing:
>> - full JPRT run (testset hotspot)
>> - local build of zero JVM variant
>>
>> Thank you and best regards,
> Thanks, Zoltan, for being pro-active about this! Patch looks fine to me
> (Disclaimer: I'm not a hotspot Reviewer).

thank you for the review!

Best wishes,


Zoltan

>
> Cheers,
> Severin
>
>> Zoltan
>>
>
>


From vladimir.kozlov at oracle.com  Mon May 18 18:06:18 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 18 May 2015 11:06:18 -0700
Subject: RFR(XS): JDK-8077620: [TESTBUG] Some of the hotspot tests require
	at least compact profile 3
In-Reply-To: <5559EFB8.9000503@oracle.com>
References: <5559EFB8.9000503@oracle.com>
Message-ID: <555A2A1A.2080600@oracle.com>

Looks fine.

Thanks,
Vladimir

On 5/18/15 6:57 AM, denis kononenko wrote:
> Hi All,
>
> Could you please review a small change to the TEST.groups file.
>
> Webrev link:
> http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.00/
> Bug id: JDK-8077620 <https://bugs.openjdk.java.net/browse/JDK-8077620>
> Testing: automated
> Description:
>
> The following tests cannot be run on compact profiles 1 and 2:
>
> compiler/jsr292/RedefineMethodUsedByMultipleMethodHandles.java: requires
> java.lang.instrument package (JEP-161, available in Compact Profile 3);
> compiler/rangechecks/TestRangeCheckSmearing.java: requires
> sun.management package (JEP-161, any management packages available since
> Compact Profile 3);
> gc/TestGCLogRotationViaJcmd.java: uses ProcessTools which depends on
> java.lang.management (JEP-161, available in Compact Profile 3);
> serviceability/sa/jmap-hashcode/Test8028623.java: the same as above.
>
> The fix simply moves these tests into :needs_compact3 group.
>
> Thank you,
> Denis.
>

From michael.c.berg at intel.com  Mon May 18 19:11:27 2015
From: michael.c.berg at intel.com (Berg, Michael C)
Date: Mon, 18 May 2015 19:11:27 +0000
Subject: [9] RFR(XS) 8080483: Incorrect test execution string at
	SumRed_Long.java
In-Reply-To: <9C43EB57-487F-4D7E-929A-07F5154550A9@oracle.com>
References: <55568CB8.3030102@oracle.com>
	<9C43EB57-487F-4D7E-929A-07F5154550A9@oracle.com>
Message-ID: <C568518E7B433348B114B6A7122D474755DEFE7D@FMSMSX102.amr.corp.intel.com>

Yes, that will make the test correct.  Thanks for catching this.

-Michael

-----Original Message-----
From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Igor Veresov
Sent: Friday, May 15, 2015 5:38 PM
To: Vladimir Kozlov
Cc: hotspot compiler
Subject: Re: [9] RFR(XS) 8080483: Incorrect test execution string at SumRed_Long.java

Ok.

igor

> On May 15, 2015, at 5:18 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
> 
> Small typo fix. Test run commands incorrectly execute SumRed_Double class instead of SumRed_Long.
> 
> http://cr.openjdk.java.net/~kvn/8080483/webrev/
> https://bugs.openjdk.java.net/browse/JDK-8080483
> 
> Tested with jtreg.
> 
> Thanks,
> Vladimir


From adinn at redhat.com  Mon May 18 19:40:22 2015
From: adinn at redhat.com (Andrew Dinn)
Date: Mon, 18 May 2015 20:40:22 +0100
Subject: RFR(S) 8077504: Unsafe load can loose control dependency and
	cause crash
In-Reply-To: <8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com>
References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com>	<552FC216.4010503@redhat.com>	<95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com>	<6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com>	<553A88F3.2010700@oracle.com>	<CDF55084-CD74-4195-B596-37E7935FCB62@oracle.com>	<553FFAC4.7030107@oracle.com>	<B295A658-D4D6-4F5B-AE79-A4B99592585D@oracle.com>
	<8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com>
Message-ID: <555A4026.3010303@redhat.com>

On 18/05/15 12:53, Roland Westrelin wrote:
> My change is broken.
> GraphKit::make_load() takes 2 boolean arguments: bool depends_only_on_test = true, bool require_atomic_access = false)
> and Parse::do_get_xxx() passes the value needs_atomic_access as argument depends_only_on_test instead of require_atomic_access. That goes unnoticed because of the default values. It?s the exact reason I didn?t use default arguments in my first webrev. It?s too easy to make a mistake and not notice it.
> If we keep default arguments, one way to have help from the compiler would be to use a 2 value enum:
> 
>   enum DependsOnlyOnTest {
>     DoesDependOnlyOnTest,
>     DoesNotDependOnlyOnTest
>   };
> 
> and use it to pass parameters around. The _depends_only_on_test LoadNode field would still be a boolean. Is this acceptable?

That looks perfectly acceptable to me (not a reviewer but I was the one
who set you off down this path).

regards,


Andrew Dinn
-----------


From michael.c.berg at intel.com  Mon May 18 20:13:29 2015
From: michael.c.berg at intel.com (Berg, Michael C)
Date: Mon, 18 May 2015 20:13:29 +0000
Subject: RFR(S): 8078866:
	compiler/eliminateAutobox/6934604/TestIntBoxing.java	assert(p_f->Opcode()
	== Op_IfFalse) failed
In-Reply-To: <C2871C32-958A-42BD-BCA9-A6F692BA6BCB@oracle.com>
References: <E65BE33D-2CBE-448E-A935-07217E43E034@oracle.com>
	<5553C1E2.3040003@oracle.com>
	<C2871C32-958A-42BD-BCA9-A6F692BA6BCB@oracle.com>
Message-ID: <C568518E7B433348B114B6A7122D474755DF0F0A@FMSMSX102.amr.corp.intel.com>

Hi Roland,

If you make pre_end->cmp_node() a variable it would remove some overhead too like you did with main_cmp.

Regards,
Michael

-----Original Message-----
From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Roland Westrelin
Sent: Monday, May 18, 2015 5:20 AM
To: hotspot compiler
Subject: Re: RFR(S): 8078866: compiler/eliminateAutobox/6934604/TestIntBoxing.java assert(p_f->Opcode() == Op_IfFalse) failed

Thanks for looking at this, Vladimir.

> Why do you think that main- and post- loops will be empty too and it is safe to replace them with pre-loop? Why it is safe?
> 
> Opaque1 node is used only when RCE and align is requsted:
>  cmp_end->set_req(2, peel_only ? pre_limit : pre_opaq);
> 
> In peel_only case pre-loop have only 1 iteration and can be eliminated too. The assert could be hit in such case too. So your change may not fix the problem.

In case of peel_only, we set:

if( peel_only ) main_head->set_main_no_pre_loop();

and in IdealLoopTree::policy_range_check():

  // If we unrolled with no intention of doing RCE and we later                                                                                                                                                                                                                 
  // changed our minds, we got no pre-loop.  Either we need to                                                                                                                                                                                                                   
  // make a new pre-loop, or we gotta disallow RCE.                                                                                                                                                                                                                             
  if (cl->is_main_no_pre_loop()) return false; // Disallowed for now.               

> I think eliminating main- and post- is good investigation for separate RFE. And we need to make sure it is safe.

My reasoning is that if the pre loop?s limit is still an opaque node, then the compiler has not been able to optimize the pre loop based on its limited number of iterations yet. I don?t see what optimization would remove stuff from the pre loop that wouldn?t have removed the same stuff from the initial loop, had it not been cloned. I could be missing something, though. Do you see one?

Roland.

> 
> To fix this bug, I think, we should bailout from do_range_check() when we can't find pre-loop.
> 
> Typo 'cab':
> +     // post loop cab be removed as well
> 
> Thanks,
> Vladimir
> 
> 
> On 5/13/15 6:02 AM, Roland Westrelin wrote:
>> http://cr.openjdk.java.net/~roland/8078866/webrev.00/
>> 
>> The assert fires when optimizing that code:
>> 
>>   static int remi_sumb() {
>>     Integer j = Integer.valueOf(1);
>>     for (int i = 0; i< 1000; i++) {
>>       j = j + 1;
>>     }
>>     return j;
>>   }
>> 
>> The compiler tries to find the pre loop from the main loop but fails because the pre loop isn?t there.
>> 
>> Here is the chain of events that leads to the pre loop being optimized out:
>> 
>> 1- The CountedLoopNode is created
>> 2- PhiNode::Value() for the induction variable?s Phi is called and capture that 0 <= i < 1000
>> 3- SplitIf is applied and moves the LoadI that reads the value from the Integer through the Phi that merges successive value of j
>> 4- pre-loop, main-loop, post-loop are created
>> 5- SplitIf is applied again and moves the LoadI of the loop through the Phi that merges both branches of:
>> 
>>     public static Integer valueOf(int i) {
>>         if (i >= IntegerCache.low && i <= IntegerCache.high)
>>             return IntegerCache.cache[i + (-IntegerCache.low)];
>>         return new Integer(i);
>>     }
>> 
>> 6- The LoadI is optimized out, the Phi for j is now a Phi that merges integer values, PhaseIdealLoop::replace_parallel_iv() replaces j by an expression that depends on i
>> 7- In the preloop, the range of values of i are known because of 2- so and the if of valueOf() can be optimized out
>> 8- the preloop is empty and is removed by IdealLoopTree::policy_do_remove_empty_loop()
>> 
>> SplitIf doesn?t find all opportunities for improvements before the pre/main/post loops are built. I considered trying to fix that but that seems complex and not robust enough: some optimizations after SplitIf could uncover more opportunities for SplitIf. The valueOf?s if cannot be optimized out in the main loop because the range of values of i in the main depends on the Phi for the iv of the pre loop and is not [0, 1000[ so I don?t see a way to make sure the compiler always optimizes the main loop out when the pre loop is empty.
>> 
>> Instead, I?m proposing that when IdealLoopTree::policy_do_remove_empty_loop() removes a loop, it checks whether it is a pre-loop and if it is, removes the main and post loops.
>> 
>> I don?t have a test case because it seems this only occurs under well defined conditions that are hard to reproduce with a simple test case: PhiNode::Value() must be called early, SplitIf must do a little work but not too much before pre/main/post loops are built. Actually, this occurs only when remi_sumb is inlined. If remi_sumb is not, we don?t optimize the loop out at all.
>> 
>> Roland.
>> 


From vladimir.kozlov at oracle.com  Mon May 18 22:27:13 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 18 May 2015 15:27:13 -0700
Subject: RFR(S) 8077504: Unsafe load can loose control dependency and
	cause crash
In-Reply-To: <555A4026.3010303@redhat.com>
References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com>	<552FC216.4010503@redhat.com>	<95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com>	<6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com>	<553A88F3.2010700@oracle.com>	<CDF55084-CD74-4195-B596-37E7935FCB62@oracle.com>	<553FFAC4.7030107@oracle.com>	<B295A658-D4D6-4F5B-AE79-A4B99592585D@oracle.com>	<8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com>
	<555A4026.3010303@redhat.com>
Message-ID: <555A6741.3030002@oracle.com>

Acceptable.

Thanks,
Vladimir

On 5/18/15 12:40 PM, Andrew Dinn wrote:
> On 18/05/15 12:53, Roland Westrelin wrote:
>> My change is broken.
>> GraphKit::make_load() takes 2 boolean arguments: bool depends_only_on_test = true, bool require_atomic_access = false)
>> and Parse::do_get_xxx() passes the value needs_atomic_access as argument depends_only_on_test instead of require_atomic_access. That goes unnoticed because of the default values. It?s the exact reason I didn?t use default arguments in my first webrev. It?s too easy to make a mistake and not notice it.
>> If we keep default arguments, one way to have help from the compiler would be to use a 2 value enum:
>>
>>    enum DependsOnlyOnTest {
>>      DoesDependOnlyOnTest,
>>      DoesNotDependOnlyOnTest
>>    };
>>
>> and use it to pass parameters around. The _depends_only_on_test LoadNode field would still be a boolean. Is this acceptable?
>
> That looks perfectly acceptable to me (not a reviewer but I was the one
> who set you off down this path).
>
> regards,
>
>
> Andrew Dinn
> -----------
>

From shrinivas.joshi at oracle.com  Mon May 18 23:13:41 2015
From: shrinivas.joshi at oracle.com (Shrinivas Joshi)
Date: Mon, 18 May 2015 16:13:41 -0700
Subject: RFR (XS) 8080308: Enable JSR292-only type profiling level on SPARC
Message-ID: <555A7224.1080004@oracle.com>

Hi,

Please review this small change which sets default value of 
TypeProfileLevel to 111 on the SPARC platform.

Bug: https://bugs.openjdk.java.net/browse/JDK-8080308
Webrev: http://cr.openjdk.java.net/~roland/8080308/webrev.00/

Tested with JTREG, CTW, NSK and JPRT.

Thanks,
-Shrinivas

From vladimir.kozlov at oracle.com  Mon May 18 23:25:19 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 18 May 2015 16:25:19 -0700
Subject: RFR (XS) 8080308: Enable JSR292-only type profiling level on SPARC
In-Reply-To: <555A7224.1080004@oracle.com>
References: <555A7224.1080004@oracle.com>
Message-ID: <555A74DF.5040306@oracle.com>

Looks fine to me but Roland should do final judgement.

Thanks,
Vladimir

On 5/18/15 4:13 PM, Shrinivas Joshi wrote:
> Hi,
>
> Please review this small change which sets default value of
> TypeProfileLevel to 111 on the SPARC platform.
>
> Bug: https://bugs.openjdk.java.net/browse/JDK-8080308
> Webrev: http://cr.openjdk.java.net/~roland/8080308/webrev.00/
>
> Tested with JTREG, CTW, NSK and JPRT.
>
> Thanks,
> -Shrinivas

From shrinivas.joshi at oracle.com  Mon May 18 23:56:05 2015
From: shrinivas.joshi at oracle.com (Shrinivas Joshi)
Date: Mon, 18 May 2015 16:56:05 -0700
Subject: RFR (XS) 8080308: Enable JSR292-only type profiling level on SPARC
In-Reply-To: <555A74DF.5040306@oracle.com>
References: <555A7224.1080004@oracle.com> <555A74DF.5040306@oracle.com>
Message-ID: <555A7C15.4030208@oracle.com>

Hi Vladimir,

Thanks for the review.

-Shrinivas

On 5/18/2015 4:25 PM, Vladimir Kozlov wrote:
> Looks fine to me but Roland should do final judgement.
>
> Thanks,
> Vladimir
>
> On 5/18/15 4:13 PM, Shrinivas Joshi wrote:
>> Hi,
>>
>> Please review this small change which sets default value of
>> TypeProfileLevel to 111 on the SPARC platform.
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8080308
>> Webrev: http://cr.openjdk.java.net/~roland/8080308/webrev.00/
>>
>> Tested with JTREG, CTW, NSK and JPRT.
>>
>> Thanks,
>> -Shrinivas
>
>


From roland.westrelin at oracle.com  Tue May 19 06:39:43 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Tue, 19 May 2015 08:39:43 +0200
Subject: RFR (XS) 8080308: Enable JSR292-only type profiling level on SPARC
In-Reply-To: <555A7224.1080004@oracle.com>
References: <555A7224.1080004@oracle.com>
Message-ID: <C3BBC888-9F3F-485E-B1D7-FEF414CCDE73@oracle.com>

> Webrev: http://cr.openjdk.java.net/~roland/8080308/webrev.00/

Looks good to me. Thanks for finding and fixing that mistake.

Roland.

From david.holmes at oracle.com  Tue May 19 07:49:30 2015
From: david.holmes at oracle.com (David Holmes)
Date: Tue, 19 May 2015 17:49:30 +1000
Subject: RFR(XS): JDK-8077620: [TESTBUG] Some of the hotspot tests require
	at least compact profile 3
In-Reply-To: <5559EFB8.9000503@oracle.com>
References: <5559EFB8.9000503@oracle.com>
Message-ID: <555AEB0A.6090301@oracle.com>

Looks fine to me.

Thanks,
David

On 18/05/2015 11:57 PM, denis kononenko wrote:
> Hi All,
>
> Could you please review a small change to the TEST.groups file.
>
> Webrev link:
> http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.00/
> Bug id: JDK-8077620 <https://bugs.openjdk.java.net/browse/JDK-8077620>
> Testing: automated
> Description:
>
> The following tests cannot be run on compact profiles 1 and 2:
>
> compiler/jsr292/RedefineMethodUsedByMultipleMethodHandles.java: requires
> java.lang.instrument package (JEP-161, available in Compact Profile 3);
> compiler/rangechecks/TestRangeCheckSmearing.java: requires
> sun.management package (JEP-161, any management packages available since
> Compact Profile 3);
> gc/TestGCLogRotationViaJcmd.java: uses ProcessTools which depends on
> java.lang.management (JEP-161, available in Compact Profile 3);
> serviceability/sa/jmap-hashcode/Test8028623.java: the same as above.
>
> The fix simply moves these tests into :needs_compact3 group.
>
> Thank you,
> Denis.
>

From andreas.eriksson at oracle.com  Tue May 19 10:56:20 2015
From: andreas.eriksson at oracle.com (Andreas Eriksson)
Date: Tue, 19 May 2015 12:56:20 +0200
Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information
Message-ID: <555B16D4.80008@oracle.com>

Hi,

Could I please get two reviews for the following bug fix.

https://bugs.openjdk.java.net/browse/JDK-8060036
http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/

The problem is with type propagation in C2.

A CmpU can use type information from nodes two steps up in the graph, 
when input 1 is an AddI or SubI node.
If the AddI/SubI overflows the CmpU node type computation will set up 
two type ranges based on the inputs to the AddI/SubI instead.
These type ranges will then be compared to input 2 of the CmpU node to 
see if we can make any meaningful observations on the real type of the CmpU.
This means that the type of the AddI/SubI can be bottom, but the CmpU 
can still use the type of the AddI/SubI inputs to get a more specific 
type itself.

The problem with this is that the types of the AddI/SubI inputs can be 
updated at a later pass, but since the AddI/SubI is of type bottom the 
propagation pass will stop there.
At that point the CmpU node is never made aware of the new types of the 
AddI/SubI inputs, and it is left with a type that is based on old type 
information.

When this happens other optimizations using the type information can go 
very wrong; in this particular case a branch to a range_check trap that 
normally never happened was optimized to always be taken.

This fix adds a check in the type propagation logic for CmpU nodes, 
which makes sure that the CmpU node will be added to the worklist so the 
type can be updated.
I've not been able to create a regression test that is reliable enough 
to check in.

Thanks,
Andreas

From tangwei6 at huawei.com  Tue May 19 12:58:08 2015
From: tangwei6 at huawei.com (Tangwei (Euler))
Date: Tue, 19 May 2015 12:58:08 +0000
Subject: How to change compilation policy to trigger C2 compilation ASAP?
Message-ID: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>

Hi All,
  I want to run a JAVA application on a performance simulator, and do a profiling on a hot function JITTed with C2 compiler.
In order to make C2 compiler compile hot function as early as possible, I hope to reduce the threshold of function invocation
count in interpreter and C1 to drive the JIT compiler transitioned to Level 4 (C2) ASAP. Following is the option list I try, but
failed to find a right combination to meet my requirement. Anyone can help to figure out what options I can use?
Thanks in advance.

-XX:Tier0ProfilingStartPercentage=0
-XX:Tier3InvocationThreshold
-XX:Tier3MinInvocationThreshold
-XX:Tier3CompileThreshold
-XX:Tier4InvocationThreshold
-XX:CompileThreshold
-XX:Tier4MinInvocationThreshold
-XX:Tier4CompileThreshold

Regards!
wei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150519/4ad14336/attachment.html>

From aleksey.shipilev at oracle.com  Tue May 19 13:10:14 2015
From: aleksey.shipilev at oracle.com (Aleksey Shipilev)
Date: Tue, 19 May 2015 16:10:14 +0300
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
Message-ID: <555B3636.9010100@oracle.com>

On 19.05.2015 15:58, Tangwei (Euler) wrote:
> In order to make C2 compiler compile hot function as early as possible,
> I hope to reduce the threshold of function invocation
> 
> count in interpreter and C1 to drive the JIT compiler transitioned to
> Level 4 (C2) ASAP. Following is the option list I try, but
> 
> failed to find a right combination to meet my requirement. Anyone can
> help to figure out what options I can use?

Cut the C1 from the compilation pipeline by disabling tiered compilation
altogether? It will still profile the code with interpreter, and so
compilation thresholds are still in effect:
  -XX:-TieredCompilation

The same, but force runtime to compile the methods as soon as it calls
into them, plus disable background compilation. This will stall the
method execution until the compiled version is available. No profiling
is done:
  -XX:-TieredCompilation -Xbatch -Xcomp

But, there is intrinsic tradeoff between the time you spend profiling on
lower compilation/interpretation levels and the profile accuracy. This
translates to the tradeoff between the compiled code efficiency and
time-to-performance. Skipping profiling altogether may and will have a
detrimental effect on performance tests.

Thanks,
-Aleksey

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150519/de75276e/signature.asc>

From vitalyd at gmail.com  Tue May 19 13:32:51 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Tue, 19 May 2015 09:32:51 -0400
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
Message-ID: <CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>

Is your goal specifically to have C2 compile or just to reach peak
performance quickly? It sounds like the latter.  What values did you
specify for the tier thresholds? Also, it may help you to
-XX:+PrintCompilation to tune the flags as this will show you which methods
are being compiled, when, and at what tier.

sent from my phone
On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com> wrote:

>  Hi All,
>
>   I want to run a JAVA application on a performance simulator, and do a
> profiling on a hot function JITTed with C2 compiler.
>
> In order to make C2 compiler compile hot function as early as possible, I
> hope to reduce the threshold of function invocation
>
> count in interpreter and C1 to drive the JIT compiler transitioned to
> Level 4 (C2) ASAP. Following is the option list I try, but
>
> failed to find a right combination to meet my requirement. Anyone can help
> to figure out what options I can use?
>
> Thanks in advance.
>
>
>
> -XX:Tier0ProfilingStartPercentage=0
>
> -XX:Tier3InvocationThreshold
>
> -XX:Tier3MinInvocationThreshold
>
> -XX:Tier3CompileThreshold
>
> -XX:Tier4InvocationThreshold
>
> -XX:CompileThreshold
>
> -XX:Tier4MinInvocationThreshold
>
> -XX:Tier4CompileThreshold
>
>
>
> Regards!
>
> wei
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150519/70a01e18/attachment.html>

From shrinivas.joshi at oracle.com  Tue May 19 16:02:29 2015
From: shrinivas.joshi at oracle.com (Shrinivas Joshi)
Date: Tue, 19 May 2015 09:02:29 -0700
Subject: RFR (XS) 8080308: Enable JSR292-only type profiling level on SPARC
In-Reply-To: <C3BBC888-9F3F-485E-B1D7-FEF414CCDE73@oracle.com>
References: <555A7224.1080004@oracle.com>
	<C3BBC888-9F3F-485E-B1D7-FEF414CCDE73@oracle.com>
Message-ID: <555B5E95.3080403@oracle.com>

Hi Roland,

Thanks for the review.

-Shrinivas

On 5/18/2015 11:39 PM, Roland Westrelin wrote:
>> Webrev: http://cr.openjdk.java.net/~roland/8080308/webrev.00/
> Looks good to me. Thanks for finding and fixing that mistake.
>
> Roland.
>
>


From michael.c.berg at intel.com  Tue May 19 18:24:21 2015
From: michael.c.berg at intel.com (Berg, Michael C)
Date: Tue, 19 May 2015 18:24:21 +0000
Subject: 8060036: C2: CmpU nodes can end up with wrong type information
In-Reply-To: <555B16D4.80008@oracle.com>
References: <555B16D4.80008@oracle.com>
Message-ID: <C568518E7B433348B114B6A7122D474755DF752C@FMSMSX102.amr.corp.intel.com>

Hi Andreas, could you add a comment to the push statement:

// Propagate change to user

And to Node* p declaration and init:

// Propagate changes to uses

So that the new code carries all similar information.
Else the changes seem ok.

-Michael

-----Original Message-----
From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andreas Eriksson
Sent: Tuesday, May 19, 2015 3:56 AM
To: hotspot-compiler-dev at openjdk.java.net
Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information

Hi,

Could I please get two reviews for the following bug fix.

https://bugs.openjdk.java.net/browse/JDK-8060036
http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/

The problem is with type propagation in C2.

A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node.
If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead.
These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU.
This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself.

The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there.
At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information.

When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken.

This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated.
I've not been able to create a regression test that is reliable enough to check in.

Thanks,
Andreas

From michael.c.berg at intel.com  Tue May 19 18:25:26 2015
From: michael.c.berg at intel.com (Berg, Michael C)
Date: Tue, 19 May 2015 18:25:26 +0000
Subject: RFR 8080325 SuperWord loop unrolling analysis
In-Reply-To: <C568518E7B433348B114B6A7122D474755DE9497@FMSMSX102.amr.corp.intel.com>
References: <C568518E7B433348B114B6A7122D474755DE9497@FMSMSX102.amr.corp.intel.com>
Message-ID: <C568518E7B433348B114B6A7122D474755DF7539@FMSMSX102.amr.corp.intel.com>

Could I please get two reviews for superword unrolling analysis.

Thanks,
Michael

From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Berg, Michael C
Sent: Wednesday, May 13, 2015 6:27 PM
To: hotspot-compiler-dev at openjdk.java.net
Subject: RFR 8080325 SuperWord loop unrolling analysis


Hi Folks,

We (Intel) would like to contribute superword loop unrolling analysis to facilitate more default unrolling and larger SIMD vector mapping.
Please review this patch and comment as needed:

Bug-id: https://bugs.openjdk.java.net/browse/JDK-8080325

webrev:

http://cr.openjdk.java.net/~kvn/8080325/webrev/

The design finds the highest common vector supported and implemented on a given machine and applies that to unrolling, iff it is greater than the default. If the user gates unrolling we will still obey the user directive. It's light weight, when we get to the analysis part, if we fail, we stop further tries. If we succeed we stop further tries. We generally always analyze only once. We then gate the unroll factor by extending the size of the unroll segment so that the iterative tries will keep unrolling, obtaining larger unrolls of a targeted loop. I see no negative behavior wrt to performance, and a modest positive swing in default behavior up to 1.5x for some micros.

Vladimir Kozlov has offered to sponsor this patch.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150519/e29c934d/attachment.html>

From tangwei6 at huawei.com  Wed May 20 00:34:52 2015
From: tangwei6 at huawei.com (Tangwei (Euler))
Date: Wed, 20 May 2015 00:34:52 +0000
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <555B3636.9010100@oracle.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<555B3636.9010100@oracle.com>
Message-ID: <C8D1E566CC4CA845813731FB6FD5C530010F757C@SZXEMI503-MBX.china.huawei.com>


> -----Original Message-----
> From: Aleksey Shipilev [mailto:aleksey.shipilev at oracle.com]
> Sent: Tuesday, May 19, 2015 9:10 PM
> To: Tangwei (Euler); hotspot-compiler-dev at openjdk.java.net
> Subject: Re: How to change compilation policy to trigger C2 compilation ASAP?
> 
> On 19.05.2015 15:58, Tangwei (Euler) wrote:
> > In order to make C2 compiler compile hot function as early as
> > possible, I hope to reduce the threshold of function invocation
> >
> > count in interpreter and C1 to drive the JIT compiler transitioned to
> > Level 4 (C2) ASAP. Following is the option list I try, but
> >
> > failed to find a right combination to meet my requirement. Anyone can
> > help to figure out what options I can use?
> 
> Cut the C1 from the compilation pipeline by disabling tiered compilation
> altogether? It will still profile the code with interpreter, and so compilation
> thresholds are still in effect:
>   -XX:-TieredCompilation
> 
Is that possible to keep tiered compilation policy, and just change threshold value to reduce transition time between different execution level? 

> The same, but force runtime to compile the methods as soon as it calls into
> them, plus disable background compilation. This will stall the method execution
> until the compiled version is available. No profiling is done:
>   -XX:-TieredCompilation -Xbatch -Xcomp
> 
I think this will compile all methods instead of some hot ones. 

> But, there is intrinsic tradeoff between the time you spend profiling on lower
> compilation/interpretation levels and the profile accuracy. This translates to
> the tradeoff between the compiled code efficiency and time-to-performance.
> Skipping profiling altogether may and will have a detrimental effect on
> performance tests.
> 
> Thanks,
> -Aleksey


From tangwei6 at huawei.com  Wed May 20 00:43:38 2015
From: tangwei6 at huawei.com (Tangwei (Euler))
Date: Wed, 20 May 2015 00:43:38 +0000
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
Message-ID: <C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>

My goal is just to reach peak performance quickly. Following is one tier threshold combination I tried:

-XX:Tier0ProfilingStartPercentage=0
-XX:Tier3InvocationThreshold=3
-XX:Tier3MinInvocationThreshold=2
-XX:Tier3CompileThreshold=2
-XX:Tier4InvocationThreshold=4
-XX:Tier4MinInvocationThreshold=3
-XX:Tier4CompileThreshold=2

Regards!
wei


From: Vitaly Davidovich [mailto:vitalyd at gmail.com]
Sent: Tuesday, May 19, 2015 9:33 PM
To: Tangwei (Euler)
Cc: hotspot compiler
Subject: Re: How to change compilation policy to trigger C2 compilation ASAP?


Is your goal specifically to have C2 compile or just to reach peak performance quickly? It sounds like the latter.  What values did you specify for the tier thresholds? Also, it may help you to -XX:+PrintCompilation to tune the flags as this will show you which methods are being compiled, when, and at what tier.

sent from my phone
On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com<mailto:tangwei6 at huawei.com>> wrote:
Hi All,
  I want to run a JAVA application on a performance simulator, and do a profiling on a hot function JITTed with C2 compiler.
In order to make C2 compiler compile hot function as early as possible, I hope to reduce the threshold of function invocation
count in interpreter and C1 to drive the JIT compiler transitioned to Level 4 (C2) ASAP. Following is the option list I try, but
failed to find a right combination to meet my requirement. Anyone can help to figure out what options I can use?
Thanks in advance.

-XX:Tier0ProfilingStartPercentage=0
-XX:Tier3InvocationThreshold
-XX:Tier3MinInvocationThreshold
-XX:Tier3CompileThreshold
-XX:Tier4InvocationThreshold
-XX:CompileThreshold
-XX:Tier4MinInvocationThreshold
-XX:Tier4CompileThreshold

Regards!
wei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150520/569bc15f/attachment-0001.html>

From vladimir.kozlov at oracle.com  Wed May 20 01:12:29 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Tue, 19 May 2015 18:12:29 -0700
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
Message-ID: <555BDF7D.4010002@oracle.com>

If you want only C2 compilation you can disable tiered compilation 
(since you don't want to spend to much on profiling anyway) and set low 
CompileThreshold (you need only this one for non-tiered compile):

-XX:-TieredCompilation -XX:CompileThreshold=100

If your method is simple and don't need profiling or inlining the 
generated code could be same quality as with long profiling.

An other approach is to set threshold per method (scale all threasholds 
(C1, C2, interpreter) by this value). For example to reduce thresholds 
by half:

-XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdScaling,0.5

Vladimir

On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
> My goal is just to reach peak performance quickly. Following is one tier
> threshold combination I tried:
>
> -XX:Tier0ProfilingStartPercentage=0
>
> -XX:Tier3InvocationThreshold=3
>
> -XX:Tier3MinInvocationThreshold=2
>
> -XX:Tier3CompileThreshold=2
>
> -XX:Tier4InvocationThreshold=4
>
> -XX:Tier4MinInvocationThreshold=3
>
> -XX:Tier4CompileThreshold=2
>
> Regards!
>
> wei
>
> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
> *Sent:* Tuesday, May 19, 2015 9:33 PM
> *To:* Tangwei (Euler)
> *Cc:* hotspot compiler
> *Subject:* Re: How to change compilation policy to trigger C2
> compilation ASAP?
>
> Is your goal specifically to have C2 compile or just to reach peak
> performance quickly? It sounds like the latter.  What values did you
> specify for the tier thresholds? Also, it may help you to
> -XX:+PrintCompilation to tune the flags as this will show you which
> methods are being compiled, when, and at what tier.
>
> sent from my phone
>
> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
> <mailto:tangwei6 at huawei.com>> wrote:
>
> Hi All,
>
>    I want to run a JAVA application on a performance simulator, and do a
> profiling on a hot function JITTed with C2 compiler.
>
> In order to make C2 compiler compile hot function as early as possible,
> I hope to reduce the threshold of function invocation
>
> count in interpreter and C1 to drive the JIT compiler transitioned to
> Level 4 (C2) ASAP. Following is the option list I try, but
>
> failed to find a right combination to meet my requirement. Anyone can
> help to figure out what options I can use?
>
> Thanks in advance.
>
> -XX:Tier0ProfilingStartPercentage=0
>
> -XX:Tier3InvocationThreshold
>
> -XX:Tier3MinInvocationThreshold
>
> -XX:Tier3CompileThreshold
>
> -XX:Tier4InvocationThreshold
>
> -XX:CompileThreshold
>
> -XX:Tier4MinInvocationThreshold
>
> -XX:Tier4CompileThreshold
>
> Regards!
>
> wei
>

From vitalyd at gmail.com  Wed May 20 01:15:23 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Tue, 19 May 2015 21:15:23 -0400
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <555BDF7D.4010002@oracle.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com>
Message-ID: <CAHjP37H=xoaNQaC0O_60u-sTSFBTX+sAkrMzBC_c8ha35q5FNA@mail.gmail.com>

Vladimir,

Is 100 actually generally good/decent based on our conversation from the
other day regarding lack of scaling down certain inlining thresholds?

Thanks

sent from my phone
On May 19, 2015 9:12 PM, "Vladimir Kozlov" <vladimir.kozlov at oracle.com>
wrote:

> If you want only C2 compilation you can disable tiered compilation (since
> you don't want to spend to much on profiling anyway) and set low
> CompileThreshold (you need only this one for non-tiered compile):
>
> -XX:-TieredCompilation -XX:CompileThreshold=100
>
> If your method is simple and don't need profiling or inlining the
> generated code could be same quality as with long profiling.
>
> An other approach is to set threshold per method (scale all threasholds
> (C1, C2, interpreter) by this value). For example to reduce thresholds by
> half:
>
>
> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdScaling,0.5
>
> Vladimir
>
> On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
>
>> My goal is just to reach peak performance quickly. Following is one tier
>> threshold combination I tried:
>>
>> -XX:Tier0ProfilingStartPercentage=0
>>
>> -XX:Tier3InvocationThreshold=3
>>
>> -XX:Tier3MinInvocationThreshold=2
>>
>> -XX:Tier3CompileThreshold=2
>>
>> -XX:Tier4InvocationThreshold=4
>>
>> -XX:Tier4MinInvocationThreshold=3
>>
>> -XX:Tier4CompileThreshold=2
>>
>> Regards!
>>
>> wei
>>
>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
>> *Sent:* Tuesday, May 19, 2015 9:33 PM
>> *To:* Tangwei (Euler)
>> *Cc:* hotspot compiler
>> *Subject:* Re: How to change compilation policy to trigger C2
>> compilation ASAP?
>>
>> Is your goal specifically to have C2 compile or just to reach peak
>> performance quickly? It sounds like the latter.  What values did you
>> specify for the tier thresholds? Also, it may help you to
>> -XX:+PrintCompilation to tune the flags as this will show you which
>> methods are being compiled, when, and at what tier.
>>
>> sent from my phone
>>
>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
>> <mailto:tangwei6 at huawei.com>> wrote:
>>
>> Hi All,
>>
>>    I want to run a JAVA application on a performance simulator, and do a
>> profiling on a hot function JITTed with C2 compiler.
>>
>> In order to make C2 compiler compile hot function as early as possible,
>> I hope to reduce the threshold of function invocation
>>
>> count in interpreter and C1 to drive the JIT compiler transitioned to
>> Level 4 (C2) ASAP. Following is the option list I try, but
>>
>> failed to find a right combination to meet my requirement. Anyone can
>> help to figure out what options I can use?
>>
>> Thanks in advance.
>>
>> -XX:Tier0ProfilingStartPercentage=0
>>
>> -XX:Tier3InvocationThreshold
>>
>> -XX:Tier3MinInvocationThreshold
>>
>> -XX:Tier3CompileThreshold
>>
>> -XX:Tier4InvocationThreshold
>>
>> -XX:CompileThreshold
>>
>> -XX:Tier4MinInvocationThreshold
>>
>> -XX:Tier4CompileThreshold
>>
>> Regards!
>>
>> wei
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150519/722a3395/attachment.html>

From vitalyd at gmail.com  Wed May 20 01:11:58 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Tue, 19 May 2015 21:11:58 -0400
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
Message-ID: <CAHjP37FiqCCPDqA8_uQE8PQuN3QNc7boWYdAkvzV7r=s3wKDdw@mail.gmail.com>

These thresholds are going to effectively compile the world, gated only by
backlog that will surely ensue on the compiler thread queues? I would start
with using default values, and tuning them down until perf is
satisfactory.  As you have it configured, I don't think tiered will help;
in fact, it's probably worse than simply using C2 and lowering
CompileThreshold.  I think the threshold for C2 should probably depend on
how well the code runs with C1, which is likely to be very application (and
hardware, to some extent) specific.

Are you using java 8? How many compiler threads for C1 and C2 are you using?

By the way, I'm likely to do some tiered experiments myself soon, so this
is new territory to me as well.  At an initial glance, if default values
aren't good, it seems to devolve into the same "black magic" as GC tuning.

sent from my phone
On May 19, 2015 8:43 PM, "Tangwei (Euler)" <tangwei6 at huawei.com> wrote:

>  My goal is just to reach peak performance quickly. Following is one tier
> threshold combination I tried:
>
>
>
> -XX:Tier0ProfilingStartPercentage=0
>
> -XX:Tier3InvocationThreshold=3
>
> -XX:Tier3MinInvocationThreshold=2
>
> -XX:Tier3CompileThreshold=2
>
> -XX:Tier4InvocationThreshold=4
>
> -XX:Tier4MinInvocationThreshold=3
>
> -XX:Tier4CompileThreshold=2
>
>
>
> Regards!
>
> wei
>
>
>
>
>
> *From:* Vitaly Davidovich [mailto:vitalyd at gmail.com]
> *Sent:* Tuesday, May 19, 2015 9:33 PM
> *To:* Tangwei (Euler)
> *Cc:* hotspot compiler
> *Subject:* Re: How to change compilation policy to trigger C2 compilation
> ASAP?
>
>
>
> Is your goal specifically to have C2 compile or just to reach peak
> performance quickly? It sounds like the latter.  What values did you
> specify for the tier thresholds? Also, it may help you to
> -XX:+PrintCompilation to tune the flags as this will show you which methods
> are being compiled, when, and at what tier.
>
> sent from my phone
>
> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com> wrote:
>
> Hi All,
>
>   I want to run a JAVA application on a performance simulator, and do a
> profiling on a hot function JITTed with C2 compiler.
>
> In order to make C2 compiler compile hot function as early as possible, I
> hope to reduce the threshold of function invocation
>
> count in interpreter and C1 to drive the JIT compiler transitioned to
> Level 4 (C2) ASAP. Following is the option list I try, but
>
> failed to find a right combination to meet my requirement. Anyone can help
> to figure out what options I can use?
>
> Thanks in advance.
>
>
>
> -XX:Tier0ProfilingStartPercentage=0
>
> -XX:Tier3InvocationThreshold
>
> -XX:Tier3MinInvocationThreshold
>
> -XX:Tier3CompileThreshold
>
> -XX:Tier4InvocationThreshold
>
> -XX:CompileThreshold
>
> -XX:Tier4MinInvocationThreshold
>
> -XX:Tier4CompileThreshold
>
>
>
> Regards!
>
> wei
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150519/94f3e38c/attachment.html>

From vladimir.kozlov at oracle.com  Wed May 20 01:30:32 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Tue, 19 May 2015 18:30:32 -0700
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <CAHjP37H=xoaNQaC0O_60u-sTSFBTX+sAkrMzBC_c8ha35q5FNA@mail.gmail.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>	<555BDF7D.4010002@oracle.com>
	<CAHjP37H=xoaNQaC0O_60u-sTSFBTX+sAkrMzBC_c8ha35q5FNA@mail.gmail.com>
Message-ID: <555BE3B8.1050801@oracle.com>

I had said "if don't need profiling or inlining" :)

Of cause low threashold affects quality of generated code. Especially if 
it has branches and calls. And you should not expect that you will get 
the same inlining. So profiling result of such code should be considered 
only as approximation.

Vladimir

On 5/19/15 6:15 PM, Vitaly Davidovich wrote:
> Vladimir,
>
> Is 100 actually generally good/decent based on our conversation from the
> other day regarding lack of scaling down certain inlining thresholds?
>
> Thanks
>
> sent from my phone
>
> On May 19, 2015 9:12 PM, "Vladimir Kozlov" <vladimir.kozlov at oracle.com
> <mailto:vladimir.kozlov at oracle.com>> wrote:
>
>     If you want only C2 compilation you can disable tiered compilation
>     (since you don't want to spend to much on profiling anyway) and set
>     low CompileThreshold (you need only this one for non-tiered compile):
>
>     -XX:-TieredCompilation -XX:CompileThreshold=100
>
>     If your method is simple and don't need profiling or inlining the
>     generated code could be same quality as with long profiling.
>
>     An other approach is to set threshold per method (scale all
>     threasholds (C1, C2, interpreter) by this value). For example to
>     reduce thresholds by half:
>
>     -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdScaling,0.5
>
>     Vladimir
>
>     On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
>
>         My goal is just to reach peak performance quickly. Following is
>         one tier
>         threshold combination I tried:
>
>         -XX:Tier0ProfilingStartPercentage=0
>
>         -XX:Tier3InvocationThreshold=3
>
>         -XX:Tier3MinInvocationThreshold=2
>
>         -XX:Tier3CompileThreshold=2
>
>         -XX:Tier4InvocationThreshold=4
>
>         -XX:Tier4MinInvocationThreshold=3
>
>         -XX:Tier4CompileThreshold=2
>
>         Regards!
>
>         wei
>
>         *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com
>         <mailto:vitalyd at gmail.com>]
>         *Sent:* Tuesday, May 19, 2015 9:33 PM
>         *To:* Tangwei (Euler)
>         *Cc:* hotspot compiler
>         *Subject:* Re: How to change compilation policy to trigger C2
>         compilation ASAP?
>
>         Is your goal specifically to have C2 compile or just to reach peak
>         performance quickly? It sounds like the latter.  What values did you
>         specify for the tier thresholds? Also, it may help you to
>         -XX:+PrintCompilation to tune the flags as this will show you which
>         methods are being compiled, when, and at what tier.
>
>         sent from my phone
>
>         On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
>         <mailto:tangwei6 at huawei.com>
>         <mailto:tangwei6 at huawei.com <mailto:tangwei6 at huawei.com>>> wrote:
>
>         Hi All,
>
>             I want to run a JAVA application on a performance simulator,
>         and do a
>         profiling on a hot function JITTed with C2 compiler.
>
>         In order to make C2 compiler compile hot function as early as
>         possible,
>         I hope to reduce the threshold of function invocation
>
>         count in interpreter and C1 to drive the JIT compiler
>         transitioned to
>         Level 4 (C2) ASAP. Following is the option list I try, but
>
>         failed to find a right combination to meet my requirement.
>         Anyone can
>         help to figure out what options I can use?
>
>         Thanks in advance.
>
>         -XX:Tier0ProfilingStartPercentage=0
>
>         -XX:Tier3InvocationThreshold
>
>         -XX:Tier3MinInvocationThreshold
>
>         -XX:Tier3CompileThreshold
>
>         -XX:Tier4InvocationThreshold
>
>         -XX:CompileThreshold
>
>         -XX:Tier4MinInvocationThreshold
>
>         -XX:Tier4CompileThreshold
>
>         Regards!
>
>         wei
>

From vitalyd at gmail.com  Wed May 20 01:33:51 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Tue, 19 May 2015 21:33:51 -0400
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <555BE3B8.1050801@oracle.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com>
	<CAHjP37H=xoaNQaC0O_60u-sTSFBTX+sAkrMzBC_c8ha35q5FNA@mail.gmail.com>
	<555BE3B8.1050801@oracle.com>
Message-ID: <CAHjP37HK1aLrkAjJtUCxv_UXQ8T0rOaimkxZJ2ZgbA-+cZaHiQ@mail.gmail.com>

Unfortunately, most non-trivial java code needs profiling and inlining to
get decent performance :).

sent from my phone
On May 19, 2015 9:30 PM, "Vladimir Kozlov" <vladimir.kozlov at oracle.com>
wrote:

> I had said "if don't need profiling or inlining" :)
>
> Of cause low threashold affects quality of generated code. Especially if
> it has branches and calls. And you should not expect that you will get the
> same inlining. So profiling result of such code should be considered only
> as approximation.
>
> Vladimir
>
> On 5/19/15 6:15 PM, Vitaly Davidovich wrote:
>
>> Vladimir,
>>
>> Is 100 actually generally good/decent based on our conversation from the
>> other day regarding lack of scaling down certain inlining thresholds?
>>
>> Thanks
>>
>> sent from my phone
>>
>> On May 19, 2015 9:12 PM, "Vladimir Kozlov" <vladimir.kozlov at oracle.com
>> <mailto:vladimir.kozlov at oracle.com>> wrote:
>>
>>     If you want only C2 compilation you can disable tiered compilation
>>     (since you don't want to spend to much on profiling anyway) and set
>>     low CompileThreshold (you need only this one for non-tiered compile):
>>
>>     -XX:-TieredCompilation -XX:CompileThreshold=100
>>
>>     If your method is simple and don't need profiling or inlining the
>>     generated code could be same quality as with long profiling.
>>
>>     An other approach is to set threshold per method (scale all
>>     threasholds (C1, C2, interpreter) by this value). For example to
>>     reduce thresholds by half:
>>
>>
>> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdScaling,0.5
>>
>>     Vladimir
>>
>>     On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
>>
>>         My goal is just to reach peak performance quickly. Following is
>>         one tier
>>         threshold combination I tried:
>>
>>         -XX:Tier0ProfilingStartPercentage=0
>>
>>         -XX:Tier3InvocationThreshold=3
>>
>>         -XX:Tier3MinInvocationThreshold=2
>>
>>         -XX:Tier3CompileThreshold=2
>>
>>         -XX:Tier4InvocationThreshold=4
>>
>>         -XX:Tier4MinInvocationThreshold=3
>>
>>         -XX:Tier4CompileThreshold=2
>>
>>         Regards!
>>
>>         wei
>>
>>         *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com
>>         <mailto:vitalyd at gmail.com>]
>>         *Sent:* Tuesday, May 19, 2015 9:33 PM
>>         *To:* Tangwei (Euler)
>>         *Cc:* hotspot compiler
>>         *Subject:* Re: How to change compilation policy to trigger C2
>>         compilation ASAP?
>>
>>         Is your goal specifically to have C2 compile or just to reach peak
>>         performance quickly? It sounds like the latter.  What values did
>> you
>>         specify for the tier thresholds? Also, it may help you to
>>         -XX:+PrintCompilation to tune the flags as this will show you
>> which
>>         methods are being compiled, when, and at what tier.
>>
>>         sent from my phone
>>
>>         On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
>>         <mailto:tangwei6 at huawei.com>
>>         <mailto:tangwei6 at huawei.com <mailto:tangwei6 at huawei.com>>> wrote:
>>
>>         Hi All,
>>
>>             I want to run a JAVA application on a performance simulator,
>>         and do a
>>         profiling on a hot function JITTed with C2 compiler.
>>
>>         In order to make C2 compiler compile hot function as early as
>>         possible,
>>         I hope to reduce the threshold of function invocation
>>
>>         count in interpreter and C1 to drive the JIT compiler
>>         transitioned to
>>         Level 4 (C2) ASAP. Following is the option list I try, but
>>
>>         failed to find a right combination to meet my requirement.
>>         Anyone can
>>         help to figure out what options I can use?
>>
>>         Thanks in advance.
>>
>>         -XX:Tier0ProfilingStartPercentage=0
>>
>>         -XX:Tier3InvocationThreshold
>>
>>         -XX:Tier3MinInvocationThreshold
>>
>>         -XX:Tier3CompileThreshold
>>
>>         -XX:Tier4InvocationThreshold
>>
>>         -XX:CompileThreshold
>>
>>         -XX:Tier4MinInvocationThreshold
>>
>>         -XX:Tier4CompileThreshold
>>
>>         Regards!
>>
>>         wei
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150519/0d2f87f2/attachment.html>

From tangwei6 at huawei.com  Wed May 20 01:41:44 2015
From: tangwei6 at huawei.com (Tangwei (Euler))
Date: Wed, 20 May 2015 01:41:44 +0000
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <CAHjP37FiqCCPDqA8_uQE8PQuN3QNc7boWYdAkvzV7r=s3wKDdw@mail.gmail.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37FiqCCPDqA8_uQE8PQuN3QNc7boWYdAkvzV7r=s3wKDdw@mail.gmail.com>
Message-ID: <C8D1E566CC4CA845813731FB6FD5C530010F761E@SZXEMI503-MBX.china.huawei.com>

Vitaly,
  Yes, I agree with you that tuning is necessary to get peak performance. That?s why I want to run the application on simulator by changing threshold to
collect some performance data to guide optimization in next step. I am using JAVA 8.  The number of compiler thread to run is default value 2.

Regards!
wei


From: Vitaly Davidovich [mailto:vitalyd at gmail.com]
Sent: Wednesday, May 20, 2015 9:12 AM
To: Tangwei (Euler)
Cc: hotspot compiler
Subject: RE: How to change compilation policy to trigger C2 compilation ASAP?


These thresholds are going to effectively compile the world, gated only by backlog that will surely ensue on the compiler thread queues? I would start with using default values, and tuning them down until perf is satisfactory.  As you have it configured, I don't think tiered will help; in fact, it's probably worse than simply using C2 and lowering CompileThreshold.  I think the threshold for C2 should probably depend on how well the code runs with C1, which is likely to be very application (and hardware, to some extent) specific.

Are you using java 8? How many compiler threads for C1 and C2 are you using?

By the way, I'm likely to do some tiered experiments myself soon, so this is new territory to me as well.  At an initial glance, if default values aren't good, it seems to devolve into the same "black magic" as GC tuning.

sent from my phone
On May 19, 2015 8:43 PM, "Tangwei (Euler)" <tangwei6 at huawei.com<mailto:tangwei6 at huawei.com>> wrote:
My goal is just to reach peak performance quickly. Following is one tier threshold combination I tried:

-XX:Tier0ProfilingStartPercentage=0
-XX:Tier3InvocationThreshold=3
-XX:Tier3MinInvocationThreshold=2
-XX:Tier3CompileThreshold=2
-XX:Tier4InvocationThreshold=4
-XX:Tier4MinInvocationThreshold=3
-XX:Tier4CompileThreshold=2

Regards!
wei


From: Vitaly Davidovich [mailto:vitalyd at gmail.com<mailto:vitalyd at gmail.com>]
Sent: Tuesday, May 19, 2015 9:33 PM
To: Tangwei (Euler)
Cc: hotspot compiler
Subject: Re: How to change compilation policy to trigger C2 compilation ASAP?


Is your goal specifically to have C2 compile or just to reach peak performance quickly? It sounds like the latter.  What values did you specify for the tier thresholds? Also, it may help you to -XX:+PrintCompilation to tune the flags as this will show you which methods are being compiled, when, and at what tier.

sent from my phone
On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com<mailto:tangwei6 at huawei.com>> wrote:
Hi All,
  I want to run a JAVA application on a performance simulator, and do a profiling on a hot function JITTed with C2 compiler.
In order to make C2 compiler compile hot function as early as possible, I hope to reduce the threshold of function invocation
count in interpreter and C1 to drive the JIT compiler transitioned to Level 4 (C2) ASAP. Following is the option list I try, but
failed to find a right combination to meet my requirement. Anyone can help to figure out what options I can use?
Thanks in advance.

-XX:Tier0ProfilingStartPercentage=0
-XX:Tier3InvocationThreshold
-XX:Tier3MinInvocationThreshold
-XX:Tier3CompileThreshold
-XX:Tier4InvocationThreshold
-XX:CompileThreshold
-XX:Tier4MinInvocationThreshold
-XX:Tier4CompileThreshold

Regards!
wei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150520/b20bf42f/attachment-0001.html>

From vitalyd at gmail.com  Wed May 20 01:55:46 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Tue, 19 May 2015 21:55:46 -0400
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <C8D1E566CC4CA845813731FB6FD5C530010F761E@SZXEMI503-MBX.china.huawei.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37FiqCCPDqA8_uQE8PQuN3QNc7boWYdAkvzV7r=s3wKDdw@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F761E@SZXEMI503-MBX.china.huawei.com>
Message-ID: <CAHjP37HMuP9fnbMaK7GCaO0TGjJtRTBB7KrVkmT4gCgejOby2Q@mail.gmail.com>

How many CPUs are available to your jvm? Perhaps try to also bump # of
compiler threads and also bump ReservedCodeCacheSize for good measure.  If
you reduce thresholds to something reasonable, more compiler threads
(assuming you have cores to run them on in parallel) should theoretically
keep the backlog down and not stall compilations.  Also, probably best to
have more C2 compiler threads than C1 since those compilations take longer.

If you do make any interesting discoveries, please do share :).

sent from my phone
On May 19, 2015 9:41 PM, "Tangwei (Euler)" <tangwei6 at huawei.com> wrote:

>  Vitaly,
>
>   Yes, I agree with you that tuning is necessary to get peak performance.
> That?s why I want to run the application on simulator by changing threshold
> to
>
> collect some performance data to guide optimization in next step. I am
> using JAVA 8.  The number of compiler thread to run is default value 2.
>
>
>
> Regards!
>
> wei
>
>
>
>
>
>
>
> *From:* Vitaly Davidovich [mailto:vitalyd at gmail.com]
> *Sent:* Wednesday, May 20, 2015 9:12 AM
> *To:* Tangwei (Euler)
> *Cc:* hotspot compiler
> *Subject:* RE: How to change compilation policy to trigger C2 compilation
> ASAP?
>
>
>
> These thresholds are going to effectively compile the world, gated only by
> backlog that will surely ensue on the compiler thread queues? I would start
> with using default values, and tuning them down until perf is
> satisfactory.  As you have it configured, I don't think tiered will help;
> in fact, it's probably worse than simply using C2 and lowering
> CompileThreshold.  I think the threshold for C2 should probably depend on
> how well the code runs with C1, which is likely to be very application (and
> hardware, to some extent) specific.
>
> Are you using java 8? How many compiler threads for C1 and C2 are you
> using?
>
> By the way, I'm likely to do some tiered experiments myself soon, so this
> is new territory to me as well.  At an initial glance, if default values
> aren't good, it seems to devolve into the same "black magic" as GC tuning.
>
> sent from my phone
>
> On May 19, 2015 8:43 PM, "Tangwei (Euler)" <tangwei6 at huawei.com> wrote:
>
> My goal is just to reach peak performance quickly. Following is one tier
> threshold combination I tried:
>
>
>
> -XX:Tier0ProfilingStartPercentage=0
>
> -XX:Tier3InvocationThreshold=3
>
> -XX:Tier3MinInvocationThreshold=2
>
> -XX:Tier3CompileThreshold=2
>
> -XX:Tier4InvocationThreshold=4
>
> -XX:Tier4MinInvocationThreshold=3
>
> -XX:Tier4CompileThreshold=2
>
>
>
> Regards!
>
> wei
>
>
>
>
>
> *From:* Vitaly Davidovich [mailto:vitalyd at gmail.com]
> *Sent:* Tuesday, May 19, 2015 9:33 PM
> *To:* Tangwei (Euler)
> *Cc:* hotspot compiler
> *Subject:* Re: How to change compilation policy to trigger C2 compilation
> ASAP?
>
>
>
> Is your goal specifically to have C2 compile or just to reach peak
> performance quickly? It sounds like the latter.  What values did you
> specify for the tier thresholds? Also, it may help you to
> -XX:+PrintCompilation to tune the flags as this will show you which methods
> are being compiled, when, and at what tier.
>
> sent from my phone
>
> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com> wrote:
>
> Hi All,
>
>   I want to run a JAVA application on a performance simulator, and do a
> profiling on a hot function JITTed with C2 compiler.
>
> In order to make C2 compiler compile hot function as early as possible, I
> hope to reduce the threshold of function invocation
>
> count in interpreter and C1 to drive the JIT compiler transitioned to
> Level 4 (C2) ASAP. Following is the option list I try, but
>
> failed to find a right combination to meet my requirement. Anyone can help
> to figure out what options I can use?
>
> Thanks in advance.
>
>
>
> -XX:Tier0ProfilingStartPercentage=0
>
> -XX:Tier3InvocationThreshold
>
> -XX:Tier3MinInvocationThreshold
>
> -XX:Tier3CompileThreshold
>
> -XX:Tier4InvocationThreshold
>
> -XX:CompileThreshold
>
> -XX:Tier4MinInvocationThreshold
>
> -XX:Tier4CompileThreshold
>
>
>
> Regards!
>
> wei
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150519/b54a42fc/attachment.html>

From tangwei6 at huawei.com  Wed May 20 02:00:39 2015
From: tangwei6 at huawei.com (Tangwei (Euler))
Date: Wed, 20 May 2015 02:00:39 +0000
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <555BDF7D.4010002@oracle.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com>
Message-ID: <C8D1E566CC4CA845813731FB6FD5C530010F7637@SZXEMI503-MBX.china.huawei.com>

Vladimir,
  What the 'double' means in following command line? The option CompileThresholdScaling you mentioned is same as ProfileMaturityPercentage? 
I cannot find the option in code. How much effect to function inlining by turning off tiered compilation? 

> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdScaling,0.5

Regards!
wei

> -----Original Message-----
> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> Sent: Wednesday, May 20, 2015 9:12 AM
> To: Tangwei (Euler); Vitaly Davidovich
> Cc: hotspot compiler
> Subject: Re: How to change compilation policy to trigger C2 compilation ASAP?
> 
> If you want only C2 compilation you can disable tiered compilation (since you
> don't want to spend to much on profiling anyway) and set low
> CompileThreshold (you need only this one for non-tiered compile):
> 
> -XX:-TieredCompilation -XX:CompileThreshold=100
> 
> If your method is simple and don't need profiling or inlining the generated code
> could be same quality as with long profiling.
> 
> An other approach is to set threshold per method (scale all threasholds (C1, C2,
> interpreter) by this value). For example to reduce thresholds by half:
> 
> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
> oldScaling,0.5
> 
> Vladimir
> 
> On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
> > My goal is just to reach peak performance quickly. Following is one
> > tier threshold combination I tried:
> >
> > -XX:Tier0ProfilingStartPercentage=0
> >
> > -XX:Tier3InvocationThreshold=3
> >
> > -XX:Tier3MinInvocationThreshold=2
> >
> > -XX:Tier3CompileThreshold=2
> >
> > -XX:Tier4InvocationThreshold=4
> >
> > -XX:Tier4MinInvocationThreshold=3
> >
> > -XX:Tier4CompileThreshold=2
> >
> > Regards!
> >
> > wei
> >
> > *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
> > *Sent:* Tuesday, May 19, 2015 9:33 PM
> > *To:* Tangwei (Euler)
> > *Cc:* hotspot compiler
> > *Subject:* Re: How to change compilation policy to trigger C2
> > compilation ASAP?
> >
> > Is your goal specifically to have C2 compile or just to reach peak
> > performance quickly? It sounds like the latter.  What values did you
> > specify for the tier thresholds? Also, it may help you to
> > -XX:+PrintCompilation to tune the flags as this will show you which
> > methods are being compiled, when, and at what tier.
> >
> > sent from my phone
> >
> > On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
> > <mailto:tangwei6 at huawei.com>> wrote:
> >
> > Hi All,
> >
> >    I want to run a JAVA application on a performance simulator, and do
> > a profiling on a hot function JITTed with C2 compiler.
> >
> > In order to make C2 compiler compile hot function as early as
> > possible, I hope to reduce the threshold of function invocation
> >
> > count in interpreter and C1 to drive the JIT compiler transitioned to
> > Level 4 (C2) ASAP. Following is the option list I try, but
> >
> > failed to find a right combination to meet my requirement. Anyone can
> > help to figure out what options I can use?
> >
> > Thanks in advance.
> >
> > -XX:Tier0ProfilingStartPercentage=0
> >
> > -XX:Tier3InvocationThreshold
> >
> > -XX:Tier3MinInvocationThreshold
> >
> > -XX:Tier3CompileThreshold
> >
> > -XX:Tier4InvocationThreshold
> >
> > -XX:CompileThreshold
> >
> > -XX:Tier4MinInvocationThreshold
> >
> > -XX:Tier4CompileThreshold
> >
> > Regards!
> >
> > wei
> >

From tangwei6 at huawei.com  Wed May 20 02:06:38 2015
From: tangwei6 at huawei.com (Tangwei (Euler))
Date: Wed, 20 May 2015 02:06:38 +0000
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <CAHjP37HMuP9fnbMaK7GCaO0TGjJtRTBB7KrVkmT4gCgejOby2Q@mail.gmail.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37FiqCCPDqA8_uQE8PQuN3QNc7boWYdAkvzV7r=s3wKDdw@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F761E@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37HMuP9fnbMaK7GCaO0TGjJtRTBB7KrVkmT4gCgejOby2Q@mail.gmail.com>
Message-ID: <C8D1E566CC4CA845813731FB6FD5C530010F7646@SZXEMI503-MBX.china.huawei.com>

Our real machine has 16 CPU cores, but unfortunately, our simulator can only support one thread now. So it doesn?t work for me to increase compiler threads by now.
I am glad to share if find some interesting thing.  ?  Thanks for your help!


Regards!
wei

From: Vitaly Davidovich [mailto:vitalyd at gmail.com]
Sent: Wednesday, May 20, 2015 9:56 AM
To: Tangwei (Euler)
Cc: hotspot compiler
Subject: RE: How to change compilation policy to trigger C2 compilation ASAP?


How many CPUs are available to your jvm? Perhaps try to also bump # of compiler threads and also bump ReservedCodeCacheSize for good measure.  If you reduce thresholds to something reasonable, more compiler threads (assuming you have cores to run them on in parallel) should theoretically keep the backlog down and not stall compilations.  Also, probably best to have more C2 compiler threads than C1 since those compilations take longer.

If you do make any interesting discoveries, please do share :).

sent from my phone
On May 19, 2015 9:41 PM, "Tangwei (Euler)" <tangwei6 at huawei.com<mailto:tangwei6 at huawei.com>> wrote:
Vitaly,
  Yes, I agree with you that tuning is necessary to get peak performance. That?s why I want to run the application on simulator by changing threshold to
collect some performance data to guide optimization in next step. I am using JAVA 8.  The number of compiler thread to run is default value 2.

Regards!
wei


From: Vitaly Davidovich [mailto:vitalyd at gmail.com<mailto:vitalyd at gmail.com>]
Sent: Wednesday, May 20, 2015 9:12 AM
To: Tangwei (Euler)
Cc: hotspot compiler
Subject: RE: How to change compilation policy to trigger C2 compilation ASAP?


These thresholds are going to effectively compile the world, gated only by backlog that will surely ensue on the compiler thread queues? I would start with using default values, and tuning them down until perf is satisfactory.  As you have it configured, I don't think tiered will help; in fact, it's probably worse than simply using C2 and lowering CompileThreshold.  I think the threshold for C2 should probably depend on how well the code runs with C1, which is likely to be very application (and hardware, to some extent) specific.

Are you using java 8? How many compiler threads for C1 and C2 are you using?

By the way, I'm likely to do some tiered experiments myself soon, so this is new territory to me as well.  At an initial glance, if default values aren't good, it seems to devolve into the same "black magic" as GC tuning.

sent from my phone
On May 19, 2015 8:43 PM, "Tangwei (Euler)" <tangwei6 at huawei.com<mailto:tangwei6 at huawei.com>> wrote:
My goal is just to reach peak performance quickly. Following is one tier threshold combination I tried:

-XX:Tier0ProfilingStartPercentage=0
-XX:Tier3InvocationThreshold=3
-XX:Tier3MinInvocationThreshold=2
-XX:Tier3CompileThreshold=2
-XX:Tier4InvocationThreshold=4
-XX:Tier4MinInvocationThreshold=3
-XX:Tier4CompileThreshold=2

Regards!
wei


From: Vitaly Davidovich [mailto:vitalyd at gmail.com<mailto:vitalyd at gmail.com>]
Sent: Tuesday, May 19, 2015 9:33 PM
To: Tangwei (Euler)
Cc: hotspot compiler
Subject: Re: How to change compilation policy to trigger C2 compilation ASAP?


Is your goal specifically to have C2 compile or just to reach peak performance quickly? It sounds like the latter.  What values did you specify for the tier thresholds? Also, it may help you to -XX:+PrintCompilation to tune the flags as this will show you which methods are being compiled, when, and at what tier.

sent from my phone
On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com<mailto:tangwei6 at huawei.com>> wrote:
Hi All,
  I want to run a JAVA application on a performance simulator, and do a profiling on a hot function JITTed with C2 compiler.
In order to make C2 compiler compile hot function as early as possible, I hope to reduce the threshold of function invocation
count in interpreter and C1 to drive the JIT compiler transitioned to Level 4 (C2) ASAP. Following is the option list I try, but
failed to find a right combination to meet my requirement. Anyone can help to figure out what options I can use?
Thanks in advance.

-XX:Tier0ProfilingStartPercentage=0
-XX:Tier3InvocationThreshold
-XX:Tier3MinInvocationThreshold
-XX:Tier3CompileThreshold
-XX:Tier4InvocationThreshold
-XX:CompileThreshold
-XX:Tier4MinInvocationThreshold
-XX:Tier4CompileThreshold

Regards!
wei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150520/b8c9f094/attachment-0001.html>

From vitalyd at gmail.com  Wed May 20 02:11:17 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Tue, 19 May 2015 22:11:17 -0400
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <C8D1E566CC4CA845813731FB6FD5C530010F7637@SZXEMI503-MBX.china.huawei.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7637@SZXEMI503-MBX.china.huawei.com>
Message-ID: <CAHjP37GHMp_inW=EEgfd26gRgPOLoh7WccYDdtQGkyCrh_j=3w@mail.gmail.com>

I'll let Vladimir comment on the compile command, but turning off tiered
doesn't prevent inlining.  What prevents inlining (an otherwise inlineable
method), in a nutshell, is lack of profiling information that would
indicate the method or callsite is hot enough for inlining.  So you can
have tiered off and run C2 with standard compilation thresholds, and you'll
likely get good inlining because you'll have a long profile.  If you turn
C2 compile threshold down too far (and too far is going to be
app-specific), C2 compilation may not have sufficient info in the shortened
profile to decide to inline.  Or even worse, the profile collected is
actually not reflective of the real profile you want to capture (e.g. app
has a phase change after initialization).

The theoretical advantage of tiered is you can get decent perf quickly, but
also since you're getting better than interpreter speed quickly, you can
run in C1 tier longer and thus collect an even larger profile, possibly
leading to better code than C2 alone.  But unfortunately this may require
serious tuning.  That's my understanding at least.

sent from my phone
On May 19, 2015 10:00 PM, "Tangwei (Euler)" <tangwei6 at huawei.com> wrote:

> Vladimir,
>   What the 'double' means in following command line? The option
> CompileThresholdScaling you mentioned is same as ProfileMaturityPercentage?
> I cannot find the option in code. How much effect to function inlining by
> turning off tiered compilation?
>
> >
> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdScaling,0.5
>
> Regards!
> wei
>
> > -----Original Message-----
> > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> > Sent: Wednesday, May 20, 2015 9:12 AM
> > To: Tangwei (Euler); Vitaly Davidovich
> > Cc: hotspot compiler
> > Subject: Re: How to change compilation policy to trigger C2 compilation
> ASAP?
> >
> > If you want only C2 compilation you can disable tiered compilation
> (since you
> > don't want to spend to much on profiling anyway) and set low
> > CompileThreshold (you need only this one for non-tiered compile):
> >
> > -XX:-TieredCompilation -XX:CompileThreshold=100
> >
> > If your method is simple and don't need profiling or inlining the
> generated code
> > could be same quality as with long profiling.
> >
> > An other approach is to set threshold per method (scale all threasholds
> (C1, C2,
> > interpreter) by this value). For example to reduce thresholds by half:
> >
> > -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
> > oldScaling,0.5
> >
> > Vladimir
> >
> > On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
> > > My goal is just to reach peak performance quickly. Following is one
> > > tier threshold combination I tried:
> > >
> > > -XX:Tier0ProfilingStartPercentage=0
> > >
> > > -XX:Tier3InvocationThreshold=3
> > >
> > > -XX:Tier3MinInvocationThreshold=2
> > >
> > > -XX:Tier3CompileThreshold=2
> > >
> > > -XX:Tier4InvocationThreshold=4
> > >
> > > -XX:Tier4MinInvocationThreshold=3
> > >
> > > -XX:Tier4CompileThreshold=2
> > >
> > > Regards!
> > >
> > > wei
> > >
> > > *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
> > > *Sent:* Tuesday, May 19, 2015 9:33 PM
> > > *To:* Tangwei (Euler)
> > > *Cc:* hotspot compiler
> > > *Subject:* Re: How to change compilation policy to trigger C2
> > > compilation ASAP?
> > >
> > > Is your goal specifically to have C2 compile or just to reach peak
> > > performance quickly? It sounds like the latter.  What values did you
> > > specify for the tier thresholds? Also, it may help you to
> > > -XX:+PrintCompilation to tune the flags as this will show you which
> > > methods are being compiled, when, and at what tier.
> > >
> > > sent from my phone
> > >
> > > On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
> > > <mailto:tangwei6 at huawei.com>> wrote:
> > >
> > > Hi All,
> > >
> > >    I want to run a JAVA application on a performance simulator, and do
> > > a profiling on a hot function JITTed with C2 compiler.
> > >
> > > In order to make C2 compiler compile hot function as early as
> > > possible, I hope to reduce the threshold of function invocation
> > >
> > > count in interpreter and C1 to drive the JIT compiler transitioned to
> > > Level 4 (C2) ASAP. Following is the option list I try, but
> > >
> > > failed to find a right combination to meet my requirement. Anyone can
> > > help to figure out what options I can use?
> > >
> > > Thanks in advance.
> > >
> > > -XX:Tier0ProfilingStartPercentage=0
> > >
> > > -XX:Tier3InvocationThreshold
> > >
> > > -XX:Tier3MinInvocationThreshold
> > >
> > > -XX:Tier3CompileThreshold
> > >
> > > -XX:Tier4InvocationThreshold
> > >
> > > -XX:CompileThreshold
> > >
> > > -XX:Tier4MinInvocationThreshold
> > >
> > > -XX:Tier4CompileThreshold
> > >
> > > Regards!
> > >
> > > wei
> > >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150519/0fcc69e2/attachment.html>

From tangwei6 at huawei.com  Wed May 20 02:29:41 2015
From: tangwei6 at huawei.com (Tangwei (Euler))
Date: Wed, 20 May 2015 02:29:41 +0000
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <CAHjP37GHMp_inW=EEgfd26gRgPOLoh7WccYDdtQGkyCrh_j=3w@mail.gmail.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7637@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37GHMp_inW=EEgfd26gRgPOLoh7WccYDdtQGkyCrh_j=3w@mail.gmail.com>
Message-ID: <C8D1E566CC4CA845813731FB6FD5C530010F7660@SZXEMI503-MBX.china.huawei.com>

Got it! Thanks a lot!


Regards!
wei

From: Vitaly Davidovich [mailto:vitalyd at gmail.com]
Sent: Wednesday, May 20, 2015 10:11 AM
To: Tangwei (Euler)
Cc: hotspot compiler; Vladimir Kozlov
Subject: RE: How to change compilation policy to trigger C2 compilation ASAP?


I'll let Vladimir comment on the compile command, but turning off tiered doesn't prevent inlining.  What prevents inlining (an otherwise inlineable method), in a nutshell, is lack of profiling information that would indicate the method or callsite is hot enough for inlining.  So you can have tiered off and run C2 with standard compilation thresholds, and you'll likely get good inlining because you'll have a long profile.  If you turn C2 compile threshold down too far (and too far is going to be app-specific), C2 compilation may not have sufficient info in the shortened profile to decide to inline.  Or even worse, the profile collected is actually not reflective of the real profile you want to capture (e.g. app has a phase change after initialization).

The theoretical advantage of tiered is you can get decent perf quickly, but also since you're getting better than interpreter speed quickly, you can run in C1 tier longer and thus collect an even larger profile, possibly leading to better code than C2 alone.  But unfortunately this may require serious tuning.  That's my understanding at least.

sent from my phone
On May 19, 2015 10:00 PM, "Tangwei (Euler)" <tangwei6 at huawei.com<mailto:tangwei6 at huawei.com>> wrote:
Vladimir,
  What the 'double' means in following command line? The option CompileThresholdScaling you mentioned is same as ProfileMaturityPercentage?
I cannot find the option in code. How much effect to function inlining by turning off tiered compilation?

> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdScaling,0.5

Regards!
wei

> -----Original Message-----
> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com<mailto:vladimir.kozlov at oracle.com>]
> Sent: Wednesday, May 20, 2015 9:12 AM
> To: Tangwei (Euler); Vitaly Davidovich
> Cc: hotspot compiler
> Subject: Re: How to change compilation policy to trigger C2 compilation ASAP?
>
> If you want only C2 compilation you can disable tiered compilation (since you
> don't want to spend to much on profiling anyway) and set low
> CompileThreshold (you need only this one for non-tiered compile):
>
> -XX:-TieredCompilation -XX:CompileThreshold=100
>
> If your method is simple and don't need profiling or inlining the generated code
> could be same quality as with long profiling.
>
> An other approach is to set threshold per method (scale all threasholds (C1, C2,
> interpreter) by this value). For example to reduce thresholds by half:
>
> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
> oldScaling,0.5
>
> Vladimir
>
> On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
> > My goal is just to reach peak performance quickly. Following is one
> > tier threshold combination I tried:
> >
> > -XX:Tier0ProfilingStartPercentage=0
> >
> > -XX:Tier3InvocationThreshold=3
> >
> > -XX:Tier3MinInvocationThreshold=2
> >
> > -XX:Tier3CompileThreshold=2
> >
> > -XX:Tier4InvocationThreshold=4
> >
> > -XX:Tier4MinInvocationThreshold=3
> >
> > -XX:Tier4CompileThreshold=2
> >
> > Regards!
> >
> > wei
> >
> > *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com<mailto:vitalyd at gmail.com>]
> > *Sent:* Tuesday, May 19, 2015 9:33 PM
> > *To:* Tangwei (Euler)
> > *Cc:* hotspot compiler
> > *Subject:* Re: How to change compilation policy to trigger C2
> > compilation ASAP?
> >
> > Is your goal specifically to have C2 compile or just to reach peak
> > performance quickly? It sounds like the latter.  What values did you
> > specify for the tier thresholds? Also, it may help you to
> > -XX:+PrintCompilation to tune the flags as this will show you which
> > methods are being compiled, when, and at what tier.
> >
> > sent from my phone
> >
> > On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com<mailto:tangwei6 at huawei.com>
> > <mailto:tangwei6 at huawei.com<mailto:tangwei6 at huawei.com>>> wrote:
> >
> > Hi All,
> >
> >    I want to run a JAVA application on a performance simulator, and do
> > a profiling on a hot function JITTed with C2 compiler.
> >
> > In order to make C2 compiler compile hot function as early as
> > possible, I hope to reduce the threshold of function invocation
> >
> > count in interpreter and C1 to drive the JIT compiler transitioned to
> > Level 4 (C2) ASAP. Following is the option list I try, but
> >
> > failed to find a right combination to meet my requirement. Anyone can
> > help to figure out what options I can use?
> >
> > Thanks in advance.
> >
> > -XX:Tier0ProfilingStartPercentage=0
> >
> > -XX:Tier3InvocationThreshold
> >
> > -XX:Tier3MinInvocationThreshold
> >
> > -XX:Tier3CompileThreshold
> >
> > -XX:Tier4InvocationThreshold
> >
> > -XX:CompileThreshold
> >
> > -XX:Tier4MinInvocationThreshold
> >
> > -XX:Tier4CompileThreshold
> >
> > Regards!
> >
> > wei
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150520/8305d9ed/attachment-0001.html>

From zoltan.majo at oracle.com  Wed May 20 06:47:17 2015
From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=)
Date: Wed, 20 May 2015 08:47:17 +0200
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <555BDF7D.4010002@oracle.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com>
Message-ID: <555C2DF5.1050600@oracle.com>

Hi Wei,


On 05/20/2015 03:12 AM, Vladimir Kozlov wrote:
[...]
> An other approach is to set threshold per method (scale all 
> threasholds (C1, C2, interpreter) by this value). For example to 
> reduce thresholds by half:
>
> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdScaling,0.5 
>

just to complement what Vladimir suggested: If you want to scale 
thresholds for all methods at once, the CompileThresholdScaling can be 
also used globally. For example, -XX:CompileThresholdScaling=0.5 would 
reduce all thresholds by half (both with tiered and non-tiered modes of 
operation).

Best regards,


Zoltan

>
> Vladimir
>
> On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
>> My goal is just to reach peak performance quickly. Following is one tier
>> threshold combination I tried:
>>
>> -XX:Tier0ProfilingStartPercentage=0
>>
>> -XX:Tier3InvocationThreshold=3
>>
>> -XX:Tier3MinInvocationThreshold=2
>>
>> -XX:Tier3CompileThreshold=2
>>
>> -XX:Tier4InvocationThreshold=4
>>
>> -XX:Tier4MinInvocationThreshold=3
>>
>> -XX:Tier4CompileThreshold=2
>>
>> Regards!
>>
>> wei
>>
>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
>> *Sent:* Tuesday, May 19, 2015 9:33 PM
>> *To:* Tangwei (Euler)
>> *Cc:* hotspot compiler
>> *Subject:* Re: How to change compilation policy to trigger C2
>> compilation ASAP?
>>
>> Is your goal specifically to have C2 compile or just to reach peak
>> performance quickly? It sounds like the latter.  What values did you
>> specify for the tier thresholds? Also, it may help you to
>> -XX:+PrintCompilation to tune the flags as this will show you which
>> methods are being compiled, when, and at what tier.
>>
>> sent from my phone
>>
>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
>> <mailto:tangwei6 at huawei.com>> wrote:
>>
>> Hi All,
>>
>>    I want to run a JAVA application on a performance simulator, and do a
>> profiling on a hot function JITTed with C2 compiler.
>>
>> In order to make C2 compiler compile hot function as early as possible,
>> I hope to reduce the threshold of function invocation
>>
>> count in interpreter and C1 to drive the JIT compiler transitioned to
>> Level 4 (C2) ASAP. Following is the option list I try, but
>>
>> failed to find a right combination to meet my requirement. Anyone can
>> help to figure out what options I can use?
>>
>> Thanks in advance.
>>
>> -XX:Tier0ProfilingStartPercentage=0
>>
>> -XX:Tier3InvocationThreshold
>>
>> -XX:Tier3MinInvocationThreshold
>>
>> -XX:Tier3CompileThreshold
>>
>> -XX:Tier4InvocationThreshold
>>
>> -XX:CompileThreshold
>>
>> -XX:Tier4MinInvocationThreshold
>>
>> -XX:Tier4CompileThreshold
>>
>> Regards!
>>
>> wei
>>


From roland.westrelin at oracle.com  Wed May 20 07:14:25 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Wed, 20 May 2015 09:14:25 +0200
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken
	after sun.misc.Cleaner became automatically cleared
In-Reply-To: <5559E3A5.8030108@oracle.com>
References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com>
	<5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com>
	<72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com>
	<55533CAE.3050201@oracle.com>
	<A98C60B2-C10C-4D62-B7E8-5B7CFA5E2B39@oracle.com>
	<55549297.5020002@oracle.com> <5555E343.5050600@oracle.com>
	<E2E24EFA-7B52-4852-923B-15171FECE6A5@oracle.com>
	<5559E3A5.8030108@oracle.com>
Message-ID: <A5F55E17-7965-4B1C-BE27-F48FE0429A9D@oracle.com>

> What do you think about the following version:
>  http://cr.openjdk.java.net/~vlivanov/8079205/webrev.04

Still looks good to me.

Roland.

From cnewland at chrisnewland.com  Wed May 20 07:46:19 2015
From: cnewland at chrisnewland.com (Chris Newland)
Date: Wed, 20 May 2015 08:46:19 +0100
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <CAHjP37GHMp_inW=EEgfd26gRgPOLoh7WccYDdtQGkyCrh_j=3w@mail.gmail.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7637@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37GHMp_inW=EEgfd26gRgPOLoh7WccYDdtQGkyCrh_j=3w@mail.gmail.com>
Message-ID: <2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net>

Hi Wei,

Is there any reason why you need to cold-start your application each time
and begin with no profiling information?

Depending on your input data, would it be possible to "warm up" the VM by
running typical inputs through your code to build a rich profile and have
the JIT compilations performed before your real data arrives?

One reason *against* doing this would be wide variations in your input
data that could result in the wrong optimisations being made and possibly
suffering a decompilation event.

There are commercial VMs that have technology to mitigate cold-start costs
(for example Azul Zing's ReadyNow).

Have you examined the HotSpot LogCompilation output to make sure there is
nothing you could change at source code level which would result in a
better JIT decisions? I'm thinking of things like inlining failures due to
call-site megamorphism.

There's a free tool called JITWatch that aims to make it easier to
understand the LogCompilation output
(https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales pitch -
I'm the author).

Regards,

Chris
@chriswhocodes

On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote:
> I'll let Vladimir comment on the compile command, but turning off tiered
> doesn't prevent inlining.  What prevents inlining (an otherwise inlineable
>  method), in a nutshell, is lack of profiling information that would
> indicate the method or callsite is hot enough for inlining.  So you can
> have tiered off and run C2 with standard compilation thresholds, and
> you'll likely get good inlining because you'll have a long profile.  If
> you turn C2 compile threshold down too far (and too far is going to be
> app-specific), C2 compilation may not have sufficient info in the
> shortened profile to decide to inline.  Or even worse, the profile
> collected is actually not reflective of the real profile you want to
> capture (e.g. app has a phase change after initialization).
>
> The theoretical advantage of tiered is you can get decent perf quickly,
> but also since you're getting better than interpreter speed quickly, you
> can run in C1 tier longer and thus collect an even larger profile,
> possibly leading to better code than C2 alone.  But unfortunately this may
> require serious tuning.  That's my understanding at least.
>
> sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)"
> <tangwei6 at huawei.com> wrote:
>
>
>> Vladimir,
>> What the 'double' means in following command line? The option
>> CompileThresholdScaling you mentioned is same as
>> ProfileMaturityPercentage?
>> I cannot find the option in code. How much effect to function inlining
>> by turning off tiered compilation?
>>
>>>
>> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresholdS
>> caling,0.5
>>
>> Regards!
>> wei
>>
>>> -----Original Message-----
>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>> Sent: Wednesday, May 20, 2015 9:12 AM
>>> To: Tangwei (Euler); Vitaly Davidovich
>>> Cc: hotspot compiler
>>> Subject: Re: How to change compilation policy to trigger C2
>>> compilation
>> ASAP?
>>
>>>
>>> If you want only C2 compilation you can disable tiered compilation
>>>
>> (since you
>>
>>> don't want to spend to much on profiling anyway) and set low
>>> CompileThreshold (you need only this one for non-tiered compile):
>>>
>>>
>>> -XX:-TieredCompilation -XX:CompileThreshold=100
>>>
>>>
>>> If your method is simple and don't need profiling or inlining the
>>>
>> generated code
>>> could be same quality as with long profiling.
>>>
>>> An other approach is to set threshold per method (scale all
>>> threasholds
>> (C1, C2,
>>
>>> interpreter) by this value). For example to reduce thresholds by
>>> half:
>>>
>>>
>>> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
>>> oldScaling,0.5
>>>
>>> Vladimir
>>>
>>>
>>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
>>>
>>>> My goal is just to reach peak performance quickly. Following is one
>>>>  tier threshold combination I tried:
>>>>
>>>> -XX:Tier0ProfilingStartPercentage=0
>>>>
>>>>
>>>> -XX:Tier3InvocationThreshold=3
>>>>
>>>>
>>>> -XX:Tier3MinInvocationThreshold=2
>>>>
>>>>
>>>> -XX:Tier3CompileThreshold=2
>>>>
>>>>
>>>> -XX:Tier4InvocationThreshold=4
>>>>
>>>>
>>>> -XX:Tier4MinInvocationThreshold=3
>>>>
>>>>
>>>> -XX:Tier4CompileThreshold=2
>>>>
>>>>
>>>> Regards!
>>>>
>>>>
>>>> wei
>>>>
>>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
>>>> *Sent:* Tuesday, May 19, 2015 9:33 PM
>>>> *To:* Tangwei (Euler)
>>>> *Cc:* hotspot compiler
>>>> *Subject:* Re: How to change compilation policy to trigger C2
>>>> compilation ASAP?
>>>>
>>>> Is your goal specifically to have C2 compile or just to reach peak
>>>> performance quickly? It sounds like the latter.  What values did you
>>>>  specify for the tier thresholds? Also, it may help you to
>>>> -XX:+PrintCompilation to tune the flags as this will show you which
>>>>  methods are being compiled, when, and at what tier.
>>>>
>>>> sent from my phone
>>>>
>>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
>>>> <mailto:tangwei6 at huawei.com>> wrote:
>>>>
>>>>
>>>> Hi All,
>>>>
>>>>
>>>> I want to run a JAVA application on a performance simulator, and do
>>>>  a profiling on a hot function JITTed with C2 compiler.
>>>>
>>>> In order to make C2 compiler compile hot function as early as
>>>> possible, I hope to reduce the threshold of function invocation
>>>>
>>>> count in interpreter and C1 to drive the JIT compiler transitioned
>>>> to Level 4 (C2) ASAP. Following is the option list I try, but
>>>>
>>>>
>>>> failed to find a right combination to meet my requirement. Anyone
>>>> can help to figure out what options I can use?
>>>>
>>>> Thanks in advance.
>>>>
>>>>
>>>> -XX:Tier0ProfilingStartPercentage=0
>>>>
>>>>
>>>> -XX:Tier3InvocationThreshold
>>>>
>>>>
>>>> -XX:Tier3MinInvocationThreshold
>>>>
>>>>
>>>> -XX:Tier3CompileThreshold
>>>>
>>>>
>>>> -XX:Tier4InvocationThreshold
>>>>
>>>>
>>>> -XX:CompileThreshold
>>>>
>>>>
>>>> -XX:Tier4MinInvocationThreshold
>>>>
>>>>
>>>> -XX:Tier4CompileThreshold
>>>>
>>>>
>>>> Regards!
>>>>
>>>>
>>>> wei
>>>>
>>
>


From tangwei6 at huawei.com  Wed May 20 08:07:34 2015
From: tangwei6 at huawei.com (Tangwei (Euler))
Date: Wed, 20 May 2015 08:07:34 +0000
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <555C2DF5.1050600@oracle.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com> <555C2DF5.1050600@oracle.com>
Message-ID: <C8D1E566CC4CA845813731FB6FD5C530010F76EF@SZXEMI503-MBX.china.huawei.com>

Hi Zoltan, 
  Is this option supported in OpenJDK8? I cannot find it by greping output of command line 'java -XX:+PrintFlagsFinal'. 


Regards!
wei

> -----Original Message-----
> From: hotspot-compiler-dev
> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Zolt?n
> Maj?
> Sent: Wednesday, May 20, 2015 2:47 PM
> To: hotspot-compiler-dev at openjdk.java.net
> Subject: Re: How to change compilation policy to trigger C2 compilation ASAP?
> 
> Hi Wei,
> 
> 
> On 05/20/2015 03:12 AM, Vladimir Kozlov wrote:
> [...]
> > An other approach is to set threshold per method (scale all
> > threasholds (C1, C2, interpreter) by this value). For example to
> > reduce thresholds by half:
> >
> >
> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
> old
> > Scaling,0.5
> >
> 
> just to complement what Vladimir suggested: If you want to scale thresholds
> for all methods at once, the CompileThresholdScaling can be also used globally.
> For example, -XX:CompileThresholdScaling=0.5 would reduce all thresholds by
> half (both with tiered and non-tiered modes of operation).
> 
> Best regards,
> 
> 
> Zoltan
> 
> >
> > Vladimir
> >
> > On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
> >> My goal is just to reach peak performance quickly. Following is one
> >> tier threshold combination I tried:
> >>
> >> -XX:Tier0ProfilingStartPercentage=0
> >>
> >> -XX:Tier3InvocationThreshold=3
> >>
> >> -XX:Tier3MinInvocationThreshold=2
> >>
> >> -XX:Tier3CompileThreshold=2
> >>
> >> -XX:Tier4InvocationThreshold=4
> >>
> >> -XX:Tier4MinInvocationThreshold=3
> >>
> >> -XX:Tier4CompileThreshold=2
> >>
> >> Regards!
> >>
> >> wei
> >>
> >> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
> >> *Sent:* Tuesday, May 19, 2015 9:33 PM
> >> *To:* Tangwei (Euler)
> >> *Cc:* hotspot compiler
> >> *Subject:* Re: How to change compilation policy to trigger C2
> >> compilation ASAP?
> >>
> >> Is your goal specifically to have C2 compile or just to reach peak
> >> performance quickly? It sounds like the latter.  What values did you
> >> specify for the tier thresholds? Also, it may help you to
> >> -XX:+PrintCompilation to tune the flags as this will show you which
> >> methods are being compiled, when, and at what tier.
> >>
> >> sent from my phone
> >>
> >> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
> >> <mailto:tangwei6 at huawei.com>> wrote:
> >>
> >> Hi All,
> >>
> >>    I want to run a JAVA application on a performance simulator, and
> >> do a profiling on a hot function JITTed with C2 compiler.
> >>
> >> In order to make C2 compiler compile hot function as early as
> >> possible, I hope to reduce the threshold of function invocation
> >>
> >> count in interpreter and C1 to drive the JIT compiler transitioned to
> >> Level 4 (C2) ASAP. Following is the option list I try, but
> >>
> >> failed to find a right combination to meet my requirement. Anyone can
> >> help to figure out what options I can use?
> >>
> >> Thanks in advance.
> >>
> >> -XX:Tier0ProfilingStartPercentage=0
> >>
> >> -XX:Tier3InvocationThreshold
> >>
> >> -XX:Tier3MinInvocationThreshold
> >>
> >> -XX:Tier3CompileThreshold
> >>
> >> -XX:Tier4InvocationThreshold
> >>
> >> -XX:CompileThreshold
> >>
> >> -XX:Tier4MinInvocationThreshold
> >>
> >> -XX:Tier4CompileThreshold
> >>
> >> Regards!
> >>
> >> wei
> >>


From zoltan.majo at oracle.com  Wed May 20 08:11:17 2015
From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=)
Date: Wed, 20 May 2015 10:11:17 +0200
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <C8D1E566CC4CA845813731FB6FD5C530010F76EF@SZXEMI503-MBX.china.huawei.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>	<555BDF7D.4010002@oracle.com>
	<555C2DF5.1050600@oracle.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F76EF@SZXEMI503-MBX.china.huawei.com>
Message-ID: <555C41A5.6010308@oracle.com>

Hi Wei,


On 05/20/2015 10:07 AM, Tangwei (Euler) wrote:
> Hi Zoltan,
>    Is this option supported in OpenJDK8? I cannot find it by greping output of command line 'java -XX:+PrintFlagsFinal'.

I checked and it is available only in 9.

Best regards,


Zoltan

>
>
> Regards!
> wei
>
>> -----Original Message-----
>> From: hotspot-compiler-dev
>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Zolt?n
>> Maj?
>> Sent: Wednesday, May 20, 2015 2:47 PM
>> To: hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: How to change compilation policy to trigger C2 compilation ASAP?
>>
>> Hi Wei,
>>
>>
>> On 05/20/2015 03:12 AM, Vladimir Kozlov wrote:
>> [...]
>>> An other approach is to set threshold per method (scale all
>>> threasholds (C1, C2, interpreter) by this value). For example to
>>> reduce thresholds by half:
>>>
>>>
>> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
>> old
>>> Scaling,0.5
>>>
>> just to complement what Vladimir suggested: If you want to scale thresholds
>> for all methods at once, the CompileThresholdScaling can be also used globally.
>> For example, -XX:CompileThresholdScaling=0.5 would reduce all thresholds by
>> half (both with tiered and non-tiered modes of operation).
>>
>> Best regards,
>>
>>
>> Zoltan
>>
>>> Vladimir
>>>
>>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
>>>> My goal is just to reach peak performance quickly. Following is one
>>>> tier threshold combination I tried:
>>>>
>>>> -XX:Tier0ProfilingStartPercentage=0
>>>>
>>>> -XX:Tier3InvocationThreshold=3
>>>>
>>>> -XX:Tier3MinInvocationThreshold=2
>>>>
>>>> -XX:Tier3CompileThreshold=2
>>>>
>>>> -XX:Tier4InvocationThreshold=4
>>>>
>>>> -XX:Tier4MinInvocationThreshold=3
>>>>
>>>> -XX:Tier4CompileThreshold=2
>>>>
>>>> Regards!
>>>>
>>>> wei
>>>>
>>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
>>>> *Sent:* Tuesday, May 19, 2015 9:33 PM
>>>> *To:* Tangwei (Euler)
>>>> *Cc:* hotspot compiler
>>>> *Subject:* Re: How to change compilation policy to trigger C2
>>>> compilation ASAP?
>>>>
>>>> Is your goal specifically to have C2 compile or just to reach peak
>>>> performance quickly? It sounds like the latter.  What values did you
>>>> specify for the tier thresholds? Also, it may help you to
>>>> -XX:+PrintCompilation to tune the flags as this will show you which
>>>> methods are being compiled, when, and at what tier.
>>>>
>>>> sent from my phone
>>>>
>>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
>>>> <mailto:tangwei6 at huawei.com>> wrote:
>>>>
>>>> Hi All,
>>>>
>>>>     I want to run a JAVA application on a performance simulator, and
>>>> do a profiling on a hot function JITTed with C2 compiler.
>>>>
>>>> In order to make C2 compiler compile hot function as early as
>>>> possible, I hope to reduce the threshold of function invocation
>>>>
>>>> count in interpreter and C1 to drive the JIT compiler transitioned to
>>>> Level 4 (C2) ASAP. Following is the option list I try, but
>>>>
>>>> failed to find a right combination to meet my requirement. Anyone can
>>>> help to figure out what options I can use?
>>>>
>>>> Thanks in advance.
>>>>
>>>> -XX:Tier0ProfilingStartPercentage=0
>>>>
>>>> -XX:Tier3InvocationThreshold
>>>>
>>>> -XX:Tier3MinInvocationThreshold
>>>>
>>>> -XX:Tier3CompileThreshold
>>>>
>>>> -XX:Tier4InvocationThreshold
>>>>
>>>> -XX:CompileThreshold
>>>>
>>>> -XX:Tier4MinInvocationThreshold
>>>>
>>>> -XX:Tier4CompileThreshold
>>>>
>>>> Regards!
>>>>
>>>> wei
>>>>


From tangwei6 at huawei.com  Wed May 20 08:17:54 2015
From: tangwei6 at huawei.com (Tangwei (Euler))
Date: Wed, 20 May 2015 08:17:54 +0000
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7637@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37GHMp_inW=EEgfd26gRgPOLoh7WccYDdtQGkyCrh_j=3w@mail.gmail.com>
	<2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net>
Message-ID: <C8D1E566CC4CA845813731FB6FD5C530010F7705@SZXEMI503-MBX.china.huawei.com>

Hi Chris, 
  Your tool looks cool. I think OpenJDK has no technology to mitigate cold-start costs, please correct if I am wrong.
I can control some compilation passes with option -XX:CompileCommand, but the profiling data, such as invocation count and backedge count,
has to reach some threshold before trigger execution level transition. This causes simulator doesn't trigger C2 compilation within reasonable time span. 

Regards!
wei

> -----Original Message-----
> From: Chris Newland [mailto:cnewland at chrisnewland.com]
> Sent: Wednesday, May 20, 2015 3:46 PM
> To: Vitaly Davidovich
> Cc: Tangwei (Euler); Vladimir Kozlov; hotspot compiler
> Subject: RE: How to change compilation policy to trigger C2 compilation ASAP?
> 
> Hi Wei,
> 
> Is there any reason why you need to cold-start your application each time and
> begin with no profiling information?
> 
> Depending on your input data, would it be possible to "warm up" the VM by
> running typical inputs through your code to build a rich profile and have the JIT
> compilations performed before your real data arrives?
> 
> One reason *against* doing this would be wide variations in your input data
> that could result in the wrong optimisations being made and possibly suffering
> a decompilation event.
> 
> There are commercial VMs that have technology to mitigate cold-start costs
> (for example Azul Zing's ReadyNow).
> 
> Have you examined the HotSpot LogCompilation output to make sure there is
> nothing you could change at source code level which would result in a better JIT
> decisions? I'm thinking of things like inlining failures due to call-site
> megamorphism.
> 
> There's a free tool called JITWatch that aims to make it easier to understand
> the LogCompilation output
> (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales pitch - I'm the
> author).
> 
> Regards,
> 
> Chris
> @chriswhocodes
> 
> On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote:
> > I'll let Vladimir comment on the compile command, but turning off
> > tiered doesn't prevent inlining.  What prevents inlining (an otherwise
> > inlineable  method), in a nutshell, is lack of profiling information
> > that would indicate the method or callsite is hot enough for inlining.
> > So you can have tiered off and run C2 with standard compilation
> > thresholds, and you'll likely get good inlining because you'll have a
> > long profile.  If you turn C2 compile threshold down too far (and too
> > far is going to be app-specific), C2 compilation may not have
> > sufficient info in the shortened profile to decide to inline.  Or even
> > worse, the profile collected is actually not reflective of the real
> > profile you want to capture (e.g. app has a phase change after initialization).
> >
> > The theoretical advantage of tiered is you can get decent perf
> > quickly, but also since you're getting better than interpreter speed
> > quickly, you can run in C1 tier longer and thus collect an even larger
> > profile, possibly leading to better code than C2 alone.  But
> > unfortunately this may require serious tuning.  That's my understanding at
> least.
> >
> > sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)"
> > <tangwei6 at huawei.com> wrote:
> >
> >
> >> Vladimir,
> >> What the 'double' means in following command line? The option
> >> CompileThresholdScaling you mentioned is same as
> >> ProfileMaturityPercentage?
> >> I cannot find the option in code. How much effect to function
> >> inlining by turning off tiered compilation?
> >>
> >>>
> >>
> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
> ol
> >> dS
> >> caling,0.5
> >>
> >> Regards!
> >> wei
> >>
> >>> -----Original Message-----
> >>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> >>> Sent: Wednesday, May 20, 2015 9:12 AM
> >>> To: Tangwei (Euler); Vitaly Davidovich
> >>> Cc: hotspot compiler
> >>> Subject: Re: How to change compilation policy to trigger C2
> >>> compilation
> >> ASAP?
> >>
> >>>
> >>> If you want only C2 compilation you can disable tiered compilation
> >>>
> >> (since you
> >>
> >>> don't want to spend to much on profiling anyway) and set low
> >>> CompileThreshold (you need only this one for non-tiered compile):
> >>>
> >>>
> >>> -XX:-TieredCompilation -XX:CompileThreshold=100
> >>>
> >>>
> >>> If your method is simple and don't need profiling or inlining the
> >>>
> >> generated code
> >>> could be same quality as with long profiling.
> >>>
> >>> An other approach is to set threshold per method (scale all
> >>> threasholds
> >> (C1, C2,
> >>
> >>> interpreter) by this value). For example to reduce thresholds by
> >>> half:
> >>>
> >>>
> >>>
> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
> >>> oldScaling,0.5
> >>>
> >>> Vladimir
> >>>
> >>>
> >>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
> >>>
> >>>> My goal is just to reach peak performance quickly. Following is one
> >>>> tier threshold combination I tried:
> >>>>
> >>>> -XX:Tier0ProfilingStartPercentage=0
> >>>>
> >>>>
> >>>> -XX:Tier3InvocationThreshold=3
> >>>>
> >>>>
> >>>> -XX:Tier3MinInvocationThreshold=2
> >>>>
> >>>>
> >>>> -XX:Tier3CompileThreshold=2
> >>>>
> >>>>
> >>>> -XX:Tier4InvocationThreshold=4
> >>>>
> >>>>
> >>>> -XX:Tier4MinInvocationThreshold=3
> >>>>
> >>>>
> >>>> -XX:Tier4CompileThreshold=2
> >>>>
> >>>>
> >>>> Regards!
> >>>>
> >>>>
> >>>> wei
> >>>>
> >>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
> >>>> *Sent:* Tuesday, May 19, 2015 9:33 PM
> >>>> *To:* Tangwei (Euler)
> >>>> *Cc:* hotspot compiler
> >>>> *Subject:* Re: How to change compilation policy to trigger C2
> >>>> compilation ASAP?
> >>>>
> >>>> Is your goal specifically to have C2 compile or just to reach peak
> >>>> performance quickly? It sounds like the latter.  What values did
> >>>> you  specify for the tier thresholds? Also, it may help you to
> >>>> -XX:+PrintCompilation to tune the flags as this will show you which
> >>>> methods are being compiled, when, and at what tier.
> >>>>
> >>>> sent from my phone
> >>>>
> >>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
> >>>> <mailto:tangwei6 at huawei.com>> wrote:
> >>>>
> >>>>
> >>>> Hi All,
> >>>>
> >>>>
> >>>> I want to run a JAVA application on a performance simulator, and do
> >>>> a profiling on a hot function JITTed with C2 compiler.
> >>>>
> >>>> In order to make C2 compiler compile hot function as early as
> >>>> possible, I hope to reduce the threshold of function invocation
> >>>>
> >>>> count in interpreter and C1 to drive the JIT compiler transitioned
> >>>> to Level 4 (C2) ASAP. Following is the option list I try, but
> >>>>
> >>>>
> >>>> failed to find a right combination to meet my requirement. Anyone
> >>>> can help to figure out what options I can use?
> >>>>
> >>>> Thanks in advance.
> >>>>
> >>>>
> >>>> -XX:Tier0ProfilingStartPercentage=0
> >>>>
> >>>>
> >>>> -XX:Tier3InvocationThreshold
> >>>>
> >>>>
> >>>> -XX:Tier3MinInvocationThreshold
> >>>>
> >>>>
> >>>> -XX:Tier3CompileThreshold
> >>>>
> >>>>
> >>>> -XX:Tier4InvocationThreshold
> >>>>
> >>>>
> >>>> -XX:CompileThreshold
> >>>>
> >>>>
> >>>> -XX:Tier4MinInvocationThreshold
> >>>>
> >>>>
> >>>> -XX:Tier4CompileThreshold
> >>>>
> >>>>
> >>>> Regards!
> >>>>
> >>>>
> >>>> wei
> >>>>
> >>
> >
> 


From tangwei6 at huawei.com  Wed May 20 08:19:49 2015
From: tangwei6 at huawei.com (Tangwei (Euler))
Date: Wed, 20 May 2015 08:19:49 +0000
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <555C41A5.6010308@oracle.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com> <555C2DF5.1050600@oracle.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F76EF@SZXEMI503-MBX.china.huawei.com>
	<555C41A5.6010308@oracle.com>
Message-ID: <C8D1E566CC4CA845813731FB6FD5C530010F7713@SZXEMI503-MBX.china.huawei.com>

Thanks, we haven't switched to JDK9, and will give a try. 


Regards!
wei

> -----Original Message-----
> From: Zolt?n Maj? [mailto:zoltan.majo at oracle.com]
> Sent: Wednesday, May 20, 2015 4:11 PM
> To: Tangwei (Euler); hotspot-compiler-dev at openjdk.java.net
> Subject: Re: How to change compilation policy to trigger C2 compilation ASAP?
> 
> Hi Wei,
> 
> 
> On 05/20/2015 10:07 AM, Tangwei (Euler) wrote:
> > Hi Zoltan,
> >    Is this option supported in OpenJDK8? I cannot find it by greping output
> of command line 'java -XX:+PrintFlagsFinal'.
> 
> I checked and it is available only in 9.
> 
> Best regards,
> 
> 
> Zoltan
> 
> >
> >
> > Regards!
> > wei
> >
> >> -----Original Message-----
> >> From: hotspot-compiler-dev
> >> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of
> >> Zolt?n Maj?
> >> Sent: Wednesday, May 20, 2015 2:47 PM
> >> To: hotspot-compiler-dev at openjdk.java.net
> >> Subject: Re: How to change compilation policy to trigger C2 compilation
> ASAP?
> >>
> >> Hi Wei,
> >>
> >>
> >> On 05/20/2015 03:12 AM, Vladimir Kozlov wrote:
> >> [...]
> >>> An other approach is to set threshold per method (scale all
> >>> threasholds (C1, C2, interpreter) by this value). For example to
> >>> reduce thresholds by half:
> >>>
> >>>
> >>
> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
> >> old
> >>> Scaling,0.5
> >>>
> >> just to complement what Vladimir suggested: If you want to scale
> >> thresholds for all methods at once, the CompileThresholdScaling can be also
> used globally.
> >> For example, -XX:CompileThresholdScaling=0.5 would reduce all
> >> thresholds by half (both with tiered and non-tiered modes of operation).
> >>
> >> Best regards,
> >>
> >>
> >> Zoltan
> >>
> >>> Vladimir
> >>>
> >>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
> >>>> My goal is just to reach peak performance quickly. Following is one
> >>>> tier threshold combination I tried:
> >>>>
> >>>> -XX:Tier0ProfilingStartPercentage=0
> >>>>
> >>>> -XX:Tier3InvocationThreshold=3
> >>>>
> >>>> -XX:Tier3MinInvocationThreshold=2
> >>>>
> >>>> -XX:Tier3CompileThreshold=2
> >>>>
> >>>> -XX:Tier4InvocationThreshold=4
> >>>>
> >>>> -XX:Tier4MinInvocationThreshold=3
> >>>>
> >>>> -XX:Tier4CompileThreshold=2
> >>>>
> >>>> Regards!
> >>>>
> >>>> wei
> >>>>
> >>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
> >>>> *Sent:* Tuesday, May 19, 2015 9:33 PM
> >>>> *To:* Tangwei (Euler)
> >>>> *Cc:* hotspot compiler
> >>>> *Subject:* Re: How to change compilation policy to trigger C2
> >>>> compilation ASAP?
> >>>>
> >>>> Is your goal specifically to have C2 compile or just to reach peak
> >>>> performance quickly? It sounds like the latter.  What values did
> >>>> you specify for the tier thresholds? Also, it may help you to
> >>>> -XX:+PrintCompilation to tune the flags as this will show you which
> >>>> methods are being compiled, when, and at what tier.
> >>>>
> >>>> sent from my phone
> >>>>
> >>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
> >>>> <mailto:tangwei6 at huawei.com>> wrote:
> >>>>
> >>>> Hi All,
> >>>>
> >>>>     I want to run a JAVA application on a performance simulator,
> >>>> and do a profiling on a hot function JITTed with C2 compiler.
> >>>>
> >>>> In order to make C2 compiler compile hot function as early as
> >>>> possible, I hope to reduce the threshold of function invocation
> >>>>
> >>>> count in interpreter and C1 to drive the JIT compiler transitioned
> >>>> to Level 4 (C2) ASAP. Following is the option list I try, but
> >>>>
> >>>> failed to find a right combination to meet my requirement. Anyone
> >>>> can help to figure out what options I can use?
> >>>>
> >>>> Thanks in advance.
> >>>>
> >>>> -XX:Tier0ProfilingStartPercentage=0
> >>>>
> >>>> -XX:Tier3InvocationThreshold
> >>>>
> >>>> -XX:Tier3MinInvocationThreshold
> >>>>
> >>>> -XX:Tier3CompileThreshold
> >>>>
> >>>> -XX:Tier4InvocationThreshold
> >>>>
> >>>> -XX:CompileThreshold
> >>>>
> >>>> -XX:Tier4MinInvocationThreshold
> >>>>
> >>>> -XX:Tier4CompileThreshold
> >>>>
> >>>> Regards!
> >>>>
> >>>> wei
> >>>>


From aph at redhat.com  Wed May 20 08:48:18 2015
From: aph at redhat.com (Andrew Haley)
Date: Wed, 20 May 2015 09:48:18 +0100
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <C8D1E566CC4CA845813731FB6FD5C530010F7646@SZXEMI503-MBX.china.huawei.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>	<CAHjP37FiqCCPDqA8_uQE8PQuN3QNc7boWYdAkvzV7r=s3wKDdw@mail.gmail.com>	<C8D1E566CC4CA845813731FB6FD5C530010F761E@SZXEMI503-MBX.china.huawei.com>	<CAHjP37HMuP9fnbMaK7GCaO0TGjJtRTBB7KrVkmT4gCgejOby2Q@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7646@SZXEMI503-MBX.china.huawei.com>
Message-ID: <555C4A52.60807@redhat.com>

On 20/05/15 03:06, Tangwei (Euler) wrote:

> Our real machine has 16 CPU cores, but unfortunately, our simulator
> can only support one thread now. So it doesn?t work for me to
> increase compiler threads by now.
> I am glad to share if find some interesting thing.

The problem with forcing early C2 compilation is that it may not have
enough profile data to do its job well.  So, you'll have something you
can measure but it won't be representative of what the compiler will
do on real hardware.

Andrew.

From cnewland at chrisnewland.com  Wed May 20 08:58:40 2015
From: cnewland at chrisnewland.com (Chris Newland)
Date: Wed, 20 May 2015 09:58:40 +0100
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <C8D1E566CC4CA845813731FB6FD5C530010F7705@SZXEMI503-MBX.china.huawei.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7637@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37GHMp_inW=EEgfd26gRgPOLoh7WccYDdtQGkyCrh_j=3w@mail.gmail.com>
	<2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net>
	<C8D1E566CC4CA845813731FB6FD5C530010F7705@SZXEMI503-MBX.china.huawei.com>
Message-ID: <318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net>

Hi Wei,

Thanks! OpenJDK has no support for persisting compilation profile data
between VM runs. For some industries it can be economical to use a
commercial VM to get the lowest latencies.

One "free" alternative is to add a training mode to your application and
feed it the most realistic input data so that you get a good set of JIT
optimisations before the inputs you care about arrive.

Is there a reason why you are cold-starting your VM? (For example your GC
strategy is to use a huge Eden space and reboot daily to avoid any garbage
collections).

Regards,

Chris
@chriswhocodes

On Wed, May 20, 2015 09:17, Tangwei (Euler) wrote:
> Hi Chris,
> Your tool looks cool. I think OpenJDK has no technology to mitigate
> cold-start costs, please correct if I am wrong. I can control some
> compilation passes with option -XX:CompileCommand, but the profiling
> data, such as invocation count and backedge count, has to reach some
> threshold before trigger execution level transition. This causes
> simulator doesn't trigger C2 compilation within reasonable time span.
>
> Regards!
> wei
>
>> -----Original Message-----
>> From: Chris Newland [mailto:cnewland at chrisnewland.com]
>> Sent: Wednesday, May 20, 2015 3:46 PM
>> To: Vitaly Davidovich
>> Cc: Tangwei (Euler); Vladimir Kozlov; hotspot compiler
>> Subject: RE: How to change compilation policy to trigger C2 compilation
>> ASAP?
>>
>>
>> Hi Wei,
>>
>>
>> Is there any reason why you need to cold-start your application each
>> time and begin with no profiling information?
>>
>> Depending on your input data, would it be possible to "warm up" the VM
>> by running typical inputs through your code to build a rich profile and
>> have the JIT compilations performed before your real data arrives?
>>
>> One reason *against* doing this would be wide variations in your input
>> data that could result in the wrong optimisations being made and
>> possibly suffering a decompilation event.
>>
>> There are commercial VMs that have technology to mitigate cold-start
>> costs (for example Azul Zing's ReadyNow).
>>
>>
>> Have you examined the HotSpot LogCompilation output to make sure there
>> is nothing you could change at source code level which would result in a
>> better JIT decisions? I'm thinking of things like inlining failures due
>> to call-site megamorphism.
>>
>> There's a free tool called JITWatch that aims to make it easier to
>> understand the LogCompilation output
>> (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales pitch -
>> I'm the
>> author).
>>
>> Regards,
>>
>>
>> Chris
>> @chriswhocodes
>>
>>
>> On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote:
>>
>>> I'll let Vladimir comment on the compile command, but turning off
>>> tiered doesn't prevent inlining.  What prevents inlining (an otherwise
>>>  inlineable  method), in a nutshell, is lack of profiling information
>>>  that would indicate the method or callsite is hot enough for
>>> inlining. So you can have tiered off and run C2 with standard
>>> compilation thresholds, and you'll likely get good inlining because
>>> you'll have a long profile.  If you turn C2 compile threshold down too
>>> far (and too far is going to be app-specific), C2 compilation may not
>>> have sufficient info in the shortened profile to decide to inline.  Or
>>> even worse, the profile collected is actually not reflective of the
>>> real profile you want to capture (e.g. app has a phase change after
>>> initialization).
>>>
>>> The theoretical advantage of tiered is you can get decent perf
>>> quickly, but also since you're getting better than interpreter speed
>>> quickly, you can run in C1 tier longer and thus collect an even
>>> larger profile, possibly leading to better code than C2 alone.  But
>>> unfortunately this may require serious tuning.  That's my
>>> understanding at
>> least.
>>>
>>> sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)"
>>> <tangwei6 at huawei.com> wrote:
>>>
>>>
>>>
>>>> Vladimir,
>>>> What the 'double' means in following command line? The option
>>>> CompileThresholdScaling you mentioned is same as
>>>> ProfileMaturityPercentage?
>>>> I cannot find the option in code. How much effect to function
>>>> inlining by turning off tiered compilation?
>>>>
>>>>>
>>>>
>> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
>> ol
>>>> dS caling,0.5
>>>>
>>>> Regards!
>>>> wei
>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>>> Sent: Wednesday, May 20, 2015 9:12 AM
>>>>> To: Tangwei (Euler); Vitaly Davidovich
>>>>> Cc: hotspot compiler
>>>>> Subject: Re: How to change compilation policy to trigger C2
>>>>> compilation
>>>> ASAP?
>>>>
>>>>
>>>>>
>>>>> If you want only C2 compilation you can disable tiered
>>>>> compilation
>>>>>
>>>> (since you
>>>>
>>>>
>>>>> don't want to spend to much on profiling anyway) and set low
>>>>> CompileThreshold (you need only this one for non-tiered compile):
>>>>>
>>>>>
>>>>>
>>>>> -XX:-TieredCompilation -XX:CompileThreshold=100
>>>>>
>>>>>
>>>>>
>>>>> If your method is simple and don't need profiling or inlining the
>>>>>
>>>>>
>>>> generated code
>>>>> could be same quality as with long profiling.
>>>>>
>>>>> An other approach is to set threshold per method (scale all
>>>>> threasholds
>>>> (C1, C2,
>>>>
>>>>
>>>>> interpreter) by this value). For example to reduce thresholds by
>>>>> half:
>>>>>
>>>>>
>>>>>
>>>>>
>> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
>>
>>>>> oldScaling,0.5
>>>>>
>>>>> Vladimir
>>>>>
>>>>>
>>>>>
>>>>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
>>>>>
>>>>>
>>>>>> My goal is just to reach peak performance quickly. Following is
>>>>>> one tier threshold combination I tried:
>>>>>>
>>>>>> -XX:Tier0ProfilingStartPercentage=0
>>>>>>
>>>>>>
>>>>>>
>>>>>> -XX:Tier3InvocationThreshold=3
>>>>>>
>>>>>>
>>>>>>
>>>>>> -XX:Tier3MinInvocationThreshold=2
>>>>>>
>>>>>>
>>>>>>
>>>>>> -XX:Tier3CompileThreshold=2
>>>>>>
>>>>>>
>>>>>>
>>>>>> -XX:Tier4InvocationThreshold=4
>>>>>>
>>>>>>
>>>>>>
>>>>>> -XX:Tier4MinInvocationThreshold=3
>>>>>>
>>>>>>
>>>>>>
>>>>>> -XX:Tier4CompileThreshold=2
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards!
>>>>>>
>>>>>>
>>>>>>
>>>>>> wei
>>>>>>
>>>>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
>>>>>> *Sent:* Tuesday, May 19, 2015 9:33 PM
>>>>>> *To:* Tangwei (Euler)
>>>>>> *Cc:* hotspot compiler
>>>>>> *Subject:* Re: How to change compilation policy to trigger C2
>>>>>> compilation ASAP?
>>>>>>
>>>>>> Is your goal specifically to have C2 compile or just to reach
>>>>>> peak performance quickly? It sounds like the latter.  What
>>>>>> values did you  specify for the tier thresholds? Also, it may
>>>>>> help you to -XX:+PrintCompilation to tune the flags as this will
>>>>>> show you which methods are being compiled, when, and at what
>>>>>> tier.
>>>>>>
>>>>>> sent from my phone
>>>>>>
>>>>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
>>>>>>  <mailto:tangwei6 at huawei.com>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I want to run a JAVA application on a performance simulator,
>>>>>> and do a profiling on a hot function JITTed with C2 compiler.
>>>>>>
>>>>>> In order to make C2 compiler compile hot function as early as
>>>>>> possible, I hope to reduce the threshold of function invocation
>>>>>>
>>>>>> count in interpreter and C1 to drive the JIT compiler
>>>>>> transitioned to Level 4 (C2) ASAP. Following is the option list
>>>>>> I try, but
>>>>>>
>>>>>>
>>>>>>
>>>>>> failed to find a right combination to meet my requirement.
>>>>>> Anyone
>>>>>> can help to figure out what options I can use?
>>>>>>
>>>>>> Thanks in advance.
>>>>>>
>>>>>>
>>>>>>
>>>>>> -XX:Tier0ProfilingStartPercentage=0
>>>>>>
>>>>>>
>>>>>>
>>>>>> -XX:Tier3InvocationThreshold
>>>>>>
>>>>>>
>>>>>>
>>>>>> -XX:Tier3MinInvocationThreshold
>>>>>>
>>>>>>
>>>>>>
>>>>>> -XX:Tier3CompileThreshold
>>>>>>
>>>>>>
>>>>>>
>>>>>> -XX:Tier4InvocationThreshold
>>>>>>
>>>>>>
>>>>>>
>>>>>> -XX:CompileThreshold
>>>>>>
>>>>>>
>>>>>>
>>>>>> -XX:Tier4MinInvocationThreshold
>>>>>>
>>>>>>
>>>>>>
>>>>>> -XX:Tier4CompileThreshold
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards!
>>>>>>
>>>>>>
>>>>>>
>>>>>> wei
>>>>>>
>>>>
>>>
>>
>
>


From tangwei6 at huawei.com  Wed May 20 09:29:52 2015
From: tangwei6 at huawei.com (Tangwei (Euler))
Date: Wed, 20 May 2015 09:29:52 +0000
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7637@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37GHMp_inW=EEgfd26gRgPOLoh7WccYDdtQGkyCrh_j=3w@mail.gmail.com>
	<2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net>
	<C8D1E566CC4CA845813731FB6FD5C530010F7705@SZXEMI503-MBX.china.huawei.com>
	<318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net>
Message-ID: <C8D1E566CC4CA845813731FB6FD5C530010F775A@SZXEMI503-MBX.china.huawei.com>

Hi Chris, 
  Actually, what I want is to run my application on simulator to locate performance bottleneck in JITTed code.
I actually don't care about if cold-starting my VM or not, just want to compile hot function earlier to shorten simulator execution time. 

Regards!
wei

> -----Original Message-----
> From: Chris Newland [mailto:cnewland at chrisnewland.com]
> Sent: Wednesday, May 20, 2015 4:59 PM
> To: Tangwei (Euler)
> Cc: hotspot compiler
> Subject: RE: How to change compilation policy to trigger C2 compilation ASAP?
> 
> Hi Wei,
> 
> Thanks! OpenJDK has no support for persisting compilation profile data
> between VM runs. For some industries it can be economical to use a
> commercial VM to get the lowest latencies.
> 
> One "free" alternative is to add a training mode to your application and feed it
> the most realistic input data so that you get a good set of JIT optimisations
> before the inputs you care about arrive.
> 
> Is there a reason why you are cold-starting your VM? (For example your GC
> strategy is to use a huge Eden space and reboot daily to avoid any garbage
> collections).
> 
> Regards,
> 
> Chris
> @chriswhocodes
> 
> On Wed, May 20, 2015 09:17, Tangwei (Euler) wrote:
> > Hi Chris,
> > Your tool looks cool. I think OpenJDK has no technology to mitigate
> > cold-start costs, please correct if I am wrong. I can control some
> > compilation passes with option -XX:CompileCommand, but the profiling
> > data, such as invocation count and backedge count, has to reach some
> > threshold before trigger execution level transition. This causes
> > simulator doesn't trigger C2 compilation within reasonable time span.
> >
> > Regards!
> > wei
> >
> >> -----Original Message-----
> >> From: Chris Newland [mailto:cnewland at chrisnewland.com]
> >> Sent: Wednesday, May 20, 2015 3:46 PM
> >> To: Vitaly Davidovich
> >> Cc: Tangwei (Euler); Vladimir Kozlov; hotspot compiler
> >> Subject: RE: How to change compilation policy to trigger C2
> >> compilation ASAP?
> >>
> >>
> >> Hi Wei,
> >>
> >>
> >> Is there any reason why you need to cold-start your application each
> >> time and begin with no profiling information?
> >>
> >> Depending on your input data, would it be possible to "warm up" the
> >> VM by running typical inputs through your code to build a rich
> >> profile and have the JIT compilations performed before your real data
> arrives?
> >>
> >> One reason *against* doing this would be wide variations in your
> >> input data that could result in the wrong optimisations being made
> >> and possibly suffering a decompilation event.
> >>
> >> There are commercial VMs that have technology to mitigate cold-start
> >> costs (for example Azul Zing's ReadyNow).
> >>
> >>
> >> Have you examined the HotSpot LogCompilation output to make sure
> >> there is nothing you could change at source code level which would
> >> result in a better JIT decisions? I'm thinking of things like
> >> inlining failures due to call-site megamorphism.
> >>
> >> There's a free tool called JITWatch that aims to make it easier to
> >> understand the LogCompilation output
> >> (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales pitch
> >> - I'm the author).
> >>
> >> Regards,
> >>
> >>
> >> Chris
> >> @chriswhocodes
> >>
> >>
> >> On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote:
> >>
> >>> I'll let Vladimir comment on the compile command, but turning off
> >>> tiered doesn't prevent inlining.  What prevents inlining (an
> >>> otherwise  inlineable  method), in a nutshell, is lack of profiling
> >>> information  that would indicate the method or callsite is hot
> >>> enough for inlining. So you can have tiered off and run C2 with
> >>> standard compilation thresholds, and you'll likely get good inlining
> >>> because you'll have a long profile.  If you turn C2 compile
> >>> threshold down too far (and too far is going to be app-specific), C2
> >>> compilation may not have sufficient info in the shortened profile to
> >>> decide to inline.  Or even worse, the profile collected is actually
> >>> not reflective of the real profile you want to capture (e.g. app has
> >>> a phase change after initialization).
> >>>
> >>> The theoretical advantage of tiered is you can get decent perf
> >>> quickly, but also since you're getting better than interpreter speed
> >>> quickly, you can run in C1 tier longer and thus collect an even
> >>> larger profile, possibly leading to better code than C2 alone.  But
> >>> unfortunately this may require serious tuning.  That's my
> >>> understanding at
> >> least.
> >>>
> >>> sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)"
> >>> <tangwei6 at huawei.com> wrote:
> >>>
> >>>
> >>>
> >>>> Vladimir,
> >>>> What the 'double' means in following command line? The option
> >>>> CompileThresholdScaling you mentioned is same as
> >>>> ProfileMaturityPercentage?
> >>>> I cannot find the option in code. How much effect to function
> >>>> inlining by turning off tiered compilation?
> >>>>
> >>>>>
> >>>>
> >>
> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
> >> ol
> >>>> dS caling,0.5
> >>>>
> >>>> Regards!
> >>>> wei
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> >>>>> Sent: Wednesday, May 20, 2015 9:12 AM
> >>>>> To: Tangwei (Euler); Vitaly Davidovich
> >>>>> Cc: hotspot compiler
> >>>>> Subject: Re: How to change compilation policy to trigger C2
> >>>>> compilation
> >>>> ASAP?
> >>>>
> >>>>
> >>>>>
> >>>>> If you want only C2 compilation you can disable tiered compilation
> >>>>>
> >>>> (since you
> >>>>
> >>>>
> >>>>> don't want to spend to much on profiling anyway) and set low
> >>>>> CompileThreshold (you need only this one for non-tiered compile):
> >>>>>
> >>>>>
> >>>>>
> >>>>> -XX:-TieredCompilation -XX:CompileThreshold=100
> >>>>>
> >>>>>
> >>>>>
> >>>>> If your method is simple and don't need profiling or inlining the
> >>>>>
> >>>>>
> >>>> generated code
> >>>>> could be same quality as with long profiling.
> >>>>>
> >>>>> An other approach is to set threshold per method (scale all
> >>>>> threasholds
> >>>> (C1, C2,
> >>>>
> >>>>
> >>>>> interpreter) by this value). For example to reduce thresholds by
> >>>>> half:
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>
> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
> >>
> >>>>> oldScaling,0.5
> >>>>>
> >>>>> Vladimir
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
> >>>>>
> >>>>>
> >>>>>> My goal is just to reach peak performance quickly. Following is
> >>>>>> one tier threshold combination I tried:
> >>>>>>
> >>>>>> -XX:Tier0ProfilingStartPercentage=0
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier3InvocationThreshold=3
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier3MinInvocationThreshold=2
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier3CompileThreshold=2
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier4InvocationThreshold=4
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier4MinInvocationThreshold=3
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier4CompileThreshold=2
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Regards!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> wei
> >>>>>>
> >>>>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
> >>>>>> *Sent:* Tuesday, May 19, 2015 9:33 PM
> >>>>>> *To:* Tangwei (Euler)
> >>>>>> *Cc:* hotspot compiler
> >>>>>> *Subject:* Re: How to change compilation policy to trigger C2
> >>>>>> compilation ASAP?
> >>>>>>
> >>>>>> Is your goal specifically to have C2 compile or just to reach
> >>>>>> peak performance quickly? It sounds like the latter.  What values
> >>>>>> did you  specify for the tier thresholds? Also, it may help you
> >>>>>> to -XX:+PrintCompilation to tune the flags as this will show you
> >>>>>> which methods are being compiled, when, and at what tier.
> >>>>>>
> >>>>>> sent from my phone
> >>>>>>
> >>>>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
> >>>>>> <mailto:tangwei6 at huawei.com>> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Hi All,
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> I want to run a JAVA application on a performance simulator, and
> >>>>>> do a profiling on a hot function JITTed with C2 compiler.
> >>>>>>
> >>>>>> In order to make C2 compiler compile hot function as early as
> >>>>>> possible, I hope to reduce the threshold of function invocation
> >>>>>>
> >>>>>> count in interpreter and C1 to drive the JIT compiler
> >>>>>> transitioned to Level 4 (C2) ASAP. Following is the option list I
> >>>>>> try, but
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> failed to find a right combination to meet my requirement.
> >>>>>> Anyone
> >>>>>> can help to figure out what options I can use?
> >>>>>>
> >>>>>> Thanks in advance.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier0ProfilingStartPercentage=0
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier3InvocationThreshold
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier3MinInvocationThreshold
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier3CompileThreshold
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier4InvocationThreshold
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:CompileThreshold
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier4MinInvocationThreshold
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier4CompileThreshold
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Regards!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> wei
> >>>>>>
> >>>>
> >>>
> >>
> >
> >
> 


From andreas.eriksson at oracle.com  Wed May 20 11:04:11 2015
From: andreas.eriksson at oracle.com (Andreas Eriksson)
Date: Wed, 20 May 2015 13:04:11 +0200
Subject: 8060036: C2: CmpU nodes can end up with wrong type information
In-Reply-To: <C568518E7B433348B114B6A7122D474755DF752C@FMSMSX102.amr.corp.intel.com>
References: <555B16D4.80008@oracle.com>
	<C568518E7B433348B114B6A7122D474755DF752C@FMSMSX102.amr.corp.intel.com>
Message-ID: <555C6A2B.7090804@oracle.com>

Comments added, thanks.

Can I have a Reviewer look at this as well?

New webrev:
http://cr.openjdk.java.net/~aeriksso/8060036/webrev.01/

- Andreas

On 2015-05-19 20:24, Berg, Michael C wrote:
> Hi Andreas, could you add a comment to the push statement:
>
> // Propagate change to user
>
> And to Node* p declaration and init:
>
> // Propagate changes to uses
>
> So that the new code carries all similar information.
> Else the changes seem ok.
>
> -Michael
>
> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andreas Eriksson
> Sent: Tuesday, May 19, 2015 3:56 AM
> To: hotspot-compiler-dev at openjdk.java.net
> Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information
>
> Hi,
>
> Could I please get two reviews for the following bug fix.
>
> https://bugs.openjdk.java.net/browse/JDK-8060036
> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/
>
> The problem is with type propagation in C2.
>
> A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node.
> If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead.
> These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU.
> This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself.
>
> The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there.
> At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information.
>
> When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken.
>
> This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated.
> I've not been able to create a regression test that is reliable enough to check in.
>
> Thanks,
> Andreas


From paul.sandoz at oracle.com  Wed May 20 12:32:26 2015
From: paul.sandoz at oracle.com (Paul Sandoz)
Date: Wed, 20 May 2015 14:32:26 +0200
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken
	after sun.misc.Cleaner became automatically cleared
In-Reply-To: <A5F55E17-7965-4B1C-BE27-F48FE0429A9D@oracle.com>
References: <554CEF76.8090903@oracle.com> <554DD477.7000703@gmail.com>
	<5551DC44.9060902@oracle.com> <5553191F.6080800@oracle.com>
	<72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com>
	<55533CAE.3050201@oracle.com>
	<A98C60B2-C10C-4D62-B7E8-5B7CFA5E2B39@oracle.com>
	<55549297.5020002@oracle.com> <5555E343.5050600@oracle.com>
	<E2E24EFA-7B52-4852-923B-15171FECE6A5@oracle.com>
	<5559E3A5.8030108@oracle.com>
	<A5F55E17-7965-4B1C-BE27-F48FE0429A9D@oracle.com>
Message-ID: <59A05F7D-058A-4A9B-BA65-17A71A57254D@oracle.com>


On May 20, 2015, at 9:14 AM, Roland Westrelin <roland.westrelin at oracle.com> wrote:

>> What do you think about the following version:
>> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.04
> 
> Still looks good to me.
> 

And to me.

Injected fields seem a little magical but safer.

Paul.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150520/175ed7ca/signature.asc>

From vladimir.x.ivanov at oracle.com  Wed May 20 13:03:10 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Wed, 20 May 2015 16:03:10 +0300
Subject: [9] RFR (M): 8079205: CallSite dependency tracking is broken
	after sun.misc.Cleaner became automatically cleared
In-Reply-To: <59A05F7D-058A-4A9B-BA65-17A71A57254D@oracle.com>
References: <554CEF76.8090903@oracle.com>
	<554DD477.7000703@gmail.com>	<5551DC44.9060902@oracle.com>
	<5553191F.6080800@oracle.com>	<72E25593-3E65-48F6-86D4-75B8E6CB8C6B@oracle.com>	<55533CAE.3050201@oracle.com>	<A98C60B2-C10C-4D62-B7E8-5B7CFA5E2B39@oracle.com>	<55549297.5020002@oracle.com>
	<5555E343.5050600@oracle.com>	<E2E24EFA-7B52-4852-923B-15171FECE6A5@oracle.com>	<5559E3A5.8030108@oracle.com>	<A5F55E17-7965-4B1C-BE27-F48FE0429A9D@oracle.com>
	<59A05F7D-058A-4A9B-BA65-17A71A57254D@oracle.com>
Message-ID: <555C860E.1020508@oracle.com>

Paul, Roland, thanks for review.

Best regards,
Vladimir Ivanov

On 5/20/15 3:32 PM, Paul Sandoz wrote:
>
> On May 20, 2015, at 9:14 AM, Roland Westrelin <roland.westrelin at oracle.com> wrote:
>
>>> What do you think about the following version:
>>> http://cr.openjdk.java.net/~vlivanov/8079205/webrev.04
>>
>> Still looks good to me.
>>
>
> And to me.
>
> Injected fields seem a little magical but safer.
>
> Paul.
>

From vitalyd at gmail.com  Wed May 20 13:40:15 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Wed, 20 May 2015 09:40:15 -0400
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7637@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37GHMp_inW=EEgfd26gRgPOLoh7WccYDdtQGkyCrh_j=3w@mail.gmail.com>
	<2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net>
	<C8D1E566CC4CA845813731FB6FD5C530010F7705@SZXEMI503-MBX.china.huawei.com>
	<318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net>
Message-ID: <CAHjP37EYHXWcvNFH0HimjqeA8-8TJk1P=NfQeyiLtwoeUfuYog@mail.gmail.com>

Training mode is also dangerous unless built in from beginning as there's
risk some unwanted application effect will take place.  This is beside the
possibility of feeding it "wrong" profile.

What does Azul ReadyNow do? Saves profiles to disk and reads them upon
startup?

sent from my phone
On May 20, 2015 4:58 AM, "Chris Newland" <cnewland at chrisnewland.com> wrote:

> Hi Wei,
>
> Thanks! OpenJDK has no support for persisting compilation profile data
> between VM runs. For some industries it can be economical to use a
> commercial VM to get the lowest latencies.
>
> One "free" alternative is to add a training mode to your application and
> feed it the most realistic input data so that you get a good set of JIT
> optimisations before the inputs you care about arrive.
>
> Is there a reason why you are cold-starting your VM? (For example your GC
> strategy is to use a huge Eden space and reboot daily to avoid any garbage
> collections).
>
> Regards,
>
> Chris
> @chriswhocodes
>
> On Wed, May 20, 2015 09:17, Tangwei (Euler) wrote:
> > Hi Chris,
> > Your tool looks cool. I think OpenJDK has no technology to mitigate
> > cold-start costs, please correct if I am wrong. I can control some
> > compilation passes with option -XX:CompileCommand, but the profiling
> > data, such as invocation count and backedge count, has to reach some
> > threshold before trigger execution level transition. This causes
> > simulator doesn't trigger C2 compilation within reasonable time span.
> >
> > Regards!
> > wei
> >
> >> -----Original Message-----
> >> From: Chris Newland [mailto:cnewland at chrisnewland.com]
> >> Sent: Wednesday, May 20, 2015 3:46 PM
> >> To: Vitaly Davidovich
> >> Cc: Tangwei (Euler); Vladimir Kozlov; hotspot compiler
> >> Subject: RE: How to change compilation policy to trigger C2 compilation
> >> ASAP?
> >>
> >>
> >> Hi Wei,
> >>
> >>
> >> Is there any reason why you need to cold-start your application each
> >> time and begin with no profiling information?
> >>
> >> Depending on your input data, would it be possible to "warm up" the VM
> >> by running typical inputs through your code to build a rich profile and
> >> have the JIT compilations performed before your real data arrives?
> >>
> >> One reason *against* doing this would be wide variations in your input
> >> data that could result in the wrong optimisations being made and
> >> possibly suffering a decompilation event.
> >>
> >> There are commercial VMs that have technology to mitigate cold-start
> >> costs (for example Azul Zing's ReadyNow).
> >>
> >>
> >> Have you examined the HotSpot LogCompilation output to make sure there
> >> is nothing you could change at source code level which would result in a
> >> better JIT decisions? I'm thinking of things like inlining failures due
> >> to call-site megamorphism.
> >>
> >> There's a free tool called JITWatch that aims to make it easier to
> >> understand the LogCompilation output
> >> (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales pitch -
> >> I'm the
> >> author).
> >>
> >> Regards,
> >>
> >>
> >> Chris
> >> @chriswhocodes
> >>
> >>
> >> On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote:
> >>
> >>> I'll let Vladimir comment on the compile command, but turning off
> >>> tiered doesn't prevent inlining.  What prevents inlining (an otherwise
> >>>  inlineable  method), in a nutshell, is lack of profiling information
> >>>  that would indicate the method or callsite is hot enough for
> >>> inlining. So you can have tiered off and run C2 with standard
> >>> compilation thresholds, and you'll likely get good inlining because
> >>> you'll have a long profile.  If you turn C2 compile threshold down too
> >>> far (and too far is going to be app-specific), C2 compilation may not
> >>> have sufficient info in the shortened profile to decide to inline.  Or
> >>> even worse, the profile collected is actually not reflective of the
> >>> real profile you want to capture (e.g. app has a phase change after
> >>> initialization).
> >>>
> >>> The theoretical advantage of tiered is you can get decent perf
> >>> quickly, but also since you're getting better than interpreter speed
> >>> quickly, you can run in C1 tier longer and thus collect an even
> >>> larger profile, possibly leading to better code than C2 alone.  But
> >>> unfortunately this may require serious tuning.  That's my
> >>> understanding at
> >> least.
> >>>
> >>> sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)"
> >>> <tangwei6 at huawei.com> wrote:
> >>>
> >>>
> >>>
> >>>> Vladimir,
> >>>> What the 'double' means in following command line? The option
> >>>> CompileThresholdScaling you mentioned is same as
> >>>> ProfileMaturityPercentage?
> >>>> I cannot find the option in code. How much effect to function
> >>>> inlining by turning off tiered compilation?
> >>>>
> >>>>>
> >>>>
> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
> >> ol
> >>>> dS caling,0.5
> >>>>
> >>>> Regards!
> >>>> wei
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> >>>>> Sent: Wednesday, May 20, 2015 9:12 AM
> >>>>> To: Tangwei (Euler); Vitaly Davidovich
> >>>>> Cc: hotspot compiler
> >>>>> Subject: Re: How to change compilation policy to trigger C2
> >>>>> compilation
> >>>> ASAP?
> >>>>
> >>>>
> >>>>>
> >>>>> If you want only C2 compilation you can disable tiered
> >>>>> compilation
> >>>>>
> >>>> (since you
> >>>>
> >>>>
> >>>>> don't want to spend to much on profiling anyway) and set low
> >>>>> CompileThreshold (you need only this one for non-tiered compile):
> >>>>>
> >>>>>
> >>>>>
> >>>>> -XX:-TieredCompilation -XX:CompileThreshold=100
> >>>>>
> >>>>>
> >>>>>
> >>>>> If your method is simple and don't need profiling or inlining the
> >>>>>
> >>>>>
> >>>> generated code
> >>>>> could be same quality as with long profiling.
> >>>>>
> >>>>> An other approach is to set threshold per method (scale all
> >>>>> threasholds
> >>>> (C1, C2,
> >>>>
> >>>>
> >>>>> interpreter) by this value). For example to reduce thresholds by
> >>>>> half:
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
> >>
> >>>>> oldScaling,0.5
> >>>>>
> >>>>> Vladimir
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
> >>>>>
> >>>>>
> >>>>>> My goal is just to reach peak performance quickly. Following is
> >>>>>> one tier threshold combination I tried:
> >>>>>>
> >>>>>> -XX:Tier0ProfilingStartPercentage=0
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier3InvocationThreshold=3
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier3MinInvocationThreshold=2
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier3CompileThreshold=2
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier4InvocationThreshold=4
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier4MinInvocationThreshold=3
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier4CompileThreshold=2
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Regards!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> wei
> >>>>>>
> >>>>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
> >>>>>> *Sent:* Tuesday, May 19, 2015 9:33 PM
> >>>>>> *To:* Tangwei (Euler)
> >>>>>> *Cc:* hotspot compiler
> >>>>>> *Subject:* Re: How to change compilation policy to trigger C2
> >>>>>> compilation ASAP?
> >>>>>>
> >>>>>> Is your goal specifically to have C2 compile or just to reach
> >>>>>> peak performance quickly? It sounds like the latter.  What
> >>>>>> values did you  specify for the tier thresholds? Also, it may
> >>>>>> help you to -XX:+PrintCompilation to tune the flags as this will
> >>>>>> show you which methods are being compiled, when, and at what
> >>>>>> tier.
> >>>>>>
> >>>>>> sent from my phone
> >>>>>>
> >>>>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
> >>>>>>  <mailto:tangwei6 at huawei.com>> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Hi All,
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> I want to run a JAVA application on a performance simulator,
> >>>>>> and do a profiling on a hot function JITTed with C2 compiler.
> >>>>>>
> >>>>>> In order to make C2 compiler compile hot function as early as
> >>>>>> possible, I hope to reduce the threshold of function invocation
> >>>>>>
> >>>>>> count in interpreter and C1 to drive the JIT compiler
> >>>>>> transitioned to Level 4 (C2) ASAP. Following is the option list
> >>>>>> I try, but
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> failed to find a right combination to meet my requirement.
> >>>>>> Anyone
> >>>>>> can help to figure out what options I can use?
> >>>>>>
> >>>>>> Thanks in advance.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier0ProfilingStartPercentage=0
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier3InvocationThreshold
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier3MinInvocationThreshold
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier3CompileThreshold
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier4InvocationThreshold
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:CompileThreshold
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier4MinInvocationThreshold
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -XX:Tier4CompileThreshold
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Regards!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> wei
> >>>>>>
> >>>>
> >>>
> >>
> >
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150520/1d65bb45/attachment-0001.html>

From roland.westrelin at oracle.com  Wed May 20 14:03:53 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Wed, 20 May 2015 16:03:53 +0200
Subject: RFR(XS): 8080699: Assert failed: Not a Java pointer in JCK test
Message-ID: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com>

http://cr.openjdk.java.net/~roland/8080699/webrev.00/

When an arraycopy node is eliminated (for a non escaping allocation), its control, src and dest arguments are set to top and it is no longer connected to the rest of the graph through its fall through outputs. It?s not entirely removed from the graph until the next igvn though, and it can be reached through its exception processing outputs. That?s what happens here and ArrayCopyNode::may_modify() assumes dest, that is top, is an oop.

Roland.

From roland.westrelin at oracle.com  Wed May 20 14:19:24 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Wed, 20 May 2015 16:19:24 +0200
Subject: RFR(S) 8077504: Unsafe load can loose control dependency and
	cause crash
In-Reply-To: <555A6741.3030002@oracle.com>
References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com>
	<552FC216.4010503@redhat.com>
	<95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com>
	<6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com>
	<553A88F3.2010700@oracle.com>
	<CDF55084-CD74-4195-B596-37E7935FCB62@oracle.com>
	<553FFAC4.7030107@oracle.com>
	<B295A658-D4D6-4F5B-AE79-A4B99592585D@oracle.com>
	<8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com>
	<555A4026.3010303@redhat.com> <555A6741.3030002@oracle.com>
Message-ID: <C28DE41C-112F-4098-BA0B-27700AF06127@oracle.com>

Thanks Andrew and Vladimir. Here is a new webrev with an enum:

http://cr.openjdk.java.net/~roland/8077504/webrev.02/

Roland.

From vladimir.kozlov at oracle.com  Wed May 20 16:12:36 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 20 May 2015 09:12:36 -0700
Subject: RFR(S) 8077504: Unsafe load can loose control dependency and
	cause crash
In-Reply-To: <C28DE41C-112F-4098-BA0B-27700AF06127@oracle.com>
References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com>	<552FC216.4010503@redhat.com>	<95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com>	<6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com>	<553A88F3.2010700@oracle.com>	<CDF55084-CD74-4195-B596-37E7935FCB62@oracle.com>	<553FFAC4.7030107@oracle.com>	<B295A658-D4D6-4F5B-AE79-A4B99592585D@oracle.com>	<8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com>	<555A4026.3010303@redhat.com>
	<555A6741.3030002@oracle.com>
	<C28DE41C-112F-4098-BA0B-27700AF06127@oracle.com>
Message-ID: <555CB274.4010409@oracle.com>

Looks good. But fix second comment in memnode.hpp - it has '???' instead 
of '.

Thanks,
Vladimir

On 5/20/15 7:19 AM, Roland Westrelin wrote:
> Thanks Andrew and Vladimir. Here is a new webrev with an enum:
>
> http://cr.openjdk.java.net/~roland/8077504/webrev.02/
>
> Roland.
>

From adinn at redhat.com  Wed May 20 16:15:59 2015
From: adinn at redhat.com (Andrew Dinn)
Date: Wed, 20 May 2015 17:15:59 +0100
Subject: RFR(S) 8077504: Unsafe load can loose control dependency and
	cause crash
In-Reply-To: <555CB274.4010409@oracle.com>
References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com>	<552FC216.4010503@redhat.com>	<95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com>	<6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com>	<553A88F3.2010700@oracle.com>	<CDF55084-CD74-4195-B596-37E7935FCB62@oracle.com>	<553FFAC4.7030107@oracle.com>	<B295A658-D4D6-4F5B-AE79-A4B99592585D@oracle.com>	<8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com>	<555A4026.3010303@redhat.com>	<555A6741.3030002@oracle.com>	<C28DE41C-112F-4098-BA0B-27700AF06127@oracle.com>
	<555CB274.4010409@oracle.com>
Message-ID: <555CB33F.6030506@redhat.com>

On 20/05/15 17:12, Vladimir Kozlov wrote:
> Looks good. But fix second comment in memnode.hpp - it has '???' instead
> of '.

Yes, looks good (but, of course, I'm not a reviewer).

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in UK and Wales under Company Registration No. 3798903
Directors: Michael Cunningham (USA), Matt Parson (USA), Charlie Peters
(USA), Michael O'Neill (Ireland)

From vladimir.kozlov at oracle.com  Wed May 20 16:17:37 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 20 May 2015 09:17:37 -0700
Subject: 8060036: C2: CmpU nodes can end up with wrong type information
In-Reply-To: <555C6A2B.7090804@oracle.com>
References: <555B16D4.80008@oracle.com>	<C568518E7B433348B114B6A7122D474755DF752C@FMSMSX102.amr.corp.intel.com>
	<555C6A2B.7090804@oracle.com>
Message-ID: <555CB3A1.2000308@oracle.com>

Looks good. I agree with explanation and fix.

Thanks,
Vladimir

On 5/20/15 4:04 AM, Andreas Eriksson wrote:
> Comments added, thanks.
>
> Can I have a Reviewer look at this as well?
>
> New webrev:
> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.01/
>
> - Andreas
>
> On 2015-05-19 20:24, Berg, Michael C wrote:
>> Hi Andreas, could you add a comment to the push statement:
>>
>> // Propagate change to user
>>
>> And to Node* p declaration and init:
>>
>> // Propagate changes to uses
>>
>> So that the new code carries all similar information.
>> Else the changes seem ok.
>>
>> -Michael
>>
>> -----Original Message-----
>> From: hotspot-compiler-dev
>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of
>> Andreas Eriksson
>> Sent: Tuesday, May 19, 2015 3:56 AM
>> To: hotspot-compiler-dev at openjdk.java.net
>> Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type
>> information
>>
>> Hi,
>>
>> Could I please get two reviews for the following bug fix.
>>
>> https://bugs.openjdk.java.net/browse/JDK-8060036
>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/
>>
>> The problem is with type propagation in C2.
>>
>> A CmpU can use type information from nodes two steps up in the graph,
>> when input 1 is an AddI or SubI node.
>> If the AddI/SubI overflows the CmpU node type computation will set up
>> two type ranges based on the inputs to the AddI/SubI instead.
>> These type ranges will then be compared to input 2 of the CmpU node to
>> see if we can make any meaningful observations on the real type of the
>> CmpU.
>> This means that the type of the AddI/SubI can be bottom, but the CmpU
>> can still use the type of the AddI/SubI inputs to get a more specific
>> type itself.
>>
>> The problem with this is that the types of the AddI/SubI inputs can be
>> updated at a later pass, but since the AddI/SubI is of type bottom the
>> propagation pass will stop there.
>> At that point the CmpU node is never made aware of the new types of
>> the AddI/SubI inputs, and it is left with a type that is based on old
>> type information.
>>
>> When this happens other optimizations using the type information can
>> go very wrong; in this particular case a branch to a range_check trap
>> that normally never happened was optimized to always be taken.
>>
>> This fix adds a check in the type propagation logic for CmpU nodes,
>> which makes sure that the CmpU node will be added to the worklist so
>> the type can be updated.
>> I've not been able to create a regression test that is reliable enough
>> to check in.
>>
>> Thanks,
>> Andreas
>

From vladimir.kozlov at oracle.com  Wed May 20 17:24:39 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 20 May 2015 10:24:39 -0700
Subject: RFR(XS): 8080699: Assert failed: Not a Java pointer in JCK test
In-Reply-To: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com>
References: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com>
Message-ID: <555CC357.9030006@oracle.com>

Should it also return false if in(ArrayCopyNode::Src)->is_top()?

thanks,
Vladimir

On 5/20/15 7:03 AM, Roland Westrelin wrote:
> http://cr.openjdk.java.net/~roland/8080699/webrev.00/
>
> When an arraycopy node is eliminated (for a non escaping allocation), its control, src and dest arguments are set to top and it is no longer connected to the rest of the graph through its fall through outputs. It?s not entirely removed from the graph until the next igvn though, and it can be reached through its exception processing outputs. That?s what happens here and ArrayCopyNode::may_modify() assumes dest, that is top, is an oop.
>
> Roland.
>

From roland.westrelin at oracle.com  Wed May 20 18:57:06 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Wed, 20 May 2015 20:57:06 +0200
Subject: RFR(XS): 8080699: Assert failed: Not a Java pointer in JCK test
In-Reply-To: <555CC357.9030006@oracle.com>
References: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com>
	<555CC357.9030006@oracle.com>
Message-ID: <29CE480B-2136-40DE-8167-E0E093A5F145@oracle.com>

Thanks for looking at this, Vladimir.

> Should it also return false if in(ArrayCopyNode::Src)->is_top()?

That doesn?t feel right: ArrayCopyNode::may_modify() would decide what to return based on src that it doesn?t modify.
Also, I don?t think src can become top and dest not be top during allocation elimination. If that happens at another time (igvn for instance), this part of the graph is being removed and presumably the compiler will get an other chance to optimize it better.

Roland.


> 
> thanks,
> Vladimir
> 
> On 5/20/15 7:03 AM, Roland Westrelin wrote:
>> http://cr.openjdk.java.net/~roland/8080699/webrev.00/
>> 
>> When an arraycopy node is eliminated (for a non escaping allocation), its control, src and dest arguments are set to top and it is no longer connected to the rest of the graph through its fall through outputs. It?s not entirely removed from the graph until the next igvn though, and it can be reached through its exception processing outputs. That?s what happens here and ArrayCopyNode::may_modify() assumes dest, that is top, is an oop.
>> 
>> Roland.
>> 


From vladimir.kozlov at oracle.com  Wed May 20 19:15:24 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 20 May 2015 12:15:24 -0700
Subject: RFR(XS): 8080699: Assert failed: Not a Java pointer in JCK test
In-Reply-To: <29CE480B-2136-40DE-8167-E0E093A5F145@oracle.com>
References: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com>	<555CC357.9030006@oracle.com>
	<29CE480B-2136-40DE-8167-E0E093A5F145@oracle.com>
Message-ID: <555CDD4C.5030707@oracle.com>

On 5/20/15 11:57 AM, Roland Westrelin wrote:
> Thanks for looking at this, Vladimir.
>
>> Should it also return false if in(ArrayCopyNode::Src)->is_top()?
>
> That doesn?t feel right: ArrayCopyNode::may_modify() would decide what to return based on src that it doesn?t modify.

Okay, returning true in such case would be conservative.

Should we check dest input for top in CallLeafNode::may_modify()? Do we 
have other similar places?

Thanks,
Vladimir

> Also, I don?t think src can become top and dest not be top during allocation elimination. If that happens at another time (igvn for instance), this part of the graph is being removed and presumably the compiler will get an other chance to optimize it better.
>
> Roland.
>
>
>
>>
>> thanks,
>> Vladimir
>>
>> On 5/20/15 7:03 AM, Roland Westrelin wrote:
>>> http://cr.openjdk.java.net/~roland/8080699/webrev.00/
>>>
>>> When an arraycopy node is eliminated (for a non escaping allocation), its control, src and dest arguments are set to top and it is no longer connected to the rest of the graph through its fall through outputs. It?s not entirely removed from the graph until the next igvn though, and it can be reached through its exception processing outputs. That?s what happens here and ArrayCopyNode::may_modify() assumes dest, that is top, is an oop.
>>>
>>> Roland.
>>>
>

From vladimir.kozlov at oracle.com  Wed May 20 19:38:26 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 20 May 2015 12:38:26 -0700
Subject: RFR(S): 8078866:
	compiler/eliminateAutobox/6934604/TestIntBoxing.java
	assert(p_f->Opcode() == Op_IfFalse) failed
In-Reply-To: <C2871C32-958A-42BD-BCA9-A6F692BA6BCB@oracle.com>
References: <E65BE33D-2CBE-448E-A935-07217E43E034@oracle.com>	<5553C1E2.3040003@oracle.com>
	<C2871C32-958A-42BD-BCA9-A6F692BA6BCB@oracle.com>
Message-ID: <555CE2B2.8020305@oracle.com>

On 5/18/15 5:19 AM, Roland Westrelin wrote:
> Thanks for looking at this, Vladimir.
>
>> Why do you think that main- and post- loops will be empty too and it is safe to replace them with pre-loop? Why it is safe?
>>
>> Opaque1 node is used only when RCE and align is requsted:
>>   cmp_end->set_req(2, peel_only ? pre_limit : pre_opaq);
>>
>> In peel_only case pre-loop have only 1 iteration and can be eliminated too. The assert could be hit in such case too. So your change may not fix the problem.
>
> In case of peel_only, we set:
>
> if( peel_only ) main_head->set_main_no_pre_loop();
>
> and in IdealLoopTree::policy_range_check():
>
>    // If we unrolled with no intention of doing RCE and we later
>    // changed our minds, we got no pre-loop.  Either we need to
>    // make a new pre-loop, or we gotta disallow RCE.
>    if (cl->is_main_no_pre_loop()) return false; // Disallowed for now.

I missed that.

>
>> I think eliminating main- and post- is good investigation for separate RFE. And we need to make sure it is safe.
>
> My reasoning is that if the pre loop?s limit is still an opaque node, then the compiler has not been able to optimize the pre loop based on its limited number of iterations yet. I don?t see what optimization would remove stuff from the pre loop that wouldn?t have removed the same stuff from the initial loop, had it not been cloned. I could be missing something, though. Do you see one?

Predicates? Can we have checks moved from pre- but not from main- 
because of execution order?

Anyway, I am not strongly against this change but you still need to add 
check to do_range_check(), I think, in case something went wrong.

Thanks,
Vladimir

>
> Roland.
>
>>
>> To fix this bug, I think, we should bailout from do_range_check() when we can't find pre-loop.
>>
>> Typo 'cab':
>> +     // post loop cab be removed as well
>>
>> Thanks,
>> Vladimir
>>
>>
>> On 5/13/15 6:02 AM, Roland Westrelin wrote:
>>> http://cr.openjdk.java.net/~roland/8078866/webrev.00/
>>>
>>> The assert fires when optimizing that code:
>>>
>>>    static int remi_sumb() {
>>>      Integer j = Integer.valueOf(1);
>>>      for (int i = 0; i< 1000; i++) {
>>>        j = j + 1;
>>>      }
>>>      return j;
>>>    }
>>>
>>> The compiler tries to find the pre loop from the main loop but fails because the pre loop isn?t there.
>>>
>>> Here is the chain of events that leads to the pre loop being optimized out:
>>>
>>> 1- The CountedLoopNode is created
>>> 2- PhiNode::Value() for the induction variable?s Phi is called and capture that 0 <= i < 1000
>>> 3- SplitIf is applied and moves the LoadI that reads the value from the Integer through the Phi that merges successive value of j
>>> 4- pre-loop, main-loop, post-loop are created
>>> 5- SplitIf is applied again and moves the LoadI of the loop through the Phi that merges both branches of:
>>>
>>>      public static Integer valueOf(int i) {
>>>          if (i >= IntegerCache.low && i <= IntegerCache.high)
>>>              return IntegerCache.cache[i + (-IntegerCache.low)];
>>>          return new Integer(i);
>>>      }
>>>
>>> 6- The LoadI is optimized out, the Phi for j is now a Phi that merges integer values, PhaseIdealLoop::replace_parallel_iv() replaces j by an expression that depends on i
>>> 7- In the preloop, the range of values of i are known because of 2- so and the if of valueOf() can be optimized out
>>> 8- the preloop is empty and is removed by IdealLoopTree::policy_do_remove_empty_loop()
>>>
>>> SplitIf doesn?t find all opportunities for improvements before the pre/main/post loops are built. I considered trying to fix that but that seems complex and not robust enough: some optimizations after SplitIf could uncover more opportunities for SplitIf. The valueOf?s if cannot be optimized out in the main loop because the range of values of i in the main depends on the Phi for the iv of the pre loop and is not [0, 1000[ so I don?t see a way to make sure the compiler always optimizes the main loop out when the pre loop is empty.
>>>
>>> Instead, I?m proposing that when IdealLoopTree::policy_do_remove_empty_loop() removes a loop, it checks whether it is a pre-loop and if it is, removes the main and post loops.
>>>
>>> I don?t have a test case because it seems this only occurs under well defined conditions that are hard to reproduce with a simple test case: PhiNode::Value() must be called early, SplitIf must do a little work but not too much before pre/main/post loops are built. Actually, this occurs only when remi_sumb is inlined. If remi_sumb is not, we don?t optimize the loop out at all.
>>>
>>> Roland.
>>>
>

From vladimir.kozlov at oracle.com  Wed May 20 22:02:08 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 20 May 2015 15:02:08 -0700
Subject: 8054492: compiler/intrinsics/classcast/NullCheckDroppingsTest.java
	is an invalid test
In-Reply-To: <5559EFA0.5060800@redhat.com>
References: <5559EFA0.5060800@redhat.com>
Message-ID: <555D0460.4080805@oracle.com>

testVarClassCast tests deoptimization for javaMirror == null:

void testVarClassCast(String s) {
     Class cl = (s == null) ? null : String.class;
     try {
         ssink = (String)cl.cast(svalue);

Which is done in LibraryCallKit::inline_Class_cast() by:

mirror = null_check(mirror);

which has Deoptimization::Action_make_not_entrant.

Unfortunately currently the test also pass because unstable_if is 
generated for the first line:

(s == null) ? null : String.class;

If you run the test with TraceDoptimization (or LogCompilation) you will 
see:

Uncommon trap occurred in NullCheckDroppingsTest::testVarClassCast 
(@0x000000010b0670d8) thread=26883 reason=unstable_if action=reinterpret 
unloaded_class_index=-1

In what situation you see it fails? Please, run with TraceDoptimization.

thanks,
Vladimir

On 5/18/15 6:56 AM, Andrew Haley wrote:
> This test has been failing with a failed assertion:
>
> Exception in thread "main" java.lang.AssertionError: void NullCheckDroppingsTest.testVarClassCast(java.lang.String) was not deoptimized
> 	at NullCheckDroppingsTest.checkDeoptimization(NullCheckDroppingsTest.java:331)
> 	at NullCheckDroppingsTest.runTest(NullCheckDroppingsTest.java:309)
> 	at NullCheckDroppingsTest.main(NullCheckDroppingsTest.java:125)
>
> It seems that the reason for the failure is that the nmethod for a
> deoptimized method (testVarClassCast) is not removed.  This is
> perfectly correct because when testVarClassCast() was trapped, it was
> with an action of Deoptimization::Action_maybe_recompile.  When this
> occurs, a method is not marked as dead and continues to be invoked, so
> the test WhiteBox.isMethodCompiled() will still return true.
>
> Therefore, I think this code in checkDeoptimization is incorrect:
>
>          // Check deoptimization event (intrinsic Class.cast() works).
>          if (WHITE_BOX.isMethodCompiled(method) == deopt) {
>              throw new AssertionError(method + " was" + (deopt ? " not" : "") + " deoptimized");
>          }
>
> We either need some code in WhiteBox to check for a deoptimization
> event properly or we should just remove this altogether.
>
> So, thoughts?  Just delete the check?
>
> Andrew.
>

From rednaxelafx at gmail.com  Thu May 21 00:22:40 2015
From: rednaxelafx at gmail.com (Krystal Mok)
Date: Wed, 20 May 2015 17:22:40 -0700
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <CAHjP37EYHXWcvNFH0HimjqeA8-8TJk1P=NfQeyiLtwoeUfuYog@mail.gmail.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7637@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37GHMp_inW=EEgfd26gRgPOLoh7WccYDdtQGkyCrh_j=3w@mail.gmail.com>
	<2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net>
	<C8D1E566CC4CA845813731FB6FD5C530010F7705@SZXEMI503-MBX.china.huawei.com>
	<318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net>
	<CAHjP37EYHXWcvNFH0HimjqeA8-8TJk1P=NfQeyiLtwoeUfuYog@mail.gmail.com>
Message-ID: <CA+cQ+tRo5ETcOBZ0LLR7q40xuf8yMdA2a3CajizStD4KKgZ8cw@mail.gmail.com>

ReadyNow does save profiles to disk and read them back in subsequent runs.
There are also other changes on the side in the runtime and compilers to
make it work smoothly.

Cc'ing Doug Hawkins who's the one working on ReadyNow if anybody's
interested in more information.

- Kris

On Wed, May 20, 2015 at 6:40 AM, Vitaly Davidovich <vitalyd at gmail.com>
wrote:

> Training mode is also dangerous unless built in from beginning as there's
> risk some unwanted application effect will take place.  This is beside the
> possibility of feeding it "wrong" profile.
>
> What does Azul ReadyNow do? Saves profiles to disk and reads them upon
> startup?
>
> sent from my phone
> On May 20, 2015 4:58 AM, "Chris Newland" <cnewland at chrisnewland.com>
> wrote:
>
>> Hi Wei,
>>
>> Thanks! OpenJDK has no support for persisting compilation profile data
>> between VM runs. For some industries it can be economical to use a
>> commercial VM to get the lowest latencies.
>>
>> One "free" alternative is to add a training mode to your application and
>> feed it the most realistic input data so that you get a good set of JIT
>> optimisations before the inputs you care about arrive.
>>
>> Is there a reason why you are cold-starting your VM? (For example your GC
>> strategy is to use a huge Eden space and reboot daily to avoid any garbage
>> collections).
>>
>> Regards,
>>
>> Chris
>> @chriswhocodes
>>
>> On Wed, May 20, 2015 09:17, Tangwei (Euler) wrote:
>> > Hi Chris,
>> > Your tool looks cool. I think OpenJDK has no technology to mitigate
>> > cold-start costs, please correct if I am wrong. I can control some
>> > compilation passes with option -XX:CompileCommand, but the profiling
>> > data, such as invocation count and backedge count, has to reach some
>> > threshold before trigger execution level transition. This causes
>> > simulator doesn't trigger C2 compilation within reasonable time span.
>> >
>> > Regards!
>> > wei
>> >
>> >> -----Original Message-----
>> >> From: Chris Newland [mailto:cnewland at chrisnewland.com]
>> >> Sent: Wednesday, May 20, 2015 3:46 PM
>> >> To: Vitaly Davidovich
>> >> Cc: Tangwei (Euler); Vladimir Kozlov; hotspot compiler
>> >> Subject: RE: How to change compilation policy to trigger C2 compilation
>> >> ASAP?
>> >>
>> >>
>> >> Hi Wei,
>> >>
>> >>
>> >> Is there any reason why you need to cold-start your application each
>> >> time and begin with no profiling information?
>> >>
>> >> Depending on your input data, would it be possible to "warm up" the VM
>> >> by running typical inputs through your code to build a rich profile and
>> >> have the JIT compilations performed before your real data arrives?
>> >>
>> >> One reason *against* doing this would be wide variations in your input
>> >> data that could result in the wrong optimisations being made and
>> >> possibly suffering a decompilation event.
>> >>
>> >> There are commercial VMs that have technology to mitigate cold-start
>> >> costs (for example Azul Zing's ReadyNow).
>> >>
>> >>
>> >> Have you examined the HotSpot LogCompilation output to make sure there
>> >> is nothing you could change at source code level which would result in
>> a
>> >> better JIT decisions? I'm thinking of things like inlining failures due
>> >> to call-site megamorphism.
>> >>
>> >> There's a free tool called JITWatch that aims to make it easier to
>> >> understand the LogCompilation output
>> >> (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales pitch
>> -
>> >> I'm the
>> >> author).
>> >>
>> >> Regards,
>> >>
>> >>
>> >> Chris
>> >> @chriswhocodes
>> >>
>> >>
>> >> On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote:
>> >>
>> >>> I'll let Vladimir comment on the compile command, but turning off
>> >>> tiered doesn't prevent inlining.  What prevents inlining (an otherwise
>> >>>  inlineable  method), in a nutshell, is lack of profiling information
>> >>>  that would indicate the method or callsite is hot enough for
>> >>> inlining. So you can have tiered off and run C2 with standard
>> >>> compilation thresholds, and you'll likely get good inlining because
>> >>> you'll have a long profile.  If you turn C2 compile threshold down too
>> >>> far (and too far is going to be app-specific), C2 compilation may not
>> >>> have sufficient info in the shortened profile to decide to inline.  Or
>> >>> even worse, the profile collected is actually not reflective of the
>> >>> real profile you want to capture (e.g. app has a phase change after
>> >>> initialization).
>> >>>
>> >>> The theoretical advantage of tiered is you can get decent perf
>> >>> quickly, but also since you're getting better than interpreter speed
>> >>> quickly, you can run in C1 tier longer and thus collect an even
>> >>> larger profile, possibly leading to better code than C2 alone.  But
>> >>> unfortunately this may require serious tuning.  That's my
>> >>> understanding at
>> >> least.
>> >>>
>> >>> sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)"
>> >>> <tangwei6 at huawei.com> wrote:
>> >>>
>> >>>
>> >>>
>> >>>> Vladimir,
>> >>>> What the 'double' means in following command line? The option
>> >>>> CompileThresholdScaling you mentioned is same as
>> >>>> ProfileMaturityPercentage?
>> >>>> I cannot find the option in code. How much effect to function
>> >>>> inlining by turning off tiered compilation?
>> >>>>
>> >>>>>
>> >>>>
>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
>> >> ol
>> >>>> dS caling,0.5
>> >>>>
>> >>>> Regards!
>> >>>> wei
>> >>>>
>> >>>>> -----Original Message-----
>> >>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>> >>>>> Sent: Wednesday, May 20, 2015 9:12 AM
>> >>>>> To: Tangwei (Euler); Vitaly Davidovich
>> >>>>> Cc: hotspot compiler
>> >>>>> Subject: Re: How to change compilation policy to trigger C2
>> >>>>> compilation
>> >>>> ASAP?
>> >>>>
>> >>>>
>> >>>>>
>> >>>>> If you want only C2 compilation you can disable tiered
>> >>>>> compilation
>> >>>>>
>> >>>> (since you
>> >>>>
>> >>>>
>> >>>>> don't want to spend to much on profiling anyway) and set low
>> >>>>> CompileThreshold (you need only this one for non-tiered compile):
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> -XX:-TieredCompilation -XX:CompileThreshold=100
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> If your method is simple and don't need profiling or inlining the
>> >>>>>
>> >>>>>
>> >>>> generated code
>> >>>>> could be same quality as with long profiling.
>> >>>>>
>> >>>>> An other approach is to set threshold per method (scale all
>> >>>>> threasholds
>> >>>> (C1, C2,
>> >>>>
>> >>>>
>> >>>>> interpreter) by this value). For example to reduce thresholds by
>> >>>>> half:
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
>> >>
>> >>>>> oldScaling,0.5
>> >>>>>
>> >>>>> Vladimir
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
>> >>>>>
>> >>>>>
>> >>>>>> My goal is just to reach peak performance quickly. Following is
>> >>>>>> one tier threshold combination I tried:
>> >>>>>>
>> >>>>>> -XX:Tier0ProfilingStartPercentage=0
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -XX:Tier3InvocationThreshold=3
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -XX:Tier3MinInvocationThreshold=2
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -XX:Tier3CompileThreshold=2
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -XX:Tier4InvocationThreshold=4
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -XX:Tier4MinInvocationThreshold=3
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -XX:Tier4CompileThreshold=2
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Regards!
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> wei
>> >>>>>>
>> >>>>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
>> >>>>>> *Sent:* Tuesday, May 19, 2015 9:33 PM
>> >>>>>> *To:* Tangwei (Euler)
>> >>>>>> *Cc:* hotspot compiler
>> >>>>>> *Subject:* Re: How to change compilation policy to trigger C2
>> >>>>>> compilation ASAP?
>> >>>>>>
>> >>>>>> Is your goal specifically to have C2 compile or just to reach
>> >>>>>> peak performance quickly? It sounds like the latter.  What
>> >>>>>> values did you  specify for the tier thresholds? Also, it may
>> >>>>>> help you to -XX:+PrintCompilation to tune the flags as this will
>> >>>>>> show you which methods are being compiled, when, and at what
>> >>>>>> tier.
>> >>>>>>
>> >>>>>> sent from my phone
>> >>>>>>
>> >>>>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
>> >>>>>>  <mailto:tangwei6 at huawei.com>> wrote:
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Hi All,
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> I want to run a JAVA application on a performance simulator,
>> >>>>>> and do a profiling on a hot function JITTed with C2 compiler.
>> >>>>>>
>> >>>>>> In order to make C2 compiler compile hot function as early as
>> >>>>>> possible, I hope to reduce the threshold of function invocation
>> >>>>>>
>> >>>>>> count in interpreter and C1 to drive the JIT compiler
>> >>>>>> transitioned to Level 4 (C2) ASAP. Following is the option list
>> >>>>>> I try, but
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> failed to find a right combination to meet my requirement.
>> >>>>>> Anyone
>> >>>>>> can help to figure out what options I can use?
>> >>>>>>
>> >>>>>> Thanks in advance.
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -XX:Tier0ProfilingStartPercentage=0
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -XX:Tier3InvocationThreshold
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -XX:Tier3MinInvocationThreshold
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -XX:Tier3CompileThreshold
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -XX:Tier4InvocationThreshold
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -XX:CompileThreshold
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -XX:Tier4MinInvocationThreshold
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -XX:Tier4CompileThreshold
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Regards!
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> wei
>> >>>>>>
>> >>>>
>> >>>
>> >>
>> >
>> >
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150520/e5b7c2df/attachment-0001.html>

From vitalyd at gmail.com  Thu May 21 00:53:32 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Wed, 20 May 2015 20:53:32 -0400
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <CA+cQ+tRo5ETcOBZ0LLR7q40xuf8yMdA2a3CajizStD4KKgZ8cw@mail.gmail.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7637@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37GHMp_inW=EEgfd26gRgPOLoh7WccYDdtQGkyCrh_j=3w@mail.gmail.com>
	<2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net>
	<C8D1E566CC4CA845813731FB6FD5C530010F7705@SZXEMI503-MBX.china.huawei.com>
	<318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net>
	<CAHjP37EYHXWcvNFH0HimjqeA8-8TJk1P=NfQeyiLtwoeUfuYog@mail.gmail.com>
	<CA+cQ+tRo5ETcOBZ0LLR7q40xuf8yMdA2a3CajizStD4KKgZ8cw@mail.gmail.com>
Message-ID: <CAHjP37Fsny=eL0E1qiGf5i_FyH7Px-tbcRMdcxoL_T0cBFR_5w@mail.gmail.com>

Thanks Kris.  Microsoft CLR also has something like that which they call
Multicore JIT (i.e. background compilations are started on separate cores
against saved profiles on startup).

I wonder if anything of the sort is on the roadmap for hotspot?

sent from my phone
On May 20, 2015 8:22 PM, "Krystal Mok" <rednaxelafx at gmail.com> wrote:

> ReadyNow does save profiles to disk and read them back in subsequent runs.
> There are also other changes on the side in the runtime and compilers to
> make it work smoothly.
>
> Cc'ing Doug Hawkins who's the one working on ReadyNow if anybody's
> interested in more information.
>
> - Kris
>
> On Wed, May 20, 2015 at 6:40 AM, Vitaly Davidovich <vitalyd at gmail.com>
> wrote:
>
>> Training mode is also dangerous unless built in from beginning as there's
>> risk some unwanted application effect will take place.  This is beside the
>> possibility of feeding it "wrong" profile.
>>
>> What does Azul ReadyNow do? Saves profiles to disk and reads them upon
>> startup?
>>
>> sent from my phone
>> On May 20, 2015 4:58 AM, "Chris Newland" <cnewland at chrisnewland.com>
>> wrote:
>>
>>> Hi Wei,
>>>
>>> Thanks! OpenJDK has no support for persisting compilation profile data
>>> between VM runs. For some industries it can be economical to use a
>>> commercial VM to get the lowest latencies.
>>>
>>> One "free" alternative is to add a training mode to your application and
>>> feed it the most realistic input data so that you get a good set of JIT
>>> optimisations before the inputs you care about arrive.
>>>
>>> Is there a reason why you are cold-starting your VM? (For example your GC
>>> strategy is to use a huge Eden space and reboot daily to avoid any
>>> garbage
>>> collections).
>>>
>>> Regards,
>>>
>>> Chris
>>> @chriswhocodes
>>>
>>> On Wed, May 20, 2015 09:17, Tangwei (Euler) wrote:
>>> > Hi Chris,
>>> > Your tool looks cool. I think OpenJDK has no technology to mitigate
>>> > cold-start costs, please correct if I am wrong. I can control some
>>> > compilation passes with option -XX:CompileCommand, but the profiling
>>> > data, such as invocation count and backedge count, has to reach some
>>> > threshold before trigger execution level transition. This causes
>>> > simulator doesn't trigger C2 compilation within reasonable time span.
>>> >
>>> > Regards!
>>> > wei
>>> >
>>> >> -----Original Message-----
>>> >> From: Chris Newland [mailto:cnewland at chrisnewland.com]
>>> >> Sent: Wednesday, May 20, 2015 3:46 PM
>>> >> To: Vitaly Davidovich
>>> >> Cc: Tangwei (Euler); Vladimir Kozlov; hotspot compiler
>>> >> Subject: RE: How to change compilation policy to trigger C2
>>> compilation
>>> >> ASAP?
>>> >>
>>> >>
>>> >> Hi Wei,
>>> >>
>>> >>
>>> >> Is there any reason why you need to cold-start your application each
>>> >> time and begin with no profiling information?
>>> >>
>>> >> Depending on your input data, would it be possible to "warm up" the VM
>>> >> by running typical inputs through your code to build a rich profile
>>> and
>>> >> have the JIT compilations performed before your real data arrives?
>>> >>
>>> >> One reason *against* doing this would be wide variations in your input
>>> >> data that could result in the wrong optimisations being made and
>>> >> possibly suffering a decompilation event.
>>> >>
>>> >> There are commercial VMs that have technology to mitigate cold-start
>>> >> costs (for example Azul Zing's ReadyNow).
>>> >>
>>> >>
>>> >> Have you examined the HotSpot LogCompilation output to make sure there
>>> >> is nothing you could change at source code level which would result
>>> in a
>>> >> better JIT decisions? I'm thinking of things like inlining failures
>>> due
>>> >> to call-site megamorphism.
>>> >>
>>> >> There's a free tool called JITWatch that aims to make it easier to
>>> >> understand the LogCompilation output
>>> >> (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales
>>> pitch -
>>> >> I'm the
>>> >> author).
>>> >>
>>> >> Regards,
>>> >>
>>> >>
>>> >> Chris
>>> >> @chriswhocodes
>>> >>
>>> >>
>>> >> On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote:
>>> >>
>>> >>> I'll let Vladimir comment on the compile command, but turning off
>>> >>> tiered doesn't prevent inlining.  What prevents inlining (an
>>> otherwise
>>> >>>  inlineable  method), in a nutshell, is lack of profiling information
>>> >>>  that would indicate the method or callsite is hot enough for
>>> >>> inlining. So you can have tiered off and run C2 with standard
>>> >>> compilation thresholds, and you'll likely get good inlining because
>>> >>> you'll have a long profile.  If you turn C2 compile threshold down
>>> too
>>> >>> far (and too far is going to be app-specific), C2 compilation may not
>>> >>> have sufficient info in the shortened profile to decide to inline.
>>> Or
>>> >>> even worse, the profile collected is actually not reflective of the
>>> >>> real profile you want to capture (e.g. app has a phase change after
>>> >>> initialization).
>>> >>>
>>> >>> The theoretical advantage of tiered is you can get decent perf
>>> >>> quickly, but also since you're getting better than interpreter speed
>>> >>> quickly, you can run in C1 tier longer and thus collect an even
>>> >>> larger profile, possibly leading to better code than C2 alone.  But
>>> >>> unfortunately this may require serious tuning.  That's my
>>> >>> understanding at
>>> >> least.
>>> >>>
>>> >>> sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)"
>>> >>> <tangwei6 at huawei.com> wrote:
>>> >>>
>>> >>>
>>> >>>
>>> >>>> Vladimir,
>>> >>>> What the 'double' means in following command line? The option
>>> >>>> CompileThresholdScaling you mentioned is same as
>>> >>>> ProfileMaturityPercentage?
>>> >>>> I cannot find the option in code. How much effect to function
>>> >>>> inlining by turning off tiered compilation?
>>> >>>>
>>> >>>>>
>>> >>>>
>>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
>>> >> ol
>>> >>>> dS caling,0.5
>>> >>>>
>>> >>>> Regards!
>>> >>>> wei
>>> >>>>
>>> >>>>> -----Original Message-----
>>> >>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>> >>>>> Sent: Wednesday, May 20, 2015 9:12 AM
>>> >>>>> To: Tangwei (Euler); Vitaly Davidovich
>>> >>>>> Cc: hotspot compiler
>>> >>>>> Subject: Re: How to change compilation policy to trigger C2
>>> >>>>> compilation
>>> >>>> ASAP?
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>> If you want only C2 compilation you can disable tiered
>>> >>>>> compilation
>>> >>>>>
>>> >>>> (since you
>>> >>>>
>>> >>>>
>>> >>>>> don't want to spend to much on profiling anyway) and set low
>>> >>>>> CompileThreshold (you need only this one for non-tiered compile):
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> -XX:-TieredCompilation -XX:CompileThreshold=100
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> If your method is simple and don't need profiling or inlining the
>>> >>>>>
>>> >>>>>
>>> >>>> generated code
>>> >>>>> could be same quality as with long profiling.
>>> >>>>>
>>> >>>>> An other approach is to set threshold per method (scale all
>>> >>>>> threasholds
>>> >>>> (C1, C2,
>>> >>>>
>>> >>>>
>>> >>>>> interpreter) by this value). For example to reduce thresholds by
>>> >>>>> half:
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
>>> >>
>>> >>>>> oldScaling,0.5
>>> >>>>>
>>> >>>>> Vladimir
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
>>> >>>>>
>>> >>>>>
>>> >>>>>> My goal is just to reach peak performance quickly. Following is
>>> >>>>>> one tier threshold combination I tried:
>>> >>>>>>
>>> >>>>>> -XX:Tier0ProfilingStartPercentage=0
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> -XX:Tier3InvocationThreshold=3
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> -XX:Tier3MinInvocationThreshold=2
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> -XX:Tier3CompileThreshold=2
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> -XX:Tier4InvocationThreshold=4
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> -XX:Tier4MinInvocationThreshold=3
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> -XX:Tier4CompileThreshold=2
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Regards!
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> wei
>>> >>>>>>
>>> >>>>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
>>> >>>>>> *Sent:* Tuesday, May 19, 2015 9:33 PM
>>> >>>>>> *To:* Tangwei (Euler)
>>> >>>>>> *Cc:* hotspot compiler
>>> >>>>>> *Subject:* Re: How to change compilation policy to trigger C2
>>> >>>>>> compilation ASAP?
>>> >>>>>>
>>> >>>>>> Is your goal specifically to have C2 compile or just to reach
>>> >>>>>> peak performance quickly? It sounds like the latter.  What
>>> >>>>>> values did you  specify for the tier thresholds? Also, it may
>>> >>>>>> help you to -XX:+PrintCompilation to tune the flags as this will
>>> >>>>>> show you which methods are being compiled, when, and at what
>>> >>>>>> tier.
>>> >>>>>>
>>> >>>>>> sent from my phone
>>> >>>>>>
>>> >>>>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
>>> >>>>>>  <mailto:tangwei6 at huawei.com>> wrote:
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Hi All,
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> I want to run a JAVA application on a performance simulator,
>>> >>>>>> and do a profiling on a hot function JITTed with C2 compiler.
>>> >>>>>>
>>> >>>>>> In order to make C2 compiler compile hot function as early as
>>> >>>>>> possible, I hope to reduce the threshold of function invocation
>>> >>>>>>
>>> >>>>>> count in interpreter and C1 to drive the JIT compiler
>>> >>>>>> transitioned to Level 4 (C2) ASAP. Following is the option list
>>> >>>>>> I try, but
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> failed to find a right combination to meet my requirement.
>>> >>>>>> Anyone
>>> >>>>>> can help to figure out what options I can use?
>>> >>>>>>
>>> >>>>>> Thanks in advance.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> -XX:Tier0ProfilingStartPercentage=0
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> -XX:Tier3InvocationThreshold
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> -XX:Tier3MinInvocationThreshold
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> -XX:Tier3CompileThreshold
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> -XX:Tier4InvocationThreshold
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> -XX:CompileThreshold
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> -XX:Tier4MinInvocationThreshold
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> -XX:Tier4CompileThreshold
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Regards!
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> wei
>>> >>>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>> >
>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150520/1a11f934/attachment.html>

From rednaxelafx at gmail.com  Thu May 21 06:53:18 2015
From: rednaxelafx at gmail.com (Krystal Mok)
Date: Wed, 20 May 2015 23:53:18 -0700
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <CAHjP37Fsny=eL0E1qiGf5i_FyH7Px-tbcRMdcxoL_T0cBFR_5w@mail.gmail.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7637@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37GHMp_inW=EEgfd26gRgPOLoh7WccYDdtQGkyCrh_j=3w@mail.gmail.com>
	<2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net>
	<C8D1E566CC4CA845813731FB6FD5C530010F7705@SZXEMI503-MBX.china.huawei.com>
	<318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net>
	<CAHjP37EYHXWcvNFH0HimjqeA8-8TJk1P=NfQeyiLtwoeUfuYog@mail.gmail.com>
	<CA+cQ+tRo5ETcOBZ0LLR7q40xuf8yMdA2a3CajizStD4KKgZ8cw@mail.gmail.com>
	<CAHjP37Fsny=eL0E1qiGf5i_FyH7Px-tbcRMdcxoL_T0cBFR_5w@mail.gmail.com>
Message-ID: <CA+cQ+tQEPzGHj3ynDzA74WpdCCXSGg5K_NbVk_aLKMQrvpo9Ww@mail.gmail.com>

Hi Vitaly,

Multicore JIT in the CLR is totally different from what we do in Azul
Zing's ReadyNow.

The profile that ReadyNow saves and restores is the kind of detailed
profile that C2 uses directly, similar to HotSpot's MDOs.
The "profile" that Multicore JIT uses is just a list of methods (which can
be group together under user-specified group names), without any detail
profiles.

The JIT compilers in CLR can only use detailed profiles when run in MPGO
(managed profile-guided optimization) mode, which is a mode used with NGen
(native image generator), an AOT compilation.

I'll have to leave the HotSpot part of your question to Oracle people to
answer.

- Kris

On Wed, May 20, 2015 at 5:53 PM, Vitaly Davidovich <vitalyd at gmail.com>
wrote:

> Thanks Kris.  Microsoft CLR also has something like that which they call
> Multicore JIT (i.e. background compilations are started on separate cores
> against saved profiles on startup).
>
> I wonder if anything of the sort is on the roadmap for hotspot?
>
> sent from my phone
> On May 20, 2015 8:22 PM, "Krystal Mok" <rednaxelafx at gmail.com> wrote:
>
>> ReadyNow does save profiles to disk and read them back in subsequent runs.
>> There are also other changes on the side in the runtime and compilers to
>> make it work smoothly.
>>
>> Cc'ing Doug Hawkins who's the one working on ReadyNow if anybody's
>> interested in more information.
>>
>> - Kris
>>
>> On Wed, May 20, 2015 at 6:40 AM, Vitaly Davidovich <vitalyd at gmail.com>
>> wrote:
>>
>>> Training mode is also dangerous unless built in from beginning as
>>> there's risk some unwanted application effect will take place.  This is
>>> beside the possibility of feeding it "wrong" profile.
>>>
>>> What does Azul ReadyNow do? Saves profiles to disk and reads them upon
>>> startup?
>>>
>>> sent from my phone
>>> On May 20, 2015 4:58 AM, "Chris Newland" <cnewland at chrisnewland.com>
>>> wrote:
>>>
>>>> Hi Wei,
>>>>
>>>> Thanks! OpenJDK has no support for persisting compilation profile data
>>>> between VM runs. For some industries it can be economical to use a
>>>> commercial VM to get the lowest latencies.
>>>>
>>>> One "free" alternative is to add a training mode to your application and
>>>> feed it the most realistic input data so that you get a good set of JIT
>>>> optimisations before the inputs you care about arrive.
>>>>
>>>> Is there a reason why you are cold-starting your VM? (For example your
>>>> GC
>>>> strategy is to use a huge Eden space and reboot daily to avoid any
>>>> garbage
>>>> collections).
>>>>
>>>> Regards,
>>>>
>>>> Chris
>>>> @chriswhocodes
>>>>
>>>> On Wed, May 20, 2015 09:17, Tangwei (Euler) wrote:
>>>> > Hi Chris,
>>>> > Your tool looks cool. I think OpenJDK has no technology to mitigate
>>>> > cold-start costs, please correct if I am wrong. I can control some
>>>> > compilation passes with option -XX:CompileCommand, but the profiling
>>>> > data, such as invocation count and backedge count, has to reach some
>>>> > threshold before trigger execution level transition. This causes
>>>> > simulator doesn't trigger C2 compilation within reasonable time span.
>>>> >
>>>> > Regards!
>>>> > wei
>>>> >
>>>> >> -----Original Message-----
>>>> >> From: Chris Newland [mailto:cnewland at chrisnewland.com]
>>>> >> Sent: Wednesday, May 20, 2015 3:46 PM
>>>> >> To: Vitaly Davidovich
>>>> >> Cc: Tangwei (Euler); Vladimir Kozlov; hotspot compiler
>>>> >> Subject: RE: How to change compilation policy to trigger C2
>>>> compilation
>>>> >> ASAP?
>>>> >>
>>>> >>
>>>> >> Hi Wei,
>>>> >>
>>>> >>
>>>> >> Is there any reason why you need to cold-start your application each
>>>> >> time and begin with no profiling information?
>>>> >>
>>>> >> Depending on your input data, would it be possible to "warm up" the
>>>> VM
>>>> >> by running typical inputs through your code to build a rich profile
>>>> and
>>>> >> have the JIT compilations performed before your real data arrives?
>>>> >>
>>>> >> One reason *against* doing this would be wide variations in your
>>>> input
>>>> >> data that could result in the wrong optimisations being made and
>>>> >> possibly suffering a decompilation event.
>>>> >>
>>>> >> There are commercial VMs that have technology to mitigate cold-start
>>>> >> costs (for example Azul Zing's ReadyNow).
>>>> >>
>>>> >>
>>>> >> Have you examined the HotSpot LogCompilation output to make sure
>>>> there
>>>> >> is nothing you could change at source code level which would result
>>>> in a
>>>> >> better JIT decisions? I'm thinking of things like inlining failures
>>>> due
>>>> >> to call-site megamorphism.
>>>> >>
>>>> >> There's a free tool called JITWatch that aims to make it easier to
>>>> >> understand the LogCompilation output
>>>> >> (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales
>>>> pitch -
>>>> >> I'm the
>>>> >> author).
>>>> >>
>>>> >> Regards,
>>>> >>
>>>> >>
>>>> >> Chris
>>>> >> @chriswhocodes
>>>> >>
>>>> >>
>>>> >> On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote:
>>>> >>
>>>> >>> I'll let Vladimir comment on the compile command, but turning off
>>>> >>> tiered doesn't prevent inlining.  What prevents inlining (an
>>>> otherwise
>>>> >>>  inlineable  method), in a nutshell, is lack of profiling
>>>> information
>>>> >>>  that would indicate the method or callsite is hot enough for
>>>> >>> inlining. So you can have tiered off and run C2 with standard
>>>> >>> compilation thresholds, and you'll likely get good inlining because
>>>> >>> you'll have a long profile.  If you turn C2 compile threshold down
>>>> too
>>>> >>> far (and too far is going to be app-specific), C2 compilation may
>>>> not
>>>> >>> have sufficient info in the shortened profile to decide to inline.
>>>> Or
>>>> >>> even worse, the profile collected is actually not reflective of the
>>>> >>> real profile you want to capture (e.g. app has a phase change after
>>>> >>> initialization).
>>>> >>>
>>>> >>> The theoretical advantage of tiered is you can get decent perf
>>>> >>> quickly, but also since you're getting better than interpreter speed
>>>> >>> quickly, you can run in C1 tier longer and thus collect an even
>>>> >>> larger profile, possibly leading to better code than C2 alone.  But
>>>> >>> unfortunately this may require serious tuning.  That's my
>>>> >>> understanding at
>>>> >> least.
>>>> >>>
>>>> >>> sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)"
>>>> >>> <tangwei6 at huawei.com> wrote:
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>> Vladimir,
>>>> >>>> What the 'double' means in following command line? The option
>>>> >>>> CompileThresholdScaling you mentioned is same as
>>>> >>>> ProfileMaturityPercentage?
>>>> >>>> I cannot find the option in code. How much effect to function
>>>> >>>> inlining by turning off tiered compilation?
>>>> >>>>
>>>> >>>>>
>>>> >>>>
>>>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
>>>> >> ol
>>>> >>>> dS caling,0.5
>>>> >>>>
>>>> >>>> Regards!
>>>> >>>> wei
>>>> >>>>
>>>> >>>>> -----Original Message-----
>>>> >>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>> >>>>> Sent: Wednesday, May 20, 2015 9:12 AM
>>>> >>>>> To: Tangwei (Euler); Vitaly Davidovich
>>>> >>>>> Cc: hotspot compiler
>>>> >>>>> Subject: Re: How to change compilation policy to trigger C2
>>>> >>>>> compilation
>>>> >>>> ASAP?
>>>> >>>>
>>>> >>>>
>>>> >>>>>
>>>> >>>>> If you want only C2 compilation you can disable tiered
>>>> >>>>> compilation
>>>> >>>>>
>>>> >>>> (since you
>>>> >>>>
>>>> >>>>
>>>> >>>>> don't want to spend to much on profiling anyway) and set low
>>>> >>>>> CompileThreshold (you need only this one for non-tiered compile):
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> -XX:-TieredCompilation -XX:CompileThreshold=100
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> If your method is simple and don't need profiling or inlining the
>>>> >>>>>
>>>> >>>>>
>>>> >>>> generated code
>>>> >>>>> could be same quality as with long profiling.
>>>> >>>>>
>>>> >>>>> An other approach is to set threshold per method (scale all
>>>> >>>>> threasholds
>>>> >>>> (C1, C2,
>>>> >>>>
>>>> >>>>
>>>> >>>>> interpreter) by this value). For example to reduce thresholds by
>>>> >>>>> half:
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
>>>> >>
>>>> >>>>> oldScaling,0.5
>>>> >>>>>
>>>> >>>>> Vladimir
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>> My goal is just to reach peak performance quickly. Following is
>>>> >>>>>> one tier threshold combination I tried:
>>>> >>>>>>
>>>> >>>>>> -XX:Tier0ProfilingStartPercentage=0
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> -XX:Tier3InvocationThreshold=3
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> -XX:Tier3MinInvocationThreshold=2
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> -XX:Tier3CompileThreshold=2
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> -XX:Tier4InvocationThreshold=4
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> -XX:Tier4MinInvocationThreshold=3
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> -XX:Tier4CompileThreshold=2
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Regards!
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> wei
>>>> >>>>>>
>>>> >>>>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
>>>> >>>>>> *Sent:* Tuesday, May 19, 2015 9:33 PM
>>>> >>>>>> *To:* Tangwei (Euler)
>>>> >>>>>> *Cc:* hotspot compiler
>>>> >>>>>> *Subject:* Re: How to change compilation policy to trigger C2
>>>> >>>>>> compilation ASAP?
>>>> >>>>>>
>>>> >>>>>> Is your goal specifically to have C2 compile or just to reach
>>>> >>>>>> peak performance quickly? It sounds like the latter.  What
>>>> >>>>>> values did you  specify for the tier thresholds? Also, it may
>>>> >>>>>> help you to -XX:+PrintCompilation to tune the flags as this will
>>>> >>>>>> show you which methods are being compiled, when, and at what
>>>> >>>>>> tier.
>>>> >>>>>>
>>>> >>>>>> sent from my phone
>>>> >>>>>>
>>>> >>>>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
>>>> >>>>>>  <mailto:tangwei6 at huawei.com>> wrote:
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Hi All,
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> I want to run a JAVA application on a performance simulator,
>>>> >>>>>> and do a profiling on a hot function JITTed with C2 compiler.
>>>> >>>>>>
>>>> >>>>>> In order to make C2 compiler compile hot function as early as
>>>> >>>>>> possible, I hope to reduce the threshold of function invocation
>>>> >>>>>>
>>>> >>>>>> count in interpreter and C1 to drive the JIT compiler
>>>> >>>>>> transitioned to Level 4 (C2) ASAP. Following is the option list
>>>> >>>>>> I try, but
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> failed to find a right combination to meet my requirement.
>>>> >>>>>> Anyone
>>>> >>>>>> can help to figure out what options I can use?
>>>> >>>>>>
>>>> >>>>>> Thanks in advance.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> -XX:Tier0ProfilingStartPercentage=0
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> -XX:Tier3InvocationThreshold
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> -XX:Tier3MinInvocationThreshold
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> -XX:Tier3CompileThreshold
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> -XX:Tier4InvocationThreshold
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> -XX:CompileThreshold
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> -XX:Tier4MinInvocationThreshold
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> -XX:Tier4CompileThreshold
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Regards!
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> wei
>>>> >>>>>>
>>>> >>>>
>>>> >>>
>>>> >>
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150520/abe799b7/attachment-0001.html>

From yekaterina.kantserova at oracle.com  Thu May 21 07:33:42 2015
From: yekaterina.kantserova at oracle.com (Yekaterina Kantserova)
Date: Thu, 21 May 2015 09:33:42 +0200
Subject: RFR(XS): 8080828: Create sanity test for JDK-8080155
Message-ID: <555D8A56.3050001@oracle.com>

Hi,

Could I please have a review of this new test.

bug: https://bugs.openjdk.java.net/browse/JDK-8080828
webrev: http://cr.openjdk.java.net/~ykantser/8080828/webrev.00

This test will trigger the exception reported in 
https://bugs.openjdk.java.net/browse/JDK-8080155 if running with -Xcomp.

Thanks,
Katja

From staffan.larsen at oracle.com  Thu May 21 08:13:04 2015
From: staffan.larsen at oracle.com (Staffan Larsen)
Date: Thu, 21 May 2015 10:13:04 +0200
Subject: RFR(XS): 8080828: Create sanity test for JDK-8080155
In-Reply-To: <555D8A56.3050001@oracle.com>
References: <555D8A56.3050001@oracle.com>
Message-ID: <AEA03B5D-D0FE-4799-BE68-51B6891D755C@oracle.com>

Can we add some kind of validation of the output? Something simple, just to verify that we actually ran what we think we ran.

> On 21 maj 2015, at 09:33, Yekaterina Kantserova <yekaterina.kantserova at oracle.com> wrote:
> 
> Hi,
> 
> Could I please have a review of this new test.
> 
> bug: https://bugs.openjdk.java.net/browse/JDK-8080828
> webrev: http://cr.openjdk.java.net/~ykantser/8080828/webrev.00
> 
> This test will trigger the exception reported in https://bugs.openjdk.java.net/browse/JDK-8080155 if running with -Xcomp.
> 
> Thanks,
> Katja


From staffan.larsen at oracle.com  Thu May 21 08:15:29 2015
From: staffan.larsen at oracle.com (Staffan Larsen)
Date: Thu, 21 May 2015 10:15:29 +0200
Subject: RFR(XS): 8080828: Create sanity test for JDK-8080155
In-Reply-To: <AEA03B5D-D0FE-4799-BE68-51B6891D755C@oracle.com>
References: <555D8A56.3050001@oracle.com>
	<AEA03B5D-D0FE-4799-BE68-51B6891D755C@oracle.com>
Message-ID: <A298A28D-C8EF-4283-9A62-9BE99FDFD2DE@oracle.com>

I just realize that you also need to add a check if SA can attach on the current platform. See the test in sa/jmap-hashcode:

if (!Platform.shouldSAAttach()) {
    System.out.println("SA attach not expected to work - test skipped.");
    return;
}

Thanks,
/Staffan

> On 21 maj 2015, at 10:13, Staffan Larsen <staffan.larsen at oracle.com> wrote:
> 
> Can we add some kind of validation of the output? Something simple, just to verify that we actually ran what we think we ran.
> 
>> On 21 maj 2015, at 09:33, Yekaterina Kantserova <yekaterina.kantserova at oracle.com> wrote:
>> 
>> Hi,
>> 
>> Could I please have a review of this new test.
>> 
>> bug: https://bugs.openjdk.java.net/browse/JDK-8080828
>> webrev: http://cr.openjdk.java.net/~ykantser/8080828/webrev.00
>> 
>> This test will trigger the exception reported in https://bugs.openjdk.java.net/browse/JDK-8080155 if running with -Xcomp.
>> 
>> Thanks,
>> Katja
> 


From yekaterina.kantserova at oracle.com  Thu May 21 08:58:10 2015
From: yekaterina.kantserova at oracle.com (Yekaterina Kantserova)
Date: Thu, 21 May 2015 10:58:10 +0200
Subject: RFR(XS): 8080828: Create sanity test for JDK-8080155
In-Reply-To: <A298A28D-C8EF-4283-9A62-9BE99FDFD2DE@oracle.com>
References: <555D8A56.3050001@oracle.com>
	<AEA03B5D-D0FE-4799-BE68-51B6891D755C@oracle.com>
	<A298A28D-C8EF-4283-9A62-9BE99FDFD2DE@oracle.com>
Message-ID: <555D9E22.6090601@oracle.com>

Staffan, thanks fro looking at it! I'll fix according the test your 
comments.

// Katja


On 05/21/2015 10:15 AM, Staffan Larsen wrote:
> I just realize that you also need to add a check if SA can attach on the current platform. See the test in sa/jmap-hashcode:
>
> if (!Platform.shouldSAAttach()) {
>      System.out.println("SA attach not expected to work - test skipped.");
>      return;
> }
>
> Thanks,
> /Staffan
>
>> On 21 maj 2015, at 10:13, Staffan Larsen <staffan.larsen at oracle.com> wrote:
>>
>> Can we add some kind of validation of the output? Something simple, just to verify that we actually ran what we think we ran.
>>
>>> On 21 maj 2015, at 09:33, Yekaterina Kantserova <yekaterina.kantserova at oracle.com> wrote:
>>>
>>> Hi,
>>>
>>> Could I please have a review of this new test.
>>>
>>> bug: https://bugs.openjdk.java.net/browse/JDK-8080828
>>> webrev: http://cr.openjdk.java.net/~ykantser/8080828/webrev.00
>>>
>>> This test will trigger the exception reported in https://bugs.openjdk.java.net/browse/JDK-8080155 if running with -Xcomp.
>>>
>>> Thanks,
>>> Katja


From andreas.eriksson at oracle.com  Thu May 21 09:14:33 2015
From: andreas.eriksson at oracle.com (Andreas Eriksson)
Date: Thu, 21 May 2015 11:14:33 +0200
Subject: 8060036: C2: CmpU nodes can end up with wrong type information
In-Reply-To: <555CB3A1.2000308@oracle.com>
References: <555B16D4.80008@oracle.com>	<C568518E7B433348B114B6A7122D474755DF752C@FMSMSX102.amr.corp.intel.com>
	<555C6A2B.7090804@oracle.com> <555CB3A1.2000308@oracle.com>
Message-ID: <555DA1F9.4020207@oracle.com>

Thanks Vladimir!

On 2015-05-20 18:17, Vladimir Kozlov wrote:
> Looks good. I agree with explanation and fix.
>
> Thanks,
> Vladimir
>
> On 5/20/15 4:04 AM, Andreas Eriksson wrote:
>> Comments added, thanks.
>>
>> Can I have a Reviewer look at this as well?
>>
>> New webrev:
>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.01/
>>
>> - Andreas
>>
>> On 2015-05-19 20:24, Berg, Michael C wrote:
>>> Hi Andreas, could you add a comment to the push statement:
>>>
>>> // Propagate change to user
>>>
>>> And to Node* p declaration and init:
>>>
>>> // Propagate changes to uses
>>>
>>> So that the new code carries all similar information.
>>> Else the changes seem ok.
>>>
>>> -Michael
>>>
>>> -----Original Message-----
>>> From: hotspot-compiler-dev
>>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of
>>> Andreas Eriksson
>>> Sent: Tuesday, May 19, 2015 3:56 AM
>>> To: hotspot-compiler-dev at openjdk.java.net
>>> Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type
>>> information
>>>
>>> Hi,
>>>
>>> Could I please get two reviews for the following bug fix.
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8060036
>>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/
>>>
>>> The problem is with type propagation in C2.
>>>
>>> A CmpU can use type information from nodes two steps up in the graph,
>>> when input 1 is an AddI or SubI node.
>>> If the AddI/SubI overflows the CmpU node type computation will set up
>>> two type ranges based on the inputs to the AddI/SubI instead.
>>> These type ranges will then be compared to input 2 of the CmpU node to
>>> see if we can make any meaningful observations on the real type of the
>>> CmpU.
>>> This means that the type of the AddI/SubI can be bottom, but the CmpU
>>> can still use the type of the AddI/SubI inputs to get a more specific
>>> type itself.
>>>
>>> The problem with this is that the types of the AddI/SubI inputs can be
>>> updated at a later pass, but since the AddI/SubI is of type bottom the
>>> propagation pass will stop there.
>>> At that point the CmpU node is never made aware of the new types of
>>> the AddI/SubI inputs, and it is left with a type that is based on old
>>> type information.
>>>
>>> When this happens other optimizations using the type information can
>>> go very wrong; in this particular case a branch to a range_check trap
>>> that normally never happened was optimized to always be taken.
>>>
>>> This fix adds a check in the type propagation logic for CmpU nodes,
>>> which makes sure that the CmpU node will be added to the worklist so
>>> the type can be updated.
>>> I've not been able to create a regression test that is reliable enough
>>> to check in.
>>>
>>> Thanks,
>>> Andreas
>>


From aph at redhat.com  Thu May 21 10:03:40 2015
From: aph at redhat.com (Andrew Haley)
Date: Thu, 21 May 2015 11:03:40 +0100
Subject: 8054492: compiler/intrinsics/classcast/NullCheckDroppingsTest.java
	is an invalid test
In-Reply-To: <555D0460.4080805@oracle.com>
References: <5559EFA0.5060800@redhat.com> <555D0460.4080805@oracle.com>
Message-ID: <555DAD7C.2090805@redhat.com>

On 05/20/2015 11:02 PM, Vladimir Kozlov wrote:
> testVarClassCast tests deoptimization for javaMirror == null:
> 
> void testVarClassCast(String s) {
>      Class cl = (s == null) ? null : String.class;
>      try {
>          ssink = (String)cl.cast(svalue);
> 
> Which is done in LibraryCallKit::inline_Class_cast() by:
> 
> mirror = null_check(mirror);
> 
> which has Deoptimization::Action_make_not_entrant.
> 
> Unfortunately currently the test also pass because unstable_if is 
> generated for the first line:
> 
> (s == null) ? null : String.class;
> 
> If you run the test with TraceDoptimization (or LogCompilation) you will 
> see:
> 
> Uncommon trap occurred in NullCheckDroppingsTest::testVarClassCast 
> (@0x000000010b0670d8) thread=26883 reason=unstable_if action=reinterpret 
> unloaded_class_index=-1

Not quite the same.  I get a reason=null_check:

Uncommon trap occurred in NullCheckDroppingsTest::testVarClassCast (@0x000003ff8d253e54) thread=4396243677696 reason=null_check action=maybe_recompile unloaded_class_index=-1

Which comes from a SEGV here:

  0x000003ff89253ca4: ldr	x0, [x10,#72]   ; implicit exception: dispatches to 0x000003ff89253e40

which is the line

            ssink = (String)cl.cast(svalue);

I don't get a trap for unstable_if because there isn't one.  I just get

  0x000003ff89253c90: cbz	x2, 0x000003ff89253e00   (this is java/lang/String  s)

-->

  0x000003ff89253e00: mov	x10, xzr
  0x000003ff89253e04: b	0x000003ff89253ca0

-->

  0x000003ff89253ca0: lsl	x11, x11, #3    ;*getstatic svalue
                                                ; - NullCheckDroppingsTest::testVarClassCast at 13 (line 181)
  0x000003ff89253ca4: ldr	x0, [x10,#72]   ; implicit exception: dispatches to 0x000003ff89253e40

... and then the trap for the null pointer exception.

Andrew.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: deopt.log
Type: text/x-log
Size: 65408 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150521/47c84d1a/deopt-0001.log>

From yekaterina.kantserova at oracle.com  Thu May 21 10:55:06 2015
From: yekaterina.kantserova at oracle.com (Yekaterina Kantserova)
Date: Thu, 21 May 2015 12:55:06 +0200
Subject: RFR(XS): 8080828: Create sanity test for JDK-8080155
In-Reply-To: <A298A28D-C8EF-4283-9A62-9BE99FDFD2DE@oracle.com>
References: <555D8A56.3050001@oracle.com>
	<AEA03B5D-D0FE-4799-BE68-51B6891D755C@oracle.com>
	<A298A28D-C8EF-4283-9A62-9BE99FDFD2DE@oracle.com>
Message-ID: <555DB98A.7050401@oracle.com>

The new webrev can be found here: 
http://cr.openjdk.java.net/~ykantser/8080828/webrev.01

Thanks,
Katja


On 05/21/2015 10:15 AM, Staffan Larsen wrote:
> I just realize that you also need to add a check if SA can attach on the current platform. See the test in sa/jmap-hashcode:
>
> if (!Platform.shouldSAAttach()) {
>      System.out.println("SA attach not expected to work - test skipped.");
>      return;
> }
>
> Thanks,
> /Staffan
>
>> On 21 maj 2015, at 10:13, Staffan Larsen <staffan.larsen at oracle.com> wrote:
>>
>> Can we add some kind of validation of the output? Something simple, just to verify that we actually ran what we think we ran.
>>
>>> On 21 maj 2015, at 09:33, Yekaterina Kantserova <yekaterina.kantserova at oracle.com> wrote:
>>>
>>> Hi,
>>>
>>> Could I please have a review of this new test.
>>>
>>> bug: https://bugs.openjdk.java.net/browse/JDK-8080828
>>> webrev: http://cr.openjdk.java.net/~ykantser/8080828/webrev.00
>>>
>>> This test will trigger the exception reported in https://bugs.openjdk.java.net/browse/JDK-8080155 if running with -Xcomp.
>>>
>>> Thanks,
>>> Katja


From staffan.larsen at oracle.com  Thu May 21 11:04:10 2015
From: staffan.larsen at oracle.com (Staffan Larsen)
Date: Thu, 21 May 2015 13:04:10 +0200
Subject: RFR(XS): 8080828: Create sanity test for JDK-8080155
In-Reply-To: <555DB98A.7050401@oracle.com>
References: <555D8A56.3050001@oracle.com>
	<AEA03B5D-D0FE-4799-BE68-51B6891D755C@oracle.com>
	<A298A28D-C8EF-4283-9A62-9BE99FDFD2DE@oracle.com>
	<555DB98A.7050401@oracle.com>
Message-ID: <5C6FDB60-85C3-4130-ABF0-52368963E0DB@oracle.com>

Looks good!

Thanks,
/Staffan

> On 21 maj 2015, at 12:55, Yekaterina Kantserova <yekaterina.kantserova at oracle.com> wrote:
> 
> The new webrev can be found here: http://cr.openjdk.java.net/~ykantser/8080828/webrev.01
> 
> Thanks,
> Katja
> 
> 
> 
> On 05/21/2015 10:15 AM, Staffan Larsen wrote:
>> I just realize that you also need to add a check if SA can attach on the current platform. See the test in sa/jmap-hashcode:
>> 
>> if (!Platform.shouldSAAttach()) {
>>     System.out.println("SA attach not expected to work - test skipped.");
>>     return;
>> }
>> 
>> Thanks,
>> /Staffan
>> 
>>> On 21 maj 2015, at 10:13, Staffan Larsen <staffan.larsen at oracle.com> wrote:
>>> 
>>> Can we add some kind of validation of the output? Something simple, just to verify that we actually ran what we think we ran.
>>> 
>>>> On 21 maj 2015, at 09:33, Yekaterina Kantserova <yekaterina.kantserova at oracle.com> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> Could I please have a review of this new test.
>>>> 
>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8080828
>>>> webrev: http://cr.openjdk.java.net/~ykantser/8080828/webrev.00
>>>> 
>>>> This test will trigger the exception reported in https://bugs.openjdk.java.net/browse/JDK-8080155 if running with -Xcomp.
>>>> 
>>>> Thanks,
>>>> Katja
> 


From vitalyd at gmail.com  Thu May 21 11:11:29 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Thu, 21 May 2015 07:11:29 -0400
Subject: How to change compilation policy to trigger C2 compilation ASAP?
In-Reply-To: <CA+cQ+tQEPzGHj3ynDzA74WpdCCXSGg5K_NbVk_aLKMQrvpo9Ww@mail.gmail.com>
References: <C8D1E566CC4CA845813731FB6FD5C530010F7487@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37Ha__D7jBMLF54Qd7nDjGCFxWu-p8EU39OvQzpHQ9y2xg@mail.gmail.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7598@SZXEMI503-MBX.china.huawei.com>
	<555BDF7D.4010002@oracle.com>
	<C8D1E566CC4CA845813731FB6FD5C530010F7637@SZXEMI503-MBX.china.huawei.com>
	<CAHjP37GHMp_inW=EEgfd26gRgPOLoh7WccYDdtQGkyCrh_j=3w@mail.gmail.com>
	<2b2f818ff7b594eda08559c4cd0e7028.squirrel@excalibur.xssl.net>
	<C8D1E566CC4CA845813731FB6FD5C530010F7705@SZXEMI503-MBX.china.huawei.com>
	<318b373acf4f165086700bc04c917d95.squirrel@excalibur.xssl.net>
	<CAHjP37EYHXWcvNFH0HimjqeA8-8TJk1P=NfQeyiLtwoeUfuYog@mail.gmail.com>
	<CA+cQ+tRo5ETcOBZ0LLR7q40xuf8yMdA2a3CajizStD4KKgZ8cw@mail.gmail.com>
	<CAHjP37Fsny=eL0E1qiGf5i_FyH7Px-tbcRMdcxoL_T0cBFR_5w@mail.gmail.com>
	<CA+cQ+tQEPzGHj3ynDzA74WpdCCXSGg5K_NbVk_aLKMQrvpo9Ww@mail.gmail.com>
Message-ID: <CAHjP37FcdQUzG4WuJKthk18kPqwT7Uo4iQrBAfutw9PRR-oHQw@mail.gmail.com>

Hi Kris,

MS CLR doesn't maintain the rich profile that Hotspot and Zing do, so yes,
for them it's pretty much a list of methods.  So I'm only mentioning them
in terms of doing eager compilation at startup based on profile info,
irrespective of what's actually in the profile.

Thanks

sent from my phone
On May 21, 2015 2:53 AM, "Krystal Mok" <rednaxelafx at gmail.com> wrote:

> Hi Vitaly,
>
> Multicore JIT in the CLR is totally different from what we do in Azul
> Zing's ReadyNow.
>
> The profile that ReadyNow saves and restores is the kind of detailed
> profile that C2 uses directly, similar to HotSpot's MDOs.
> The "profile" that Multicore JIT uses is just a list of methods (which can
> be group together under user-specified group names), without any detail
> profiles.
>
> The JIT compilers in CLR can only use detailed profiles when run in MPGO
> (managed profile-guided optimization) mode, which is a mode used with NGen
> (native image generator), an AOT compilation.
>
> I'll have to leave the HotSpot part of your question to Oracle people to
> answer.
>
> - Kris
>
> On Wed, May 20, 2015 at 5:53 PM, Vitaly Davidovich <vitalyd at gmail.com>
> wrote:
>
>> Thanks Kris.  Microsoft CLR also has something like that which they call
>> Multicore JIT (i.e. background compilations are started on separate cores
>> against saved profiles on startup).
>>
>> I wonder if anything of the sort is on the roadmap for hotspot?
>>
>> sent from my phone
>> On May 20, 2015 8:22 PM, "Krystal Mok" <rednaxelafx at gmail.com> wrote:
>>
>>> ReadyNow does save profiles to disk and read them back in subsequent
>>> runs.
>>> There are also other changes on the side in the runtime and compilers to
>>> make it work smoothly.
>>>
>>> Cc'ing Doug Hawkins who's the one working on ReadyNow if anybody's
>>> interested in more information.
>>>
>>> - Kris
>>>
>>> On Wed, May 20, 2015 at 6:40 AM, Vitaly Davidovich <vitalyd at gmail.com>
>>> wrote:
>>>
>>>> Training mode is also dangerous unless built in from beginning as
>>>> there's risk some unwanted application effect will take place.  This is
>>>> beside the possibility of feeding it "wrong" profile.
>>>>
>>>> What does Azul ReadyNow do? Saves profiles to disk and reads them upon
>>>> startup?
>>>>
>>>> sent from my phone
>>>> On May 20, 2015 4:58 AM, "Chris Newland" <cnewland at chrisnewland.com>
>>>> wrote:
>>>>
>>>>> Hi Wei,
>>>>>
>>>>> Thanks! OpenJDK has no support for persisting compilation profile data
>>>>> between VM runs. For some industries it can be economical to use a
>>>>> commercial VM to get the lowest latencies.
>>>>>
>>>>> One "free" alternative is to add a training mode to your application
>>>>> and
>>>>> feed it the most realistic input data so that you get a good set of JIT
>>>>> optimisations before the inputs you care about arrive.
>>>>>
>>>>> Is there a reason why you are cold-starting your VM? (For example your
>>>>> GC
>>>>> strategy is to use a huge Eden space and reboot daily to avoid any
>>>>> garbage
>>>>> collections).
>>>>>
>>>>> Regards,
>>>>>
>>>>> Chris
>>>>> @chriswhocodes
>>>>>
>>>>> On Wed, May 20, 2015 09:17, Tangwei (Euler) wrote:
>>>>> > Hi Chris,
>>>>> > Your tool looks cool. I think OpenJDK has no technology to mitigate
>>>>> > cold-start costs, please correct if I am wrong. I can control some
>>>>> > compilation passes with option -XX:CompileCommand, but the profiling
>>>>> > data, such as invocation count and backedge count, has to reach some
>>>>> > threshold before trigger execution level transition. This causes
>>>>> > simulator doesn't trigger C2 compilation within reasonable time span.
>>>>> >
>>>>> > Regards!
>>>>> > wei
>>>>> >
>>>>> >> -----Original Message-----
>>>>> >> From: Chris Newland [mailto:cnewland at chrisnewland.com]
>>>>> >> Sent: Wednesday, May 20, 2015 3:46 PM
>>>>> >> To: Vitaly Davidovich
>>>>> >> Cc: Tangwei (Euler); Vladimir Kozlov; hotspot compiler
>>>>> >> Subject: RE: How to change compilation policy to trigger C2
>>>>> compilation
>>>>> >> ASAP?
>>>>> >>
>>>>> >>
>>>>> >> Hi Wei,
>>>>> >>
>>>>> >>
>>>>> >> Is there any reason why you need to cold-start your application each
>>>>> >> time and begin with no profiling information?
>>>>> >>
>>>>> >> Depending on your input data, would it be possible to "warm up" the
>>>>> VM
>>>>> >> by running typical inputs through your code to build a rich profile
>>>>> and
>>>>> >> have the JIT compilations performed before your real data arrives?
>>>>> >>
>>>>> >> One reason *against* doing this would be wide variations in your
>>>>> input
>>>>> >> data that could result in the wrong optimisations being made and
>>>>> >> possibly suffering a decompilation event.
>>>>> >>
>>>>> >> There are commercial VMs that have technology to mitigate cold-start
>>>>> >> costs (for example Azul Zing's ReadyNow).
>>>>> >>
>>>>> >>
>>>>> >> Have you examined the HotSpot LogCompilation output to make sure
>>>>> there
>>>>> >> is nothing you could change at source code level which would result
>>>>> in a
>>>>> >> better JIT decisions? I'm thinking of things like inlining failures
>>>>> due
>>>>> >> to call-site megamorphism.
>>>>> >>
>>>>> >> There's a free tool called JITWatch that aims to make it easier to
>>>>> >> understand the LogCompilation output
>>>>> >> (https://github.com/AdoptOpenJDK/jitwatch) (Sorry for the sales
>>>>> pitch -
>>>>> >> I'm the
>>>>> >> author).
>>>>> >>
>>>>> >> Regards,
>>>>> >>
>>>>> >>
>>>>> >> Chris
>>>>> >> @chriswhocodes
>>>>> >>
>>>>> >>
>>>>> >> On Wed, May 20, 2015 03:11, Vitaly Davidovich wrote:
>>>>> >>
>>>>> >>> I'll let Vladimir comment on the compile command, but turning off
>>>>> >>> tiered doesn't prevent inlining.  What prevents inlining (an
>>>>> otherwise
>>>>> >>>  inlineable  method), in a nutshell, is lack of profiling
>>>>> information
>>>>> >>>  that would indicate the method or callsite is hot enough for
>>>>> >>> inlining. So you can have tiered off and run C2 with standard
>>>>> >>> compilation thresholds, and you'll likely get good inlining because
>>>>> >>> you'll have a long profile.  If you turn C2 compile threshold down
>>>>> too
>>>>> >>> far (and too far is going to be app-specific), C2 compilation may
>>>>> not
>>>>> >>> have sufficient info in the shortened profile to decide to
>>>>> inline.  Or
>>>>> >>> even worse, the profile collected is actually not reflective of the
>>>>> >>> real profile you want to capture (e.g. app has a phase change after
>>>>> >>> initialization).
>>>>> >>>
>>>>> >>> The theoretical advantage of tiered is you can get decent perf
>>>>> >>> quickly, but also since you're getting better than interpreter
>>>>> speed
>>>>> >>> quickly, you can run in C1 tier longer and thus collect an even
>>>>> >>> larger profile, possibly leading to better code than C2 alone.  But
>>>>> >>> unfortunately this may require serious tuning.  That's my
>>>>> >>> understanding at
>>>>> >> least.
>>>>> >>>
>>>>> >>> sent from my phone On May 19, 2015 10:00 PM, "Tangwei (Euler)"
>>>>> >>> <tangwei6 at huawei.com> wrote:
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>> Vladimir,
>>>>> >>>> What the 'double' means in following command line? The option
>>>>> >>>> CompileThresholdScaling you mentioned is same as
>>>>> >>>> ProfileMaturityPercentage?
>>>>> >>>> I cannot find the option in code. How much effect to function
>>>>> >>>> inlining by turning off tiered compilation?
>>>>> >>>>
>>>>> >>>>>
>>>>> >>>>
>>>>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
>>>>> >> ol
>>>>> >>>> dS caling,0.5
>>>>> >>>>
>>>>> >>>> Regards!
>>>>> >>>> wei
>>>>> >>>>
>>>>> >>>>> -----Original Message-----
>>>>> >>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>>> >>>>> Sent: Wednesday, May 20, 2015 9:12 AM
>>>>> >>>>> To: Tangwei (Euler); Vitaly Davidovich
>>>>> >>>>> Cc: hotspot compiler
>>>>> >>>>> Subject: Re: How to change compilation policy to trigger C2
>>>>> >>>>> compilation
>>>>> >>>> ASAP?
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>>
>>>>> >>>>> If you want only C2 compilation you can disable tiered
>>>>> >>>>> compilation
>>>>> >>>>>
>>>>> >>>> (since you
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>> don't want to spend to much on profiling anyway) and set low
>>>>> >>>>> CompileThreshold (you need only this one for non-tiered compile):
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> -XX:-TieredCompilation -XX:CompileThreshold=100
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> If your method is simple and don't need profiling or inlining the
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>> generated code
>>>>> >>>>> could be same quality as with long profiling.
>>>>> >>>>>
>>>>> >>>>> An other approach is to set threshold per method (scale all
>>>>> >>>>> threasholds
>>>>> >>>> (C1, C2,
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>> interpreter) by this value). For example to reduce thresholds by
>>>>> >>>>> half:
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >> -XX:CompileCommand=option,SomeClass.someMethod,double,CompileThresh
>>>>> >>
>>>>> >>>>> oldScaling,0.5
>>>>> >>>>>
>>>>> >>>>> Vladimir
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> On 5/19/15 5:43 PM, Tangwei (Euler) wrote:
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>> My goal is just to reach peak performance quickly. Following is
>>>>> >>>>>> one tier threshold combination I tried:
>>>>> >>>>>>
>>>>> >>>>>> -XX:Tier0ProfilingStartPercentage=0
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> -XX:Tier3InvocationThreshold=3
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> -XX:Tier3MinInvocationThreshold=2
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> -XX:Tier3CompileThreshold=2
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> -XX:Tier4InvocationThreshold=4
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> -XX:Tier4MinInvocationThreshold=3
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> -XX:Tier4CompileThreshold=2
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> Regards!
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> wei
>>>>> >>>>>>
>>>>> >>>>>> *From:*Vitaly Davidovich [mailto:vitalyd at gmail.com]
>>>>> >>>>>> *Sent:* Tuesday, May 19, 2015 9:33 PM
>>>>> >>>>>> *To:* Tangwei (Euler)
>>>>> >>>>>> *Cc:* hotspot compiler
>>>>> >>>>>> *Subject:* Re: How to change compilation policy to trigger C2
>>>>> >>>>>> compilation ASAP?
>>>>> >>>>>>
>>>>> >>>>>> Is your goal specifically to have C2 compile or just to reach
>>>>> >>>>>> peak performance quickly? It sounds like the latter.  What
>>>>> >>>>>> values did you  specify for the tier thresholds? Also, it may
>>>>> >>>>>> help you to -XX:+PrintCompilation to tune the flags as this will
>>>>> >>>>>> show you which methods are being compiled, when, and at what
>>>>> >>>>>> tier.
>>>>> >>>>>>
>>>>> >>>>>> sent from my phone
>>>>> >>>>>>
>>>>> >>>>>> On May 19, 2015 9:01 AM, "Tangwei (Euler)" <tangwei6 at huawei.com
>>>>> >>>>>>  <mailto:tangwei6 at huawei.com>> wrote:
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> Hi All,
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> I want to run a JAVA application on a performance simulator,
>>>>> >>>>>> and do a profiling on a hot function JITTed with C2 compiler.
>>>>> >>>>>>
>>>>> >>>>>> In order to make C2 compiler compile hot function as early as
>>>>> >>>>>> possible, I hope to reduce the threshold of function invocation
>>>>> >>>>>>
>>>>> >>>>>> count in interpreter and C1 to drive the JIT compiler
>>>>> >>>>>> transitioned to Level 4 (C2) ASAP. Following is the option list
>>>>> >>>>>> I try, but
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> failed to find a right combination to meet my requirement.
>>>>> >>>>>> Anyone
>>>>> >>>>>> can help to figure out what options I can use?
>>>>> >>>>>>
>>>>> >>>>>> Thanks in advance.
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> -XX:Tier0ProfilingStartPercentage=0
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> -XX:Tier3InvocationThreshold
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> -XX:Tier3MinInvocationThreshold
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> -XX:Tier3CompileThreshold
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> -XX:Tier4InvocationThreshold
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> -XX:CompileThreshold
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> -XX:Tier4MinInvocationThreshold
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> -XX:Tier4CompileThreshold
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> Regards!
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> wei
>>>>> >>>>>>
>>>>> >>>>
>>>>> >>>
>>>>> >>
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150521/95b1edb0/attachment-0001.html>

From roland.westrelin at oracle.com  Thu May 21 11:55:43 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Thu, 21 May 2015 13:55:43 +0200
Subject: RFR(S) 8077504: Unsafe load can loose control dependency and
	cause crash
In-Reply-To: <555CB274.4010409@oracle.com>
References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com>
	<552FC216.4010503@redhat.com>
	<95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com>
	<6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com>
	<553A88F3.2010700@oracle.com>
	<CDF55084-CD74-4195-B596-37E7935FCB62@oracle.com>
	<553FFAC4.7030107@oracle.com>
	<B295A658-D4D6-4F5B-AE79-A4B99592585D@oracle.com>
	<8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com>
	<555A4026.3010303@redhat.com> <555A6741.3030002@oracle.com>
	<C28DE41C-112F-4098-BA0B-27700AF06127@oracle.com>
	<555CB274.4010409@oracle.com>
Message-ID: <27CDBC05-3F22-46F2-871A-7786F19AE8B3@oracle.com>

Vladimir,

Thanks for spotting the bad comment and thanks for re-reviewing.

Roland.

> On May 20, 2015, at 6:12 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
> 
> Looks good. But fix second comment in memnode.hpp - it has '???' instead of '.
> 
> Thanks,
> Vladimir
> 
> On 5/20/15 7:19 AM, Roland Westrelin wrote:
>> Thanks Andrew and Vladimir. Here is a new webrev with an enum:
>> 
>> http://cr.openjdk.java.net/~roland/8077504/webrev.02/
>> 
>> Roland.
>> 


From roland.westrelin at oracle.com  Thu May 21 12:09:25 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Thu, 21 May 2015 14:09:25 +0200
Subject: RFR(S) 8077504: Unsafe load can loose control dependency and
	cause crash
In-Reply-To: <555CB33F.6030506@redhat.com>
References: <42EDECBF-2347-406A-8835-29A98F1CA153@oracle.com>
	<552FC216.4010503@redhat.com>
	<95F149F4-7BE2-45C8-A3C0-DEACFE42ACDF@oracle.com>
	<6AC5258A-A40C-4121-8B1F-0076713345D4@oracle.com>
	<553A88F3.2010700@oracle.com>
	<CDF55084-CD74-4195-B596-37E7935FCB62@oracle.com>
	<553FFAC4.7030107@oracle.com>
	<B295A658-D4D6-4F5B-AE79-A4B99592585D@oracle.com>
	<8C19A207-B613-430F-8CC9-C273BDF4A361@oracle.com>
	<555A4026.3010303@redhat.com> <555A6741.3030002@oracle.com>
	<C28DE41C-112F-4098-BA0B-27700AF06127@oracle.com>
	<555CB274.4010409@oracle.com> <555CB33F.6030506@redhat.com>
Message-ID: <8D2CAEE0-573C-4AC8-8A60-B522610AC4B0@oracle.com>

Andrew,

>> Looks good. But fix second comment in memnode.hpp - it has '???' instead
>> of '.
> 
> Yes, looks good (but, of course, I'm not a reviewer).

Thanks for looking at it. 

Roland.

From roland.westrelin at oracle.com  Thu May 21 12:55:40 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Thu, 21 May 2015 14:55:40 +0200
Subject: RFR(XS): 8080699: Assert failed: Not a Java pointer in JCK test
In-Reply-To: <555CDD4C.5030707@oracle.com>
References: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com>
	<555CC357.9030006@oracle.com>
	<29CE480B-2136-40DE-8167-E0E093A5F145@oracle.com>
	<555CDD4C.5030707@oracle.com>
Message-ID: <73E7F064-372B-4303-8FA4-9CB6CF1239D4@oracle.com>

>>> Should it also return false if in(ArrayCopyNode::Src)->is_top()?
>> 
>> That doesn?t feel right: ArrayCopyNode::may_modify() would decide what to return based on src that it doesn?t modify.
> 
> Okay, returning true in such case would be conservative.
> 
> Should we check dest input for top in CallLeafNode::may_modify()? Do we have other similar places?

You?re right checking for top dest in CallLeafNode::may_modify() is safer.
I don?t see other similar issues elsewhere in the array copy code.

Roland.


> 
> Thanks,
> Vladimir
> 
>> Also, I don?t think src can become top and dest not be top during allocation elimination. If that happens at another time (igvn for instance), this part of the graph is being removed and presumably the compiler will get an other chance to optimize it better.
>> 
>> Roland.
>> 
>> 
>> 
>>> 
>>> thanks,
>>> Vladimir
>>> 
>>> On 5/20/15 7:03 AM, Roland Westrelin wrote:
>>>> http://cr.openjdk.java.net/~roland/8080699/webrev.00/
>>>> 
>>>> When an arraycopy node is eliminated (for a non escaping allocation), its control, src and dest arguments are set to top and it is no longer connected to the rest of the graph through its fall through outputs. It?s not entirely removed from the graph until the next igvn though, and it can be reached through its exception processing outputs. That?s what happens here and ArrayCopyNode::may_modify() assumes dest, that is top, is an oop.
>>>> 
>>>> Roland.
>>>> 
>> 


From roland.westrelin at oracle.com  Thu May 21 13:25:29 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Thu, 21 May 2015 15:25:29 +0200
Subject: RFR(S): 8078866:
	compiler/eliminateAutobox/6934604/TestIntBoxing.java
	assert(p_f->Opcode() == Op_IfFalse) failed
In-Reply-To: <555CE2B2.8020305@oracle.com>
References: <E65BE33D-2CBE-448E-A935-07217E43E034@oracle.com>
	<5553C1E2.3040003@oracle.com>
	<C2871C32-958A-42BD-BCA9-A6F692BA6BCB@oracle.com>
	<555CE2B2.8020305@oracle.com>
Message-ID: <5B87D4E4-FA26-411F-BD03-9F9CAA1E0CBF@oracle.com>

>>> I think eliminating main- and post- is good investigation for separate RFE. And we need to make sure it is safe.
>> 
>> My reasoning is that if the pre loop?s limit is still an opaque node, then the compiler has not been able to optimize the pre loop based on its limited number of iterations yet. I don?t see what optimization would remove stuff from the pre loop that wouldn?t have removed the same stuff from the initial loop, had it not been cloned. I could be missing something, though. Do you see one?
> 
> Predicates? Can we have checks moved from pre- but not from main- because of execution order?

It seems we don?t do predication with pre/main loops. In PhaseIdealLoop::loop_predication_impl():

  if (head->is_valid_counted_loop()) {
    cl = head->as_CountedLoop();
    // do nothing for iteration-splitted loops                                                                                                                                                                                                                                   
    if (!cl->is_normal_loop()) return false;

> Anyway, I am not strongly against this change but you still need to add check to do_range_check(), I think, in case something went wrong.

You not being against it is not enough for me to push that change? 

What other investigation do you think is needed?

Roland.


> 
> Thanks,
> Vladimir
> 
>> 
>> Roland.
>> 
>>> 
>>> To fix this bug, I think, we should bailout from do_range_check() when we can't find pre-loop.
>>> 
>>> Typo 'cab':
>>> +     // post loop cab be removed as well
>>> 
>>> Thanks,
>>> Vladimir
>>> 
>>> 
>>> On 5/13/15 6:02 AM, Roland Westrelin wrote:
>>>> http://cr.openjdk.java.net/~roland/8078866/webrev.00/
>>>> 
>>>> The assert fires when optimizing that code:
>>>> 
>>>>   static int remi_sumb() {
>>>>     Integer j = Integer.valueOf(1);
>>>>     for (int i = 0; i< 1000; i++) {
>>>>       j = j + 1;
>>>>     }
>>>>     return j;
>>>>   }
>>>> 
>>>> The compiler tries to find the pre loop from the main loop but fails because the pre loop isn?t there.
>>>> 
>>>> Here is the chain of events that leads to the pre loop being optimized out:
>>>> 
>>>> 1- The CountedLoopNode is created
>>>> 2- PhiNode::Value() for the induction variable?s Phi is called and capture that 0 <= i < 1000
>>>> 3- SplitIf is applied and moves the LoadI that reads the value from the Integer through the Phi that merges successive value of j
>>>> 4- pre-loop, main-loop, post-loop are created
>>>> 5- SplitIf is applied again and moves the LoadI of the loop through the Phi that merges both branches of:
>>>> 
>>>>     public static Integer valueOf(int i) {
>>>>         if (i >= IntegerCache.low && i <= IntegerCache.high)
>>>>             return IntegerCache.cache[i + (-IntegerCache.low)];
>>>>         return new Integer(i);
>>>>     }
>>>> 
>>>> 6- The LoadI is optimized out, the Phi for j is now a Phi that merges integer values, PhaseIdealLoop::replace_parallel_iv() replaces j by an expression that depends on i
>>>> 7- In the preloop, the range of values of i are known because of 2- so and the if of valueOf() can be optimized out
>>>> 8- the preloop is empty and is removed by IdealLoopTree::policy_do_remove_empty_loop()
>>>> 
>>>> SplitIf doesn?t find all opportunities for improvements before the pre/main/post loops are built. I considered trying to fix that but that seems complex and not robust enough: some optimizations after SplitIf could uncover more opportunities for SplitIf. The valueOf?s if cannot be optimized out in the main loop because the range of values of i in the main depends on the Phi for the iv of the pre loop and is not [0, 1000[ so I don?t see a way to make sure the compiler always optimizes the main loop out when the pre loop is empty.
>>>> 
>>>> Instead, I?m proposing that when IdealLoopTree::policy_do_remove_empty_loop() removes a loop, it checks whether it is a pre-loop and if it is, removes the main and post loops.
>>>> 
>>>> I don?t have a test case because it seems this only occurs under well defined conditions that are hard to reproduce with a simple test case: PhiNode::Value() must be called early, SplitIf must do a little work but not too much before pre/main/post loops are built. Actually, this occurs only when remi_sumb is inlined. If remi_sumb is not, we don?t optimize the loop out at all.
>>>> 
>>>> Roland.


From roland.westrelin at oracle.com  Thu May 21 13:39:24 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Thu, 21 May 2015 15:39:24 +0200
Subject: 8060036: C2: CmpU nodes can end up with wrong type information
In-Reply-To: <555C6A2B.7090804@oracle.com>
References: <555B16D4.80008@oracle.com>
	<C568518E7B433348B114B6A7122D474755DF752C@FMSMSX102.amr.corp.intel.com>
	<555C6A2B.7090804@oracle.com>
Message-ID: <7E368413-6A30-47DC-9369-58CEDB2E85FD@oracle.com>

> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.01/

It looks good to me.

1607               if( p->bottom_type() != type(p) ) // If not already bottomed out
1608                 worklist.push(p); // Propagate change to user


should be:

if (p->bottom_type() != type(p)) { // If not already bottomed out
 worklist.push(p); // Propagate change to user 
}

to follow the hotspot style.
https://wikis.oracle.com/display/HotSpotInternals/StyleGuide

I can sponsor it if you need.

Roland.

> 
> - Andreas
> 
> On 2015-05-19 20:24, Berg, Michael C wrote:
>> Hi Andreas, could you add a comment to the push statement:
>> 
>> // Propagate change to user
>> 
>> And to Node* p declaration and init:
>> 
>> // Propagate changes to uses
>> 
>> So that the new code carries all similar information.
>> Else the changes seem ok.
>> 
>> -Michael
>> 
>> -----Original Message-----
>> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andreas Eriksson
>> Sent: Tuesday, May 19, 2015 3:56 AM
>> To: hotspot-compiler-dev at openjdk.java.net
>> Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information
>> 
>> Hi,
>> 
>> Could I please get two reviews for the following bug fix.
>> 
>> https://bugs.openjdk.java.net/browse/JDK-8060036
>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/
>> 
>> The problem is with type propagation in C2.
>> 
>> A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node.
>> If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead.
>> These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU.
>> This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself.
>> 
>> The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there.
>> At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information.
>> 
>> When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken.
>> 
>> This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated.
>> I've not been able to create a regression test that is reliable enough to check in.
>> 
>> Thanks,
>> Andreas
> 


From andreas.eriksson at oracle.com  Thu May 21 14:04:58 2015
From: andreas.eriksson at oracle.com (Andreas Eriksson)
Date: Thu, 21 May 2015 16:04:58 +0200
Subject: 8060036: C2: CmpU nodes can end up with wrong type information
In-Reply-To: <7E368413-6A30-47DC-9369-58CEDB2E85FD@oracle.com>
References: <555B16D4.80008@oracle.com>
	<C568518E7B433348B114B6A7122D474755DF752C@FMSMSX102.amr.corp.intel.com>
	<555C6A2B.7090804@oracle.com>
	<7E368413-6A30-47DC-9369-58CEDB2E85FD@oracle.com>
Message-ID: <555DE60A.9080003@oracle.com>


On 2015-05-21 15:39, Roland Westrelin wrote:
>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.01/
> It looks good to me.
>
> 1607               if( p->bottom_type() != type(p) ) // If not already bottomed out
> 1608                 worklist.push(p); // Propagate change to user
>
>
> should be:
>
> if (p->bottom_type() != type(p)) { // If not already bottomed out
>   worklist.push(p); // Propagate change to user
> }
>
> to follow the hotspot style.
> https://wikis.oracle.com/display/HotSpotInternals/StyleGuide

I did it that way because all the other

if (...)
   worklist.push(p)

in that function omit the curly braces.
Should I have them in this change anyway?

>
> I can sponsor it if you need.

That would be very helpful, thanks.

- Andreas

>
> Roland.
>
>> - Andreas
>>
>> On 2015-05-19 20:24, Berg, Michael C wrote:
>>> Hi Andreas, could you add a comment to the push statement:
>>>
>>> // Propagate change to user
>>>
>>> And to Node* p declaration and init:
>>>
>>> // Propagate changes to uses
>>>
>>> So that the new code carries all similar information.
>>> Else the changes seem ok.
>>>
>>> -Michael
>>>
>>> -----Original Message-----
>>> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andreas Eriksson
>>> Sent: Tuesday, May 19, 2015 3:56 AM
>>> To: hotspot-compiler-dev at openjdk.java.net
>>> Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information
>>>
>>> Hi,
>>>
>>> Could I please get two reviews for the following bug fix.
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8060036
>>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/
>>>
>>> The problem is with type propagation in C2.
>>>
>>> A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node.
>>> If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead.
>>> These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU.
>>> This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself.
>>>
>>> The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there.
>>> At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information.
>>>
>>> When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken.
>>>
>>> This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated.
>>> I've not been able to create a regression test that is reliable enough to check in.
>>>
>>> Thanks,
>>> Andreas


From roland.westrelin at oracle.com  Thu May 21 14:11:12 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Thu, 21 May 2015 16:11:12 +0200
Subject: 8060036: C2: CmpU nodes can end up with wrong type information
In-Reply-To: <555DE60A.9080003@oracle.com>
References: <555B16D4.80008@oracle.com>
	<C568518E7B433348B114B6A7122D474755DF752C@FMSMSX102.amr.corp.intel.com>
	<555C6A2B.7090804@oracle.com>
	<7E368413-6A30-47DC-9369-58CEDB2E85FD@oracle.com>
	<555DE60A.9080003@oracle.com>
Message-ID: <DCFBAE72-BEC8-4F73-AF8B-E2B8AD04FC1B@oracle.com>


> I did it that way because all the other
> 
> if (...)
>  worklist.push(p)
> 
> in that function omit the curly braces.
> Should I have them in this change anyway?

Some c2 code doesn?t (yet) follow the current style. The usual practice is to: 

1) use the current style for new code
2) fix the style around what you?ve modified

>> I can sponsor it if you need.
> 
> That would be very helpful, thanks.

Can you send me a changeset (hg export) with the properly formatted mercurial comment?
http://openjdk.java.net/guide/producingChangeset.html
section "Formatting a Changeset Comment?
Can you also make sure you?ve run jcheck.

Thanks,
Roland.

> 
> - Andreas
> 
>> 
>> Roland.
>> 
>>> - Andreas
>>> 
>>> On 2015-05-19 20:24, Berg, Michael C wrote:
>>>> Hi Andreas, could you add a comment to the push statement:
>>>> 
>>>> // Propagate change to user
>>>> 
>>>> And to Node* p declaration and init:
>>>> 
>>>> // Propagate changes to uses
>>>> 
>>>> So that the new code carries all similar information.
>>>> Else the changes seem ok.
>>>> 
>>>> -Michael
>>>> 
>>>> -----Original Message-----
>>>> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andreas Eriksson
>>>> Sent: Tuesday, May 19, 2015 3:56 AM
>>>> To: hotspot-compiler-dev at openjdk.java.net
>>>> Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information
>>>> 
>>>> Hi,
>>>> 
>>>> Could I please get two reviews for the following bug fix.
>>>> 
>>>> https://bugs.openjdk.java.net/browse/JDK-8060036
>>>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/
>>>> 
>>>> The problem is with type propagation in C2.
>>>> 
>>>> A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node.
>>>> If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead.
>>>> These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU.
>>>> This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself.
>>>> 
>>>> The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there.
>>>> At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information.
>>>> 
>>>> When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken.
>>>> 
>>>> This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated.
>>>> I've not been able to create a regression test that is reliable enough to check in.
>>>> 
>>>> Thanks,
>>>> Andreas
> 


From benedikt.wedenik at theobroma-systems.com  Thu May 21 14:24:23 2015
From: benedikt.wedenik at theobroma-systems.com (Benedikt Wedenik)
Date: Thu, 21 May 2015 16:24:23 +0200
Subject: Disassembly from c1 or c2
Message-ID: <57E1D081-D9F5-429E-A327-2002A7B777D0@theobroma-systems.com>

Hi!

I?m trying to figure out how to determine whether the generated disassembly 
(-XX:CompileCommand='print,*Foo.bar') is emitted by c1 or by c2.

Is there any easy way like a flag?
If not, does anyone know which ?print? in which file to enhance to get the needed information?
If there is the possibility to disable the entire output of c1, this would help a lot too :)

Thanks in advance, 
Benedikt W., Theobroma-systems.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150521/f5f99195/attachment.html>

From aph at redhat.com  Thu May 21 14:26:24 2015
From: aph at redhat.com (Andrew Haley)
Date: Thu, 21 May 2015 15:26:24 +0100
Subject: Disassembly from c1 or c2
In-Reply-To: <57E1D081-D9F5-429E-A327-2002A7B777D0@theobroma-systems.com>
References: <57E1D081-D9F5-429E-A327-2002A7B777D0@theobroma-systems.com>
Message-ID: <555DEB10.2000805@redhat.com>

On 05/21/2015 03:24 PM, Benedikt Wedenik wrote:
> Hi!
> 
> I?m trying to figure out how to determine whether the generated disassembly 
> (-XX:CompileCommand='print,*Foo.bar') is emitted by c1 or by c2.

Yes.  Here it is:

Loaded disassembler from hsdis-aarch64.so
Compiled method (c2)

You'll want to build a debug VM for full information.

Andrew.


From benedikt.wedenik at theobroma-systems.com  Thu May 21 14:33:39 2015
From: benedikt.wedenik at theobroma-systems.com (Benedikt Wedenik)
Date: Thu, 21 May 2015 16:33:39 +0200
Subject: Disassembly from c1 or c2
In-Reply-To: <555DEB10.2000805@redhat.com>
References: <57E1D081-D9F5-429E-A327-2002A7B777D0@theobroma-systems.com>
	<555DEB10.2000805@redhat.com>
Message-ID: <25DCD7E3-F429-483D-A6F4-8178FD91DF35@theobroma-systems.com>

Thank you for the quick response!

Benedikt
On 21 May 2015, at 16:26, Andrew Haley <aph at redhat.com> wrote:

> On 05/21/2015 03:24 PM, Benedikt Wedenik wrote:
>> Hi!
>> 
>> I?m trying to figure out how to determine whether the generated disassembly 
>> (-XX:CompileCommand='print,*Foo.bar') is emitted by c1 or by c2.
> 
> Yes.  Here it is:
> 
> Loaded disassembler from hsdis-aarch64.so
> Compiled method (c2)
> 
> You'll want to build a debug VM for full information.
> 
> Andrew.
> 


From andreas.eriksson at oracle.com  Thu May 21 15:00:57 2015
From: andreas.eriksson at oracle.com (Andreas Eriksson)
Date: Thu, 21 May 2015 17:00:57 +0200
Subject: 8060036: C2: CmpU nodes can end up with wrong type information
In-Reply-To: <DCFBAE72-BEC8-4F73-AF8B-E2B8AD04FC1B@oracle.com>
References: <555B16D4.80008@oracle.com>
	<C568518E7B433348B114B6A7122D474755DF752C@FMSMSX102.amr.corp.intel.com>
	<555C6A2B.7090804@oracle.com>
	<7E368413-6A30-47DC-9369-58CEDB2E85FD@oracle.com>
	<555DE60A.9080003@oracle.com>
	<DCFBAE72-BEC8-4F73-AF8B-E2B8AD04FC1B@oracle.com>
Message-ID: <555DF329.4090707@oracle.com>


On 2015-05-21 16:11, Roland Westrelin wrote:
>> I did it that way because all the other
>>
>> if (...)
>>   worklist.push(p)
>>
>> in that function omit the curly braces.
>> Should I have them in this change anyway?
> Some c2 code doesn?t (yet) follow the current style. The usual practice is to:
>
> 1) use the current style for new code
> 2) fix the style around what you?ve modified
>
>>> I can sponsor it if you need.
>> That would be very helpful, thanks.
> Can you send me a changeset (hg export) with the properly formatted mercurial comment?
> http://openjdk.java.net/guide/producingChangeset.html
> section "Formatting a Changeset Comment?
> Can you also make sure you?ve run jcheck.

Attached an export, passes jcheck.
I changed the surrounding code to add curly braces and remove extra 
whitespace.
(I don't need an extra round of reviews for that, right?)

- Andreas

>
> Thanks,
> Roland.
>
>> - Andreas
>>
>>> Roland.
>>>
>>>> - Andreas
>>>>
>>>> On 2015-05-19 20:24, Berg, Michael C wrote:
>>>>> Hi Andreas, could you add a comment to the push statement:
>>>>>
>>>>> // Propagate change to user
>>>>>
>>>>> And to Node* p declaration and init:
>>>>>
>>>>> // Propagate changes to uses
>>>>>
>>>>> So that the new code carries all similar information.
>>>>> Else the changes seem ok.
>>>>>
>>>>> -Michael
>>>>>
>>>>> -----Original Message-----
>>>>> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andreas Eriksson
>>>>> Sent: Tuesday, May 19, 2015 3:56 AM
>>>>> To: hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information
>>>>>
>>>>> Hi,
>>>>>
>>>>> Could I please get two reviews for the following bug fix.
>>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8060036
>>>>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/
>>>>>
>>>>> The problem is with type propagation in C2.
>>>>>
>>>>> A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node.
>>>>> If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead.
>>>>> These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU.
>>>>> This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself.
>>>>>
>>>>> The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there.
>>>>> At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information.
>>>>>
>>>>> When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken.
>>>>>
>>>>> This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated.
>>>>> I've not been able to create a regression test that is reliable enough to check in.
>>>>>
>>>>> Thanks,
>>>>> Andreas

-------------- next part --------------
# HG changeset patch
# User aeriksso
# Date 1432219751 -7200
#      Thu May 21 16:49:11 2015 +0200
# Node ID 10cc51f8aaa98fa79da3941cf4fb3a6fd3056d75
# Parent  67729f5f33c4a798af42d0e2bcbce838834695f9
8060036: C2: CmpU nodes can end up with wrong type information
Reviewed-by: mcberg, kvn, roland

diff -r 67729f5f33c4 -r 10cc51f8aaa9 src/share/vm/opto/phaseX.cpp
--- a/src/share/vm/opto/phaseX.cpp
+++ b/src/share/vm/opto/phaseX.cpp
@@ -1576,8 +1576,9 @@
         if( m->is_Region() ) {  // New path to Region?  Must recheck Phis too
           for (DUIterator_Fast i2max, i2 = m->fast_outs(i2max); i2 < i2max; i2++) {
             Node* p = m->fast_out(i2); // Propagate changes to uses
-            if( p->bottom_type() != type(p) ) // If not already bottomed out
+            if(p->bottom_type() != type(p)) { // If not already bottomed out
               worklist.push(p); // Propagate change to user
+            }
           }
         }
         // If we changed the receiver type to a call, we need to revisit
@@ -1587,12 +1588,30 @@
         if (m->is_Call()) {
           for (DUIterator_Fast i2max, i2 = m->fast_outs(i2max); i2 < i2max; i2++) {
             Node* p = m->fast_out(i2);  // Propagate changes to uses
-            if (p->is_Proj() && p->as_Proj()->_con == TypeFunc::Control && p->outcnt() == 1)
+            if (p->is_Proj() && p->as_Proj()->_con == TypeFunc::Control && p->outcnt() == 1) {
               worklist.push(p->unique_out());
+            }
           }
         }
         if( m->bottom_type() != type(m) ) // If not already bottomed out
           worklist.push(m);     // Propagate change to user
+
+        // CmpU nodes can get their type information from two nodes up in the
+        // graph (instead of from the nodes immediately above). Make sure they
+        // are added to the worklist if nodes they depend on are updated, since
+        // they could be missed and get wrong types otherwise.
+        uint m_op = m->Opcode();
+        if (m_op == Op_AddI || m_op == Op_SubI) {
+          for (DUIterator_Fast i2max, i2 = m->fast_outs(i2max); i2 < i2max; i2++) {
+            Node* p = m->fast_out(i2); // Propagate changes to uses
+            if (p->Opcode() == Op_CmpU) {
+              // Got a CmpU which might need the new type information from node n.
+              if(p->bottom_type() != type(p)) { // If not already bottomed out
+                worklist.push(p); // Propagate change to user
+              }
+            }
+          }
+        }
       }
     }
   }

From rickard.backman at oracle.com  Thu May 21 15:11:53 2015
From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=)
Date: Thu, 21 May 2015 17:11:53 +0200
Subject: RFR(S) 8080692: lots of jstack tests failing in pit
Message-ID: <20150521151153.GA2285@rbackman>

Hi,

can I please have reviews for this change?
The base used to calculate the address for getting Pairs and Data was wrong in SA.
Also fixes issues with getting the count field. Added more verification
in oopMap.cpp to check that offsets and sizes are correct.

Webrev: http://cr.openjdk.java.net/~rbackman/8080692/
Bug: https://bugs.openjdk.java.net/browse/JDK-8080692

tmtools jstack tests & jprt.

Thanks
/R
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150521/5831f472/signature-0001.asc>

From vladimir.x.ivanov at oracle.com  Thu May 21 15:20:49 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Thu, 21 May 2015 18:20:49 +0300
Subject: RFR(S) 8080692: lots of jstack tests failing in pit
In-Reply-To: <20150521151153.GA2285@rbackman>
References: <20150521151153.GA2285@rbackman>
Message-ID: <555DF7D1.5050206@oracle.com>

Looks good.

Best regards,
Vladimir Ivanov

On 5/21/15 6:11 PM, Rickard B?ckman wrote:
> Hi,
>
> can I please have reviews for this change?
> The base used to calculate the address for getting Pairs and Data was wrong in SA.
> Also fixes issues with getting the count field. Added more verification
> in oopMap.cpp to check that offsets and sizes are correct.
>
> Webrev: http://cr.openjdk.java.net/~rbackman/8080692/
> Bug: https://bugs.openjdk.java.net/browse/JDK-8080692
>
> tmtools jstack tests & jprt.
>
> Thanks
> /R
>

From vladimir.kozlov at oracle.com  Thu May 21 15:54:10 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 21 May 2015 08:54:10 -0700
Subject: RFR(S) 8080692: lots of jstack tests failing in pit
In-Reply-To: <20150521151153.GA2285@rbackman>
References: <20150521151153.GA2285@rbackman>
Message-ID: <555DFFA2.8090303@oracle.com>

Looks good. Did you run tmtools tests with -Xcomp?

Thanks,
Vladimir

On 5/21/15 8:11 AM, Rickard B?ckman wrote:
> Hi,
>
> can I please have reviews for this change?
> The base used to calculate the address for getting Pairs and Data was wrong in SA.
> Also fixes issues with getting the count field. Added more verification
> in oopMap.cpp to check that offsets and sizes are correct.
>
> Webrev: http://cr.openjdk.java.net/~rbackman/8080692/
> Bug: https://bugs.openjdk.java.net/browse/JDK-8080692
>
> tmtools jstack tests & jprt.
>
> Thanks
> /R
>

From rickard.backman at oracle.com  Thu May 21 16:06:00 2015
From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=)
Date: Thu, 21 May 2015 18:06:00 +0200
Subject: RFR(S) 8080692: lots of jstack tests failing in pit
In-Reply-To: <555DFFA2.8090303@oracle.com>
References: <20150521151153.GA2285@rbackman>
 <555DFFA2.8090303@oracle.com>
Message-ID: <20150521160600.GB2285@rbackman>

On 05/21, Vladimir Kozlov wrote:
> Looks good. Did you run tmtools tests with -Xcomp?
> 

I hope. They failed with the errors before my fixes.

Command line: $COMMON/ute/bin/ute -jdk
/home/rbackman/code/oopmap4/build/fastdebug/images/jdk -component vm
-testbase $COMMON/vm -stablejavahome /home/rbackman/opt/8u20 -vmflavor
Xcomp -dkfl empty -test tmtools/jstack/

Thanks for the review.

/R

> Thanks,
> Vladimir
> 
> On 5/21/15 8:11 AM, Rickard B?ckman wrote:
> >Hi,
> >
> >can I please have reviews for this change?
> >The base used to calculate the address for getting Pairs and Data was wrong in SA.
> >Also fixes issues with getting the count field. Added more verification
> >in oopMap.cpp to check that offsets and sizes are correct.
> >
> >Webrev: http://cr.openjdk.java.net/~rbackman/8080692/
> >Bug: https://bugs.openjdk.java.net/browse/JDK-8080692
> >
> >tmtools jstack tests & jprt.
> >
> >Thanks
> >/R
> >
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150521/91ad9736/signature.asc>

From rickard.backman at oracle.com  Thu May 21 16:06:22 2015
From: rickard.backman at oracle.com (Rickard =?iso-8859-1?Q?B=E4ckman?=)
Date: Thu, 21 May 2015 18:06:22 +0200
Subject: RFR(S) 8080692: lots of jstack tests failing in pit
In-Reply-To: <555DF7D1.5050206@oracle.com>
References: <20150521151153.GA2285@rbackman>
 <555DF7D1.5050206@oracle.com>
Message-ID: <20150521160622.GC2285@rbackman>

Thanks for the review!

/R

On 05/21, Vladimir Ivanov wrote:
> Looks good.
> 
> Best regards,
> Vladimir Ivanov
> 
> On 5/21/15 6:11 PM, Rickard B?ckman wrote:
> >Hi,
> >
> >can I please have reviews for this change?
> >The base used to calculate the address for getting Pairs and Data was wrong in SA.
> >Also fixes issues with getting the count field. Added more verification
> >in oopMap.cpp to check that offsets and sizes are correct.
> >
> >Webrev: http://cr.openjdk.java.net/~rbackman/8080692/
> >Bug: https://bugs.openjdk.java.net/browse/JDK-8080692
> >
> >tmtools jstack tests & jprt.
> >
> >Thanks
> >/R
> >
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150521/551a970e/signature.asc>

From roland.westrelin at oracle.com  Thu May 21 16:07:31 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Thu, 21 May 2015 18:07:31 +0200
Subject: 8060036: C2: CmpU nodes can end up with wrong type information
In-Reply-To: <555DF329.4090707@oracle.com>
References: <555B16D4.80008@oracle.com>
	<C568518E7B433348B114B6A7122D474755DF752C@FMSMSX102.amr.corp.intel.com>
	<555C6A2B.7090804@oracle.com>
	<7E368413-6A30-47DC-9369-58CEDB2E85FD@oracle.com>
	<555DE60A.9080003@oracle.com>
	<DCFBAE72-BEC8-4F73-AF8B-E2B8AD04FC1B@oracle.com>
	<555DF329.4090707@oracle.com>
Message-ID: <45C554FC-C0CE-4353-A210-4431C1FD29CB@oracle.com>


> Attached an export, passes jcheck.
> I changed the surrounding code to add curly braces and remove extra whitespace.
> (I don't need an extra round of reviews for that, right?)

You don?t.
Thanks for fixing the formatting. I?m pushing it.

Roland.

> 
> - Andreas
> 
>> 
>> Thanks,
>> Roland.
>> 
>>> - Andreas
>>> 
>>>> Roland.
>>>> 
>>>>> - Andreas
>>>>> 
>>>>> On 2015-05-19 20:24, Berg, Michael C wrote:
>>>>>> Hi Andreas, could you add a comment to the push statement:
>>>>>> 
>>>>>> // Propagate change to user
>>>>>> 
>>>>>> And to Node* p declaration and init:
>>>>>> 
>>>>>> // Propagate changes to uses
>>>>>> 
>>>>>> So that the new code carries all similar information.
>>>>>> Else the changes seem ok.
>>>>>> 
>>>>>> -Michael
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andreas Eriksson
>>>>>> Sent: Tuesday, May 19, 2015 3:56 AM
>>>>>> To: hotspot-compiler-dev at openjdk.java.net
>>>>>> Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type information
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Could I please get two reviews for the following bug fix.
>>>>>> 
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8060036
>>>>>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/
>>>>>> 
>>>>>> The problem is with type propagation in C2.
>>>>>> 
>>>>>> A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node.
>>>>>> If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead.
>>>>>> These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU.
>>>>>> This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself.
>>>>>> 
>>>>>> The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there.
>>>>>> At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information.
>>>>>> 
>>>>>> When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken.
>>>>>> 
>>>>>> This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated.
>>>>>> I've not been able to create a regression test that is reliable enough to check in.
>>>>>> 
>>>>>> Thanks,
>>>>>> Andreas
> 
> <8060036.export>


From tobias.hartmann at oracle.com  Thu May 21 17:28:50 2015
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Thu, 21 May 2015 19:28:50 +0200
Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type
	information
In-Reply-To: <555B16D4.80008@oracle.com>
References: <555B16D4.80008@oracle.com>
Message-ID: <555E15D2.4070700@oracle.com>

Hi Andreas,

I was working on JDK-8080156 [1] and just noticed that this is the same issue as JDK-8060036. The bug contains a small reproducer (see Ivan Gerasimov's comment), in case you want to add this as a regression test to your fix.

Best,
Tobias

[1] https://bugs.openjdk.java.net/browse/JDK-8080156

On 19.05.2015 12:56, Andreas Eriksson wrote:
> Hi,
> 
> Could I please get two reviews for the following bug fix.
> 
> https://bugs.openjdk.java.net/browse/JDK-8060036
> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/
> 
> The problem is with type propagation in C2.
> 
> A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node.
> If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead.
> These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU.
> This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself.
> 
> The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there.
> At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information.
> 
> When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken.
> 
> This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated.
> I've not been able to create a regression test that is reliable enough to check in.
> 
> Thanks,
> Andreas

From vladimir.kozlov at oracle.com  Thu May 21 17:51:11 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 21 May 2015 10:51:11 -0700
Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type
	information
In-Reply-To: <555E15D2.4070700@oracle.com>
References: <555B16D4.80008@oracle.com> <555E15D2.4070700@oracle.com>
Message-ID: <555E1B0F.9050802@oracle.com>

Since this fix 8060036 is already in JPRT queue lets use 8080156 to add 
the regression test.

Thanks,
Vladimir

On 5/21/15 10:28 AM, Tobias Hartmann wrote:
> Hi Andreas,
>
> I was working on JDK-8080156 [1] and just noticed that this is the same issue as JDK-8060036. The bug contains a small reproducer (see Ivan Gerasimov's comment), in case you want to add this as a regression test to your fix.
>
> Best,
> Tobias
>
> [1] https://bugs.openjdk.java.net/browse/JDK-8080156
>
> On 19.05.2015 12:56, Andreas Eriksson wrote:
>> Hi,
>>
>> Could I please get two reviews for the following bug fix.
>>
>> https://bugs.openjdk.java.net/browse/JDK-8060036
>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/
>>
>> The problem is with type propagation in C2.
>>
>> A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node.
>> If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead.
>> These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU.
>> This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself.
>>
>> The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there.
>> At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information.
>>
>> When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken.
>>
>> This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated.
>> I've not been able to create a regression test that is reliable enough to check in.
>>
>> Thanks,
>> Andreas

From tobias.hartmann at oracle.com  Thu May 21 18:51:46 2015
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Thu, 21 May 2015 20:51:46 +0200
Subject: [9] RFR(S): 8080156: Integer.toString(int value) sometimes throws NPE
Message-ID: <555E2942.7060404@oracle.com>

Hi,

this is a follow up RFR on JDK-8060036 [1] that adds a regression test I used while investigating JDK-8080156 which turned out to be a duplicate.

https://bugs.openjdk.java.net/browse/JDK-8080156
http://cr.openjdk.java.net/~thartmann/8080156/webrev.00/

The test triggers a range check that is implemented with a CmpUNode. If type information is not propagated correctly to the CmpUNode, the range check trap is always taken. Because C2 also removes the unnecessary buffer allocation, a NullPointerException is thrown and the test fails.

Thanks,
Tobias

[1] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-May/017983.html

From tobias.hartmann at oracle.com  Thu May 21 18:52:15 2015
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Thu, 21 May 2015 20:52:15 +0200
Subject: RFR: 8060036: C2: CmpU nodes can end up with wrong type
	information
In-Reply-To: <555E1B0F.9050802@oracle.com>
References: <555B16D4.80008@oracle.com> <555E15D2.4070700@oracle.com>
	<555E1B0F.9050802@oracle.com>
Message-ID: <555E295F.3040204@oracle.com>


On 21.05.2015 19:51, Vladimir Kozlov wrote:
> Since this fix 8060036 is already in JPRT queue lets use 8080156 to add the regression test.

Okay, I sent a new RFR with the regression test.

Thanks,
Tobias

> 
> Thanks,
> Vladimir
> 
> On 5/21/15 10:28 AM, Tobias Hartmann wrote:
>> Hi Andreas,
>>
>> I was working on JDK-8080156 [1] and just noticed that this is the same issue as JDK-8060036. The bug contains a small reproducer (see Ivan Gerasimov's comment), in case you want to add this as a regression test to your fix.
>>
>> Best,
>> Tobias
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8080156
>>
>> On 19.05.2015 12:56, Andreas Eriksson wrote:
>>> Hi,
>>>
>>> Could I please get two reviews for the following bug fix.
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8060036
>>> http://cr.openjdk.java.net/~aeriksso/8060036/webrev.00/
>>>
>>> The problem is with type propagation in C2.
>>>
>>> A CmpU can use type information from nodes two steps up in the graph, when input 1 is an AddI or SubI node.
>>> If the AddI/SubI overflows the CmpU node type computation will set up two type ranges based on the inputs to the AddI/SubI instead.
>>> These type ranges will then be compared to input 2 of the CmpU node to see if we can make any meaningful observations on the real type of the CmpU.
>>> This means that the type of the AddI/SubI can be bottom, but the CmpU can still use the type of the AddI/SubI inputs to get a more specific type itself.
>>>
>>> The problem with this is that the types of the AddI/SubI inputs can be updated at a later pass, but since the AddI/SubI is of type bottom the propagation pass will stop there.
>>> At that point the CmpU node is never made aware of the new types of the AddI/SubI inputs, and it is left with a type that is based on old type information.
>>>
>>> When this happens other optimizations using the type information can go very wrong; in this particular case a branch to a range_check trap that normally never happened was optimized to always be taken.
>>>
>>> This fix adds a check in the type propagation logic for CmpU nodes, which makes sure that the CmpU node will be added to the worklist so the type can be updated.
>>> I've not been able to create a regression test that is reliable enough to check in.
>>>
>>> Thanks,
>>> Andreas

From vladimir.kozlov at oracle.com  Thu May 21 19:06:15 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 21 May 2015 12:06:15 -0700
Subject: [9] RFR(S): 8080156: Integer.toString(int value) sometimes throws
	NPE
In-Reply-To: <555E2942.7060404@oracle.com>
References: <555E2942.7060404@oracle.com>
Message-ID: <555E2CA7.2040904@oracle.com>

Looks good.

Thanks,
Vladimir

On 5/21/15 11:51 AM, Tobias Hartmann wrote:
> Hi,
>
> this is a follow up RFR on JDK-8060036 [1] that adds a regression test I used while investigating JDK-8080156 which turned out to be a duplicate.
>
> https://bugs.openjdk.java.net/browse/JDK-8080156
> http://cr.openjdk.java.net/~thartmann/8080156/webrev.00/
>
> The test triggers a range check that is implemented with a CmpUNode. If type information is not propagated correctly to the CmpUNode, the range check trap is always taken. Because C2 also removes the unnecessary buffer allocation, a NullPointerException is thrown and the test fails.
>
> Thanks,
> Tobias
>
> [1] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-May/017983.html
>

From tobias.hartmann at oracle.com  Thu May 21 19:12:57 2015
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Thu, 21 May 2015 21:12:57 +0200
Subject: [9] RFR(S): 8080156: Integer.toString(int value) sometimes throws
	NPE
In-Reply-To: <555E2CA7.2040904@oracle.com>
References: <555E2942.7060404@oracle.com> <555E2CA7.2040904@oracle.com>
Message-ID: <555E2E39.8040903@oracle.com>

Thanks, Vladimir.

Best,
Tobias

On 21.05.2015 21:06, Vladimir Kozlov wrote:
> Looks good.
> 
> Thanks,
> Vladimir
> 
> On 5/21/15 11:51 AM, Tobias Hartmann wrote:
>> Hi,
>>
>> this is a follow up RFR on JDK-8060036 [1] that adds a regression test I used while investigating JDK-8080156 which turned out to be a duplicate.
>>
>> https://bugs.openjdk.java.net/browse/JDK-8080156
>> http://cr.openjdk.java.net/~thartmann/8080156/webrev.00/
>>
>> The test triggers a range check that is implemented with a CmpUNode. If type information is not propagated correctly to the CmpUNode, the range check trap is always taken. Because C2 also removes the unnecessary buffer allocation, a NullPointerException is thrown and the test fails.
>>
>> Thanks,
>> Tobias
>>
>> [1] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-May/017983.html
>>

From yekaterina.kantserova at oracle.com  Fri May 22 06:40:45 2015
From: yekaterina.kantserova at oracle.com (Yekaterina Kantserova)
Date: Fri, 22 May 2015 08:40:45 +0200
Subject: RFR(XS): 8080828: Create sanity test for JDK-8080155
In-Reply-To: <5C6FDB60-85C3-4130-ABF0-52368963E0DB@oracle.com>
References: <555D8A56.3050001@oracle.com>
	<AEA03B5D-D0FE-4799-BE68-51B6891D755C@oracle.com>
	<A298A28D-C8EF-4283-9A62-9BE99FDFD2DE@oracle.com>
	<555DB98A.7050401@oracle.com>
	<5C6FDB60-85C3-4130-ABF0-52368963E0DB@oracle.com>
Message-ID: <555ECF6D.2030003@oracle.com>

Staffan, thanks for the review!

On 05/21/2015 01:04 PM, Staffan Larsen wrote:
> Looks good!
>
> Thanks,
> /Staffan
>
>> On 21 maj 2015, at 12:55, Yekaterina Kantserova <yekaterina.kantserova at oracle.com> wrote:
>>
>> The new webrev can be found here: http://cr.openjdk.java.net/~ykantser/8080828/webrev.01
>>
>> Thanks,
>> Katja
>>
>>
>>
>> On 05/21/2015 10:15 AM, Staffan Larsen wrote:
>>> I just realize that you also need to add a check if SA can attach on the current platform. See the test in sa/jmap-hashcode:
>>>
>>> if (!Platform.shouldSAAttach()) {
>>>      System.out.println("SA attach not expected to work - test skipped.");
>>>      return;
>>> }
>>>
>>> Thanks,
>>> /Staffan
>>>
>>>> On 21 maj 2015, at 10:13, Staffan Larsen <staffan.larsen at oracle.com> wrote:
>>>>
>>>> Can we add some kind of validation of the output? Something simple, just to verify that we actually ran what we think we ran.
>>>>
>>>>> On 21 maj 2015, at 09:33, Yekaterina Kantserova <yekaterina.kantserova at oracle.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Could I please have a review of this new test.
>>>>>
>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8080828
>>>>> webrev: http://cr.openjdk.java.net/~ykantser/8080828/webrev.00
>>>>>
>>>>> This test will trigger the exception reported in https://bugs.openjdk.java.net/browse/JDK-8080155 if running with -Xcomp.
>>>>>
>>>>> Thanks,
>>>>> Katja


From yekaterina.kantserova at oracle.com  Fri May 22 07:17:33 2015
From: yekaterina.kantserova at oracle.com (Yekaterina Kantserova)
Date: Fri, 22 May 2015 09:17:33 +0200
Subject: 8080855: Create sanity test for JDK-8080692
Message-ID: <555ED80D.5000806@oracle.com>

Hi,

Could I please have a review of this new test.

bug: https://bugs.openjdk.java.net/browse/JDK-8080855
webrev: http://cr.openjdk.java.net/~ykantser/8080855/webrev.00/

This test will trigger the exception reported in 
https://bugs.openjdk.java.net/browse/JDK-8080692 if running with -Xcomp.

Thanks,
Katja

From staffan.larsen at oracle.com  Fri May 22 07:20:45 2015
From: staffan.larsen at oracle.com (Staffan Larsen)
Date: Fri, 22 May 2015 09:20:45 +0200
Subject: 8080855: Create sanity test for JDK-8080692
In-Reply-To: <555ED80D.5000806@oracle.com>
References: <555ED80D.5000806@oracle.com>
Message-ID: <2EB33CDA-5402-428E-8E70-0C64581D697A@oracle.com>

Looks good!

Thanks,
/Staffan

> On 22 maj 2015, at 09:17, Yekaterina Kantserova <yekaterina.kantserova at oracle.com> wrote:
> 
> Hi,
> 
> Could I please have a review of this new test.
> 
> bug: https://bugs.openjdk.java.net/browse/JDK-8080855
> webrev: http://cr.openjdk.java.net/~ykantser/8080855/webrev.00/
> 
> This test will trigger the exception reported in https://bugs.openjdk.java.net/browse/JDK-8080692 if running with -Xcomp.
> 
> Thanks,
> Katja


From roland.westrelin at oracle.com  Fri May 22 07:24:44 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Fri, 22 May 2015 09:24:44 +0200
Subject: RFR(XS): 8080699: Assert failed: Not a Java pointer in JCK test
In-Reply-To: <73E7F064-372B-4303-8FA4-9CB6CF1239D4@oracle.com>
References: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com>
	<555CC357.9030006@oracle.com>
	<29CE480B-2136-40DE-8167-E0E093A5F145@oracle.com>
	<555CDD4C.5030707@oracle.com>
	<73E7F064-372B-4303-8FA4-9CB6CF1239D4@oracle.com>
Message-ID: <CDAD2356-4BE1-4A7A-B2CC-EF02E4A6611E@oracle.com>

Vladimir,

>>>> Should it also return false if in(ArrayCopyNode::Src)->is_top()?
>>> 
>>> That doesn?t feel right: ArrayCopyNode::may_modify() would decide what to return based on src that it doesn?t modify.
>> 
>> Okay, returning true in such case would be conservative.
>> 
>> Should we check dest input for top in CallLeafNode::may_modify()? Do we have other similar places?
> 
> You?re right checking for top dest in CallLeafNode::may_modify() is safer.
> I don?t see other similar issues elsewhere in the array copy code.

Here is a webrev with the top check in CallLeafNode::may_modify()?

http://cr.openjdk.java.net/~roland/8080699/webrev.01/

Can I push it?

Roland.

> 
> Roland.
> 
> 
>> 
>> Thanks,
>> Vladimir
>> 
>>> Also, I don?t think src can become top and dest not be top during allocation elimination. If that happens at another time (igvn for instance), this part of the graph is being removed and presumably the compiler will get an other chance to optimize it better.
>>> 
>>> Roland.
>>> 
>>> 
>>> 
>>>> 
>>>> thanks,
>>>> Vladimir
>>>> 
>>>> On 5/20/15 7:03 AM, Roland Westrelin wrote:
>>>>> http://cr.openjdk.java.net/~roland/8080699/webrev.00/
>>>>> 
>>>>> When an arraycopy node is eliminated (for a non escaping allocation), its control, src and dest arguments are set to top and it is no longer connected to the rest of the graph through its fall through outputs. It?s not entirely removed from the graph until the next igvn though, and it can be reached through its exception processing outputs. That?s what happens here and ArrayCopyNode::may_modify() assumes dest, that is top, is an oop.
>>>>> 
>>>>> Roland.
>>>>> 
>>> 
> 


From denis.kononenko at oracle.com  Fri May 22 12:38:43 2015
From: denis.kononenko at oracle.com (denis kononenko)
Date: Fri, 22 May 2015 15:38:43 +0300
Subject: [8u60] RFR(XS): JDK-8077620: [TESTBUG] Some of the hotspot tests
	require at least compact profile 3
Message-ID: <555F2353.1020406@oracle.com>

Hi All,

Could you please review a backport for JDK-8080693.

JDK9 webrev link: 
http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.00/
JDK9 bug id: 8080693, http://bugs.openjdk.java.net/browse/JDK-8080693
JDK9 review thread: 
http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2015-May/014780.html

The changes cannot be applied cleanly, so I've created a new webrev:

8u60 webrev link: 
http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.8u60.00/
JDK8 bug id: 8077620, https://bugs.openjdk.java.net/browse/JDK-8077620

Thank you,
Denis.


From andreas.eriksson at oracle.com  Fri May 22 13:07:01 2015
From: andreas.eriksson at oracle.com (Andreas Eriksson)
Date: Fri, 22 May 2015 15:07:01 +0200
Subject: [9] RFR(S): 8080156: Integer.toString(int value) sometimes throws
	NPE
In-Reply-To: <555E2942.7060404@oracle.com>
References: <555E2942.7060404@oracle.com>
Message-ID: <555F29F5.7090402@oracle.com>

Hi Tobias,

You have a comma that shouldn't be there in the bugid line:
--------------------------------------------------
TEST: compiler/types/TestTypePropagationToCmpU.java
TEST RESULT: Error. Parse Exception: Invalid or unrecognized bugid: 8080156,
--------------------------------------------------

Otherwise it looks good to me.
Thanks for pushing this test.

- Andreas

On 2015-05-21 20:51, Tobias Hartmann wrote:
> Hi,
>
> this is a follow up RFR on JDK-8060036 [1] that adds a regression test I used while investigating JDK-8080156 which turned out to be a duplicate.
>
> https://bugs.openjdk.java.net/browse/JDK-8080156
> http://cr.openjdk.java.net/~thartmann/8080156/webrev.00/
>
> The test triggers a range check that is implemented with a CmpUNode. If type information is not propagated correctly to the CmpUNode, the range check trap is always taken. Because C2 also removes the unnecessary buffer allocation, a NullPointerException is thrown and the test fails.
>
> Thanks,
> Tobias
>
> [1] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-May/017983.html


From tobias.hartmann at oracle.com  Fri May 22 13:10:40 2015
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Fri, 22 May 2015 15:10:40 +0200
Subject: [9] RFR(S): 8080156: Integer.toString(int value) sometimes throws
	NPE
In-Reply-To: <555F29F5.7090402@oracle.com>
References: <555E2942.7060404@oracle.com> <555F29F5.7090402@oracle.com>
Message-ID: <555F2AD0.2010907@oracle.com>

Hi Andreas,

thanks, I already noticed while running through JPRT and fixed it.

Best,
Tobias


On 22.05.2015 15:07, Andreas Eriksson wrote:
> Hi Tobias,
> 
> You have a comma that shouldn't be there in the bugid line:
> --------------------------------------------------
> TEST: compiler/types/TestTypePropagationToCmpU.java
> TEST RESULT: Error. Parse Exception: Invalid or unrecognized bugid: 8080156,
> --------------------------------------------------
> 
> Otherwise it looks good to me.
> Thanks for pushing this test.
> 
> - Andreas
> 
> On 2015-05-21 20:51, Tobias Hartmann wrote:
>> Hi,
>>
>> this is a follow up RFR on JDK-8060036 [1] that adds a regression test I used while investigating JDK-8080156 which turned out to be a duplicate.
>>
>> https://bugs.openjdk.java.net/browse/JDK-8080156
>> http://cr.openjdk.java.net/~thartmann/8080156/webrev.00/
>>
>> The test triggers a range check that is implemented with a CmpUNode. If type information is not propagated correctly to the CmpUNode, the range check trap is always taken. Because C2 also removes the unnecessary buffer allocation, a NullPointerException is thrown and the test fails.
>>
>> Thanks,
>> Tobias
>>
>> [1] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-May/017983.html
> 

From vladimir.x.ivanov at oracle.com  Fri May 22 14:36:43 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Fri, 22 May 2015 17:36:43 +0300
Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match
	some 8-bit/16-bit masks
Message-ID: <555F3EFB.60606@oracle.com>

http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00
https://bugs.openjdk.java.net/browse/JDK-8001622

Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from 
immI8/immI16 to immI, since zero-extending move is used.

Thanks!

Best regards,
Vladimir Ivanov

From vladimir.x.ivanov at oracle.com  Fri May 22 14:38:55 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Fri, 22 May 2015 17:38:55 +0300
Subject: [8u60] RFR(XS): JDK-8077620: [TESTBUG] Some of the hotspot tests
	require at least compact profile 3
In-Reply-To: <555F2353.1020406@oracle.com>
References: <555F2353.1020406@oracle.com>
Message-ID: <555F3F7F.5040000@oracle.com>

Both changes look good.

Best regards,
Vladimir Ivanov

On 5/22/15 3:38 PM, denis kononenko wrote:
> Hi All,
>
> Could you please review a backport for JDK-8080693.
>
> JDK9 webrev link:
> http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.00/
> JDK9 bug id: 8080693, http://bugs.openjdk.java.net/browse/JDK-8080693
> JDK9 review thread:
> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2015-May/014780.html
>
>
> The changes cannot be applied cleanly, so I've created a new webrev:
>
> 8u60 webrev link:
> http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.8u60.00/
> JDK8 bug id: 8077620, https://bugs.openjdk.java.net/browse/JDK-8077620
>
> Thank you,
> Denis.
>

From roland.westrelin at oracle.com  Fri May 22 14:57:18 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Fri, 22 May 2015 16:57:18 +0200
Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match
	some 8-bit/16-bit masks
In-Reply-To: <555F3EFB.60606@oracle.com>
References: <555F3EFB.60606@oracle.com>
Message-ID: <49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com>

> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00
> https://bugs.openjdk.java.net/browse/JDK-8001622
> 
> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from immI8/immI16 to immI, since zero-extending move is used.

The change on sparc doesn?t look good:
__ and3($dst$$Register, $mask$$constant, $dst$$Register);

the and instruction cannot encode arbitrary large integer constants.

Isn?t the root of the problem that we want an unsigned 8 bit integer and not a signed one?

Roland.

From vladimir.x.ivanov at oracle.com  Fri May 22 15:14:44 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Fri, 22 May 2015 18:14:44 +0300
Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match
	some 8-bit/16-bit masks
In-Reply-To: <49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com>
References: <555F3EFB.60606@oracle.com>
	<49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com>
Message-ID: <555F47E4.8020507@oracle.com>

Thanks for looking into the fix, Roland.

On 5/22/15 5:57 PM, Roland Westrelin wrote:
>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00
>> https://bugs.openjdk.java.net/browse/JDK-8001622
>>
>> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from immI8/immI16 to immI, since zero-extending move is used.
>
> The change on sparc doesn?t look good:
> __ and3($dst$$Register, $mask$$constant, $dst$$Register);
>
> the and instruction cannot encode arbitrary large integer constants.
Good catch. I missed that it is and3.

> Isn?t the root of the problem that we want an unsigned 8 bit integer and not a signed one?
Yes, unsigned 8-bit integer work fine, but it can be generalized to 
arbitrary masks as well.

I experimented with new operand types (immU8, immU16), but it was more 
code for no particular benefit. I can use them on sparc.

Best regards,
Vladimir Ivanov

From vladimir.x.ivanov at oracle.com  Fri May 22 15:38:37 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Fri, 22 May 2015 18:38:37 +0300
Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match
	some 8-bit/16-bit masks
In-Reply-To: <555F47E4.8020507@oracle.com>
References: <555F3EFB.60606@oracle.com>	<49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com>
	<555F47E4.8020507@oracle.com>
Message-ID: <555F4D7D.7080504@oracle.com>

Updated webrev:
   http://cr.openjdk.java.net/~vlivanov/8001622/webrev.01

Introduced immU8/immU16 operands on sparc.

As a cleanup, sorted immI/immU operand declarations.

Best regards,
Vladimir Ivanov

On 5/22/15 6:14 PM, Vladimir Ivanov wrote:
> Thanks for looking into the fix, Roland.
>
> On 5/22/15 5:57 PM, Roland Westrelin wrote:
>>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00
>>> https://bugs.openjdk.java.net/browse/JDK-8001622
>>>
>>> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from
>>> immI8/immI16 to immI, since zero-extending move is used.
>>
>> The change on sparc doesn?t look good:
>> __ and3($dst$$Register, $mask$$constant, $dst$$Register);
>>
>> the and instruction cannot encode arbitrary large integer constants.
> Good catch. I missed that it is and3.
>
>> Isn?t the root of the problem that we want an unsigned 8 bit integer
>> and not a signed one?
> Yes, unsigned 8-bit integer work fine, but it can be generalized to
> arbitrary masks as well.
>
> I experimented with new operand types (immU8, immU16), but it was more
> code for no particular benefit. I can use them on sparc.
>
> Best regards,
> Vladimir Ivanov

From vladimir.kozlov at oracle.com  Fri May 22 15:44:39 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 22 May 2015 08:44:39 -0700
Subject: RFR(XS): 8080699: Assert failed: Not a Java pointer in JCK test
In-Reply-To: <CDAD2356-4BE1-4A7A-B2CC-EF02E4A6611E@oracle.com>
References: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com>	<555CC357.9030006@oracle.com>	<29CE480B-2136-40DE-8167-E0E093A5F145@oracle.com>	<555CDD4C.5030707@oracle.com>	<73E7F064-372B-4303-8FA4-9CB6CF1239D4@oracle.com>
	<CDAD2356-4BE1-4A7A-B2CC-EF02E4A6611E@oracle.com>
Message-ID: <555F4EE7.8060305@oracle.com>

Good.

Thanks,
Vladimir

On 5/22/15 12:24 AM, Roland Westrelin wrote:
> Vladimir,
>
>>>>> Should it also return false if in(ArrayCopyNode::Src)->is_top()?
>>>>
>>>> That doesn?t feel right: ArrayCopyNode::may_modify() would decide what to return based on src that it doesn?t modify.
>>>
>>> Okay, returning true in such case would be conservative.
>>>
>>> Should we check dest input for top in CallLeafNode::may_modify()? Do we have other similar places?
>>
>> You?re right checking for top dest in CallLeafNode::may_modify() is safer.
>> I don?t see other similar issues elsewhere in the array copy code.
>
> Here is a webrev with the top check in CallLeafNode::may_modify()?
>
> http://cr.openjdk.java.net/~roland/8080699/webrev.01/
>
> Can I push it?
>
> Roland.
>
>>
>> Roland.
>>
>>
>>>
>>> Thanks,
>>> Vladimir
>>>
>>>> Also, I don?t think src can become top and dest not be top during allocation elimination. If that happens at another time (igvn for instance), this part of the graph is being removed and presumably the compiler will get an other chance to optimize it better.
>>>>
>>>> Roland.
>>>>
>>>>
>>>>
>>>>>
>>>>> thanks,
>>>>> Vladimir
>>>>>
>>>>> On 5/20/15 7:03 AM, Roland Westrelin wrote:
>>>>>> http://cr.openjdk.java.net/~roland/8080699/webrev.00/
>>>>>>
>>>>>> When an arraycopy node is eliminated (for a non escaping allocation), its control, src and dest arguments are set to top and it is no longer connected to the rest of the graph through its fall through outputs. It?s not entirely removed from the graph until the next igvn though, and it can be reached through its exception processing outputs. That?s what happens here and ArrayCopyNode::may_modify() assumes dest, that is top, is an oop.
>>>>>>
>>>>>> Roland.
>>>>>>
>>>>
>>
>

From roland.westrelin at oracle.com  Fri May 22 15:51:28 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Fri, 22 May 2015 17:51:28 +0200
Subject: RFR(XS): 8080699: Assert failed: Not a Java pointer in JCK test
In-Reply-To: <555F4EE7.8060305@oracle.com>
References: <05F9D284-258E-4980-A650-89C756060EB8@oracle.com>
	<555CC357.9030006@oracle.com>
	<29CE480B-2136-40DE-8167-E0E093A5F145@oracle.com>
	<555CDD4C.5030707@oracle.com>
	<73E7F064-372B-4303-8FA4-9CB6CF1239D4@oracle.com>
	<CDAD2356-4BE1-4A7A-B2CC-EF02E4A6611E@oracle.com>
	<555F4EE7.8060305@oracle.com>
Message-ID: <61FE1136-DD34-44D9-9E4E-A6B5FEEC228E@oracle.com>

Thanks, Vladimir.

Roland.

> On May 22, 2015, at 5:44 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
> 
> Good.
> 
> Thanks,
> Vladimir
> 
> On 5/22/15 12:24 AM, Roland Westrelin wrote:
>> Vladimir,
>> 
>>>>>> Should it also return false if in(ArrayCopyNode::Src)->is_top()?
>>>>> 
>>>>> That doesn?t feel right: ArrayCopyNode::may_modify() would decide what to return based on src that it doesn?t modify.
>>>> 
>>>> Okay, returning true in such case would be conservative.
>>>> 
>>>> Should we check dest input for top in CallLeafNode::may_modify()? Do we have other similar places?
>>> 
>>> You?re right checking for top dest in CallLeafNode::may_modify() is safer.
>>> I don?t see other similar issues elsewhere in the array copy code.
>> 
>> Here is a webrev with the top check in CallLeafNode::may_modify()?
>> 
>> http://cr.openjdk.java.net/~roland/8080699/webrev.01/
>> 
>> Can I push it?
>> 
>> Roland.
>> 
>>> 
>>> Roland.
>>> 
>>> 
>>>> 
>>>> Thanks,
>>>> Vladimir
>>>> 
>>>>> Also, I don?t think src can become top and dest not be top during allocation elimination. If that happens at another time (igvn for instance), this part of the graph is being removed and presumably the compiler will get an other chance to optimize it better.
>>>>> 
>>>>> Roland.
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> thanks,
>>>>>> Vladimir
>>>>>> 
>>>>>> On 5/20/15 7:03 AM, Roland Westrelin wrote:
>>>>>>> http://cr.openjdk.java.net/~roland/8080699/webrev.00/
>>>>>>> 
>>>>>>> When an arraycopy node is eliminated (for a non escaping allocation), its control, src and dest arguments are set to top and it is no longer connected to the rest of the graph through its fall through outputs. It?s not entirely removed from the graph until the next igvn though, and it can be reached through its exception processing outputs. That?s what happens here and ArrayCopyNode::may_modify() assumes dest, that is top, is an oop.
>>>>>>> 
>>>>>>> Roland.
>>>>>>> 
>>>>> 
>>> 
>> 


From vladimir.kozlov at oracle.com  Fri May 22 17:21:52 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 22 May 2015 10:21:52 -0700
Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match
	some 8-bit/16-bit masks
In-Reply-To: <555F4D7D.7080504@oracle.com>
References: <555F3EFB.60606@oracle.com>	<49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com>	<555F47E4.8020507@oracle.com>
	<555F4D7D.7080504@oracle.com>
Message-ID: <555F65B0.30307@oracle.com>

Good.

Thanks,
Vladimir K

On 5/22/15 8:38 AM, Vladimir Ivanov wrote:
> Updated webrev:
>    http://cr.openjdk.java.net/~vlivanov/8001622/webrev.01
>
> Introduced immU8/immU16 operands on sparc.
>
> As a cleanup, sorted immI/immU operand declarations.
>
> Best regards,
> Vladimir Ivanov
>
> On 5/22/15 6:14 PM, Vladimir Ivanov wrote:
>> Thanks for looking into the fix, Roland.
>>
>> On 5/22/15 5:57 PM, Roland Westrelin wrote:
>>>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00
>>>> https://bugs.openjdk.java.net/browse/JDK-8001622
>>>>
>>>> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from
>>>> immI8/immI16 to immI, since zero-extending move is used.
>>>
>>> The change on sparc doesn?t look good:
>>> __ and3($dst$$Register, $mask$$constant, $dst$$Register);
>>>
>>> the and instruction cannot encode arbitrary large integer constants.
>> Good catch. I missed that it is and3.
>>
>>> Isn?t the root of the problem that we want an unsigned 8 bit integer
>>> and not a signed one?
>> Yes, unsigned 8-bit integer work fine, but it can be generalized to
>> arbitrary masks as well.
>>
>> I experimented with new operand types (immU8, immU16), but it was more
>> code for no particular benefit. I can use them on sparc.
>>
>> Best regards,
>> Vladimir Ivanov

From rednaxelafx at gmail.com  Fri May 22 22:22:13 2015
From: rednaxelafx at gmail.com (Krystal Mok)
Date: Fri, 22 May 2015 15:22:13 -0700
Subject: Question on ADLC's RegClass destructor
Message-ID: <CA+cQ+tSgHpS2T3TSqTwQ4kLwjBFX-uwNfv=OYHMrAUKPFZOBWA@mail.gmail.com>

Hi compiler team,

I came across this piece of code in ADLC, and I'm not sure if it's right or
not:

http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/ac291bc3ece2/src/share/vm/adlc/formsopt.cpp#l236

RegClass::~RegClass() {
  delete _classid;
}

Why do we need to delete the _classid field here? If this is needed, then a
bunch of other classes in formsopt.hpp would need the same treatment.

(I know this virtual destructor was added mainly for the two new
subclasses, but I'm just wondering about this one...)

Thanks,
Kris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150522/d68c09db/attachment.html>

From dean.long at oracle.com  Sat May 23 04:49:30 2015
From: dean.long at oracle.com (Dean Long)
Date: Fri, 22 May 2015 21:49:30 -0700
Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match
	some 8-bit/16-bit masks
In-Reply-To: <555F4D7D.7080504@oracle.com>
References: <555F3EFB.60606@oracle.com>	<49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com>	<555F47E4.8020507@oracle.com>
	<555F4D7D.7080504@oracle.com>
Message-ID: <556006DA.8030202@oracle.com>

Isn't this what you want:

5656 // Load Unsigned Byte (8 bit UNsigned) with 32-bit mask into Long Register
5657 instruct loadUB2L_immI(iRegL dst, memory mem, immI mask) %{
5658   match(Set dst (ConvI2L (AndI (LoadUB mem) mask)));
5659   ins_cost(MEMORY_REF_COST + DEFAULT_COST);
5660
5661   size(2*4);
5662   format %{ "LDUB   $mem,$dst\t# ubyte & 32-bit mask -> long\n\t"
5663             "AND    $dst,right_n_bits($mask, 8),$dst" %}
5664   ins_encode %{
5665     __ ldub($mem$$Address, $dst$$Register);
5666     __ and3($dst$$Register, right_n_bits($mask$$constant, 8), $dst$$Register);
5667   %}
5668   ins_pipe(iload_mem);
5669 %}

dl

On 5/22/2015 8:38 AM, Vladimir Ivanov wrote:
> Updated webrev:
>   http://cr.openjdk.java.net/~vlivanov/8001622/webrev.01
>
> Introduced immU8/immU16 operands on sparc.
>
> As a cleanup, sorted immI/immU operand declarations.
>
> Best regards,
> Vladimir Ivanov
>
> On 5/22/15 6:14 PM, Vladimir Ivanov wrote:
>> Thanks for looking into the fix, Roland.
>>
>> On 5/22/15 5:57 PM, Roland Westrelin wrote:
>>>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00
>>>> https://bugs.openjdk.java.net/browse/JDK-8001622
>>>>
>>>> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from
>>>> immI8/immI16 to immI, since zero-extending move is used.
>>>
>>> The change on sparc doesn?t look good:
>>> __ and3($dst$$Register, $mask$$constant, $dst$$Register);
>>>
>>> the and instruction cannot encode arbitrary large integer constants.
>> Good catch. I missed that it is and3.
>>
>>> Isn?t the root of the problem that we want an unsigned 8 bit integer
>>> and not a signed one?
>> Yes, unsigned 8-bit integer work fine, but it can be generalized to
>> arbitrary masks as well.
>>
>> I experimented with new operand types (immU8, immU16), but it was more
>> code for no particular benefit. I can use them on sparc.
>>
>> Best regards,
>> Vladimir Ivanov


From dawid.weiss at gmail.com  Sat May 23 12:51:03 2015
From: dawid.weiss at gmail.com (Dawid Weiss)
Date: Sat, 23 May 2015 14:51:03 +0200
Subject: 1.9.0-ea-b64 regression (AIOOB thrown where it shouldn't be thrown)
Message-ID: <CAM21Rt-ZD-HU3Jo_NFy-mrS5XN5tGrF7RPq_G9EhCwdqidhPjA@mail.gmail.com>

Hi everyone,

I've ran into an issue with a suspicious ArrayIndexOutOfBounds on ea
builds of JDK 1.9.0. Here's some context:

- we run separate builds for 1.7, 1.8 and 1.9ea VMs and only the 1.9
build currently fails (Windows, Linux environments, 64-bit),

- the bug/ issue is a suspicious AIOOB on:

org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820)

which happens to be the line of code inside this for loop:

        for (int i = 0, lastShadow = this.last; i <= lastShadow; i++) {
            tt[cftab[ll8[i] & 0xff]++] = i;
        }

Which array access this is exactly is hard to tell, but the *same*
bzip input file does not produce the error on any other JVM (or an
earlier releases of 1.9ea). This code is deterministic in the test
that uses the above routine.

- the problem *only* appears from 1.9ea_b64; on earlier releases the
same code passes just fine (bisected it back from b45),

- I also checked 1.9ea_b65 (which happens to be on the download server
but wasn't properly announced yet?). The problem persists.

- the problem does reproduce on the build server (Windows and Linux).
Interestingly, I couldn't reproduce it locally. The code is
proprietary, I couldn't narrow it down yet to something that would
reproduce (sigh).

I realize this is insufficient information to get started, but perhaps
this issue is already known or somebody may have a clue at what is
going on (CCing hotspot-compiler-dev)?

Dawid

From dawid.weiss at gmail.com  Sat May 23 19:58:53 2015
From: dawid.weiss at gmail.com (Dawid Weiss)
Date: Sat, 23 May 2015 21:58:53 +0200
Subject: 1.9.0-ea-b64 regression (AIOOB thrown where it shouldn't be
	thrown)
Message-ID: <CAM21Rt-Lc=rZus3p_X1NibqcJTf0MdVjuYTGD7oCWH0UGoNayQ@mail.gmail.com>

Good news. I have a repro that crashes for me every time and it only
contains open-source code (and some data). Bad news: it's probably a
compiler bug because everything works just fine with -Xint.

I'll put it together into a repro tomorrow, hopefully, and will ask
somebody with the right permission to file an issue in Jira. Should be
relatively easy to narrow it down by bisecting hs repo commits.

Dawid

On Sat, May 23, 2015 at 2:19 PM, Dawid Weiss
<dawid.weiss at carrotsearch.com> wrote:
> Hi Rory, everyone,
>
> I've ran into an issue with a suspicious ArrayIndexOutOfBounds on ea
> builds of JDK 1.9.0. Here's some context:
>
> - we run separate builds for 1.7, 1.8 and 1.9ea VMs and only the 1.9
> build currently fails (Windows, Linux environments, 64-bit),
>
> - the bug/ issue is a suspicious AIOOB on:
>
> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820)
>
> which happens to be the line of code inside this for loop:
>
>         for (int i = 0, lastShadow = this.last; i <= lastShadow; i++) {
>             tt[cftab[ll8[i] & 0xff]++] = i;
>         }
>
> Which array access this is exactly is hard to tell, but the *same*
> bzip input file does not produce the error on any other JVM (or an
> earlier releases of 1.9ea). This code is deterministic in the test
> that uses the above routine.
>
> - the problem *only* appears from 1.9ea_b64; on earlier releases the
> same code passes just fine (bisected it back from b45),
>
> - I also checked 1.9ea_b65 (which happens to be on the download server
> but wasn't properly announced yet?). The problem persists.
>
> - the problem does reproduce on the build server (Windows and Linux).
> Interestingly, I couldn't reproduce it locally. The code is
> proprietary, I couldn't narrow it down yet to something that would
> reproduce (sigh).
>
> I realize this is insufficient information to get started, but perhaps
> this issue is already known or somebody may have a clue at what is
> going on (CCing hotspot-compiler-dev)?
>
> Dawid

From dawid.weiss at gmail.com  Sun May 24 07:23:11 2015
From: dawid.weiss at gmail.com (Dawid Weiss)
Date: Sun, 24 May 2015 09:23:11 +0200
Subject: 1.9.0-ea-b64 regression (AIOOB thrown where it shouldn't be
	thrown)
In-Reply-To: <CAM21Rt-Lc=rZus3p_X1NibqcJTf0MdVjuYTGD7oCWH0UGoNayQ@mail.gmail.com>
References: <CAM21Rt-Lc=rZus3p_X1NibqcJTf0MdVjuYTGD7oCWH0UGoNayQ@mail.gmail.com>
Message-ID: <CAM21Rt9XnRJNp74O-pi2YwQort_mYJ+_kUx-REsVThezPWDCEA@mail.gmail.com>

Hello again,

The bug repro code is at the link below:
http://download.carrotsearch.com/jvm/repro.zip

Definitely something with the compilation because disabling loop
unrolling (or running in interpreted mode) doesn't trigger the bug.
More information (also included in README.txt) quoted below.

Dawid

Expected behavior:
  The code should re-read the gz2 resource, looping and printing (infinitely):
  Round...
  Round...
  Round...

Actual behavior (64-Bit Server VM, build 1.9.0-ea-b64, mixed mode):
  Round...
  Round...
  Round...
  Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 314297
          at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820)
          at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.<init>(BZip2CompressorInputStream.java:136)
          at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.<init>(BZip2CompressorInputStream.java:111)
          at bug.Repro.main(Repro.java:15)

Notes
-----

- Self contained maven project (copied commons compress sources so that one can
  tweak them if needed). An additional bz2 resource is needed (included).
- Build with:
  mvn package
- Run with:
  java -jar target/Repro-0.0.0.jar
- Running in interpreted mode does *not* cause any error:
  java -Xint -jar target/Repro-0.0.0.jar
- Running without loop unrolls does *not* cause any error:
  java -Xbatch -XX:LoopUnrollLimit=0 -jar target/Repro-0.0.0.jar

On Sat, May 23, 2015 at 9:58 PM, Dawid Weiss <dawid.weiss at gmail.com> wrote:
> Good news. I have a repro that crashes for me every time and it only
> contains open-source code (and some data). Bad news: it's probably a
> compiler bug because everything works just fine with -Xint.
>
> I'll put it together into a repro tomorrow, hopefully, and will ask
> somebody with the right permission to file an issue in Jira. Should be
> relatively easy to narrow it down by bisecting hs repo commits.
>
> Dawid
>
> On Sat, May 23, 2015 at 2:19 PM, Dawid Weiss
> <dawid.weiss at carrotsearch.com> wrote:
>> Hi Rory, everyone,
>>
>> I've ran into an issue with a suspicious ArrayIndexOutOfBounds on ea
>> builds of JDK 1.9.0. Here's some context:
>>
>> - we run separate builds for 1.7, 1.8 and 1.9ea VMs and only the 1.9
>> build currently fails (Windows, Linux environments, 64-bit),
>>
>> - the bug/ issue is a suspicious AIOOB on:
>>
>> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820)
>>
>> which happens to be the line of code inside this for loop:
>>
>>         for (int i = 0, lastShadow = this.last; i <= lastShadow; i++) {
>>             tt[cftab[ll8[i] & 0xff]++] = i;
>>         }
>>
>> Which array access this is exactly is hard to tell, but the *same*
>> bzip input file does not produce the error on any other JVM (or an
>> earlier releases of 1.9ea). This code is deterministic in the test
>> that uses the above routine.
>>
>> - the problem *only* appears from 1.9ea_b64; on earlier releases the
>> same code passes just fine (bisected it back from b45),
>>
>> - I also checked 1.9ea_b65 (which happens to be on the download server
>> but wasn't properly announced yet?). The problem persists.
>>
>> - the problem does reproduce on the build server (Windows and Linux).
>> Interestingly, I couldn't reproduce it locally. The code is
>> proprietary, I couldn't narrow it down yet to something that would
>> reproduce (sigh).
>>
>> I realize this is insufficient information to get started, but perhaps
>> this issue is already known or somebody may have a clue at what is
>> going on (CCing hotspot-compiler-dev)?
>>
>> Dawid

From david.holmes at oracle.com  Sun May 24 23:41:31 2015
From: david.holmes at oracle.com (David Holmes)
Date: Mon, 25 May 2015 09:41:31 +1000
Subject: [8u60] RFR(XS): JDK-8077620: [TESTBUG] Some of the hotspot tests
	require at least compact profile 3
In-Reply-To: <555F2353.1020406@oracle.com>
References: <555F2353.1020406@oracle.com>
Message-ID: <556261AB.6010000@oracle.com>

Looks good.

Thanks,
David

On 22/05/2015 10:38 PM, denis kononenko wrote:
> Hi All,
>
> Could you please review a backport for JDK-8080693.
>
> JDK9 webrev link:
> http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.00/
> JDK9 bug id: 8080693, http://bugs.openjdk.java.net/browse/JDK-8080693
> JDK9 review thread:
> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2015-May/014780.html
>
>
> The changes cannot be applied cleanly, so I've created a new webrev:
>
> 8u60 webrev link:
> http://cr.openjdk.java.net/~skovalev/dkononen/8077620/webrev.8u60.00/
> JDK8 bug id: 8077620, https://bugs.openjdk.java.net/browse/JDK-8077620
>
> Thank you,
> Denis.
>

From dawid.weiss at gmail.com  Mon May 25 07:46:54 2015
From: dawid.weiss at gmail.com (Dawid Weiss)
Date: Mon, 25 May 2015 09:46:54 +0200
Subject: 1.9.0-ea-b64 regression (AIOOB thrown where it shouldn't be
	thrown)
In-Reply-To: <55623549.2060502@oracle.com>
References: <CAM21Rt-Lc=rZus3p_X1NibqcJTf0MdVjuYTGD7oCWH0UGoNayQ@mail.gmail.com>
	<CAM21Rt9XnRJNp74O-pi2YwQort_mYJ+_kUx-REsVThezPWDCEA@mail.gmail.com>
	<55623549.2060502@oracle.com>
Message-ID: <CAM21Rt_o1v5iFq7=1_Js+=c+sW5f1SD6R2nVMGdNrH8_pOOFhQ@mail.gmail.com>

Filed a bug report with Review ID: JI-9021458. Thanks!

Dawid

On Sun, May 24, 2015 at 10:32 PM, Rory O'Donnell
<rory.odonnell at oracle.com> wrote:
> Hi Dawid,
>
> Could you log an incident at bugs.java.com and let us know the incident id.
>
> Thanks, Rory
>
>
> On 24/05/2015 08:23, Dawid Weiss wrote:
>>
>> Hello again,
>>
>> The bug repro code is at the link below:
>> http://download.carrotsearch.com/jvm/repro.zip
>>
>> Definitely something with the compilation because disabling loop
>> unrolling (or running in interpreted mode) doesn't trigger the bug.
>> More information (also included in README.txt) quoted below.
>>
>> Dawid
>>
>> Expected behavior:
>>    The code should re-read the gz2 resource, looping and printing
>> (infinitely):
>>    Round...
>>    Round...
>>    Round...
>>
>> Actual behavior (64-Bit Server VM, build 1.9.0-ea-b64, mixed mode):
>>    Round...
>>    Round...
>>    Round...
>>    Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException:
>> 314297
>>            at
>> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820)
>>            at
>> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.<init>(BZip2CompressorInputStream.java:136)
>>            at
>> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.<init>(BZip2CompressorInputStream.java:111)
>>            at bug.Repro.main(Repro.java:15)
>>
>> Notes
>> -----
>>
>> - Self contained maven project (copied commons compress sources so that
>> one can
>>    tweak them if needed). An additional bz2 resource is needed (included).
>> - Build with:
>>    mvn package
>> - Run with:
>>    java -jar target/Repro-0.0.0.jar
>> - Running in interpreted mode does *not* cause any error:
>>    java -Xint -jar target/Repro-0.0.0.jar
>> - Running without loop unrolls does *not* cause any error:
>>    java -Xbatch -XX:LoopUnrollLimit=0 -jar target/Repro-0.0.0.jar
>>
>> On Sat, May 23, 2015 at 9:58 PM, Dawid Weiss <dawid.weiss at gmail.com>
>> wrote:
>>>
>>> Good news. I have a repro that crashes for me every time and it only
>>> contains open-source code (and some data). Bad news: it's probably a
>>> compiler bug because everything works just fine with -Xint.
>>>
>>> I'll put it together into a repro tomorrow, hopefully, and will ask
>>> somebody with the right permission to file an issue in Jira. Should be
>>> relatively easy to narrow it down by bisecting hs repo commits.
>>>
>>> Dawid
>>>
>>> On Sat, May 23, 2015 at 2:19 PM, Dawid Weiss
>>> <dawid.weiss at carrotsearch.com> wrote:
>>>>
>>>> Hi Rory, everyone,
>>>>
>>>> I've ran into an issue with a suspicious ArrayIndexOutOfBounds on ea
>>>> builds of JDK 1.9.0. Here's some context:
>>>>
>>>> - we run separate builds for 1.7, 1.8 and 1.9ea VMs and only the 1.9
>>>> build currently fails (Windows, Linux environments, 64-bit),
>>>>
>>>> - the bug/ issue is a suspicious AIOOB on:
>>>>
>>>>
>>>> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820)
>>>>
>>>> which happens to be the line of code inside this for loop:
>>>>
>>>>          for (int i = 0, lastShadow = this.last; i <= lastShadow; i++) {
>>>>              tt[cftab[ll8[i] & 0xff]++] = i;
>>>>          }
>>>>
>>>> Which array access this is exactly is hard to tell, but the *same*
>>>> bzip input file does not produce the error on any other JVM (or an
>>>> earlier releases of 1.9ea). This code is deterministic in the test
>>>> that uses the above routine.
>>>>
>>>> - the problem *only* appears from 1.9ea_b64; on earlier releases the
>>>> same code passes just fine (bisected it back from b45),
>>>>
>>>> - I also checked 1.9ea_b65 (which happens to be on the download server
>>>> but wasn't properly announced yet?). The problem persists.
>>>>
>>>> - the problem does reproduce on the build server (Windows and Linux).
>>>> Interestingly, I couldn't reproduce it locally. The code is
>>>> proprietary, I couldn't narrow it down yet to something that would
>>>> reproduce (sigh).
>>>>
>>>> I realize this is insufficient information to get started, but perhaps
>>>> this issue is already known or somebody may have a clue at what is
>>>> going on (CCing hotspot-compiler-dev)?
>>>>
>>>> Dawid
>
>
> --
> Rgds,Rory O'Donnell
> Quality Engineering Manager
> Oracle EMEA, Dublin,Ireland
>

From rory.odonnell at oracle.com  Mon May 25 08:00:55 2015
From: rory.odonnell at oracle.com (Rory O'Donnell)
Date: Mon, 25 May 2015 09:00:55 +0100
Subject: 1.9.0-ea-b64 regression (AIOOB thrown where it shouldn't be
	thrown)
In-Reply-To: <CAM21Rt_o1v5iFq7=1_Js+=c+sW5f1SD6R2nVMGdNrH8_pOOFhQ@mail.gmail.com>
References: <CAM21Rt-Lc=rZus3p_X1NibqcJTf0MdVjuYTGD7oCWH0UGoNayQ@mail.gmail.com>
	<CAM21Rt9XnRJNp74O-pi2YwQort_mYJ+_kUx-REsVThezPWDCEA@mail.gmail.com>
	<55623549.2060502@oracle.com>
	<CAM21Rt_o1v5iFq7=1_Js+=c+sW5f1SD6R2nVMGdNrH8_pOOFhQ@mail.gmail.com>
Message-ID: <5562D6B7.2050800@oracle.com>

Hi Dawid,

Here is the JBS id: https://bugs.openjdk.java.net/browse/JDK-8080976

Thank you for submitting the bug.

Rgds, Rory

On 25/05/2015 08:46, Dawid Weiss wrote:
> Filed a bug report with Review ID: JI-9021458. Thanks!
>
> Dawid
>
> On Sun, May 24, 2015 at 10:32 PM, Rory O'Donnell
> <rory.odonnell at oracle.com> wrote:
>> Hi Dawid,
>>
>> Could you log an incident at bugs.java.com and let us know the incident id.
>>
>> Thanks, Rory
>>
>>
>> On 24/05/2015 08:23, Dawid Weiss wrote:
>>> Hello again,
>>>
>>> The bug repro code is at the link below:
>>> http://download.carrotsearch.com/jvm/repro.zip
>>>
>>> Definitely something with the compilation because disabling loop
>>> unrolling (or running in interpreted mode) doesn't trigger the bug.
>>> More information (also included in README.txt) quoted below.
>>>
>>> Dawid
>>>
>>> Expected behavior:
>>>     The code should re-read the gz2 resource, looping and printing
>>> (infinitely):
>>>     Round...
>>>     Round...
>>>     Round...
>>>
>>> Actual behavior (64-Bit Server VM, build 1.9.0-ea-b64, mixed mode):
>>>     Round...
>>>     Round...
>>>     Round...
>>>     Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException:
>>> 314297
>>>             at
>>> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820)
>>>             at
>>> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.<init>(BZip2CompressorInputStream.java:136)
>>>             at
>>> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.<init>(BZip2CompressorInputStream.java:111)
>>>             at bug.Repro.main(Repro.java:15)
>>>
>>> Notes
>>> -----
>>>
>>> - Self contained maven project (copied commons compress sources so that
>>> one can
>>>     tweak them if needed). An additional bz2 resource is needed (included).
>>> - Build with:
>>>     mvn package
>>> - Run with:
>>>     java -jar target/Repro-0.0.0.jar
>>> - Running in interpreted mode does *not* cause any error:
>>>     java -Xint -jar target/Repro-0.0.0.jar
>>> - Running without loop unrolls does *not* cause any error:
>>>     java -Xbatch -XX:LoopUnrollLimit=0 -jar target/Repro-0.0.0.jar
>>>
>>> On Sat, May 23, 2015 at 9:58 PM, Dawid Weiss <dawid.weiss at gmail.com>
>>> wrote:
>>>> Good news. I have a repro that crashes for me every time and it only
>>>> contains open-source code (and some data). Bad news: it's probably a
>>>> compiler bug because everything works just fine with -Xint.
>>>>
>>>> I'll put it together into a repro tomorrow, hopefully, and will ask
>>>> somebody with the right permission to file an issue in Jira. Should be
>>>> relatively easy to narrow it down by bisecting hs repo commits.
>>>>
>>>> Dawid
>>>>
>>>> On Sat, May 23, 2015 at 2:19 PM, Dawid Weiss
>>>> <dawid.weiss at carrotsearch.com> wrote:
>>>>> Hi Rory, everyone,
>>>>>
>>>>> I've ran into an issue with a suspicious ArrayIndexOutOfBounds on ea
>>>>> builds of JDK 1.9.0. Here's some context:
>>>>>
>>>>> - we run separate builds for 1.7, 1.8 and 1.9ea VMs and only the 1.9
>>>>> build currently fails (Windows, Linux environments, 64-bit),
>>>>>
>>>>> - the bug/ issue is a suspicious AIOOB on:
>>>>>
>>>>>
>>>>> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820)
>>>>>
>>>>> which happens to be the line of code inside this for loop:
>>>>>
>>>>>           for (int i = 0, lastShadow = this.last; i <= lastShadow; i++) {
>>>>>               tt[cftab[ll8[i] & 0xff]++] = i;
>>>>>           }
>>>>>
>>>>> Which array access this is exactly is hard to tell, but the *same*
>>>>> bzip input file does not produce the error on any other JVM (or an
>>>>> earlier releases of 1.9ea). This code is deterministic in the test
>>>>> that uses the above routine.
>>>>>
>>>>> - the problem *only* appears from 1.9ea_b64; on earlier releases the
>>>>> same code passes just fine (bisected it back from b45),
>>>>>
>>>>> - I also checked 1.9ea_b65 (which happens to be on the download server
>>>>> but wasn't properly announced yet?). The problem persists.
>>>>>
>>>>> - the problem does reproduce on the build server (Windows and Linux).
>>>>> Interestingly, I couldn't reproduce it locally. The code is
>>>>> proprietary, I couldn't narrow it down yet to something that would
>>>>> reproduce (sigh).
>>>>>
>>>>> I realize this is insufficient information to get started, but perhaps
>>>>> this issue is already known or somebody may have a clue at what is
>>>>> going on (CCing hotspot-compiler-dev)?
>>>>>
>>>>> Dawid
>>
>> --
>> Rgds,Rory O'Donnell
>> Quality Engineering Manager
>> Oracle EMEA, Dublin,Ireland
>>

-- 
Rgds,Rory O'Donnell
Quality Engineering Manager
Oracle EMEA , Dublin, Ireland


From zoltan.majo at oracle.com  Tue May 26 07:06:52 2015
From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=)
Date: Tue, 26 May 2015 09:06:52 +0200
Subject: Question on ADLC's RegClass destructor
In-Reply-To: <CA+cQ+tSgHpS2T3TSqTwQ4kLwjBFX-uwNfv=OYHMrAUKPFZOBWA@mail.gmail.com>
References: <CA+cQ+tSgHpS2T3TSqTwQ4kLwjBFX-uwNfv=OYHMrAUKPFZOBWA@mail.gmail.com>
Message-ID: <55641B8C.7080007@oracle.com>

Hi Kris,


On 05/23/2015 12:22 AM, Krystal Mok wrote:
> Hi compiler team,
>
> I came across this piece of code in ADLC, and I'm not sure if it's 
> right or not:
>
> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/ac291bc3ece2/src/share/vm/adlc/formsopt.cpp#l236
>
> RegClass::~RegClass() {
>   delete _classid;
> }
>
> Why do we need to delete the _classid field here? If this is needed, 
> then a bunch of other classes in formsopt.hpp would need the same 
> treatment.

If I remember correctly, I added the destructor because the caller of 
the RegClass constructor allocates memory for _classid, but does not 
free it. But I'd have to check to be sure.

Best regards,


Zoltan

>
> (I know this virtual destructor was added mainly for the two new 
> subclasses, but I'm just wondering about this one...)
>
> Thanks,
> Kris


From aph at redhat.com  Tue May 26 09:29:20 2015
From: aph at redhat.com (Andrew Haley)
Date: Tue, 26 May 2015 10:29:20 +0100
Subject: Protection of RSA from timing and cache-flushing attacks [Was: RFR(L):
	8069539: RSA acceleration]
In-Reply-To: <5550CCC0.6000502@redhat.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>	<5502E67C.8080208@oracle.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>	<550C004D.1060501@redhat.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com> <5548193E.7060803@oracle.com>
	<55488803.4020802@redhat.com> <554CDD48.8080500@redhat.com>
	<554CE68F.20509@redhat.com> <554CF00B.6000508@redhat.com>
	<5550CCC0.6000502@redhat.com>
Message-ID: <55643CF0.8090100@redhat.com>

On 05/11/2015 04:37 PM, Florian Weimer wrote:
> On 05/08/2015 07:19 PM, Andrew Haley wrote:
> 
>>> Do we want to add side-channel protection as part of this effort
>>> (against timing attacks and cache-flushing attacks)?
>>
>> I wouldn't have thought so.  It might make sense to add an optional
>> path without key-dependent branches, but not as a part of this effort:
>> the goals are completely orthogonal.
> 
> I'm not well-versed in this kind of side-channel protection for RSA
> implementations, but my impression that algorithm changes are needed to
> mitigate the impact of data-dependent memory fetches (see fixed-width
> modular exponentiation).  But maybe the necessary changes materialize at
> a higher level, beyond the operation which you proposed to intrinsify.

By the way: there is quite a bit of code in
sun/security/rsa/RSACore.java to protect against timing attacks.  In
particular, the patch for "8031346: Enhance RSA key handling" looks
quite thorough and there is also extra care taken to make padding
operations execute in constant time.

Andrew.

From rory.odonnell at oracle.com  Sun May 24 20:32:09 2015
From: rory.odonnell at oracle.com (Rory O'Donnell)
Date: Sun, 24 May 2015 21:32:09 +0100
Subject: 1.9.0-ea-b64 regression (AIOOB thrown where it shouldn't be
	thrown)
In-Reply-To: <CAM21Rt9XnRJNp74O-pi2YwQort_mYJ+_kUx-REsVThezPWDCEA@mail.gmail.com>
References: <CAM21Rt-Lc=rZus3p_X1NibqcJTf0MdVjuYTGD7oCWH0UGoNayQ@mail.gmail.com>
	<CAM21Rt9XnRJNp74O-pi2YwQort_mYJ+_kUx-REsVThezPWDCEA@mail.gmail.com>
Message-ID: <55623549.2060502@oracle.com>

Hi Dawid,

Could you log an incident at bugs.java.com and let us know the incident id.

Thanks, Rory

On 24/05/2015 08:23, Dawid Weiss wrote:
> Hello again,
>
> The bug repro code is at the link below:
> http://download.carrotsearch.com/jvm/repro.zip
>
> Definitely something with the compilation because disabling loop
> unrolling (or running in interpreted mode) doesn't trigger the bug.
> More information (also included in README.txt) quoted below.
>
> Dawid
>
> Expected behavior:
>    The code should re-read the gz2 resource, looping and printing (infinitely):
>    Round...
>    Round...
>    Round...
>
> Actual behavior (64-Bit Server VM, build 1.9.0-ea-b64, mixed mode):
>    Round...
>    Round...
>    Round...
>    Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 314297
>            at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820)
>            at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.<init>(BZip2CompressorInputStream.java:136)
>            at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.<init>(BZip2CompressorInputStream.java:111)
>            at bug.Repro.main(Repro.java:15)
>
> Notes
> -----
>
> - Self contained maven project (copied commons compress sources so that one can
>    tweak them if needed). An additional bz2 resource is needed (included).
> - Build with:
>    mvn package
> - Run with:
>    java -jar target/Repro-0.0.0.jar
> - Running in interpreted mode does *not* cause any error:
>    java -Xint -jar target/Repro-0.0.0.jar
> - Running without loop unrolls does *not* cause any error:
>    java -Xbatch -XX:LoopUnrollLimit=0 -jar target/Repro-0.0.0.jar
>
> On Sat, May 23, 2015 at 9:58 PM, Dawid Weiss <dawid.weiss at gmail.com> wrote:
>> Good news. I have a repro that crashes for me every time and it only
>> contains open-source code (and some data). Bad news: it's probably a
>> compiler bug because everything works just fine with -Xint.
>>
>> I'll put it together into a repro tomorrow, hopefully, and will ask
>> somebody with the right permission to file an issue in Jira. Should be
>> relatively easy to narrow it down by bisecting hs repo commits.
>>
>> Dawid
>>
>> On Sat, May 23, 2015 at 2:19 PM, Dawid Weiss
>> <dawid.weiss at carrotsearch.com> wrote:
>>> Hi Rory, everyone,
>>>
>>> I've ran into an issue with a suspicious ArrayIndexOutOfBounds on ea
>>> builds of JDK 1.9.0. Here's some context:
>>>
>>> - we run separate builds for 1.7, 1.8 and 1.9ea VMs and only the 1.9
>>> build currently fails (Windows, Linux environments, 64-bit),
>>>
>>> - the bug/ issue is a suspicious AIOOB on:
>>>
>>> org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820)
>>>
>>> which happens to be the line of code inside this for loop:
>>>
>>>          for (int i = 0, lastShadow = this.last; i <= lastShadow; i++) {
>>>              tt[cftab[ll8[i] & 0xff]++] = i;
>>>          }
>>>
>>> Which array access this is exactly is hard to tell, but the *same*
>>> bzip input file does not produce the error on any other JVM (or an
>>> earlier releases of 1.9ea). This code is deterministic in the test
>>> that uses the above routine.
>>>
>>> - the problem *only* appears from 1.9ea_b64; on earlier releases the
>>> same code passes just fine (bisected it back from b45),
>>>
>>> - I also checked 1.9ea_b65 (which happens to be on the download server
>>> but wasn't properly announced yet?). The problem persists.
>>>
>>> - the problem does reproduce on the build server (Windows and Linux).
>>> Interestingly, I couldn't reproduce it locally. The code is
>>> proprietary, I couldn't narrow it down yet to something that would
>>> reproduce (sigh).
>>>
>>> I realize this is insufficient information to get started, but perhaps
>>> this issue is already known or somebody may have a clue at what is
>>> going on (CCing hotspot-compiler-dev)?
>>>
>>> Dawid

-- 
Rgds,Rory O'Donnell
Quality Engineering Manager
Oracle EMEA, Dublin,Ireland


From dawid.weiss at carrotsearch.com  Sat May 23 12:19:46 2015
From: dawid.weiss at carrotsearch.com (Dawid Weiss)
Date: Sat, 23 May 2015 14:19:46 +0200
Subject: 1.9.0-ea-b64 regression?
Message-ID: <CAM21Rt8L1Yv=OeUbDLENF15g3kQBXy10auRwfyXFD3iN-oKVHQ@mail.gmail.com>

Hi Rory, everyone,

I've ran into an issue with a suspicious ArrayIndexOutOfBounds on ea
builds of JDK 1.9.0. Here's some context:

- we run separate builds for 1.7, 1.8 and 1.9ea VMs and only the 1.9
build currently fails (Windows, Linux environments, 64-bit),

- the bug/ issue is a suspicious AIOOB on:

org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupBlock(BZip2CompressorInputStream.java:820)

which happens to be the line of code inside this for loop:

        for (int i = 0, lastShadow = this.last; i <= lastShadow; i++) {
            tt[cftab[ll8[i] & 0xff]++] = i;
        }

Which array access this is exactly is hard to tell, but the *same*
bzip input file does not produce the error on any other JVM (or an
earlier releases of 1.9ea). This code is deterministic in the test
that uses the above routine.

- the problem *only* appears from 1.9ea_b64; on earlier releases the
same code passes just fine (bisected it back from b45),

- I also checked 1.9ea_b65 (which happens to be on the download server
but wasn't properly announced yet?). The problem persists.

- the problem does reproduce on the build server (Windows and Linux).
Interestingly, I couldn't reproduce it locally. The code is
proprietary, I couldn't narrow it down yet to something that would
reproduce (sigh).

I realize this is insufficient information to get started, but perhaps
this issue is already known or somebody may have a clue at what is
going on (CCing hotspot-compiler-dev)?

Dawid

From andreas.eriksson at oracle.com  Tue May 26 15:55:29 2015
From: andreas.eriksson at oracle.com (Andreas Eriksson)
Date: Tue, 26 May 2015 17:55:29 +0200
Subject: [8u60] backport RFR: 8060036 and 8080156
Message-ID: <55649771.2050105@oracle.com>

Hi,

I'd like to backport the fix for
8060036 <https://bugs.openjdk.java.net/browse/JDK-8060036>: C2: CmpU 
nodes can end up with wrong type information
http://hg.openjdk.java.net/jdk9/hs-comp/hotspot/rev/53707cf9a443

And at the same time the test that was pushed by Tobias with
8080156 <https://bugs.openjdk.java.net/browse/JDK-8080156>: 
Integer.toString(int value) sometimes throws NPE
http://hg.openjdk.java.net/jdk9/hs-comp/hotspot/rev/b2e3cbd555fc

The changes applied cleanly from the jdk9 changesets.
A jprt run with the new test passes.

Thanks,
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150526/e2fa04f5/attachment.html>

From rednaxelafx at gmail.com  Tue May 26 19:34:16 2015
From: rednaxelafx at gmail.com (Krystal Mok)
Date: Tue, 26 May 2015 12:34:16 -0700
Subject: Question on ADLC's RegClass destructor
In-Reply-To: <55641B8C.7080007@oracle.com>
References: <CA+cQ+tSgHpS2T3TSqTwQ4kLwjBFX-uwNfv=OYHMrAUKPFZOBWA@mail.gmail.com>
	<55641B8C.7080007@oracle.com>
Message-ID: <CA+cQ+tRPOUR5xGuXCiyJQygfxebOjG9mO6WJq=+53q6E1R9ybA@mail.gmail.com>

Hi Zoltan,

Thanks for your reply. When I was looking at the call chain that leads to
RegClass' constructor, get_ident() is what's passed into RegClass as
classid.

The comment on get_ident_common() says:
http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/ac291bc3ece2/src/share/vm/adlc/adlparse.cpp#l4560
//------------------------------get_ident_common-------------------------------
// Looks for an identifier in the buffer, and turns it into a null
terminated
// string(still inside the file buffer).  Returns a pointer to the string or
// NULL if some other token is found instead.
char *ADLParser::get_ident_common(bool do_preproc) {

So normally, the string returned should be still inside the file buffer (if
no preprocessing is needed...), and shouldn't need to be free'd afterwards;
but if preprocessing is needed, then yes, there's dynamically allocated
memory for the string returned.
get_ident() is get_ident_common(true), so it's possible that preprocessing
is needed; but from the RegClass constructor, it'd be hard to tell whether
the string passed in is from the file buffer or from a dynamically
allocated piece of memory.

It (the original code before your changes) was an odd piece of code to
begin with...

- Kris


On Tue, May 26, 2015 at 12:06 AM, Zolt?n Maj? <zoltan.majo at oracle.com>
wrote:

> Hi Kris,
>
>
> On 05/23/2015 12:22 AM, Krystal Mok wrote:
>
>> Hi compiler team,
>>
>> I came across this piece of code in ADLC, and I'm not sure if it's right
>> or not:
>>
>>
>> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/ac291bc3ece2/src/share/vm/adlc/formsopt.cpp#l236
>>
>> RegClass::~RegClass() {
>>   delete _classid;
>> }
>>
>> Why do we need to delete the _classid field here? If this is needed, then
>> a bunch of other classes in formsopt.hpp would need the same treatment.
>>
>
> If I remember correctly, I added the destructor because the caller of the
> RegClass constructor allocates memory for _classid, but does not free it.
> But I'd have to check to be sure.
>
> Best regards,
>
>
> Zoltan
>
>
>
>> (I know this virtual destructor was added mainly for the two new
>> subclasses, but I'm just wondering about this one...)
>>
>> Thanks,
>> Kris
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150526/1e83b7e3/attachment.html>

From zoltan.majo at oracle.com  Wed May 27 08:09:50 2015
From: zoltan.majo at oracle.com (=?UTF-8?B?Wm9sdMOhbiBNYWrDsw==?=)
Date: Wed, 27 May 2015 10:09:50 +0200
Subject: Question on ADLC's RegClass destructor
In-Reply-To: <CA+cQ+tRPOUR5xGuXCiyJQygfxebOjG9mO6WJq=+53q6E1R9ybA@mail.gmail.com>
References: <CA+cQ+tSgHpS2T3TSqTwQ4kLwjBFX-uwNfv=OYHMrAUKPFZOBWA@mail.gmail.com>	<55641B8C.7080007@oracle.com>
	<CA+cQ+tRPOUR5xGuXCiyJQygfxebOjG9mO6WJq=+53q6E1R9ybA@mail.gmail.com>
Message-ID: <55657BCE.1020306@oracle.com>

Hi Kris,


On 05/26/2015 09:34 PM, Krystal Mok wrote:
> Hi Zoltan,
>
> Thanks for your reply. When I was looking at the call chain that leads 
> to RegClass' constructor, get_ident() is what's passed into RegClass 
> as classid.
>
> The comment on get_ident_common() says:
> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/ac291bc3ece2/src/share/vm/adlc/adlparse.cpp#l4560
> //------------------------------get_ident_common-------------------------------
> // Looks for an identifier in the buffer, and turns it into a null 
> terminated
> // string(still inside the file buffer).  Returns a pointer to the 
> string or
> // NULL if some other token is found instead.
> char *ADLParser::get_ident_common(bool do_preproc) {
>
> So normally, the string returned should be still inside the file 
> buffer (if no preprocessing is needed...), and shouldn't need to be 
> free'd afterwards; but if preprocessing is needed, then yes, there's 
> dynamically allocated memory for the string returned.

thank you for the detailed explanation and for bringing attention to 
this issue!

I filed JDK-8081288 [1] to track the problem and will take care of it 
hopefully soon. If you have some time to spare, you are of course 
welcome to contribute a patch.

Best regards,


Zoltan

[1] https://bugs.openjdk.java.net/browse/JDK-8081288

> get_ident() is get_ident_common(true), so it's possible that 
> preprocessing is needed; but from the RegClass constructor, it'd be 
> hard to tell whether the string passed in is from the file buffer or 
> from a dynamically allocated piece of memory.
>
> It (the original code before your changes) was an odd piece of code to 
> begin with...
>
> - Kris
>
>
> On Tue, May 26, 2015 at 12:06 AM, Zolt?n Maj? <zoltan.majo at oracle.com 
> <mailto:zoltan.majo at oracle.com>> wrote:
>
>     Hi Kris,
>
>
>     On 05/23/2015 12:22 AM, Krystal Mok wrote:
>
>         Hi compiler team,
>
>         I came across this piece of code in ADLC, and I'm not sure if
>         it's right or not:
>
>         http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/ac291bc3ece2/src/share/vm/adlc/formsopt.cpp#l236
>
>         RegClass::~RegClass() {
>           delete _classid;
>         }
>
>         Why do we need to delete the _classid field here? If this is
>         needed, then a bunch of other classes in formsopt.hpp would
>         need the same treatment.
>
>
>     If I remember correctly, I added the destructor because the caller
>     of the RegClass constructor allocates memory for _classid, but
>     does not free it. But I'd have to check to be sure.
>
>     Best regards,
>
>
>     Zoltan
>
>
>
>         (I know this virtual destructor was added mainly for the two
>         new subclasses, but I'm just wondering about this one...)
>
>         Thanks,
>         Kris
>
>
>


From benedikt.wedenik at theobroma-systems.com  Wed May 27 10:27:46 2015
From: benedikt.wedenik at theobroma-systems.com (Benedikt Wedenik)
Date: Wed, 27 May 2015 12:27:46 +0200
Subject: Implicit null-check in hot-loop?
Message-ID: <CD83789D-6E40-42A2-8700-4506ED7F7F16@theobroma-systems.com>

Hey there!

I wrote a small Fibonacci-Micro-Benchmark to examine the generated assembly (eg. loop unrolling, pipeline, and so on). This is the relevant Java Code:

  2     private static long fib(long n) {
  3         if(n == 0) {
  4             return 0;
  5         }
  6         if(n == 1) {
  7             return 1;
  8         }
  9 
 10         long res = 3;
 11         long imm0 = 0;
 12         long imm1 = 1;
 13 
 14         for(long i = 1; i < n; ++i) {
 15             res = imm0 + imm1;
 16             imm0 = imm1;
 17             imm1 = res;
 18         }
 19         return res;
 20     }


In the hot-loop there is this ?ldr? which looked a little ?strange? at the first glance.
I think that this load is a null-check? Is that the case?
If so, why does HotSpot needs to check for null in every iteration step?

I also investigated the generated code on x86 which is quite similar, but instead of a load, they are
using the ?test?-instruction which performs an ?and? but only sets the flags discarding the result. 
Is there any similar instruction available on aarch64 or is this already the closest solution?

Here the relevant assembly:

541   0x0000007fa4188a4c: adrp  x23, 0x0000007fb5580000                                                          
542                                                 ;   {poll}                                                   
543   0x0000007fa4188a50: b 0x0000007fa4188a74
544   0x0000007fa4188a54: nop
545   0x0000007fa4188a58: nop                                                                                    
546   0x0000007fa4188a5c: nop   
547   0x0000007fa4188a60: add   x19, x19, #0x1  ;*ladd
548                                                 ; - Fib::fib at 52 (line 14)                                    
549   
550   0x0000007fa4188a64: add   xlocals, xbcp, xdispatch                                                         
551                                                 ; OopMap{off=104}
552                                                 ;*goto                                                       
553                                                 ; - Fib::fib at 55 (line 14)                                    
554                                                 
555   0x0000007fa4188a68: ldr   wzr, [x23]      ;*goto
556                                                 ; - Fib::fib at 55 (line 14)                                    
557                                                 ;   {poll}                                                   
558   0x0000007fa4188a6c: mov   xbcp, xdispatch     
559   0x0000007fa4188a70: mov   xdispatch, xlocals  ;*lload
560                                                 ; - Fib::fib at 29 (line 14)                                    
561   
562   0x0000007fa4188a74: cmp   x19, xesp           
563   0x0000007fa4188a78: b.lt  0x0000007fa4188a60  ;*ifge


Btw - why does HotSpot not unroll this loop?

Thanks and best regards,
Benedikt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150527/4ed50989/attachment.html>

From aph at redhat.com  Wed May 27 10:36:34 2015
From: aph at redhat.com (Andrew Haley)
Date: Wed, 27 May 2015 11:36:34 +0100
Subject: Implicit null-check in hot-loop?
In-Reply-To: <CD83789D-6E40-42A2-8700-4506ED7F7F16@theobroma-systems.com>
References: <CD83789D-6E40-42A2-8700-4506ED7F7F16@theobroma-systems.com>
Message-ID: <55659E32.9070007@redhat.com>

On 05/27/2015 11:27 AM, Benedikt Wedenik wrote:
> 
> In the hot-loop there is this ?ldr? which looked a little ?strange? at the first glance.
> I think that this load is a null-check? Is that the case?

No.  It's a safepoint check.  HotSpot has to insert one of these because
it can't prove that the loop is reasonably short-lived.

> I also investigated the generated code on x86 which is quite similar, but instead of a load, they are
> using the ?test?-instruction which performs an ?and? but only sets the flags discarding the result.
> Is there any similar instruction available on aarch64 or is this already the closest solution?

Closest to what?  A load to XZR is the best solution: it does not hit
the flags.  x86 cannot do this.

It looks like great code.

Andrew.


From benedikt.wedenik at theobroma-systems.com  Wed May 27 10:46:32 2015
From: benedikt.wedenik at theobroma-systems.com (Benedikt Wedenik)
Date: Wed, 27 May 2015 12:46:32 +0200
Subject: Implicit null-check in hot-loop?
In-Reply-To: <55659E32.9070007@redhat.com>
References: <CD83789D-6E40-42A2-8700-4506ED7F7F16@theobroma-systems.com>
	<55659E32.9070007@redhat.com>
Message-ID: <9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com>

Oh!

Thanks :) So then it's not so bad, cause it is (I guess) obligatory.

Do you have any idea why this loop does not get unrolled?

Benedikt.

On 27 May 2015, at 12:36, Andrew Haley <aph at redhat.com> wrote:

> On 05/27/2015 11:27 AM, Benedikt Wedenik wrote:
>> 
>> In the hot-loop there is this ?ldr? which looked a little ?strange? at the first glance.
>> I think that this load is a null-check? Is that the case?
> 
> No.  It's a safepoint check.  HotSpot has to insert one of these because
> it can't prove that the loop is reasonably short-lived.
> 
>> I also investigated the generated code on x86 which is quite similar, but instead of a load, they are
>> using the ?test?-instruction which performs an ?and? but only sets the flags discarding the result.
>> Is there any similar instruction available on aarch64 or is this already the closest solution?
> 
> Closest to what?  A load to XZR is the best solution: it does not hit
> the flags.  x86 cannot do this.
> 
> It looks like great code.
> 
> Andrew.
> 


From vitalyd at gmail.com  Wed May 27 11:39:48 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Wed, 27 May 2015 07:39:48 -0400
Subject: Implicit null-check in hot-loop?
In-Reply-To: <9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com>
References: <CD83789D-6E40-42A2-8700-4506ED7F7F16@theobroma-systems.com>
	<55659E32.9070007@redhat.com>
	<9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com>
Message-ID: <CAHjP37H75Tbx=f-23Gm25Jnv8Tf3Hy_Vx7zYr2rcdE347vC3CQ@mail.gmail.com>

Cause you're using a long as induction variable; change to int and it
should unroll.

sent from my phone
On May 27, 2015 6:46 AM, "Benedikt Wedenik" <
benedikt.wedenik at theobroma-systems.com> wrote:

> Oh!
>
> Thanks :) So then it's not so bad, cause it is (I guess) obligatory.
>
> Do you have any idea why this loop does not get unrolled?
>
> Benedikt.
>
> On 27 May 2015, at 12:36, Andrew Haley <aph at redhat.com> wrote:
>
> > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote:
> >>
> >> In the hot-loop there is this ?ldr? which looked a little ?strange? at
> the first glance.
> >> I think that this load is a null-check? Is that the case?
> >
> > No.  It's a safepoint check.  HotSpot has to insert one of these because
> > it can't prove that the loop is reasonably short-lived.
> >
> >> I also investigated the generated code on x86 which is quite similar,
> but instead of a load, they are
> >> using the ?test?-instruction which performs an ?and? but only sets the
> flags discarding the result.
> >> Is there any similar instruction available on aarch64 or is this
> already the closest solution?
> >
> > Closest to what?  A load to XZR is the best solution: it does not hit
> > the flags.  x86 cannot do this.
> >
> > It looks like great code.
> >
> > Andrew.
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150527/dc8be94b/attachment-0001.html>

From vitalyd at gmail.com  Wed May 27 11:44:16 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Wed, 27 May 2015 07:44:16 -0400
Subject: Implicit null-check in hot-loop?
In-Reply-To: <CAHjP37H75Tbx=f-23Gm25Jnv8Tf3Hy_Vx7zYr2rcdE347vC3CQ@mail.gmail.com>
References: <CD83789D-6E40-42A2-8700-4506ED7F7F16@theobroma-systems.com>
	<55659E32.9070007@redhat.com>
	<9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com>
	<CAHjP37H75Tbx=f-23Gm25Jnv8Tf3Hy_Vx7zYr2rcdE347vC3CQ@mail.gmail.com>
Message-ID: <CAHjP37Fe4YMCOdv_tGki_bMfDk3jvOLOzC5w9gmyQhyR070H6Q@mail.gmail.com>

Also, the safepoint check on each iteration should go away when using an
int loop; it'll be coalesced into once per unroll factor or may go away
entirely if the loop turns into a counted loop.

sent from my phone
On May 27, 2015 7:39 AM, "Vitaly Davidovich" <vitalyd at gmail.com> wrote:

> Cause you're using a long as induction variable; change to int and it
> should unroll.
>
> sent from my phone
> On May 27, 2015 6:46 AM, "Benedikt Wedenik" <
> benedikt.wedenik at theobroma-systems.com> wrote:
>
>> Oh!
>>
>> Thanks :) So then it's not so bad, cause it is (I guess) obligatory.
>>
>> Do you have any idea why this loop does not get unrolled?
>>
>> Benedikt.
>>
>> On 27 May 2015, at 12:36, Andrew Haley <aph at redhat.com> wrote:
>>
>> > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote:
>> >>
>> >> In the hot-loop there is this ?ldr? which looked a little ?strange? at
>> the first glance.
>> >> I think that this load is a null-check? Is that the case?
>> >
>> > No.  It's a safepoint check.  HotSpot has to insert one of these because
>> > it can't prove that the loop is reasonably short-lived.
>> >
>> >> I also investigated the generated code on x86 which is quite similar,
>> but instead of a load, they are
>> >> using the ?test?-instruction which performs an ?and? but only sets the
>> flags discarding the result.
>> >> Is there any similar instruction available on aarch64 or is this
>> already the closest solution?
>> >
>> > Closest to what?  A load to XZR is the best solution: it does not hit
>> > the flags.  x86 cannot do this.
>> >
>> > It looks like great code.
>> >
>> > Andrew.
>> >
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150527/5e02c092/attachment.html>

From benedikt.wedenik at theobroma-systems.com  Wed May 27 12:09:22 2015
From: benedikt.wedenik at theobroma-systems.com (Benedikt Wedenik)
Date: Wed, 27 May 2015 14:09:22 +0200
Subject: Implicit null-check in hot-loop?
In-Reply-To: <CAHjP37Fe4YMCOdv_tGki_bMfDk3jvOLOzC5w9gmyQhyR070H6Q@mail.gmail.com>
References: <CD83789D-6E40-42A2-8700-4506ED7F7F16@theobroma-systems.com>
	<55659E32.9070007@redhat.com>
	<9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com>
	<CAHjP37H75Tbx=f-23Gm25Jnv8Tf3Hy_Vx7zYr2rcdE347vC3CQ@mail.gmail.com>
	<CAHjP37Fe4YMCOdv_tGki_bMfDk3jvOLOzC5w9gmyQhyR070H6Q@mail.gmail.com>
Message-ID: <AB0460F9-5AD8-49AA-A094-B67E50C0EE38@theobroma-systems.com>

First of all, thanks for your quick response :)
You were absolutely right - but I do not understand why the long counter was the problem.

Is there any reason why the loop-unrolling is only available for int loops?
I mean I guess so - because else it would probably be implemented already.
But still it would be a nice opportunity for an optimisation :)

Benedikt.
On 27 May 2015, at 13:44, Vitaly Davidovich <vitalyd at gmail.com> wrote:

> Also, the safepoint check on each iteration should go away when using an int loop; it'll be coalesced into once per unroll factor or may go away entirely if the loop turns into a counted loop.
> 
> sent from my phone
> 
> On May 27, 2015 7:39 AM, "Vitaly Davidovich" <vitalyd at gmail.com> wrote:
> Cause you're using a long as induction variable; change to int and it should unroll.
> 
> sent from my phone
> 
> On May 27, 2015 6:46 AM, "Benedikt Wedenik" <benedikt.wedenik at theobroma-systems.com> wrote:
> Oh!
> 
> Thanks :) So then it's not so bad, cause it is (I guess) obligatory.
> 
> Do you have any idea why this loop does not get unrolled?
> 
> Benedikt.
> 
> On 27 May 2015, at 12:36, Andrew Haley <aph at redhat.com> wrote:
> 
> > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote:
> >>
> >> In the hot-loop there is this ?ldr? which looked a little ?strange? at the first glance.
> >> I think that this load is a null-check? Is that the case?
> >
> > No.  It's a safepoint check.  HotSpot has to insert one of these because
> > it can't prove that the loop is reasonably short-lived.
> >
> >> I also investigated the generated code on x86 which is quite similar, but instead of a load, they are
> >> using the ?test?-instruction which performs an ?and? but only sets the flags discarding the result.
> >> Is there any similar instruction available on aarch64 or is this already the closest solution?
> >
> > Closest to what?  A load to XZR is the best solution: it does not hit
> > the flags.  x86 cannot do this.
> >
> > It looks like great code.
> >
> > Andrew.
> >
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150527/af2fff00/attachment.html>

From vitalyd at gmail.com  Wed May 27 12:25:24 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Wed, 27 May 2015 08:25:24 -0400
Subject: Implicit null-check in hot-loop?
In-Reply-To: <AB0460F9-5AD8-49AA-A094-B67E50C0EE38@theobroma-systems.com>
References: <CD83789D-6E40-42A2-8700-4506ED7F7F16@theobroma-systems.com>
	<55659E32.9070007@redhat.com>
	<9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com>
	<CAHjP37H75Tbx=f-23Gm25Jnv8Tf3Hy_Vx7zYr2rcdE347vC3CQ@mail.gmail.com>
	<CAHjP37Fe4YMCOdv_tGki_bMfDk3jvOLOzC5w9gmyQhyR070H6Q@mail.gmail.com>
	<AB0460F9-5AD8-49AA-A094-B67E50C0EE38@theobroma-systems.com>
Message-ID: <CAHjP37Et5p_V+-Z4wAL0jYUEJ3N5vh=+07Qos+SvB_-c0GKRVw@mail.gmail.com>

I don't think there's any fundamental reason, it's just that int loops are
more common and so were optimized.

sent from my phone
On May 27, 2015 8:09 AM, "Benedikt Wedenik" <
benedikt.wedenik at theobroma-systems.com> wrote:

> First of all, thanks for your quick response :)
> You were absolutely right - but I do not understand why the long counter
> was the problem.
>
> Is there any reason why the loop-unrolling is only available for int loops?
> I mean I guess so - because else it would probably be implemented already.
> But still it would be a nice opportunity for an optimisation :)
>
> Benedikt.
> On 27 May 2015, at 13:44, Vitaly Davidovich <vitalyd at gmail.com> wrote:
>
> Also, the safepoint check on each iteration should go away when using an
> int loop; it'll be coalesced into once per unroll factor or may go away
> entirely if the loop turns into a counted loop.
>
> sent from my phone
> On May 27, 2015 7:39 AM, "Vitaly Davidovich" <vitalyd at gmail.com> wrote:
>
>> Cause you're using a long as induction variable; change to int and it
>> should unroll.
>>
>> sent from my phone
>> On May 27, 2015 6:46 AM, "Benedikt Wedenik" <
>> benedikt.wedenik at theobroma-systems.com> wrote:
>>
>>> Oh!
>>>
>>> Thanks :) So then it's not so bad, cause it is (I guess) obligatory.
>>>
>>> Do you have any idea why this loop does not get unrolled?
>>>
>>> Benedikt.
>>>
>>> On 27 May 2015, at 12:36, Andrew Haley <aph at redhat.com> wrote:
>>>
>>> > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote:
>>> >>
>>> >> In the hot-loop there is this ?ldr? which looked a little ?strange?
>>> at the first glance.
>>> >> I think that this load is a null-check? Is that the case?
>>> >
>>> > No.  It's a safepoint check.  HotSpot has to insert one of these
>>> because
>>> > it can't prove that the loop is reasonably short-lived.
>>> >
>>> >> I also investigated the generated code on x86 which is quite similar,
>>> but instead of a load, they are
>>> >> using the ?test?-instruction which performs an ?and? but only sets
>>> the flags discarding the result.
>>> >> Is there any similar instruction available on aarch64 or is this
>>> already the closest solution?
>>> >
>>> > Closest to what?  A load to XZR is the best solution: it does not hit
>>> > the flags.  x86 cannot do this.
>>> >
>>> > It looks like great code.
>>> >
>>> > Andrew.
>>> >
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150527/1440f25d/attachment.html>

From benedikt.wedenik at theobroma-systems.com  Wed May 27 12:27:33 2015
From: benedikt.wedenik at theobroma-systems.com (Benedikt Wedenik)
Date: Wed, 27 May 2015 14:27:33 +0200
Subject: Implicit null-check in hot-loop?
In-Reply-To: <CAHjP37Et5p_V+-Z4wAL0jYUEJ3N5vh=+07Qos+SvB_-c0GKRVw@mail.gmail.com>
References: <CD83789D-6E40-42A2-8700-4506ED7F7F16@theobroma-systems.com>
	<55659E32.9070007@redhat.com>
	<9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com>
	<CAHjP37H75Tbx=f-23Gm25Jnv8Tf3Hy_Vx7zYr2rcdE347vC3CQ@mail.gmail.com>
	<CAHjP37Fe4YMCOdv_tGki_bMfDk3jvOLOzC5w9gmyQhyR070H6Q@mail.gmail.com>
	<AB0460F9-5AD8-49AA-A094-B67E50C0EE38@theobroma-systems.com>
	<CAHjP37Et5p_V+-Z4wAL0jYUEJ3N5vh=+07Qos+SvB_-c0GKRVw@mail.gmail.com>
Message-ID: <25BD5D47-934E-4601-BB32-0F6507254C13@theobroma-systems.com>

Okay :)

Thank you again!

Benedikt.
On 27 May 2015, at 14:25, Vitaly Davidovich <vitalyd at gmail.com> wrote:

> I don't think there's any fundamental reason, it's just that int loops are more common and so were optimized.
> 
> sent from my phone
> 
> On May 27, 2015 8:09 AM, "Benedikt Wedenik" <benedikt.wedenik at theobroma-systems.com> wrote:
> First of all, thanks for your quick response :)
> You were absolutely right - but I do not understand why the long counter was the problem.
> 
> Is there any reason why the loop-unrolling is only available for int loops?
> I mean I guess so - because else it would probably be implemented already.
> But still it would be a nice opportunity for an optimisation :)
> 
> Benedikt.
> On 27 May 2015, at 13:44, Vitaly Davidovich <vitalyd at gmail.com> wrote:
> 
>> Also, the safepoint check on each iteration should go away when using an int loop; it'll be coalesced into once per unroll factor or may go away entirely if the loop turns into a counted loop.
>> 
>> sent from my phone
>> 
>> On May 27, 2015 7:39 AM, "Vitaly Davidovich" <vitalyd at gmail.com> wrote:
>> Cause you're using a long as induction variable; change to int and it should unroll.
>> 
>> sent from my phone
>> 
>> On May 27, 2015 6:46 AM, "Benedikt Wedenik" <benedikt.wedenik at theobroma-systems.com> wrote:
>> Oh!
>> 
>> Thanks :) So then it's not so bad, cause it is (I guess) obligatory.
>> 
>> Do you have any idea why this loop does not get unrolled?
>> 
>> Benedikt.
>> 
>> On 27 May 2015, at 12:36, Andrew Haley <aph at redhat.com> wrote:
>> 
>> > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote:
>> >>
>> >> In the hot-loop there is this ?ldr? which looked a little ?strange? at the first glance.
>> >> I think that this load is a null-check? Is that the case?
>> >
>> > No.  It's a safepoint check.  HotSpot has to insert one of these because
>> > it can't prove that the loop is reasonably short-lived.
>> >
>> >> I also investigated the generated code on x86 which is quite similar, but instead of a load, they are
>> >> using the ?test?-instruction which performs an ?and? but only sets the flags discarding the result.
>> >> Is there any similar instruction available on aarch64 or is this already the closest solution?
>> >
>> > Closest to what?  A load to XZR is the best solution: it does not hit
>> > the flags.  x86 cannot do this.
>> >
>> > It looks like great code.
>> >
>> > Andrew.
>> >
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150527/758fb0df/attachment-0001.html>

From vitalyd at gmail.com  Wed May 27 12:34:37 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Wed, 27 May 2015 08:34:37 -0400
Subject: Implicit null-check in hot-loop?
In-Reply-To: <25BD5D47-934E-4601-BB32-0F6507254C13@theobroma-systems.com>
References: <CD83789D-6E40-42A2-8700-4506ED7F7F16@theobroma-systems.com>
	<55659E32.9070007@redhat.com>
	<9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com>
	<CAHjP37H75Tbx=f-23Gm25Jnv8Tf3Hy_Vx7zYr2rcdE347vC3CQ@mail.gmail.com>
	<CAHjP37Fe4YMCOdv_tGki_bMfDk3jvOLOzC5w9gmyQhyR070H6Q@mail.gmail.com>
	<AB0460F9-5AD8-49AA-A094-B67E50C0EE38@theobroma-systems.com>
	<CAHjP37Et5p_V+-Z4wAL0jYUEJ3N5vh=+07Qos+SvB_-c0GKRVw@mail.gmail.com>
	<25BD5D47-934E-4601-BB32-0F6507254C13@theobroma-systems.com>
Message-ID: <CAHjP37EW4BoCNF8Pjg9yH6BshST5ao2Ax3eB9FzvvGvpYEjLNA@mail.gmail.com>

Also, my hunch is that loop unrolling in current production hotspot isn't
necessarily a win for long trip counts since there's virtually no
vectorization and I'm not sure making the loop code bigger is a net win
since OoO CPUs can take care of them well on their own.
Eliminating/reducing the safepoint check is probably the biggest win.

sent from my phone
On May 27, 2015 8:27 AM, "Benedikt Wedenik" <
benedikt.wedenik at theobroma-systems.com> wrote:

> Okay :)
>
> Thank you again!
>
> Benedikt.
> On 27 May 2015, at 14:25, Vitaly Davidovich <vitalyd at gmail.com> wrote:
>
> I don't think there's any fundamental reason, it's just that int loops are
> more common and so were optimized.
>
> sent from my phone
> On May 27, 2015 8:09 AM, "Benedikt Wedenik" <
> benedikt.wedenik at theobroma-systems.com> wrote:
>
>> First of all, thanks for your quick response :)
>> You were absolutely right - but I do not understand why the long counter
>> was the problem.
>>
>> Is there any reason why the loop-unrolling is only available for int
>> loops?
>> I mean I guess so - because else it would probably be implemented already.
>> But still it would be a nice opportunity for an optimisation :)
>>
>> Benedikt.
>> On 27 May 2015, at 13:44, Vitaly Davidovich <vitalyd at gmail.com> wrote:
>>
>> Also, the safepoint check on each iteration should go away when using an
>> int loop; it'll be coalesced into once per unroll factor or may go away
>> entirely if the loop turns into a counted loop.
>>
>> sent from my phone
>> On May 27, 2015 7:39 AM, "Vitaly Davidovich" <vitalyd at gmail.com> wrote:
>>
>>> Cause you're using a long as induction variable; change to int and it
>>> should unroll.
>>>
>>> sent from my phone
>>> On May 27, 2015 6:46 AM, "Benedikt Wedenik" <
>>> benedikt.wedenik at theobroma-systems.com> wrote:
>>>
>>>> Oh!
>>>>
>>>> Thanks :) So then it's not so bad, cause it is (I guess) obligatory.
>>>>
>>>> Do you have any idea why this loop does not get unrolled?
>>>>
>>>> Benedikt.
>>>>
>>>> On 27 May 2015, at 12:36, Andrew Haley <aph at redhat.com> wrote:
>>>>
>>>> > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote:
>>>> >>
>>>> >> In the hot-loop there is this ?ldr? which looked a little ?strange?
>>>> at the first glance.
>>>> >> I think that this load is a null-check? Is that the case?
>>>> >
>>>> > No.  It's a safepoint check.  HotSpot has to insert one of these
>>>> because
>>>> > it can't prove that the loop is reasonably short-lived.
>>>> >
>>>> >> I also investigated the generated code on x86 which is quite
>>>> similar, but instead of a load, they are
>>>> >> using the ?test?-instruction which performs an ?and? but only sets
>>>> the flags discarding the result.
>>>> >> Is there any similar instruction available on aarch64 or is this
>>>> already the closest solution?
>>>> >
>>>> > Closest to what?  A load to XZR is the best solution: it does not hit
>>>> > the flags.  x86 cannot do this.
>>>> >
>>>> > It looks like great code.
>>>> >
>>>> > Andrew.
>>>> >
>>>>
>>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150527/60ea7657/attachment.html>

From roland.westrelin at oracle.com  Wed May 27 14:08:52 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Wed, 27 May 2015 16:08:52 +0200
Subject: [8u60] backport RFR: 8060036 and 8080156
In-Reply-To: <55649771.2050105@oracle.com>
References: <55649771.2050105@oracle.com>
Message-ID: <255230A8-017F-4C7F-9E10-37D74F973091@oracle.com>

That looks good to me. I?ll push it.

Roland.

> On May 26, 2015, at 5:55 PM, Andreas Eriksson <andreas.eriksson at oracle.com> wrote:
> 
> Hi,
> 
> I'd like to backport the fix for
> 8060036: C2: CmpU nodes can end up with wrong type information
> http://hg.openjdk.java.net/jdk9/hs-comp/hotspot/rev/53707cf9a443
> 
> And at the same time the test that was pushed by Tobias with
> 8080156: Integer.toString(int value) sometimes throws NPE
> http://hg.openjdk.java.net/jdk9/hs-comp/hotspot/rev/b2e3cbd555fc
> 
> The changes applied cleanly from the jdk9 changesets.
> A jprt run with the new test passes.
> 
> Thanks,
> Andreas


From andreas.eriksson at oracle.com  Wed May 27 14:21:41 2015
From: andreas.eriksson at oracle.com (Andreas Eriksson)
Date: Wed, 27 May 2015 16:21:41 +0200
Subject: [8u60] backport RFR: 8060036 and 8080156
In-Reply-To: <255230A8-017F-4C7F-9E10-37D74F973091@oracle.com>
References: <55649771.2050105@oracle.com>
	<255230A8-017F-4C7F-9E10-37D74F973091@oracle.com>
Message-ID: <5565D2F5.70406@oracle.com>

Thanks!

On 2015-05-27 16:08, Roland Westrelin wrote:
> That looks good to me. I?ll push it.
>
> Roland.
>
>> On May 26, 2015, at 5:55 PM, Andreas Eriksson <andreas.eriksson at oracle.com> wrote:
>>
>> Hi,
>>
>> I'd like to backport the fix for
>> 8060036: C2: CmpU nodes can end up with wrong type information
>> http://hg.openjdk.java.net/jdk9/hs-comp/hotspot/rev/53707cf9a443
>>
>> And at the same time the test that was pushed by Tobias with
>> 8080156: Integer.toString(int value) sometimes throws NPE
>> http://hg.openjdk.java.net/jdk9/hs-comp/hotspot/rev/b2e3cbd555fc
>>
>> The changes applied cleanly from the jdk9 changesets.
>> A jprt run with the new test passes.
>>
>> Thanks,
>> Andreas


From aph at redhat.com  Wed May 27 17:11:50 2015
From: aph at redhat.com (Andrew Haley)
Date: Wed, 27 May 2015 18:11:50 +0100
Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA
	acceleration]
In-Reply-To: <555708BE.8090100@redhat.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>	<5502E67C.8080208@oracle.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>	<550C004D.1060501@redhat.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>	<55101C52.1090506@redhat.com>
	<5548193E.7060803@oracle.com>	<55488803.4020802@redhat.com>
	<554CDD48.8080500@redhat.com>	<1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com>
	<555708BE.8090100@redhat.com>
Message-ID: <5565FAD6.5010409@redhat.com>

An update:

I'm still working on this.  Following last week's revelations [1] it
seems to me that a faster implementation of (integer) D-H is even more
important.

I've spent a couple of days tracking down an extremely odd feature
(bug?) in MutableBigInteger which was breaking everything, but I'm
past that now.  I'm trying to produce an intrinsic implementation of
the core modular exponentiation which is as fast as any state-of-the-
art implementation while disrupting the common code as little as
possible; this is not easy.

I hope to have something which is faster on all processors, not just
those for which we have hand-coded assembly-language implementations.

I don't think that my work should be any impediment to Sadya's patch
for squareToLen at http://cr.openjdk.java.net/~kvn/8069539/webrev.01/
being committed.  It'll still be useful.

Andrew.


[1]  Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
https://weakdh.org/imperfect-forward-secrecy.pdf

From vladimir.x.ivanov at oracle.com  Wed May 27 18:19:09 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Wed, 27 May 2015 21:19:09 +0300
Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match
	some 8-bit/16-bit masks
In-Reply-To: <556006DA.8030202@oracle.com>
References: <555F3EFB.60606@oracle.com>	<49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com>	<555F47E4.8020507@oracle.com>
	<555F4D7D.7080504@oracle.com> <556006DA.8030202@oracle.com>
Message-ID: <55660A9D.6090809@oracle.com>

Vladimir, Dean, thanks for the feedback.

I like right_n_bits trick.
Here's an updated webrev:
   http://cr.openjdk.java.net/~vlivanov/8001622/webrev.02

Got rid of immU16 since set can encode 32-bit constants.

Best regards,
Vladimir Ivanov

On 5/23/15 7:49 AM, Dean Long wrote:
> Isn't this what you want:
>
> 5656 // Load Unsigned Byte (8 bit UNsigned) with 32-bit mask into Long
> Register
> 5657 instruct loadUB2L_immI(iRegL dst, memory mem, immI mask) %{
> 5658   match(Set dst (ConvI2L (AndI (LoadUB mem) mask)));
> 5659   ins_cost(MEMORY_REF_COST + DEFAULT_COST);
> 5660
> 5661   size(2*4);
> 5662   format %{ "LDUB   $mem,$dst\t# ubyte & 32-bit mask -> long\n\t"
> 5663             "AND    $dst,right_n_bits($mask, 8),$dst" %}
> 5664   ins_encode %{
> 5665     __ ldub($mem$$Address, $dst$$Register);
> 5666     __ and3($dst$$Register, right_n_bits($mask$$constant, 8),
> $dst$$Register);
> 5667   %}
> 5668   ins_pipe(iload_mem);
> 5669 %}
>
> dl
>
> On 5/22/2015 8:38 AM, Vladimir Ivanov wrote:
>> Updated webrev:
>>   http://cr.openjdk.java.net/~vlivanov/8001622/webrev.01
>>
>> Introduced immU8/immU16 operands on sparc.
>>
>> As a cleanup, sorted immI/immU operand declarations.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 5/22/15 6:14 PM, Vladimir Ivanov wrote:
>>> Thanks for looking into the fix, Roland.
>>>
>>> On 5/22/15 5:57 PM, Roland Westrelin wrote:
>>>>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00
>>>>> https://bugs.openjdk.java.net/browse/JDK-8001622
>>>>>
>>>>> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from
>>>>> immI8/immI16 to immI, since zero-extending move is used.
>>>>
>>>> The change on sparc doesn?t look good:
>>>> __ and3($dst$$Register, $mask$$constant, $dst$$Register);
>>>>
>>>> the and instruction cannot encode arbitrary large integer constants.
>>> Good catch. I missed that it is and3.
>>>
>>>> Isn?t the root of the problem that we want an unsigned 8 bit integer
>>>> and not a signed one?
>>> Yes, unsigned 8-bit integer work fine, but it can be generalized to
>>> arbitrary masks as well.
>>>
>>> I experimented with new operand types (immU8, immU16), but it was more
>>> code for no particular benefit. I can use them on sparc.
>>>
>>> Best regards,
>>> Vladimir Ivanov
>

From dean.long at oracle.com  Wed May 27 19:04:02 2015
From: dean.long at oracle.com (Dean Long)
Date: Wed, 27 May 2015 12:04:02 -0700
Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match
	some 8-bit/16-bit masks
In-Reply-To: <55660A9D.6090809@oracle.com>
References: <555F3EFB.60606@oracle.com>	<49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com>	<555F47E4.8020507@oracle.com>
	<555F4D7D.7080504@oracle.com> <556006DA.8030202@oracle.com>
	<55660A9D.6090809@oracle.com>
Message-ID: <55661522.4010703@oracle.com>

You should be able to use right_n_bits on x86 too, to allow the smaller 
encoding with an 8 bit immediate instead of 32.

Also, for this on sparc

5791     __ set($mask$$constant, Rtmp);

I believe

   __ set(right_n_bits($mask$$constant, 16), Rtmp)

will give the same result in fewer instructions.

dl

On 5/27/2015 11:19 AM, Vladimir Ivanov wrote:
> Vladimir, Dean, thanks for the feedback.
>
> I like right_n_bits trick.
> Here's an updated webrev:
>   http://cr.openjdk.java.net/~vlivanov/8001622/webrev.02
>
> Got rid of immU16 since set can encode 32-bit constants.
>
> Best regards,
> Vladimir Ivanov
>
> On 5/23/15 7:49 AM, Dean Long wrote:
>> Isn't this what you want:
>>
>> 5656 // Load Unsigned Byte (8 bit UNsigned) with 32-bit mask into Long
>> Register
>> 5657 instruct loadUB2L_immI(iRegL dst, memory mem, immI mask) %{
>> 5658   match(Set dst (ConvI2L (AndI (LoadUB mem) mask)));
>> 5659   ins_cost(MEMORY_REF_COST + DEFAULT_COST);
>> 5660
>> 5661   size(2*4);
>> 5662   format %{ "LDUB   $mem,$dst\t# ubyte & 32-bit mask -> long\n\t"
>> 5663             "AND    $dst,right_n_bits($mask, 8),$dst" %}
>> 5664   ins_encode %{
>> 5665     __ ldub($mem$$Address, $dst$$Register);
>> 5666     __ and3($dst$$Register, right_n_bits($mask$$constant, 8),
>> $dst$$Register);
>> 5667   %}
>> 5668   ins_pipe(iload_mem);
>> 5669 %}
>>
>> dl
>>
>> On 5/22/2015 8:38 AM, Vladimir Ivanov wrote:
>>> Updated webrev:
>>>   http://cr.openjdk.java.net/~vlivanov/8001622/webrev.01
>>>
>>> Introduced immU8/immU16 operands on sparc.
>>>
>>> As a cleanup, sorted immI/immU operand declarations.
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> On 5/22/15 6:14 PM, Vladimir Ivanov wrote:
>>>> Thanks for looking into the fix, Roland.
>>>>
>>>> On 5/22/15 5:57 PM, Roland Westrelin wrote:
>>>>>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8001622
>>>>>>
>>>>>> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from
>>>>>> immI8/immI16 to immI, since zero-extending move is used.
>>>>>
>>>>> The change on sparc doesn?t look good:
>>>>> __ and3($dst$$Register, $mask$$constant, $dst$$Register);
>>>>>
>>>>> the and instruction cannot encode arbitrary large integer constants.
>>>> Good catch. I missed that it is and3.
>>>>
>>>>> Isn?t the root of the problem that we want an unsigned 8 bit integer
>>>>> and not a signed one?
>>>> Yes, unsigned 8-bit integer work fine, but it can be generalized to
>>>> arbitrary masks as well.
>>>>
>>>> I experimented with new operand types (immU8, immU16), but it was more
>>>> code for no particular benefit. I can use them on sparc.
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>


From sandhya.viswanathan at intel.com  Thu May 28 01:27:53 2015
From: sandhya.viswanathan at intel.com (Viswanathan, Sandhya)
Date: Thu, 28 May 2015 01:27:53 +0000
Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA
	acceleration]
In-Reply-To: <5565FAD6.5010409@redhat.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>
	<5502E67C.8080208@oracle.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>
	<550C004D.1060501@redhat.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>
	<55101C52.1090506@redhat.com>	<5548193E.7060803@oracle.com>
	<55488803.4020802@redhat.com>	<554CDD48.8080500@redhat.com>
	<1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com>
	<555708BE.8090100@redhat.com> <5565FAD6.5010409@redhat.com>
Message-ID: <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337BA3A@FMSMSX112.amr.corp.intel.com>

Hi Tony,

Please let us know if you are ok with the changes in BigInteger.java (range checks) in patch from Intel:

http://cr.openjdk.java.net/~kvn/8069539/webrev.01/

Per Andrew's email below we could go ahead with this patch and it shouldn't affect his work.

Best Regards,
Sandhya


-----Original Message-----
From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andrew Haley
Sent: Wednesday, May 27, 2015 10:12 AM
To: Christian Thalinger
Cc: Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net
Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration]

An update:

I'm still working on this.  Following last week's revelations [1] it
seems to me that a faster implementation of (integer) D-H is even more
important.

I've spent a couple of days tracking down an extremely odd feature
(bug?) in MutableBigInteger which was breaking everything, but I'm
past that now.  I'm trying to produce an intrinsic implementation of
the core modular exponentiation which is as fast as any state-of-the-
art implementation while disrupting the common code as little as
possible; this is not easy.

I hope to have something which is faster on all processors, not just
those for which we have hand-coded assembly-language implementations.

I don't think that my work should be any impediment to Sadya's patch
for squareToLen at http://cr.openjdk.java.net/~kvn/8069539/webrev.01/
being committed.  It'll still be useful.

Andrew.


[1]  Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
https://weakdh.org/imperfect-forward-secrecy.pdf

From roland.westrelin at oracle.com  Thu May 28 13:55:10 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Thu, 28 May 2015 15:55:10 +0200
Subject: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on
	(regression)
Message-ID: <E6EC1C64-49DB-45A4-8C6E-ED9D3ADF8BF7@oracle.com>

http://cr.openjdk.java.net/~roland/8080976/webrev.00/

A loop where the result of the reduction is used in the loop shouldn?t be vectorized.

Roland.

From vitalyd at gmail.com  Thu May 28 14:15:18 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Thu, 28 May 2015 10:15:18 -0400
Subject: Inlining methods with large switch statements
Message-ID: <CAHjP37Gh=8c5dsEJWQETG0LKb=vYLsMP1+zBRW66oV4St3oc0Q@mail.gmail.com>

Hi guys,

C2 will fail to inline hot methods over FreqInlineSize that contain a large
dense switch statement.  The actual machine code here uses a jump table so
native size is small and I'd expect inlining.

Is this something that can be handled better?

Thanks

sent from my phone
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150528/55361bb9/attachment.html>

From vladimir.kozlov at oracle.com  Thu May 28 16:32:45 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 28 May 2015 09:32:45 -0700
Subject: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on
	(regression)
In-Reply-To: <E6EC1C64-49DB-45A4-8C6E-ED9D3ADF8BF7@oracle.com>
References: <E6EC1C64-49DB-45A4-8C6E-ED9D3ADF8BF7@oracle.com>
Message-ID: <5567432D.3040108@oracle.com>

Looks good. Thank you for fixing this.

Vladimir

On 5/28/15 6:55 AM, Roland Westrelin wrote:
> http://cr.openjdk.java.net/~roland/8080976/webrev.00/
>
> A loop where the result of the reduction is used in the loop shouldn?t be vectorized.
>
> Roland.
>

From michael.c.berg at intel.com  Thu May 28 16:43:30 2015
From: michael.c.berg at intel.com (Berg, Michael C)
Date: Thu, 28 May 2015 16:43:30 +0000
Subject: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on
	(regression)
In-Reply-To: <5567432D.3040108@oracle.com>
References: <E6EC1C64-49DB-45A4-8C6E-ED9D3ADF8BF7@oracle.com>
	<5567432D.3040108@oracle.com>
Message-ID: <C568518E7B433348B114B6A7122D474755E0600D@FMSMSX102.amr.corp.intel.com>

I will also review the change, thx, it will be just a bit.

-----Original Message-----
From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Vladimir Kozlov
Sent: Thursday, May 28, 2015 9:33 AM
To: hotspot-compiler-dev at openjdk.java.net
Subject: Re: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on (regression)

Looks good. Thank you for fixing this.

Vladimir

On 5/28/15 6:55 AM, Roland Westrelin wrote:
> http://cr.openjdk.java.net/~roland/8080976/webrev.00/
>
> A loop where the result of the reduction is used in the loop shouldn?t be vectorized.
>
> Roland.
>

From michael.c.berg at intel.com  Thu May 28 17:18:34 2015
From: michael.c.berg at intel.com (Berg, Michael C)
Date: Thu, 28 May 2015 17:18:34 +0000
Subject: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on
	(regression)
References: <E6EC1C64-49DB-45A4-8C6E-ED9D3ADF8BF7@oracle.com>
	<5567432D.3040108@oracle.com> 
Message-ID: <C568518E7B433348B114B6A7122D474755E0604D@FMSMSX102.amr.corp.intel.com>

Roland I would add a condition so that we save a step, and a comment or two:

              bool ok = false;
              for (unsigned j = 1; j < def_node->req(); j++) {
                Node* in = def_node->in(j);
                if (in == phi) {
                  ok = true;
                  break;
                }
              }

              // do nothing if we did not match the initial criteria
              if (ok == false) {
                continue;
              }

              // The result of the reduction must not be used in the loop
              for (DUIterator_Fast imax, i = def_node->fast_outs(imax); i < imax && ok; i++) {
                Node* u = def_node->fast_out(i);
                if (has_ctrl(u) && !loop->is_member(get_loop(get_ctrl(u)))) {
                  continue;
                }
                if (u == phi) {
                  continue;
                }
                ok = false;
              }

              // iff the uses conform
              if (ok) {
                def_node->add_flag(Node::Flag_is_reduction);
              }

I checked the resultant code for the bug, as is the case for the change in the webrev, the reduction is no longer emitted.  I also checked the relevant micros and verified the desired behavior is still present.

Regards,
-Michael

-----Original Message-----
From: Berg, Michael C 
Sent: Thursday, May 28, 2015 9:43 AM
To: 'Vladimir Kozlov'; hotspot-compiler-dev at openjdk.java.net
Subject: RE: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on (regression)

I will also review the change, thx, it will be just a bit.

-----Original Message-----
From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Vladimir Kozlov
Sent: Thursday, May 28, 2015 9:33 AM
To: hotspot-compiler-dev at openjdk.java.net
Subject: Re: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on (regression)

Looks good. Thank you for fixing this.

Vladimir

On 5/28/15 6:55 AM, Roland Westrelin wrote:
> http://cr.openjdk.java.net/~roland/8080976/webrev.00/
>
> A loop where the result of the reduction is used in the loop shouldn?t be vectorized.
>
> Roland.
>

From vladimir.x.ivanov at oracle.com  Thu May 28 17:40:13 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Thu, 28 May 2015 20:40:13 +0300
Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match
	some 8-bit/16-bit masks
In-Reply-To: <55661522.4010703@oracle.com>
References: <555F3EFB.60606@oracle.com>	<49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com>	<555F47E4.8020507@oracle.com>
	<555F4D7D.7080504@oracle.com> <556006DA.8030202@oracle.com>
	<55660A9D.6090809@oracle.com> <55661522.4010703@oracle.com>
Message-ID: <556752FD.6060605@oracle.com>

Dean, what do you think about the following:
   http://cr.openjdk.java.net/~vlivanov/8001622/webrev.03

Best regards,
Vladimir Ivanov

On 5/27/15 10:04 PM, Dean Long wrote:
> You should be able to use right_n_bits on x86 too, to allow the smaller
> encoding with an 8 bit immediate instead of 32.
>
> Also, for this on sparc
>
> 5791     __ set($mask$$constant, Rtmp);
>
> I believe
>
>    __ set(right_n_bits($mask$$constant, 16), Rtmp)
>
> will give the same result in fewer instructions.
>
> dl
>
> On 5/27/2015 11:19 AM, Vladimir Ivanov wrote:
>> Vladimir, Dean, thanks for the feedback.
>>
>> I like right_n_bits trick.
>> Here's an updated webrev:
>>   http://cr.openjdk.java.net/~vlivanov/8001622/webrev.02
>>
>> Got rid of immU16 since set can encode 32-bit constants.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 5/23/15 7:49 AM, Dean Long wrote:
>>> Isn't this what you want:
>>>
>>> 5656 // Load Unsigned Byte (8 bit UNsigned) with 32-bit mask into Long
>>> Register
>>> 5657 instruct loadUB2L_immI(iRegL dst, memory mem, immI mask) %{
>>> 5658   match(Set dst (ConvI2L (AndI (LoadUB mem) mask)));
>>> 5659   ins_cost(MEMORY_REF_COST + DEFAULT_COST);
>>> 5660
>>> 5661   size(2*4);
>>> 5662   format %{ "LDUB   $mem,$dst\t# ubyte & 32-bit mask -> long\n\t"
>>> 5663             "AND    $dst,right_n_bits($mask, 8),$dst" %}
>>> 5664   ins_encode %{
>>> 5665     __ ldub($mem$$Address, $dst$$Register);
>>> 5666     __ and3($dst$$Register, right_n_bits($mask$$constant, 8),
>>> $dst$$Register);
>>> 5667   %}
>>> 5668   ins_pipe(iload_mem);
>>> 5669 %}
>>>
>>> dl
>>>
>>> On 5/22/2015 8:38 AM, Vladimir Ivanov wrote:
>>>> Updated webrev:
>>>>   http://cr.openjdk.java.net/~vlivanov/8001622/webrev.01
>>>>
>>>> Introduced immU8/immU16 operands on sparc.
>>>>
>>>> As a cleanup, sorted immI/immU operand declarations.
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>> On 5/22/15 6:14 PM, Vladimir Ivanov wrote:
>>>>> Thanks for looking into the fix, Roland.
>>>>>
>>>>> On 5/22/15 5:57 PM, Roland Westrelin wrote:
>>>>>>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8001622
>>>>>>>
>>>>>>> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from
>>>>>>> immI8/immI16 to immI, since zero-extending move is used.
>>>>>>
>>>>>> The change on sparc doesn?t look good:
>>>>>> __ and3($dst$$Register, $mask$$constant, $dst$$Register);
>>>>>>
>>>>>> the and instruction cannot encode arbitrary large integer constants.
>>>>> Good catch. I missed that it is and3.
>>>>>
>>>>>> Isn?t the root of the problem that we want an unsigned 8 bit integer
>>>>>> and not a signed one?
>>>>> Yes, unsigned 8-bit integer work fine, but it can be generalized to
>>>>> arbitrary masks as well.
>>>>>
>>>>> I experimented with new operand types (immU8, immU16), but it was more
>>>>> code for no particular benefit. I can use them on sparc.
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>
>

From dean.long at oracle.com  Thu May 28 17:43:04 2015
From: dean.long at oracle.com (Dean Long)
Date: Thu, 28 May 2015 10:43:04 -0700
Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match
	some 8-bit/16-bit masks
In-Reply-To: <556752FD.6060605@oracle.com>
References: <555F3EFB.60606@oracle.com>	<49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com>	<555F47E4.8020507@oracle.com>
	<555F4D7D.7080504@oracle.com> <556006DA.8030202@oracle.com>
	<55660A9D.6090809@oracle.com> <55661522.4010703@oracle.com>
	<556752FD.6060605@oracle.com>
Message-ID: <556753A8.6030909@oracle.com>

Looks good.

dl

On 5/28/2015 10:40 AM, Vladimir Ivanov wrote:
> Dean, what do you think about the following:
>   http://cr.openjdk.java.net/~vlivanov/8001622/webrev.03
>
> Best regards,
> Vladimir Ivanov
>
> On 5/27/15 10:04 PM, Dean Long wrote:
>> You should be able to use right_n_bits on x86 too, to allow the smaller
>> encoding with an 8 bit immediate instead of 32.
>>
>> Also, for this on sparc
>>
>> 5791     __ set($mask$$constant, Rtmp);
>>
>> I believe
>>
>>    __ set(right_n_bits($mask$$constant, 16), Rtmp)
>>
>> will give the same result in fewer instructions.
>>
>> dl
>>
>> On 5/27/2015 11:19 AM, Vladimir Ivanov wrote:
>>> Vladimir, Dean, thanks for the feedback.
>>>
>>> I like right_n_bits trick.
>>> Here's an updated webrev:
>>>   http://cr.openjdk.java.net/~vlivanov/8001622/webrev.02
>>>
>>> Got rid of immU16 since set can encode 32-bit constants.
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> On 5/23/15 7:49 AM, Dean Long wrote:
>>>> Isn't this what you want:
>>>>
>>>> 5656 // Load Unsigned Byte (8 bit UNsigned) with 32-bit mask into Long
>>>> Register
>>>> 5657 instruct loadUB2L_immI(iRegL dst, memory mem, immI mask) %{
>>>> 5658   match(Set dst (ConvI2L (AndI (LoadUB mem) mask)));
>>>> 5659   ins_cost(MEMORY_REF_COST + DEFAULT_COST);
>>>> 5660
>>>> 5661   size(2*4);
>>>> 5662   format %{ "LDUB   $mem,$dst\t# ubyte & 32-bit mask -> long\n\t"
>>>> 5663             "AND    $dst,right_n_bits($mask, 8),$dst" %}
>>>> 5664   ins_encode %{
>>>> 5665     __ ldub($mem$$Address, $dst$$Register);
>>>> 5666     __ and3($dst$$Register, right_n_bits($mask$$constant, 8),
>>>> $dst$$Register);
>>>> 5667   %}
>>>> 5668   ins_pipe(iload_mem);
>>>> 5669 %}
>>>>
>>>> dl
>>>>
>>>> On 5/22/2015 8:38 AM, Vladimir Ivanov wrote:
>>>>> Updated webrev:
>>>>>   http://cr.openjdk.java.net/~vlivanov/8001622/webrev.01
>>>>>
>>>>> Introduced immU8/immU16 operands on sparc.
>>>>>
>>>>> As a cleanup, sorted immI/immU operand declarations.
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>> On 5/22/15 6:14 PM, Vladimir Ivanov wrote:
>>>>>> Thanks for looking into the fix, Roland.
>>>>>>
>>>>>> On 5/22/15 5:57 PM, Roland Westrelin wrote:
>>>>>>>> http://cr.openjdk.java.net/~vlivanov/8001622/webrev.00
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8001622
>>>>>>>>
>>>>>>>> Mask in loadUB2L_immI8 & loadUS2L_immI16 can be relaxed from
>>>>>>>> immI8/immI16 to immI, since zero-extending move is used.
>>>>>>>
>>>>>>> The change on sparc doesn?t look good:
>>>>>>> __ and3($dst$$Register, $mask$$constant, $dst$$Register);
>>>>>>>
>>>>>>> the and instruction cannot encode arbitrary large integer 
>>>>>>> constants.
>>>>>> Good catch. I missed that it is and3.
>>>>>>
>>>>>>> Isn?t the root of the problem that we want an unsigned 8 bit 
>>>>>>> integer
>>>>>>> and not a signed one?
>>>>>> Yes, unsigned 8-bit integer work fine, but it can be generalized to
>>>>>> arbitrary masks as well.
>>>>>>
>>>>>> I experimented with new operand types (immU8, immU16), but it was 
>>>>>> more
>>>>>> code for no particular benefit. I can use them on sparc.
>>>>>>
>>>>>> Best regards,
>>>>>> Vladimir Ivanov
>>>>
>>


From dawid.weiss at gmail.com  Thu May 28 20:41:26 2015
From: dawid.weiss at gmail.com (Dawid Weiss)
Date: Thu, 28 May 2015 22:41:26 +0200
Subject: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on
	(regression)
In-Reply-To: <E6EC1C64-49DB-45A4-8C6E-ED9D3ADF8BF7@oracle.com>
References: <E6EC1C64-49DB-45A4-8C6E-ED9D3ADF8BF7@oracle.com>
Message-ID: <CAM21Rt87+wrd5mW-8eq16JkZAwdOke3f+vWE9BONkFH67WQyNw@mail.gmail.com>

Your quick follow-up to this issue is much appreciated! (I filed the
bug via Rory).

Dawid

On Thu, May 28, 2015 at 3:55 PM, Roland Westrelin
<roland.westrelin at oracle.com> wrote:
> http://cr.openjdk.java.net/~roland/8080976/webrev.00/
>
> A loop where the result of the reduction is used in the loop shouldn?t be vectorized.
>
> Roland.

From roland.westrelin at oracle.com  Fri May 29 07:53:31 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Fri, 29 May 2015 09:53:31 +0200
Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match
	some 8-bit/16-bit masks
In-Reply-To: <556752FD.6060605@oracle.com>
References: <555F3EFB.60606@oracle.com>
	<49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com>
	<555F47E4.8020507@oracle.com> <555F4D7D.7080504@oracle.com>
	<556006DA.8030202@oracle.com> <55660A9D.6090809@oracle.com>
	<55661522.4010703@oracle.com> <556752FD.6060605@oracle.com>
Message-ID: <DB397D12-31FD-44B4-B595-8055B2ACED48@oracle.com>

>  http://cr.openjdk.java.net/~vlivanov/8001622/webrev.03

That looks good to me.

Roland.


From vladimir.x.ivanov at oracle.com  Fri May 29 10:50:29 2015
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Fri, 29 May 2015 13:50:29 +0300
Subject: [9] RFR (XS): loadUB2L_immI8 & loadUS2L_immI16 rules don't match
	some 8-bit/16-bit masks
In-Reply-To: <DB397D12-31FD-44B4-B595-8055B2ACED48@oracle.com>
References: <555F3EFB.60606@oracle.com>
	<49A68520-A00C-4F14-A216-2F6E62C6EBCD@oracle.com>
	<555F47E4.8020507@oracle.com> <555F4D7D.7080504@oracle.com>
	<556006DA.8030202@oracle.com> <55660A9D.6090809@oracle.com>
	<55661522.4010703@oracle.com> <556752FD.6060605@oracle.com>
	<DB397D12-31FD-44B4-B595-8055B2ACED48@oracle.com>
Message-ID: <55684475.90905@oracle.com>

Thanks for review, Dean, Roland & Vladimir.

Best regards,
Vladimir Ivanov

On 5/29/15 10:53 AM, Roland Westrelin wrote:
>>   http://cr.openjdk.java.net/~vlivanov/8001622/webrev.03
>
> That looks good to me.
>
> Roland.
>

From anthony.scarpino at oracle.com  Thu May 28 23:39:59 2015
From: anthony.scarpino at oracle.com (Anthony Scarpino)
Date: Thu, 28 May 2015 16:39:59 -0700
Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA
	acceleration]
In-Reply-To: <02FCFB8477C4EF43A2AD8E0C60F3DA2B6337BA3A@FMSMSX112.amr.corp.intel.com>
References: <02FCFB8477C4EF43A2AD8E0C60F3DA2B63321E33@FMSMSX112.amr.corp.intel.com>	<5502E67C.8080208@oracle.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63322AB0@FMSMSX112.amr.corp.intel.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B63323E86@FMSMSX112.amr.corp.intel.com>	<550C004D.1060501@redhat.com>	<02FCFB8477C4EF43A2AD8E0C60F3DA2B633245C8@FMSMSX112.amr.corp.intel.com>	<55101C52.1090506@redhat.com>	<5548193E.7060803@oracle.com>	<55488803.4020802@redhat.com>	<554CDD48.8080500@redhat.com>	<1CCF381D-1C55-4392-A905-AD8CD457980B@oracle.com>	<555708BE.8090100@redhat.com>
	<5565FAD6.5010409@redhat.com>
	<02FCFB8477C4EF43A2AD8E0C60F3DA2B6337BA3A@FMSMSX112.amr.corp.intel.com>
Message-ID: <5567A74F.4040105@oracle.com>

Personally I think it better to not have implSquareToLenChecks() and 
implMulAddCheck() as separate methods and to have the range check 
squareToLen and mulAdd.  Given these change are about performance, it 
seems unnecessary to add an extra call to a method.

While we are changing BigInteger, should a range check for multiplyToLen 
be added?  Or is there a different bug for that?

Tony

On 05/27/2015 06:27 PM, Viswanathan, Sandhya wrote:
> Hi Tony,
>
> Please let us know if you are ok with the changes in BigInteger.java (range checks) in patch from Intel:
>
> http://cr.openjdk.java.net/~kvn/8069539/webrev.01/
>
> Per Andrew's email below we could go ahead with this patch and it shouldn't affect his work.
>
> Best Regards,
> Sandhya
>
>
> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Andrew Haley
> Sent: Wednesday, May 27, 2015 10:12 AM
> To: Christian Thalinger
> Cc: Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net
> Subject: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration]
>
> An update:
>
> I'm still working on this.  Following last week's revelations [1] it
> seems to me that a faster implementation of (integer) D-H is even more
> important.
>
> I've spent a couple of days tracking down an extremely odd feature
> (bug?) in MutableBigInteger which was breaking everything, but I'm
> past that now.  I'm trying to produce an intrinsic implementation of
> the core modular exponentiation which is as fast as any state-of-the-
> art implementation while disrupting the common code as little as
> possible; this is not easy.
>
> I hope to have something which is faster on all processors, not just
> those for which we have hand-coded assembly-language implementations.
>
> I don't think that my work should be any impediment to Sadya's patch
> for squareToLen at http://cr.openjdk.java.net/~kvn/8069539/webrev.01/
> being committed.  It'll still be useful.
>
> Andrew.
>
>
> [1]  Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
> https://weakdh.org/imperfect-forward-secrecy.pdf
>


From roland.westrelin at oracle.com  Fri May 29 14:06:21 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Fri, 29 May 2015 16:06:21 +0200
Subject: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on
	(regression)
In-Reply-To: <CAM21Rt87+wrd5mW-8eq16JkZAwdOke3f+vWE9BONkFH67WQyNw@mail.gmail.com>
References: <E6EC1C64-49DB-45A4-8C6E-ED9D3ADF8BF7@oracle.com>
	<CAM21Rt87+wrd5mW-8eq16JkZAwdOke3f+vWE9BONkFH67WQyNw@mail.gmail.com>
Message-ID: <9D656012-9AD5-4723-9DE6-5810CAA916D0@oracle.com>

> Your quick follow-up to this issue is much appreciated! (I filed the
> bug via Rory).

Having a reliable way to reproduce it helped a lot: thanks for providing a reproducer.

Roland.

From roland.westrelin at oracle.com  Fri May 29 14:09:58 2015
From: roland.westrelin at oracle.com (Roland Westrelin)
Date: Fri, 29 May 2015 16:09:58 +0200
Subject: RFR(S): 8080976: Unexpected AIOOB thrown from 1.9.0-ea-b64 on
	(regression)
In-Reply-To: <C568518E7B433348B114B6A7122D474755E0604D@FMSMSX102.amr.corp.intel.com>
References: <E6EC1C64-49DB-45A4-8C6E-ED9D3ADF8BF7@oracle.com>
	<5567432D.3040108@oracle.com>
	<C568518E7B433348B114B6A7122D474755E0604D@FMSMSX102.amr.corp.intel.com>
Message-ID: <03323516-B787-4D8B-AF1F-C3D54C3BE1F1@oracle.com>

Michael,

> Roland I would add a condition so that we save a step, and a comment or two:

Thanks for the suggestions. I?ll go with your version.

> I checked the resultant code for the bug, as is the case for the change in the webrev, the reduction is no longer emitted.  I also checked the relevant micros and verified the desired behavior is still present.

Thanks for taking the time to verify the fix is correct.

Roland.


From Ulf.Zibis at CoSoCo.de  Fri May 29 16:48:28 2015
From: Ulf.Zibis at CoSoCo.de (Ulf Zibis)
Date: Fri, 29 May 2015 18:48:28 +0200
Subject: Why isn't Object.notify() a synchronized method?
In-Reply-To: <5567420A.9020406@redhat.com>
References: <55673D97.4010505@CoSoCo.de> <5567420A.9020406@redhat.com>
Message-ID: <5568985C.6050507@CoSoCo.de>

Thanks for your hint David. That's the only reason I could imagine too.
Can somebody tell something about the cost for recursive lock acquisition in comparison to the whole 
call, couldn't it be eliminated by Hotspot?

As I recently fell into the trap of forgetting the synchronized block around a single notifyAll(), I 
believe, the current situation is just errorprone.

Any comments about the Javadoc issue?

-Ulf


Am 28.05.2015 um 18:27 schrieb David M. Lloyd:
> Since most of the time you have to hold the lock anyway for other reasons, I think this would 
> generally be an unwelcome change since I expect the cost of recursive lock acquisition is nonzero.
>
> On 05/28/2015 11:08 AM, Ulf Zibis wrote:
>> Hi all,
>>
>> in the Javadoc of notify(), notifyAll() and wait(...) I read, that this
>> methods should only be used with synchronisation on it's instance.
>> So I'm wondering, why they don't have the synchronized modifier out of
>> the box in Object class.
>>
>> Also I think, the following note should be moved from wait(long,int) to
>> wait(long):
>> /The current thread must own this object's monitor. The thread releases
>> ownership of this monitor and waits until either of the following two
>> conditions has occurred://
>> /
>>
>>   * /Another thread notifies threads waiting on this object's monitor to
>> wake up either through a
>>     call to the notify method or the notifyAll method./
>>   * /The timeout period, specified by timeout milliseconds plus nanos
>> nanoseconds arguments, has
>>     elapsed. /
>>
>>
>>
>> Cheers,
>>
>> Ulf
>>
>


From vladimir.kozlov at oracle.com  Fri May 29 17:38:20 2015
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 29 May 2015 10:38:20 -0700
Subject: Implicit null-check in hot-loop?
In-Reply-To: <CAHjP37Et5p_V+-Z4wAL0jYUEJ3N5vh=+07Qos+SvB_-c0GKRVw@mail.gmail.com>
References: <CD83789D-6E40-42A2-8700-4506ED7F7F16@theobroma-systems.com>	<55659E32.9070007@redhat.com>	<9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com>	<CAHjP37H75Tbx=f-23Gm25Jnv8Tf3Hy_Vx7zYr2rcdE347vC3CQ@mail.gmail.com>	<CAHjP37Fe4YMCOdv_tGki_bMfDk3jvOLOzC5w9gmyQhyR070H6Q@mail.gmail.com>	<AB0460F9-5AD8-49AA-A094-B67E50C0EE38@theobroma-systems.com>
	<CAHjP37Et5p_V+-Z4wAL0jYUEJ3N5vh=+07Qos+SvB_-c0GKRVw@mail.gmail.com>
Message-ID: <5568A40C.9000602@oracle.com>

There is reason. We do long arithmetic to prove that int index will not overflow in canonical (counted) loops but we 
can't do that cheaply for long index. And 99% of loops use int index - so we don't want to spend code on edge case.

Regards,
Vladimir


On 5/27/15 5:25 AM, Vitaly Davidovich wrote:
> I don't think there's any fundamental reason, it's just that int loops are more common and so were optimized.
>
> sent from my phone
>
> On May 27, 2015 8:09 AM, "Benedikt Wedenik" <benedikt.wedenik at theobroma-systems.com
> <mailto:benedikt.wedenik at theobroma-systems.com>> wrote:
>
>     First of all, thanks for your quick response :)
>     You were absolutely right - but I do not understand why the long counter was the problem.
>
>     Is there any reason why the loop-unrolling is only available for int loops?
>     I mean I guess so - because else it would probably be implemented already.
>     But still it would be a nice opportunity for an optimisation :)
>
>     Benedikt.
>     On 27 May 2015, at 13:44, Vitaly Davidovich <vitalyd at gmail.com <mailto:vitalyd at gmail.com>> wrote:
>
>>     Also, the safepoint check on each iteration should go away when using an int loop; it'll be coalesced into once
>>     per unroll factor or may go away entirely if the loop turns into a counted loop.
>>
>>     sent from my phone
>>
>>     On May 27, 2015 7:39 AM, "Vitaly Davidovich" <vitalyd at gmail.com <mailto:vitalyd at gmail.com>> wrote:
>>
>>         Cause you're using a long as induction variable; change to int and it should unroll.
>>
>>         sent from my phone
>>
>>         On May 27, 2015 6:46 AM, "Benedikt Wedenik" <benedikt.wedenik at theobroma-systems.com
>>         <mailto:benedikt.wedenik at theobroma-systems.com>> wrote:
>>
>>             Oh!
>>
>>             Thanks :) So then it's not so bad, cause it is (I guess) obligatory.
>>
>>             Do you have any idea why this loop does not get unrolled?
>>
>>             Benedikt.
>>
>>             On 27 May 2015, at 12:36, Andrew Haley <aph at redhat.com <mailto:aph at redhat.com>> wrote:
>>
>>             > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote:
>>             >>
>>             >> In the hot-loop there is this ?ldr? which looked a little ?strange? at the first glance.
>>             >> I think that this load is a null-check? Is that the case?
>>             >
>>             > No.  It's a safepoint check.  HotSpot has to insert one of these because
>>             > it can't prove that the loop is reasonably short-lived.
>>             >
>>             >> I also investigated the generated code on x86 which is quite similar, but instead of a load, they are
>>             >> using the ?test?-instruction which performs an ?and? but only sets the flags discarding the result.
>>             >> Is there any similar instruction available on aarch64 or is this already the closest solution?
>>             >
>>             > Closest to what?  A load to XZR is the best solution: it does not hit
>>             > the flags.  x86 cannot do this.
>>             >
>>             > It looks like great code.
>>             >
>>             > Andrew.
>>             >
>>
>

From vitalyd at gmail.com  Fri May 29 17:48:47 2015
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Fri, 29 May 2015 13:48:47 -0400
Subject: Implicit null-check in hot-loop?
In-Reply-To: <5568A40C.9000602@oracle.com>
References: <CD83789D-6E40-42A2-8700-4506ED7F7F16@theobroma-systems.com>
	<55659E32.9070007@redhat.com>
	<9353696E-19CD-4F65-8EE3-A63FBE4A4CDD@theobroma-systems.com>
	<CAHjP37H75Tbx=f-23Gm25Jnv8Tf3Hy_Vx7zYr2rcdE347vC3CQ@mail.gmail.com>
	<CAHjP37Fe4YMCOdv_tGki_bMfDk3jvOLOzC5w9gmyQhyR070H6Q@mail.gmail.com>
	<AB0460F9-5AD8-49AA-A094-B67E50C0EE38@theobroma-systems.com>
	<CAHjP37Et5p_V+-Z4wAL0jYUEJ3N5vh=+07Qos+SvB_-c0GKRVw@mail.gmail.com>
	<5568A40C.9000602@oracle.com>
Message-ID: <CAHjP37HUC2X=oiVbi3R5FYqcfwzhPJ8UKukNX+-0tmTUH+gHmw@mail.gmail.com>

Hi Vladimir,

Given java has no unsigned long, wouldn't this be somewhat straightforward
to test? I agree that it's unlikely to be worth it given most loops are int.

Thanks

On Fri, May 29, 2015 at 1:38 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com
> wrote:

> There is reason. We do long arithmetic to prove that int index will not
> overflow in canonical (counted) loops but we can't do that cheaply for long
> index. And 99% of loops use int index - so we don't want to spend code on
> edge case.
>
> Regards,
> Vladimir
>
>
> On 5/27/15 5:25 AM, Vitaly Davidovich wrote:
>
>> I don't think there's any fundamental reason, it's just that int loops
>> are more common and so were optimized.
>>
>> sent from my phone
>>
>> On May 27, 2015 8:09 AM, "Benedikt Wedenik" <
>> benedikt.wedenik at theobroma-systems.com
>> <mailto:benedikt.wedenik at theobroma-systems.com>> wrote:
>>
>>     First of all, thanks for your quick response :)
>>     You were absolutely right - but I do not understand why the long
>> counter was the problem.
>>
>>     Is there any reason why the loop-unrolling is only available for int
>> loops?
>>     I mean I guess so - because else it would probably be implemented
>> already.
>>     But still it would be a nice opportunity for an optimisation :)
>>
>>     Benedikt.
>>     On 27 May 2015, at 13:44, Vitaly Davidovich <vitalyd at gmail.com
>> <mailto:vitalyd at gmail.com>> wrote:
>>
>>      Also, the safepoint check on each iteration should go away when
>>> using an int loop; it'll be coalesced into once
>>>     per unroll factor or may go away entirely if the loop turns into a
>>> counted loop.
>>>
>>>     sent from my phone
>>>
>>>     On May 27, 2015 7:39 AM, "Vitaly Davidovich" <vitalyd at gmail.com
>>> <mailto:vitalyd at gmail.com>> wrote:
>>>
>>>         Cause you're using a long as induction variable; change to int
>>> and it should unroll.
>>>
>>>         sent from my phone
>>>
>>>         On May 27, 2015 6:46 AM, "Benedikt Wedenik" <
>>> benedikt.wedenik at theobroma-systems.com
>>>         <mailto:benedikt.wedenik at theobroma-systems.com>> wrote:
>>>
>>>             Oh!
>>>
>>>             Thanks :) So then it's not so bad, cause it is (I guess)
>>> obligatory.
>>>
>>>             Do you have any idea why this loop does not get unrolled?
>>>
>>>             Benedikt.
>>>
>>>             On 27 May 2015, at 12:36, Andrew Haley <aph at redhat.com
>>> <mailto:aph at redhat.com>> wrote:
>>>
>>>             > On 05/27/2015 11:27 AM, Benedikt Wedenik wrote:
>>>             >>
>>>             >> In the hot-loop there is this ?ldr? which looked a little
>>> ?strange? at the first glance.
>>>             >> I think that this load is a null-check? Is that the case?
>>>             >
>>>             > No.  It's a safepoint check.  HotSpot has to insert one of
>>> these because
>>>             > it can't prove that the loop is reasonably short-lived.
>>>             >
>>>             >> I also investigated the generated code on x86 which is
>>> quite similar, but instead of a load, they are
>>>             >> using the ?test?-instruction which performs an ?and? but
>>> only sets the flags discarding the result.
>>>             >> Is there any similar instruction available on aarch64 or
>>> is this already the closest solution?
>>>             >
>>>             > Closest to what?  A load to XZR is the best solution: it
>>> does not hit
>>>             > the flags.  x86 cannot do this.
>>>             >
>>>             > It looks like great code.
>>>             >
>>>             > Andrew.
>>>             >
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150529/5c720060/attachment.html>

From david.holmes at oracle.com  Sun May 31 04:31:25 2015
From: david.holmes at oracle.com (David Holmes)
Date: Sun, 31 May 2015 14:31:25 +1000
Subject: Why isn't Object.notify() a synchronized method?
In-Reply-To: <5568985C.6050507@CoSoCo.de>
References: <55673D97.4010505@CoSoCo.de> <5567420A.9020406@redhat.com>
	<5568985C.6050507@CoSoCo.de>
Message-ID: <556A8E9D.4040407@oracle.com>

On 30/05/2015 2:48 AM, Ulf Zibis wrote:
> Thanks for your hint David. That's the only reason I could imagine too.
> Can somebody tell something about the cost for recursive lock
> acquisition in comparison to the whole call, couldn't it be eliminated
> by Hotspot?

There are a number of fast paths related to recursive locking. Not sure 
if inlining allows lock elision that something for the compiler folk.

> As I recently fell into the trap of forgetting the synchronized block
> around a single notifyAll(), I believe, the current situation is just
> errorprone.

How is it errorprone? You forgot to acquire the lock and you got an 
IllegalMonitorStateException when you did notifyAll. That's pointing out 
your error.

> Any comments about the Javadoc issue?

I did comment - I don't see any issue.

David H.


> -Ulf
>
>
> Am 28.05.2015 um 18:27 schrieb David M. Lloyd:
>> Since most of the time you have to hold the lock anyway for other
>> reasons, I think this would generally be an unwelcome change since I
>> expect the cost of recursive lock acquisition is nonzero.
>>
>> On 05/28/2015 11:08 AM, Ulf Zibis wrote:
>>> Hi all,
>>>
>>> in the Javadoc of notify(), notifyAll() and wait(...) I read, that this
>>> methods should only be used with synchronisation on it's instance.
>>> So I'm wondering, why they don't have the synchronized modifier out of
>>> the box in Object class.
>>>
>>> Also I think, the following note should be moved from wait(long,int) to
>>> wait(long):
>>> /The current thread must own this object's monitor. The thread releases
>>> ownership of this monitor and waits until either of the following two
>>> conditions has occurred://
>>> /
>>>
>>>   * /Another thread notifies threads waiting on this object's monitor to
>>> wake up either through a
>>>     call to the notify method or the notifyAll method./
>>>   * /The timeout period, specified by timeout milliseconds plus nanos
>>> nanoseconds arguments, has
>>>     elapsed. /
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Ulf
>>>
>>
>