[performance] Inline heuristics, scalar replacement and cold prolog of hot loop.

Sat Feb 2 00:52:22 UTC 2019

On 2/1/19 4:16 PM, Sergey Kuksenko wrote:
> is_init_with_ea() does'n help with methods collection::iterator() which prevails over constructors in iteration patterns.

The reason I mentioned is_init_with_ea() is to use it for collection::iterator() which seems simply calls constructors.
We need to add new ciMethod::is_iterator() method similar to is_object_initializer() and check it in is_init_with_ea().

> 
> The amount of loop iterations is equal to size of collection, we pass the whole collection. If collection size is small, 
> amount of back branches is comparable with amount of prolog invocations. In that case prolog methods are inlined. If 
> collection is large, prolog counter is relatively small, but back branch counter is quite large. In that case 
> MinInliningThreshold prevents methods iterator() to be inlined.

Okay, got it. With small size test method is compiled much later when it and called method are relatively hot.

Thanks,
Vladimir

> 
>   org.openjdk.Hash::sumIter (48 bytes)
>                                @ 6   java.util.HashMap::keySet (25 bytes)   executed < MinInliningThreshold times
>                                 \-> TypeProfile (15/15 counts) = java/util/HashMap
> our target==>      @ 11   java.util.HashMap$KeySet::iterator (12 bytes)   executed < MinInliningThreshold times
>                                 \-> TypeProfile (15/15 counts) = java/util/HashMap$KeySet
>                                @ 18 java.util.HashMap$HashIterator::hasNext (13 bytes)   inline (hot)
>                                 \-> TypeProfile (148955/148955 counts) = java/util/HashMap$KeyIterator
>                                @ 27 java.util.HashMap$KeyIterator::next (8 bytes)   inline (hot)
>                                  @ 1 java.util.HashMap$HashIterator::nextNode (100 bytes)   inline (hot)
>                                @ 35   java.lang.Integer::intValue (5 bytes)   accessor
>                                @ 18 java.util.HashMap$HashIterator::hasNext (13 bytes)   inline (hot)
> 
> 
> On 2/1/19 2:38 PM, Vladimir Kozlov wrote:
>> We have special method is_init_with_ea() [1] which allow to inline constructors and boxing methods regardless 
>> MinInliningThreshold. So constructor for Iterator should be inlined.
>>
>> But may be I don't understand your question. Which Iterator you are talking about when you said "iterator is 
>> scalarized". And why it works only for small sizes?
>>
>> There is an other C2's EA flag EliminateAllocationArraySizeLimit which limits which array can be scalarized. But it 
>> seems it is not case here.
>>
>> Vladimir K
>>
>> [1] http://hg.openjdk.java.net/jdk/jdk/file/9c84d2865c2d/src/hotspot/share/opto/bytecodeInfo.cpp#l76
>>
>> On 2/1/19 11:26 AM, Sergey Kuksenko wrote:
>>> Hi, All.
>>>
>>> I'd like to raise some discussion - if it's possible to make C2 inline heuristics better.
>>>
>>> Code (just measure cost of iteration):
>>>
>>> Map<Integer, Integer>map;
>>> ...
>>> for (Iterator<Integer> iterator =map.keySet().iterator(); iterator.hasNext(); ) {
>>>      int v = iterator.next();
>>>      s += v;
>>> }
>>>
>>> For examples, I found that the iterator is scalarized if collection size is quite small (10 or less) and it isn't 
>>> scalarized is collection size is large enough (10000, 1000000 elements). Direct cost of iterator allocation and GC 
>>> pressure is negligible small is our collection is large enough. But, if iterator's state contains object references 
>>> the cost of GC write barriers may be large.
>>>
>>> The reason why the iterator is scalarized for small collections and is not scalarized for large collections is 
>>> MinInliningThreshold option. So we have a hot loop, but relatively cold prolog for the loop and MinInliningThreshold 
>>> prevents inlining methods called from the prolog. At the same moment inlining such prolog methods may have 
>>> significant impact to loop performance, particularly fro such widely used pattern as iteration.
>>>
>>> You can find simple benchmarks for that here: http://cr.openjdk.java.net/~skuksenko/hotspot/inline/
>>>
>>> Here my results.
>>>
>>> default:
>>>
>>> Benchmark           (size)   Mode  Cnt     Score     Error Units
>>> Hash.sumIter         10000  thrpt   15  2929.112 ±  42.927 ops/s
>>> Hash.sumIterHidden   10000  thrpt   15  3379.195 ±  30.589 ops/s
>>> Tree.sumIter         10000  thrpt   15  3618.466 ± 126.785 ops/s
>>> Tree.sumIterHidden   10000  thrpt   15  3576.429 ± 129.217 ops/s
>>>
>>> -XX:MinInliningThreshold=0
>>>
>>> Benchmark           (size)   Mode  Cnt     Score     Error Units
>>> Hash.sumIter         10000  thrpt   15  5392.184 ± 115.039 ops/s
>>> Hash.sumIterHidden   10000  thrpt   15  3357.754 ±  60.636 ops/s
>>> Tree.sumIter         10000  thrpt   15  4128.407 ±  56.748 ops/s
>>> Tree.sumIterHidden   10000  thrpt   15  3601.293 ± 123.135 ops/s
>>>
>>> So G1 write barriers costs 60% performance of HashMap iterator and 14% performance of TreeMap iterator. Other GC also 
>>> have cost, but G1 has highest.
>>>
>>>
>>>