[performance] Inline heuristics, scalar replacement and cold prolog of hot loop.
Vladimir Kozlov
vladimir.kozlov at oracle.com
Sat Feb 2 00:52:22 UTC 2019
On 2/1/19 4:16 PM, Sergey Kuksenko wrote:
> is_init_with_ea() does'n help with methods collection::iterator() which prevails over constructors in iteration patterns.
The reason I mentioned is_init_with_ea() is to use it for collection::iterator() which seems simply calls constructors.
We need to add new ciMethod::is_iterator() method similar to is_object_initializer() and check it in is_init_with_ea().
>
> The amount of loop iterations is equal to size of collection, we pass the whole collection. If collection size is small,
> amount of back branches is comparable with amount of prolog invocations. In that case prolog methods are inlined. If
> collection is large, prolog counter is relatively small, but back branch counter is quite large. In that case
> MinInliningThreshold prevents methods iterator() to be inlined.
Okay, got it. With small size test method is compiled much later when it and called method are relatively hot.
Thanks,
Vladimir
>
> org.openjdk.Hash::sumIter (48 bytes)
> @ 6 java.util.HashMap::keySet (25 bytes) executed < MinInliningThreshold times
> \-> TypeProfile (15/15 counts) = java/util/HashMap
> our target==> @ 11 java.util.HashMap$KeySet::iterator (12 bytes) executed < MinInliningThreshold times
> \-> TypeProfile (15/15 counts) = java/util/HashMap$KeySet
> @ 18 java.util.HashMap$HashIterator::hasNext (13 bytes) inline (hot)
> \-> TypeProfile (148955/148955 counts) = java/util/HashMap$KeyIterator
> @ 27 java.util.HashMap$KeyIterator::next (8 bytes) inline (hot)
> @ 1 java.util.HashMap$HashIterator::nextNode (100 bytes) inline (hot)
> @ 35 java.lang.Integer::intValue (5 bytes) accessor
> @ 18 java.util.HashMap$HashIterator::hasNext (13 bytes) inline (hot)
>
>
> On 2/1/19 2:38 PM, Vladimir Kozlov wrote:
>> We have special method is_init_with_ea() [1] which allow to inline constructors and boxing methods regardless
>> MinInliningThreshold. So constructor for Iterator should be inlined.
>>
>> But may be I don't understand your question. Which Iterator you are talking about when you said "iterator is
>> scalarized". And why it works only for small sizes?
>>
>> There is an other C2's EA flag EliminateAllocationArraySizeLimit which limits which array can be scalarized. But it
>> seems it is not case here.
>>
>> Vladimir K
>>
>> [1] http://hg.openjdk.java.net/jdk/jdk/file/9c84d2865c2d/src/hotspot/share/opto/bytecodeInfo.cpp#l76
>>
>> On 2/1/19 11:26 AM, Sergey Kuksenko wrote:
>>> Hi, All.
>>>
>>> I'd like to raise some discussion - if it's possible to make C2 inline heuristics better.
>>>
>>> Code (just measure cost of iteration):
>>>
>>> Map<Integer, Integer>map;
>>> ...
>>> for (Iterator<Integer> iterator =map.keySet().iterator(); iterator.hasNext(); ) {
>>> int v = iterator.next();
>>> s += v;
>>> }
>>>
>>> For examples, I found that the iterator is scalarized if collection size is quite small (10 or less) and it isn't
>>> scalarized is collection size is large enough (10000, 1000000 elements). Direct cost of iterator allocation and GC
>>> pressure is negligible small is our collection is large enough. But, if iterator's state contains object references
>>> the cost of GC write barriers may be large.
>>>
>>> The reason why the iterator is scalarized for small collections and is not scalarized for large collections is
>>> MinInliningThreshold option. So we have a hot loop, but relatively cold prolog for the loop and MinInliningThreshold
>>> prevents inlining methods called from the prolog. At the same moment inlining such prolog methods may have
>>> significant impact to loop performance, particularly fro such widely used pattern as iteration.
>>>
>>> You can find simple benchmarks for that here: http://cr.openjdk.java.net/~skuksenko/hotspot/inline/
>>>
>>> Here my results.
>>>
>>> default:
>>>
>>> Benchmark (size) Mode Cnt Score Error Units
>>> Hash.sumIter 10000 thrpt 15 2929.112 ± 42.927 ops/s
>>> Hash.sumIterHidden 10000 thrpt 15 3379.195 ± 30.589 ops/s
>>> Tree.sumIter 10000 thrpt 15 3618.466 ± 126.785 ops/s
>>> Tree.sumIterHidden 10000 thrpt 15 3576.429 ± 129.217 ops/s
>>>
>>> -XX:MinInliningThreshold=0
>>>
>>> Benchmark (size) Mode Cnt Score Error Units
>>> Hash.sumIter 10000 thrpt 15 5392.184 ± 115.039 ops/s
>>> Hash.sumIterHidden 10000 thrpt 15 3357.754 ± 60.636 ops/s
>>> Tree.sumIter 10000 thrpt 15 4128.407 ± 56.748 ops/s
>>> Tree.sumIterHidden 10000 thrpt 15 3601.293 ± 123.135 ops/s
>>>
>>> So G1 write barriers costs 60% performance of HashMap iterator and 14% performance of TreeMap iterator. Other GC also
>>> have cost, but G1 has highest.
>>>
>>>
>>>
More information about the hotspot-compiler-dev
mailing list