[performance] Inline heuristics, scalar replacement and cold prolog of hot loop.
Sergey Kuksenko
sergey.kuksenko at oracle.com
Sat Feb 2 00:16:17 UTC 2019
is_init_with_ea() does'n help with methods collection::iterator() which
prevails over constructors in iteration patterns.
The amount of loop iterations is equal to size of collection, we pass
the whole collection. If collection size is small, amount of back
branches is comparable with amount of prolog invocations. In that case
prolog methods are inlined. If collection is large, prolog counter is
relatively small, but back branch counter is quite large. In that case
MinInliningThreshold prevents methods iterator() to be inlined.
org.openjdk.Hash::sumIter (48 bytes)
@ 6 java.util.HashMap::keySet (25
bytes) executed < MinInliningThreshold times
\-> TypeProfile (15/15 counts) =
java/util/HashMap
our target==> @ 11 java.util.HashMap$KeySet::iterator (12
bytes) executed < MinInliningThreshold times
\-> TypeProfile (15/15 counts) =
java/util/HashMap$KeySet
@ 18
java.util.HashMap$HashIterator::hasNext (13 bytes) inline (hot)
\-> TypeProfile (148955/148955 counts) =
java/util/HashMap$KeyIterator
@ 27 java.util.HashMap$KeyIterator::next
(8 bytes) inline (hot)
@ 1
java.util.HashMap$HashIterator::nextNode (100 bytes) inline (hot)
@ 35 java.lang.Integer::intValue (5
bytes) accessor
@ 18
java.util.HashMap$HashIterator::hasNext (13 bytes) inline (hot)
On 2/1/19 2:38 PM, Vladimir Kozlov wrote:
> We have special method is_init_with_ea() [1] which allow to inline
> constructors and boxing methods regardless MinInliningThreshold. So
> constructor for Iterator should be inlined.
>
> But may be I don't understand your question. Which Iterator you are
> talking about when you said "iterator is scalarized". And why it works
> only for small sizes?
>
> There is an other C2's EA flag EliminateAllocationArraySizeLimit which
> limits which array can be scalarized. But it seems it is not case here.
>
> Vladimir K
>
> [1]
> http://hg.openjdk.java.net/jdk/jdk/file/9c84d2865c2d/src/hotspot/share/opto/bytecodeInfo.cpp#l76
>
> On 2/1/19 11:26 AM, Sergey Kuksenko wrote:
>> Hi, All.
>>
>> I'd like to raise some discussion - if it's possible to make C2
>> inline heuristics better.
>>
>> Code (just measure cost of iteration):
>>
>> Map<Integer, Integer>map;
>> ...
>> for (Iterator<Integer> iterator =map.keySet().iterator();
>> iterator.hasNext(); ) {
>> int v = iterator.next();
>> s += v;
>> }
>>
>> For examples, I found that the iterator is scalarized if collection
>> size is quite small (10 or less) and it isn't scalarized is
>> collection size is large enough (10000, 1000000 elements). Direct
>> cost of iterator allocation and GC pressure is negligible small is
>> our collection is large enough. But, if iterator's state contains
>> object references the cost of GC write barriers may be large.
>>
>> The reason why the iterator is scalarized for small collections and
>> is not scalarized for large collections is MinInliningThreshold
>> option. So we have a hot loop, but relatively cold prolog for the
>> loop and MinInliningThreshold prevents inlining methods called from
>> the prolog. At the same moment inlining such prolog methods may have
>> significant impact to loop performance, particularly fro such widely
>> used pattern as iteration.
>>
>> You can find simple benchmarks for that here:
>> http://cr.openjdk.java.net/~skuksenko/hotspot/inline/
>>
>> Here my results.
>>
>> default:
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> Hash.sumIter 10000 thrpt 15 2929.112 ± 42.927 ops/s
>> Hash.sumIterHidden 10000 thrpt 15 3379.195 ± 30.589 ops/s
>> Tree.sumIter 10000 thrpt 15 3618.466 ± 126.785 ops/s
>> Tree.sumIterHidden 10000 thrpt 15 3576.429 ± 129.217 ops/s
>>
>> -XX:MinInliningThreshold=0
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> Hash.sumIter 10000 thrpt 15 5392.184 ± 115.039 ops/s
>> Hash.sumIterHidden 10000 thrpt 15 3357.754 ± 60.636 ops/s
>> Tree.sumIter 10000 thrpt 15 4128.407 ± 56.748 ops/s
>> Tree.sumIterHidden 10000 thrpt 15 3601.293 ± 123.135 ops/s
>>
>> So G1 write barriers costs 60% performance of HashMap iterator and
>> 14% performance of TreeMap iterator. Other GC also have cost, but G1
>> has highest.
>>
>>
>>
More information about the hotspot-compiler-dev
mailing list