[performance] Inline heuristics, scalar replacement and cold prolog of hot loop.
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Fri Feb 1 23:51:00 UTC 2019
Another option to consider is to delay normal compilation while
executing in OSRed version. In the worst case (a loop w/ many
iterations), the method is invoked only once before being compiled: the
code around the loop was executed, but it's hard to consider the profile
as representative.
IMO the problem is that when making a decision to issue a normal
compilation, both invocation and backedge counts are considered, but the
latter is overly prioritized (assigned the same weight as invocation
count: i+b > threshold).
It's better to gather more profile and at least get over
MinInliningThreshold for 100% executed code.
Best regards,
Vladimir Ivanov
[1]
http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/share/runtime/tieredThresholdPolicy.cpp#l60
On 01/02/2019 14:38, Vladimir Kozlov wrote:
> We have special method is_init_with_ea() [1] which allow to inline
> constructors and boxing methods regardless MinInliningThreshold. So
> constructor for Iterator should be inlined.
>
> But may be I don't understand your question. Which Iterator you are
> talking about when you said "iterator is scalarized". And why it works
> only for small sizes?
>
> There is an other C2's EA flag EliminateAllocationArraySizeLimit which
> limits which array can be scalarized. But it seems it is not case here.
>
> Vladimir K
>
> [1]
> http://hg.openjdk.java.net/jdk/jdk/file/9c84d2865c2d/src/hotspot/share/opto/bytecodeInfo.cpp#l76
>
>
> On 2/1/19 11:26 AM, Sergey Kuksenko wrote:
>> Hi, All.
>>
>> I'd like to raise some discussion - if it's possible to make C2 inline
>> heuristics better.
>>
>> Code (just measure cost of iteration):
>>
>> Map<Integer, Integer>map;
>> ...
>> for (Iterator<Integer> iterator =map.keySet().iterator();
>> iterator.hasNext(); ) {
>> int v = iterator.next();
>> s += v;
>> }
>>
>> For examples, I found that the iterator is scalarized if collection
>> size is quite small (10 or less) and it isn't scalarized is collection
>> size is large enough (10000, 1000000 elements). Direct cost of
>> iterator allocation and GC pressure is negligible small is our
>> collection is large enough. But, if iterator's state contains object
>> references the cost of GC write barriers may be large.
>>
>> The reason why the iterator is scalarized for small collections and is
>> not scalarized for large collections is MinInliningThreshold option.
>> So we have a hot loop, but relatively cold prolog for the loop and
>> MinInliningThreshold prevents inlining methods called from the prolog.
>> At the same moment inlining such prolog methods may have significant
>> impact to loop performance, particularly fro such widely used pattern
>> as iteration.
>>
>> You can find simple benchmarks for that here:
>> http://cr.openjdk.java.net/~skuksenko/hotspot/inline/
>>
>> Here my results.
>>
>> default:
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> Hash.sumIter 10000 thrpt 15 2929.112 ± 42.927 ops/s
>> Hash.sumIterHidden 10000 thrpt 15 3379.195 ± 30.589 ops/s
>> Tree.sumIter 10000 thrpt 15 3618.466 ± 126.785 ops/s
>> Tree.sumIterHidden 10000 thrpt 15 3576.429 ± 129.217 ops/s
>>
>> -XX:MinInliningThreshold=0
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> Hash.sumIter 10000 thrpt 15 5392.184 ± 115.039 ops/s
>> Hash.sumIterHidden 10000 thrpt 15 3357.754 ± 60.636 ops/s
>> Tree.sumIter 10000 thrpt 15 4128.407 ± 56.748 ops/s
>> Tree.sumIterHidden 10000 thrpt 15 3601.293 ± 123.135 ops/s
>>
>> So G1 write barriers costs 60% performance of HashMap iterator and 14%
>> performance of TreeMap iterator. Other GC also have cost, but G1 has
>> highest.
>>
>>
>>
More information about the hotspot-compiler-dev
mailing list