[performance] Inline heuristics, scalar replacement and cold prolog of hot loop.

Fri Feb 1 23:51:00 UTC 2019

Another option to consider is to delay normal compilation while 
executing in OSRed version. In the worst case (a loop w/ many 
iterations), the method is invoked only once before being compiled: the 
code around the loop was executed, but it's hard to consider the profile 
as representative.

IMO the problem is that when making a decision to issue a normal 
compilation, both invocation and backedge counts are considered, but the 
latter is overly prioritized (assigned the same weight as invocation 
count: i+b > threshold).

It's better to gather more profile and at least get over 
MinInliningThreshold for 100% executed code.

Best regards,
Vladimir Ivanov

[1] 
http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/share/runtime/tieredThresholdPolicy.cpp#l60

On 01/02/2019 14:38, Vladimir Kozlov wrote:
> We have special method is_init_with_ea() [1] which allow to inline 
> constructors and boxing methods regardless MinInliningThreshold. So 
> constructor for Iterator should be inlined.
> 
> But may be I don't understand your question. Which Iterator you are 
> talking about when you said "iterator is scalarized". And why it works 
> only for small sizes?
> 
> There is an other C2's EA flag EliminateAllocationArraySizeLimit which 
> limits which array can be scalarized. But it seems it is not case here.
> 
> Vladimir K
> 
> [1] 
> http://hg.openjdk.java.net/jdk/jdk/file/9c84d2865c2d/src/hotspot/share/opto/bytecodeInfo.cpp#l76 
> 
> 
> On 2/1/19 11:26 AM, Sergey Kuksenko wrote:
>> Hi, All.
>>
>> I'd like to raise some discussion - if it's possible to make C2 inline 
>> heuristics better.
>>
>> Code (just measure cost of iteration):
>>
>> Map<Integer, Integer>map;
>> ...
>> for (Iterator<Integer> iterator =map.keySet().iterator(); 
>> iterator.hasNext(); ) {
>>      int v = iterator.next();
>>      s += v;
>> }
>>
>> For examples, I found that the iterator is scalarized if collection 
>> size is quite small (10 or less) and it isn't scalarized is collection 
>> size is large enough (10000, 1000000 elements). Direct cost of 
>> iterator allocation and GC pressure is negligible small is our 
>> collection is large enough. But, if iterator's state contains object 
>> references the cost of GC write barriers may be large.
>>
>> The reason why the iterator is scalarized for small collections and is 
>> not scalarized for large collections is MinInliningThreshold option. 
>> So we have a hot loop, but relatively cold prolog for the loop and 
>> MinInliningThreshold prevents inlining methods called from the prolog. 
>> At the same moment inlining such prolog methods may have significant 
>> impact to loop performance, particularly fro such widely used pattern 
>> as iteration.
>>
>> You can find simple benchmarks for that here: 
>> http://cr.openjdk.java.net/~skuksenko/hotspot/inline/
>>
>> Here my results.
>>
>> default:
>>
>> Benchmark           (size)   Mode  Cnt     Score     Error  Units
>> Hash.sumIter         10000  thrpt   15  2929.112 ±  42.927  ops/s
>> Hash.sumIterHidden   10000  thrpt   15  3379.195 ±  30.589  ops/s
>> Tree.sumIter         10000  thrpt   15  3618.466 ± 126.785  ops/s
>> Tree.sumIterHidden   10000  thrpt   15  3576.429 ± 129.217  ops/s
>>
>> -XX:MinInliningThreshold=0
>>
>> Benchmark           (size)   Mode  Cnt     Score     Error  Units
>> Hash.sumIter         10000  thrpt   15  5392.184 ± 115.039  ops/s
>> Hash.sumIterHidden   10000  thrpt   15  3357.754 ±  60.636  ops/s
>> Tree.sumIter         10000  thrpt   15  4128.407 ±  56.748  ops/s
>> Tree.sumIterHidden   10000  thrpt   15  3601.293 ± 123.135  ops/s
>>
>> So G1 write barriers costs 60% performance of HashMap iterator and 14% 
>> performance of TreeMap iterator. Other GC also have cost, but G1 has 
>> highest.
>>
>>
>>