[performance] Inline heuristics, scalar replacement and cold prolog of hot loop.

Fri Feb 1 19:26:25 UTC 2019

Hi, All.

I'd like to raise some discussion - if it's possible to make C2 inline 
heuristics better.

Code (just measure cost of iteration):

Map<Integer, Integer>map;
...
for (Iterator<Integer> iterator =map.keySet().iterator(); iterator.hasNext(); ) {
     int v = iterator.next();
     s += v;
}

For examples, I found that the iterator is scalarized if collection size 
is quite small (10 or less) and it isn't scalarized is collection size 
is large enough (10000, 1000000 elements). Direct cost of iterator 
allocation and GC pressure is negligible small is our collection is 
large enough. But, if iterator's state contains object references the 
cost of GC write barriers may be large.

The reason why the iterator is scalarized for small collections and is 
not scalarized for large collections is MinInliningThreshold option. So 
we have a hot loop, but relatively cold prolog for the loop and 
MinInliningThreshold prevents inlining methods called from the prolog. 
At the same moment inlining such prolog methods may have significant 
impact to loop performance, particularly fro such widely used pattern as 
iteration.

You can find simple benchmarks for that here: 
http://cr.openjdk.java.net/~skuksenko/hotspot/inline/

Here my results.

default:

Benchmark           (size)   Mode  Cnt     Score     Error  Units
Hash.sumIter         10000  thrpt   15  2929.112 ±  42.927  ops/s
Hash.sumIterHidden   10000  thrpt   15  3379.195 ±  30.589  ops/s
Tree.sumIter         10000  thrpt   15  3618.466 ± 126.785  ops/s
Tree.sumIterHidden   10000  thrpt   15  3576.429 ± 129.217  ops/s

-XX:MinInliningThreshold=0

Benchmark           (size)   Mode  Cnt     Score     Error  Units
Hash.sumIter         10000  thrpt   15  5392.184 ± 115.039  ops/s
Hash.sumIterHidden   10000  thrpt   15  3357.754 ±  60.636  ops/s
Tree.sumIter         10000  thrpt   15  4128.407 ±  56.748  ops/s
Tree.sumIterHidden   10000  thrpt   15  3601.293 ± 123.135  ops/s

So G1 write barriers costs 60% performance of HashMap iterator and 14% 
performance of TreeMap iterator. Other GC also have cost, but G1 has 
highest.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20190201/c86c1f91/attachment.html>