RFR: 8221542: ~15% performance degradation due to less optimized inline decision

Sat Apr 6 00:47:55 UTC 2019

> I have updated the patch based on your advice.
> Webrev: http://cr.openjdk.java.net/~jiefu/monte_carlo-perf-drop/webrev.01/

What you are proposing is to unconditionally inline constructor calls. I 
consider such change as too intrusive.

I suggest to focus on "profile.count() == 0" check and make it smarter: 
when profile info is scarce, try to prove that the call site is actually 
reachable before giving up.

Playing with a small microbenchmark, I observed the following:

(lldb) p caller_method->print()
<ciMethod name=integrate holder=Test signature=(I)D loaded=true 
arg_size=1 flags=DEFAULT_ACCESS,static,final ident=1092 
address=0x00000001008a3c30>(lldb

   caller_method->interpreter_invocation_count() == 1
   caller_method->method_data() != NULL [1]
   caller_method->method_data()->is_mature() == true
   caller_method->method_data()->invocation_count() == 0
   caller_method->method_data()->backedge_count() == 802816

(lldb) p callee_method->print()
<ciMethod name=<init> holder=java/util/Random signature=(J)V loaded=true 
arg_size=3 flags=public ident=1095 address=0x00000001008a5dd0>

   callee_method->was_executed_more_than(0) == true

In addition, it's possible to prove the call is always executed by 
looking at CFG or checking that start block is being parsed.

When "profile.count() == 0", but the call site has been reached before, 
it seems the following conditions should hold:

   caller_method->interpreter_invocation_count() > 0
AND
   caller_method->method_data()->invocation_count() == (0 OR 1)
AND
  callee_method->was_executed_more_than(0) == true
AND
  parse->block() == parse->start_block()

Some of them can be turned into asserts (e.g., invocation_count() == 0).

Best regards,
Vladimir Ivanov

[1]
p caller_method->method_data()->print()
0     bci: 5    CounterData         count(0)
16    bci: 15   BranchData          taken(0) displacement(200)
                                     not taken(751616)
48    bci: 19   ciVirtualCallData   count(0) entries(1)
                                     java/util/Random(751616)
104   bci: 25   ciVirtualCallData   count(0) entries(1)
                                     java/util/Random(751616)
160   bci: 43   BranchData          taken(161218) displacement(32)
                                     not taken(590398)
192   bci: 52   JumpData            taken(751615) displacement(-176)
--- Extra data:
264   bci: 0    ArgInfoData           0x0

[2]
(lldb) p this
(Parse *) $33 = 0x000070000eacc6e8

(lldb) p start_block()
(Parse::Block *) $31 = 0x00000001008aef00

(lldb) p block()
(Parse::Block *) $32 = 0x00000001008aef00

> Testing:
>   - Running scimark.monte_carlo on jdk/x64 and jdk8u/mips64 with 
> -XX:-TieredCompilation: no performance drop
>   - Running SPECjvm2008 on jdk8u/mips64 with -XX:-TieredCompilation: no 
> performance regression
>   - Running make test TEST="micro" on jdk/x64: no performance regression
>   - Running make test TEST="tier1 tier2 tier3" JTREG="JOBS=3" 
> CONF=release on jdk/x64: no regression
> 
> Could you please review it and give me some advice?
> Thanks a lot.
> 
> Best regards,
> Jie
> 
> 
> On 2019/3/28 下午2:21, Vladimir Ivanov wrote:
>> Hi Jie,
>>
>> The heuristic quirk looks very similar to the one Sergey reported 
>> recently:
>>
>>
>> http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-February/032623.html 
>>
>>
>> Overall, tweaking the heuristic to favor inlining doesn't look the 
>> right thing here. profile.count=0 is a sign the profile isn't mature 
>> enough and it's likely the callee doesn't have enough profiling info 
>> as well. (And that's what Sergey observed on some of the 
>> microbenchmarks during his experiments.)
>>
>> In your particular case (Random::<init>), tweaking the heuristic so 
>> is_init_with_ea [1] overrules "profile.count > 0" may be a more 
>> promising approach. After all, the fact that the call site is being 
>> considered for inlining (and not pruned along with the basic block it 
>> belongs to) is a strong signal in favor of "profile.count > 0" case. 
>> (Though it's not guaranteed due to the immaturity of profile data.)
>>
>> But IMO the root problem is that top-tier compilation happens too 
>> early: profile data isn't mature enough yet and it will easily lead to 
>> similar problems later (during compilation).
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> [1] 
>> http://hg.openjdk.java.net/jdk/jdk/file/9c84d2865c2d/src/hotspot/share/opto/bytecodeInfo.cpp#l81 
>>
>>
>> On 27/03/2019 03:15, Jie Fu wrote:
>>> Hi all,
>>>
>>> JBS:    https://bugs.openjdk.java.net/browse/JDK-8221542
>>> Webrev: 
>>> http://cr.openjdk.java.net/~jiefu/monte_carlo-perf-drop/webrev.00/
>>>
>>> ## Symptom
>>> ~15% performance degradation (from 700 ops/m to 600 ops/m) was 
>>> observed randomly on x86 while running SPECjvm2008's 
>>> scimark.monte_carlo with -XX:-TieredCompilation.
>>>
>>> ## Reproduce
>>> It can be always reproduced with the script[1] in less than 5 minutes.
>>>
>>> ## Reason
>>> The drop was caused by a not-inline decision on 
>>> spec.benchmarks.scimark.utils.Random::<init> in 
>>> spec.benchmarks.scimark.monte_carlo.MonteCarlo::integrate.
>>>
>>> ## Fix
>>> It might be better to make a little change to the inline heuristic[2].
>>>
>>> For callers without loops, the original heuristic works fine.
>>> But for callers with loops, it would be better to make a not-inline 
>>> decision more conservatively.
>>>
>>> ## Testing
>>> - Running scimark.monte_carlo on jdk/x64 with -XX:-TieredCompilation 
>>> for about 5000 times, no performance drop
>>>    Also on jdk8u/mips64 with -XX:-TieredCompilation, no performance drop
>>> - Running make test TEST="micro" on jdk/x64, no performance regression
>>> - Running SPECjvm2008 on jdk8u/x64 with -XX:-TieredCompilation, no 
>>> performance regression
>>>
>>> For more detailed info, please see the JBS.
>>>
>>> Could you please review it?
>>> Thanks a lot.
>>>
>>> Best regards,
>>> Jie
>>>
>>> [1] http://cr.openjdk.java.net/~jiefu/monte_carlo-perf-drop/reproduce.sh
>>> [2] 
>>> http://hg.openjdk.java.net/jdk/jdk/file/0a2d73e02076/src/hotspot/share/opto/bytecodeInfo.cpp#l375 
>>>
>>>
>>>
>