RFR: 8221542: ~15% performance degradation due to less optimized inline decision
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Sat Apr 6 00:47:55 UTC 2019
> I have updated the patch based on your advice.
> Webrev: http://cr.openjdk.java.net/~jiefu/monte_carlo-perf-drop/webrev.01/
What you are proposing is to unconditionally inline constructor calls. I
consider such change as too intrusive.
I suggest to focus on "profile.count() == 0" check and make it smarter:
when profile info is scarce, try to prove that the call site is actually
reachable before giving up.
Playing with a small microbenchmark, I observed the following:
(lldb) p caller_method->print()
<ciMethod name=integrate holder=Test signature=(I)D loaded=true
arg_size=1 flags=DEFAULT_ACCESS,static,final ident=1092
address=0x00000001008a3c30>(lldb
caller_method->interpreter_invocation_count() == 1
caller_method->method_data() != NULL [1]
caller_method->method_data()->is_mature() == true
caller_method->method_data()->invocation_count() == 0
caller_method->method_data()->backedge_count() == 802816
(lldb) p callee_method->print()
<ciMethod name=<init> holder=java/util/Random signature=(J)V loaded=true
arg_size=3 flags=public ident=1095 address=0x00000001008a5dd0>
callee_method->was_executed_more_than(0) == true
In addition, it's possible to prove the call is always executed by
looking at CFG or checking that start block is being parsed.
When "profile.count() == 0", but the call site has been reached before,
it seems the following conditions should hold:
caller_method->interpreter_invocation_count() > 0
AND
caller_method->method_data()->invocation_count() == (0 OR 1)
AND
callee_method->was_executed_more_than(0) == true
AND
parse->block() == parse->start_block()
Some of them can be turned into asserts (e.g., invocation_count() == 0).
Best regards,
Vladimir Ivanov
[1]
p caller_method->method_data()->print()
0 bci: 5 CounterData count(0)
16 bci: 15 BranchData taken(0) displacement(200)
not taken(751616)
48 bci: 19 ciVirtualCallData count(0) entries(1)
java/util/Random(751616)
104 bci: 25 ciVirtualCallData count(0) entries(1)
java/util/Random(751616)
160 bci: 43 BranchData taken(161218) displacement(32)
not taken(590398)
192 bci: 52 JumpData taken(751615) displacement(-176)
--- Extra data:
264 bci: 0 ArgInfoData 0x0
[2]
(lldb) p this
(Parse *) $33 = 0x000070000eacc6e8
(lldb) p start_block()
(Parse::Block *) $31 = 0x00000001008aef00
(lldb) p block()
(Parse::Block *) $32 = 0x00000001008aef00
> Testing:
> - Running scimark.monte_carlo on jdk/x64 and jdk8u/mips64 with
> -XX:-TieredCompilation: no performance drop
> - Running SPECjvm2008 on jdk8u/mips64 with -XX:-TieredCompilation: no
> performance regression
> - Running make test TEST="micro" on jdk/x64: no performance regression
> - Running make test TEST="tier1 tier2 tier3" JTREG="JOBS=3"
> CONF=release on jdk/x64: no regression
>
> Could you please review it and give me some advice?
> Thanks a lot.
>
> Best regards,
> Jie
>
>
> On 2019/3/28 下午2:21, Vladimir Ivanov wrote:
>> Hi Jie,
>>
>> The heuristic quirk looks very similar to the one Sergey reported
>> recently:
>>
>>
>> http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-February/032623.html
>>
>>
>> Overall, tweaking the heuristic to favor inlining doesn't look the
>> right thing here. profile.count=0 is a sign the profile isn't mature
>> enough and it's likely the callee doesn't have enough profiling info
>> as well. (And that's what Sergey observed on some of the
>> microbenchmarks during his experiments.)
>>
>> In your particular case (Random::<init>), tweaking the heuristic so
>> is_init_with_ea [1] overrules "profile.count > 0" may be a more
>> promising approach. After all, the fact that the call site is being
>> considered for inlining (and not pruned along with the basic block it
>> belongs to) is a strong signal in favor of "profile.count > 0" case.
>> (Though it's not guaranteed due to the immaturity of profile data.)
>>
>> But IMO the root problem is that top-tier compilation happens too
>> early: profile data isn't mature enough yet and it will easily lead to
>> similar problems later (during compilation).
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> [1]
>> http://hg.openjdk.java.net/jdk/jdk/file/9c84d2865c2d/src/hotspot/share/opto/bytecodeInfo.cpp#l81
>>
>>
>> On 27/03/2019 03:15, Jie Fu wrote:
>>> Hi all,
>>>
>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8221542
>>> Webrev:
>>> http://cr.openjdk.java.net/~jiefu/monte_carlo-perf-drop/webrev.00/
>>>
>>> ## Symptom
>>> ~15% performance degradation (from 700 ops/m to 600 ops/m) was
>>> observed randomly on x86 while running SPECjvm2008's
>>> scimark.monte_carlo with -XX:-TieredCompilation.
>>>
>>> ## Reproduce
>>> It can be always reproduced with the script[1] in less than 5 minutes.
>>>
>>> ## Reason
>>> The drop was caused by a not-inline decision on
>>> spec.benchmarks.scimark.utils.Random::<init> in
>>> spec.benchmarks.scimark.monte_carlo.MonteCarlo::integrate.
>>>
>>> ## Fix
>>> It might be better to make a little change to the inline heuristic[2].
>>>
>>> For callers without loops, the original heuristic works fine.
>>> But for callers with loops, it would be better to make a not-inline
>>> decision more conservatively.
>>>
>>> ## Testing
>>> - Running scimark.monte_carlo on jdk/x64 with -XX:-TieredCompilation
>>> for about 5000 times, no performance drop
>>> Also on jdk8u/mips64 with -XX:-TieredCompilation, no performance drop
>>> - Running make test TEST="micro" on jdk/x64, no performance regression
>>> - Running SPECjvm2008 on jdk8u/x64 with -XX:-TieredCompilation, no
>>> performance regression
>>>
>>> For more detailed info, please see the JBS.
>>>
>>> Could you please review it?
>>> Thanks a lot.
>>>
>>> Best regards,
>>> Jie
>>>
>>> [1] http://cr.openjdk.java.net/~jiefu/monte_carlo-perf-drop/reproduce.sh
>>> [2]
>>> http://hg.openjdk.java.net/jdk/jdk/file/0a2d73e02076/src/hotspot/share/opto/bytecodeInfo.cpp#l375
>>>
>>>
>>>
>
More information about the hotspot-compiler-dev
mailing list