RFR(S) 8246019 PerfClassTraceTime slows down VM start-up
David Holmes
david.holmes at oracle.com
Thu Jun 25 04:02:38 UTC 2020
No objections.
Thanks,
David
On 24/06/2020 8:57 am, Ioi Lam wrote:
> I've updated the patch to include just the fix for class initialization:
>
> http://cr.openjdk.java.net/~iklam/jdk16/8246019-avoid-PerfClassTraceTime.v02/
>
>
> Hopefully this part is non-controversial. We are unlikely to make
> call_class_initializer(THREAD) any slower when there's no <clinit>, so I
> didn't add the diagnostic flag as suggested by Claes.
>
> I'll leave the class linking alone for now, as that may change in the
> future.
>
> Meanwhile, I will look at other ways of reducing the effect of the
> performance counters on start-up, under JDK-8246020 (-XX:+UsePerfData is
> enabled by default and slows down VM bootstrap by 6%).
>
> thanks
> - Ioi
>
> On 6/18/20 4:38 AM, Claes Redestad wrote:
>>
>>
>> On 2020-06-17 05:19, Ioi Lam wrote:
>>>
>>>
>>> On 6/16/20 6:20 PM, David Holmes wrote:
>>>> Hi Ioi,
>>>>
>>>> On 17/06/2020 6:14 am, Ioi Lam wrote:
>>>>> https://bugs.openjdk.java.net/browse/JDK-8246019
>>>>> http://cr.openjdk.java.net/~iklam/jdk16/8246019-avoid-PerfClassTraceTime.v01/
>>>>>
>>>>>
>>>>> PerfClassTraceTime is (a rarely used feature) for measuring the
>>>>> time spent during class linking and initialization.
>>>>
>>>> "A special command jcmd <process id/main class> PerfCounter.print
>>>> prints all performance counters in the process."
>>>>
>>>> How do you know this is a "rarely used feature"?
>>> Hi David,
>>>
>>> Sure, the counter will be dumped, but by "rarely used" -- I mean no
>>> one will find this particular counter useful, and no one will be
>>> actively looking at it.
>>>
>>> I changed two parts of the code -- class init and class linking.
>>>
>>> For class initialization, the counter may be useful for people who
>>> want to know how much time is spent in their <clinit> functions, and
>>> my patch doesn't change that. It only avoids using the counter when a
>>> class has no <clinit>, i.e., we know that the counter counts nothing
>>> (except for a logging statement).
>>>
>>> =====
>>>
>>> For class linking, no user code is executed, so it only measures VM
>>> code. If it's useful for anyone, that would be VM engineers like me
>>> who are trying to optimize the speed of class loading. However, due
>>> to the overhead of the counter vs what it's trying to measure, the
>>> results are pretty meaningless.
>>>
>>> Note that I've not disabled the counter altogether. Instead, I
>>> disable it only when linking a CDS shared class, and we know that
>>> very little is happening for this class (e.g., no verification).
>>>
>>> I think the class linking timer might have been useful 15 years ago
>>> when it was introduced, or it might be useful today when CDS is
>>> disabled. But with CDS enabled, we are paying a constant price that
>>> seems to benefit no one.
>>>
>>> I think we should short-circuit it when it seems appropriate. If this
>>> indeed causes problems for our users, it's easy to re-enable it.
>>> That's better than just keeping this forever just because we're
>>> afraid to touch anything.
>>
>> I think this seems like well-rounded approach overall, but this assumes
>> that we're mostly measuring the overhead of measurement here. I don't
>> doubt that's the case for the scenarios you're excluding here and now,
>> but it's hard to guarantee this property hold in the future.
>>
>> Perhaps a diagnostic flag to enable timing unconditionally would be
>> appropriate? With such a flag we could verify that the time deltas of
>> running some applications with and without the flag roughly matches the
>> time delta in reported linking time. If they diverge, we might need to
>> adjust the conditions.
>>
>>>
>>>>
>>>> I find it hard to evaluate whether this short-circuiting of the time
>>>> tracing is reasonable or not. Obviously any monitoring mechanism
>>>> should impose minimal overhead compared to what is being measured,
>>>> and these timers fall short in that regard. But if these stats
>>>> become meaningless then they may as well be removed.
>>>>
>>>> I think the serviceability folk (cc'd) need to evaluate this in the
>>>> context of the M&M tools.
>>
>> As a complement (or even alternative) there might be ways we can reduce
>> time-to-measure overheads. E.g, JFR added
>> FastUnorderedElapsedCounterSource (share/utilities/ticks.hpp) which uses
>> rdtsc if available (x86 - fallback to os::elapsed_counter otherwise).
>>
>> This might be a reasonable alternative for the Perf* timers, which
>> should be short-running events on a single thread.
>>
>> /Claes
>>
>>>>
>>>>> However, it's quite expensive and it needs to start and stop a
>>>>> bunch of timers. With CDS, it's quite often for the overhead of the
>>>>> timer itself to be much more than the time it's trying to measure,
>>>>> giving unreliable measurement.
>>>>>
>>>>> In this patch, when it's clear that the init and linking will be
>>>>> very quick, I disable the timer and count only the number of
>>>>> invocations. This shows a small improvement in start-up
>>>>
>>>> I'm curious if you tried to forcing EagerInitialization to be true
>>>> to see how that improves the baseline. I've always noticed
>>>> eager_init in the code, but hadn't realized it is disabled by default.
>>>>
>>>
>>> I think it cannot be done by default, as it will violate the JLS. A
>>> class can be initialized only when it's touched by bytecodes.
>>>
>>> It can also backfire as we may load many classes without initializing
>>> them. E.g., during bytecode verification, we load many classes and
>>> just check that one is a supertype of another.
>>>
>>> Thanks
>>> - Ioi
>>>
>>>> Cheers,
>>>> David
>>>> -----
>>>>
>>>>> Results of " perf stat -r 100 bin/java -Xshare:on
>>>>> -XX:SharedArchiveFile=jdk2.jsa -Xint -version "
>>>>>
>>>>> 59623970 59341935 (-282035) ----- 41.774 41.591 ( -0.183) -
>>>>> 59623495 59331646 (-291849) ----- 41.696 41.165 ( -0.531) --
>>>>> 59627148 59329526 (-297622) ----- 41.249 41.094 ( -0.155) -
>>>>> 59612439 59340760 (-271679) ---- 41.773 40.657 ( -1.116) -----
>>>>> 59626438 59335681 (-290757) ----- 41.683 40.901 ( -0.782) ----
>>>>> 59618436 59338953 (-279483) ----- 41.861 41.249 ( -0.612) ---
>>>>> 59608782 59340173 (-268609) ---- 41.198 41.508 ( 0.310) +
>>>>> 59614612 59325177 (-289435) ----- 41.397 41.738 ( 0.341) ++
>>>>> 59615905 59344006 (-271899) ---- 41.921 40.969 ( -0.952) ----
>>>>> 59635867 59333147 (-302720) ----- 41.491 40.836 ( -0.655) ---
>>>>> ================================================
>>>>> 59620708 59336100 (-284608) ----- 41.604 41.169 ( -0.434) --
>>>>> instruction delta = -284608 -0.4774%
>>>>> time delta = -0.434 ms -1.0435%
>>>>>
>>>>> The number of PerfClassTraceTime's used is reduced from 564 to 116
>>>>> (so we have an overhead of about 715 instructions per use, yikes!).
>>>
>
More information about the serviceability-dev
mailing list