RFR(S) 8246019 PerfClassTraceTime slows down VM start-up

Thu Jun 25 04:02:38 UTC 2020

No objections.

Thanks,
David

On 24/06/2020 8:57 am, Ioi Lam wrote:
> I've updated the patch to include just the fix for class initialization:
> 
> http://cr.openjdk.java.net/~iklam/jdk16/8246019-avoid-PerfClassTraceTime.v02/ 
> 
> 
> Hopefully this part is non-controversial. We are unlikely to make 
> call_class_initializer(THREAD) any slower when there's no <clinit>, so I 
> didn't add the diagnostic flag as suggested by Claes.
> 
> I'll leave the class linking alone for now, as that may change in the 
> future.
> 
> Meanwhile, I will look at other ways of reducing the effect of the 
> performance counters on start-up, under JDK-8246020 (-XX:+UsePerfData is 
> enabled by default and slows down VM bootstrap by 6%).
> 
> thanks
> - Ioi
> 
> On 6/18/20 4:38 AM, Claes Redestad wrote:
>>
>>
>> On 2020-06-17 05:19, Ioi Lam wrote:
>>>
>>>
>>> On 6/16/20 6:20 PM, David Holmes wrote:
>>>> Hi Ioi,
>>>>
>>>> On 17/06/2020 6:14 am, Ioi Lam wrote:
>>>>> https://bugs.openjdk.java.net/browse/JDK-8246019
>>>>> http://cr.openjdk.java.net/~iklam/jdk16/8246019-avoid-PerfClassTraceTime.v01/ 
>>>>>
>>>>>
>>>>> PerfClassTraceTime is (a rarely used feature) for measuring the 
>>>>> time spent during class linking and initialization.
>>>>
>>>> "A special command jcmd <process id/main class> PerfCounter.print 
>>>> prints all performance counters in the process."
>>>>
>>>> How do you know this is a "rarely used feature"?
>>> Hi David,
>>>
>>> Sure, the counter will be dumped, but by "rarely used" -- I mean no 
>>> one will find this particular counter useful, and no one will be 
>>> actively looking at it.
>>>
>>> I changed two parts of the code -- class init and class linking.
>>>
>>> For class initialization, the counter may be useful for people who 
>>> want to know how much time is spent in their <clinit> functions, and 
>>> my patch doesn't change that. It only avoids using the counter when a 
>>> class has no <clinit>, i.e., we know that the counter counts nothing 
>>> (except for a logging statement).
>>>
>>> =====
>>>
>>> For class linking, no user code is executed, so it only measures VM 
>>> code. If it's useful for anyone, that would be VM engineers like me 
>>> who are trying to optimize the speed of class loading. However, due 
>>> to the overhead of the counter vs what it's trying to measure, the 
>>> results are pretty meaningless.
>>>
>>> Note that I've not disabled the counter altogether. Instead, I 
>>> disable it only when linking a CDS shared class, and we know that 
>>> very little is happening for this class (e.g., no verification).
>>>
>>> I think the class linking timer might have been useful 15 years ago 
>>> when it was introduced, or it might be useful today when CDS is 
>>> disabled. But with CDS enabled, we are paying a constant price that 
>>> seems to benefit no one.
>>>
>>> I think we should short-circuit it when it seems appropriate. If this 
>>> indeed causes problems for our users, it's easy to re-enable it. 
>>> That's better than just keeping this forever just because we're 
>>> afraid to touch anything.
>>
>> I think this seems like well-rounded approach overall, but this assumes
>> that we're mostly measuring the overhead of measurement here. I don't
>> doubt that's the case for the scenarios you're excluding here and now,
>> but it's hard to guarantee this property hold in the future.
>>
>> Perhaps a diagnostic flag to enable timing unconditionally would be
>> appropriate? With such a flag we could verify that the time deltas of
>> running some applications with and without the flag roughly matches the
>> time delta in reported linking time. If they diverge, we might need to
>> adjust the conditions.
>>
>>>
>>>>
>>>> I find it hard to evaluate whether this short-circuiting of the time 
>>>> tracing is reasonable or not. Obviously any monitoring mechanism 
>>>> should impose minimal overhead compared to what is being measured, 
>>>> and these timers fall short in that regard. But if these stats 
>>>> become meaningless then they may as well be removed.
>>>>
>>>> I think the serviceability folk (cc'd) need to evaluate this in the 
>>>> context of the M&M tools.
>>
>> As a complement (or even alternative) there might be ways we can reduce
>> time-to-measure overheads. E.g, JFR added
>> FastUnorderedElapsedCounterSource (share/utilities/ticks.hpp) which uses
>> rdtsc if available (x86 - fallback to os::elapsed_counter otherwise).
>>
>> This might be a reasonable alternative for the Perf* timers, which
>> should be short-running events on a single thread.
>>
>> /Claes
>>
>>>>
>>>>> However, it's quite expensive and it needs to start and stop a 
>>>>> bunch of timers. With CDS, it's quite often for the overhead of the 
>>>>> timer itself to be much more than the time it's trying to measure, 
>>>>> giving unreliable measurement.
>>>>>
>>>>> In this patch, when it's clear that the init and linking will be 
>>>>> very quick, I disable the timer and count only the number of 
>>>>> invocations. This shows a small improvement in start-up
>>>>
>>>> I'm curious if you tried to forcing EagerInitialization to be true 
>>>> to see how that improves the baseline. I've always noticed 
>>>> eager_init in the code, but hadn't realized it is disabled by default.
>>>>
>>>
>>> I think it cannot be done by default, as it will violate the JLS. A 
>>> class can be initialized only when it's touched by bytecodes.
>>>
>>> It can also backfire as we may load many classes without initializing 
>>> them. E.g., during bytecode verification, we load many classes and 
>>> just check that one is a supertype of another.
>>>
>>> Thanks
>>> - Ioi
>>>
>>>> Cheers,
>>>> David
>>>> -----
>>>>
>>>>> Results of " perf stat -r 100 bin/java -Xshare:on 
>>>>> -XX:SharedArchiveFile=jdk2.jsa -Xint -version "
>>>>>
>>>>> 59623970 59341935 (-282035)   -----  41.774  41.591 ( -0.183) -
>>>>> 59623495 59331646 (-291849)   -----  41.696  41.165 ( -0.531) --
>>>>> 59627148 59329526 (-297622)   -----  41.249  41.094 ( -0.155) -
>>>>> 59612439 59340760 (-271679)   ----   41.773  40.657 ( -1.116) -----
>>>>> 59626438 59335681 (-290757)   -----  41.683  40.901 ( -0.782) ----
>>>>> 59618436 59338953 (-279483)   -----  41.861  41.249 ( -0.612) ---
>>>>> 59608782 59340173 (-268609)   ----   41.198  41.508 ( 0.310) +
>>>>> 59614612 59325177 (-289435)   -----  41.397  41.738 ( 0.341) ++
>>>>> 59615905 59344006 (-271899)   ----   41.921  40.969 ( -0.952) ----
>>>>> 59635867 59333147 (-302720)   -----  41.491  40.836 ( -0.655) ---
>>>>> ================================================
>>>>> 59620708 59336100 (-284608)   -----  41.604  41.169 ( -0.434) --
>>>>> instruction delta =      -284608    -0.4774%
>>>>> time        delta =       -0.434 ms -1.0435%
>>>>>
>>>>> The number of PerfClassTraceTime's used is reduced from 564 to 116 
>>>>> (so we have an overhead of about 715 instructions per use, yikes!).
>>>
>