RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast Linux machines with Total safepoint time 0 ms

Tue Jul 30 21:35:12 UTC 2019

On 30/07/2019 10:39 pm, Baesken, Matthias wrote:
> Hi David,   "put that whole code (the while loop) in a helper method."   was JC's idea,  and I like the idea .

Regardless I think the way you are using NUM_THREAD_DUMPS is really 
confusing. As an all-caps static you'd expect it to be a constant.

Thanks,
David

> Let's see what others think .
> 
>>
>> Overall tests like this are not very useful, yet very fragile.
>>
> 
> I am also  fine with putting the test on the exclude list.
> 
> Best regards, Matthias
> 
> 
>> -----Original Message-----
>> From: David Holmes <david.holmes at oracle.com>
>> Sent: Dienstag, 30. Juli 2019 14:12
>> To: Baesken, Matthias <matthias.baesken at sap.com>; Jean Christophe
>> Beyler <jcbeyler at google.com>
>> Cc: hotspot-dev at openjdk.java.net; serviceability-dev <serviceability-
>> dev at openjdk.java.net>
>> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast
>> Linux machines with Total safepoint time 0 ms
>>
>> Hi Matthias,
>>
>> On 30/07/2019 9:25 pm, Baesken, Matthias wrote:
>>> Hello  JC / David,   here is a second webrev  :
>>>
>>> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.1/
>>>
>>> It moves   the  thread dump execution into a  method
>>> executeThreadDumps(long)     , and also adds  while loops   (but with a
>>> limitation  for the number of thread dumps, really don’t
>>> want to cause timeouts etc.).    I removed a check for
>>> MAX_VALUE_FOR_PASS   because we cannot go over Long.MAX_VALUE .
>>
>> I don't think executeThreadDumps is worth factoring out like out.
>>
>> The handling of NUM_THREAD_DUMPS is a bit confusing. I'd rather it
>> remains a constant 100, and then you set a simple loop iteration count
>> limit. Further with the proposed code when you get here:
>>
>>    85         NUM_THREAD_DUMPS = NUM_THREAD_DUMPS * 2;
>>
>> you don't even know what value you may be starting with.
>>
>> But I was thinking of simply:
>>
>> long value = 0;
>> do {
>>       Thread.getAllStackTraces();
>>       value = mbean.getTotalSafepointTime();
>> } while (value == 0);
>>
>> We'd only hit a timeout if something is completely broken - which is fine.
>>
>> Overall tests like this are not very useful, yet very fragile.
>>
>> Thanks,
>> David
>>
>>> Hope you like this version  better.
>>>
>>> Best regards, Matthias
>>>
>>> *From:*Jean Christophe Beyler <jcbeyler at google.com>
>>> *Sent:* Dienstag, 30. Juli 2019 05:39
>>> *To:* David Holmes <david.holmes at oracle.com>
>>> *Cc:* Baesken, Matthias <matthias.baesken at sap.com>;
>>> hotspot-dev at openjdk.java.net; serviceability-dev
>>> <serviceability-dev at openjdk.java.net>
>>> *Subject:* Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails
>>> on fast Linux machines with Total safepoint time 0 ms
>>>
>>> Hi Matthias,
>>>
>>> I wonder if you should not do what David is suggesting and then put that
>>> whole code (the while loop) in a helper method. Below you have a
>>> calculation again using value2 (which I wonder what the added value of
>>> it is though) but anyway, that value2 could also be 0 at some point, no?
>>>
>>> So would it not be best to just refactor the getAllStackTraces and
>>> calculate safepoint time in a helper method for both value / value2
>>> variables?
>>>
>>> Thanks,
>>>
>>> Jc
>>>
>>> On Mon, Jul 29, 2019 at 7:50 PM David Holmes <david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>> wrote:
>>>
>>>      Hi Matthias,
>>>
>>>      On 29/07/2019 8:20 pm, Baesken, Matthias wrote:
>>>       > Hello , please review this small test fix .
>>>       >
>>>       > The test
>>>
>> test/jdk/sun/management/HotspotRuntimeMBean/GetTotalSafepointTime.
>> java
>>>      fails sometimes on fast Linux machines with this error message :
>>>       >
>>>       > java.lang.RuntimeException: Total safepoint time illegal value: 0
>>>      ms (MIN = 1; MAX = 9223372036854775807)
>>>       >
>>>       > looks like the total safepoint time is too low currently on these
>>>      machines, it is < 1 ms.
>>>       >
>>>       > There might be several ways to handle this :
>>>       >
>>>       >    *   Change the test  in a way that it might generate nigher
>>>      safepoint times
>>>       >    *   Allow  safepoint time  == 0 ms
>>>       >    *   Offer an additional interface that gives  safepoint times
>>>      with finer granularity ( currently the HS has safepoint time values
>>>      in ns , see  jdk/src/hotspot/share/runtime/safepoint.cpp
>>>        SafepointTracing::end
>>>       >
>>>       > But it is converted on ms in this code
>>>       >
>>>       > 114jlong RuntimeService::safepoint_time_ms() {
>>>       > 115  return UsePerfData ?
>>>       > 116
>>>      Management::ticks_to_ms(_safepoint_time_ticks->get_value()) : -1;
>>>       > 117}
>>>       >
>>>       > 064jlong Management::ticks_to_ms(jlong ticks) {
>>>       > 2065  assert(os::elapsed_frequency() > 0, "Must be non-zero");
>>>       > 2066  return (jlong)(((double)ticks /
>>>      (double)os::elapsed_frequency())
>>>       > 2067                 * (double)1000.0);
>>>       > 2068}
>>>       >
>>>       >
>>>       >
>>>       > Currently I go for  the first attempt (and try to generate
>>>      higher safepoint times in my patch) .
>>>
>>>      Yes that's probably best. Coarse-grained timing on very fast machines
>>>      was bound to eventually lead to problems.
>>>
>>>      But perhaps a more future-proof approach is to just add a do-while loop
>>>      around the stack dumps and only exit when we have a non-zero
>> safepoint
>>>      time?
>>>
>>>      Thanks,
>>>      David
>>>      -----
>>>
>>>       > Bug/webrev :
>>>       >
>>>       > https://bugs.openjdk.java.net/browse/JDK-8228658
>>>       >
>>>       > http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.0/
>>>       >
>>>       >
>>>       > Thanks, Matthias
>>>       >
>>>
>>>
>>> --
>>>
>>> Thanks,
>>>
>>> Jc
>>>