RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast Linux machines with Total safepoint time 0 ms

Thu Aug 1 07:56:21 UTC 2019

On 1/08/2019 5:13 pm, Baesken, Matthias wrote:
> Hi David + JC , thanks for the reviews .
> 
> David - I added the  suggested print-outputs  , and  also  the parameter to executeThreadDumps .

Okay - thanks for that.

David
-----

> Best regards, Matthias
> 
> 
> 
>> -----Original Message-----
>> From: David Holmes <david.holmes at oracle.com>
>> Sent: Mittwoch, 31. Juli 2019 23:57
>> To: Baesken, Matthias <matthias.baesken at sap.com>; Jean Christophe
>> Beyler <jcbeyler at google.com>
>> Cc: hotspot-dev at openjdk.java.net; serviceability-dev <serviceability-
>> dev at openjdk.java.net>
>> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast
>> Linux machines with Total safepoint time 0 ms
>>
>> On 1/08/2019 12:01 am, Baesken, Matthias wrote:
>>>
>>> Hi upload works again, now with webrev :
>>>
>>> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.2/
>>
>> Could you please add, for diagnostic purposes:
>>
>> System.out.println("Total safepoint time (ms): " + value);
>>
>> after:
>>
>>    60         long value = executeThreadDumps();
>>
>> and
>>
>>    68         long value2 = executeThreadDumps();
>>
>> that way if the test fails we can check logs to see what kind of
>> safepoint times have been observed previously. No need to see an updated
>> webrev just for that.
>>
>> I have one further suggestion, take it or leave it, that
>> executeThreadDumps() takes a parameter to specify the initial value, so
>> we'd have:
>>
>>    60         long value = executeThreadDumps(0);
>>
>> and
>>
>>    68         long value2 = executeThreadDumps(value);
>>
>> This might help detect getTotalSafepointTime() going backwards slightly
>> better than current code.
>>
>> Thanks,
>> David
>>
>>> Best regards, Matthias
>>>
>>>
>>>> -----Original Message-----
>>>> From: Baesken, Matthias
>>>> Sent: Mittwoch, 31. Juli 2019 14:05
>>>> To: 'David Holmes' <david.holmes at oracle.com>; Jean Christophe Beyler
>>>> <jcbeyler at google.com>
>>>> Cc: hotspot-dev at openjdk.java.net; serviceability-dev <serviceability-
>>>> dev at openjdk.java.net>
>>>> Subject: RE: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on
>> fast
>>>> Linux machines with Total safepoint time 0 ms
>>>>
>>>> Hello, here is a version following  the latest proposal of JC .
>>>>
>>>> Unfortunately  attached as patch,  sorry for that - the uploads / pushes
>>>> currently  do not work from here .
>>>>
>>>> Best regards, Matthias
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: David Holmes <david.holmes at oracle.com>
>>>>> Sent: Mittwoch, 31. Juli 2019 05:04
>>>>> To: Jean Christophe Beyler <jcbeyler at google.com>
>>>>> Cc: Baesken, Matthias <matthias.baesken at sap.com>; hotspot-
>>>>> dev at openjdk.java.net; serviceability-dev <serviceability-
>>>>> dev at openjdk.java.net>
>>>>> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on
>>>> fast
>>>>> Linux machines with Total safepoint time 0 ms
>>>>>
>>>>> On 31/07/2019 9:08 am, Jean Christophe Beyler wrote:
>>>>>> FWIW, I would have done something like what David was suggesting,
>> just
>>>>>> slightly tweaked:
>>>>>>
>>>>>> public static long executeThreadDumps() {
>>>>>>     long value;
>>>>>>     long initial_value = mbean.getTotalSafepointTime();
>>>>>>     do {
>>>>>>         Thread.getAllStackTraces();
>>>>>>         value = mbean.getTotalSafepointTime();
>>>>>>     } while (value == initial_value);
>>>>>>     return value;
>>>>>> }
>>>>>>
>>>>>> This ensures that the value is a new value as opposed to the current
>>>>>> value and if something goes wrong, as David said, it will timeout; which
>>>>>> is ok.
>>>>>
>>>>> Works for me.
>>>>>
>>>>>> But I come back to not really understanding why we are doing this at
>>>>>> this point of relaxing (just get a new value of safepoint time).
>>>>>> Because, if we accept timeouts now as a failure here, then really the
>>>>>> whole test becomes:
>>>>>>
>>>>>> executeThreadDumps();
>>>>>> executeThreadDumps();
>>>>>>
>>>>>> Since the first call will return when value > 0 and the second call will
>>>>>> return when value2 > value (I still wonder why we want to ensure it
>>>>>> works twice...).
>>>>>
>>>>> The test is trying to sanity check that we are actually recording the
>>>>> time used by safepoints. So first check is that we can get a non-zero
>>>>> value; second check is we get a greater non-zero value. It's just a
>>>>> sanity test to try and catch if something gets unexpectedly broken in
>>>>> the time tracking code.
>>>>>
>>>>>> So both failures and even testing for it is kind of redundant, once you
>>>>>> have a do/while until a change?
>>>>>
>>>>> Yes - the problem with the tests that try to check internal VM behaviour
>>>>> is that we have no specified way to do something, in this case execute
>>>>> safepoints, that relates to internal VM behaviour, so we have to do
>>>>> something we know will currently work even if not specified to do so -
>>>>> e.g. dumping all thread stacks uses a global safepoint. The second
>>>>> problem is that the timer granularity is so coarse that we then have to
>>>>> guess how many times we need to do that something before seeing a
>>>>> change. To make the test robust we can keep doing stuff until we see a
>>>>> change and so the only way that will fail is if the overall timeout of
>>>>> the test kicks in. Or we can try and second guess how long it should
>>>>> take by introducing our own internal timeout - either directly or by
>>>>> limiting the number of loops in this case. That has its own problems and
>>>>> in general we have tried to reduce internal test timeouts (by removing
>>>>> them) and let overall timeouts take charge.
>>>>>
>>>>> No ideal solution. And this has already consumed way too much of
>>>>> everyone's time.
>>>>>
>>>>> Cheers,
>>>>> David
>>>>>
>>>>>> Thanks,
>>>>>> Jc
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 30, 2019 at 2:35 PM David Holmes
>> <david.holmes at oracle.com
>>>>>> <mailto:david.holmes at oracle.com>> wrote:
>>>>>>
>>>>>>       On 30/07/2019 10:39 pm, Baesken, Matthias wrote:
>>>>>>        > Hi David,   "put that whole code (the while loop) in a helper
>>>>>>       method."   was JC's idea,  and I like the idea .
>>>>>>
>>>>>>       Regardless I think the way you are using NUM_THREAD_DUMPS is
>>>> really
>>>>>>       confusing. As an all-caps static you'd expect it to be a constant.
>>>>>>
>>>>>>       Thanks,
>>>>>>       David
>>>>>>
>>>>>>        > Let's see what others think .
>>>>>>        >
>>>>>>        >>
>>>>>>        >> Overall tests like this are not very useful, yet very fragile.
>>>>>>        >>
>>>>>>        >
>>>>>>        > I am also  fine with putting the test on the exclude list.
>>>>>>        >
>>>>>>        > Best regards, Matthias
>>>>>>        >
>>>>>>        >
>>>>>>        >> -----Original Message-----
>>>>>>        >> From: David Holmes <david.holmes at oracle.com
>>>>>>       <mailto:david.holmes at oracle.com>>
>>>>>>        >> Sent: Dienstag, 30. Juli 2019 14:12
>>>>>>        >> To: Baesken, Matthias <matthias.baesken at sap.com
>>>>>>       <mailto:matthias.baesken at sap.com>>; Jean Christophe
>>>>>>        >> Beyler <jcbeyler at google.com <mailto:jcbeyler at google.com>>
>>>>>>        >> Cc: hotspot-dev at openjdk.java.net
>>>>>>       <mailto:hotspot-dev at openjdk.java.net>; serviceability-dev
>>>>>>       <serviceability-
>>>>>>        >> dev at openjdk.java.net <mailto:dev at openjdk.java.net>>
>>>>>>        >> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java
>>>>>>       fails on fast
>>>>>>        >> Linux machines with Total safepoint time 0 ms
>>>>>>        >>
>>>>>>        >> Hi Matthias,
>>>>>>        >>
>>>>>>        >> On 30/07/2019 9:25 pm, Baesken, Matthias wrote:
>>>>>>        >>> Hello  JC / David,   here is a second webrev  :
>>>>>>        >>>
>>>>>>        >>> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.1/
>>>>>>        >>>
>>>>>>        >>> It moves   the  thread dump execution into a  method
>>>>>>        >>> executeThreadDumps(long)     , and also adds  while loops
>>>>>>       (but with a
>>>>>>        >>> limitation  for the number of thread dumps, really don’t
>>>>>>        >>> want to cause timeouts etc.).    I removed a check for
>>>>>>        >>> MAX_VALUE_FOR_PASS   because we cannot go over
>>>>> Long.MAX_VALUE .
>>>>>>        >>
>>>>>>        >> I don't think executeThreadDumps is worth factoring out like out.
>>>>>>        >>
>>>>>>        >> The handling of NUM_THREAD_DUMPS is a bit confusing. I'd
>> rather
>>>> it
>>>>>>        >> remains a constant 100, and then you set a simple loop iteration
>>>>>>       count
>>>>>>        >> limit. Further with the proposed code when you get here:
>>>>>>        >>
>>>>>>        >>    85         NUM_THREAD_DUMPS = NUM_THREAD_DUMPS * 2;
>>>>>>        >>
>>>>>>        >> you don't even know what value you may be starting with.
>>>>>>        >>
>>>>>>        >> But I was thinking of simply:
>>>>>>        >>
>>>>>>        >> long value = 0;
>>>>>>        >> do {
>>>>>>        >>       Thread.getAllStackTraces();
>>>>>>        >>       value = mbean.getTotalSafepointTime();
>>>>>>        >> } while (value == 0);
>>>>>>        >>
>>>>>>        >> We'd only hit a timeout if something is completely broken -
>>>>>>       which is fine.
>>>>>>        >>
>>>>>>        >> Overall tests like this are not very useful, yet very fragile.
>>>>>>        >>
>>>>>>        >> Thanks,
>>>>>>        >> David
>>>>>>        >>
>>>>>>        >>> Hope you like this version  better.
>>>>>>        >>>
>>>>>>        >>> Best regards, Matthias
>>>>>>        >>>
>>>>>>        >>> *From:*Jean Christophe Beyler <jcbeyler at google.com
>>>>>>       <mailto:jcbeyler at google.com>>
>>>>>>        >>> *Sent:* Dienstag, 30. Juli 2019 05:39
>>>>>>        >>> *To:* David Holmes <david.holmes at oracle.com
>>>>>>       <mailto:david.holmes at oracle.com>>
>>>>>>        >>> *Cc:* Baesken, Matthias <matthias.baesken at sap.com
>>>>>>       <mailto:matthias.baesken at sap.com>>;
>>>>>>        >>> hotspot-dev at openjdk.java.net
>>>>>>       <mailto:hotspot-dev at openjdk.java.net>; serviceability-dev
>>>>>>        >>> <serviceability-dev at openjdk.java.net
>>>>>>       <mailto:serviceability-dev at openjdk.java.net>>
>>>>>>        >>> *Subject:* Re: RFR: [XS] 8228658: test
>>>>>>       GetTotalSafepointTime.java fails
>>>>>>        >>> on fast Linux machines with Total safepoint time 0 ms
>>>>>>        >>>
>>>>>>        >>> Hi Matthias,
>>>>>>        >>>
>>>>>>        >>> I wonder if you should not do what David is suggesting and then
>>>>>>       put that
>>>>>>        >>> whole code (the while loop) in a helper method. Below you
>> have a
>>>>>>        >>> calculation again using value2 (which I wonder what the added
>>>>>>       value of
>>>>>>        >>> it is though) but anyway, that value2 could also be 0 at some
>>>>>>       point, no?
>>>>>>        >>>
>>>>>>        >>> So would it not be best to just refactor the getAllStackTraces
>> and
>>>>>>        >>> calculate safepoint time in a helper method for both value /
>> value2
>>>>>>        >>> variables?
>>>>>>        >>>
>>>>>>        >>> Thanks,
>>>>>>        >>>
>>>>>>        >>> Jc
>>>>>>        >>>
>>>>>>        >>> On Mon, Jul 29, 2019 at 7:50 PM David Holmes
>>>>>>       <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
>>>>>>        >>> <mailto:david.holmes at oracle.com
>>>>>>       <mailto:david.holmes at oracle.com>>> wrote:
>>>>>>        >>>
>>>>>>        >>>      Hi Matthias,
>>>>>>        >>>
>>>>>>        >>>      On 29/07/2019 8:20 pm, Baesken, Matthias wrote:
>>>>>>        >>>       > Hello , please review this small test fix .
>>>>>>        >>>       >
>>>>>>        >>>       > The test
>>>>>>        >>>
>>>>>>        >>
>>>>>
>>>>
>> test/jdk/sun/management/HotspotRuntimeMBean/GetTotalSafepointTime.
>>>>>>        >> java
>>>>>>        >>>      fails sometimes on fast Linux machines with this error
>>>>>>       message :
>>>>>>        >>>       >
>>>>>>        >>>       > java.lang.RuntimeException: Total safepoint time
>>>>>>       illegal value: 0
>>>>>>        >>>      ms (MIN = 1; MAX = 9223372036854775807)
>>>>>>        >>>       >
>>>>>>        >>>       > looks like the total safepoint time is too low
>>>>>>       currently on these
>>>>>>        >>>      machines, it is < 1 ms.
>>>>>>        >>>       >
>>>>>>        >>>       > There might be several ways to handle this :
>>>>>>        >>>       >
>>>>>>        >>>       >    *   Change the test  in a way that it might generate
>>>>>>       nigher
>>>>>>        >>>      safepoint times
>>>>>>        >>>       >    *   Allow  safepoint time  == 0 ms
>>>>>>        >>>       >    *   Offer an additional interface that gives
>>>>>>       safepoint times
>>>>>>        >>>      with finer granularity ( currently the HS has safepoint
>>>>>>       time values
>>>>>>        >>>      in ns , see  jdk/src/hotspot/share/runtime/safepoint.cpp
>>>>>>        >>>        SafepointTracing::end
>>>>>>        >>>       >
>>>>>>        >>>       > But it is converted on ms in this code
>>>>>>        >>>       >
>>>>>>        >>>       > 114jlong RuntimeService::safepoint_time_ms() {
>>>>>>        >>>       > 115  return UsePerfData ?
>>>>>>        >>>       > 116
>>>>>>        >>>
>>>>>>       Management::ticks_to_ms(_safepoint_time_ticks->get_value()) : -
>> 1;
>>>>>>        >>>       > 117}
>>>>>>        >>>       >
>>>>>>        >>>       > 064jlong Management::ticks_to_ms(jlong ticks) {
>>>>>>        >>>       > 2065  assert(os::elapsed_frequency() > 0, "Must be
>>>>>>       non-zero");
>>>>>>        >>>       > 2066  return (jlong)(((double)ticks /
>>>>>>        >>>      (double)os::elapsed_frequency())
>>>>>>        >>>       > 2067                 * (double)1000.0);
>>>>>>        >>>       > 2068}
>>>>>>        >>>       >
>>>>>>        >>>       >
>>>>>>        >>>       >
>>>>>>        >>>       > Currently I go for  the first attempt (and try to generate
>>>>>>        >>>      higher safepoint times in my patch) .
>>>>>>        >>>
>>>>>>        >>>      Yes that's probably best. Coarse-grained timing on very
>>>>>>       fast machines
>>>>>>        >>>      was bound to eventually lead to problems.
>>>>>>        >>>
>>>>>>        >>>      But perhaps a more future-proof approach is to just add a
>>>>>>       do-while loop
>>>>>>        >>>      around the stack dumps and only exit when we have a non-
>> zero
>>>>>>        >> safepoint
>>>>>>        >>>      time?
>>>>>>        >>>
>>>>>>        >>>      Thanks,
>>>>>>        >>>      David
>>>>>>        >>>      -----
>>>>>>        >>>
>>>>>>        >>>       > Bug/webrev :
>>>>>>        >>>       >
>>>>>>        >>>       > https://bugs.openjdk.java.net/browse/JDK-8228658
>>>>>>        >>>       >
>>>>>>        >>>       >
>> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.0/
>>>>>>        >>>       >
>>>>>>        >>>       >
>>>>>>        >>>       > Thanks, Matthias
>>>>>>        >>>       >
>>>>>>        >>>
>>>>>>        >>>
>>>>>>        >>> --
>>>>>>        >>>
>>>>>>        >>> Thanks,
>>>>>>        >>>
>>>>>>        >>> Jc
>>>>>>        >>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Thanks,
>>>>>> Jc