RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast Linux machines with Total safepoint time 0 ms

David Holmes david.holmes at oracle.com
Wed Jul 31 21:57:07 UTC 2019


On 1/08/2019 12:01 am, Baesken, Matthias wrote:
> 
> Hi upload works again, now with webrev :
> 
> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.2/

Could you please add, for diagnostic purposes:

System.out.println("Total safepoint time (ms): " + value);

after:

  60         long value = executeThreadDumps();

and

  68         long value2 = executeThreadDumps();

that way if the test fails we can check logs to see what kind of 
safepoint times have been observed previously. No need to see an updated 
webrev just for that.

I have one further suggestion, take it or leave it, that 
executeThreadDumps() takes a parameter to specify the initial value, so 
we'd have:

  60         long value = executeThreadDumps(0);

and

  68         long value2 = executeThreadDumps(value);

This might help detect getTotalSafepointTime() going backwards slightly 
better than current code.

Thanks,
David

> Best regards, Matthias
> 
> 
>> -----Original Message-----
>> From: Baesken, Matthias
>> Sent: Mittwoch, 31. Juli 2019 14:05
>> To: 'David Holmes' <david.holmes at oracle.com>; Jean Christophe Beyler
>> <jcbeyler at google.com>
>> Cc: hotspot-dev at openjdk.java.net; serviceability-dev <serviceability-
>> dev at openjdk.java.net>
>> Subject: RE: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast
>> Linux machines with Total safepoint time 0 ms
>>
>> Hello, here is a version following  the latest proposal of JC .
>>
>> Unfortunately  attached as patch,  sorry for that - the uploads / pushes
>> currently  do not work from here .
>>
>> Best regards, Matthias
>>
>>
>>> -----Original Message-----
>>> From: David Holmes <david.holmes at oracle.com>
>>> Sent: Mittwoch, 31. Juli 2019 05:04
>>> To: Jean Christophe Beyler <jcbeyler at google.com>
>>> Cc: Baesken, Matthias <matthias.baesken at sap.com>; hotspot-
>>> dev at openjdk.java.net; serviceability-dev <serviceability-
>>> dev at openjdk.java.net>
>>> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on
>> fast
>>> Linux machines with Total safepoint time 0 ms
>>>
>>> On 31/07/2019 9:08 am, Jean Christophe Beyler wrote:
>>>> FWIW, I would have done something like what David was suggesting, just
>>>> slightly tweaked:
>>>>
>>>> public static long executeThreadDumps() {
>>>>    long value;
>>>>    long initial_value = mbean.getTotalSafepointTime();
>>>>    do {
>>>>        Thread.getAllStackTraces();
>>>>        value = mbean.getTotalSafepointTime();
>>>>    } while (value == initial_value);
>>>>    return value;
>>>> }
>>>>
>>>> This ensures that the value is a new value as opposed to the current
>>>> value and if something goes wrong, as David said, it will timeout; which
>>>> is ok.
>>>
>>> Works for me.
>>>
>>>> But I come back to not really understanding why we are doing this at
>>>> this point of relaxing (just get a new value of safepoint time).
>>>> Because, if we accept timeouts now as a failure here, then really the
>>>> whole test becomes:
>>>>
>>>> executeThreadDumps();
>>>> executeThreadDumps();
>>>>
>>>> Since the first call will return when value > 0 and the second call will
>>>> return when value2 > value (I still wonder why we want to ensure it
>>>> works twice...).
>>>
>>> The test is trying to sanity check that we are actually recording the
>>> time used by safepoints. So first check is that we can get a non-zero
>>> value; second check is we get a greater non-zero value. It's just a
>>> sanity test to try and catch if something gets unexpectedly broken in
>>> the time tracking code.
>>>
>>>> So both failures and even testing for it is kind of redundant, once you
>>>> have a do/while until a change?
>>>
>>> Yes - the problem with the tests that try to check internal VM behaviour
>>> is that we have no specified way to do something, in this case execute
>>> safepoints, that relates to internal VM behaviour, so we have to do
>>> something we know will currently work even if not specified to do so -
>>> e.g. dumping all thread stacks uses a global safepoint. The second
>>> problem is that the timer granularity is so coarse that we then have to
>>> guess how many times we need to do that something before seeing a
>>> change. To make the test robust we can keep doing stuff until we see a
>>> change and so the only way that will fail is if the overall timeout of
>>> the test kicks in. Or we can try and second guess how long it should
>>> take by introducing our own internal timeout - either directly or by
>>> limiting the number of loops in this case. That has its own problems and
>>> in general we have tried to reduce internal test timeouts (by removing
>>> them) and let overall timeouts take charge.
>>>
>>> No ideal solution. And this has already consumed way too much of
>>> everyone's time.
>>>
>>> Cheers,
>>> David
>>>
>>>> Thanks,
>>>> Jc
>>>>
>>>>
>>>> On Tue, Jul 30, 2019 at 2:35 PM David Holmes <david.holmes at oracle.com
>>>> <mailto:david.holmes at oracle.com>> wrote:
>>>>
>>>>      On 30/07/2019 10:39 pm, Baesken, Matthias wrote:
>>>>       > Hi David,   "put that whole code (the while loop) in a helper
>>>>      method."   was JC's idea,  and I like the idea .
>>>>
>>>>      Regardless I think the way you are using NUM_THREAD_DUMPS is
>> really
>>>>      confusing. As an all-caps static you'd expect it to be a constant.
>>>>
>>>>      Thanks,
>>>>      David
>>>>
>>>>       > Let's see what others think .
>>>>       >
>>>>       >>
>>>>       >> Overall tests like this are not very useful, yet very fragile.
>>>>       >>
>>>>       >
>>>>       > I am also  fine with putting the test on the exclude list.
>>>>       >
>>>>       > Best regards, Matthias
>>>>       >
>>>>       >
>>>>       >> -----Original Message-----
>>>>       >> From: David Holmes <david.holmes at oracle.com
>>>>      <mailto:david.holmes at oracle.com>>
>>>>       >> Sent: Dienstag, 30. Juli 2019 14:12
>>>>       >> To: Baesken, Matthias <matthias.baesken at sap.com
>>>>      <mailto:matthias.baesken at sap.com>>; Jean Christophe
>>>>       >> Beyler <jcbeyler at google.com <mailto:jcbeyler at google.com>>
>>>>       >> Cc: hotspot-dev at openjdk.java.net
>>>>      <mailto:hotspot-dev at openjdk.java.net>; serviceability-dev
>>>>      <serviceability-
>>>>       >> dev at openjdk.java.net <mailto:dev at openjdk.java.net>>
>>>>       >> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java
>>>>      fails on fast
>>>>       >> Linux machines with Total safepoint time 0 ms
>>>>       >>
>>>>       >> Hi Matthias,
>>>>       >>
>>>>       >> On 30/07/2019 9:25 pm, Baesken, Matthias wrote:
>>>>       >>> Hello  JC / David,   here is a second webrev  :
>>>>       >>>
>>>>       >>> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.1/
>>>>       >>>
>>>>       >>> It moves   the  thread dump execution into a  method
>>>>       >>> executeThreadDumps(long)     , and also adds  while loops
>>>>      (but with a
>>>>       >>> limitation  for the number of thread dumps, really don’t
>>>>       >>> want to cause timeouts etc.).    I removed a check for
>>>>       >>> MAX_VALUE_FOR_PASS   because we cannot go over
>>> Long.MAX_VALUE .
>>>>       >>
>>>>       >> I don't think executeThreadDumps is worth factoring out like out.
>>>>       >>
>>>>       >> The handling of NUM_THREAD_DUMPS is a bit confusing. I'd rather
>> it
>>>>       >> remains a constant 100, and then you set a simple loop iteration
>>>>      count
>>>>       >> limit. Further with the proposed code when you get here:
>>>>       >>
>>>>       >>    85         NUM_THREAD_DUMPS = NUM_THREAD_DUMPS * 2;
>>>>       >>
>>>>       >> you don't even know what value you may be starting with.
>>>>       >>
>>>>       >> But I was thinking of simply:
>>>>       >>
>>>>       >> long value = 0;
>>>>       >> do {
>>>>       >>       Thread.getAllStackTraces();
>>>>       >>       value = mbean.getTotalSafepointTime();
>>>>       >> } while (value == 0);
>>>>       >>
>>>>       >> We'd only hit a timeout if something is completely broken -
>>>>      which is fine.
>>>>       >>
>>>>       >> Overall tests like this are not very useful, yet very fragile.
>>>>       >>
>>>>       >> Thanks,
>>>>       >> David
>>>>       >>
>>>>       >>> Hope you like this version  better.
>>>>       >>>
>>>>       >>> Best regards, Matthias
>>>>       >>>
>>>>       >>> *From:*Jean Christophe Beyler <jcbeyler at google.com
>>>>      <mailto:jcbeyler at google.com>>
>>>>       >>> *Sent:* Dienstag, 30. Juli 2019 05:39
>>>>       >>> *To:* David Holmes <david.holmes at oracle.com
>>>>      <mailto:david.holmes at oracle.com>>
>>>>       >>> *Cc:* Baesken, Matthias <matthias.baesken at sap.com
>>>>      <mailto:matthias.baesken at sap.com>>;
>>>>       >>> hotspot-dev at openjdk.java.net
>>>>      <mailto:hotspot-dev at openjdk.java.net>; serviceability-dev
>>>>       >>> <serviceability-dev at openjdk.java.net
>>>>      <mailto:serviceability-dev at openjdk.java.net>>
>>>>       >>> *Subject:* Re: RFR: [XS] 8228658: test
>>>>      GetTotalSafepointTime.java fails
>>>>       >>> on fast Linux machines with Total safepoint time 0 ms
>>>>       >>>
>>>>       >>> Hi Matthias,
>>>>       >>>
>>>>       >>> I wonder if you should not do what David is suggesting and then
>>>>      put that
>>>>       >>> whole code (the while loop) in a helper method. Below you have a
>>>>       >>> calculation again using value2 (which I wonder what the added
>>>>      value of
>>>>       >>> it is though) but anyway, that value2 could also be 0 at some
>>>>      point, no?
>>>>       >>>
>>>>       >>> So would it not be best to just refactor the getAllStackTraces and
>>>>       >>> calculate safepoint time in a helper method for both value / value2
>>>>       >>> variables?
>>>>       >>>
>>>>       >>> Thanks,
>>>>       >>>
>>>>       >>> Jc
>>>>       >>>
>>>>       >>> On Mon, Jul 29, 2019 at 7:50 PM David Holmes
>>>>      <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
>>>>       >>> <mailto:david.holmes at oracle.com
>>>>      <mailto:david.holmes at oracle.com>>> wrote:
>>>>       >>>
>>>>       >>>      Hi Matthias,
>>>>       >>>
>>>>       >>>      On 29/07/2019 8:20 pm, Baesken, Matthias wrote:
>>>>       >>>       > Hello , please review this small test fix .
>>>>       >>>       >
>>>>       >>>       > The test
>>>>       >>>
>>>>       >>
>>>
>> test/jdk/sun/management/HotspotRuntimeMBean/GetTotalSafepointTime.
>>>>       >> java
>>>>       >>>      fails sometimes on fast Linux machines with this error
>>>>      message :
>>>>       >>>       >
>>>>       >>>       > java.lang.RuntimeException: Total safepoint time
>>>>      illegal value: 0
>>>>       >>>      ms (MIN = 1; MAX = 9223372036854775807)
>>>>       >>>       >
>>>>       >>>       > looks like the total safepoint time is too low
>>>>      currently on these
>>>>       >>>      machines, it is < 1 ms.
>>>>       >>>       >
>>>>       >>>       > There might be several ways to handle this :
>>>>       >>>       >
>>>>       >>>       >    *   Change the test  in a way that it might generate
>>>>      nigher
>>>>       >>>      safepoint times
>>>>       >>>       >    *   Allow  safepoint time  == 0 ms
>>>>       >>>       >    *   Offer an additional interface that gives
>>>>      safepoint times
>>>>       >>>      with finer granularity ( currently the HS has safepoint
>>>>      time values
>>>>       >>>      in ns , see  jdk/src/hotspot/share/runtime/safepoint.cpp
>>>>       >>>        SafepointTracing::end
>>>>       >>>       >
>>>>       >>>       > But it is converted on ms in this code
>>>>       >>>       >
>>>>       >>>       > 114jlong RuntimeService::safepoint_time_ms() {
>>>>       >>>       > 115  return UsePerfData ?
>>>>       >>>       > 116
>>>>       >>>
>>>>      Management::ticks_to_ms(_safepoint_time_ticks->get_value()) : -1;
>>>>       >>>       > 117}
>>>>       >>>       >
>>>>       >>>       > 064jlong Management::ticks_to_ms(jlong ticks) {
>>>>       >>>       > 2065  assert(os::elapsed_frequency() > 0, "Must be
>>>>      non-zero");
>>>>       >>>       > 2066  return (jlong)(((double)ticks /
>>>>       >>>      (double)os::elapsed_frequency())
>>>>       >>>       > 2067                 * (double)1000.0);
>>>>       >>>       > 2068}
>>>>       >>>       >
>>>>       >>>       >
>>>>       >>>       >
>>>>       >>>       > Currently I go for  the first attempt (and try to generate
>>>>       >>>      higher safepoint times in my patch) .
>>>>       >>>
>>>>       >>>      Yes that's probably best. Coarse-grained timing on very
>>>>      fast machines
>>>>       >>>      was bound to eventually lead to problems.
>>>>       >>>
>>>>       >>>      But perhaps a more future-proof approach is to just add a
>>>>      do-while loop
>>>>       >>>      around the stack dumps and only exit when we have a non-zero
>>>>       >> safepoint
>>>>       >>>      time?
>>>>       >>>
>>>>       >>>      Thanks,
>>>>       >>>      David
>>>>       >>>      -----
>>>>       >>>
>>>>       >>>       > Bug/webrev :
>>>>       >>>       >
>>>>       >>>       > https://bugs.openjdk.java.net/browse/JDK-8228658
>>>>       >>>       >
>>>>       >>>       > http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.0/
>>>>       >>>       >
>>>>       >>>       >
>>>>       >>>       > Thanks, Matthias
>>>>       >>>       >
>>>>       >>>
>>>>       >>>
>>>>       >>> --
>>>>       >>>
>>>>       >>> Thanks,
>>>>       >>>
>>>>       >>> Jc
>>>>       >>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Thanks,
>>>> Jc


More information about the serviceability-dev mailing list