RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast Linux machines with Total safepoint time 0 ms

Tue Jul 30 23:08:23 UTC 2019

FWIW, I would have done something like what David was suggesting, just
slightly tweaked:

public static long executeThreadDumps() {
 long value;
 long initial_value = mbean.getTotalSafepointTime();
 do {
     Thread.getAllStackTraces();
     value = mbean.getTotalSafepointTime();
 } while (value == initial_value);
 return value;
}

This ensures that the value is a new value as opposed to the current value
and if something goes wrong, as David said, it will timeout; which is ok.

But I come back to not really understanding why we are doing this at this
point of relaxing (just get a new value of safepoint time). Because, if we
accept timeouts now as a failure here, then really the whole test becomes:

executeThreadDumps();
executeThreadDumps();

Since the first call will return when value > 0 and the second call will
return when value2 > value (I still wonder why we want to ensure it works
twice...).

So both failures and even testing for it is kind of redundant, once you
have a do/while until a change?

Thanks,
Jc

On Tue, Jul 30, 2019 at 2:35 PM David Holmes <david.holmes at oracle.com>
wrote:

> On 30/07/2019 10:39 pm, Baesken, Matthias wrote:
> > Hi David,   "put that whole code (the while loop) in a helper method."
>  was JC's idea,  and I like the idea .
>
> Regardless I think the way you are using NUM_THREAD_DUMPS is really
> confusing. As an all-caps static you'd expect it to be a constant.
>
> Thanks,
> David
>
> > Let's see what others think .
> >
> >>
> >> Overall tests like this are not very useful, yet very fragile.
> >>
> >
> > I am also  fine with putting the test on the exclude list.
> >
> > Best regards, Matthias
> >
> >
> >> -----Original Message-----
> >> From: David Holmes <david.holmes at oracle.com>
> >> Sent: Dienstag, 30. Juli 2019 14:12
> >> To: Baesken, Matthias <matthias.baesken at sap.com>; Jean Christophe
> >> Beyler <jcbeyler at google.com>
> >> Cc: hotspot-dev at openjdk.java.net; serviceability-dev <serviceability-
> >> dev at openjdk.java.net>
> >> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails
> on fast
> >> Linux machines with Total safepoint time 0 ms
> >>
> >> Hi Matthias,
> >>
> >> On 30/07/2019 9:25 pm, Baesken, Matthias wrote:
> >>> Hello  JC / David,   here is a second webrev  :
> >>>
> >>> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.1/
> >>>
> >>> It moves   the  thread dump execution into a  method
> >>> executeThreadDumps(long)     , and also adds  while loops   (but with a
> >>> limitation  for the number of thread dumps, really don’t
> >>> want to cause timeouts etc.).    I removed a check for
> >>> MAX_VALUE_FOR_PASS   because we cannot go over Long.MAX_VALUE .
> >>
> >> I don't think executeThreadDumps is worth factoring out like out.
> >>
> >> The handling of NUM_THREAD_DUMPS is a bit confusing. I'd rather it
> >> remains a constant 100, and then you set a simple loop iteration count
> >> limit. Further with the proposed code when you get here:
> >>
> >>    85         NUM_THREAD_DUMPS = NUM_THREAD_DUMPS * 2;
> >>
> >> you don't even know what value you may be starting with.
> >>
> >> But I was thinking of simply:
> >>
> >> long value = 0;
> >> do {
> >>       Thread.getAllStackTraces();
> >>       value = mbean.getTotalSafepointTime();
> >> } while (value == 0);
> >>
> >> We'd only hit a timeout if something is completely broken - which is
> fine.
> >>
> >> Overall tests like this are not very useful, yet very fragile.
> >>
> >> Thanks,
> >> David
> >>
> >>> Hope you like this version  better.
> >>>
> >>> Best regards, Matthias
> >>>
> >>> *From:*Jean Christophe Beyler <jcbeyler at google.com>
> >>> *Sent:* Dienstag, 30. Juli 2019 05:39
> >>> *To:* David Holmes <david.holmes at oracle.com>
> >>> *Cc:* Baesken, Matthias <matthias.baesken at sap.com>;
> >>> hotspot-dev at openjdk.java.net; serviceability-dev
> >>> <serviceability-dev at openjdk.java.net>
> >>> *Subject:* Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails
> >>> on fast Linux machines with Total safepoint time 0 ms
> >>>
> >>> Hi Matthias,
> >>>
> >>> I wonder if you should not do what David is suggesting and then put
> that
> >>> whole code (the while loop) in a helper method. Below you have a
> >>> calculation again using value2 (which I wonder what the added value of
> >>> it is though) but anyway, that value2 could also be 0 at some point,
> no?
> >>>
> >>> So would it not be best to just refactor the getAllStackTraces and
> >>> calculate safepoint time in a helper method for both value / value2
> >>> variables?
> >>>
> >>> Thanks,
> >>>
> >>> Jc
> >>>
> >>> On Mon, Jul 29, 2019 at 7:50 PM David Holmes <david.holmes at oracle.com
> >>> <mailto:david.holmes at oracle.com>> wrote:
> >>>
> >>>      Hi Matthias,
> >>>
> >>>      On 29/07/2019 8:20 pm, Baesken, Matthias wrote:
> >>>       > Hello , please review this small test fix .
> >>>       >
> >>>       > The test
> >>>
> >> test/jdk/sun/management/HotspotRuntimeMBean/GetTotalSafepointTime.
> >> java
> >>>      fails sometimes on fast Linux machines with this error message :
> >>>       >
> >>>       > java.lang.RuntimeException: Total safepoint time illegal
> value: 0
> >>>      ms (MIN = 1; MAX = 9223372036854775807)
> >>>       >
> >>>       > looks like the total safepoint time is too low currently on
> these
> >>>      machines, it is < 1 ms.
> >>>       >
> >>>       > There might be several ways to handle this :
> >>>       >
> >>>       >    *   Change the test  in a way that it might generate nigher
> >>>      safepoint times
> >>>       >    *   Allow  safepoint time  == 0 ms
> >>>       >    *   Offer an additional interface that gives  safepoint
> times
> >>>      with finer granularity ( currently the HS has safepoint time
> values
> >>>      in ns , see  jdk/src/hotspot/share/runtime/safepoint.cpp
> >>>        SafepointTracing::end
> >>>       >
> >>>       > But it is converted on ms in this code
> >>>       >
> >>>       > 114jlong RuntimeService::safepoint_time_ms() {
> >>>       > 115  return UsePerfData ?
> >>>       > 116
> >>>      Management::ticks_to_ms(_safepoint_time_ticks->get_value()) : -1;
> >>>       > 117}
> >>>       >
> >>>       > 064jlong Management::ticks_to_ms(jlong ticks) {
> >>>       > 2065  assert(os::elapsed_frequency() > 0, "Must be non-zero");
> >>>       > 2066  return (jlong)(((double)ticks /
> >>>      (double)os::elapsed_frequency())
> >>>       > 2067                 * (double)1000.0);
> >>>       > 2068}
> >>>       >
> >>>       >
> >>>       >
> >>>       > Currently I go for  the first attempt (and try to generate
> >>>      higher safepoint times in my patch) .
> >>>
> >>>      Yes that's probably best. Coarse-grained timing on very fast
> machines
> >>>      was bound to eventually lead to problems.
> >>>
> >>>      But perhaps a more future-proof approach is to just add a
> do-while loop
> >>>      around the stack dumps and only exit when we have a non-zero
> >> safepoint
> >>>      time?
> >>>
> >>>      Thanks,
> >>>      David
> >>>      -----
> >>>
> >>>       > Bug/webrev :
> >>>       >
> >>>       > https://bugs.openjdk.java.net/browse/JDK-8228658
> >>>       >
> >>>       > http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.0/
> >>>       >
> >>>       >
> >>>       > Thanks, Matthias
> >>>       >
> >>>
> >>>
> >>> --
> >>>
> >>> Thanks,
> >>>
> >>> Jc
> >>>
>

-- 

Thanks,
Jc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20190730/63420f78/attachment-0001.html>