RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast Linux machines with Total safepoint time 0 ms

Thu Aug 1 07:13:08 UTC 2019

Hi David + JC , thanks for the reviews .

David - I added the  suggested print-outputs  , and  also  the parameter to executeThreadDumps .

Best regards, Matthias

> -----Original Message-----
> From: David Holmes <david.holmes at oracle.com>
> Sent: Mittwoch, 31. Juli 2019 23:57
> To: Baesken, Matthias <matthias.baesken at sap.com>; Jean Christophe
> Beyler <jcbeyler at google.com>
> Cc: hotspot-dev at openjdk.java.net; serviceability-dev <serviceability-
> dev at openjdk.java.net>
> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast
> Linux machines with Total safepoint time 0 ms
> 
> On 1/08/2019 12:01 am, Baesken, Matthias wrote:
> >
> > Hi upload works again, now with webrev :
> >
> > http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.2/
> 
> Could you please add, for diagnostic purposes:
> 
> System.out.println("Total safepoint time (ms): " + value);
> 
> after:
> 
>   60         long value = executeThreadDumps();
> 
> and
> 
>   68         long value2 = executeThreadDumps();
> 
> that way if the test fails we can check logs to see what kind of
> safepoint times have been observed previously. No need to see an updated
> webrev just for that.
> 
> I have one further suggestion, take it or leave it, that
> executeThreadDumps() takes a parameter to specify the initial value, so
> we'd have:
> 
>   60         long value = executeThreadDumps(0);
> 
> and
> 
>   68         long value2 = executeThreadDumps(value);
> 
> This might help detect getTotalSafepointTime() going backwards slightly
> better than current code.
> 
> Thanks,
> David
> 
> > Best regards, Matthias
> >
> >
> >> -----Original Message-----
> >> From: Baesken, Matthias
> >> Sent: Mittwoch, 31. Juli 2019 14:05
> >> To: 'David Holmes' <david.holmes at oracle.com>; Jean Christophe Beyler
> >> <jcbeyler at google.com>
> >> Cc: hotspot-dev at openjdk.java.net; serviceability-dev <serviceability-
> >> dev at openjdk.java.net>
> >> Subject: RE: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on
> fast
> >> Linux machines with Total safepoint time 0 ms
> >>
> >> Hello, here is a version following  the latest proposal of JC .
> >>
> >> Unfortunately  attached as patch,  sorry for that - the uploads / pushes
> >> currently  do not work from here .
> >>
> >> Best regards, Matthias
> >>
> >>
> >>> -----Original Message-----
> >>> From: David Holmes <david.holmes at oracle.com>
> >>> Sent: Mittwoch, 31. Juli 2019 05:04
> >>> To: Jean Christophe Beyler <jcbeyler at google.com>
> >>> Cc: Baesken, Matthias <matthias.baesken at sap.com>; hotspot-
> >>> dev at openjdk.java.net; serviceability-dev <serviceability-
> >>> dev at openjdk.java.net>
> >>> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on
> >> fast
> >>> Linux machines with Total safepoint time 0 ms
> >>>
> >>> On 31/07/2019 9:08 am, Jean Christophe Beyler wrote:
> >>>> FWIW, I would have done something like what David was suggesting,
> just
> >>>> slightly tweaked:
> >>>>
> >>>> public static long executeThreadDumps() {
> >>>>    long value;
> >>>>    long initial_value = mbean.getTotalSafepointTime();
> >>>>    do {
> >>>>        Thread.getAllStackTraces();
> >>>>        value = mbean.getTotalSafepointTime();
> >>>>    } while (value == initial_value);
> >>>>    return value;
> >>>> }
> >>>>
> >>>> This ensures that the value is a new value as opposed to the current
> >>>> value and if something goes wrong, as David said, it will timeout; which
> >>>> is ok.
> >>>
> >>> Works for me.
> >>>
> >>>> But I come back to not really understanding why we are doing this at
> >>>> this point of relaxing (just get a new value of safepoint time).
> >>>> Because, if we accept timeouts now as a failure here, then really the
> >>>> whole test becomes:
> >>>>
> >>>> executeThreadDumps();
> >>>> executeThreadDumps();
> >>>>
> >>>> Since the first call will return when value > 0 and the second call will
> >>>> return when value2 > value (I still wonder why we want to ensure it
> >>>> works twice...).
> >>>
> >>> The test is trying to sanity check that we are actually recording the
> >>> time used by safepoints. So first check is that we can get a non-zero
> >>> value; second check is we get a greater non-zero value. It's just a
> >>> sanity test to try and catch if something gets unexpectedly broken in
> >>> the time tracking code.
> >>>
> >>>> So both failures and even testing for it is kind of redundant, once you
> >>>> have a do/while until a change?
> >>>
> >>> Yes - the problem with the tests that try to check internal VM behaviour
> >>> is that we have no specified way to do something, in this case execute
> >>> safepoints, that relates to internal VM behaviour, so we have to do
> >>> something we know will currently work even if not specified to do so -
> >>> e.g. dumping all thread stacks uses a global safepoint. The second
> >>> problem is that the timer granularity is so coarse that we then have to
> >>> guess how many times we need to do that something before seeing a
> >>> change. To make the test robust we can keep doing stuff until we see a
> >>> change and so the only way that will fail is if the overall timeout of
> >>> the test kicks in. Or we can try and second guess how long it should
> >>> take by introducing our own internal timeout - either directly or by
> >>> limiting the number of loops in this case. That has its own problems and
> >>> in general we have tried to reduce internal test timeouts (by removing
> >>> them) and let overall timeouts take charge.
> >>>
> >>> No ideal solution. And this has already consumed way too much of
> >>> everyone's time.
> >>>
> >>> Cheers,
> >>> David
> >>>
> >>>> Thanks,
> >>>> Jc
> >>>>
> >>>>
> >>>> On Tue, Jul 30, 2019 at 2:35 PM David Holmes
> <david.holmes at oracle.com
> >>>> <mailto:david.holmes at oracle.com>> wrote:
> >>>>
> >>>>      On 30/07/2019 10:39 pm, Baesken, Matthias wrote:
> >>>>       > Hi David,   "put that whole code (the while loop) in a helper
> >>>>      method."   was JC's idea,  and I like the idea .
> >>>>
> >>>>      Regardless I think the way you are using NUM_THREAD_DUMPS is
> >> really
> >>>>      confusing. As an all-caps static you'd expect it to be a constant.
> >>>>
> >>>>      Thanks,
> >>>>      David
> >>>>
> >>>>       > Let's see what others think .
> >>>>       >
> >>>>       >>
> >>>>       >> Overall tests like this are not very useful, yet very fragile.
> >>>>       >>
> >>>>       >
> >>>>       > I am also  fine with putting the test on the exclude list.
> >>>>       >
> >>>>       > Best regards, Matthias
> >>>>       >
> >>>>       >
> >>>>       >> -----Original Message-----
> >>>>       >> From: David Holmes <david.holmes at oracle.com
> >>>>      <mailto:david.holmes at oracle.com>>
> >>>>       >> Sent: Dienstag, 30. Juli 2019 14:12
> >>>>       >> To: Baesken, Matthias <matthias.baesken at sap.com
> >>>>      <mailto:matthias.baesken at sap.com>>; Jean Christophe
> >>>>       >> Beyler <jcbeyler at google.com <mailto:jcbeyler at google.com>>
> >>>>       >> Cc: hotspot-dev at openjdk.java.net
> >>>>      <mailto:hotspot-dev at openjdk.java.net>; serviceability-dev
> >>>>      <serviceability-
> >>>>       >> dev at openjdk.java.net <mailto:dev at openjdk.java.net>>
> >>>>       >> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java
> >>>>      fails on fast
> >>>>       >> Linux machines with Total safepoint time 0 ms
> >>>>       >>
> >>>>       >> Hi Matthias,
> >>>>       >>
> >>>>       >> On 30/07/2019 9:25 pm, Baesken, Matthias wrote:
> >>>>       >>> Hello  JC / David,   here is a second webrev  :
> >>>>       >>>
> >>>>       >>> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.1/
> >>>>       >>>
> >>>>       >>> It moves   the  thread dump execution into a  method
> >>>>       >>> executeThreadDumps(long)     , and also adds  while loops
> >>>>      (but with a
> >>>>       >>> limitation  for the number of thread dumps, really don’t
> >>>>       >>> want to cause timeouts etc.).    I removed a check for
> >>>>       >>> MAX_VALUE_FOR_PASS   because we cannot go over
> >>> Long.MAX_VALUE .
> >>>>       >>
> >>>>       >> I don't think executeThreadDumps is worth factoring out like out.
> >>>>       >>
> >>>>       >> The handling of NUM_THREAD_DUMPS is a bit confusing. I'd
> rather
> >> it
> >>>>       >> remains a constant 100, and then you set a simple loop iteration
> >>>>      count
> >>>>       >> limit. Further with the proposed code when you get here:
> >>>>       >>
> >>>>       >>    85         NUM_THREAD_DUMPS = NUM_THREAD_DUMPS * 2;
> >>>>       >>
> >>>>       >> you don't even know what value you may be starting with.
> >>>>       >>
> >>>>       >> But I was thinking of simply:
> >>>>       >>
> >>>>       >> long value = 0;
> >>>>       >> do {
> >>>>       >>       Thread.getAllStackTraces();
> >>>>       >>       value = mbean.getTotalSafepointTime();
> >>>>       >> } while (value == 0);
> >>>>       >>
> >>>>       >> We'd only hit a timeout if something is completely broken -
> >>>>      which is fine.
> >>>>       >>
> >>>>       >> Overall tests like this are not very useful, yet very fragile.
> >>>>       >>
> >>>>       >> Thanks,
> >>>>       >> David
> >>>>       >>
> >>>>       >>> Hope you like this version  better.
> >>>>       >>>
> >>>>       >>> Best regards, Matthias
> >>>>       >>>
> >>>>       >>> *From:*Jean Christophe Beyler <jcbeyler at google.com
> >>>>      <mailto:jcbeyler at google.com>>
> >>>>       >>> *Sent:* Dienstag, 30. Juli 2019 05:39
> >>>>       >>> *To:* David Holmes <david.holmes at oracle.com
> >>>>      <mailto:david.holmes at oracle.com>>
> >>>>       >>> *Cc:* Baesken, Matthias <matthias.baesken at sap.com
> >>>>      <mailto:matthias.baesken at sap.com>>;
> >>>>       >>> hotspot-dev at openjdk.java.net
> >>>>      <mailto:hotspot-dev at openjdk.java.net>; serviceability-dev
> >>>>       >>> <serviceability-dev at openjdk.java.net
> >>>>      <mailto:serviceability-dev at openjdk.java.net>>
> >>>>       >>> *Subject:* Re: RFR: [XS] 8228658: test
> >>>>      GetTotalSafepointTime.java fails
> >>>>       >>> on fast Linux machines with Total safepoint time 0 ms
> >>>>       >>>
> >>>>       >>> Hi Matthias,
> >>>>       >>>
> >>>>       >>> I wonder if you should not do what David is suggesting and then
> >>>>      put that
> >>>>       >>> whole code (the while loop) in a helper method. Below you
> have a
> >>>>       >>> calculation again using value2 (which I wonder what the added
> >>>>      value of
> >>>>       >>> it is though) but anyway, that value2 could also be 0 at some
> >>>>      point, no?
> >>>>       >>>
> >>>>       >>> So would it not be best to just refactor the getAllStackTraces
> and
> >>>>       >>> calculate safepoint time in a helper method for both value /
> value2
> >>>>       >>> variables?
> >>>>       >>>
> >>>>       >>> Thanks,
> >>>>       >>>
> >>>>       >>> Jc
> >>>>       >>>
> >>>>       >>> On Mon, Jul 29, 2019 at 7:50 PM David Holmes
> >>>>      <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
> >>>>       >>> <mailto:david.holmes at oracle.com
> >>>>      <mailto:david.holmes at oracle.com>>> wrote:
> >>>>       >>>
> >>>>       >>>      Hi Matthias,
> >>>>       >>>
> >>>>       >>>      On 29/07/2019 8:20 pm, Baesken, Matthias wrote:
> >>>>       >>>       > Hello , please review this small test fix .
> >>>>       >>>       >
> >>>>       >>>       > The test
> >>>>       >>>
> >>>>       >>
> >>>
> >>
> test/jdk/sun/management/HotspotRuntimeMBean/GetTotalSafepointTime.
> >>>>       >> java
> >>>>       >>>      fails sometimes on fast Linux machines with this error
> >>>>      message :
> >>>>       >>>       >
> >>>>       >>>       > java.lang.RuntimeException: Total safepoint time
> >>>>      illegal value: 0
> >>>>       >>>      ms (MIN = 1; MAX = 9223372036854775807)
> >>>>       >>>       >
> >>>>       >>>       > looks like the total safepoint time is too low
> >>>>      currently on these
> >>>>       >>>      machines, it is < 1 ms.
> >>>>       >>>       >
> >>>>       >>>       > There might be several ways to handle this :
> >>>>       >>>       >
> >>>>       >>>       >    *   Change the test  in a way that it might generate
> >>>>      nigher
> >>>>       >>>      safepoint times
> >>>>       >>>       >    *   Allow  safepoint time  == 0 ms
> >>>>       >>>       >    *   Offer an additional interface that gives
> >>>>      safepoint times
> >>>>       >>>      with finer granularity ( currently the HS has safepoint
> >>>>      time values
> >>>>       >>>      in ns , see  jdk/src/hotspot/share/runtime/safepoint.cpp
> >>>>       >>>        SafepointTracing::end
> >>>>       >>>       >
> >>>>       >>>       > But it is converted on ms in this code
> >>>>       >>>       >
> >>>>       >>>       > 114jlong RuntimeService::safepoint_time_ms() {
> >>>>       >>>       > 115  return UsePerfData ?
> >>>>       >>>       > 116
> >>>>       >>>
> >>>>      Management::ticks_to_ms(_safepoint_time_ticks->get_value()) : -
> 1;
> >>>>       >>>       > 117}
> >>>>       >>>       >
> >>>>       >>>       > 064jlong Management::ticks_to_ms(jlong ticks) {
> >>>>       >>>       > 2065  assert(os::elapsed_frequency() > 0, "Must be
> >>>>      non-zero");
> >>>>       >>>       > 2066  return (jlong)(((double)ticks /
> >>>>       >>>      (double)os::elapsed_frequency())
> >>>>       >>>       > 2067                 * (double)1000.0);
> >>>>       >>>       > 2068}
> >>>>       >>>       >
> >>>>       >>>       >
> >>>>       >>>       >
> >>>>       >>>       > Currently I go for  the first attempt (and try to generate
> >>>>       >>>      higher safepoint times in my patch) .
> >>>>       >>>
> >>>>       >>>      Yes that's probably best. Coarse-grained timing on very
> >>>>      fast machines
> >>>>       >>>      was bound to eventually lead to problems.
> >>>>       >>>
> >>>>       >>>      But perhaps a more future-proof approach is to just add a
> >>>>      do-while loop
> >>>>       >>>      around the stack dumps and only exit when we have a non-
> zero
> >>>>       >> safepoint
> >>>>       >>>      time?
> >>>>       >>>
> >>>>       >>>      Thanks,
> >>>>       >>>      David
> >>>>       >>>      -----
> >>>>       >>>
> >>>>       >>>       > Bug/webrev :
> >>>>       >>>       >
> >>>>       >>>       > https://bugs.openjdk.java.net/browse/JDK-8228658
> >>>>       >>>       >
> >>>>       >>>       >
> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.0/
> >>>>       >>>       >
> >>>>       >>>       >
> >>>>       >>>       > Thanks, Matthias
> >>>>       >>>       >
> >>>>       >>>
> >>>>       >>>
> >>>>       >>> --
> >>>>       >>>
> >>>>       >>> Thanks,
> >>>>       >>>
> >>>>       >>> Jc
> >>>>       >>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Thanks,
> >>>> Jc