RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast Linux machines with Total safepoint time 0 ms
Jean Christophe Beyler
jcbeyler at google.com
Wed Jul 31 19:07:36 UTC 2019
Hi Matthias,
Looks good to me :)
Jc
On Wed, Jul 31, 2019 at 7:01 AM Baesken, Matthias <matthias.baesken at sap.com>
wrote:
>
> Hi upload works again, now with webrev :
>
> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.2/
>
> Best regards, Matthias
>
>
> > -----Original Message-----
> > From: Baesken, Matthias
> > Sent: Mittwoch, 31. Juli 2019 14:05
> > To: 'David Holmes' <david.holmes at oracle.com>; Jean Christophe Beyler
> > <jcbeyler at google.com>
> > Cc: hotspot-dev at openjdk.java.net; serviceability-dev <serviceability-
> > dev at openjdk.java.net>
> > Subject: RE: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on
> fast
> > Linux machines with Total safepoint time 0 ms
> >
> > Hello, here is a version following the latest proposal of JC .
> >
> > Unfortunately attached as patch, sorry for that - the uploads / pushes
> > currently do not work from here .
> >
> > Best regards, Matthias
> >
> >
> > > -----Original Message-----
> > > From: David Holmes <david.holmes at oracle.com>
> > > Sent: Mittwoch, 31. Juli 2019 05:04
> > > To: Jean Christophe Beyler <jcbeyler at google.com>
> > > Cc: Baesken, Matthias <matthias.baesken at sap.com>; hotspot-
> > > dev at openjdk.java.net; serviceability-dev <serviceability-
> > > dev at openjdk.java.net>
> > > Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails
> on
> > fast
> > > Linux machines with Total safepoint time 0 ms
> > >
> > > On 31/07/2019 9:08 am, Jean Christophe Beyler wrote:
> > > > FWIW, I would have done something like what David was suggesting,
> just
> > > > slightly tweaked:
> > > >
> > > > public static long executeThreadDumps() {
> > > > long value;
> > > > long initial_value = mbean.getTotalSafepointTime();
> > > > do {
> > > > Thread.getAllStackTraces();
> > > > value = mbean.getTotalSafepointTime();
> > > > } while (value == initial_value);
> > > > return value;
> > > > }
> > > >
> > > > This ensures that the value is a new value as opposed to the current
> > > > value and if something goes wrong, as David said, it will timeout;
> which
> > > > is ok.
> > >
> > > Works for me.
> > >
> > > > But I come back to not really understanding why we are doing this at
> > > > this point of relaxing (just get a new value of safepoint time).
> > > > Because, if we accept timeouts now as a failure here, then really the
> > > > whole test becomes:
> > > >
> > > > executeThreadDumps();
> > > > executeThreadDumps();
> > > >
> > > > Since the first call will return when value > 0 and the second call
> will
> > > > return when value2 > value (I still wonder why we want to ensure it
> > > > works twice...).
> > >
> > > The test is trying to sanity check that we are actually recording the
> > > time used by safepoints. So first check is that we can get a non-zero
> > > value; second check is we get a greater non-zero value. It's just a
> > > sanity test to try and catch if something gets unexpectedly broken in
> > > the time tracking code.
> > >
> > > > So both failures and even testing for it is kind of redundant, once
> you
> > > > have a do/while until a change?
> > >
> > > Yes - the problem with the tests that try to check internal VM
> behaviour
> > > is that we have no specified way to do something, in this case execute
> > > safepoints, that relates to internal VM behaviour, so we have to do
> > > something we know will currently work even if not specified to do so -
> > > e.g. dumping all thread stacks uses a global safepoint. The second
> > > problem is that the timer granularity is so coarse that we then have to
> > > guess how many times we need to do that something before seeing a
> > > change. To make the test robust we can keep doing stuff until we see a
> > > change and so the only way that will fail is if the overall timeout of
> > > the test kicks in. Or we can try and second guess how long it should
> > > take by introducing our own internal timeout - either directly or by
> > > limiting the number of loops in this case. That has its own problems
> and
> > > in general we have tried to reduce internal test timeouts (by removing
> > > them) and let overall timeouts take charge.
> > >
> > > No ideal solution. And this has already consumed way too much of
> > > everyone's time.
> > >
> > > Cheers,
> > > David
> > >
> > > > Thanks,
> > > > Jc
> > > >
> > > >
> > > > On Tue, Jul 30, 2019 at 2:35 PM David Holmes <
> david.holmes at oracle.com
> > > > <mailto:david.holmes at oracle.com>> wrote:
> > > >
> > > > On 30/07/2019 10:39 pm, Baesken, Matthias wrote:
> > > > > Hi David, "put that whole code (the while loop) in a helper
> > > > method." was JC's idea, and I like the idea .
> > > >
> > > > Regardless I think the way you are using NUM_THREAD_DUMPS is
> > really
> > > > confusing. As an all-caps static you'd expect it to be a
> constant.
> > > >
> > > > Thanks,
> > > > David
> > > >
> > > > > Let's see what others think .
> > > > >
> > > > >>
> > > > >> Overall tests like this are not very useful, yet very
> fragile.
> > > > >>
> > > > >
> > > > > I am also fine with putting the test on the exclude list.
> > > > >
> > > > > Best regards, Matthias
> > > > >
> > > > >
> > > > >> -----Original Message-----
> > > > >> From: David Holmes <david.holmes at oracle.com
> > > > <mailto:david.holmes at oracle.com>>
> > > > >> Sent: Dienstag, 30. Juli 2019 14:12
> > > > >> To: Baesken, Matthias <matthias.baesken at sap.com
> > > > <mailto:matthias.baesken at sap.com>>; Jean Christophe
> > > > >> Beyler <jcbeyler at google.com <mailto:jcbeyler at google.com>>
> > > > >> Cc: hotspot-dev at openjdk.java.net
> > > > <mailto:hotspot-dev at openjdk.java.net>; serviceability-dev
> > > > <serviceability-
> > > > >> dev at openjdk.java.net <mailto:dev at openjdk.java.net>>
> > > > >> Subject: Re: RFR: [XS] 8228658: test
> GetTotalSafepointTime.java
> > > > fails on fast
> > > > >> Linux machines with Total safepoint time 0 ms
> > > > >>
> > > > >> Hi Matthias,
> > > > >>
> > > > >> On 30/07/2019 9:25 pm, Baesken, Matthias wrote:
> > > > >>> Hello JC / David, here is a second webrev :
> > > > >>>
> > > > >>> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.1/
> > > > >>>
> > > > >>> It moves the thread dump execution into a method
> > > > >>> executeThreadDumps(long) , and also adds while loops
> > > > (but with a
> > > > >>> limitation for the number of thread dumps, really don’t
> > > > >>> want to cause timeouts etc.). I removed a check for
> > > > >>> MAX_VALUE_FOR_PASS because we cannot go over
> > > Long.MAX_VALUE .
> > > > >>
> > > > >> I don't think executeThreadDumps is worth factoring out like
> out.
> > > > >>
> > > > >> The handling of NUM_THREAD_DUMPS is a bit confusing. I'd
> rather
> > it
> > > > >> remains a constant 100, and then you set a simple loop
> iteration
> > > > count
> > > > >> limit. Further with the proposed code when you get here:
> > > > >>
> > > > >> 85 NUM_THREAD_DUMPS = NUM_THREAD_DUMPS * 2;
> > > > >>
> > > > >> you don't even know what value you may be starting with.
> > > > >>
> > > > >> But I was thinking of simply:
> > > > >>
> > > > >> long value = 0;
> > > > >> do {
> > > > >> Thread.getAllStackTraces();
> > > > >> value = mbean.getTotalSafepointTime();
> > > > >> } while (value == 0);
> > > > >>
> > > > >> We'd only hit a timeout if something is completely broken -
> > > > which is fine.
> > > > >>
> > > > >> Overall tests like this are not very useful, yet very
> fragile.
> > > > >>
> > > > >> Thanks,
> > > > >> David
> > > > >>
> > > > >>> Hope you like this version better.
> > > > >>>
> > > > >>> Best regards, Matthias
> > > > >>>
> > > > >>> *From:*Jean Christophe Beyler <jcbeyler at google.com
> > > > <mailto:jcbeyler at google.com>>
> > > > >>> *Sent:* Dienstag, 30. Juli 2019 05:39
> > > > >>> *To:* David Holmes <david.holmes at oracle.com
> > > > <mailto:david.holmes at oracle.com>>
> > > > >>> *Cc:* Baesken, Matthias <matthias.baesken at sap.com
> > > > <mailto:matthias.baesken at sap.com>>;
> > > > >>> hotspot-dev at openjdk.java.net
> > > > <mailto:hotspot-dev at openjdk.java.net>; serviceability-dev
> > > > >>> <serviceability-dev at openjdk.java.net
> > > > <mailto:serviceability-dev at openjdk.java.net>>
> > > > >>> *Subject:* Re: RFR: [XS] 8228658: test
> > > > GetTotalSafepointTime.java fails
> > > > >>> on fast Linux machines with Total safepoint time 0 ms
> > > > >>>
> > > > >>> Hi Matthias,
> > > > >>>
> > > > >>> I wonder if you should not do what David is suggesting and
> then
> > > > put that
> > > > >>> whole code (the while loop) in a helper method. Below you
> have a
> > > > >>> calculation again using value2 (which I wonder what the
> added
> > > > value of
> > > > >>> it is though) but anyway, that value2 could also be 0 at
> some
> > > > point, no?
> > > > >>>
> > > > >>> So would it not be best to just refactor the
> getAllStackTraces and
> > > > >>> calculate safepoint time in a helper method for both value
> / value2
> > > > >>> variables?
> > > > >>>
> > > > >>> Thanks,
> > > > >>>
> > > > >>> Jc
> > > > >>>
> > > > >>> On Mon, Jul 29, 2019 at 7:50 PM David Holmes
> > > > <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
> > > > >>> <mailto:david.holmes at oracle.com
> > > > <mailto:david.holmes at oracle.com>>> wrote:
> > > > >>>
> > > > >>> Hi Matthias,
> > > > >>>
> > > > >>> On 29/07/2019 8:20 pm, Baesken, Matthias wrote:
> > > > >>> > Hello , please review this small test fix .
> > > > >>> >
> > > > >>> > The test
> > > > >>>
> > > > >>
> > >
> > test/jdk/sun/management/HotspotRuntimeMBean/GetTotalSafepointTime.
> > > > >> java
> > > > >>> fails sometimes on fast Linux machines with this error
> > > > message :
> > > > >>> >
> > > > >>> > java.lang.RuntimeException: Total safepoint time
> > > > illegal value: 0
> > > > >>> ms (MIN = 1; MAX = 9223372036854775807)
> > > > >>> >
> > > > >>> > looks like the total safepoint time is too low
> > > > currently on these
> > > > >>> machines, it is < 1 ms.
> > > > >>> >
> > > > >>> > There might be several ways to handle this :
> > > > >>> >
> > > > >>> > * Change the test in a way that it might
> generate
> > > > nigher
> > > > >>> safepoint times
> > > > >>> > * Allow safepoint time == 0 ms
> > > > >>> > * Offer an additional interface that gives
> > > > safepoint times
> > > > >>> with finer granularity ( currently the HS has safepoint
> > > > time values
> > > > >>> in ns , see
> jdk/src/hotspot/share/runtime/safepoint.cpp
> > > > >>> SafepointTracing::end
> > > > >>> >
> > > > >>> > But it is converted on ms in this code
> > > > >>> >
> > > > >>> > 114jlong RuntimeService::safepoint_time_ms() {
> > > > >>> > 115 return UsePerfData ?
> > > > >>> > 116
> > > > >>>
> > > > Management::ticks_to_ms(_safepoint_time_ticks->get_value()) : -1;
> > > > >>> > 117}
> > > > >>> >
> > > > >>> > 064jlong Management::ticks_to_ms(jlong ticks) {
> > > > >>> > 2065 assert(os::elapsed_frequency() > 0, "Must be
> > > > non-zero");
> > > > >>> > 2066 return (jlong)(((double)ticks /
> > > > >>> (double)os::elapsed_frequency())
> > > > >>> > 2067 * (double)1000.0);
> > > > >>> > 2068}
> > > > >>> >
> > > > >>> >
> > > > >>> >
> > > > >>> > Currently I go for the first attempt (and try to
> generate
> > > > >>> higher safepoint times in my patch) .
> > > > >>>
> > > > >>> Yes that's probably best. Coarse-grained timing on very
> > > > fast machines
> > > > >>> was bound to eventually lead to problems.
> > > > >>>
> > > > >>> But perhaps a more future-proof approach is to just
> add a
> > > > do-while loop
> > > > >>> around the stack dumps and only exit when we have a
> non-zero
> > > > >> safepoint
> > > > >>> time?
> > > > >>>
> > > > >>> Thanks,
> > > > >>> David
> > > > >>> -----
> > > > >>>
> > > > >>> > Bug/webrev :
> > > > >>> >
> > > > >>> > https://bugs.openjdk.java.net/browse/JDK-8228658
> > > > >>> >
> > > > >>> >
> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.0/
> > > > >>> >
> > > > >>> >
> > > > >>> > Thanks, Matthias
> > > > >>> >
> > > > >>>
> > > > >>>
> > > > >>> --
> > > > >>>
> > > > >>> Thanks,
> > > > >>>
> > > > >>> Jc
> > > > >>>
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Thanks,
> > > > Jc
>
--
Thanks,
Jc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20190731/3efc7bad/attachment-0001.html>
More information about the serviceability-dev
mailing list