RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast Linux machines with Total safepoint time 0 ms
David Holmes
david.holmes at oracle.com
Wed Jul 31 03:04:21 UTC 2019
On 31/07/2019 9:08 am, Jean Christophe Beyler wrote:
> FWIW, I would have done something like what David was suggesting, just
> slightly tweaked:
>
> public static long executeThreadDumps() {
> long value;
> long initial_value = mbean.getTotalSafepointTime();
> do {
> Thread.getAllStackTraces();
> value = mbean.getTotalSafepointTime();
> } while (value == initial_value);
> return value;
> }
>
> This ensures that the value is a new value as opposed to the current
> value and if something goes wrong, as David said, it will timeout; which
> is ok.
Works for me.
> But I come back to not really understanding why we are doing this at
> this point of relaxing (just get a new value of safepoint time).
> Because, if we accept timeouts now as a failure here, then really the
> whole test becomes:
>
> executeThreadDumps();
> executeThreadDumps();
>
> Since the first call will return when value > 0 and the second call will
> return when value2 > value (I still wonder why we want to ensure it
> works twice...).
The test is trying to sanity check that we are actually recording the
time used by safepoints. So first check is that we can get a non-zero
value; second check is we get a greater non-zero value. It's just a
sanity test to try and catch if something gets unexpectedly broken in
the time tracking code.
> So both failures and even testing for it is kind of redundant, once you
> have a do/while until a change?
Yes - the problem with the tests that try to check internal VM behaviour
is that we have no specified way to do something, in this case execute
safepoints, that relates to internal VM behaviour, so we have to do
something we know will currently work even if not specified to do so -
e.g. dumping all thread stacks uses a global safepoint. The second
problem is that the timer granularity is so coarse that we then have to
guess how many times we need to do that something before seeing a
change. To make the test robust we can keep doing stuff until we see a
change and so the only way that will fail is if the overall timeout of
the test kicks in. Or we can try and second guess how long it should
take by introducing our own internal timeout - either directly or by
limiting the number of loops in this case. That has its own problems and
in general we have tried to reduce internal test timeouts (by removing
them) and let overall timeouts take charge.
No ideal solution. And this has already consumed way too much of
everyone's time.
Cheers,
David
> Thanks,
> Jc
>
>
> On Tue, Jul 30, 2019 at 2:35 PM David Holmes <david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>> wrote:
>
> On 30/07/2019 10:39 pm, Baesken, Matthias wrote:
> > Hi David, "put that whole code (the while loop) in a helper
> method." was JC's idea, and I like the idea .
>
> Regardless I think the way you are using NUM_THREAD_DUMPS is really
> confusing. As an all-caps static you'd expect it to be a constant.
>
> Thanks,
> David
>
> > Let's see what others think .
> >
> >>
> >> Overall tests like this are not very useful, yet very fragile.
> >>
> >
> > I am also fine with putting the test on the exclude list.
> >
> > Best regards, Matthias
> >
> >
> >> -----Original Message-----
> >> From: David Holmes <david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>>
> >> Sent: Dienstag, 30. Juli 2019 14:12
> >> To: Baesken, Matthias <matthias.baesken at sap.com
> <mailto:matthias.baesken at sap.com>>; Jean Christophe
> >> Beyler <jcbeyler at google.com <mailto:jcbeyler at google.com>>
> >> Cc: hotspot-dev at openjdk.java.net
> <mailto:hotspot-dev at openjdk.java.net>; serviceability-dev
> <serviceability-
> >> dev at openjdk.java.net <mailto:dev at openjdk.java.net>>
> >> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java
> fails on fast
> >> Linux machines with Total safepoint time 0 ms
> >>
> >> Hi Matthias,
> >>
> >> On 30/07/2019 9:25 pm, Baesken, Matthias wrote:
> >>> Hello JC / David, here is a second webrev :
> >>>
> >>> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.1/
> >>>
> >>> It moves the thread dump execution into a method
> >>> executeThreadDumps(long) , and also adds while loops
> (but with a
> >>> limitation for the number of thread dumps, really don’t
> >>> want to cause timeouts etc.). I removed a check for
> >>> MAX_VALUE_FOR_PASS because we cannot go over Long.MAX_VALUE .
> >>
> >> I don't think executeThreadDumps is worth factoring out like out.
> >>
> >> The handling of NUM_THREAD_DUMPS is a bit confusing. I'd rather it
> >> remains a constant 100, and then you set a simple loop iteration
> count
> >> limit. Further with the proposed code when you get here:
> >>
> >> 85 NUM_THREAD_DUMPS = NUM_THREAD_DUMPS * 2;
> >>
> >> you don't even know what value you may be starting with.
> >>
> >> But I was thinking of simply:
> >>
> >> long value = 0;
> >> do {
> >> Thread.getAllStackTraces();
> >> value = mbean.getTotalSafepointTime();
> >> } while (value == 0);
> >>
> >> We'd only hit a timeout if something is completely broken -
> which is fine.
> >>
> >> Overall tests like this are not very useful, yet very fragile.
> >>
> >> Thanks,
> >> David
> >>
> >>> Hope you like this version better.
> >>>
> >>> Best regards, Matthias
> >>>
> >>> *From:*Jean Christophe Beyler <jcbeyler at google.com
> <mailto:jcbeyler at google.com>>
> >>> *Sent:* Dienstag, 30. Juli 2019 05:39
> >>> *To:* David Holmes <david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>>
> >>> *Cc:* Baesken, Matthias <matthias.baesken at sap.com
> <mailto:matthias.baesken at sap.com>>;
> >>> hotspot-dev at openjdk.java.net
> <mailto:hotspot-dev at openjdk.java.net>; serviceability-dev
> >>> <serviceability-dev at openjdk.java.net
> <mailto:serviceability-dev at openjdk.java.net>>
> >>> *Subject:* Re: RFR: [XS] 8228658: test
> GetTotalSafepointTime.java fails
> >>> on fast Linux machines with Total safepoint time 0 ms
> >>>
> >>> Hi Matthias,
> >>>
> >>> I wonder if you should not do what David is suggesting and then
> put that
> >>> whole code (the while loop) in a helper method. Below you have a
> >>> calculation again using value2 (which I wonder what the added
> value of
> >>> it is though) but anyway, that value2 could also be 0 at some
> point, no?
> >>>
> >>> So would it not be best to just refactor the getAllStackTraces and
> >>> calculate safepoint time in a helper method for both value / value2
> >>> variables?
> >>>
> >>> Thanks,
> >>>
> >>> Jc
> >>>
> >>> On Mon, Jul 29, 2019 at 7:50 PM David Holmes
> <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
> >>> <mailto:david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>>> wrote:
> >>>
> >>> Hi Matthias,
> >>>
> >>> On 29/07/2019 8:20 pm, Baesken, Matthias wrote:
> >>> > Hello , please review this small test fix .
> >>> >
> >>> > The test
> >>>
> >> test/jdk/sun/management/HotspotRuntimeMBean/GetTotalSafepointTime.
> >> java
> >>> fails sometimes on fast Linux machines with this error
> message :
> >>> >
> >>> > java.lang.RuntimeException: Total safepoint time
> illegal value: 0
> >>> ms (MIN = 1; MAX = 9223372036854775807)
> >>> >
> >>> > looks like the total safepoint time is too low
> currently on these
> >>> machines, it is < 1 ms.
> >>> >
> >>> > There might be several ways to handle this :
> >>> >
> >>> > * Change the test in a way that it might generate
> nigher
> >>> safepoint times
> >>> > * Allow safepoint time == 0 ms
> >>> > * Offer an additional interface that gives
> safepoint times
> >>> with finer granularity ( currently the HS has safepoint
> time values
> >>> in ns , see jdk/src/hotspot/share/runtime/safepoint.cpp
> >>> SafepointTracing::end
> >>> >
> >>> > But it is converted on ms in this code
> >>> >
> >>> > 114jlong RuntimeService::safepoint_time_ms() {
> >>> > 115 return UsePerfData ?
> >>> > 116
> >>>
> Management::ticks_to_ms(_safepoint_time_ticks->get_value()) : -1;
> >>> > 117}
> >>> >
> >>> > 064jlong Management::ticks_to_ms(jlong ticks) {
> >>> > 2065 assert(os::elapsed_frequency() > 0, "Must be
> non-zero");
> >>> > 2066 return (jlong)(((double)ticks /
> >>> (double)os::elapsed_frequency())
> >>> > 2067 * (double)1000.0);
> >>> > 2068}
> >>> >
> >>> >
> >>> >
> >>> > Currently I go for the first attempt (and try to generate
> >>> higher safepoint times in my patch) .
> >>>
> >>> Yes that's probably best. Coarse-grained timing on very
> fast machines
> >>> was bound to eventually lead to problems.
> >>>
> >>> But perhaps a more future-proof approach is to just add a
> do-while loop
> >>> around the stack dumps and only exit when we have a non-zero
> >> safepoint
> >>> time?
> >>>
> >>> Thanks,
> >>> David
> >>> -----
> >>>
> >>> > Bug/webrev :
> >>> >
> >>> > https://bugs.openjdk.java.net/browse/JDK-8228658
> >>> >
> >>> > http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.0/
> >>> >
> >>> >
> >>> > Thanks, Matthias
> >>> >
> >>>
> >>>
> >>> --
> >>>
> >>> Thanks,
> >>>
> >>> Jc
> >>>
>
>
>
> --
>
> Thanks,
> Jc
More information about the serviceability-dev
mailing list