RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast Linux machines with Total safepoint time 0 ms
David Holmes
david.holmes at oracle.com
Thu Aug 1 07:56:21 UTC 2019
On 1/08/2019 5:13 pm, Baesken, Matthias wrote:
> Hi David + JC , thanks for the reviews .
>
> David - I added the suggested print-outputs , and also the parameter to executeThreadDumps .
Okay - thanks for that.
David
-----
> Best regards, Matthias
>
>
>
>> -----Original Message-----
>> From: David Holmes <david.holmes at oracle.com>
>> Sent: Mittwoch, 31. Juli 2019 23:57
>> To: Baesken, Matthias <matthias.baesken at sap.com>; Jean Christophe
>> Beyler <jcbeyler at google.com>
>> Cc: hotspot-dev at openjdk.java.net; serviceability-dev <serviceability-
>> dev at openjdk.java.net>
>> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast
>> Linux machines with Total safepoint time 0 ms
>>
>> On 1/08/2019 12:01 am, Baesken, Matthias wrote:
>>>
>>> Hi upload works again, now with webrev :
>>>
>>> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.2/
>>
>> Could you please add, for diagnostic purposes:
>>
>> System.out.println("Total safepoint time (ms): " + value);
>>
>> after:
>>
>> 60 long value = executeThreadDumps();
>>
>> and
>>
>> 68 long value2 = executeThreadDumps();
>>
>> that way if the test fails we can check logs to see what kind of
>> safepoint times have been observed previously. No need to see an updated
>> webrev just for that.
>>
>> I have one further suggestion, take it or leave it, that
>> executeThreadDumps() takes a parameter to specify the initial value, so
>> we'd have:
>>
>> 60 long value = executeThreadDumps(0);
>>
>> and
>>
>> 68 long value2 = executeThreadDumps(value);
>>
>> This might help detect getTotalSafepointTime() going backwards slightly
>> better than current code.
>>
>> Thanks,
>> David
>>
>>> Best regards, Matthias
>>>
>>>
>>>> -----Original Message-----
>>>> From: Baesken, Matthias
>>>> Sent: Mittwoch, 31. Juli 2019 14:05
>>>> To: 'David Holmes' <david.holmes at oracle.com>; Jean Christophe Beyler
>>>> <jcbeyler at google.com>
>>>> Cc: hotspot-dev at openjdk.java.net; serviceability-dev <serviceability-
>>>> dev at openjdk.java.net>
>>>> Subject: RE: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on
>> fast
>>>> Linux machines with Total safepoint time 0 ms
>>>>
>>>> Hello, here is a version following the latest proposal of JC .
>>>>
>>>> Unfortunately attached as patch, sorry for that - the uploads / pushes
>>>> currently do not work from here .
>>>>
>>>> Best regards, Matthias
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: David Holmes <david.holmes at oracle.com>
>>>>> Sent: Mittwoch, 31. Juli 2019 05:04
>>>>> To: Jean Christophe Beyler <jcbeyler at google.com>
>>>>> Cc: Baesken, Matthias <matthias.baesken at sap.com>; hotspot-
>>>>> dev at openjdk.java.net; serviceability-dev <serviceability-
>>>>> dev at openjdk.java.net>
>>>>> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on
>>>> fast
>>>>> Linux machines with Total safepoint time 0 ms
>>>>>
>>>>> On 31/07/2019 9:08 am, Jean Christophe Beyler wrote:
>>>>>> FWIW, I would have done something like what David was suggesting,
>> just
>>>>>> slightly tweaked:
>>>>>>
>>>>>> public static long executeThreadDumps() {
>>>>>> long value;
>>>>>> long initial_value = mbean.getTotalSafepointTime();
>>>>>> do {
>>>>>> Thread.getAllStackTraces();
>>>>>> value = mbean.getTotalSafepointTime();
>>>>>> } while (value == initial_value);
>>>>>> return value;
>>>>>> }
>>>>>>
>>>>>> This ensures that the value is a new value as opposed to the current
>>>>>> value and if something goes wrong, as David said, it will timeout; which
>>>>>> is ok.
>>>>>
>>>>> Works for me.
>>>>>
>>>>>> But I come back to not really understanding why we are doing this at
>>>>>> this point of relaxing (just get a new value of safepoint time).
>>>>>> Because, if we accept timeouts now as a failure here, then really the
>>>>>> whole test becomes:
>>>>>>
>>>>>> executeThreadDumps();
>>>>>> executeThreadDumps();
>>>>>>
>>>>>> Since the first call will return when value > 0 and the second call will
>>>>>> return when value2 > value (I still wonder why we want to ensure it
>>>>>> works twice...).
>>>>>
>>>>> The test is trying to sanity check that we are actually recording the
>>>>> time used by safepoints. So first check is that we can get a non-zero
>>>>> value; second check is we get a greater non-zero value. It's just a
>>>>> sanity test to try and catch if something gets unexpectedly broken in
>>>>> the time tracking code.
>>>>>
>>>>>> So both failures and even testing for it is kind of redundant, once you
>>>>>> have a do/while until a change?
>>>>>
>>>>> Yes - the problem with the tests that try to check internal VM behaviour
>>>>> is that we have no specified way to do something, in this case execute
>>>>> safepoints, that relates to internal VM behaviour, so we have to do
>>>>> something we know will currently work even if not specified to do so -
>>>>> e.g. dumping all thread stacks uses a global safepoint. The second
>>>>> problem is that the timer granularity is so coarse that we then have to
>>>>> guess how many times we need to do that something before seeing a
>>>>> change. To make the test robust we can keep doing stuff until we see a
>>>>> change and so the only way that will fail is if the overall timeout of
>>>>> the test kicks in. Or we can try and second guess how long it should
>>>>> take by introducing our own internal timeout - either directly or by
>>>>> limiting the number of loops in this case. That has its own problems and
>>>>> in general we have tried to reduce internal test timeouts (by removing
>>>>> them) and let overall timeouts take charge.
>>>>>
>>>>> No ideal solution. And this has already consumed way too much of
>>>>> everyone's time.
>>>>>
>>>>> Cheers,
>>>>> David
>>>>>
>>>>>> Thanks,
>>>>>> Jc
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 30, 2019 at 2:35 PM David Holmes
>> <david.holmes at oracle.com
>>>>>> <mailto:david.holmes at oracle.com>> wrote:
>>>>>>
>>>>>> On 30/07/2019 10:39 pm, Baesken, Matthias wrote:
>>>>>> > Hi David, "put that whole code (the while loop) in a helper
>>>>>> method." was JC's idea, and I like the idea .
>>>>>>
>>>>>> Regardless I think the way you are using NUM_THREAD_DUMPS is
>>>> really
>>>>>> confusing. As an all-caps static you'd expect it to be a constant.
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>> > Let's see what others think .
>>>>>> >
>>>>>> >>
>>>>>> >> Overall tests like this are not very useful, yet very fragile.
>>>>>> >>
>>>>>> >
>>>>>> > I am also fine with putting the test on the exclude list.
>>>>>> >
>>>>>> > Best regards, Matthias
>>>>>> >
>>>>>> >
>>>>>> >> -----Original Message-----
>>>>>> >> From: David Holmes <david.holmes at oracle.com
>>>>>> <mailto:david.holmes at oracle.com>>
>>>>>> >> Sent: Dienstag, 30. Juli 2019 14:12
>>>>>> >> To: Baesken, Matthias <matthias.baesken at sap.com
>>>>>> <mailto:matthias.baesken at sap.com>>; Jean Christophe
>>>>>> >> Beyler <jcbeyler at google.com <mailto:jcbeyler at google.com>>
>>>>>> >> Cc: hotspot-dev at openjdk.java.net
>>>>>> <mailto:hotspot-dev at openjdk.java.net>; serviceability-dev
>>>>>> <serviceability-
>>>>>> >> dev at openjdk.java.net <mailto:dev at openjdk.java.net>>
>>>>>> >> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java
>>>>>> fails on fast
>>>>>> >> Linux machines with Total safepoint time 0 ms
>>>>>> >>
>>>>>> >> Hi Matthias,
>>>>>> >>
>>>>>> >> On 30/07/2019 9:25 pm, Baesken, Matthias wrote:
>>>>>> >>> Hello JC / David, here is a second webrev :
>>>>>> >>>
>>>>>> >>> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.1/
>>>>>> >>>
>>>>>> >>> It moves the thread dump execution into a method
>>>>>> >>> executeThreadDumps(long) , and also adds while loops
>>>>>> (but with a
>>>>>> >>> limitation for the number of thread dumps, really don’t
>>>>>> >>> want to cause timeouts etc.). I removed a check for
>>>>>> >>> MAX_VALUE_FOR_PASS because we cannot go over
>>>>> Long.MAX_VALUE .
>>>>>> >>
>>>>>> >> I don't think executeThreadDumps is worth factoring out like out.
>>>>>> >>
>>>>>> >> The handling of NUM_THREAD_DUMPS is a bit confusing. I'd
>> rather
>>>> it
>>>>>> >> remains a constant 100, and then you set a simple loop iteration
>>>>>> count
>>>>>> >> limit. Further with the proposed code when you get here:
>>>>>> >>
>>>>>> >> 85 NUM_THREAD_DUMPS = NUM_THREAD_DUMPS * 2;
>>>>>> >>
>>>>>> >> you don't even know what value you may be starting with.
>>>>>> >>
>>>>>> >> But I was thinking of simply:
>>>>>> >>
>>>>>> >> long value = 0;
>>>>>> >> do {
>>>>>> >> Thread.getAllStackTraces();
>>>>>> >> value = mbean.getTotalSafepointTime();
>>>>>> >> } while (value == 0);
>>>>>> >>
>>>>>> >> We'd only hit a timeout if something is completely broken -
>>>>>> which is fine.
>>>>>> >>
>>>>>> >> Overall tests like this are not very useful, yet very fragile.
>>>>>> >>
>>>>>> >> Thanks,
>>>>>> >> David
>>>>>> >>
>>>>>> >>> Hope you like this version better.
>>>>>> >>>
>>>>>> >>> Best regards, Matthias
>>>>>> >>>
>>>>>> >>> *From:*Jean Christophe Beyler <jcbeyler at google.com
>>>>>> <mailto:jcbeyler at google.com>>
>>>>>> >>> *Sent:* Dienstag, 30. Juli 2019 05:39
>>>>>> >>> *To:* David Holmes <david.holmes at oracle.com
>>>>>> <mailto:david.holmes at oracle.com>>
>>>>>> >>> *Cc:* Baesken, Matthias <matthias.baesken at sap.com
>>>>>> <mailto:matthias.baesken at sap.com>>;
>>>>>> >>> hotspot-dev at openjdk.java.net
>>>>>> <mailto:hotspot-dev at openjdk.java.net>; serviceability-dev
>>>>>> >>> <serviceability-dev at openjdk.java.net
>>>>>> <mailto:serviceability-dev at openjdk.java.net>>
>>>>>> >>> *Subject:* Re: RFR: [XS] 8228658: test
>>>>>> GetTotalSafepointTime.java fails
>>>>>> >>> on fast Linux machines with Total safepoint time 0 ms
>>>>>> >>>
>>>>>> >>> Hi Matthias,
>>>>>> >>>
>>>>>> >>> I wonder if you should not do what David is suggesting and then
>>>>>> put that
>>>>>> >>> whole code (the while loop) in a helper method. Below you
>> have a
>>>>>> >>> calculation again using value2 (which I wonder what the added
>>>>>> value of
>>>>>> >>> it is though) but anyway, that value2 could also be 0 at some
>>>>>> point, no?
>>>>>> >>>
>>>>>> >>> So would it not be best to just refactor the getAllStackTraces
>> and
>>>>>> >>> calculate safepoint time in a helper method for both value /
>> value2
>>>>>> >>> variables?
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>>
>>>>>> >>> Jc
>>>>>> >>>
>>>>>> >>> On Mon, Jul 29, 2019 at 7:50 PM David Holmes
>>>>>> <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
>>>>>> >>> <mailto:david.holmes at oracle.com
>>>>>> <mailto:david.holmes at oracle.com>>> wrote:
>>>>>> >>>
>>>>>> >>> Hi Matthias,
>>>>>> >>>
>>>>>> >>> On 29/07/2019 8:20 pm, Baesken, Matthias wrote:
>>>>>> >>> > Hello , please review this small test fix .
>>>>>> >>> >
>>>>>> >>> > The test
>>>>>> >>>
>>>>>> >>
>>>>>
>>>>
>> test/jdk/sun/management/HotspotRuntimeMBean/GetTotalSafepointTime.
>>>>>> >> java
>>>>>> >>> fails sometimes on fast Linux machines with this error
>>>>>> message :
>>>>>> >>> >
>>>>>> >>> > java.lang.RuntimeException: Total safepoint time
>>>>>> illegal value: 0
>>>>>> >>> ms (MIN = 1; MAX = 9223372036854775807)
>>>>>> >>> >
>>>>>> >>> > looks like the total safepoint time is too low
>>>>>> currently on these
>>>>>> >>> machines, it is < 1 ms.
>>>>>> >>> >
>>>>>> >>> > There might be several ways to handle this :
>>>>>> >>> >
>>>>>> >>> > * Change the test in a way that it might generate
>>>>>> nigher
>>>>>> >>> safepoint times
>>>>>> >>> > * Allow safepoint time == 0 ms
>>>>>> >>> > * Offer an additional interface that gives
>>>>>> safepoint times
>>>>>> >>> with finer granularity ( currently the HS has safepoint
>>>>>> time values
>>>>>> >>> in ns , see jdk/src/hotspot/share/runtime/safepoint.cpp
>>>>>> >>> SafepointTracing::end
>>>>>> >>> >
>>>>>> >>> > But it is converted on ms in this code
>>>>>> >>> >
>>>>>> >>> > 114jlong RuntimeService::safepoint_time_ms() {
>>>>>> >>> > 115 return UsePerfData ?
>>>>>> >>> > 116
>>>>>> >>>
>>>>>> Management::ticks_to_ms(_safepoint_time_ticks->get_value()) : -
>> 1;
>>>>>> >>> > 117}
>>>>>> >>> >
>>>>>> >>> > 064jlong Management::ticks_to_ms(jlong ticks) {
>>>>>> >>> > 2065 assert(os::elapsed_frequency() > 0, "Must be
>>>>>> non-zero");
>>>>>> >>> > 2066 return (jlong)(((double)ticks /
>>>>>> >>> (double)os::elapsed_frequency())
>>>>>> >>> > 2067 * (double)1000.0);
>>>>>> >>> > 2068}
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>> > Currently I go for the first attempt (and try to generate
>>>>>> >>> higher safepoint times in my patch) .
>>>>>> >>>
>>>>>> >>> Yes that's probably best. Coarse-grained timing on very
>>>>>> fast machines
>>>>>> >>> was bound to eventually lead to problems.
>>>>>> >>>
>>>>>> >>> But perhaps a more future-proof approach is to just add a
>>>>>> do-while loop
>>>>>> >>> around the stack dumps and only exit when we have a non-
>> zero
>>>>>> >> safepoint
>>>>>> >>> time?
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>> David
>>>>>> >>> -----
>>>>>> >>>
>>>>>> >>> > Bug/webrev :
>>>>>> >>> >
>>>>>> >>> > https://bugs.openjdk.java.net/browse/JDK-8228658
>>>>>> >>> >
>>>>>> >>> >
>> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.0/
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>> > Thanks, Matthias
>>>>>> >>> >
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> --
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>>
>>>>>> >>> Jc
>>>>>> >>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Thanks,
>>>>>> Jc
More information about the hotspot-dev
mailing list