RFR(S): 8235454: [TESTBUG] Basic test for JFR event streaming for jdk_jfr_sanity

mikhailo.seledtsov at oracle.com mikhailo.seledtsov at oracle.com
Fri Dec 13 23:50:59 UTC 2019


Hi Igor, Erik,

Igor, thank you for taking a second look at the test. See my comments 
inline:

On 12/12/19 8:47 PM, Igor Ignatyev wrote:
> Hi Misha,
>
> besides the obvious problem w/ the test's current failure mode, which is that all the failures will be reported as ' Error. Program java timed out....' which tend to be miscategorized as environment or test bugs;
Good point. From practical side of things it does tend to happen.
>   you also get a more subtle problem. your test doesn't fully define what it verifies, the most important part is defined by test harness now. and b/c test harness doesn't have any knowledge about the test specific it can assert only very broadly -- all test code has to finish in 2 minutes (subject to scale according to test factor). such broad assertions can lead to false-negative results from this test: at one extreme, onEvent takes 2 minutes - delta - epsilon, and onClose takes just delta, which is close to 0 ms, where epsilon is spent on test overhead e.g. spawning a JVM, at another extreme onEvent takes delta, and onClose -- 2minutes - delta - epsilon. and if the latter might be acceptable, the former sounds like a bug to me.

Right. Also, even though API does not specify any time explicitly, there 
is a reasonable expectation of onEvent() and onClose() to arrive. That 
is, say, if onEvent() does not arrive after 20 or 30 seconds, this 
indicates some sort of an issue.

I am thinking the following solution may be a good compromise to address 
feedback from both you and Erik:

   - I will create a constant WAIT_TIMEOUT, and set it to something 
intentionally larger than the "reasonably expected" timeout. Say, 30 seconds

   - include the consideration of timeout factor: WAIT_TIMEOUT = 30 * 
JTREG_TIMEOUT_FACTOR

   - use this timeout for both onEvent() latch and onClose() latch.

If there are no objections, I will make the necessary updats, retest and 
post the code.

Let me know what you think.


Thank you,

Misha

> with that being said, I'm fine either way. I just wanted to make sure the drawbacks of current approach are clear and understood.
>
> Thanks,
> -- Igor
>
>> On Dec 10, 2019, at 11:43 AM, mikhailo.seledtsov at oracle.com wrote:
>>
>> Hi Erik, Igor,
>>
>> Pasting Igor's comment as well:
>>
>>> the only problem I have w/ the test is its failure mode, other than by crash, the test can fail only by timeout, which isn't the nicest way to fail. I understand that there is no specific guarantee when an event is received by consumer, nor there are such guarantees for on-close, but I'd really like us to find a better way to detect and report failures. for example, you can replace no-arg CountDownLatch::await calls w/ one which takes timeout value (await(long, j.u.c.TimeUnit)), use some reasonable timeout value > and report test failure if the return value isn't true.
>>
>>
>> IMO, whether to use explicit timeout or not in await() in the test depends on the goal of a particular test case. If the goal is to test the timing/responsiveness of the API, we should use a timeout, given that there is an explicit or implicit guarantee for event arrival (specifically that the onEvent() and onClose() should arrive within some specific time). If the goal is to simply test that events arrive, we should not use the timeout.
>>
>> In this case, this is a basic/sanity test, and the goal is not to test specific timing. Hence, I am in favor of leaving the test as is, and let the test harness signal the timeout. The user will see what happened from post-test diagnostics provided. I can slightly improve the test by adding a printout statement after each await(), to let user know whether the await() is through.
>>
>> Let me know what you think.
>>
>>
>> Thanks,
>> Misha
>>
>> On 12/9/19 8:41 PM, Erik Gahlin wrote:
>>> Looks good
>>>
>>> When it comes to failure mode.
>>>
>>> I think it is better to let the test harness determine what is a reasonable timeout, and if needed, it can be change from the outside for all tests. Also, if the test fails using a timeout, the test harness will (hopefully) take a thread dump, which is useful for troubleshooting.
>>>
>>> Erik
>>>
>>>> On 10 Dec 2019, at 02:36, mikhailo.seledtsov at oracle.com wrote:
>>>>
>>>> Please review this change that introduces a very simple basic/sanity test for JFR streaming, and adds this test to jdk_jfr_sanity (which runs in tier-1).
>>>>
>>>>      JBS: https://bugs.openjdk.java.net/browse/JDK-8235454
>>>>      Webrev: http://cr.openjdk.java.net/~mseledtsov/8235454.00/
>>>>      Testing:
>>>>          1. Ran this test over 200 times on multiple platforms in the test cluster: ALL PASS
>>>>
>>>>           2. Ran the group jdk_jfr_sanity 80 times: ALL PASS
>>>>
>>>> Thank you,
>>>> Misha
>>>>


More information about the hotspot-jfr-dev mailing list