RFR(S): 8235454: [TESTBUG] Basic test for JFR event streaming for jdk_jfr_sanity

Sat Dec 14 00:43:34 UTC 2019

Hi Misha,

I think the test should be kept as is. We have many other JFR tests that times out this way, so this test would not be unique. Getting a thread dump has helped me many times and we have explicitly removed Thread.sleep and arbitrary timeouts to force tests into a binary state, exception or timeout by the test harness.

With WAIT_TIMEOUT, you will always wonder if it was a bug or if the test timed out was too small in the environment, i.e a debug build, emulated hardware etc? We don’t want to spend time on false positives and intermittent failures, especially in tier 1. We want to remove indeterminism so it easy to reason about what happened. If the test is not able to deliver an event on time, it is most likely due to some locking issue, i.e. a deadlock or a livelock, and we want a thread dump so we can see this. Preferably I would like a core dump as well. so you could get the full state of the JVM process.

Thanks
Erik

> On 14 Dec 2019, at 00:50, mikhailo.seledtsov at oracle.com wrote:
> 
> Hi Igor, Erik,
> 
> Igor, thank you for taking a second look at the test. See my comments inline:
> 
> On 12/12/19 8:47 PM, Igor Ignatyev wrote:
>> Hi Misha,
>> 
>> besides the obvious problem w/ the test's current failure mode, which is that all the failures will be reported as ' Error. Program java timed out....' which tend to be miscategorized as environment or test bugs;
> Good point. From practical side of things it does tend to happen.
>>  you also get a more subtle problem. your test doesn't fully define what it verifies, the most important part is defined by test harness now. and b/c test harness doesn't have any knowledge about the test specific it can assert only very broadly -- all test code has to finish in 2 minutes (subject to scale according to test factor). such broad assertions can lead to false-negative results from this test: at one extreme, onEvent takes 2 minutes - delta - epsilon, and onClose takes just delta, which is close to 0 ms, where epsilon is spent on test overhead e.g. spawning a JVM, at another extreme onEvent takes delta, and onClose -- 2minutes - delta - epsilon. and if the latter might be acceptable, the former sounds like a bug to me.
> 
> Right. Also, even though API does not specify any time explicitly, there is a reasonable expectation of onEvent() and onClose() to arrive. That is, say, if onEvent() does not arrive after 20 or 30 seconds, this indicates some sort of an issue.
> 
> I am thinking the following solution may be a good compromise to address feedback from both you and Erik:
> 
>   - I will create a constant WAIT_TIMEOUT, and set it to something intentionally larger than the "reasonably expected" timeout. Say, 30 seconds
> 
>   - include the consideration of timeout factor: WAIT_TIMEOUT = 30 * JTREG_TIMEOUT_FACTOR
> 
>   - use this timeout for both onEvent() latch and onClose() latch.
> 
> If there are no objections, I will make the necessary updats, retest and post the code.
> 
> Let me know what you think.
> 
> 
> Thank you,
> 
> Misha
> 
>> with that being said, I'm fine either way. I just wanted to make sure the drawbacks of current approach are clear and understood.
>> 
>> Thanks,
>> -- Igor
>> 
>>> On Dec 10, 2019, at 11:43 AM, mikhailo.seledtsov at oracle.com wrote:
>>> 
>>> Hi Erik, Igor,
>>> 
>>> Pasting Igor's comment as well:
>>> 
>>>> the only problem I have w/ the test is its failure mode, other than by crash, the test can fail only by timeout, which isn't the nicest way to fail. I understand that there is no specific guarantee when an event is received by consumer, nor there are such guarantees for on-close, but I'd really like us to find a better way to detect and report failures. for example, you can replace no-arg CountDownLatch::await calls w/ one which takes timeout value (await(long, j.u.c.TimeUnit)), use some reasonable timeout value > and report test failure if the return value isn't true.
>>> 
>>> 
>>> IMO, whether to use explicit timeout or not in await() in the test depends on the goal of a particular test case. If the goal is to test the timing/responsiveness of the API, we should use a timeout, given that there is an explicit or implicit guarantee for event arrival (specifically that the onEvent() and onClose() should arrive within some specific time). If the goal is to simply test that events arrive, we should not use the timeout.
>>> 
>>> In this case, this is a basic/sanity test, and the goal is not to test specific timing. Hence, I am in favor of leaving the test as is, and let the test harness signal the timeout. The user will see what happened from post-test diagnostics provided. I can slightly improve the test by adding a printout statement after each await(), to let user know whether the await() is through.
>>> 
>>> Let me know what you think.
>>> 
>>> 
>>> Thanks,
>>> Misha
>>> 
>>> On 12/9/19 8:41 PM, Erik Gahlin wrote:
>>>> Looks good
>>>> 
>>>> When it comes to failure mode.
>>>> 
>>>> I think it is better to let the test harness determine what is a reasonable timeout, and if needed, it can be change from the outside for all tests. Also, if the test fails using a timeout, the test harness will (hopefully) take a thread dump, which is useful for troubleshooting.
>>>> 
>>>> Erik
>>>> 
>>>>> On 10 Dec 2019, at 02:36, mikhailo.seledtsov at oracle.com wrote:
>>>>> 
>>>>> Please review this change that introduces a very simple basic/sanity test for JFR streaming, and adds this test to jdk_jfr_sanity (which runs in tier-1).
>>>>> 
>>>>>     JBS: https://bugs.openjdk.java.net/browse/JDK-8235454
>>>>>     Webrev: http://cr.openjdk.java.net/~mseledtsov/8235454.00/
>>>>>     Testing:
>>>>>         1. Ran this test over 200 times on multiple platforms in the test cluster: ALL PASS
>>>>> 
>>>>>          2. Ran the group jdk_jfr_sanity 80 times: ALL PASS
>>>>> 
>>>>> Thank you,
>>>>> Misha
>>>>>