RFR(S): 8229209: [TESTBUG] test for cross-process JFR event streaming

mikhailo.seledtsov at oracle.com mikhailo.seledtsov at oracle.com
Fri Nov 22 17:02:17 UTC 2019


Hi Erik,

   Thank you for the review. I will wait for your fix, rebase on it, and 
use onClose() on the stream. Also, I will loop when reading 
jdk.jfr.repository.


Thanks,

Misha

On 11/21/19 11:56 PM, Erik Gahlin wrote:
> Hi Misha,
>
> Thanks for fixing this.
>
> I think there is a race condition. The child process may not have set the system property jdk.jjfr.repository before the parent process try to get it. You need to loop if you get null as repository.
>
> I will check in the fix for EventStream close fix today, so you could remove waitForEventStreamToClose and just do awaitForTermination(). Checking in after will also avoid the race condition with jdk.jfr.repository having two values during startup.
>
> No need for new webrev,
>
> Thanks
> Erik
>
>
>> On 22 Nov 2019, at 00:49, Mikhailo Seledtsov <mikhailo.seledtsov at oracle.com> wrote:
>>
>> After in-person discussion with Erik and Markus, I made the following changes based on the feedback:
>>   - removed the "crash" scenario; a separate crash test case will be integrated as part of:
>>     "8233700: EventStream not closed"
>>   - removing "sleep()" for each event to better stress the system. Instead of sleeping after each emitted event,
>>     sleeping between every 1000 emitted events. This will better stress the system and closer to a real-world scenario.
>>     The occasional sleep in the test is done in order to avoid flooding the JFR sub-system, and in some way simulates real-world natural pauses
>>      in the program.
>>   - the "main" process now controls the child process exit via the shared file
>>   - generating events not based on time or number of events, but rather based on number of JFR stream flushes
>>
>> Updated webrev: http://cr.openjdk.java.net/~mseledtsov/8229209.10/
>> Testing: ran 20 times on multiple platforms: ALL PASS
>>
>> Misha
>>
>>
>> On 11/15/19, 2:24 PM, Igor Ignatyev wrote:
>>> LGTM
>>> -- Igor
>>>
>>>> On Nov 15, 2019, at 1:51 PM, mikhailo.seledtsov at oracle.com wrote:
>>>>
>>>>
>>>> On 11/14/19 8:32 PM, Igor Ignatyev wrote:
>>>>> reading your code, I remembered that ProcessTools.startProcess can't be used together w/ OutputAnalyzer (see 8233725 for details), i.e. you can't reliably use OutputAnalyzer to check output of the producer process,
>>>> Thanks. Good to know.
>>>>> so you can either use Consumer to check that 'Leaving main' isn't printed, or skip this check.
>>>> I will check the exit value directly from the Process.exitValue(), and skip the log string check.
>>>>
>>>> Here is the updated webrev: http://cr.openjdk.java.net/~mseledtsov/8229209.07/
>>>>
>>>>
>>>> Misha
>>>>
>>>>> sorry that I haven't warned you before.
>>>>>
>>>>> -- Igor
>>>>>
>>>>>> On Nov 14, 2019, at 8:17 PM, mikhailo.seledtsov at oracle.com wrote:
>>>>>>
>>>>>> Here is the updated webrev: http://cr.openjdk.java.net/~mseledtsov/8229209.06/
>>>>>>
>>>>>> On 11/14/19 11:40 AM, Mikhailo Seledtsov wrote:
>>>>>>> On 11/14/19, 11:26 AM, Igor Ignatyev wrote:
>>>>>>>>> On Nov 14, 2019, at 11:23 AM, Mikhailo Seledtsov<mikhailo.seledtsov at oracle.com<mailto:mikhailo.seledtsov at oracle.com>>  wrote:
>>>>>>>>>
>>>>>>>>> Minor update: http://cr.openjdk.java.net/~mseledtsov/8229209.05/
>>>>>>>>> Difference since v04: removed exit code verification from testProducerCrashing(), since crashing process produces different exit codes on different systems, and it is not strictly necessary to verify that.
>>>>>>>> although it's not strictly need to verify the product behavior, checking that a process really crashed is beneficial to catch possible test bugs, you can assert that it's not 0 and  you also might check that there is no 'Leaving main()' in stdout.
>>>>>>> I will do that.
>>>>>>>> the rest looks good to me.
>>>>>>> Thank you,
>>>>>>> Misha
>>>>>>>> -- Igor
>>>>>>>>> Testing: ran multiple times in the test cluster, ALL PASS
>>>>>>>>>
>>>>>>>>> Misha
>>>>>>>>>
>>>>>>>>> On 11/13/19, 4:42 PM, Mikhailo Seledtsov wrote:
>>>>>>>>>> Here is the updated webrev: http://cr.openjdk.java.net/~mseledtsov/8229209.04/
>>>>>>>>>>
>>>>>>>>>> I have incorporated feedback from Igor, as well as feedback from chatting with Erik, and some fixes.
>>>>>>>>>> Feedback from Erik:
>>>>>>>>>>    - use VistualMachine.attach() instead of JCMD - preferred way, simplifies code
>>>>>>>>>>    - rework the crash logic in a producer a bit - simplifies code
>>>>>>>>>>
>>>>>>>>>> Also, I have added exports for child process "--add-exports" for Unsafe, added check for child process exit code, plus comment cleanup.
>>>>>>>>>>
>>>>>>>>>> Thank you,
>>>>>>>>>> Misha
>>>>>>>>>>
>>>>>>>>>> On 11/13/19, 2:12 PM, Erik Gahlin wrote:
>>>>>>>>>>> On 2019-11-13 22:38, Mikhailo Seledtsov wrote:
>>>>>>>>>>>> Hi Igor,
>>>>>>>>>>>>
>>>>>>>>>>>>    Thank you for reviewing the test.
>>>>>>>>>>>>
>>>>>>>>>>>> The original question in your first email: I do not think it is necessary to run the test in othervm mode; I will remove the 'othervm' from at-run. The crashing process is the child of the main test process, so this should not affect the state of the main test VM.
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/12/19, 9:59 PM, Igor Ignatyev wrote:
>>>>>>>>>>>>> Hi Misha,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've finished reading, several more comments:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2) in consumeEvents at line #167, it'd be clearer and safer to use TestEvent.class.getName().
>>>>>>>>>>>> Fixed
>>>>>>>>>>>>> 3) instead of asking JVM which you create its system properties in getJfrRepository(), can we just start "producer" process w/ jdk.jfr.repository being set to a well-know location?
>>>>>>>>>>>> AFAIK, JFR sub-system picks the location automatically. Part of location is a base directory, e.g. "/private/var/folders/" on Linux, while the second part is picked at runtime based on time stamp and PID. I would like to leave this as is for now, while I start a discussion with JFR team on how to improve this.
>>>>>>>>>>> It was our initial plan to provide means to set the repository on the command line, but we ran into issues so we decided against it, at least for the time being.
>>>>>>>>>>>
>>>>>>>>>>>   -XX:FligtRecorderOptions:repository=... today means the base repository so it may break code if we change it.
>>>>>>>>>>>
>>>>>>>>>>> Adding a second repository option (what would it be called?) would confuse users, but also force us to deal with a situation where the repository already exist. This can happen if the JVM crashes, or if on Windows and another process is reading a file, and the directory can't be removed. It can also happen, if you start two JVMs with the same repository path.
>>>>>>>>>>>
>>>>>>>>>>> One can imagine a production deployment where a server is restarted if it crashes, but it will not start up again because the repository of the previous process was left behind.
>>>>>>>>>>>
>>>>>>>>>>>> This is a good point. As a consideration, the user should be able to specify this property, or at least the base directory for JFR repositories. Once this is resolved, I will update the test accordingly.
>>>>>>>>>>>>> 4) to make the test more robust, at line #118 I'd compare eventsDeliveredWhileProducerIsAlive w/ 1 instead of 1/2 of eventsDeliveredTotal.
>>>>>>>>>>>> OK.
>>>>>>>>>>>>> 5) at line #110 please use Asserts.assertEqual instead of Assert.assertTrue(a == b) as it not just compare the values but also prints them out in case !(a==b)
>>>>>>>>>>>> Fixed
>>>>>>>>>>>>> 6) at line #118 please use Asserts.assertGreaterThanOrEqual instead of Assert.assertTrue(a>= b) as it not just compare the values but also prints them out in case !(a>=b)
>>>>>>>>>>>> Fixed
>>>>>>>>>>>>> 7) is it really necessary to wait till "producer" process main is started?
>>>>>>>>>>>> This is necessary to make sure that VM has initialized to a point where it can correctly receive JCMD requests. We ran into this type of issue before.
>>>>>>>>>>>>> 7.1) if it is, you should use String::equals at line #204 instead of contains
>>>>>>>>>>>>> 7.2) if the answer of 7) is true, I'd recommend you to use j.u.c.CountDownLatch (w/ count = 1) instead of AtomicBoolean for producerMainStarted (you might need to find a better name, e.g. producerStartedSignal) to convey a signal that "producer" process reached its main, so line#205 will become producerMainStarted.countDown(), and line#142 -- producerMainStarted.await()
>>>>>>>>>>>> Thanks for suggesting during our conversation the use of ProcessTools.startProcess() with wait-until-predicate. I will use this, it will simplify the code and be more declarative.
>>>>>>>>>>>>> 8) I find usage of AtomicIntegeres as a way to return two values from consumeEvents() at a very least confusing, could you please create a "record" class for that purpose instead?
>>>>>>>>>>>> I can do that, as we have discussed.
>>>>>>>>>>>>> 9) local 'id' (line #169) isn't use anywhere. will it be beneficial to check that its value is b/w 0 and (NUMBER_OF_EVENTS_PRODUCED - 1) and, for testProducerExitsNormally case, that sum of all id is ((NUMBER_OF_EVENTS_PRODUCED - 1) * NUMBER_OF_EVENTS_PRODUCED / 2) ?
>>>>>>>>>>>> The id is not necessary. I used it for development and experimentation. I will remove it.
>>>>>>>>>>>>> 9.1) and if EventStream::onEvent is expected to preserve events order, that 'id' is strictly increasing?
>>>>>>>>>>>> No guarantee of that.
>>>>>>>>>>>>
>>>>>>>>>>>> I will make the updates, test, and post the updated webrev.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you,
>>>>>>>>>>>> Misha
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> -- Igor
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Nov 12, 2019, at 9:12 PM, Igor Ignatyev<igor.ignatyev at oracle.com>  wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Misha,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I haven't read the test code thoroughly yet, but I have a question -- is there a reason for TestCrossProcessStreaming to be run in othervm mode?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> -- Igor
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Nov 12, 2019, at 4:20 PM, Mikhailo Seledtsov<mikhailo.seledtsov at oracle.com>  wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please review this new JFR test. In this test the producer of JFR events and the consumer run in separate processes.
>>>>>>>>>>>>>>> This is expected to be a common scenario with JFR streaming for monitoring applications and services.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     JBS: https://bugs.openjdk.java.net/browse/JDK-8229209
>>>>>>>>>>>>>>>     Webrev: http://cr.openjdk.java.net/~mseledtsov/8229209.01/
>>>>>>>>>>>>>>>     Testing:
>>>>>>>>>>>>>>>       - ran this new test locally (Mac, Linux-x64) multiple times - PASS
>>>>>>>>>>>>>>>       - running this test on the test cluster - in progress
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you,
>>>>>>>>>>>>>>> Misha
>>>>>>>>>>>>>>>


More information about the hotspot-jfr-dev mailing list