RFR: 8112: Flamegraph model creation performance improvements

Thu Jul 27 15:40:10 UTC 2023

On Tue, 27 Jun 2023 16:06:45 GMT, Vincent Alexander Beelte <duke at openjdk.org> wrote:

> This pull request improves the performance of creating the model that the flame graph visualization is drawn from.
> 
> The first issue this fixes is not actually really "performance" but rather a case where the thread pool used to create the models can be fully saturated with tasks that are already invaliated and blocking the newest useful task.
> The use case where that tends to happen to me goes as follows:
> My jfr file contains about 1.8 million events (1.5 mio "Allocation in new TLAB", 200k "Allocation outside TLAB", 70k "Method Profiling Sample"). It was created by async-profiler during a 5 minute load test.
> When loading this file and switching to the Java Application view (with the Flame Graph visualization already in focus) multiple tasks are generated to create a flamegraph for all event types and all threads. I did not investigate why there would be multiple tasks. "multiple" is "at least two"
> Then without waiting for the graph to load I go on to filter to the thread that I am interested about which generates multiple tasks to generate a flamegraph for all event types in that thread.
> Lastly I filter to the Method Profiling Samples which creates the last task, which then has to wait for enough of the other tasks to finish that one of the 3 thread in the pool can run it.
> All in all with this specific jfr file and the current master (everything including commit 5ace151) in this scenario I am waiting about 55 seconds until I see the graph. (just measured with a stopwatch app)
> 
> My solution to this was to give the StacktraceTreeModel constructor a stop flag that it checks at the start both the inner and the outer loop. So that it can return early. The flag is then implemented by the FlamegraphSwingView.ModelRebuildRunnable.isInvalid field which is already checked at some places to see if the current task is still needed. It does feel strange to do this in a constructor but it was the easiest way to interrupt the most expensive part of creating the StacktraceTreeModel.
> 
> With this flag alone use case above goes from 55 seconds to about 6 seconds and there I am already the limiting factor stopping the clock in time.
> 
> When looking at the flamegraph of a flight recording of jmc where I had it draw me a bunch of different flamegraph with different filters in my 1.8 mio event file it did look like there where some additional low hanging fruit to pick:
> ![streams](https://github.com/openjdk/jmc/assets/917408/892a5851-ed3c-4337-8646-07ee802d3e63)
> This is filtered to on...

merged it and "clean verify" seems to work.
now the ui tests show some other strange behavior even on master which all of the following comments are about.
One problem I could already fix or work around. The tests on linux seem to have issues with desktop scaling. I needed to go back to 100% from 200%. Otherwise it seems that in MCJemmyTestBase.initialize() line 91
`MC.setRecordingAnalysis(false)` the click coordinates do not seem to be computed correctly with 200% scaling. It does not seems to click the checkmark (not in the right place) and thus fails at that point.

I am currently stuck at an issue where some test opens an "import connections" dialog and then inputs something like "7tmp7testExportImport2835912463012025242.xml" where the expected/asserted input is something like "/tmp/testExportImport2835912463012025242.xml". Which looks like the test driver forgot to hold shift when typing the slashes :)

The tests in error for me currently are:
`
ConnectionExportImportTest.testExportImport:93 » TimeoutExpired State 'Waiting...
  ConnectionExportImportTest.testImportNonExistantFile:125 » Jemmy Action 'typin...
  ConnectionExportImportTest.testSetMasterPassword:135 » TimeoutExpired State 'W...
  CustomConnectionsTest.createNewFolder:66 » TimeoutExpired State 'Waiting for 1...
  CustomConnectionsTest.editConnection:122 » TimeoutExpired State 'Waiting for 1...
  CustomConnectionsTest.invalidConnection:113 » TimeoutExpired State 'Waiting fo...
  CustomConnectionsTest.nestedFoldersConnections:79 » TimeoutExpired State 'Wait...
  CustomConnectionsTest.renameConnection:104 » TimeoutExpired State 'Waiting for...
  CustomConnectionsTest.renameFolder:72 » TimeoutExpired State 'Waiting for 1 or...
  CustomConnectionsTest.validConnection:96 » TimeoutExpired State 'Waiting for 1...
`
But I am sure some of that is only because the dialog never closes. I have literally only just now tried that so I am still confident I will find what is going on there :)

On the oca I will hopefully make progress this evening (german time) but I do not really expect that to be resolved even next week.

I think I may have found the issue with the "7tmp7". There is a class called org.jemmy.input.DefaultCharBindingMap which looks like it is specifically taught how to type on an english and I believe swedish keyboard layout. But not (yet) a german one.

edit: Also I just noticed, that's a library and not part of jmc or its own test utilities.

Converted this to a draft because if I understood our opensource practices correctly there will be another new pull request be opened directly from the official accenture github account for this.

As for the uitest keyboard problem I started going the path of least resistance, which seemed to be to just to set up a jmc build on my gaming pc. There I found I needed to only set the OS level keyboard layout to english for the tests to type the correct letters. Settings the OS level keyboard layout did not work on Linux so it might also be the case that I have an actual physical keyboard with english layout plugged into that pc.
Did not run them in full yet though, they seemed to take a while and I did not have the time to wait for them.

Ok it looks like I can proceed with this PR instead of opening a new one. I do not have a response/confirmation yet to my mail about being put on the accenture contributor list but the oca label is gone so thats a good sign I guess. I am going to attempt a "covered" command in a different comment and see if that does something.

On the ui tests:
I have not yet had a single run of them even on master that passed all tests. (about 5-8 runs in total)
Some do not even fail in the same places from run to run.
At least one of them looks like it actually can't succeed because its expectation/setup conflicts with what to me looks like a change in jdk17 maybe.
I will provide specifics once I am back into it a little more.

So reporting on the ui tests:
I do not seem to have broken more of them. In fact I fixed one setup over in #509
But it looks like they are currently reporting actual bugs even on master. There are three tests that are currently failing regularly for me on Windows with a JDK 17.0.7 (specifically Temurin-17.0.7+7)
They are ControlRecordingsTest.modifyEventPeriod, ControlRecordingsTest.modifyEventThreshold and ControlRecordingsTest.modifyRecordingEvents.
ControlRecordingsTest.modifyRecordingEvents is the least likely to get a successful run. It basically never succeeds. The others are sucessfull every now and again.
I clicked through the "test plan" of ControlRecordingsTest.modifyRecordingEvents manually (as in "no junit, no jemmy, just me") and I can confirm that I could reproduce the failure case. I think the details about that should be discussed elsewhere though, to not detract from the topic of this pull request.

-------------

PR Comment: https://git.openjdk.org/jmc/pull/502#issuecomment-1614451482
PR Comment: https://git.openjdk.org/jmc/pull/502#issuecomment-1614749649
PR Comment: https://git.openjdk.org/jmc/pull/502#issuecomment-1617483581
PR Comment: https://git.openjdk.org/jmc/pull/502#issuecomment-1644211007
PR Comment: https://git.openjdk.org/jmc/pull/502#issuecomment-1653863203