RFR: 8314438: NMT: Performance benchmarks are needed to have a baseline for comparison of improvements

Mon Sep 25 07:34:49 UTC 2023

On Tue, 5 Sep 2023 07:53:36 GMT, Afshin Zafari <azafari at openjdk.org> wrote:

> A new benchmark for  measuring the NMT overhead in `summary` and `detail` modes.
> The tests are run using: 
> 
> make CONF=debug test TEST="micro:java.util.NMTBenchmark" MICRO="RESULTS_FORMAT=json"
> 
> The results are written to a JSON file that can be visualized using [JMH Visualizer](https://jmh.morethan.io/).
> 
> ### Notes
> A separate [issue](https://bugs.openjdk.org/browse/JDK-8316814) is created for preparing a progfram for validating and analyzing the JMH outputs.
> Another separate [issue](https://bugs.openjdk.org/browse/JDK-8316813) is created for measuring the virtual memory tracing parts of NMT.

The test will not compile for me unless I add:

`import java.util.concurrent.TimeUnit;`

Marked as reviewed by gziemski (Committer).

These are my results:

Benchmark                                   (N)  (THREADS)  (WITH_THREADS)  Mode  Cnt     Score    Error  Units
NMTBenchmark.NMTDetail.allocateMemory    100000          2               0  avgt   10   130.946 ?  8.761  ms/op
NMTBenchmark.NMTDetail.allocateMemory    100000          2               1  avgt   10     0.399 ?  0.228  ms/op
NMTBenchmark.NMTDetail.allocateMemory    100000          4               0  avgt   10   120.379 ?  8.072  ms/op
NMTBenchmark.NMTDetail.allocateMemory    100000          4               1  avgt   10     0.543 ?  0.207  ms/op
NMTBenchmark.NMTDetail.allocateMemory    100000          8               0  avgt   10   122.144 ?  3.696  ms/op
NMTBenchmark.NMTDetail.allocateMemory    100000          8               1  avgt   10     0.776 ?  0.132  ms/op
NMTBenchmark.NMTDetail.allocateMemory   1000000          2               0  avgt   10  1546.584 ? 24.937  ms/op
NMTBenchmark.NMTDetail.allocateMemory   1000000          2               1  avgt   10     2.757 ?  0.740  ms/op
NMTBenchmark.NMTDetail.allocateMemory   1000000          4               0  avgt   10  1530.738 ? 73.529  ms/op
NMTBenchmark.NMTDetail.allocateMemory   1000000          4               1  avgt   10     2.859 ?  0.624  ms/op
NMTBenchmark.NMTDetail.allocateMemory   1000000          8               0  avgt   10  1609.203 ? 29.717  ms/op
NMTBenchmark.NMTDetail.allocateMemory   1000000          8               1  avgt   10     3.113 ?  0.632  ms/op
NMTBenchmark.NMTOff.allocateMemory       100000          2               0  avgt   10    60.585 ?  4.187  ms/op
NMTBenchmark.NMTOff.allocateMemory       100000          2               1  avgt   10     0.333 ?  0.149  ms/op
NMTBenchmark.NMTOff.allocateMemory       100000          4               0  avgt   10    56.138 ?  2.727  ms/op
NMTBenchmark.NMTOff.allocateMemory       100000          4               1  avgt   10     0.493 ?  0.214  ms/op
NMTBenchmark.NMTOff.allocateMemory       100000          8               0  avgt   10    57.457 ?  2.310  ms/op
NMTBenchmark.NMTOff.allocateMemory       100000          8               1  avgt   10     0.731 ?  0.158  ms/op
NMTBenchmark.NMTOff.allocateMemory      1000000          2               0  avgt   10   815.515 ? 19.232  ms/op
NMTBenchmark.NMTOff.allocateMemory      1000000          2               1  avgt   10     2.674 ?  0.566  ms/op
NMTBenchmark.NMTOff.allocateMemory      1000000          4               0  avgt   10   829.726 ? 31.153  ms/op
NMTBenchmark.NMTOff.allocateMemory      1000000          4               1  avgt   10     2.678 ?  0.652  ms/op
NMTBenchmark.NMTOff.allocateMemory      1000000          8               0  avgt   10   817.929 ? 19.142  ms/op
NMTBenchmark.NMTOff.allocateMemory      1000000          8               1  avgt   10     3.023 ?  0.578  ms/op
NMTBenchmark.NMTSummary.allocateMemory   100000          2               0  avgt   10   101.419 ?  4.231  ms/op
NMTBenchmark.NMTSummary.allocateMemory   100000          2               1  avgt   10     0.329 ?  0.143  ms/op
NMTBenchmark.NMTSummary.allocateMemory   100000          4               0  avgt   10   102.224 ?  3.317  ms/op
NMTBenchmark.NMTSummary.allocateMemory   100000          4               1  avgt   10     0.514 ?  0.230  ms/op
NMTBenchmark.NMTSummary.allocateMemory   100000          8               0  avgt   10   101.127 ?  4.144  ms/op
NMTBenchmark.NMTSummary.allocateMemory   100000          8               1  avgt   10     0.719 ?  0.168  ms/op
NMTBenchmark.NMTSummary.allocateMemory  1000000          2               0  avgt   10  1423.287 ? 56.109  ms/op
NMTBenchmark.NMTSummary.allocateMemory  1000000          2               1  avgt   10     2.682 ?  0.442  ms/op
NMTBenchmark.NMTSummary.allocateMemory  1000000          4               0  avgt   10  1418.999 ? 33.891  ms/op
NMTBenchmark.NMTSummary.allocateMemory  1000000          4               1  avgt   10     2.812 ?  0.510  ms/op
NMTBenchmark.NMTSummary.allocateMemory  1000000          8               0  avgt   10  1431.552 ? 29.915  ms/op
NMTBenchmark.NMTSummary.allocateMemory  1000000          8               1  avgt   10     3.073 ?  0.583  ms/op

You said `The JSON file can be used for visualising the results.` Can you please explain how to do that exactly?

It looks like a good start.

The test took something like 20 minutes for me.

Is the multithreading mode really useful?

And even if we decide that it is, we can probably skip the entries that create the threads, but not actually use them? Ex:

THREADS=2, WITH_THREADS=0
THREADS=4, WITH_THREADS=0
THREADS=8, WITH_THREADS=0

I like it! I do have more follow ups, tough.

I see that we are currently mixing malloc/free. Wouldn't we better served if we separated those as 2 distinct categories? I can imagine a scenario where either malloc or free path could be regressed/improved without affecting the other.

Also, I think we should add realloc to the mix, again keeping it separate. I tried to collapse malloc/realloc into single code path, but it was rejected, mostly on the basis that it would impact the performance - this microbenchmark would be a perfect tool to offer a definitive answer.

Testing the code and looking at the output the microbenchmark produces I think it would be super useful (here or in a follow up issue) to have a python or Java app to quickly run and compare the results to give a user the difference, without having to calculate it by hand (and risk making a mistake)

The tool should look at `NMTOff` first from both runs and verify that they are the same, to prove that the results are useful (there should be no difference)

Only then, if the previous numbers show no change, it should report `NMTSummary`, then `NMTDetail` differences.

I added `usleep(1)` at the beginning of `MallocTracker::record_malloc()` to make sure the microbenchmark catches that, and it does.

> > The tool should look at `NMTOff` first from both runs and verify that they are the same, to prove that the results are useful (there should be no difference)
> 
> Which measures should be the same between _both runs_? For `NMT_Off` for example, there are 2 (no of threads= 0,4) measures per method. Consider also, we will have more methods for virtual memory alloc/dealloc tests.

I think all the available data should be used to verify, so all runs with `NMT_Off`?

This can be done in a followup issue I think?

To me it looks like we have a good starting point, which we can use already (by manually comparing the results), so we are OK to push this in I think.

-------------

Changes requested by gziemski (Committer).

PR Review: https://git.openjdk.org/jdk/pull/15563#pullrequestreview-1614112680
PR Review: https://git.openjdk.org/jdk/pull/15563#pullrequestreview-1629213190
PR Comment: https://git.openjdk.org/jdk/pull/15563#issuecomment-1709016452
PR Comment: https://git.openjdk.org/jdk/pull/15563#issuecomment-1709017248
PR Comment: https://git.openjdk.org/jdk/pull/15563#issuecomment-1709039527
PR Comment: https://git.openjdk.org/jdk/pull/15563#issuecomment-1711976042
PR Comment: https://git.openjdk.org/jdk/pull/15563#issuecomment-1717863041
PR Comment: https://git.openjdk.org/jdk/pull/15563#issuecomment-1721458771