RFR: 8314438: NMT: Performance benchmarks are needed to have a baseline for comparison of improvements
Gerard Ziemski
gziemski at openjdk.org
Mon Sep 25 07:34:49 UTC 2023
On Tue, 5 Sep 2023 07:53:36 GMT, Afshin Zafari <azafari at openjdk.org> wrote:
> A new benchmark for measuring the NMT overhead in `summary` and `detail` modes.
> The tests are run using:
>
> make CONF=debug test TEST="micro:java.util.NMTBenchmark" MICRO="RESULTS_FORMAT=json"
>
> The results are written to a JSON file that can be visualized using [JMH Visualizer](https://jmh.morethan.io/).
>
> ### Notes
> A separate [issue](https://bugs.openjdk.org/browse/JDK-8316814) is created for preparing a progfram for validating and analyzing the JMH outputs.
> Another separate [issue](https://bugs.openjdk.org/browse/JDK-8316813) is created for measuring the virtual memory tracing parts of NMT.
The test will not compile for me unless I add:
`import java.util.concurrent.TimeUnit;`
Marked as reviewed by gziemski (Committer).
These are my results:
Benchmark (N) (THREADS) (WITH_THREADS) Mode Cnt Score Error Units
NMTBenchmark.NMTDetail.allocateMemory 100000 2 0 avgt 10 130.946 ? 8.761 ms/op
NMTBenchmark.NMTDetail.allocateMemory 100000 2 1 avgt 10 0.399 ? 0.228 ms/op
NMTBenchmark.NMTDetail.allocateMemory 100000 4 0 avgt 10 120.379 ? 8.072 ms/op
NMTBenchmark.NMTDetail.allocateMemory 100000 4 1 avgt 10 0.543 ? 0.207 ms/op
NMTBenchmark.NMTDetail.allocateMemory 100000 8 0 avgt 10 122.144 ? 3.696 ms/op
NMTBenchmark.NMTDetail.allocateMemory 100000 8 1 avgt 10 0.776 ? 0.132 ms/op
NMTBenchmark.NMTDetail.allocateMemory 1000000 2 0 avgt 10 1546.584 ? 24.937 ms/op
NMTBenchmark.NMTDetail.allocateMemory 1000000 2 1 avgt 10 2.757 ? 0.740 ms/op
NMTBenchmark.NMTDetail.allocateMemory 1000000 4 0 avgt 10 1530.738 ? 73.529 ms/op
NMTBenchmark.NMTDetail.allocateMemory 1000000 4 1 avgt 10 2.859 ? 0.624 ms/op
NMTBenchmark.NMTDetail.allocateMemory 1000000 8 0 avgt 10 1609.203 ? 29.717 ms/op
NMTBenchmark.NMTDetail.allocateMemory 1000000 8 1 avgt 10 3.113 ? 0.632 ms/op
NMTBenchmark.NMTOff.allocateMemory 100000 2 0 avgt 10 60.585 ? 4.187 ms/op
NMTBenchmark.NMTOff.allocateMemory 100000 2 1 avgt 10 0.333 ? 0.149 ms/op
NMTBenchmark.NMTOff.allocateMemory 100000 4 0 avgt 10 56.138 ? 2.727 ms/op
NMTBenchmark.NMTOff.allocateMemory 100000 4 1 avgt 10 0.493 ? 0.214 ms/op
NMTBenchmark.NMTOff.allocateMemory 100000 8 0 avgt 10 57.457 ? 2.310 ms/op
NMTBenchmark.NMTOff.allocateMemory 100000 8 1 avgt 10 0.731 ? 0.158 ms/op
NMTBenchmark.NMTOff.allocateMemory 1000000 2 0 avgt 10 815.515 ? 19.232 ms/op
NMTBenchmark.NMTOff.allocateMemory 1000000 2 1 avgt 10 2.674 ? 0.566 ms/op
NMTBenchmark.NMTOff.allocateMemory 1000000 4 0 avgt 10 829.726 ? 31.153 ms/op
NMTBenchmark.NMTOff.allocateMemory 1000000 4 1 avgt 10 2.678 ? 0.652 ms/op
NMTBenchmark.NMTOff.allocateMemory 1000000 8 0 avgt 10 817.929 ? 19.142 ms/op
NMTBenchmark.NMTOff.allocateMemory 1000000 8 1 avgt 10 3.023 ? 0.578 ms/op
NMTBenchmark.NMTSummary.allocateMemory 100000 2 0 avgt 10 101.419 ? 4.231 ms/op
NMTBenchmark.NMTSummary.allocateMemory 100000 2 1 avgt 10 0.329 ? 0.143 ms/op
NMTBenchmark.NMTSummary.allocateMemory 100000 4 0 avgt 10 102.224 ? 3.317 ms/op
NMTBenchmark.NMTSummary.allocateMemory 100000 4 1 avgt 10 0.514 ? 0.230 ms/op
NMTBenchmark.NMTSummary.allocateMemory 100000 8 0 avgt 10 101.127 ? 4.144 ms/op
NMTBenchmark.NMTSummary.allocateMemory 100000 8 1 avgt 10 0.719 ? 0.168 ms/op
NMTBenchmark.NMTSummary.allocateMemory 1000000 2 0 avgt 10 1423.287 ? 56.109 ms/op
NMTBenchmark.NMTSummary.allocateMemory 1000000 2 1 avgt 10 2.682 ? 0.442 ms/op
NMTBenchmark.NMTSummary.allocateMemory 1000000 4 0 avgt 10 1418.999 ? 33.891 ms/op
NMTBenchmark.NMTSummary.allocateMemory 1000000 4 1 avgt 10 2.812 ? 0.510 ms/op
NMTBenchmark.NMTSummary.allocateMemory 1000000 8 0 avgt 10 1431.552 ? 29.915 ms/op
NMTBenchmark.NMTSummary.allocateMemory 1000000 8 1 avgt 10 3.073 ? 0.583 ms/op
You said `The JSON file can be used for visualising the results.` Can you please explain how to do that exactly?
It looks like a good start.
The test took something like 20 minutes for me.
Is the multithreading mode really useful?
And even if we decide that it is, we can probably skip the entries that create the threads, but not actually use them? Ex:
THREADS=2, WITH_THREADS=0
THREADS=4, WITH_THREADS=0
THREADS=8, WITH_THREADS=0
I like it! I do have more follow ups, tough.
I see that we are currently mixing malloc/free. Wouldn't we better served if we separated those as 2 distinct categories? I can imagine a scenario where either malloc or free path could be regressed/improved without affecting the other.
Also, I think we should add realloc to the mix, again keeping it separate. I tried to collapse malloc/realloc into single code path, but it was rejected, mostly on the basis that it would impact the performance - this microbenchmark would be a perfect tool to offer a definitive answer.
Testing the code and looking at the output the microbenchmark produces I think it would be super useful (here or in a follow up issue) to have a python or Java app to quickly run and compare the results to give a user the difference, without having to calculate it by hand (and risk making a mistake)
The tool should look at `NMTOff` first from both runs and verify that they are the same, to prove that the results are useful (there should be no difference)
Only then, if the previous numbers show no change, it should report `NMTSummary`, then `NMTDetail` differences.
I added `usleep(1)` at the beginning of `MallocTracker::record_malloc()` to make sure the microbenchmark catches that, and it does.
> > The tool should look at `NMTOff` first from both runs and verify that they are the same, to prove that the results are useful (there should be no difference)
>
> Which measures should be the same between _both runs_? For `NMT_Off` for example, there are 2 (no of threads= 0,4) measures per method. Consider also, we will have more methods for virtual memory alloc/dealloc tests.
I think all the available data should be used to verify, so all runs with `NMT_Off`?
This can be done in a followup issue I think?
To me it looks like we have a good starting point, which we can use already (by manually comparing the results), so we are OK to push this in I think.
-------------
Changes requested by gziemski (Committer).
PR Review: https://git.openjdk.org/jdk/pull/15563#pullrequestreview-1614112680
PR Review: https://git.openjdk.org/jdk/pull/15563#pullrequestreview-1629213190
PR Comment: https://git.openjdk.org/jdk/pull/15563#issuecomment-1709016452
PR Comment: https://git.openjdk.org/jdk/pull/15563#issuecomment-1709017248
PR Comment: https://git.openjdk.org/jdk/pull/15563#issuecomment-1709039527
PR Comment: https://git.openjdk.org/jdk/pull/15563#issuecomment-1711976042
PR Comment: https://git.openjdk.org/jdk/pull/15563#issuecomment-1717863041
PR Comment: https://git.openjdk.org/jdk/pull/15563#issuecomment-1721458771
More information about the build-dev
mailing list