[foreign-memaccess+abi] RFR: Add benchmark on ResourceScope::close

Fri Apr 16 15:26:33 UTC 2021

This patch adds a new benchamrk for ResourceScope::close. I think it's interesting to measure the tradeoffs provided by the various configurations; we test 4 diufferent configurations:

* confined scope
* shared scope
* implicit scope
* implicit scope, with periodic calls to System::gc

For each configuration, we create a scope, allocate a segment with it (of fixed size) and then close the scope.

The benchmark supports a number of stress modes:

* NONE - e.g. no extra work, just benchmark is ran
* MEMORY - this puts additional memory pressure by creating lots of small byte arrays at the start of the benchmark; designed to disrupt implicit scopes
* THREADS - this creates extra threads which are spinning in a busy loop; designed to disrupt explicit shared scopes

Results are as follows:

Benchmark                                                                 (mode)  Mode  Cnt      Score      Error   Units
ResourceScopeClose.confined_close                                           NONE  avgt   30      0.107 ?    0.002   us/op
ResourceScopeClose.confined_close:?gc.time                                  NONE  avgt   30     39.000                 ms
ResourceScopeClose.confined_close                                         MEMORY  avgt   30      0.114 ?    0.006   us/op
ResourceScopeClose.confined_close:?gc.time                                MEMORY  avgt   30     12.000                 ms
ResourceScopeClose.confined_close                                        THREADS  avgt   30      0.115 ?    0.002   us/op
ResourceScopeClose.confined_close:?gc.time                               THREADS  avgt   30     30.000                 ms
ResourceScopeClose.implicit_close                                           NONE  avgt   30      5.654 ?    9.617   us/op
ResourceScopeClose.implicit_close:?gc.time                                  NONE  avgt   30  15540.000                 ms
ResourceScopeClose.implicit_close                                         MEMORY  avgt   30      4.085 ?    4.375   us/op
ResourceScopeClose.implicit_close:?gc.time                                MEMORY  avgt   30  23126.000                 ms
ResourceScopeClose.implicit_close                                        THREADS  avgt   30      2.380 ?    2.585   us/op
ResourceScopeClose.implicit_close:?gc.time                               THREADS  avgt   30  15940.000                 ms
ResourceScopeClose.implicit_close_systemgc                                  NONE  avgt   30     31.301 ?    0.667   us/op
ResourceScopeClose.implicit_close_systemgc:?gc.time                         NONE  avgt   30  14520.000                 ms
ResourceScopeClose.implicit_close_systemgc                                MEMORY  avgt   30   1502.083 ?   28.460   us/op
ResourceScopeClose.implicit_close_systemgc:?gc.time                       MEMORY  avgt   30  23087.000                 ms
ResourceScopeClose.implicit_close_systemgc                               THREADS  avgt   30     30.733 ?    0.704   us/op
ResourceScopeClose.implicit_close_systemgc:?gc.time                      THREADS  avgt   30  14551.000                 ms
ResourceScopeClose.shared_close                                             NONE  avgt   30      8.850 ?    0.936   us/op
ResourceScopeClose.shared_close:?gc.time                                    NONE  avgt   30      6.000                 ms
ResourceScopeClose.shared_close                                           MEMORY  avgt   30      8.401 ?    0.506   us/op
ResourceScopeClose.shared_close                                          THREADS  avgt   30     10.966 ?    0.349   us/op
ResourceScopeClose.shared_close:?gc.time                                 THREADS  avgt   30      4.000                 ms

Of course the confined case comes out on top; very little GC activity there, very good perf, and very low variance.

Shared scopes is second best - performances are not quite as good as with confined case (the close is ~10x slower) - but low variance, low GC activity.

Then there is implicit scopes. In the version without System::gc calls, if we look at scores we might be tricked into thinking that the results are good. In reality, if you look at GC activity, you see that there is a huge amount of time (~15s !!) spent on GC. What's worse, and what cannot be appreaciated here (as I didn't find a way to let JMH spit the data), is that, by observing the benchmark process with `top` in the implcit case w/o calls to `System::gc` we see peaks of resident memory up to 10g (!!).

The version which periodically calls `System::gc` helps with keeping resident memory under control (never exceeds 1g that way) - but as you can see, if we go for explicit calls to System::gc, the cost gets higher the more heap is used (look at the MEMORY stress mode).

Overall, confined and shared segments are very good, compared to what is possible to achieve using cleaners; closing a shared scope is more expensive, yes (while we don't think we can improve these numbers much, we'll keep looking for opportunities), but there's none of the system thrashing that occurs when the benchmark relies only on the GC. Also, this benchmark is rather unrealistic since it basically does nothing with the segment created from the scope; as soon as some real code is added there, the additional cost for closing the segment will likely be washed away in many cases.

-------------

Commit messages:
 - Add benchmark on ResourceScope::close

Changes: https://git.openjdk.java.net/panama-foreign/pull/508/files
 Webrev: https://webrevs.openjdk.java.net/?repo=panama-foreign&pr=508&range=00
  Stats: 132 lines in 1 file changed: 132 ins; 0 del; 0 mod
  Patch: https://git.openjdk.java.net/panama-foreign/pull/508.diff
  Fetch: git fetch https://git.openjdk.java.net/panama-foreign pull/508/head:pull/508

PR: https://git.openjdk.java.net/panama-foreign/pull/508