RFR: 8267532: Try/catch block not optimized as expected

Wed Nov 1 13:48:27 UTC 2023

The issue is essentially that for the Java try-with-resource construct, javac generates multiple calls to `close(`) the resource. One of those calls is inside the hidden exception handler of the try block. The issue for us is that typically the exception handler is never entered (since no exception is thrown), however we don't profile exception handlers at the moment, so the block is not pruned. C2 doesn't inline the `close()` call in the handler due to low call site frequency. As a result, the receiver of that call escapes and can not be scalar replaced, which then leads to a loss in performance.

There has been some discussion on the JBS issue that this could be fixed by profiling catch blocks. And another suggestion that partial escape analysis could help here to prevent the object from escaping. But, I think there are other benefits to being able to prune dead catch blocks, such as general reduction in code size, and other optimizations being possible by dead code being eliminated. So, I've implemented catch block profiling + pruning in this patch.

The implementation is essentially very straightforward: we allocate an extra bit of profiling data for each
exception handler of a method in the `MethodData` for that method (which holds all the profiling
data). Then when looking up the exception handler after an exception is thrown, we mark the
exception handler as entered. When C2 parses the exception handler block, and it sees that it has
never been entered, we emit an uncommon trap instead.

I've also cleaned up the handling of profiling data sections a bit. After adding the extra section of data to MethodData, I was seeing several crashes when ciMethodData was used. The underlying issue seemed to be that the offset of the parameter data was computed based on the total data size - parameter data size (which doesn't work if we add an additional section for exception handler data). I've re-written the code around this a bit to try and prevent issues in the future. Both MethodData and ciMethodData now track offsets of parameter data and exception handler data, and the size of the each data section is derived from the offsets.

Finally, there was an assert firing in `freeze_internal` in `continuationFreezeThaw.cpp`:

    assert(monitors_on_stack(current) == ((current->held_monitor_count() - current->jni_monitor_count()) > 0),
         "Held monitor count and locks on stack invariant: " INT64_FORMAT " JNI: " INT64_FORMAT, (int64_t)current->held_monitor_count(), (int64_t)current->jni_monitor_count());

This assert relies on `has_monitors` being set for a method, which in itself relies on `monitorenter` and `monitorexit` being parsed. However, if we prune untaken exception handlers, we might not see any `monitorexit`, which is a problem for OSR compilations since then we might also not see any `monitorenter` in that case. After some investigation, it turns out that `ciMethod` already tracks whether monitor bytecodes are being used, so we can just piggyback on that instead of relying on `monitorenter` or `monitorexit` being parsed. We can follow the existing pattern for how `has_reserved_stack_access` is being tracked (which I've done). See https://github.com/openjdk/jdk/pull/16416/commits/a48420681549ac9343f625e1a3a26a737fc4921e https://github.com/openjdk/jdk/pull/16416/commits/a33a905689a056ac053ac34df64541b652747076 and https://github.com/openjdk/jdk/pull/16416/commits/d727df704ea092eedb20517bdd696d82d75b00b2

Benchmark with `-XX:-PruneDeadExceptionHandlers`:

Benchmark                                                      Mode  Cnt      Score     Error   Units
ResourceScopeCloseMin.confined_close                           avgt   30     10.458 ±   0.070   ns/op
ResourceScopeCloseMin.confined_close:gc.alloc.rate             avgt   30   9480.988 ±  63.335  MB/sec
ResourceScopeCloseMin.confined_close:gc.alloc.rate.norm        avgt   30    104.000 ±   0.001    B/op
ResourceScopeCloseMin.confined_close:gc.count                  avgt   30    119.000            counts
ResourceScopeCloseMin.confined_close:gc.time                   avgt   30     94.000                ms
ResourceScopeCloseMin.confined_close_notry                     avgt   30      4.691 ±   0.063   ns/op
ResourceScopeCloseMin.confined_close_notry:gc.alloc.rate       avgt   30  11383.693 ± 151.145  MB/sec
ResourceScopeCloseMin.confined_close_notry:gc.alloc.rate.norm  avgt   30     56.000 ±   0.001    B/op
ResourceScopeCloseMin.confined_close_notry:gc.count            avgt   30    120.000            counts
ResourceScopeCloseMin.confined_close_notry:gc.time             avgt   30    104.000                ms

with `-XX:+PruneDeadExceptionHandlers`:

Benchmark                                                      Mode  Cnt      Score     Error   Units
ResourceScopeCloseMin.confined_close                           avgt   30      4.563 ±   0.043   ns/op
ResourceScopeCloseMin.confined_close:gc.alloc.rate             avgt   30  11702.868 ± 108.816  MB/sec
ResourceScopeCloseMin.confined_close:gc.alloc.rate.norm        avgt   30     56.000 ±   0.001    B/op
ResourceScopeCloseMin.confined_close:gc.count                  avgt   30    121.000            counts
ResourceScopeCloseMin.confined_close:gc.time                   avgt   30     93.000                ms
ResourceScopeCloseMin.confined_close_notry                     avgt   30      4.601 ±   0.054   ns/op
ResourceScopeCloseMin.confined_close_notry:gc.alloc.rate       avgt   30  11605.391 ± 134.000  MB/sec
ResourceScopeCloseMin.confined_close_notry:gc.alloc.rate.norm  avgt   30     56.000 ±   0.001    B/op
ResourceScopeCloseMin.confined_close_notry:gc.count            avgt   30    121.000            counts
ResourceScopeCloseMin.confined_close_notry:gc.time             avgt   30    101.000                ms

Note that with the optimization turned on, timing and `gc.alloc.rate.norm` is ~equal.

I also noticed through other experiments that C2's ability to inline improves, due to `inline_instructions_size` being reduced for methods with untaken exception handlers, which might bring the size under `InlineSmallCode`, and allow the method to be inlined again.

Testing : tier 1-6 (ongoing). Local run of `hotspot_compiler` suite with `-XX:+DeoptimizeALot` and with `-XX:+StressPrunedExceptionHandlers`.

-------------

Commit messages:
 - use TWR in foreign benchmarks
 - reduce duplication
 - beef up test a bit
 - improve StressPrunedExceptionHandlers comment
 - Revert "Revert "add exception handler pruning stress flags""
 - also set has_monitors when callee is synchronized
 - Revert "add exception handler pruning stress flags"
 - use has_monitor_bytecodes() in c1 as well
 - use ciMethod::has_monitor_bytecodes
 - add exception handler pruning stress flags
 - ... and 18 more: https://git.openjdk.org/jdk/compare/6f352740...059e306c

Changes: https://git.openjdk.org/jdk/pull/16416/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16416&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8267532
  Stats: 598 lines in 23 files changed: 498 ins; 34 del; 66 mod
  Patch: https://git.openjdk.org/jdk/pull/16416.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16416/head:pull/16416

PR: https://git.openjdk.org/jdk/pull/16416